Crawl Budget
What is crawl budget?
Crawl budget is the number of pages Googlebot can and wants to scan on your website within a given period. Google has limited resources and cannot scan the entire internet simultaneously — so it allocates a specific crawling budget to each site.
Crawl budget is particularly important for large websites (thousands or millions of pages). For small sites (up to a few hundred pages), it typically isn't a problem — Google can scan the entire site without difficulty.
How does crawl budget work?
Crawl budget consists of two components:
Crawl capacity limit
This is the maximum number of simultaneous connections Googlebot can establish with your server without overloading it. Google automatically reduces crawling intensity when:
- The server responds slowly (high TTFB)
- The server returns 5xx errors
- The site owner has limited crawling in robots.txt (
Crawl-delay)
Crawl demand
This is Google's interest in scanning your website, depending on:
- URL popularity — pages with more backlinks and traffic are crawled more frequently
- Freshness — regularly updated pages have higher priority
- Content type — new URLs discovered in the sitemap or links are crawled with priority
Why is crawl budget important?
If Googlebot exhausts the crawl budget on unimportant pages (duplicates, URL parameters, error pages), it may never reach your most important content — articles, service pages, new products.
Consequences of crawl budget problems:
- Delayed indexing of new content — blog articles appear in Google after days or weeks instead of hours (see how to speed up Google indexing)
- Outdated data in the index — changes to existing pages aren't reflected in search results
- Unindexed pages — parts of the site may never be scanned
When is crawl budget a problem?
- The site has more than 10,000 pages
- The site generates many duplicates (URL parameters, filters, sorting)
- The server is slow or unstable
- The site has deep architecture — pages accessible only after many clicks
- A large portion of the site returns 404 or 5xx errors
How to optimize crawl budget?
Eliminate budget waste
- Remove or block duplicates — URL parameters, sorting, and filters should have canonical or noindex (more techniques in the technical SEO checklist)
- Fix 404 and 5xx errors — every request to an error page is a wasted crawl
- Limit pagination — hundreds of
/page/2,/page/3pages waste budget - Block unimportant resources — admin pages, internal search results, shopping cart
Improve crawling efficiency
- Update sitemap.xml — submit an up-to-date sitemap in Google Search Console
- Optimize robots.txt — block crawling of sections that shouldn't be indexed
- Use flat architecture — every important page should be accessible within max 3 clicks from the homepage
- Link internally — new pages should be connected to existing content
Improve server speed
- Optimize TTFB — server response time below 200ms
- Consider SSG — static files are served instantly
- Use a CDN — shortens response time for crawlers
- Monitor server logs — analyze how Googlebot crawls your site
How to monitor crawl budget?
- Google Search Console — the "Crawl Stats" report shows the number of crawled pages, response time, and errors
- Server logs — direct analysis of Googlebot requests (user-agent: Googlebot)
- Screaming Frog — crawl simulation and problem identification
- Robots.txt tester — verification that blocking rules are correct
Example
An e-commerce store has 5,000 products but generates 50,000 URLs through filtering parameters (color, size, price, sorting). Googlebot primarily crawls filter pages, ignoring new products. After implementing canonical URLs on parameterized pages, noindex on filter pages, and updating the sitemap, Google begins indexing new products within 24 hours instead of 2 weeks.
Related terms
- Indexing — the process of recording a page in Google's index, dependent on crawl budget
- Crawlability — the ability of crawlers to traverse a website
- Robots.txt — a file controlling crawler access
- Sitemap — a site map facilitating crawling
- Canonical URL — a tag preventing crawl budget waste on duplicates