Robots.txt
What is robots.txt?
Robots.txt is a text file placed in the root directory of a website (/robots.txt) that tells search engine crawlers which pages they can scan and which ones they should skip. It is the first thing Googlebot (and other crawlers, including AI bots) checks before scanning a page.
The robots.txt file does NOT block indexing — a page can appear in Google results even without being scanned (e.g., if other pages link to it). To block indexing, use the noindex meta tag.
Why does it matter?
- Crawl budget control — blocking unimportant pages (admin, duplicates) conserves crawling resources
- Resource protection — blocking admin pages, dev versions, internal search pages
- AI crawlers — controlling access for GPTBot, ClaudeBot, and PerplexityBot to your content
- Sitemap reference — robots.txt is the standard place to include the sitemap URL
How does it work?
User-agent: *
Allow: /
Disallow: /_next/
Disallow: /api/
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: PerplexityBot
Allow: /
Sitemap: https://yourdomain.com/sitemap.xml
Key directives:
- User-agent — which crawler the rule applies to (
*= all) - Allow — explicit permission to scan a path
- Disallow — blocks scanning of a path
- Sitemap — the URL of the sitemap file
Best practices
- Do not block important resources — CSS, JS, and images must be accessible for page rendering
- Allow AI crawlers — for GEO, it is important that GPTBot, ClaudeBot, and PerplexityBot can scan your content
- Block admin and duplicates —
/admin/,/api/, internal searches - Add a sitemap —
Sitemap: https://...at the end of the file - Test in GSC — Google Search Console has a tool for testing robots.txt
- Do not use it to hide content — robots.txt does not protect private data
More on technical optimization in the article on technical SEO.
Related terms
- Sitemap — site map for search engines
- Crawlability — the ability of a site to be crawled
- Crawl budget — crawl limit
- GEO — optimization for AI
- Indexing — the process of adding pages to Google's index