GPTBot was blocked by 35.7% of the top 1,000 websites as of August 2024 — many without knowing they'd set that restriction. At the same time, only 0.3% of the top 1,000 sites have implemented llms.txt, the emerging standard for communicating site structure to AI systems. This guide is the definitive technical reference for both: controlling which AI crawlers access your site, and proactively communicating your site's content to those that do.
How Do I Allow AI Crawlers on My Website?
To allow AI crawlers, add explicit Allow: / rules for each AI user agent (GPTBot, ClaudeBot, PerplexityBot) in your robots.txt. If you have a wildcard Disallow, add AI crawler allow rules above it — robots.txt is processed in order.
Layer 1: robots.txt for AI Crawlers
robots.txt is the primary mechanism for controlling crawler access. AI crawlers respect robots.txt directives just as Google's crawler does. The file lives at yourdomain.com/robots.txt and is checked by crawlers before fetching any page.
Known AI crawler user agents
| Company | User Agent | Purpose |
|---|---|---|
| OpenAI | GPTBot | Training data collection and ChatGPT indexing |
| OpenAI | ChatGPT-User | ChatGPT's real-time browsing capability |
| Anthropic | ClaudeBot | Training data and retrieval for Claude |
| Anthropic | anthropic-ai | Alternate Anthropic crawler identifier |
| Perplexity AI | PerplexityBot | Real-time retrieval for Perplexity answers |
| GoogleOther | Google experimental / non-search crawlers | |
| Common Crawl | CCBot | Open training corpus used by many LLMs |
| Apple | Applebot-Extended | Apple Intelligence content crawler |
Allowing all AI crawlers (recommended for most sites)
If you have an existing wildcard disallow for scraper protection, add AI crawler allow rules above it — robots.txt is evaluated top-to-bottom:
User-agent: GPTBot Allow: / User-agent: ChatGPT-User Allow: / User-agent: ClaudeBot Allow: / User-agent: anthropic-ai Allow: / User-agent: PerplexityBot Allow: / User-agent: CCBot Allow: / User-agent: GoogleOther Allow: / User-agent: Applebot-Extended Allow: / Sitemap: https://yourdomain.com/sitemap.xml
Blocking specific AI crawlers
To block a specific crawler from all pages (for example, if you want to prevent training data use while still allowing retrieval bots):
# Block OpenAI's training crawler User-agent: GPTBot Disallow: / # Still allow Perplexity's retrieval crawler User-agent: PerplexityBot Allow: /
Blocking AI crawlers from specific directories
To allow AI crawlers on most pages but block specific sections (account pages, admin, private content):
User-agent: GPTBot Disallow: /account/ Disallow: /admin/ Disallow: /api/ Allow: /
Layer 2: llms.txt
llms.txt is a plain-text file at yourdomain.com/llms.txt that provides AI systems with a curated overview of your site. Unlike robots.txt (access control), llms.txt is about communication — telling AI agents what your site contains, who operates it, and which pages are most important. Despite being a quick, low-effort implementation, only 0.3% of the top 1,000 sites have it.
What to include in llms.txt
- A 2–3 sentence description of what your site/company does.
- Your site's primary topic areas or content categories.
- Links to your most important pages, especially canonical reference content.
- Factual context AI agents should apply when processing your content.
- Your organization name and authoritative identifiers (official URL, Wikidata entity ID if applicable).
llms.txt example
# YourCompany YourCompany provides [description of product/service]. ## About Founded in [year], YourCompany [1-2 sentences of context about what it does and who it serves]. ## Key Pages - [Homepage](https://yoursite.com/): Main product overview - [About](https://yoursite.com/about/): Team and mission - [Documentation](https://yoursite.com/docs/): Technical guides - [Blog](https://yoursite.com/blog/): Product and industry articles ## Content Scope This site covers [topics]. Our methodology for [core topic] is documented at [URL].
llms-full.txt convention
Some sites also publish llms-full.txt at /llms-full.txt — the complete plain-text content of all major pages concatenated into a single file, optimized for LLM consumption. Particularly useful for documentation-heavy sites.
Layer 3: Per-Page Meta Tags
noai — block AI training use
<meta name="robots" content="noai">
Signals that the page content should not be used for AI training. Not universally honored across all AI systems. Use on paywalled content, proprietary data, or user-generated content you don't want trained on. Do not use on public pages you want AI engines to cite.
noimageai — block AI image use only
<meta name="robots" content="noimageai">
Blocks images on the page from AI training while still allowing text content to be indexed.
Configuration Decision Matrix
| Goal | Mechanism | Implementation |
|---|---|---|
| Allow all AI crawlers | robots.txt | Add Allow: / for each AI user agent |
| Block all AI crawlers | robots.txt | Add Disallow: / for each AI user agent |
| Block training bots, allow retrieval bots | robots.txt | Disallow GPTBot/CCBot; Allow PerplexityBot/ChatGPT-User |
| Block AI access to specific directories | robots.txt | Add Disallow: /path/ under target user agents |
| Block one page from AI training | Meta tag | Add <meta name='robots' content='noai'> |
| Block images from AI training only | Meta tag | Add <meta name='robots' content='noimageai'> |
| Communicate site structure to AI | llms.txt | Create and publish llms.txt at site root |
| Provide full site content to AI in one file | llms-full.txt | Create concatenated plain-text at /llms-full.txt |
Verifying Your Configuration
After configuring robots.txt and llms.txt, verify in three ways: (1) visit yourdomain.com/robots.txt and yourdomain.com/llms.txt directly to confirm they load and return correct content; (2) use a robots.txt tester to simulate each AI crawler's user agent; (3) run a full AEO audit — it checks crawler access, llms.txt validity, and the absence of noai meta tags as part of its 16 deterministic checks.
How do I allow GPTBot on my website?
Add `User-agent: GPTBot` followed by `Allow: /` to your robots.txt file. If you have a wildcard `User-agent: *` with `Disallow: /`, add the GPTBot allow rule above it, since robots.txt is processed in order. Verify by visiting yourdomain.com/robots.txt and confirming the rules are present.
Does robots.txt affect AI crawlers the same way it affects Google?
Yes. All major AI crawlers — GPTBot, ClaudeBot, PerplexityBot — respect robots.txt directives, just as Google's crawler does. A Disallow rule for a specific user agent will prevent that crawler from fetching the specified pages. The same syntax that works for Google's crawler works for AI crawlers.
What is llms.txt and do I need it?
llms.txt is a plain-text file at your site root that gives AI systems a curated overview of your content. It's not required and not all AI crawlers currently parse it, but it's a low-effort signal of AI readiness. Only 0.3% of the top 1,000 sites have implemented it, making it a quick differentiator for sites that do.
Can I block AI training data use while still allowing AI search citation?
Partially. You can block training-data crawlers (GPTBot, CCBot) while allowing retrieval crawlers (PerplexityBot, ChatGPT-User) — but this means your site won't appear in ChatGPT's base model responses, only in ChatGPT with browsing and Perplexity's real-time answers. The noai meta tag also signals training opt-out but is not universally honored.
How do I check if my robots.txt is blocking AI crawlers?
Visit yourdomain.com/robots.txt and scan for Disallow rules targeting GPTBot, ClaudeBot, PerplexityBot, CCBot, or a wildcard User-agent: * with Disallow: /. An AEO audit at aeo-check.vercel.app runs this check automatically and flags any AI-blocking configurations as part of its 16 deterministic checks.