← All articles

Reference

AI Crawler Configuration Guide: robots.txt, llms.txt, Meta Tags

Technical reference for controlling AI crawler access: robots.txt rules for GPTBot, ClaudeBot, PerplexityBot, llms.txt setup, and noai meta tags — with copy-paste examples.

GPTBot was blocked by 35.7% of the top 1,000 websites as of August 2024 — many without knowing they'd set that restriction. At the same time, only 0.3% of the top 1,000 sites have implemented llms.txt, the emerging standard for communicating site structure to AI systems. This guide is the definitive technical reference for both: controlling which AI crawlers access your site, and proactively communicating your site's content to those that do.

How Do I Allow AI Crawlers on My Website?

To allow AI crawlers, add explicit Allow: / rules for each AI user agent (GPTBot, ClaudeBot, PerplexityBot) in your robots.txt. If you have a wildcard Disallow, add AI crawler allow rules above it — robots.txt is processed in order.

Layer 1: robots.txt for AI Crawlers

robots.txt is the primary mechanism for controlling crawler access. AI crawlers respect robots.txt directives just as Google's crawler does. The file lives at yourdomain.com/robots.txt and is checked by crawlers before fetching any page.

Known AI crawler user agents

CompanyUser AgentPurpose
OpenAIGPTBotTraining data collection and ChatGPT indexing
OpenAIChatGPT-UserChatGPT's real-time browsing capability
AnthropicClaudeBotTraining data and retrieval for Claude
Anthropicanthropic-aiAlternate Anthropic crawler identifier
Perplexity AIPerplexityBotReal-time retrieval for Perplexity answers
GoogleGoogleOtherGoogle experimental / non-search crawlers
Common CrawlCCBotOpen training corpus used by many LLMs
AppleApplebot-ExtendedApple Intelligence content crawler

Allowing all AI crawlers (recommended for most sites)

If you have an existing wildcard disallow for scraper protection, add AI crawler allow rules above it — robots.txt is evaluated top-to-bottom:

User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: anthropic-ai
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: CCBot
Allow: /

User-agent: GoogleOther
Allow: /

User-agent: Applebot-Extended
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Blocking specific AI crawlers

To block a specific crawler from all pages (for example, if you want to prevent training data use while still allowing retrieval bots):

# Block OpenAI's training crawler
User-agent: GPTBot
Disallow: /

# Still allow Perplexity's retrieval crawler
User-agent: PerplexityBot
Allow: /

Blocking AI crawlers from specific directories

To allow AI crawlers on most pages but block specific sections (account pages, admin, private content):

User-agent: GPTBot
Disallow: /account/
Disallow: /admin/
Disallow: /api/
Allow: /

Layer 2: llms.txt

llms.txt is a plain-text file at yourdomain.com/llms.txt that provides AI systems with a curated overview of your site. Unlike robots.txt (access control), llms.txt is about communication — telling AI agents what your site contains, who operates it, and which pages are most important. Despite being a quick, low-effort implementation, only 0.3% of the top 1,000 sites have it.

What to include in llms.txt

  • A 2–3 sentence description of what your site/company does.
  • Your site's primary topic areas or content categories.
  • Links to your most important pages, especially canonical reference content.
  • Factual context AI agents should apply when processing your content.
  • Your organization name and authoritative identifiers (official URL, Wikidata entity ID if applicable).

llms.txt example

# YourCompany

YourCompany provides [description of product/service].

## About

Founded in [year], YourCompany [1-2 sentences of context about what it does and who it serves].

## Key Pages

- [Homepage](https://yoursite.com/): Main product overview
- [About](https://yoursite.com/about/): Team and mission
- [Documentation](https://yoursite.com/docs/): Technical guides
- [Blog](https://yoursite.com/blog/): Product and industry articles

## Content Scope

This site covers [topics]. Our methodology for [core topic] is documented at [URL].

llms-full.txt convention

Some sites also publish llms-full.txt at /llms-full.txt — the complete plain-text content of all major pages concatenated into a single file, optimized for LLM consumption. Particularly useful for documentation-heavy sites.

Layer 3: Per-Page Meta Tags

noai — block AI training use

<meta name="robots" content="noai">

Signals that the page content should not be used for AI training. Not universally honored across all AI systems. Use on paywalled content, proprietary data, or user-generated content you don't want trained on. Do not use on public pages you want AI engines to cite.

noimageai — block AI image use only

<meta name="robots" content="noimageai">

Blocks images on the page from AI training while still allowing text content to be indexed.

Configuration Decision Matrix

GoalMechanismImplementation
Allow all AI crawlersrobots.txtAdd Allow: / for each AI user agent
Block all AI crawlersrobots.txtAdd Disallow: / for each AI user agent
Block training bots, allow retrieval botsrobots.txtDisallow GPTBot/CCBot; Allow PerplexityBot/ChatGPT-User
Block AI access to specific directoriesrobots.txtAdd Disallow: /path/ under target user agents
Block one page from AI trainingMeta tagAdd <meta name='robots' content='noai'>
Block images from AI training onlyMeta tagAdd <meta name='robots' content='noimageai'>
Communicate site structure to AIllms.txtCreate and publish llms.txt at site root
Provide full site content to AI in one filellms-full.txtCreate concatenated plain-text at /llms-full.txt

Verifying Your Configuration

After configuring robots.txt and llms.txt, verify in three ways: (1) visit yourdomain.com/robots.txt and yourdomain.com/llms.txt directly to confirm they load and return correct content; (2) use a robots.txt tester to simulate each AI crawler's user agent; (3) run a full AEO audit — it checks crawler access, llms.txt validity, and the absence of noai meta tags as part of its 16 deterministic checks.

How do I allow GPTBot on my website?

Add `User-agent: GPTBot` followed by `Allow: /` to your robots.txt file. If you have a wildcard `User-agent: *` with `Disallow: /`, add the GPTBot allow rule above it, since robots.txt is processed in order. Verify by visiting yourdomain.com/robots.txt and confirming the rules are present.

Does robots.txt affect AI crawlers the same way it affects Google?

Yes. All major AI crawlers — GPTBot, ClaudeBot, PerplexityBot — respect robots.txt directives, just as Google's crawler does. A Disallow rule for a specific user agent will prevent that crawler from fetching the specified pages. The same syntax that works for Google's crawler works for AI crawlers.

What is llms.txt and do I need it?

llms.txt is a plain-text file at your site root that gives AI systems a curated overview of your content. It's not required and not all AI crawlers currently parse it, but it's a low-effort signal of AI readiness. Only 0.3% of the top 1,000 sites have implemented it, making it a quick differentiator for sites that do.

Can I block AI training data use while still allowing AI search citation?

Partially. You can block training-data crawlers (GPTBot, CCBot) while allowing retrieval crawlers (PerplexityBot, ChatGPT-User) — but this means your site won't appear in ChatGPT's base model responses, only in ChatGPT with browsing and Perplexity's real-time answers. The noai meta tag also signals training opt-out but is not universally honored.

How do I check if my robots.txt is blocking AI crawlers?

Visit yourdomain.com/robots.txt and scan for Disallow rules targeting GPTBot, ClaudeBot, PerplexityBot, CCBot, or a wildcard User-agent: * with Disallow: /. An AEO audit at aeo-check.vercel.app runs this check automatically and flags any AI-blocking configurations as part of its 16 deterministic checks.