Reference

AI Crawler Configuration Guide: robots.txt, llms.txt, Meta Tags

Technical reference for controlling AI crawler access: robots.txt rules for GPTBot, ClaudeBot, PerplexityBot, llms.txt setup, and noai meta tags — with copy-paste examples.

March 3, 2026·9 min read

GPTBot was blocked by 35.7% of the top 1,000 websites as of August 2024 — many without knowing they'd set that restriction. At the same time, only 0.3% of the top 1,000 sites have implemented llms.txt, the emerging standard for communicating site structure to AI systems. This guide is the definitive technical reference for both: controlling which AI crawlers access your site, and proactively communicating your site's content to those that do.

How Do I Allow AI Crawlers on My Website?

To allow AI crawlers, add explicit Allow: / rules for each AI user agent (GPTBot, ClaudeBot, PerplexityBot) in your robots.txt. If you have a wildcard Disallow, add AI crawler allow rules above it — robots.txt is processed in order.

Layer 1: robots.txt for AI Crawlers

robots.txt is the primary mechanism for controlling crawler access. AI crawlers respect robots.txt directives just as Google's crawler does. The file lives at yourdomain.com/robots.txt and is checked by crawlers before fetching any page.

Known AI crawler user agents

Company	User Agent	Purpose
OpenAI	GPTBot	Training data collection and ChatGPT indexing
OpenAI	ChatGPT-User	ChatGPT's real-time browsing capability
Anthropic	ClaudeBot	Training data and retrieval for Claude
Anthropic	anthropic-ai	Alternate Anthropic crawler identifier
Perplexity AI	PerplexityBot	Real-time retrieval for Perplexity answers
Google	GoogleOther	Google experimental / non-search crawlers
Common Crawl	CCBot	Open training corpus used by many LLMs
Apple	Applebot-Extended	Apple Intelligence content crawler

Allowing all AI crawlers (recommended for most sites)

If you have an existing wildcard disallow for scraper protection, add AI crawler allow rules above it — robots.txt is evaluated top-to-bottom:

User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: anthropic-ai
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: CCBot
Allow: /

User-agent: GoogleOther
Allow: /

User-agent: Applebot-Extended
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Blocking specific AI crawlers

To block a specific crawler from all pages (for example, if you want to prevent training data use while still allowing retrieval bots):

# Block OpenAI's training crawler
User-agent: GPTBot
Disallow: /

# Still allow Perplexity's retrieval crawler
User-agent: PerplexityBot
Allow: /

Blocking AI crawlers from specific directories

To allow AI crawlers on most pages but block specific sections (account pages, admin, private content):

User-agent: GPTBot
Disallow: /account/
Disallow: /admin/
Disallow: /api/
Allow: /

Layer 2: llms.txt

llms.txt is a plain-text file at yourdomain.com/llms.txt that provides AI systems with a curated overview of your site. Unlike robots.txt (access control), llms.txt is about communication — telling AI agents what your site contains, who operates it, and which pages are most important. Despite being a quick, low-effort implementation, only 0.3% of the top 1,000 sites have it.

What to include in llms.txt

A 2–3 sentence description of what your site/company does.
Your site's primary topic areas or content categories.
Links to your most important pages, especially canonical reference content.
Factual context AI agents should apply when processing your content.
Your organization name and authoritative identifiers (official URL, Wikidata entity ID if applicable).

llms.txt example

# YourCompany

YourCompany provides [description of product/service].

## About

Founded in [year], YourCompany [1-2 sentences of context about what it does and who it serves].

## Key Pages

- [Homepage](https://yoursite.com/): Main product overview
- [About](https://yoursite.com/about/): Team and mission
- [Documentation](https://yoursite.com/docs/): Technical guides
- [Blog](https://yoursite.com/blog/): Product and industry articles

## Content Scope

This site covers [topics]. Our methodology for [core topic] is documented at [URL].

llms-full.txt convention

Some sites also publish llms-full.txt at /llms-full.txt — the complete plain-text content of all major pages concatenated into a single file, optimized for LLM consumption. Particularly useful for documentation-heavy sites.

Layer 3: Per-Page Meta Tags

noai — block AI training use

<meta name="robots" content="noai">

Signals that the page content should not be used for AI training. Not universally honored across all AI systems. Use on paywalled content, proprietary data, or user-generated content you don't want trained on. Do not use on public pages you want AI engines to cite.

noimageai — block AI image use only

<meta name="robots" content="noimageai">

Blocks images on the page from AI training while still allowing text content to be indexed.

Configuration Decision Matrix

Goal	Mechanism	Implementation
Allow all AI crawlers	robots.txt	Add Allow: / for each AI user agent
Block all AI crawlers	robots.txt	Add Disallow: / for each AI user agent
Block training bots, allow retrieval bots	robots.txt	Disallow GPTBot/CCBot; Allow PerplexityBot/ChatGPT-User
Block AI access to specific directories	robots.txt	Add Disallow: /path/ under target user agents
Block one page from AI training	Meta tag	Add <meta name='robots' content='noai'>
Block images from AI training only	Meta tag	Add <meta name='robots' content='noimageai'>
Communicate site structure to AI	llms.txt	Create and publish llms.txt at site root
Provide full site content to AI in one file	llms-full.txt	Create concatenated plain-text at /llms-full.txt

Verifying Your Configuration

After configuring robots.txt and llms.txt, verify in three ways: (1) visit yourdomain.com/robots.txt and yourdomain.com/llms.txt directly to confirm they load and return correct content; (2) use a robots.txt tester to simulate each AI crawler's user agent; (3) run a full AEO audit — it checks crawler access, llms.txt validity, and the absence of noai meta tags as part of its 16 deterministic checks.

How do I allow GPTBot on my website?

Add `User-agent: GPTBot` followed by `Allow: /` to your robots.txt file. If you have a wildcard `User-agent: *` with `Disallow: /`, add the GPTBot allow rule above it, since robots.txt is processed in order. Verify by visiting yourdomain.com/robots.txt and confirming the rules are present.

Does robots.txt affect AI crawlers the same way it affects Google?

Yes. All major AI crawlers — GPTBot, ClaudeBot, PerplexityBot — respect robots.txt directives, just as Google's crawler does. A Disallow rule for a specific user agent will prevent that crawler from fetching the specified pages. The same syntax that works for Google's crawler works for AI crawlers.

What is llms.txt and do I need it?

llms.txt is a plain-text file at your site root that gives AI systems a curated overview of your content. It's not required and not all AI crawlers currently parse it, but it's a low-effort signal of AI readiness. Only 0.3% of the top 1,000 sites have implemented it, making it a quick differentiator for sites that do.

Can I block AI training data use while still allowing AI search citation?

Partially. You can block training-data crawlers (GPTBot, CCBot) while allowing retrieval crawlers (PerplexityBot, ChatGPT-User) — but this means your site won't appear in ChatGPT's base model responses, only in ChatGPT with browsing and Perplexity's real-time answers. The noai meta tag also signals training opt-out but is not universally honored.

How do I check if my robots.txt is blocking AI crawlers?

Visit yourdomain.com/robots.txt and scan for Disallow rules targeting GPTBot, ClaudeBot, PerplexityBot, CCBot, or a wildcard User-agent: * with Disallow: /. An AEO audit at aeo-check.vercel.app runs this check automatically and flags any AI-blocking configurations as part of its 16 deterministic checks.

Sources