# Vizelo.ai — Do AI engines respect robots.txt? # Source: https://vizelo.ai/do-ai-engines-respect-robots-txt.html # Last reviewed: 2026-05-26 # Do AI engines respect robots.txt? **Short answer:** Mostly yes. The named crawlers from OpenAI, Anthropic, Perplexity, Google, and Microsoft honor it. The caveats are third-party scrapers, the persistence of training data already collected, and the difference between blocking a crawl and being absent from an answer. ## Which crawlers honor it Every major vendor with a public AI product publishes the user agents it crawls under and commits to honoring robots.txt. The list as of 2026: - **OpenAI** — `GPTBot` (training-data crawler), `ChatGPT-User` (live browsing from a ChatGPT session), `OAI-SearchBot` (indexing for ChatGPT search). - **Perplexity** — `PerplexityBot` (indexing), `Perplexity-User` (live retrieval). - **Anthropic** — `ClaudeBot`, and separately `anthropic-ai` for various data-collection paths. - **Google** — `Google-Extended` (publishers can opt out of Gemini training while still being indexed by regular Googlebot). - **Microsoft** — `Bingbot`; the AI surfaces on Bing inherit Bingbot's rules. - **ByteDance, Apple, Meta** — all maintain their own AI-related agents. In practice these named crawlers do what they say they'll do — the headers identify correctly and the disallow directives stick. ## Which ones might not The honest answer: anyone who decides not to. Robots.txt is a convention, not a fence. The biggest practical exposures are third-party scrapers that feed open dataset aggregators — the corpora that get reused as training inputs across many models — and smaller commercial scrapers that crawl on behalf of analytics products. Common Crawl (`CCBot`) does honor robots.txt, and its archives are a frequent ingredient in foundation-model training, but content that escaped into derivative datasets before you blocked it can persist in places you can't reach. There's also a steady supply of unbranded scrapers that ignore robots entirely; those are the ones that pay no attention regardless of what your file says, and the only defense against them is server-level rate limiting, auth, or accepting the risk. ## What allow / disallow means in practice Three things to keep separate: 1. **Training-time crawlers** (GPTBot, Google-Extended, ClaudeBot in its training role) — disallowing keeps you out of *future* training data. It does nothing to content already in earlier model versions. 2. **Query-time crawlers** (ChatGPT-User, OAI-SearchBot, Perplexity-User) — disallowing keeps you out of *live answers*. You become invisible at the moment the engine tries to fetch and cite you. 3. **Training-data echo** — none of the above affects what the model has memorized statistically from past training. A model can still echo your brand by name based on training-data exposure even when every active crawler is blocked. Blocking is a forward-looking control, not a reset button. ## The right setup for most sites If your goal is to be cited — the goal for almost every commercial site — **allow the query-time bots**: GPTBot in its retrieval role, ChatGPT-User, OAI-SearchBot, PerplexityBot, Perplexity-User, ClaudeBot, Google-Extended. Each one of those is a path into an answer. Disallowing them is choosing invisibility. Make a conscious, separate decision about training-time inclusion: if you write content you don't want repurposed wholesale, you can disallow training-only crawlers without giving up live citation. Pair the robots.txt with an `llms.txt` that points engines at your canonical content. ## Related answers - [How do I rank in ChatGPT?](https://vizelo.ai/how-to-rank-in-chatgpt.html) - [How does Perplexity decide which sources to cite?](https://vizelo.ai/how-perplexity-chooses-citations.html) - [Why aren't my pages cited by ChatGPT?](https://vizelo.ai/why-am-i-not-cited-by-chatgpt.html) - [How do I track when AI engines cite my brand?](https://vizelo.ai/how-to-track-ai-citations.html) - [All answers](https://vizelo.ai/answers.html)