# Vizelo.ai — Do AI engines respect robots.txt?
# Source: https://vizelo.ai/do-ai-engines-respect-robots-txt.html
# Last reviewed: 2026-05-26

# Do AI engines respect robots.txt?

**Short answer:** Mostly yes. The named crawlers from OpenAI, Anthropic, Perplexity, Google, and Microsoft honor it. The caveats are third-party scrapers, the persistence of training data already collected, and the difference between blocking a crawl and being absent from an answer.

## Which crawlers honor it

Every major vendor with a public AI product publishes the user agents it crawls under and commits to honoring robots.txt. The list as of 2026:

- **OpenAI** — `GPTBot` (training-data crawler), `ChatGPT-User` (live browsing from a ChatGPT session), `OAI-SearchBot` (indexing for ChatGPT search).
- **Perplexity** — `PerplexityBot` (indexing), `Perplexity-User` (live retrieval).
- **Anthropic** — `ClaudeBot`, and separately `anthropic-ai` for various data-collection paths.
- **Google** — `Google-Extended` (publishers can opt out of Gemini training while still being indexed by regular Googlebot).
- **Microsoft** — `Bingbot`; the AI surfaces on Bing inherit Bingbot's rules.
- **ByteDance, Apple, Meta** — all maintain their own AI-related agents.

In practice these named crawlers do what they say they'll do — the headers identify correctly and the disallow directives stick.

## Which ones might not

The honest answer: anyone who decides not to. Robots.txt is a convention, not a fence.

The biggest practical exposures are third-party scrapers that feed open dataset aggregators — the corpora that get reused as training inputs across many models — and smaller commercial scrapers that crawl on behalf of analytics products. Common Crawl (`CCBot`) does honor robots.txt, and its archives are a frequent ingredient in foundation-model training, but content that escaped into derivative datasets before you blocked it can persist in places you can't reach.

There's also a steady supply of unbranded scrapers that ignore robots entirely; those are the ones that pay no attention regardless of what your file says, and the only defense against them is server-level rate limiting, auth, or accepting the risk.

## What allow / disallow means in practice

Three things to keep separate:

1. **Training-time crawlers** (GPTBot, Google-Extended, ClaudeBot in its training role) — disallowing keeps you out of *future* training data. It does nothing to content already in earlier model versions.
2. **Query-time crawlers** (ChatGPT-User, OAI-SearchBot, Perplexity-User) — disallowing keeps you out of *live answers*. You become invisible at the moment the engine tries to fetch and cite you.
3. **Training-data echo** — none of the above affects what the model has memorized statistically from past training. A model can still echo your brand by name based on training-data exposure even when every active crawler is blocked.

Blocking is a forward-looking control, not a reset button.

## The right setup for most sites

If your goal is to be cited — the goal for almost every commercial site — **allow the query-time bots**: GPTBot in its retrieval role, ChatGPT-User, OAI-SearchBot, PerplexityBot, Perplexity-User, ClaudeBot, Google-Extended. Each one of those is a path into an answer. Disallowing them is choosing invisibility.

Make a conscious, separate decision about training-time inclusion: if you write content you don't want repurposed wholesale, you can disallow training-only crawlers without giving up live citation. Pair the robots.txt with an `llms.txt` that points engines at your canonical content.

## Related answers

- [How do I rank in ChatGPT?](https://vizelo.ai/how-to-rank-in-chatgpt.html)
- [How does Perplexity decide which sources to cite?](https://vizelo.ai/how-perplexity-chooses-citations.html)
- [Why aren't my pages cited by ChatGPT?](https://vizelo.ai/why-am-i-not-cited-by-chatgpt.html)
- [How do I track when AI engines cite my brand?](https://vizelo.ai/how-to-track-ai-citations.html)
- [All answers](https://vizelo.ai/answers.html)