When I disallow AI crawlers in robots.txt, do they actually stop fetching my content?

Mostly yes. The named crawlers from the major AI vendors — OpenAI's GPTBot, ChatGPT-User, and OAI-SearchBot; Perplexity's PerplexityBot and Perplexity-User; Anthropic's ClaudeBot; Google's Google-Extended; Microsoft's Bingbot — publicly commit to honoring robots.txt and, in practice, do. The caveats matter. Third-party scrapers that feed open dataset aggregators (Common Crawl, LAION-style scrapes) historically didn't always honor every site's directives, and content from those datasets may already be in current model training data. Blocking a crawler today doesn't retroactively erase what's already in a model. There's also a meaningful difference between blocking a training-time crawler (you don't want to be in future training data) and a query-time crawler (you don't want to be cited in live answers). For most sites the right setup is to allow the query-time bots — GPTBot in its retrieval mode, PerplexityBot, ChatGPT-User, OAI-SearchBot — because being citable is the goal, while making conscious decisions about training-time inclusion.

// Answer

Do AI engines respect robots.txt?

Mostly yes. The named crawlers from OpenAI, Anthropic, Perplexity, Google, and Microsoft honor it. The caveats are third-party scrapers, the persistence of training data already collected, and the difference between blocking a crawl and being absent from an answer.

// Honors robots.txt

Which crawlers honor it.

Every major vendor with a public AI product publishes the user agents it crawls under and commits to honoring robots.txt. The list as of 2026: OpenAI ships GPTBot (training-data crawler), ChatGPT-User (live browsing from a ChatGPT session), and OAI-SearchBot (indexing for ChatGPT search). Perplexity ships PerplexityBot (indexing) and Perplexity-User (live retrieval). Anthropic ships ClaudeBot and, separately, anthropic-ai for various data-collection paths. Google ships Google-Extended to give publishers a way to opt out of Gemini training while still being indexed by regular Googlebot. Microsoft uses Bingbot; the AI surfaces on Bing inherit Bingbot’s rules. ByteDance, Apple, and Meta all maintain their own AI-related agents. In practice these named crawlers do what they say they’ll do — the headers identify correctly and the disallow directives stick.

// Doesn’t honor it

Which ones might not.

The honest answer: anyone who decides not to. Robots.txt is a convention, not a fence. The biggest practical exposures are third-party scrapers that feed open dataset aggregators — the corpora that get reused as training inputs across many models — and smaller commercial scrapers that crawl on behalf of analytics products. Common Crawl (CCBot) does honor robots.txt, and its archives are a frequent ingredient in foundation-model training, but content that escaped into derivative datasets before you blocked it can persist in places you can’t reach. There’s also a steady supply of unbranded scrapers that ignore robots entirely; those are the ones that pay no attention regardless of what your file says, and the only defense against them is server-level rate limiting, auth, or accepting the risk.

// What blocking does

What allow / disallow means in practice.

Three things to keep separate. First, disallowing a training-time crawler (GPTBot, Google-Extended, ClaudeBot in its training role) keeps you out of future training data. It does nothing to content already in earlier model versions. Second, disallowing a query-time crawler (ChatGPT-User, OAI-SearchBot, Perplexity-User) keeps you out of live answers — you become invisible at the moment the engine tries to fetch and cite you. Third, none of this affects what the model has memorized statistically from past training. A model can still echo your brand by name based on training-data exposure even when every active crawler is blocked. Blocking is a forward-looking control, not a reset button.

// Recommended setup

The right setup for most sites.

If your goal is to be cited — the goal for almost every commercial site — allow the query-time bots. GPTBot in its retrieval role, ChatGPT-User, OAI-SearchBot, PerplexityBot, Perplexity-User, ClaudeBot, Google-Extended. Each one of those is a path into an answer. Disallowing them is choosing invisibility. Make a conscious, separate decision about training-time inclusion: if you write content you don’t want repurposed wholesale, you can disallow training-only crawlers without giving up live citation. Pair the robots.txt with an llms.txt that points engines at your canonical content; the combination is what we recommend, and it’s what we run on vizelo.ai itself. Start free to see how your robots and llms configuration affects what engines actually do with you.

// Related

Common questions.

What’s the difference between GPTBot and ChatGPT-User?

GPTBot is OpenAI’s training-data crawler. ChatGPT-User is the user agent ChatGPT uses when a user’s session triggers a live browsing fetch. OAI-SearchBot powers the indexing behind ChatGPT’s search features. Blocking each one has a different consequence: blocking GPTBot keeps your content out of future training; blocking ChatGPT-User and OAI-SearchBot blocks live citation in real answers.

If I disallow GPTBot, will my brand still appear in ChatGPT?

Maybe. Content already in the training corpus before you disallowed it can still be echoed. Live browsing via ChatGPT-User is a separate channel — if you allow that bot, the model can still fetch you at query time. Most brands shouldn’t block both unless they have a specific reason to be invisible.

Do Common Crawl and other scrapers honor my robots.txt?

Common Crawl (CCBot) honors robots.txt, and its corpus is a frequent ingredient in foundation-model training. But many smaller scrapers that aggregate or repackage web data have historically been less consistent, and content that escapes into derivative datasets can persist in places you can’t unwind.

Does blocking AI crawlers protect me from being copied?

It reduces but doesn’t eliminate the risk. Robots.txt is a polite signal, not a security perimeter. If you have content that genuinely can’t be public, putting it behind authentication is the right control. Robots.txt is for shaping legitimate crawler behavior.

What’s a sensible default robots.txt setup for an AI-aware site?

Allow the query-time bots (GPTBot, ChatGPT-User, OAI-SearchBot, PerplexityBot, Perplexity-User, ClaudeBot, Google-Extended) because being citable is the goal. Optionally disallow training-time-only crawlers on content you don’t want repurposed wholesale, and use server-level rate limits or auth for anything sensitive. Then ship an llms.txt to point engines at your canonical content.

Make sure your robots config isn’t the reason you’re invisible.

Start free See it live →