A 30-year-old standard meets an 18-month-old proposal

robots.txt is one of the oldest conventions on the web. Martijn Koster proposed it in February 1994 — when there were perhaps two dozen web crawlers in total — as a voluntary protocol for sites to tell those crawlers which paths to skip. The format has barely changed since. It's plain text, lives at /robots.txt, and uses the Robots Exclusion Protocol's User-agent / Allow / Disallow directives. Search engines that want to be trusted honor it; spam bots ignore it. That tension is built into the design.

llms.txt is the modern counterpart, proposed by Jeremy Howard at Answer.AI in September 2024 (llmstxt.org). It solves a different problem: AI language models — especially ones with limited context windows and aggressive token budgets — benefit from a clean, human-curated summary of what a website is about, what its key resources are, and how to navigate them. Crawling the full site is expensive, slow, and produces noisy results. llms.txt is intentionally short, written in Markdown, and lives at /llms.txt alongside robots.txt.

The two files are sometimes confused because they sit in the same place and share the .txt extension. They do different things. Treating one as a replacement for the other is the most common mistake we see in early-2026 e-commerce GEO work.

Side-by-side comparison

Dimension robots.txt llms.txt
Year proposed 1994 September 2024
Purpose Tell crawlers which paths to skip or honor. Tell language models what the site is and where the key resources live.
Format Directives. User-agent: X + Allow: / Disallow:. Markdown. H1 title + blockquote summary + sections of links.
Length 5–50 lines typically. 30–200 lines. Brief but human-readable prose, not exhaustive.
Audience Crawlers, machine-only. Humans rarely read it. Language models. Humans can also read it usefully.
Enforcement Voluntary. Honored by major search engines; ignored by spam bots. Voluntary. Emerging — honored by some AI tools, not yet by all.
Updates Rarely. Usually only when adding a new bot to allow/block. Whenever site identity, services, or key resources change. Monthly is reasonable for active sites.
Failure mode Blocked a bot you wanted, or accidentally allowed one you wanted to block. Out-of-date summary creates a stale impression of the brand in AI training data.

Notice the structural difference: robots.txt answers "may I", llms.txt answers "what is this". A site without a robots.txt is interpreted as fully open to all crawlers. A site without an llms.txt is interpreted as a site you have to crawl from scratch to understand.

What robots.txt looks like

The classic syntax. One User-agent block per bot or bot family, followed by allow/disallow rules. The path patterns support a limited form of wildcards. Most sites end with a Sitemap: reference to make the XML sitemap discoverable.

/robots.txt
# Allow everyone, with explicit allow-lists for AI crawlers
User-agent: *
Allow: /

# Explicit allow for AI training and retrieval bots
User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

Sitemap: https://yourstore.com/sitemap.xml

This is essentially what we ship on geonexa.ai — live here. A few decisions worth understanding:

  • Default-allow with explicit AI bot entries. The User-agent: * block already covers AI crawlers under the default policy. We add explicit AI-bot entries anyway because (a) some AI bots check for their named entry first before falling back to the wildcard, and (b) explicit allow signals intent — useful if you ever want to tighten policy later, you can change one explicit entry without flipping the wildcard.
  • What we don't ship is a Disallow block. Stores selling a public catalog have no SEO reason to disallow anything. The temptation to Disallow: /cart is misguided — the cart URL isn't indexed by Google anyway because it requires authentication state, and AI crawlers benefit from seeing the URL structure even if they can't fetch dynamic state.
  • The Sitemap directive is mandatory if you have an XML sitemap. Without it, crawlers find your sitemap only via Google Search Console submission, which AI crawlers don't use.
Common robots.txt mistake. Shipping User-agent: GPTBot + Disallow: / to "protect content from being trained on" is a popular but counterproductive pattern. Blocking GPTBot doesn't prevent OpenAI from learning about your brand — it learns about you through every third-party mention, every review, every comparison page. What it does prevent is OpenAI surfacing accurate, up-to-date information about your products. You lose the citation, the competitor doesn't.

What llms.txt looks like

The llms.txt format is Markdown. The opening H1 is the brand name. The blockquote underneath is a one-sentence elevator pitch. Then comes free-form context paragraphs (transparent about who you are, what you do, what stage you're at), followed by H2 sections that group links to key resources.

The canonical structure, as proposed in the original spec:

/llms.txt
# Brand Name

> One-sentence summary of what the brand is and what it does.

A few paragraphs of free-form context. What problem the brand solves,
who it serves, what stage it's at, and anything else a language model
should know up front to talk about the brand accurately.

## Services

- [Service one](https://yoursite.com/path): short description.
- [Service two](https://yoursite.com/path): short description.

## Resources

- [Blog](https://yoursite.com/blog/): editorial content, methodology.
- [Pricing](https://yoursite.com/pricing): current public tiers.

## Optional

- [Press kit](https://yoursite.com/press): for journalists.
- [Investor info](https://yoursite.com/investors): for context.

The ## Optional section is a convention in the spec: anything listed there is fair game to skip if the LLM is operating under tight token budgets. Anything in non-Optional sections is meant to be processed.

You can see our own llms.txt at geonexa.ai/llms.txt as a working example. We use it to disclose the brand's current state honestly — including the 4/100 starting AI visibility score — because trying to hide that from a language model is impossible. The model is going to see it through Rankie, through Perplexity citations, through every third-party reference. We'd rather get to summarize it ourselves than have the model piece it together from fragments.

Who actually honors each file in 2026

This is the question that determines whether either file is worth shipping. Below is our current read, based on published statements from AI vendors plus practical testing in May 2026. Adoption is a moving target — the matrix below will be wrong by some specific cell within 12 months, but the directional shape is stable.

robots.txt — honored by

  • Googlebot Yes, strictly
  • Bingbot Yes, strictly
  • GPTBot (OpenAI) Yes
  • ClaudeBot (Anthropic) Yes
  • PerplexityBot Yes
  • Google-Extended Yes (Gemini training)
  • CCBot (Common Crawl) Yes
  • Spam scrapers No, by design

llms.txt — honored by

  • Anthropic Claude Yes, since Q4 2024
  • Cursor (IDE AI) Yes
  • Perplexity Partial / inconsistent
  • Smaller AI search tools Most do
  • OpenAI ChatGPT Not yet announced
  • Google Gemini Not yet announced
  • DeepSeek No
  • Common Crawl Not relevant — not a model

The honest read of this matrix: robots.txt is universally honored by every crawler whose opinion you care about; llms.txt is honored by enough engines that early adoption is worth the 20 minutes it takes to write, and the adoption curve is steepening fast. Shipping llms.txt in 2026 is roughly where shipping a sitemap.xml was in 2007 — partial coverage, growing, free upside if it works, no downside if it doesn't.

Where each one shines

Use robots.txt to

  • Make your sitemap discoverable. The Sitemap: directive is the only convention that consistently surfaces an XML sitemap to non-Google crawlers.
  • Block administrative or private paths. Admin panels, search result pages with parameter explosions, faceted navigation pages that produce infinite URL combinations.
  • Signal intent to AI training bots. Even if the default policy already allows them, an explicit User-agent: GPTBot + Allow: / entry is a clear statement: this site is fair game for training. It can matter in disputes about scraping consent.
  • Throttle aggressive crawlers. The Crawl-delay: directive (honored by Bing and some smaller bots, ignored by Google) tells a bot to wait N seconds between requests. Useful if a poorly-behaved bot is hitting your origin.

Use llms.txt to

  • Summarize the brand in machine-readable prose. Especially valuable when the brand's name is generic or could be confused with another entity (e.g., a SaaS product named "Pivot" needs to disambiguate itself in the first paragraph).
  • Surface key pages without making AI crawl the whole site. Link to your services, pricing, blog index, and contact page directly. Saves the language model tokens and surfaces the right pages.
  • Make stage and provenance honest. If you're new, say so. If you've pivoted recently, say so. AI engines will eventually figure it out; better to be the source of the explanation than the subject of a guess.
  • Update on a meaningful cadence. Monthly is reasonable. Each update should reflect real changes — new services, removed services, new resource pages, updated positioning. Don't change it for the sake of recency signal; AI engines don't read it like a feed.

The five most common mistakes

1. Treating llms.txt as a replacement for robots.txt

They serve different purposes and live in different processing stages of an AI engine's pipeline. robots.txt gates whether the crawler fetches the site at all; llms.txt shapes what the model knows about the site once it does fetch. Removing one because you have the other breaks something.

2. Blocking AI bots in robots.txt to "protect content"

Blocking GPTBot doesn't prevent OpenAI from learning about your brand. It only prevents OpenAI from getting accurate, current information from your authoritative source. The brand still gets discussed in AI answers — based on Reddit threads, third-party reviews, and competitor comparison pages, all of which are usually less flattering than your own site.

3. Shipping an llms.txt longer than 250 lines

The format is intentionally summary-grade. If you have more to say, link to a longer page from inside llms.txt. The point is to give the model an executive summary it can hold in context cheaply, not to reproduce the site in prose form.

4. Forgetting the Sitemap directive in robots.txt

If your XML sitemap exists but isn't referenced in robots.txt, AI crawlers that don't use Search Console will never find it. Single missing line that costs months of crawl coverage.

5. Letting llms.txt go stale

A six-month-old llms.txt referencing services you don't offer anymore, prices you've raised, or a brand description that no longer fits actively damages your AI representation. The model treats it as authoritative because you wrote it. Set a calendar reminder to revisit it quarterly at minimum.

A practical recipe: ship both, then check

Total work: about 25 minutes the first time.

  1. Write robots.txt using the template above. Default-allow with explicit entries for GPTBot, ClaudeBot, PerplexityBot, Google-Extended, OAI-SearchBot, ChatGPT-User, and CCBot. End with a Sitemap: line pointing at your XML sitemap.
  2. Write llms.txt using the structure above. One H1, one blockquote, two or three context paragraphs, then sections for Services / Resources / Optional. Link out to your most important pages.
  3. Ship both to /robots.txt and /llms.txt at the site root. On Shopify, both files go in the theme's assets/ with appropriate template redirects, or via a Cloudflare Worker if your platform doesn't let you serve raw files at root. On Netlify, plain static files at site root just work.
  4. Validate accessibility by curling each from the public web: curl -I https://yoursite.com/robots.txt and curl -I https://yoursite.com/llms.txt. Both should return HTTP 200. If you see a 301 to /index.html, your SPA catch-all is swallowing the file — fix the routing config.
  5. Submit the sitemap to Google Search Console and Bing Webmaster Tools. Optional but cheap; both tools verify that robots.txt is reachable and that the sitemap URL inside it is accessible.
  6. Re-check quarterly. Both files drift. Stale entries cost you more than the maintenance does.
A note on Shopify specifically. Shopify ships its own auto-generated robots.txt based on store settings. You can override it by adding a robots.txt.liquid template file. Custom llms.txt on Shopify is more awkward — the default behavior is for the path to fall through to a 404 because Shopify doesn't know what to do with it. The common workaround is a Cloudflare Worker (or Shopify Hydrogen route, if you're on Hydrogen) that serves the file directly. We've documented the Shopify-specific recipe in our Schema.org for Shopify companion post.

Where this fits in a GEO foundation

robots.txt and llms.txt together cost about a kilobyte and 25 minutes of work. They are the cheapest, lowest-friction lever you have to influence how AI engines see your site. They are also the easiest to get wrong in a way that quietly costs months of citation visibility — usually because someone blocked an AI bot they meant to allow, or because the llms.txt went stale.

Both files are part of GeoNexa's standard foundation build, alongside Schema.org structured data and the third-party citation seeding work that compounds over the following months. We ship them on day one and re-audit them at days 30, 60, and 90 of every engagement.

If you want to see what production examples look like rather than just templates: our own live files are at geonexa.ai/robots.txt and geonexa.ai/llms.txt. The full story of why we're shipping ours so transparently — including our 4/100 baseline AI visibility score — lives at Case Study Zero.

Want this shipped on your store?

Both files plus the rest of the AI search foundation, done for you. Book a free 30-minute audit to see what's missing.

Book Free Audit →