llms.txt vs robots.txt: What UK Businesses Need to Know

The Core Difference Between the Two Files

robots.txt controls which web crawlers can access which parts of your site — it is a 30-year-old standard that every search engine respects. llms.txt is a new voluntary standard (proposed 2024) that tells AI language models what content they are permitted to use for training, context, or citation. robots.txt talks to crawlers; llms.txt talks to AI systems. Both belong on your website in 2026, but they serve entirely different purposes and blocking content in one does not block it in the other.

What robots.txt Does

robots.txt is a plain text file hosted at the root of your domain (e.g. www.seoandgeo.co.uk/robots.txt). It uses User-agent and Disallow directives to tell web crawlers which paths they can and cannot access.

A typical robots.txt looks like:

User-agent: *
Disallow: /admin/
Disallow: /checkout/
Allow: /

Sitemap: https://www.example.co.uk/sitemap.xml

What it controls:

Which bots (Googlebot, Bingbot, etc.) can crawl which paths
Whether your sitemap URL is declared for easy discovery
Whether staging environments, admin areas, or internal tools are hidden from crawlers

What it does not control:

Whether pages are indexed (a page can be indexed even if blocked — Google may still list it if it finds links to it)
AI model training — GPTBot and ClaudeBot need separate handling
Whether AI systems cite your content in responses

Important: Blocking a page in robots.txt does not remove it from Google's index if it has already been indexed. Use noindex meta tags for indexation control.

What llms.txt Does

llms.txt is a voluntary specification proposed by Jeremy Howard in 2024. It is a markdown-formatted file hosted at yourdomain.com/llms.txt that provides a structured, machine-readable summary of your site specifically for large language models.

Unlike robots.txt, llms.txt is not a set of access rules. It is an information file — it tells AI systems:

What your site is about
Which pages contain your most important content
Which URLs are most relevant for citation
What context an AI needs to correctly represent your business

An example llms.txt:

# seoandgeo

> Combined SEO and GEO audit platform for UK small businesses.

## Key pages
- [How our audit works](https://www.seoandgeo.co.uk/insights/how-our-14-ai-agents-audit-your-website)
- [What is GEO?](https://www.seoandgeo.co.uk/blog/what-is-geo-ai-search-optimisation-guide)
- [Pricing](https://www.seoandgeo.co.uk/audit)

There is also a companion file llms-full.txt for sites that want to provide full content to AI systems rather than a curated index.

For a full explanation of how AI systems find and cite businesses, see Is your website visible to ChatGPT?

Why robots.txt Does Not Control AI Training

In 2023–2024, most major AI companies introduced their own crawler user-agent names:

Company	Crawler
OpenAI	GPTBot
Google	Google-Extended
Anthropic	ClaudeBot
Meta	FacebookBot
Apple	Applebot-Extended

Each of these respects robots.txt directives — but only if you specifically name them. A blanket Disallow: / blocks Googlebot but does not block AI crawlers unless you explicitly add:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

This is why many sites believed they were blocking AI crawlers when they were not — the wildcard User-agent: * only applies to bots that do not have a more specific rule.

When to Use robots.txt to Block AI Crawlers

There are legitimate reasons to restrict AI crawler access:

You have proprietary content — research, data, or original analysis you do not want used in AI training data
You have user-generated content with privacy sensitivities
You are building a competitive moat around original material

However, blocking AI crawlers also means your content cannot be cited by those systems when answering user questions. For most small business websites — where the goal is to be found by AI systems — blocking AI crawlers is counterproductive.

When to Use llms.txt to Guide AI Systems

llms.txt is the right tool when you want AI systems to find and cite you correctly:

You want to direct AI to your best content rather than letting it index everything and prioritise badly
Your site has a lot of content and you want AI systems to focus on specific pillar pages
You want to ensure accurate representation — providing structured context reduces the risk of AI misrepresenting your business
You are optimising for GEO (Generative Engine Optimisation) and want every advantage in AI citation

For most UK small businesses, the correct answer is: no AI blocking in robots.txt, and an llms.txt file that points AI systems toward your best pages.

The Relationship Between Both Files

File	Purpose	Format	Effect
robots.txt	Crawler access control	Plain text directives	Hard rules — bots should follow
llms.txt	AI content guidance	Markdown	Soft guidance — AI systems may follow
sitemap.xml	Page discovery	XML	Tells crawlers what exists
llms-full.txt	Full content for AI	Markdown	Extended AI context

None of these files replace each other. A complete site configuration in 2026 includes all four.

As AI search becomes a significant traffic channel, your robots.txt and llms.txt configuration determines whether AI systems can find, understand, and correctly cite your business. We cover both in the GEO section of every seoandgeo audit — see What makes our audit different for details on what we check.

Choose Your Configuration Based on Your Goal

Goal: Rank in Google and appear in AI search → Allow all crawlers, add llms.txt Most small businesses should allow Googlebot, GPTBot, ClaudeBot, and Bingbot access. Add an llms.txt to guide AI systems toward your best content.

Goal: Protect proprietary content from AI training → Block specific AI crawlers in robots.txt Add specific User-agent: GPTBot and User-agent: ClaudeBot blocks. Note this also prevents citation by those AI systems.

Goal: Protect some pages, expose others → Use path-specific robots.txt rules Block admin and internal pages for all bots. Allow AI crawlers full access to your public-facing content.

Frequently Asked Questions

Does robots.txt stop AI companies from using my content? Robots.txt directives are voluntary — legally they are not enforceable. Reputable AI companies like OpenAI, Anthropic, and Google have committed to respecting robots.txt for their crawler user-agents. However, if content has already been scraped before you added the block, it may already be in training data. For new content going forward, explicit AI crawler blocks in robots.txt are the current best practice.

Is llms.txt an official standard? No — llms.txt is a community-proposed specification, not a W3C or IETF standard. However, it has gained adoption among GEO practitioners and some AI systems are beginning to recognise and process it. Its value is in providing structured context to AI systems rather than relying on them to infer what your site is about.

Can I have both robots.txt blocks and llms.txt on the same site? Yes, but it creates a contradiction — you are blocking AI access while also trying to guide AI understanding. If you want to be cited by AI systems, do not block their crawlers. If you are blocking their crawlers for legitimate reasons, llms.txt has limited value for those same systems.

How do I check if AI crawlers can access my site? Look for GPTBot, ClaudeBot, and Google-Extended entries in your server access logs, or check your robots.txt for any rules that might inadvertently block them. seoandgeo's audit checks AI crawler accessibility as part of the GEO section across all 3 audit tiers.

What should my robots.txt say to maximise AI visibility? At minimum, ensure you are not blocking AI crawlers with a wildcard Disallow: /. Ideally, add explicit User-agent: GPTBot and Allow: / entries to make your intent clear. Then add an llms.txt pointing to your most citable pages.