Stable RFC 9309 (2022) Crawl Control

Robots.txt Protocol

Last updated: 2026-01-10

01 The Rule

Use robots.txt to block crawlers from low-value URL spaces: faceted navigation parameters, internal search results, admin areas, and session-based URLs. Never block CSS or JavaScript resources required for rendering.

02 Rationale

Robots.txt is the first line of crawl budget defense. Every URL path that's crawlable but shouldn't be indexed wastes crawl capacity. For sites with millions of faceted URLs, proper robots.txt rules can reduce crawl waste by 90%+.

03 Implementation

  • Block faceted/filtered parameter URLs with Disallow patterns
  • Block internal search results (/search?q=*)
  • Block admin, staging, and API endpoints
  • Declare sitemap location with Sitemap: directive
  • Allow CSS and JS files (required for rendering-based indexing)
  • Test rules with Google Search Console robots.txt tester

04 Common Violations & Consequences

Violation

Blocking CSS/JS resources

Consequence

Googlebot can't render pages — mobile-first indexing fails, content invisible

Violation

No robots.txt file at all

Consequence

All URL space is crawlable including parameters, internal search, and admin

Violation

Overly broad Disallow rules

Consequence

Important content blocked from crawling and indexing

05 The Fix

Audit your crawlable URL space using log file analysis. Identify URL patterns consuming crawl budget without generating organic traffic. Add targeted Disallow rules for those patterns while ensuring important content remains accessible.