Stable Sitemaps.org Protocol (2008) Discovery

XML Sitemaps

Last updated: 2026-01-15

01 The Rule

Every site must provide an XML sitemap listing all canonical, indexable URLs. Sites exceeding 50,000 URLs must use a sitemap index. Only include URLs that return 200 and are not blocked by robots.txt or noindex.

02 Rationale

Sitemaps are the most direct way to tell search engines about your URL space. For large sites, sitemaps are often the primary discovery mechanism for deep content that isn't well-linked internally. Accurate lastmod values drive crawl prioritization.

03 Implementation

  • Maximum 50,000 URLs per sitemap file, 50MB uncompressed
  • Use <sitemapindex> for sites exceeding 50K URLs
  • Include accurate <lastmod> dates — only update when content changes
  • Segment sitemaps by content type (products, categories, blog)
  • Gzip compress sitemap files
  • Reference sitemap location in robots.txt

04 Common Violations & Consequences

Violation

Including noindex or redirecting URLs in sitemap

Consequence

Conflicting signals — sitemap says 'index this' while page says 'don't'

Violation

Stale lastmod dates (never updated)

Consequence

Google ignores lastmod signals for your entire domain

Violation

Single monolithic sitemap for 1M+ URL site

Consequence

File too large to process; crawlers may abandon download

05 The Fix

Generate sitemaps programmatically from your canonical URL database. Validate that every sitemap URL returns 200 and matches canonical tags. Segment by content type and update lastmod only on actual content changes.