Authority SpecialistAuthoritySpecialist
Pricing
Growth PlanDashboard
AuthoritySpecialist

Data-driven SEO strategies for ambitious brands. We turn search visibility into predictable revenue.

Services

  • SEO Services
  • LLM Presence
  • Content Strategy
  • Technical SEO

Company

  • About Us
  • How We Work
  • Founder
  • Pricing
  • Contact
  • Careers

Resources

  • SEO Guides
  • Free Tools
  • Comparisons
  • Use Cases
  • Best Lists
  • Site Map
  • Cost Guides
  • Services
  • Locations
  • Industry Resources
  • Content Marketing
  • SEO Development
  • SEO Learning

Industries We Serve

View all industries →
Healthcare
  • Plastic Surgeons
  • Orthodontists
  • Veterinarians
  • Chiropractors
Legal
  • Criminal Lawyers
  • Divorce Attorneys
  • Personal Injury
  • Immigration
Finance
  • Banks
  • Credit Unions
  • Investment Firms
  • Insurance
Technology
  • SaaS Companies
  • App Developers
  • Cybersecurity
  • Tech Startups
Home Services
  • Contractors
  • HVAC
  • Plumbers
  • Electricians
Hospitality
  • Hotels
  • Restaurants
  • Cafes
  • Travel Agencies
Education
  • Schools
  • Private Schools
  • Daycare Centers
  • Tutoring Centers
Automotive
  • Auto Dealerships
  • Car Dealerships
  • Auto Repair Shops
  • Towing Companies

© 2026 AuthoritySpecialist SEO Solutions OÜ. All rights reserved.

Privacy PolicyTerms of ServiceCookie Policy
Home/SEO Services/What is an XML Sitemap? (And Why Most Sites Are Using Them Wrong)
Intelligence Report

What is an XML Sitemap? (And Why Most Sites Are Using Them Wrong)Submitting a sitemap isn't a strategy. Here's the crawl architecture framework that separates sites Google loves from sites Google ignores.

Most XML sitemap guides stop at 'submit it to Search Console.' This guide goes deeper: architecture, priority signals, and frameworks that actually move rankings.

Get Your Custom Analysis
See All Services
Authority Specialist Editorial TeamSEO Strategists
Last UpdatedMarch 2026

What is What is an XML Sitemap? (And Why Most Sites Are Using Them Wrong)?

  • 1An XML sitemap is a structured signal file — not a magic ranking lever — and treating it like one is why most sites see zero lift after submission.
  • 2The 'Crawl Prioritisation Stack' framework: how to structure sitemaps so Google allocates crawl budget to your highest-value pages first.
  • 3Why including every URL in your sitemap is one of the most common and damaging technical SEO mistakes.
  • 4The 'Sitemap Signal Hierarchy' method: using lastmod, changefreq, and priority tags strategically, not as filler.
  • 5Index sitemap vs. individual sitemap: when to split, when to combine, and the threshold that matters.
  • 6How e-commerce, SaaS, and content-heavy sites each need a fundamentally different sitemap architecture.
  • 7What sitemap errors in Google Search Console are actually telling you (and the ones worth acting on vs. ignoring).
  • 8The pre-submission audit checklist that stops you from signalling low-quality pages to Google before you even rank.
  • 9How AI-powered crawlers and SGE index models are changing what a 'good' sitemap looks like in 2025 and beyond.

Introduction

Here's the contrarian truth no sitemap guide wants to say upfront: submitting an XML sitemap to Google Search Console does almost nothing on its own. It is not a ranking signal. It will not rescue a poorly structured site. And yet, teams spend hours debating priority values and changefreq tags as if those micro-decisions will move the needle. They won't — unless everything else is already right.

The XML sitemap is best understood as a communication protocol between your site and a search engine crawler. Think of it less like a filing system and more like a tour guide briefing: you're telling Google which rooms exist, which ones have been recently renovated, and which ones are worth its time. The problem? Most guides teach you how to generate the file. Almost none teach you how to think about what goes in it, what stays out, and how to structure it as part of a broader crawl strategy.

When I first started auditing technical SEO for scaling businesses, sitemap files were almost always an afterthought — auto-generated by a plugin, submitted once, and never revisited. What I found consistently was that the sites struggling most with crawl efficiency and index coverage had sitemaps full of thin pages, redirects, and canonicalised-away URLs. Google was being sent to the wrong rooms, repeatedly.

This guide is designed to fix that. You'll get a full working definition, real XML examples, the frameworks we use to architect sitemaps for authority-driven growth, and the mistakes we see even experienced operators making. If you already 'have a sitemap,' this guide will help you audit whether it's working for you or quietly working against you.
Contrarian View

What Most Guides Get Wrong

The standard advice goes like this: generate a sitemap, submit it in Google Search Console, make sure it returns a 200 status, and move on. That framing reduces the sitemap to an administrative checkbox — something to tick off during a site launch and never think about again.

What that advice misses entirely is the sitemap's role in crawl budget management. For sites with fewer than a few hundred pages, crawl budget is rarely a concern. But for growing e-commerce stores, SaaS platforms with dynamic URLs, or content hubs publishing at volume, what you include in your sitemap directly affects which pages get crawled frequently, which get crawled rarely, and which effectively fall off Google's radar.

Most guides also teach you to include every URL your CMS generates. That sounds logical — surely you want Google to find everything? But including low-quality, thin, or near-duplicate pages in your sitemap actively degrades the signal quality of the file as a whole. You're telling Google to pay attention to pages that don't deserve attention, while diluting discovery of the pages that do.

The other common error: treating XML sitemaps as static documents. A sitemap should evolve with your content strategy, your site architecture, and your growth priorities.

Strategy 1

What Exactly is an XML Sitemap? (The Definition That Actually Matters for SEO)

An XML sitemap is a structured file, written in Extensible Markup Language (XML), that lists the URLs on your website you want search engines to discover and crawl. It sits at a readable web address — typically yoursite.com/sitemap.xml — and follows a standardised format defined by the Sitemaps protocol, which Google, Bing, and other major search engines all support.

The core function is simple: it's a roadmap. When Googlebot arrives at your domain, it doesn't automatically know every page that exists, especially on large, complex sites or newly published content. The sitemap removes the guesswork. It says, 'Here is a complete list of the URLs I want you to know about.'

But here's where the definition needs to go deeper than most guides allow: a sitemap is a signal file, not a guarantee. Google has explicitly stated that including a URL in a sitemap does not force it to crawl or index that URL. What the sitemap does is increase the likelihood and speed of discovery — particularly for pages that aren't well-linked internally.

A basic XML sitemap entry looks like this:

<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>https://www.yoursite.com/example-page/</loc> <lastmod>2025-01-15</lastmod> <changefreq>monthly</changefreq> <priority>0.8</priority> </url> </urlset>

The four core tags each play a role: - loc (required): The canonical URL of the page. - lastmod (recommended): The date the page was last meaningfully changed. This is more valuable than most people realise — it helps Google prioritise recrawling updated content. - changefreq (optional): A hint about how often the page changes. Google treats this as advisory, not binding. - priority (optional): A relative value from 0.0 to 1.0 indicating how important this URL is compared to other URLs on your site. Default is 0.5.

For most sites, the sitemap file is generated automatically by a CMS plugin or platform. The critical SEO work isn't in the generation — it's in the curation.

Key Points

  • An XML sitemap lists URLs you want search engines to discover — it's a signal, not a command.
  • The four core tags are loc, lastmod, changefreq, and priority — only loc is required.
  • Sitemaps are especially valuable for large sites, new sites, or sites with poor internal linking.
  • Google uses lastmod to prioritise recrawling — keeping it accurate is more important than most realise.
  • The sitemap protocol is supported by all major search engines, not just Google.
  • Your sitemap lives at a publicly accessible URL, typically /sitemap.xml or /sitemap_index.xml.

💡 Pro Tip

Always verify that your sitemap URL is referenced in your robots.txt file using the 'Sitemap:' directive. This ensures all crawlers, not just those you've notified via Search Console, can find it without relying on manual submission alone.

⚠️ Common Mistake

Setting every URL to priority 1.0. This tells Google nothing useful about relative importance — it's the equivalent of highlighting every word in a book. Use a genuine hierarchy: 1.0 for homepage and core service/product pages, 0.8 for key category pages, 0.6 for blog posts, 0.4 for supporting content.

Strategy 2

The Crawl Prioritisation Stack: A Framework for Sitemap Architecture That Actually Drives Results

This is the framework I wish had existed when I started doing technical SEO audits. I've named it the Crawl Prioritisation Stack because it forces you to think about your sitemap not as a flat list of URLs but as a tiered signal system that guides Googlebot toward your highest-value content first.

The core insight: crawl budget is finite. Google doesn't crawl every page of every site with equal frequency. Sites with strong authority and fast servers get more generous allocations, but no site gets unlimited crawl attention. The question your sitemap needs to answer is: if Google could only visit 20% of my pages this week, which 20% matter most?

The Crawl Prioritisation Stack has three tiers:

Tier 1 — Revenue Pages (Priority 0.9 – 1.0) These are your core landing pages, product or service pages, and conversion-critical content. They should be updated frequently, internally linked aggressively, and assigned the highest priority values. If these pages aren't being crawled and indexed reliably, everything else is moot.

Tier 2 — Authority Pages (Priority 0.7 – 0.8) These are your cornerstone content pieces, category pages, and pillar articles — pages designed to rank for competitive terms and build topical authority. They support Tier 1 by funnelling organic traffic toward conversion. Update these regularly and ensure lastmod values are accurate.

Tier 3 — Supporting Content (Priority 0.5 – 0.6) Blog posts, FAQ pages, supporting articles, and supplementary landing pages. These are valuable but not your primary crawl focus. Many sites have far too many Tier 3 pages included in their sitemap without the Tier 1 and Tier 2 infrastructure to support them.

What stays out entirely: Paginated archive pages (unless they have unique SEO value), tag and category pages with thin content, URLs with tracking parameters, pages blocked by robots.txt, pages with noindex tags, redirected URLs, and canonicalised-away duplicates.

When I started applying this framework to site audits, the consistent finding was that 30 – 50% of URLs in auto-generated sitemaps shouldn't be in the sitemap at all. Removing them tightened the crawl signal and, in several cases, accelerated indexing of the pages that actually mattered.

Key Points

  • Tier 1 (Revenue Pages): Highest priority, most frequent lastmod updates, maximum internal linking.
  • Tier 2 (Authority Pages): Cornerstone content and category pages that build topical depth.
  • Tier 3 (Supporting Content): Lower priority, still included if they offer genuine value.
  • Pages to exclude: noindex, redirects, canonicalised duplicates, thin content, parameter-based URLs.
  • The goal is signal clarity — every URL you include makes a claim about its value.
  • Re-audit your sitemap quarterly as your content strategy evolves.
  • A smaller, cleaner sitemap almost always outperforms a bloated one for crawl efficiency.

💡 Pro Tip

After removing low-value URLs from your sitemap, cross-reference the excluded list against your internal link structure. Pages that are weakly linked AND excluded from the sitemap are effectively invisible to Google — either improve their internal link equity or consolidate them into stronger pages.

⚠️ Common Mistake

Including paginated pages (page/2, page/3, etc.) in the sitemap without a clear SEO rationale. Unless these pages have unique ranking value, they dilute your sitemap signal and waste crawl budget on pages Google will often de-prioritise anyway.

Strategy 3

Sitemap Types Explained: Which Structure Does Your Site Actually Need?

Not all sitemaps are built the same, and choosing the wrong structure for your site type is a common technical oversight. There are four main sitemap types, each serving a distinct purpose.

1. XML URL Sitemap The standard sitemap format. Lists individual URLs with optional metadata. Best for most sites with fewer than 50,000 URLs and a file size under 50MB uncompressed. This is what most people mean when they say 'XML sitemap.'

2. Sitemap Index File When your site exceeds 50,000 URLs or the 50MB file size limit, you need a sitemap index — a parent file that references multiple child sitemap files. This is common for large e-commerce stores and content-heavy platforms.

A basic sitemap index looks like:

<?xml version="1.0" encoding="UTF-8"?> <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <sitemap> <loc>https://www.yoursite.com/sitemap-products.xml</loc> <lastmod>2025-02-01</lastmod> </sitemap> <sitemap> <loc>https://www.yoursite.com/sitemap-blog.xml</loc> <lastmod>2025-02-01</lastmod> </sitemap> </sitemapindex>

This structure also gives you architectural control — you can segment by content type, which is strategically useful.

3. Image Sitemap A dedicated sitemap (or sitemap extension) listing image URLs. Particularly valuable for photography sites, e-commerce product pages, and media publishers who want images indexed in Google Image Search. Use the image XML namespace extension within your standard sitemap.

4. Video Sitemap For sites publishing video content, a video sitemap provides metadata (title, description, thumbnail URL, duration) that helps Google index and surface video content in search results and video carousels.

Choosing the right structure by site type:

- SaaS / Lead Gen sites (under 500 pages): Single XML sitemap, manually curated. - Content hubs / editorial sites (500 – 10,000 pages): Single sitemap or split by content type (posts vs. pages vs. authors). - E-commerce stores (10,000+ URLs): Sitemap index with child sitemaps segmented by product category, blog, and static pages. - Media / news publishers: Consider a Google News sitemap in addition to standard sitemaps, enabling inclusion in the Top Stories carousel.

The segmentation approach in sitemap index files is underused as a strategic tool. When you separate products, blog posts, and landing pages into discrete child sitemaps, you can monitor crawl performance by content type in Google Search Console — a level of visibility that a single monolithic sitemap file simply doesn't provide.

Key Points

  • Standard XML sitemap: best for sites under 50,000 URLs and 50MB file size.
  • Sitemap index: required above those thresholds, also valuable for segmentation strategy.
  • Image and video sitemaps extend discovery into specialised search surfaces.
  • Google News sitemaps unlock the Top Stories carousel for eligible publishers.
  • Segmenting by content type in a sitemap index enables per-content-type performance monitoring.
  • Always reference all child sitemaps in your sitemap index — partial submissions are a common oversight.
  • Choose structure based on site scale and content mix, not just URL count.

💡 Pro Tip

If you run an e-commerce store, create separate child sitemaps for in-stock products vs. out-of-stock products. This makes it easier to deprioritise or temporarily remove out-of-stock product URLs from your sitemap during prolonged stockouts, keeping crawl budget focused on pages that can actually convert.

⚠️ Common Mistake

Submitting only the sitemap index URL to Search Console and assuming the child sitemaps are automatically tracked. Verify that each child sitemap appears individually in the 'Sitemaps' report and shows expected URL counts — mismatched numbers often reveal generation errors.

Strategy 4

The Sitemap Signal Hierarchy: Using lastmod, changefreq, and Priority Tags Strategically

Most guides mention these three optional tags and then tell you to set them and forget them. That's a missed opportunity. The Sitemap Signal Hierarchy framework treats these tags as active communication tools — a way to consistently tell Google which content is fresh, important, and worth revisiting.

lastmod: The Most Underestimated Tag

Of the three optional tags, lastmod is the one Google actually pays the most attention to. When the lastmod value changes on a URL, it signals to Googlebot that the page has been updated and is worth recrawling. Used accurately, this can meaningfully accelerate the recrawling of refreshed content.

The operative word is accurately. Many CMS platforms update lastmod automatically on every plugin update or site-wide change — even if the page content itself didn't change. This trains Googlebot to distrust your lastmod signals over time, because it visits pages expecting fresh content and finds identical content. Audit your CMS to ensure lastmod only updates when actual content changes.

changefreq: Treat as Advisory, Not Prescriptive

Google has confirmed that changefreq is used as a hint, not a schedule. Valid values are: always, hourly, daily, weekly, monthly, yearly, never. Use 'daily' for news or frequently updated content, 'weekly' for active blog sections, 'monthly' for evergreen guides, and 'yearly' or 'never' for static policy pages.

Don't overthink this tag. Its influence is minor compared to actual crawl behaviour signals Google derives from your site's history.

priority: Signal Hierarchy, Not Absolute Ranking

Priority values (0.0 to 1.0) are relative to your own site, not a global ranking signal. The classic mistake is assigning every page a high priority, which eliminates the relative signal entirely. Use the Crawl Prioritisation Stack tiers to assign values consistently:

- Homepage and core conversion pages: 1.0 - Key product/service/category pages: 0.8 – 0.9 - Pillar content and authority articles: 0.7 – 0.8 - Supporting blog content: 0.5 – 0.6 - Archive and utility pages (if included): 0.3 – 0.4

The Sitemap Signal Hierarchy framework works best when all three tags are applied consistently and maintained over time — not set at launch and ignored. Schedule a quarterly sitemap review to check that lastmod values are accurate, priority assignments still reflect your current content strategy, and no new low-quality URLs have been auto-added.

Key Points

  • lastmod is the most influential optional tag — but only when it reflects genuine content changes.
  • Inaccurate lastmod values train Googlebot to distrust your sitemap signals over time.
  • changefreq is advisory — use logical values but don't expect Google to follow them literally.
  • priority signals relative importance within your site, not global ranking weight.
  • A consistent, hierarchical priority structure is far more useful than uniform high values.
  • Audit all three tag values quarterly, not just at launch.
  • CMS auto-updates often corrupt lastmod accuracy — investigate and correct this if present.

💡 Pro Tip

When you publish a significant update to an existing piece of content — not just a typo fix, but a genuine expansion or refresh — update the lastmod tag on that URL the same day. Then monitor Search Console for a recrawl within the following days. Over time, this behaviour trains Google to trust your lastmod signals and recrawl updated content faster.

⚠️ Common Mistake

Setting changefreq to 'always' or 'hourly' on pages that don't actually change at those intervals. This doesn't improve crawl frequency — it signals that your sitemap metadata isn't trustworthy, which reduces its value across the entire file.

Strategy 5

What Should You EXCLUDE From Your XML Sitemap? (The Inclusion List Everyone Ignores)

The most actionable thing most site owners can do to improve their sitemap today isn't adding anything — it's removing pages that shouldn't be there. This is the counterintuitive truth at the heart of good sitemap hygiene: a smaller, higher-quality sitemap sends a stronger crawl signal than a comprehensive one.

Here's the definitive exclusion list, with the rationale behind each:

1. Noindex Pages Any page with a noindex directive in its meta robots tag or X-Robots-Tag header should never appear in your sitemap. Including it creates a direct contradiction: you're telling Google 'find this page' and 'don't index this page' simultaneously. Google has to work out which instruction to follow, and you've wasted crawl budget on the decision.

2. Redirected URLs If a URL returns a 301 or 302 redirect, it should not be in your sitemap. Sitemaps should only contain canonical, live URLs that return a 200 status. Redirected URLs tell Google nothing useful and waste crawl resources on the intermediate hop.

3. Canonicalised-Away Duplicates If a URL has a canonical tag pointing to a different URL, the canonical destination belongs in the sitemap — not the duplicate. Including the duplicate creates conflicting signals about which version of the content you consider authoritative.

4. Parameter-Based URLs Filtered, sorted, or tracked URLs (e.g., /products?sort=price&color=red) are almost always duplicate or near-duplicate content. Unless they have genuine independent SEO value — and this is rare — keep them out of your sitemap and handle them via canonical tags or URL parameter settings.

5. Low-Quality and Thin Content Pages Tag pages, author archive pages with minimal content, stub pages, and thin landing pages created speculatively without real content investment should not be in your sitemap. Including them tells Google to evaluate these pages, and if they underperform quality thresholds, it can drag down your overall site quality signal.

6. Blocked Pages If a URL is blocked in robots.txt, there's no point including it in your sitemap. Google can see the sitemap entry but can't crawl the page — it's a contradictory signal.

The audit process: run your sitemap through a URL-level crawler, check each URL's status code, canonical tag, robots meta tag, and index status. Flag any URL that fails these checks and remove it from the sitemap. Resubmit and monitor Search Console for changes in crawl coverage.

Key Points

  • Noindex pages and sitemap inclusion are mutually exclusive — one must go.
  • Only URLs returning a 200 status code should appear in your sitemap.
  • Canonical duplicates belong out of the sitemap — only include the canonical destination.
  • URL parameters almost never deserve sitemap inclusion.
  • Low-quality pages in your sitemap dilute the quality signal of your entire site.
  • robots.txt blocked pages should be removed from your sitemap.
  • Run a quarterly sitemap audit using a crawler to catch newly added problematic URLs.

💡 Pro Tip

After cleaning your sitemap, compare your total submitted URL count against your indexed URL count in Search Console. A large gap (significantly more submitted than indexed) often reveals that Google is finding quality issues with submitted pages. Investigate and address those pages rather than simply removing them — some may be salvageable with content improvement.

⚠️ Common Mistake

Using the 'Pages' report in Search Console and assuming all indexed pages have strong sitemap coverage. Some pages may be indexed via discovery (internal links) without ever being confirmed through the sitemap. Run a coverage audit to identify which indexed pages lack sitemap inclusion and evaluate whether that's intentional.

Strategy 6

Sitemap Architecture by Site Type: E-Commerce, SaaS, and Content Hubs

Different site types face fundamentally different sitemap challenges. A one-size-fits-all approach is why many sitemaps underperform. Here's the architecture logic for the three most common site types we work with.

E-Commerce: The Inventory Problem

E-commerce sites face unique sitemap challenges: large URL volumes, frequent product additions and removals, variant pages (colour, size), and pagination. The priority architecture looks like this:

- Child sitemap 1: Core category and subcategory pages (high authority, frequently crawled) - Child sitemap 2: Active product pages (in-stock, full content) - Child sitemap 3: Blog and buying guide content - Child sitemap 4: Static pages (about, contact, policies)

Product variant pages (e.g., the same shirt in 12 colours) should almost always be handled via canonical tags pointing to the primary product page — not given independent sitemap entries. The exception: if each variant has meaningfully different content and genuine search demand.

Out-of-stock products are a judgment call. If the product will return, keep it in the sitemap and ensure the page has content value (reviews, related products, notify-me functionality). If it's permanently discontinued, remove it from the sitemap and redirect the URL.

SaaS: The Feature Page Challenge

SaaS sites typically have smaller URL volumes but higher stakes per page. The architecture should prioritise:

- Homepage and core solution pages (Tier 1, priority 1.0) - Feature and use-case pages (Tier 1 – 2, priority 0.8 – 0.9) - Integration and comparison pages (Tier 2, priority 0.7 – 0.8) - Blog and resource content (Tier 2 – 3, priority 0.5 – 0.7) - Help documentation (Tier 3, priority 0.4 – 0.5, or a separate sitemap entirely)

Many SaaS sites make the mistake of including their entire help centre and knowledge base in the main sitemap without considering whether those pages should rank in organic search or serve only logged-in users. Help documentation often contains highly specific, low-volume queries — useful for existing customers, but not strategic organic targets.

Content Hubs: The Scale Problem

For editorial sites and content hubs publishing at volume, the sitemap becomes a content quality governance tool. Establish a content quality threshold — minimum word count, internal links, backlinks, or engagement signals — and only include posts that meet it in the sitemap. Posts below threshold should either be improved or excluded.

Consider segmenting by content age as well. Evergreen cornerstone content should have stable, accurate lastmod values. Trending or time-sensitive content should update more frequently with accurate lastmod tracking.

Key Points

  • E-commerce sites benefit most from child sitemap segmentation by content type.
  • Product variant pages should typically use canonicals, not independent sitemap entries.
  • SaaS help documentation may deserve a separate sitemap from core marketing pages.
  • Content hubs should use content quality thresholds to decide sitemap inclusion.
  • Discontinued e-commerce products should be removed from sitemaps and redirected.
  • SaaS feature and use-case pages are high-priority sitemap entries — treat them as Tier 1.
  • Segmented sitemaps by site type enable more precise crawl performance monitoring.

💡 Pro Tip

For large e-commerce sites, consider generating your product sitemap dynamically from your product database rather than relying on CMS plugin generation. This ensures the sitemap always reflects current inventory status, canonical URLs, and lastmod dates based on actual product data updates — not arbitrary CMS timestamps.

⚠️ Common Mistake

Including SaaS app pages (dashboard URLs, user-specific pages) in the sitemap. These pages are typically behind authentication, have no SEO value, and may contain sensitive user data. Ensure your sitemap generation logic explicitly excludes any URL pattern associated with the authenticated app environment.

Strategy 7

How to Read Your Search Console Sitemap Data (And What the Errors Are Actually Telling You)

Submitting your sitemap to Google Search Console is the beginning of an ongoing diagnostic relationship, not a one-time task. The data returned in the Sitemaps report and the Index Coverage report tells a detailed story about how Google perceives your site's content — if you know how to read it.

The Sitemaps Report: Key Metrics

After submission, Search Console displays two key numbers: URLs submitted and URLs indexed. The gap between these numbers is your first diagnostic signal.

- Small gap (submitted slightly higher than indexed): Normal. Some pages take longer to index; some may not meet Google's quality threshold yet. - Large gap (significant portion of submitted URLs not indexed): Investigate. This typically indicates quality issues, crawl errors, or canonicalisation conflicts. - Indexed higher than submitted: Google has discovered and indexed pages you didn't include in your sitemap. Evaluate whether those pages should be in your sitemap or whether they're being indexed unintentionally.

The Index Coverage Report: Error Categories

The Coverage report categorises your URLs into four states: Error, Valid with warnings, Valid, and Excluded.

- Submitted URL not found (404): A URL in your sitemap returns a 404. Remove from sitemap immediately and either restore the page or redirect the URL. - Submitted URL returns redirect: As discussed — remove redirected URLs from your sitemap. - Submitted URL blocked by robots.txt: Contradictory signal. Resolve by either removing from sitemap or updating robots.txt. - Submitted URL indexed, not in sitemap: Useful diagnostic.

Evaluate if these should be in your sitemap. - Crawled, currently not indexed: Google crawled the page but chose not to index it. This is the most important error to investigate — it often signals thin content, quality issues, or canonicalisation confusion.

The Recrawl Request Tool

After making significant improvements to a page or fixing a sitemap error, use the URL Inspection tool to request a recrawl of individual URLs. For bulk changes, resubmitting your sitemap (or updating lastmod values) signals Google that content has changed and warrants fresh attention.

Monitor sitemap data at least monthly for growing sites, quarterly for stable sites. Sudden drops in indexed URL counts often precede ranking changes and should trigger an immediate audit.

Key Points

  • The submitted vs. indexed gap is your first diagnostic signal — investigate large discrepancies.
  • 404 errors in your sitemap mean live pages have been removed without sitemap updates.
  • 'Crawled, currently not indexed' is the most actionable error — investigate content quality.
  • Redirected and robots.txt-blocked URLs in your sitemap need immediate cleanup.
  • Use URL Inspection for individual page recrawl requests after significant updates.
  • Monitor indexed URL counts monthly — drops often precede ranking declines.
  • URLs indexed but not in your sitemap deserve evaluation — intentional exclusions should be confirmed.

💡 Pro Tip

Set up a monthly calendar reminder to check your sitemap's submitted vs. indexed count. Create a simple tracking spreadsheet with date, submitted count, indexed count, and any notable site changes that month. Patterns in this data over 6 – 12 months reveal how changes to your content strategy, site architecture, or technical setup affect Google's crawl and index behaviour.

⚠️ Common Mistake

Dismissing the 'Discovered, currently not indexed' status as a temporary delay. While some indexing delays are normal, persistent 'Discovered, not indexed' status for important pages often signals that Google has assessed the page and found insufficient quality or relevance to justify indexing. These pages need content improvement, not just patience.

Strategy 8

XML Sitemaps in 2025 and Beyond: What AI Search and SGE Mean for Crawl Architecture

The fundamental role of the XML sitemap isn't going away — but the context around it is shifting in ways that smart SEO operators should already be thinking about.

AI Overviews and Crawl Prioritisation

As Google integrates AI-generated overviews (SGE) into search results, its appetite for structured, authoritative, and comprehensive content is increasing. What this means for sitemaps: pages that are clearly structured, consistently updated, and part of a coherent topical cluster are more likely to be crawled frequently and considered for AI-generated answer inclusion.

The implication for your sitemap strategy: ensure that your highest-priority pillar content — the pages most likely to be referenced in AI overviews — are clearly prioritised in your sitemap, have accurate and up-to-date lastmod values, and are grouped logically within your sitemap architecture to signal topical coherence.

Structured Data and Sitemap Alignment

One emerging best practice is aligning your sitemap architecture with your structured data (Schema.org) implementation. Pages marked up with Article, Product, FAQPage, or HowTo schema should have corresponding sitemap priority levels that reflect their intended search purpose. This alignment — sitemap priority + structured data type + content depth — creates a consistent authority signal across multiple SEO dimensions.

Real-Time Sitemaps for High-Frequency Publishers

For news publishers and high-frequency content sites, static XML sitemaps are increasingly being supplemented by real-time sitemap feeds or dynamically generated sitemaps that update within minutes of content publication. Google's crawlers can pick up new URLs from recently updated sitemaps far faster than they discover them through link crawling alone.

What Won't Change

The fundamentals of sitemap hygiene — include only quality URLs, maintain accurate lastmod values, remove contradictory signals, monitor Search Console data — will remain relevant regardless of how AI search evolves. The sitemap is, at its core, a communication protocol. As long as crawlers need to discover and evaluate content, clear, well-structured communication will matter.

The best sitemap strategy for the AI search era is the same as the best strategy has always been: signal clearly, maintain quality, and audit consistently.

Key Points

  • AI Overviews increase the premium on well-structured, authoritative, frequently-updated content — which should be reflected in your sitemap priority tiers.
  • Align sitemap priority levels with your structured data implementation for consistent authority signals.
  • High-frequency publishers should consider dynamically generated sitemaps for faster discovery.
  • Google News sitemaps remain a key lever for news publishers targeting the Top Stories carousel.
  • The core sitemap hygiene principles remain durable regardless of search evolution.
  • Topical cluster coherence — visible in your sitemap architecture — increasingly matters for AI-driven ranking.
  • Real-time sitemap updates are a competitive advantage for content sites publishing multiple pieces daily.

💡 Pro Tip

Review your sitemap architecture alongside your topical authority map. If you're building authority in a specific niche, your sitemap should visually reflect that cluster — when you look at the URLs in each child sitemap, they should tell a coherent content story. If they don't, your internal linking and content architecture likely needs attention before your sitemap strategy can be fully effective.

⚠️ Common Mistake

Treating the XML sitemap as a legacy technical requirement while investing in newer signals like structured data or entity optimisation. These elements work together, not in isolation. A technically sound sitemap amplifies the value of every other SEO investment on your site — neglecting it undermines the rest.

From the Founder

What I Wish I Knew Earlier About XML Sitemaps

When I started auditing technical SEO, I treated sitemaps as binary: either you had one or you didn't. It took working through dozens of sites with indexing problems before I understood that the sitemap is one of the clearest windows into how Google perceives your site's quality.

The moment that shifted my thinking: I was looking at a content site with around 3,000 published posts. They had a sitemap with all 3,000 URLs, and fewer than 40% were indexed. When we ran a content audit, we found that over half of those posts were under 400 words, had no internal links, and had received no backlinks. We removed those pages from the sitemap (and then improved or consolidated the underlying content), and the indexed ratio climbed significantly over the following quarters.

The sitemap had been faithfully presenting all of those thin pages to Google, essentially asking Google to evaluate content that didn't deserve evaluation. The sitemap wasn't the problem — the content was. But the sitemap was making the problem worse.

That's the insight I'd go back and give myself: your sitemap doesn't create quality, but it does amplify whatever quality signal exists. Make sure what you're amplifying is worth amplifying.

Action Plan

Your 30-Day XML Sitemap Action Plan

Days 1 – 3

Audit your current sitemap. Export all URLs from your sitemap and run them through a crawler to check status codes, canonical tags, and robots meta tags. Flag any URL that returns a non-200 status, is canonicalised away, or has a noindex tag.

Expected Outcome

A clean list of URLs that should NOT be in your sitemap.

Days 4 – 5

Remove flagged URLs from your sitemap. Update your CMS sitemap plugin settings or sitemap generation script to exclude these URL patterns going forward. Resubmit the cleaned sitemap in Search Console.

Expected Outcome

A leaner sitemap with reduced contradictory signals.

Days 6 – 10

Apply the Crawl Prioritisation Stack framework. Review the remaining URLs and assign priority tiers: Tier 1 (revenue pages), Tier 2 (authority content), Tier 3 (supporting content). Update priority tag values to reflect genuine hierarchy.

Expected Outcome

A prioritised sitemap that communicates relative page importance clearly.

Days 11 – 14

Audit lastmod accuracy. Check whether your CMS updates lastmod values on genuine content changes or on every site-wide update. Correct any settings that generate inaccurate lastmod dates. Manually update lastmod for any pages refreshed in the past 90 days that weren't captured accurately.

Expected Outcome

Trustworthy lastmod signals that encourage faster recrawling of updated content.

Days 15 – 20

Evaluate sitemap structure. If your site has more than 5,000 URLs, consider splitting into a sitemap index with child sitemaps by content type. If already using an index structure, review segmentation logic and update if your content mix has evolved.

Expected Outcome

An architecture that enables content-type-level monitoring in Search Console.

Days 21 – 25

Review Search Console Coverage and Sitemaps reports. Document your current submitted vs. indexed URL counts. Investigate any 'Crawled, currently not indexed' or 'Discovered, not indexed' URLs for content quality issues.

Expected Outcome

A baseline diagnostic snapshot to track improvement against.

Days 26 – 30

Set up ongoing monitoring. Create a monthly calendar reminder for sitemap review. Build a simple tracking sheet for submitted vs. indexed counts over time. Schedule a full sitemap re-audit in 90 days to assess improvement.

Expected Outcome

An ongoing sitemap governance system, not a one-time fix.

Related Guides

Continue Learning

Explore more in-depth guides

What is Crawl Budget? How to Optimise It for SEO

Understand how Googlebot allocates crawl resources across your site and learn the architectural decisions that ensure your highest-value pages get crawled most frequently.

Learn more →

Technical SEO Audit: The Complete Framework for 2025

A step-by-step guide to auditing every technical layer of your site — from crawlability and indexation to Core Web Vitals and structured data implementation.

Learn more →

Internal Linking Strategy: How to Build Authority Architecture

Internal links and sitemaps work together. Learn how to design an internal link structure that distributes authority to your highest-priority pages and reinforces your topical clusters.

Learn more →

Google Search Console: The Advanced Guide for SEO Operators

Go beyond the basics of Search Console. Learn how to extract actionable insight from Coverage reports, Performance data, and Core Web Vitals to drive measurable organic growth.

Learn more →

Robots.txt: Complete SEO Guide With Examples

Robots.txt and sitemaps are closely linked in your crawl architecture. This guide covers correct syntax, common mistakes, and how to use robots.txt to protect crawl budget without accidentally blocking important content.

Learn more →
FAQ

Frequently Asked Questions

Not directly. An XML sitemap improves crawl efficiency and discoverability — it helps Google find and index your pages faster and more reliably. Once pages are indexed, their rankings are determined by content quality, relevance, and authority signals like backlinks. Think of the sitemap as ensuring your pages are in the race; everything else determines how fast they run. The biggest ranking impact comes when a sitemap ensures that important, high-quality pages are being crawled and indexed that otherwise might have been missed or delayed.
Your sitemap should update automatically whenever significant content changes occur — when new pages are published, when important pages are updated, or when pages are removed. For most CMS platforms, this is handled automatically by plugins. Resubmission to Google Search Console is only necessary if you've changed your sitemap URL, resolved major errors, or made significant structural changes to the file. For ongoing monitoring, review your sitemap data in Search Console monthly. A full sitemap audit — checking included URLs against quality criteria — should happen quarterly.
The Sitemaps protocol limits individual sitemap files to 50,000 URLs and a maximum uncompressed file size of 50MB. If your site exceeds either threshold, you must use a sitemap index file that references multiple child sitemaps, each within those limits. However, just because you can include 50,000 URLs doesn't mean you should. Quality over quantity is the governing principle — a sitemap with 5,000 high-quality URLs will typically outperform one with 50,000 mixed-quality URLs from a crawl efficiency standpoint.
Yes — and this is often overlooked. Adding a 'Sitemap:' directive to your robots.txt file (e.g., Sitemap: https://www.yoursite.com/sitemap.xml) ensures that any crawler reading your robots.txt can discover your sitemap without relying on manual submission. This is especially valuable for Bing, Yandex, and other search engines that may not have a direct submission interface you're actively using. It's a simple, high-return addition that takes under a minute to implement and should be standard practice on every site.
Yes, and for many sites this is the preferred architecture. A sitemap index file acts as a parent file that references multiple child sitemaps. You might separate products, blog posts, landing pages, and images into distinct child sitemaps. This architecture has two practical benefits: it allows you to exceed the 50,000 URL limit across the full site, and it enables you to monitor crawl performance by content type in Google Search Console — because each child sitemap appears as a separate entry in the Sitemaps report with its own submitted vs. indexed count.
Prioritise errors in this order: First, fix any URLs returning 404 or 5xx errors — remove them from the sitemap and either restore the page or implement a redirect. Second, resolve canonical conflicts — ensure your sitemap only contains URLs that are the canonical version of their content. Third, address noindex contradictions — pages can't be both in your sitemap and marked noindex.

Fourth, investigate 'Crawled, not indexed' URLs — these usually signal content quality issues worth addressing. After fixing each category, resubmit your sitemap and monitor the Coverage report for improvement over the following weeks.
They're still worth having, but the impact is lower. For small sites with strong internal linking, Google's crawler will typically discover all pages through link crawling alone — the sitemap adds minimal discovery value. Where it does help even for small sites: ensuring new pages are discovered quickly after publication, giving Google clear canonical signals, and establishing a technical foundation you won't need to retrofit as you grow. Set it up correctly once and let it run — the setup cost is minimal and the long-term value is positive, even if the immediate impact is modest.

Your Brand Deserves to Be the Answer.

From Free Data to Monthly Execution
No payment required · No credit card · View Engagement Tiers
Request a What is an XML Sitemap? (And Why Most Sites Are Using Them Wrong) strategy reviewRequest Review