Google has an insatiable appetite for finding new pages on the web and wanting to crawl them. XML sitemaps are an efficient way for Google to discover all the pages on your site. However, after having done hundreds of SEO audits, I’ve come across many issues in generating and maintaining XML sitemaps. Below is my list of top 10 XML sitemap fails:
1. Not having them if you have a large site, or a site that frequently publishes new and time sensitive content.
If you have a large, authoritative site it will be crawled very frequently and any new content should be picked up pretty quickly. But for megasites, and many ecommerce sites carrying tens or hundreds of thousands of SKUs, adding new products, without any assisting horizontal links will take longer for Google to find, crawl and index them. XML sitemaps are requested fairly regularly (sometimes daily, or more frequently than that) and helps search engine crawlers find your new pages.
2. Not regenerating them when the site is updated
If you have a WordPress site with one of the popular SEO plugins, they will either dynamically generate the XML sitemaps, or they’ll be updated when you publish new content. However, for sites, especially those with a custom Content Management System (CMS), they may not have an automated way to generate XML sitemaps. Or they may rely on a separate system or process to run periodically. The scheduler or the sitemap generator can fail and if there isn’t any checking done as part of the script, or other means, it can sometimes go unnoticed for weeks or months. The more content that you update on your site, the more important it is to regenerate the XML sitemap.
3. Sitemap contains pages that you don’t want indexed
So many times I’ve seen thank you pages, test pages, or default pages that came with the theme remain in the XML sitemap. People often forget to flag these pages with noindex and many often end up in the index. While the thank you page after a form submission isn’t really going to cause major problems, apart from maybe corrupting your lead gen reporting, having test pages, or other pages published that aren’t yet meant for public consumption could lead to public relations or legal issues.
4. Sitemap contains pages that return a http status that isn’t a 200
Your XML sitemaps need to be clean, that is, URLs contained within them should only return a 200 HTTP status code. If you find any that redirect, 404, or return 500 server errors, I strongly recommend tracking down the source of the issue and fixing it. It might be as simple as regenerating the XML sitemap to account for products that you no longer carry, or it could be an application error.
5. Trying to game the last updated attribute
When Google crawls a page it generates a hash of the page, which is unique compressed version of the page. Any changes to the content on the page will cause a different hash to be generated. A quick comparison of the previous and current hash values helps a system quickly identify if page content has changed and determine if it then wants to spend additional resources to parse through the actual HTML and figure it out. There have been some sites that will update the lastmod attribute to the current date, whether the page has changed or not, in the vain attempt to try and fool the search engine into recrawling the page, because the premise behind this is that if a page is crawled frequently, or is seen to be fresh, it will get a ranking boost. From my experiments and testing of tens of millions of URLs, this theory does not work. Just be honest with the lastmod attribute and update it when the pager is updated. For smaller sites up to a few thousand pages, I’ve generated XML sitemaps without using lastmod or priority, it was basically just the list of URLs in XML format and didn’t see any differences. Google treats these attributes as hints, not directives. If you’re dishonest, Google will just ignore them.
6. Not using xml sitemaps for troubleshooting indexing problems
If your rankings dip and you find a large number of pages aren’t being ranked, or even indexed any more, you should be looking at the sitemap report in Google Search Console. If you haven’t already, try generating and submitting some custom sitemaps and see if you can narrow down which groups of content might be problematic. Troubleshooting SEO issues is very often about finding a common or consistent pattern, then figuring out a test, or solution.
7. Not splitting sitemaps up into chunks
This is critical for large sites. WordPress does this naturally with a sitemap_index and separate XML sitemaps for posts, pages, categories, tags, authors, etc. If you have a large ecommerce site, or a media/publishing site, it’s important to generate multiple XML sitemaps that are logically grouped. It might be by product line, brand, or some other category, but as a I mentioned in my previous point, if something goes awry, you’ll already have them broken out which will speed up the troubleshooting process.
8. Relying too heavily on sitemaps, especially for small sites
For small sites up to a few thousand pages that have been around for a while, have a few decent backlinks and decent internal linking, XML sitemaps aren’t too necessary. Search engines will already be crawling the site pretty frequently, will have found all the pages on your site, and established a crawl frequency for each URL. But if you have a process that generates XML sitemaps, you don’t need to be obsessive about it and be constantly regenerating and updating them if nothing has changed on your site – your time and resources are much better spent elsewhere, like generating good content.
9. Not validating the syntax and checking for typos
One of the most common problems I see are syntax errors, especially when there’s a blank line at the beginning of the file. This seemingly innocuous issues can cause the search engines to fail to parse the file and just ignore it. HTML parsing is extremely forgiving, but not so with XML or JSON files. If you’ve recently made any changes to a custom process that generates your XML file, make sure it doesn’t contain any syntax errors.
10. Not checking that they generated successfully, for sites that auto generate them
I’ve come across sites that supposedly had a scheduled process to generate an XML sitemap that was created by a developer that’s long since left the company. Over the years there were lots of changes to the web development and marketing teams and this process just got left behind. If you have an SEO team, they should be checking the sitemap reports in Google Search Console weekly or monthly to catch these types of issues. Ideally the sitemap generator has its own validation process (or it might be a separate script) to check the XML sitemaps generated successfully.