There are many XML sitemap generators available for purchase, or even for free. They do what they’re supposed to – they crawl your site and spit out a properly formatted XML sitemap.
But sometimes there’s a problem with these XML sitemap generators. They don’t know what URLs should (or should not) be in the XML sitemap. Sure, you could tell some of them to obey directives and tags, like robots.txt and canonical tags, but unless your site is perfectly optimized, you’ll need to do some work by hand. It’s extremely rare to see a larger, database driven site that’s perfectly optimized, thus producing a flawless XML sitemap from these tools. Parameters tend to create duplication or page bloat. Language directories sometimes improperly get included. Runaway folder structures tend to reveal process files and junk pages you didn’t know existed. The bigger and more dynamic the website, the higher the likelihood for unnecessary page/URL creation.
At the end of the day, the XML sitemap should only be exposing the URLs you actually want Google to see. Nothing more, nothing less. The XML sitemap is a file to help search engines get a data dump of all your important pages, to supplement what they haven’t found on their own. In return, this allows these “unfound” pages to get found, crawled, and ideally (hopefully) rank.
So what should be in the ultimate XML sitemap?
- Only pages that 200 (page found). No 404’s, redirects, 500 errors, etc.
- Only pages that are not blocked by robots.txt.
- Only pages that are the canonical page.
- Only pages that relate to the second-level domain (meaning, no subdomains – they should get their own XML sitemap)
- In most cases, only pages that are of the same language (even if all your language pages are on the same .TLD, that language usually gets its own sitemap)
So, in the end, the perfect XML sitemap should 100% mirror what – in a perfect world – Google crawls and indexes. Ideally your website has a process for building these perfect sitemaps routinely without your intervention. As new products or pages come in and go out, your XML sitemap should simply overwrite itself. However, the rest of this post explains how to create a one off XML sitemap for those occasions where a prototype sitemap is needed, or as a quick fix for a broken sitemap generator.
What Do You Need?
Screaming Frog is an incredibly powerful site crawler ideal for SEOs. One of its several features is the ability to export perfectly written XML sitemaps. If your export is large, it properly breaks the sitemaps up and includes a sitemapindex.xml file. While you’re at it, you can even export an image sitemap. Screaming Frog is free for small crawls, but if you have a site larger than 500 URLs, pony up for the paid version. This is one tool you’ll be happy you paid for if you do SEO work. It’s a mere £99 per year (or $130).
Download > Screaming Frog
Once you install it on your desktop, you’re just about ready to go. If you are working on extremely large sites, like a CNN.com or Toysrus.com, you’re probably going to need to expand its memory usage. Out of the box, Screaming Frog allocates 512mb of RAM for its use. As you can imagine, the more you crawl, the more memory you’ll need. To do this, follow the steps 1/3 of the way down this page, called “Increasing memory on Windows 32 & 64-bit”: https://www.screamingfrog.co.uk/seo-spider/user-guide/general/. Mac users, your instructions are on that page too.
Now that Screaming Frog is installed and super-charged, you’re ready to go.
Setting Up For The Perfect Crawl
Screaming Frog looks like a lot but is very easy to use. In the Configuration > Spider setting, you have several checkboxes you can use to tell Screaming Frog how to behave. We’re trying to get Screaming Frog to emulate Google, so we want to make some checks here. This includes:
- Respect Noindex
- Respect Robots.txt
- Do not crawl nofollow
- Do not crawl external links
- Respect canonical
At this point, I recommend crawling the site. Consider this the first wave.
Examining The Data
Export the full site data —- from Screaming Frog. We’re going to evaluate all the pages in Excel. While we know we took some steps to show us only things search engines have access to on their own, we want to make sure there aren’t pages they are seeing which we didn’t know about. You know, those ?color= parameters on eCommerce sites, or /search/ URLs that maybe you didn’t want indexed. I like to sort the URL column A-Z so I can quickly scan down and see duplicate URLs.
This data is super valuable not only to creating a strong XML sitemap, but also going back and blocking some pages on your website that need some tightening up. Unless your site is 100% optimized, and hey… congrats if it is, this is a valuable, hard look at potentially runaway URLs. I recommend doing this crawl and looking at your data at least once a quarter.
Scrubbing Out Bad URLs
A “bad” URL in this case is simply one we don’t want Google to see. Ultimately we’re going to need to get these further exclusions available to Screaming Frog. At this point you have two options.
- We can either upload your clean Excel list back into Screaming Frog,
- or run a new Screaming Frog crawl with the exclusions built in.
Option 1: Using your spreadsheet, delete the rows containing URLs you don’t like. Speed up the process by using Excel’s filters (ie, contains, does not contain, etc.). The only column of data we care about is the one with your URLs. Also, use Excel’s filters to show only the 200 (page found) URLs. The time it takes to audit this spreadsheet depends on how many URLs you have, different types of URL conventions, and how comfortable you are with Excel.
Next, copy the entire column of “good” URLs, and return back to Screaming Frog. Start a new crawl using the Mode > List option. Paste your URLs and start your crawl. Once all the appropriate URLs are back into Screaming Frog, move on to the next section.
Option 2: Now that you know the URLs you want to block, you can do it with Screaming Frog’s exclude feature. Configure > Exclude pulls up a small window to enter in regular expressions (regex). Not familiar with regex? No problem, it’s really very easy, in which Screaming Frog gives you great examples you just need to bend to your will. https://www.screamingfrog.co.uk/seo-spider/user-guide/configuration/#exclude.
(Alternatively, you can use the include function if there’s certain types of URLs or sections you specifically want to crawl. Take the directions above, and simply reverse them.)
Once you have a perfect crawl in Screaming Frog, move on to the next section below.
Export The XML Sitemap
At this stage, you’ve either chosen Option 1 or Option 2 above. You have all the URLs you want indexed loaded in Screaming Frog. You just need to do the simplest step of all – export!
You have some extra checkboxes to consider. A very smart set of selections if you ask me.
This helps you really refine what goes into the XML sitemap in case you missed something in the steps above. Simply select what makes sense to you, and execute the export. Screaming Frog will generate the sitemaps to your desired location, to which you’re ready to upload to your website. Don’t forget to get these new sitemaps into your Google Search Console and Bing Toolbox sitemap uploader.
(If you need some clarity on what these definitions are, visit http://www.sitemaps.org/protocol.html)
You’re all set. Remember, this is just a snapshot of your ever-changing site. I still fully recommend a dynamic XML sitemap that updates as your site changes. Hope this was helpful.