Sitemap XML Guide – Do You Need Them and How to Build Them.
Thanks to the constant number of SEO Audits that Dave does each month he sees odd things across many sites, content management systems and more specifically in sitemap XML files.
Automate it on your backend (generate the files based on your local database). That way you can ping sitemap files immediately when something changes, and you have an exact last-modification date. Don’t crawl your own site, Google already does that.Via John Mu on Reddit
And while John is correct that the most efficient and reliable way to create a sitemap.xml file is through a backend process, the reality is that there are often technical or business reasons why this isn’t a viable solution. As we covered recently, sometimes best practices aren’t best practice and every situation around a CMS, plugins, business goals, developer resources, setup, and 20 other things may make your situation pretty unique and thus “best practice” simply won’t work for you.
The goal of this podcast episode was to try and walk you through various options no matter your CMS and technical ability (as best we can in 20 minutes of course).
Steps for Building a Sitemap.xml
- Start with your CMS and use any extensions, plugins or default ways to build the file.
- Compare your known site to the sitemap.xml – are there pages in it you do NOT want in it? Is it missing any types of pages?
- Move on to looking at ways to customize your sitemap.xml as needed. This may be at the plugin/extension level or for something like WordPress this may be at the page level.
- Now this may be where if your out of the box setup or plugin is having issues that you may need to look at custom options for building a sitemap.xml file.
Tips for Building a Sitemap.xml
For large dynamic sites the out of the box solutions may or may not work. When that is the case for you and your site here is a quick breakdown of some tips to use to help you build a solid sitemap file.
- Export all URLs that you can from your CMS
- Export all URLs from your Google Analytics or Web Analytics solution
- Export all URLs from your Google Search Console
- Crawl the site using whatever 3rd party crawler you prefer to use
- Combine all these sources to give you an idea of every URL there is on your site.
- Cut anything you don’t want in the sitemap.xml file – tag pages, author pages, add to cart, thank you pages or anything else.
- Double check it like you are Santa and push the file live.
CMS Sitemap XML Resources
Below is a list of CMS resources on how or where to find plugins and information on implementing sitemap XML files for various content management systems. We didn’t cover EVERY CMS on the show but we did try and mention and cover the more heavily used systems.
Sitemap XML Podcast Transcript
Matt Siltala: [00:00:00] Hey guys, excited to be here with you today. How’s it going Dave?
Dave Rohrer: [00:00:15] It is going, sir. Awesome.
Matt Siltala: [00:00:17] So we’ve got a fun one for you guys. You’re gonna love this one today. We’re going to get really nerdy on ya.
Um. Well, Dave is, anyway,
Dave Rohrer: [00:00:27] so
Matt Siltala: [00:00:28] somewhat nerdy now. This is, this is a good one, I think is a good time. Like we were just talking about. It’s a good time to get into this and discuss it. Um, we want to make sure things are done right after all this as a digital marketing blog, there’s a lot of stuff that we talk about that’s really related to, um, just different technical.
Um, things regarding SEO or just technical with sites in general. And so, uh, what we wanted to, to jump into today, guys, is a guide, the site map XML files. And so
Dave Rohrer: [00:00:57] it might be overstating what we actually [00:01:00] ended up talking about,
Matt Siltala: [00:01:01] which is, you know, par
Dave Rohrer: [00:01:02] for the course. What is the goal is to kind of create something of a guide.
Matt Siltala: [00:01:06] Oh, so you know it, and if it doesn’t happen, it’s on you. Then day
Dave Rohrer: [00:01:11] it totally is on me because I didn’t know what to call it. Posted that. And Matt was like, all right, you’re the one talking. I’m like, no.
Matt Siltala: [00:01:20] So this came
Dave Rohrer: [00:01:21] from,
Matt Siltala: [00:01:22] yeah, no pressure. But, so this came from an article that you read on a search engine round table or something that a
Dave Rohrer: [00:01:28] John S E round table confined you?
Yeah, that’s all right.
Matt Siltala: [00:01:34] He can come all the way to Phoenix if he wants, but the se round table, I guess. Um, yeah.
Dave Rohrer: [00:01:44] Well, and it’s based on a Reddit comment from mr Mr. John Mueller. Um, and it, um, some of the things that I initially saw, John, where, uh, I think it’s actually Barry adding it, so I’m not sure, but the actual quote [00:02:00] from John Mueller is, um, uh, regarding XML sitemaps was to automate it on your backend, generate the files based on your local database.
That way you can ping site map files immediately when something changes. Um, dah, dah, dah, dah, dah. Don’t crawl your own site. Google already does that for many projects, clients, CMS is, um, I don’t, it doesn’t always work that smoothly though, but the backend systems, and that’s my, my big problem with, um, this, this recommendation.
Um, and if you didn’t catch our recent . Episode where we talked about how best practices aren’t always best practices. There’s many times where I will do an audit and I will work with a client in the past where using the CMS is built in or the plugin, that extender.
Matt Siltala: [00:02:57] Oh, geez. Yeah.
Dave Rohrer: [00:02:59] That, you know, [00:03:00] because it doesn’t come to fault.
Um, and that’s the other thing that’s interesting is that in this day and age. WordPress is used by what? Over 50% of the Internet’s. Yeah. Or the CMSs, depending on which number you look at. By default, it does not have a sitemap. Dot. XML creator. So think about that.
Matt Siltala: [00:03:26] Interesting.
Dave Rohrer: [00:03:27] Google’s telling you don’t crawl your own site.
WordPress, one of the largest used, most frequently used across the web. CMS is does not come built with one yet. Like there is no way for it to do it. You have to use a Yoast or some other type of plugin. There’s a Google XML sitemap builder plugin. Yoast does it. Um, I think the newer one rink math does it and it lets you control.
Some of them give you a lot of control. Some of them don’t. And the number of times [00:04:00] where that built in or that plugin creates a site map based on your local database. How often do we not want pages that are in that local database to show up in that site map? A lot. Like most recently. I am for my conference, popup.com site did not realize that when I flipped on, you know, the XML site map thing and went about my business, that it was plugging in, creating pages based on things that were not pages to me.
And so let me, let me step back a second. So for. Every session that would happen during a conference day. I had to create basically in the back end of WordPress, like not a post, but it was kind of a post. It’s just a different type
Matt Siltala: [00:04:57] add or go, go ahead.
Dave Rohrer: [00:04:58] Sorry. [00:05:00] Well, so like each author or each speaker had their own quote unquote page.
What’s that? Okay. Yeah, so in the back end each. Fragment or you know, whatever it was, it was a custom post type had to have quote unquote, a URL in the database so that knew what to pull in, you know, whatever. There’s no links to it on the page, but when you do in the site map, if you don’t know or if you don’t look or you don’t have that level of granular control, I didn’t realize that all these different custom product page types were actually being dropped into the site map.
I had blocked, you know, the date archives and author archives and stuff like that, but I didn’t realize that all these other custom ones were being dropped in there or that they’re even creating URLs that were any way crawlable.
Matt Siltala: [00:05:54] So, yeah, that’s just very interesting to me.
Dave Rohrer: [00:05:58] And if you have an [00:06:00] eCommerce site and you have all sorts of different faceted navigation, there is no.
Those, those URLs are not even in your database. Yeah. There’s a category page, but there’s no other way to build those other than by literally crawling the site,
Matt Siltala: [00:06:17] which is what he says
Dave Rohrer: [00:06:19] not to do. Yeah. So that was what got me thinking about the number of times where there was literally no other way. Um, and I pinged a friend of mine and former, um.
I guess client that we’d worked on a project with and we, I won’t throw the CMS under the bus now. I thought about it and I confirmed with them that I knew which one it was, but I will not throw it under the bus. Um, we spent nine months of deep prioritizing the ability to make a site map on the back end because out of the [00:07:00] box.
That CMS kind of had one, but it was broken. So it took us nine months to pre able to prioritize and get them to fix it. Cause each time they fixed it, it was wrong. Jeez. And all it was doing was it was creating a site map on the product side or the production level that was pulling the stage in your roles.
For nine months we did. We basically had to suck it up and didn’t have one.
Matt Siltala: [00:07:30] That sounds like a nightmare.
Dave Rohrer: [00:07:31] Well, especially when you were changing lots of URLs and we did a site redesign and a URL change. So we actually built it with the sites and the pages we knew about and made sure that one, we weren’t including pages that were redirecting because that’ll get you in trouble with being a much quicker than it will with Google.
At least it had in the past. Um. Like Google, Google has a threshold of, you can throw all sorts of junk in your site map. [00:08:00] Um, I don’t know what that threshold is for being anymore. Um, I haven’t tested it lately, but I know Duane at one point and there was some other people that actually came out and said that, you know, beings, threshold for junk in that file is very, very low.
Like they expect it to be clean and optimized. So if you’re throwing stuff in there that redirects because it’s in your back end, but you’ve actually changed stuff, you know, or you haven’t updated it because you killed off your plugin or something like that, you’re asking for problems. Interesting. Um, and I also thought it was interesting that, and I think it was Barry, but just the, the idea.
That the number of CMS is that come with an XML sitemap solution built in is large, and it’s not actually, so I already talked about WordPress and I have like all sorts of tabs open. [00:09:00] Um, and I’ll give you links to all of them too.
Matt Siltala: [00:09:03] I know, I was going to say you have quite a. List of sites here in our notes.
I don’t know.
Dave Rohrer: [00:09:07] I dropped a whole bunch.
Matt Siltala: [00:09:08] Let’s, you are wanting to talk about with all of them.
Dave Rohrer: [00:09:11] You know, Matt was like, Holy cow. I’m like, don’t, you don’t have to read them all. Don’t worry. Um, it’s just the notes for when we, when we pushed this live that we have them all there. Um, you know, like there was a, uh.
An article on WP beginner, and it talks about site maps and it, it recommends a Yoast and it says, uh, the Google XML site maps, which I’ve also used in the past. Um, both of them work, um, even today. So, you know, out of the box, again, WordPress doesn’t have one. You have to install the plugin. All right, well, what about something else?
Drupal? Drupal does not have one. You have to install an extension. And there’s a number of extensions in the directory, but again, [00:10:00] Drupal it by isn’t by used by at all as much as word for us, but I picked some of the top five or 10 depending on your source, CMSs that are out there, and by far, WordPress is the number one, does not have one.
By by default, Joomla, one of the larger ASP. I think it’s, if not the largest ASP kind of based. CMS does not have one. So Joomla doesn’t out of the box. Shopify does. And the one thing that I noticed is a Wix, Squarespace, and Shopify, the all hosted, everything contained type of solutions actually do out of the box.
But personally, I don’t know to the level of control that you’ll get with all of those. So if there’s internal pages, I think there’s usually like a little button or a switch so you can hide stuff. But that’s at a very manual level. And I think at a page [00:11:00] level, so whatever CMS you’re on or looking at, you really have to look deep into the FAQ.
Matt Siltala: [00:11:08] It’d be neat to, if any listeners are. You know, if anyone that’s listening to this one specifically works with one of these, you know, like Shopify or Joomla or whatever, wigs, um, drop in some comments and give us some feedback on it. It’d be,
Dave Rohrer: [00:11:22] I’d be, if it’s great or if you want to bang your head against the,
Matt Siltala: [00:11:27] I’d love to, I’d love to know that.
Dave Rohrer: [00:11:30] because depending on which one you use, um, I think it’s queer space or Shopify that always has the collections. Um. And what if you don’t want, what if you’ve got multiple collections set up in weird ways? Like, do I have to very, very manually. You have to control it. There is no easy way to control it.
You have to go into each one and say, you know, hide this one and you can do that in WordPress and stuff. But yeah, it’s, it’s annoying. [00:12:00] Um, site Finity has one, but honestly it doesn’t, I don’t love it. Um. But you, you’re, the amount of control you get over it is very, yeah. Um, Drupal is one that I was just working with in the recently, the amount of control you get, and it does kind of, I think it’s out of the box.
Um, which is one of the few open source larger ones, but I don’t know how much control you get out of it. Um, I know what, maybe it’s not. Maybe it’s not out of the box as I’m looking at it. Download an extent. Okay. Yeah. So I correct myself. It is an extension module. So again, Drupal out of the box does not create XML.
Sitemaps so basically any open source or any main one does not create it.
Matt Siltala: [00:12:55] So what would you, I mean, what would you, what would be your argument? Um. [00:13:00] Or I guess
Dave Rohrer: [00:13:00] what to do or
Matt Siltala: [00:13:02] not argument. Yeah. Like how would you, how would you approach John and tell him, look, I respect you. You’re Ron on this one, but here’s, here’s the real world.
Oh, no, he’s
Dave Rohrer: [00:13:12] right. I think it just depends.
Matt Siltala: [00:13:14] Well, that’s what, no, that’s what I mean.
Dave Rohrer: [00:13:16] I mean, for many sites, I think crawling your site and maintaining it is the worst possible thing to do. And I think it should be your last option.
Matt Siltala: [00:13:24] But see, that’s what I’m saying. There’s so many people that are in these scenarios that you’re talking about.
And if they hear that from him, it’s going to confuse them. And so it’s like, how do you, how do you go about getting, I don’t know the right information out to .
Dave Rohrer: [00:13:38] No, I honestly, I know I don’t disagree with them. I just think I wanted to kind of cover all of the options, as many options as I could in our 15 to 20 minutes.
And walk people through what I think they should think about with their own. Because like, like I said, we spent nine months with it trying [00:14:00] to get them to fix what should have been a builtin XML site map thing. And honestly, for some sites you don’t need it. If you go to my personal site, I don’t have one.
If you go to my agency site, I don’t have one. Um, my personal site is 15 or 20 pages and my agency site is like 1520 pages. I don’t have them. Don’t care. Yeah. You don’t need it. Um, I think I have it turned off.
Matt Siltala: [00:14:26] What would you say is the threshold for needing one
Dave Rohrer: [00:14:29] large complex sites? And I think that’s the problem is that the larger the site and the more complex it is, the more likely you do have to crawl where you do have to come up with some sort of weird ad hoc XML site map system.
Because. I haven’t done it lately, but for really large sites with really complex or really just millions of pages, we would actually create site maps in the past based on [00:15:00] category levels or products or certain things. And it was like a mix of crawling the site, honestly, to understand what percentage was being indexed and what.
Traffic was coming from. And where were the errors and was it a template issue? Was it a site issue? Um, you know, was it just a content issue? We would use them a lot of times to help us do bug or testings. And you can’t do that without crawling or without doing some sort of export. But then you, it’s a very manual process.
The, um, another interesting article was. Uh, through, it’s one li, uh, one Lee. Yeah. I’m over
Matt Siltala: [00:15:48] in Poland as well.
Dave Rohrer: [00:15:50] Yeah, I’ll drop it. But it’s a deep dive into medium, which I guess technically is its own CMS.
Matt Siltala: [00:15:58] Oh, medium is, [00:16:00] yeah. I didn’t know they qualified it as its own CMS
Dave Rohrer: [00:16:02] via that they talk about it. And medium is number one 52 in terms of popularity among content, CMS is.
Um, but 2.4% of the most popular websites use it as a blog. Especially large brands.
Matt Siltala: [00:16:16] I know it ranks well. And I know it’s beautiful for reputation management purposes.
Dave Rohrer: [00:16:19] It is. Yes. But this was a deep dive into some of the problems their sitemaps have.
Matt Siltala: [00:16:28] Interesting.
Dave Rohrer: [00:16:31] So, you know, even large CMSs and large sites can have these problems.
And I, is there an easy answer? I don’t. I think if you can, by all means use the plugins. It saves so much time. Um, when we had that nine months of headaches every, every month, we would have to, if we wanted to, we would go through and just grab the URLs that had been posted on the blog. Or if there was any [00:17:00] other, you know, big changes, we would had to manually add it.
And that’s annoying. It is. Yeah. But I think in some cases it’s really the best way to go about it or to build some sort of backend system, but I think there’s also diminishing returns for that. But I think it’s a case by case basis. For what type of system is it out of the box default type of, you know, using just a plugin and letting it go.
But. If you do that, honestly, take a look, let it produce it. Go and double check and make sure that there’s no, you know, download thank you pages in there. Um, the number of times where I see those get in there because people don’t think that or don’t know or aren’t aware that when they’re creating a thank you page or a download page in WordPress, if they don’t tell it to hide it, so that.
[00:18:00] Email capture that lead gen page now has the thank you page indexed on the site map in Google because someone didn’t know that they had to click a little button to not make it show up in the sitemap
Matt Siltala: [00:18:16] where you guys go
Dave Rohrer: [00:18:17] because it’s automagically done. Yeah. I’m, I’m all about saving time dev resources, being efficient.
I think. Weirdly, the XML site map, if you have a small site, just don’t use it is really my suggestion. Don’t even use it. The larger your site is, the more you need to babysit it because I think it actually tends to cause more problems than help. Like a small to medium size site. From what I’ve seen, I just see more people cause problems that we have to then later fix.
Then anything else?
[00:19:00] Matt Siltala: [00:19:00] There you go guys.
Dave Rohrer: [00:19:02] So I guess this is a guide on how not to screw up your site map
Matt Siltala: [00:19:06] is going to be really nerdy.
Dave Rohrer: [00:19:09] Well, I didn’t get too technical. I didn’t say like what tools does scrape with and stuff, but I mean, we
Matt Siltala: [00:19:14] talked about site maps the whole time.
Dave Rohrer: [00:19:16] Well, so here, cause I don’t want to just be like, I don’t want it sound like I’m venting or, or, or whining.
The, the tip I would suggest when building a site map is if you haven’t done it or if you’re working on a new site or client site or is treat it like a content audit. Go through, crawl the site, look at your analytics for the last year and look, pull out every URL. You look at your Google search console for the last 16 months or a year.
Pull out all of the URLs there. Crawl the site, and you can do this in like sight bulb. Um, I think even on crawl and jet octopus and site, uh, [00:20:00] screaming frog, all of those will bring in like all sorts of different data and we’ll show you where the overlap of things is. And you can even compare it to your current site map if you have one.
But grab as much data as you can. And even if you can export your whatever your CMS is. List of URLs, compare them all together and look at, then go through it and dig really deep into what pages should be in there, which ones shouldn’t, which ones are missing, and then figure out a way to build it that one time or figure out where your solution is breaking down or if you really need it.
You know, if your site is not that big. Honestly, I say for go it at this. In this day and age, if you have a 20 page site. I don’t understand why you really need to build one. I might be in a minority on that. I don’t think so.
Matt Siltala: [00:20:52] Well, I’m sure. I’m sure the Internet’s will let us know.
Dave Rohrer: [00:20:56] I’m sure someone will correct me.
Matt Siltala: [00:20:59] With [00:21:00] that said, you guys got a whole lot of tips on that last final tip. That was a great, great, uh,
Dave Rohrer: [00:21:05] great Excel. Look at your site map XML this week.
Matt Siltala: [00:21:08] Exactly. All right. Well there you guys go. Your guide to site sitemap, uh, XML files. And so. Thank you very much, Dave, for, for all that information. And uh, you know, thank you guys for joining us.
I appreciate the time that you spent with us and, and listening, you know, for, for 20 minutes, for 20 minutes plus on a site maps. You guys were all, you guys are awesome. And so for Dave Roth Northside metrics, I met SOTAL with avalanche media and we will, uh, look forward to having you guys on the next one.
Dave Rohrer: [00:21:36] Bye guys. Thanks.