Practically everyone who uses Drupal wants their site to be added to search engines. That's fine and good with static HTML websites but it gets a lot harder with dynamic ones. You can rely on bots to roam your site effectively, but until the XML Sitemap module there wasn't a good way to automatically generate a sitemap so search engines can find every page you want included.

XML Sitemap doesn't just create a sitemap, it also submits the sitemap to search engines when you want it to (on the 5.x version, that means on cron runs or when the sitemap gets updated - there's a pending feature request to add configurable times to the 6.x version).

The module isn't fully working in the 6.x port yet, but I think that such a commonly used feature should be in 7.x core.

Thoughts?

Module: http://drupal.org/project/xmlsitemap
Port to 6.x: http://drupal.org/node/157533

Comments

wayland76’s picture

+1

john.kenney’s picture

+1

lancets’s picture

+1

CompShack’s picture

+1 for core - a must module for SEO

catch’s picture

OK a few things with this:

1. In my opinion, core should only handle things which are difficult to do in contrib, or make things in contrib much easier, not simply because they're popular (and there's more popular modules than xmlsitemap which aren't in core).
2. I think I used the google sitemap module for a week or so back in 4.7, and with 10,000 nodes or so on a small VPS it used to almost bring my site down every time it got hit. Has this been dealt with in the new version at all?
3. The specification has changed within the past couple of years, once it's in core, it's frozen for a year at a time (like jQuery) so if there's an update it'll be much harder to modify than if it stays in contrib

So I'm finding it hard to see the benefit of this moving into core.

hass’s picture

@catch: Yep this overloading of a site still exists... in the latest D6 patch it is not fixed yet, but we hopefully can solve this with D6 batch api... The main issue is performance of your site that this install step runs through. The sitemap installer does one or more inserts into a xmlsitemap table for every node that exists. This will overwhelm you server or better - PHP will run out of time (main issue). I tested it with 18.000 nodes once and darren added many performance enhancements in D5 times... it behaves much better now, but the issue is not fully solved.

However we must move to batch api about this to get it fully fixed...

wayland76’s picture

Status: Active » Postponed

Hmm. I believe that there's a redesign in the works too. Because of the code freeze, I think we should postpone it until after that, at the very least.

hass’s picture

D7 is in code freeze??? Maybe in a half year, but i haven't read anything about a date yet...

icecreamyou’s picture

In my opinion, core should only handle things which are difficult to do in contrib, or make things in contrib much easier, not simply because they're popular (and there's more popular modules than xmlsitemap which aren't in core).

I understand where you're coming from, but it's not that this is just some popular thing that a lot of people use--it's basically essential to any website. Anyone who doesn't have a sitemap only doesn't because they don't know to do so or their site is for private use only. That's not the same as, say, CCK.

I actually think we should do the same thing with this module as Update Status or CCK though - incorporate the crucial parts into core and leave the rest in an "Advanced Sitemap" module.

CompShack’s picture

May be it doesn't have to be as fancy as the current module. But at least something in core that will create a sitemap!

catch’s picture

it's basically essential to any website. Anyone who doesn't have a sitemap only doesn't because they don't know to do so or their site is for private use only.

My main drupal site site only ever had a sitemap for one week (which made it crash), it's got over 40,000 pages indexed by google and gets over 200,000 referrals from them per month. Please explain why it's essential.

icecreamyou’s picture

I guess I should have added "or your site is too big" to my last message.

Please explain why it's essential.

For sites that are either new or relatively low on Google Pageranks - that is, the vast majority of Drupal sites - sitemaps tell search engines to update their information and ranking for your site. They also get pages included in searches that wouldn't otherwise be there. I don't think I need to explain why good search engine karma is essential.

With a large amount of nodes it could certainly turn into a problem. However, for most Drupal sites it wouldn't be an issue, and for sites that could use a Sitemap tool without problems it enhances their popularity and growth.

I have three ideas for how to deal with warning large sites that they should disable the Sitemap tool if the site has too many nodes.

1) Provide a status setting that becomes an error if there are over a given number of nodes on a site.
2) Warn the user when the module is enabled in the same way that users are warned when they enable modules that depend on other (disabled) modules.
3) Set an error message on enabling the module if the site has too many nodes, or a standard status message if the site doesn't have too many.

EDIT: Forgot about the new Batch API. Never mind about the overload then.

flickerfly’s picture

+1

hass’s picture

As i know there is no issue if you run gsitemap/xmlsitemap from the first day... the issue with overloading is temporarily... after all nodes are inserted the issue goes away. There are a few ways... wait some time for cron to complete this or do it by hand and ignore the PHP timeouts. For D6 we can solve this in a much better way.

wayland76’s picture

I'm not saying that there's a code freeze right now, but that inclusion in core makes the code "more frozen", so we should wait until after the big rewrite (2.x?) before we try to get it into core. My estimate is that, even if the 2.x rewrite is done by the end of the year, it'll probably be too late for D7, but I'd still like to see it in D8.

icecreamyou’s picture

Fair enough--but getting a 2.x requires Darren Oh to be cooperative. We might be better off just writing 2.x for core since fixes for multisite/multilingual issues have generally been turned down.

wayland76’s picture

@IceCreamYou: There is that, although the multisite one I've been working on ( http://drupal.org/node/202294 ) hasn't been turned down, simply ignored. But it's been getting attention this week from non-Darren people. We'll have to see what happens.

I mentioned 2.x because Darren apparently has some big plan for 2.x. I have some ideas too, which I'm planning to put into a 2.x issue soon. But if we don't get some co-operation within the next month, I wouldn't be surprised if there's enough people interested that we could fork the module. But I'd prefer it if we didn't have to.

wayland76’s picture

To see the ideas for the 2.x version: http://drupal.org/node/253762

lancets’s picture

My main drupal site site only ever had a sitemap for one week (which made it crash), it's got over 40,000 pages indexed by google and gets over 200,000 referrals from them per month. Please explain why it's essential.

With that many pages indexed by webcrawlers, I'm surprised you don't agree that it's important (I won't go as far to say essential). A while back I had another CMS package with Gallery on a local box with limited bandwidth to the internet. Over time, I built up my Gallery (and eventually Gallery2) to thousands of images. After a while, the web crawlers began discovering my site and crawling it regularly. Since it was loaded with images, they (especially the Googlebots) practically brought my internet connection to its knees every time they crawled, and I had crawlers from all over hitting it many times a day. And they'd keep coming back and downloading the same content over and over again. That's when I discovered the benefit of sitemaps as it was able to define which content has changed without the need for the Googlebots to download the entire site just to see if anything changed. Soon after I fed the sitemap information to Google, I noticed a substantial drop in Googlebot downloading activity, as Google was only needing to periodically check the sitemap to see if any content had changed. Since then, I enjoyed improved performance of the site since they're not pounding on it for several hours per day, which far outweighs the few seconds of processing time for xml generation at the moment that Google comes to grab the sitemap. Sure, you can control many of the bots with robots.txt, but rather than just blocking their ability to index the content, why not just allow them to do their work more efficiently? I've left the other CMS behind now, but the sitemap is probably the thing I miss the most, so very anxious to see it supported in Drupal 6.x, whether it be core or contrib.

Frieder’s picture

+1 from me

luti’s picture

+1

Anonymous’s picture

Not everybody will know how to figure it out. Setting up a sitemap.xml in my experience always requires a bit of tuning. Get it wrong and you can actually damage your SEO with things such as broken links, badly setup robots.txt, wrong .htaccess files to name a few things of the top of my head.

For the sake of argument: whether search engines will use the sitemap.xml in the future remains to be seen, as is the importance they give to it. If it goes the same route as keywords for example or search engines start to work with sitemap.xpl (to name a example) it is stuck in core until the next release.

Reading about problems with large sites in the posts above. If this is the case this is something which could damage the reputation of Drupal as having buggy parts.

In my opinion you would be better of using a contributory module that allows for seemless integration.

So my vote goes for keeping sitemap.xml for something like a contributary module, handbook page, or something similar.

hass’s picture

@Designer:

Reading about problems with large sites in the posts above. If this is the case this is something which could damage the reputation of Drupal as having buggy parts.

This is not a "Drupal" issue and have nothing to do with reputation.

Anonymous’s picture

I didn't say it was. Referring to #5, #6 and#14

Adding a component to core which would have those problems would not be good for the reputation of Drupal in my opinion. I think you misunderstood what I wrote or I misunderstood the posts in which case my apologies. If this needs to be clarified please let me know.

dman’s picture

Despite what some marketing folk would sell you, SEO is an optional function for some websites.

If your intention of the website is to make money, then pagerank is probably a priority to you. However, there are also intranets, extranets, hobby sites, members-only sites, private blogs and resources that do not need to push it.
Plus, it's actually dangerous to have sitemap enabled prematurely, when a site is being built. It's probably a good thing that amature developers don't have it on by default.

It can easily continue to live as a contrib module, as it doesn't need to be tied any tighter to core code than is already possible.
Core is not for cool or even popular modules - it's for things that make the guts of the CMS possible. SEO tweaking is an optional choice.

icecreamyou’s picture

Yes, and despite what some people would tell you, update status and CCK are optional functions for some websites. So is Poll. So is Forum and Blog (although they're being migrated out now for different reasons). So is the ability to have a dynamic front page (after all, Views even comes with its own customizable version of the stock front page). So is the ability to have stickied posts.

"Some" websites that don't need a these things means very few. Hobby sites, private blogs, and resources "don't need to push it" with their SEO but it always helps, especially because a lot of private blogs pay for their web hosting using advertising. And why wouldn't you want people reading what you write?

My only concern is members-only sites, in which I'm including intranets and extranets. I don't know how many of those Drupal has. However, it's the same deal with the update status module--sites that are stable don't really need it, but it's in core because it's incredibly useful to the point where it's practically an essential for any Drupal install. It's not like it needs to be part of core to get extra functionality -- in fact, when it went into core functionality was stripped out. Same with CCK.

I think the same should happen with the XML sitemap module--it should be diluted in core. It's something that's just so central to, as you call it, "the guts of the CMS" for almost every website.

luti’s picture

I agree this feature belongs to a core Drupal more than some other modules (already being a part of core). Besides, if it would be optional, it could be switched off by default in any case...

darren oh’s picture

Status: Postponed » Closed (fixed)

Putting more features into core will eventually make core unmanageable. Contributed modules allow interested developers to collaborate much more effectively. Features should not go into core unless they change the way core works. CCK was such a feature. Views may be eventually. We have solved most of the problems in XML Sitemap without any changes to core. The remaining problem, of detecting when URL aliases are updated, is also not likely to involve changing core.

If specific changes to core can improve XML Sitemap, we can open issues for them. It is the responsibility of interested developers to become familiar enough with the code to know what those changes should be.

What would really help beginners would be an install profile that would include XML Sitemap. But that's not an issue for Drupal core.

asb’s picture

IMHO, Drupal Core does not need a tool to generate sitemaps for Search Engines; some might need it, others definitely don't, it's simply not a feature for Core. It's a feature for end users, not a part of a Framework.

However, what Drupal Core *badly* needs is a mechanism to handle hierarchically (by something like the menu or the book mechanism), or dynamically (by taxonomy) structured content. Such a mechanism would need to be able

* to scale massively (potentially millions of nodes, as the amazing "Eureka!" showcase illustrates (http://drupal.org/node/261340), and
* to output data about the site's structure to contributed tools like "XML sitemap" or *other* sitemap generators not targeted for search engines but humans (like "Sitemap" or "Sitemenu" modules).

Drupal core needs a feature enabling the Framework to *know* about the content it's maintining, and an API to harvest this knowledge by 3rd-party tools. That would allow to solve scalability problems *once*, and offer solutions *multiple* times. That's what a Framework should do.

All tools like "XML sitemap", or "Site map", or "Site Menu" modules do have the same problems coping with mid-sized sites (a few thousand categories, some 10k nodes), not even thinking about really large sites (imagine someone wanting to import a dump of en Wikipedia into Drupal with 2.5 million articles plus additional millions of discussion and administration pages). This problem should be solved conceptually *once*, and that mechanism should be in Drupal Core. The hard work invested in making "XML sitemap" perform reasonably on mid-sized sites could possibly offer a starting point for such a conceptual approch in Core.

Regards, -asb

icecreamyou’s picture

That's an interesting idea asb--will you open a separate issue?

asb’s picture

> will you open a separate issue?

Where would be the right place for this? Should this be an XML Sitemap issue, or a "feature request" for Drupal Core (e.g. http://drupal.org/node/add/project_issue/drupal/feature)?

-asb

icecreamyou’s picture

It's probably a core issue that would use code from XML Sitemap.

I have a bad feeling the issue will get bogged down though, and it's entirely possible there's another discussion about this somewhere.

Z2222’s picture

I don't think XML sitemaps should be added to the Drupal core. There is still debate over whether XML sitemaps are useful for SEO. (I work in SEO and I don't use them.)

XML sitemaps don't increase your rankings -- they only help search engines discover pages. Search engines don't need XML sitemaps to discover pages though. You just need a good internal linking structure. If you want Google to discover new content immediately, you can ping Google Blogsearch when you add new content.

More thoughts here:
http://groups.drupal.org/node/11304

Crell’s picture

@#29: Drupal already has a system for building large hierarchies. It's called the menu system. It already has an alternate interface for constructing them. It's called the book module. Those are already in core.

wayland76’s picture

There's also a "Node Hierarchy" module somewhere. I'm not sure the menu system solves the perceived problem.

http://drupal.org/project/nodehierarchy

It sounds to me like asb wants that module in core. That would be pretty cool, because then XML Sitemap would be able to depend on it.

asb’s picture

[ Node Hierarchy, http://drupal.org/project/nodehierarchy ]

> It sounds to me like asb wants that module in core. That would be pretty cool, because then XML Sitemap would be able to depend on it.

Yes and no; I think, Drupal core needs to know about relations of it's nodes; those relations do not necessarily need to be hierarchical. That knowledge has to be somehow abstract, contributed modules can harvest this, that's my point.

Some of this functionality already is in core (menues, books), but can not harvested by contributed modules like "XML Sitemap" as far as I know. Simple example: Even a tightly hierarchical site that is completely structured into one "book" does not represent all information required for a Google or Yahoo Sitemap. Also, there is a bunch of contributed modules that try to add this functionality; some examples:

* Simple Sitemap, http://drupal.org/project/simplesitemap
* Node Relativity, http://drupal.org/project/relativity
* Related Links, http://drupal.org/project/relatedlinks
* Site map, http://drupal.org/project/site_map

Some of this modules help creating relations between nodes, but don't display them; some of them solely display nodes and its (herarchical) relations; some of them are targted to end users, others only to machines. All of them need the same information about a site's structure, and that's what IMHO needs to be in Core - nothing else.

I've created a feature request for this: Abstraction layer plus API for structural knowledge about a site's content in Core. Those liking the idea to integrate structural knowledge about a site's content into content should join in there; I'm hoping that there will be pointers to similar discussions.

Greetings, -asb

thedevnull’s picture

YES, I would LOVE to see this module in core! Sitemap belong in core as a feature that will be virtually required moving forward. Drupal is amazing but can improve on its SEO capabilities out of the box.

Also desirable would be adding a few existing modules to the core:

XML Sitemap - sitemaps are not optional for SEO
Meta tags - still needed by many engines, the configurable title tags here is also vital.
Path auto - clean urls don't cut it for SEO, they need to be more configurable. At this point they are not even humanly readable or usable either!

I hope XML sitemaps become part of core like RSS has because it's fundamentally important to the success of a website. Keep up the great work, you will always have my support. =)

thedevnull’s picture

I totally agree with a simple sitemap creation option in core. Doesn't have to be as complicated as the current module...

Z2222’s picture

I think it would be a mistake to put the sitemaps module in core. It's not essential for SEO. I think sitemaps benefit Google more than webmasters.

Vanessa Fox from Google says, "It’s really not about the ranking; it’s more about crawling… Sitemaps doesn’t impact your ranking at all."

(source)

avpaderno’s picture

Status: Closed (fixed) » Active

It's not essential, but it helps. A sitemap, then, it's not just used by Google, but also by Microsoft Live, Yahoo, and Ask.com (at least).

Rather than to add the module, I would like to see ported some of the code, which would be the base for other modules. Apart the normal sitemaps, there are other types of sitemaps. like the code search sitemap (that is actually just used from Google); therefore, there could be a module for the normal sitemap, a module for the code search sitemaps, etc...

hass’s picture

Status: Active » Closed (fixed)