Hi

i reviewed the module and i cannot find anything what the module will do, if i have 50.000+ URLs. As you may know, the maximum allowed URL count per XML file is 50.000. For more links you need to split this to more then one file and create a Sitemap Index file.

Are you able to fix this, please?

Regards
Marc

Comments

bharat’s picture

Status: Active » Needs review
StatusFileSize
new7.32 KB

We're having this problem on the Gallery website also. I modified gsitemap.module to create a and and provide individual sitemaps in 10k node chunks. You don't get the full advantage of using an index when you chunk by node id, because if the chunks are large enough then there's always one or two nodes in there that got modified recently. But if you set the chunks down to something reasonably small (say, 1K) I suspect that you'll get to save some bandwidth because Google won't download the indexes that haven't changed recently.

Firefox thinks that the XML output is well formed, but Google hasn't crawled our site yet with the new code so I don't know if Google likes it. But I'm attaching the patch for your inspection. Let me know if you want me to change anything.

bharat’s picture

Category: bug » feature

Just a followup to say that Google downloaded the sitemap index and then the 6 chunks with no problem:

gsitemap 2006-08-28 23:43 Sitemap chunk 3 downloaded by Google. Anonymous Coward (not verified)
gsitemap 2006-08-28 23:41 Sitemap chunk 0 downloaded by Google. Anonymous Coward (not verified)
gsitemap 2006-08-28 23:41 Sitemap chunk 1 downloaded by Google. Anonymous Coward (not verified)
gsitemap 2006-08-28 23:40 Sitemap chunk 2 downloaded by Google. Anonymous Coward (not verified)
gsitemap 2006-08-28 23:39 Sitemap chunk 4 downloaded by Google. Anonymous Coward (not verified)
gsitemap 2006-08-28 23:39 Sitemap chunk 5 downloaded by Google. Anonymous Coward (not verified)
gsitemap 2006-08-28 23:34 Sitemap index downloaded by Google. Anonymous Coward (not verified)

And on the Google webmasters site:

Googlebot has successfully accessed your home page. Last crawl date: Aug 27, 2006

So at least the XML is well formed :-)

marc.bau’s picture

sorry, i haven't imported the data yet, and this can take some time. so i don't know how to test this 50.000 limit today.

why using "gsitemap" as alias? why not using sitemap.xml and index_sitemap.xml as google is doing by default and the Yahoo sitemap generator is doing with a alias named "urllist.txt". so google don't be able to learn about this dynamicaly generated content *G*... hide logic from google as much as possible....

bharat’s picture

An easy way to test it is to modify the chunk size in the module. If you have 5,000 nodes, just modify the chunk size to be 1,000 and it should generate a with 5 chunks for you.

m3avrck’s picture

Just subscribing for now. This is an issue I'm about to face as well so I'll have more thoughts / code soon too :-)

bharat’s picture

bump! We've been using this code in production on the Gallery website for the past 4 months with no problems.

m3avrck’s picture

Code looks to be Google specific:

if(strpos(getenv('HTTP_USER_AGENT'),'Googlebot')) {

What about MSN and Yahoo now that's a standard sitemap?

hass’s picture

but the line you are writing about is nothing problematic... if not google it will download the xml, the real user agent and the IP is logged in watchdog. ok, it's extensible, but i think this logic makes no sense for future. maybe other search engines will support XML sitemap, too and then the module need to be extended by every engine to have a "nicer" watchdog entry, what makes no big sense for me... however this is offtopic in this case.

SamAMac’s picture

Assigned: Unassigned » SamAMac
Status: Needs review » Fixed

I updated this patch to support the new sitemaps.org namespace and committed it to 4.7 dev.

Anonymous’s picture

Status: Fixed » Closed (fixed)
darren oh’s picture

Version: 4.7.x-1.x-dev » 5.x-1.x-dev
Assigned: SamAMac » darren oh
Status: Closed (fixed) » Patch (to be ported)
darren oh’s picture

Status: Patch (to be ported) » Needs review
StatusFileSize
new7.28 KB

I need someone to test this patch for the Drupal 5 version.

jt6919’s picture

when I apply the patch, I get a weird error with I goto www.mywebsite.com/gsitemap:

Error loading stylesheet: A network error occured loading an XSLT stylesheet:http://www.mywebsite.com/modules/gsitemap/gss.xsl

darren oh’s picture

StatusFileSize
new24.85 KB

See how this patch works.

hass’s picture

sorry, but what have this patch containing CSS and image links from http://www.baccoubonneville.com/gss.jpg to do with this gsitemap module producing an XML responds only?

darren oh’s picture

The 4.7 version has this so that users can see that a sitemap is being generated correctly. I ported all the features from the 4.7 version because that is how users will expect the 5 version to work. If you think the features are inappropriate, you should open a new issue requesting that they be removed from both versions. I will mark this as fixed as soon as someone confirms that an index and multiple sitemaps are being generated.

jt6919’s picture

an index sitemap was generated for my site www.celebritynewslive.com, and I had set it in the admin options to break it down in chunks of 1,000. This site has about 23,000 nodes, so the index sitemap contained 23 parts (about right for 1,000 url's per part).

However, when I click on each part from my sitemap index www.celebritynewslive.com/gsitemap - each part (chunk) contains 16,000+ url's. That seems to be the only problem.

hass’s picture

PLEASE, name the URL's like the google generator is nameing it. we shouldn't tell SE what software we are using...

sitemap.xml
sitemap1.xml.gz
sitemap2.xml.gz
sitemap3.xml.gz
...

and finally we should really change the way how they are created... it should be done per cron if changes happen only. this will not DoS your Server caused 50.000 lookups with the url() function. I opened a bug for this...

darren oh’s picture

Status: Needs review » Fixed

Work on this issue is done. Bugs and requests for changes should be reported in new issues.

jt6919’s picture

I did want to add (for anyone using this patch) that if you have <10,000 nodes and tons of taxonomy terms, or yahoo extracted terms, etc - that you want to uncheck the option to include terms in your sitemap. It will eat your server processor alive if you don't...it also unnecessarily bloats up your site map ten-fold. If you take out the terms, you limit duplicate content and listings, and this is better for your SEO.

Anonymous’s picture

Status: Fixed » Closed (fixed)