Closed (fixed)
Project:
Google Sitemap
Version:
5.x-1.x-dev
Component:
Code
Priority:
Normal
Category:
Feature request
Assigned:
Reporter:
Created:
19 Aug 2006 at 23:58 UTC
Updated:
9 Apr 2007 at 16:17 UTC
Jump to comment: Most recent file
Comments
Comment #1
bharat commentedWe're having this problem on the Gallery website also. I modified gsitemap.module to create a and and provide individual sitemaps in 10k node chunks. You don't get the full advantage of using an index when you chunk by node id, because if the chunks are large enough then there's always one or two nodes in there that got modified recently. But if you set the chunks down to something reasonably small (say, 1K) I suspect that you'll get to save some bandwidth because Google won't download the indexes that haven't changed recently.
Firefox thinks that the XML output is well formed, but Google hasn't crawled our site yet with the new code so I don't know if Google likes it. But I'm attaching the patch for your inspection. Let me know if you want me to change anything.
Comment #2
bharat commentedJust a followup to say that Google downloaded the sitemap index and then the 6 chunks with no problem:
gsitemap 2006-08-28 23:43 Sitemap chunk 3 downloaded by Google. Anonymous Coward (not verified)
gsitemap 2006-08-28 23:41 Sitemap chunk 0 downloaded by Google. Anonymous Coward (not verified)
gsitemap 2006-08-28 23:41 Sitemap chunk 1 downloaded by Google. Anonymous Coward (not verified)
gsitemap 2006-08-28 23:40 Sitemap chunk 2 downloaded by Google. Anonymous Coward (not verified)
gsitemap 2006-08-28 23:39 Sitemap chunk 4 downloaded by Google. Anonymous Coward (not verified)
gsitemap 2006-08-28 23:39 Sitemap chunk 5 downloaded by Google. Anonymous Coward (not verified)
gsitemap 2006-08-28 23:34 Sitemap index downloaded by Google. Anonymous Coward (not verified)
And on the Google webmasters site:
Googlebot has successfully accessed your home page. Last crawl date: Aug 27, 2006
So at least the XML is well formed :-)
Comment #3
marc.bausorry, i haven't imported the data yet, and this can take some time. so i don't know how to test this 50.000 limit today.
why using "gsitemap" as alias? why not using sitemap.xml and index_sitemap.xml as google is doing by default and the Yahoo sitemap generator is doing with a alias named "urllist.txt". so google don't be able to learn about this dynamicaly generated content *G*... hide logic from google as much as possible....
Comment #4
bharat commentedAn easy way to test it is to modify the chunk size in the module. If you have 5,000 nodes, just modify the chunk size to be 1,000 and it should generate a with 5 chunks for you.
Comment #5
m3avrck commentedJust subscribing for now. This is an issue I'm about to face as well so I'll have more thoughts / code soon too :-)
Comment #6
bharat commentedbump! We've been using this code in production on the Gallery website for the past 4 months with no problems.
Comment #7
m3avrck commentedCode looks to be Google specific:
if(strpos(getenv('HTTP_USER_AGENT'),'Googlebot')) {What about MSN and Yahoo now that's a standard sitemap?
Comment #8
hass commentedbut the line you are writing about is nothing problematic... if not google it will download the xml, the real user agent and the IP is logged in watchdog. ok, it's extensible, but i think this logic makes no sense for future. maybe other search engines will support XML sitemap, too and then the module need to be extended by every engine to have a "nicer" watchdog entry, what makes no big sense for me... however this is offtopic in this case.
Comment #9
SamAMac commentedI updated this patch to support the new sitemaps.org namespace and committed it to 4.7 dev.
Comment #10
(not verified) commentedComment #11
darren ohComment #12
darren ohI need someone to test this patch for the Drupal 5 version.
Comment #13
jt6919 commentedwhen I apply the patch, I get a weird error with I goto www.mywebsite.com/gsitemap:
Error loading stylesheet: A network error occured loading an XSLT stylesheet:http://www.mywebsite.com/modules/gsitemap/gss.xsl
Comment #14
darren ohSee how this patch works.
Comment #15
hass commentedsorry, but what have this patch containing CSS and image links from http://www.baccoubonneville.com/gss.jpg to do with this gsitemap module producing an XML responds only?
Comment #16
darren ohThe 4.7 version has this so that users can see that a sitemap is being generated correctly. I ported all the features from the 4.7 version because that is how users will expect the 5 version to work. If you think the features are inappropriate, you should open a new issue requesting that they be removed from both versions. I will mark this as fixed as soon as someone confirms that an index and multiple sitemaps are being generated.
Comment #17
jt6919 commentedan index sitemap was generated for my site www.celebritynewslive.com, and I had set it in the admin options to break it down in chunks of 1,000. This site has about 23,000 nodes, so the index sitemap contained 23 parts (about right for 1,000 url's per part).
However, when I click on each part from my sitemap index www.celebritynewslive.com/gsitemap - each part (chunk) contains 16,000+ url's. That seems to be the only problem.
Comment #18
hass commentedPLEASE, name the URL's like the google generator is nameing it. we shouldn't tell SE what software we are using...
sitemap.xml
sitemap1.xml.gz
sitemap2.xml.gz
sitemap3.xml.gz
...
and finally we should really change the way how they are created... it should be done per cron if changes happen only. this will not DoS your Server caused 50.000 lookups with the url() function. I opened a bug for this...
Comment #19
darren ohWork on this issue is done. Bugs and requests for changes should be reported in new issues.
Comment #20
jt6919 commentedI did want to add (for anyone using this patch) that if you have <10,000 nodes and tons of taxonomy terms, or yahoo extracted terms, etc - that you want to uncheck the option to include terms in your sitemap. It will eat your server processor alive if you don't...it also unnecessarily bloats up your site map ten-fold. If you take out the terms, you limit duplicate content and listings, and this is better for your SEO.
Comment #21
(not verified) commented