checked the code of gsitemap and saw this tons of url() requests done in the loops. From api side, this is correct, but from performance side this is a killer feature. We need aliases, but with the current logic every site using gsitemap is open to a DOS attack. Add 100000 pages to your site with and without url_alias and request the sitemap. Your DB get's overloaded... 100000 SINGLE requests to the aliases table... i'd like to see the database / hardware that can responds to this in a reasonable time (~2-5 sec)...
This looks fixable if the url_alias table will be selected once (for realtime requests) with an IN statement or if the sitemap is pregenerated with cron.
Comments
Comment #1
SamAMac commentedAdd 100000 nodes to your site, and Google will reject your sitemap out of hand, since there is a hard limit of 50000 URLs per sitemap.
It is ironic that you filed this bug agains 4.7 dev, since that version now contains support for breaking your sitemap into manageable chunks - IMHO a better solution for performance on large sites than circumventing the Drupal API and opening the door to DB compatibility problems.
Comment #2
hass commentedyes, i know gsitemap need to split sitemap files per max 50.000 URLs, max 10MB and so on... but the 100.000
url()request will come to you DB url_aliases table... if this are two URLs - or one - however you turn it - you will have a minimum of 50.000 DB requests per sitemap file request and probably every database server get's killed too, isn't it?Comment #3
Noki commentedI don't think it's ironic that hass filed this bug agains 4.7 dev because the way the sitemaps module operates is really stupid in various ways right now. For big sites it's always slow and always would if the way it operates is not changed.
A much better way to handle things would be:
1. Have a single table that contains all data that is needed to generate the sitemap.
2. Use the cron to keep the data in this table up to date.
3. Don't use complex joins over various tables and NEVER use String-Functions like CONCAT in a JOIN. When you need it use it in an INSERT and work with a second Table.
BTW.: I tried to use the sitemaps module with a site that has more than 10.000 nodes and more than 28.000 terms (free tagging) - the result was that I had lot's of slow queries in my MYSQL-log and things were not even working due to http://drupal.org/node/110484.
Comment #4
hass commentedAs said in a different bug case, cron should only update a "cached" sitemap.xml if really required. this is required if new nodes exists or updated nodes exists and so on. if nothing changed the cached version can stay there in drupal cache table... additional the "submit on cron" must only be done if something has really changed!
Aside, i don't think duplicating content in database is a good technically way. This is bad design and should only done if no other way possible. Creating with cron makes it possible to create a sitemap_index.xml containing the splited XML files ("chunk" per 50.000 urls) and finally we are able to create a gziped version with the splited files. this will close more then one case in the open issues... for sure.
thank you.
Comment #5
darren ohIt would be fairly simple to save the site maps as files, then retrieve the contents using a menu callback. Very little code would need to change. I think the best design would be to create temporary files for this purpose.
Comment #6
darren ohI will work on this for the DRUPAL-5 branch.
Comment #7
hass commentedCaching is more the second issue... i filed another bug regarding pre-compile and cache the sitemap files with cron. This is OT here.
The primary issue here is the 50.000 requests to URL function that should be change to a single request without the api function url().
Comment #8
darren ohActually, caching will be necessary to prevent DOS on large sites, since a query cannot break the sitemap into chunks properly. If you think it can, please design a query for issue 137449.
Comment #9
hass commentedyou are talking about a totally different problem!
Here we are talking about merging the 50.000 single api calls to url() and therefor 50.000 single SQL requests if i have a site with ONLY 50.000 nodes. I have a site with 120.000 nodes! If i do 120.000 single SELECT requests to the aliases table (what the url() function is doing if i call them in a loop) then my DB server gets down and overloaded.
this must be changed in a way like:
1. do only one big SQL request or chunked request to the alias table (MUCH MUCH FASTER) for e.g. with an
IN ()statement2. after this have finished, we are able to do the NEXT step - do caching of the resulting file
it is no solution to cache only... the cache must be filled with data and this filling will kill a DB server if there are 120.000 single requests to one table within some seconds. doing this will result in an totally overloaded sqlserver. whatever machine you have... this will take approximately an hour with 100% load. and this makes this case critical...
hope this helps.
i saw i haven't filed a feature request to create cache files with cron... i wonder... i will do later.
Comment #10
darren ohBy default, the site map is split into chunks of 10,000 links each. You can choose a smaller value. There should never be a case where more than 50,000 links are indexed at once. I just committed a modification to make each site map have the correct number of links.
I do understand the concern, and, if you will look at the code and give me some examples of how the queries can be consolidated, I would be happy to work on it.
Comment #11
hass commentedhave you ever done 10.000 single SELECT requests to a database? please add more then 10.000 aliases to nodes in your test environment and then request the gsitemap in "only" 10.000 link chunks. how does this affect your DB Server? Will you ever get an answer?
It really makes no difference to server load if you do 50.000 or "only" 10.000 - ok - with 10.000 the DB machine comes back responsible after ~20 minutes... and not after more then one hour for only one HTTP request - but the problem doesn't change.
sorry i haven't spend time on optimizing this code yet. i reviewed the module regarding chunking and performance in past and saw the url() function and from this point i know we will get VERY BIG performance problems with this module.
Comment #12
darren ohThis used to be accomplished with a single SQL query. The query that made this possible was changed for issue . Let's continue the discussion there.
Comment #13
darren ohSorry about the missing issue number: 124325.