insert into xmlsitemap_node ties up database
| Project: | XML Sitemap |
| Version: | 5.x-1.x-dev |
| Component: | xmlsitemap_node |
| Category: | bug report |
| Priority: | normal |
| Assigned: | Liam McDermott |
| Status: | closed |
We have had to restart mysql many times lately, and looking at our slow queries log, I see that each time is similar to the following:
# Query_time: 2141 Lock_time: 0 Rows_sent: 0 Rows_examined: 0
INSERT INTO xmlsitemap_node (nid, pid, last_changed, last_comment, previous_comment)
SELECT n.nid, ua.pid, n.changed, s.last_comment_timestamp, MAX(c.timestamp) FROM node n
LEFT JOIN node_comment_statistics s ON s.nid = n.nid
LEFT OUTER JOIN comments c ON c.nid = n.nid
LEFT JOIN url_alias ua ON ua.src = CONCAT('node/', n.nid)
LEFT JOIN xmlsitemap_node xn ON xn.nid = n.nid
WHERE xn.nid IS NULL AND (c.timestamp < s.last_comment_timestamp OR s.comment_count <= 1)
GROUP BY n.nid;
(Eventually we end up killing the query by restarting mysqld - most of these have 0 rows_examined, but there was one entry where it must have just finally timed out at Query_time: 2688 and the Rows_examined: 883585707)
I compared this to earlier logs prior to our last upgrade of xmlsitemap module, and saw only one entry (I log all >2 seconds), which took 13 seconds:
# Query_time: 13 Lock_time: 0 Rows_sent: 0 Rows_examined: 3900316
INSERT INTO xmlsitemap_node (nid, pid, last_changed, last_comment, previous_comment)
SELECT n.nid, ua.pid, n.changed, s.last_comment_timestamp, MAX(c.timestamp) FROM node n
LEFT JOIN node_comment_statistics s ON s.nid = n.nid
INNER JOIN comments c ON c.nid = n.nid
LEFT JOIN url_alias ua ON ua.src = CONCAT('node/', n.nid)
LEFT JOIN xmlsitemap_node xn ON xn.nid = n.nid
WHERE xn.nid IS NULL AND c.timestamp < s.last_comment_timestamp
GROUP BY n.nid;
My immediate guess would be that this issue is related to the change from an INNER JOIN to a LEFT OUTER JOIN with comments.
Is anyone else experiencing this issue? Any suggestions?

#1
I have the same problem. Only with this query:
INSERT INTO xmlsitemap_term (tid, last_changed, pid)
SELECT td.tid, 1197023421, ua.pid FROM term_data td
LEFT JOIN url_alias ua ON ua.src = CONCAT('taxonomy/term/', td.tid) OR ua.src = CONCAT('forum/', td.tid)
LEFT JOIN xmlsitemap_term xt ON xt.tid = td.tid
WHERE xt.tid IS NULL
I'm using InnoDB in a database with 130.000+ nodes.
#2
The xmlsitemap_node query can be very slow when there are lots of comments. It has to load all the comments to find the second-most-recent (the time between the last two comments is used to calculate the change frequency of the node).
If someone will provide a patch to do the query in batches, I will commit it.
#3
Just to note, when I disabled xmlsitemap_node, I still crashed, only this time on xmlsitemap_term, just like johsw.
I have currently had to completely disable this module, so as to not crash my entire database.
As far as the number of comments, I only have 166 entries in my comments table, which I don't see as being that large. However, I do have a very large node table (13,400).
Is there any further work with this issue? johsw - do you have any components of the xmlsitemap module enabled that don't crash your database? (since I currently have nothing enabled with this module, it may help to be able to use at least some component)
thanks
#4
Does anyone know if this issue is resolved with version 5.x-1.4? I just realized the new version
#5
The new version does nothing to affect this issue. I have been told that switching the node_access table to MyISAM can work around the problem. See issue 187451.
#6
Darren - thanks for the quick response.
I do have my node_access table as MyISAM (the only drupal tables i have as InnoDB are accesslog, cache, sessions and watchdog; everything else is MyISAM), so that wouldn't be the fix I need
#7
Am also seeing this problem, large number of nodes (29,000), large number of comments (152,000). The 1.4 version hangs on install, whilst previous versions crashed the server the site was running on. The table type for node_access is MyISAM and am trying to install the xmlsitemap_node module (I enabled the other sitemap modules seperately, just in case).
What needs to changed to stop the failure? I'd be glad to write a patch, just point me in the right direction. :)
#8
A patch would need to limit the number of nodes that are indexed in a single query. Ideally, the query should be repeated until all nodes are indexed, but that may cause the installation to time out.
#9
The installation times out at the moment! :)
Could we index an arbitrary number of nodes on installation, and index the same amount on a cron job (until all nodes are indexed)? I shall investigate to get a better idea of how this module works.
#10
I know the installation is timing out. The question is whether it would time out if the indexing were done in multiple queries. If necessary, the script can include a temporary change to the timeout limit.
#11
Having experimented a little, adding a limit to the insert query didn't work: even a limit of 1 locked the installation. It also seems the problem isn't with the amount of nodes, but the amount of url aliases. Removing the url alias join in the insert query (on install), and entering 0 as the pid made the install run within a couple of seconds, even with a large number of nodes.
What I'll do next is see how quickly an update query will run to change all those pid's from 0 to what they should be. Does this sound reasonable?
#12
First attempt at a patch. I've only tested it with MySQL, so PostgreSQL needs to be checked.
Installation of all XMLSitemap modules, with the patch, takes ~30 seconds for my site.
#13
Forgot to update status.
#14
I never tested the queries on a site with thousands of URL aliases. The problem seems to be the repeated search for an alias for each node when there are thousands of aliases. We should be able to eliminate this by searching for an entry in the xmlsitemap_node table for each alias instead.
That means two queries: one to populate the xmlsitemap_node table, and another to add the path ID. There is no reason to set the pid to zero on the first query; it does not need to be set at all.
#15
So the patch is fine apart from setting the pid to zero, nothing else needs to be done? Just wanting to check exactly what needs to be changed here. :)
#16
I hadn't realized your patch contained both queries. My thought was that we should eliminate the search for the pid, which the second query in the patch still contains. That would require that we find a way to join the xmlsitemap_node table to the url_alias table rather than the other way around. If your patch cuts the time down to 30 seconds, that would not be necessary.
The only other issue is that the pid should not be updated if it is already set.
#17
Sorry for the inactivity. Having another go at a patch. This one initially sets the pid to 0 (same as last patch), but now the update only sets the pid for rows where it's 0.
Regarding issue #199413, I tested with multiple aliases for the same node and the pid of the first alias was inserted. I tried to get it so the last (or MAX(pid) ) was inserted, but this seems impossible with an UPDATE--in MySQL at least--without a correlated sub-query. A correlated sub-query just made the install process time out again, so I ditched that idea.
Anyway, hope this makes sense. Might be an idea to get the guy with PostgreSQL to test this patch. :)
#18
Doh! Forgot to update the term and user modules. Here's a new patch.
#19
Hi,
where is this patch to be executed? it does neither work in the modules subdirectory (patch -p0), nor in the drupal root directory (patch -p1); am I missing something here?
Thanks, -asb
#20
I rolled it from inside the xmlsitemap directory. Just tested and it works if you apply the patch from there (patch -p0). In future I'll roll my patches from the modules directory (if that's what people are expecting). :)
#21
i apply your patch liam and receive this messages
patching file xmlsitemap_node/xmlsitemap_node.installHunk #1 FAILED at 107.
Hunk #2 FAILED at 116.
2 out of 2 hunks FAILED -- saving rejects to file xmlsitemap_node/xmlsitemap_node.install.rej
patching file xmlsitemap_node/xmlsitemap_node.module
Hunk #1 FAILED at 366.
1 out of 1 hunk FAILED -- saving rejects to file xmlsitemap_node/xmlsitemap_node.module.rej
patching file xmlsitemap_term/xmlsitemap_term.install
Hunk #1 FAILED at 40.
1 out of 1 hunk FAILED -- saving rejects to file xmlsitemap_term/xmlsitemap_term.install.rej
patching file xmlsitemap_term/xmlsitemap_term.module
Hunk #1 FAILED at 232.
1 out of 1 hunk FAILED -- saving rejects to file xmlsitemap_term/xmlsitemap_term.module.rej
patching file xmlsitemap_user/xmlsitemap_user.install
Hunk #1 FAILED at 50.
1 out of 1 hunk FAILED -- saving rejects to file xmlsitemap_user/xmlsitemap_user.install.rej
patching file xmlsitemap_user/xmlsitemap_user.module
Hunk #1 FAILED at 240.
1 out of 1 hunk FAILED -- saving rejects to file xmlsitemap_user/xmlsitemap_user.module.rej
#22
ops... i upgraded module to 5.x-1.4 and now receive this messages
patching file xmlsitemap_node/xmlsitemap_node.installHunk #2 FAILED at 116.
1 out of 2 hunks FAILED -- saving rejects to file xmlsitemap_node/xmlsitemap_node.install.rej
patching file xmlsitemap_node/xmlsitemap_node.module
Hunk #1 FAILED at 366.
1 out of 1 hunk FAILED -- saving rejects to file xmlsitemap_node/xmlsitemap_node.module.rej
patching file xmlsitemap_term/xmlsitemap_term.install
patching file xmlsitemap_term/xmlsitemap_term.module
patching file xmlsitemap_user/xmlsitemap_user.install
patching file xmlsitemap_user/xmlsitemap_user.module
#23
Probably needs a re-roll.
#24
Reroll, this time from modules directory. This is patched against the DRUPAL-5 branch of xmlsitemap. If you try this patch and it works, could you please come back here and say so? Unless you don't mind running a customised version of xmlsitemaps forever of course. ;)
#25
Liam:
After Applying "split_update_3.patch" to our version 5.x.1.3 of 'xmlsitemap', I got the same results as when I applied the "split_update_2.patch" (no upgrade to 5.x-1.4 involved).
Here are the exact errors..
Hunk #1 FAILED at 366.
1 out of 1 hunk FAILED -- saving rejects to file xmlsitemap_node/xmlsitemap_node.module.rej
Hunk #1 FAILED at 232.
1 out of 1 hunk FAILED -- saving rejects to file xmlsitemap_term/xmlsitemap_term.module.rej
Hunk #1 FAILED at 50.
1 out of 1 hunk FAILED -- saving rejects to file xmlsitemap_user/xmlsitemap_user.install.rej
Hunk #1 FAILED at 240.
1 out of 1 hunk FAILED -- saving rejects to file xmlsitemap_user/xmlsitemap_user.module.rej
Keeping you updated,
Robert D. S.
#26
Thanks for trying the patch. Have you tried applying it to a CVS copy of the code from the DRUPAL-5 branch? You can grab this from: http://ftp.drupal.org/files/projects/xmlsitemap-5.x-1.x-dev.tar.gz
I just tested the patch using:
patch -p0 < split_update_3.patchand that xmlsitemap download, it does apply. The only other thing I can suggest is to make sure the patch is in the directory above xmlsitemap (modules) and apply it from there. You probably already know that though. :)#27
This problem still has not been resolved. It seems the fix has been applied to dev, but NOT 1.4. Also, I didn't like the proposed fix as for large sites I believe that PHP should take more of the load and not the database. I say this because in hook_nodeapi of the xmlsitemap node module, there was still a *JOIN on a function* ( in this case the CONCAT function ) which makes submission on a large site painfully slow.
This is an initial patch to fix this issue. The pid is put in as 0, and there is a seperate function to go through and update the pid.
#28
Better patch to fix UPDATE query reference table.
This patch should apply cleanly to 5.x-1.4
#29
I believe the patch in #24 above hasn't been committed yet. It is awaiting testing.
Had a look at the patch in #28 and it only applies to the node module.
That's what you 'believe' but have you got any evidence to back that up? Parts of core join to
url_aliasusing a function, I agree that this is inefficient, but it's the way the rest of Drupal works. It would be good to see some evidence that pulling all the fields into PHP individually for processing is less efficient than getting the database to do it, particularly considering there is an index on thesrcfield.In defense of the patch in #24: with this applied, all xmlsitemap modules install within 30 seconds on a site with ~29,000 nodes and ~15,000 users. That's a reasonably sized site. If I get a moment I'll see whether the patch in #28 is faster.
#30
There are 2 primary reasons why placing the load on PHP and not the DB is better in my mind. I didn't put numbers because the decision is opinionated...also, honestly, I didn't have time to do benchmarking, diagramming, etc..
With the load on the application, and the processing done in more of an atomic method, one can control the number or operations run. Meaning, say it just takes too long to create the site map in a sledge-hammer, one-fell swoop method, one can inclemently create the site map. Sending a query...in this case a bad query, to the database for processing on a large site would obviously tie up the database much more so than if you moved the load to the application/PHP, and send well-formed better quires to the DB...even though one sends more of them.
You mention that there is an index on the source field, however, the indexes go out of the windows when you join on a function: e.g. ON A.id = CONCAT('strong/', id). Even with a limit, the join must happen, and since the join is on a computed ID, the Id (meaning for the entire table) must be computed. The usual operations of a database (slight varies depending on hints, etc.) is:
JOINS, WHERE, GROUP BY, SELECT, ORDER
Also, I just posted the patch as 24 didn't apply to 5.4 for me. I meant to post as needs-work, as i figured others could work off it , use it, and then commit something. I needed to fix the node version becase everytime xmlsitemap was turned on, it would bring the site to a halt.
During enabling, it took 54 seconds to run the initial hook_enable function via my patch on a site with ~500,000 nodes and ~380,000 users.
Node crud operations avg. ~14 seconds. That may seem high, but there are other factors and items going on behind the scene that attribute to the node CRUD times. (When I say crud, i mean editing, updating, and creating nodes ).
#31
You right though, I should either expand/merge the patch to include the sitemap users/etc. information/patches. perhaps i chopped that off.
#32
Found an error. I can roll another patch later today, or wait to hear from Liam about his patch to see if it applies agains the 1.4 branch instead of the 5.x-dev.
#33
My patch should apply against 5.x-dev, from the modules directory. Here's how I'm checking the module out and applying the patch (so tell me if I'm doing something incorrectly):
cd modulescvs checkout -r DRUPAL-5 -d xmlsitemap contributions/modules/xmlsitemap
patch -p0 < split_update_3.patch
Just tested, and it works (applying the patch in #24 that is).
#34
This patch works great. I have a site with a ton of nodes and aliases, so the installation of xmlsitemap node module would always time out. Once I had applied this patch to 5.x-dev, I was able to install xmlsitemap with no problems. I needed to run the cron script a few times, but after that the sitemap.xml looked pretty good. I will test it some more before I send this to our live site, but so far so good. . . .
thanks for the patch
ash
#35
Liam,
I tried applying your patch again today to no avail:
CVS co:
cvs -z6 -d:pserver:anonymous:anonymous@cvs.drupal.org:/cvs/drupal-contrib checkout -r DRUPAL-5 -d xmlsitemap contributions/modules/xmlsitemapErrors:
:# patch -p0 < split_update_3.patch
patching file xmlsitemap/xmlsitemap_node/xmlsitemap_node.install
Hunk #1 FAILED at 107.
Hunk #2 FAILED at 116.
2 out of 2 hunks FAILED -- saving rejects to file xmlsitemap/xmlsitemap_node/xmlsitemap_node.install.rej
patching file xmlsitemap/xmlsitemap_node/xmlsitemap_node.module
Hunk #1 FAILED at 366.
1 out of 1 hunk FAILED -- saving rejects to file xmlsitemap/xmlsitemap_node/xmlsitemap_node.module.rej
patching file xmlsitemap/xmlsitemap_term/xmlsitemap_term.install
Hunk #1 FAILED at 40.
1 out of 1 hunk FAILED -- saving rejects to file xmlsitemap/xmlsitemap_term/xmlsitemap_term.install.rej
patching file xmlsitemap/xmlsitemap_term/xmlsitemap_term.module
Hunk #1 FAILED at 232.
1 out of 1 hunk FAILED -- saving rejects to file xmlsitemap/xmlsitemap_term/xmlsitemap_term.module.rej
patching file xmlsitemap/xmlsitemap_user/xmlsitemap_user.install
Hunk #1 FAILED at 50.
1 out of 1 hunk FAILED -- saving rejects to file xmlsitemap/xmlsitemap_user/xmlsitemap_user.install.rej
patching file xmlsitemap/xmlsitemap_user/xmlsitemap_user.module
Hunk #1 FAILED at 240.
1 out of 1 hunk FAILED -- saving rejects to file xmlsitemap/xmlsitemap_user/xmlsitemap_user.module.rej
#36
Updated patch.
This patch fixes a few issues with some quires. I still need to migrate the changes to the users site map type.
This patch also adds a new settings page: admin/settings/xmlsitemap/node that allows to following:
1. See how many nodes in the table haven't had their path aliases processed.
2. Set a *BATCH NUMBER*, so that the module only processes so many nodes at a time.
3. One can *MANUALLY* run the process if they want.
As mentioned, nodes are now updated in batches. It starts at 10,000. However, one can update the number of nodes that are processed in the settings page.
I like updating each node atomically because I have more information and control about what's going on. Throwing everything to the query and hoping it doesn't time out doesn't bode well for me.
So far so good. I re-did my tests and did about 480K nodes in about 4 min in 100K batches.
Again, this solution could be totally off, but I'm just placing it here for discussion. I'll try and roll a patch out again 5.x-dev too.
Liam,
Hopefully you don't think i'm trying to hijack your thread. Perhaps we could merge out patches, and get this in to the current release branches fairly soon?
#37
I just did exactly what you did in #35 and the patch applied?!!!
*** EDIT ***
No worries. When it comes down to it I just want a sitemap module that doesn't hang my whole server when trying to install. :) Am really busy at the moment, but am hoping to have some time later to see which is the fastest method. Maybe try some PostgreSQL testing too.
Just wish I knew why that patch only applies for me and no-one else! Even if I use a clean checkout it works. Weird.
#38
I got the patch to work with no issues. I'm not sure why people are having an issue applying it. I was able to install xmlsitemap without any hang-ups, but now I am running into XML Parsing Error: no element found . I think it is a permissions issue because this does not occur on my dev site, just the live one, but this is an unrelated issue. I will poke around the forums more for a solution. . .
ash
#39
@ashtronaut: which patch did you apply, the one souvent is having trouble with is from comment #24. Is that the one you're applying?
Thanks for keeping us updated though.
#40
Liam,
Sorry for taking so long to respond. I have been on vacation for the past few days. Yes, I was able to successfully apply the patch from #24 to 5.x-dev with no issues.
patch -p0 < split_update_3.patch
thanks,
ash
#41
Just tried some benchmarking, using Postgres. Not a really scientific test, just running it on a GNU laptop whilst listening to Hitchhikers Guide to the Galaxy, but still interesting :)
Setup: Apache2, Postgres 8.2, Ubuntu GNU+Linux. A fresh Drupal 5 install with 150,000 nodes (and aliases), created by the devel module. Only xmlsitemap_node is tested, patch in #24 applies to all xmlsitemap modules whereas #34 applies to xmlsitemap_node.
Patch in comment #34
# warning: pg_query() [function.pg-query]: Query failed: ERROR: column "xn" of relation "xmlsitemap_node" does not exist LINE 1: UPDATE xmlsitemap_node xn SET xn.pid = ( ^ in /home/liam/public_html/drupal/includes/database.pgsql.inc on line 125.
# user warning: query: UPDATE xmlsitemap_node xn SET xn.pid = ( SELECT ua.pid FROM url_alias ua WHERE ua.src = 'node/1' ) WHERE nid = 1 in /home/liam/public_html/drupal/includes/database.pgsql.inc on line 144.
(once for each row).
Patch in comment #24
Am not sure this is the case, the src field--the one that's indexed--is being searched and is not included in the function call. It can certainly be searched using the index. I agree joining on a function call is a bit rubbish though.
#42
Am bumping version to dev, and status to code needs review, etc. etc.
Patch attached works in PostgreSQL (tested pretty thoroughly). As before the patch should be applied from the modules directory:
cd modulescvs checkout -r DRUPAL-5 -d xmlsitemap contributions/modules/xmlsitemap
patch -p0 < split_update_4.patch
#43
To satisfy my curiosity I did some--rough and ready--testing on this, it's not exactly on-topic, but does show whether the
srcfield should be avoided in joins. Setup is as shown above, this is just a Drupal page content type, using the PHP input format. The results are below, I repeated the test a few times to make sure each individual result is not a fluke, the results shown here are the$elapsed_timeseen in the code.Using a
whereclause<?php$start_time = time();
for ($i = 1; $i <= 10000; $i++) {
$result = db_query("SELECT dst FROM {url_alias} ua WHERE ua.src = 'node/' || %d", $i);
print db_result($result) .", ";
}
$finish_time = time();
$elapsed_time = $finish_time - $start_time;
?>
<p>Start time = <?php print $start_time; ?></p>
<p>Finish time = <?php print $finish_time; ?></p>
<p>Elapsed time = <?php print $elapsed_time; ?></p>
With an index on the src field:
10 seconds
8 seconds
7 seconds
34 seconds
7 seconds
33 seconds
10 seconds
8 seconds
7 seconds
7 seconds
6 seconds
(I got some weird results, e.g. ~30 seconds instead of ~7 seconds, so repeated the test a few extra times)
Without an index on the src field:
655 seconds
612 seconds
649 seconds
Then, using
inner joininstead of awhereclause<?php$start_time = time();
for ($i = 1; $i <= 10000; $i++) {
$result = db_query("
SELECT ua.dst, xn.pid FROM {url_alias} ua
INNER JOIN {xmlsitemap_node} xn ON ua.src = 'node/' || xn.nid
WHERE xn.nid = %d
", $i);
print db_result($result) .", ";
}
$finish_time = time();
$elapsed_time = $finish_time - $start_time;
?>
<p>Start time = <?php print $start_time; ?></p>
<p>Finish time = <?php print $finish_time; ?></p>
<p>Elapsed time = <?php print $elapsed_time; ?></p>
With an index on the src field:
9 seconds
8 seconds
8 seconds
Without an index on the src field:
794 seconds
766 seconds
889 seconds
Conclusion
Joining on a function doesn't throw out indexes at all. It's sub-optimal, but works fast enough.
#44
Hi,
I tried to apply the patch on 5.x-1.4, getting the following results:
#:/var/www/drupal/portal/modules# patch -p0 < split_update_4.patchpatching file xmlsitemap/xmlsitemap_node/xmlsitemap_node.install
Hunk #2 FAILED at 116.
1 out of 2 hunks FAILED -- saving rejects to file xmlsitemap/xmlsitemap_node/xmlsitemap_node.install.rej
patching file xmlsitemap/xmlsitemap_node/xmlsitemap_node.module
Hunk #1 FAILED at 366.
1 out of 1 hunk FAILED -- saving rejects to file xmlsitemap/xmlsitemap_node/xmlsitemap_node.module.rej
patching file xmlsitemap/xmlsitemap_term/xmlsitemap_term.install
patching file xmlsitemap/xmlsitemap_term/xmlsitemap_term.module
patching file xmlsitemap/xmlsitemap_user/xmlsitemap_user.install
patching file xmlsitemap/xmlsitemap_user/xmlsitemap_user.module
Should I post the *.rej files?
Regards, -asb
#45
I think the patch will only work if applied to 5.x.1.x-dev. I haven't tried applying this patch yet, but the previous version of this patch applied to 5.x.1.x-dev and not 5.x-1.4 . It's worth giving it a shot. You can get a copy of the dev file from
http://ftp.drupal.org/files/projects/xmlsitemap-5.x-1.x-dev.tar.gz
ash
#46
Hi Ash,
yes, patching against 5.x.1.x-dev does work fine. Let's see if it helps against crshing MySQL.
Thanks, -asbdpl
#47
Do pop-back and let us know how you got along. Unfortunately Darren (the maintainer--I've been hassling the poor chap on IRC) hasn't had a chance to review this patch yet, but it will give him confidence to see others have and whether it worked for them. Thanks. :)
#48
Hi Liam,
sure. I'm running 5.x-1.x-dev with the split_update.patch on one site for 24 hrs and had no long running SQL queries or similar problems so far. I just enabled it on a few other sites which look goot at the moment.
Thanks the the good work!
-asb
#49
just adding my two cents...I ran into this problem just now and after upgrading to -dev and applying the patch my server is feeling good again!
thanks for tackling this guys!
#50
Well, at the moment the patch in #42 has been independently reviewed by at least three people. I know Jaydub and I both use PostgreSQL, so that's MySQL and PostgreSQL tested. This is ready to be committed IMO. :)
*** EDIT ***
Also, #42 should be committed due to the cheesy Hitchhikers Guide to the Galaxy reference. :)
#51
I have cleanly applied the patch in #42 and have been testing it on my dev box (Mysql) for about 5 days now. I feel it's ready to be committed as well.
ash
#52
maybe this wasn't part of the patch's design, but i'm still getting memory exhaustion error when I create nodes... even though i don't get a cpu spike
#53
My two cents:
Talk about night and day! Trying to use 1.4 caused innumerable problems and hung up my browser for 7 minutes. There were "duplicate entry" errors and the resulting xmlsitemap_node table was mostly empty.
Using the dev version and applying the patch was a freakin' miracle in comparison:
Very fast, zero errors, and all the nodes are represented in the table!
Thank you for this!!
#54
If I remember correctly this patch will help a little with this problem, but XML sitemaps will still need a lot of time to populate the sitemap files from the database. Basically you need to run cron a load of times and keep trying, eventually XML sitemaps will create the files and things will run smoothly-ish. As you said: this patch isn't meant to fix all these problems, just the database ones. :)
#55
#43, I think you forgot to read the part where it says "when you join on a function: e.g. ON A.id = CONCAT('strong/', id)".
I totally agree with Souvent22, that join with the CONCAT function was killing my server too.
My problem started when I moved from gsitemap to xmlsitemap. After activating the xmlsitemap_node module, my server was "hanging up" on the first cron run (I have about 40,000 url aliases, 20,000 nodes and 170,000 comments). I tried to run the query manually but it always timed out. I have tried some of the above patches but I couldn't get it to work, so I fixed the problem manually.
Here is how.
First, I removed all joins from that query to run just this:
INSERT INTO xmlsitemap_node (nid, pid, last_changed, last_comment, previous_comment)SELECT n.nid, NULL, n.changed, NULL, NULL
FROM node n
LEFT JOIN xmlsitemap_node xn ON xn.nid = n.nid
WHERE xn.nid IS NULL;
That run super fast.
Next, I updated the comment timestamps per node. I only care about the last_comment field, since the previous_comment field will be updated for any node whenever a user posts a new comment.
UPDATE xmlsitemap_node xn
JOIN node_comment_statistics s on s.nid = xn.nid
SET xn.last_comment = s.last_comment_timestamp;
Again, that took few seconds to run.
Finally, it was time to update the pid from the url_alias table. Since the join with the CONCAT function is mega slow, I added my own nid field to the url_alias table and updated it with the corresponding nid value (btw, I think this field should be added on Drupal's urlalias core module).
alter table url_alias add column nid int(11) NULL;
update url_alias set nid = substr(src,6,10) where src like 'node/%';
That went very fast, faster than I expected.
So the final step was, again, update the pid on xmlsitemap_node table. Now that we have a nid field on url_alias, the join doesn't take very long.
update xmlsitemap_node xn join url_alias a on a.nid = xn.nid set xn.pid = a.pid;
So that was it. Now I hope that query won't kill my server whenever users add new nodes. If that happens I would have to fix or split that query on the module it self.
Regards,
Eneko Alonso
www.spaniards.es
#56
By the way, does anybody know why the pid field is required on xmlsitemap_node table?
I mean, Drupal has a nice function called url(), which you can use like this to get the url_alias for a node:
<?php$node_url = url("node/$node->nid");
$full_url = "http://www.spaniards.es" . url("node/$node->nid"); // On my website template I use it this way, for debugging.
?>
Maybe xmlsitemap_node module should use this function in order to generate the sitemap xml file?
#57
No, I did not 'forget' to read anything. I proved the index does make a difference. Try it for yourself.
What was the issue with the patch in #42? Did it not apply, was it too slow or what? This discussion is clouding the issue somewhat, the patch in #42 has worked for a number of people here, it is an improvement and should be committed.
Totally agree that many aspects of xmlsitemap could do with being re-thought (believe Darren has a grand plan already), but going off on tangents and hypothecating about changes isn't going to get this bug fixed. If you've got a better patch please put it up. :)
#58
Sorry, I think I didn't explain very well. The code you posted on comment 43 is not valid for this test, since 1st, it does not use a DB function like CONCAT, and 2nd, the query you put only affect one record at a time.
But that's not important here. The patch that you mention, comment 42, uses an INNER join on the update, which doesn't seem to have the problem the OUTER join had on the original query. So yes, it runs very fast. I do not understand why MySQL handles both joins differently so if anyone can explain this to me, I'll be gald to hear your comments (pls. email me at eneko[at]spaniards.es
Some how I missed that patch, so I'll give it a try and let you know how it goes.
Either way, good job! I'm very happey there is a good solution for this issue :)
#59
That's the PostgreSQL version of CONCAT. Same result, different syntactic sugar. :) I can repeat the test on MySQL if you like. Am pretty sure it'll have the same result.
What? It's a select query, it selects all the records. What test would you propose?
#60
Hi Liam,
Newbie here. I installed xml sitemap version 5.x-1.4 a couple days ago. I have about 100,000 nodes.
The first time it crashed the mysql db and my isp had to restart it. The second time, it took about 12 hours to create the sitemap. Afterwards, it created a bunch of duplicate nodes on the live site. The resulting sitemap looks mostly ok with some nodes without the pathauto url aliases.
Anyway, taking half a day to generate a sitemap is not ideal - that's why I'm interested in your patch. A couple questions:
- Does your latest patch work with 5.x-1.4? (I followed the entire thread, but I'm still confused).
If it does, how do I apply the patch? Newbie here, give me detail instructions, especially knowing that I've already installed the module and had already generated a sitemap. Would I have to uninstall my current module first?
If not, how do I go about installing it. Would I have to uninstall my current version of the module, install 5.x-1.x-dev, then apply the patch? How do go about applying the patch exactly?
- This is more of a question regarding XML Sitemap. After the initial sitemap is generated, does it regenerate the entire sitemap again from scratch each time cron is ran, or does it only add on new nodes?
Thx!
-
#61
I've replied to sinmao via other means, anyone wanting to know about applying patches should check out: http://drupal.org/patch/apply or see support options: http://drupal.org/support . Please keep this issue on-topic, thanks. :)
#62
Liam,
Thanks for the detail reply thru email! Much appreciated.
#63
I don't know what I was thinking when I wrote that... hahaha.
Either way, as I said before, I'm glad everything is working fine now.
#64
I guess I'll throw one more solution into the ring. The original queries are great in theory and I think they have the right idea: make the database move data around, that's what it's good at. The queries just needed a little tweaking to get rid of all those full table lookups.
I'm working with a fairly large database -- about 25 thousand nodes, 5 thousand taxonomy terms, half a million users, half a million url aliases (by the way, this is the table causing the slowdowns) -- and it took 2.18 seconds to build xmlsitemap_node, 0.24 seconds to build xmlsitemap_term, and 44.10 seconds to build xmlsitemap_user. Subsequent runs take less than half a second for nodes and terms and less than 2 seconds for users. Those times are from a VM running on my laptop, nothing remotely close to production hardware.
This patch is for 5.x-1.4 and only changes queries for MySQL.
#65
Marked issue 217718 as duplicate.
#66
Subscribing.
#67
Noticed some changes to the DRUPAL-5 branch, have tested applying the patch in #42 and it still applies.
Is there any reason the patch in #42 hasn't been applied? It's been reviewed and works with MySQL and PostgreSQL. It output: http://www.webmaster-forums.net/sitemap.xml in about 10 seconds for example (after I truncated the xmlsitemap tables).
Even if there's a plan to refactor how xmlsitemap generates URLs, to work better with large result sets, this fixes the problem now (for most people), will stop more issues being raised and more comments here.
#68
Sorry it took me so long to get to this. I was skeptical that the patch could actually provide the required performance, but replacing the LEFT JOIN to the url_alias table with an INNER JOIN seems to do the trick. I modified the patch so that it is not necessary to set the pid to zero initially, so it needs one more test.
#69
Seems to work. Fixed in CVS commit 115946.
#70
Hurrah! Thanks Darren. :)
#71
I have to say Darren's patch came at the most opportune time!
I just did research yesterday on this problem because we just had a major corporate Drupal launch and the cron was pegging our CPU for three minutes. Within that time there could be no writes to the database. Since the site was also a migration from WordPress, watchdog was having a field day writing. Now it's down to one second.
Thanks for the effort!
#72
Automatically closed -- issue fixed for two weeks with no activity.