Hi there,
My drupal site doesn't appear to be indexing past a certain point.
A user reported this to me but I can't figure out why so thought I'd ask for some help.
Here's some info :-
mysql> select max(sid) from search_index;
+----------+
| max(sid) |
+----------+
| 689 |
+----------+
mysql> select max(nid) from node;
+----------+
| max(nid) |
+----------+
| 2087 |
+----------+
What ever happens, the max sid setting in search_index table never goes beyong 689. Looking into it, if I search for words that I know appear in stories prior to 690 then search spits them out. But for stories beyond 689 no results get returned.
I'm using wget to hourly do the index. It doesn't seem to report a problem and neither does it being called manually.
The admin interface says "100% of the site has been indexed. There are 0 items left to index" but obviously there are, lots of then in fact. Just can't figure out why it's not doing them.
Any help appreciated.
Regards
AjK
| Comment | File | Size | Author |
|---|---|---|---|
| #25 | 57106.patch | 1.01 KB | douggreen |
| #22 | future-last-comment.patch | 2.6 KB | Steven |
| #21 | node_75.patch | 472 bytes | adixon |
| #20 | node_74.patch | 1.32 KB | adixon |
| #18 | node_73.patch | 1.19 KB | adixon |
Comments
Comment #1
AjK commentedI figured out why indexing had stopped. A node had been created with a date in teh future (2008-12-01 15:25:22 to be precise).
Stored in table variable I found
node_cron_last | s:10:"1228145122";so I assumed that the indexer was waiting until that time to carry on from!Removing that and re-indexing cleared the fault. Worth noting how it failed thou.
--AjK
Comment #2
AjK commentedI'm changing this issue from "support request" to "bug report" as the only way I found to fix this was to use a mysql command line client and do
delete from variable where name = "node_cron_last"and then re-index.This was probably not the correct way to fix this but it occured to me that there was no where in the drupal admin system I could "undo" the errornous setting of the "node_cron_last" variable. So, I assume anyone who sets a node create date into the far future by mistake will run into an indexing problem with no obvious way out.
Correct me if I'm wrong here but if you can't "fix" something from within drupal itself but must resort to external tools, is that classed as a "bug"?
--AjK
Comment #3
magico commentedDoes someone else had this problem? I think this should be critical, but nobody else said anything...
Comment #4
magico commentedComment #5
catchI'm running 5.1 on a database with c.12,000 nodes and 120,000 comments. Have been using external search tools due to this problem since 4.7 but decided we want to use drupal's own search again with 5.x
When indexing, cron indexes 300 nodes, then stops, that leaves over 11,000 un-indexed. 2% of the site has been indexed. There are 11771 items left to index.
Some information about our setup:
Debian Sarge, mysql 4.1, php 4.3
We're not using any node_access modules, although were doing so in the past - have manually inserted the 'all' line in that table though and checked it. We 'are' using CCK nodes, but there's at least 1,000 nodes entered before we started using it.
search is set up to index 10 nodes at a time, cron running every 2 minutes to check this, although it was getting stuck at once an hour as well. cron jobs are completing and no errors showing up in logs.
This is what I've tried:
empty all cache tables
empty all search tables
delete node_cron_last and node_cron_last_nid from variable
re-index button in search admin.
always sticks at 300.
The node referenced in node_cron_last_nid is: 371 - this is basically plain text so I don't think it's a character issue with indexing (although really that shouldn't affect it anyway). Entered in 2005 on a 4.6 db, and I've checked the authoring information to confirm it's not in the future.
I've searched the forums as much as possible, a few threads about this but nothing conclusive, this was the only issue with any information I could find, so re-opening. Marking as critical since 2% coverage of a site renders search unusable. Like I said this seems to have been a problem since at least 4.7.
Any help much appreciated!
Comment #6
catchone more thing. We imported a lot of nodes using phpbb2drupal. I'm wondering if this might be anything to do with the issue on this forum topic: http://drupal.org/node/50151#comment-137637
not tried that code though.
Comment #7
dfserra commentedI have this problem. The new content isn't index. And the status is 100% indexed.
The cron don't have problems.
The problem Has started in 4.7 and continue in 5.1.
Any solution?
Thank you.
Comment #8
steven jones commentedIf you were able to provide a database dump of the node that the search index fails on then we could try and reproduce the error on another system.
Comment #9
catchYou mean a database dump of just the node - so node, node_revisions, term_node etc. - not sure how to do that for just the one but would be happy to provide.
Nothing special about this node though - it's a book page, it's plain text etc. etc.
Comment #10
AjK commentedWith a full database dump (mysqldump or via phpmysqladmin, whatever) I'd look at it. You'd need to use my contact tab to get hold of me. Btw, it's not really a good idea sending your database dumps to anon people like me but if you're really stuck then you may considerate it. Depends how much you trust me with your data ;)
Comment #11
dfserra commentedI Had just resolved the problem.
I uninstalled the Search.module and reindexed the content.
Thank you
Comment #12
adixon commentedHere's my diagnosis of one way this happens:
1. the search stops indexing because of the node_cron_last value (as noted already).
2. as per the node_update_index function, the node_cron_last value is set to the largest value of:
GREATEST(IF(c.last_comment_timestamp IS NULL, 0, c.last_comment_timestamp), n.changed) as last_changeof the nodes that get processed.
3. as per the comment_nodeapi function, a new node with a future date causes this:
db_query('INSERT INTO {node_comment_statistics} (nid, last_comment_timestamp, last_comment_name, last_comment_uid, comment_count) VALUES (%d, %d, NULL, %d, 0)', $node->nid, $node->created, $node->uid);i.e., that future date gets inserted into the node_comment_statistics s the last_comment_timestamp.
4. voila - the next time the search spider finishes a run with that node (it'll always be last), the node_cron_last gets set into the future.
Conclusion:
a. the comment module should be fixed to not insert future dates (replace ->created with ->changed? or the current timestamp?).
b. perhaps the node_update_index should make sure that it's not inserting a future date?
or both ...
Comment #13
catchOK I ran: SELECT * FROM `node_comment_statistics` WHERE last_comment_timestamp > 1175626800 and nothing showed up.
node_cron_last gives 1122034214 or Fri, 22 Jul 2005 12:10:14 GMT
So maybe two different bugs here?
Comment #14
Steven commentedThe solution is simple, because node_cron_last is a lower limit. Whenever we update it in node_update_index(), we need to make sure it's never set to a future date, even if a future node was indexed.
The culprit:
Note that there is no way to fix it so that future dated nodes will not be reindexed until their publishing date has passed, but IMO this is an acceptable drawback. Future timestamps are mostly useful for unpublished announcements in moderation and such, i.e. temporary situatins. At least they will not hinder the indexing of other, correctly dated content.
Comment #15
adixon commentedYes - exactly. And now as I look again, I think the place it should really be fixed is in the node_update_index() function, since that future date means you don't want it indexed by the search engine either. Here's a simple patch for your consideration - i decided it would make more sense not to make the sql any more complicated, since it's a relatively rare occurrence (i hope).
Comment #16
Steven commentedI'm not sure future nodes shouldn't be indexed. If you want something to not show up, you should unpublish it. Future timestamped nodes still show up on the normal site too.
I think my suggestion is the better approach.
Comment #17
drummI agree with Steven
Comment #18
adixon commentedOkay, here's a modified patch that still allows the indexer to index future dated items.
Comment #19
dries commentedLooks OK but a code comment might be in order to decypher this 2 years from now :P
Comment #20
adixon commentedAnd here it is again with a helpful comment.
Comment #21
adixon commentedAnd here's another better version i think, probably what steven had in mind.
Comment #22
Steven commentedActually now that I see where the bad node_last_comment timestamp is coming from... that should be changed.
$node->createdis the publically visible authoring date, that is used for sorting for example.$node->changedis an internal housekeeping variable that is always set to the time of the last node_save(). The only way the latter could be set to a future date is if you ran a bad import script, which is something we should IMO not cater to.I attached a patch which changes this. Instead of bogging down the indexing code with an (unnecessary) max(), I added a simple update which validates the timestamp values and resets any future node_last_comment entries in the database to now.
As this is the first update to comment.module, it can in fact be applied cleanly to both 5.x and 6.x, and will only get run once if people upgrade to 6.x later without extra housekeeping.
We do need to remember that users must run update.php for the next Drupal maintenance release (but this is already the case due to another bugfix).
Comment #23
adixon commentedOh good, this was next on my list, thanks. The advantage of this patch is that it will prevent pointless re-indexing of the future-dated nodes every cron run until that date.
But I notice that there is a report of the symptom above (i.e. a future date in node_cron_last), that didn't seem to be caused by the mechanism I described, i.e. what you are fixing with this patch.
I'd also note that my patch is really very light, since it only runs once per cron, and protects against future bugs of this kind. The logic really does get pretty complicated (which is why this bug has taken so long to track down, i imagine).
Conclusion: I think both patches should be applied.
Comment #24
catchI've been following this closely, but I'm completely stumped since the issue with my indexing doesn't seem to have anything to do with the patch as I understand it.
node cron last
node -> created
node -> changed
comment -> timestamp
none of these have timestamps in the future.
Having said that since I last posted it's indexed another 17 nodes - now 317 out of c.14,000. node_cron_last is Saturday, July 23rd 2005, 15:27:26 (GMT) or 1122132446
Comment #25
douggreen commentedIf cron.php is aborted after the last_ variables are set, but before the search_index is updated, the node that we were working on will never get indexed. This is a very likely culprit to nodes not getting indexed.
Patch attached for 5.x but also works on 6.x.
Comment #26
catchI've applied the patch, re-indexed site from empty search tables. stuck at 60 nodes I'm afraid.
Comment #27
Steven commenteddouggreen: we can't do that, since we need to be able to skip nodes that abort the page (e.g. PHP nodes with die; in them).
Comment #28
douggreen commented@steven, I hadn't thought of this reason for setting it the way you do. I don't think that it's a good assumption that the last node that was running was the source of the abort. php scripts die all the time on systems, and sometimes for no apparent reason - we run out of memory because it's the 99th thing that is done, the system just initiates a mailing and bounce handling consumes all of the processes, lightning strikes, someone hits the power button, etc.
I do think that this is a likely culprit to sites that aren't getting indexed. Would it make sense to keep a counter on how many times we've tried to index the node, and move on after the 2nd, 3rd, or 4th failed attempt?
Comment #29
David Lesieur commentedI haven't read the whole discussion, but Steven's patch on #22 would certainly solve a problem I've stumbled upon recently.
Comment #30
chx commentedThis issue is actually two . Steven fixed the future timestamp issue in #22 and I deem that ready. catch has another problem which we can't fix without a database dump, sorry.
Comment #31
David Lesieur commentedcatch's problem is discussed here.
Comment #32
catchchx:
http://drupal.org/node/139537#comment-242977
on the issue David Lesieur linked to has brought my site's index from 3% to 100%. Either the patch fixes a bug, or it works around an issue with my db, either way it's worked. My database dates back to 4.5.x, and is bloated, so probably has the same issues that caused the fault pre-patch since I've not done any major tidy ups since then, happy to provide a dump.
Comment #33
drummCommitted #22 to 5.x.
If the table do get rearranged, watch out for that database update in case someone upgrades from 5.1 to 6.
Comment #34
dries commentedCommitted to CVS HEAD. Thanks.
Comment #35
(not verified) commented