More forceful index clearing in the "all" case; Change INNER JOIN to LEFT JOIN

robertDouglass - July 21, 2009 - 09:10
Project:Apache Solr Search Integration
Version:6.x-2.x-dev
Component:Code
Category:bug report
Priority:normal
Assigned:Unassigned
Status:needs review
Description

Although it is to be expected that every node ever created invokes hook_nodeapi, sometimes this is impractical (when doing database node imports, for example). Currently the only way we have to build apachesolr_search_nodes is via hook_nodeapi. I propose a function that completely rebuilds the table by first deleting it and then importing all of the nids from the node table. I also propose that this is the action taken in apachesolr_clear_last_index when $namespace == ''.

<?php
/**
* Clear a specific namespace's last changed and nid, or clear all.
*/
function apachesolr_clear_last_index($namespace = '') {
  if (
$namespace) {
   
$stored = variable_get('apachesolr_index_last', array());
    unset(
$stored[$namespace]);
   
variable_set('apachesolr_index_last', $stored);
  }
  else {
   
variable_del('apachesolr_index_last');
  }
}
?>

#1

robertDouglass - July 21, 2009 - 10:04
Title:More forceful index clearing in the "all" case» More forceful index clearing in the "all" case; Change INNER JOIN to LEFT JOIN
Status:active» needs review

I also think that the JOIN should be a left join because not all nodes have rows in comment statistics.

AttachmentSize
reindex.patch 4.38 KB

#2

Scott Reynolds - July 21, 2009 - 16:00

Actually perhaps its better to do an if statement()

if (module_exists(comment)) {
INNER JOIN comment stats
}

If comment module is enabled, every node has node_comment_statistics (http://api.drupal.org/api/function/comment_nodeapi/6). This is true for the install etc (I think Im the one who wrote the INNER JOIN BTW, my bad).

Why do we care if the update timestamp makes sense? why can't it just be 0. If we are doing the edge case where nodes are imported, why not put them at the top? Or make the changed = time()? (and my reaction to this holds true for the install as well. Why do we need accurate changed in Solr table. "Changed" to apachesolr is still 0 as its never seen this node before).

And this has an added bonus of not having to worry about comment_stats.

#3

robertDouglass - July 21, 2009 - 16:44

Scott, without trying to apologize for bad data imports, it's actually quite possible to find sites where there are nodes without corresponding entries in comment_statistics. Anyway, assuming that and using an INNER join makes the system more brittle. Other than "cleanliness" or "purity", do you have objections to leaving the LEFT JOIN?

I'd like to consider your other points further, but want to commit the bug fix first.

#4

Scott Reynolds - July 21, 2009 - 17:07

no objections.

#5

Ryan Palmer - July 21, 2009 - 20:08

+1 on that JOIN being a LEFT JOIN. Records don't end up in the {node_comment_statistics} table unless the comment module is enabled. This caused a few hundred nodes to be absent from the apache solr search index for me.

As a temporary workaround, simply enabling the comments module momentarily, and disabling and re-enabling the apachesolr module seems to solve the issue I had of certain nodes not being indexed. The various implementations of hook_enable seem to take care of tidying up both the comment stats records and the indexing records.

#6

robertDouglass - July 23, 2009 - 10:39
Status:needs review» active

I've committed the patch in #1. Now, on to the next steps. As per Scott:

Why do we care if the update timestamp makes sense? why can't it just be 0. If we are doing the edge case where nodes are imported, why not put them at the top? Or make the changed = time()? (and my reaction to this holds true for the install as well. Why do we need accurate changed in Solr table. "Changed" to apachesolr is still 0 as its never seen this node before).

Valid question. Thoughts? Can anyone roll a patch?

#7

pwolanin - July 23, 2009 - 13:49

did you commit to 1.x also? this looks like a significant bug fix.

#8

robertDouglass - July 23, 2009 - 14:01

Now committed to DRUPAL-6--1 as well. Any followup patches need to be backported.

#9

Scott Reynolds - July 24, 2009 - 07:01
Status:active» needs review

So heres the followup applied to 6.2.

It nukes the search index table and then inserts the nid, status, 0 for the table.

Because the apachesolr_get_last_index() handles the batch process so well, during reindexing we shouldn't care about accuracy of the changed variable.

So the main case: Admin clicks reindex and then goes and changes/insert a node

Without this change: all nodes are processed before this new changed node is processed (the current code prevents dates in the future from being stored)

with this change: all nodes are processed before this new changed node is processed (0 <<<< time())

So the argument is, to the end user there is no difference. At the DB level, its less complicated. Is it a big deal? no not at all.

#10

Scott Reynolds - July 24, 2009 - 07:02
AttachmentSize
reindex.patch 1.24 KB

#11

pwolanin - July 25, 2009 - 16:08

@Scott - there is somewhat of a difference (may not matter) in that with your patch nodes are processed in nid order since all the timestamps will be the same.

We could also, possibly do something like:

LEAST(GREATEST(n.created, n.changed, c.last_comment_timestamp), %d)

where %d in the current timestamp.

But I think the extra query is pretty harmless.

 
 

Drupal is a registered trademark of Dries Buytaert.