In testing this module on scratch, we found that when indexing nodes that have a drupal_goto() in them, the cron process is redirected to the destination of the goto. Indexing therefor stalls on the first node with a goto.

Comments

jeremy’s picture

I have committed a less than optimal "fix" to this problem, but will leave this bug open.

The problem is quite simple: when we building nodes to index them, we call node_buid_content(). This in turn calls node_prepare(), which calls check_markup() in the filter.module. In check_markup(), we apply all filters, including php filters. If a node has a drupal_goto() in it, and a php filter applied, the drupal_goto() is executed.

This affects the core module the same as the xapian module. However, it's a more serious problem for the xapian module, because the goto results in the cache not being flushed to disk. Thus, and nodes we've indexed in this cron run prior to reaching the drupal_goto() are lost.

My "fix" is to flush a node from the index_queue before we actually index it. This prevents us from trying to reindex the same node over and over and over forever, but it doesn't solve the problem of indexed content never being flushed to disk.

I still consider this bug to be critical, and I'm leaving it open until we come up with an optimal fix.

jeremy’s picture

My current idea on how to fix this bug is to make it indexing transactional:

  1. When _cron runs and we start indexing content, instead of deleting items from the index_queue as we index them, mark them as "indexing".
  2. When _cron finishes and we successfully flush the cache, delete all items from the index_queue that are marked as "indexing".
  3. If _cron fails for whatever reason, the next time it runs we see that there are some items in an "indexing" state.
  4. If there are >1 nodes in an "indexing" state, index all but the last one and exit.
  5. If there is exactly 1 node in an "indexing" state, delete it from the index_queue, and try to index it.
jeremy’s picture

Status: Active » Fixed

I essentially implemented what I've describe above, and verified that we successfully index all content except for those which it's not possible to index (ie, those with a drupal_goto).

Anonymous’s picture

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for two weeks with no activity.