The node_update_index function indexes all pages, including those with format = 2 (PHP code). The result is that when search_cron runs, and reaches the php page, the page executes, and depending on its code, it might abort the rest of the indexing, other than the execution of the code may have unpredictable consequences.

I encountered this bug when I had a php page which just calls drupal_goto to redirect a page to another. The result was that when cron is called it aborts in the middle of execution, and redirects to the destination page given to drupal_goto.

I solved this problem temporarily by excluding all nodes with PHP input format from indexing, by adding a " AND n.format <> 2", and I believe it is a valid solution, except that, in my humble opinion, a more general approach should be applied in the administer > input formats section, to include/exclude the format into/from search indexing. This would scale when the user adds more input formats.

Comments

weam’s picture

Another solution might be at the node level;i.e. to be able to exclude a specific node from search, by adding a field to the node named "searchable" or so.

Steven’s picture

Status: Active » Closed (works as designed)

If the search index is aborted because of a php page, it'll continue with the next node later.

weam’s picture

Yes. But all the nodes created/modified after the last cron before this one, will be excluded from the search.

The reason is that the abortion happens before exiting the foreach module_list() as $module at the beginning of search_cron(), and does not reach the foreach search_dirty() as $word => $dummy loop to UPDATE the {search_total} table from the dirty stuff, while the node_cron_last variable would have been already updated by the node_update_index() code.

Since the {search_total} rows are necessary for the pager_query call to succeed in do_search, because of the INNER JOIN, a not up-to-date {search_total} will hide all the nodes that have been changed since the last cron run, and this is permanent, since the node_cron_last already bypassed the misfortunate nodes.

weam’s picture

Status: Closed (works as designed) » Active

Changed status back to active;

weam’s picture

Title: Cron Search Executes PHP pages » Cron Search Executes PHP pages - Node Range Lost Permanently in Search
dopry’s picture

weam, can you tell me how to duplicate this?

Zen’s picture

Priority: Critical » Normal
mfredrickson’s picture

Here's how to duplicate:

Create a page node.
Use the php filter:

<?php
drupal_goto("http://www.example.com");
?>

Make sure this node will be indexed and run mysite/cron.php

You should be redirected to example.com and site indexing should stop.

Here's a method to fix it: Wrap the goto in:

if ($_SERVER['REQUEST_URI'] != '/cron.php') {
... code ...
}

I would also like to see a cron user created that can be checked during PHP pages.:

http://drupal.org/node/5380

Steven’s picture

Status: Active » Fixed

4.6.x search does do the redirect, but will continue indexing at the next node, the next time cron is run. Only the search_totals table will not be updated correctly.

In 4.7.x this was addressed. The search also does the redirect, but recovers gracefully. No search data is lost, the next time cron is run it will continue at the next node. The only side-effect is that one cron run will index slightly less nodes.

Anonymous’s picture

Status: Fixed » Closed (fixed)