Fix search index link handling for non-existent nodes

David_Rothstein - December 31, 2007 - 20:31
Project:Drupal
Version:7.x-dev
Component:search.module
Category:bug report
Priority:normal
Assigned:Unassigned
Status:patch (code needs work)
Description

If the text of a node contains a link to another node and that node doesn't exist yet (e.g., http://www.example.com/node/9999999999 or something like that), a PHP error is generated when the search index is updated:

notice: Trying to get property of non-object in /web/modules/search/search.module on line 503.
notice: Trying to get property of non-object in /web/modules/search/search.module on line 505.

The offending code is in search_index() and is the following:

if ($linknid > 0) {
  // Note: ignore links to uncachable nodes to avoid redirect bugs.
  $node = db_fetch_object(db_query('SELECT n.title, n.nid, n.vid, r.format FROM {node} n INNER JOIN {node_revisions} r ON n.vid = r.vid WHERE n.nid = %d', $linknid));
  if (filter_format_allowcache($node->format)) {
    $link = TRUE;
    $linktitle = $node->title;
  }
}

Basically it tries to load the node and do something with it regardless of whether or not the node exists.

I would write a patch, but I'm not sure what the correct behavior should be. Should the search index just ignore links to nodes that haven't been created yet? (I'm not sure that's ideal; think of Wikipedia, for example, where links to articles that don't exist yet are all over the place). But if it doesn't ignore them, what should be done about the $linktitle and the uncachable nodes issue?

#1

douggreen - January 1, 2008 - 00:09

I would think that if the node doesn't exist yet, then it's not using a format that can't be cached, so this check to prevent redirect bugs, could be skipped when the node doesn't exist (i.e., we could do !isset($node) || filter_format_allowcache($node->format).

Other possible problems:

  1. If the type isn't cachable, $linktitle doesn't get set here, but it's used in the else statement. We probably need logic probably needs to catch this case.
  2. We can have multiple revisions, but the node being retrieved is some random one, probably the first one. Shouldn't the SELECT ... ORDER BY r.timestamp DESC so that we get the most recent title?
  3. And while this isn't that important, we're pulling back a couple of columns that aren't needed (SELECT nid, vid).

it would be nice to get Steven's input on this...

#2

douggreen - January 1, 2008 - 00:10
Priority:minor» normal

#3

Gábor Hojtsy - January 1, 2008 - 17:33

Subscribe

#4

David_Rothstein - January 21, 2008 - 03:35
Status:active» patch (code needs review)

Here is a stab at a patch.

If the node doesn't exist yet, I set the $linktitle to be equal to the node id (I imagine it probably doesn't matter much). And of course you're right that there's no need to worry about a nonexistent node being uncacheable -- I have no idea what I was thinking when I wrote that the first time ;)

The attached patch also fixes #3 in @douggreen's comment above. I think #1 and #2 aren't necessary (but please review):

(1) $linktitle is only used in the else statement when $link is TRUE, so the current behavior (only setting $linktitle when $link is set to TRUE) seems okay.

(2) The SELECT statement acts on the {node} table, which only stores the most recent version of the node and the most recent vid, so it should be okay as is, right?

AttachmentSize
search_nonexistent_nodes.patch1.44 KB

#5

David_Rothstein - January 21, 2008 - 03:43

Also note that there is a minor problem which this patch doesn't address (and which is not limited to the case of a nonexistent node):

If you create node/1 containing http://www.example.com/node/2 in the body, then a link to node 2 will be put in the search index (under 'title' if node/2 exists and has 'title' as its title, or under '2' if node 2 doesn't exist yet, as per the above patch).

If you then change node 2's title to 'newtitle' (or create it with this title), the search index does not get correctly updated... if you search for 'title' you will still get references to node/2 (whereas searching for 'newtitle' will not correctly count the link from node 1). Things only get fixed if and when node/1 gets edited, so that the search reindexing for node/1 will get triggered.

This probably isn't a major problem, but if anyone can think of a simple way to deal with it, it could be added to this patch...

#6

robertDouglass - April 15, 2008 - 09:25
Title:Search index gives PHP error on links to nonexistent nodes» Fix search index link handling for non-existent nodes
Version:6.x-dev» 7.x-dev

Rerolled and tested. Works as advertised. One more review, please, but otherwise, should go in.

AttachmentSize
search_nonexistent_nodes.patch1.44 KB

#7

douggreen - April 27, 2008 - 19:02

I was only suggesting that we fix the php error. If the node does not exist, then I think we should just ignore the link. The entire point of the node linking is to score the title words from the linked to node, with the linking node, thus improving the search relevancy of the linked to node. But if the linked to node doesn't exist, then there's nothing to do...

#8

David_Rothstein - April 27, 2008 - 21:50

@douggreen, the use case for this is if you are adding pages sequentially. For example, you might create node 1:

Blah blah blah... See the next page for some information about <a href="/node/2">dinosaurs</a>.

and then go ahead and create node 2 later. (I think that's how I discovered this bug in the first place, although I don't really remember.) This link should be counted, and searching for "dinosaurs" should pull up node 2... but with your method, it won't happen.

Where this would be even more interesting is if it worked with path aliases. Then you could have:

Some day this site will have a page about <a href="/dinosaurs">dinosaurs</a>, but it doesn't yet.

As mentioned previously, the use case here is something like Wikipedia, which has those kinds of links all over the place. It would be nice if they could somehow be counted, although I don't see any obvious way to do this at the moment. However, if we put the infrastructure for dealing with nonexistent nodes in place, then maybe it will be possible eventually...

#9

douggreen - May 3, 2008 - 20:59
Status:patch (code needs review)» patch (code needs work)

@David, I don't see ever linking to a node that doesn't exist yet being anything but an error. How do you know what the node number will be of a future, yet to be created node?

 
 

Drupal is a registered trademark of Dries Buytaert.