Fix search index link handling for non-existent nodes [#205202]

If the text of a node contains a link to another node and that node doesn't exist yet (e.g., http://www.example.com/node/9999999999 or something like that), a PHP error is generated when the search index is updated:

notice: Trying to get property of non-object in /web/modules/search/search.module on line 503.
notice: Trying to get property of non-object in /web/modules/search/search.module on line 505.

The offending code is in search_index() and is the following:

if ($linknid > 0) {
  // Note: ignore links to uncachable nodes to avoid redirect bugs.
  $node = db_fetch_object(db_query('SELECT n.title, n.nid, n.vid, r.format FROM {node} n INNER JOIN {node_revisions} r ON n.vid = r.vid WHERE n.nid = %d', $linknid));
  if (filter_format_allowcache($node->format)) {
    $link = TRUE;
    $linktitle = $node->title;
  }
}

Basically it tries to load the node and do something with it regardless of whether or not the node exists.

I would write a patch, but I'm not sure what the correct behavior should be. Should the search index just ignore links to nodes that haven't been created yet? (I'm not sure that's ideal; think of Wikipedia, for example, where links to articles that don't exist yet are all over the place). But if it doesn't ignore them, what should be done about the $linktitle and the uncachable nodes issue?

Comment	File	Size	Author
#49	search.module.d7.remember-orphans.performance.comments.patch	3.52 KB	donquixote
#40	search.module.d7.remember-orphans.performance-is_numeric.patch	2.61 KB	donquixote
#32	search.module.d7.remember-orphans.performance.patch	2.54 KB	donquixote
#28	search.module.d7.remember-orphans.patch	741 bytes	donquixote
#27	search.module.d7.discard-orphans.patch	844 bytes	donquixote
#27	search.module.d7.remember-orphans.performance.patch	2.54 KB	donquixote
#27	search.module.d7.remember-orphans.patch	741 bytes	donquixote
#25	search-links-205202-24.patch	1.64 KB	David_Rothstein
#17	search.module.d7.patch	741 bytes	donquixote
#17	search.module.d7.alternative.patch	833 bytes	donquixote
#16	search.module.d7.alternative.patch	854 bytes	donquixote
#14	search.module.d7.patch.txt	758 bytes	donquixote
#15	search.module.d7.patch	758 bytes	donquixote
#6	search_nonexistent_nodes.patch	1.44 KB	robertdouglass
#4	search_nonexistent_nodes.patch	1.44 KB	David_Rothstein

Comments

Comment #1

douggreen commented 1 January 2008 at 00:09

I would think that if the node doesn't exist yet, then it's not using a format that can't be cached, so this check to prevent redirect bugs, could be skipped when the node doesn't exist (i.e., we could do !isset($node) || filter_format_allowcache($node->format).

Comment #2

douggreen commented 1 January 2008 at 00:10

Priority:

Minor

» Normal

Comment #3

gábor hojtsy

he/him

Hungarian

Hungary

commented 1 January 2008 at 17:33

Comment #4

David_Rothstein commented 21 January 2008 at 03:35

Status:

Active

» Needs review

Status	File	Size
new	search_nonexistent_nodes.patch	1.44 KB

Here is a stab at a patch.

If the node doesn't exist yet, I set the $linktitle to be equal to the node id (I imagine it probably doesn't matter much). And of course you're right that there's no need to worry about a nonexistent node being uncacheable -- I have no idea what I was thinking when I wrote that the first time ;)

The attached patch also fixes #3 in @douggreen's comment above. I think #1 and #2 aren't necessary (but please review):

(1) $linktitle is only used in the else statement when $link is TRUE, so the current behavior (only setting $linktitle when $link is set to TRUE) seems okay.

(2) The SELECT statement acts on the {node} table, which only stores the most recent version of the node and the most recent vid, so it should be okay as is, right?

Comment #5

David_Rothstein commented 21 January 2008 at 03:43

Also note that there is a minor problem which this patch doesn't address (and which is not limited to the case of a nonexistent node):

If you create node/1 containing http://www.example.com/node/2 in the body, then a link to node 2 will be put in the search index (under 'title' if node/2 exists and has 'title' as its title, or under '2' if node 2 doesn't exist yet, as per the above patch).

If you then change node 2's title to 'newtitle' (or create it with this title), the search index does not get correctly updated... if you search for 'title' you will still get references to node/2 (whereas searching for 'newtitle' will not correctly count the link from node 1). Things only get fixed if and when node/1 gets edited, so that the search reindexing for node/1 will get triggered.

This probably isn't a major problem, but if anyone can think of a simple way to deal with it, it could be added to this patch...

Comment #6

robertdouglass commented 15 April 2008 at 09:25

Title:	Search index gives PHP error on links to nonexistent nodes	» Fix search index link handling for non-existent nodes
Version:	6.x-dev	» 7.x-dev

Status	File	Size
new	search_nonexistent_nodes.patch	1.44 KB

Rerolled and tested. Works as advertised. One more review, please, but otherwise, should go in.

Comment #7

douggreen commented 27 April 2008 at 19:02

I was only suggesting that we fix the php error. If the node does not exist, then I think we should just ignore the link. The entire point of the node linking is to score the title words from the linked to node, with the linking node, thus improving the search relevancy of the linked to node. But if the linked to node doesn't exist, then there's nothing to do...

Comment #8

David_Rothstein commented 27 April 2008 at 21:50

@douggreen, the use case for this is if you are adding pages sequentially. For example, you might create node 1:

Blah blah blah... See the next page for some information about <a href="/node/2">dinosaurs</a>.

and then go ahead and create node 2 later. (I think that's how I discovered this bug in the first place, although I don't really remember.) This link should be counted, and searching for "dinosaurs" should pull up node 2... but with your method, it won't happen.

Where this would be even more interesting is if it worked with path aliases. Then you could have:

Some day this site will have a page about <a href="/dinosaurs">dinosaurs</a>, but it doesn't yet.

As mentioned previously, the use case here is something like Wikipedia, which has those kinds of links all over the place. It would be nice if they could somehow be counted, although I don't see any obvious way to do this at the moment. However, if we put the infrastructure for dealing with nonexistent nodes in place, then maybe it will be possible eventually...

Comment #9

douggreen commented 3 May 2008 at 20:59

Status:

Needs review

» Needs work

@David, I don't see ever linking to a node that doesn't exist yet being anything but an error. How do you know what the node number will be of a future, yet to be created node?

Comment #10

donquixote commented 19 February 2010 at 06:19

The other, more important use case is when you delete a node that is linked to from somewhere else.

My opinion: Always consider that db_fetch_object can return NULL. Drupal needs a lot more sanity checks.

About the patch:
If the linked node doesn't exist, then why do we store this in the index?
As far as I understand, the index is meant for "other nodes that link here" and "other nodes linked from here" listings. If the target node doesn't exist, then what is this good for?
- After deleting node X, the index can show me a listing of other nodes that now have orphan links.
- Viewing node A, the list of "other nodes linked from here" will include deleted nodes.
So, the main purpose would be maintenance. But we don't want to show these links in a "what links here" block for normal visitors.

I noticed a different problem:
If filter_format_allowcache($node->format) returns false, the "what links here" search (in views) doesn't work.
I totally don't get why this check is needed. Especially, if the node has some CCK textareas and the body field, and each has a different format, then what is the relevance of this check?
The most notable effect is that I spend hours of debugging to find out why the "what links here" views argument doesn't work with my site. I added a "|| true" to the filter_format_allowcache condition, and now the indexing works. Tataa!

I need to add, all my experiments were done in D6. I wish we can backport this when it's ready.

Comment #11

jhodgdon

she/her

English

commented 19 February 2010 at 16:14

Let's not combine things -- that thing about allowcache is being discussed on another issue, having to do with PHP input formats.
#286263: Make search indexing smart to handle nodes wth php redirects

So let's just stick to the main issue here, and we need another patch.

Comment #12

donquixote commented 20 February 2010 at 03:00

@jhdgdon (#11):
After reading a bit in the linked issue, I don't see how it is related to my problem. The title looks promising, but none of the patches does anything with filter_format_allowcache() inside search_index(), and none of the comments mention it. Comment #286263-57 uses filter_format_allowcache(), but this is in node.module.

Could someone explain why the filter_format_allowcache() is needed in this case? The only thing that is needed from this node is its title. No filters are executed for this node.

@robertDouglass (#6):
Why is this D7, with a patch that is clearly for D6 ? The D7 search_index() looks quite different, and does not use filter_format_allowcache() at all. Or did this change since your post?

Comment #13

jhodgdon

she/her

English

commented 20 February 2010 at 16:00

This is a Drupal 7 issue because we fix bugs in the latest version first, then port back to Drupal 6.

As far as whether this issue and the PHP formats issue are related, maybe they are not, but the issue there is again an input format that is dynamic and doesn't allow caching.

Comment #14

donquixote commented 20 February 2010 at 17:34

Status	File	Size
new	search.module.d7.patch.txt	758 bytes

Sure, but the patch in #6 will only work in D6.
D7 does not have any sanity checks either.

<?php
                $node = db_query('SELECT title, nid, vid FROM {node} WHERE nid = :nid', array(':nid' => $linknid), array('target' => 'slave'))->fetchObject();
                $link = TRUE;
                $linktitle = $node->title;
?>

Noone has reported a problem in D7 yet, but it can be expected to happen sooner or later. Sanity checks are always a good thing. So, here goes a D7 patch.

Comment #15

donquixote commented 20 February 2010 at 17:29

Status:

Needs work

» Needs review

Status	File	Size
new	search.module.d7.patch	758 bytes

It doesn't like me.. ("Ignored")

Comment #16

donquixote commented 20 February 2010 at 17:38

Status	File	Size
new	search.module.d7.alternative.patch	854 bytes

The previous patch will store the link to the non-existing node in the index table.

And here is an alternative patch, that will just ignore this node. So, if we change our minds we can use this one :)

Comment #17

donquixote commented 20 February 2010 at 17:48

Status	File	Size
new	search.module.d7.alternative.patch	833 bytes
new	search.module.d7.patch	741 bytes

Notepad++ moved the "Unix line endings" to a different menu..
So, another try, line endings fixed.

Comment #18

donquixote commented 20 February 2010 at 18:18

Thinking about this again, wouldn't it be better to store NULL as the title of a non-existing node, so it can be identified by other modules as a broken link?

Comment #19

damien tournoud commented 20 February 2010 at 18:36

Status:

Needs review

» Reviewed & tested by the community

search.module.d7.alternative.patch is the correct way to fix that.

Comment #20

David_Rothstein commented 21 February 2010 at 00:37

Status:

Reviewed & tested by the community

» Needs review

Wow, I think this might be my oldest open bug report - would be great to see it fixed :)

It's been a while since I looked at this, but I believe it's the case that search.module.d7.alternative.patch will make it so that "What links here" blocks don't work correctly in these situations. In these situations, a "What links here" kind of block should be auto-updated on the next cron run, not require any kind of manual reindexing.

Again, the main reason this issue exists is that currently you get a PHP notice. I think the goal here (especially for possible backport to D6) should be to remove the PHP notice, not change the feature.

So, I'm pretty sure that the first patch from #17 is the correct one. But I do agree with @donquixote that it is much better to store NULL or an empty string here - not sure what I was thinking when I tried using the node ID :) Also, I don't think is_object($node) is the cleanest check here. Probably better and more readable to do something like !empty($node).

Comment #21

donquixote commented 21 February 2010 at 06:04

@David_Rothstein (#20):

> that search.module.d7.alternative.patch will make it so that "What links here" blocks don't work correctly in these situations.

Really?
- Noone wants to see the "what links here" block for the nid of a non-existent node, except for housekeeping.
- If I create a new node, then the "what links here" really should be empty. If any links exist that point to the new node's id, then this will usually not be intended. Noone can know an nid before the node is created (unless we are copying parts of a database).
- If I delete a node, I am no longer interested in the links, except for housekeeping.

> In these situations, a "What links here" kind of block should be auto-updated on the next cron run, not require any kind of manual reindexing.

A cron run will do the complete job, yes.
And on top of that, when a node is saved/modified, I assume that Drupal will call search_index() for this node.
Noone needs to index the destination node!

So the only question is: Do we really need the orphan index entries for housekeeping tasks, or not? Do they have any negative side effects somewhere?

----

> Also, I don't think is_object($node) is the cleanest check here. Probably better and more readable to do something like !empty($node).

The truth is, I didn't want to look up what exactly db_query(..)->fetchObject() returns in case of failure, and if this is reliable. We know very well what it returns in case of success: An object. For anything else we rather don't want to write "$node->title". So, is_object() is actually safer than most other things that we could check for. (to be strict, we should even change the simple if($node) in the second patch, but it doesn't matter that much)

----

For backporting to D6, I would like to read an opinion or explanation about the filter_format_allowcache($node->format). This problem does not exist in D7.

Comment #22

David_Rothstein commented 21 February 2010 at 21:05

OK, this is painful but here is how to see what is going on here. On a fresh Drupal installation:

1. Create a node with title "First page", content along the lines of Go to the <a href="/node/2">next page</a>.
2. Run cron a couple times.
3. Create another node with title "Second page".
4. Run cron a couple times.

With either the current Drupal 7 codebase or search.module.d7.patch from #17, the results are the same:

mysql> select * from search_dataset;
+-----+------+--------------------------------------------------------------------+---------+
| sid | type | data                                                               | reindex |
+-----+------+--------------------------------------------------------------------+---------+
|   1 | node |  first page add new comment go to the next page  add new comment   |       0 | 
|   2 | node |  second page add new comment  next page add new comment            |       0 | 
+-----+------+--------------------------------------------------------------------+---------+
2 rows in set (0.00 sec)

mysql> select * from search_node_links;
+-----+------+-----+-----------------+
| sid | type | nid | caption         |
+-----+------+-----+-----------------+
|   1 | node |   1 | add new comment | 
|   1 | node |   2 | next page       | 
|   2 | node |   2 | add new comment | 
+-----+------+-----+-----------------+
3 rows in set (0.00 sec)

mysql> select * from search_index;
+---------+-----+------+-------+
| word    | sid | type | score |
+---------+-----+------+-------+
| add     |   1 | node |  13.2 | 
| add     |   2 | node |  13.2 | 
| comment |   1 | node |  13.2 | 
| comment |   2 | node |  13.2 | 
| first   |   1 | node |    26 | 
| new     |   1 | node |  13.2 | 
| new     |   2 | node |  13.2 | 
| next    |   1 | node |   2.2 | 
| next    |   2 | node |    11 | 
| page    |   1 | node |  28.2 | 
| page    |   2 | node |    37 | 
| second  |   2 | node |    26 | 
| the     |   1 | node |     1 | 
+---------+-----+------+-------+
13 rows in set (0.00 sec)

So, the "next page" link from node 1 to node 2 is correctly recorded everywhere, and the words "next" and "page" correctly count towards node 2 in the search index, in addition to node 1. (You can also see a nice little unrelated bug in there with the add new comment stuff, it seems...)

With search.module.d7.alternative.patch from #17:

mysql> select * from search_dataset;
+-----+------+--------------------------------------------------------------------+---------+
| sid | type | data                                                               | reindex |
+-----+------+--------------------------------------------------------------------+---------+
|   1 | node |  first page add new comment go to the next page  add new comment   |       0 | 
|   2 | node |  second page add new comment  add new comment                      |       0 | 
+-----+------+--------------------------------------------------------------------+---------+
2 rows in set (0.00 sec)

mysql> select * from search_node_links;
+-----+------+-----+-----------------+
| sid | type | nid | caption         |
+-----+------+-----+-----------------+
|   1 | node |   1 | add new comment | 
|   2 | node |   2 | add new comment | 
+-----+------+-----+-----------------+
2 rows in set (0.00 sec)

mysql> select * from search_index;
+---------+-----+------+-------+
| word    | sid | type | score |
+---------+-----+------+-------+
| add     |   1 | node |  13.2 | 
| add     |   2 | node |  13.2 | 
| comment |   1 | node |  13.2 | 
| comment |   2 | node |  13.2 | 
| first   |   1 | node |    26 | 
| new     |   1 | node |  13.2 | 
| new     |   2 | node |  13.2 | 
| next    |   1 | node |    11 | 
| page    |   1 | node |    37 | 
| page    |   2 | node |    26 | 
| second  |   2 | node |    26 | 
| the     |   1 | node |     1 | 
+---------+-----+------+-------+
12 rows in set (0.00 sec)

In this case, the link from node 1 to node 2 is never recorded, which means that a "What links here" block on node 2 will not show the correct data. The only way it will ever show the link is if someone goes back and (manually) forces a reindex of node 1 - e.g., by resaving the node. They should not have to do that in order to make it work correctly.

So, it seems to me like search.module.d7.patch is the correct approach.

I also realized why I had originally used the node ID as the link title in this case (rather than NULL) - it turns out that with the current code, you have to use something. If the caption is empty, it will never get recorded as a link. Refactoring it to allow that sounds possible, but looking at the code - probably very scary :) In any case, this rarely matters. The above example will not be affected either way, because "next page" will be used as the caption in that case. It's only if you pasted a URL directly into the node body (e.g., http://example.com/node/2) that the link title is used (which makes me wonder why in the world the search index code is querying the database for the title on every single link.... seems like an unnecessary performance hit).

As per the above, then, there is no reliable way for other modules to identify broken links via the {search_node_links} table alone - if they really care about that (a "What this links to" block is the only example I can think of), they would need to examine the node IDs themselves, which is exactly what they already need to do with the current codebase.

Comment #23

donquixote commented 21 February 2010 at 21:24

1. Create a node with title "First page", content along the lines of Go to the next page.
2. Run cron a couple times.
3. Create another node with title "Second page".
4. Run cron a couple times.

Do you seriously recommend this way of working?

1. User A creates node/1001, with a link to the yet not existing node/1002
2. If the indexer runs now (on 1001), it will find and remember a link from 1001 to not yet existing 1002. With alternative.patch it would not remember this link.
3. User B creates node/1002, totally not thinking about what user A does.
4. User A creates node/2003, which he thought to get the nid 1002.
5. If the indexer runs now (if 1001 is re-indexed), it will find the link from 1001 to existing 1002

The nid of a yet to be created node is only predictable in a very controlled one-user environment. On the average site this is likely to result in wrong link destinations. Usually a link to a non-existing node will be a link to a deleted node, or a typo.

And anyway, nowadays everyone uses pathauto, and would rather write href="alias/to/node", with a yet non-existing alias. In this case none of the existing or proposed solutions would work.

Conclusion:
- You want to support a technique that is rarely used and can't be recommended.
- A lot of things can be done on the links indexing, but this should happen after this bugfix, and in a separate issue.

Comment #24

donquixote commented 21 February 2010 at 21:30

which makes me wonder why in the world the search index code is querying the database for the title on every single link.... seems like an unnecessary performance hit

True. Let's do this in a follow-up.

Comment #25

David_Rothstein commented 21 February 2010 at 21:31

Status	File	Size
new	search-links-205202-24.patch	1.64 KB

I think something like this is probably the correct approach, which also has the nice side effect of removing the performance issue automatically. Instead of the node ID, this falls back on just using the URL itself (which probably makes some sense since that is how an external URL would be indexed)... again, I doubt it matters much.

Comment #26

David_Rothstein commented 21 February 2010 at 21:37

The nid of a yet to be created node is only predictable in a very controlled one-user environment.

Sure. But a site with a single content creator is a very valid use case.

On the average site this is likely to result in wrong link destinations. Usually a link to a non-existing node will be a link to a deleted node, or a typo.

Yes, the search index will record broken links in this case, but I don't see that as a problem. I think the point of a search index is to faithfully record the actual content, not try to guess people's intentions. If another module wants to use the search index for some other purpose (e.g., the "what this links to" example), I think it's up to them to properly check the data.

And anyway, nowadays everyone uses pathauto, and would rather write href="alias/to/node", with a yet non-existing alias. In this case none of the existing or proposed solutions would work.

Sure, but supporting that is probably for Drupal 8 at this point - see my comments above about Wikipedia. I still don't think we should change the current behavior just because it doesn't work perfectly for all cases. Maybe others feel differently.

Comment #27

donquixote commented 21 February 2010 at 21:58

Status	File	Size
new	search.module.d7.remember-orphans.patch	741 bytes
new	search.module.d7.remember-orphans.performance.patch	2.54 KB
new	search.module.d7.discard-orphans.patch	844 bytes

Ok, I leave the choice to others.
I created a new patch with the performance improvements you suggested. This only makes sense if we want to remember orphan links. If we want to discard them, we have to load the node to check if it exists. Unless we want to speed this up by querying all at once or in bigger chunks...

I re-uploaded the older patches with new names, to make the discussion easier. if($node) changed to if(is_object($node)) in discard-orphans.patch.

Comment #28

donquixote commented 21 February 2010 at 22:02

Status	File	Size
new	search.module.d7.remember-orphans.patch	741 bytes

I notice that remember-orphans.performance.patch will behave differently than remember-orphans.patch. the .performance version will store the $value, while the old version will use the $linknid, if no $node->title is found.

EDIT:
Duh! I was playing with the upload field. Please ignore the file!

Comment #29

David_Rothstein commented 21 February 2010 at 22:06

Sounds reasonable - I've already spent more time thinking about this issue than I ever wanted to :)

In the second patch:

+            if (is_object($node) && !emtpy($node->title)) {

This typo means that it better not pass the testbot - although somehow I bet it will :)

I also don't understand the differences between the second patch (search.module.d7.remember-orphans.performance.patch) and my patch from #25. Why did you need to make all those extra changes? Was there something I missed?

Comment #30

donquixote commented 21 February 2010 at 22:26

Because
a) I didn't notice your patch. My apologies.
b) I replaced the check for if ($link) with if ($linknid > 0). Matter of taste. I'm happy to do it your way.

if (is_object($node) && !emtpy($node->title)) {

Some people think that there is one correct solution for sanity checking, and this has to be used everywhere. This is not the case. Sanity checking is problem-specific, and !empty() is not always the best solution. In this case we want to make sure that $node is nothing but an object.

Comment #31

David_Rothstein commented 21 February 2010 at 22:46

I was mainly referring to the typo ("emtpy" vs "empty")....

In this case I think you are probably right that is_object($node) is good. Actually, it's probably enough - I'm not sure what the other check really adds. But if there is a reason, "empty" needs to be spelled correctly :)

Comment #32

donquixote commented 21 February 2010 at 22:54

Status	File	Size
new	search.module.d7.remember-orphans.performance.patch	2.54 KB

> what the other check really adds

It prevents the text to be replaced with an empty string or other things, if for whatever reason the node title is an empty string or NULL or whatever other things. We don't absolutely need that.

> But if there is a reason, "empty" needs to be spelled correctly

Wow, I didn't notice that.
I am cheating a bit with my patches, because I don't have a working D7 install for testing. Seeing that the test bots don't have a problem with the "emtpy", I ask kindly ask to be careful with my patches.

Comment #33

David_Rothstein commented 22 February 2010 at 03:05

I created a couple spinoff issues for some of the other things that were discussed above:

#721374: "Add new comment" is put into the search index along with each node (bug report for Drupal 7)

#721394: Allow links to non-existent URL aliases to be indexed correctly once the node with that alias is created (feature request for Drupal 8)

Comment #34

elally commented 25 April 2010 at 19:47

subscribing

Comment #35

jhodgdon

she/her

English

commented 26 April 2010 at 14:57

The patch in #32 looks quite reasonable to me. David_Rothstein - any further comments?

Comment #36

jhodgdon

she/her

English

commented 26 April 2010 at 22:09

#32: search.module.d7.remember-orphans.performance.patch queued for re-testing.

Comment #37

David_Rothstein commented 27 April 2010 at 02:44

It looks reasonable to me too, and I confirmed that it works - but I'm not the one who objected to the general approach :) In my opinion, it should go in.

In terms of the code, my only (minor) comments are that we are setting $linknid = NULL in some places but then doing checks like $linknid > 0 - I think it would be cleaner to go one way or the other, i.e. always keep it an integer (use $linknid = 0 instead) or the other way around.

Also, this part:

-  $link = FALSE; // State variable for link analyzer
+  $linknid = NULL; // State variable for link analyzer

The code comment seems a little off now - perhaps should be more like "ID of the most recent node link"?

Comment #38

jhodgdon

she/her

English

commented 27 April 2010 at 14:13

Status:

Needs review

» Needs work

Agreed on your two comments about the code. Let's fix these and go ahead and get it in.

Comment #39

donquixote commented 11 May 2010 at 17:42

I think the value NULL reflects better than a numeric zero that something is switched off. The information the variable wants to communicate in this case is not a numeric information.

The if ($linknid > 0) does return false for NULL, but it is a numeric comparison. To make it look more logical, we should first check for NULL and then do a numeric check (if that is still needed):

if (is_numeric($linknid) && $linknid > 0)

The $linknid is coming from a preg_match which does allow positive numbers and zero, so I guess the $linknid > 0 is needed.

Comment #40

donquixote commented 11 May 2010 at 17:52

Status	File	Size
new	search.module.d7.remember-orphans.performance-is_numeric.patch	2.61 KB

Changed the comment to
$linknid = NULL; // value NULL means that the link does not point to a node.

added the is_numeric in two places.

I did not test this modified patch in anyway. just downloaded and applied the changes to the patch file itself.

Comment #41

donquixote commented 11 May 2010 at 17:52

Status:

Needs work

» Needs review

Comment #42

jhodgdon

she/her

English

commented 12 May 2010 at 14:59

Status:

Needs review

» Needs work

This code is kind of a mess...
- Can't you just use !empty() to test whether something is non-zero and numeric?
- Our coding standards say that comments come before the line of code, not at the end of the line, and should be sentences.
http://drupal.org/coding-standards#comment

Comment #43

donquixote commented 12 May 2010 at 15:26

I want to test if it is positive.
Technically, $linknid > 0 does exactly that, but it does not indicate that we expect a non-numeric value in some cases.
empty() does accept non-numeric strings, negative integers and float numbers. The preg_match will only return non-negative integers, so the only possible values are NULL, zero and positive integers.
Making this explicit in the check is more transparent than just knowing what the preg_match does.

empty() is the laziest and least specific kind of check, I don't see this as something positive. It does accept not only positive integers, but also negative integers, float numbers, strings, objects etc. We never get these (thanks to the preg_match), but I think it is more transparent to explicitly check for the type of value we want.

I prefer checks that (i) indicate what kind of values we expect in either case (that would be NULL, zero and positive integers), and (ii) rule out any unexpected value (such as non-numeric strings, negative values or floats), so that a reader doesn't need to search where the values come from.

The $linknid > 0 satisfies condition (ii), but not condition (i). empty() satisfies none of these conditions.
I agree that having two checks looks messy..

Anyway, I don't care that much.

--------

About the comment, you are right, I was just too lazy to generate a new patch.

Comment #44

jhodgdon

she/her

English

commented 12 May 2010 at 15:54

How could it possibly be anything but a positive integer or Null/zero from a query on the node table?

Comment #45

donquixote commented 12 May 2010 at 16:12

It's funny how that comment totally proves my point :p

The $linknid is not from a query on the node table. It is a value extracted from the text, and filtered/manipulated with drupal_get_normal_path() and preg_match(). The final preg_match allows nothing but non-negative integers, so empty() is in fact safe. But it is not transparent for people who don't look where the value comes from (did you?).

Anyway, I don't care that much (again), just can't resist an argument...

Comment #46

jhodgdon

she/her

English

commented 12 May 2010 at 16:19

My mistake, sorry!

Anyway the code needs cleaning up. I think it would be cleaner just to cast it to an int.

Comment #47

donquixote commented 12 May 2010 at 16:32

So, my idea was:
NULL is a non-numeric value, carrying the non-numeric information that we are not dealing with a node. Previously this was done in an additional variable $link = FALSE, that I thought is no longer necessary.
Positive integer values can be assumed to be node nids (although we never know if that node exist, until we check the db).
We do not expect any other values (such as arbitrary strings, or negative numbers, or numeric zero), but I wanted to be strict and make things explicit.

Cast it to an int - where? A numeric zero would still be technically sufficient to express that we are not dealing with a node nid (because auto_increment starts at one, not zero), but I think the non-numeric information should not be expressed as a number.

That was my thinking.. now let's wait what others say. I bet few people enjoy to read this little controversy :(

---

Do you see more problems than these two points?

Comment #48

jhodgdon

she/her

English

commented 12 May 2010 at 19:50

OK, I'm convinced.

The code still needs to be brought up to standards. And if you needed to explain it to me, please add comments to explain it in the code. :)

Comment #49

donquixote commented 8 June 2010 at 13:45

Status	File	Size
new	search.module.d7.remember-orphans.performance.comments.patch	3.52 KB

Let's try another one.
More comments, and replaced the is_numeric() and > stuff with !empty().
I think this is fine, if the comments explain what kind of values we can expect.

Comment #50

donquixote commented 8 June 2010 at 13:45

Status:

Needs work

» Needs review

Comment #51

jhodgdon

she/her

English

commented 8 June 2010 at 15:16

Status:

Needs review

» Needs work

Please read (and follow) the coding standards for in-code comments and casting:
http://drupal.org/coding-standards#comment
http://drupal.org/coding-standards#cast

Aside from that:

             if (preg_match('!(?:node|book)/(?:view/)?([0-9]+)!i', $path, $match)) {
+              // The link looks like it points to a node.
+              // $match[1] can be a positive integer or zero (theoretically).

I think $match[1] could be empty/undefined, if the URL is node/view here? So this comment is maybe not quite right?

+            $node = db_query(
+              'SELECT title, nid, vid FROM {node} WHERE nid = :nid',
+              array(':nid' => $linknid),
+              array('target' => 'slave')
+            )->fetchObject();

I don't think the vid field is being used anywhere? And... Is this the right way to get the node title now, or is node title a field? I think it's a field, and this isn't the right way to get the node title?

Comment #52

donquixote commented 8 June 2010 at 16:09

Status:

Needs work

» Needs review

Coding standards for in-code comments:
What exactly is wrong? There is no capital letter for "$".
To be honest, there are some existing comments that I feel puzzled about, such as "Links score mainly for the target.". If you can suggest anything, let me know.
The same for other comment issues, it would help if you can suggest something better, if you don't like it.

Coding standards for casting:
Ok, one space missing. (only looking at those lines that are new)

             if (preg_match('!(?:node|book)/(?:view/)?([0-9]+)!i', $path, $match)) {
+              // The link looks like it points to a node.
+              // $match[1] can be a positive integer or zero (theoretically).
I think $match[1] could be empty/undefined, if the URL is node/view here? So this comment is maybe not quite right?

$m[1] is the result of a regex, which I don't fully understand, '!(?:node|book)/(?:view/)?([0-9]+)!i'. Why are we looking for book or view stuff here? Why not a simple /^node\/(\d+)$/ ?
So, hm, I guess I got this wrong, was only looking at the [0-9]+ part.

Could you explain how you understand this regex check, before I post the next patch?

I don't think the vid field is being used anywhere? And... Is this the right way to get the node title now, or is node title a field? I think it's a field, and this isn't the right way to get the node title?

This part is moved from some lines above, where it used to work (afaik). I guess this is not the only place in D7 where something does $node->title. I guess the title behaves like a field, but is still stored in the master node table. I did not intend to improve or rewrite it. I think this would be for another issue. For now we just want to fix a bug. Moving this piece of code down was suggested some posts before for performance reasons.

Yes, the vid seems to be unnecessary. Still, it's existing code that I would prefer not to touch.

Comment #53

donquixote commented 8 June 2010 at 16:10

Status:

Needs review

» Needs work

didn't mean it to change this to "needs review".
(I was about to post the next patch, but then didn't.)

Comment #54

David_Rothstein commented 8 June 2010 at 16:13

Last time I checked, D7 node titles were no longer fields:

#571654: Revert node titles as fields

Unless there was a rollback of that rollback which I missed :)

Comment #55

gpk commented 14 June 2010 at 12:43

@10, 11, 12, 13: the issue to do with the handling in 6.x of nodes with a non-cacheable format is actually #664124: {search_node_links} entries are not created if the target is uncachable (e.g. PHP input format). See also #513870-12: What links here tab / backlinks not finding nodes. (Apols, slightly off-topic.)

Comment #56

Sree commented 4 October 2011 at 06:10

If filter_format_allowcache($node->format) returns false, the "what links here" search (in views) doesn't work. in D6

Was there any patch released for this?

Comment #57

jhodgdon

she/her

English

commented 28 October 2013 at 02:22

Status:

Needs work

» Closed (duplicate)

Someone is working on this exact current issue on #1299530: php errors triggered during search indexing if content contains broken internal links now.

Comment #58

ykoech commented 29 July 2018 at 16:28

15 files were hidden/shown/deleted

Status	File	Size
hidden	search_nonexistent_nodes.patch	1.44 KB
hidden	search_nonexistent_nodes.patch	1.44 KB
hidden	search.module.d7.patch	758 bytes
hidden	search.module.d7.patch.txt	758 bytes
hidden	search.module.d7.alternative.patch	854 bytes
hidden	search.module.d7.alternative.patch	833 bytes
hidden	search.module.d7.patch	741 bytes
hidden	search-links-205202-24.patch	1.64 KB
hidden	search.module.d7.remember-orphans.patch	741 bytes
hidden	search.module.d7.remember-orphans.performance.patch	2.54 KB
hidden	search.module.d7.discard-orphans.patch	844 bytes
hidden	search.module.d7.remember-orphans.patch	741 bytes
hidden	search.module.d7.remember-orphans.performance.patch	2.54 KB
hidden	search.module.d7.remember-orphans.performance-is_numeric.patch	2.61 KB
hidden	search.module.d7.remember-orphans.performance.comments.patch	3.52 KB

I have an issue where index got to 96% and won't proceed even if I add more pages, this is the exact message "96% of the site has been indexed. There are 384 items left to index."

Note; I found a fix last year but a Drupal update messed up everything. I can't trace the fix right now, its been hours. I think it had to do with skipping deleted nodes from Drupal Index to avoid errors or something close. I can't quite remember.