If the text of a node contains a link to another node and that node doesn't exist yet (e.g., http://www.example.com/node/9999999999 or something like that), a PHP error is generated when the search index is updated:
notice: Trying to get property of non-object in /web/modules/search/search.module on line 503.
notice: Trying to get property of non-object in /web/modules/search/search.module on line 505.
The offending code is in search_index() and is the following:
if ($linknid > 0) {
// Note: ignore links to uncachable nodes to avoid redirect bugs.
$node = db_fetch_object(db_query('SELECT n.title, n.nid, n.vid, r.format FROM {node} n INNER JOIN {node_revisions} r ON n.vid = r.vid WHERE n.nid = %d', $linknid));
if (filter_format_allowcache($node->format)) {
$link = TRUE;
$linktitle = $node->title;
}
}
Basically it tries to load the node and do something with it regardless of whether or not the node exists.
I would write a patch, but I'm not sure what the correct behavior should be. Should the search index just ignore links to nodes that haven't been created yet? (I'm not sure that's ideal; think of Wikipedia, for example, where links to articles that don't exist yet are all over the place). But if it doesn't ignore them, what should be done about the $linktitle and the uncachable nodes issue?
| Comment | File | Size | Author |
|---|
Comments
Comment #1
douggreen commentedI would think that if the node doesn't exist yet, then it's not using a format that can't be cached, so this check to prevent redirect bugs, could be skipped when the node doesn't exist (i.e., we could do
!isset($node) || filter_format_allowcache($node->format).Other possible problems:
elsestatement. We probably need logic probably needs to catch this case.SELECT ... ORDER BY r.timestamp DESCso that we get the most recent title?SELECT nid, vid).it would be nice to get Steven's input on this...
Comment #2
douggreen commentedComment #3
gábor hojtsySubscribe
Comment #4
David_Rothstein commentedHere is a stab at a patch.
If the node doesn't exist yet, I set the $linktitle to be equal to the node id (I imagine it probably doesn't matter much). And of course you're right that there's no need to worry about a nonexistent node being uncacheable -- I have no idea what I was thinking when I wrote that the first time ;)
The attached patch also fixes #3 in @douggreen's comment above. I think #1 and #2 aren't necessary (but please review):
(1) $linktitle is only used in the
elsestatement when $link is TRUE, so the current behavior (only setting $linktitle when $link is set to TRUE) seems okay.(2) The
SELECTstatement acts on the {node} table, which only stores the most recent version of the node and the most recent vid, so it should be okay as is, right?Comment #5
David_Rothstein commentedAlso note that there is a minor problem which this patch doesn't address (and which is not limited to the case of a nonexistent node):
If you create node/1 containing
http://www.example.com/node/2in the body, then a link to node 2 will be put in the search index (under 'title' if node/2 exists and has 'title' as its title, or under '2' if node 2 doesn't exist yet, as per the above patch).If you then change node 2's title to 'newtitle' (or create it with this title), the search index does not get correctly updated... if you search for 'title' you will still get references to node/2 (whereas searching for 'newtitle' will not correctly count the link from node 1). Things only get fixed if and when node/1 gets edited, so that the search reindexing for node/1 will get triggered.
This probably isn't a major problem, but if anyone can think of a simple way to deal with it, it could be added to this patch...
Comment #6
robertdouglass commentedRerolled and tested. Works as advertised. One more review, please, but otherwise, should go in.
Comment #7
douggreen commentedI was only suggesting that we fix the php error. If the node does not exist, then I think we should just ignore the link. The entire point of the node linking is to score the title words from the linked to node, with the linking node, thus improving the search relevancy of the linked to node. But if the linked to node doesn't exist, then there's nothing to do...
Comment #8
David_Rothstein commented@douggreen, the use case for this is if you are adding pages sequentially. For example, you might create node 1:
and then go ahead and create node 2 later. (I think that's how I discovered this bug in the first place, although I don't really remember.) This link should be counted, and searching for "dinosaurs" should pull up node 2... but with your method, it won't happen.
Where this would be even more interesting is if it worked with path aliases. Then you could have:
As mentioned previously, the use case here is something like Wikipedia, which has those kinds of links all over the place. It would be nice if they could somehow be counted, although I don't see any obvious way to do this at the moment. However, if we put the infrastructure for dealing with nonexistent nodes in place, then maybe it will be possible eventually...
Comment #9
douggreen commented@David, I don't see ever linking to a node that doesn't exist yet being anything but an error. How do you know what the node number will be of a future, yet to be created node?
Comment #10
donquixote commentedThe other, more important use case is when you delete a node that is linked to from somewhere else.
My opinion: Always consider that db_fetch_object can return NULL. Drupal needs a lot more sanity checks.
About the patch:
If the linked node doesn't exist, then why do we store this in the index?
As far as I understand, the index is meant for "other nodes that link here" and "other nodes linked from here" listings. If the target node doesn't exist, then what is this good for?
- After deleting node X, the index can show me a listing of other nodes that now have orphan links.
- Viewing node A, the list of "other nodes linked from here" will include deleted nodes.
So, the main purpose would be maintenance. But we don't want to show these links in a "what links here" block for normal visitors.
I noticed a different problem:
If filter_format_allowcache($node->format) returns false, the "what links here" search (in views) doesn't work.
I totally don't get why this check is needed. Especially, if the node has some CCK textareas and the body field, and each has a different format, then what is the relevance of this check?
The most notable effect is that I spend hours of debugging to find out why the "what links here" views argument doesn't work with my site. I added a "|| true" to the filter_format_allowcache condition, and now the indexing works. Tataa!
I need to add, all my experiments were done in D6. I wish we can backport this when it's ready.
Comment #11
jhodgdonLet's not combine things -- that thing about allowcache is being discussed on another issue, having to do with PHP input formats.
#286263: Make search indexing smart to handle nodes wth php redirects
So let's just stick to the main issue here, and we need another patch.
Comment #12
donquixote commented@jhdgdon (#11):
After reading a bit in the linked issue, I don't see how it is related to my problem. The title looks promising, but none of the patches does anything with filter_format_allowcache() inside search_index(), and none of the comments mention it. Comment #286263-57 uses filter_format_allowcache(), but this is in node.module.
Could someone explain why the filter_format_allowcache() is needed in this case? The only thing that is needed from this node is its title. No filters are executed for this node.
@robertDouglass (#6):
Why is this D7, with a patch that is clearly for D6 ? The D7 search_index() looks quite different, and does not use filter_format_allowcache() at all. Or did this change since your post?
Comment #13
jhodgdonThis is a Drupal 7 issue because we fix bugs in the latest version first, then port back to Drupal 6.
As far as whether this issue and the PHP formats issue are related, maybe they are not, but the issue there is again an input format that is dynamic and doesn't allow caching.
Comment #14
donquixote commentedSure, but the patch in #6 will only work in D6.
D7 does not have any sanity checks either.
Noone has reported a problem in D7 yet, but it can be expected to happen sooner or later. Sanity checks are always a good thing. So, here goes a D7 patch.
Comment #15
donquixote commentedIt doesn't like me.. ("Ignored")
Comment #16
donquixote commentedThe previous patch will store the link to the non-existing node in the index table.
And here is an alternative patch, that will just ignore this node. So, if we change our minds we can use this one :)
Comment #17
donquixote commentedNotepad++ moved the "Unix line endings" to a different menu..
So, another try, line endings fixed.
Comment #18
donquixote commentedThinking about this again, wouldn't it be better to store NULL as the title of a non-existing node, so it can be identified by other modules as a broken link?
Comment #19
damien tournoud commentedsearch.module.d7.alternative.patch is the correct way to fix that.
Comment #20
David_Rothstein commentedWow, I think this might be my oldest open bug report - would be great to see it fixed :)
It's been a while since I looked at this, but I believe it's the case that search.module.d7.alternative.patch will make it so that "What links here" blocks don't work correctly in these situations. In these situations, a "What links here" kind of block should be auto-updated on the next cron run, not require any kind of manual reindexing.
Again, the main reason this issue exists is that currently you get a PHP notice. I think the goal here (especially for possible backport to D6) should be to remove the PHP notice, not change the feature.
So, I'm pretty sure that the first patch from #17 is the correct one. But I do agree with @donquixote that it is much better to store NULL or an empty string here - not sure what I was thinking when I tried using the node ID :) Also, I don't think
is_object($node)is the cleanest check here. Probably better and more readable to do something like!empty($node).Comment #21
donquixote commented@David_Rothstein (#20):
> that search.module.d7.alternative.patch will make it so that "What links here" blocks don't work correctly in these situations.
Really?
- Noone wants to see the "what links here" block for the nid of a non-existent node, except for housekeeping.
- If I create a new node, then the "what links here" really should be empty. If any links exist that point to the new node's id, then this will usually not be intended. Noone can know an nid before the node is created (unless we are copying parts of a database).
- If I delete a node, I am no longer interested in the links, except for housekeeping.
> In these situations, a "What links here" kind of block should be auto-updated on the next cron run, not require any kind of manual reindexing.
A cron run will do the complete job, yes.
And on top of that, when a node is saved/modified, I assume that Drupal will call search_index() for this node.
Noone needs to index the destination node!
So the only question is: Do we really need the orphan index entries for housekeeping tasks, or not? Do they have any negative side effects somewhere?
----
> Also, I don't think is_object($node) is the cleanest check here. Probably better and more readable to do something like !empty($node).
The truth is, I didn't want to look up what exactly db_query(..)->fetchObject() returns in case of failure, and if this is reliable. We know very well what it returns in case of success: An object. For anything else we rather don't want to write "$node->title". So, is_object() is actually safer than most other things that we could check for. (to be strict, we should even change the simple if($node) in the second patch, but it doesn't matter that much)
----
For backporting to D6, I would like to read an opinion or explanation about the filter_format_allowcache($node->format). This problem does not exist in D7.
Comment #22
David_Rothstein commentedOK, this is painful but here is how to see what is going on here. On a fresh Drupal installation:
1. Create a node with title "First page", content along the lines of
Go to the <a href="/node/2">next page</a>.2. Run cron a couple times.
3. Create another node with title "Second page".
4. Run cron a couple times.
With either the current Drupal 7 codebase or search.module.d7.patch from #17, the results are the same:
So, the "next page" link from node 1 to node 2 is correctly recorded everywhere, and the words "next" and "page" correctly count towards node 2 in the search index, in addition to node 1. (You can also see a nice little unrelated bug in there with the add new comment stuff, it seems...)
With search.module.d7.alternative.patch from #17:
In this case, the link from node 1 to node 2 is never recorded, which means that a "What links here" block on node 2 will not show the correct data. The only way it will ever show the link is if someone goes back and (manually) forces a reindex of node 1 - e.g., by resaving the node. They should not have to do that in order to make it work correctly.
So, it seems to me like search.module.d7.patch is the correct approach.
I also realized why I had originally used the node ID as the link title in this case (rather than NULL) - it turns out that with the current code, you have to use something. If the caption is empty, it will never get recorded as a link. Refactoring it to allow that sounds possible, but looking at the code - probably very scary :) In any case, this rarely matters. The above example will not be affected either way, because "next page" will be used as the caption in that case. It's only if you pasted a URL directly into the node body (e.g., http://example.com/node/2) that the link title is used (which makes me wonder why in the world the search index code is querying the database for the title on every single link.... seems like an unnecessary performance hit).
As per the above, then, there is no reliable way for other modules to identify broken links via the {search_node_links} table alone - if they really care about that (a "What this links to" block is the only example I can think of), they would need to examine the node IDs themselves, which is exactly what they already need to do with the current codebase.
Comment #23
donquixote commentedDo you seriously recommend this way of working?
1. User A creates node/1001, with a link to the yet not existing node/1002
2. If the indexer runs now (on 1001), it will find and remember a link from 1001 to not yet existing 1002. With alternative.patch it would not remember this link.
3. User B creates node/1002, totally not thinking about what user A does.
4. User A creates node/2003, which he thought to get the nid 1002.
5. If the indexer runs now (if 1001 is re-indexed), it will find the link from 1001 to existing 1002
The nid of a yet to be created node is only predictable in a very controlled one-user environment. On the average site this is likely to result in wrong link destinations. Usually a link to a non-existing node will be a link to a deleted node, or a typo.
And anyway, nowadays everyone uses pathauto, and would rather write href="alias/to/node", with a yet non-existing alias. In this case none of the existing or proposed solutions would work.
Conclusion:
- You want to support a technique that is rarely used and can't be recommended.
- A lot of things can be done on the links indexing, but this should happen after this bugfix, and in a separate issue.
Comment #24
donquixote commentedTrue. Let's do this in a follow-up.
Comment #25
David_Rothstein commentedI think something like this is probably the correct approach, which also has the nice side effect of removing the performance issue automatically. Instead of the node ID, this falls back on just using the URL itself (which probably makes some sense since that is how an external URL would be indexed)... again, I doubt it matters much.
Comment #26
David_Rothstein commentedSure. But a site with a single content creator is a very valid use case.
Yes, the search index will record broken links in this case, but I don't see that as a problem. I think the point of a search index is to faithfully record the actual content, not try to guess people's intentions. If another module wants to use the search index for some other purpose (e.g., the "what this links to" example), I think it's up to them to properly check the data.
Sure, but supporting that is probably for Drupal 8 at this point - see my comments above about Wikipedia. I still don't think we should change the current behavior just because it doesn't work perfectly for all cases. Maybe others feel differently.
Comment #27
donquixote commentedOk, I leave the choice to others.
I created a new patch with the performance improvements you suggested. This only makes sense if we want to remember orphan links. If we want to discard them, we have to load the node to check if it exists. Unless we want to speed this up by querying all at once or in bigger chunks...
I re-uploaded the older patches with new names, to make the discussion easier. if($node) changed to if(is_object($node)) in discard-orphans.patch.
Comment #28
donquixote commentedI notice that remember-orphans.performance.patch will behave differently than remember-orphans.patch. the .performance version will store the $value, while the old version will use the $linknid, if no $node->title is found.
EDIT:
Duh! I was playing with the upload field. Please ignore the file!
Comment #29
David_Rothstein commentedSounds reasonable - I've already spent more time thinking about this issue than I ever wanted to :)
In the second patch:
This typo means that it better not pass the testbot - although somehow I bet it will :)
I also don't understand the differences between the second patch (search.module.d7.remember-orphans.performance.patch) and my patch from #25. Why did you need to make all those extra changes? Was there something I missed?
Comment #30
donquixote commentedBecause
a) I didn't notice your patch. My apologies.
b) I replaced the check for if ($link) with if ($linknid > 0). Matter of taste. I'm happy to do it your way.
Some people think that there is one correct solution for sanity checking, and this has to be used everywhere. This is not the case. Sanity checking is problem-specific, and !empty() is not always the best solution. In this case we want to make sure that $node is nothing but an object.
Comment #31
David_Rothstein commentedI was mainly referring to the typo ("emtpy" vs "empty")....
In this case I think you are probably right that is_object($node) is good. Actually, it's probably enough - I'm not sure what the other check really adds. But if there is a reason, "empty" needs to be spelled correctly :)
Comment #32
donquixote commented> what the other check really adds
It prevents the text to be replaced with an empty string or other things, if for whatever reason the node title is an empty string or NULL or whatever other things. We don't absolutely need that.
> But if there is a reason, "empty" needs to be spelled correctly
Wow, I didn't notice that.
I am cheating a bit with my patches, because I don't have a working D7 install for testing. Seeing that the test bots don't have a problem with the "emtpy", I ask kindly ask to be careful with my patches.
Comment #33
David_Rothstein commentedI created a couple spinoff issues for some of the other things that were discussed above:
#721374: "Add new comment" is put into the search index along with each node (bug report for Drupal 7)
#721394: Allow links to non-existent URL aliases to be indexed correctly once the node with that alias is created (feature request for Drupal 8)
Comment #34
elally commentedsubscribing
Comment #35
jhodgdonThe patch in #32 looks quite reasonable to me. David_Rothstein - any further comments?
Comment #36
jhodgdon#32: search.module.d7.remember-orphans.performance.patch queued for re-testing.
Comment #37
David_Rothstein commentedIt looks reasonable to me too, and I confirmed that it works - but I'm not the one who objected to the general approach :) In my opinion, it should go in.
In terms of the code, my only (minor) comments are that we are setting
$linknid = NULLin some places but then doing checks like$linknid > 0- I think it would be cleaner to go one way or the other, i.e. always keep it an integer (use$linknid = 0instead) or the other way around.Also, this part:
The code comment seems a little off now - perhaps should be more like "ID of the most recent node link"?
Comment #38
jhodgdonAgreed on your two comments about the code. Let's fix these and go ahead and get it in.
Comment #39
donquixote commentedI think the value NULL reflects better than a numeric zero that something is switched off. The information the variable wants to communicate in this case is not a numeric information.
The
if ($linknid > 0)does return false for NULL, but it is a numeric comparison. To make it look more logical, we should first check for NULL and then do a numeric check (if that is still needed):if (is_numeric($linknid) && $linknid > 0)The $linknid is coming from a preg_match which does allow positive numbers and zero, so I guess the
$linknid > 0is needed.Comment #40
donquixote commentedChanged the comment to
$linknid = NULL; // value NULL means that the link does not point to a node.added the is_numeric in two places.
I did not test this modified patch in anyway. just downloaded and applied the changes to the patch file itself.
Comment #41
donquixote commented..
Comment #42
jhodgdonThis code is kind of a mess...
- Can't you just use !empty() to test whether something is non-zero and numeric?
- Our coding standards say that comments come before the line of code, not at the end of the line, and should be sentences.
http://drupal.org/coding-standards#comment
Comment #43
donquixote commentedI want to test if it is positive.
Technically,
$linknid > 0does exactly that, but it does not indicate that we expect a non-numeric value in some cases.empty() does accept non-numeric strings, negative integers and float numbers. The preg_match will only return non-negative integers, so the only possible values are NULL, zero and positive integers.
Making this explicit in the check is more transparent than just knowing what the preg_match does.
empty() is the laziest and least specific kind of check, I don't see this as something positive. It does accept not only positive integers, but also negative integers, float numbers, strings, objects etc. We never get these (thanks to the preg_match), but I think it is more transparent to explicitly check for the type of value we want.
I prefer checks that (i) indicate what kind of values we expect in either case (that would be NULL, zero and positive integers), and (ii) rule out any unexpected value (such as non-numeric strings, negative values or floats), so that a reader doesn't need to search where the values come from.
The
$linknid > 0satisfies condition (ii), but not condition (i). empty() satisfies none of these conditions.I agree that having two checks looks messy..
Anyway, I don't care that much.
--------
About the comment, you are right, I was just too lazy to generate a new patch.
Comment #44
jhodgdonHow could it possibly be anything but a positive integer or Null/zero from a query on the node table?
Comment #45
donquixote commentedIt's funny how that comment totally proves my point :p
The $linknid is not from a query on the node table. It is a value extracted from the text, and filtered/manipulated with drupal_get_normal_path() and preg_match(). The final preg_match allows nothing but non-negative integers, so empty() is in fact safe. But it is not transparent for people who don't look where the value comes from (did you?).
Anyway, I don't care that much (again), just can't resist an argument...
Comment #46
jhodgdonMy mistake, sorry!
Anyway the code needs cleaning up. I think it would be cleaner just to cast it to an int.
Comment #47
donquixote commentedSo, my idea was:
NULL is a non-numeric value, carrying the non-numeric information that we are not dealing with a node. Previously this was done in an additional variable
$link = FALSE, that I thought is no longer necessary.Positive integer values can be assumed to be node nids (although we never know if that node exist, until we check the db).
We do not expect any other values (such as arbitrary strings, or negative numbers, or numeric zero), but I wanted to be strict and make things explicit.
Cast it to an int - where? A numeric zero would still be technically sufficient to express that we are not dealing with a node nid (because auto_increment starts at one, not zero), but I think the non-numeric information should not be expressed as a number.
That was my thinking.. now let's wait what others say. I bet few people enjoy to read this little controversy :(
---
Do you see more problems than these two points?
Comment #48
jhodgdonOK, I'm convinced.
The code still needs to be brought up to standards. And if you needed to explain it to me, please add comments to explain it in the code. :)
Comment #49
donquixote commentedLet's try another one.
More comments, and replaced the is_numeric() and > stuff with !empty().
I think this is fine, if the comments explain what kind of values we can expect.
Comment #50
donquixote commented..
Comment #51
jhodgdonPlease read (and follow) the coding standards for in-code comments and casting:
http://drupal.org/coding-standards#comment
http://drupal.org/coding-standards#cast
Aside from that:
I think $match[1] could be empty/undefined, if the URL is node/view here? So this comment is maybe not quite right?
I don't think the vid field is being used anywhere? And... Is this the right way to get the node title now, or is node title a field? I think it's a field, and this isn't the right way to get the node title?
Comment #52
donquixote commentedCoding standards for in-code comments:
What exactly is wrong? There is no capital letter for "$".
To be honest, there are some existing comments that I feel puzzled about, such as "Links score mainly for the target.". If you can suggest anything, let me know.
The same for other comment issues, it would help if you can suggest something better, if you don't like it.
Coding standards for casting:
Ok, one space missing. (only looking at those lines that are new)
$m[1] is the result of a regex, which I don't fully understand,
'!(?:node|book)/(?:view/)?([0-9]+)!i'. Why are we looking for book or view stuff here? Why not a simple/^node\/(\d+)$/?So, hm, I guess I got this wrong, was only looking at the
[0-9]+part.Could you explain how you understand this regex check, before I post the next patch?
This part is moved from some lines above, where it used to work (afaik). I guess this is not the only place in D7 where something does $node->title. I guess the title behaves like a field, but is still stored in the master node table. I did not intend to improve or rewrite it. I think this would be for another issue. For now we just want to fix a bug. Moving this piece of code down was suggested some posts before for performance reasons.
Yes, the vid seems to be unnecessary. Still, it's existing code that I would prefer not to touch.
Comment #53
donquixote commenteddidn't mean it to change this to "needs review".
(I was about to post the next patch, but then didn't.)
Comment #54
David_Rothstein commentedLast time I checked, D7 node titles were no longer fields:
#571654: Revert node titles as fields
Unless there was a rollback of that rollback which I missed :)
Comment #55
gpk commented@10, 11, 12, 13: the issue to do with the handling in 6.x of nodes with a non-cacheable format is actually #664124: {search_node_links} entries are not created if the target is uncachable (e.g. PHP input format). See also #513870-12: What links here tab / backlinks not finding nodes. (Apols, slightly off-topic.)
Comment #56
Sree commentedIf filter_format_allowcache($node->format) returns false, the "what links here" search (in views) doesn't work. in D6
Was there any patch released for this?
Comment #57
jhodgdonSomeone is working on this exact current issue on #1299530: php errors triggered during search indexing if content contains broken internal links now.
Comment #58
ykoech commentedI have an issue where index got to 96% and won't proceed even if I add more pages, this is the exact message "96% of the site has been indexed. There are 384 items left to index."
Note; I found a fix last year but a Drupal update messed up everything. I can't trace the fix right now, its been hours. I think it had to do with skipping deleted nodes from Drupal Index to avoid errors or something close. I can't quite remember.