Early Bird Registration for DrupalCon Portland 2024 is open! Register by 23:59 PST on 31 March 2024, to get $100 off your ticket.
If using a php filter, and your php code includes a redirect, the search indexer will never get past this node. This is because it doesn't update the database until after it indexes the node, but it is unable to index the node because when the node is rendered the cron process gets redirected.
To duplicate:
- Install Drupal 6.
- Enable the Search module and the PHP filter module.
- Create a normal node at node/1.
- Create a PHP node at node/2, and including a php snippet that calls drupal_goto('node/1');
- Each time you try to index the website by visiting cron.php, you will redirect to node/1.
- Note that on the search admin page, the site is only 50% indexed, and never gets to node/2.
Problem originally discovered when reviewing the D6 search back port.
Comment | File | Size | Author |
---|---|---|---|
#43 | 286263-disabled_redirects_idea.patch | 1.7 KB | sime |
#42 | 286263-disabled_redirects_idea.patch | 791 bytes | sime |
#26 | 286263.patch | 7.47 KB | robeano |
#17 | 286263.patch | 7.47 KB | douggreen |
#16 | 286263.patch | 7.47 KB | douggreen |
Comments
Comment #1
douggreen CreditAttribution: douggreen commented(I finally got around to this on the plane to DrupalCon, and since webchick was just an aisle over, I discussed this with her.)
I think that the only solution is to set a global variable when in cron, and then check this global variable before doing a drupal_goto.
It's conceivable that other modules will try to render the node from within cron, and the real problem is cron shouldn't be redirected. I think that this is a reasonable general solution.
I have a similar patch for 7.x, but just in case there's some tweaking to this one, I'll wait to post it until we get this one done.
Comment #2
douggreen CreditAttribution: douggreen commentedActually, here's a slightly different version, that avoids on logical comparison (the !).
Comment #3
Jeremy CreditAttribution: Jeremy commentedThere are several problems with this solution: on d.o., nodes are calling header() directly to do redirects, not calling drupal_goto() -- I simply used the latter as an example way to easily duplicate the problem. Also, even if the calls were to drupal_goto(), we'll end up indexing the contained PHP code, as with your patch we return the node body instead of executing the contained PHP code -- it is wrong to index the php, as then searches for "drupal_goto" will return incorrect results.
The solution I used in the Xapian module was to make indexing stateful, and to detect the nodes that we fail to index, skipping them. This had a few advantages: it doesn't require a patch to core, it works with any method for redirecting pages, and it doesn't index content that shouldn't be indexed. Unfortunately it also feels a bit like a kludge, and it can be slow to index a website with a lot of redirects in it as it tries each failed node twice (to be sure it's not a memory issue or some other intermittent issue).
Comment #4
Damien Tournoud CreditAttribution: Damien Tournoud commentedI suggest we create a special menu callback, something like
node/<nid>/raw
, and make a drupal_http_request() to that during the indexing.Comment #5
Jeremy CreditAttribution: Jeremy commented@Damien: That doesn't solve the issue at hand. How would you handle redirects? You're either going to index raw PHP code, or you're going to index the same node multiple times (the original, and the one that's pointing to it via a redirect). Neither is desirable imo.
Comment #6
Damien Tournoud CreditAttribution: Damien Tournoud commentedI've not been that clear.
What I suggest is to make a special path (
node/nid/raw
) where the node is rendered and returned as-is (without being embedded in atheme('page')
). The search indexers would calldrupal_http_request()
on that path and thus index the *result* of the rendering (never the PHP code).Easy: you don't index pages that produce a redirect (ie. you don't follow redirects).
This architecture would allow us to render the node and index in separate threads. This way, we avoid pitfalls of the current strategy (for example: PHP fatal errors and other
exit()
currently block the indexing altogether, etc.).Comment #7
arhak CreditAttribution: arhak commentedsubscribing
Comment #8
douggreen CreditAttribution: douggreen commented@dww, If the code was really doing a drupal_goto, I don't see how we are indexing raw php code. Wouldn't the php filter simply return the empty string, and thus we would index nothing?
However, I get the problem about states and failures. So here's another patch that includes both the Cron global variable, and some basic state information. I've changed the {search_dataset}.reindex column to an unsigned int (don't forget to run update.php), and then set it to the negative value of the time, during indexing. Nodes that get stuck will always have this negative value. Some contrib module might choose to re-run these X-times, but I didn't put that in core. I did add a bit more information to the admin/settings/search page though, so that it will tell you how many of these nodes are stuck in pending.
There's also one more minor change to the current behavior, which is, if you are 100% indexed, it no longer says "There are 0 items left to index." I removed this because I thought it looked funny to say "99% of the site has been indexed. There are 0 items left to index. There is 1 item pending or failed indexing."
Comment #9
douggreen CreditAttribution: douggreen commentedOpps, forgot to change the Status, which explains why no one has reviewed this yet.
Comment #11
douggreen CreditAttribution: douggreen commentedComment #12
douggreen CreditAttribution: douggreen commentedComment #13
DamienMcKennaMight this help with the search status message too? #305659: Patch for node_search($op=status) to give correct status message totals
Comment #14
douggreen CreditAttribution: douggreen commented@DamienMcKenna, as you've noticed in [#305569], that problem has nothing to do with core search, and IMO is not related to this at all.
Comment #15
rares CreditAttribution: rares commentedsubscribing to the issue, I think it's fine that the search module evaluates php when doing indexing, but there should definitely be no redirections when doing so, or cron will never complete. one could patch search, but another way of fixing this is to add a line in drupal_goto to exit the function if cron_semaphore is on.
Comment #16
douggreen CreditAttribution: douggreen commented@rares, good idea! I've updated the patch to not use the new drupal_get_cron function, but just check cron_semaphore.
Comment #17
douggreen CreditAttribution: douggreen commentedI ran into this problem again today, our home site hasn't indexed anything in a couple weeks. This patch fixed the problem (... that and changing a header() call in node php to drupal_goto).
Is there another solution out there? I don't consider dww's header() to be a show-stopper. In my opinion, header() in the node php is a user error, and should be changed to drupal_goto().
I've attached a new patch just to increment the update handler.
Comment #18
Bevan CreditAttribution: Bevan commented#111744: Add configuration to exclude node types from search indexing was marked duplicate
Comment #19
salvisComment Mover for D5 produces nodes with drupal_goto() calls, which cause indexing failures when upgrading to D6 (see #357508: Incompatibility with D6 core Search module?).
I don't know how the Search indexing works exactly, but doesn't it index titles, too. So, if drupal_goto() calls are simply ignored during indexing, Search results would include nodes with drupal_goto() calls, which the user cannot see (because they redirect).
That and the annoyance of having items that will forever be "pending or failed" make this a bad solution, imo. Shouldn't redirecting nodes be skipped during indexing, so that they can never show up in Search results?
Comment #20
DamienMcKennaI'd go the other way. On a site I run we store all off-site destinations using cck_redirection fields so that we *can* index those pages and be able to search for them based on keywords. A common example would be the 3rd party store service being used, we have a page that is indexed with lots of store-related keywords so if someone searches for those they'll get a link to the store, rather than swearing at the site because they can't find where to buy our merchandise.
Comment #21
Bevan CreditAttribution: Bevan commented#19 and #20; Damien and Salvis, I think you've identified two separate use cases that need solutions;
This d.o issue node and patch deals with the first case, but not the second. In the second case, by telling the search engine not to index the content items of a certain type, the issue this patch solves goes away and is not even a problem.
I think there are some threads comments and modules that might allow you to disable indexing of certain types. Try #111744: Add configuration to exclude node types from search indexing and Search config module for starters. Searching d.o may reveal more recent developments.
Comment #22
salvisHaving Search results that you can't actually view (because they drupal_goto() elsewhere) seems odd at first, but I understand your use case now.
Still, having items that will forever be "pending or failed" (and without having any way to find out which items they are) isn't such a great solution.
Maybe drupal_goto() with a status code of 301 could be interpreted as use case 2.
Comment Mover for D6 is not using drupal_goto() anymore, and as far as it is concerned, it seems more appropriate now to convert the D5 nodes to the new mechanism and avoid the issue here.
Comment #23
rootworkIs it possible that a bunch of these errors:
warning: Invalid argument supplied for foreach() in /home/modules/taxonomy/taxonomy.module on line 1189.
upon running cron, and a lack of additional search indexing beyond a certain point, could be a symptom of this problem? I've been trying to track down the source of these errors and failed indexing over at #295235: Invalid argument supplied to foreach() in taxonomy.module (line 1189) due to incomplete/partially-deleted nodes but this seems promising.
However, I tried applying the patch and running update, and the errors and failed indexing remain. Any idea if this could be related? The errors and the failed indexing could, of course, be two separate problems.
Is there any way to quickly find out if any nodes have content entered with the PHP filter on them from the database side? I inherited this site with lots of content, and I'm not sure if there are nodes with PHP redirecting, but there might well be.
Comment #24
salvis@quixoticlife: I don't think so.
Use SQL to check your nodes, something like
SELECT * FROM node_revisions WHERE body LIKE '%drupal_goto%';
Look at the context to verify whether it's PHP code or just text about drupal_goto.
Comment #25
rootworkIt turned out I did have a page with a drupal_goto function (a cache-clear page left over from D5) but that didn't get rid of the errors. I think they're remnants of old nodes that weren't fully deleted.
Thanks for your help!
Comment #26
robeano CreditAttribution: robeano commentedI've attached a new patch which increments the update handler.
Comment #27
Berdirfunction drupal_goto($path = '', $query = NULL, $fragment = NULL, $http_response_code = 302) {
+ // Ignore drupal_goto's within cron.php.
+ if (variable_get('cron_semaphore', 0)) {
+ return;
+ }
This will also disable drupal_goto() for every other request while cron is running, I think. Not a good idea...
Comment #28
douggreen CreditAttribution: douggreen commentedA drupal_goto during cron is a bad idea... I can't think of any reason why you'd want drupal_goto to work, why is this "not a good idea." Drupal goto will forward you from cron.php to some other page!?
Comment #29
Berdir#28
Yes, but the cron_semaphore value is saved in the database, if a user does access a site *during* a cron run, drupal_goto() will not work. Instead, it should probably be a static variable, or something like that, so that only the actual cron crun is affected from that check.
Comment #30
douggreen CreditAttribution: douggreen commentedupdating version to 7.x, we need to solve this before code freeze!
Comment #31
douggreen CreditAttribution: douggreen commentedComment #34
simeWe only need to avoid sending headers.
(Sorry, can't install D7 on my mamp right now, I would have tried a patch.)
Comment #35
simeOh, implicit in the above solution is an appropriate listener. Something like:
Comment #36
simeit's not a patch so it needs work still I think
Comment #37
simeIt also occurs to me that this could be fairly marked 'by design' or 'won't fix'.
If you are issuing a redirect in a node, you should be checking for the context in which your node is being displayed. For example, it would be your responsibility to ensure a redirect didn't occur in a preview or in a view.
Comment #38
Bevan CreditAttribution: Bevan commentedI believe that
if (SOME CHECK THAT SEARCH IS INDEXING)
should be something likeif (variable_get('cron_semaphore', FALSE))
Comment #39
Bevan CreditAttribution: Bevan commentedDuplicate of #38
Comment #40
BerdirSee #27/29 why variable_get() does not work.
Comment #41
sime@bevan also search indexing can run outside of a cron run. you can do it on the search admin page.
Edit: no I'm wrong about that.
Comment #42
simeOK. I may have broken some rules here with the use of drupal_static(), but it makes sense to me!
I have some other stupid ideas for this, so just testing the water with this very simple patch.
I actually think it *should* be possible to disable header sending, but all through core
header()
is being called directly, so such an idea is not a simple patch.Comment #43
simeforgot function context
Comment #44
simeBecause of #37, I believe this is not a critical bug.
Also, because of the hook_drupal_goto_alter(), it is possible to write a workaround module in contrib.
As a feature request it goes to 8.x I presume.
Comment #45
Dave Cohen CreditAttribution: Dave Cohen commentedI just noticed this for the first time on a D6 site. I've followed this thread and while solutions sound reasonable for D8, that's long way off and I'm wondering what I can do right now.
My first idea is to put hook_cron() into my own module, weight it to come before search_cron(), and in that hook I'll print something, so that it will too late for search_cron() to set headers. Haven't tried that yet, but I'll let you know if it works...
Comment #46
simeHey Dave, not sure if it's useful, but when I've had this problem it's been with code in the node body. I've wrapped the problem code in an if statement to only run if we're on node/% path.
Comment #47
cledman CreditAttribution: cledman commentedhi... could you help me?
it looks like your function is missing some parts like:
drupal_cron_run();
}
else {
Index: includes/common.inc
(where ends this else?) or..
function drupal_goto($path = '', array $
drupal_alter('drupal_goto', $path, $options, $http_response_code);
(where this function starts?)
Comment #48
cledman CreditAttribution: cledman commentedhi.. i solve my problem, but could you complete my question latter?
Comment #49
pexxi CreditAttribution: pexxi commentedMy simple workaround...
At top of drupal_cron_run():
At top of drupal_goto():
This allows detection of 'cron-mode' anywhere in PHP-code with minimum of added code and doesn't limit users running out-of cron-mode.
Comment #50
jaypabs CreditAttribution: jaypabs commentedIs the patch applicable to v6.15? I am experiencing the same problem also. When I disabled search module cron execute successfully.
I cannot determine what is the correct answer from the comments above.
Comment #51
As If CreditAttribution: As If commentedI just used sime's method on a 6.13 site. It works fine. It doesn't attempt to solve the bigger problem of the indexer executing drupal_goto(); instead it simply makes sure the indexer doesn't get a chance to.
All you have to do is edit every page or block where you use drupal_goto(), and make sure it only gets called if arg(0) == 'node'.
Comment #52
Dave ReidThere are an infinite number of things I could put in PHP. I could use header(). I could use lots of other things that could break search.module when indexing. I also don't see how it would be possible to stop drupal_goto() by using hook_drupal_goto_alter(). There are no checks after the alter hook is run and it calls header() no matter what.
Comment #53
As If CreditAttribution: As If commentedI don't know if you're responding to me or to the earlier thread, but you don't need to do anything in a hook function. When I said "Sime's method" I simply meant: Wrap the goto (or the location header code, or anything you do that breaks the indexer) inside an if(arg(0) == 'node'). That's all you need to do.
Comment #54
Dave Cohen CreditAttribution: Dave Cohen commentedHere's a hack I'm planning to use. Seems to work so far. Drawback is you get PHP warnings about headers already sent when running cron.php.
I'm not suggesting this as a permanent fix for drupal, just a workaround. The diff below shows a few lines that I added to the beginning of cron.php.
Comment #55
Dave Cohen CreditAttribution: Dave Cohen commentedHas anyone noticed you can have the same problem in both search indexing and on the search results page?
That is, if your node does something wacky like a drupal_goto() or add javascript to the page or anything like that, code will run when rendering search results, too. I only encountered this after applying my patch above, because now the contents of these nodes is actually being indexed.
So, I like sime's idea more and more, as a workaround. I'd like to see core somehow prevent this. Maybe not indexing nodes with a php input format by default? I'm not sure the best solution; it's tricky.
Comment #56
As If CreditAttribution: As If commented> Maybe not indexing nodes with a php input format by default?
Yeah, I thought of that too, but many people (including myself on occasion) are using PHP to generate or manipulate some actual content. So that content wouldn't get indexed either. This *has* to be the reason we are even bothering to index PHP pages at all.
My next thought was "index only the *output* of PHP statements" - i.e., only the stuff that gets echo'd or print'd to the screen. We would execute the PHP during indexing (which we seem to already be doing), and I imagine then we would capture the output to a temporary table and index *that*. But obviously we are talking about major core-hacking at this point.
Comment #57
Dave Cohen CreditAttribution: Dave Cohen commentedI think it's reasonable to say, if the content is dynamic (i.e. generated from PHP input filter) then don't include it in search. As with all great ideas, not everyone will agree. ;)
One way to implement that would be to enhance the input filter API. Now, input filters can declare themselves as dynamic. We can make use of that, or add a way for them to declare "exclude me from search." Then in filter.module, add a check for whether its a search request and if so, return empty string. To implement this would require changes to format.module and search.module (so that format module would have some way of knowing whether the current request is to build search index or search results.
An almost-as-good approach is to enforce this only for content search. The node.module already keeps track of whether the request is search related (in $node->build_mode), so the change there could be quite small. Something like the following diff, which looks more complicated than it is; it's really just an if clause added to node_build_content():
I doubt consensus could ever be reached to make a change like this, but I plan to use something like this on my sites.
hack core, fix core - it's in the eye of the beholder.
Comment #58
manderson725 CreditAttribution: manderson725 commentedApparently I posted on the wrong issue. My issue is that i cannot run cron from the browser when Search module is enabled. I get the below error.
Fatal error: Class 'view' not found in /var/www/drupal6/includes/common.inc(1685) : eval()'d code on line 3
Disabling Search allows cron to run, but i'm not sure what the error indicates, a view with bad php code? The comments above seem like a great discussion but I don't understand enough of what's going on to fix it without good directions. I would appreciate any help/suggestions where to look/how to get around this.
Comment #59
jhodgdonYour issue sounds similar: If you have a node that is doing some kind of a PHP evaluation, and the PHP evaluation fails when the node is indexed for searching, cron will stop. If you disable the search module, the node will not be indexed, the PHP evaluation will not fail, and cron will run.
Comment #60
jhodgdonI don't know why this was changed from a bug report to a feature request. I'm putting it back to being a bug report.
Comment #61
Dave ReidBecause it is a feature request? Currently we don't special case anything if the node has PHP content, so if you have improper PHP, that's a bug in your node, not the search module. Since we're having to add functionality to avoid bugs in node content, not actual Drupal code, it sounds like a feature request to me.
Comment #62
jhodgdonOK.
Comment #63
Dave ReidYeah, let's really tackle this in D8 properly. At least we have contrib solutions we can explore with the new APIs in D7.
Comment #64
Dave Cohen CreditAttribution: Dave Cohen commentedOne could argue this is a bug. Since PHP content is dynamic, the node's content may change based on who's viewing the node and when. So what the search module indexes may be correct for anonymous users but not registered ones. In such a case, the search will return improper results to those registered users. Drupal knows the content is dynamic, and so chooses to leave it out of the filter cache. Perhaps Drupal should leave dynamic content out of the search index as well. (Essentially what I try to do in comment #57.)
Comment #65
DamienMcKennaCould it be handled by a the following rules:
??
Comment #66
jhodgdonRE #64: that is an interesting idea, but I still think I agree with Dave Reid that it's a feature request that search should do better with dynamic content.
RE #65: Yes, it's probably too late to do something smart with drupal_goto() for Drupal 7. I'm not sure what "smartness" it would need, though?
Comment #67
DamienMcKennajhodgdon: I wonder if maybe a tiny hook to allow for overriding whether goto's are acted on could be sneaked under the radar? =)
Comment #68
jhodgdonWhat would drupal_goto() do if it was told not to act -- what rendering would happen for this node, and what would be indexed or returned in searching? I think this needs more thought than a sneaking in under the radar...
Comment #69
Junro CreditAttribution: Junro commentedsubscribe
Comment #70
Todd Young CreditAttribution: Todd Young commentedsubscribe
Comment #71
IceCreamYou CreditAttribution: IceCreamYou commentedHow about something like this?
Comment #72
izmeez CreditAttribution: izmeez commentedsubscribing
Comment #73
kmillecam CreditAttribution: kmillecam commentedHmm,
This arg(0) fix isn't working for me. Indexing continues to stop when it hits the node that has this in its body:
Comment #74
As If CreditAttribution: As If commentedI'm wondering if your first line is messing things up somehow, because it is getting executed. Why are you saying if(arg(0) == 'node') anyway? Why can't you just say if($page)?
Comment #75
kmillecam CreditAttribution: kmillecam commented@As If -- I was just wrapping my PHP function in the if statement described (a couple of times) earlier in this thread.
@ all, I did end up using a solution described here (hack to filter.module) that fixed the problem: http://drupal.org/node/361171
Comment #76
As If CreditAttribution: As If commentedOh, I thought you were showing us the raw body (prior to the fix). Hm. I cannot see why that didn't work.
So in your new alternative, you are simply looking for format == 3 and bypassing those?
Comment #77
Bevan CreditAttribution: Bevan commentedA possible solution for people with this issue is http://drupal.org/project/search_config
Comment #78
arhak CreditAttribution: arhak commented#68 as an example, wikipedia got its own way to handle no-redirects
Comment #79
d_arc CreditAttribution: d_arc commentedsubscribe
Comment #80
ikeigenwijs CreditAttribution: ikeigenwijs commentedif(arg(0) == 'node'){
search breaking code
}
Is the only one that did the trick for me.
thx a lot
Comment #81
ikeigenwijs CreditAttribution: ikeigenwijs commentedI was wrong, did not work either, the search continues.
Comment #82
Gabriel R. CreditAttribution: Gabriel R. commentedA bug from 2008 is a feature request in 2011.
Oh Drupal, this is so you. I love you, you schyzo.
Here's an easy fix for those running cron in CLI server side. This will check that:
- the user can administer nodes, so you could see the page and its EDIT button.
- the page is open in a browser, i.e. SERVER_ADDR is set.
Comment #83
jonathan1055 CreditAttribution: jonathan1055 commentedSubscribe.
... and apologies to everyone who is going to get an e-mail telling them this. If there was a better way to subscribe to a thread then I would do it.
Comment #84
DWB Internet CreditAttribution: DWB Internet commented#43: 286263-disabled_redirects_idea.patch queued for re-testing.
Comment #86
DamienMcKennaThe Field Redirection module may help on D7 sites by allowing the redirection functionality to be handled on specific view modes, rather than for all uses of a given field.
Comment #87
DamienMcKennaPatch #43 needs to be rerolled, and, honestly, rewritten.
Comment #88
ianthomas_ukThe php filter module has been removed from Drupal 8 for exactly this sort of reason - it's too powerful and can't be locked down.
I'll set this back to D7 in case anyone wants to work on a 'fix', but my recommendation to anyone encountering this problem is to disable the php filter module and find alternative ways to do what you're trying to do (e.g. write a custom module).
Comment #89
ianthomas_ukActually this is a duplicate of #752184: Keep running cron when bad PHP content is encountered
Comment #90
bburgCommenting here since this issue seems to be highest search result on this particular bug. I frequently come across older sites with this problem when I trying to reindex the site's content. Since this is a manual reindexing, and not through cron, the redirect still happens. Changing checks to look for the "batch" path item seems to avoid redirecting during patch API processing.