I found that some nodes were not being indexed in search in my 4.7 installation. Looking forward, it looks like the same problem also applies to the HEAD.
How to reproduce:
1. Create some content (e.g. a page), initially unpublished. Say this is node/101
2. Create another, but publish it. Say this is node/102
3. Run cron.php
4. Go to the node admin page (admin/node in 4.7), and publish node/101 by selecting and bulk publishing ("Approve the selected posts" in 4.7).
5. Run cron.php again
6. Try searching for some content that is in node 101. Node 101 will not be found.
This certainly happens in 4.7, and following through the logic suggests it will also happen in 5 and HEAD (I'm still working on 4.7 myself). Any confirmation that this does in fact still happen would be useful. (A related problem had been noticed previously at http://drupal.org/node/98686, where this problem was attributed to a problem with the moderate status. But the problem still exists even if moderate is not used.)
Diagnosis:
Node search needs to know which nodes it has left to index. node_update_index() uses variables node_cron_last and node_cron_last_nid to do this. Any nodes with higher number than node_cron_last_nid are indexed, as are any nodes changed since node_cron_last. To decide whether something has "changed since node_cron_last", node.module checks "n.changed > node_cron_last".
But the batch publish option does not update n.changed. It simply does (HEAD version)
db_query('UPDATE {node} SET status = 1 WHERE nid IN(%s)', implode(',', $nodes));or in 4.7 version
UPDATE {node} SET status = 1, moderate = 0 WHERE nid = %d
This means that if node_update_index() has already passed over this node, it will not be indexed once published, since its n.changed is still the original one. This also applies to other actions that set status=1, such as "promote".
Possible fixes:
Probably the cleanest fix is to update n.changed when changing status, i.e. replacing the update operators with
db_query('UPDATE {node} SET status = 1, changed=%d WHERE nid IN(%s)', now(), implode(',', $nodes));Note that this will change the behaviour of batch status changes made by a node administrator. Currently a status change made by the user through the node form will update the n.changed time, but a batch change will not. I would argue that n.changed should really reflect *any* change, even a change of state, so this should be the correct behaviour anyway. However, a Drupal core guru may disagree.
Altering n.changed like this may will have an impact on anything that uses n.changed. In node.module, this appears to be only the node admin page ordering, so batch published nodes will then be listed at the top, rather than staying in the same place. However, anything that *really* relies on the current behaviour of n.changed is probably doing something dangerous anyway :-)
Alternative fix:
An alternative might be to change the logic in node_update_index() and node_search('status', ...) to specifically look for unindexed published nodes. However, node.module does not have easily have access to this information because search module does not provide a method to check for unindexed items: this is supposed to be the responsibility of the node module itself.
A workaround could perhaps be done by e.g. something like
SELECT n.nid FROM {node} n
LEFT JOIN {search_dataset} sd ON (sd.type = 'node' AND sd.sid=n.nid)
WHERE status = 1 AND sd.sid IS NULLOn my site, this SQL snippet caught about 1 in 20 nodes not indexed (about 50 in the last 900 nodes added since the site went live).
However, this workaround introduces an undesirable dependency with the implementation of the search module, so should probably only be used as a last resort. (Or maybe this could be introduced in an additional workaround module, designed to be dependent on both node and search modules, that is designed solely to ensure that all published nodes are indexed.)
I have not posted a patch for either of these fixes - I thought it would be better to get a decision on which fix (or any alternative) to apply first. Also, I have only tested on 4.7, but I believe the logic also applies to 5 and HEAD, so apologies if I have missed anything obvious in recent fixes.
Mark
| Comment | File | Size | Author |
|---|---|---|---|
| #49 | batch_node.patch | 7.32 KB | yched |
| #48 | batch_node.patch | 7.32 KB | yched |
| #45 | node_mass_update.patch | 7.43 KB | yched |
| #40 | node_update_0.patch | 7.34 KB | yched |
| #38 | node_mass_update_message.patch | 830 bytes | yched |
Comments
Comment #1
plumbley commentedSorry, did I say "
now()"? Of course I meant "time()". Too much Excel yesterday... Mark.Comment #2
chx commentedWhile there I saw the same function repeated six times, wow. Cleanup.
Comment #3
chx commentedtwo bonus y characters :)
Comment #4
moshe weitzman commentedwe have all agreed on the m/l that these mass operations are a bug because they do not fire nodeapi hooks. we should iterate over each selected node and call node_save().
Comment #5
yched commentedNode_save on (up to) 50 nodes ? Should those be run in a batch then ?
(Shouldn't be too hard, if we feel this is needed.)
Comment #6
moshe weitzman commentedi think batching is a good idea here.
Comment #7
yched commentedEr, I'm going to be extremely short on time to actually provide a patch for the batching stuff :-(
Anyone willing to tackle that (there should be no tricky subtleties AFAICT) should be aware of http://drupal.org/node/169079 ( "batch API callbacks cannot live in .inc files")
Comment #8
karens commentedI don't have this code quite right yet but here is an attempt to run the updates through batch processing and do a node_load() and node_save() on each. Taking chx's idea a bit further, and with no more code than it would take to update only the fields 'status', 'sticky', and 'promote', we can create a function that will take an array of key/value pairs and mass update any field(s) on a node to set values. This would not only solve the problem at hand but also open up a way that other modules could use this code to do mass updates (CCK would be one of them).
I put the batch processing all in node.module instead of node.admin.inc because of the problem getting batch processing to work in .inc files (as noted by yched). If that gets fixed it could be moved back to node.admin.inc.
The code needs work, but I post what I have so others can take a stab at it -- or throw cold water on the idea :)
Comment #9
yched commentedCool, thanks Karen :-)
Just a quick note : I think the only 2 things you need to store in $context['sandbox'] are 'nodes' and 'max' (the size of the original $node param).
'overrides' you get as a (constant) $overrides param for the operation callback,
'current_node' is not actually used,
'progress' simply becomes count($context['sandbox']['nodes'])
+ the array_shift call could probably replaced with an unset (array_shift is expensive) ?
(the batch engine itself relies on an array_shift, btw, I still have to provide a patch to fix that)
Comment #10
moshe weitzman commentedThats very cool, Karen. We might make this even more flexible I think and instead of $override instructions, let caller provide a list of actions that should fire (publish, sticky, etc.). it would be up to the caller to only name actions that make sense (i.e. actions that operate on nodes). er, substitute triggers where i said actions :)
Comment #11
karens commentedCalling actions and updating fields both would be useful. In this situation either would work because all the things we are trying to do are both actions and field values that need to be updated.
But I was thinking of something different, the ability to do things like pass in an array of nodes and change a specific field to a set value (or to unset a value) for all of them. I don't think you can set up actions for things like that. It looks like actions make more sense for things you would do over and over again rather than one-time changes.
This fix could go either way and solve the problem. So if we make this into something that does mass actions it's still useful (although I would rename the function in that case), but it would be nice to also add in a mass update function at some point.
If we don't add a mass field update process in core, I'll probably create one in CCK since I'm pretty sure we're going to need it anyway.
Comment #12
karens commentedAlso, what happens if we fire 50 actions? Do we need to batch that or does the Actions module handle that OK?
Comment #13
moshe weitzman commentedKaren - i would think your current batch implemntation would work equally well for actions. actions on its own does not batching
Comment #14
yched commentedI think it is important to keep the focus on 'batchifying' the stuff in admin/content/node's mass node update *form submission* (at best, possibly adding some abstraction level to include the similar mass user update form on admin/user/user ?).
Providing a generic API function involving batch processing is way more tricky. An API function can be called from virtually anywhere, and batch processing is not suitable in every context because of the page redirection / break execution it implies. Form submission handlers are one of the 'good' (simple and safe) contexts, because the batch API integrates into FAPI workflow.
This is just what made http://drupal.org/node/144337 (node_access_rebuild in a batch) not that easy.
Comment #15
karens commentedI started re-writing this to pass each change in as an action instead of updating the node value and realized that won't work. We might have up to 50 nodes to operate on in the mass operations and actions.inc has a hard-coded limit of 35 calls to actions_do. Plus the actions stack is a static value that is never decremented, so you couldn't do, say two pages of actions on the mass update even if one page had less than 35 nodes in it without running into that limit. Plus actions adds another level of code that isn't really needed here, so maybe just as well not to go that route.
So I'm going back to the original method of doing a node_load() and node_save() on each of our 50 nodes. I'll try to get something posted soon.
@yched, I am adding some wording in the function documentation to indicate this should only be called from a form submit handler. As to abstracting it out to work with users, too, I'll play around with that, too.
Comment #16
yched commentedI am adding some wording in the function documentation to indicate this should only be called from a form submit handler
Well, it's not exactly _that_ restrictive - install.php and update.php both use batches outside of form submit handlers. But batching outside of a form means a slightly different call sequence, and requires you have a 'good' knowledge of the current execution callstack, because _you_ have to decide when the interruption / redirection happens (batch_process() ). So it's hard to provide a 'one fits all' batch-enabled function.
Well, it's true that for regular module authors, most (all ?) use cases for batches will be in form submit handlers. I tried to make the docs in http://api.drupal.org/api/group/batch/6, http://api.drupal.org/api/function/batch_process/6, http://api.drupal.org/api/function/batch_set/6, reflect that, but any enhancement / clarification is welcome :-)
Comment #17
karens commentedOK here's a patch that actually works. It updates the array of nodes, running them through the batch process and showing in the progress bar the name of each node as it is updated. At the end it displays a message with a list of the nodes that were changed.
This runs each node through node_load() and node_save() so last changed time will get updated and hooks will fire.
I decided it was too difficult at this point to abstract it out to work for user updates, too, since there are so many differences between the way nodes and users work. That could be a project for D7.
Not sure if want the list of updated nodes at the end or not. If not, we probably should at least display a message to indicate which node was being processed when the update failed or maybe a list of nodes that did NOT get updated.
Comment #18
karens commentedThe patch again, with some debugging cruft removed and some cleanup on the documentation.
Comment #19
yched commentedTested : works, code looks fine.
I'm not sure about the $limit param on node_mass_update() : it's always called with a value of 5, and I don't think other values would have significant impact on the overall speed. There has been the same sort of debate in the 'batch node_acces_rebuild' issue, and hardcoding it to 5 is just OK, IMO.
Aside from that, this looks RTBC to me.
(sorry about my previous ramble, I thought you meant updating batch API PHPdoc)
Comment #20
yched commentedWe might also want to skip the batch overhead when the number of nodes is below, say, 10 ?
Going through the additional 'display progress page / wait in front of the progressbar for the processing request to perform' steps is going to be tedious+overkill if the user simply wants to unpublish one node :-)
Simple way :
just add 'progressive' => FALSE in the definition of the batch
(performs all the operations in one pass, without separate HTTP requests)
you still get the _processing_ overhead of the batch engine as opposed to 'regular' code execution, but that should be minor for these rather small (1 operation...) batches - we simply get more function calls.
Thorough way :
separate code branches, one using a batch, one directly updating the nodes in a foreach loop...
Comment #21
moshe weitzman commentedI also tested this patch thoroughly and it works as advertised. Nice work!. I added a progressive element to the patch as suggested by yched. I don't see that it makes any UI difference. I also removed an exclamation point in mass_update doxygen.
Since both yched and I have no tested this, we're RTBC.
Comment #22
yched commented'progressive' => FALSE doesn't work - Ah, true, sorry. It's because FAPI overrides that value based on whether the form is programmatic or not.
I'll try to come up with something shortly.
Comment #23
yched commentedOK, I should have thought about that use case earlier, but allowing batches to specify the 'progressive' flag themselves would be a new feature at that point, so I went the 'separate code branches' path.
Also hardcoded $limit to 5 as I suggested above, and reworked a bit the displayed messages to keep them inline with what we currently have (we display the list of processed nodes only if there was a problem during the HTTP iterations, so that the user knows what was updated, but we don't flood him with 50 node links when everything went OK).
Comment #24
karens commentedI mostly agree with these changes but a couple comments:
1) what is the number of nodes we can safely do in one pass? 5? 10 ? 20? We use 10 as the number to decide whether to batch this, then use 5 as the number when we run batches. And the node_access_rebuild() patch uses 20 as the number done in a batch. Do we have any benchmarks or ideas on what a 'safe' number is? If not, we should at least be consistent.
2) Don't we need a t() around the error message at the end?
I'll rework the patch once we settle on the answer to #1.
Comment #25
yched commented1) The numbers are estimated, of course. Given the hook_nodeapi calls, it's hard to predict how much time a node_save will take
(I'd say 'more than computing its access grants', for instance).
Mind you that the previous $limit param is not the number of nodes we treat in one HTTP iteration. It's the number of nodes we treat before we let the batch engine decide whether we still have time to treat another round, or if this has to wait for the next HTPP request. Even setting it to 1 would not be that bad, even if certainly dumb.
Setting it to half of the estimated 'reasonable amount ensuring no timeout' - we went for 10 nodes- seems like the right thing to me.
2) format_plural takes care of the t() itself.
Comment #26
karens commentedAh! I misunderstood how that part was working. One of the most confusing parts of the new batch API is trying to figure out why that iteration is in there. Why don't we get rid of the foreach() completely and just process one node at a time, letting the batch engine figure out when to intervene. That would certainly be easier to understand . It looks like the foreach() doesn't really add any benefit?? If so, why use it at all? Or maybe I'm missing something...
I am assuming that others will use this function as an example in setting up their own batch processes, so I am trying to make it as clear an example as possible.
BTW, when this is done I will go and see if I can add anything to the API docs and the handbook about batch processing -- from the perspective of someone who didn't already know how it was supposed to work :)
Comment #27
yched commentedYou're not the only one :-), the same miscunception arose during the node_access_rebuild patch. Trouble is, how to write a multipass batch operation is not part of the api itself, it's rather how to make proper use if it. So the documentation for this is not so much in the PHPdoc, but rather in batch_example.module in contrib/docs, for instance. Or, we add extensive comments in every instance of a multipass op we have in core ?
Multipass batch ops come straight out of the 'multipass updates' we already had, which were 'not simple' too, and not widely understood. But they were in .install files, and thus sort of below the radar.
It looks like the foreach() doesn't really add any benefit?? If so, why use it at all? Or maybe I'm missing something...
You mean the for() in _node_mass_update_batch_process() ? Well, removing it and going one node at a time works, but adds a little overhead. We could do that, but I still think the 'right' way is to have reasonable grouping, especially if we target this as example code.
Comment #28
karens commentedIn that case, I agree with your code. I think I am finally understanding how this should work :)
Now that I am more clear on the Batch API, I created a Batch API example at http://drupal.org/node/180528 and changed the Batch API docs in 'Updating your module' to point to it.
Comment #29
yched commentedThanks for this, Karen.
I still think pointing to http://cvs.drupal.org/viewvc.py/drupal/contributions/docs/developer/exam... in the update handbook page is valid, though. The example module provides a full-fledged, er, module, that you can play with and hack into, and provides more 'space' where detailed comments / hints can be fully expanded.
Additionnally, those tricky multipass ops are 'only' advanced use - batch api also supports 'basic' operations, which are less frightening :-)
But this discussion should probably go in a separate thread :-). I think the current patch here is about ready to go.
Comment #30
karens commentedThe patch looks good. I installed it and put it through its paces and I think it's RTBC.
On your example module, I found it very confusing (no offense, I hope! It's just a complex subject). Even though my example is a 'complex' example, it clarifies all the things I found confusing as I tried to figure the other one out. We could point to both if you like.
Comment #31
yched commentedNo offense at all :-) What I meant was that the clarifications made in your batch API handbook page would be perfectly suited in the example module as well. I think the two really complement.
Comment #32
dries commentedOK, this looks good. Committed. Thanks.
Comment #33
dries commentedFurther testing shows that it always shows 'remaining 0 of 1'. Is that correct? It confused me ...
Comment #34
yched commentedThis is because the batch has only one (multipass) operation, instead of x times the same op. Less data to store and retrieve, faster processing. Drawback : the 'remaining x of y' message is less informative, and true, in some cases including this one, this can be confusing (did it update all the nodes I asked or only one ?).
Possible workaround : do not display the 'Remaining x of x' message - patch attached.
but the batch engine still displays an empty line instead - could be fixed in a separate patch if solution is accepted ?
Or we can drop the 'multipass' approach and use 50 separate ops...
Side note : this feature makes two core cases for http://drupal.org/node/169079 ("batch API callbacks cannot live in .inc files") :-)
Comment #35
karens commentedThe fact that the progress message notes how many 'operations' are done rather than how many actions within the batch confused me, too. I played with the idea of 50 separate ops doing one node each in the original patch, but that adds overhead and bloats the code, just so the message would make more sense.
A better fix would be changing the progress message so it can count actions instead of operations by letting you set your own action counter inside the batch operation. Who cares how many operations there are? The end user doesn't see that part of the code. I assume it's too late in the D6 cycle for changes like that, though.
My remaining reservation is that this is in core, which means it is going to serve as an example to developers trying to use the Batch API, and setting a blank progress message here will confuse them. That would be the one reason to go back and do this as 50 separate ops with a message that counts how many nodes have been processed.
Comment #36
catchre KarenS' last post - is there no way to make the ops count display as a % completed? Would save doing it one by one and also the numeric ops vs. nodes confusion. Bear in mind I have no idea how this actually works so apologies if I've misunderstood.
Comment #37
karens commentedYes you can display percent completed, and that could be a good solution.
Just to see what would change, I tried reworking this to do 50 ops instead of a single op run 50 times (patch attached). It actually uses less code and provides a simpler example if we want to think of this as example code. It also creates a meaningful message with the correct node count. On the other hand it is definitely noticeably slower.
Just so others can see try it, I've attached the patch.
Comment #38
yched commentedUsing the percentage (@percentage) would work, but it is already displayed on the right side of the progressbar, so I'd propose 'Updating count($nodes) nodes'. Patch attached, with a word of comment on why we do it that way.
On 1 mutlipass op vs (up to) 50 separate ops : Yep, it is slower because the serialized array that gets stored in the db is a lot bigger. About educational aspects : it's true that a non-multipass op is simpler to write and understand (btw, karen, you don't need to use $context['sandbox'] at all in your patch above), but the PHPdoc already provides an example (http://api.drupal.org/api/group/batch/6). I think for these node admin tasks we're dealing with here, we should favor speed.
More generally - after a few actual uses of batch processing in core and contrib, it seems (er, of course ?) that most use cases for batch processing are 'do the same thing, on x sets of params', and are thus done faster with a single multipass op than with x separate operations... By the time I wrote the batch 'multipass' were the just an exception (a few update functions in contrib - cck...). Maybe future enhancements to the batch api could make those easier to write ?
Comment #39
dries commentedThe more I test this patch, the more I hate the scrollbar and the information it outputs. Can't we use a subtle throttler instead? The information presented by the scrollbar is useless anyway -- it goes too fast, I can't read it. We should definitely not use the word 'nodes' in output.
I think we need to go back to the drawing board with this patch.
Comment #40
yched commentedSorry for the delay, I'm mainly off-drupal for the next few weeks.
Dries : The progressbar looks useless for test sites with few nodeapi modules (which is likely with current D6 installs). The need for batching here relies on the assumption that updating 50 nodes with lots of modules will possibly take some time. Then, the batch will get processed in several requests and the progressbar does make sense (30%, 60%, done)
If it goes straight to 100%, then setting up a batch was useless, but how can we tell in advance ? We currently batch when updating more than 10 nodes - that number can be debated and fine-tuned, obviously. I'm not sure I see a reason why a throbber would be better suited for this specific instance of batch processing (and I currently don't have the time to add a 'throbber' version of the batch process page :-).
Attached patch displays no additional message apart from the progress bar itself. The option in #38 ('Processing x items') is also valid IMO.
The patch also moves the batch operations from node.module to node.admin.inc where they belong (now that http://drupal.org/node/169079 got in)
(not critical anymore, btw)
Comment #41
jaysmall commentedAny prospect of a patch that will work in 5.x?
Comment #42
agentrickard@jaysmall
We'll open a new issue for a backport once this is committed.
Comment #43
catchI applied the patch, then did some mass updates in admin/content. I also felt the progress bar was a bit pointless given the speed - but my live site has a lot of contrib modules installed, and single node save takes some time, hard to test this until more node api modules are available, by which time it'll be too late for the patch.
Nodes appears as updated, with the correct "updated" time in tracker etc. as well, so no bugs I could find and the patch solves the original issue that was raised.
I guess with this it'd be possible to make the admin form 2-300 (or more) posts long to enable much bigger operations with some tweaking, that'd be a feature request for after this goes in but very handy for otherwise extremely repetitive tasks working on hundreds or thousands of nodes at a time.
I think this should be RTBC as it is, but will leave as review in the hope someone else will take a look at it in more depth than I'm able to.
Comment #44
karens commentedI think there is a typo in the patch. It is supposed to use the batch processing if handling more than 10 nodes and that got changed to 1 in the patch. That means that for less than 10 nodes the batch doesn't even come into play. That would maybe affect catch's observation about the speed and pointlessness of the progress bar.
I think it is a reasonable compromise to switch to batch processing any time you have more than 10 nodes to update. As yched said, it's impossible to know ahead of time for sure which batches will benefit from batch processing and which won't, so you have to just draw a line in the sand somewhere.
This patch is specifically written for the node admin page, which will never allow you to process more than 50 nodes at a time, but it will also be extremely helpful to other modules to use for mass updating, and those modules may have lots more than 50 nodes to update. CCK will be one of them for sure, and will at times need to process hundreds or thousands of nodes at a time. As you get to larger and larger numbers of nodes to handle at once, you get to the point where the progress bar makes more and more sense while a throbber that goes on and on will appear to just be broken.
I suggest we commit this as-is (with the fix to change it back to a 10 node cut-off for using batch processing). Tweaking the distinction of when to use a the progress bar or a throbber can be done in later versions, once it becomes more clear exactly how and when this function will get actually used on live sites.
Comment #45
yched commentedI think there is a typo in the patch. It is supposed to use the batch processing if handling more than 10 nodes and that got changed to 1 in the patch./em>
Oops. My huge bad... Corrected patch attached.
And I think Karen just perfectly summed up the situation :-)
Comment #46
yched commentedRerolled.
Just to sum up, this patch :
1- moves the batch 'mass node operations' in node.admin.inc where it belongs, instead of cluttering node.module (the initial feature was committed before the 'file' property was added to batches)
2- removes the confusing 'Remaining 0 of 1' message - when you asked for an operation on 25 nodes, it makes you wonder if your nodes got updated alright.
I esp. think #2 is an important thing to fix before we release.
(edit : fixed typo)
Comment #47
catchI think you forgot something ;)
Comment #48
yched commentedEr, did I ? Thx :-)
Comment #49
yched commentedrerolled + update title to match current patch
Summary in #46 above
Comment #50
catchI tested this with both < 10 and > 10 nodes. Messages are much, much better and it only does batch operations over 10 nodes now so RTBC.
Comment #51
gábor hojtsyThanks, committed this one.
Comment #52
yched commentedThanks Gabor.
Resetting original title for posterity
Comment #53
(not verified) commentedAutomatically closed -- issue fixed for two weeks with no activity.