Project:Views Bulk Operations (VBO)
Version:7.x-3.x-dev
Component:Core
Category:task
Priority:normal
Assigned:Unassigned
Status:active

Issue Summary

I love this module, but one thing I have noticed is the batch performance is rather bad. A big reason for this is because a new batch operation is performed on every single row selected. A better approach is to execute as many rows as possible before doing another batch.

Patch to follow.

Thanks,

Travis.

Comments

#1

Here is my approach... I basically wanted to retain reverse compatibility, but just provide a new checkfield called "Maximize Batch Performance". This will then fall into the new way of batching that will execute as many rows as it possibly can before it does another batch process. This improves batch performance significantly.

Let me know your thoughts.

Travis Tidwell.

AttachmentSize
1176794-vbo-max-performance.patch 11.9 KB

#2

Status:active» needs review

#3

Great stuff! Thanks for the patch, I'll review it soon.

#4

Added some bade code in the previous patch. Use this one instead.

AttachmentSize
1176794-vbo-max-performance-4.patch 11.1 KB

#5

Version:6.x-1.10» 6.x-1.x-dev
Status:needs review» needs work

Would you mind re-rolling the patch against the latest dev? A lot has changed since 1.10. Thanks!

#6

I can do that this weekend. Today is pretty crazy. :)

#7

Take your time and thanks again.

#8

I want this to be the default behavior on 7.x-3.x...
Not familiar with possible pitfalls, but I'll ping a colleague to take a look.

#9

Sweet!

#10

Thanks travist! #4 works for me with latest 6.x-dev (just changed one or two lines). The numbering beneath the progress bar does not reflect actual number of nodes processed, but instead counts the number of batch operations.

In my case (heavy operations on each row), I do see some improvement in performance, but I think that on smaller operations this might be a really nice thing. Still testing...

#11

Version:6.x-1.x-dev» 7.x-3.x-dev
Assigned to:travist» Anonymous
Status:needs work» patch (to be ported)

After digi24 shamed me into applying and reviewing the patch... Committed to D6 with a few minor changes, thanks a lot!

#12

Sweet! Thanks infojunkie,

I will test the latest DEV version and see if my use case still works, but from looking at the diff, it seems you cleaned it up so it looks good.

I did start looking at migrating this to D7, which is taking some time due to the change in entities for the batch process, but will try to have something ready after this weekend. Thanks for your attention to this.

Travis.

#13

You can leave the 7.x part to me, since I'm planning to revisit how we handle the batching.

#14

I've finally reviewed the way we handle Batch API, as well as the patch that was committed here.
Our initial approach (that is still used in 7.x-3.x) was wrong, and this issue made it even worse.

Here's how Batch API should be used: http://api.drupal.org/api/drupal/includes--form.inc/group/batch/6
So, instead of adding an operation per object id, it should add one operation with all object ids.
The Batch API handles the timeouts and memory limits, so the complexity added by this issue is unnecessary.
If done correctly, the code takes object ids one by one, loads them, triggers the action, and repeats the process.
The Batch API makes sure that the process caries on across requests. Hence, good performance.

Additionally, this issue added "fake aggregation" support to Batch API, where it attempts to aggregate the objects, but there's no guarantee that it will aggregate ALL of the selected objects, might just aggregate a few, which is inconsistent and in the end not what the action is wanting. Falling back to the direct method is the only sane option for actions that need aggregation.

So, 6.x-1.x needs to revert this commit, and backport the work that I plan to do on 7.x-3.x.

Vaguely related: #1207316: Provide a way to override the execution type per action.

#15

I'll wait for your reorg of execution modes.

#16

Version:7.x-3.x-dev» 6.x-1.x-dev

Committed this to 7.x-3.x:
http://drupalcode.org/project/views_bulk_operations.git/commitdiff/fac9c...
and a tweak..
http://drupalcode.org/project/views_bulk_operations.git/commitdiff/fc561...

1. Made it use Batch API properly. Loads entities in groups ($entity_load_capacity, defaults to 10) to improve performance.
2. If there's less than $entity_load_capacity rows selected, falls back to the direct execution method (just like node_mass_operations() in core).
3. Removed the execution type setting completely.
4. Made the "Use queue" setting per action.

I'm open for suggestions and improvements.

#17

Subscribe - still interested in this for 6.x.

#18

Subscribe

#19

These months of using 7.x have shown that the 7.x solution is also not ideal.

So, instead of adding an operation per object id, it should add one operation with all object ids.

This allowed us good performance, but it also impossed a lower limit on the amount of items that can be processed, since batch api creates only one queue item for all entity ids, they are all serialized in one go, at a certain point you just have too many...

This means that we need some sort of a middle ground (between the old 6.x approach and the current 7.x approach).
Perhaps divide the items into groups of $entity_load_capacity, and then add a batch api operation for each group.
This means a new queue item is created for each group, serialization happens per each group, we don't have the problem with serializing too much, and the user can increase the $entity_load_capacity in the VBO field settings for even better performance (it is a conservative 10 by default. I worked on setups where 1000 wasn't the slightest problem).

Alternatively, we hardcode the group size to 1000 or 5000, but I like the idea of having it configurable.
Also, the max number of items varies based on whether the operation needs rows (which serializes the views rows along with the entity ids. I hate that, but people need it...).

#20

Version:6.x-1.x-dev» 7.x-3.x-dev
Category:feature request» task
Status:patch (to be ported)» active

Repurposing this issue, nobody is doing the D6 port anyway, and the D7 code still needs improvements.

I tried to implement what I outlined in #19, but it turns out I was wrong, although Batch API creates a queue item for each operation (and set of ids), it also serializes everything to the batch table, so there's no way to avoid all ids being serialized.

The real solution is this:
1) Change the entity_load_capacity help text to tell users that it is best if it matches the number of items per page
2) Get all selected ids, divide them into groups of $entity_load_capacity size.
3) Create a queue ("views_bulk_operations:batch:$unique_id", or something like that), and add a queue item for each group.
4) Call Batch API with just the unique id, and let it fetch and process the queue.

So you set $entity_load_capacity to 20. In the view, you select 100 items (5 pages of 20 items, for example). VBO adds five queue items, each containing 20 serialized ids each. So no way to overload the db row. VBO then starts the batch, which loads the queue items one by one, and processes the ids.

Additional work would involve batching the step where the ids are fetched and filled into the queue, so that when you do "select all on all pages" on 75 thousand items (which I have on one test instance, just for this issue). VBO loads the pages one by one, instead of trying to do it at once, and crashing.

Also related: #1367646: Stop passing the selection to action forms, improve performance by deferring loading all items on all pages until execution. and #1367644: Batch the selecting of all ids on all pages, refactor execution types.

#21

yes, backport to D6 would be very cool!! :)

#22

@bojanz
I am not using D7, just some thoughts from the distance.

Regarding your select all issue:
wouldn't it be easier, to use a different logic for the select all stuff? Instead of batching the filling of the queue and then batching the processing, I would try to avoid filling the queue with ids at all. If we use a select all, we can safely insert a sort on the numerical id. In that case you could omit to fill the queue with all ids and instead process n-items on each batch, store the last processed id and just check whether larger or smaller ids are present (depending wether the select all should be allowed to grow dynamically).

Regarding the entity_load_capacity:
I personally prefer the memory usage and execution time logic in D6. In certain environments, where php runs as a daemon, you cannot safily assume that memory will be sufficient to process n rows.

#23

@digi24
You're forgetting that "select all" doesn't mean "all entities on the system", it means all items in the view.
And the view can have any combination of filters which means we can't guess which ids it contains.

Technically, if you're batching filling the queue, then batching the processing, there's an option of doing both in one batch, but then you have a greater chance of hitting memory or execution limits, which is something we definitely want to avoid.

The logic in D7 is just a more powerful&tweakable variant of the D6 logic, allowing you better performance if you have the memory (I had backend installs with 512MB memory limits, so..).
If you set "entity_load_capacity" to 1, you get D6 behavior (and that's what the help text in Views tells you "Set to 1 if you're having problems").

#24

@bojanz

I was not overlooking that select all only selects items within a view, I just did not express it well. My idea was to take the views object, overwrite the sort part and execute it to return the next n items, based on the last processed id. After processing an item, the id is stored, so Drupal has the place to continue if something crashes.

In the current D6 version we have _views_bulk_operations_execute_next(), I think it is different than simply setting entity_load_capacity to 1, because it also processes multiple entities per batch, just with the difference that the number is determined on the fly, based on memory usage and execution time. I find this extremely useful, because predicting memory usage can vary if you have eg. node types of different complexity mixed up in a view.

Anyway, I did not mean to critisize your approach, the reason for my suggestion was to reduce the load on the DB, for very large views and select all operations. I tried to write something like outlined above myself for D6 some time ago, I just was not too familiar with views, so that I ended up using a non-views solution.

#25

I was not overlooking that select all only selects items within a view, I just did not express it well. My idea was to take the views object, overwrite the sort part and execute it to return the next n items, based on the last processed id. After processing an item, the id is stored, so Drupal has the place to continue if something crashes.

That is exactly what I plan to do. Batch starts, each invocation gets next $entity_load_reference items, and creates a queue item.

In the current D6 version we have _views_bulk_operations_execute_next(), I think it is different than simply setting entity_load_capacity to 1, because it also processes multiple entities per batch, just with the difference that the number is determined on the fly, based on memory usage and execution time. I find this extremely useful, because predicting memory usage can vary if you have eg. node types of different complexity mixed up in a view.

Yes, I agree that is useful, you're right. However, there's a tradeoff to be made here, because with that approach you need to load the entities one by one, you can't use entity_load() to load all the entities in the group at once (which improves performance). It's either "know the number of entities you can handle in one group by monitoring resources during processing" or "load all entities at once in a fixed size group".
Not sure which one is more important. A hunch tells me it might be knowing the exact number of entities, but I'll think on it.
In general, I'm not sure that the fixed group size is that big of a problem, since it's a very conservative setting. It is 10 by default, while it works with up to 100 on a 32MB memory limit (probably less with a decent amount of modules enabled).
Interested to know your opinion on this. You mentioned the php-as-a-daemon use case, it's probably still a question of how liberal the memory_limit is..

#26

@bojanz

You are right, loading multiple entities the Drupal 7 way totally makes sense.

Regarding the php as deamon stuff I was referring to, I had the following on my mind:
When running php as a deamon, the maximum memory is typically set to a high value. So in the optimal case, batching can take advantage of this. But depending on the quality of your scripts, small memory leaks can occur, which makes it difficult to predict how many items can be processed at once.

With the logic in the 6.x version, I am currently able to use one view with a multitude of different actions. Some simple actions are currently processing about 50 items per batch, while problematic tasks (eg. video conversion) only process one item per batch. The nice thing is, that I do not have to care about it, test adjustments or make special configurations/separate views.

The result of this is, that loading of multiple entities and the maxmise batch performance in D6 are two separate things that do not conflict. One is about the number of steps per batch and the other about the size of a step, so sorry about my remarks, nothing to worry about.

#27

sorry, but i can't seem to follow where this is at. is this correct:

D6 - no plan to fix
D7 - work still ongoing

#28

D6 - solid, a bit complex codewise (but with good gains because of it, as described by digi24. Flakey with aggregation, as I said in #14).
D7 - solid, a bit simpler codewise (since the D6 approach is less relevant because we have multiple entity loading, though I'm still looking into stealing bits from it).

The current discussion is about scaling the D7 code above what Batch API can currently give us (it craps out at about 70 thousand items. Probably significantly less on D6, since it serializes even more. That's the part we can't fix on D6, without breaking backwards compatibility.).
After that is implemented in D7, a backport to D6 could be possible. We can't talk about backporting code that doesn't exist yet ;)

nobody click here