Batch API performance/memory Question/Suggestion [#869776]

I was trying to update 10K+ rows and kept getting a memory exhaustion error. I am using the Batch API to process the rows. I found out why and have a question if there is a reason to store it this way.

File: views_bulk_operations.module
Line: 580.

You loop through the selected objects and add an operation for each row object. Well it is including all the fields you displayed in your view. So I had 8 fields in my view (name, address info...etc.). So the code is adding an object will all those field=>value pairs into the batch process. 10k+ objects in this fashion killed my memory. I even jacked it to 512. From tracing the code, as it processes through these rows, it does a node_load for each one. So my question is why save all this extra information if its not used?

This is what I did to fix my memory issue and it seems to work. (There is a better way but for hacking/testing purposes here it is)

Line: 580 I did this.

    foreach ($objects as $oid => $row) {
      $node = new stdClass;
      $node->nid = $row->nid; //could be user here or taxonomy term I am sure
      $operations[] = array('_views_bulk_operations_batch_process', array($oid, $node));
    }

I'm sure we can cut out the row earlier in the process just to save all that memory we are consuming passing the data around but wanted to find out the reasoning first.

Let me know if all we need is the $oid and I'll work up a patch. If not give me some clarification and maybe I can go at it at another angle.

Thanks,
-d

Comment	File	Size	Author
#11	batch_memory_improvements_1.patch	8.74 KB	iLLin

Comments

Comment #1

iLLin commented 3 August 2010 at 16:29

Also with adding this 10k+ set of objects to the batch table as a serialized object... Why not use temporary tables? Would be so much faster to process the results as it sits now, every time is runs it has to re-que up that big a$$ object. Using temporary tables we could LIMIT our results and it would process so much faster.

Anyone have any thoughts?

Comment #2

infojunkie

Vancouver

commented 3 August 2010 at 17:54

iLLin, thanks for your analysis.

The reason why the row is included in the batch invocation is that the row might contain extra information that the node does not contain - e.g. the result of JOINs and other SQL constructs - and that the operation might expect. I pass both the row and the object id (oid) to the batch process function, and this function loads the object from its oid.

In your experience, removing the row from the invocation decreased the amount of memory needed. Is that correct? In this case, is it possible to retrieve the row when we're inside the process function, instead of keeping it with us?

Concerning the temp tables, can you please outline how this would work and what it would achieve?

Comment #3

iLLin commented 4 August 2010 at 21:22

Yea I kinda guessed that was the case on the view, the views I am working with have relational data so I see that.

As far as temporary tables... well it wouldn't be a "true" temporary table as they only persist through the life of the script. So after the first batch the table would destroy itself. But what if we created a table on the fly before the batch and then dropped it during the batch finish function. For example:

create table batch_holder ( oid INT(11), batchinfo TEXT );
//process each row into a large mysql insert.
db_query("INSERT INTO {batch_holder} (oid, batchinfo) VALUES (%d, '%s'), (%d, '%s')", $oid, $batchinfo, $oid2, $batchinfo);
//for however many we have.
//Questions: 
// Is there a max values we can put into a single insert statement?
// It might be faster to not use %d OR '%s' and quote it ourselves?
// It might be faster to use native mysql commands (if large data sets poses a problem with the inserts)

//Then when running the batch we setup our max value by the total count of our batch_holder
//Then process our limit of rows using LIMIT.
//Here is an example batch I wrote a while for user alias's but should show the general point.
function batch_update_setup() {
  $batch = array(
    'operations' => array(array('batch_update_process', array())),
    'finished' => 'batch_update_finish',
    'title' => t('Processing Pets Batch'),
    'init_message' => t('Batch is starting.'),
    'progress_message' => t('Processed @current out of @total.'),
    'error_message' => t('Batch has encountered an error.'),
    'file' => drupal_get_path('module', 'batch_update') .'/inc/pets_alias.inc',
  );
  batch_set($batch);

  // If not called from a submit handler, add the following,
  // noting the url the user should be sent to once the batch
  // is finished.
  batch_process('admin/content/batch_update');
}

function batch_update_process(&$context) {
  $start_time = microtime();
  $node_type = 'pet';
  if (!isset($context['sandbox']['progress'])) {
    $context['sandbox']['progress'] = 0;
    $context['sandbox']['current_node'] = 0;
    $max = db_fetch_object(db_query("SELECT COUNT(*) AS total FROM {node} WHERE type='%s'", $node_type));
    $context['sandbox']['max'] = $max->total;
    watchdog('migration_status', 'Starting the pet alias updates');
  }
  //set our limit here
  $limit = 10000;

  //add our message
  $finish = $context['sandbox']['progress'] + $limit;
  if ($finish > $context['sandbox']['max']) $finish = $context['sandbox']['max'];
  $start = $context['sandbox']['progress'];

  $result = db_query_range("SELECT n.nid, n.title, u.name, u.uid FROM {node} n LEFT JOIN {users} u ON n.uid=u.uid WHERE n.nid > %d AND n.type='%s' ORDER BY nid", $context['sandbox']['current_node'], $node_type, 0, $limit);
  while ($row = db_fetch_object($result)) {
    $path = 'node/'. $row->nid;
    $alias = 'community/users/'. $row->uid .'/pets/'. $row->nid;
    db_query("DELETE FROM {url_alias} WHERE src = '%s'", $path);
    db_query("INSERT INTO {url_alias} (src, dst) VALUES ('%s', '%s')", $path, $alias);

    // Store some result for post-processing in the finished callback.
    $context['results'][] = $row->title;

    // Update our progress information.
    $context['sandbox']['progress']++;
    $context['sandbox']['current_node'] = $row->nid;
  }
  $execution_time = batch_update_parse_time($start_time);
  $total_iterations = ($context['sandbox']['max'] - $context['sandbox']['progress']) / $limit;
  $time_remaining = $execution_time * $total_iterations;
  $time_remaining = batch_update_time_left($time_remaining);
  
  //log our message
  $context['message'] = t('Processed records <b>%start</b> - <b>%finish</b> of a total of %total<br />It took %parse_time\'s total seconds to execute these records.  We calculate a total of %time remaining.', array('%start' => $start, '%finish' => $finish, '%total' => $context['sandbox']['max'], '%parse_time' => $execution_time, '%time' => $time_remaining));
  // Inform the batch engine that we are not finished,
  // and provide an estimation of the completion level we reached.
  if ($context['sandbox']['progress'] <= $context['sandbox']['max']) {
    $context['finished'] = $context['sandbox']['progress'] / $context['sandbox']['max'];
  }
  else {
    $context['finished'] = 0;
  }
}

function batch_update_finish($success, $results, $operations) {
  if ($success) {
    // Here we do something meaningful with the results.
    $message = count($results) .' pet aliases processed.';
    watchdog('migration_status', $message);
  }
  else {
    // An error occurred.
    // $operations contains the operations that remained unprocessed.
    $error_operation = reset($operations);
    $message = 'An error occurred while processing '. $error_operation[0] .' with arguments :'. print_r($error_operation[0], TRUE);
  }
  drupal_set_message($message);
}

I have some extra stuff in their for execution time and stuff but in the batch_update_finish function we would drop our table. I think from a memory standpoint this would be a lot faster, but It still may pose a problem with inserting all those rows into a table, BUT once in the table it would perform so much faster as it wouldn't be dealing with that large object each time. I haven't researched further on this but I am assuming that during the life of the batch (in its current state), as it processes each object, it removes it from the overall object, therefore the batch will continue to get faster and faster. If we used a temp table, it will be fast the whole way through. And we could play with indexes and stuff with the temp table to make sure its performing at its highest speed.

The reason you keep all the rows as once is the batch_set function inserts "one" row with "all" the data (serialized) with the operation attached to it. So the batch api queries for that row, extracts that big ole object and begins to process as much of it as it can. I haven't checked but I bet there is an internal timer thats being kept so it knows when to refresh the page.

Anyway that make more sense? Sorry for the length and let me know what your thoughts are with something like this or if its been tried and failed miserably :)

I am assuming quite a few things with the batch api and haven't actually verified my assumptions so shoot me down if I am way off bass and we can figure out something else :)

-d

Comment #4

infojunkie

Vancouver

commented 5 August 2010 at 03:35

I'd like to focus on the memory consumption rather than the performance.

We can create a permanent VBO table where we insert the batch data, indexed by some ID that we pass to the batch process function. We truncate the table at the end of the batch process. Let's not worry about optimizing the INSERTs for now, as we will hit the problem of max_packet_size.

Can you submit a patch to VBO-6.x-1.x-dev that does this?

Comment #5

iLLin commented 11 August 2010 at 18:22

Ok little more digginig. I wrote this patch to utilize a new database table to store the results. It would work if it ever got there. Well I found the problem for memory and I think I have the perfect solution.

views_bulk_operations.module
function _views_bulk_operations_adjust_selection()
Not sure of line as I have made some changes.

Anyway this function iterates through the view if select all is selected. Puts them in a $selection variable and then adds them to the SESSION. So now we have 10K+ objects in the session. That is not a good idea and if someone wanted to update 100K... well would never work.

Solution:
How about we save the view (name, params) to the session AND then use the $view object in our batch functions. Then we don't need an extra table, we will NEVER have to worry about memory issues as it will function just like your viewing the view on a page with a pager. Then we essentially page through our view and complete our actions.

Problem solved :)

Thoughts before I re-write this?

Comment #6

iLLin commented 11 August 2010 at 18:42

Assigned:

Unassigned

» iLLin

Comment #7

infojunkie

Vancouver

commented 12 August 2010 at 18:28

Re #5: If I understand you correctly, you want to re-execute the view every time you need the results. Correct? The immediate problem I see with this is that when using Batch API, we will need to recreate and re-execute the view for each one of those 10K records, right?

When you write the patch, can you please provide memory and timing comparisons between the current code and what you propose?

Also, maybe we could consider forcing the view to cache its results while we're retrieving them.

Comment #8

iLLin commented 14 August 2010 at 04:10

We would move the pointer essentially. The goal is to execute the records for each page. It would be like pressing "next" in the pager. So it's executing the query each time, with a limit and moving the offset. This way there is minimum memory consumption. In its initial state, the code is executing the view and storing the complete result set. This is what causes the memory consumption. For some reason though the view is not executing correctly from within the batch and that is where I left off. I could store the query and just run that myself, but views has handlers for certain fields and therefore would make that difficult. It would be better to let the view handle everything... now if it will just play nice.

Comment #9

jbomb commented 20 August 2010 at 13:09

Any progress on this? I would be happy to test help test.

[edit]

I was thinking along the same lines, but I was thinking of passing the view's build information to the batch api rather than the entire views object. I *think* all the necessary information is available in _views_bulk_operations_adjust_selection. Once the views has been executed, one can grab the SQL and count SQL statements in $view->build_info, and the object id -- if I am not mistaken -- is available in $view->base_field. Could that help get around some of the handler issues that were mentioned in the previous post?

I guess it doesn't solve all cases for operations.

Comment #10

iLLin commented 30 August 2010 at 12:44

I got pulled into another project (actually quite a few) but I am still working on this. I did run into a wall and haven't had time to figure my problem out. Basically I am saving the views_embed_view, view name, page, arguments... etc to the session. Then I was trying to build the view out myself like so:

function _views_bulk_operations_batch_process(&$context) {
  /*.....snip....*/
  $view = views_get_view($view_params['vid'], TRUE);
  $view->set_display($view_params['current_display']);
  $view->set_items_per_page(0);
  $view->set_exposed_input($view_params['exposed_input']);
  $view->set_arguments($view_params['arguments']);
  $view->pre_execute($view_params['arguments']);
  $view->execute();

But during the batch api, the view wasn't being executed correctly. The arguments weren't working, but it I called the view from a node page it worked... Well that's where I stopped. But my idea was to basically call the view each time and just move the pager.

I guess I could save the query and just re run that as a limit query?

But the _views_bulk_operations_adjust_selection is where I grab and save the view operations. The only thing I don't like about running the raw sql's is when the view is built, it adds more stuff... hmmm I wonder if I saved the whole $view object, before $view->execute is called and save that to a tmp table or session? Then adjust the pager and run execute each loop through the batch?

I'll give that a try later today.

Comment #11

iLLin commented 1 September 2010 at 16:50

Status	File	Size
new	batch_memory_improvements_1.patch	8.74 KB

Ok heres a draft version to keep this going. Still very rough.

Some notes:
What this does is push each $result from the view to a new table called vbo_batch (make sure you run update.php). Creates a unique id and stores each result with that unique id. Then while processing it uses the unique id to query the vbo_batch table with a limit and executes the rows that way.

Ok that's a basic summary. I still think we can execute the view in a more controlled way, as right now I can do 10k+ rows with zero memory issues BUT that view is still executed and all those results are saved to the view object. Before we also transferred the view object to a selection variable and then passed that on to the batch... as you see the memory continues to grow this way.

Right now when the Execute button is pressed, we enter all those rows into the database right then. Cancel button needs to remove those rows if pressed (proper cleanup).

On the confirmation page, we need a way to transfer the "FULL" count as right now it only shows 20 as thats all I add to the $selection variable so we can see the first 20 titles.

This is defaulted to happen the whole time, so this does this even if you don't use the batch api, so we need a way to only perform the operation in this fashion if using the batch api for select all.

Please review and lets continue to make some improvements, I will look into executing the view in a easier on the memory way. If we did like 50k+ rows, that $view object would have 50k results inside of it... we will def have memory issues again. I think if we can execute the view 5k rows at a time in a loop while adding to the vbo_batch table, that could potentially help.

Side Note:
I still think my original idea about saving the view object and then just paging through that instead would be better as we wouldn't have to worry about adding all these rows to a temp table, we would just page through the views object correctly. I may re look at that today if I have time as that makes more sense to me.

Ideas and thoughts are most welcome.

Comment #12

nally commented 2 September 2010 at 07:05

Comment #13

AntiNSA commented 4 September 2010 at 12:27

Comment #14

infojunkie

Vancouver

commented 10 September 2010 at 18:35

Re #11: Thanks for the patch. Your work seems to be going in the right direction - i.e. saving the results in the database, although a new factor has entered the equation: #907646: Concurrent Batch API operations.. According to this issue, new items can appear in the view result between the time items are selected and the time execution takes place, which leads to unwanted modifications. So we *really* need to save the initial view result + selection at each page.

What I've been thinking lately is to model the behaviour after the Views caching mechanism, instead of inserting the elements in a new table. The simplest scenario would be to save the views result both for the current page and for all pages as a blob in our own cache table (or even in the global cache).

But what seems to be clear now is that re-executing the view to retrieve the results every time we need them is not correct - we should no longer consider this an option.

Comment #15

iLLin commented 11 September 2010 at 12:56

You can't save the view result, because it contains all the records. While this is fine for smaller sets, this is just way to memory intensive for larger sets. This is where the memory problem stems from. Currently the view results are saved to the session and passed around. That wouldn't be any different than saving a serialized object to a db, its still going to be very large. While the patch still kinda does this as it takes the entire view result and then inserts it into the database, I think this is still wrong as the entire view is still executed. Right now it does work with 10K records I don't think it would work with 20k+... well maybe but 50K no way.

Concurrent actions is always a problem, that's why they're options for table locking (i.e. transactions)... etc. But I don't think we need to do that. I don't think we can account for that type of scenario. It would be no different now and I don't think its critical. If you run your action on all unpublished records, someone adds a new one right now... well it gets skipped. How is that any different than if they add it tomorrow? There will still be an unpublished node that needs to be published. I guess I fail to see the problem here? Maybe there is scenario where its critical, but if its that critical put your site in maintenance mode.

If you think saving the view and then paging through it is now wrong... well I'm at a stand still as that was the whole point so we didn't have to execute and entire view result. Do we have any options that doesn't included executing the entire view?

Comment #16

infojunkie

Vancouver

commented 12 September 2010 at 23:38

It turns out that the concurrent issue above is not critical after all. Phew! Back to our regular programming.

So to confirm my understanding of our issue here: the problem only occurs when "select all nodes across pages" is chosen, because otherwise the memory is not exhausted. Is this the case?

If so, then we can concentrate on this specific case. When finding the "select all" flag, we can postpone retrieving all results to execution time (instead of doing it in _views_bulk_operations_adjust_selection()). At execution time, in either execution mode (direct, batched, queued), we can retrieve the view results page by page until there are no more pages. This will be slower since we have to re-execute the view for every page. Maybe we can estimate the number of rows that we can retrieve per page, or let the user select this number depending on his situation.

What do you think?

Comment #17

iLLin commented 15 September 2010 at 00:33

Now we are thinking the same :)

Have had zero time lately but I agree with what you are saying. So I'm thinking we just need to save the view "pre-execute" ? Whats best approach with this?

Comment #18

mikeytown2 commented 12 October 2010 at 02:01

Grab the query from the view and do a direct table to table transfer. This way PHP never has to see the data in order to populate the vbo_batch table. I use the direct table to table transfer to move million+ entries in a short amount of time.

  // Run a query just like views does
  $replacements = module_invoke_all('views_query_substitutions', $view);
  $query = str_replace(array_keys($replacements), $replacements, $view->build_info['query']);
  $query_args = str_replace(array_keys($replacements), $replacements, $view->build_info['query_args']);
  $results = db_query($query, $query_args);

  // Normal insert
  db_query("INSERT INTO {boost_crawler} (url, hash) VALUES ('%s', '%s')", $url['url'], $url['hash']);
  // Use a query to insert into a table
  db_query_range("INSERT INTO {boost_crawler} (url, hash) SELECT url, hash_url FROM {boost_cache} WHERE push = 1 AND extension = '%s' AND expire = 0", $extension, $loaded, $count);

Putting the 2 together; theory here.

  // Run a query just like views does
  $replacements = module_invoke_all('views_query_substitutions', $view);
  $query = str_replace(array_keys($replacements), $replacements, $view->build_info['query']);
  $query_args = str_replace(array_keys($replacements), $replacements, $view->build_info['query_args']);
  $results = db_query("INSERT INTO {table} (field_a, field_b) " . $query, $query_args);

The other thing to think about is a multi insert code in Boost. Look at boost_crawler_add_alias_to_table in the latest dev.

What makes all this work is my own batch processor that is multi threaded... MT vbo? The sky is the limit in terms of where you want to take this. Moving to a table to store the VBO info is the right direction; allows for lots of tricks like multiple threads and direct table inserts. I set the max to 10k rows btw.

Comment #19

infojunkie

Vancouver

commented 15 October 2010 at 19:12

Issue tags:

+r1.11

Tagging for next release.

Comment #20

bojanz commented 4 July 2011 at 09:43

Status:

Active

» Closed (duplicate)

7.x-3.x no longer passes the full rows unless the action specifically demands it.
See #1180538: Pass the selected views row(s) to the action. It could be backported to 6.x-1.x.

Closing this issue as a duplicate. It's not going anywhere anyway (although mikeytown2's solution looks interesting).

Batch API performance/memory Question/Suggestion

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

Comment #14

Comment #15

Comment #16

Comment #17

Comment #18

Comment #19

Comment #20

News items

Our community

Documentation

Drupal code base

Governance of community