To reproduce: create a batch with 1001 items to process, and do them in batches of 10, 100 or any multiple of 10. Specific case is Apache Solr module trying to index 1001 nodes; it just does 1000 and leaves one node untouched.

The problem part is depending on $progress to shift into $op = 'finished', since $progress is rounded inside _batch_api_percentage().

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

janusman’s picture

Status: Active » Needs review
FileSize
597 bytes

Patch for review.

Would this need tests?

janusman’s picture

I managed to reproduce the bug this way:
1) On D7, created 501 dummy nodes
2) Using apachesolr: set # of records to process each cron run to "100" and then indexed all the nodes using the Batch API.

Since item #500 out of 501 equals 0.998003992, $percent gets rounded up to 100 thus Batch API finalizes.

Damien Tournoud’s picture

Priority: Normal » Critical
Issue tags: +Needs tests

Ew. This is ridiculous :)

Of course we need tests for this. The fix should work for D7, for D8 we will need to make sure that we don't rely on floating point comparisons to determine if the batch is finished, _batch_process() should report this information explicitly.

Marking as critical because this can break stuff in a lot of fun ways, including the upgrade process.

David_Rothstein’s picture

Bookmarking this to come back to later, but off the top of my head I'm not seeing where the bug is. I tried running this test script:

define('DRUPAL_ROOT', getcwd());
require_once DRUPAL_ROOT . '/includes/bootstrap.inc';
require_once DRUPAL_ROOT . '/includes/batch.inc';
drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);
print _batch_api_percentage(501, 500);

and it gives me "99.8", which is not equal to 100... Where do you see the incorrect rounding happen?

Note that there are existing (extensive) tests for this function in modules/simpletest/tests/batch.test, written by Charlie Gordon. It's possible that this is the most extensively-tested function in all of Drupal, but if it needs more, it needs more :)

janusman’s picture

Status: Needs review » Active

I bet others will know more about the inner workings of Batch API, but I'm still diving in... =)

I'll tell the story of what I'm seeing:

I'm processing 100 nodes at a time inside my batch operation function, and the entire batch consists of 501 nodes. So, the batch operation would have to run 6 times. BUT, it's never entering its sixth run.

After the fifth run, this is the $context:

array (
  'sandbox' => 
  array (
    'progress' => 500,
    'max' => '501',
  ),
  'results' => 
  array (
    'count' => 400,
  ),
  'finished' => 0.998003992016,
  'message' => 'Indexed 500 of 501 nodes',
)

So you see it has already processed 500 nodes, $context['finished'] is reporting 0.998 (or 99.800399...%), so clearly this hasn't finished. =) I'm assuming a batch operation function can set $context['finished'] to a float: the API docs exemplify it at http://api.drupal.org/api/drupal/includes--form.inc/group/batch/7

So, after the above context, we only have to process 1 node out of 1 for the next call to our batch operation function.

But, after just running the above 4th call to the batch operation function, $current is calculated...

...and IMO this seems a bit weird:

$current    = $total - $remaining + $finished;

which turs out like this:

$current = 1 - 1 + 0.998003992016 ( which equals 0.998003992016)

I would've thought "current" would somehow have to represent a whole number, but I think $finished was added thinking it might always just be "0" or "1" (or TRUE/FALSE?).

So the next line:
$percentage = _batch_api_percentage($total, $current);
which ends up being called as
$percentage = _batch_api_percentage(1, 0.998003992016);
Ends up returning 100. So, our batch operation is finalized, and it never gets around to call the batch operation function one last time to get at that final 501st node.

janusman’s picture

Damien Tournoud’s picture

Looking at the code, I can definitely see the bug:

In _batch_progress_page_nojs():

    list($percentage, $message) = _batch_process($batch);
    if ($percentage == 100) {
      $new_op = 'finished';
    }

In batch.js:

        if (progress == 100) {
          pb.stopMonitoring();
          window.location = settings.batch.uri + '&op=finished';
        }

The problem, as I described in #3, is that _batch_process() does not return a "I am finished" flag, so the front-ends (JS or non-JS) rely on floating point comparison to assess if the batch is finished or not.

aspilicious’s picture

I agree with damien, we shouldn't ever rely on floating point comparison. I saw to many troubles with this on non drupal projects I worked on. Now we need someone who actually can fix _batch_process() ;)

David_Rothstein’s picture

That might be a good idea, but I don't see how it's related to this issue.

Even if there is a flag, we still need the code to set the flag correctly or it won't work very well :) And we always want the percentages to be correct anyway, or otherwise we'd print "100%" on the screen for batches that aren't actually done yet.

Based on @janusman's description in #5, the bug is just that _batch_api_percentage() assumes it will only get integers passed in (for $total and $current), but sometimes it doesn't. So either the assumption or inputs have to be changed.

janusman’s picture

@David_Rothstein: IMO the fact that we're not figuring this out yet, and adding in some of the WTFs we've found so far, is a signal that:

- this needs a serious overview
- think that it's possible that this approach (changed from D6 behavior) is flawed and maybe needs an overhaul or rollback.

In restrospective, my initial patch truly is a horrific hack =)

janusman’s picture

Assigned: Unassigned » janusman

Tracking down commits that might have changed behavior, this one seems related:

commit 9066709443242752af89979ac7df8ed2474069be
Author: Dries Buytaert <dries@buytaert.net>
Date:   Wed May 6 10:41:43 2009 +0000

    - Patch #267333 by cwgordon7, David_Rothstein, lilou, et al: fixed batch api rounding.

I'll look into it.

janusman’s picture

FileSize
511 bytes

Yup, reverting that commit also seems to fix it. See patch.

Of course, the patch is incomplete, since the _batch_api_percentage() would be removed as well as batch.test. This patch is just to demonstrate things work now by rolling back a bit of code... I still don't fully grok why $current adds a mix of floats and integers to get... uhm... What It Wants.

catch’s picture

Issue tags: +Needs backport to D7

Adding backport tag.

janusman’s picture

Status: Active » Needs review

Setting needs review, as #12 seems to fix the problem.

The only thing remaining is to clear up the "weirdness" (IMO) in mixing floats and integers in this line:

$current    = $total - $remaining + $finished; 

I'm trying to think that just renaming it like so would help (at least as far as I can understand it):

$current_progress = $total_operations - $remaining_operations + $current_operation_progress_percentage;

... or sprinkle some comments, etc.

I'd really love for someone who really understands what's going on to comment on this one.

yched’s picture

@janusman :
Batch operations can be multipass. Some ops can therefore be 'halfway through', and the completion level of a given op is therefore a float between 0 and 1. The completion level of the whole batch is therefore a float too. Why would a calculation mixing ints and floats be weird ?

A patch that renames the $finished variable within _batch_process() to something clearer is definitely acceptable, if you feel it's needed. Or possibly just adding a one line comment above the $current = ... line: "Total progress is the number of operations that have fully run plus the completion level of the current operation".

As for the possible revert of _batch_api_percentage(), this would really need some feedback from the folks who introduced it in the first place (boombatower, cwgordon7).

David_Rothstein’s picture

FileSize
3.17 KB

I was also pretty involved in the issue that added _batch_api_percentage(), and I don't see why we would revert it. It was added for a reason (to make sure the percentages printed were always correct and meaningful). Always rounding down doesn't make sense.

As I described in #9, I think the bug is just that when we introduced that function, we made a mistake in thinking it would only ever get integers as input. However, we forgot about multi-step operations, where it turns out you can have a batch whose current state is something like 2.998 out of 3 items completed.

I think the attached patch should fix it, right? It comes with a test that is based directly on @janusman's Apache Solr example from above :)

janusman’s picture

I apologize for calling out "weirdness" when in fact I am not very experienced with the code in batch API (which now makes more sense to me personally). I guess I used the term because others were a bit perplexed by it too =)

@David_Rothenstein: your patch in #16 fixes the problem my particular test runs, and does show "99.8%" briefly when the batch is running when the progress hits 500 out of 501 items. BT

This new patch adds a two-line comment in _batch_process() similar to what yched proposed in #15.

David_Rothstein’s picture

That additional code comment looks very reasonable to me.

janusman’s picture

Issue tags: -Needs tests

Removing "needs tests" tag, as it doesn't anymore =)

ejustice’s picture

#16: 1089472-16-batch-api.patch queued for re-testing.

aspilicious’s picture

Status: Needs review » Reviewed & tested by the community

Ok it took me an hour to write this post lol. So I thought about this a lot. I tested a lot of cases and if the function gets called with integers it works perfectly . If it get called with floating point values it can get nasty.

Example: (500 steps done out of 501 but in floating point values)
---------

$total = 1;
$current = 0.998003992016; (== 99.8 %)
than it stops to fast. Primarily because

floor(log10($total / 2.0)) - 1

returns some kind of negative value. BUT that negative value gets 0 when we do.

$decimal_places = max(0, floor(log10($total / 2.0)) - 1); 

== 0 in our case.

SO if we were working with integer and we had 500 steps done out of 501, the previous calculation would return 1 an not 0.
If the calcualation returns 0 (like we have now) 99.8% gets rounded to 100% ==> error. Thats why we add one to $decimal_places each time we loop. After one step the number will be 99.8%, the loop will finish and we have a correct value.

This will not keep looping if we are in the final step (501 out of 501), because we check that before we start looping.
The tests are perfect. So this sounds RTBC.

Lets get this in ;)

Dries’s picture

Status: Reviewed & tested by the community » Fixed

I also spent about 45 minutes looking at this. I didn't really like the while-loop at first, but it actually looks like the proper fix. Committed to 7.x and 8.x. Thanks!

pillarsdotnet’s picture

Issue tags: -Needs backport to D7

Removing tag.

catch’s picture

Issue tags: +Needs backport to D7

Pleaseeave the tags in place. It's the only way to track issues affecting both versions at the moment.

Status: Fixed » Closed (fixed)
Issue tags: -Needs backport to D7

Automatically closed -- issue fixed for 2 weeks with no activity.