Problem/Motivation

  private function prepareDatabasePrefix() {
    $suffix = mt_rand(1000, 1000000);
    $this->siteDirectory = 'sites/simpletest/' . $suffix;
    $this->databasePrefix = 'simpletest' . $suffix;

This code in no way guarantees that the test suffix is unique. When running with a concurrency of 8 and a mixed of DrupalUnitTestBase (fast) and WebTestBase (slow) classes we have a high chance of collision.

See https://drupal.org/comment/8469693#comment-8469693 for examples of where this might have occurred... specifically https://qa.drupal.org/pifr/test/724593

rmdir(sites/simpletest/488385): Directory not empty<pre class="backtrace">rmdir('sites/simpletest/488385') drupal_rmdir('sites/simpletest/488385') file_unmanaged_delete_recursive('sites/simpletest/488385', Array) Drupal\simpletest\TestBase->restoreEnvironment() Drupal\simpletest\TestBase->run() simpletest_script_run_one_test('268', 'Drupal\field\Tests\reEnableModuleFieldTest')

Proposed resolution

Using the locking system in the parent side to guarantee uniqueness.

Remaining tasks

Review patch.

User interface changes

None

API changes

None

Comments

bzrudi71’s picture

@alexpot Do you think it's possible to shorten the 'simpletest' prefix to something like 'st' while working on this? E.g.
$this->databasePrefix = 'st' . $suffix;
It should have no impact on the uniqueness of databasePrefix but could/might help because of #998898: Make sure that the identifiers are not more the 63 characters on PostgreSQL which leads to broken PostgreSQl tests. This will not fix all and everything in PostgreSQL test, but may help a bit ;-)

bzrudi71’s picture

On the other hand, patch(es) from #998898: Make sure that the identifiers are not more the 63 characters on PostgreSQL doesn't look that bad. I think I will do a D8 port and see if it works to prevent the 63 chars limit all over...

berdir’s picture

Status: Needs review » Reviewed & tested by the community

Wrote a quick script to simulate a test run with a concurrency of 8 and then checking how long it takes until it conflicts:

https://gist.github.com/Berdir/8934380

While the mean value is usually between 100-150k loops for 100 runs, There are cases with <1000 loops and I had one that conflicted already after 58 loops.

So yes, this a problem and apparently more likely than you'd think.

Code looks fine, let's try if this helps with those weird, random missing table test cases and similar cases.

Note: This runs in a single process, not separate ones but I doubt that would change much?

sun’s picture

Status: Reviewed & tested by the community » Needs work

Why all this lock stuff?

We can simply check whether the generated test site directory exists already in a while loop.

do {
  $suffix = ...
} while (is_dir("sites/simpletest/$suffix");
sun’s picture

Status: Needs work » Needs review
StatusFileSize
new1.14 KB

Here. :)

sun’s picture

StatusFileSize
new1.15 KB

In addition, I always wondered why the range of possible random values varies so much in the first place?

1,000-1,000,000 → 100,000-999,999

That certainly contributes to the issue here, since often times I'm getting test prefixes of just 5 digits.

The last submitted patch, 5: drupal8.test-site-unique.patch, failed testing.

berdir’s picture

We just discussed that the dynamic prefixes might actually be the biggest problem here, not direct conflicts.

If one test gets the prefix "5000", then once that is completed, it will do the wildcard table delete things, which will match any suffix that starts with 5000, which means 5000* and 5000**, so it can conflict with up to 110 other suffixes. My guess is that those are the missing table cases, and "already installed" that sometimes also happen would be the exact match. But that's just a guess :)

sun’s picture

Indeed, this "edge-case" apparently seems to happen way more often than we imagined — just in https://qa.drupal.org/pifr/test/725078 only, it happened 4x times.

The latest/final patch resolves two discrete problems:

  1. The generated random database table suffix is not necessarily unique due to concurrency.

    → Check whether a test site directory exists already, and if so, generate a new one.

  2. The possible range of random numbers can result in a shorter database table prefix, but since TestBase::restoreEnvironment() uses the value as a wildcard prefix for finding database tables, the database tables of other tests (using the same prefix * 10 or * 100) are possibly deleted, too.

    → Ensure to use a fixed length of digits for the database prefix.

Would be great to commit this patch ASAP.

berdir’s picture

Status: Needs review » Reviewed & tested by the community

Agreed, this is an easier fix and the mt_rand() change is actually the most important part here. Sometimes I wonder why it takes us so long to figure out stuff like that ;)

Yes, let's get this in ASAP.

alexpott’s picture

Status: Reviewed & tested by the community » Needs work

I disagree with checking the file system as a means of guaranteeing uniqueness. Whether this actually works in a concurrent testing situation will depend on what file system you use. There is a reason will libraries such as this exist https://github.com/unlox775/File-NFSLock-php.

Patch incoming.

alexpott’s picture

Status: Needs work » Needs review
StatusFileSize
new2.26 KB

Patch which makes the same change to mt_rand().

Status: Needs review » Needs work

The last submitted patch, 12: 2193521.12.patch, failed testing.

alexpott’s picture

Status: Needs work » Reviewed & tested by the community

After a long discussion in IRC @sun and specifically @Berdir convinced me that my concerns about clustered file systems are going to be a lot bigger than just this. The patch in #6 is RTBC.

webchick’s picture

Status: Reviewed & tested by the community » Fixed

Looks good here. Here's to fewer random-ass test failures. :)

Committed and pushed to 8.x. Thanks!

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.