Early Bird Registration for DrupalCon Portland 2024 is open! Register by 23:59 PST on 31 March 2024, to get $100 off your ticket.
Problem/Motivation
I had a random test fail in ManageGitIgnoreTest yesterday, during standard patch review with PHP 7.1 & MySQL 5.7 -- and when I hit retest, the tests all passed.
Here's the failing job:
https://www.drupal.org/pift-ci-job/1347693
In case it goes away, here are the results:
fail: [Other] Line 0 of sites/default/files/simpletest/phpunit-555.xml:
PHPunit Test failed to complete; Error: PHPUnit 6.5.14 by Sebastian Bergmann and contributors.
Testing Drupal\Tests\Component\Scaffold\Functional\ManageGitIgnoreTest
..F 3 / 3 (100%)Command 'install --no-ansi --no-scripts' - Expected exit code: 0, actual exit code: 1
Time: 11.23 seconds, Memory: 287.00MB
There was 1 failure:
1) Drupal\Tests\Component\Scaffold\Functional\ManageGitIgnoreTest::testUnmanagedGitIgnoreWhenGitNotAvailable
Failed asserting that file "/tmp/composer-scaffold-test-bdbb6af8cfb6c321e9f0053d2e8e63f6/drupal-composer-drupal-project/docroot/index.php" exists.
/var/www/html/core/tests/Drupal/Tests/Component/Scaffold/Functional/ManageGitIgnoreTest.php:186
Proposed resolution
TBD
Remaining tasks
User interface changes
API changes
Data model changes
Release notes snippet
Comment | File | Size | Author |
---|---|---|---|
#50 | 3067768-50.patch | 7.35 KB | Krzysztof Domański |
#50 | interdiff-45-50.txt | 2.18 KB | Krzysztof Domański |
#45 | 3067768-45.patch | 7.32 KB | greg.1.anderson |
Comments
Comment #2
jhodgdonNoting PHP version.
Comment #3
greg.1.anderson CreditAttribution: greg.1.anderson at Pantheon commentedI'm not sure why the scaffold operation would sometimes fail to run. We could make this easier to diagnose by adding another assert against the output of the scaffold operation. If we asserted against the status result code + stdout + stderr all in a single assert, then we could get more data when the random fail happened again.
Comment #4
alexpott@greg.1.anderson also we probably at some point want to add -vvv to the composer commands as that really helps with debugging what's going on too.
Comment #5
alexpottOne thing that does look at bit interesting is the use of md5 + 1. The whole point of that is generate something unique BUT I have the feeling the way we're adding in time we're actually more likely to generate a clash!!! So let's use uniqid() instead. This still has problems of not always being unique but at least it is the right tool for the job.
Comment #6
Mixologic11:40:17
11:40:17
11:40:17 [Symfony\Component\Console\Exception\CommandNotFoundException]
11:40:17 There are no commands defined in the "composer" namespace.
11:40:17
11:40:17
11:40:18 Drupal\Tests\Component\Scaffold\Functional\ManageGitIgnoreTe 0 passes 1 fails
^this is what was showing on the console log. seems odd that it'd throw this kind of exception.
Comment #7
greg.1.anderson CreditAttribution: greg.1.anderson at Pantheon commentedAdditional improvements.
Comment #8
MixologicAdditionally I checked all the logs since this commit, and so far this is the only instance of this test failing.
Comment #9
greg.1.anderson CreditAttribution: greg.1.anderson at Pantheon commented@mixologic: That's the failure that I would expect. 'No commands defined in the "composer" namespace' can be explained by the composer install failing. I'm not sure why it failed, but such a failure could certainly be explained by conflicting temporary directories. My patch adds output to diagnose that situation, although I suspect that @alexpott's patch fixes the problem.
Comment #10
jhodgdonLucky me, I won the random test fail lottery! :)
Comment #11
alexpottFunnily enough I was thinking of adding a test like this for the iterable collection change I did at the very end of the scaffold patch so +1 for adding this. As I said before we really should consider adding
-vvv
to the composer commands because this gives way more helpful output if something does go wrong in one of our commands.Comment #12
greg.1.anderson CreditAttribution: greg.1.anderson at Pantheon commentedThe problem with `-vvv` in Composer commands is that I think it would be hard to do exact matches against the output, because there are a lot of local filesystem paths and variable output in very verbose mode.
For tests that do not assert on the output, or that only assert 'contains' on the output, then `-vvv` becomes viable to use.
Comment #13
greg.1.anderson CreditAttribution: greg.1.anderson at Pantheon commentedTests are green; anyone care to RTBC this, or should we do additional work here?
Comment #14
MixologicTurns out this is affecting Drupal\Tests\Component\Scaffold\Integration\ScaffoldFileCollectionTest::testCreate as well.
https://www.drupal.org/pift-ci-job/1356921
https://www.drupal.org/pift-ci-job/1354328
https://www.drupal.org/pift-ci-job/1354324
https://www.drupal.org/pift-ci-job/1353539
https://www.drupal.org/pift-ci-job/1353256
We dont really have a great way of proving that a random failure is no longer randomly failing, short of running the old test a bunch of times hoping to get it to rando fail, then running the new patch a bunch of times hoping that it doesnt.
Should we take those steps here?
Comment #15
MixologicYeah, Im going to take those steps of reproducing the randofail, and hoping it goes away with the fix.. I'd much rather have some level of confidence that the fix we're attempting actually helps.
Comment #16
MixologicHeres just the scaffold tests, running 20 times each on 32 processors.
This should hopefully fail.
Comment #17
MixologicThis one is a little bit more focused. Its *just* the Drupal\Tests\Component\Scaffold\Integration\ScaffoldFileCollectionTest
Comment #19
MixologicOkay, great. We've proven that we can reproduce. Lets try it with the patch from #7 and up the number of times we run this to 10x and see what happens.
Comment #22
Mixologicokay, so thats a bummer. We've still got an issue even with #7.
Comment #23
alexpott@mixologic++ thanks for adding the testing. Let's see if this does it.
Comment #24
greg.1.anderson CreditAttribution: greg.1.anderson at Pantheon commentedI don't think that the extra 'ensureDirectoryExists' is necessarily helpful. If this is a race condition, it is only a race condition because two unrelated tests chose the same temporary directory name.
The temporary directory is currently named as follows:
It is impossible that the same process should ever produce the same name on subsequent calls of this function. If there are parallel processes creating temporary directories, and they make a hundred in a second, then there is a one-in-ten-thousand chance that two processes will run at exactly the same microtime(). While I am not sure exactly how many temporary directories are created by these tests, or how long they take to run, it's probably something along those lines.
Maybe we should shorten the invariant part ('composer-scaffold-test-') and append some data from mt_rand() or some other source here. The chance that two pseudo-random sequences will both line up exactly should be about zero.
Whoops I wasn't looking at the patched version of the file. The current random name is generated via:
Comment #25
greg.1.anderson CreditAttribution: greg.1.anderson at Pantheon commentedLooking at the docs for uniqid, it seems that uniqid($prefix) is way less random than md5($prefix . microtime()). However, uniqid($prefix, true) appends another random value on the end of the output. The thing is, though, that the main randomness of uniqid('', false) is very poor, with very similar values being produced from one call to the next. Maybe what we should try instead is something like:
uniqueid(md5($prefix . microtime()), true);
Thoughts?
Comment #26
greg.1.anderson CreditAttribution: greg.1.anderson at Pantheon commentedHm, but #23 passed. I am skeptical that it is only fixing the symptom - the temporary directory being deleted by another process, caused by the other process choosing the same tmp directory name. There could still be other issues if the same tmp directory are used by different processes.
Maybe there is an environment variable set by Drupal CI with a unique job number that could also be used here. If this environment variable were unset in other environments, it should not matter, as other environments are not likely to be as concurrent. Not sure what is available, but I imagine there is something.
Comment #27
MixologicYeah, I think it must come down to two different scaffold tests both creating tmpdirs at the same time.
I sure wish composer didnt swallow the mkdir error so we can see whats actually happening.
But #23 seems to fix it, though even at 12800 times #19 only failed 3 times, so Im gonna resubmit a few more just to be sure.
Drupalci doesnt really know what iteration of the test is running.. that'd be something run-tests.sh should pass in to its child tests.
Comment #28
MixologicComment #29
greg.1.anderson CreditAttribution: greg.1.anderson at Pantheon commentedThe other thought would be for DrupalCI to set up sys_get_temp_dir() such that every test gets a separate tmp directory.
Comment #30
greg.1.anderson CreditAttribution: greg.1.anderson at Pantheon commentedFailing #29, the other thing we could do is have the scaffold test make itself a single unique / random prefix to use when generating all tmp directories. It would be very unlikely for two concurrent scaffold tests to get the exact same random prefix, and even if they did, it would then be only two processes in contention for directory collisions, rather than all of them.
Comment #32
greg.1.anderson CreditAttribution: greg.1.anderson at Pantheon commentedYep, #23 made failures less damaging, but did not prevent collisions. Working on another version of the patch.
Comment #33
greg.1.anderson CreditAttribution: greg.1.anderson at Pantheon commentedThis patch implements the ideas from #30 and #25. I also backed out the remediation in #23 to make collisions easier to detect, as ideally we will avoid collisions rather than attempt to repair collisions.
Comment #34
MixologicSo each test that is executed by run-tests.sh gets assigned a unique test-id that it gets from the database.
That test-id gets passed to the test class constructor:
$test = new $class_name($test_id);
Which means, in a run-tests.sh environment we could potentially avoid intra test collisions by using that unique test_id in paths.
Either we've thought of this before, and it didnt work, or we wonder why we never thought of it before.
Comment #35
Mixologicoh, the constructor assigns that $test_id to the name attribute of the TestCase base class.
So $this->name might just give us what we need.
Comment #36
greg.1.anderson CreditAttribution: greg.1.anderson at Pantheon commentedUpdating tests to pass $this->name through to the fixtures.
Comment #37
greg.1.anderson CreditAttribution: greg.1.anderson at Pantheon commentedUpdated per #36.
Comment #38
greg.1.anderson CreditAttribution: greg.1.anderson at Pantheon commentedIt turns out that $this->name is not passed all the way through from drupal.org CI to phpunit, so #37 is not helpful. Here's something that should be more stable.
Comment #39
greg.1.anderson CreditAttribution: greg.1.anderson at Pantheon commentedPlease pay no attention to "patch failed to apply", I named my interdiff wrong. :p
Comment #40
MixologicHeres another thrashing. If this is 4x green, then it 差不多 for me.
Comment #41
Mile23Just the one thing from #38:
Let's fail early on this instead of silencing the error.
In fact,
mkTmpDir()
might be a protected function.Comment #42
greg.1.anderson CreditAttribution: greg.1.anderson at Pantheon commentedAh I think that was a change that was treating symptoms that was left in by mistake. I agree it should come out.
Comment #43
greg.1.anderson CreditAttribution: greg.1.anderson at Pantheon commented#40 with #41 fixed, tested four times.
Comment #44
Mile23LGTM. Now we need to revert the drupalci.yml parts and take care of this:
Wrap docblock to 80. Probably other CS things, too.
Comment #45
greg.1.anderson CreditAttribution: greg.1.anderson at Pantheon commentedBack to ordinary testing. Fixed wordwrap CS in #44.
Comment #46
MixologicThis is solid, and should be pretty impervious to accidental collisions on the testbot in parallel environments. There was a head breaking test today on commit, so, getting this in before more disruptions happen would be good.
Comment #47
alexpottWe need to silence any errors in
getmypid()
- https://www.php.net/manual/en/function.getmypid.php - doens't work on windows apparently. I think this is okay but we just need to be make sure this doesn't break things. Highly concurrent testing on windows is not necessary to support.Comment #48
MixologicI cant find anywhere that indicates that getmypid() doesnt work on windows?
Comment #49
greg.1.anderson CreditAttribution: greg.1.anderson at Pantheon commentedI tried it on appveyor:
https://ci.appveyor.com/project/greg-1-anderson/site-alias/builds/262389...
Test line:
Result:
I don't see any reason to protect getmypid(), but please LMK if you'd like changes here anyway.
Comment #50
Krzysztof Domański1. The
$output
is an array.function exec ($command, array &$output = null, &$return_var = null) {}
2. Fixed coding standard and typo.
Comment #51
greg.1.anderson CreditAttribution: greg.1.anderson at Pantheon commented+1 on #50 thanks for the improvements.
Comment #52
larowlanthanks folks, this was a curly one indeed
Committed 3a6e13d and pushed to 8.8.x. Thanks!
Comment #54
cilefen CreditAttribution: cilefen at Institute for Advanced Study commented#3070853: Race condition during composer cache clear in functional composer scaffolding plugin tests
Comment #55
Mixologic