Problem/Motivation
Occasionally, the Build tests fail to allocate a web server and fail:
- https://www.drupal.org/pift-ci-job/1426940
- https://www.drupal.org/pift-ci-job/1425967
- https://www.drupal.org/pift-ci-job/1426830
- https://www.drupal.org/pift-ci-job/1426933
- https://www.drupal.org/pift-ci-job/1427002
There are two potential causes that seem most likely:
- We only wait one second for the web server to start up; perhaps this is not enough time.
- We protect against race conditions in port selection using a cooperative locking scheme; however, if there are any other processes running anywhere else on the same machine that are occasionally acquiring and releasing ports, then this could cause a race condition with our code.
Proposed resolution
Increase the time we wait for a web server to 1.5s ~ 2s and see if that helps. Emit how long it took for the web server to start up when successful, so that we can gather data to inform future decisions. If there are a lot of instances where the web server did not start until the timeout had nearly been reached (e.g. under heavy test load), then perhaps the current timeout is insufficient.
Make the web server startup code more resilient: try a different port number if we cannot start up the server on the one we picked the first time around.
Remaining tasks
tbd
User interface changes
None
API changes
None
Data model changes
None
Release notes snippet
tbd
Comment | File | Size | Author |
---|---|---|---|
#38 | 3086179-38.patch | 5.69 KB | alexpott |
#38 | 35-38-interdiff.txt | 2.18 KB | alexpott |
#35 | 3086179-35.patch | 5.03 KB | alexpott |
#35 | 33-35-interdiff.txt | 1.13 KB | alexpott |
#33 | 3086179-33.patch | 5.12 KB | alexpott |
Comments
Comment #2
greg.1.anderson CreditAttribution: greg.1.anderson at Pantheon commentedRemove extra unused list items in issue summary
Comment #3
Mile23We could mark it incomplete to help out the 8.8.x effort.
Comment #4
Mixologicif we *reduced* the time to something unreasonable, like 100ms, would / should we be able to predictably get this to fail, and thus have more confidence that the cure will help?
Comment #5
MixologicFor cause #2 there shouldnt be much acquiring or releasing ports in the docker container. It might be that acquiring ports, from within a docker container, takes longer than we'd expect.
Comment #6
greg.1.anderson CreditAttribution: greg.1.anderson at Pantheon commented#4 would certainly cause more failures, but we wouldn't know any more about whether timeouts were the cause of the spurious failures before the change. If we log how long it is taking to spin up the web server, then we'll have data that we can use to determine how likely it is that the timeout is causing the problem. If there are not many / any other processes allocating ports in the docker container, then it's likely that a slow web server startup is the cause.
Comment #7
Ghost of Drupal PastMaybe:
do_cli_server
php_cli_server_ctor
php_network_listen_socket
php_network_getaddresses
getaddrinfo
Are we well set up on DrupalCI for a getaddrinfo syscall? I mean, it shouldn't be a problem because it's hardwired to 127.0.0.1 but I have seen a lot worse than a misconfigured nsswitch.conf especially in the limited world of containers.
Should we limit to registered ports (1024 to 49151) and leave the dynamic ports (49152 to 65535) alone? Simple enough change:
$port = random_int(1024, 65535);
is the current version.Comment #8
alexpottNote it is always
\Drupal\BuildTests\Framework\Tests\BuildTestTest::testPortMany
.So let's set it to repeat and also try making it 5 secs.
Comment #11
alexpottComment #14
alexpottComment #16
alexpottWhoops...
Comment #18
alexpottSo I don't think our locking is working at people are expecting...
Comment #19
alexpottSo this test is almost designed to fail :)
\Drupal\BuildTests\Framework\BuildTestBase::lockAcquired()
returns true if a lock is acquired or if the lock has already been acquired. Therefore a test like \Drupal\BuildTests\Framework\Tests\BuildTestTest::testPortMany() can get the same port because it has already acquired it. Also I'm not sure that we need to store all the locks on BuildTestBase.\Drupal\BuildTests\Framework\BuildTestBase::checkPortIsAvailable() shouldn't be involved in locking... it needs to be moved to \Drupal\BuildTests\Framework\BuildTestBase::findAvailablePort() which is where locking is needed.
Comment #20
alexpottI don't think we need 5 seconds.
Comment #23
alexpottComment #25
alexpottSo we can't let the lock objects destruct until we're done with them... so let's store them in state. I think this all shows that providing generic locking methods on the test base class is not a great idea.
Comment #27
alexpottMaybe it's the timeout on the fsockopen... v weird.
Comment #29
alexpottSo maybe it's because we don't stop the processes... also trying smaller port numbers because hey why not.
Comment #30
alexpottYay... so was it the port range or the tidying up.
Comment #32
MixologicLooks like definitely the port range. I wonder if there's something about how docker, running on top of aws, handles dynamic ports, and maybe it takes longer for those to close.
But hey, restricting the range seems sensible to me. we only really need 32 of them (maybe 64 or 128 someday) at a time for parallelization, and even then we only need it for this test and once we have some future tests written to use this test type.
Comment #33
alexpottSo this goes back to #29 and also provides a final patch.
I think all the changes this makes are worthwhile:
findAvailablePort
fromcheckPortIsAvailable
- this makes sense - we shouldn't be doing any locking incheckPortIsAvailable
Comment #34
heddnFun little nit. Made me smile:
can use the...
I don't think we are https://www.youtube.com/watch?v=T59EDTqqW0AComment #35
alexpott@heddn thanks for the review
Comment #36
heddnWe should comment why 9999 since this seems like a bit of a magic number.
Otherwise, I think these changes are fine.
Comment #37
Mile23This shouldn't be static, because otherwise we never release lock objects until the runtime is over. It should be a class property that works the same way:
$this->ports[$port_number] = $lock;
It'll also need to be unset intearDown()
.Under run-tests.sh it works either way because we run one test class per process, so runtime ends pretty quickly. Under phpunit, however, those locks will be alive until after all tests have run, and all reports are generated. (There are about a thousand variations of this test listener which unsets test class properties for this reason. https://packagist.org/packages/mybuilder/phpunit-accelerator )
Also, since it's per object now and not per class, subsequent objects can start over at 8000 or whatever, and not be disallowed from trying a port that might have opened since the last poll.
Comment #38
alexpott@heddn & @Mile23 good calls...
Comment #39
heddnI'm satisfied if this comes back green.
Comment #40
catchCommitted 9a70cb8 and pushed to 8.8.x. Thanks!