Problem/Motivation

Occasionally, the Build tests fail to allocate a web server and fail:

There are two potential causes that seem most likely:

  1. We only wait one second for the web server to start up; perhaps this is not enough time.
  2. We protect against race conditions in port selection using a cooperative locking scheme; however, if there are any other processes running anywhere else on the same machine that are occasionally acquiring and releasing ports, then this could cause a race condition with our code.

Proposed resolution

Increase the time we wait for a web server to 1.5s ~ 2s and see if that helps. Emit how long it took for the web server to start up when successful, so that we can gather data to inform future decisions. If there are a lot of instances where the web server did not start until the timeout had nearly been reached (e.g. under heavy test load), then perhaps the current timeout is insufficient.
Make the web server startup code more resilient: try a different port number if we cannot start up the server on the one we picked the first time around.

Remaining tasks

tbd

User interface changes

None

API changes

None

Data model changes

None

Release notes snippet

tbd

CommentFileSizeAuthor
#38 3086179-38.patch5.69 KBalexpott
#38 35-38-interdiff.txt2.18 KBalexpott
#35 3086179-35.patch5.03 KBalexpott
#35 33-35-interdiff.txt1.13 KBalexpott
#33 3086179-33.patch5.12 KBalexpott
#33 3086179-33-test.patch7.91 KBalexpott
#30 3086179-30.patch7.87 KBalexpott
#30 29-30-interdiff.txt536 bytesalexpott
#29 3086179-29.patch7.91 KBalexpott
#29 27-29-interdiff.txt1.63 KBalexpott
#27 3086179-27.patch7.53 KBalexpott
#27 25-27-interdiff.txt678 bytesalexpott
#25 3086179-25.patch7.53 KBalexpott
#25 23-25-interdiff.txt2.6 KBalexpott
#23 3086179-23.patch6.65 KBalexpott
#23 20-23-interdiff.txt1.37 KBalexpott
#20 3086179-20.patch6.29 KBalexpott
#20 19-20-interdiff.txt655 bytesalexpott
#19 3086179-19.patch6.55 KBalexpott
#19 16-19-interdiff.txt2.57 KBalexpott
#16 3086179-16.patch4 KBalexpott
#16 14-16-interdiff.txt708 bytesalexpott
#14 11-14-interdiff.txt633 bytesalexpott
#14 3086179-14.patch3.99 KBalexpott
#11 3086179-11.patch3.74 KBalexpott
#11 3086179-test-only-11.patch3.1 KBalexpott
#8 3086179-8.patch2.49 KBalexpott
#8 3086179-8.test-only.patch1.85 KBalexpott
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

greg.1.anderson created an issue. See original summary.

greg.1.anderson’s picture

Issue summary: View changes

Remove extra unused list items in issue summary

Mile23’s picture

We could mark it incomplete to help out the 8.8.x effort.

Mixologic’s picture

if we *reduced* the time to something unreasonable, like 100ms, would / should we be able to predictably get this to fail, and thus have more confidence that the cure will help?

Mixologic’s picture

For cause #2 there shouldnt be much acquiring or releasing ports in the docker container. It might be that acquiring ports, from within a docker container, takes longer than we'd expect.

greg.1.anderson’s picture

#4 would certainly cause more failures, but we wouldn't know any more about whether timeouts were the cause of the spurious failures before the change. If we log how long it is taking to spin up the web server, then we'll have data that we can use to determine how likely it is that the timeout is causing the problem. If there are not many / any other processes allocating ports in the docker container, then it's likely that a slow web server startup is the cause.

Ghost of Drupal Past’s picture

  • We only wait one second for the web server to start up; perhaps this is not enough time.
  • Maybe:

    1. do_cli_server
    2. php_cli_server_ctor
    3. php_network_listen_socket
    4. php_network_getaddresses
    5. getaddrinfo

    Are we well set up on DrupalCI for a getaddrinfo syscall? I mean, it shouldn't be a problem because it's hardwired to 127.0.0.1 but I have seen a lot worse than a misconfigured nsswitch.conf especially in the limited world of containers.

    however, if there are any other processes running anywhere else on the same machine that are occasionally acquiring and releasing ports, then this could cause a race condition with our code.

    Should we limit to registered ports (1024 to 49151) and leave the dynamic ports (49152 to 65535) alone? Simple enough change: $port = random_int(1024, 65535); is the current version.

    alexpott’s picture

    Status: Active » Needs review
    FileSize
    1.85 KB
    2.49 KB

    Note it is always \Drupal\BuildTests\Framework\Tests\BuildTestTest::testPortMany.

    So let's set it to repeat and also try making it 5 secs.

    The last submitted patch, 8: 3086179-8.test-only.patch, failed testing. View results

    Status: Needs review » Needs work

    The last submitted patch, 8: 3086179-8.patch, failed testing. View results

    alexpott’s picture

    Status: Needs work » Needs review
    FileSize
    3.1 KB
    3.74 KB

    The last submitted patch, 11: 3086179-test-only-11.patch, failed testing. View results

    Status: Needs review » Needs work

    The last submitted patch, 11: 3086179-11.patch, failed testing. View results

    alexpott’s picture

    Status: Needs work » Needs review
    FileSize
    3.99 KB
    633 bytes

    Status: Needs review » Needs work

    The last submitted patch, 14: 3086179-14.patch, failed testing. View results

    alexpott’s picture

    Status: Needs work » Needs review
    FileSize
    708 bytes
    4 KB

    Whoops...

    Status: Needs review » Needs work

    The last submitted patch, 16: 3086179-16.patch, failed testing. View results

    alexpott’s picture

    So I don't think our locking is working at people are expecting...

    RuntimeException: Unable to start the web server.
    
    
    ERROR_OUTPUT:
    [Mon Oct  7 11:15:18 2019] Failed to listen on 127.0.0.1:59410 (reason: Address already in use)
    
    alexpott’s picture

    Status: Needs work » Needs review
    FileSize
    2.57 KB
    6.55 KB

    So this test is almost designed to fail :)

    \Drupal\BuildTests\Framework\BuildTestBase::lockAcquired() returns true if a lock is acquired or if the lock has already been acquired. Therefore a test like \Drupal\BuildTests\Framework\Tests\BuildTestTest::testPortMany() can get the same port because it has already acquired it. Also I'm not sure that we need to store all the locks on BuildTestBase.

    \Drupal\BuildTests\Framework\BuildTestBase::checkPortIsAvailable() shouldn't be involved in locking... it needs to be moved to \Drupal\BuildTests\Framework\BuildTestBase::findAvailablePort() which is where locking is needed.

    alexpott’s picture

    I don't think we need 5 seconds.

    The last submitted patch, 19: 3086179-19.patch, failed testing. View results

    Status: Needs review » Needs work

    The last submitted patch, 20: 3086179-20.patch, failed testing. View results

    alexpott’s picture

    Status: Needs work » Needs review
    FileSize
    1.37 KB
    6.65 KB

    Status: Needs review » Needs work

    The last submitted patch, 23: 3086179-23.patch, failed testing. View results

    alexpott’s picture

    Status: Needs work » Needs review
    FileSize
    2.6 KB
    7.53 KB

    So we can't let the lock objects destruct until we're done with them... so let's store them in state. I think this all shows that providing generic locking methods on the test base class is not a great idea.

    Status: Needs review » Needs work

    The last submitted patch, 25: 3086179-25.patch, failed testing. View results

    alexpott’s picture

    Status: Needs work » Needs review
    FileSize
    678 bytes
    7.53 KB

    Maybe it's the timeout on the fsockopen... v weird.

    Status: Needs review » Needs work

    The last submitted patch, 27: 3086179-27.patch, failed testing. View results

    alexpott’s picture

    Status: Needs work » Needs review
    FileSize
    1.63 KB
    7.91 KB

    So maybe it's because we don't stop the processes... also trying smaller port numbers because hey why not.

    alexpott’s picture

    Yay... so was it the port range or the tidying up.

    Status: Needs review » Needs work

    The last submitted patch, 30: 3086179-30.patch, failed testing. View results

    Mixologic’s picture

    Looks like definitely the port range. I wonder if there's something about how docker, running on top of aws, handles dynamic ports, and maybe it takes longer for those to close.

    But hey, restricting the range seems sensible to me. we only really need 32 of them (maybe 64 or 128 someday) at a time for parallelization, and even then we only need it for this test and once we have some future tests written to use this test type.

    alexpott’s picture

    Status: Needs work » Needs review
    FileSize
    7.91 KB
    5.12 KB

    So this goes back to #29 and also provides a final patch.

    I think all the changes this makes are worthwhile:

    • Where we do the locking is moved to findAvailablePort from checkPortIsAvailable - this makes sense - we shouldn't be doing any locking in checkPortIsAvailable
    • Removing the lock helper methods - locking is complex and I don't think we should have generic methods on the Test base class - it's not as if now the sub classes have a common need to lock.
    • Cleaning up at the end of \Drupal\BuildTests\Framework\Tests\BuildTestTest::testPortMany()
    heddn’s picture

    +++ b/core/tests/Drupal/BuildTests/Framework/BuildTestBase.php
    @@ -449,11 +439,33 @@ protected function stopServer() {
    +      // Take a lock so that no other process can can the same port number even
    

    Fun little nit. Made me smile: can use the... I don't think we are https://www.youtube.com/watch?v=T59EDTqqW0A

    alexpott’s picture

    @heddn thanks for the review

    heddn’s picture

    Status: Needs review » Needs work
    +++ b/core/tests/Drupal/BuildTests/Framework/BuildTestBase.php
    @@ -449,11 +439,33 @@ protected function stopServer() {
    +      $port = random_int(1024, 9999);
    

    We should comment why 9999 since this seems like a bit of a magic number.

    Otherwise, I think these changes are fine.

    Mile23’s picture

    +++ b/core/tests/Drupal/BuildTests/Framework/BuildTestBase.php
    @@ -449,11 +439,33 @@ protected function stopServer() {
    +    // Prevent the same process finding the same port by storing a list of ports
    +    // already discovered. This also stores locks so they are not released until
    +    // the test class is destroyed.
    +    static $ports = [];
    

    This shouldn't be static, because otherwise we never release lock objects until the runtime is over. It should be a class property that works the same way: $this->ports[$port_number] = $lock; It'll also need to be unset in tearDown().

    Under run-tests.sh it works either way because we run one test class per process, so runtime ends pretty quickly. Under phpunit, however, those locks will be alive until after all tests have run, and all reports are generated. (There are about a thousand variations of this test listener which unsets test class properties for this reason. https://packagist.org/packages/mybuilder/phpunit-accelerator )

    Also, since it's per object now and not per class, subsequent objects can start over at 8000 or whatever, and not be disallowed from trying a port that might have opened since the last poll.

    alexpott’s picture

    Status: Needs work » Needs review
    FileSize
    2.18 KB
    5.69 KB

    @heddn & @Mile23 good calls...

    heddn’s picture

    Status: Needs review » Reviewed & tested by the community

    I'm satisfied if this comes back green.

    catch’s picture

    Status: Reviewed & tested by the community » Fixed

    Committed 9a70cb8 and pushed to 8.8.x. Thanks!

    • catch committed 9a70cb8 on 8.8.x
      Issue #3086179 by alexpott, greg.1.anderson, Mixologic, heddn, Charlie...

    Status: Fixed » Closed (fixed)

    Automatically closed - issue fixed for 2 weeks with no activity.