Symptom: Tests on testbots complete, but then start over, cycling endlessly.

Root Cause: Testbot is not able to send results to qa.d.o.

- Testbot watchdog log contains: “Failed to send result: Request Entity Too Large”
- This is tripped on line 95 of pifr_client_xmlrpc_send_result() in pifr_client.xmlrpc.inc
- called by pifr_client_review_run() in pifr_client.review.inc
- Passed $review->get_result(), defined at line 355 of pifr_simpletest.client.inc
- contains a large array of results for a given test run.

Additionally, after 3-4 test cycles on the same patch following this pattern, apache segfaults.

Short-term resolution:

There is a views debug() statement which contributes 10k assertions to every test … clearing it out should open up the communications channel again ... at least for the average test. Patch is at https://drupal.org/node/1822048#comment-7479778
- [May 31st, 1am CST] This patch has been applied, and D8 tests are now processing successfully will now process successfully once D8 HEAD is unbroken.

Medium-term resolution:

Identify the network/server element which is causing the HTTP 413 response, and reconfigure it to allow the communication of large payloads.

Long-term resolution:

Refactor testbot communications to support parallel test processing, intermediate batch results, and greatly reduce the amount of information which needs to be transferred between the testbots and qa.d.o in each communications exchange.

CommentFileSizeAuthor
#2 2008626-1.patch1.03 KBjthorson

Comments

jthorson’s picture

Assigned: Unassigned » jthorson
Priority: Normal » Critical
jthorson’s picture

Status: Active » Needs review
StatusFileSize
new1.03 KB

Temporary workaround.

jthorson’s picture

https://drupal.org/node/1822048#comment-7479778 committed ... trying that first; as it should confirm the diagnosis.

jthorson’s picture

HTTP 413 responses still seen this morning on a 2MB request ... but they are no longer being sent by the proxy (where they were seen earlier ... updating the max_client_body_size parameter appears to have unplugged things there).

We're now looking at 1 in 100 tests, rather than every single test; which makes this harder to debug (but easier on my blood pressure). Next time we run into it, I'd suggest upping the limits on the testbot side of things, to see if that might be where the limit is being applied.

jthorson’s picture

Infra ticket opened at #2009884: Nginx blocking large requests, including testbot communications and file uploads.

Leaving this open to see if we might be able to enhance PIFR to better handle the failure scenario.

jthorson’s picture

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

Anonymous’s picture

Issue summary: View changes

Updated issue summary