Problem/Motivation
It would be useful to have automated performance testing for Drupal core. Manual performance testing is sometime required for issues, but it has a number of limitations.
Much like test coverage for bugs, it's easier to introduce a regression without a test than to fix it. This is because performance regressions don't always look like 'performance' issues.
Even where we do manual performance testing, it can be hard to determine what and how to test - benchmarks, xhprof, blackfire, devtools, EXPLAIN etc.. And people often struggle with producing useful performance data - i.e. ensuring that before/after comparisons are done on a site in exactly the same state for things like whether the cache is warm or not. It's also not easy to present performance data back to issues - links to blackfire go stale before issues get fixed, xhprof screenshots aren't accessible etc.
If we had some performance testing built into our CI framework, then we'd be able to see the improvements from some performance improvements, and the regressions from some performance regressions automatically. This would also provide some examples for people to apply to manual testing, or to expand coverage when new improvements are added or regressions found.
Steps to reproduce
Some recent fixed issues that introduced what should be measurable improvements or regressions. We can use these to see if performance testing shows a difference once they're reverted or not.
#3167034: Leverage the 'loading' html attribute to enable lazy-load by default for images in Drupal core
#1014086: Stampedes and cold cache performance issues with css/js aggregation
#2695871: Aggregation creates two extra aggregates when it encounters {media: screen} in a library declaration
#3327856: Performance regression introduced by container serialization solution
Proposed resolution
There are broadly two types of performance tests we can do:
1. Absolute/objective/hard-coded/deterministic - write a phpunit test that ensures a certain thing (database queries, network requests) only happens a certain number of times, on a certain request.
An example of this that we already have in core is #2120457: Add test to guarantee that the Standard profile does not load any JavaScript for anonymous users on critical pages. These allow us to fail commits on regressions, but the number of things we can check like this is extremely limited - it needs to be consistent across hardware and configurations. Also as well as actual regressions, tests will need to be adjusted due to functional changes (i.e. an extra block on the Umami front page, 'vegetable of the day', could mean an extra http request for an image, but this wouldn't be a 'regression' as such, just a new UX element in Umami).
2. Relative/subjective/dynamic/non-deterministic - these are metrics which are useful, but which vary on hardware, network, what else the machine is doing (like running other phpunit tests etc.) For these, we can collect certain metrics (time to first byte, largest contentful paint, entire xhprof runs), store those metrics permanently outside the test itself, i.e. with Open Telemetry, then graph over time, compare runs, show traces from specific pages etc. This might allow us to do things like compare the runs between a known state like 10.0.0 and an MR, if we can find a way to show diffs.
Remaining tasks
#3346765: Add PerformanceTestBase for allowing browser performance assertions within FunctionalJavaScriptTests adds PerformanceTestBase and allows counting of actual network requests via chromedriver.
#3352459: Add OpenTelemetry Application Performance Monitoring to core performance tests send various non-deterministic data to OpenTelemetry for graphs/trends and possibly alerts.
#3352389: Add open-telemetry/sdk and open-telemetry/exporter-otlp as dev dependencies
Add more data collection for both phpunit assertions and OpenTelemetry
#3352851: Allow assertions on the number of database queries run during tests
#3354347: Add xhr and BigPipe assertions to PerformanceTestTrait
Needs issue: add support for database query logging - we can count number of queries by SELECT/UPDATE/INSERT/DELETE, query time etc. A possible 'absolute' test would be asserting the number of database queries executed on a warm page cache request.
Needs issue - consider adding a trait that handles instrumentation for unit/kernel/functional tests.
Comments
Comment #1
catchSuggested Drupal 6 site:
50,000 nodes.
20,000 comments.
20,000 users.
Draft test plan, several user classes with different behaviour.
Authenticated:
Casual (x100):
1. Log in
2. Visit user/$uid
3. Visit node/$nid
4. Log out.
Regular (x100):
1. Log in.
2. Hit 50 random /node/$nid paths from a pool of 500.
3. Hit 10 taxonomy/term/$tid path.
4. Hit the front page once.
5. Possibly hit known bottlenecks like /tracker and /forum once each too.
6. Post one comment.
7. Log out.
Editor (x10):
1. Log in
2. Post a node
3. Edit a node.
4. Go to admin/content
5. Log out.
Anonymous users (x1000):
Casual:
1. Visit one random node/$nid page out of 500 and bounce.
Regular:
1. Visit front page
2. Visit 5 random node/$nid paths out of 500.
3. Visit 3 taxonomy/term/$tid pages.
This should give us a mixture of cache misses, cache hits, and cache rebuilds.
We'd absolutely need to be able to see separate results for each of these. In initial testing of jmeter + D7 with Jacob Singh, we found that logging in and out are extremely expensive - we may need to somehow factor that out of tests (like having a single auth user make 200 requests instead of four making 50).
Also we need to be able to easily divide auth user results from anonymous user results - since if a regression only affects pages for auth users, lightening fast auth user throughput may make it really hard to detect from overall figures.
Comment #2
Anonymous (not verified) commentedsubscribe.
Comment #3
int commentedsubscribe
Comment #4
moshe weitzman commentedFor these purposes, log out will just mean to not send the session cookie.
If we want to simplify further, we can just chart HEAD performance over time and skip the compare against prior drupal. I'm not really sure how that works. We didn't have a toolbar or fields in D6. Do e test with a few fields on our nodes or not?
We might consider 6 months of refining the performance suite in contrib before we put it in core. Simpletest had years in contrib. But thats a detail. As catch says, anything is a big improvement over today.
Comment #5
moshe weitzman commentedFor these purposes, log out will just mean to not send the session cookie.
If we want to simplify further, we can just chart HEAD performance over time and skip the compare against prior drupal. I'm not really sure how that works. We didn't have a toolbar or fields in D6. Do e test with a few fields on our nodes or not?
We might consider 6 months of refining the performance suite in contrib before we put it in core. Simpletest had years in contrib. But thats a detail. As catch says, anything is a big improvement over today.
Comment #6
catchMy plan with testing D6 as well as HEAD is that we know that D6 performance is more or less stable - that means over time, we should be able to account somewhat for server variations when graphing, it's not the most important thing but would be nice to have a control.
Toolbar and fields, yeah that's tricky. We could compare the upgraded Drupal 6, then also have a separate test for a site heavily loaded with lots of fields etc., I think we'll hopefully end up with a few different tests for different known pain points.
Comment #7
webchickJust subscribing for now. Don't have anything to add other than a big-ass +1. :)
Comment #8
chx commentedImportant yes but can't see how this blocks the release of Drupal 7.
Comment #9
catchComment #10
scroogie commentedI think that would be really awesome. I'm glad that performance is of importance in Drupal development.
Comment #11
effulgentsia commentedSubscribing. Heck yeah! If possible, it would so rock to have this become part of the D8 dev process, so whenever a patch to D8 is submitted, there's a report as to how it impacts performance. Even if it's not fully integrated into the process until D9 development, it would still be awesome to be able to request a performance report for a particular patch. And if we're able to get anything along these lines up and running in some form within the next couple months, it will help with optimization efforts prior to D7 release.
Comment #12
carlos8f commentedSubscribing. I want to contribute to this if I can.
Comment #13
carlos8f commentedThinking about this a bit...
twothree things:I'm very excited about this project bringing more overall attention to performance, and accountability through documentation. Not to mention that we currently rely on poor catch for everything :)
Comment #14
boombatower commented1) Attempt and thoughts on update test #377856: Provide D6 to D7 update test
2) Assume we want to integrate this into qa.drupal.org in some manor? I am happy to help.
Comment #15
carlos8f commentedIn IRC, catch made the point that the D6 benchmark is also intended as a baseline that the benchmark server/hardware is tested against to ensure a level playing field. I think it's smart to adjust for the test server's abilities, but I would rather do it with a one-time calibration than before every D7 benchmark. The calibration could also take into account filesystem, network, and CPU speed. The problem is, applying the adjustment correctly to benchmarks would be tricky and a flat multiplier would definitely not work. Ultimately, the benchmark results are relative to the test machine, so it might be that we just have to live with that variability.
In IRC we also discussed that apart from hitting common paths with ab/siege/jmeter, the benchmark suite should have support for what we call 'unit benchmarking' or 'microbenchmarking'. This would be basically, bootstrapping Drupal and running one function, say menu_get_item() or _menu_item_localize(), 1,000-1,000,000 times, averaging the run-time, and comparing changes that way. This way we can isolate the actual code from the bootstrap process, or if the change affects the bootstrap, isolate the bootstrap from the code.
Comment #16
Anonymous (not verified) commentedsubscribe
Comment #17
Juanlu001 commentedSubscribing
Comment #18
casey commentedThis might be interesting
http://www.hell.org.ua/Docs/oreilly/webprog/pcook/ch08_26.htm
Comment #19
casey commentedHmmm register_tick_function() is really interesting. e.g.:
Unfortunately not all statements are tickable. Typically, condition expressions and argument expressions are not tickable. Also body of internal functions aren't tickable. But we could write a code generator (we could use PGP module here) that creates a compiled version of drupal that increments the tick counter manually.
Would become for example
Just a crazy idea...
But isn't there some library that counts clockcycles or something alike instead of execution time? That would make comparisons a lot easier.
Comment #20
willieseabrook commentedsubscribe
Comment #21
damien tournoud commentedI wrote this proof of concept one year ago:
http://drupal.org/project/scenario
The idea is to use the internal browser of simpletest to build a set of test scenarios. The runner can definitely be improved, but the idea is simple and sound.
Comment #22
figaro commented@#18: Link no longer operational. Please see here alternatively:
http://0-0.at/doc/webprog/pcook/ch08_26.htm
Comment #23
alippai commentedA simple xdebug profiler log to qa.drupal.org would be nice - we could catch future performance bottlenecks.
Comment #24
damien tournoud commented@alippai: that would be a *huge file*.
Comment #25
webchickComment #26
podaroksubscribe
Comment #27
dave reid.
Comment #28
erikwebb commentedI second #11 wholeheartedly. Generalizing this would be great for d.o dev, but also testing of newly developed features offline.
JMeter testing is very hard for new users to configure, so an easy interface within Drupal would be a huge win. The difficulty would then be in parallel testing similar to JMeter. Is the end goal to be able to accurately simulate a load event (a la JMeter) or simply to measure individual request scenarios?
Comment #29
catchSo I think we should approach performance testing somewhat the same way that we approach automated testing for functionality - primarily as a way to stop regressions.
So, here's some regressions that I would have absolutely loved to have been found by automated testing before or immediately after commit, so I didn't have to come across them days, months or years after they were committed. We might not have exactly the same regressions again, but for example page caching was hamstrung at least twice in Drupal 7 by innocent looking clean-up patches so that's an obvious one to keep an eye on, some others will be much harder. I'm picking issues that are in my mind at the moment, there well may be better examples.
Regression 1: page caching broken
#1064212: Page caching performance has regressed by 30-40%
The issue the regression was introduced: #978144: cache_get_multiple() inconsistent with cache_get(), and another issues where I was involved in trying to introduce it but which was abandoned #344088: cache.inc cannot be fully converted to dbtng - note none of those issues discussed performance unless it was to dismiss it - because the patches looked completely innocent and just cleanup.
Regression 2: page caching broken (again)
#623992: Reduce {system} database hits is where this was both introduced and fixed (see http://drupal.org/node/623992#comment-2248022 for benchmarks). This started out as a performance issue (and while I haven't bisected, may have been responsible for the smaller bug found on [##978144: cache_get_multiple() inconsistent with cache_get()), then it was 'cleaned up' and page caching deteriorated by 50% that time (instead of 40% like the first time).
Regression 3: memory usage up 500kb
#887870: Make system_list() return full module records introduced the regression. I insisted on benchmarks in that issue, the benchmarks were fine, completely ignored memory, then discovered the issue on #1061924: system_list() memory usage. So we should try to at least track peak memory usage of some kind of standard page across time as well as requests per second.
Regression 4: file system scans on admin
#1014130: install_profile_info() does a file system scan on every request to admin/config (and etc.)
Regression was introduced by #509398: Install profiles should be modules with full access to the Drupal API and all it entails(.install files, dependencies, update_x). Was a new feature, the hunk in the patch wasn't reviewed in that issue at all, let alone for performance. Went un-noticed for 18 months, at least in my case because I'd never profiled /admin with anything other than the standard profile until I specifically needed to for work. I can't think of an automated performance testing plan that would cover this unless it was specifically looking for it.
More would be welcome, when we eventually set something up, we should intentionally break it to see how well it picks things up as well.
Comment #30
kirkilj commentedNew to Drupal, first-time commenter, old-school web developer (Oracle).
I found my way to this issue after tracking #1064212: Page caching performance has regressed by 30-40% and reading http://ca.tchpole.net/node/2 and http://drupal.org/node/1020494.
At my local Drupal users group meeting last week, Drupal 7 was rumored, once again, as being slower than D6, at which point several attendees acknowledged that they'll take another look in a year or two. I'm one of several prospective Drupal developers in our city who is trying to decide between D6 or D7 as a starting platform, and when I hear these concerns, it gives me pause, even though I have a one-year runway for my first large project.
In addition, some of the new underpinnings of D7 are architectural in nature, but can appear esoteric to new developers. I myself am excited about the future possibilities with RDF and the database abstraction layer going forward, but most developers can't easily map those benefits to their clients' needs in the near to mid-term. D6 developers already know how to get around D6's issues, so they are looking for a compelling reason to take time away from their D6 projects/skills and invest in D7. Performance degradation is a fast way to terminate the conversation.
At my day job, for a semiconductor company, I interact with people who's sole job is performance analysis, where numbers in the nanosecond and picosecond range are bandied about as they compare our designs to the competition. These aren't the same people who design or implement the functional characteristics of a chip. They aren't emotionally invested in a particularly elegant architecture or an implementation of that architecture. They just tell it like it is. They spend all of their time writing and evaluating performance tests using a simulator before a design is sent to a fab, because it's a very expensive proposition ($millions) to implement performance improvements after the fact. It could require a complete respin of design, verification, layout and tape-out to the fabs. Fortunately, software is a bit more malleable, but some aspects still carry over. Once a product has a reputation, deservedly or not, for being slow, the damage can be hard to repair until the next product generation when people are willing to take another look.
If there was an automated performance test suite that ran on a periodic basis, either nightly or weekly, or perhaps on-demand, performance improvements and regressions could be identified fairly quickly and it would expedite the task of determining root cause. Perhaps an effort analogous to SimpleTest could be made a priority. It appears that serious thought has already gone into this issue and some of the tooling may already be at hand, both for core and contrib.
If my battle scars could talk, they would urge you to put the creation of a seamless and automatic performance testing environment at the head of the list, before making things potentially worse with enhancements or bug fixes for D8/D7.
Comment #31
boombatower commentedSubscribe
Comment #32
ijf8090 commentedSubscribe
Comment #33
chiddicks commentedI've been thinking about this issue for a while now and I've got a few thoughts for moving forward. I think it might be smarter to approach this from a code analysis standpoint rather than time-based benchmarking, per se. The trouble with benchmarking is that it is so exposed to external influences - other processes running on the machine, faster or slower machines, etc. Obviously you would standardize the machine and reduce those effects as much as possible, but that's tough, and somewhat difficult to sustain. Not impossible, mind.
What we're trying to establish is the degree to which an individual patch would affect Drupal's overall performance. We're often looking at a small bit of code - take #1135950: Remove static caching in t() as an example - and extrapolating what impact it is going to have. These patches might range from a typo fix in a comment, which would have no discernible impact, to something like a modification of the t() function, which might run 300 times per pageload.
So I'm curious about what tools are out there for analyzing code executions, less from a CPU time point-of-view than statistics on number of function calls, memory fetches, recursions, etc - anything that would be useful in predicting performance issues. A reasonable first step for automated testing, I think, is to raise red flags so that the patch can be scrutinized further, should possible performance issues result from a patch. So, looking to see if functions are called significantly more times, or significantly more memory is used, or if database queries are taking significantly more time to execute. Thoughts on this approach?
Comment #34
catch@chiddicks - there's active work going on with this at http://drupal.org/project/performance_testing, as well as some previous discussions at http://drupal.org/sandbox/catch/1186744
Short version is we're trying to use xhprof and cgroups (which can do things like function calls, memory usage as well as cpu time (instead of wall time)) to get that kind of data, and allow people to write specific tests (i.e. it should be easy to write a test that calls t() 1000 times then compare before/after).
Comment #41
andypostComment #49
catchThirteen years later #3346765: Add PerformanceTestBase for allowing browser performance assertions within FunctionalJavaScriptTests.
Comment #50
catchComment #51
catchComment #52
catchComment #53
catchComment #54
catchComment #56
catch#3391689: Add a performance tests job to gitlab and send data to OpenTelemetry landed, which means we actually have 'automated performance testing for core' now!!!
However it still needs work to make it useful for finding performance regressions. Making some progress on #3352851: Allow assertions on the number of database queries run during tests which for me is the highest priority thing to add.
Comment #57
joseph.olstadGreat news! Now we have a beginning of performance metrics, what maybe we might need to add next after #3352851 is special weekly performance test that runs weekly or daily , part of the test would be creating 50,000 nodes, 5000 menu links, 500 taxonomy terms and 300 blocks, other entity types also, run through some operational tasks then monitor whether or not the performance improves or degrades between a week or a days worth of commits. Provide red flag on all the related commits between the tests period that slow performance down.
try to simulate some sort of realistic situation where performance would come into play.
Comment #58
slashrsm commentedComment #59
andypostThere's only 2 child issues left, both are features so the meta could be considered fixed
Comment #60
catchYep, there are some other issues floating around too, but all the basics are in place and if we want to do major new things those could have another plan issue! 15 years...
Comment #61
joseph.olstadAwesome work on this! Very important milestone!
Comment #62
fabianx commentedFantastic work, catch!