Automated performance testing for core [#638078]

Problem/Motivation

It would be useful to have automated performance testing for Drupal core. Manual performance testing is sometime required for issues, but it has a number of limitations.

Much like test coverage for bugs, it's easier to introduce a regression without a test than to fix it. This is because performance regressions don't always look like 'performance' issues.

Even where we do manual performance testing, it can be hard to determine what and how to test - benchmarks, xhprof, blackfire, devtools, EXPLAIN etc.. And people often struggle with producing useful performance data - i.e. ensuring that before/after comparisons are done on a site in exactly the same state for things like whether the cache is warm or not. It's also not easy to present performance data back to issues - links to blackfire go stale before issues get fixed, xhprof screenshots aren't accessible etc.

If we had some performance testing built into our CI framework, then we'd be able to see the improvements from some performance improvements, and the regressions from some performance regressions automatically. This would also provide some examples for people to apply to manual testing, or to expand coverage when new improvements are added or regressions found.

Steps to reproduce

Some recent fixed issues that introduced what should be measurable improvements or regressions. We can use these to see if performance testing shows a difference once they're reverted or not.
#3167034: Leverage the 'loading' html attribute to enable lazy-load by default for images in Drupal core
#1014086: Stampedes and cold cache performance issues with css/js aggregation
#2695871: Aggregation creates two extra aggregates when it encounters {media: screen} in a library declaration
#3327856: Performance regression introduced by container serialization solution

Proposed resolution

There are broadly two types of performance tests we can do:

1. Absolute/objective/hard-coded/deterministic - write a phpunit test that ensures a certain thing (database queries, network requests) only happens a certain number of times, on a certain request.

An example of this that we already have in core is #2120457: Add test to guarantee that the Standard profile does not load any JavaScript for anonymous users on critical pages. These allow us to fail commits on regressions, but the number of things we can check like this is extremely limited - it needs to be consistent across hardware and configurations. Also as well as actual regressions, tests will need to be adjusted due to functional changes (i.e. an extra block on the Umami front page, 'vegetable of the day', could mean an extra http request for an image, but this wouldn't be a 'regression' as such, just a new UX element in Umami).

2. Relative/subjective/dynamic/non-deterministic - these are metrics which are useful, but which vary on hardware, network, what else the machine is doing (like running other phpunit tests etc.) For these, we can collect certain metrics (time to first byte, largest contentful paint, entire xhprof runs), store those metrics permanently outside the test itself, i.e. with Open Telemetry, then graph over time, compare runs, show traces from specific pages etc. This might allow us to do things like compare the runs between a known state like 10.0.0 and an MR, if we can find a way to show diffs.

Remaining tasks

#3346765: Add PerformanceTestBase for allowing browser performance assertions within FunctionalJavaScriptTests adds PerformanceTestBase and allows counting of actual network requests via chromedriver.

#3352459: Add OpenTelemetry Application Performance Monitoring to core performance tests send various non-deterministic data to OpenTelemetry for graphs/trends and possibly alerts.
#3352389: Add open-telemetry/sdk and open-telemetry/exporter-otlp as dev dependencies

Add more data collection for both phpunit assertions and OpenTelemetry
#3352851: Allow assertions on the number of database queries run during tests
#3354347: Add xhr and BigPipe assertions to PerformanceTestTrait
Needs issue: add support for database query logging - we can count number of queries by SELECT/UPDATE/INSERT/DELETE, query time etc. A possible 'absolute' test would be asserting the number of database queries executed on a warm page cache request.

Needs issue - consider adding a trait that handles instrumentation for unit/kernel/functional tests.

User interface changes

API changes

Data model changes

Release notes snippet

Comments

Comment #1

catch

he/him

English

commented 20 November 2009 at 06:36

Suggested Drupal 6 site:

50,000 nodes.

20,000 comments.

20,000 users.

Draft test plan, several user classes with different behaviour.

Authenticated:
Casual (x100):
1. Log in
2. Visit user/$uid
3. Visit node/$nid
4. Log out.

Regular (x100):
1. Log in.
2. Hit 50 random /node/$nid paths from a pool of 500.
3. Hit 10 taxonomy/term/$tid path.
4. Hit the front page once.
5. Possibly hit known bottlenecks like /tracker and /forum once each too.
6. Post one comment.
7. Log out.

Editor (x10):
1. Log in
2. Post a node
3. Edit a node.
4. Go to admin/content
5. Log out.

Anonymous users (x1000):

Casual:
1. Visit one random node/$nid page out of 500 and bounce.

Regular:
1. Visit front page
2. Visit 5 random node/$nid paths out of 500.
3. Visit 3 taxonomy/term/$tid pages.

This should give us a mixture of cache misses, cache hits, and cache rebuilds.

We'd absolutely need to be able to see separate results for each of these. In initial testing of jmeter + D7 with Jacob Singh, we found that logging in and out are extremely expensive - we may need to somehow factor that out of tests (like having a single auth user make 200 requests instead of four making 50).

Also we need to be able to easily divide auth user results from anonymous user results - since if a regression only affects pages for auth users, lightening fast auth user throughput may make it really hard to detect from overall figures.

Comment #2

Anonymous (not verified) commented 20 November 2009 at 21:21

subscribe.

Comment #3

int commented 20 November 2009 at 23:30

Comment #4

moshe weitzman commented 22 November 2009 at 00:09

For these purposes, log out will just mean to not send the session cookie.

If we want to simplify further, we can just chart HEAD performance over time and skip the compare against prior drupal. I'm not really sure how that works. We didn't have a toolbar or fields in D6. Do e test with a few fields on our nodes or not?

We might consider 6 months of refining the performance suite in contrib before we put it in core. Simpletest had years in contrib. But thats a detail. As catch says, anything is a big improvement over today.

Comment #5

moshe weitzman commented 22 November 2009 at 00:09

For these purposes, log out will just mean to not send the session cookie.

Comment #6

catch

he/him

English

commented 22 November 2009 at 02:56

My plan with testing D6 as well as HEAD is that we know that D6 performance is more or less stable - that means over time, we should be able to account somewhat for server variations when graphing, it's not the most important thing but would be nice to have a control.

Toolbar and fields, yeah that's tricky. We could compare the upgraded Drupal 6, then also have a separate test for a site heavily loaded with lots of fields etc., I think we'll hopefully end up with a few different tests for different known pain points.

Comment #7

webchick

she/they

English

Vancouver 🇨🇦

commented 23 November 2009 at 04:09

Just subscribing for now. Don't have anything to add other than a big-ass +1. :)

Comment #8

chx commented 5 December 2009 at 08:48

Priority:

Critical

» Normal

Important yes but can't see how this blocks the release of Drupal 7.

Comment #9

catch

he/him

English

commented 5 December 2009 at 12:29

Title:

Automated performance testing fo core

» Automated performance testing for core

Comment #10

scroogie commented 5 December 2009 at 13:27

I think that would be really awesome. I'm glad that performance is of importance in Drupal development.

Comment #11

effulgentsia commented 7 December 2009 at 01:18

Subscribing. Heck yeah! If possible, it would so rock to have this become part of the D8 dev process, so whenever a patch to D8 is submitted, there's a report as to how it impacts performance. Even if it's not fully integrated into the process until D9 development, it would still be awesome to be able to request a performance report for a particular patch. And if we're able to get anything along these lines up and running in some form within the next couple months, it will help with optimization efforts prior to D7 release.

Comment #12

carlos8f commented 15 December 2009 at 03:30

Subscribing. I want to contribute to this if I can.

Comment #13

carlos8f commented 17 December 2009 at 22:40

Thinking about this a bit...

The D6-D7 upgrade idea accomplishes ~~two~~ three things:
1. pre-populates D7 database for testing in a relatively robust manner,
2. provides a 'control' against which regression is measured,
3. and ensures that the upgrade path doesn't break.
Since D6 is fairly stable performance-wise, it's probably not necessary to run the test suite on it every time, except to make snapshots. We could still compare D7 against the snapshots to detect regression for known bottlenecks.
Pre-populating the database with an upgrade seems a little "overkill" to me. All we have to do is basically create some nodes, terms, fields, menu items, what have you :) Our current test suites seem to do this pretty reliably. The only difference is that we'd be creating thousands of nodes, etc. to simulate the scale of an actual site.
With points 2 and 3, the upgrade idea is then only useful for ensuring upgrade path integrity. Seems like that could be done outside of the benchmarking suite, like in SimpleTest.
Benchmarking what we call 'known bottlenecks' is fine, but that doesn't work for every patch. Each patch or module may have some specific paths or API functions that they affect. I think it would be nice then, if there was a hook_benchmark_info() or a .benchmark meta file. That would let patches and modules define an arbitrary set of paths or API functions that would be run through the benchmark suite, making things less oriented around 'known pain points'.
I propose a sort of generic 'Performance rating' that the suite gives to a patch. While making it easy to read a benchmark at-a-glance, we would also need something like this to facilitate graphing each commit or week of commits. My ideas for this are:
- Suite benchmarks HEAD and stores it as a snapshot, and snapshots are created with each commit. Benchmark returns a number of "points" corresponding to overall performance. I don't know how these points will be awarded :)
- It could be that the benchmark does "assertions" like SimpleTest but instead of pass/fail, it registers points or percentages somehow.
- When a commit happens. its benchmark results are compared with the previous snapshot(s), which would generate the rating. There could be two ratings, corresponding with D6 and HEAD comparisons.
- I'm not sure what the actual rating would look like, but there would need to be an even score for no change, and an indicator when a patch totally flunks or when it awesomely improves performance by 5% or more.
A page on qa.drupal.org could automatically graph the snapshots, and show up-to-date performance improvements or regressions in HEAD. This would bring in an 'accountability' factor to performance testing in Drupal.
I agree with @moshe that this should at least start as contrib. Being in core is not required for PIFR integration either, since it can check out the module from CVS as it does for D6 SimpleTest, and apply patches to HEAD if needed.

I'm very excited about this project bringing more overall attention to performance, and accountability through documentation. Not to mention that we currently rely on poor catch for everything :)

Comment #14

boombatower commented 17 December 2009 at 23:00

1) Attempt and thoughts on update test #377856: Provide D6 to D7 update test
2) Assume we want to integrate this into qa.drupal.org in some manor? I am happy to help.

Comment #15

carlos8f commented 18 December 2009 at 02:02

In IRC, catch made the point that the D6 benchmark is also intended as a baseline that the benchmark server/hardware is tested against to ensure a level playing field. I think it's smart to adjust for the test server's abilities, but I would rather do it with a one-time calibration than before every D7 benchmark. The calibration could also take into account filesystem, network, and CPU speed. The problem is, applying the adjustment correctly to benchmarks would be tricky and a flat multiplier would definitely not work. Ultimately, the benchmark results are relative to the test machine, so it might be that we just have to live with that variability.

In IRC we also discussed that apart from hitting common paths with ab/siege/jmeter, the benchmark suite should have support for what we call 'unit benchmarking' or 'microbenchmarking'. This would be basically, bootstrapping Drupal and running one function, say menu_get_item() or _menu_item_localize(), 1,000-1,000,000 times, averaging the run-time, and comparing changes that way. This way we can isolate the actual code from the bootstrap process, or if the change affects the bootstrap, isolate the bootstrap from the code.

Comment #16

Anonymous (not verified) commented 23 December 2009 at 20:51

Comment #17

Juanlu001 commented 7 January 2010 at 13:57

Subscribing

Comment #18

casey commented 7 January 2010 at 15:04

This might be interesting
http://www.hell.org.ua/Docs/oreilly/webprog/pcook/ch08_26.htm

Comment #19

casey commented 7 January 2010 at 15:48

Hmmm register_tick_function() is really interesting. e.g.:

function tick_handler() {
  $GLOBALS['counter']++;
}

declare(ticks=1);
register_tick_function('tick_handler');
/**
 * Root directory of Drupal installation.
 */
define('DRUPAL_ROOT', getcwd());

require_once DRUPAL_ROOT . '/includes/bootstrap.inc';
drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);
menu_execute_active_handler();
unregister_tick_function('tick_handler');

echo "<br>Ticks:" . $GLOBALS['counter'];

Unfortunately not all statements are tickable. Typically, condition expressions and argument expressions are not tickable. Also body of internal functions aren't tickable. But we could write a code generator (we could use PGP module here) that creates a compiled version of drupal that increments the tick counter manually.

$a = 50;
$b = ($a >= 50) ? FALSE : TRUE;

Would become for example

$a = 50;
$b = ($a >= 50 && $GLOBALS['counter']++) ? FALSE : TRUE;

Just a crazy idea...

But isn't there some library that counts clockcycles or something alike instead of execution time? That would make comparisons a lot easier.

Comment #20

willieseabrook commented 27 May 2010 at 02:10

Comment #21

damien tournoud commented 27 May 2010 at 07:40

I wrote this proof of concept one year ago:

http://drupal.org/project/scenario

The idea is to use the internal browser of simpletest to build a set of test scenarios. The runner can definitely be improved, but the idea is simple and sound.

Comment #22

figaro commented 31 May 2010 at 19:06

@#18: Link no longer operational. Please see here alternatively:
http://0-0.at/doc/webprog/pcook/ch08_26.htm

Comment #23

alippai commented 25 June 2010 at 15:50

A simple xdebug profiler log to qa.drupal.org would be nice - we could catch future performance bottlenecks.

Comment #24

damien tournoud commented 25 June 2010 at 16:01

@alippai: that would be a *huge file*.

Comment #25

webchick

she/they

English

Vancouver 🇨🇦

commented 25 June 2010 at 16:57

Version:	7.x-dev	» 8.x-dev
Category:	task	» feature

Comment #26

podarok

🇺🇦 he/him/his

Ukrainian

Ukraine

commented 8 February 2011 at 23:01

Comment #27

dave reid

he/him

English

Nebraska USA

commented 9 February 2011 at 01:30

Comment #28

erikwebb commented 18 February 2011 at 04:43

I second #11 wholeheartedly. Generalizing this would be great for d.o dev, but also testing of newly developed features offline.

JMeter testing is very hard for new users to configure, so an easy interface within Drupal would be a huge win. The difficulty would then be in parallel testing similar to JMeter. Is the end goal to be able to accurately simulate a load event (a la JMeter) or simply to measure individual request scenarios?

Comment #29

catch

he/him

English

commented 19 February 2011 at 15:30

So I think we should approach performance testing somewhat the same way that we approach automated testing for functionality - primarily as a way to stop regressions.

So, here's some regressions that I would have absolutely loved to have been found by automated testing before or immediately after commit, so I didn't have to come across them days, months or years after they were committed. We might not have exactly the same regressions again, but for example page caching was hamstrung at least twice in Drupal 7 by innocent looking clean-up patches so that's an obvious one to keep an eye on, some others will be much harder. I'm picking issues that are in my mind at the moment, there well may be better examples.

Regression 1: page caching broken
#1064212: Page caching performance has regressed by 30-40%

The issue the regression was introduced: #978144: cache_get_multiple() inconsistent with cache_get(), and another issues where I was involved in trying to introduce it but which was abandoned #344088: cache.inc cannot be fully converted to dbtng - note none of those issues discussed performance unless it was to dismiss it - because the patches looked completely innocent and just cleanup.

Regression 2: page caching broken (again)
#623992: Reduce {system} database hits is where this was both introduced and fixed (see http://drupal.org/node/623992#comment-2248022 for benchmarks). This started out as a performance issue (and while I haven't bisected, may have been responsible for the smaller bug found on [##978144: cache_get_multiple() inconsistent with cache_get()), then it was 'cleaned up' and page caching deteriorated by 50% that time (instead of 40% like the first time).

Regression 3: memory usage up 500kb

#887870: Make system_list() return full module records introduced the regression. I insisted on benchmarks in that issue, the benchmarks were fine, completely ignored memory, then discovered the issue on #1061924: system_list() memory usage. So we should try to at least track peak memory usage of some kind of standard page across time as well as requests per second.

Regression 4: file system scans on admin
#1014130: install_profile_info() does a file system scan on every request to admin/config (and etc.)

Regression was introduced by #509398: Install profiles should be modules with full access to the Drupal API and all it entails(.install files, dependencies, update_x). Was a new feature, the hunk in the patch wasn't reviewed in that issue at all, let alone for performance. Went un-noticed for 18 months, at least in my case because I'd never profiled /admin with anything other than the standard profile until I specifically needed to for work. I can't think of an automated performance testing plan that would cover this unless it was specifically looking for it.

More would be welcome, when we eventually set something up, we should intentionally break it to see how well it picks things up as well.

Comment #30

kirkilj commented 19 February 2011 at 22:33

New to Drupal, first-time commenter, old-school web developer (Oracle).

I found my way to this issue after tracking #1064212: Page caching performance has regressed by 30-40% and reading http://ca.tchpole.net/node/2 and http://drupal.org/node/1020494.

At my local Drupal users group meeting last week, Drupal 7 was rumored, once again, as being slower than D6, at which point several attendees acknowledged that they'll take another look in a year or two. I'm one of several prospective Drupal developers in our city who is trying to decide between D6 or D7 as a starting platform, and when I hear these concerns, it gives me pause, even though I have a one-year runway for my first large project.

In addition, some of the new underpinnings of D7 are architectural in nature, but can appear esoteric to new developers. I myself am excited about the future possibilities with RDF and the database abstraction layer going forward, but most developers can't easily map those benefits to their clients' needs in the near to mid-term. D6 developers already know how to get around D6's issues, so they are looking for a compelling reason to take time away from their D6 projects/skills and invest in D7. Performance degradation is a fast way to terminate the conversation.

At my day job, for a semiconductor company, I interact with people who's sole job is performance analysis, where numbers in the nanosecond and picosecond range are bandied about as they compare our designs to the competition. These aren't the same people who design or implement the functional characteristics of a chip. They aren't emotionally invested in a particularly elegant architecture or an implementation of that architecture. They just tell it like it is. They spend all of their time writing and evaluating performance tests using a simulator before a design is sent to a fab, because it's a very expensive proposition ($millions) to implement performance improvements after the fact. It could require a complete respin of design, verification, layout and tape-out to the fabs. Fortunately, software is a bit more malleable, but some aspects still carry over. Once a product has a reputation, deservedly or not, for being slow, the damage can be hard to repair until the next product generation when people are willing to take another look.

If there was an automated performance test suite that ran on a periodic basis, either nightly or weekly, or perhaps on-demand, performance improvements and regressions could be identified fairly quickly and it would expedite the task of determining root cause. Perhaps an effort analogous to SimpleTest could be made a priority. It appears that serious thought has already gone into this issue and some of the tooling may already be at hand, both for core and contrib.

If my battle scars could talk, they would urge you to put the creation of a seamless and automatic performance testing environment at the head of the list, before making things potentially worse with enhancements or bug fixes for D8/D7.

Comment #31

boombatower commented 24 February 2011 at 23:05

Comment #41

andypost

he/him

catch

he/him

English

commented 21 March 2023 at 19:09

Issue summary:

View changes

Comment #52

catch

he/him

English

commented 28 March 2023 at 09:16

Issue summary:

View changes

Comment #53

catch

he/him

English

commented 5 April 2023 at 16:52

Issue summary:

View changes

Comment #54

catch

he/him

English

commented 20 April 2023 at 14:30

Issue summary:

View changes

Comment #55

20 April 2023 at 14:30

Version:

10.1.x-dev

» 11.x-dev

Drupal core is moving towards using a “main” branch. As an interim step, a new 11.x branch has been opened, as Drupal.org infrastructure cannot currently fully support a branch named main. New developments and disruptive changes should now be targeted for the 11.x branch, which currently accepts only minor-version allowed changes. For more information, see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Comment #56

catch

he/him

English

commented 8 November 2023 at 13:04

#3391689: Add a performance tests job to gitlab and send data to OpenTelemetry landed, which means we actually have 'automated performance testing for core' now!!!

However it still needs work to make it useful for finding performance regressions. Making some progress on #3352851: Allow assertions on the number of database queries run during tests which for me is the highest priority thing to add.

Comment #57

joseph.olstad

French

commented 9 November 2023 at 00:48

Great news! Now we have a beginning of performance metrics, what maybe we might need to add next after #3352851 is special weekly performance test that runs weekly or daily , part of the test would be creating 50,000 nodes, 5000 menu links, 500 taxonomy terms and 300 blocks, other entity types also, run through some operational tasks then monitor whether or not the performance improves or degrades between a week or a days worth of commits. Provide red flag on all the related commits between the tests period that slow performance down.

try to simulate some sort of realistic situation where performance would come into play.

Comment #58

slashrsm commented 22 January 2024 at 12:40

Issue tags:

+gander

Comment #59

andypost

he/him

Russian

commented 22 May 2024 at 19:32

Status:

Active

» Reviewed & tested by the community

There's only 2 child issues left, both are features so the meta could be considered fixed

Comment #60

catch

he/him

English

commented 22 May 2024 at 20:02

Status:

Reviewed & tested by the community

» Fixed

Yep, there are some other issues floating around too, but all the basics are in place and if we want to do major new things those could have another plan issue! 15 years...

Comment #61

joseph.olstad

French

commented 22 May 2024 at 20:13

Awesome work on this! Very important milestone!

Comment #62

fabianx commented 23 May 2024 at 13:00

Fantastic work, catch!

Comment #63

6 June 2024 at 12:59

Status:

Fixed

» Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

Automated performance testing for core

Problem/Motivation

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

Release notes snippet

Comments

Change records for this issue

Child issues

Related issues