Problem/Motivation
With #3259709: Create the database driver for MySQLi we'll have a database driver that is able to execute asynchronous queries.
I think we need potentially two APIs to handle that.
1. An interface indicating support for async queries, something like:
AsyncInterface{
function executeAsync();
}
As well as something to get the query when it comes back.
Ideally we can encapsulate at least MySQL and PostgreSQL's implementations so they'll work through one API. Core is only likely to need to execute one async query at a time rather than sending multiple, although you never know.
2. A way to conditionally execute a query async when in a Fiber and the driver supports it, this doesn't necessarily need to live in the database API as such.
Could look something like this:
class AsyncQueryHelper {
public function getAsyncConnectionIfInFiber($database, $target) {
// Given a database and target, get a dedicated connection for executing
// an asynchronous database query. This can maintain a stack so that existing
// connections where the query has already returned can be re-used. Avoids
// "commands out of sync; you can't run this command now".
if (\Fiber::getCurrent() === NULL) {
return Database::getConnection($database, $target);
}
return $this->getAsyncConnectionFromStack($database, $target);
}
public function executeAsyncSuspendFiber($query) {
// If the driver doesn't support async quer
if (!$query instanceof AsyncInterface) {
return $query->execute();
}
$fiber = \Fiber::getCurrent();
// If we're not within a fiber, we just have to return when the query comes back
if ($fiber === NULL) {
throw new \Exception('If you're not in a fiber and you have an async query interface, you didn't use getAsyncConnectionIfInFiber so shouldn't be using this API);
}
// Now we're cookin'.
$query->execute();
$fiber->suspend();
// @todo: infinite loop protection.
// The query holds a reference to the connection so can we poll the result from that?
while (!$query->hasResult()) {
$fiber->suspend();
}
return $query->getResult();
}
}
PHP's MySQL implementation, even with mysqli async support, can only handle one query at a time. If you execute two async queries without waiting for the first to come back (or an async query then a non-async query), you get Commands out of sync; you can't run this command now.
However, this can be worked around via a connection pool.
You have your main, non-async database connection.
When you want to make an async query, you request an async database connection via the skeleton API above. This would use a connection pool - if there's no async connection, create a new one, if there's an existing async connection, check if it's free and use it if it is, if there are existing, busy async connections, create new ones up to some kind of limit, if the limit gets hit, wait until one returns then use it.
Examples of this logic are in https://github.com/amphp/mysql/tree/3.x however we are going to want it to work with Drupal's database API, so probably can't use it directly, however it'll be a good reference point regardless.
The drawback of the connection pool approach is there's overhead in creating a new database connection, but if we only do that for listing queries and similar, then the absolute number of connections should be restricted, and we can artificially cap it too.
Comments
Comment #2
catchComment #3
catchComment #4
catchComment #5
catchComment #6
kingdutchI think with the learnings from #3425212: [PP-1] Migrate BigPipe and Renderer fiber implementation to use Revolt Event Loop I'd propose a different approach:
All database queries should always be async. This is possible with Fibers (and the Revolt Event Loop) because the code that requires the result of the database query will not be executed until the result is returned. This ensures that while any database query is happening, unrelated code can run (e.g. a cache prewarmer or another bigpipe task). In case a piece of code needs to make multiple database queries and wants to do them at the same time then a primitive such as
streamorconcurrentlycan be used to control how they're executed. That would lead to similar patterns as implemented in the BigPipe changes.Comment #7
catchThere are a couple of limitations which I think point against trying to do this:
1. PDO doesn't have any async support, we have an issue to add a mysqli driver to core, specifically to enable this issue. So we need a non-async code path for those drivers based on a capability check.
2.mysqli can only run one async query at a time per connection, it's not particularly well documented, but see for example https://stackoverflow.com/questions/12866366/php-mysqli-asynchronous-que... and https://dev.mysql.com/doc/refman/8.0/en/commands-out-of-sync.html
This means that to execute an async query, we need to open an additional MySQL connection each time, or re-use a connection from which a query has previously returned.
Especially with database caching, but also just in general, Drupal can easily execute dozens or hundreds of small database queries on each page request for authenticated users (or just on a cache miss). I think we would need/want to avoid a situation where we have say more than 5-10 connections open per request, since in a stampede we could end up hitting 'too many connections' limits. As soon as we have whatever limit of connections open (say 5) we'd be unable to execute any more until one returns, and we might not have anything else to do except issue more database queries. However, if we only execute async queries for longer database queries (entity queries and views etc.), then all the other ones can be handled by the main, non-async connection (whether that's using the mysqli driver or PDO) even if there are five, slow, async queries running.
The other reason is that some database queries (again, especially with the db cache but not only) are directly blocking the rest of the request, including in situations where most caches will be warm, something like a key value query, cache tag checksums, path alias, or route lookups. In those cases, we want those (usually sub 1ms) queries to finish as soon as possible so we can move onto the next blocking thing, and wouldn't want to go back into the event loop to do prewarming or whatever in between.
Comment #8
kingdutchI feel like this is missing the beauty of Fiber's in our mental model :D The whole point of Fibers is that we can have things that are async which look exactly the same as sync code: it solves the "What color is your function" problem (see "What problem do fibers solve?" in Christian Lück's great post.
So where we should end up (and if we don't we did something wrong) is that code calling
Entity::loadMultiple($ids)shouldn't care how we load the data (whether that's sync or async) because thanks to Fibers the code writing it can be sure that even if the loading itself happens asynchronously (and other things are done in the meantime) the code in this specific place won't run pastEntity::loadMultiple($ids);until those entities have loaded.I think in the form of a ConnectionPool this is a problem that's long since been solved in other programming languages and we shouldn't re-invent the wheel here. I also don't think we should differentiate between the type of query (because the workload is inherently unpredictable) and all queries should go through the connection pool; especially since there's no query that shouldn't happen.
From a caller's perspective I don't even think we should know about the ConnectionPool, that should be the database driver's responsibility. Calling code just makes a database call and that'll suspend until data is available and will resume when data is present. Whether that suspension is because there's no connection free or whether the connection is waiting for data from a different connection shouldn't matter.
For scheduling what code actually gets to load data the EventLoop helps us with priorities based on how things are added to it:
EventLoop::queuehappens immediately when the eventloop gets controlled;EventLoop::deferhappens first in the next tick;EventLoop::delayandEventLoop::repeathappen only after deferred callbacks have run. Responses to streams are also treated as higher priority as far as I understand.Comment #9
catchThat's not the point though - if the query is synchronous, then we shouldn't be trying to suspend between executing it and it coming back, because there's no opportunity to do so. So even if the logic is in the database driver, it still needs to know whether it supports async or not and behave accordingly.
Yes I mention a connction pool in the issue summary, but I don't think we should use that for cache queries (for example), there's literally no point to those being async and it's likely to make performance worse if we do that.
Comment #10
catch#3259709: Create the database driver for MySQLi just landed so this is possible to implement now.
Comment #12
catchStarted to look at this in a bit more detail, immediately ran into a problem:
mysqli async support only works with mysqli->query()/mysqli_quer(), not with prepared statements. We use prepared statements for everything in Drupal.
This means we'd either have to find a way to run async queries through mysqli->query() (e.g. a translation layer between the prepared statement that formats it as the raw query) or go even further and adopt something like https://github.com/amphp/mysql which doesn't use any driver at all.
The prepared statement -> raw query translation would be a horrible hack but would unblock this.
On the other hand, there's a current RFC for full async support in PHP at https://wiki.php.net/rfc/true_async which is on revision 1.7 - new revisions are being posted at the rate of about one per month. This will allow the existing PDO and mysqli drivers to be made asynchronous via coroutines. If that RFC gets to the point where it's accepted, we could wait for that.
Comment #13
mfbThanks for diving into this @catch, I have been wondering why amphp/mysql went in one direction (creating a low-level mysql driver) and folks here seemed to be going in a different direction. Last I read up on the async RFC, there was a great need for folks to review and shepherd it through the RFC process, see https://news-web.php.net/php.internals/129944
Comment #14
mondrakeComment #15
steinmb commentedI have been following https://wiki.php.net/rfc/true_async for quite a while.
Sadly I believe no one thinks v1.7 of it will pass voting if tested. For PHP async is a big change. It really puts the RFC process to the test, as I believe, very few fully understands the changes suggested in the RFC, me included, making an open discussion of the inner workings hard and the voting phase, more a leap of faith.
I still believe that we as a community can help out. Perhaps first by joining the discussion from a framework standpoint and try to explain where and why we would like to use async, as there even been claims that PHP do not need async support.