Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
Problem/Motivation
While adding new tasks, it does not check if there is an existing task already. Which leads to having multiple same tasks and executing the same tasks multiple times. The is a performance killer on large sites.
- Or duplicate tasks are allowed by design?
- Or should disable cron running while tracking?
Steps to reproduce
Generate 100k ~ 1000k nodes, and add an index for it, with the cron running per 1 minute. While tracking, check the search_api_task
table of your database, you should see duplicate task records added.
Proposed resolution
Check the existence of the task before adding a new one.
Remaining tasks
Comment | File | Size | Author |
---|---|---|---|
#15 | 3193690-15--duplicate_tasks.patch | 10.09 KB | drunken monkey |
|
Comments
Comment #2
jungleComment #3
jungleShould be NULL not 'NULL', fixing
Comment #6
jungleTrying to fix failed testTaskCountLimit()
Comment #7
jungleAdding question Or duplicate tasks are allowed by design? to IS.
Comment #8
jungleAdding question Or should disable cron running while tracking? to IS
Comment #9
longwaveI have a site with 150,000 nodes which I think is running into this bug. The index config has changed and so a reindex has been triggered during the config import step of a deployment. This is taking several hours to complete, and inspecting the search_api_task table during the import shows that 100 duplicate tasks seem to be inserted for each page?
Running the same query a few seconds later:
Is this expected behaviour, or is it actually indexing everything 100 times?
Comment #10
drunken monkeyThanks for reporting this problem!
No, this is definitely not expected behavior, and seems like a large performance problem when it happens. So really great that you spotted this, and would be important to fix.
Regarding your patch, I spotted one problem: there is no
IS NULL
operator for entity queries, you need to usenotExists()
for that. However, I also re-arranged the code a bit. I also added a test for this to prove it’s working correctly and hopefully prevent us from breaking it in the future.Patch attached, tests/reviews would be very welcome!
Also, it would be great to find out what code is even responsible for trying to create multiple tasks. Any chance of finding that out, either of you? Apparently, that code is buggy, too, so maybe it will also cause other problems if not fixed.
Finally, if we now ensure no duplicates are added, it would actually be good to add a
UNIQUE
index to the tasks table for that. However, as tasks are content entities, I’m not sure how to do that – does anyone here know?Comment #11
drunken monkeyAny help with my questions, or any other feedback on the patch?
Comment #13
drunken monkeyThanks to chx I now found out how to add that unique index, and with a bit of fiddling even managed to write a database update function for it.
Please test/review and I can commit this.
Comment #14
Ghost of Drupal Past(that existing task is) instead returned.
returned instead
that foreach is not big enough for this sort just do
@todo Use named parameters here once we depend on PHP 8.0+.
you could pass an array right now if you so desired.
if (!empty($result)) {
if(!empty()) is only ever necessary if the variable can be undefined , but here it can't be . This is literally the exact same as if($result).
Comment #15
drunken monkeyThanks for the review!
I don’t think the thing with named parameters is urgent enough right now, though. But great catches otherwise.
Comment #16
drunken monkeyAnyone available for testing this so we can finally commit it?
Comment #18
drunken monkeyWell, let’s just push this and hope for the best.
Committed. Thanks again, everyone!
Comment #19
DiegoPino CreditAttribution: DiegoPino as a volunteer commentedHi,
we are getting a MYSQL exception with this update (on updatedb) with search_api_update_8107
[error] Exception thrown while performing a schema update. SQLSTATE[42000]: Syntax error or access violation: 1170 BLOB/TEXT column 'data' used in key specification without a key length: ALTER TABLE "search_api_task" ADD UNIQUE KEY `task__unique` (`type`, `server_id`, `index_id`, `data`);
The Table alter to add the new keys (in specific) `data` is not passing a length and since data is of variable length that is invalid MYSQL
https://www.drupal.org/files/issues/2021-05-08/3193690-15--duplicate_tas...
Wonder if anyone has updated an existing one with this patch?
The offending code is this schema
protected function getEntitySchema(ContentEntityTypeInterface $entity_type, $reset = FALSE): array {
+ $schema = parent::getEntitySchema($entity_type, $reset);
+
+ if ($data_table = $this->storage->getBaseTable()) {
+ $schema[$data_table]['unique keys'] += [
+ 'task__unique' => ['type', 'server_id', 'index_id', 'data'],
+ ];
+ }
+
+ return $schema;
+ }
Since CORE mysql driver defines
I wonder if you can pass an array there instead of just the key?
Comment #20
volman CreditAttribution: volman commentedSame as #19
I get the same error when doing updatedb. We use MariaDB 10.3.29.
Comment #21
drunken monkeyVery strange, I don’t have that problem with MariaDB 10.6.4. However, luckily the Drupal test bot caught this, too: see #3228210: Fix current test failures.
Comment #23
maximpodorov CreditAttribution: maximpodorov commentedPostgreSQL 9.6 has similar problem:
Exception thrown while performing a schema update. SQLSTATE[54000]: Program limit exceeded: 7 ERROR: index row requires 16216 bytes, maximum size is 8191: ALTER TABLE {search_api_task} ADD CONSTRAINT search_api_task__task__unique__key UNIQUE (type,server_id,index_id,data);
#3247781: Database update fails, database schema problem on latest release