Problem/Motivation

While adding new tasks, it does not check if there is an existing task already. Which leads to having multiple same tasks and executing the same tasks multiple times. The is a performance killer on large sites.

  1. Or duplicate tasks are allowed by design?
  2. Or should disable cron running while tracking?

Steps to reproduce

Generate 100k ~ 1000k nodes, and add an index for it, with the cron running per 1 minute. While tracking, check the search_api_task table of your database, you should see duplicate task records added.

Proposed resolution

Check the existence of the task before adding a new one.

Remaining tasks

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

jungle created an issue. See original summary.

jungle’s picture

Title: Duplicated task added » Duplicated task records added into the search_api_task table
Status: Active » Needs review
FileSize
1.05 KB
jungle’s picture

FileSize
1.04 KB
744 bytes
+++ b/src/Task/TaskManager.php
@@ -136,6 +136,28 @@ class TaskManager implements TaskManagerInterface {
+      $query->condition('server_id', 'NULL', 'IS NULL');
...
+      $query->condition('index_id', 'NULL', 'IS NULL');

Should be NULL not 'NULL', fixing

The last submitted patch, 2: 3193690-2.patch, failed testing. View results

Status: Needs review » Needs work

The last submitted patch, 3: 3193690-3.patch, failed testing. View results

jungle’s picture

Status: Needs work » Needs review
FileSize
1.67 KB
638 bytes

Trying to fix failed testTaskCountLimit()

jungle’s picture

Issue summary: View changes

Adding question Or duplicate tasks are allowed by design? to IS.

jungle’s picture

Issue summary: View changes

Adding question Or should disable cron running while tracking? to IS

longwave’s picture

I have a site with 150,000 nodes which I think is running into this bug. The index config has changed and so a reindex has been triggered during the config import step of a deployment. This is taking several hours to complete, and inspecting the search_api_task table during the import shows that 100 duplicate tasks seem to be inserted for each page?

mysql> select type, server_id, index_id, data, count(*) from search_api_task group by type, server_id, index_id, data;
+------------+-----------+---------------+---------------------------------------------------------------+----------+
| type       | server_id | index_id      | data                                                          | count(*) |
+------------+-----------+---------------+---------------------------------------------------------------+----------+
| trackItems | NULL      | article_index | a:2:{s:10:"datasource";s:11:"entity:node";s:4:"page";i:1443;} |       56 |
| trackItems | NULL      | article_index | a:2:{s:10:"datasource";s:11:"entity:node";s:4:"page";i:1444;} |       44 |
+------------+-----------+---------------+---------------------------------------------------------------+----------+
2 rows in set (0.00 sec)

Running the same query a few seconds later:

mysql> select type, server_id, index_id, data, count(*) from search_api_task group by type, server_id, index_id, data;
+------------+-----------+---------------+---------------------------------------------------------------+----------+
| type       | server_id | index_id      | data                                                          | count(*) |
+------------+-----------+---------------+---------------------------------------------------------------+----------+
| trackItems | NULL      | article_index | a:2:{s:10:"datasource";s:11:"entity:node";s:4:"page";i:1445;} |       81 |
| trackItems | NULL      | article_index | a:2:{s:10:"datasource";s:11:"entity:node";s:4:"page";i:1446;} |       19 |
+------------+-----------+---------------+---------------------------------------------------------------+----------+
2 rows in set (0.00 sec)

Is this expected behaviour, or is it actually indexing everything 100 times?

drunken monkey’s picture

Thanks for reporting this problem!

No, this is definitely not expected behavior, and seems like a large performance problem when it happens. So really great that you spotted this, and would be important to fix.

Regarding your patch, I spotted one problem: there is no IS NULL operator for entity queries, you need to use notExists() for that. However, I also re-arranged the code a bit. I also added a test for this to prove it’s working correctly and hopefully prevent us from breaking it in the future.
Patch attached, tests/reviews would be very welcome!

Also, it would be great to find out what code is even responsible for trying to create multiple tasks. Any chance of finding that out, either of you? Apparently, that code is buggy, too, so maybe it will also cause other problems if not fixed.

Finally, if we now ensure no duplicates are added, it would actually be good to add a UNIQUE index to the tasks table for that. However, as tasks are content entities, I’m not sure how to do that – does anyone here know?

drunken monkey’s picture

Any help with my questions, or any other feedback on the patch?

drunken monkey’s picture

Thanks to chx I now found out how to add that unique index, and with a bit of fiddling even managed to write a database update function for it.

Please test/review and I can commit this.

Ghost of Drupal Past’s picture

(that existing task is) instead returned.

returned instead

+      if ($values === NULL) {
+        $query->notExists($property);
+        continue;
+      }
       $query->condition($property, $values, is_array($values) ? 'IN' : '=');

that foreach is not big enough for this sort just do

     if (isset($values)) {
       $query->condition($property, $values, is_array($values) ? 'IN' : '=');
     }
     else {
       $query->notExists($property);
     }

@todo Use named parameters here once we depend on PHP 8.0+.

you could pass an array right now if you so desired.

if (!empty($result)) {

if(!empty()) is only ever necessary if the variable can be undefined , but here it can't be . This is literally the exact same as if($result).

drunken monkey’s picture

Thanks for the review!
I don’t think the thing with named parameters is urgent enough right now, though. But great catches otherwise.

drunken monkey’s picture

Anyone available for testing this so we can finally commit it?

  • drunken monkey committed e46ffd1 on 8.x-1.x authored by jungle
    Issue #3193690 by jungle, drunken monkey, Charlie ChX Negyesi, longwave...
drunken monkey’s picture

Status: Needs review » Fixed

Well, let’s just push this and hope for the best.
Committed. Thanks again, everyone!

DiegoPino’s picture

Hi,

we are getting a MYSQL exception with this update (on updatedb) with search_api_update_8107

[error] Exception thrown while performing a schema update. SQLSTATE[42000]: Syntax error or access violation: 1170 BLOB/TEXT column 'data' used in key specification without a key length: ALTER TABLE "search_api_task" ADD UNIQUE KEY `task__unique` (`type`, `server_id`, `index_id`, `data`);

The Table alter to add the new keys (in specific) `data` is not passing a length and since data is of variable length that is invalid MYSQL

https://www.drupal.org/files/issues/2021-05-08/3193690-15--duplicate_tas...

Wonder if anyone has updated an existing one with this patch?

The offending code is this schema

protected function getEntitySchema(ContentEntityTypeInterface $entity_type, $reset = FALSE): array {
+ $schema = parent::getEntitySchema($entity_type, $reset);
+
+ if ($data_table = $this->storage->getBaseTable()) {
+ $schema[$data_table]['unique keys'] += [
+ 'task__unique' => ['type', 'server_id', 'index_id', 'data'],
+ ];
+ }
+
+ return $schema;
+ }

Since CORE mysql driver defines

protected function createKeySql($fields) {
    $return = [];
    foreach ($fields as $field) {
      if (is_array($field)) {
        $return[] = '`' . $field[0] . '`(' . $field[1] . ')';
      }
      else {
        $return[] = '`' . $field . '`';
      }
    }
    return implode(', ', $return);
  }

I wonder if you can pass an array there instead of just the key?

volman’s picture

Status: Fixed » Needs review

Same as #19

I get the same error when doing updatedb. We use MariaDB 10.3.29.

drunken monkey’s picture

Status: Needs review » Fixed

Very strange, I don’t have that problem with MariaDB 10.6.4. However, luckily the Drupal test bot caught this, too: see #3228210: Fix current test failures.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

maximpodorov’s picture

PostgreSQL 9.6 has similar problem:

Exception thrown while performing a schema update. SQLSTATE[54000]: Program limit exceeded: 7 ERROR: index row requires 16216 bytes, maximum size is 8191: ALTER TABLE {search_api_task} ADD CONSTRAINT search_api_task__task__unique__key UNIQUE (type,server_id,index_id,data);

#3247781: Database update fails, database schema problem on latest release