If the heartbeat_activity table is very large and joined to itself in a situation like this were it can't use indexes effectively due to the comparison operators, the query can take many minutes to execute, if it finishes at all. For example, on a site with ~20K entries in the heartbeat_activity table this query is effectively blocking cron from running.

    $users = db_query("SELECT uid FROM {users} WHERE status = 1");
    $query = array();
    $args = array();
    while ($account = db_fetch_object($users)) {
      $query[] = "(SELECT uid, uaid FROM {heartbeat_activity} WHERE uid = %d ORDER BY uaid DESC LIMIT %d)";
      $args[] = $account->uid;
      $args[] = $keep_latest_number;
    }
    
    $result = db_query(implode("UNION", $query), $args);

It looks more complex but on that same set of data executes in only a couple seconds.

There's a nice writeup of why/how this works here.

Comments

Stalski’s picture

Is it a solution to add indexed to some fields you think?

gcassie’s picture

This is an extension of the conversation at http://drupal.org/node/1191740 and an attempt to improve the scalability of this query:

    $result = db_query("SELECT
        t1.uid,
        t1.uaid as 'uaid',
        COUNT(*) as 'rows_per_user',
        t1.timestamp as 'real_date',
        MIN(t2.timestamp) as 'oldest_date',
        count(t2.uid) AS 'count'
      FROM {heartbeat_activity} AS t1
      INNER JOIN {heartbeat_activity} AS t2 ON t1.uid = t2.uid AND t2.timestamp >= t1.timestamp
      WHERE (t1.timestamp, t1.uaid) < (t2.timestamp, t2.uaid)
      GROUP BY t1.uid, t1.uaid HAVING COUNT(t2.uid) <= %d
      ORDER BY t1.uid, t1.uaid, t1.timestamp DESC", $keep_latest_number);

You mentioned over there that "the timestamp still needs to be put into query as well. So the last query will stay the same". I'm curious why we need the timestamp and can't rely on the uaid - they're created sequentially as records are inserted.

Adding indexes won't help the first query - the comparison operators make it so these can't work well.

I do have to admit I'm not certain about this original syntax:

WHERE (t1.timestamp, t1.uaid) < (t2.timestamp, t2.uaid)

Could you explain this a bit? Thanks.

Stalski’s picture

That's called a "weighted comparison". It will compare the the two fields on a target record.

The reason why I want to delete too old activity in general is just to cover all cases. Lots of people want the activity to only exist for 1 month in the database. The advanced query is only to make sure that there is some activity left to show on the profile pages of individual users. Not the other way around.

You also mentioned the uaid as parameter to check. Well since we're talking about activity, time is the leading factor imo. If someone does something with existing records through code, we can't loose the message. I once had an action that updated the record resetting the timestamp so the activity would be "on topic" again.

Any thoughts are welcome.

gcassie’s picture

I figured it was doing a comparison, but I'm not sure how the logic pans out. Is it equivalent to doing:

WHERE t1.timestamp < t2.timestamp AND t1.uaid < t2.uaid

?

Yep, I get the intention. Delete everything over a certain age, except for the n most recent entries. I agree it's a good approach for most sites. If we can't optimize this query though consider making it optional - just cleaning out everything over a certain age is simple and might still be a viable alternative for a lot of sites.

It seems to me there's a very tight correlation between the uaid and timestamp fields. Both move forward incrementally at the same time. I'll defer to you on timestamps if they're needed, but the index is going to be smaller if we can use uaid.

Also on the original query I noticed a lot of extra fields are being selected that don't seem to be used anywhere, like oldest_date and the COUNTS. Removing those should also be helpful if you don't want to go with a UNION approach.

Stalski’s picture

No, the query condition is not the same as the weighted approach. This is rather difficult to explain.

There is a correlation currently between uaid and timestamp, but I want to get rid of that actually.
E.g. Promoting activity could mean, updating an existing row's timestamp but keeping the same uaid.

No, I did not say that I won't go with the union option. That was my first approach as well. But then I realized, couldn't it be that I don't know SQL good enough. So I asked some people to lend me a hand in dealing with such things. The solution is what they think is most viable.
However, your point is well taken as I want to see it in practice as well.
So what's left for you and me to do, is check with lots and lots of users and lots and lots of activity and measure the scalibility of each approach ;) Are you up for it, since you already got an indication that it's not that good currently?

gcassie’s picture

Hmm, might consider keeping the timestamp as an unchangeable "created" stamp and adding a new "changed" column, like node does.

I'm looking at a site with ~40K activity records and can't get through a cron run at the moment. I'm going to try patching in the UNION queries and see how it goes.

Stalski’s picture

created, changed: Good point.

query contest: I am a bit curious I must say ;)

gcassie’s picture

sorry, haven't had a chance to benchmark yet. not forgotten.

gcassie’s picture

The new query to select what needs to be deleted is running in a few hundred milliseconds. The query to delete the records still takes a while, but at least that's linear.

The production data set I have is a little weird as it was created during a mass import script. It might be worth generating a few thousand dummy activity records and running both versions on the same datasets a few times?

catch’s picture

Subscribing.

Stalski’s picture

While testing the query in #0, I saw this is drupal6 code. Did you test this there?

I tried to rewrite it so it works on drupal7 db api, but without success.

    $users = db_query("SELECT uid FROM {users} WHERE status = 1");
    $query = array();
    $args = array();
    foreach ($users as $key => $account) {
      $query[] = " ( SELECT uid, uaid FROM {heartbeat_activity} WHERE uid = :uid_$key ORDER BY uaid DESC LIMIT 0, :limit ) ";
      $args[':uid_' . $key] = $account->uid;
      $args[':limit'] = $keep_latest_number;
    }

    $result = db_query(implode("UNION", $query), $args);

The error is
PDOException: SQLSTATE[42000]: Syntax error or access violation: 1064 You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ''10' ) UNION ( SELECT uid, uaid FROM heartbeat_activity WHERE uid = '2' ORDER BY' at line 1: ( SELECT uid, uaid FROM {heartbeat_activity} WHERE uid = :uid_0 ORDER BY uaid DESC LIMIT 0, :limit ) UNION ( SELECT uid, uaid FROM {heartbeat_activity} WHERE uid = :uid_1 ORDER BY uaid DESC LIMIT 0, :limit ) UNION ( SELECT uid, uaid FROM {heartbeat_activity} WHERE uid = :uid_2 ORDER BY uaid DESC LIMIT 0, :limit ) UNION ( SELECT uid, uaid FROM {heartbeat_activity} WHERE uid = :uid_3 ORDER BY uaid DESC LIMIT 0, :limit ) UNION ( SELECT uid, uaid FROM {heartbeat_activity} WHERE uid = :uid_4 ORDER BY uaid DESC LIMIT 0, :limit ) UNION ( SELECT uid, uaid FROM {heartbeat_activity} WHERE uid = :uid_5 ORDER BY uaid DESC LIMIT 0, :limit ) ; Array ( [:uid_0] => 1 [:limit] => 10 [:uid_1] => 2 [:uid_2] => 3 [:uid_3] => 4 [:uid_4] => 5 [:uid_5] => 6 ) in heartbeat_cron() (line 113 of /opt/d7/heartbeat/heartbeat.module).

Does anyone know what could be wrong?

Stalski’s picture

Status: Active » Needs work
catch’s picture

You should use db_select() instead of string concatenation in D7. #557318: UNION support is missing in database API added union support but I haven't personally used that yet.

Stalski’s picture

@catch: Can you guide me to the best query to run for things like this?

gcassie certainly has a point, so the advanced query i started with does not suffice. Any guidance on how we can proceed would be highly appreciated.

TS79’s picture

Isn't there an additional performance improvement, if you don't determine the $keep_uids in an php-array, which goes into the "NOT IN" operator

<?php
    $arguments = array_merge(array($expire), $keep_uaids);
    $delete_result = db_query("SELECT uaid
      FROM {heartbeat_activity}
      WHERE
        timestamp < %d
      AND
        uaid NOT IN (" . db_placeholders($keep_uaids) . ") ", $arguments);
?>

but flag these uaids temporary in heartbeat_activity table via UNION query:

<?php
    db_query("UPDATE {heartbeat_activity} SET keep_alive = 0");
    db_query("UPDATE {heartbeat_activity} SET keep_alive = 1 WHERE (".implode("UNION", $query).")", $args);

    $arguments = array($expire);
    $delete_result = db_query("SELECT uaid
      FROM {heartbeat_activity}
      WHERE
        timestamp < %d
      AND
        keep_alive = 0", $arguments);
?>

This might improve performance, if you have many users and many activity logs to keep alive.
A temporary table for uaids to be kept alive might be similar approach.

TS79’s picture

And it would be nice, if all the code in #0 is only executed, if $keep_latest_number > 0
:-)

Stalski’s picture

Ofcourse. I don't know who will create the patch, but that will be up to the programmer.
@master You can give it a try then :)

gcassie’s picture

The UNION approach works better up to a point, but now that the site has passed 10K users, the query string itself is causing a segfault in PHP due to its length. I'll look at how it can perhaps be processed in batches or something.

dppeak’s picture

Adding unique keys and indexes would speed up the queries as well.

Adding this to the heartbeat_translations table/schema:

    'indexes' => array(
      'uaid' => array('uaid'),
      'tuaid' => array('tuaid'),
      'uaid_tuaid' => array('uaid', 'tuaid'),
    ),

Add this to the heartbeat_activity table/schema:

    'unique keys' => array(
      'uaid' => array('uaid'),
      'uaid_uid' => array('uaid', 'uid'),
      'uaid_nid' => array('uaid', 'nid'),
      'uaid_uid_nid' => array('uaid', 'uid', 'nid'),
    ),

New hook_update would contain this:

  // Add in unique keys and indexes.
  db_add_unique_key($ret, 'heartbeat_activity', 'uaid', array('uaid'));
  db_add_unique_key($ret, 'heartbeat_activity', 'uaid_uid', array('uaid', 'uid'));
  db_add_unique_key($ret, 'heartbeat_activity', 'uaid_nid', array('uaid', 'nid'));
  db_add_unique_key($ret, 'heartbeat_activity', 'uaid_uid_uid', array('uaid', 'uid', 'nid'));
  db_add_index($ret, 'heartbeat_translations', 'uaid', array('uaid'));
  db_add_index($ret, 'heartbeat_translations', 'tuaid', array('tuaid'));
  db_add_index($ret, 'heartbeat_translations', 'uaid_tuaid', array('uaid', 'tuaid'));
Stalski’s picture

I did not get to this issue yet, but I promise I will make time for this one. Ofcourse any patches would be very welcome

Stalski’s picture

I already pushed the added keys to git.