Scalability of activity stream for large number of users

kbahey - March 29, 2009 - 18:45
Project:Activity Stream
Version:6.x-2.x-dev
Component:Code
Category:support request
Priority:normal
Assigned:Unassigned
Status:closed
Description

I was testing this module for potential inclusion on a large site (53,000+ active users, 700,000+ page views a day).

The cron design would not have a chance on such a site, because of the following reasons:

1. The query in hook_cron() selects all users who have opted to have their activity stream available. Even if this is a quarter or half of the users, that is tens of thousands of rows to process every time cron runs.

2. Since all of the activity stream info comes from external sites, one or more operations over the network are needed, and these are by definition slow. To estimate how much this would take, I added changed the code below as follows:

function activitystream_cron() {
  // 2bits.com -- start a timer
  timer_start('ascron');
  $result = db_query('SELECT uid, userid, password, feed, module from {activitystream_accounts}');
  while ($user = db_fetch_object($result)) {
    activitystream_invoke_streamapi($user);
  }
  // 2bits.com -- read the timer
  $ms = timer_read('ascron');
  // 2bits.com -- get the number of accounts
  $count = db_result(db_query('SELECT COUNT(*) FROM {activitystream_accounts}'));
  // 2bits.com -- Log the info to watchdog
  watchdog('activitystream', t('activity stream processed %count records in %ms milliseconds',
    array('%ms' => $ms, '%count' => $count)));
}

The result is that for an account for 4 services (RSS, twitter, facebook and digg) it took 3.7 SECONDS to process a single user. This does not scale at all.

There are several options to get around this, but none of them are ideal:

1. Process only a certain number of accounts on every cron run, e.g. 100. This will avoid overloading the server and lessens the chances of cron timing out, but will mean that data will not be fetched immediately, and we will probably la behind a lot as time goes on.

2. Doing the retrieval once a day only, as opposed to at every cron run. This has the drawback of losing the near immediacy, as well as it may take a long time to complete, and may in fact never complete.

3. Limiting the retrieval to people who have accessed the site in the last X days, and not updating those who are inactive on the site.

I am sure others will have better ideas, so throwing it to the maintainers and the community for comment/brainstorming.

#1

nquocbao - April 5, 2009 - 16:04

What do you think if we pull out user stream when his profile is visited ? This is, at least, reduce the long tun in cron

I'm willing to help with this feature.

#2

George2 - April 8, 2009 - 03:36

i like the idea of only grabbing x per cron run for the latest active users. or, maybe only updated when the user logs in? or posts something or does something! this would prevent a spammer from setting up an account with a feed, and deserting, and just spamming your site with links to other things.

#3

akalsey - April 8, 2009 - 04:11

How about implementing Job Queue if the module is installed?

I'm thinking the cron hooks could add their update jobs to the queue and each module could have a priority setting to allow site admins to pick which feeds to update first. Using a queue system would help ensure that the module never times out and that all users get updated (under the current model, if you have 10k users, the last few thousand might never get updated).

For truly high performance, however, we'd probably need to look at a distributed indexing farm. I've previously built a high-performance feed indexer capable of updating over 5k feeds per minute using a distributed job queue. That sort of queue would have to pull in all the feeds and keep track of changes, building a single blended feed of just the updates from every user. This way, Drupal would only have to make a single network connection -- it could likely handle several thousand updates per minute. Moving to a serialized PHP stream for the data would make it even faster. And in the event this still wasn't performant enough for a massive site, Activity Stream could support a web services API that would allow the external indexer to insert stream items directly into the site and bypass cron.

Doing this in an external service would require that someone actually run and maintain a server (or perhaps a server farm) to do the updating. I'd be willing to do that (and have some serious experience in running these sorts of highly-available services), but obviously this sort of thing isn't going to be free.

If you're interested in this sort of thing, I could get it up and running quickly. The longest part of the project would be putting together an interface for adding new users/services to the remote update queue. Not sure what pricing would look like. Probably a monthly fee based on some combination of the number of users and the number of sources the site is using.

Khalid, feel free to contact me privately about details of the site you're wanting to use this on if you feel it would help to come up with an answer.

#4

kbahey - April 8, 2009 - 12:13

Indexing farms and external servers are an option, but overkill for most sites. It can be an outsourced service like Apache solr's search.

Job Queue is a good option. I implemented it in a similar use case where users performance certain actions, then I checked those actions and submitted a job queue task for them. If they did any of the actions again, only one scheduled task would fire for them between doing the action and cron running.

Job queue is very simple actually: here is what I used:

...
// User has created a node
$uid = $node->uid;
job_queue_add('decision_matrix_process', 'Decision matrix', array($uid), '', TRUE);

One thing that needs to be addressed is what constitutes a user's activity? Login is one. Creating content or commenting is another. What else? There must be others.

We can also limit the user to one feed download per day or twice daily, so that a user's stream is not retrieved over and over from many services too frequently. So we need a table with uid and last_stream_update timestamp in it and we check that and not submit the above job queue unless a certain number of hours have passed (e.g. 6, 12 or 24 depending on a setting).

#5

kbahey - April 8, 2009 - 12:44

Also, we can have an .install function that triggers downloading the stream for the most currently logged in users. This gets triggered on installing the module, so we have some data for the most active users.

<?php
// Download only every 6 hours, should be a setting
$interval = 6 * 60 * 60;

// Number of users to process, should be a setting
$count = 50;

$result = db_query("SELECT  u.uid FROM {users} u INNER JOIN {users_activity_stream} a ON a.uid = u.uid WHERE a.last_timestamp > %d ORDER BY  u.access DESC LIMIT %d", $interval, $count);
while (...) {
// get the stream for each of those users
}
?>

A variation of the above can also be used on a daily or twice daily basis.

This is not enough to determine active users though. Good for the "on install" and "twice daily" loads only.

#6

akalsey - April 8, 2009 - 15:45

If we use Job Queue, do we really need a setting for how many users to process at a time? Rather than throw in a knob to twiddle with, I'd like to just say that using Job Queue is the preferred way to handle large numbers of users. Especially, if twiddling that knob will require knowledge of performance and timeouts that most users won't have.

#7

akalsey - April 8, 2009 - 15:48

There's no point in pulling in stream data on install. At install time, no users have any streams set up, so there'd be nothing to download.

We *could* kick off an update when the user saves their stream settings for the first time, but that should be discussed in a separate issue.

#8

kbahey - April 8, 2009 - 16:15

For install, I was thinking more when the module is enabled on a site that has been running for a while and has lots of users. Instant way to provide streams for some users. I am not that hung up on it, but want to avoid the support issues of "I installed the module and it does nothing" ...

As for job queue knob, I think it is needed. We do similar things in core re: watchdog logs, ...etc.

We can put that settings under an advanced fieldset and set it to a reasonable value that most users would not need changed. Advanced users can tweak it. If you insist on not having it, then I will settle for a hidden variable (meaning if (variable_get('something', 20)) { ... that I can put in settings.php without a user interface).

#9

akalsey - April 8, 2009 - 16:19

At install, you might have lots of users, but we don't know the twitter IDs, Flickr IDs, or blog feeds of any of them. So there'd still be nothing to fetch.

#10

akalsey - April 8, 2009 - 17:44

If Job Queue is installed, the "number of updates" settings would become useless. Job Queue already supports that feature.

#11

gausarts - April 9, 2009 - 13:38

subscribing. This has bothered me for some time as well.

The idea of a trigger, such as user log in, along with job queue to start downloading, will stop streams of the inactive users. A good screen to those engaged users. In the current state, a user can simply supply the feed url once, and leave the site without bothering much about the updates. A growing number of inactive users with the growing number of automated streams will make the site soon unhealthy, I imagine.

Any hint to go with rules as well?

Thanks

#12

kbahey - April 9, 2009 - 16:32

@gausarts

This is not how I meant it to run: "Login once and we have your stuff refreshed forever".
What I meant is that only the recently active users have their stuff updated, so they have to continue to be active. In other words job_queue will submit stream update jobs *only* for users who are active, not for users who have logged in once upon a time.

The challenge is what constitutes activity and how to detect it in Drupal, for example posting a comment or a node or voting is easy. Not all users are actively participating like that. Some just browse the site all the time. Detecting login is not easy, since the cookie by default is some 23 days. Configuring the cookie for a lesser amount (3 days? 1 week) will inconvenience the users since they have to login more frequently. We can rely on the timestamp of the session table, but joining it to the users for a large site can be slow.

Maybe something like this then?

<?php
$last_check
= variable_get('activity_stream_last_check', time());

$result = db_query("SELECT DISTINCT(uid), timestamp FROM {sessions} WHERE timestamp > %d ORDER BY timestamp DESC WHERE uid > 0", $last_check);
while (
$data = db_fetch_object($result)) {
 
// Check if the user has been processed in the last N hours
 
$user_last_update = db_result(db_query("SELECT last_update FROM {activity_stream_user) WHERE uid = %d", $data->uid));
  if (
$user_last_update > (time() - NUMBER_OF_HOURS * 60 * 60) {
   
// Process the user's activity stream
 
}

?>

I did not do a join in the code snippet because the first query is heavy because of the DISTINCT on a large site.

No need for job queues. Just being on the site a lot.

#13

akalsey - April 12, 2009 - 07:53

For determining user activity, how about the users.access field?

#14

kbahey - April 12, 2009 - 15:27

Yes, that works, but we have to limit the retrieval for the same user to once or twice a day (configurable) rather than at every cron run.

So here is the revised pseudo code:

<?php
$last_check
= variable_get('activity_stream_last_check', time());

$result = db_query("SELECT uid, access FROM {users} ORDER BY access DESC WHERE uid > 0");
while (
$data = db_fetch_object($result)) {
 
// Check if the user has been processed in the last N hours
 
$user_last_update = db_result(db_query("SELECT last_update FROM {activity_stream_user) WHERE uid = %d", $data->uid));
  if (
$user_last_update > (time() - NUMBER_OF_HOURS * 60 * 60) {
   
// Process the user's activity stream
 
}
}
?>

#15

kbahey - April 12, 2009 - 19:41
Version:5.x-1.3» 6.x-2.x-dev
Status:active» needs review

Here is a rewrite of the hook_cron() function to download only the active users, and do it only every N hours for each user.

<?php
function activitystream_cron() {
 
module_load_include('inc', 'node', 'node.pages');

 
// Download the user's data only every 6 hours, so we don't do it too frequently
  // TODO We can keep this to a reasonable default and no user interface. Those who need to
  // tweak it can do so in settings.php
 
$number_of_hours = variable_get('activitystream_cron_hours', 6);

 
$i = 0;
 
// Find the users who have been active lately
 
$result = db_query("SELECT uid, access FROM {users} ORDER BY access DESC WHERE uid > 0");
  while (
$data = db_fetch_object($result)) {
   
// Check if the user has been processed in the last N hours
    // TODO we need to amend the schema for this table: just the uid and last_update fields.
   
$user_last_update = db_result(db_query("SELECT last_update FROM {activitystream_user) WHERE uid = %d", $data->uid));
    if (
$user_last_update > (REQUEST_TIME - $number_of_hours * 60 * 60)) {
     
// Process the user's activity stream
     
$result = db_query('SELECT uid, userid, password, feed, module from {activitystream_accounts} WHERE uid = %d', $data->uid);
      while (
$user = db_fetch_object($result)) {
       
activitystream_invoke_streamapi($user);
       
// Update the last update time for this user, so we do not do it too often
       
db_query("UPDATE {activitystream_user} SET last_update = %d WHERE uid = %d", REQUEST_TIME, $data->uid);
       
$i++;
      }
    }
  }
  if (
$i > 0) {
   
// Add an informative message in the watchdog
   
watchdog('activitystream', 'Processed %count users.', array('%count' => $i));
  }
}
?>

Two things need to be done:

1. Amend the schema to include a new table with the last update date.

2. (Optional) Add a user interface for the number of hours to wait between retrieval of the stream for the same user.

#16

CorpX - April 12, 2009 - 20:13

This is an interesting module, but I too have been extremely concerned with scalability.
I agree that job queue is the way to go with this - its a great module that was design specifically for allowing modules like this to scale effectively. We also need to be sure that there is someway of deleting events that roll off the views or you risk filling your db with activities slowing things down for views if you are using them?

#17

akalsey - April 13, 2009 - 18:05

There's already a lastfetch field in the activitystream_accounts table. It's not being populated now, but it's intended for a use like this. We can use that and some creative date math in the query to simplify this quite a bit.

<?php
function activitystream_cron() {
 
module_load_include('inc', 'node', 'node.pages');
 
$number_of_hours = variable_get('activitystream_cron_hours', 1);
 
$result = db_query('SELECT uid, userid, password, feed, module from {activitystream_accounts} WHERE lastfetch < date_sub(now( ) , INTERVAL %d HOUR )', $number_of_hours);
 
$processed_users = array();
  while (
$user = db_fetch_object($result)) {
   
activitystream_invoke_streamapi($user);
   
// Update the last fetch time for this user and module.
   
db_query('UPDATE {activitystream_accounts} set lastfetch = now() where uid = %d AND module = \'%s\'', $user->uid, $user->module);
   
$processed_users[$user->uid] = $user->userid;
  }
 
$num_users = count($processed_users);
  if (
$num_users > 0) {
   
// Add an informative message in the watchdog
   
watchdog('activitystream', 'Processed %count users.', array('%count' => $num_users));
  }
}
?>

@corpX, there's really not a need to have this module delete nodes. There's very little chance of "filling your DB". Modern databases don't generally have finite numbers of rows they can handle. If you want to delete nodes matching certain criteria, I'd suggest creating a separate module for that. The ability to nodes of a certain content type and older than a certain date might be a useful module.

#18

akalsey - April 13, 2009 - 18:18

Now with job_queue support. If job_queue is installed, add the streamapi function call to the queue instead of calling it directly.

<?php
function activitystream_cron() {
 
module_load_include('inc', 'node', 'node.pages');
 
$use_job_queue = module_exists('job_queue') ? TRUE : FALSE;
 
$number_of_hours = variable_get('activitystream_cron_hours', 1);
 
$result = db_query('SELECT uid, userid, password, feed, module from {activitystream_accounts} WHERE lastfetch < date_sub(now( ) , INTERVAL %d HOUR )', $number_of_hours);
 
$processed_users = array();
  while (
$user = db_fetch_object($result)) {
    if (
$use_job_queue) {
     
job_queue_add('activitystream_invoke_streamapi', 'Activity Stream Update for '. $uid, array($user), '', TRUE);
    } else {
     
activitystream_invoke_streamapi($user);
    }
   
// Update the last fetch time for this user and module.
   
db_query('UPDATE {activitystream_accounts} set lastfetch = now() where uid = %d AND module = \'%s\'', $user->uid, $user->module);
   
$processed_users[$user->uid] = $user->userid;
  }
 
$num_users = count($processed_users);
  if (
$num_users > 0) {
   
// Add an informative message in the watchdog
   
watchdog('activitystream', 'Processed %count users.', array('%count' => $num_users));
  }
}
?>

#19

akalsey - April 14, 2009 - 04:51

The code I posted above is in the 2.x branch

#20

kbahey - April 14, 2009 - 15:16

It is getting better, but still have some points for consideration. I am looking at the CVS code in DRUPAL-6--2 for these:

1. In hook_cron() there is still no check for the fact that a user has logged in recently (e.g. access column in users table).

2. By checking the lastfetch, we can have a case where we fetched some accounts for a user but not others, giving a partial picture for a user.

3. This part can be memory intensive because you can end up with LOTS of users, and building an array of them may hit a limit somewhere.

<?php
 
while ($user = db_fetch_object($result)) {
   
$users[] = $user;
  }
?>

4. Setting the lastfetch when using job_queue can be misleading. The user's data would not have been fetched yet, rather scheduled for job queue to do that, but we have marked the last fetch. I don't have a good solution to this since job queue does not provide a callback when it runs.

Another idea here: we can still use the lastfetch column, but reduce it to most recent for every user, rather than for an account:

Here is a rewrite using your new function, calling it once per user, allowing it use job_queue (or not), and eliminating the need for a new table, and using lastfetch.

<?php
function activitystream_cron() {
 
module_load_include('inc', 'node', 'node.pages');

 
// Download the user's data only every 6 hours, so we don't do it too frequently
  // TODO We can keep this to a reasonable default and no user interface. Those who need to
  // tweak it can do so in settings.php
 
$number_of_hours = variable_get('activitystream_cron_hours', 6);

 
$i = 0;
 
// Find the users who have been active lately
 
$result = db_query("SELECT uid, access FROM {users} ORDER BY access DESC WHERE uid > 0");
  while (
$data = db_fetch_object($result)) {
   
// Check if the user has been processed in the last N hours
   
$user_last_update = db_result(db_query("SELECT MAX(lastfetch) FROM {activitystream_accounts) WHERE uid = %d", $data->uid));
    if (
$user_last_update > (REQUEST_TIME - ($number_of_hours * 60 * 60))) {
     
// Process the user's activity stream
     
$result = db_query('SELECT uid, userid, password, feed, module from {activitystream_accounts} WHERE uid = %d', $data->uid);
      while (
$user = db_fetch_object($result)) {
       
$users[] = $user;
      }
     
activitystream_update_streams($users);
     
$i++;
    }
  }
  if (
$i > 0) {
   
// Add an informative message in the watchdog
   
watchdog('activitystream', 'Processed %count users.', array('%count' => $i));
  }
}
?>

#21

akalsey - April 14, 2009 - 16:02

You're ordering by last access date but still processing all users, even those that haven't logged in recently. Users that have abandoned the site are still being updated. To prevent the problem that guasarts describes in #11, we should only update users that have accessed the site in some recent time period. I'm okay with creating good defaults and making both this and the fetch interval settings under a collapsed "Advanced" fieldset as long as the reasons why someone would want to change this are well-described. The code in CVS has a setting for fetch interval already.

I'm not thrilled with the complexity of the loops and queries. Unless substantially all your users are using activity stream, the function as written will perform rather poorly.
You're getting every user on the site in the cron and then looping over the result set looking for users that have activity stream accounts, then checking to see if each user has passed the fetch interval, then finally getting the accounts for that user. That's a lot of round trips to the database and potentially fetching a lot of rows that end up not being used.

We can filter all this in the database and reduce all those loops and queries to one DB query.

SELECT a.uid, a.userid, a.password, a.feed, a.module FROM activitystream_accounts a LEFT JOIN users u on a.uid = u.uid  WHERE a.lastfetch < date_sub(now( ) , INTERVAL 2 HOUR) AND u.access > unix_timestamp(date_sub(now( ) , INTERVAL 10 day)) ORDER BY a.lastfetch DESC, u.access DESC

By checking the lastfetch, we can have a case where we fetched some accounts for a user but not others, giving a partial picture for a user.

The way that updates happen, this shouldn't be an issue. The only time this might happen is if the user adds a bunch of accounts at once and then later adds another account. The newly added account would have a lastfetch time that's out of sync with the others. For the most part, this isn't a huge issue, especially since we're not trying to have real-time updates.

If it becomes a problem, we could always sync lastfetch times periodically. Say, every week, we get the oldest lastfetch for each uid and run "UPDATE activitystream_accounts SET lastfetch = '$max_lastfetch' WHERE uid = $users_uid".

I'm not terribly happy with sticking all users into an array and calling the new update function. I had to do it to add a second feature -- the immediate update of newly-added accounts. The alternative is to call the update function for every single user. I think I'd need a database with tens of thousands of users and streams to benchmark and see which is faster.

I'm not sure what to do about lastfetch and job queue, either. The same issue occurred to me when I was writing that code. The only solution I can think of is to move the updating of lastfetch into the activitystream_save_stream function. Doing that would cause a lot of unnecessary updates, though. A stream with 20 items in it would update the lastfetch 20 times. We could create a static variable in activitystream_save_stream and store an array of users for which we've already updated lastfetch, but that seems like a heavy solution to something that isn't likely a big issue.

#22

akalsey - November 10, 2009 - 00:36
Status:needs review» fixed

These changes are in the 2.0-BETA1

#23

System Message - November 24, 2009 - 00:40
Status:fixed» closed

Automatically closed -- issue fixed for 2 weeks with no activity.

 
 

Drupal is a registered trademark of Dries Buytaert.