This modules has the potential to fulfill a need for Drupal users worldwide, especially newbies, by collecting module usage statistics.

The idea is that everyone who uses this module would have the option to allow drupal.org to collect data about their site's modules, and to use that data to generate popularity ratings for each module. A module's rating would be based on the percentage of users who have the module installed and enabled. It would also be possible to show the percentage of sites that keep a module installed/enabled (i.e., a module that is frequently scrapped within a week of being installed is probably not very good).

This kind of data would help newbies choose between similar modules. It would supplement (rather than replace) a subjective module-rating/feedback system. As more and more modules are added, the more we're going to need both objective and subjective ratings to sort through them all.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

hass’s picture

really, i'd love your idea to see this module installation statistics... it helps to separate the wheat from the chaff. but the way how this data is collected... i don't know.

dww’s picture

Project: Update Status » Project
Version: 5.x-1.0 » 5.x-1.x-dev
Component: Miscellaneous » Releases
Priority: Normal » Critical

Dries, merlin and I have already discussed this at length, we just didn't create an issue for it yet. ;)

yes, the XML-RPC server code that update_status.module queries will log the queries. there will be a way for sites to opt out of this, but we'll encourage sites to allow us to collect the data. we'll use this data to provide information about actual usage of modules. this data will become even more accurate once the functionality of update_status.module moves into core for 6.x.

anyway, yeah, it's high on the list, and a prerequisite for a bunch of nice usability improvements for the drupal.org download pages and project nodes. see http://drupal.org/node/128546 for a summary and links to older issues about this.

dww’s picture

hunmonk and i just spent a while last night thinking through some of the devilish details about this task. here's what we came up with:

  1. client's query will include the site's baseurl if the client UI has said to allow us to log the request. if there's no baseurl, we answer the query but don't log anything.
  2. if the baseurl exists, and passes some basic validation, we md5hash it (at the server) and use that as the unique key into our various DB tables to store the info (so we can uniquely, but anonymously identify which site we're talking about, avoid duplicate info from the same site, etc).
  3. the initial request is stored in a DB table that's optimizied for quickly inserting another row, not for processing the data. so, instead of normalizing on the individual modules within the request (and therefore having to update/insert N rows for each request), we just serialize the request array and insert a single row into the following table:
    {project_usage_raw_data}
    varchar hostkey [the md5hash of the baseurl]
    varchar data [the serialized request array of (module name, version info) pairs]
    UNIQUE KEY(hostkey)
    
  4. there's a daily cron job that iterates through every row in {project_usage_raw_data}, unserializes everything, and populates 2 other tables that are the things we actually use to query and display any of this data. first, a summary that includes the # of sites using each major revision of each module:
    {project_usage_count_data}
    int nid (project id),
    int version_tid,
    int version_major,
    int count
    

    second, a table of the normalized stuff (for example, so we can do data mining on what modules are frequently installed together on the same sites to figure out what kinds of install profiles we should be working on, etc):

    {project_usage_normal_data}
    int nid (project id),
    int version_tid,
    int version_major,
    varchar hostkey
    
  5. then, we add a monthly cron job to keep track of historical summary data, for example, to have a page to show graphs of module usage over time:
    {project_usage_count_history}
    int nid (project id),
    int version_tid,
    int version_major,
    int count
    month/datestamp  (ISO date string or unix timestamp int, i don't really care)
    
  6. TBD: do we need any kinds of throttles to avoid someone trying to check too frequently?

the basic premise here is that instead of generating a bunch of additional work for the DB on each incoming request (deleting N rows that match the hostkey, inserting N new rows), and then doing a bunch of work in the DB to count all the rows that match a given module over and over again, whenever we view a project node, want to sort modules by usage, etc, we do all that processing once per day, cache the results and use those for displaying/mining this stuff. this keeps the load on the DB as low as possible while doing the critical task of responding to all the queries, and yet still gives us useful, cheap to use data for d.o project nodes, listings, etc. it requires more space in the DB, but that seems like the best thing to be "wasteful" of, instead of addtional queries and processing time.

i'm not exactly sure how we're going to want to mine the data about sets of modules, so instead of trying to process and store summaries of that via the daily cron job, i'm advocating we just keep the current snapshot of normalized data completely, and figure out what kinds of queries we'll want to run against that later.

everyone happy with this? if so, it should be pretty easy now to crank out the patch.

dww’s picture

p.s. of course, we could also hash the IP address of the incoming request for our hostkey, instead of messing with the baseurl. i was just thinking about sites like d.o itself where there might be N web front-ends with different IPs that are all the same "site". in that case, we'd just have another attribute ("log_me = true/false" or something).

merlinofchaos’s picture

IP address will also totally fail on multi-site. Bryght might have a thousand sites from the same IP address. We can't consider that unique or valid information by any means.

Otherwise, this proposal is +1 from me -- I am really happy to see the optimize for insert and do calculations later. We need to get statistics.module to do the same kind of thing.

Steven’s picture

Actually, hashing the base_url provides very bad anonymity. Given a site, it is trivial to find its base_url and hash it. i think we should use hashes of the form hash($base_url . $private_key) (using the drupal site-wide secret key as used for form tokens for example). This means that hashes will be constant, gives people the chance to change their hash at will (if they really want to) and prevents us from doing reverse look-ups.

IMO we should also cull the stats based on timestamp to ensure freshness.

We talked about client-side vs server-side hashing on IRC, and verification methods, and basically there is no easy way to avoid spoofing. Even if we use DNS as verification, you could still game the system if you really wanted to (with subdomains) and it would be very hard to notice. So, I think we should do client side hashing and let the admin choose whether to submit their base url along.

dww’s picture

Assigned: Unassigned » dww
Status: Active » Needs work
FileSize
15.37 KB

there's a lot of // TODO in here, but this is a nearly complete (and basically working) version of what we've discussed above. the patch adds this code as a separate module in contributions/project/usage/project_usage.* so that we can isolate it from the rest of the project code. basically no sites other than d.o will have a use for this, so i don't want to include it directly in project_release.*

anyway, please review what's here, and i'll try to work on the // TODO items in parallel. thanks!

hass’s picture

For multisite installations "update_status" should collect the installed module instances and send it with the request... it should be easy to count the number of installed modules per multisite or normal installation. The same way it is possible to get the hostnames in this way, too.

drewish’s picture

assuming that the baseurl/ip is hashed in a secure manner, it seems to me that it'd make sense to track which releases are being used. the big benefit would be for the security team to help determine how widely used a vulnerable module is.

dww’s picture

Status: Needs work » Postponed

a lot of the // TODO in my patch in #7 is really a symptom of a much bigger problem: http://drupal.org/node/136172

let's move discussion over there for now, to hash out how the XML-RPC protocol (and project_release.module's server code that provides the answers) should really work. once that's done, we can resume work in here.

Dries’s picture

  1. Using the $base_url is not 100% either. If someone is using poorman's cron, he might send requests from http://example.com OR http://www.example.com/. I don't think there is anything we can do about this, unless we use "application keys".
  2. Server side hashing of the URLs is naive and provides nothing but a false sense of security. Let's not get side-track too far with that. It's easy enough to come up with a list of Drupal sites ...
  3. A daily cron-job that iterates over all requests doesn't seem like the proper thing to do. I also don't think we should store the data in serialized format. We have to process these requests anyway -- why postpone them to a cron run? It saves absolutely no work, it only creates more as we also have to store them in an intermediate format. Let's just update the project module table(s) as the XML-RPC requests come in, and then worry about how frequent these tables are queried by the data visualizer/reporter. The project modules should simply cache their queries for x days.
merlinofchaos’s picture

A daily cron-job that iterates over all requests doesn't seem like the proper thing to do. I also don't think we should store the data in serialized format. We have to process these requests anyway -- why postpone them to a cron run? It saves absolutely no work, it only creates more as we also have to store them in an intermediate format. Let's just update the project module table(s) as the XML-RPC requests come in, and then worry about how frequent these tables are queried by the data visualizer/reporter. The project modules should simply cache their queries for x days.

The idea here isn't to 'save work' -- it's probably more work to do this. The idea here is to minimize the impact of index updates. Writes on tables with a lot of indexes are the most expensive operation we can do, database-wise. See statistics.module -- which we've had to turn off on drupal.org! -- as a good example of how badly that can go. Because this data does not need to be realtime, the way to reduce the impact is to write to a table which is indexed minimally and then process it all at once. While it doesn't need to be written serialized, I do agree with the processing only once or twice a day part.

xamox’s picture

would love to see this, I think it would be a huge asset to the community

dww’s picture

Status: Postponed » Needs work

http://drupal.org/node/136172 is basically RTBC now, and already deployed on scratch.d.o. merlin and i are hard at work at update_status 5.x-2.0. so, this is back on the table again....

drewish’s picture

Assigned: dww » drewish

okay so #136172 has been committed so this needs to get moving. over on IRC dww's been heckling me to get working on this part. i can take a hint.

greggles’s picture

subscribing

dww’s picture

See http://drupal.org/node/153741 for the client-side changes to update_status to make this possible.

BioALIEN’s picture

Subscribing

dww’s picture

http://drupal.org/node/153741 is committed and in the latest update_status 5.x-2.0-(beta2|rc) releases. There was a brief period there where we were fetching directly to apache and avoiding the Drupal menu paths entirely, since it seemed performace was killing us (http://drupal.org/node/146564). However, I spoke to Dries and he hated the inflexible approach, so we came up with something better: http://drupal.org/node/155281. Therefore, the original approach I wrote code for in here is still on the table as a possibility. Certainly, we won't need to scrape apache logs, which is good. ;)

drewish’s picture

so dww created the project-release-serve-history.php script that returns the XML files to the update(_status).module. that's where we'll save the usage info to the database. i need to sit down with dww and figure out what tables we'll need to log usage info. and what module should own them.

dww’s picture

I mentioned this to drewish in IRC, but we should add a project_usage.module that owns the schema for these tables, as per my patch in #7, even if project-release-serve-history.php is writing to the tables. This particular functionality is something only very few sites running project* will care about, so it doesn't belong directly in the project.install schema.

drewish’s picture

FileSize
14.12 KB

i re-rolled the patch dww posted to #7 to clean up the .install file a bit. i'm not sure if the changes to project_release_data() make much sense now that project-release-serve-history.php is dishing out the .xml files...

i'm not convinced that writing serialized data is a good thing. if it's only going to be queried once a day to build a daily/monthly tally then just don't index it and take that slow query hit once a day.

dww’s picture

I think we should probably rip out all the code in project_release to serve those xml history files entirely. Folks should use project-release-serve-history.php instead. If they want it to be drupal code doing a full bootstrap, they can do that via the CLI. Seems silly to have code in project_release.module itself for this. See http://drupal.org/node/159892

Similarly, as soon as update_status 5.x-2.0 is out for a few weeks, we should deprecate the entire XML-RPC interface and rip that out, too. Maybe we'll handle that in #159892, or maybe a separate issue...

So yeah, no need to touch any of that code with this patch anymore.

Other reasons this still needs work... ;)

1) This was all work-in-progress code before we completely redesigned the update_status protocol and introduced the .xml files. The $projects array shouldn't exist anymore in the code or schema.

Furthermore, the whole schema needs to be slightly redesigned. We need separate rows for each project a site is fetching. Also, update(_status?) is sending us an exact version string, so we should store that, too.

Basically, project_usage needs to be ported to the update_status 5.x-2.* or update.module code base. The act of writing the patch at #7 illustrated to me just how broken the original 5.x-1.* protocol was, and started me down the long road towards 5.x-2.0 and update.module in D6 core. I never came back to this issue and fixed anything in here now that the protocol is sane.

2) We don't do this anymore: +version = "$Name: $"

drewish’s picture

spent some time talking with dww on irc, the following is a summary.

so the project_usage module will end up as a holder for the .install file and a hook_cron() implementation...

the URL built by _update_build_fetch_url() demonstrates the data we'll be able to collect. the update.module will be sending the version and project values from the .info file (as described in #101009) and the core version.

if they provide a site_key but no version (checked out from CVS perhaps?) we might still want to log which version of core they're using.

the version string passed by the client should be the same as the value in the {project_release_nodes}.version field. we'll need to add an index to that column for JOINs / WHEREs.

when an XML file is served the following data will be recorded to {project_usage_raw_data}: timestamp, site_key, project, version, core_version.

dww was pushing for an index free table to make the logging quick. i'm still not totally convinced it will be possible. it seems there are two cases to consider:
- the "normal" case where a site checks in weekly and you want to log their usage, here a simple insert would be fine.
- the bad case where someone's manually updating every 5 minutes. you only want to count them once per reporting interval (day? week?) so at some point you'll have to skip/remove duplicates.

i'm leaning towards saying that the statistics should be processed weekly otherwise people that check back daily will be counted 7 times more than those checking in weekly...

the more i think about it i wonder if it wouldn't be a better idea to just take the hit at and do computation when they update. round the timestamp to start of the week then do a "SELECT * FROM {project_usage_data} WHERE site_key=%s AND timestamp=%d AND project_uri=%s". if a record exists we UPDATE it if not we INSERT a new one. then on a weekly basis we look back and add up how many site_key's used a particular project in total and by version. i think a weekly breakdown would be pretty handy and would make a good basis for usage graphs.

drewish’s picture

so according to merlinofchaos the update.module and update_status.module both phone home daily. but he was also of the opinion that weekly made sense:

<merlinofchaos> drewish: I agree, more datapoints in a year. Daily is too many datapoints.
...
<merlinofchaos> Merging daily into weekly will help compress the fuzz.
dww’s picture

merlinofchaos doesn't seem to remember http://drupal.org/node/154049 ;)

There's a setting to control the frequency it checks. Granted, the default is daily, and few sites will probably change the setting.

That said, I'm not a huge fan of only processing weekly. It seems like "compressing the fuzz" is a job for the stuff digesting and displaying the data, not something we want to hard-code into the data collection itself. It's possible we'll want to drill down into daily for something, but if the schema and data collection doesn't allow for that, we're screwed. You can always go from more detail to lower resolution if you want, but if you start from low res, you can never see higher res.

drewish’s picture

dww, he did say it was an option but that it was daily by default.

you're right that it's easier to discard data once you've collected it, but i'm having a hard time seeing how daily data would be very useful considering it'll have to store 7 times as much data.

webchick’s picture

Just for clarification purposes, update(_status) would still be auto-checking daily, but we just would only store one request per site per week? If so, that sounds fine to me, and a lot more manageable.

If we're switching it so that update(_status) only checks for new updates once per week, rather than daily, that's definitely NOT a good idea, imo ... security updates that wait a week to get installed could result in a compromised site.

Sorry, it's kind of late, this might be a stupid question. :)

webchick’s picture

Hmmm... dww's point is sound...

Would it be silly to store data in two tables? A "raw" table that has all of the individual daily requests and a "calculated" table that stores aggregated statistical data? Sort of like how VotingAPI does it so it's quick to grab the average rating of something. This would potentially allow us to prune old logged data if its totals had already been stored somewhere else.

dww’s picture

@webchick: this thread is only about collecting data. The update(_status?) clients are already loose in the wild, we're not changing those now. ;)

And yeah, if you look at the original draft schema from my patch in #7, that's exactly what I had: multiple tables, 1 for very fast dumping of raw data (no indexes), and a few others of processed data, with indexes, to be used to actually display stats.

Now, the debates are:

A) How much raw data to save.
B) Does it make sense to bulk process at some regular interval, or always process everything as soon as it comes in.

I'm firmly in the camp of: A) save as much as we have, B) do as little as possible during the save, and offload all processing until we're at some regular interval. Of course, only benchmarks will really tell what's best overall in terms of B, but I'm very worried about response time for fetching each individual .xml file, both for the clients, and for the load on d.o's DB...

hass’s picture

A) i would collect two years per day in raw and additional a calculated version in a second table. This second table will be small and therefor very fast. After two years, delete raw. 2 years is good for privacy (remember the google discussion :-) ). So you are free to change the stats and re-build other type of statistics within the two years.

b) i'm sure this bulk process is required. For drupal this raw table will be very big... and i thing the bulk process will have performance issues without indexes on this raw table.

drewish’s picture

i'm fine with collecting daily data. i do think that the weekly data is going to be the most useful so a weekly computation would make sense. as hass points out trying to do any processing on a d.o sized data without indexes is going to be very slow.

in terms of logging the data, project-release-serve-history.php the XML file has already been sent to the client before we log the usage. at this point, can't we just call ob_flush()? or maybe we'd need to do something a little more complex?

i also think it's a good idea to dump the raw data after a given length of time (2 years, a year?) and only keep the aggregate data after that point.

drewish’s picture

FileSize
9.06 KB

i was really hoping to get a hold of dww on irc today to bounce some of this off of him but that didn't work out so i started hacking. here's what i've gotten worked out. daily data is recorded. during a daily cron job the nids for the project and release node are found. on a weekly basis usage for the week is computed for projects and releases. after a year old usage data is removed.

drewish’s picture

i just realized that the way i was computing the weekly totals is wrong. it should count distinct site_keys. i'll have another go at this tomorrow after i get some feedback.

drewish’s picture

FileSize
11.82 KB

this works a bit better and, for testing, includes some metric module support.

dww’s picture

Yup, still "needs work"... ;)

A) As per http://drupal.org/node/156550, for this patch to be useful at all, project-release-serve-history.php has to start setting the right cache headers to completely disable caching all the time, or else squid will still have the release history .xml files and we'll only log a fraction of the usage since only a fraction of the requests will get through.

B) The @file comment for project_issue.modules is a mix of lies (XML-RPC) and truth (project-release-serve-history.php).

C) This:

+      'description' => t('Configure how long data on usage is retained.'),

Would probably be better as: "Configure how long usage data is retained."

D) My feeble brain can't do anything with this:

+  $times = array(94608000, 63072000, 31536000, 15724800, 7257600, 4838400, 2419200);

Can you either comment those, or define them like: (60*60*24*365)?
Is that really the best UI for this? Shouldn't we give them a text box to input whatever value they want, and a drop-down to select the magnitude (days, weeks, years) or something?

E) Why is a bunch of stuff commented out of project_usage_cron()? I guess that's just debugging/test code so that it fires every time. Even still, those comments shouldn't be in the patches you upload to this issue...

F) project_usage_cron() initializes $cur_time = time(); but then later, you use time() again unnecessarily:

+    project_usage_process_week(time());

G) This int literal shows up a lot: 31536000 -- that'd be better as a PHP constant.

H) s/$ageoptions/$age_options/

I) "Compiled usage summary will not be deleted." is not gramatically correct.

J) It's not immediately clear why you'd need to do this:

+  // Set any missing pid and nids.

The phpdoc at the top of the function doesn't explain it, either. Can you please make the phpdoc comment more verbose to explain why project_usage_process_day() needs to exist at all? It's really because you don't get pids and nids from project-release-serve-history.php at all, right? So, we *have* to do a daily processing run to convert the raw data into something we can actually use, anyway. Which brings me to...

K) Maybe it's my own prejudice from writing the initial patch, but the table name {project_usage_day} implies to me "data for 1 day". In fact, it really seems to be "all raw data". Perhaps {project_usage_raw} is a better name for the table? Also, I'm still worried that we store everything in the same table until the 2 year limit when we discard stuff (or whatever we configure that knob to be). I thought writes to big tables can get expensive. Seems like if we're going to do a daily processing where we're updating rows anyway, that we should move the data into a separate table at that point to keep the table where we have to do the high-performance writes as small as possible. E.g. {project_usage_day} becomes a table that just stores stuff as it comes in, until the next daily processing. Then, it gets moved into {project_usage_raw}. The weekly run processes {project_usage_raw} and populates {project_usage_week_project} and {project_usage_week_release} as appropriate, and prunes data from {project_usage_raw} we want to discard. That would have the benefit of also not having (count 'em) 4 indexes(!) on the table we have to do the frequent, speed-critical writes to. :(

Or, alternatively, we call them {project_usage_raw} (raw data coming in, not yet processed) and {project_usage_all} (all the data we have) and then the per-week summary tables as you have them...

L) "Beginning" is consistently mispelled in here as "begining".

M) Grammar bug: "Compute the begining and end a week containing a given timestamp."

N) Some phpdoc would be most helpful for the metrics hook implementations.

O) This:

+          'description' => t('This project is used on @count sites in the last @weeks.', array('@count' => $count, '@weeks' => $weeks)),

should probably use format_plural(). There might be modules only used by 1 site. ;) And, since $weeks defaults to 1, that *definitely* should be format_plural() material.

P) This code is completely cut and pasted:

+    case 'options':
+      $weeks = drupal_map_assoc(array(1, 2, 3, 4, 6, 8, 10, 20, 26, 52));
+      $form['weeks'] = array(
+        '#type' => 'select',
+        '#title' => t('Weeks'),
+        '#default_value' => isset($options['weeks']) ? $options['weeks'] : 1,
+        '#options' => $weeks,
+      );
+      return $form;
+  }

Can you put that in a private helper so both metric hooks can share it?

Q) It's too bad there's no pgsql case for project_usage.install. Obviously, I don't fault you since the schema might still be in flux, and we don't really need it for d.o, but it'd be the right thing to do (at least until we're porting to D6 and schema API). I wouldn't hold up this patch over this, especially given the impending port to schema API, but I wanted to raise it.

R) Oh, this comment is no longer a @todo once the patch goes in... ;)

  * @todo: Record usage statistics?

Other than that, the code looks reasonable. I haven't tried testing it at all. I didn't see any security holes, and I was looking fairly closely for them.

At some point (preferably once we resolve the schema debate from J and K), we should just install this on d.o and start gathering data. We won't really know what the problems are until we actually see them live, and we're just losing data we could otherwise be collecting now...

Sorry for the delay reviewing this. I've just been slammed at my real job. That said, other folks could have provided most of this input, so it's too bad you feel like you've been blocked on me for this. Oh well, such is life...

Thanks for moving this forward!
-Derek

drewish’s picture

Status: Needs work » Needs review
FileSize
11.96 KB

here's a patch with some of the simple things dww pointed out fixed. i've got to run right now so i don't have time to address the substantial stuff he pointed out.

drewish’s picture

got a little time after all so i'll address a few of the points.

G) 31536000, once the options in the select were build as 365 * $day it showed up twice. i went ahead and added it as a constant but it seems a bit silly to me.

O (format_plural): at one point i had started to use format_plural() but the problem is which one gets the pluralizing? and there's no easy way to do the replacement for the other value. in drupal 6 it looks like this has been improved, the upgrade handbook has a good comparison of how to do it in 5 and in 6.

Q (pgsql): I don't have PostgreSQL installed so I can't properly test the schema creation. I feel that it would be worse to put something in there that's untested than it would to leave it out. I imagine that hunmonk would be able to help us translate it.

J: if you have a look at the way that project-release-serve-history.php stores records it does a SELECT and then if there's no existing record for that day does INSERT, otherwise it UPDATEs that record. that's the way that the project_usage_save_raw_query() function in your patch on #7 did it.

at some point you need to match the usage up to a particular project and release node. my thinking was that there's no sense doing the lookup on the project node each time you have a request. just do it at the end of the day and get all the views-5.x-1.2 usages at once.

K: i'm not dead set on the table naming. i'd though day would make it obvious that the records were per day and week would indicate per week. if that's not the case then we can rename them.

i'm not certain on the performance of writes to large tables. my understanding was that if there weren't reads and writes at the same time it wasn't a problem.

several times you've mentioned keeping the "raw" table unindexed to get fast writes, that would be well and good if there wasn't the SELECT to look for existing records. i suppose we could just log every request, allowing duplicates, but then when you do your daily processing you're going to take a huge performance hit looking for duplicates either on the db server or by putting them into a PHP hash.

at this point i really do have to run. i'll comment on this further later.

dww’s picture

Status: Needs review » Needs work

G) How about renaming the constant to "YEAR" then, if that's what it is...

O) Just use placeholders that themselves are defined via format_plural(). Something like:

'description' => t('This project is used on !sites in the last !weeks.',
  array(
    '!sites' => format_plural($count, '1 site', '@count sites'),
    '!weeks' => format_plural($weeks, '1 week', '@count weeks'),
  )),

Q) Yeah, I'm happy to leave pgsql out of this patch for now. We can get that for free when we port to D6 + Schema API, or hunmonk can do it as a followup issue. ;)

Still not sure what to say about J and K. Obviously, at some point, we'll need benchmarks to see what's best.

However, re: indexes on the "raw" table (wherever we do the writes), if we wanted to make it the DB's problem to weed-out duplicates, we could just keep a single index on that table for (project_uri, timestamp, site_key) and only use all the other indexes on the "processed full data" table...

dww’s picture

FYI: I just committed the changes so that project-release-serve-history.php sets the right http headers so that caching is completely disabled. See http://drupal.org/cvs?commit=74854 and http://drupal.org/node/156550. Now we just gotta get the squid cache cleared...

drewish’s picture

FileSize
13.34 KB

okay, untested but looking for some conceptual feedback. i think will address both of our concerns. there's a {project_usage_raw} that is used to collect the days usage. once a day the previous day's usage records have a project and release node id computed and are then moved into the {project_usage_day} table. weekly statistics are computed once a week.

drewish’s picture

FileSize
11.71 KB

actually tested it a bit and fixed some bugs.

drewish’s picture

oh, forgot to mention that i realized that the whole redundant weekly databases made no sense. nid are unique so we can have one weekly usage table and depending on the context in which it's queried it'll have information on projects and releases.

drewish’s picture

FileSize
1.87 KB

and here's a little script to generate some usage data for the raw table so you can run the cron job

dww’s picture

This is looking much better. Haven't tested, but visual inspection of the patch yields the following:

A) format_plural() still not used where it needs to be (as above).

B) Yuck, we lose the data on usage where all we have is the core compatibility and not a full version when we do the daily processing?

C) Might need some more indexes on {project_usage_day} for other kinds of queries, but we can always add those later once we know what we need. Better to add indexes later when we need them than add ones we don't need now.

Otherwise, this just needs a little testing and it should go in!

Oh, final question -- only a *tiny* set of sites will want this. Do we really want to ship it as part of project.module, or should it be its own new project? I'm not sure.

Thanks again!
-Derek

drewish’s picture

FileSize
12.59 KB

finally fixed the format plural bug. also fixed some un-quoted tables.

adjusted the indexes a bit on {project_usage_day}. i added keys on pid and nid to speed up the weekly tallies... then again maybe it's an infrequent enough query that we can just take the hit and drop those indexes.

B) Yuck, we lose the data on usage where all we have is the core compatibility and not a full version when we do the daily processing?

i started digging into this and found it was actually was worse than that. if we couldn't locate a release the usage wasn't counted at all. i reworked it so that now their usage will be logged against the project but not a release.

we could try to locate the "current" release for the major version but i don't think it's such a good idea. i'd rather discard incomplete data rather than make assumptions about what it should be and generate results that may mislead people later. we're capturing the important information: how widely the project used (all versions), and which releases are being used.

we can create a major version summary by summing up the usage of all the release nodes assigned to that term. the value might be low but in my view under counting is a safe. but if you think that the major version information is really important we could add a tid field to the raw and day tables and store that.

dww’s picture

a) Why does this still need work from your perspective? I'd like to use the status more accurately if possible, and I'd like your replies to specifically mention which one of my previous points have been solved and which are still open. It takes a lot of extra effort for me to re-review each patch and figure out where things stand. But, it's not really any more work for you as you fix things in another iteration of the patch to say "new patch that solves A, B, D, E, F, I, L from #36. C, G, H, J and K still need help" (or whatever the case may be). Thanks.

b) In release-system-speak, "major" version is the "2" in 5.x-2.0. "5.x" is the "core compatibility" or "api version". So, you're talking about stats by api version, not major version, right? Yes, I think api versions are incredibly important, and we'll definitely want to have total usage for 5.x vs. 6.x for a given project.

Ideally, we'd also have summaries by api_version + major_version. For example, for the purposes of http://drupal.org/node/150278 and http://drupal.org/node/163616 I'd LOVE to know how many sites are actually using project* 4.7.x-1.* right now. I don't care about 4.7.x-1.2 vs. 4.7.x-1.3 so much, but total usage of 4.7.x-1.* would be invaluable. Obviously, that's impossible, but in the future, it'd be amazing to give project maintainers this kind of info.

Remember, aside from the general project-metrics applications of this data, I'd also like to expose the full data to project maintainers somehow. They don't just want to know what usage "bucket" they're in, they want to see the actual numbers. ;)

Thanks!
-Derek

hass’s picture

Please take a look to this small piece of code. There seems to be a small bug (duplicated description).

+        return array(
+          'value' => log($result), 
+          'description' => t('This project is used on @count sites in the last @weeks.', array('@count' => $count, '@weeks' => $weeks)),
+          'description' => t('This project is used on !sites in the last !weeks.', array(
+            '!sites' => format_plural($count, '1 site', '@count sites'),
+            '!weeks' => format_plural($weeks, '1 week', '@count weeks'),
+          )),
+        );

Additional, what about the 2 # here?

+  // Figure out if it's been 24 hours since our last daily processing.
+#  if (variable_get('project_usage_last_daily', 0) <= ($time - (60 * 60 * 24))) {
+    project_usage_process_day();
+    variable_set('project_usage_last_daily', $time);
+#  }
drewish’s picture

Status: Needs work » Needs review
FileSize
12.96 KB

dww, sorry about that. i'd forgotten to change the status, it should have been needs review. i didn't see you on IRC so i'd assumed you were off on the day job. i appreciate you taking a look at this so quickly. i'd decided that rather than posting a follow to just change the status i'd look at adding the api version term. turns out it's not that hard and if it'll make my mentor happy... well just consider it done ;)

i'm also into the idea of making this available to maintainers. i'm already thinking about adding some tabs on the project and release nodes to display a table with usage.

so looking back up the list:
#36 A-R are either fixed or no longer relevant.
#39 G, O are fixed. Q (pgsql) is open ended.
#45 Fixed A and B. C (indexes) is also an open question.
#47 A is addressed here. B you're correct i meant API version rather than major version, that's addressed in this patch.

hass, thanks i've fixed those.

drewish’s picture

Realized I'd missed the unlettered question at the bottom of #45:

Oh, final question -- only a *tiny* set of sites will want this. Do we really want to ship it as part of project.module, or should it be its own new project? I'm not sure.

I think it should be left in with the project module. project_usage.module is dependent on release/project-release-serve-history.php for database input. Moving it off to a separate project would just increase the odds of incompatible changes being made to one of the two modules. It would also create possible complications on determining which versions of the projects are compatible.

Worst case with the current setup is that users will enable the module when they don't really need it. Since there won't be any data coming in there won't be much additional overhead in the cron job so I don't really see the harm. Maybe we just need to add a node to the module description that says "you really don't need this module".

dww’s picture

Re: #49 -- great, seems like everything's accounted for then. I'll give that patch a thorough (hopefully final) review and commit ASAP (if all goes well sometime this afternoon, we'll have to see).

Re: #50 -- sounds reasonable to me. As you can see from my original patch here, that's what I had in mind. I just wanted to get a second opinion. ;)

Thanks,
-Derek

drewish’s picture

FileSize
15.08 KB

These are some comments I made in an email to the infrastructure list.:

It's really hard to gauge the table sizes without having an idea of
how many sites are using Drupal. All the numbers turn on how many
sites will be reporting usage and how many modules each has installed.
Doing the math with "small" numbers like 5,000 sites, with 15 modules
each adds up to 27 million records over a year. Leading me to the
conclusion that we probably won't be storing the daily data for long.
The weekly data would fare better as the number of rows comes down to
number of project and release nodes on drupal.org * the number of
weeks we retain the usage data.

I'm also starting to think that for the purposes of pruning weekly
data, it might be best to separate the weekly summary tables for
projects and releases. That would make it easier to have separate
settings to discard project usage after X weeks and release usage
after Y weeks. Since there are usually many releases for a given
project, shortening the life of release usage would create
corresponding savings.

The other benefit to separate weekly usage tables would be that we
could add a column to the project table to differentiate the API
version. This would allow us to track weekly usage of, say, views
4.7.x vs. 5.x. In my mind that would be of more long term importance
than the usage of a specific release.

I'm attaching a patch that implements the changes discussed above.

It drops {project_usage_week} in favor of {project_usage_project_week} and {project_usage_release_week}. {project_usage_project_week} now has an addition tid field.

PROJECT_USAGE_YEAR was defined incorrect with 356 days, fixed that.

Added separate settings to control how long the weekly project and release usage is retained for.

drewish’s picture

Status: Needs review » Needs work

I've been trying to add a UI to the usage so I more easily verify that it's being computed correctly. In the process I realized that it's probably a bug that the cron hook only processes the previous week. If it doesn't get run for a week it doesn't catch up. I'm working on a fix.

drewish’s picture

Status: Needs work » Needs review
FileSize
13.87 KB

project_usage_uninstall() was deleting the wrong tables. Fixed that.

Added PROJECT_USAGE_DAY to define the number of seconds in the day.

Got the weekly processing checking back to the last weekly run and computing all weeks between then and now.

I also removed the metrics stuff and put that into the metrics module as a .inc file. I'll look at moving it back in once this is committed.

dww’s picture

Status: Needs review » Fixed

After further review, committed a modified version of this to HEAD (http://drupal.org/cvs?commit=76685).

Oddly, it still wasn't working either via local testing nor on d.o. In IRC, we tracked it down to the fact that you can't include bootstrap.inc if you're already defining a copy of check_plain() (*doh*!). Now fixed (http://drupal.org/cvs?commit=76691).

So, d.o is hereby collecting usage stats! WOOT!

mysql> select count(*) from project_usage_raw;
+----------+
| count(*) |
+----------+
|       58 | 
+----------+
1 row in set (0.02 sec)

I'm sure that number is going to grow significantly over the next week... ;)

At this point, let's move all future work on metrics and ways for people to view these stats into new issues.

Thanks drewish!
-Derek

drewish’s picture

Component: Releases » Usage statistics

Sweet! I've got a follow up patch to make this data visible: #165380.

Anonymous’s picture

Status: Fixed » Closed (fixed)
Vacilando’s picture

Is there a module or a live stats page listing module and theme usage stats that users keep sending to http://drupal.org/xmlrpc.php ?

Cheers,

Tomas

drewish’s picture

Thomas, that's still a work in progress: http://drupal.org/node/165380

Vacilando’s picture

Hi Andrew,

Still however I don't understand the relation between the project you refer to, which I assume further refers to module "project" ( http://drupal.org/project/project ).

What I would like to know is: can this or other project (perhaps in development) query the data that is *already* being sent by many people to Drupal's XML-RPC server ( http://drupal.org/xmlrpc.php ). Surely, there must be a lot of information collected already. Is there a way to query it, using this or other module, or is there a regular dump of the stats on some page? If not, what's the roadmap to this? How can I/others help getting there (I could help with module coding).

Can you please elaborate.

Thanks,

Tomas

sepeck’s picture

Currently the data is collected but not accessible without a manual database query. There are no regular database dumps at this time. There aren't that many sites actually sending to it as it's not really been a publishized feature.

Road maps are in general, not a Drupal thing.

Please see the referred to issue for details on work already in progress.