"Getting started with Drupal" statistics

Dries - February 10, 2009 - 17:09
Project:Project
Version:6.x-1.x-dev
Component:Usage statistics
Category:feature request
Priority:normal
Assigned:Dries
Status:active
Issue tags:drupal.org redesign
Description

For the "Getting started with Drupal" page as envisioned by Mark Boulton (http://drupal.markboultondesign.com/iteration11/getstarted.html), we need to

  1. Retrieve a list of the top 4 modules
  2. Retrieve a list of the top 4 themes
  3. Retrieve a list of the top 3 (or 4) translations

I'm thinking we should implement a function for this: project_get_most_popular_projects($tid, $limit = 5). The function would use the usage statistics to determine the top $limit projects with the specified taxonomy term $tid.

I suggest that we maintain that function in the project module. I can give this a try tomorrow if that sounds like the right thing to do.

#1

aclight - February 10, 2009 - 17:32

It seems to me that the better solution to this would be to use Views. Granted, project_usage isn't Views enabled yet, but since Views 2 can support using views on non-nodes I think that Views could be used.

If this did go into project* directly I don't think we'd want it in the project module, but instead in project_usage. For sites that didn't run project_usage (that is probably all sites running project* with the exception of drupal.org) this function wouldn't make any sense.

#2

Dries - February 10, 2009 - 18:05

Using Views seems overkill to me -- plus the query is non-trivial, IMO.

I'd basically implement a function in project_usage that executes the following query:

For modules:

mysql> SELECT n.nid, n.title, pw.count FROM node n INNER JOIN project_projects pp ON n.nid = pp.nid INNER JOIN project_usage_week_project pw ON pp.nid = pw.nid INNER JOIN term_node tn ON tn.nid = pp.nid WHERE pw.timestamp = (SELECT MAX(timestamp) FROM project_usage_week_project) AND tn.tid = 14 AND n.status = 1 ORDER BY pw.count DESC LIMIT 5;
+--------+--------------------------------+-------+
| nid    | title                          | count |
+--------+--------------------------------+-------+
|  38878 | Views                          | 44441 |
|  48429 | Content Construction Kit (CCK) | 39922 |
| 106016 | Token                          | 36428 |
|  17345 | Pathauto                       | 28593 |
| 122616 | Update Status                  | 23832 |
+--------+--------------------------------+-------+

For themes:

mysql> SELECT n.nid, n.title, pw.count FROM node n INNER JOIN project_projects pp ON n.nid = pp.nid INNER JOIN project_usage_week_project pw ON pp.nid = pw.nid INNER JOIN term_node tn ON tn.nid = pp.nid WHERE pw.timestamp = (SELECT MAX(timestamp) FROM project_usage_week_project) AND tn.tid = 15 AND n.status = 1 ORDER BY pw.count DESC LIMIT 5;
+--------+------------------+-------+
| nid    | title            | count |
+--------+------------------+-------+
|  88566 | Zen              |  7085 |
| 315548 | Acquia Marina    |  5182 |
| 185969 | Marinelli        |  3313 |
| 295505 | Pixture Reloaded |  1600 |
| 217032 | Four Seasons     |  1507 |
+--------+------------------+-------+

For translations we'd have to use a hard-coded list.

The query seems to be fast as it is always using the primary key:

mysql> EXPLAIN SELECT n.nid, n.title, pw.count FROM node n INNER JOIN project_projects pp ON n.nid = pp.nid INNER JOIN project_usage_week_project pw ON pp.nid = pw.nid INNER JOIN term_node tn ON tn.nid = pp.nid WHERE pw.timestamp = (SELECT MAX(timestamp) FROM project_usage_week_project) AND tn.tid = 14 AND n.status = 1 ORDER BY pw.count DESC LIMIT 10;
+----+-------------+----------------------------+--------+------------------------------------------------------------------+---------+---------+---------------------+-------+----------------------------------------------+
| id | select_type | table                      | type   | possible_keys                                                    | key     | key_len | ref                 | rows  | Extra                                        |
+----+-------------+----------------------------+--------+------------------------------------------------------------------+---------+---------+---------------------+-------+----------------------------------------------+
|  1 | PRIMARY     | pp                         | index  | PRIMARY                                                          | PRIMARY | 4       | NULL                |  3906 | Using index; Using temporary; Using filesort |
|  1 | PRIMARY     | tn                         | eq_ref | PRIMARY,nid,tid                                                  | PRIMARY | 8       | const,drupal.pp.nid |     1 | Using index                                  |
|  1 | PRIMARY     | pw                         | ref    | PRIMARY                                                          | PRIMARY | 8       | drupal.tn.nid,const |     1 | Using where                                  |
|  1 | PRIMARY     | n                          | ref    | PRIMARY,node_status_type,nid,tracker_global,node_status_type_uid | PRIMARY | 4       | drupal.tn.nid       |     1 | Using where                                  |
|  2 | SUBQUERY    | project_usage_week_project | index  | NULL                                                             | PRIMARY | 12      | NULL                | 84603 | Using index                                  |
+----+-------------+----------------------------+--------+------------------------------------------------------------------+---------+---------+---------------------+-------+----------------------------------------------+
5 rows in set (0.00 sec)

#3

dww - February 10, 2009 - 23:36
Component:Projects» Usage statistics

Note: we already export usage data into the document we send to the solr index. See the new project browsing pages where you can sort by module usage:

http://d6.drupal.org/project/modules

(grrr, at least you could when solr was up). So, it might be easier to just ask solr for this info, instead of the main DB.

In any event, this should either live in project_usage, or in project_solr, but definitely not project itself.

If you're going to do this in project_usage, a few notes on your queries:

A) SELECT MAX(timestamp) FROM project_usage_week_project is very expensive and unnecessary. See variable_get('project_usage_active_weeks').

B) INNER JOIN project_projects pp is unnecessary until #372061: Form_alter the theme vocabularies so that they only show up for theme node editing is done, which might be a long time. You can just have WHERE n.type == 'project_project' for now.

C) If this lives in project_usage, you should strongly consider caching the results -- they only change at most once per week, so it's silly to redo this query on every page load of "Getting started" -- we should just clear the cache when we do the weekly processing.

#4

Dries - February 11, 2009 - 09:28

Thanks for the feedback, dww.

  1. Using the WHERE n.type == 'project_project' actually slows down the query because we can no longer use primary keys only. I did some basic testing on the d.o database and the WHERE-clause actually slows down the query. The existing query completes in 0.8 seconds when not cached by MySQL, and in 0.00 seconds when cached by MySQL.
  2. SELECT MAX(timestamp) FROM project_usage_week_project completes in 0.01 seconds. I'll add the caching but the query isn't that bad as you seem to suggest it is. It remains to be seen if variable_get() is actually much faster -- I guess it could be a lot faster the day we have a lot more data in the weekly usage table.
  3. Putting this functionality in project_usage.module sounds like the right thing to do. I don't think it makes sense to move this to Solr -- it seems like this is perfect use case for using a relational database, and faster too.

Now I've done some additional testing and validation, I'll start working on a patch, and then hopefully we can take it from there. We'd love to implement the "Getting started with Drupal" page as soon as we can.

#5

Dries - February 11, 2009 - 09:51

Patch attached.

I don't have a real-life project module environment setup but I manually created some dummy tables for testing. It seems to be working.

Because there are two parameters, $tid and $limit, I decided not to add caching to this function. I'll add caching to the calling function instead. We'll call this function from custom code in the Drupal.org Customizations project.

The caching in the calling function should also eliminate the need to use variable_get('project_usage_active_weeks').

It would be great to get this committed so we can make progress with the "Getting started with Drupal" page. If you want me to clean certain things up, let me know (or at this point they might be easy enough to clean-up before committing).

AttachmentSize
most-active-projects.patch 1.1 KB

#6

aclight - February 11, 2009 - 12:27
Status:active» needs work

1. Typo: +  return $projecs;

2. I think the db_query call needs to first be run through db_rewrite_sql(). Here's (yet another) case where caching the results of the function could lead to inadvertent information disclosure of project titles, so perhaps it's better that the patch you've provided doesn't implement the caching itself. Caching what this function returns for d.o shouldn't be a problem.

3. Instead of putting a "LIMIT" clause in the query itself, isn't it safer to use db_query_range()? My understanding is that pgsql doesn't support using LIMIT the same way that MySQL does. I don't recall whether project_usage even works on pgsql, but we might as well make this query db agnostic from the start.

#7

Dries - February 11, 2009 - 14:29
Status:needs work» active

DamZ and I will brainstorm about this some more to explore using Solr.

#8

drewish - March 11, 2009 - 18:18

subscribing.

#9

courtney - April 10, 2009 - 18:58

Has there been any more thought on this? We're working on the getting started and download pages here at the redesign sprint in SF and would like to move forward with implementing a solution.

I've updated the function Dries supplied slightly since we don't need the count for the display on the landing pages and added another function for returning Newly added projects and attached this as a patch.

We also need a function for returning Most Active Projects, which we are thinking are the projects with the most recent commits or issue updates?

AttachmentSize
most_active_and_most_recent_projects.patch 1.91 KB

#10

dww - April 12, 2009 - 00:13

Oh, I just realized that my concern about SELECT MAX(timestamp) FROM project_usage_week_project being very expensive and unnecessary is actually about {project_usage_daily}. {project_usage_week_project} doesn't have nearly as many rows, so it's not so terrible (as Dries's benchmarks suggest). However, I still think it's probably worth just using variable_get('project_usage_active_weeks'), or better yet, the new project_usage_get_current_active_week() function from #371968: Clean APIs to retrieve project related statistics.

Still don't know what the deal is with using Solr for this -- I'll try to track down Damien.

 
 

Drupal is a registered trademark of Dries Buytaert.