Expose project statistics to the solr index [#372243]

For the "Getting started with Drupal" page as envisioned by Mark Boulton (http://drupal.markboultondesign.com/iteration11/getstarted.html), we need to

Retrieve a list of the top 4 modules
Retrieve a list of the top 4 themes
Retrieve a list of the top 3 (or 4) translations

I'm thinking we should implement a function for this: project_get_most_popular_projects($tid, $limit = 5). The function would use the usage statistics to determine the top $limit projects with the specified taxonomy term $tid.

I suggest that we maintain that function in the project module. I can give this a try tomorrow if that sounds like the right thing to do.

Comment	File	Size	Author
#9	most_active_and_most_recent_projects.patch	1.91 KB	courtney
#5	most-active-projects.patch	1.1 KB	dries

Comments

Comment #1

aclight commented 10 February 2009 at 17:32

It seems to me that the better solution to this would be to use Views. Granted, project_usage isn't Views enabled yet, but since Views 2 can support using views on non-nodes I think that Views could be used.

If this did go into project* directly I don't think we'd want it in the project module, but instead in project_usage. For sites that didn't run project_usage (that is probably all sites running project* with the exception of drupal.org) this function wouldn't make any sense.

Comment #2

dries commented 10 February 2009 at 18:05

Using Views seems overkill to me -- plus the query is non-trivial, IMO.

I'd basically implement a function in project_usage that executes the following query:

For modules:

mysql> SELECT n.nid, n.title, pw.count FROM node n INNER JOIN project_projects pp ON n.nid = pp.nid INNER JOIN project_usage_week_project pw ON pp.nid = pw.nid INNER JOIN term_node tn ON tn.nid = pp.nid WHERE pw.timestamp = (SELECT MAX(timestamp) FROM project_usage_week_project) AND tn.tid = 14 AND n.status = 1 ORDER BY pw.count DESC LIMIT 5;
+--------+--------------------------------+-------+
| nid    | title                          | count |
+--------+--------------------------------+-------+
|  38878 | Views                          | 44441 | 
|  48429 | Content Construction Kit (CCK) | 39922 | 
| 106016 | Token                          | 36428 | 
|  17345 | Pathauto                       | 28593 | 
| 122616 | Update Status                  | 23832 | 
+--------+--------------------------------+-------+

For themes:

mysql> SELECT n.nid, n.title, pw.count FROM node n INNER JOIN project_projects pp ON n.nid = pp.nid INNER JOIN project_usage_week_project pw ON pp.nid = pw.nid INNER JOIN term_node tn ON tn.nid = pp.nid WHERE pw.timestamp = (SELECT MAX(timestamp) FROM project_usage_week_project) AND tn.tid = 15 AND n.status = 1 ORDER BY pw.count DESC LIMIT 5;
+--------+------------------+-------+
| nid    | title            | count |
+--------+------------------+-------+
|  88566 | Zen              |  7085 | 
| 315548 | Acquia Marina    |  5182 | 
| 185969 | Marinelli        |  3313 | 
| 295505 | Pixture Reloaded |  1600 | 
| 217032 | Four Seasons     |  1507 | 
+--------+------------------+-------+

For translations we'd have to use a hard-coded list.

The query seems to be fast as it is always using the primary key:

mysql> EXPLAIN SELECT n.nid, n.title, pw.count FROM node n INNER JOIN project_projects pp ON n.nid = pp.nid INNER JOIN project_usage_week_project pw ON pp.nid = pw.nid INNER JOIN term_node tn ON tn.nid = pp.nid WHERE pw.timestamp = (SELECT MAX(timestamp) FROM project_usage_week_project) AND tn.tid = 14 AND n.status = 1 ORDER BY pw.count DESC LIMIT 10;
+----+-------------+----------------------------+--------+------------------------------------------------------------------+---------+---------+---------------------+-------+----------------------------------------------+
| id | select_type | table                      | type   | possible_keys                                                    | key     | key_len | ref                 | rows  | Extra                                        |
+----+-------------+----------------------------+--------+------------------------------------------------------------------+---------+---------+---------------------+-------+----------------------------------------------+
|  1 | PRIMARY     | pp                         | index  | PRIMARY                                                          | PRIMARY | 4       | NULL                |  3906 | Using index; Using temporary; Using filesort | 
|  1 | PRIMARY     | tn                         | eq_ref | PRIMARY,nid,tid                                                  | PRIMARY | 8       | const,drupal.pp.nid |     1 | Using index                                  | 
|  1 | PRIMARY     | pw                         | ref    | PRIMARY                                                          | PRIMARY | 8       | drupal.tn.nid,const |     1 | Using where                                  | 
|  1 | PRIMARY     | n                          | ref    | PRIMARY,node_status_type,nid,tracker_global,node_status_type_uid | PRIMARY | 4       | drupal.tn.nid       |     1 | Using where                                  | 
|  2 | SUBQUERY    | project_usage_week_project | index  | NULL                                                             | PRIMARY | 12      | NULL                | 84603 | Using index                                  | 
+----+-------------+----------------------------+--------+------------------------------------------------------------------+---------+---------+---------------------+-------+----------------------------------------------+
5 rows in set (0.00 sec)

Comment #3

dww

we/he/they

commented 10 February 2009 at 23:36

Component:

Projects

» Usage statistics

Note: we already export usage data into the document we send to the solr index. See the new project browsing pages where you can sort by module usage:

http://d6.drupal.org/project/modules

(grrr, at least you could when solr was up). So, it might be easier to just ask solr for this info, instead of the main DB.

In any event, this should either live in project_usage, or in project_solr, but definitely not project itself.

If you're going to do this in project_usage, a few notes on your queries:

A) SELECT MAX(timestamp) FROM project_usage_week_project is very expensive and unnecessary. See variable_get('project_usage_active_weeks').

B) INNER JOIN project_projects pp is unnecessary until #372061: Form_alter the theme vocabularies so that they only show up for theme node editing is done, which might be a long time. You can just have WHERE n.type == 'project_project' for now.

C) If this lives in project_usage, you should strongly consider caching the results -- they only change at most once per week, so it's silly to redo this query on every page load of "Getting started" -- we should just clear the cache when we do the weekly processing.

Comment #4

dries commented 11 February 2009 at 09:28

Thanks for the feedback, dww.

Using the WHERE n.type == 'project_project' actually slows down the query because we can no longer use primary keys only. I did some basic testing on the d.o database and the WHERE-clause actually slows down the query. The existing query completes in 0.8 seconds when not cached by MySQL, and in 0.00 seconds when cached by MySQL.
SELECT MAX(timestamp) FROM project_usage_week_project completes in 0.01 seconds. I'll add the caching but the query isn't that bad as you seem to suggest it is. It remains to be seen if variable_get() is actually much faster -- I guess it could be a lot faster the day we have a lot more data in the weekly usage table.
Putting this functionality in project_usage.module sounds like the right thing to do. I don't think it makes sense to move this to Solr -- it seems like this is perfect use case for using a relational database, and faster too.

Now I've done some additional testing and validation, I'll start working on a patch, and then hopefully we can take it from there. We'd love to implement the "Getting started with Drupal" page as soon as we can.

Comment #5

dries commented 11 February 2009 at 09:51

Status	File	Size
new	most-active-projects.patch	1.1 KB

Patch attached.

I don't have a real-life project module environment setup but I manually created some dummy tables for testing. It seems to be working.

Because there are two parameters, $tid and $limit, I decided not to add caching to this function. I'll add caching to the calling function instead. We'll call this function from custom code in the Drupal.org Customizations project.

The caching in the calling function should also eliminate the need to use variable_get('project_usage_active_weeks').

It would be great to get this committed so we can make progress with the "Getting started with Drupal" page. If you want me to clean certain things up, let me know (or at this point they might be easy enough to clean-up before committing).

Comment #6

aclight commented 11 February 2009 at 12:27

Status:

Active

» Needs work

1. Typo: + return $projecs;

2. I think the db_query call needs to first be run through db_rewrite_sql(). Here's (yet another) case where caching the results of the function could lead to inadvertent information disclosure of project titles, so perhaps it's better that the patch you've provided doesn't implement the caching itself. Caching what this function returns for d.o shouldn't be a problem.

3. Instead of putting a "LIMIT" clause in the query itself, isn't it safer to use db_query_range()? My understanding is that pgsql doesn't support using LIMIT the same way that MySQL does. I don't recall whether project_usage even works on pgsql, but we might as well make this query db agnostic from the start.

Comment #7

dries commented 11 February 2009 at 14:29

Status:

Needs work

» Active

DamZ and I will brainstorm about this some more to explore using Solr.

Comment #8

drewish commented 11 March 2009 at 18:18

subscribing.

Comment #9

courtney commented 10 April 2009 at 18:58

Status	File	Size
new	most_active_and_most_recent_projects.patch	1.91 KB

Has there been any more thought on this? We're working on the getting started and download pages here at the redesign sprint in SF and would like to move forward with implementing a solution.

I've updated the function Dries supplied slightly since we don't need the count for the display on the landing pages and added another function for returning Newly added projects and attached this as a patch.

We also need a function for returning Most Active Projects, which we are thinking are the projects with the most recent commits or issue updates?

Comment #10

dww

we/he/they

commented 12 April 2009 at 00:13

Oh, I just realized that my concern about SELECT MAX(timestamp) FROM project_usage_week_project being very expensive and unnecessary is actually about {project_usage_daily}. {project_usage_week_project} doesn't have nearly as many rows, so it's not so terrible (as Dries's benchmarks suggest). However, I still think it's probably worth just using variable_get('project_usage_active_weeks'), or better yet, the new project_usage_get_current_active_week() function from #371968: Clean APIs to retrieve project related statistics.

Still don't know what the deal is with using Solr for this -- I'll try to track down Damien.

Comment #11

lisarex commented 17 December 2009 at 02:45

Linking this from the Redesign project #661692: Meta issue for modules Project and Project issue tracking because this issue was tagged 'drupal.org redesign'

Comment #12

jwilson3

English

Ecuador

commented 5 January 2010 at 20:08

subscribing, same thing going on over here: #666120: Provide API to get the top N projects for various categories

I'll also point out that these same lists of most popular and most active modules / themes / translations / profiles / etc (aka project metrics), would potentially be used to build the lists on the Download & Extend homepage, as well as the sidebar blocks on the D&E Modules, D&E Themes page and, even though they weren't mocked-up in the redesign, presumably in similar sidebar blocks on the translations and profiles pages as well.

Comment #13

jwilson3

English

Ecuador

commented 6 January 2010 at 06:47

In the patches above, I would say that while project_usage_get_most_active_projects makes sense from the point of view of determining what modules are 'actively' reporting themselves... from an end user point of view, this is a metric of "popularity" not "activity".

Also, project_usage_get_most_recent_projects in the patch in #9 should actually be project_usage_get_newest_projects.

This input is based on the redesign mockups that use the terminology 'most popular', 'most installed', 'highest rated', and 'most active'.

An "active" project is one that is currently under heavy development. Measuring this is fairly non-trivial.
A "popular" project is one that has been used heavily. (For the moment, until popularity could be determined by some other method, this is synonymous with most installed).
A "new" project is one that has a recent creation date.
A "recent" project is one that has recently added a new release (as in recently updated).

"Most downloaded" statistics should probably be gathered from server logs (?)
"Highest Rated" statistics will depend on a separate node rating system, that could be susceptible to being "gamed". This is discussed elsewhere (sorry no link :(

Comment #14

damien tournoud commented 9 January 2010 at 00:13

Title:	"Getting started with Drupal" statistics	» Add an easy way to get simple statistics
Assigned:	dries	» damien tournoud

Actually we want to use solr for that now :) After all, we have all the nice data in there, we just need a simple way to grab a few statistics. Assigning myself.

Comment #15

dww

we/he/they

commented 12 July 2010 at 06:39

Title:	Add an easy way to get simple statistics	» Expose project statistics to the solr index
Component:	Usage statistics	» ApacheSolr integration
Assigned:	damien tournoud	» Unassigned

I repurposed #666120: Provide API to get the top N projects for various categories to make it more clear. That's about taking stuff out of Solr and making nice lists with it. This issue should be about making sure everything we care about is actually in the solr index in the first place. We've already got usage stats in there. So, now we need to add other things like download stats (see #32124: Enable download statistics), "most active" (whatever we're going to define that to be), "most reviewed" (once http://groups.drupal.org/node/80004 is done and deployed), etc.

Comment #16

hunmonk commented 28 July 2010 at 14:33

Issue tags:

+drupal.org redesign sprint 1, +drupal.org redesign project

Comment #17

dww

we/he/they

commented 29 July 2010 at 17:44

Status:

Active

» Fixed

Upon further discussion with csevb10 and hunmonk, we decided it's pointless to keep kicking this issue around. We have everything in the solr document we currently care about. As we add more features (download stats, reviews, etc) we can either handle adding those to the index in the issue where we add the feature, or as a more specific/focused/clear issue like "add download stats to the solr index" if needed.

Comment #18

12 August 2010 at 17:50

Status:	Fixed	» Closed (fixed)
Issue tags:	-drupal.org redesign, -drupal.org redesign sprint 1, -drupal.org redesign project

Automatically closed -- issue fixed for 2 weeks with no activity.

Comment #19

12 August 2010 at 17:50

Issue tags:

+drupal.org redesign, +drupal.org redesign sprint 1, +drupal.org redesign project

Restoring issue tags, see #2125755: System messages removed all issue tags during D7 upgrade.

Expose project statistics to the solr index

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

Comment #14

Comment #15

Comment #16

Comment #17

Comment #18

Comment #19

News items

Our community

Documentation

Drupal code base

Governance of community