Posted by Dries on February 10, 2009 at 5:09pm
| Project: | Project |
| Version: | 6.x-1.x-dev |
| Component: | ApacheSolr integration |
| Category: | feature request |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | closed (fixed) |
| Issue tags: | drupal.org redesign, drupal.org redesign project, drupal.org redesign sprint 1 |
Issue Summary
For the "Getting started with Drupal" page as envisioned by Mark Boulton (http://drupal.markboultondesign.com/iteration11/getstarted.html), we need to
- Retrieve a list of the top 4 modules
- Retrieve a list of the top 4 themes
- Retrieve a list of the top 3 (or 4) translations
I'm thinking we should implement a function for this: project_get_most_popular_projects($tid, $limit = 5). The function would use the usage statistics to determine the top $limit projects with the specified taxonomy term $tid.
I suggest that we maintain that function in the project module. I can give this a try tomorrow if that sounds like the right thing to do.
Comments
#1
It seems to me that the better solution to this would be to use Views. Granted, project_usage isn't Views enabled yet, but since Views 2 can support using views on non-nodes I think that Views could be used.
If this did go into project* directly I don't think we'd want it in the project module, but instead in project_usage. For sites that didn't run project_usage (that is probably all sites running project* with the exception of drupal.org) this function wouldn't make any sense.
#2
Using Views seems overkill to me -- plus the query is non-trivial, IMO.
I'd basically implement a function in project_usage that executes the following query:
For modules:
mysql> SELECT n.nid, n.title, pw.count FROM node n INNER JOIN project_projects pp ON n.nid = pp.nid INNER JOIN project_usage_week_project pw ON pp.nid = pw.nid INNER JOIN term_node tn ON tn.nid = pp.nid WHERE pw.timestamp = (SELECT MAX(timestamp) FROM project_usage_week_project) AND tn.tid = 14 AND n.status = 1 ORDER BY pw.count DESC LIMIT 5;+--------+--------------------------------+-------+
| nid | title | count |
+--------+--------------------------------+-------+
| 38878 | Views | 44441 |
| 48429 | Content Construction Kit (CCK) | 39922 |
| 106016 | Token | 36428 |
| 17345 | Pathauto | 28593 |
| 122616 | Update Status | 23832 |
+--------+--------------------------------+-------+
For themes:
mysql> SELECT n.nid, n.title, pw.count FROM node n INNER JOIN project_projects pp ON n.nid = pp.nid INNER JOIN project_usage_week_project pw ON pp.nid = pw.nid INNER JOIN term_node tn ON tn.nid = pp.nid WHERE pw.timestamp = (SELECT MAX(timestamp) FROM project_usage_week_project) AND tn.tid = 15 AND n.status = 1 ORDER BY pw.count DESC LIMIT 5;+--------+------------------+-------+
| nid | title | count |
+--------+------------------+-------+
| 88566 | Zen | 7085 |
| 315548 | Acquia Marina | 5182 |
| 185969 | Marinelli | 3313 |
| 295505 | Pixture Reloaded | 1600 |
| 217032 | Four Seasons | 1507 |
+--------+------------------+-------+
For translations we'd have to use a hard-coded list.
The query seems to be fast as it is always using the primary key:
mysql> EXPLAIN SELECT n.nid, n.title, pw.count FROM node n INNER JOIN project_projects pp ON n.nid = pp.nid INNER JOIN project_usage_week_project pw ON pp.nid = pw.nid INNER JOIN term_node tn ON tn.nid = pp.nid WHERE pw.timestamp = (SELECT MAX(timestamp) FROM project_usage_week_project) AND tn.tid = 14 AND n.status = 1 ORDER BY pw.count DESC LIMIT 10;+----+-------------+----------------------------+--------+------------------------------------------------------------------+---------+---------+---------------------+-------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------------+--------+------------------------------------------------------------------+---------+---------+---------------------+-------+----------------------------------------------+
| 1 | PRIMARY | pp | index | PRIMARY | PRIMARY | 4 | NULL | 3906 | Using index; Using temporary; Using filesort |
| 1 | PRIMARY | tn | eq_ref | PRIMARY,nid,tid | PRIMARY | 8 | const,drupal.pp.nid | 1 | Using index |
| 1 | PRIMARY | pw | ref | PRIMARY | PRIMARY | 8 | drupal.tn.nid,const | 1 | Using where |
| 1 | PRIMARY | n | ref | PRIMARY,node_status_type,nid,tracker_global,node_status_type_uid | PRIMARY | 4 | drupal.tn.nid | 1 | Using where |
| 2 | SUBQUERY | project_usage_week_project | index | NULL | PRIMARY | 12 | NULL | 84603 | Using index |
+----+-------------+----------------------------+--------+------------------------------------------------------------------+---------+---------+---------------------+-------+----------------------------------------------+
5 rows in set (0.00 sec)
#3
Note: we already export usage data into the document we send to the solr index. See the new project browsing pages where you can sort by module usage:
http://d6.drupal.org/project/modules
(grrr, at least you could when solr was up). So, it might be easier to just ask solr for this info, instead of the main DB.
In any event, this should either live in project_usage, or in project_solr, but definitely not project itself.
If you're going to do this in project_usage, a few notes on your queries:
A)
SELECT MAX(timestamp) FROM project_usage_week_projectis very expensive and unnecessary. See variable_get('project_usage_active_weeks').B)
INNER JOIN project_projects ppis unnecessary until #372061: Form_alter the theme vocabularies so that they only show up for theme node editing is done, which might be a long time. You can just haveWHERE n.type == 'project_project'for now.C) If this lives in project_usage, you should strongly consider caching the results -- they only change at most once per week, so it's silly to redo this query on every page load of "Getting started" -- we should just clear the cache when we do the weekly processing.
#4
Thanks for the feedback, dww.
WHERE n.type == 'project_project'actually slows down the query because we can no longer use primary keys only. I did some basic testing on the d.o database and the WHERE-clause actually slows down the query. The existing query completes in 0.8 seconds when not cached by MySQL, and in 0.00 seconds when cached by MySQL.SELECT MAX(timestamp) FROM project_usage_week_projectcompletes in 0.01 seconds. I'll add the caching but the query isn't that bad as you seem to suggest it is. It remains to be seen ifvariable_get()is actually much faster -- I guess it could be a lot faster the day we have a lot more data in the weekly usage table.Now I've done some additional testing and validation, I'll start working on a patch, and then hopefully we can take it from there. We'd love to implement the "Getting started with Drupal" page as soon as we can.
#5
Patch attached.
I don't have a real-life project module environment setup but I manually created some dummy tables for testing. It seems to be working.
Because there are two parameters,
$tidand$limit, I decided not to add caching to this function. I'll add caching to the calling function instead. We'll call this function from custom code in the Drupal.org Customizations project.The caching in the calling function should also eliminate the need to use
variable_get('project_usage_active_weeks').It would be great to get this committed so we can make progress with the "Getting started with Drupal" page. If you want me to clean certain things up, let me know (or at this point they might be easy enough to clean-up before committing).
#6
1. Typo:
+ return $projecs;2. I think the db_query call needs to first be run through db_rewrite_sql(). Here's (yet another) case where caching the results of the function could lead to inadvertent information disclosure of project titles, so perhaps it's better that the patch you've provided doesn't implement the caching itself. Caching what this function returns for d.o shouldn't be a problem.
3. Instead of putting a "LIMIT" clause in the query itself, isn't it safer to use db_query_range()? My understanding is that pgsql doesn't support using LIMIT the same way that MySQL does. I don't recall whether project_usage even works on pgsql, but we might as well make this query db agnostic from the start.
#7
DamZ and I will brainstorm about this some more to explore using Solr.
#8
subscribing.
#9
Has there been any more thought on this? We're working on the getting started and download pages here at the redesign sprint in SF and would like to move forward with implementing a solution.
I've updated the function Dries supplied slightly since we don't need the count for the display on the landing pages and added another function for returning Newly added projects and attached this as a patch.
We also need a function for returning Most Active Projects, which we are thinking are the projects with the most recent commits or issue updates?
#10
Oh, I just realized that my concern about
SELECT MAX(timestamp) FROM project_usage_week_projectbeing very expensive and unnecessary is actually about {project_usage_daily}. {project_usage_week_project} doesn't have nearly as many rows, so it's not so terrible (as Dries's benchmarks suggest). However, I still think it's probably worth just using variable_get('project_usage_active_weeks'), or better yet, the new project_usage_get_current_active_week() function from #371968: Clean APIs to retrieve project related statistics.Still don't know what the deal is with using Solr for this -- I'll try to track down Damien.
#11
Linking this from the Redesign project #661692: Meta issue for modules Project and Project issue tracking because this issue was tagged 'drupal.org redesign'
#12
subscribing, same thing going on over here: #666120: Provide API to get the top N projects for various categories
I'll also point out that these same lists of most popular and most active modules / themes / translations / profiles / etc (aka project metrics), would potentially be used to build the lists on the Download & Extend homepage, as well as the sidebar blocks on the D&E Modules, D&E Themes page and, even though they weren't mocked-up in the redesign, presumably in similar sidebar blocks on the translations and profiles pages as well.
#13
In the patches above, I would say that while
project_usage_get_most_active_projectsmakes sense from the point of view of determining what modules are 'actively' reporting themselves... from an end user point of view, this is a metric of "popularity" not "activity".Also,
project_usage_get_most_recent_projectsin the patch in #9 should actually beproject_usage_get_newest_projects.This input is based on the redesign mockups that use the terminology 'most popular', 'most installed', 'highest rated', and 'most active'.
An "active" project is one that is currently under heavy development. Measuring this is fairly non-trivial.
A "popular" project is one that has been used heavily. (For the moment, until popularity could be determined by some other method, this is synonymous with most installed).
A "new" project is one that has a recent creation date.
A "recent" project is one that has recently added a new release (as in recently updated).
"Most downloaded" statistics should probably be gathered from server logs (?)
"Highest Rated" statistics will depend on a separate node rating system, that could be susceptible to being "gamed". This is discussed elsewhere (sorry no link :(
#14
Actually we want to use solr for that now :) After all, we have all the nice data in there, we just need a simple way to grab a few statistics. Assigning myself.
#15
I repurposed #666120: Provide API to get the top N projects for various categories to make it more clear. That's about taking stuff out of Solr and making nice lists with it. This issue should be about making sure everything we care about is actually in the solr index in the first place. We've already got usage stats in there. So, now we need to add other things like download stats (see #32124: Enable download statistics), "most active" (whatever we're going to define that to be), "most reviewed" (once http://groups.drupal.org/node/80004 is done and deployed), etc.
#16
#17
Upon further discussion with csevb10 and hunmonk, we decided it's pointless to keep kicking this issue around. We have everything in the solr document we currently care about. As we add more features (download stats, reviews, etc) we can either handle adding those to the index in the issue where we add the feature, or as a more specific/focused/clear issue like "add download stats to the solr index" if needed.
#18
Automatically closed -- issue fixed for 2 weeks with no activity.