Download statistics are a frequently requested functionality. It would be easy enough to implement in a way parallel to what's done in the weblinks module, where a download link doesn't point directly to the file. This would require:

* add a field to the project_releases table, e.g.

ALTER TABLE project_releases ADD download int(10) unsigned NOT NULL default '0';

* change download links to point to a callback that first increments a counter and then redirects to the file.

Thoughts about this approach? If no problems are seen, I'll code a patch.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

hass’s picture

Version: x.y.z » 5.x-1.x-dev

Would be nice to get this together with usage statistics...

Amazon’s picture

Hi, going to link with this issue: http://drupal.org/node/188993 <= Screenscrape AWSTATs to have running tallies of Drupal downloads available publicly. Note OSUOSL is recommending a custom module to calculate stats from a compiled awstats data file.

aclight’s picture

I wrote a small module for my site that adds a menu path to track and count project_release downloads. If people would rather use that than scrape awstats I could post that.

greggles’s picture

@aclight, I don't think there's much debate about whether or not you should post that. I think it would be better as an optional feature for project_release module. That way your column could be added as nedjo mentions above and integrated into the general release schema instead of just in a related table (which would hurt performance on sites with lots of releases).

dww’s picture

Right, the "downloads" counter should go directly into the {project_release_nodes} table. project_release *is* about a web-based system for downloadable "software", so it makes sense that the module natively supports a download counter. however, i'd be happy if there were multiple ways to populate the field, namely the magic download menu path that increments the counters vs. some external system to populate based on other stats (from httpd or even another system entirely).

So, project_release itself should provide the column in its schema, the UI for displaying the results, and potentially the project browsing method for sorting by this value (although this last part should be delayed until the browsing methods are converted to views). Then, it should have another admin setting, something like the following radio toggle:

Collect download stats via:
(x) Special download URLs
( ) External source

That'd be ideal, as far as I'm concerned... If we really wanted to over-engineer the problem, we could have this setting as a global default, but allow project admins to define it individually on a per-project basis. ;)

greggles’s picture

@dww - what about having two columns: one for internal and one external and no admin option?

I think that there is value in having both external sources AND this web download column. Then we can say that a module that gets 5 downloads from the web page and 300 checkouts from CVS can easily be flagged as a 'developer favorite.'

dww’s picture

@greggles: interesting. i totally forgot about the effort to track CVS checkouts, too. I was thinking about external == awstats or something, if your site would rather go that route. but, fundamentally you've still got a single value to store, the # of times each release was downloaded via the web. however, yeah, there's the CVS checkouts (http://drupal.org/node/187019), too. and here's another potential complication: what about downloads of install profiles that include this release? do we record that, and if so, how? who's responsibility is it to keep track of that? that really has no place directly in project_release, since the install profile packaging question is specific to d.o. but, is there a general principle we could abstract out of this that could make sense on other project* sites?

all that said, I don't want this issue to get too complicated. ;) part of me thinks we should stick with the single value, # of downloads via the web, and give sites flexibility on how to collect that single metric. then, we put everything else (CVS checkouts, install profile downloads, etc) in other tables and handle them in other ways.

honestly, CVS checkouts probably belongs more in versioncontrol_project and/or versioncontrol_cvs than project_release. In fact, there was recently a GHOP task about commit metrics for projects. seems like it'd make sense to get a similar "checkout" metric in the same place. but, that's really a discussion for #187019, not here, assuming we all agree that project_release itself should just focus on a single value, the # of web downloads, and everything else should be handled in the appropriate module that's extending project_release for a specific thing. sound good?

greggles’s picture

dww - yes, generally, agreed. There are lots of things we can do in our dreams and ideally there would be a project_release->download_statistics 1:N relationship, but let's not complicate this issue too much.

For this patch because it makes sense for the major consumer of this module (d.o) I think we should just have the one database column.

Any further fancy statistics action should probably be kept to separate issues.

aclight’s picture

I've attached the files for the module I'm running on one of my sites.

I have only tested this using public downloads. Also, users can always use the normal URL to download a file (eg. example.com/files/file1.txt) and in those cases the download won't be counted.

I probably won't have time to write this as a patch to project_release any time in the near future, but I thought I'd post this here in case it is useful to others making such a patch.

aclight’s picture

As an FYI, the file names in the comment I just sent were wrong, so I edited the comment and replaced with the appropriately named files. So if you click the link from your email to see the file attachments you won't get them, but if you click on the link from the issue itself you'll get the correct link.

Gábor Hojtsy’s picture

Issue tags: +drupal.org redesign
Dries’s picture

The Mark Boulton redesign exposes download statistics on a number of pages. Both on individual project pages, as well as landing pages and more.

We discussed this a bit at the drupal.org redesign sprint in Paris, and our recommendation would to never generate the URLs to the "raw" downloads but to create fake URLs that do a redirect after incrementing a download counter.

The following information needs to be available to implement the Mark Boulton designs:

* The number of downloads for a specific version. For example, Views 2.1 has been downloaded 32,000 times.

* The number of downloads for a given core compatibility branch. For example, Views for Drupal 5 was downloaded 42,000 times.

* The number of downloads across all core compatibility branches. Views has been download 96,000 times.

That is the data that we would need to feed to Solr so we can create all the different views, filters and sort orders.

Some comments on the discussion in this thread:

* External data sources sounds a bit complicated. I don't think that would be necessary for d.o itself, but I guess others might have a use case for it. Unless we have a concrete use case for it, I would hold off its implementation.

* CVS integration sounds like "nice to have" but I personally don't think it is a show-stopper. It would be very interesting to know the CVS vs download statistics but we could easily add this in a follow-up stage. I don't think we should obsess with getting the statistics 100% correct because they will never be.

Dries’s picture

Version: 5.x-1.x-dev » 6.x-1.x-dev

I've started looking into the project_release.module code. Some first thoughts:

  1. Should the download URL become project/%project_name/download/%release_version (e.g. http://drupal.org/project/mollom/download/6.x-1.7)? Advantages:
    1. I think that would be the most intuitive as it allows for quick URL manipulation for those that like fiddling with their browser's address bar. For example, it would it very easy to download version 1.6 if 1.7 of the Mollom module wasn't working somehow, or if 1.6 was not easy to navigate to. It would be great for SEO purposes too -- it would give the download pages more weight in Google.
    2. http://drupal.org/project/mollom/download could be a landing page listing all the downloads/releases. It is a better URL than our current http://drupal.org/node/240806/release. It is easier to get to, better overall SEO, and it would be easy to manipulate the URL. For example, if I was at http://drupal.org/project/mollom/download/6.x-1.7 I could drop the 6.x-1.7 to see what other downloads were available. Very convenient!

    If we decide to use the URL scheme proposed above, we'd have to shuffle some things around but it feels like it would be for the better. It adds sexiness to the project module.

  2. theme_project_release_download_link() is responsible for generating the download links. It is great that we already have a centralized place for that. Unfortunately, the function declaration takes the wrong parameters; it takes a filename as input instead of a release node. Regardless of the URL namespace decision above, it probably makes sense to update the code in Drupal 6 to start passing around the release node instead of the filename.
Dries’s picture

I've attached a patch that implements a first step; i.e. changing the declaration of the function(s) that generate download URLs and updating all the callers. See item 2 of comment #13. The generated URLs are unchanged -- that should be a convenient next step left for a follow-up patch. In other words, after this patch landed, we should be able to create a link like http://drupal.org/project/mollom/download/6.x-1.7 relatively easy.

This patch is only 90% there though; hopefully someone a little bit more experienced with project_release.module can drive this home! It is one of the key new features of the new Mark Boulton design, so I hope we can showcase this sooner than late. Thanks!

When working on this, I ran into two complications which explains why I had to stop at 90% completion:

  1. I noticed that release nodes can have multiple files. This took me by surprise because I don't remember seeing this in action. Furthermore, support for multiple files seems to be incomplete/experimental -- some functions assume that there is only one file (e.g. project_release_table()), while other functions assume that there can be multiple files attached to a release node. This complicated rolling a patch but I tried to plow through it as best as I could based on what I learned.
  2. I don't have a complete drupal.org mirror setup, causing files never to be loaded. This made it difficult to do testing so I assume that there might be some bugs still (or maybe not).
aclight’s picture

Status: Active » Needs work

1. There are a few places where you have "Mke" instead of "Make" in comments.

2. It looks like you've changed the calls to theme_project_release_download_link() without changing that function or its PHPDoc itself. So I don't think your patch will actually work.

3. The patch itself doesn't seem to implement any kind of download counting ability, so it's not clear to me how you plan to actually count file downloads. I don't know that we get much gain by creating a new menu structure for downloads, because ultimately it's the actual download of the file itself that we want to track, not whether or not a user went to a page with listings of files that can be downloaded.

In comment #9 above I posted a module I'm using on one of my project sites to track downloads. It's for D5 but was pretty simple and shouldn't be hard to port for D6. I've been using this for over a year now on my project site and it seems to work just fine. Of course if users go directly to the URL of the file itself and not the link that my module causes project to use when it prints the link in download tables, the download won't be counted, but I don't think there's any real way we can get around that.

Dries’s picture

Status: Needs work » Active
FileSize
881 bytes

Here is another independent patch that adds the downloads field to the project_release_nodes table. This was discussed earlier in this thread. Should be painless. :)

nedjo’s picture

Assigned: nedjo » Unassigned
dww’s picture

Assigned: Unassigned » nedjo
Status: Active » Needs work

Re: #14.1: I noticed that release nodes can have multiple files. This took me by surprise because I don't remember seeing this in action. Furthermore, support for multiple files seems to be incomplete/experimental -- some functions assume that there is only one file (e.g. project_release_table()), while other functions assume that there can be multiple files attached to a release node. This complicated rolling a patch but I tried to plow through it as best as I could based on what I learned.

That was introduced at #357920: Numerous errors when previewing/submitting a new release node and paves the way for things like #11416: Please provide *.zip downloads.. We didn't want to spend time at the sprint completely finishing the job (since it that aspect wasn't critical), but since we had to store the uploads in the {files} table anyway, and yet we needed extra metadata than {files} itself can hold (e.g. the md5hash), we added a {project_release_file} table and started undoing the assumption of 1-file-per-release. Hopefully, all the places that still assume 1 file were commented in the code.

If we're going to be gathering stats, and we're planning (for usability reasons) to provide both .zip and .gz downloads, it seems we should track per file, not per release node. This also makes much more sense for other project* users that have, for example, different compiled binaries for different platforms. I'd be happy to add a download count to the {project_release_file} table. The summaries across all branches and all versions should probably just be computed when reindexing a project node for solr. I don't think it's necessary to keep the other summaries in the DB (though if others disagree, I'm open to discussion about it).

If we're tracking downloads per file, then the URLs should either be of the form:

http://drupal.org/project/[name]/download/[version]/[extension]

E.g.:

http://drupal.org/project/mollom/download/6.x-1.7/gz
http://drupal.org/project/mollom/download/6.x-1.7/zip
...

or like so:

http://drupal.org/project/[name]/download/[filename]

E.g.:

http://drupal.org/project/mollom/download/mollom-6.x-1.7.tar.gz
http://drupal.org/project/mollom/download/mollom-6.x-1.7.zip
...

I haven't yet looked at aclight's module, so I can't comment on the pros/cons of that approach yet.

I'll be able to look more closely at this whole thread next week. Just wanted to reply with some initial thoughts now before I'm offline for the rest of the day. ;)

Dries’s picture

Status: Needs work » Active
FileSize
7.84 KB

Thanks for the quick review.

1. There are a few places where you have "Mke" instead of "Make" in comments.

Fixed.

2. It looks like you've changed the calls to theme_project_release_download_link() without changing that function or its PHPDoc itself. So I don't think your patch will actually work.

Fixed (I hope).

3. The patch itself doesn't seem to implement any kind of download counting ability, so it's not clear to me how you plan to actually count file downloads. I don't know that we get much gain by creating a new menu structure for downloads, because ultimately it's the actual download of the file itself that we want to track, not whether or not a user went to a page with listings of files that can be downloaded.

I discussed this in my comment #14. This patch implements step 1 of a 2 step process. I've ran out of time to work on this patch today, and will probably not be able to work on this patch next week. As indicated, I hope someone else can help drive this home.

aclight’s picture

Status: Active » Needs work

For the record, my module essentially does the later of what dww suggests. Instead of http://example.com/files/file.tar.gz, it would create a link of http://example.com/project_doanload/files/file.tar.gz.

Most of that module is concerned with the tracking and displaying of download counts, etc. and would probably be done differently for the project module itself.

Dries’s picture

Status: Needs work » Active

dww's comments in #18 makes sense to me and would work for the Mark Boulton design.

hass’s picture

I'd like to chime in about a few more tracking variants.

1. We could add onclick events to the links and count via a custom php script. Like GoogleAnalytics... with a JS limitation... not perfect, but works.

2. I would keep the download link as is and do not create custom menu paths to allow the multiple files per release node in future. If we go this way we could have a download link "http://ftp.drupal.org/files/projects/project-5.x-1.3.tar.gz" but behind the scenes of this link we have a mod_rewrite rule that executes a "project_download_count.php" and then redirects or better rewrites (one HTTP request less) to the real file.

I would go with #2.

lisarex’s picture

Linking this from the Redesign project #661692: Meta issue for modules Project and Project issue tracking because this issue was tagged 'drupal.org redesign'

bdragon’s picture

(See also http://drupal.org/node/324675 )

Assuming solr is doing the heavy lifting for sorting, the only need as far as statistics storage is to tack a download count on to {project_release_file}. Everything else can be computed from this. (Although if weekly counts are desired a separate table akin to {project_usage_week_release} but keyed on fid is needed.)

Everything more than this is just denormalization.

Log parsing to generate the stats is pretty easy at this level....

(time passes...)

Err, easy enough that I just did it, since the awstats already get synced to util and I was awake.
Script is ~bdragon/process-download-stats.php on util and the stats go to downloads column in {project_release_file} (Yeah, I owe an _update_xxx() for that now.)

I assume doing it daily would work. (The reports for the current month get regenerated each day??) It's fast enough (6 seconds on a warm fs cache) to just parse the whole set of awstats files that I am just doing that for simplicity.

Took 22 seconds to do the whole process, from loading the files to writing the stats to the table. Much faster than running stats ;)

hunmonk’s picture

we're looking at getting this functioning fully for phase one of the redesign. here's what i see needing to happen for that to be possible:

  1. review/commit process-download-stats.php to project_release module
  2. update function in project_release.install to add {project_release_file}.downloads
  3. investigate if we need an index on {project_release_file}.downloads
  4. collect summary of download statistics and insert into the solr document (for now, the facets can be the same as we're generating for usage stats)
  5. build 'Most Downloaded' block in project_solr that pulls the data from the document
drumm’s picture

awstats isn't actually the most reliable. If something goes wrong, it can take awhile for it to catch up. I think we should use it to backfill current numbers, but not for ongoing logging.

I don't want to change the download URLs or anything. I think a decent solution is to add a JS click event that does an AJAX request, which increments the download counter. This will be skewed with drush, wget, and such, but that's okay. The numbers will still be relevant relative to each other.

drumm’s picture

Actually, our wget numbers are significant, 40.5%, so we do need to track awstats.

hunmonk’s picture

more tags for redesign

geerlingguy’s picture

Subscribe - this would be great not only for drupal.org, but also for a lot of other sites using project.module.

webchick’s picture

Subscribing. This is really important data for the community. I'd like to try and figure out how to expose it to more than 20 people.

webchick’s picture

So here's what I've managed to figure out from an evening's worth of poking around.

BDragon's "process-download-stats.php" script parses the log files in /var/log/DROP/drupal-awstats-data/ which appear to log all kinds of interesting stuff about ftp.drupal.org, including translations and what robots are pinging us.

Then, it puts the stuff about releases into a MongoDB database. Looks like the raw Update Status information is MongoDB, too. (The Drupal.org DB is missing data for project_usage_day and project_usage_raw.)

Finally, it updates the "downloads" column in the project_release_file table, which contains the release node IDs. Presumably, this is what we need to hook into Solr/Views/whatever. (I'll need some help with that part.)

I ran the script and I believe it's working, because if you check the Drupal.org DB, it contains counts there for like http://drupal.org/node/1172658, which was just created tonight. It was well-commented and easy to follow what was going on. It helped me play around with MongoDB inspection a bit in a script in my home directory, since I couldn't find a client on util, apart from PHP.

I don't know if BDragon's script realistically belongs in project_release module, though. It seemed pretty drupal.org-specific. Like you probably don't want a MongoDB dependency in Project Release module in order to collect download counts. So I'm wondering if we should split off this issue into two: one for a general feature request for Project module to track file downloads through the web interface, and one for a Drupal.org redesign-specific "Show counts on project pages" type of feature request.

webchick’s picture

Also, does anyone know about the feasibility of tracking git clone requests on the Git side? That's a non-trivial percentage of downloads, too. I realize it was talked about and dismissed for CVS above, but wasn't sure if Git made things easier in that regard.

webchick’s picture

dww asked for an update at #1353138: Display download count on project page. This continues to feel like a very wrong place to discuss this info, since it's extremely d.o specific and not related to Project module at all, but in the interest of completeness. :)

I had a conversation with BDragon last week in #drupal-infrastructure about the current status of these parsing scripts. I didn't get his permission to post the log, so paraphrasing:

- The scripts discussed above now live in the BZR scripts repository. (I don't have access to that to see where or what they're called.)
- There's a jenkins job util_parse_awstatsdata (triggered by util_sync_awstatsdata) that handles populating the MongoDB collection and the d.o database tables. These seem to run daily.
- There's also a neat script that was written for Kieran that you can run on util as /home/bdragon/download-report [project-name] that'll give you a nice pretty-printed table output of what's in MongoDB. I asked whether that should move into the scripts repo but BDragon said that would make it even less accessible since not everyone has access to BZR.

I don't know if that's enough info to get #1353138: Display download count on project page unpostponed or not. If not, please let me know what other info is needed.

webchick’s picture

webchick’s picture

And tagging.

sun’s picture

Priority: Normal » Minor
greggles’s picture

Priority: Minor » Normal

@sun - this is not your queue to prioritize things. Maybe there was some discussion that led you to do that and it makes sense, but without a body to your comment I can only think the proper action is to take this back to normal.

I feel very frustrated by your behavior on these issues. It seems quite counter-productive.

webchick’s picture

Hey, Derek, since you're around these days, any chance you could take a look at #33 and see if that's what you need?

dww’s picture

Not until later in January, sorry. I need to focus on http://drupal.org/community-initiatives/drupalorg/distribution-packaging for now, and then I've got some major personal stuff to attend to in the near future...

webchick’s picture

Issue tags: +Drupal.org priority, +Developer improvements, +Business improvements, +Site builder improvements

Tagging.

drumm’s picture

Assigned: nedjo » Unassigned
Status: Active » Fixed

There are two Jenkins jobs at work here. One to rsync a copy of awstats from wherever the OSL keeps them. That's one line and is good. The second processes them. It is now in Git at http://drupalcode.org/project/infrastructure.git/blob/HEAD:/live/process.... It would be cool if this were a drush command instead of manually bootstrapping, but that's okay. The

Hold onto yer hats!
Whew!

output is nice to see. I think this is good enough to call fixed and unblock #1353138: Display download count on project page.

bdragon’s picture

Assigned: Unassigned » bdragon

I'll leave this as fixed but I will see about drushifying. It's my fault that it bootstraps and I need to teach myself how to write drush commands again (haven't done it for a loong time).

bdragon’s picture

@drumm: I converted it to a drush command and moved it to drupalorg/drupalorg_project (since it relies on awstats and is somewhat drupal.org specific still) and updated the jenkins job.

Automatically closed -- issue fixed for 2 weeks with no activity.