Kieran asked me to post this code here.

Attached, find a perl script that processes an awstats download file and comes up with the top 30 contrib downloads and a sample output file. Run "perl rank-drupal-downloads.pl" for usage, etc. Also find a sample output file with a ranking section that's ready for csv import use.

I think it would be great if this info was made public: good snapshot of what's hot in the contrib world. For finding usage it would be handy if a reverse list of IP addresses downloading module-x-version-y were generated from apache logs, this could then be used to find upgrade downloads, which implies usage and a live running install.

See sample output below.


This program ranks drupal downloads by popularity based on an awstats file.
Using awstats122007.ftp.drupal.org.txt as awstats input file.

        Found data start: BEGIN_SIDER 12184
        Found data end: END_SIDER
                Read 12186 lines of data.

######################################################################

16588,ulink,5.x-1.x-dev
12358,ad_agency,5.x-1.x-dev
9119,views,5.x-1.6
8039,alek_2_0,5.x-1.x-dev
7824,cck,5.x-1.6-1
6433,amadou,6.x-1.x-dev
6347,aberdeen,5.x-1.7
6146,image,5.x-1.6
5944,tinymce,5.x-1.9
5141,token,5.x-1.9
4637,atck,5.x-5.x-dev
3983,spooner,5.x-1.4-5
3844,pathauto,5.x-2.0
3579,CristalX4Drupal,5.x-1.x-dev
3425,es,4.7.x-1.x-dev
3178,date,5.x-1.7
3137,captcha,5.x-3.1
2991,zen,5.x-0.8
2906,fckeditor,5.x-2.0-beta
2677,imce,5.x-1.0
2603,calendar,5.x-1.7
2460,gallery,5.x-2.0
2381,bluebreeze,5.x-1.2
2345,jquery_update,5.x-1.0
2335,event,5.x-1.0
2301,jstools,5.x-0.8
2299,contemplate,5.x-1.8
2298,img_assist,5.x-1.5
2296,devel,5.x-1.x-dev
2267,imagecache,5.x-1.3

######################################################################

ulink               5.x-1.x-dev    ******************************
ad_agency           5.x-1.x-dev    **********************
views               5.x-1.6        ****************
alek_2_0            5.x-1.x-dev    **************
cck                 5.x-1.6-1      **************
amadou              6.x-1.x-dev    ***********
aberdeen            5.x-1.7        ***********
image               5.x-1.6        ***********
tinymce             5.x-1.9        **********
token               5.x-1.9        *********
atck                5.x-5.x-dev    ********
spooner             5.x-1.4-5      *******
pathauto            5.x-2.0        ******
CristalX4Drupal     5.x-1.x-dev    ******
es                  4.7.x-1.x-dev  ******
date                5.x-1.7        *****
captcha             5.x-3.1        *****
zen                 5.x-0.8        *****
fckeditor           5.x-2.0-beta   *****
imce                5.x-1.0        ****
calendar            5.x-1.7        ****
gallery             5.x-2.0        ****
bluebreeze          5.x-1.2        ****
jquery_update       5.x-1.0        ****
event               5.x-1.0        ****
jstools             5.x-0.8        ****
contemplate         5.x-1.8        ****
img_assist          5.x-1.5        ****
devel               5.x-1.x-dev    ****
imagecache          5.x-1.3        ****

Thanx.

Comments

catch’s picture

Note that this report includes themes as well as module.

joegml’s picture

True.

There is no string differentiation between the two in the input data. It wouldn't be much work to crossreference this to a list of themes or modules. Got a clean list anywhere?

greggles’s picture

I've been analyzing the data for a few years now and since last January it became clear that there are some people who are working to inflate the numbers for their modules/themes.

I think a more valuable investment of time would be in #165380: Make usage statistics visible

gcassie’s picture

Here are the stats from above with the project type included:

http://spreadsheets.google.com/pub?key=pu7uEf_qdH5kDrsZji0UPIw

joegml’s picture

Wow: a lot of comments on http://drupal.org/node/165380.

Note that in my script I ignore drupal core downloads and videos, only a match on
#^.*projects/(\w*)-(\d.*)\.tar\.gz#
makes it into my analysis sieve.

"I've been analyzing the data for a few years now and since last January it became clear that there are some people who are working to inflate the numbers for their modules/themes."
>> Is there a discussion of analysis methods somewhere other than the 163 comments on post greggles mentions? Looking at downloads is primitive. Are bots and Scotty the Script kiddie who made a script to download his module once an hour excluded from awstats files?

"Reality" lies in the apache logs. Usage is different from download.

catch’s picture

Title: Ranking Contrib Modules based on Awstats Download File » Expose awstats download statistics for projects

Retitling to something more appropriate given we haven't decided on how to rank modules yet.

For much more discussion about this, see http://groups.drupal.org/module-metrics-and-ranking

Specifically: http://groups.drupal.org/node/10629 and http://groups.drupal.org/node/7191

Would be good to keep discussions around gamin to that group/those threads rather than in this issue.

Amazon’s picture

Title: Expose awstats download statistics for projects » Ranking Contrib Modules based on Awstats Download File

The goal here is to parse the logs and get the data regarding downloads per month back into the Drupal.org database. It would make sense to extend the project usage tables to include the download data. I assume the schema would look something like:

Month of downloads, Project name, version.

These could then be cross referenced against project module tables to determine which ones are themes and which are modules. It would be good to have some comparisons between what is downloaded and what is used on sites that are reporting back.

I don't think it makes any sense that people are inflating their download numbers, they simply aren't reported anywhere reliably. Instead we are seeing randomly selected modules downloaded from specific countries repeatedly. Almost like a crawler gone bad.

It's worth noting that cross-checking IP addresses of downloads versus usage would not likely be reliable. Code repositories, and personal machines, probably lie between downloads and live site IPs. I also think it would be a violation of privacy, and we've been careful to restrict access to project usage IPs with hashes.

Amazon’s picture

Gcassie, did you map the project to modules or themes manually? I think we are looking at scalable automated solutions.

joegml’s picture

"It's worth noting that cross-checking IP addresses of downloads versus usage would not likely be reliable."
>> Not to belabor the point, but we don't have "usage" do we?

As regards hashed IP addresses: that's great, just the ticket: as long as IP addr "a.b.c.d" always hashes to "hashabcd" the analysis can be done and you can verify that the guy who downloaded mod-version-x last month, downloaded mod-version-x+1 this month and you can imply "usage" of the module. (I know IP addrs are not super reliable in this context, but would work for a start.)

dww’s picture

FYI:
#32124: Enable download statistics
There's the start of a "release_download" module in there that should be the basis of what Amazon is talking about.

Amazon’s picture

FYI: http://drupal.org/node/188993 we never had a reliable way to process AWSTATS output into the database. I believe Greggles was doing something manual.

Joe >> Not to belabor the point, but we don't have "usage" do we?
We have those stats, they are collected but just not displayed. I am looking for alternate avenues to get that data public other than the project module.

Joe: FYI http://sourceforge.net/projects/phpawstats/ I was looking for that but didn't find it earlier.

Regarding: 32124: "Enable download statistics"
I think it's always good to go straight to the source, the logs of the server that is actually delivering the files. But this is a reasonable approach as well.

I think the next step is to confirm that our downloads are in monthly AWSTATs files and for Joe's perl script to be run against all those monthly log files. Getting the pre-processed data will prove valuable in our Drupal 6 marketing efforts.

joegml’s picture

Isn't the Bawstats module the way to integrate awstats into drupal? See http://equivocation.org/node/86
The image w/ geo stats is cool.

@ gcassie: can you get me a list of module names? I could then narrow my results to modules. (I think it's cool to see both too.)

Suggestions for more appropriate title for this issue?

gcassie’s picture

Yes, my post was a manual effort. I thought this was a one-off.

There is a project category of Themes. Maybe you could compare this list against the site's DB for projects with that tag to split apart modules and themes?

joegml’s picture

After waiting a couple of minutes for http://drupal.org/project/Themes to download, I think I can parse through the html to get theme names.

I love an excuse to use perl ;-)

joegml’s picture

StatusFileSize
new7.31 KB

Any body got a theme count in all versions? I get 321. See attached list. Did I drop anything?

Amazon’s picture

"I am looking for alternate avenues to get that data public other than the project module."

Just to clarify since I wasn't explicit. I am not looking to duplicate the project_usage work that is being done in project module. Where it makes sense to work with project_usage we have, notably the integration of pivots recommendations with the project usage module.

I am looking to expose the Apache webserver logs download data. Download data != project_usage data.

How the download data get's exposed is up in the air. First we need to reliably pre-process it, and then get it into the Drupal.org database. I have suggested it be lined up with the project_usage data. But it also makes sense to just run the scripts and post the contributed module download stats to a mailing list or an issue queue.

danithaca’s picture

subscribe

joegml’s picture

StatusFileSize
new3.86 KB

Getting closer. Got the comparison to the theme list incorporated. Problem is Amare and andreas01 are not in the present list of themes and so were not recognized as themes. Other than that I think it's groovy. With an accurate list of all know themes *or* modules, this should do it. Anybody got a list of defunct themes no longer shown on Drupal themes listing? ;-)

5127,tinymce,4.7.0
4887,image,4.7.0
4057,views,4.7.0
3088,cck,4.7.0
2752,event,4.7.0
2685,gallery,4.7.0
2421,Amare,4.7.0
2398,acidfree,4.7.0
2397,ecommerce,4.7.0
2301,andreas01,4.7.0
2298,imce,4.7.0
1913,pathauto,4.7.0
1791,flexinode,4.7.0
1788,accents,4.7.0
1775,front,4.7.0
1766,banner,4.7.0
1686,nice_menus,4.7.0
1648,img_assist,4.7.0
1638,gsitemap,4.7.0
1631,video,4.7.0
1619,adsense,4.7.0
1598,panels,4.7.0
1593,audio,4.7.0
1498,category,4.7.0
1473,controlpanel,4.7.0
1440,webform,4.7.0
1422,wordfilter,4.7.0
1398,filemanager,4.7.0
1372,taxonomy_access,4.7.0
1367,i18n,4.7.0

######################################################################

tinymce             4.7.0          ******************************
image               4.7.0          ****************************
views               4.7.0          ***********************
cck                 4.7.0          ******************
event               4.7.0          ****************
gallery             4.7.0          ***************
Amare               4.7.0          **************
acidfree            4.7.0          **************
ecommerce           4.7.0          **************
andreas01           4.7.0          *************
imce                4.7.0          *************
pathauto            4.7.0          ***********
flexinode           4.7.0          **********
accents             4.7.0          **********
front               4.7.0          **********
banner              4.7.0          **********
nice_menus          4.7.0          *********
img_assist          4.7.0          *********
gsitemap            4.7.0          *********
video               4.7.0          *********
adsense             4.7.0          *********
panels              4.7.0          *********
audio               4.7.0          *********
category            4.7.0          ********
controlpanel        4.7.0          ********
webform             4.7.0          ********
wordfilter          4.7.0          ********
filemanager         4.7.0          ********
taxonomy_access     4.7.0          ********
i18n                4.7.0          *******
dww’s picture

@Amazon (and other interested parties): FYI: #165380-164: Make usage statistics visible

@joegml: If you want accurate lists of projects of various types, #157514: Add possibility to retrieve a list of projects from the server would probably be of interest to you.

Amazon’s picture

I am trying to confirm the stats are monthly. Narayan provided us with the following sample logs.

logs/awstats102006.ftp.drupal.org.txt
logs/awstats102007.ftp.drupal.org.txt
logs/awstats102008.ftp.drupal.org.txt
logs/awstats112006.ftp.drupal.org.txt
logs/awstats112007.ftp.drupal.org.txt
logs/awstats122006.ftp.drupal.org.txt
logs/awstats122007.ftp.drupal.org.txt

I'd like to confirm where these logs are on the infrastructure and then work with someone who has access to run the script against the logs.

Amazon’s picture

Gerhard confirmed the logs are on awstats.osuosl.org and accessible by those with drupal.org server access.

joegml’s picture

Title: Ranking Contrib Modules based on Awstats Download File » Contrib Module Downloads: Awstats Reporting

@dww: Some links on post you referenced were dead. Does this http://updates.drupal.org/release-history/project-list/all have the sacred regularly updated listing of all contrib modules throughout all of Drupal history and not themes? In terms of a programmatic approach, I need that as a reference.

Hopefully title is more appropriate ;-)

dww’s picture

@joegml: Right, that feature had to be backed out due to troubles I mentioned in the previously linked comment. But aclight has a patch that will probably solve it, which will be deployed for testing on project.d.o in a little while, and then if all goes well, we'll put it live on updates.d.o soon thereafter. Cheers.

dww’s picture

@joegml: http://updates.drupal.org/release-history/project-list/all is now live with real data. Enjoy.

Amazon’s picture

@joegml does you script process the stats for all projects or just the most popular? It seems to just run stats for 20 projct or so.

joegml’s picture

See the code:

# number of downloads to list: top x, could be argument if needed
my $lines_to_read = 30;

Setting this to something large (2000) would list all. Interim files generated also have more detail than final output to stdout.

smartinm’s picture

subscribe

damien tournoud’s picture

Assigned: joegml » Unassigned
Issue tags: +drupal.org redesign

Marking as relevant for the redesign project. We have all the awstats files pulled to util, we only need to set this up and push the data into a table somewhere.

dww’s picture

@DCSF bdragon said he might be interested in workin on this. It's very similar to what he's been doing for usage stats from parsing squid logs, doing some processing, and populating the d.o tables.

drumm’s picture

Status: Active » Closed (duplicate)

Looks like this work is moving forward at #32124: Enable download statistics

Component: Webserver » Servers