dww and I were talking on IRC about how once the project_usage data is visible and people are using that to determine which modules are popular, evil developers would have an incentive to submit fake usages data to increase the visibility of their modules. Joomla had a similar experience with this so, sadly, we should assume it will also happen to us.
At this point, were that type of abuse to occur, there's little we could do to catch them. Short of pouring over the data looking for sites that only submit one module, or looking for sharp increases in a module's usage, there's not much we can do. If we're going to be able to prevent this type of abuse we either need to trust the site keys or track the originating IP address so we can more easily spot the abuse and remove the usage records.
The only way I can think of making the site keys "trusted" would be to have them as public/private pairs that are registered to a drupal.org account. It would be a reasonably secure solution but it would be a pain to setup and would probably discourage most people from reporting usage.
I'm in favor of logging IP addresses though it isn't foolproof. Multi-site installs are all going to report the same IP address so there'd have to be some threshold before we ignore their usage. Or, perhaps we require that they submit data for a couple of weeks before we count them. But I think the best thing is to track the IP so we have some idea of where the usage data is coming from and can take some retroactive action if we discover abuse is taking place.
| Comment | File | Size | Author |
|---|---|---|---|
| #18 | project_168009.patch | 5.5 KB | drewish |
| #12 | project_168009_1.patch | 5.5 KB | dww |
| #3 | project_168009_0.patch | 4.99 KB | drewish |
| #1 | project_168009.patch | 4.92 KB | drewish |
Comments
Comment #1
drewish commentedHere's a simple patch to start collecting IPs.
Comment #2
dwwSorry, I guess I missed this while on vacation and just noticed it now... Sadly, this no longer applies due to update_N conflicts and a hunk failure in project-release-serve-history.php...
Comment #3
drewish commentedi'm hoping to direct a little attention at this issue from http://drupal.org/node/178776 so here's a re-roll.
Comment #4
dwwThe httpd logs _already_ include the IP addresses of all incoming requests to *.d.o, so it's not like this is fundamentally new data we'd be collecting. It's just that the data would live in the d.o DB itself, not on the d.o filesystem (which is harder to get to). However, unless we have this data directly in the same usage-related DB tables, it'll be MUCH more difficult and time consuming to try to track down suspicions of abuse and be able to do anything about it if/when we discover any.
I'm in favor of this patch going in, but I want more opinions from the rest of the community, so I'm about to email the devel list soliciting feedback. I just hope it remains a productive thread, not another bike-shed debate, since this is an important topic to get right...
Comment #5
kbahey commentedRe: patch, I think that you should use the new function ip_address() instead of $_SERVER['REMOTE_ADDR'], which works when drupal is behind a proxy (which d.o is). Hence the CNW.
Re: this from dww's comment on the mailing lists
We need to have the privacy policy say that we collect the IP address, and have the select few who have d.o access to see that (including OSUOSL folk) acknowledge that they are aware of the policy too.
Other than those points, I think it should go in.
Comment #6
dwwip_address() is new in D6. This is a patch for D5 (which is what d.o is currently running). however, d.o is patched to include the ip_address() + proxy stuff, which makes another mess for project* in this case. I'd hate to commit code that's the bizzaro mix of the D5 and D6 APIs running on d.o to the D5 version of project. :( OTOH, d.o is the main (only?) site that runs project-release-serve-history.php, so it's not that big a deal. And, clearly we're going to want the proxy-aware IP stuff to work correctly. So, perhaps the best thing is to conditionally check
if (function_exists('ip_address'))and if so, use it, else fall back to what's there now (with a nice juicy comment about why).That said, it's a 3 line change to the patch. I still want input on if people think this patch should go in. Thanks.
Comment #7
johnalbinI read the Joomla article mentioned in the issue description and that alone convinces me we need to collect enough data to recognize when abuse is happening.
However, I’d like to point out that the Joomla extension directory included commercial products, so there was a monetary reason for all the abuse. The Drupal modules and themes directory will likely have less abuse than Joomla experienced.
Still, +1 on the patch inclusion.
Our only other option is: plug your ears and cover your eyes as the abuse skews the stats.
Comment #8
catch+1 for the patch. You can opt out of update status, so anyone who really cares can do that.
Comment.module records hostname in the comment tables every time someone posts a comment to drupal.org (or any drupal site) - and that's from people's PCs rather than their servers, has been no complaints as far as I'm aware
Comment #9
moshe weitzman commentedI think writing this info to our DB is perfectly reasonable.
Comment #10
drewish commentedJohnAlbin, I'd forgotten that Joomla had commercial extensions listed. I think you're right that there'll less incentive in Drupal-land but still think this change is necessary.
Comment #11
catchEven if there's zero attempts at abuse, it'd be nice to know that there really isn't.
Comment #12
dwwNew patch with the check for ip_address(), and a nice comment about why. I also renamed the new column in the schema and the related local variables from "hostname" to "ip_addr", since this is *always* storing an IP, and I'd rather we were specific and self-documenting about what it is.
Comment #13
drewish commenteddww i'd copied that field from either some place in the project or watchdog module (i can't remember which). you could probably shorten down the field's width if it's only going to store IPs, 16 chars should be enough... that or widen it if you think you'll be storing IPv6 addresses ;)
Comment #14
drewish commentedi forgot to add that the patch looks good to me, i didn't test it though.
Comment #15
dwwheh, "ipv4_addr" ;) but yeah, we probably could shrink the size. although i usually have the cavalier "disk is cheap" attitude, in this case, it's closer to "RAM on the d.o DB server ain't cheap", so I agree we should be more careful. 16 chars is indeed fine...
Comment #16
drewish commentedhere's a re-roll that makes it a varchar(16).
Comment #17
johnalbinPatch? We don't need no stinkin' patches!
Comment #18
drewish commentedlets see if this sticks...
Comment #19
bart jansens commentedVarchars only use the storage space required, so I don't think it will make a significant difference (but ask the DB experts to be sure).
I was just testing drupals ipv6 support and was happy that it all just worked, lets try not to break it. In the US you can probably ignore ipv6 for a couple more years, but thats not true for other parts of the world (especially Asia).
Comment #20
aclight commentedThis is getting into borderline bike shed territory, but I'll point it out anyway. Core uses "hostname" for both {sessions} and {watchdog}. In both of those tables it's a varchar(128) as well. The value stored in those columns is also always the IP address. So we're somewhat more self-documenting if we call the column here ip_addr, but less consistent with core. I'd vote to stick with core and use a varchar(128) named hostname, but ultimately it doesn't matter that much to me.
However, I don't like "ip_addr" as a column name. Why not "ip_address", since we tend not to abbreviate variable names very frequently.
If we're super concerned with db memory we should be able to make it a varchar(15), since the longest valid ip address I can think of is "255.255.255.255" and that's only 15 characters.
As for privacy concerns, I don't think there are any. I think it's assumed that a web server will log the ip address for pretty much any hit, and of course this information is stored in the apache logs in any case. As long as we sanitize the IP information whenever a drupal db dump is made for an outside entity, I don't think we have anything to be worried about.
The code here looks good and it creates the db tables as expected. I also set up a release-history server and used a separate install running update_status (with the destination set as my local server) and confirmed that my ip address was added to the table properly.
Since drewish has done what dww requested, I'm setting this to RTBC. If it's decided to go back to storing the address as varchar(128) and/or changing the db column name, it won't be difficult to modify this patch to do so.
Comment #21
drewish commentedaclight, thanks for the review.
Comment #22
dwwI changed it back to a varchar 128, since Bart is right -- that's what the "var" in VARchar is about. ;) But, I'm sticking with the self-documenting "ip_addr" name -- core's own poorly named schema is no reason to perpetuate bogus names like this.
Committed to HEAD and DRUPAL-5, and deployed on d.o. I'm already seeing some IP data in {project_usage_raw}. Thanks, drewish, and sorry for the delays!
Comment #23
Anonymous (not verified) commentedAutomatically closed -- issue fixed for two weeks with no activity.