Justification:

For all but the most heavily-trafficked sites, the statistics reported by Drupal are severely skewed by visits from crawlers, and from the administrators themselves. Assuming that the purpose of the statistics is to inform administrators about visits from human beings other than themselves, it is highly desirable to do our best to ignore other visits. To that end, I developed the statistics_filter module (and its spinoff, the browscap module).

Why core?

There's enough concern over the logging the statistics module does in the exit hook for the performance issues to be detailed in the help. To work as a contributed module, the statistics_filter module needs to undo what the statistics module did, essentially doubling the overhead for accesses that are meant to be ignored. If incorporated into the statistics module directly, the filtering functionality will actually reduce the database overhead (no database queries at all for ignored roles).

Open issue

Ignoring crawlers (which are the biggest part of the issue for most sites - my own site, with modest volume, gets 40% of its raw traffic from the Google crawler) requires the browscap database to identify crawlers. Currently I have maintenance of the browscap data (as well as provision for browser/crawler statistcs) encapsulated in a separate module. Should this support be submitted to core as a separate module, or integrated into the statistics module?

Attached is a patch to statistics.module implementing filtering by roles, with filtering out crawlers dependent on an external browscap module. I hope this patch can be accepted into Drupal 4.7 - if the feeling is that the browscap code should be incorporated into statistics.module, I can do that.

Thanks.

CommentFileSizeAuthor
statistics.module_2.patch6.73 KBmikeryan
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Bèr Kessels’s picture

a big -1.

we should STORE all (read absolute all) logs, yet FILTER them in the reports.

What makes you think crawlers are not users? Or that I am not interested in crawlers?

I think you might be more interested in adding value to xstatistics, which wants to be a more advanced stats module.

And last, but not least, adding checks for contrib modules in core (if_module_exists) is a no-go. In that case, you could try to introduce a hook, but hardcoded checks for modules will simply not do.

Let us hear moer comments and then decide the status of this patch.

mikeryan’s picture

Of course, the filtering is optional - if an administrator wants to count crawlers as if they were users, then they just don't turn on filtering of crawlers. Personally, I don't find it useful to see that a node has 100 views when I know some unknown (but substantial) portion of those views were from Google/Inktomi/etc - I want to know what the human beings are reading, not what the crawlers' algorithms picked to index today.

Yes, I know better than to reference contrib code from core - as I said, the question is whether (if this is to go into core) it would be better to keep browscap as a separate (core) module or incorporate it into statistics.module. Probably the latter, but I figured I'd raise the issue before putting the integration work in...

The advantage of filtering at the point of logging is performance - reduced overhead in the exit hook, plus a substantially smaller accesslog table. The disadvantage is, of course, losing the log entries for crawlers and ignored roles, but if you're not interested in them anyway it's a win.

So, the question is whether others are interested in filtering accesses from the log entirely, or it's just me....

dopry’s picture

+1 for this patch

If I remember correctly popular content block, etc are linked to the statistics module and the data it logs. So some sites may want to not get this data skewed by administrators and search engines. If you still want full logging capabilities you can use the apache access logs. For larger sites it may be a slight performance advantage and save some db access time with smaller log tables even though admin access and bot access would be a negligable percentage. As an option I think its a nice one.

I think Bers objection to core requiring a contrib module check is an important one, though and if other people think this is something that should go into core, then it should be addresses.

Bèr Kessels’s picture

I know it is optional. But still: filtering your logs on *save* is unacceptible. logs should contain *everything*. If you want to not show certain entries, you should filter them on *output*.

robertDouglass’s picture

I agree with Bér.

varunvnair’s picture

+1 for what mikeryan is suggesting.

I use Drupal to power by blog (http://www.thoughtfulchaos.com). I have a shared webhosting package and 100s of other websites are also hosted on the machine that hosts my blog (and the machine that hosts my database also has 100s or 1000s of other databases)

Sometimes my site seems to be quite slow. This is probably bcoz of the machine receiving too much traffic. I cannot move to a better package bcoz I cannot afford it. I often look to sqeeze every ounce of performance I can from my installation and 1 way of doing this is by reducing the number of SQL queries.

What to log and what not to log should be at the discretion of the site admin. After all s/he is the 1 who is going to decide what to do with the logs. There is no 1 golden rule that applies to all installations. There is no sense in logging everything if the site admin has to go to extra lengths to ignore what s/he doesn't need.

Anyways all accesses are logged by the provider and most people can access the Apache logs and use them for more detailed analysis (I can).

For a CMS like Drupal, to capture everything in a log is probably unnecessary.

Kobus’s picture

I can't see any reason besides "taking up space" for not logging everything. I say -1 for not logging everything, +1 for filtering logs on output, with full logs available on demand.

Regards,
Kobus

mikeryan’s picture

Hmm, didn't expect the proposal to be so controversial... I'd like to point out a couple of things you can do filtering at log time that you can't at output time

  1. Leave "ignored" hits out of the node counter table.
  2. Ignore crawlers, unless a user agent column is added to accesslog.

Either one makes filtering at output time unacceptable for my purposes.

Since opinion is divided, how about...

  $group = form_radios(t('When to apply filters'), 'statistics_filter_apply', 
    variable_get('statistics_filter_apply_when_logging', 0), 
    array('1' => t('At logging time'), '0' => t('At display time')), 
    t('If applied at display time, filtered accesses are logged to the database but ignored by default in reports. '.
      'If applied at logging time, they are not written to the database.'));

P.S. Is Preview broken on drupal.org? I hit Preview and just get the edit form back...

Boris Mann’s picture

+1 for this. Full logs should be the job of the HTTP layer (i.e. Apache logs), not Drupal. Even Apache logs get rolled over -- there is no way to do this in Drupal other than to discard the logs outright. If people are so against it, then go for the admin option that Mike includes.

The goal of a statistics module is to give real results to people. These filtered results give a better picture of what is actually going on.

Much like archive, I'd almost like to see stats module removed from core rather than be in the decrepit state it is today (often because of the difficulty of getting core commits -- so as Ber suggests, maybe xstatistics; and Ber, you maintain a lot of modules, maybe give that one to Mike?). As it is, everyone just needs to install a separate stats package to get real results for their Drupal site.

kbahey’s picture

I guess I have to agree with Mike and Boris here.

You now can get it way you want: You want logging of all hits? You want logging for only humans? You can get either.

I have said before what Boris said: Drupal can never do the level of logging that Apache does (for example, bytes per request, ...etc.), so there will always be a need for Apache logs and tools that analyze it.

So, let Drupal statistics do what it can do best, and make it more configurable.

Bèr Kessels’s picture

an ever bigger -1 on the config options. These are very advanced settings, that we really should not bother anyone with.

And still a big -1 on filtering the logs.
Why can you not log on output? Is that too hard?
Who says logs of bots are useless? They are the logs I am most interested in, personally. For SEO and so on.
Not everyone can/has access to full server logs; Not everyone /should/ have access to them, esp. if we can do better. Esp. if drupal can make much more sense in showing stats, then any log analiser can ever do. Drupal knows what it is and does, awstats has no clue about the setup of a drupal site.
And last: Who decides what a bot is and what not? That really differs per set up. I have some crawlers that IMO need to be treated like visitors, while I also have 'visitors' that look very much like humans, but that are in fact just spambots.

Boris Mann’s picture

Ber: other than philosophical issues, have you actually run this code? With the config options, this does not affect your world at all -- you can still log anything you want.

an ever bigger -1 on the config options. These are very advanced settings, that we really should not bother anyone with.

You're right -- but you're the one who insisted on having the option to log everything, so Mike added the config option in.

esp. if we can do better. Esp. if drupal can make much more sense in showing stats, then any log analiser can ever do. Drupal knows what it is and does, awstats has no clue about the setup of a drupal site

Sorry, but Drupal will never be the #1 log analyzer in the world, and shouldn't strive to be. That is the whole point -- this gives better information about the stuff that matters.

Bèr Kessels’s picture

Moshe spoke to me about the initial ideas behind this patch: performance.
If it is all about performance, and if the patch actually increases that performance, I can understand the need for it. I misunderstoood that that was the primary goal.
So, I beleive that performance patches need some form or proof. Maybe a benchmark, or some figures on server load before and after?

Boris: no, Drupal will never be the #1 log analiser. But it can get so close that joe average has no need for server logs.

And no, I did not run the code, for I saw the option in thed patch. And thus i immediately understood that I could keep running my system as intended/I wish.

But what I beleive it is the worst thread in OSS: "We dont agree on Foo or Bar, so we put in Foo and Bar and make it optional." That is the #1 usability thread for most OSS.
Especially when these options can be replaced perfectly by well thought out defaults (the Mac way).

So, when it is not primary about that option, but about speed, I think the option will get a lot of +1s and make a big chance for getting in.

Just a note: IMO we should discuss introducing a section in the administration called "optimisation" where all throttle, cache, memory, etc settings can live. A place where joe average will not have to look. For I beleive that settings like "log filters" frighten Joe Averages away. They want it to Just Work, and they want the option pages to be "Understood without having to read manuals".

mikeryan’s picture

No, my primary motivation is not performance, it's to limit the statistics to the hits that most matter to most administrators - real human visitors other than themselves. The motivation for implementing the filtering feature by not logging the "uninteresting" hits, as opposed to implementing it at reporting time, is primarily two-fold: performance, and functionally because crawlers can't be omitted at report time (because the core statistics module doesn't record them).

On the role of the statistics module - I see it as providing quick-and-dirty stats for low-to-moderate traffic sites. Logging every hit to a database would frankly be insane for a high-traffic site, and if you need more sophisticated reporting it's better to use a specialized analysis tool rather than reinventing the wheel via a Drupal module (now, a module that integrates one of those tools into the Drupal admin interface would be very nice....).

Assuming that role for the statistics module - well, who are the consumers of this module? Bèr, you and I aren't typical Drupal admins (if indeed there is such a thing), we have a better understanding both of Drupal itself and of the technical aspects of website administration than most. And I'm telling you, I don't have the slightest interest in tracking every individual Googlebot hit when I have the cumulative crawler stats from the browscap module, plus the referrer log to show me what search strings people are using to find my site. I believe most people running Drupal sites are even less interested in that level of detail than I am, and the filtering feature will help give them statistics that are relevant to their needs.

And one more time - this is an option. Your point about overloading the UI with too many options is valid, and if there wasn't any support for this suggestion I'd drop it, but I really believe it's useful to a large enough audience that it's worthwhile.

There is an alternative to putting this into core for 4.7 - I could contribute a fork of the core statistics module with this support integrated (what the hell, throw in statistics_trends too:-). This would enable me, on my own schedule, to add something else I'd like - archival of older stats. I.e., the current statistics module won't retain more than 16 weeks worth of the access log, I'd like to archive per-day counters beyond that, so statistics_trends could display traffic trends for a year back (or beyond). Anyway, I will bow to the community's will on how to proceed from here...

Thanks for listening.

Boris Mann’s picture

Status: Needs review » Reviewed & tested by the community

I'm +1 for putting this in core in time for 4.7.

Dries/others -- this is definitely not a code issue: I've set this to "ready to be committed", it's up to core committers to look this over and give some guidance.

killes@www.drop.org’s picture

Status: Reviewed & tested by the community » Needs work

Doesn't conform to coding standards.

I think that the idea of filtering on input is contrary to what we do in other areas of Drupal. So we should filter on output here, too. If your own visits make up a large percentage of your visitors, you probably got too much spare time anyway.

mikeryan’s picture

What specific coding standard?

The big thing isn't my own visits, it's the crawlers (with 5000+ nodes on my site, they significantly distort the stats). And they can't be filtered at output time...

killes@www.drop.org’s picture

} else {

Why can't bots filtered out on display? we save the IP, so we can filter by IP, no?

mikeryan’s picture

Where would the data on what IP addresses represent bots come from? How would it be kept up-to-date? In my proposal, the browscap module uses Gary Keith's user agent data to identify bots (not perfect, but close enough for government work)...

Bèr Kessels’s picture

We should really better store that user-agent. I need it in my xstatistics (which wants to be a drupal specific webalizer alike thing). But it seems you need it for filtering too.

However, filtering on these is IMO a bad choice. Content spam bots (gobbling up my 30%) use strings like Internet explorer, lynx, or other existing browsers. So filtering on agent would only server a sertain case of skewed stats. IMO it needs to be slightly smarter than that.
I tried some smarter ideas in xstatistics, that would find bots by patterns (users dont iterate /node/1 node/2, nor open 1000 pages in one minute.) but its only int the test phase. Such filters are only possible on-output.

dekkeh’s picture

Hi,

First off: I'm a Drupal newbie. I switched my site from Postnuke to Drupal last weekend (reason: inline image upload support, better themes, better community & support.)

I'm not much of a coder, but if you really want to make non-skewed statistics you should at least log which accesses are bots and which are users and anonymous visitors. It will make reporting a lot easier and faster, as there may be hundreds of bots you want to check your logs and IP's against to filter the 'bad visits'. That's going to take a lot of time and CPU.

A point I haven't heard so far is that (many) bots visiting your sites are NOT friendly loving google but actually spam bots. I suggest the following read about this subject: http://www.kloth.net/internet/bottrap.php

I use the non-bot-logging module right now because I want to know how many PEOPLE visit my site. If there are better options, I'd immediately install them. Until then, this seems to be the only way (from a non-coder perspective).

In the long run I would suggest taking one of the better open source PHP-stat packages out there and porting it to a drupal module. As a users, this would be my statistical wet dream so to speak.
(the same would go for the included forum module I have to say. It's ok...but just ok.)

That being said: I love Drupal. Go community!

Grtz all,

Hans

dopry’s picture

Should this issue go to won't fix, its been sitting around since 4.6. Doesn't seem like any consensus came and I really don't use statistics module.

dopry’s picture

Version: x.y.z » 4.6.x-dev
Status: Needs work » Closed (fixed)

yeah Lost in the noise and statistics module is gone now I believe.