One of my biggest long-time annoyances with Drupal is the way that the hit counter for a node behaves. The hit counter records every single page view for a node. If you reload a node/x page 10 times, then 10 hits will be recorded for that page. This is ridiculous: the hit counter system in statistics.module needs to be just a little bit more intelligent than this. It needs to identify hits that are not unique (i.e. hits from the same user/host for the same page), and to stop such hits from bumping up the counter.

Attached patch implements this checking system. With this patch, only the first hit for page 'x' from user 'y' (which can of course be an anonymous user, i.e. user 0) on host IP 'z' will increment the counter. All subsequent hits will be recognised as duplicates, and will not result in the counter being incremented. This results in 1 extra query per node page view, but the performance impact is minimal (considering that more performance-intensive INSERT queries are being run anyway, whereas this is just an extra little SELECT query). This checking only occurs if the node access log feature is enabled, because otherwise there is no way to check for non-unique page views.

Files: 
CommentFileSizeAuthor
#37 unique_page_views_37.patch3.08 KBPancho
Passed: 7565 passes, 0 fails, 0 exceptions
[ View ]
#26 90468.statistics_unique.patch3.13 KBDave Reid
Failed: Failed to apply patch.
[ View ]
#25 90468.patch3.51 KBDave Reid
Failed: Failed to apply patch.
[ View ]
#14 unique_page_views.patch2.16 KBPancho
#10 unique_page_views.patch2.15 KBPancho
#7 unique_views.patch2.56 KBDave Reid
#3 statistics_counter_0.patch1.84 KBJaza
statistics_counter.patch1.83 KBJaza

Comments

Version:x.y.z» 5.x-dev
Category:task» bug

Actually, I think this qualifies as a bug. The patch above still applies and still works.

Status:Needs review» Needs work

IMO it's the session id you need to check against. One IP != one anonymous user, could be several from the same IP.

Status:Needs work» Needs review
StatusFileSize
new1.84 KB

Good point, chx. New patch that uses session ID instead of IP address. The check for user ID is still in there: it's not really needed, but you never know (there could be two different logged-in users with session IDs that are hash collisions). And besides, I doubt that it has any effect on the performance of the query.

Apart from that, same patch. Tested, and it has the same visible effect as the original patch.

How does this affect page load time?

Status:Needs review» Needs work

Version:5.x-dev» 5.1

I just tried this latest patch and don't seem to have any results. I did run the update.php script and it did modify the accesslog table.

Is anybody working on this issue?

It would be good if the tracker/counter did not log duplicate views, edits, and tracker views either.

Status:Needs work» Needs review
StatusFileSize
new2.56 KB

Here's the patch I've tested and used on my own Drupal installation. I've added an option in the admin/logs/settings page to only count unique views. Also, if the users is currently logged in, the code will check the history table which logs when registered users last viewed nodes. Otherwise for anonymous users, it still checks the accesslog table with uid = 0 and the session id.

On the discussion of performance of this patch, I firmly believe that this has a minimal effect on performance. At minimum, pre-patch performs one UPDATE query, post-patch performs one SELECT query.

Version:5.1» 7.x-dev
Category:bug» feature

I would consider such setting additions to be features. This should be developed for the current development version of Drupal.

Version:7.x-dev» 6.x-dev
Category:feature» task
Status:Needs review» Needs work

...without the settings, it would reduce database load by eliminating a lot of writes to 'node_counter' (or possibly 'node' if it's merged back in - ameliorating some of the performance trade-off of that optimisation as well). So could maybe get in as a bug fix/performance enhancement?

StatusFileSize
new2.15 KB

Unbelievable... patch #3 still applies with some offset - after one year and a quarter!
I just broke the code down a little bit to make it more digestible, and added proper documentation.

This bug renders the node statistics rather useless. Also it improves performance per node view. Therefore I keep it in the D6 queue.
Please test this patch thoroughly.

Status:Needs work» Needs review

Hi,

I tested this patch on Drupal 6 RC3, and the dev tarball.

I keep getting the following error:

Fatal error: Call to undefined function db_num_rows() in <...>\modules\statistics\statistics.module on line 59

The db_num_rows function seems to be removed from Drupal 6.

Also, if you set the 'Enable access log' setting to disabled, but 'Count content views' to enabled, it will still count every pageload.
Is this supposed to happen?

Assigned:Jaza» Unassigned
Status:Needs review» Needs work

Thank you for testing this! I need to check this and will post an corrected patch asap.

However, it is in fact by design, that with disables access log every pageload is counted. Finally we depend on the access log to discriminate unique page loads by checking the Session ID. The only question is, whether we should disable 'Count content views' then as well to avoid confusion and as the results are rather useless. Anyway this needs to be documented better.

Status:Needs work» Needs review
StatusFileSize
new2.16 KB

Okay, I removed db_num_rows() and use SELECT COUNT instead.
Also I removed checking uid in the access log, as the session id is unique enough. The only difference this makes except for performance is in the following case: an authenticated user views a page (which is counted as a page view) and logs out. If he now views the page again this is not counted again, as he keeps his session id. This is IMHO even preferrable.

This time I tested it thouroughly and it works now as advertised.

Surely we can further improve the mechanism determining whether a given page view is unique. We might want to discuss if we want to count only one page view, in case the user views a page anonymously and then again logged in (cross session ids). We might also elaborate on whether for all authenticated users, a page view should be counted only once - ever. Or within a given timespan. Finally we could somehow use the IP address. We might even want to add configuration options for that.

But these are details that would take up too much time right now. This is a stop-gap measure, rendering the page counter usable at all. Also we improve performance on every page view. Mission accomplished, IMHO.

Please test some more and then RTBC.

Version:6.x-dev» 7.x-dev

Sorry, it is too late for D6 to change the meaning of these counters and change the mix of queries on every page view. D7 is open for business.

For those following this issue, I created a module to fix this until something can get committed. Check it out at http://drupal.org/project/statistics_advanced

You spelt Statistics wrong in the url, it should be http://drupal.org/project/statistics_advanced
I'd be interested to see this in D7

Patch in #14 still applies and works as intented. Should we add more features or setting it rtbc'd ?

This query can cause problems as the {accesslog} grows:

$count = db_result(db_query("SELECT COUNT(*) FROM {accesslog} WHERE path = '%s' AND sid = '%s'", $_GET['q'], session_id()));

To make it more effective, the accesslog should have an index added on path, sid (in that order). As it stands right now, neither field is indexed so every query requires a full table scan.

Is this to late for D7?

This seems like a good default behavior to deal with page views. It also helps prevent users from just clicking to view the node a hundred times to bump up the node views.

No it's not to late for Drupal 7 however we do need to deal with the issue in #19

Patch rolled against HEAD with DB API changes.
- Checks against history table if user is logged in (possibly faster than searching accesslog table)
- Adds accesslog_path_sid index in accesslog table for optimization
- Uses db_query_range and 'SELECT 1' to only return first result
- Also fixes some weird trailing space issues...

Patch failed to apply. More information can be found at http://testing.drupal.org/node/14001. If you need help with creating patches please look at http://drupal.org/patch/create

Hold on...retrying patch...

StatusFileSize
new3.51 KB
Failed: Failed to apply patch.
[ View ]

Alright, patch ready for testing.

StatusFileSize
new3.13 KB
Failed: Failed to apply patch.
[ View ]

Patch re-rolled to HEAD (time() function was removed).

Assigned:Unassigned» Dave Reid

Status:Needs review» Needs work

I will need to make a test for this before it is ready.

EDIT: wrong review..

I did not tested, but reading the patch, I think we are still incrementing the node counter when the tracker module is disabled and the node is viewed by an authenticated user..

cases of node counter:

disabled tracker, disabled statistics_enable_access_log
authenticated: increment always
anonymous: increment always

enabled tracker, disabled statistics_enable_access_log
authenticated: increment only if unique
anonymous: increment always

disabled tracker, enabled statistics_enable_access_log
authenticated: increment always
anonymous: increment only if unique

enabled tracker, enabled statistics_enable_access_log
authenticated: increment only if unique
anonymous: increment only if unique

conclusion: this works perfectly only when the tracker module and the statistics_enable_access_log variable are enabled..

I think it is ok to increment when we can't decide if it is unique or not, but it is not ok that the correct functionality depend on two settings (tracker, statistics_enable_access_log)..

sorry, absolutely wrong review..
the {history} table does not depend on the tracker module..

Actually, if the count content views is disabled (!variable_get('statistics_count_content_views', 0)) then it will never hit any logic to increment. It works best if the access log is enabled since we can test for unique anonymous hits, so maybe we should add a message under the count views option to let the user know that this feature will work best if the access log is enabled.

Hi there, I'm a bit of a drupal project newbie still and have yet to understand exactly how a lot of this works.

This appears to be a thread discussing how to improve the statistics module so that it's more efficient and doesn't record every single pageview sent its way (which I agree is lame and facile).

Somewhere in this thread, is there a patch for the D5 statistics module? Or is this all D7? I'm running a D5 site. If I want to make this change to my stats module there, does this thread have a way to do that in it? If not, could someone link me to a thread that does?

@bflora: the patch in #7 is for D5, but you should NOT hack core if you want to avoid problems.. I think this patch won't be backported to D5, as this is a new feature..

bflora, I should have a 5.x version of the Statistics Advanced module available soon which has this feature.

Very cool. I look forward to it.

I really need a D5 patch. Which one should I take? I tried the first three patches, but none of them worked with my D5.10 version, the counter still counts every hit.

I would really appreciate your help.

I did try the patch in #7. Unfortunately even this does not work with my version. If logged in it does not count anything, even if the node has not been viewed before. Not logged in, every hit is counted.

Anybody any idea?

Status:Needs work» Needs review
StatusFileSize
new3.08 KB
Passed: 7565 passes, 0 fails, 0 exceptions
[ View ]

Last patch fails. Here's a rerolled version with clearified comments.
Please review.

I wonder if we shouldn't add an admin option to select after which time page views by the same user should be counted again. Nodes can be updated, and if the user reads the node again after some time, that page view might be worth being counted.
Possible set of options: "immediately", "after 5 minutes", "after 15 minutes", "after 1 hour", "after 1 day", "after 7 days", "after 1 month", "never".

If that's the case, we can compare the node's updated timestamp to when the user last accessed the node, no admin option needed.

That's yet another way of thinking about this. However we can't easily determine how much a node was changed if at all. So we can't generally assume whether this is a reason a page view should be counted again or not. This depends heavily on the content.
I think if someone can decide after which time page views should be counted again, then only the webmaster.

Status:Needs review» Needs work

Some testing revealed that the code doesn't work at all: Authenticated users are not even counted once.

We had that working, but then in #25 we started checking against the history table, which is a mistake.
The problem is that the history record gets updated way earlier, so we can't know if the page has been viewed by this user before - we only find that the user has viewed the page right now (surprise, surprise). So a user viewing the page for the very first time, at this point already has a corresponding {history} record. Because we misinterpret this already as a repeated page view, we don't ever count a user's page view.
Only the {accesslog} table is reliable, as we are updating it later within the same hook.

Now, what can we do?
IMHO we should return to querying only the {access_log} table. We can further improve the indexing to make sure it's as fast as possible.
Even if we can fix the implementation of the query to the {history} table, there are some more reasons for this move:

Case 1: Access log is enabled
There is no point in adding up absolutely-unique page views (by authenticated users) with per-session-unique page views (by anonymous visitors). This handicaps authentificated users compared to anonymous visitors. The result is like adding apples and cherries.
For a default core behaviour it is also too complicated to explain where the data comes from in this case or in that case, and what exactly this does imply. Finally this is so difficult, because it doesn't really imply anything, but is only some kind of haltingly approximation to the popularity of a page.
Case 2: Access log is disabled
It also doesn't make much sense to count only authenticated users' page views. Surely there might be some use cases for this. But in most other cases the - in fact low - significance of such figures will be misinterpreted and overrated. So this is nothing we should ship as default core behavior. If we want, we can make this an option that the admin can decidedly choose.
General concept of Unique page-views
Finally, as already indicated in #37, I seriously question the concept of counting authenticated users' page views only once in a lifetime.
Google Analytics defines:
A unique pageview [...] aggregates pageviews that are generated by the same user during the same session.
[It] represents the number of sessions during which that page was viewed one or more times.
(see here.)
To comply with this definition, which makes a lot of sense and is the expected behaviour, our counts need to be session-based for both authenticated users and anonymous visitors. This means we need to rely on the {accesslog} data only.

From my point of view, this is a reasonable standard behavior that we can ship with D7. To allow for every other way of counting page-views, we should make sure this can be easily be overridden by a simple hook implementation in contrib.

If you guys follow my considerations, I'm going to rework this into a new patch ASAP. I also started work on some simpletests, which obviously need to be adapted to our new strategy, but otherwise are quite done.

Another remark: We need to make sure the pageview is not counted if a http status code "403 Forbidden" or "404 Not Found" is returned. Probably we need to add a 'status_code' column to the {accesslog} table. Though we could also check the Page title in the record, this would be no good idea performance-wise.

Page views are only counted in the {node_counter} table... They are not affected by 403s and 404s.

Page views [...] are not affected by 403s and 404s.

No. This statement is wrong in two respects:

  1. Both 403s and 404s do get counted in {node_counter}!
  2. {accesslog} records are added for both 403s and 404s, so they are also checked against in statistics_exit(). So after we have fixed the first bug, the next two edge-cases will arise:
    a) Some user wants to see node/3, but he gets a "403 forbidden". Now he gets permission. But even if he is successful on retry, then no pageview will be counted at all.
    b) node/135 doesn't exist now and the user gets a "404 Not found". Now that it has been created, but the next (successful) try won't be counted as well.
    Why? In both cases an {accesslog} entry was written already at the failed pageview, so the next (successful) pageview is considered a duplicate.

The second problem is admittedly not really big (I said: edge-cases), it's just not nice.
The first one though is really uncool and needs to be fixed asap (see #345133: 403s and 404s are counted in {node_counter}). IMHO this also needs to be backported to D6.

If you still can't believe it: go and check it, please.

Pancho. Wow. I did not believe that 403s and 404s were counted in {node_counter}. You were right. :) I posted a patch in that issue that I'm going to implement in my statistics advanced module, which continues to essentially be a D6 backport of this issue.

Any one has succesfully patched a D5 installation ? I am looking forward to a D5 fix and the Statistics Advance module is only for D6.

Version:7.x-dev» 8.x-dev
Category:task» feature

Bumping to D8.

No I am sorry to tell you, that I have never been able in patching a D5 version of the module...I am also still looking for this in a D5 version!

Does unique_page_views_37.patch work on Drupal 7.2?

subscribing

I'm amazed that this ancient issue is still going on. How hard is it to have a working core Statistics module ? There are patches dated 2008 in this thread. Wow.

subscribe

If you are using Google Analytics, I think this module can do the job: http://drupal.org/project/google_analytics_counter. It has an option to override drupals core node counter.

@haggster Just keep in mind that Google Analytics is not a replacement if you need 100% accurate stats. Users are able to opt-out of Google Analytics in which case the hit won't be logged. I'm also concerned about the future of Google Analytics if they start respecting the Do Not Track header this could happen if there was pressure from the FTC, making the stats even less accurate.

Assigned:Dave Reid» Unassigned

I don't think there is a good way todo this without bringing more performance issues, especially as I would like remove accesslog and only keep statistics.module as a node_counter.

Category:feature» support

Any movement on this? Statistics Filter (http://drupal.org/project/statistics_filter) helps with filtering out hits based on role, but are we still short of a D7 solution/fix for the core stats counter?