403s and 404s are counted in {node_counter}

Pancho - December 10, 2008 - 09:39
Project:Drupal
Version:7.x-dev
Component:statistics.module
Category:bug report
Priority:normal
Assigned:Unassigned
Status:needs work
Description

While working on #90468: Only record unique hits in node counter stats, I found out that 403s and 404s are counted in {node_counter}. This further invalidates the node view counter.
I see no quick and easy solution, as the status is not available in $GLOBALS nor elsewhere. Querying the node title in {accesslog} would be ugly and slow, so we'd need this to be set somewhere. Or: we use (or add) another hook that resembles hook_exit but is not called in these cases.
Unfortunately I don't have the time right now to further investigate this. Maybe someone can throw in her/his knowledge?

Note: This is loosely connected to #330542: Exempt custom 403 and 404 pages from popular content - customerror could probably help.

#1

kbahey - December 10, 2008 - 15:20

You are right. customerror will help in this case. As long as they are nodes, they have all the baggage that nodes have. The purpose of customerror is to make those pages non-nodes so they don't have all the side effects described here and in the other issue.

#2

Dave Reid - December 10, 2008 - 18:11
Assigned to:Anonymous» Dave Reid
Status:active» needs review

Steps to reproduce:
1. Go to an invalid node, e.g. node/1234567890
2. Look in the {node_counter} table in your database
3. Notice there has been a record added for an invalid nid '1234567890'

After some investigating, here's a patch that checks if it is a valid node and the user actually had access to view it. This will perform even better when we cache the data from node_load(). I couldn't think of any other solution.

AttachmentSizeStatusTest resultOperations
345133-dont-count-them-nodes-D7.patch1.04 KBIdlePassed: 7641 passes, 0 fails, 0 exceptionsView details | Re-test

#3

Dave Reid - December 10, 2008 - 23:41
Status:needs review» needs work

Arg... this doesn't work if the page is a cached view and hasn't loaded all modules.

node_load() requires node.module and common.inc (drupal_schema_fields_sql).
node_access() requires node.module and user.module.

#4

Pancho - December 11, 2008 - 07:00
Assigned to:Dave Reid» Anonymous
Status:needs work» needs review

I took another approach that should always work plus it also solves a bunch of other requests like #330542: Exempt custom 403 and 404 pages from popular content.
I'm extracting the '403' or '404' string out of the header. The value is then also written to a new 'status_code' field in the {accesslog} table, so 'popular content' lists or blocks can easily filter on that.
A variation of this would be: to do drupal_get_title() earlier and use it both for the 403/404 handling and for the title field in {accesslog}.
What do you think?

AttachmentSizeStatusTest resultOperations
statistics_handle_403-404.patch2.49 KBIdleFailed: 7658 passes, 1 fail, 0 exceptionsView details | Re-test

#5

Dave Reid - December 11, 2008 - 07:12

It's a good idea, although IMHO it should probably be a patch as a part of #330542: Exempt custom 403 and 404 pages from popular content since it's out of the context of this issue.

Plus this looks very dubious:

<?php
$status_code
= substr(drupal_get_headers(), strlen('Content-Type: text/html; charset=utf-8' . $_SERVER['SERVER_PROTOCOL']) +2, 3);
?>

#6

Pancho - December 11, 2008 - 07:53

I think that this is very right here in this issue. I'm not fixing the UI for the other issues, that remains to be solved over there. I'm just fixing this issue here, and only by the way I'm tackling the underlying problem that prevents the other issues to be easily solved. We can decouple it though, if that makes more sense to the committers.

I agree the obscure string operation in patch #4 is not optimal. So the other suggestion proposed in #4 will probably be the more transparent way. Patch enclosed.

AttachmentSizeStatusTest resultOperations
statistics_handle_403-404.patch2.7 KBIdlePassed: 7659 passes, 0 fails, 0 exceptionsView details | Re-test

#7

Pancho - December 11, 2008 - 07:56

Removed a leftover. Please test this one.

AttachmentSizeStatusTest resultOperations
statistics_handle_403-404_7.patch2.67 KBIdlePassed: 7659 passes, 0 fails, 0 exceptionsView details | Re-test

#8

kbahey - December 11, 2008 - 15:30

I feel that this is hackish, because it is an exception to the normal case. Exceptions and special cases are always problematic in the long run. Rather, I would see a cut down version of customerror in core. Simply two textareas and a menu path for each. Then they are not nodes and the problem goes away.

#9

Pancho - December 11, 2008 - 16:40

I don't know why you think this was more of an exception than it should be, and thus problematic. In fact, 403s and 404 are exceptions, so we need to treat them as such.

Still, I think for a very different reason that #7 is still not optimal as we're checking the title string to find out the status. If there were any nodes with one of the two "magic titles", they wouldn't be counted either.

However, I don't see any other possibility unless we introduce another global (e.g. $http_status) that is set in drupal_not_found() and drupal_access_denied(). This is what I first thought, and probably it remains the best way.

Simply including a cut down version of customerror would IMHO not help. Surely non-nodes don't have the problem. But we are offering the admin to choose any node as 403 or 404 page, and whatever choice the admin takes, the error page shouldn't be counted. So we'd have to /replace/ this feature by customerror, and if we did that, I can already see the complaints, because customerror is not CCK-capable and whatever else.
So no - unfortunately we can't go this way.

Enclosed is another corrected version of patch #7 (I was too much in a hurry). You may continue reviewing that, but I'm gonna write another alternative solution tonight.

AttachmentSizeStatusTest resultOperations
statistics_handle_403-404_9.patch2.66 KBIdlePassed: 7659 passes, 0 fails, 0 exceptionsView details | Re-test

#10

kbahey - December 11, 2008 - 16:47

Recording the HTTP status in statistics module is a good idea. I stopped using this module a long time ago because of scalability issues though.

My objection is mainly that we are continuing to treat the 403/404 as nodes.

We are going around the main problem, using nodes where nodes is not a good solution, than trying to find solutions for this.

#11

Pancho - December 11, 2008 - 17:29

Let's tackle this bug first, please, then we can still start a debate over the general idea! There would be a way to go before the non-node solution is as flexible as the the current solution, and I'm still not bought that this really is what we want.

How do you like the following alternative? IMHO it is the cleanest of the approaches we have seen.

AttachmentSizeStatusTest resultOperations
statistics_handle_403-404_11.patch4.4 KBIdlePassed: 7659 passes, 0 fails, 0 exceptionsView details | Re-test

#12

Dave Reid - December 11, 2008 - 17:34
Status:needs review» needs work

We don't need the 'description' key when adding a field:

+  $ret = array();
+  $field = array(
+    'type' => 'int',
+    'unsigned' => TRUE,
+    'not null' => FALSE,
+    'default' => 0,
+    'description' => 'HTTP/1.1 Status Code',
+  );
+  db_add_field($ret, 'accesslog', 'http_status', $field);
+  return $ret;
+}

Plus, you're adding an index in statistics_schema, but not providing an upgrade path to create the index. What need to we have to create an index on the status code field?

#13

Pancho - December 11, 2008 - 18:05
Status:needs work» needs review

Oh, now you're getting picky like me... ;)
But you're absolutely right about the 'description' and the missing index upgrade path!
However, I think an index is very appropriate to be able to filter out 403s and 404 in the 'Popular content' block (see #330542: Exempt custom 403 and 404 pages from popular content and mark them yellow or red in the 'Recent hits', 'Top referrers' and 'Top pages' report. The UI part would be a separate issue though.

My remaining question with this is whether you guys think that drupal_initialize_variables() is the correct (or the best) place to initialize the $http_status global or if we should put it somewhere else...

AttachmentSizeStatusTest resultOperations
statistics_handle_403-404_13.patch4.44 KBIdlePassed: 7659 passes, 0 fails, 0 exceptionsView details | Re-test

#14

Dries - December 11, 2008 - 21:04

This looks like a reasonable approach. Would be ideal if it came with some tests.

#15

gpk - December 12, 2008 - 00:25

Note that Drupal sometimes emits other response codes, e.g. 304...

#16

Dave Reid - December 14, 2008 - 18:09

I like the idea, but I think this may also be better implemented as a function instead of introducing a whole new global variable:

<?php
function drupal_http_status($set_status = NULL) {
  static
$status = 200;

  if (isset(
$set_status) && is_int($set_status)) {
   
$status = $set_status;
   
// Maybe even send the status header here?
    // header($set_status, ...) we'd need an extra $reason = '' parameter to do this
 
}

  return
$status;
}
?>

So in drupal_access_denied():

<?php
drupal_http_status
(403);
?>

And in statistics_exit():

<?php
$status
= drupal_http_status();
?>

Agreed with Dries that this will def. need tests.

#17

Pancho - December 14, 2008 - 20:07
Status:needs review» needs work

Yeah, that makes sense, though for code readability I'd prefer to call it drupal_set_http_status($status_code) and add a separate wrapper function drupal_get_http_status().

#18

Dave Reid - January 6, 2009 - 21:37
 
 

Drupal is a registered trademark of Dries Buytaert.