Hi,

more of an heads up than anything at the moment...

recently a module was contributed to improve statistics collection in the accesslog, Better Statistics.

Work is in progress to enhance it so that the accesslog table can be extended, adding additional fields whose content can be filled in by any module, via an API.

See:
#1779204: Create a statistics API
http://www.asmallwebfirm.net/blogs/2012/09/introducing-better-statistics...

I am reviewing and testing this development at the moment, also developing myself a very basic integration for IP Geoloc (more details in the posts above). Will post my work here once the new release will be on air.

At the moment there is a need for eyes to review Better Statistics itself before releasing, so if anyone can help, please join the effort.

P.S. Also, it looks like in D8 the accesslog will be dropped from core
#1446956: Remove the accesslog from statistics

Comments

mondrake’s picture

Title: Integartion with Better Statistics? » Integration with Better Statistics?

fix typo

rdeboer’s picture

Hi mondrake
Thanks for your interest and the heads-up.
I had a look at "Better Statistics" and BrowserCap also -- interesting projects!
Look forward to working together.
Let me know if I can make things easier, eg put in a drupal_alter() or something.
Rik

mondrake’s picture

Status: Needs review » Active

Great! Better Statistics 7.x-1.1 with the Statistics API was released yesterday, so we can start figuring out what to do.

I am summarising here my current understanding - as a strawman for integration work. Any comment/correction/proposal is welcome!

Better Statistics (alike the core Statistics module) gathers and writes the accesslog data at the time of hook_exit().
IP Geoloc stores data in the $_SESSION['ip_geoloc'] a session cache variable. However, it may be that at the time of hook_exit() this variable is not set as it may be waiting for return of the asynchronous client-side process for HTML5 geolocation. Moreover, it may well be that this is never returned, and IP Geoloc will fallback to less detailed geolocation via Smart IP or the likes.

So in my opinion we have the following scenarios to chase:
1.1) $_SESSION['ip_geoloc'] is available at the time of hook_exit().
1.2) $_SESSION['ip_geoloc'] is NOT available at the time of hook_exit(), but a previously saved record in {ip_geoloc} for the calling ip exists, and fallback data can be fetched from there.
2) an ajax call with reverse geocoded data is returning via ip_geoloc_current_location_ajax_recipient(). The page hits from when the ajax was started till when the ajax is returning can be associated to the geolocation of the returning ajax.
3) there are page hits that were not geolocated neither at hook_exit(), nor at ajax return. After a timeout period of time of ??, we can assume they will no longer be geolocated by a runtime an asynchronous process. So during a batch process (cron ?), they could be geolocated through a fallback method (Smart IP, or..).
4) 'historical' page hits recorded before IP Geoloc and/or Better Statistics were installed -these could be geolocated through a fallback method, if admins need to.

1.1 and 1.2 are easy, as they fall fully into Statistics API capabilities. Will post a patch proposal for this soon.

EDITS:

  • Scenario 4 added based on input in #5
  • Removed 1.2 based on #17
mondrake’s picture

Title: Integration with Better Statistics? » Integration with Better Statistics
Status: Active » Needs review
StatusFileSize
new30.23 KB
new10.74 KB

So...

first patch. This covers scenario 1.1 and 1.2 above.

The patch adds a ip_geoloc.statistics.inc file to enable IP Geoloc to feed the Statistics API, and implements in the main module file the hook_statistics_api().

No dependencies are introduced. Install Better Statistics module, then this patch to IP Geoloc, and go to Configuration > Settings > System > Statistics. Select from the menu the IP Geoloc fields you want to add to accesslog. Data collection (should) start immediately.

Added fields are exposed to Views - if you build a view on accesslog, it should be (relatively) straightforward to get to a hit list like this.

geocoded hits

A couple points:
1) there are more fields in the session array than in the db table. I tried to capture them all for now, although maybe they're not all needed (see next)
2) for the moment I exposed all the fields I could find in the array, although it seems to me there are 'duplications' e.g. 'region' with 'administrative_area_level_2', or 'city' and 'locality'. Is there any logic that can be applied to make that more uniform?

Reviews welcome

rdeboer’s picture

Well that all looks good mondrake!

Couple of remarks:

o we (I) should move away from storing stuff in the $_SESSION as it causes issues with Varnish on pages that could otherwise be cached by it -- I'm soon going to replace retrieval of the $location variable by a call like $location = ip_geoloc_get_visitor_location() which will use a storage mechanism that can be set by the site administrator (and will include $_SESSION for backward compatibility).

o will "Better Statistics" define the Views "schema"?

o yes Google, GeoIP API and Smart IP all return slightly different suburb/locality/city fields; I am not at my dev environment right now, but recall there is some "harmonisation" logic somewhere in the IP Geolocation module -- will get back to you

o if "Better Statistics" doesn't have it already it would be great for those who want to make the switch from core's Statistics module if historical Statistics data can be rolled over into Better Statistics' tables

Rik

mondrake’s picture

Hi

to your points @ #5:

1. fine - just a couple of changes on the ip_geoloc.statistics.inc, let me know when you have that in dev and I will accommodate

2. yes - Better Statistics exposes automatically to Views the added fields

3. ok

4. Better Statistics does not have tables of its own, it just adds fields to the {accesslog} table. But your point on getting the 'historical' accesslog refreshed is valid. This is an additional scenario I will add to #3 above.

mondrake’s picture

Hi Rik,

follow on thinking...

If you are already amending the way you get geolocation data by introducing a ip_geoloc_get_visitor_location() function, would it make sense to:

a) include an optional argument ip_geoloc_get_visitor_location($harmonize = FALSE) that, in case set to true, would do the 'harmonisation' you mentioned in your point 3 above? That way - one single point to trigger that process, and the 'harmonisation' logic would not have to be embedded in the ip_geoloc.statistics.inc separately.
b) include a 'validity' key with a timestamp indicating since when the data returned is valid? That would simplify the work for scenario 2.

Cheers

rdeboer’s picture

Totally makes sense!
Rik

rdeboer’s picture

Re #5 and ip_geoloc_get_visitor_location($harmonize = FALSE)

I have just released Session Cache API, so that I am now in a position to progress on this issue, starting by applying your patch (with minor mods, according to #5,6,7).

Will have to do some ground work first in the module itself to support Session Cache API, so bear with me...

Rik

mondrake’s picture

Thanks for update!

rdeboer’s picture

Note to self... do this.

rdeboer’s picture

Ok so the patch is in with minor mods (simplifications actually) due to the introduction of Session Cache API.
Haven't tested this at all yet.
Rik

mondrake’s picture

Thanks, will give it some testing.

rdeboer’s picture

Thanks in advance, mondrake!

mondrake’s picture

Hi Rik,

so just an update after testing for a few days on a site with
- Drupal 7.17
- Better Statistics 7.x-1.1+1-dev
- IP Geolocation Views and Maps 7.x-1.17+25-dev

(no Session Cache API)

In general, everything works fine :)

Pages are visited, the accesslog stores ip_geoloc data of the visitor, and I can build lists of accesslog entries, inclusive of geocoded info, with Views.

On a more detailed level, a couple of points:
a) ALL accesslog entries get geocoded data if available - even hits to pages specified as 'Exceptions: pages excluded from the set of pages specified above' in configuration. Just flagging, I am not sure whether the intent of that option is just to prevent the action of geocoding, or rather to really avoid tracking the information.
b) since there are different timelines to get to more accurate geolocation from first access to the site, we end up in a situation where the 'first' hits get more general data, and the 'last' hits more accurate one. However, as per my point in #4 above, we get a mixup of same data in different fields. Is there any chance to get to 'harmonization' as per points #7 #8?
c) sometimes the 'provider' field (e.g. 'google') is populated, sometimes not

Cheers

mondrake’s picture

StatusFileSize
new11 KB

Attached, the export of a view that return a list of geocoded hits from the accesslog.

mondrake’s picture

Hi Rik,

after a bit more testing, thinking, and understanding how IP Geolocation and view works...

1) I found a small glitch that causes the $location['provider'] to be void in ip_geoloc_use_smart_ip_if_enabled(), causing the behaviour under #15 point c. A fix proposed in the patch attached.
2) I believe the case under #3, point 1.2, does not make sense. At ip_geoloc_init() we have a synchronous fetch of geolocation info via smart_ip or geoip modules if they are enabled, so that will be available in _ip_geoloc_get_field_value() which occurs at hook_exit(). If smart_ip or geoip are not enabled, then we would be likely just fetching partial (no provider, accuracy) and stale data here. So better just return NULL, and leave such case to a process similar to the one used for realigning {ip_geoloc} and {accesslog} [i.e. ip_geoloc_sync_with_accesslog()]. So I removed the fallback db_query in _ip_geoloc_get_field_value() in the patch attached.
3) I think I found the place where to tackle point 2 of #3, i.e. 'backfilling' more accurate geolocation to entries occurred between 'first click' and ajax returning. A placeholder in ip_geoloc_current_location_ajax_recipient() in the patch. My idea would be to run an update to all the accesslog entries inserted after $since with the information in $location for the given ip address.

So here attached patch to my current testing. I am also editing #3 to reflect the above.

rdeboer’s picture

Status: Active » Needs review

Hi mondrake,
This is great! Thanks for your analysis and bug fix.
Hope to have time for a more detailed reply this weekend.
Rik

mondrake’s picture

Hi, just an update.

As per #17 sub 3, I implemented a backfill function. Very much in testing mode for now... will have to be cleaned up of debug watchdogs.

One point I came across was setting the 'position_pending_since' session value. Currently in ip_geoloc this is set to the time when the hook_init is executed. However, in a a backfill mode this would let the 'parent' page out of the update. So my proposal would be to subtract timer_read('page') and an additional safeguard second to be sure parent page is included.

Attached patch to my current testing.

rdeboer’s picture

Thanks again, mondrake,
Will review soon.
Rik

mondrake’s picture

Hi

I cleaned up the patch @19, which I now deem final from my POV.

This patch:

  • fixes #15.c including the change in #17.1
  • fixes the timestamp 'position_pending_since' to be as close as possible to the time when the page request was initiated, through logic in #19
  • implements the 'backfill' function as per #3.2

There are other points I noticed during testing, that I would anyhow suggest to leave to other issues:

  1. per #4 above, 'google' and 'smart_ip' location data still differ in what info goes in which field
  2. if we get an accurate position through 'google', but next time positioning is run, the routine fails, we have fallback data that is much less accurate than initially. I would suggest to retain the former data in this case instead of going to the fallback.
  3. the backfill routine relies on the session_id to update earlier accesslog entries. However, the session_id may change if the first call is anonymous (before login) and later the user authenticates.

Cheers

rdeboer’s picture

Hi mondrake,
Thanks for the latest patch and comments.
Hope to attend to this soon.
Rik

mondrake’s picture

Status: Needs review » Needs work

Hi,

as it stands, the implementation as per patch in #21 leads to a fatal exception if the statistics plugin is invoked within the context of a page served from cache. See #1879512: API - Dealing with statistics data collection on cached page requests.

A temporary patch will come soon.

mondrake’s picture

Status: Needs work » Needs review
StatusFileSize
new5.08 KB

Hi, until a stable solution comes for #1879512: API - Dealing with statistics data collection on cached page requests, with the attached patch, in case of page requests served from cache, ip_geoloc will return no information for filling in the accesslog - but the fatal exception will be avoided.

rdeboer’s picture

Re #24:
Sounds good, mondrake.
Thanks so much for the patch.
Rik

mondrake’s picture

Thanks Rik,

my suggestion at this point would be to get this in dev, if and when you are happy to, and close this issue - so to mark completion at this stage. That would also allow iamEAP (I suppose) to list 'IP Geolocation Views and Maps' as a module that has developed integration with Better Statistics, in its project page.

The case of IP geolocation in the context of cached requests will require further changes and separate review, and I would suggest to tackle that later in a new issue, to be opened as a follow up of this.

Cheers

iamEAP’s picture

Just curious, what's the status on this issue? Looks like the latest release has Statistics API code. Would love to feature this module on the Better Stats project page.

rdeboer’s picture

Hi iamEAP,
I need to apply mondrake's patch from #24.
Not sure if that will work well now with #1879512: API - Dealing with statistics data collection on cached page requests, though?
Maybe mondrake can comment?
Rik

mondrake’s picture

Hi

the situation is the following:

a) what is in dev right now (which is basically the patch in #4, revised by Rik after the integration with Session API) will work OK as long as Better Statistics is not trying to record accesslog entries for cached page requests. The problem is with the ip_geoloc_get_visitor_location() function in ip_geoloc.statistics.inc. On cached page requests, modules are not loaded as the bootstrap phase remains low. Better Statistics loads ip_geoloc.statistics.inc, but ip_geoloc_get_visitor_location() is defined in ip_geoloc.module and won't be available at that point. The result is that the call to the function will fail on hook_exit invocation, and an error posted to the error_log file, as described in #1839090: Call to undefined function drupal_get_path().

b) the patch in #24 will overcome the error, by simply returning null if ip_geoloc_get_visitor_location() is not defined.
Also, this patch:
- fixes #15.c (sometimes the 'provider' field (e.g. 'google') is populated, sometimes not)
- implements a 'backfill' function as per #3.2 (ajax call with reverse geocoded data is returning asynchronously, and the page hits from when the ajax was started till when the ajax is returning can be associated to the geolocation of the returning ajax)

c) I have a fully functioning implementation that covers also full data gathering on cached page requests, this has been running smoothly on a site for two months now. I am waiting for #24 to be committed as I was proposing in #26, then I will post a new issue with a patch for that. I would prefer to close this thread and start fresh with that if you are OK.

Cheers

rdeboer’s picture

Assigned: Unassigned » rdeboer

Thanks mondrake for the update.
Hope to find time to apply the patch soon.
Rik

rdeboer’s picture

Status: Needs review » Fixed

I have finally applied the patch from #24, with attribution.
Thanks mondrake.
Sorry for the delay.
Rik

rdeboer’s picture

Status: Fixed » Closed (fixed)