Hi,
more of an heads up than anything at the moment...
recently a module was contributed to improve statistics collection in the accesslog, Better Statistics.
Work is in progress to enhance it so that the accesslog table can be extended, adding additional fields whose content can be filled in by any module, via an API.
See:
#1779204: Create a statistics API
http://www.asmallwebfirm.net/blogs/2012/09/introducing-better-statistics...
I am reviewing and testing this development at the moment, also developing myself a very basic integration for IP Geoloc (more details in the posts above). Will post my work here once the new release will be on air.
At the moment there is a need for eyes to review Better Statistics itself before releasing, so if anyone can help, please join the effort.
P.S. Also, it looks like in D8 the accesslog will be dropped from core
#1446956: Remove the accesslog from statistics
Comments
Comment #1
mondrakefix typo
Comment #2
rdeboerHi mondrake
Thanks for your interest and the heads-up.
I had a look at "Better Statistics" and BrowserCap also -- interesting projects!
Look forward to working together.
Let me know if I can make things easier, eg put in a drupal_alter() or something.
Rik
Comment #3
mondrakeGreat! Better Statistics 7.x-1.1 with the Statistics API was released yesterday, so we can start figuring out what to do.
I am summarising here my current understanding - as a strawman for integration work. Any comment/correction/proposal is welcome!
Better Statistics (alike the core Statistics module) gathers and writes the accesslog data at the time of
hook_exit().IP Geoloc stores data in
thea session cache variable. However, it may be that at the time of$_SESSION['ip_geoloc']hook_exit()this variable is not set as it may be waiting for return of the asynchronous client-side process for HTML5 geolocation. Moreover, it may well be that this is never returned, and IP Geoloc will fallback to less detailed geolocation via Smart IP or the likes.So in my opinion we have the following scenarios to chase:
1.1)
$_SESSION['ip_geoloc']is available at the time ofhook_exit().1.2)
2) an ajax call with reverse geocoded data is returning via$_SESSION['ip_geoloc']is NOT available at the time ofhook_exit(), but a previously saved record in {ip_geoloc} for the calling ip exists, and fallback data can be fetched from there.ip_geoloc_current_location_ajax_recipient(). The page hits from when the ajax was started till when the ajax is returning can be associated to the geolocation of the returning ajax.3) there are page hits that were not geolocated
neitherathook_exit(), nor at ajax return. After a timeout periodof time of ??, we can assume they will no longer be geolocated bya runtimean asynchronous process. Soduring a batch process (cron ?),they could be geolocated through a fallback method (Smart IP, or..).4) 'historical' page hits recorded before IP Geoloc and/or Better Statistics were installed -these could be geolocated through a fallback method, if admins need to.
1.1 and 1.2 are easy, as they fall fully into Statistics API capabilities. Will post a patch proposal for this soon.
EDITS:
Comment #4
mondrakeSo...
first patch. This covers scenario 1.1 and 1.2 above.
The patch adds a ip_geoloc.statistics.inc file to enable IP Geoloc to feed the Statistics API, and implements in the main module file the hook_statistics_api().
No dependencies are introduced. Install Better Statistics module, then this patch to IP Geoloc, and go to Configuration > Settings > System > Statistics. Select from the menu the IP Geoloc fields you want to add to accesslog. Data collection (should) start immediately.
Added fields are exposed to Views - if you build a view on accesslog, it should be (relatively) straightforward to get to a hit list like this.
A couple points:
1) there are more fields in the session array than in the db table. I tried to capture them all for now, although maybe they're not all needed (see next)
2) for the moment I exposed all the fields I could find in the array, although it seems to me there are 'duplications' e.g. 'region' with 'administrative_area_level_2', or 'city' and 'locality'. Is there any logic that can be applied to make that more uniform?
Reviews welcome
Comment #5
rdeboerWell that all looks good mondrake!
Couple of remarks:
o we (I) should move away from storing stuff in the $_SESSION as it causes issues with Varnish on pages that could otherwise be cached by it -- I'm soon going to replace retrieval of the $location variable by a call like
$location = ip_geoloc_get_visitor_location()which will use a storage mechanism that can be set by the site administrator (and will include $_SESSION for backward compatibility).o will "Better Statistics" define the Views "schema"?
o yes Google, GeoIP API and Smart IP all return slightly different suburb/locality/city fields; I am not at my dev environment right now, but recall there is some "harmonisation" logic somewhere in the IP Geolocation module -- will get back to you
o if "Better Statistics" doesn't have it already it would be great for those who want to make the switch from core's Statistics module if historical Statistics data can be rolled over into Better Statistics' tables
Rik
Comment #6
mondrakeHi
to your points @ #5:
1. fine - just a couple of changes on the ip_geoloc.statistics.inc, let me know when you have that in dev and I will accommodate
2. yes - Better Statistics exposes automatically to Views the added fields
3. ok
4. Better Statistics does not have tables of its own, it just adds fields to the {accesslog} table. But your point on getting the 'historical' accesslog refreshed is valid. This is an additional scenario I will add to #3 above.
Comment #7
mondrakeHi Rik,
follow on thinking...
If you are already amending the way you get geolocation data by introducing a
ip_geoloc_get_visitor_location()function, would it make sense to:a) include an optional argument
ip_geoloc_get_visitor_location($harmonize = FALSE)that, in case set to true, would do the 'harmonisation' you mentioned in your point 3 above? That way - one single point to trigger that process, and the 'harmonisation' logic would not have to be embedded in the ip_geoloc.statistics.inc separately.b) include a 'validity' key with a timestamp indicating since when the data returned is valid? That would simplify the work for scenario 2.
Cheers
Comment #8
rdeboerTotally makes sense!
Rik
Comment #9
rdeboerRe #5 and
ip_geoloc_get_visitor_location($harmonize = FALSE)I have just released Session Cache API, so that I am now in a position to progress on this issue, starting by applying your patch (with minor mods, according to #5,6,7).
Will have to do some ground work first in the module itself to support Session Cache API, so bear with me...
Rik
Comment #10
mondrakeThanks for update!
Comment #11
rdeboerNote to self... do this.
Comment #12
rdeboerOk so the patch is in with minor mods (simplifications actually) due to the introduction of Session Cache API.
Haven't tested this at all yet.
Rik
Comment #13
mondrakeThanks, will give it some testing.
Comment #14
rdeboerThanks in advance, mondrake!
Comment #15
mondrakeHi Rik,
so just an update after testing for a few days on a site with
- Drupal 7.17
- Better Statistics 7.x-1.1+1-dev
- IP Geolocation Views and Maps 7.x-1.17+25-dev
(no Session Cache API)
In general, everything works fine :)
Pages are visited, the accesslog stores ip_geoloc data of the visitor, and I can build lists of accesslog entries, inclusive of geocoded info, with Views.
On a more detailed level, a couple of points:
a) ALL accesslog entries get geocoded data if available - even hits to pages specified as 'Exceptions: pages excluded from the set of pages specified above' in configuration. Just flagging, I am not sure whether the intent of that option is just to prevent the action of geocoding, or rather to really avoid tracking the information.
b) since there are different timelines to get to more accurate geolocation from first access to the site, we end up in a situation where the 'first' hits get more general data, and the 'last' hits more accurate one. However, as per my point in #4 above, we get a mixup of same data in different fields. Is there any chance to get to 'harmonization' as per points #7 #8?
c) sometimes the 'provider' field (e.g. 'google') is populated, sometimes not
Cheers
Comment #16
mondrakeAttached, the export of a view that return a list of geocoded hits from the accesslog.
Comment #17
mondrakeHi Rik,
after a bit more testing, thinking, and understanding how IP Geolocation and view works...
1) I found a small glitch that causes the $location['provider'] to be void in ip_geoloc_use_smart_ip_if_enabled(), causing the behaviour under #15 point c. A fix proposed in the patch attached.
2) I believe the case under #3, point 1.2, does not make sense. At ip_geoloc_init() we have a synchronous fetch of geolocation info via smart_ip or geoip modules if they are enabled, so that will be available in _ip_geoloc_get_field_value() which occurs at hook_exit(). If smart_ip or geoip are not enabled, then we would be likely just fetching partial (no provider, accuracy) and stale data here. So better just return NULL, and leave such case to a process similar to the one used for realigning {ip_geoloc} and {accesslog} [i.e. ip_geoloc_sync_with_accesslog()]. So I removed the fallback db_query in _ip_geoloc_get_field_value() in the patch attached.
3) I think I found the place where to tackle point 2 of #3, i.e. 'backfilling' more accurate geolocation to entries occurred between 'first click' and ajax returning. A placeholder in ip_geoloc_current_location_ajax_recipient() in the patch. My idea would be to run an update to all the accesslog entries inserted after $since with the information in $location for the given ip address.
So here attached patch to my current testing. I am also editing #3 to reflect the above.
Comment #18
rdeboerHi mondrake,
This is great! Thanks for your analysis and bug fix.
Hope to have time for a more detailed reply this weekend.
Rik
Comment #19
mondrakeHi, just an update.
As per #17 sub 3, I implemented a backfill function. Very much in testing mode for now... will have to be cleaned up of debug watchdogs.
One point I came across was setting the 'position_pending_since' session value. Currently in ip_geoloc this is set to the time when the hook_init is executed. However, in a a backfill mode this would let the 'parent' page out of the update. So my proposal would be to subtract timer_read('page') and an additional safeguard second to be sure parent page is included.
Attached patch to my current testing.
Comment #20
rdeboerThanks again, mondrake,
Will review soon.
Rik
Comment #21
mondrakeHi
I cleaned up the patch @19, which I now deem final from my POV.
This patch:
There are other points I noticed during testing, that I would anyhow suggest to leave to other issues:
Cheers
Comment #22
rdeboerHi mondrake,
Thanks for the latest patch and comments.
Hope to attend to this soon.
Rik
Comment #23
mondrakeHi,
as it stands, the implementation as per patch in #21 leads to a fatal exception if the statistics plugin is invoked within the context of a page served from cache. See #1879512: API - Dealing with statistics data collection on cached page requests.
A temporary patch will come soon.
Comment #24
mondrakeHi, until a stable solution comes for #1879512: API - Dealing with statistics data collection on cached page requests, with the attached patch, in case of page requests served from cache, ip_geoloc will return no information for filling in the accesslog - but the fatal exception will be avoided.
Comment #25
rdeboerRe #24:
Sounds good, mondrake.
Thanks so much for the patch.
Rik
Comment #26
mondrakeThanks Rik,
my suggestion at this point would be to get this in dev, if and when you are happy to, and close this issue - so to mark completion at this stage. That would also allow iamEAP (I suppose) to list 'IP Geolocation Views and Maps' as a module that has developed integration with Better Statistics, in its project page.
The case of IP geolocation in the context of cached requests will require further changes and separate review, and I would suggest to tackle that later in a new issue, to be opened as a follow up of this.
Cheers
Comment #27
iamEAP commentedJust curious, what's the status on this issue? Looks like the latest release has Statistics API code. Would love to feature this module on the Better Stats project page.
Comment #28
rdeboerHi iamEAP,
I need to apply mondrake's patch from #24.
Not sure if that will work well now with #1879512: API - Dealing with statistics data collection on cached page requests, though?
Maybe mondrake can comment?
Rik
Comment #29
mondrakeHi
the situation is the following:
a) what is in dev right now (which is basically the patch in #4, revised by Rik after the integration with Session API) will work OK as long as Better Statistics is not trying to record accesslog entries for cached page requests. The problem is with the ip_geoloc_get_visitor_location() function in ip_geoloc.statistics.inc. On cached page requests, modules are not loaded as the bootstrap phase remains low. Better Statistics loads ip_geoloc.statistics.inc, but ip_geoloc_get_visitor_location() is defined in ip_geoloc.module and won't be available at that point. The result is that the call to the function will fail on hook_exit invocation, and an error posted to the error_log file, as described in #1839090: Call to undefined function drupal_get_path().
b) the patch in #24 will overcome the error, by simply returning null if ip_geoloc_get_visitor_location() is not defined.
Also, this patch:
- fixes #15.c (sometimes the 'provider' field (e.g. 'google') is populated, sometimes not)
- implements a 'backfill' function as per #3.2 (ajax call with reverse geocoded data is returning asynchronously, and the page hits from when the ajax was started till when the ajax is returning can be associated to the geolocation of the returning ajax)
c) I have a fully functioning implementation that covers also full data gathering on cached page requests, this has been running smoothly on a site for two months now. I am waiting for #24 to be committed as I was proposing in #26, then I will post a new issue with a patch for that. I would prefer to close this thread and start fresh with that if you are OK.
Cheers
Comment #30
rdeboerThanks mondrake for the update.
Hope to find time to apply the patch soon.
Rik
Comment #31
rdeboerI have finally applied the patch from #24, with attribution.
Thanks mondrake.
Sorry for the delay.
Rik
Comment #32
rdeboer