Chrome, Opera, Safari and Firefox (not sure about IE9) all contain web page thumbnails which are displayed on the new tab page. These are updated semi-regularly and require the browser to render the entire page which causes Google Analytics to see this as a visitor with a really short bounce time.
When these thumbnail requests are made a specific header is sent so they can be filtered out. I believe Chrome, Opera and Safari use: "X-Purpose: preview" and Firefox uses "X-MOZ: prefetech". What we could do is watch for this header and not display the Google Analytics script on these page requests.
The only problem with this solution is Chrome has added a new feature called "Instant" where it *tries* to load a web page as you type the address (yes, this is rude because it nails the server with 404's but it'll probably be a default feature before long), every time a HTTP request is made it sends the "X-Purpose: preview" header to the server, so if the user doesn't hit enter (because Google has already loaded the page) we're still stuck with a "X-Purpose: preview" header and using my solution above it wouldn't load Google Analytics code. Maybe to solve this we could wait 5 seconds to confirm the user is actually on the page? I think this would be better than loading the Google Analytics code potentially up to 10+ times per page view.
I'm interested to hear what other people think about this as it appears to be a real problem that's going to mess with our stats.
Comments
Comment #1
hass commentedCan you share some more background information, please. It's the first time I hear about this and I'm not sure if i really understand it.
Comment #2
SeanBannister commentedI just found a video on YouTube that shows the two things I'm talking about http://www.youtube.com/watch?v=XnX9bbg1dYU
Notice when the user clicks the new tab button the browser displays thumbnails of websites the user has viewed in the past. The browser keeps these thumbnails up to date so they display the latest content on the site by regularly visiting the website in the background (without the users knowledge). This requires the browser to render the entire page including the Google Analytics code but the user isn't actually visiting the site as this is happening in the background.
Then watch what happens as the user starts typing a URL, the browser tries to guess the URL being typed and loads that page before the user hits enter. The problem is it will often guess the wrong URL and load a website that the user didn't actually want to go to. Lets use this issues URL as an example : http://drupal.org/node/1096478
The user starts to type : drup
The browser guesses and loads : drupal.com
The user continues to type : drupal.org
The browser loads : drupal.org
The user continues to type : drupal.org/nod
The browser loads : drupal.org/nod and gets a 404 error
The user types : drupal.org/node
The browser loads : drupal.org/node as this page actually exists
The user types : drupal.org/node/1
The browser loads : drupal.org/node/1 as this page exists
The user types : drupal.org/node/10
The browser loads : drupal.org/node/10
The user types : drupal.org/node/109
The browser loads : drupal.org/node/109
...
And so on until the browser finally loads http://drupal.org/node/1096478
This is a bit of an exaggeration because if the user typed fast enough the browser wouldn't have time to load all of these pages but it'd get a few loads in before the user finished typing. And as you can see above the user didn't want to go to many of these pages but they're now displayed in Google Analytics with a high bounce rate.
As I mentioned in my first post the browser sends "X-Purpose: preview" or "X-MOZ: prefetech" when it makes these preview requests to the server.
@hass if you need any more information or would like me to show you how it works and how the headers are sent I can setup screen sharing.
Comment #3
hass commentedThe user continues to type : drupal.org/nod
The browser loads : drupal.org/nod and gets a 404 error
What browser does this bullshit and overloads my webserver? Never seen this yet and the video only shows the google search results... I only know that google refresses it's own search results this way, but they don't punish my server with this useless load... Or at least I have not seen this yet... :-(((
Until today i thougt the previews are images, cached from my last visit... Good to know. As it is still traffic, we should log it to learn how much traffic this produces for our sites. But if there is no standard and not all browsers send us something... I don't know how to implement.
Do you have some docs that describe the headers and how this works per definition?
Comment #4
SeanBannister commentedYeah, it's really rude sending all this extra traffic, but on the other hand I turned Instant on in my browser and find it really useful :| catch 22. I don't believe Google Chrome currently turns this feature on by default but it wouldn't surprise if they turn it on in the future so it'd be good to prepare.
Google does have a page about blocking the Instant feature but it only works per browser session.
While searching for some "official" information for these headers I turned up an interesting Mozilla article about specifying links that a browser should prefetch. Of course this is what we do when we want to prefetch images so it certainly isn't a new concept as this also adds the "X-MOZ: prefetech" header.
In the Chromium issue queue there isn't much mentioned:
https://code.google.com/p/chromium/issues/list?can=1&q=%22X-Purpose%22&c...
I really can't find much official information about these headers but I can certainly see it in my server logs. To add the headers to your Apache logs you'll need to setup a custom log format like the following:
Comment #5
hass commentedReally bad stuff... If we would like to block this requests we may need a new module (may already exists) with hook_boot() implementation to prevent a full bootstrap and sending a 403 back if someone don't like this stuff and directly exit page processing after this hook execution... I'm still asking me if someone is able to have this thumbnails on servers sending a 403 for this requests...
This is more or less out of scope of GA, but I see your point that we should not log this stuff as page view if someone don't have such a blocker module. Let's write a patch for this sh**
Comment #6
SeanBannister commented@hass Sorry I should of elaborated, if you block the headers your blocking the thumbnails as well!!! This is obviously a big negative as the thumbnails can be really useful for users, so I think most sites will want to keep Instant and just stop Google Analytics from loading when "X-Purpose: preview" and "X-MOZ: prefetech" are sent.
I'm interested to know if you think this is in scope for the GA module? As I don't see any other use for this functionality outside of the GA module. Other analytics modules might like the functionality but I don't think it warrants a separate module as the code sharing would be minimal. Yet I think its a feature everyone with the GA module needs by default.
Comment #7
hass commentedThis is all in scope of GA, but I wish I would be able to block Instant requests (what I understood as the browser is trying to load my website while typing) as this produce tons of 404's on my server and I really wish to block/disable this from server side in the client browser.
Comment #8
SeanBannister commentedYeah I totally agree, it may happen once they think it through. What they need is a separate header for Instant and Thumbnails so you can block one and not the other.
Comment #9
hass commentedCan you confirm that "X-MOZ: prefetch" really executes the JavaScript inside a prefetched webpage? If it would - it must really download all files linked in the prefetched document... this would really add some strange network load... I cannot find docs about the expected behavior...
Safari 5.0.4: I have just traced with Wireshark and it does not send the X-PURPOSE header at all if you reload the TOP SITES tab...
Chrome 10.x: I see nothing helpful (except a request to
/dictionary/js/api_loader.js) in the network trace if the browser is opened first and presents me the thumbnails. Additionally it seems to be as I guessed first. The images are locally cached. If you reload the thumbnail page you see nothing in the network trace. I'm not sure in what situation the images are generated nor how I'm able to rebuild them.Need more information.
Comment #10
hass commentedThis article looks also interesting http://dev.opera.com/articles/view/opera-speed-dial-enhancements/#with-x... and also describes that we may should create an extra module to serve custom minified previews e.g. if you'd like to reduce server load or remove the tracking at all, check in hook_boot() / _init() for the X-PURPOSE header and return an pre-generated image of your website to these requests. In this case the preview image wouldn't have tracking code inside and your server is also not overloaded. Otherwise the rewrite rule is much better as Drupal is not bootstrapped.
I have some code ready and we can commit without testing, but I would be more happy to be able to test this feature.
Comment #11
hass commentedAdditionally I have read somewhere that Google only sends a
HEADwhile instant is active to the server to check if the URL exists and it gets an 200 OK back or not. This does not cause the JavaScript to be loaded as a side note.Committed some code to prepare a follow up patch.
Comment #12
hass commentedSean: should we close this case?
Comment #13
SeanBannister commentedLets keep it open because we need to work out a solution, I'll look into it soon and post a patch.
Comment #14
hass commentedBut it's not only a question of a patch... HEAD requests cannot tracked and i was not able to test anything.
Comment #15
SeanBannister commentedSorry I should of read back through the conversation, I just wanted to clarify our conversation so far as I'm a little confused.
Just wanted to reply to :
Yes, these are cached so refreshing the page doesn't update them, however at regular intervals in the background Safari and I presume Chrome updates this cache to keep the thumbnails fresh. When safari updates them it sends the "X-Purpose: Preview" header.
I have Google Chrome Instant on in my browser and sometimes when I start typing URLs it displays the site before I even finish typing, sometimes it displays the wrong page because I haven't typed the full URL and it is definitely pulling in javascript.
Just so we're on the same page, what I'm talking about is adding some code to the Google Analytics module that checks if "X-Purpose: preview" or "X-MOZ: prefetech" is sent by the browser and if it is it still loads drupal but doesn't load the Google Analytics code. This way the user still gets their thumbnails or previews but we don't have to deal with inaccurate stats in Google Analytics such as high bounce rates.
Here's some quick example code just for demonstration purposes:
Comment #16
hass commentedFound #1326396: Instant browsing clutters 404 URL list dramatically in my own core 404 logs.
Comment #17
hass commentedWe cannot implement this in a reliable way.
The reasons are:
The master deal breaker here is the caching and we cannot solve this... and telling people to turn of their caching will make thing more worse. We have the same issue with DNT header... :-(