Hi,
I just iinstalled the statistics filtering module mainly to enable me to keep track of search bots.
I also downloaded browscap.ini file from http://www.garykeith.com/browsers/downloads.asp and tweaked .htaccess to set the (php) value of browscap to the appropriate location.
When I look at the 'User Agents' I see that the module identifies a lot of browsers. However it incorrectly classified Slurp! as a non-crawler (screenshot attached)
I tried all the 3 versions of browscap.ini from the site (normal version, lite version, php version). I also looked up the file in a text editor and Slurp! is classified as a crawler. I tried uploading the browscap.ini file both in ASCII mode and Binary mode (ftp modes).
However the problem is still there. Could you please look into this?
Thanks.
| Comment | File | Size | Author |
|---|---|---|---|
| #7 | useragents.jpg | 79.09 KB | mikeryan |
| #2 | statistics_filter_1.module | 10.28 KB | mikeryan |
| #1 | statistics_filter_0.module | 10.28 KB | mikeryan |
| statistics-filtering-screenshot.png | 31.41 KB | varunvnair |
Comments
Comment #1
mikeryanThe issues here are as follows:
1. php_browscap.ini is the version that should be used - was that the first one you tried, and did Inktomi show up as a browser with that one? If so, I don't really have an explanation, but I'm guessing you used another version of browscap.ini...
2. Once a user agent has been identified as a browser or crawler, that designation remains even if an updated browscap.ini changes it.
3. statistics_filter caches browser strings for a day - a change to browscap.ini won't have any affect on browser strings that are in the cache. I've added an option to the settings to clear the cache.
4. If you're running PHP as an Apache module, copying a new browscap.ini will have no effect until you restart Apache.
The attached statistics_filter.module will update the crawler flag according to the latest active browscap.ini, and lets you clear the cache, as well as providing better help. Please try installing this version, and (making sure you have the PHP version of browscap.ini) restart Apache, then clear the cache. See if the next Slurp access shows up properly (as Inktomi on the crawlers list).
Thanks.
Comment #2
mikeryanOops...
Comment #3
varunvnair commentedHi,
Thanks for the prompt reply. BTW did I tell you that this module is simply great? :-) At least for small sites like mine ( http://www.ThoughtfulChaos.com ).
I am positive that I uploaded the php version of browscap first but I am not 100% sure. So maybe I made a mistake and uploaded the incorrect version (can you should add this to the README?)
I use a shared web hosting package to host my site. So I cannot really reboot apache :-(
Here is what I plan to do.
I will definitely keep you updated.
Thanks. Have a nice day.
Comment #4
john.money commentedAs a suggestion, I have seen other stat filtering interfaces enable the admin to toggle whether an agent is a crawler or not. This functionality would get around having to have the latest browscap.ini file. Also, I have had good experience with this browser detection (http://phpsniff.sourceforge.net/) class. FWIW...
Comment #5
varunvnair commentedHi,
I installed the updated module yesterday and have been keeping track of how it is doing.
Unfortunately it is still not classifying searchbots as crawlers. I am using the php version of the browscap.ini. I have also enabled monitoring of crawlers in the module settings.
You can see what the listing looks like at http://www.thoughtfulchaos.com/varun/page/2005/06/13/useragents-as-shown-by-statistics_filter.
I am positive that I have done everything correctly this time. Are you aware of a live site where this module has been installed and is working correctly?
On a side not how can I know how many MySQL queries are fired for serving some page. I also want to know the execution time. Is there a module that does these things?
Thanks.
Comment #6
mikeryanYep, it works just fine on my Fenway Views site (see the attached screen capture). The one you've got seems strange, only a couple very basic things (i.e., IE 6.0) seem to be recognized even in terms of consolidating the user agent name, let alone recognizing them as crawlers. It's hard to see what in my code could cause this - I may be grasping at straws, but what version of PHP are you using? The reason there is a PHP version of browscap.ini is that some versions had a buggy implementation of get_browser...
Thanks.
Comment #7
mikeryanWhen you reference an attached file, it helps to actually attach it:-).
Comment #8
varunvnair commentedEven I thought that the listing should show more human-friendly names instead of the user agent string. Something seems to be seriously wrong with my setup here.
Here is my configuration:
PHP 4.3.10
Apache/1.3.29 (Unix)
Drupal 4.6.1
Browscap version=2005-05-29 18:29:06 GMT
All this was garnered from the phpinfo() provided by my host. The phpinfo() can be found at
http://s96134469.onlinehome.us/check_info.php
You can go to the following page to see how get_browser() behaves on my site:
http://www.thoughtfulchaos.com/varun/blog/454-user-agent-tester
Comment #9
varunvnair commentedI dont know if this will help or not but only the following are classified as crawlers
Comment #10
varunvnair commentedI poked around with my installation to find the problem, but I was largely unsuccessful.
I have set the location of the browscap.ini file with the following line in my .htaccess file:
NOTE: The actual .htaccess file has the actual path :-)
I want to whether ths value is actually being set or not. What exact variable do I print?
I tried to print_r($_GLOBALS), print_r($_SERVER) and print_r($_ENV) but the browscap value was not in any of them. What should I try to print to check that the value of browscap is correct?
1 suggestion:
The settings page for the module alerts the user if the 'browscap value is not set. If the value is set then it should print out the value. This will help in detecting misconfigurations.
On a sidenote my hosts phpinfo() shows that 'Server API' is CGI. Does this mean anything?
Comment #11
mikeryanOK, that's it - I forgot that I didn't have much luck configuring browscap in .htaccess, it only worked for me in php.ini (if you look back at your phpinfo() output, you'll see that browscap is NULL).
I'll do a little more research to see if there's any way to make browscap work from .htaccess, or at least httpd.conf, and document this in README.txt. And I'll also look into having the settings page report whether browscap is properly set.
If you can't edit php.ini, you might be out of luck:-(.
Comment #12
mikeryanOh, please don't edit the title when replying - I've done it myself in the past, you think you're adding a title to your reply, but you're actually changing the title of the issue...
Comment #13
varunvnair commentedI did suspect that the browscap setting was not having any effect when placed in .htaccess. But I brushed aside that possibility because if I do not set browscap anywhere then the module setting shows an error saying that browscap.ini could not be found (or some such similar error). So I was under the impression that the browscap value was being used. What value should I try and print to check value of browscap?
If the path to browscap can be set only in php.ini then I am out of luck because I host my site using a shared web hosting package and do not have access to php.ini. Seems like there is no sense in pursuing the issue :-(
There are a couple of mysteries here:
[01] Seems like php 'knows' the value of browscap from .htaccess (otherwise it throws an error) but does not actually use it.
[02] Why are some user agent still being classified as crawlers?
Sorry for changing the title of the issue. I thought I was just adding a title to my followup :-)
Thanks mike for helping me out. You were a great help. I will still keep the module going for some days in case I can help you out with any more bugs or improvements. I will try to find a workaround to this problem and post it here if I find something that you can use.
Have a nice day.
P.S. The status should be changed to 'postponed' or 'by design'.
Comment #14
mikeryanThe value of browscap can be checked with ini_get('browscap'). The php.ini documentation does say browscap can't be set in .htaccess - although it claims it should be settable from httpd.conf, although I had no luck with that either.
I'm going to leave it open for now - many (if not most) Drupal users are in hosted situations like you where they can't use browscap, which obviously limits the use of this module. I'm playing around with importing the .csv version of browscap into a table, if I'm feeling ambitious enough I may make a "browscap" module which would automatically fetch and import the data weekly, and provide an API for statistics_filter (and other modules) to use.
Thanks for all your feedback!
Comment #15
mikeryanI've submitted a new module, browscap, to handle the browser data without requiring you to do any PHP configuration. I've also committed a new version of statistics_filter which uses the browscap module's services (and moved the browser tracking from statistics_filter to browscap). Please try this new combo and let me know whether it works for your purposes.
Comment #16
mikeryanBrowscap.ini configuration issues addressed by introduction of browscap module.