Browser strings not properly saved to a database
| Project: | Browscap |
| Version: | 6.x-1.x-dev |
| Component: | Code |
| Category: | bug report |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | active |
Jump to:
If a browser string contains a single backslash, as in my example:
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; InfoPath.1; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.648; TE\x8a; TE\x8a)'
it seems it is not properly saved into the database, and the error is reported:
Duplicate entry 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.' for key 1 query: INSERT INTO browscap_statistics (parent,counter,is_crawler) VALUES('Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; InfoPath.1; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.648; TE
It is obvious that something went wrong starting from the location of the first backslash... I presume it is due to the MySQL query, which in this case would require a single backslash to be escaped by additional backslash.
Has anyone an idea about:
1. How to "adapt" a string to be appropriate for the MySQL query (and, properly saved into the MySQL database) - I think all occurencies of backslash should be replaced by 2 backslashes (or, is it not that "simple"?). If yes, what would be the simplest code to do that (unfortunately I have extremely few experience in PHP coding, and it probably doesn't make sense to reinvent the wheel...)?
2. Where in the module code, the string has to be "corrected" (I've found some backslashes in useragent column of browscap table, so I presume there is no need for that at the import stage, but only for new records during visits) - is it just the $browserstring around line 96 or also $browserstring around line 364?
Any help is appreciated.

#1
I've tried reproducing this (setting Safari's browser string to yours above) and had no problem. The code first tries updating the current row for the browser string, then inserting a new row for it if nothing is updated - it's using the same $browserstring both times, so I can't see how the update would fail and then the insert would come back with a duplicate key. Well, it could theoretically happen once if someone with the same browser string hit the site at precisely the right moment, but it wouldn't be reproducible (this would be highly unlikely).
Can you still reproduce this problem? If so, can you tell me what version of MySQL you're using? Also, it would really help if you could install the devel module and enable Collect query info and Display query log, to see the specific queries being passed to MySQL.
Thanks.
#2
Changing status...
#3
Unfortunately, it seems I can not reproduce the error.
I've tried with "tamper data" plugin in Firefox 2, but it seems (according to what I've found in Apache access log) it translates the entered string:
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; InfoPath.1; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.648; TE\x8a; TE\x8a)'
into:
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; InfoPath.1; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.648; TE\\x8a; TE\\x8a)'
(please note the double backslashes!).
With double backslashes, there is of course no problem... Are you sure your Safari has sent the browser string unmodified? Maybe we are dealing with some strange client (in my case) who is sending an invalid browser string out? But, also in this case, such weirdnesses should be handled without an error (algorythm crash) by my opinion. Maybe some note could be taken, but in any case we should somehow register such client, and count it as any other in statistics.
I also have to say that I can't find now this browser being logged (I've tried to look in browscap table and in browscap_statistics table for a backslash, but I couldn't find any such string which would look like the browser string above; there were 3 entries containing a backslash in browscap table, but something totally different...). So, it was not logged, as it looks...
I've also tried to put just 'TEST-1' as a browser string. This was logged (and, is displayed in browscap logs at the end of "All user agents" list, but with count = 0 (even if I've visited some pages using it...).
I am using the latest 5.x-1.x-dev (2008-Jun-04) version.
#4
Issues without concrete steps to repeat for so long go to closed.
I tried:
wget http://6d.local/ -o /dev/null -U mon\\keypantsAnd in my http://6d.local/admin/reports/browscap I see "mon\keypants" listed as a user agent.
If you can provide additional information to repeat the problem please re-open.
#5
I've finally found the way to manually reproduce this situation.
You need wget for Windows (it doesn't work in Linux, I am always getting no backslash or double backslash there...) - I've tested it with version 1.5.3.1 (unpack just wget.exe).
In Windows (I've tested it with XP; maybe in some other versions it works differently, and you will not get a single backslash in user-agent string...) open Command Prompt in the folder, where you've extracted wget.exe, and type:
wget -U "Test Agent: TEŠ" http://your-web-server-urlYou shall find in your server log a new line containing - among other stuff - also the string:
"Test Agent: TE\x8a"Checking browscap logs, you shall find only:
"Test Agent: TE"So, in some circumstances, Slavic letter Š (Scaron) is obviously transmitted within User-Agent string as \x8a. I presume it should not be like that (according to some RFCs), but unfortunately it seems to be a case in some environments and for sure it would be better that the complete string is logged and displayed in browscap (so at least we can distinguish such letters, if there would be more...).
If you believe it is only my issue, you can find quite some cases of such user-agent strings - just google for "TE\x8a" (including quotes); here are just some of the URLs found:
http://www.odskodnine.info/stats/usage_200602.html
http://www.bager.org/statistika/agent_200803.html
http://www.e-prospekt.com/log/agent_200807.html
http://www.ews.uiuc.edu/bstats/weeks/071028-week.html
http://www.ljudmila.org/webalizer/www.creativecommons.si/agent_200701.html
(search for x8a within each page...)
I have Englesh Windows XP with all latest updates installed and Slovene characters / keyboard set (if you will get some other results...).
For my issue (Š => \x8a), I've added a couple of lines in my module:
--- /browscap/browscap.module.ORIGINAL 2008-08-13 04:49:43.000000000 +0200+++ /browscap/browscap.module 2009-09-18 12:11:45.000000000 +0200
@@ -91,6 +91,11 @@ function browscap_exit() {
if ($browserstring == '' or $browserstring == 'Default Browser') {
$browserstring = trim($_SERVER['HTTP_USER_AGENT']);
}
+
+ if (substr_count($browserstring, "\x8a")) {
+ $browserstring = str_replace("\x8a", '\\x8a', $browserstring);
+ }
+
db_query("UPDATE {browscap_statistics} SET counter = counter + 1, is_crawler=%d ".
"WHERE parent='%s'", $browser['crawler'], $browserstring);
// If we affected 0 rows, this is the first time we've seen this browser
which seems to resolve my issue (I am getting now occasionally in browscap some logs with complete strings displayed...). But, it is not a general solution to replace single backslashes with double ones (to be properly saved in mysql), but I am not good enough in regular expressions and php coding... :-(
Could this be done in a better (more general) way?
#6
What is the webserver where the problem occurs? Does the problem only occur on the windows webserver or is it the same webserver just client requests are made from a windows machine?
#7
Web server is Apache (httpd-2.2.3-22.el5.centos.2) at CentOS 5.3 linux; everything else needed (mysql-server-5.0.77-3.el5, php-5.1.6-23.el5_3 ...) is at the same machine. But I guess it doesn't matter - probably Drupal at some Winblow$ machine would break as well, as underlying MySQL can not handle a single backslash properly...
wget client requests from linux machine (tested from server itself) are logged with double backslashes (as user-agent string should be provided, I guess), while wget client requests from win xp (as above) are logged (httpd logs) with a problematic single backslash.
Firefox 3.5 requests from the same win xp machine (setting general.useragent.extra.firefox value in about:config) provide double backslashes. So, probably win xp wget is buggy (as probably some other clients as well), but still - I believe Drupal shall not break (with MySQL error) in any case, but handle such exceptions somehow better...
#8
After digging a bit more, my conclusion is that in some cases user agent strings are sent out in some non-US codepage (in my case, it was Windows 1250), and some specific characters contained are in this case handled by Apache on linux with their hex values. So, probably we can find many different hex values in the range \x80 to \xFF within user-agent strings in such cases.
I've adapted my browscap module to be more (generally) sustainable to such exceptions:
--- /browscap/browscap.module.ORIGINAL 2009-09-30 19:16:40.000000000 +0200+++ /browscap/browscap.module 2009-10-02 11:37:45.000000000 +0200
@@ -86,6 +86,58 @@ function browscap_exit() {
if ($browserstring == '' or $browserstring == 'Default Browser') {
$browserstring = trim($_SERVER['HTTP_USER_AGENT']);
}
+
+// Replace all improperly encoded HEX characters (127-255)
+// with MySQL safe double-quoted HEX representations:
+ $pattern = array(
+ "\x80", "\x81", "\x82", "\x83", "\x84", "\x85", "\x86", "\x87",
+ "\x88", "\x89", "\x8a", "\x8b", "\x8c", "\x8d", "\x8e", "\x8f",
+ "\x90", "\x91", "\x92", "\x93", "\x94", "\x95", "\x96", "\x97",
+ "\x98", "\x99", "\x9a", "\x9b", "\x9c", "\x9d", "\x9e", "\x9f",
+ "\xa0", "\xa1", "\xa2", "\xa3", "\xa4", "\xa5", "\xa6", "\xa7",
+ "\xa8", "\xa9", "\xaa", "\xab", "\xac", "\xad", "\xae", "\xaf",
+ "\xb0", "\xb1", "\xb2", "\xb3", "\xb4", "\xb5", "\xb6", "\xb7",
+ "\xb8", "\xb9", "\xba", "\xbb", "\xbc", "\xbd", "\xbe", "\xbf",
+ "\xc0", "\xc1", "\xc2", "\xc3", "\xc4", "\xc5", "\xc6", "\xc7",
+ "\xc8", "\xc9", "\xca", "\xcb", "\xcc", "\xcd", "\xce", "\xcf",
+ "\xd0", "\xd1", "\xd2", "\xd3", "\xd4", "\xd5", "\xd6", "\xd7",
+ "\xd8", "\xd9", "\xda", "\xdb", "\xdc", "\xdd", "\xde", "\xdf",
+ "\xe0", "\xe1", "\xe2", "\xe3", "\xe4", "\xe5", "\xe6", "\xe7",
+ "\xe8", "\xe9", "\xea", "\xeb", "\xec", "\xed", "\xee", "\xef",
+ "\xf0", "\xf1", "\xf2", "\xf3", "\xf4", "\xf5", "\xf6", "\xf7",
+ "\xf8", "\xf9", "\xfa", "\xfb", "\xfc", "\xfd", "\xfe", "\xff",
+ );
+
+ $replace = array(
+ '\\x80', '\\x81', '\\x82', '\\x83', '\\x84', '\\x85', '\\x86', '\\x87',
+ '\\x88', '\\x89', '\\x8a', '\\x8b', '\\x8c', '\\x8d', '\\x8e', '\\x8f',
+ '\\x90', '\\x91', '\\x92', '\\x93', '\\x94', '\\x95', '\\x96', '\\x97',
+ '\\x98', '\\x99', '\\x9a', '\\x9b', '\\x9c', '\\x9d', '\\x9e', '\\x9f',
+ '\\xa0', '\\xa1', '\\xa2', '\\xa3', '\\xa4', '\\xa5', '\\xa6', '\\xa7',
+ '\\xa8', '\\xa9', '\\xaa', '\\xab', '\\xac', '\\xad', '\\xae', '\\xaf',
+ '\\xb0', '\\xb1', '\\xb2', '\\xb3', '\\xb4', '\\xb5', '\\xb6', '\\xb7',
+ '\\xb8', '\\xb9', '\\xba', '\\xbb', '\\xbc', '\\xbd', '\\xbe', '\\xbf',
+ '\\xc0', '\\xc1', '\\xc2', '\\xc3', '\\xc4', '\\xc5', '\\xc6', '\\xc7',
+ '\\xc8', '\\xc9', '\\xca', '\\xcb', '\\xcc', '\\xcd', '\\xce', '\\xcf',
+ '\\xd0', '\\xd1', '\\xd2', '\\xd3', '\\xd4', '\\xd5', '\\xd6', '\\xd7',
+ '\\xd8', '\\xd9', '\\xda', '\\xdb', '\\xdc', '\\xdd', '\\xde', '\\xdf',
+ '\\xe0', '\\xe1', '\\xe2', '\\xe3', '\\xe4', '\\xe5', '\\xe6', '\\xe7',
+ '\\xe8', '\\xe9', '\\xea', '\\xeb', '\\xec', '\\xed', '\\xee', '\\xef',
+ '\\xf0', '\\xf1', '\\xf2', '\\xf3', '\\xf4', '\\xf5', '\\xf6', '\\xf7',
+ '\\xf8', '\\xf9', '\\xfa', '\\xfb', '\\xfc', '\\xfd', '\\xfe', '\\xff',
+ );
+
+ $browserstring = str_replace($pattern, $replace, $browserstring);
+
db_query("UPDATE {browscap_statistics} SET counter = counter + 1, is_crawler=%d ".
"WHERE parent='%s'", $browser['crawler'], $browserstring);
// If we affected 0 rows, this is the first time we've seen this browser
Seems to work for me. Would it be possible to make a more simple replacement with preg_replace() somehow? I've tried to search for '/([\x80-\xFF])/', but don't know how to push an extra backslash to it, to achieve replacements as above...