I've run into a situation where characters in a requested path that doesn't exist is causing a PDOException in the core statistics module. Here's the error:
PDOException: in statistics_exit() (line 93 of /path/public_html/modules/statistics/statistics.module).
This error is raised when attempting to add the "page not found" error to my database. I caught the exception and was able to tease out this error message:
SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\xC2' for column 'path' at row 1
This is the full URL that includes the path that doesn't exist:
http://www.mysite.org/id-governor%C2
The "%C2" appears to be causing the error.
Comment | File | Size | Author |
---|---|---|---|
#51 | invalid_utf8_database_errors-1757488-51.patch | 1.25 KB | sah62 |
#43 | invalid_utf8_database_errors-1757488-43.patch | 956 bytes | sah62 |
#41 | invalid_utf8_database_errors-1757488-40.patch | 956 bytes | sah62 |
#39 | invalid_utf8_database_errors-1757488-39.patch | 984 bytes | sah62 |
#38 | iconv_notice.png | 17.89 KB | salvis |
Comments
Comment #1
popzkg CreditAttribution: popzkg commentedthe same problem here..
Comment #2
sah62 CreditAttribution: sah62 commentedAdditional info if it helps: the path column in my database table has type=varchar(255), collation=utf8_general_ci.
Comment #3
zyxware CreditAttribution: zyxware commentedI can confirm that there are still issues related to this on a fresh Drupal 7.16 installation.
When I try to access /id-governor%C2
I get the following warning
Warning: htmlspecialchars(): Invalid multibyte sequence in argument in check_plain() (line 1572 of /home/user/public_html/drupal-7.16/includes/bootstrap.inc).
Comment #4
zyxware CreditAttribution: zyxware commentedI have more information about this.
When a 404 error occurs $_GET['q'] is still in iso8859-1 and when check_plain is called with $_GET['q'] as a parameter the htmlspecialchars($text, ENT_QUOTES, 'UTF-8') function inside check_plain raises this warning because the passed string is not UTF-8.
Passing $_GET['q'] via drupal_convert_to_utf8 before the call to check_plain fixes the problem but I am not sure if that is the right solution.
For regular page requests that does not result in 404 errors what encoding is $_GET['q'] in D7? UTF-8? If so shouldn't $_GET['q'] be in UTF-8 as well when there is a 404? If not, then the above fix to pass $_GET['q'] via drupal_convert_to_utf8 before calling check_plain should fix the issue.
Comment #5
Damien Tournoud CreditAttribution: Damien Tournoud commentedThis is a classic IIS bug. By default, IIS converts the encoding of the headers before passing them to the FastCGI handler. You need some trickery to make it behave appropriately.
Comment #6
sah62 CreditAttribution: sah62 commentedI'm using Apache, not IIS. Is it also an Apache bug?
Comment #7
zyxware CreditAttribution: zyxware commentedNot sure if this is an IIS issue. I had tested this issue on a fresh D7.16 install on Apache2 on Ubuntu 11.04. The issue was tested in Firefox 14.0.1
Comment #8
sah62 CreditAttribution: sah62 commentedComment #9
zyxware CreditAttribution: zyxware commentedI have more information on this. I added the following before the call to the check_plain for the $_GET['q'] in MENU_NOT_FOUND case in drupal_deliver_html_page in common.in
This is what I get
So in $_GET['q'] the url_unencoded character (chr(194)) is present whereas in $_SERVER['REQUEST_URI'] the original string typed into the browser address bar is present.
Comment #10
superspring CreditAttribution: superspring commented%C2 isn't valid UTF-8 as far as I know. This patch attempts to identify invalid unicode characters and fix them where it can. This propogates to fix the database errors (since it will then be valid).
Comment #11
sah62 CreditAttribution: sah62 commentedThe patch eliminates the error on my site. Thanks!
Comment #12
superspring CreditAttribution: superspring commentedThis second patch is applied against D8-dev. The previous patch is the D7-dev backport.
Comment #13
superspring CreditAttribution: superspring commentedRegarding the CP1252 conversion, the following URLs provide some details.
http://en.wikipedia.org/wiki/UTF-8
http://www.php.net/manual/en/function.iconv.php
http://www.linuxquestions.org/questions/linux-software-2/iconv-from-utf-...
They notes a similar encoding so UTF-8 cannot read it without a conversion. CP1252 is used in some Windows file encodings.
This patch is based on GIGO, so only if the URL is otherwise unreadable does it attempt to convert it from this.
This latest patch only contains a spelling correct, no code change.
Comment #14
superspring CreditAttribution: superspring commentedTo add a note for later:
Comment #15
superspring CreditAttribution: superspring commentedComment #16
catchI don't think we need this comment, the code is self-documenting enough. Quick re-roll to remove?
Comment #17
superspring CreditAttribution: superspring commentedSame as above but with the #16 comment change.
Comment #18
superspring CreditAttribution: superspring commented" [...] immediately die if the incoming path is not utf-8. Do not match it, do not do anything, just [400] out."
The patch has been modifed to, instead of repairing the string, return a 404 response.
Comment #19
Fabianx CreditAttribution: Fabianx commentedThe only possible concern I have with this is performance. I think drupal_normal_path is called quite a lot during a normal drupal requeust, but this should really be only done once for the request path from symfony - if I see this correctly.
It is great work, thanks for the patch, just thinking about this ...
Comment #20
superspring CreditAttribution: superspring commentedSame patch as before but caching has been added to the function.
Comment #21
superspring CreditAttribution: superspring commentedSame patch as above but using caching more efficiently.
Comment #22
sah62 CreditAttribution: sah62 commentedModulo minor differences in path.inc will this patch (with the Symfony dependency) work with Drupal 7?
Comment #23
superspring CreditAttribution: superspring commentedHey @sah62,
Are you asking for a Drupal 7 version of this patch? If so I can write one after/if it gets into Drupal 8.
Comment #24
sah62 CreditAttribution: sah62 commentedI'm the OP, I'm still running with Drupal 7, and yes, I'm asking about a D7 version of the most recent patch. I'm using the first version of the patch posted above but I wouldn't mind testing the more recent versions before (until?) this makes it into a 7.x core release.
Comment #25
superspring CreditAttribution: superspring commentedHey @sah62,
I think the patch I first wrote for you is more what you're looking for.
The later patch just gives a 404 error instead of a 500 error.
Comment #26
sah62 CreditAttribution: sah62 commentedOK, works for me. Will it get committed and be included as part of 7.18?
Comment #27
jthorson CreditAttribution: jthorson commentedsah62: It will be added to D7 only after it's added to D8 ... we always patch the newest version first, and then backport to previous major releases. This workflow makes sure we don't introduce regressions by patching D7 and then forgetting to patch D8.
Comment #28
salvisThe second isset() covers the first one.
This needs tests to be considered for inclusion in core.
(I came here from the nasty [#1906626] which should be fixed by this.)
Comment #29
sah62 CreditAttribution: sah62 commentedIn the hope of getting this moving I just modified the patch from #21 to include the change described in #28. What other tests are needed?
Comment #30
sah62 CreditAttribution: sah62 commentedComment #31
Damien Tournoud CreditAttribution: Damien Tournoud commentedThis looks like the wrong place to do this. We need this check up the stack, where we build the current path.
Comment #32
ansebul CreditAttribution: ansebul commentedDrupal 7.27 + PostgreSQL 8.4.21
When I try to access /id-governor%C2
I get the following
Warning: htmlspecialchars() [function.htmlspecialchars]: Invalid multibyte sequence in argument in check_plain() (line 1566 of /home/hosting/oooo/htdocs/includes/bootstrap.inc).
PDOException: in drupal_lookup_path() (line 176 of /home/hosting/oooo/htdocs/includes/path.inc).
PostgreSQL log:
postgres[9839]: [9-1] ERROR: invalid byte sequence for encoding "UTF8": 0xc220
postgres[9839]: [9-2] HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
The patch eliminates the error on my site
Comment #33
cs_shadow CreditAttribution: cs_shadow commentedMarking this for review so that the testbot can check the patch.
Comment #36
scor CreditAttribution: scor commentedI tried to reproduce this bug in D8. This exact same error will no longer happen because the access log was removed from the statistics module in #1446956: Remove the accesslog from statistics. However I'm seeing errors in the logs:
which is this line:
Comment #37
sah62 CreditAttribution: sah62 commentedFor whatever it's worth the patch in #17 has been eliminating this error on my D7 site for the last two years. If #36 is correct, and this isn't an issue for D8, can something be done to fix D7?
Comment #38
salvisWith #32 D7 on PHP 5.3.10 gives me
even though I've turned messages off (related info).
The bug is not present in D8.
Comment #39
sah62 CreditAttribution: sah62 commentedI'm moving this back to 7.x-dev because the issue still exists in 7.x and as noted above doesn't exist in 8.x. The patch in #17 consistently eliminates the error on my site so I'm going to pull it in here again and move it back to "needs review".
Comment #41
sah62 CreditAttribution: sah62 commentedLet's try creating the patch file again...
Comment #43
sah62 CreditAttribution: sah62 commentedTrying the patch again with core 7.42. If anyone sees anything that helps explain why the automated tests failed, please clue me in!
Comment #44
sah62 CreditAttribution: sah62 commentedCould someone else please review this patch and confirm that it works for you so we can move this to "reviewed and tested by the community"?
Comment #45
salvisI'm sorry, I lost track of this...
What you're saying is "if it's not valid UTF-8 then assume it's CP1252 (Windows Latin 1 / Western European) and force it to UTF-8". I suspect that this assumption is not correct.
Moreover, this page explains that not all code points in CP1252 can be converted to UTF-8.
I read RFC 3986 as saying roughly that anything that is not printable 7-bit ASCII must be percent-encoded UTF-8. If drupal_validate_utf8() returns FALSE, all bets are off and the best thing we can do is to generate a 404 error.
I may be wrong though...
Comment #46
sah62 CreditAttribution: sah62 commentedSo how would you change the patch? Would something like this work?
Comment #47
salvisI think anything that is not UTF-8 (which is a superset of US-ASCII) is more likely to be malicious than wrongly encoded and we should protect ourselves by aborting with a 404 error.
A browser that uses a non-standard encoding probably causes all sorts of grief to the user and is not likely to stay around for long.
IMHO this issue is just about avoiding the PHP warning/error, not about avoiding the 404.
(See the first note on https://secure.php.net/manual/en/function.mb-detect-encoding.php)
Comment #48
sah62 CreditAttribution: sah62 commentedDo you have a patch in mind to throw a 404?
Comment #49
asPagurus CreditAttribution: asPagurus commentedI think patch from #21 is more appropriate for this, but it is for D8.
Comment #50
sah62 CreditAttribution: sah62 commentedI just tested a modified version of that patch that works for D7. I'll try to get it posted here soon.
Comment #51
sah62 CreditAttribution: sah62 commentedUpdated patch from #21 for D7 attached.
Comment #52
YPCrumble CreditAttribution: YPCrumble commentedI'm also getting bot traffic with these odd URLs causing PDOException errors. I would love to get #51 integrated into D7 core - is there anything I could do to help with the process?
Comment #53
drilauri CreditAttribution: drilauri commentedBot traffic is causing these constantly on my site as well. I'm going to apply the patch, but would love to see it incorporated into D7 core.