I've run into a situation where characters in a requested path that doesn't exist is causing a PDOException in the core statistics module. Here's the error:

PDOException: in statistics_exit() (line 93 of /path/public_html/modules/statistics/statistics.module).

This error is raised when attempting to add the "page not found" error to my database. I caught the exception and was able to tease out this error message:

SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\xC2' for column 'path' at row 1

This is the full URL that includes the path that doesn't exist:

http://www.mysite.org/id-governor%C2

The "%C2" appears to be causing the error.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

popzkg’s picture

the same problem here..

sah62’s picture

Additional info if it helps: the path column in my database table has type=varchar(255), collation=utf8_general_ci.

zyxware’s picture

Version: 7.15 » 7.16
Component: statistics.module » base system

I can confirm that there are still issues related to this on a fresh Drupal 7.16 installation.

When I try to access /id-governor%C2

I get the following warning

Warning: htmlspecialchars(): Invalid multibyte sequence in argument in check_plain() (line 1572 of /home/user/public_html/drupal-7.16/includes/bootstrap.inc).

zyxware’s picture

Title: PDOException: in statistics_exit() » $_GET['q'] not in UTF-8 leading to warning on 404 - htmlspecialchars(): Invalid multibyte sequence in argument in check_plain()

I have more information about this.

When a 404 error occurs $_GET['q'] is still in iso8859-1 and when check_plain is called with $_GET['q'] as a parameter the htmlspecialchars($text, ENT_QUOTES, 'UTF-8') function inside check_plain raises this warning because the passed string is not UTF-8.

Passing $_GET['q'] via drupal_convert_to_utf8 before the call to check_plain fixes the problem but I am not sure if that is the right solution.

For regular page requests that does not result in 404 errors what encoding is $_GET['q'] in D7? UTF-8? If so shouldn't $_GET['q'] be in UTF-8 as well when there is a 404? If not, then the above fix to pass $_GET['q'] via drupal_convert_to_utf8 before calling check_plain should fix the issue.

Damien Tournoud’s picture

Category: bug » support
Status: Active » Closed (works as designed)

This is a classic IIS bug. By default, IIS converts the encoding of the headers before passing them to the FastCGI handler. You need some trickery to make it behave appropriately.

sah62’s picture

I'm using Apache, not IIS. Is it also an Apache bug?

zyxware’s picture

Status: Closed (works as designed) » Active

Not sure if this is an IIS issue. I had tested this issue on a fresh D7.16 install on Apache2 on Ubuntu 11.04. The issue was tested in Firefox 14.0.1

sah62’s picture

Category: support » bug
zyxware’s picture

I have more information on this. I added the following before the call to the check_plain for the $_GET['q'] in MENU_NOT_FOUND case in drupal_deliver_html_page in common.in

$var = $_GET['q'];
for($i = 0; $i < strlen($var); $i++) {
   echo ord($var[$i]) . "-" . $var[$i] . "<br />";
}

$var = $_SERVER['REQUEST_URI'];
for($i = 0; $i < strlen($var); $i++) {
   echo ord($var[$i]) . "-" . $var[$i] . "<br />";
}

echo ":". htmlspecialchars($_GET['q']) . ":<br />";
echo ":". htmlspecialchars($_SERVER['REQUEST_URI']) . ":<br />";
exit;

This is what I get

105-i
100-d
45--
103-g
111-o
118-v
101-e
114-r
110-n
111-o
114-r
194-�
...
47-/
105-i
100-d
45--
103-g
111-o
118-v
101-e
114-r
110-n
111-o
114-r
37-%
67-C
50-2
:id-governor�:
:/~user/drupal7/id-governor%C2:

So in $_GET['q'] the url_unencoded character (chr(194)) is present whereas in $_SERVER['REQUEST_URI'] the original string typed into the browser address bar is present.

superspring’s picture

Status: Active » Needs review
FileSize
953 bytes

%C2 isn't valid UTF-8 as far as I know. This patch attempts to identify invalid unicode characters and fix them where it can. This propogates to fix the database errors (since it will then be valid).

sah62’s picture

The patch eliminates the error on my site. Thanks!

superspring’s picture

Version: 7.16 » 8.x-dev
FileSize
953 bytes

This second patch is applied against D8-dev. The previous patch is the D7-dev backport.

superspring’s picture

Regarding the CP1252 conversion, the following URLs provide some details.
http://en.wikipedia.org/wiki/UTF-8
http://www.php.net/manual/en/function.iconv.php
http://www.linuxquestions.org/questions/linux-software-2/iconv-from-utf-...
They notes a similar encoding so UTF-8 cannot read it without a conversion. CP1252 is used in some Windows file encodings.

This patch is based on GIGO, so only if the URL is otherwise unreadable does it attempt to convert it from this.

This latest patch only contains a spelling correct, no code change.

superspring’s picture

Status: Needs review » Reviewed & tested by the community

To add a note for later:

<Fabianx> superspring: I'd link to these sites from the issue summary and also give a little more explanation of the why there and then mark it RTBC.
superspring’s picture

catch’s picture

Status: Reviewed & tested by the community » Needs work

I don't think we need this comment, the code is self-documenting enough. Quick re-roll to remove?

+    // Oh noes! Invalid UTF-8. Now we fix it.
superspring’s picture

Status: Needs work » Needs review
FileSize
956 bytes

Same as above but with the #16 comment change.

superspring’s picture

" [...] immediately die if the incoming path is not utf-8. Do not match it, do not do anything, just [400] out."

The patch has been modifed to, instead of repairing the string, return a 404 response.

Fabianx’s picture

Issue tags: +Performance

The only possible concern I have with this is performance. I think drupal_normal_path is called quite a lot during a normal drupal requeust, but this should really be only done once for the request path from symfony - if I see this correctly.

It is great work, thanks for the patch, just thinking about this ...

superspring’s picture

Same patch as before but caching has been added to the function.

superspring’s picture

Same patch as above but using caching more efficiently.

sah62’s picture

Modulo minor differences in path.inc will this patch (with the Symfony dependency) work with Drupal 7?

superspring’s picture

Hey @sah62,

Are you asking for a Drupal 7 version of this patch? If so I can write one after/if it gets into Drupal 8.

sah62’s picture

I'm the OP, I'm still running with Drupal 7, and yes, I'm asking about a D7 version of the most recent patch. I'm using the first version of the patch posted above but I wouldn't mind testing the more recent versions before (until?) this makes it into a 7.x core release.

superspring’s picture

Hey @sah62,

I think the patch I first wrote for you is more what you're looking for.
The later patch just gives a 404 error instead of a 500 error.

sah62’s picture

OK, works for me. Will it get committed and be included as part of 7.18?

jthorson’s picture

sah62: It will be added to D7 only after it's added to D8 ... we always patch the newest version first, and then backport to previous major releases. This workflow makes sure we don't introduce regressions by patching D7 and then forgetting to patch D8.

salvis’s picture

Status: Needs review » Needs work
+++ b/core/includes/path.inc
@@ -264,16 +266,31 @@ function drupal_get_path_alias($path = NULL, $langcode = NULL) {
+  if (isset($cache[$path]) && isset($cache[$path][$langcode])) {

The second isset() covers the first one.

This needs tests to be considered for inclusion in core.

(I came here from the nasty [#1906626] which should be fixed by this.)

sah62’s picture

In the hope of getting this moving I just modified the patch from #21 to include the change described in #28. What other tests are needed?

sah62’s picture

Damien Tournoud’s picture

This looks like the wrong place to do this. We need this check up the stack, where we build the current path.

ansebul’s picture

Drupal 7.27 + PostgreSQL 8.4.21

When I try to access /id-governor%C2

I get the following

Warning: htmlspecialchars() [function.htmlspecialchars]: Invalid multibyte sequence in argument in check_plain() (line 1566 of /home/hosting/oooo/htdocs/includes/bootstrap.inc).
PDOException: in drupal_lookup_path() (line 176 of /home/hosting/oooo/htdocs/includes/path.inc).

PostgreSQL log:
postgres[9839]: [9-1] ERROR: invalid byte sequence for encoding "UTF8": 0xc220
postgres[9839]: [9-2] HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".

The patch eliminates the error on my site

cs_shadow’s picture

Status: Needs work » Needs review

Marking this for review so that the testbot can check the patch.

The last submitted patch, 30: invalid_utf8_database_errors-1757488-29.patch, failed testing.

Status: Needs review » Needs work

The last submitted patch, 32: invalid_byte_sequence_for_encoding_UTF8_error-fix.patch, failed testing.

scor’s picture

I tried to reproduce this bug in D8. This exact same error will no longer happen because the access log was removed from the statistics module in #1446956: Remove the accesslog from statistics. However I'm seeing errors in the logs:

Drupal\Core\Database\DatabaseExceptionWrapper:  in Drupal\Core\Path\AliasManager->writeCache() (line 150 of core/lib/Drupal/Core/Path/AliasManager. 

which is this line:

$this->cache->set($this->cacheKey, $path_lookups, REQUEST_TIME + $twenty_four_hours);
sah62’s picture

For whatever it's worth the patch in #17 has been eliminating this error on my D7 site for the last two years. If #36 is correct, and this isn't an issue for D8, can something be done to fix D7?

salvis’s picture

Issue summary: View changes
FileSize
17.89 KB

With #32 D7 on PHP 5.3.10 gives me

even though I've turned messages off (related info).

The bug is not present in D8.

sah62’s picture

Version: 8.0.x-dev » 7.x-dev
Status: Needs work » Needs review
FileSize
984 bytes

I'm moving this back to 7.x-dev because the issue still exists in 7.x and as noted above doesn't exist in 8.x. The patch in #17 consistently eliminates the error on my site so I'm going to pull it in here again and move it back to "needs review".

Status: Needs review » Needs work

The last submitted patch, 39: invalid_utf8_database_errors-1757488-39.patch, failed testing.

sah62’s picture

Status: Needs work » Needs review
FileSize
956 bytes

Let's try creating the patch file again...

Status: Needs review » Needs work

The last submitted patch, 41: invalid_utf8_database_errors-1757488-40.patch, failed testing.

sah62’s picture

Status: Needs work » Needs review
FileSize
956 bytes

Trying the patch again with core 7.42. If anyone sees anything that helps explain why the automated tests failed, please clue me in!

sah62’s picture

Could someone else please review this patch and confirm that it works for you so we can move this to "reviewed and tested by the community"?

salvis’s picture

I'm sorry, I lost track of this...

What you're saying is "if it's not valid UTF-8 then assume it's CP1252 (Windows Latin 1 / Western European) and force it to UTF-8". I suspect that this assumption is not correct.

Moreover, this page explains that not all code points in CP1252 can be converted to UTF-8.

I read RFC 3986 as saying roughly that anything that is not printable 7-bit ASCII must be percent-encoded UTF-8. If drupal_validate_utf8() returns FALSE, all bets are off and the best thing we can do is to generate a 404 error.

I may be wrong though...

sah62’s picture

So how would you change the patch? Would something like this work?

$from=mb_detect_encoding($path);
if ($from) {
  $path = drupal_convert_to_utf8($path, $from);
}
salvis’s picture

I think anything that is not UTF-8 (which is a superset of US-ASCII) is more likely to be malicious than wrongly encoded and we should protect ourselves by aborting with a 404 error.

A browser that uses a non-standard encoding probably causes all sorts of grief to the user and is not likely to stay around for long.

IMHO this issue is just about avoiding the PHP warning/error, not about avoiding the 404.

(See the first note on https://secure.php.net/manual/en/function.mb-detect-encoding.php)

sah62’s picture

Do you have a patch in mind to throw a 404?

asPagurus’s picture

I think patch from #21 is more appropriate for this, but it is for D8.

sah62’s picture

I think patch from #21 is more appropriate for this, but it is for D8.

I just tested a modified version of that patch that works for D7. I'll try to get it posted here soon.

sah62’s picture

Updated patch from #21 for D7 attached.

YPCrumble’s picture

I'm also getting bot traffic with these odd URLs causing PDOException errors. I would love to get #51 integrated into D7 core - is there anything I could do to help with the process?

drilauri’s picture

Bot traffic is causing these constantly on my site as well. I'm going to apply the patch, but would love to see it incorporated into D7 core.