Optimize check_plain()
catch - July 18, 2009 - 00:54
| Project: | Drupal |
| Version: | 6.x-dev |
| Component: | base system |
| Category: | bug report |
| Priority: | critical |
| Assigned: | Unassigned |
| Status: | needs review |
| Issue tags: | needs backport to D6, Performance |
Description
check_plain() spends 6% of page execution time in self, and 6% in drupal_validate_utf8(), nearly all of this time is just the expense of calling functions thousands of times, not what they're actually doing. Also drupal_validate_utf8() only exists because IE6 is stupid.
By moving the same preg_match() into check_plain(), we save about 50% of the time spent in profiling, this should be backportable.
| Attachment | Size | Status | Test result | Operations |
|---|---|---|---|---|
| ie6diediedieyoufucker.patch | 630 bytes | Idle | Failed: 13247 passes, 3 fails, 0 exceptions | View details | Re-test |
| head.png | 158.04 KB | Ignored | None | None |
| patch.png | 237.6 KB | Ignored | None | None |

#1
whoops.
#2
Hm. drupal_validate_utf8() additionally checked for empty()... why not here?
Also, even after some discussion in IRC, I'm still not sure whether we can't entirely drop that IE6 support function. See also #308865: Drop IE6 support in Drupal core.
#3
Additional insight for why this function exists: https://security.drupal.org/node/116
I'd propose to re-think whether we need this entire check at all. (See also the issue I linked to in my previous follow-up)
#4
Added the empty(). We should consider dropping at least some aspects of IE6 support in HEAD I think, but we can also backport this to 5/6.
Still twice as fast, this morning check_plain() is taking 6% / 3% instead of 12% / 6% but still good results.
#5
More sober title.
#6
empty() will return TURE for the string '0' as well as various non-strings. Are you sure you want to use that vs. the strlen==0 check?
You could also cast to a string and compare to the empty string, and return the cast variable.
#7
Just checked HEAD and we get between 1-3 empty strings per page, so how about just removing that altogether?
#8
@catch- makes sense to me - if there are that few we are not gaining anything since it costs lots of extra function calls.
#9
We should append a @see drupal_validate_utf8() to the PHPDoc though, and perhaps additionally explain why we are duplicating code.
Speaking of duplication, shouldn't we apply the same to filter_xss() (the second instance) ? So we'd keep drupal_validate_utf8() for (rather special) contrib modules that need it only?
#10
Added the phpdoc, did similar in filter_xss() and removed the strlen from drupal_validate_utf8().
#11
Yay! Thanks for updating!
+ * valid UTF-8. @see drupal_validate_utf8().(multiple occurrences) @see should always be on a separate line.
function check_plain($text) {- return drupal_validate_utf8($text) ? htmlspecialchars($text, ENT_QUOTES) : '';
+ // Validate strings as UTF-8 to prevent cross site scripting attacks on
+ // Internet Explorer 6. We duplicate the same preg_match() here to avoid
+ // additional function calls, since check_plain() is often called hundreds of
+ // times per request.
+ return (preg_match('/^./us', $text) == 1) ? htmlspecialchars($text, ENT_QUOTES) : '';
This lengthy comment should rather go into the PHPDoc.
#12
The last submitted patch failed testing.
#13
Fixed the notice and did some more comment tidying.
#14
Looks good to me - nice to have to validated performance improvements.
#15
The only problem with removing the strlen from drupal_validate_utf8 is that it changes the result for empty string. An empty string is a valid utf8 string, whereas the regex requires at least one character.
#16
Hmm, I'm wondering if we could switch the order - so if the preg_match() returns TRUE we're done, if it doesn't, then we do the strlen check and return true if it's an empty string? It's not a problem in check_plain() because we return an empty string anyway if the preg_match() fails. Or we could just delete that hunk from the patch since drupal_validate_utf8() isn't in the critical path anyway.
#17
strlen() should be lightening fast. What about benchmarking with and without that check?
#18
We only need to duplicate the preg_match() in check_plain() since that's in the critical path, so removed other changes.
#19
#20
This was already RTBC. It duplicates a small amount of code (referencing the original) in a critical path which reduces page load by a noticeable amount. Sounds good to me.
#21
So Moshe pointed out that since xdebug has to measure function calls, it's not the best way to measure performance improvements gained by removing function calls, which was a good point. Beyond this, I also found via micro-benchmarking that having xdebug extension enabled at all (i.e. xdebug on, profiling off) adds overhead to function calls.
So here's micro-benchmarks for before/after check_plain() to verify the speed-up in check_plain() itself. I've attached the benchmarking script to this issue so you can run it locally and/or see the logic:
(three runs, 2,000,000 iterations of each loop).
catch@catch-laptop:~/www/7$ php check_plain.phpnothing: 0.467885017395 seconds
function no_op(): 1.22173714638 seconds
check plain old: 8.99900794029 seconds
check plain new: 5.9074318409 seconds
catch@catch-laptop:~/www/7$ php check_plain.php
nothing: 0.499942064285 seconds
function no_op(): 1.27891206741 seconds
check plain old: 8.88757514954 seconds
check plain new: 5.72876691818 seconds
catch@catch-laptop:~/www/7$ php check_plain.php
nothing: 0.451987028122 seconds
function no_op(): 1.2409799099 seconds
check plain old: 8.81144690514 seconds
check plain new: 5.80533409119 seconds
With xdebug enabled:
nothing: 0.719228029251 secondsfunction no_op(): 6.70149278641 seconds
check plain old: 38.8574278355 seconds
check plain new: 24.8503789902 seconds
Actually in this case, the relative speed up in check_plain() is the same, it's just the absolute times which are different. Either way, it validates this patch, but I'm hoping to figure out how best to verify these changes on some of the other issues.
#22
Forgot to attach the test script.
#23
And just for completeness, here's numbers for 10,000 iterations - which it's possible to get in HEAD on some pages(or at least the webgrind screenshots had 8,900).
nothing: 0.0027801990509 seconds
function no_op(): 0.0087149143219 seconds
check plain old: 0.0544929504395 seconds
check plain new: 0.037360906601 seconds
catch@catch-laptop:~/www/7$ php check_plain.php
nothing: 0.00294899940491 seconds
function no_op(): 0.00863099098206 seconds
check plain old: 0.0572259426117 seconds
check plain new: 0.0334417819977 seconds
catch@catch-laptop:~/www/7$ php check_plain.php
nothing: 0.00277781486511 seconds
function no_op(): 0.00840902328491 seconds
check plain old: 0.0663850307465 seconds
check plain new: 0.0283868312836 seconds
That's a saving of around 25-30ms.
#24
+++ includes/bootstrap.inc 18 Jul 2009 19:54:35 -0000@@ -1024,11 +1024,21 @@ function drupal_unpack($obj, $field = 'd
+ * check_plain() also validates strings as UTF-8 to prevent cross site scripting
...slightly exceeds 80 chars.
Beer-o-mania starts in 22 days! Don't drink and patch.
#25
It only does that if you include the + at the beginning of the line, otherwise it's exactly 80 characters ;)
#26
In PHP 5.2.5 and newer, htmlspecialchars() automatically handles partial multibyte sequences (according to the release notes).
The following approach reduces the running time of check_plain() even further (approx. 11% AFAICT) when using a PHP 5.2.5 or later. With older versions, it is about as fast as than the current implementation.
<?phpfunction check_plain($text) {
static $new_php;
if (!isset($new_php)) {
$new_php = version_compare(PHP_VERSION, '5.2.5', '>=');
}
return ($new_php || preg_match('/^./us', $text) == 1) ? htmlspecialchars($text, ENT_QUOTES, 'UTF-8') : '';
}
?>
#27
Let's prove this.
NOTE: As of now, this is NOT the patch that is RTBC.
#28
The last submitted patch failed testing.
#29
Complete removal of check_plain(): (sorry, my machine is slow)
HEAD without patch:
# ab -c 2 -n 100 http://drupal.test/
Document Path: /
Document Length: 10865 bytes
Concurrency Level: 2
Time taken for tests: 38.092 seconds
Complete requests: 100
Failed requests: 0
Write errors: 0
Total transferred: 1128800 bytes
HTML transferred: 1086500 bytes
Requests per second: 2.63 [#/sec] (mean)
Time per request: 761.836 [ms] (mean)
Time per request: 380.918 [ms] (mean, across all concurrent requests)
Transfer rate: 28.94 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 13 11.7 16 62
Processing: 547 744 198.3 625 1641
Waiting: 531 728 197.8 609 1625
Total: 562 757 201.3 656 1656
Percentage of the requests served within a certain time (ms)
50% 656
66% 828
75% 844
80% 844
90% 937
95% 1281
98% 1422
99% 1656
100% 1656 (longest request)
Patch applied:
# ab -c 2 -n 100 http://drupal.test/
Document Path: /
Document Length: 10865 bytes
Concurrency Level: 2
Time taken for tests: 35.529 seconds
Complete requests: 100
Failed requests: 0
Write errors: 0
Total transferred: 1128800 bytes
HTML transferred: 1086500 bytes
Requests per second: 2.81 [#/sec] (mean)
Time per request: 710.589 [ms] (mean)
Time per request: 355.294 [ms] (mean, across all concurrent requests)
Transfer rate: 31.03 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 13 9.3 16 31
Processing: 547 694 146.5 687 1484
Waiting: 531 678 146.5 672 1469
Total: 562 707 145.7 687 1500
Percentage of the requests served within a certain time (ms)
50% 687
66% 750
75% 766
80% 766
90% 859
95% 937
98% 1344
99% 1500
100% 1500 (longest request)
EDIT: I mistakenly copied the benchmarks in the wrong order :-/
#30
And of course, here is the updated patch.
#31
The last submitted patch failed testing.
#32
Using a concurrency of 1:
HEAD:
# ab -c1 -n 100 http://drupal.test/
Document Path: /
Document Length: 10865 bytes
Concurrency Level: 1
Time taken for tests: 49.904 seconds
Complete requests: 100
Failed requests: 0
Write errors: 0
Total transferred: 1128800 bytes
HTML transferred: 1086500 bytes
Requests per second: 2.00 [#/sec] (mean)
Time per request: 499.037 [ms] (mean)
Time per request: 499.037 [ms] (mean, across all concurrent requests)
Transfer rate: 22.09 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 10 7.7 16 16
Processing: 453 487 48.1 469 734
Waiting: 437 471 48.4 453 719
Total: 469 497 48.5 484 734
Percentage of the requests served within a certain time (ms)
50% 484
66% 484
75% 500
80% 500
90% 562
95% 609
98% 687
99% 734
100% 734 (longest request)
Patch:
Document Path: /
Document Length: 10865 bytes
Concurrency Level: 1
Time taken for tests: 48.169 seconds
Complete requests: 100
Failed requests: 0
Write errors: 0
Total transferred: 1128800 bytes
HTML transferred: 1086500 bytes
Requests per second: 2.08 [#/sec] (mean)
Time per request: 481.694 [ms] (mean)
Time per request: 481.694 [ms] (mean, across all concurrent requests)
Transfer rate: 22.88 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 10 7.7 16 31
Processing: 453 470 41.3 469 766
Waiting: 437 454 41.4 437 750
Total: 453 480 42.3 469 781
Percentage of the requests served within a certain time (ms)
50% 469
66% 469
75% 484
80% 484
90% 484
95% 531
98% 719
99% 781
100% 781 (longest request)
#33
Please don't mark patches RTBC until they're reviewed and tested.
#34
The last submitted patch failed testing.
#35
#26 does make a lot of sense to me, but I would like to see that tested. I would not trust PHP to watch my food ;)
#36
@Damz: Here is the output from my system testing the code from #26 in the script provided on #22.
nothing: 0.342183113098 seconds
function no_op(): 2.46541881561 seconds
check plain old: 14.3335449696 seconds
check plain new: 9.67884707451 seconds
check plain new 5.2.5: 6.89955997467 seconds
#37
I meant "test that htmlentities() does the correct thing regarding broken UTF-8 sequences".
#38
htmlspecialchars assumes a default charset of ISO-8859-1 where no multibyte codes exist. In ISO-8859-1 mode, htmlspecialchars does not and cannot care about "invalid" multi-byte sequences.
Drupal strings are UTF-8; You must pass 'UTF-8' as the charset argument to get proper behaviour.
htmlspecialchars("Foo\xC0barbaz", ENT_QUOTES, 'UTF-8'); should return an empty string.
#39
which patch is under consideration? #18? Maybe we should just get that in and have any further discussion to post-freeze.
The code in #26 (no patch) seems to me to be as far as we would want to go in terms of taking advantage of PHP 5.2.5+. We should not require it.
#40
htmlspecialchars("Foo\xC0barbaz", ENT_QUOTES, 'UTF-8');That looks plenty of reason to keep a separate check_plain() wrapper. Since this is all internals, I don't see any reason why it couldn't go in after freeze as a performance improvement (and also be backported to D6 once we settle on one). One idea I had is maybe settings a constant for PHP version then checking that rather than the static - not sure how that compares in terms of speed.
#41
Yes, that was also my gut feeling when I replaced all instances of check_plain() with htmlspecialchars() directly. That won't work out for the "average developer and themer"... Additionally, it seems that the performance gain of that approach is minimal. So let's get back to the BC approach in #26.
Benchmarks, anyone?
#42
What about dynamically defining check_plain(), depending on the result of the PHP version check? This results in the fastest possible implementation for suitable PHP versions, and a slower one for the rest.
#43
I'm a big fan of #42. Defining the function so that we don't have to check the static on each call should save us even more time overall.
#44
#42 omits the basic optimization of duplicating the preg_match
#45
Inlined drupal_validate_utf8() as per #44.
#46
In the doxygen, "Uses drupal_validate_utf8()" is not really true. Maybe "uses the same method..."?
Also, the conditional define is abnormal for Drupal core, so it might be worth explaining in the comments why is doesn't look like http://api.drupal.org/api/function/drupal_clone/6 In oter words, this function is called so often for a non-cached page that any sub-calls add significant overhead and we thus put the preg code there directly adn conditionally define the function to avoid doing version_compare each time it's called.
#47
From #18
+ * check_plain() also validates strings as UTF-8 to prevent cross site scripting
+ * attacks on Internet Explorer 6. We duplicate the preg_match() from
+ * drupal_validate_utf8() here rather than calling the function to avoid
+ * the overhead of an additional function call, since check_plain() may be
+ * called hundreds of times during a request.
#48
This needs to be benchmarked. As far as I know, conditionally defined functions cannot be cached by most (if not all) op-code caches. I personally would recommend the static variable approach.
#49
hmm, well we certainly want the function to be cacheable in the opcode.
If we use a static, it should be a normal static (not use drupal_static()) since it doesn't make sense to me that the PHP version would ever change while the code is executing.
#50
ok, more like #18
#51
Heine points out that I need === to distinguish FALSE from NULL
#52
#51 appears to be about 6% faster than #26, and 33% faster than #18
nothing: 0.330275058746 seconds
function no_op(): 2.31866407394 seconds
check plain old: 14.1262290478 seconds
check plain new: 9.65815281868 seconds
check plain new 5.2.5: 6.80544900894 seconds
check plain new 5.2.5 pwolanin: 6.36292600632 seconds
It would be worthwhile to add #38 as a SimpleTest to ensure that the intended functionality remains intact.
#53
+++ includes/bootstrap.inc@@ -1120,11 +1120,35 @@ function drupal_unpack($obj, $field = 'data') {
+ static $php525 = NULL;
+
+ if ($php525 === NULL) {
Just use !isset() like everywhere else in core (and remove the = NULL from the static).
(Also note trailing white-space here)
+++ includes/bootstrap.inc@@ -1120,11 +1120,35 @@ function drupal_unpack($obj, $field = 'data') {
+ $php525 = version_compare(PHP_VERSION, '5.2.5', '>=');
...
+ if ($php525) {
An additional cast to Boolean may speed up the further tests of the variable.
Otherwise, looks good.
This review is powered by Dreditor.
#54
@sun - I assumed === might be faster than isset(), but I can do it that way.
The return value of version_compare() will be a boolean.
#55
minor fixes per sun, plus a simple unit test case.
#56
nothing: 0.338335990906 seconds
function no_op(): 2.31378507614 seconds
check plain old: 14.4185819626 seconds
check plain new: 9.58456897736 seconds
check plain new 5.2.5: 6.64569997787 seconds
check plain new 5.2.5 pwolanin: 6.31802201271 seconds
check plain new 5.2.5 pwolanin 2nd edition: 6.39244294167 seconds
About 1% slower than #51. Thankfully, isset isn't a function. :)
Thanks pwolanin! Let's get this in soon.
#57
Marking #55 RTBC, wonderful job everyone.
#58
+
#59
Added an explicit @todo to drop the ugly when we drop support for either IE6 or PHP5.2.5
#60
I support this approach, but most of the new PHPdoc at the top of the function needs to be moved (back) into code. The reader of the API documentation doesn't (and shouldn't care) about the fact that we inlined some code. Let's do a quick re-roll.
#61
Moved the docs inline. It's a pretty short function, so I just left it as a single block of documentation, no easy way to split it up.
#62
Can we have more tests? With overlong three byte encoding for example?
#63
moved the bulk of the comments to the function body.
#64
some possible test cases can be derived from http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
#65
@Heine, @pwolanin: Is this what you were thinking?
<?php$text = check_plain("\xE0\x80\xAF");
$this->assertEqual($text, '', 'check_plain() rejects invalid sequence "\xE0\x80\xAF"');
?>
Note:
- \x2F succeeds as expected.
- \xC0\x2F fails as expected.
- \xC0\xAF and \xE0\x80\xAF fail in the preg_match as expected, but unexpectedly does not fail in the htmlspecialchars.
The behavior for these 4 test cases is identical in 5.1.6 and 5.2.10, except that in 5.1.6 a PHP Warning is generated for \xC0\x2F from htmlspecialchars.
Perhaps I am doing something wrong; I would appreciate it if someone who knows more about UTF-8 than I do could let me know if this is correct.
#66
We shoudl build a little bigger test set - but basically, if the PHP internals don't handle this as robustly as the preg_match, we should just go back to the simpler earlier patch that doesn't try to rely on PHP version.
#67
The last patch here doing that was #18 - http://drupal.org/node/523058#comment-1842404
But that probably needs the comments in-lining per Dries review (since the comments are much the same).
I think we should probably go ahead and commit that, and backport to Drupal 6 too, since there's nothing breakable there. Then that still gives us a window to further optimize if we find away around the PHP wonkiness.
#68
I am no expert on this, but here is my understand of this:
AFAICT htmlspecialchars() only rejects partial multibyte sequences followed by a byte < 0x80 (or the end of the string). I think this is fine - the UTF-8 exploit is only relevant if one of the few characters that would otherwise be escaped by htmlspecialchars() (i.e.
" < > & ') is "hidden" by a partial multibyte sequence. These few characters all have character codes < 0x80. In other words, htmlspecialchars() does not block all invalid UTF-8 strings, but it does block the dangerous ones.Here is the change that was made to htmlspecialchars() in PHP 5.2.5:
http://svn.php.net/viewvc/php/php-src/branches/PHP_5_2/ext/standard/html...
#69
Thanks for the link to the diff. So, it seems like we can proceed as long as we are selective about picking test cases? Or would we rather pay the performance penalty to have more consistent drupal behavior across PHP version? I could imagine see this leading to some hard-to-debug problems.
#70
I'd be tempted to say commit the slightly slower version - at leastso we can do a straight backport to Drupal 6 for consistency there. Then consider changing this in Drupal 8 (even if we don't require 5.2.6 for D8 it seems reasonable that most servers will actually be running it by then).
For me that means re-rolling #18 with the documentation moved inline + a @todo for the PHP 5.2.6 version + leaving this open once we've done the backport so we can discuss more for Drupal 7 or push it to Drupal 8.
#71
Given the grave performance implications here and the need to backport ASAP I would strongly recommend committing the latest patch and building the test case later. This is an exception to a rule I know but it has good reason.
#72
@c960657: Thank you for the explanation.
+1 to #63 because it doesn't alter the intent of the preg_match, and has significant performance benefits.
#73
so?
#74
@scroogie: I'm pretty sure this is the point where more people review and approve the code or it gets committed. It takes a lot of patience sometimes. :)
#75
Given the last comments, it's entirely not clear to me which patch should be committed to D7. Please re-attach.
#76
This one is the one which was marked RTBC most recently.
#77
The last submitted patch failed testing.
#78
Tests failed.
#79
#80
#81
Would really be nice to see this committed, since the slowness of check_plain() also adds overhead to all the functions which call it, which makes it harder to track down as-yet-unfound bottlenecks when profiling.
Re-posting jrchamp's microbenchmarks, which are the most recent:
nothing: 0.338335990906 seconds
function no_op(): 2.31378507614 seconds
check plain old: 14.4185819626 seconds
check plain new: 9.58456897736 seconds
check plain new 5.2.5: 6.64569997787 seconds
check plain new 5.2.5 pwolanin: 6.31802201271 seconds
check plain new 5.2.5 pwolanin 2nd edition: 6.39244294167 seconds
The last result is for the current patch.
As you can see, this is more than twice as fast as the implementation in HEAD, and the RTBC patch is itself a third faster than the original patch on this issue, making it worth the static for PHP version until we can increase requirements.
I inherited a horrible, horrible site freelancing one time, where the theme was riddled with XSS holes - after explaining to the client that this would need to be cleaned up, they went back to the company which originally build the theme, which came back and explained they never use check_plain() because it 'slows things down'. I'm sure that's not an isolated case of such stupidity.
The only slight issue here is how much we trust the htmlspecialchars() fix in PHP 5.2.5 - c960657 has shown that it's secure, but not as restrictive as our original preg_match(). If we're uneasy about relying on that though, then we have only very slightly slower alternatives, also ready to go, earlier in the issue.
#82
Huh. Somehow I had missed Dries's #60 where he points out that he supports this approach apart from some commenting issues. It looks like those comments have been addressed, along with nicely expanded test coverage. Additionally, it looks like this speeds things up quite a bit, according to the microbenchmarks.
Therefore, committed to HEAD. Seems like it might be worth back-porting this to D6?
#83
And here's a backport for d6
#84
Didn't read through whole issue, but does this fix or is aware of:
htmlspecialchars(): Invalid multibyte sequence in argument Warning bootstrap.inc 1204 check_plain()when run on PHP 5.3.