check_plain() spends 6% of page execution time in self, and 6% in drupal_validate_utf8(), nearly all of this time is just the expense of calling functions thousands of times, not what they're actually doing. Also drupal_validate_utf8() only exists because IE6 is stupid.
By moving the same preg_match() into check_plain(), we save about 50% of the time spent in profiling, this should be backportable.
Comment | File | Size | Author |
---|---|---|---|
#91 | 523058_check_plain_D6-2.patch | 2.58 KB | c960657 |
#83 | 523058_check_plain_D6.patch | 1.71 KB | andypost |
#76 | check_plain.patch | 3.28 KB | catch |
#63 | 523058-check_plain-61.patch | 3.28 KB | pwolanin |
#61 | check_plain.patch | 0 bytes | catch |
Comments
Comment #1
catchwhoops.
Comment #2
sunHm. drupal_validate_utf8() additionally checked for empty()... why not here?
Also, even after some discussion in IRC, I'm still not sure whether we can't entirely drop that IE6 support function. See also #308865: Drop IE6 support in Drupal core.
Comment #3
sunAdditional insight for why this function exists: https://security.drupal.org/node/116
I'd propose to re-think whether we need this entire check at all. (See also the issue I linked to in my previous follow-up)
Comment #4
catchAdded the empty(). We should consider dropping at least some aspects of IE6 support in HEAD I think, but we can also backport this to 5/6.
Still twice as fast, this morning check_plain() is taking 6% / 3% instead of 12% / 6% but still good results.
Comment #5
catchMore sober title.
Comment #6
pwolanin CreditAttribution: pwolanin commentedempty() will return TURE for the string '0' as well as various non-strings. Are you sure you want to use that vs. the strlen==0 check?
You could also cast to a string and compare to the empty string, and return the cast variable.
Comment #7
catchJust checked HEAD and we get between 1-3 empty strings per page, so how about just removing that altogether?
Comment #8
pwolanin CreditAttribution: pwolanin commented@catch- makes sense to me - if there are that few we are not gaining anything since it costs lots of extra function calls.
Comment #9
sunWe should append a @see drupal_validate_utf8() to the PHPDoc though, and perhaps additionally explain why we are duplicating code.
Speaking of duplication, shouldn't we apply the same to filter_xss() (the second instance) ? So we'd keep drupal_validate_utf8() for (rather special) contrib modules that need it only?
Comment #10
catchAdded the phpdoc, did similar in filter_xss() and removed the strlen from drupal_validate_utf8().
Comment #11
sunYay! Thanks for updating!
(multiple occurrences) @see should always be on a separate line.
This lengthy comment should rather go into the PHPDoc.
Comment #13
catchFixed the notice and did some more comment tidying.
Comment #14
pwolanin CreditAttribution: pwolanin commentedLooks good to me - nice to have to validated performance improvements.
Comment #15
jrchamp CreditAttribution: jrchamp commentedThe only problem with removing the strlen from drupal_validate_utf8 is that it changes the result for empty string. An empty string is a valid utf8 string, whereas the regex requires at least one character.
Comment #16
catchHmm, I'm wondering if we could switch the order - so if the preg_match() returns TRUE we're done, if it doesn't, then we do the strlen check and return true if it's an empty string? It's not a problem in check_plain() because we return an empty string anyway if the preg_match() fails. Or we could just delete that hunk from the patch since drupal_validate_utf8() isn't in the critical path anyway.
Comment #17
Damien Tournoud CreditAttribution: Damien Tournoud commentedstrlen() should be lightening fast. What about benchmarking with and without that check?
Comment #18
catchWe only need to duplicate the preg_match() in check_plain() since that's in the critical path, so removed other changes.
Comment #19
catchComment #20
jrchamp CreditAttribution: jrchamp commentedThis was already RTBC. It duplicates a small amount of code (referencing the original) in a critical path which reduces page load by a noticeable amount. Sounds good to me.
Comment #21
catchSo Moshe pointed out that since xdebug has to measure function calls, it's not the best way to measure performance improvements gained by removing function calls, which was a good point. Beyond this, I also found via micro-benchmarking that having xdebug extension enabled at all (i.e. xdebug on, profiling off) adds overhead to function calls.
So here's micro-benchmarks for before/after check_plain() to verify the speed-up in check_plain() itself. I've attached the benchmarking script to this issue so you can run it locally and/or see the logic:
(three runs, 2,000,000 iterations of each loop).
With xdebug enabled:
Actually in this case, the relative speed up in check_plain() is the same, it's just the absolute times which are different. Either way, it validates this patch, but I'm hoping to figure out how best to verify these changes on some of the other issues.
Comment #22
catchForgot to attach the test script.
Comment #23
catchAnd just for completeness, here's numbers for 10,000 iterations - which it's possible to get in HEAD on some pages(or at least the webgrind screenshots had 8,900).
That's a saving of around 25-30ms.
Comment #24
sun...slightly exceeds 80 chars.
Beer-o-mania starts in 22 days! Don't drink and patch.
Comment #25
catchIt only does that if you include the + at the beginning of the line, otherwise it's exactly 80 characters ;)
Comment #26
c960657 CreditAttribution: c960657 commentedIn PHP 5.2.5 and newer, htmlspecialchars() automatically handles partial multibyte sequences (according to the release notes).
The following approach reduces the running time of check_plain() even further (approx. 11% AFAICT) when using a PHP 5.2.5 or later. With older versions, it is about as fast as than the current implementation.
Comment #27
sunLet's prove this.
NOTE: As of now, this is NOT the patch that is RTBC.
Comment #29
sunComplete removal of check_plain(): (sorry, my machine is slow)
HEAD without patch:
Patch applied:
EDIT: I mistakenly copied the benchmarks in the wrong order :-/
Comment #30
sunAnd of course, here is the updated patch.
Comment #32
sunUsing a concurrency of 1:
HEAD:
Patch:
Comment #33
webchickPlease don't mark patches RTBC until they're reviewed and tested.
Comment #35
Damien Tournoud CreditAttribution: Damien Tournoud commented#26 does make a lot of sense to me, but I would like to see that tested. I would not trust PHP to watch my food ;)
Comment #36
jrchamp CreditAttribution: jrchamp commented@Damz: Here is the output from my system testing the code from #26 in the script provided on #22.
nothing: 0.342183113098 seconds
function no_op(): 2.46541881561 seconds
check plain old: 14.3335449696 seconds
check plain new: 9.67884707451 seconds
check plain new 5.2.5: 6.89955997467 seconds
Comment #37
Damien Tournoud CreditAttribution: Damien Tournoud commentedI meant "test that htmlentities() does the correct thing regarding broken UTF-8 sequences".
Comment #38
Heine CreditAttribution: Heine commentedhtmlspecialchars assumes a default charset of ISO-8859-1 where no multibyte codes exist. In ISO-8859-1 mode, htmlspecialchars does not and cannot care about "invalid" multi-byte sequences.
Drupal strings are UTF-8; You must pass 'UTF-8' as the charset argument to get proper behaviour.
htmlspecialchars("Foo\xC0barbaz", ENT_QUOTES, 'UTF-8'); should return an empty string.
Comment #39
pwolanin CreditAttribution: pwolanin commentedwhich patch is under consideration? #18? Maybe we should just get that in and have any further discussion to post-freeze.
The code in #26 (no patch) seems to me to be as far as we would want to go in terms of taking advantage of PHP 5.2.5+. We should not require it.
Comment #40
catchhtmlspecialchars("Foo\xC0barbaz", ENT_QUOTES, 'UTF-8');
That looks plenty of reason to keep a separate check_plain() wrapper. Since this is all internals, I don't see any reason why it couldn't go in after freeze as a performance improvement (and also be backported to D6 once we settle on one). One idea I had is maybe settings a constant for PHP version then checking that rather than the static - not sure how that compares in terms of speed.
Comment #41
sunYes, that was also my gut feeling when I replaced all instances of check_plain() with htmlspecialchars() directly. That won't work out for the "average developer and themer"... Additionally, it seems that the performance gain of that approach is minimal. So let's get back to the BC approach in #26.
Benchmarks, anyone?
Comment #42
smk-ka CreditAttribution: smk-ka commentedWhat about dynamically defining check_plain(), depending on the result of the PHP version check? This results in the fastest possible implementation for suitable PHP versions, and a slower one for the rest.
Comment #43
jrchamp CreditAttribution: jrchamp commentedI'm a big fan of #42. Defining the function so that we don't have to check the static on each call should save us even more time overall.
Comment #44
pwolanin CreditAttribution: pwolanin commented#42 omits the basic optimization of duplicating the preg_match
Comment #45
smk-ka CreditAttribution: smk-ka commentedInlined drupal_validate_utf8() as per #44.
Comment #46
pwolanin CreditAttribution: pwolanin commentedIn the doxygen, "Uses drupal_validate_utf8()" is not really true. Maybe "uses the same method..."?
Also, the conditional define is abnormal for Drupal core, so it might be worth explaining in the comments why is doesn't look like http://api.drupal.org/api/function/drupal_clone/6 In oter words, this function is called so often for a non-cached page that any sub-calls add significant overhead and we thus put the preg code there directly adn conditionally define the function to avoid doing version_compare each time it's called.
Comment #47
jrchamp CreditAttribution: jrchamp commentedFrom #18
+ * check_plain() also validates strings as UTF-8 to prevent cross site scripting
+ * attacks on Internet Explorer 6. We duplicate the preg_match() from
+ * drupal_validate_utf8() here rather than calling the function to avoid
+ * the overhead of an additional function call, since check_plain() may be
+ * called hundreds of times during a request.
Comment #48
Damien Tournoud CreditAttribution: Damien Tournoud commentedThis needs to be benchmarked. As far as I know, conditionally defined functions cannot be cached by most (if not all) op-code caches. I personally would recommend the static variable approach.
Comment #49
pwolanin CreditAttribution: pwolanin commentedhmm, well we certainly want the function to be cacheable in the opcode.
If we use a static, it should be a normal static (not use drupal_static()) since it doesn't make sense to me that the PHP version would ever change while the code is executing.
Comment #50
pwolanin CreditAttribution: pwolanin commentedok, more like #18
Comment #51
pwolanin CreditAttribution: pwolanin commentedHeine points out that I need === to distinguish FALSE from NULL
Comment #52
jrchamp CreditAttribution: jrchamp commented#51 appears to be about 6% faster than #26, and 33% faster than #18
nothing: 0.330275058746 seconds
function no_op(): 2.31866407394 seconds
check plain old: 14.1262290478 seconds
check plain new: 9.65815281868 seconds
check plain new 5.2.5: 6.80544900894 seconds
check plain new 5.2.5 pwolanin: 6.36292600632 seconds
It would be worthwhile to add #38 as a SimpleTest to ensure that the intended functionality remains intact.
Comment #53
sunJust use !isset() like everywhere else in core (and remove the = NULL from the static).
(Also note trailing white-space here)
An additional cast to Boolean may speed up the further tests of the variable.
Otherwise, looks good.
This review is powered by Dreditor.
Comment #54
pwolanin CreditAttribution: pwolanin commented@sun - I assumed === might be faster than isset(), but I can do it that way.
The return value of version_compare() will be a boolean.
Comment #55
pwolanin CreditAttribution: pwolanin commentedminor fixes per sun, plus a simple unit test case.
Comment #56
jrchamp CreditAttribution: jrchamp commentednothing: 0.338335990906 seconds
function no_op(): 2.31378507614 seconds
check plain old: 14.4185819626 seconds
check plain new: 9.58456897736 seconds
check plain new 5.2.5: 6.64569997787 seconds
check plain new 5.2.5 pwolanin: 6.31802201271 seconds
check plain new 5.2.5 pwolanin 2nd edition: 6.39244294167 seconds
About 1% slower than #51. Thankfully, isset isn't a function. :)
Thanks pwolanin! Let's get this in soon.
Comment #57
catchMarking #55 RTBC, wonderful job everyone.
Comment #58
hass CreditAttribution: hass commented+
Comment #59
catchAdded an explicit @todo to drop the ugly when we drop support for either IE6 or PHP5.2.5
Comment #60
Dries CreditAttribution: Dries commentedI support this approach, but most of the new PHPdoc at the top of the function needs to be moved (back) into code. The reader of the API documentation doesn't (and shouldn't care) about the fact that we inlined some code. Let's do a quick re-roll.
Comment #61
catchMoved the docs inline. It's a pretty short function, so I just left it as a single block of documentation, no easy way to split it up.
Comment #62
Heine CreditAttribution: Heine commentedCan we have more tests? With overlong three byte encoding for example?
Comment #63
pwolanin CreditAttribution: pwolanin commentedmoved the bulk of the comments to the function body.
Comment #64
pwolanin CreditAttribution: pwolanin commentedsome possible test cases can be derived from http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
Comment #65
jrchamp CreditAttribution: jrchamp commented@Heine, @pwolanin: Is this what you were thinking?
Note:
- \x2F succeeds as expected.
- \xC0\x2F fails as expected.
- \xC0\xAF and \xE0\x80\xAF fail in the preg_match as expected, but unexpectedly does not fail in the htmlspecialchars.
The behavior for these 4 test cases is identical in 5.1.6 and 5.2.10, except that in 5.1.6 a PHP Warning is generated for \xC0\x2F from htmlspecialchars.
Perhaps I am doing something wrong; I would appreciate it if someone who knows more about UTF-8 than I do could let me know if this is correct.
Comment #66
pwolanin CreditAttribution: pwolanin commentedWe shoudl build a little bigger test set - but basically, if the PHP internals don't handle this as robustly as the preg_match, we should just go back to the simpler earlier patch that doesn't try to rely on PHP version.
Comment #67
catchThe last patch here doing that was #18 - http://drupal.org/node/523058#comment-1842404
But that probably needs the comments in-lining per Dries review (since the comments are much the same).
I think we should probably go ahead and commit that, and backport to Drupal 6 too, since there's nothing breakable there. Then that still gives us a window to further optimize if we find away around the PHP wonkiness.
Comment #68
c960657 CreditAttribution: c960657 commentedI am no expert on this, but here is my understand of this:
AFAICT htmlspecialchars() only rejects partial multibyte sequences followed by a byte < 0x80 (or the end of the string). I think this is fine - the UTF-8 exploit is only relevant if one of the few characters that would otherwise be escaped by htmlspecialchars() (i.e.
" < > & '
) is "hidden" by a partial multibyte sequence. These few characters all have character codes < 0x80. In other words, htmlspecialchars() does not block all invalid UTF-8 strings, but it does block the dangerous ones.Here is the change that was made to htmlspecialchars() in PHP 5.2.5:
http://svn.php.net/viewvc/php/php-src/branches/PHP_5_2/ext/standard/html...
Comment #69
pwolanin CreditAttribution: pwolanin commentedThanks for the link to the diff. So, it seems like we can proceed as long as we are selective about picking test cases? Or would we rather pay the performance penalty to have more consistent drupal behavior across PHP version? I could imagine see this leading to some hard-to-debug problems.
Comment #70
catchI'd be tempted to say commit the slightly slower version - at leastso we can do a straight backport to Drupal 6 for consistency there. Then consider changing this in Drupal 8 (even if we don't require 5.2.6 for D8 it seems reasonable that most servers will actually be running it by then).
For me that means re-rolling #18 with the documentation moved inline + a @todo for the PHP 5.2.6 version + leaving this open once we've done the backport so we can discuss more for Drupal 7 or push it to Drupal 8.
Comment #71
chx CreditAttribution: chx commentedGiven the grave performance implications here and the need to backport ASAP I would strongly recommend committing the latest patch and building the test case later. This is an exception to a rule I know but it has good reason.
Comment #72
jrchamp CreditAttribution: jrchamp commented@c960657: Thank you for the explanation.
+1 to #63 because it doesn't alter the intent of the preg_match, and has significant performance benefits.
Comment #73
scroogie CreditAttribution: scroogie commentedso?
Comment #74
jrchamp CreditAttribution: jrchamp commented@scroogie: I'm pretty sure this is the point where more people review and approve the code or it gets committed. It takes a lot of patience sometimes. :)
Comment #75
sunGiven the last comments, it's entirely not clear to me which patch should be committed to D7. Please re-attach.
Comment #76
catchThis one is the one which was marked RTBC most recently.
Comment #78
sunTests failed.
Comment #79
catchComment #80
catchComment #81
catchWould really be nice to see this committed, since the slowness of check_plain() also adds overhead to all the functions which call it, which makes it harder to track down as-yet-unfound bottlenecks when profiling.
Re-posting jrchamp's microbenchmarks, which are the most recent:
nothing: 0.338335990906 seconds
function no_op(): 2.31378507614 seconds
check plain old: 14.4185819626 seconds
check plain new: 9.58456897736 seconds
check plain new 5.2.5: 6.64569997787 seconds
check plain new 5.2.5 pwolanin: 6.31802201271 seconds
check plain new 5.2.5 pwolanin 2nd edition: 6.39244294167 seconds
The last result is for the current patch.
As you can see, this is more than twice as fast as the implementation in HEAD, and the RTBC patch is itself a third faster than the original patch on this issue, making it worth the static for PHP version until we can increase requirements.
I inherited a horrible, horrible site freelancing one time, where the theme was riddled with XSS holes - after explaining to the client that this would need to be cleaned up, they went back to the company which originally build the theme, which came back and explained they never use check_plain() because it 'slows things down'. I'm sure that's not an isolated case of such stupidity.
The only slight issue here is how much we trust the htmlspecialchars() fix in PHP 5.2.5 - c960657 has shown that it's secure, but not as restrictive as our original preg_match(). If we're uneasy about relying on that though, then we have only very slightly slower alternatives, also ready to go, earlier in the issue.
Comment #82
webchickHuh. Somehow I had missed Dries's #60 where he points out that he supports this approach apart from some commenting issues. It looks like those comments have been addressed, along with nicely expanded test coverage. Additionally, it looks like this speeds things up quite a bit, according to the microbenchmarks.
Therefore, committed to HEAD. Seems like it might be worth back-porting this to D6?
Comment #83
andypostAnd here's a backport for d6
Comment #84
boombatower CreditAttribution: boombatower commentedDidn't read through whole issue, but does this fix or is aware of:
when run on PHP 5.3.
Comment #85
andypost@boombatower not this patch not about this warning, suppose test should be changed a bit
Also why this improvement still not in?
Comment #86
pwolanin CreditAttribution: pwolanin commentedI pointed it out to Gabor today - we just need final confirmation that it's working cleanly on D6.
Comment #87
pwolanin CreditAttribution: pwolanin commentedPatch applies cleanly and matches what's already in D7.
Comment #88
Gábor HojtsyFor maintenance reasons, I'd add a strong, very visible reference to check_plain() in drupal_validate_utf8(), so if we need modification there, we don't forget to roll it to here as well. When we have multiple copies of code around, we should make sure to modify them in concert.
Comment #89
andypost@Gábor Hojtsy purpose of drupal_validate_utf8() is different - just to check a string. I see nothing to modify in drupal_validate_utf8()
Comment #90
Gábor Hojtsy@andypost: imagine drupal_validate_utf8() will have a change in a month or a year or two. Will people making that change remember that the same code appears in check_plain() and needs modification? I'm saying we should add a clear comment to drupal_validate_utf8() to notify people in the future that the same code is copied in check_plain().
Comment #91
c960657 CreditAttribution: c960657 commentedAdded an inline comment in validate_utf8() as requested in #88.
Comment #92
andypostSuppose this comment is enough
Comment #93
andypostThere's a follow up for D7 #755758: check_plain throwing warnings
Comment #94
Gábor HojtsyCommitted, thanks.
Comment #96
greywolfsspirit CreditAttribution: greywolfsspirit commentedapostrophe is being encoded to html code when used with Advanced Forum module's navigation. If a topic has an apostrophe, say for example, CiCi's Pizza, it's displaying the link as: CiCi's Pizza
Does anyone have any suggestions as to a fix for this?
Comment #97
Oleksa-1 CreditAttribution: Oleksa-1 commentedAfter this patch #91 was commited to D6.17 a lot such kind of questions "warning: htmlspecialchars() [function.htmlspecialchars]: Invalid multibyte sequence" you can find on forum.
I had to replace in my latest D6.19 file bootstrap.inc with same file from D6.16 to get rid of this warning error.