Optimize check_plain()

catch - July 18, 2009 - 00:54
Project:Drupal
Version:6.x-dev
Component:base system
Category:bug report
Priority:critical
Assigned:Unassigned
Status:needs review
Issue tags:needs backport to D6, Performance
Description

check_plain() spends 6% of page execution time in self, and 6% in drupal_validate_utf8(), nearly all of this time is just the expense of calling functions thousands of times, not what they're actually doing. Also drupal_validate_utf8() only exists because IE6 is stupid.

By moving the same preg_match() into check_plain(), we save about 50% of the time spent in profiling, this should be backportable.

AttachmentSizeStatusTest resultOperations
ie6diediedieyoufucker.patch630 bytesIdleFailed: 13247 passes, 3 fails, 0 exceptionsView details | Re-test
head.png158.04 KBIgnoredNoneNone
patch.png237.6 KBIgnoredNoneNone

#1

catch - July 18, 2009 - 00:55
Title:IE6 and one function call makes every Drupal site slow» IE6 and one function call makes every Drupal site slower

whoops.

#2

sun - July 18, 2009 - 01:36

Hm. drupal_validate_utf8() additionally checked for empty()... why not here?

Also, even after some discussion in IRC, I'm still not sure whether we can't entirely drop that IE6 support function. See also #308865: Drop IE6 support in Drupal core.

#3

sun - July 18, 2009 - 02:04

Additional insight for why this function exists: https://security.drupal.org/node/116

I'd propose to re-think whether we need this entire check at all. (See also the issue I linked to in my previous follow-up)

#4

catch - July 18, 2009 - 09:25

Added the empty(). We should consider dropping at least some aspects of IE6 support in HEAD I think, but we can also backport this to 5/6.

Still twice as fast, this morning check_plain() is taking 6% / 3% instead of 12% / 6% but still good results.

AttachmentSizeStatusTest resultOperations
check_plain.patch1.3 KBIdlePassed: 14661 passes, 0 fails, 0 exceptionsView details | Re-test

#5

catch - July 18, 2009 - 10:02
Title:IE6 and one function call makes every Drupal site slower» Optimize check_plain()

More sober title.

#6

pwolanin - July 18, 2009 - 15:46
Title:Optimize check_plain()» IE6 and one function call makes every Drupal site slower

empty() will return TURE for the string '0' as well as various non-strings. Are you sure you want to use that vs. the strlen==0 check?

You could also cast to a string and compare to the empty string, and return the cast variable.

#7

catch - July 18, 2009 - 16:00

Just checked HEAD and we get between 1-3 empty strings per page, so how about just removing that altogether?

AttachmentSizeStatusTest resultOperations
check_plain.patch893 bytesIdlePassed: 14685 passes, 0 fails, 0 exceptionsView details | Re-test

#8

pwolanin - July 18, 2009 - 16:32

@catch- makes sense to me - if there are that few we are not gaining anything since it costs lots of extra function calls.

#9

sun - July 18, 2009 - 17:03
Status:needs review» needs work

We should append a @see drupal_validate_utf8() to the PHPDoc though, and perhaps additionally explain why we are duplicating code.

Speaking of duplication, shouldn't we apply the same to filter_xss() (the second instance) ? So we'd keep drupal_validate_utf8() for (rather special) contrib modules that need it only?

#10

catch - July 18, 2009 - 18:03
Status:needs work» needs review

Added the phpdoc, did similar in filter_xss() and removed the strlen from drupal_validate_utf8().

AttachmentSizeStatusTest resultOperations
check_plain.patch2.46 KBIdleFailed: 11011 passes, 671 fails, 13048 exceptionsView details | Re-test

#11

sun - July 18, 2009 - 18:19

Yay! Thanks for updating!

+ *   valid UTF-8. @see drupal_validate_utf8().

(multiple occurrences) @see should always be on a separate line.

function check_plain($text) {
-  return drupal_validate_utf8($text) ? htmlspecialchars($text, ENT_QUOTES) : '';
+  // Validate strings as UTF-8 to prevent cross site scripting attacks on
+  // Internet Explorer 6. We duplicate the same preg_match() here to avoid
+  // additional function calls, since check_plain() is often called hundreds of
+  // times per request.
+  return (preg_match('/^./us', $text) == 1) ? htmlspecialchars($text, ENT_QUOTES) : '';

This lengthy comment should rather go into the PHPDoc.

#12

System Message - July 18, 2009 - 18:20
Status:needs review» needs work

The last submitted patch failed testing.

#13

catch - July 18, 2009 - 19:55
Status:needs work» needs review

Fixed the notice and did some more comment tidying.

AttachmentSizeStatusTest resultOperations
check_plain.patch2.69 KBIdleFailed: Failed to install HEAD.View details | Re-test

#14

pwolanin - July 19, 2009 - 19:40
Status:needs review» reviewed & tested by the community

Looks good to me - nice to have to validated performance improvements.

#15

jrchamp - July 20, 2009 - 13:55

The only problem with removing the strlen from drupal_validate_utf8 is that it changes the result for empty string. An empty string is a valid utf8 string, whereas the regex requires at least one character.

#16

catch - July 20, 2009 - 15:17
Status:reviewed & tested by the community» needs review

Hmm, I'm wondering if we could switch the order - so if the preg_match() returns TRUE we're done, if it doesn't, then we do the strlen check and return true if it's an empty string? It's not a problem in check_plain() because we return an empty string anyway if the preg_match() fails. Or we could just delete that hunk from the patch since drupal_validate_utf8() isn't in the critical path anyway.

#17

Damien Tournoud - July 20, 2009 - 15:50

strlen() should be lightening fast. What about benchmarking with and without that check?

#18

catch - July 23, 2009 - 01:24

We only need to duplicate the preg_match() in check_plain() since that's in the critical path, so removed other changes.

AttachmentSizeStatusTest resultOperations
check_plain_4.patch1.31 KBIdlePassed: 14684 passes, 0 fails, 0 exceptionsView details | Re-test

#19

catch - July 23, 2009 - 17:52
Title:IE6 and one function call makes every Drupal site slower» Optimize check_plain()

#20

jrchamp - August 3, 2009 - 15:20
Status:needs review» reviewed & tested by the community

This was already RTBC. It duplicates a small amount of code (referencing the original) in a critical path which reduces page load by a noticeable amount. Sounds good to me.

#21

catch - August 3, 2009 - 21:32

So Moshe pointed out that since xdebug has to measure function calls, it's not the best way to measure performance improvements gained by removing function calls, which was a good point. Beyond this, I also found via micro-benchmarking that having xdebug extension enabled at all (i.e. xdebug on, profiling off) adds overhead to function calls.

So here's micro-benchmarks for before/after check_plain() to verify the speed-up in check_plain() itself. I've attached the benchmarking script to this issue so you can run it locally and/or see the logic:

(three runs, 2,000,000 iterations of each loop).

catch@catch-laptop:~/www/7$ php check_plain.php
nothing: 0.467885017395 seconds
function no_op(): 1.22173714638 seconds
check plain old: 8.99900794029 seconds
check plain new: 5.9074318409 seconds
catch@catch-laptop:~/www/7$ php check_plain.php
nothing: 0.499942064285 seconds
function no_op(): 1.27891206741 seconds
check plain old: 8.88757514954 seconds
check plain new: 5.72876691818 seconds
catch@catch-laptop:~/www/7$ php check_plain.php
nothing: 0.451987028122 seconds
function no_op(): 1.2409799099 seconds
check plain old: 8.81144690514 seconds
check plain new: 5.80533409119 seconds

With xdebug enabled:

nothing: 0.719228029251 seconds
function no_op(): 6.70149278641 seconds
check plain old: 38.8574278355 seconds
check plain new: 24.8503789902 seconds

Actually in this case, the relative speed up in check_plain() is the same, it's just the absolute times which are different. Either way, it validates this patch, but I'm hoping to figure out how best to verify these changes on some of the other issues.

#22

catch - August 3, 2009 - 21:38

Forgot to attach the test script.

AttachmentSizeStatusTest resultOperations
check_plain.php.txt1.42 KBIgnoredNoneNone

#23

catch - August 3, 2009 - 21:50

And just for completeness, here's numbers for 10,000 iterations - which it's possible to get in HEAD on some pages(or at least the webgrind screenshots had 8,900).

nothing: 0.0027801990509 seconds
function no_op(): 0.0087149143219 seconds
check plain old: 0.0544929504395 seconds
check plain new: 0.037360906601 seconds
catch@catch-laptop:~/www/7$ php check_plain.php
nothing: 0.00294899940491 seconds
function no_op(): 0.00863099098206 seconds
check plain old: 0.0572259426117 seconds
check plain new: 0.0334417819977 seconds

catch@catch-laptop:~/www/7$ php check_plain.php
nothing: 0.00277781486511 seconds
function no_op(): 0.00840902328491 seconds
check plain old: 0.0663850307465 seconds
check plain new: 0.0283868312836 seconds

That's a saving of around 25-30ms.

#24

sun - August 9, 2009 - 03:20

+++ includes/bootstrap.inc 18 Jul 2009 19:54:35 -0000
@@ -1024,11 +1024,21 @@ function drupal_unpack($obj, $field = 'd
+ * check_plain() also validates strings as UTF-8 to prevent cross site scripting

...slightly exceeds 80 chars.

Beer-o-mania starts in 22 days! Don't drink and patch.

#25

catch - August 9, 2009 - 10:33

It only does that if you include the + at the beginning of the line, otherwise it's exactly 80 characters ;)

#26

c960657 - August 12, 2009 - 18:26

In PHP 5.2.5 and newer, htmlspecialchars() automatically handles partial multibyte sequences (according to the release notes).

The following approach reduces the running time of check_plain() even further (approx. 11% AFAICT) when using a PHP 5.2.5 or later. With older versions, it is about as fast as than the current implementation.

<?php
function check_plain($text) {
  static
$new_php;
  if (!isset(
$new_php)) {
   
$new_php = version_compare(PHP_VERSION, '5.2.5', '>=');
  }
  return (
$new_php || preg_match('/^./us', $text) == 1) ? htmlspecialchars($text, ENT_QUOTES, 'UTF-8') : '';
}
?>

#27

sun - August 24, 2009 - 01:42

Let's prove this.

NOTE: As of now, this is NOT the patch that is RTBC.

AttachmentSizeStatusTest resultOperations
drupal.remove-check-plain.patch118.37 KBIdleFailed: 11902 passes, 508 fails, 600 exceptionsView details | Re-test

#28

System Message - August 24, 2009 - 02:05
Status:reviewed & tested by the community» needs work

The last submitted patch failed testing.

#29

sun - August 24, 2009 - 02:43
Status:needs work» reviewed & tested by the community

Complete removal of check_plain(): (sorry, my machine is slow)

HEAD without patch:

# ab -c 2 -n 100 http://drupal.test/
Document
Path:          /
Document Length:        10865 bytes

Concurrency Level:      2
Time taken for tests:   38.092 seconds
Complete requests:      100
Failed requests:        0
Write errors:           0
Total transferred:      1128800 bytes
HTML transferred:       1086500 bytes
Requests per second:    2.63 [#/sec] (mean)
Time per request:       761.836 [ms] (mean)
Time per request:       380.918 [ms] (mean, across all concurrent requests)
Transfer rate:          28.94 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0   13  11.7     16      62
Processing:   547  744 198.3    625    1641
Waiting:      531  728 197.8    609    1625
Total:        562  757 201.3    656    1656

Percentage of the requests served within a certain time (ms)
  50%    656
  66%    828
  75%    844
  80%    844
  90%    937
  95%   1281
  98%   1422
  99%   1656
100%   1656 (longest request)

Patch applied:

# ab -c 2 -n 100 http://drupal.test/
Document
Path:          /
Document Length:        10865 bytes

Concurrency Level:      2
Time taken for tests:   35.529 seconds
Complete requests:      100
Failed requests:        0
Write errors:           0
Total transferred:      1128800 bytes
HTML transferred:       1086500 bytes
Requests per second:    2.81 [#/sec] (mean)
Time per request:       710.589 [ms] (mean)
Time per request:       355.294 [ms] (mean, across all concurrent requests)
Transfer rate:          31.03 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0   13   9.3     16      31
Processing:   547  694 146.5    687    1484
Waiting:      531  678 146.5    672    1469
Total:        562  707 145.7    687    1500

Percentage of the requests served within a certain time (ms)
  50%    687
  66%    750
  75%    766
  80%    766
  90%    859
  95%    937
  98%   1344
  99%   1500
100%   1500 (longest request)

EDIT: I mistakenly copied the benchmarks in the wrong order :-/

#30

sun - August 24, 2009 - 02:43

And of course, here is the updated patch.

AttachmentSizeStatusTest resultOperations
drupal.remove-check-plain.28.patch118.37 KBIdleFailed: 12313 passes, 1 fail, 0 exceptionsView details | Re-test

#31

System Message - August 24, 2009 - 03:00
Status:reviewed & tested by the community» needs work

The last submitted patch failed testing.

#32

sun - August 24, 2009 - 03:14
Status:needs work» reviewed & tested by the community

Using a concurrency of 1:

HEAD:

# ab -c1 -n 100 http://drupal.test/
Document
Path:          /
Document Length:        10865 bytes

Concurrency Level:      1
Time taken for tests:   49.904 seconds
Complete requests:      100
Failed requests:        0
Write errors:           0
Total transferred:      1128800 bytes
HTML transferred:       1086500 bytes
Requests per second:    2.00 [#/sec] (mean)
Time per request:       499.037 [ms] (mean)
Time per request:       499.037 [ms] (mean, across all concurrent requests)
Transfer rate:          22.09 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0   10   7.7     16      16
Processing:   453  487  48.1    469     734
Waiting:      437  471  48.4    453     719
Total:        469  497  48.5    484     734

Percentage of the requests served within a certain time (ms)
  50%    484
  66%    484
  75%    500
  80%    500
  90%    562
  95%    609
  98%    687
  99%    734
100%    734 (longest request)

Patch:

Document Path:          /
Document Length:        10865 bytes

Concurrency Level:      1
Time taken for tests:   48.169 seconds
Complete requests:      100
Failed requests:        0
Write errors:           0
Total transferred:      1128800 bytes
HTML transferred:       1086500 bytes
Requests per second:    2.08 [#/sec] (mean)
Time per request:       481.694 [ms] (mean)
Time per request:       481.694 [ms] (mean, across all concurrent requests)
Transfer rate:          22.88 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0   10   7.7     16      31
Processing:   453  470  41.3    469     766
Waiting:      437  454  41.4    437     750
Total:        453  480  42.3    469     781

Percentage of the requests served within a certain time (ms)
  50%    469
  66%    469
  75%    484
  80%    484
  90%    484
  95%    531
  98%    719
  99%    781
100%    781 (longest request)

#33

webchick - August 24, 2009 - 03:16
Status:reviewed & tested by the community» needs review

Please don't mark patches RTBC until they're reviewed and tested.

#34

System Message - August 24, 2009 - 03:16
Status:needs review» needs work

The last submitted patch failed testing.

#35

Damien Tournoud - August 24, 2009 - 07:16
Status:needs work» needs review

#26 does make a lot of sense to me, but I would like to see that tested. I would not trust PHP to watch my food ;)

#36

jrchamp - August 24, 2009 - 13:48

@Damz: Here is the output from my system testing the code from #26 in the script provided on #22.

nothing: 0.342183113098 seconds
function no_op(): 2.46541881561 seconds
check plain old: 14.3335449696 seconds
check plain new: 9.67884707451 seconds
check plain new 5.2.5: 6.89955997467 seconds

#37

Damien Tournoud - August 24, 2009 - 13:57

I meant "test that htmlentities() does the correct thing regarding broken UTF-8 sequences".

#38

Heine - August 24, 2009 - 14:20

htmlspecialchars assumes a default charset of ISO-8859-1 where no multibyte codes exist. In ISO-8859-1 mode, htmlspecialchars does not and cannot care about "invalid" multi-byte sequences.

Drupal strings are UTF-8; You must pass 'UTF-8' as the charset argument to get proper behaviour.

htmlspecialchars("Foo\xC0barbaz", ENT_QUOTES, 'UTF-8'); should return an empty string.

#39

pwolanin - August 24, 2009 - 14:52

which patch is under consideration? #18? Maybe we should just get that in and have any further discussion to post-freeze.

The code in #26 (no patch) seems to me to be as far as we would want to go in terms of taking advantage of PHP 5.2.5+. We should not require it.

#40

catch - August 24, 2009 - 16:04

htmlspecialchars("Foo\xC0barbaz", ENT_QUOTES, 'UTF-8');
That looks plenty of reason to keep a separate check_plain() wrapper. Since this is all internals, I don't see any reason why it couldn't go in after freeze as a performance improvement (and also be backported to D6 once we settle on one). One idea I had is maybe settings a constant for PHP version then checking that rather than the static - not sure how that compares in terms of speed.

#41

sun - August 24, 2009 - 17:02

Yes, that was also my gut feeling when I replaced all instances of check_plain() with htmlspecialchars() directly. That won't work out for the "average developer and themer"... Additionally, it seems that the performance gain of that approach is minimal. So let's get back to the BC approach in #26.

Benchmarks, anyone?

AttachmentSizeStatusTest resultOperations
drupal.check-plain-utf-8.patch1.27 KBIdlePassed: 14676 passes, 0 fails, 0 exceptionsView details | Re-test

#42

smk-ka - August 24, 2009 - 18:17

What about dynamically defining check_plain(), depending on the result of the PHP version check? This results in the fastest possible implementation for suitable PHP versions, and a slower one for the rest.

AttachmentSizeStatusTest resultOperations
523058-check_plain.patch1.48 KBIdlePassed: 14671 passes, 0 fails, 0 exceptionsView details | Re-test

#43

jrchamp - August 24, 2009 - 18:52

I'm a big fan of #42. Defining the function so that we don't have to check the static on each call should save us even more time overall.

#44

pwolanin - August 24, 2009 - 22:14
Status:needs review» needs work

#42 omits the basic optimization of duplicating the preg_match

#45

smk-ka - August 24, 2009 - 23:38
Status:needs work» needs review

Inlined drupal_validate_utf8() as per #44.

AttachmentSizeStatusTest resultOperations
523058-check_plain.patch1.48 KBIdlePassed: 14672 passes, 0 fails, 0 exceptionsView details | Re-test

#46

pwolanin - August 25, 2009 - 01:42
Status:needs review» needs work

In the doxygen, "Uses drupal_validate_utf8()" is not really true. Maybe "uses the same method..."?

Also, the conditional define is abnormal for Drupal core, so it might be worth explaining in the comments why is doesn't look like http://api.drupal.org/api/function/drupal_clone/6 In oter words, this function is called so often for a non-cached page that any sub-calls add significant overhead and we thus put the preg code there directly adn conditionally define the function to avoid doing version_compare each time it's called.

#47

jrchamp - August 25, 2009 - 13:53

From #18
+ * check_plain() also validates strings as UTF-8 to prevent cross site scripting
+ * attacks on Internet Explorer 6. We duplicate the preg_match() from
+ * drupal_validate_utf8() here rather than calling the function to avoid
+ * the overhead of an additional function call, since check_plain() may be
+ * called hundreds of times during a request.

#48

Damien Tournoud - August 25, 2009 - 14:00

This needs to be benchmarked. As far as I know, conditionally defined functions cannot be cached by most (if not all) op-code caches. I personally would recommend the static variable approach.

#49

pwolanin - August 25, 2009 - 14:31

hmm, well we certainly want the function to be cacheable in the opcode.

If we use a static, it should be a normal static (not use drupal_static()) since it doesn't make sense to me that the PHP version would ever change while the code is executing.

#50

pwolanin - August 25, 2009 - 22:21
Status:needs work» needs review

ok, more like #18

AttachmentSizeStatusTest resultOperations
523058-check_plain-50.patch1.59 KBIdlePassed: 14694 passes, 0 fails, 0 exceptionsView details | Re-test

#51

pwolanin - August 25, 2009 - 22:32

Heine points out that I need === to distinguish FALSE from NULL

AttachmentSizeStatusTest resultOperations
523058-check_plain-51.patch1.6 KBIdleFailed: 14479 passes, 10 fails, 0 exceptionsView details | Re-test

#52

jrchamp - August 26, 2009 - 13:59

#51 appears to be about 6% faster than #26, and 33% faster than #18

nothing: 0.330275058746 seconds
function no_op(): 2.31866407394 seconds
check plain old: 14.1262290478 seconds
check plain new: 9.65815281868 seconds
check plain new 5.2.5: 6.80544900894 seconds
check plain new 5.2.5 pwolanin: 6.36292600632 seconds

It would be worthwhile to add #38 as a SimpleTest to ensure that the intended functionality remains intact.

#53

sun - August 26, 2009 - 14:02

+++ includes/bootstrap.inc
@@ -1120,11 +1120,35 @@ function drupal_unpack($obj, $field = 'data') {
+  static $php525 = NULL;

+  if ($php525 === NULL) {

Just use !isset() like everywhere else in core (and remove the = NULL from the static).

(Also note trailing white-space here)

+++ includes/bootstrap.inc
@@ -1120,11 +1120,35 @@ function drupal_unpack($obj, $field = 'data') {
+    $php525 = version_compare(PHP_VERSION, '5.2.5', '>=');
...
+  if ($php525) {

An additional cast to Boolean may speed up the further tests of the variable.

Otherwise, looks good.

This review is powered by Dreditor.

#54

pwolanin - August 26, 2009 - 16:08

@sun - I assumed === might be faster than isset(), but I can do it that way.

The return value of version_compare() will be a boolean.

#55

pwolanin - August 26, 2009 - 17:12

minor fixes per sun, plus a simple unit test case.

AttachmentSizeStatusTest resultOperations
523058-check_plain-55.patch3.23 KBIdlePassed: 14712 passes, 0 fails, 0 exceptionsView details | Re-test

#56

jrchamp - August 27, 2009 - 13:46

nothing: 0.338335990906 seconds
function no_op(): 2.31378507614 seconds
check plain old: 14.4185819626 seconds
check plain new: 9.58456897736 seconds
check plain new 5.2.5: 6.64569997787 seconds
check plain new 5.2.5 pwolanin: 6.31802201271 seconds
check plain new 5.2.5 pwolanin 2nd edition: 6.39244294167 seconds

About 1% slower than #51. Thankfully, isset isn't a function. :)

Thanks pwolanin! Let's get this in soon.

#57

catch - August 27, 2009 - 22:32
Status:needs review» reviewed & tested by the community

Marking #55 RTBC, wonderful job everyone.

#58

hass - September 6, 2009 - 16:18

+

#59

catch - September 16, 2009 - 05:56

Added an explicit @todo to drop the ugly when we drop support for either IE6 or PHP5.2.5

AttachmentSizeStatusTest resultOperations
check_plain.patch3.64 KBIdleFailed: Invalid PHP syntax in modules/simpletest/tests/common.test.View details | Re-test

#60

Dries - September 16, 2009 - 15:55
Status:reviewed & tested by the community» needs work

I support this approach, but most of the new PHPdoc at the top of the function needs to be moved (back) into code. The reader of the API documentation doesn't (and shouldn't care) about the fact that we inlined some code. Let's do a quick re-roll.

#61

catch - September 16, 2009 - 16:19
Status:needs work» reviewed & tested by the community

Moved the docs inline. It's a pretty short function, so I just left it as a single block of documentation, no easy way to split it up.

AttachmentSizeStatusTest resultOperations
check_plain.patch0 bytesIdlePassed: 14686 passes, 0 fails, 0 exceptionsView details | Re-test

#62

Heine - September 16, 2009 - 16:31

Can we have more tests? With overlong three byte encoding for example?

#63

pwolanin - September 16, 2009 - 16:42
Status:reviewed & tested by the community» needs review

moved the bulk of the comments to the function body.

AttachmentSizeStatusTest resultOperations
523058-check_plain-61.patch3.28 KBIdlePassed: 14692 passes, 0 fails, 0 exceptionsView details | Re-test

#64

pwolanin - September 16, 2009 - 17:46

some possible test cases can be derived from http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

#65

jrchamp - September 18, 2009 - 13:08

@Heine, @pwolanin: Is this what you were thinking?

<?php
     $text
= check_plain("\xE0\x80\xAF");
    
$this->assertEqual($text, '', 'check_plain() rejects invalid sequence "\xE0\x80\xAF"');
?>

Note:
- \x2F succeeds as expected.
- \xC0\x2F fails as expected.
- \xC0\xAF and \xE0\x80\xAF fail in the preg_match as expected, but unexpectedly does not fail in the htmlspecialchars.

The behavior for these 4 test cases is identical in 5.1.6 and 5.2.10, except that in 5.1.6 a PHP Warning is generated for \xC0\x2F from htmlspecialchars.

Perhaps I am doing something wrong; I would appreciate it if someone who knows more about UTF-8 than I do could let me know if this is correct.

#66

pwolanin - September 18, 2009 - 17:18

We shoudl build a little bigger test set - but basically, if the PHP internals don't handle this as robustly as the preg_match, we should just go back to the simpler earlier patch that doesn't try to rely on PHP version.

#67

catch - September 18, 2009 - 17:26

The last patch here doing that was #18 - http://drupal.org/node/523058#comment-1842404

But that probably needs the comments in-lining per Dries review (since the comments are much the same).

I think we should probably go ahead and commit that, and backport to Drupal 6 too, since there's nothing breakable there. Then that still gives us a window to further optimize if we find away around the PHP wonkiness.

#68

c960657 - September 18, 2009 - 17:59

I am no expert on this, but here is my understand of this:

AFAICT htmlspecialchars() only rejects partial multibyte sequences followed by a byte < 0x80 (or the end of the string). I think this is fine - the UTF-8 exploit is only relevant if one of the few characters that would otherwise be escaped by htmlspecialchars() (i.e. " < > & ') is "hidden" by a partial multibyte sequence. These few characters all have character codes < 0x80. In other words, htmlspecialchars() does not block all invalid UTF-8 strings, but it does block the dangerous ones.

Here is the change that was made to htmlspecialchars() in PHP 5.2.5:
http://svn.php.net/viewvc/php/php-src/branches/PHP_5_2/ext/standard/html...

#69

pwolanin - September 20, 2009 - 23:52

Thanks for the link to the diff. So, it seems like we can proceed as long as we are selective about picking test cases? Or would we rather pay the performance penalty to have more consistent drupal behavior across PHP version? I could imagine see this leading to some hard-to-debug problems.

#70

catch - September 21, 2009 - 02:37

I'd be tempted to say commit the slightly slower version - at leastso we can do a straight backport to Drupal 6 for consistency there. Then consider changing this in Drupal 8 (even if we don't require 5.2.6 for D8 it seems reasonable that most servers will actually be running it by then).

For me that means re-rolling #18 with the documentation moved inline + a @todo for the PHP 5.2.6 version + leaving this open once we've done the backport so we can discuss more for Drupal 7 or push it to Drupal 8.

#71

chx - September 21, 2009 - 14:58
Status:needs review» reviewed & tested by the community

Given the grave performance implications here and the need to backport ASAP I would strongly recommend committing the latest patch and building the test case later. This is an exception to a rule I know but it has good reason.

#72

jrchamp - September 21, 2009 - 15:25

@c960657: Thank you for the explanation.

+1 to #63 because it doesn't alter the intent of the preg_match, and has significant performance benefits.

#73

scroogie - November 2, 2009 - 17:21

so?

#74

jrchamp - November 3, 2009 - 22:19

@scroogie: I'm pretty sure this is the point where more people review and approve the code or it gets committed. It takes a lot of patience sometimes. :)

#75

sun - November 3, 2009 - 23:54
Status:reviewed & tested by the community» needs review

Given the last comments, it's entirely not clear to me which patch should be committed to D7. Please re-attach.

#76

catch - November 4, 2009 - 01:30
Status:needs review» reviewed & tested by the community

This one is the one which was marked RTBC most recently.

AttachmentSizeStatusTest resultOperations
check_plain.patch3.28 KBIdlePassed: 14711 passes, 0 fails, 0 exceptionsView details | Re-test

#77

System Message - November 4, 2009 - 01:40
Status:reviewed & tested by the community» needs work

The last submitted patch failed testing.

#78

sun - November 4, 2009 - 02:02

Tests failed.

#79

catch - November 4, 2009 - 11:44
Status:needs work» needs review

#80

catch - November 4, 2009 - 13:13
Status:needs review» reviewed & tested by the community

#81

catch - November 5, 2009 - 18:51

Would really be nice to see this committed, since the slowness of check_plain() also adds overhead to all the functions which call it, which makes it harder to track down as-yet-unfound bottlenecks when profiling.

Re-posting jrchamp's microbenchmarks, which are the most recent:

nothing: 0.338335990906 seconds
function no_op(): 2.31378507614 seconds
check plain old: 14.4185819626 seconds
check plain new: 9.58456897736 seconds
check plain new 5.2.5: 6.64569997787 seconds
check plain new 5.2.5 pwolanin: 6.31802201271 seconds
check plain new 5.2.5 pwolanin 2nd edition: 6.39244294167 seconds

The last result is for the current patch.

As you can see, this is more than twice as fast as the implementation in HEAD, and the RTBC patch is itself a third faster than the original patch on this issue, making it worth the static for PHP version until we can increase requirements.

I inherited a horrible, horrible site freelancing one time, where the theme was riddled with XSS holes - after explaining to the client that this would need to be cleaned up, they went back to the company which originally build the theme, which came back and explained they never use check_plain() because it 'slows things down'. I'm sure that's not an isolated case of such stupidity.

The only slight issue here is how much we trust the htmlspecialchars() fix in PHP 5.2.5 - c960657 has shown that it's secure, but not as restrictive as our original preg_match(). If we're uneasy about relying on that though, then we have only very slightly slower alternatives, also ready to go, earlier in the issue.

#82

webchick - November 8, 2009 - 13:58
Version:7.x-dev» 6.x-dev
Status:reviewed & tested by the community» patch (to be ported)

Huh. Somehow I had missed Dries's #60 where he points out that he supports this approach apart from some commenting issues. It looks like those comments have been addressed, along with nicely expanded test coverage. Additionally, it looks like this speeds things up quite a bit, according to the microbenchmarks.

Therefore, committed to HEAD. Seems like it might be worth back-porting this to D6?

#83

andypost - November 15, 2009 - 13:19
Status:patch (to be ported)» needs review

And here's a backport for d6

AttachmentSizeStatusTest resultOperations
523058_check_plain_D6.patch1.71 KBIgnoredNoneNone

#84

boombatower - December 4, 2009 - 20:16

Didn't read through whole issue, but does this fix or is aware of:

htmlspecialchars(): Invalid multibyte sequence in argument Warning bootstrap.inc 1204 check_plain()

when run on PHP 5.3.

 
 

Drupal is a registered trademark of Dries Buytaert.