A site user submitted this article (pretty much as is) to my site, using an input format which uses the core line break converter.

For some reason, I get completely empty html output from the node body. It's there in the database, but not rendered, at all. Now, there's various markup issues in there, but this shouldn't result in the entire text being stripped.

On HEAD I get this as ouput for a 640 line document:

<div class="content clear-block">
</div>

fun eh?

I'd found this intermittently with other articles, and for now have 'fixed' it by switching to the line break handling provided by bbcode.module which doesn't cause the same problem. So it's definitely the line-break converter that's causing this.

To reproduce, copy and paste the attached .txt file, submit a post with 'filtered html' and watch the blank output. Try it again without the line break converter and it should show up.

Although this is probably a fairly obscure bug, it's also pretty nasty when it shows up, and has taken me ages to track down (I've yet to compare what bbcode.module and core do differently, which'd probably help with this). So marking as critical on the assumption that others can reproduce with the same input.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

redndahead’s picture

FileSize
92.85 KB

After some testing I'm pretty sure it's a size issue. I've attached a text file that will work. But add one more character and it should not work. This text file is 95078 characters long and it dies at the 95079th character. Hopefully this helps.

red

redndahead’s picture

FileSize
32.55 KB

Retrying with just numbers in the file and it came out to 33331 was the limit 33332 did not work. attached is the file.

redndahead’s picture

After more testing it looks like the issue occurs at line 903, in version 6.0 of filter.module. It looks like this:

$chunk = preg_replace('/\n?(.+?)(?:\n\s*\n|\z)/s', "<p>$1</p>\n", $chunk); // make paragraphs, including one at the end

Where $chunk is the text. I searched to see if there was a size limit on preg_replace in php and there doesn't seem to be. So I'm guessing it's an issue with the regular expression. Unfortunately regular expressions cause me to have twitches. So my skills stop here. Anyone else can help?

redndahead’s picture

FYI I added ! delimeters and it fixed the issue. That said I have no clue what the ! does I just copied it from another preg_replace in that file. Here is the updated preg_replace.

$chunk = preg_replace('!/\n?(.+?)(?:\n\s*\n|\z)/s!', "<p>$1</p>\n", $chunk); // make paragraphs, including one at the end
redndahead’s picture

FileSize
1.17 KB

Attached patch. With above changes. Not quite sure it's what would be preferred. It doesn't add p tags around the text.

redndahead’s picture

Status: Active » Needs review
vladimir.dolgopolov’s picture

Status: Postponed (maintainer needs more info) » Needs review

1. I wonder why #4 works. I've tested it and the regex (!/$pattern/s!) has no much sense.


$chunk = 'sometext';
$chunk = preg_replace('!/\n?(.+?)(?:\n\s*\n|\z)/s!', "<p>$1</p>\n", $chunk); // make paragraphs, including one at the end
print_r($chunk);

2. Anyway I've wrote a script for test different types of regexp delimiters:


$preg_delimiters = preg_split("//", '!@#%&`');

array_pop($preg_delimiters);
array_shift($preg_delimiters);

foreach ($preg_delimiters as $delimiter) {
  $pattern = $delimiter . '\n?(.+?)(?:\n\s*\n|\z)' . $delimiter . 's';
  for ($i = 1; $i < 1000000; $i++) {
    $chunk = str_repeat('0', $i);
    $chunk = preg_replace($pattern, "<p>$1</p>\n", $chunk);
    if (empty($chunk)) {
      echo("$delimiter: $i\n");
      break;
    }
  }
}

I've got 33333 bytes.

I think it's PCRE limitation: http://regexkit.sourceforge.net/Documentation/pcre/pcre.html#SEC3

The maximum length of a subject string is the largest positive number that an integer variable can hold. However, when using the traditional matching function, PCRE uses recursion to handle subpatterns and indefinite repetition. This means that the available stack space may limit the size of a subject string that can be processed by certain patterns.

3. However, here is the test for the issue. IMHO it's won't fix.


class TestCase225335 extends DrupalTestCase {
  function get_info() {
    return array(
      'name' => t('[225335] Line break filter can result in entire body being stripped on output'),
      'desc' => t('A site user submitted this article (pretty much as is) to my site, using an input format which uses the core line break converter. After some testing I\'m pretty sure it\'s a size issue.'),
      'group' => t('Drupal 7 Tests'),
    );
  }

  function testIssue() {
    $string_normal = str_repeat('0123456789', 3330);
    $string_long   = str_repeat('0123456789', 33330);

    $result_normal = _filter_autop($string_normal);
    $result_long   = _filter_autop($string_long);

    $result_normal_expected = "<p>$string_normal</p>\n";
    $result_long_expected   = "<p>$string_long</p>\n";

    $this->assertEqual($result_normal, $result_normal_expected, 'Check normal string');
    $this->assertEqual($result_long, $result_long_expected, 'Check long string');
  }

}

vladimir.dolgopolov’s picture

Status: Needs review » Postponed (maintainer needs more info)

We should get more info about regexp limitations.

Bug #24460 preg_match crashing on specific pattern/string size

catch’s picture

vladimir and redndahead thanks for taking a look at this. That it's an arbitrary PCRE bug makes sense, however since there's a possible fix (and I still didn't check out what the other filter was doing differently) I'm going to mark it back to needs review. I'll try to actually test the patch asap.

vladimir.dolgopolov’s picture

There is test script for recognition of preg_replace limits concerning the issue:


// patterns from _filter_autop()
$block = '(?:table|thead|tfoot|caption|colgroup|tbody|tr|td|th|div|dl|dd|dt|ul|ol|li|pre|select|form|blockquote|address|p|h[1-6]|hr)';

$patterns = array(
  '|\n*$|',
  '|<br />\s*<br />|',
  '!(<'. $block .'[^>]*>)!',
  '!(</'. $block .'>)!',
  "/\n\n+/", "\n\n",
  '/\n?(.+?)(?:\n\s*\n|\z)/s',
  '|<p>\s*</p>\n|',
  "|<p>(<li.+?)</p>|",
  '|<p><blockquote([^>]*)>|i',
  '!<p>\s*(</?'. $block .'[^>]*>)!',
  '!(</?'. $block .'[^>]*>)\s*</p>!',
  '|(?<!<br />)\s*\n|', "<br />\n",
  '!(</?'. $block .'[^>]*>)\s*<br />!',
  '!<br />(\s*</?(?:p|li|div|th|pre|td|ul|ol)>)!',
  '/&([^#])(?![A-Za-z0-9]{1,8};)/',
);

$start  = 0;
$finish = 10000000; // max string's lenght
$step   = 1000000; // initial step

foreach ($patterns as $pattern) {
  pcre_limit($pattern, $start, $finish, $step);
}


function pcre_limit($pattern, $start, $finish, $step) {
  for ($i = $start; $i < $finish; $i += $step) {
    $chunk = str_repeat('0', $i);
    $chunk = preg_replace($pattern, "$0", $chunk);
    if (empty($chunk) && $i != 0) {
      echo("Gotcha possible limit: '$pattern' '$i' bytes\n");
      if ($step == 1) {
        echo("Limit for pattern: '$pattern' is '$i' bytes\n");
        return;
      }
      pcre_limit($pattern, $i - $step, $i, $step / 10);
      break;
    }
  }
}

I've got this: Limit for pattern: '/\n?(.+?)(?:\n\s*\n|\z)/s' is '33333' bytes.

Fortunately, a 's' modifier (PCRE_DOTALL) has not used in Drupal core with few exceptions like this issue.

redndahead’s picture

So do you feel that it's the modifier that's causing the issue? It's only used for that one "." can we replace the "." with ^\n (I read that the "." with /s is used as a way to express not a newline.)

vladimir.dolgopolov’s picture

@redndahead Yes, you are right. I have some investigation: \s modifier does not causing the issue.
I've written another script with the regexp's modifications.


$patterns = array(
  '/\n?(.+?)(?:\n\s*\n|\z)/',
  '/\n?(.+?)(?:\z)/',
  '/\n?(.+?)$/',
  '/\n?(.+)$/',
  '/(.+?)$/',
  '/^(.+?)$/',
  '/^(.+?)/',
);

$start  = 0;
$finish = 1000 * 1000;
$step   = $finish / 10;

echo("Check limits from $start to $finish\n\n");

foreach ($patterns as $pattern) {
  echo("Checking pattern '$pattern' ...\n");
  pcre_limit($pattern, $start, $finish, $step);
  echo("\n");
}

function pcre_limit($pattern, $start, $finish, $step) {
  for ($i = $start; $i <= $finish; $i += $step) {
    $chunk = str_repeat('0', $i);
    $chunk = preg_replace($pattern, "$0", $chunk);
    if (empty($chunk) && $i != 0) {
      if ($step == 1) {
        echo("Found limit: $i\n");
        return;
      }
      pcre_limit($pattern, $i - $step, $i, $step / 10);
      break;
    }
  }
}

Check limits from 0 to 1000000

Checking pattern '/\n?(.+?)(?:\n\s*\n|\z)/' ...
Found limit: 33333

Checking pattern '/\n?(.+?)(?:\z)/' ...
Found limit: 49999

Checking pattern '/\n?(.+?)$/' ...
Found limit: 99997

Checking pattern '/\n?(.+)$/' ...

Checking pattern '/(.+?)$/' ...
Found limit: 99998

Checking pattern '/^(.+?)$/' ...
Found limit: 99998

Checking pattern '/^(.+?)/' ...

Things to think about.

redndahead’s picture

I've more narrowed it down to the \z. In other words checking for a new line at the end. Using $ doesn't work either. I've asked some smart drupal people and they're not sure what to do. Maybe someone that really knows regex can figure this out.

dana_johnson’s picture

Thank you for finding the solution in #4, I just discovered this problem today and the solution works for me.

redndahead’s picture

Status: Needs review » Needs work

Sorry for the long update. It was discovered that my patch above actually doesn't wrap any p tags on any paragraphs. This is bad. So this needs some work. I'm out of my league now. Anyone else want to pick this up? I have narrowed it even further to:

|\z

This part of the regex kills the text. Anyone else that can run with this?

catch’s picture

Status: Needs work » Closed (duplicate)

Marking as duplicate, this is an old one apparently :(

http://drupal.org/node/133188

yngens’s picture

subscribe. now not sure what to do - is #4 only temp solution?

bambangfals’s picture

I still have problem with my long post. Still doesnt show under title of posting... any other solutions???