Unless I'm misunderstanding the intention, I don't think this code is doing what the comments say ...

  // In some cases, no delimiter has been specified (e.g. when posting using
  // the Blogger API). In this case, we try to split at paragraph boundaries.
  // When even the first paragraph is too long, we try to split at the end of
  // the next sentence.
  $breakpoints = array('</p>' => 4, '<br />' => 0, '<br>' => 0, "\n" => 0, '. ' => 1, '! ' => 1, '? ' => 1, '„ÄÇ' => 3, 'ÿü ' => 1);
  foreach ($breakpoints as $point => $charnum) {
    if ($length = strpos($body, $point, $size)) {
      return substr($body, 0, $length + $charnum);
    }
  }

The comments lead me to believe that if the end of the first paragraph occurs after $size chars, that it will attempt to break before the end of that paragraph, correct?

I have a case where there is a <p> </p> surrounding the entire body (with none in between anywhere). In this case, it returns the entire body, no matter how long it is. Am I missing something?

Comments

dman’s picture

According to that block of code...
It will scan for punctuation if an end para is not found.
So your text should contain a '. ' (with trailing space) or a br or a newline etc, to break intelligently.

... however, yeah, looking closer, it seems that it starts scanning after the desired endpoint ... and yes, the first thing it sees is the very last [/p]

Probably not what you wanted, no.

It's making a guess that because you deliberatly wrapped the block in a single P, you never want that paragraph chopped up. This seems like sensible behaviour. More sensible than having all your input in one paragraph, anyway.
If you are clear enough to be using paras at all, you probably should be using them semantically.

.dan.

RayZ’s picture

My point was that the comments say ...

  // When even the first paragraph is too long, we try to split at the end of
  // the next sentence.

... but that's not what the code does. The code will never split a long first paragraph.

Btw, I'm trying to use node_teaser() to generate a teaser for a CCK text field and I want to do it after the filtering has been applied so I'm passing field_mytextfield[0]['view'] to node_teaser(), and it appears that field_mytextfield[0]['view'] puts P tags around it, since field_mytextfield[0]['value'] does not have those P tags.

I'm interested in hearing if there is a better approach to my specific case, but I still believe there is a bug in node_teaser(), either in the code or in the comments.

magico’s picture

Priority: Normal » Minor
gerd riesselmann’s picture

Version: x.y.z » 4.7.4
Priority: Minor » Critical
Status: Active » Needs review
StatusFileSize
new976 bytes

The OP is right: This code creates a teaser that has at least a length of teaser_length chars. But teaser_length is defined as the maximum chars the teaser is allowed to have, so this code is plain wrong.

I marked it as beeing critical, since a correct teaser length is crucial on portals, where space on the front page is limited.

The reason the code fails is in this line:

if ($length = strpos($body, $point, $size)) {

$size = teaser_length, so if teaser_length was 600, this code starts searching for a breakpoint after the 600th char. Instead, of course, we must search from the 600th chars backwards to find what we are looking for:

if ($length = strrpos(truncate_utf8($body, $size), $point)) {

An optimized version (calculating truncate_utf8 only once) is attached as a patch.

gerd riesselmann’s picture

BTW: Does anybody knows what '„ÄÇ' and 'ÿü' mean?

May it be these once had a meaning but got crumbled by some character set conversion? On my system for example they read '。' and '؟ ' - and this was taken fresh out of CVS

gerd riesselmann’s picture

StatusFileSize
new1.01 KB

Patch above was against HEAD, this patch is for 4.7 branch.

johnalbin’s picture

Version: 4.7.4 » 5.x-dev
Status: Needs review » Needs work

Gerd, those are UTF-8 characters: 。؟ And I assume they are end-of-sentence markers in some language.

strrpos($haystack, $needle) can only use single characters as a needle in PHP4. (See http://www.php.net/manual/en/function.strrpos.php). So we'll need a work around for PHP4.

johnalbin’s picture

Status: Needs work » Needs review
StatusFileSize
new2.38 KB

Here's an updated patch for 5.x-dev. It works around the PHP4 strrpos() issue. I believe I accounted for all edge cases, but please double check.

And it also gives all end-of-sentence characters equal standing when the teaser is shorter than the first paragraph; the old code would chop at a period even when chopping at an exclamation mark was better.

(This patch applies successfully to the 4.7.x-dev version as well. "Hunk #1 succeeded at 180 with fuzz 1 (offset 7 lines)." I can supply a 4.7-specific patch, if needed.)

ChrisKennedy’s picture

Priority: Critical » Normal

This is a good bug to fix but it isn't critical. See http://drupal.org/node/45111 for more info.

gerd riesselmann’s picture

StatusFileSize
new2.47 KB

I looked up both the UTF-8-chars: The first one is the ideographic full stop, the second the arabic question mark. However, PHP-Eclipse with UTF-8 encoding only shows "???" for the first and "?? " for the second (that's why I was wondering). On CVS diff, the last one reverts the writing direction (looks funny).

I therefore would suggest to replace them with their hexadcimal representation 0xEF 0xBD 0xA1 and 0xD8 0x9F. This avoids any kind of unwanted interference with local charset. I modified John's patch accordingly (against 4.7 branch).

Additionally: What do you think about implementing all international sentence dividers like specified by Unicode STerm: http://www.sql-und-xml.de/unicode-database/sterm.html

johnalbin’s picture

Gerd, please see http://drupal.org/node/12864 (anyone else have a better link to the UTF-8 in source code standard?). UTF-8 characters are included in much of Drupal's source code; it's the standard. Also, I don't think your patch would work since PHP doesn't do hexadecimal escaping inside single quotes.

Please continue to use my previous patch.

ChrisKennedy’s picture

Here's an earlier discussion of this issue: Teaser can be longer than maximum allowed..

johnalbin’s picture

StatusFileSize
new2.18 KB

Chris, thanks for pointing out the previous discussion.

The only major issue I see in the previous discussion is that people are worried about invalid html caused by chopping off closing tags (such as </div> or </p>.) However, since that is a problem with the current code as well as with my patch, I think "invalid html in teaser" is a separate issue.

Looking at the code presented by Adam (http://drupal.org/node/28211#comment-52638), I decided to do timing tests on his idea (search for each breakpoint using strpos() and compare the found positions) versus my idea (generate a regex using all breakpoints and use preg_match()).

  • using strpos() inside a foreach loop: 0.14s
  • using preg_match() with no loops: 0.24s

So, I rerolled my patch using strpos()

Steven’s picture

Status: Needs review » Fixed

Tested and verified. We should indeed do as the description says.

I did clean up the comments and code a bit. It could be simplified. Committed to HEAD.

RayZ’s picture

Version: 5.x-dev » 4.7.x-dev
Status: Fixed » Patch (to be ported)

Could use a backport to 4.7.x

johnalbin’s picture

Status: Patch (to be ported) » Needs review
StatusFileSize
new2.1 KB

Here is Steven's commit patch against 4.7.x-dev.

Steven, one extremely minor nitpick: In your comments, you say We use strpos on the reversed needle and haystack for speed. But it's really because strrpos() is useless/broken in PHP4.

killes@www.drop.org’s picture

Status: Needs review » Fixed

applied

bjaspan’s picture

IMHO, this is a substantial, possibly site-breaking, change. I agree that the previous method for computing teasers was not necessarily what was meant or expected, but the new method will result in incorrect behavior for sites that have optimized themselves to the old behavior.

For example, one client of mine wants one-sentence teasers, so we set the teaser length to 0. The old code found the first sentence/paragraph/whatever break after position 0, as desired. The new code (as I understand it) will result in empty teasers.

This deserves documentation.

Anonymous’s picture

Status: Fixed » Closed (fixed)
Bevan’s picture