node_teaser() not breaking properly

RayZ - July 16, 2006 - 00:04
Project:Drupal
Version:4.7.x-dev
Component:node.module
Category:bug report
Priority:normal
Assigned:Unassigned
Status:closed
Description

Unless I'm misunderstanding the intention, I don't think this code is doing what the comments say ...

<?php
 
// In some cases, no delimiter has been specified (e.g. when posting using
  // the Blogger API). In this case, we try to split at paragraph boundaries.
  // When even the first paragraph is too long, we try to split at the end of
  // the next sentence.
 
$breakpoints = array('</p>' => 4, '<br />' => 0, '<br>' => 0, "\n" => 0, '. ' => 1, '! ' => 1, '? ' => 1, '„ÄÇ' => 3, 'ÿü ' => 1);
  foreach (
$breakpoints as $point => $charnum) {
    if (
$length = strpos($body, $point, $size)) {
      return
substr($body, 0, $length + $charnum);
    }
  }
?>

The comments lead me to believe that if the end of the first paragraph occurs after $size chars, that it will attempt to break before the end of that paragraph, correct?

I have a case where there is a <p> </p> surrounding the entire body (with none in between anywhere). In this case, it returns the entire body, no matter how long it is. Am I missing something?

#1

dman - July 16, 2006 - 02:56

According to that block of code...
It will scan for punctuation if an end para is not found.
So your text should contain a '. ' (with trailing space) or a br or a newline etc, to break intelligently.

... however, yeah, looking closer, it seems that it starts scanning after the desired endpoint ... and yes, the first thing it sees is the very last [/p]

Probably not what you wanted, no.

It's making a guess that because you deliberatly wrapped the block in a single P, you never want that paragraph chopped up. This seems like sensible behaviour. More sensible than having all your input in one paragraph, anyway.
If you are clear enough to be using paras at all, you probably should be using them semantically.

.dan.

#2

RayZ - July 17, 2006 - 12:34

My point was that the comments say ...

  // When even the first paragraph is too long, we try to split at the end of
  // the next sentence.

... but that's not what the code does. The code will never split a long first paragraph.

Btw, I'm trying to use node_teaser() to generate a teaser for a CCK text field and I want to do it after the filtering has been applied so I'm passing field_mytextfield[0]['view'] to node_teaser(), and it appears that field_mytextfield[0]['view'] puts P tags around it, since field_mytextfield[0]['value'] does not have those P tags.

I'm interested in hearing if there is a better approach to my specific case, but I still believe there is a bug in node_teaser(), either in the code or in the comments.

#3

magico - September 20, 2006 - 22:59
Priority:normal» minor

#4

gerd riesselmann - December 6, 2006 - 16:12
Version:x.y.z» 4.7.4
Priority:minor» critical
Status:active» patch (code needs review)

The OP is right: This code creates a teaser that has at least a length of teaser_length chars. But teaser_length is defined as the maximum chars the teaser is allowed to have, so this code is plain wrong.

I marked it as beeing critical, since a correct teaser length is crucial on portals, where space on the front page is limited.

The reason the code fails is in this line:

<?php
if ($length = strpos($body, $point, $size)) {
?>

$size = teaser_length, so if teaser_length was 600, this code starts searching for a breakpoint after the 600th char. Instead, of course, we must search from the 600th chars backwards to find what we are looking for:

<?php
if ($length = strrpos(truncate_utf8($body, $size), $point)) {
?>

An optimized version (calculating truncate_utf8 only once) is attached as a patch.

AttachmentSize
node_teaser.patch976 bytes

#5

gerd riesselmann - December 6, 2006 - 19:07

BTW: Does anybody knows what '„ÄÇ' and 'ÿü' mean?

May it be these once had a meaning but got crumbled by some character set conversion? On my system for example they read '。' and '؟ ' - and this was taken fresh out of CVS

#6

gerd riesselmann - December 6, 2006 - 22:37

Patch above was against HEAD, this patch is for 4.7 branch.

AttachmentSize
node.module.4-7.node_teaser.patch1.01 KB

#7

JohnAlbin - December 6, 2006 - 23:55
Version:4.7.4» 5.x-dev
Status:patch (code needs review)» patch (code needs work)

Gerd, those are UTF-8 characters: 。؟ And I assume they are end-of-sentence markers in some language.

strrpos($haystack, $needle) can only use single characters as a needle in PHP4. (See http://www.php.net/manual/en/function.strrpos.php). So we'll need a work around for PHP4.

#8

JohnAlbin - December 7, 2006 - 06:00
Status:patch (code needs work)» patch (code needs review)

Here's an updated patch for 5.x-dev. It works around the PHP4 strrpos() issue. I believe I accounted for all edge cases, but please double check.

And it also gives all end-of-sentence characters equal standing when the teaser is shorter than the first paragraph; the old code would chop at a period even when chopping at an exclamation mark was better.

(This patch applies successfully to the 4.7.x-dev version as well. "Hunk #1 succeeded at 180 with fuzz 1 (offset 7 lines)." I can supply a 4.7-specific patch, if needed.)

AttachmentSize
node_teaser_0.patch2.38 KB

#9

ChrisKennedy - December 7, 2006 - 06:33
Priority:critical» normal

This is a good bug to fix but it isn't critical. See http://drupal.org/node/45111 for more info.

#10

gerd riesselmann - December 7, 2006 - 11:14

I looked up both the UTF-8-chars: The first one is the ideographic full stop, the second the arabic question mark. However, PHP-Eclipse with UTF-8 encoding only shows "???" for the first and "?? " for the second (that's why I was wondering). On CVS diff, the last one reverts the writing direction (looks funny).

I therefore would suggest to replace them with their hexadcimal representation 0xEF 0xBD 0xA1 and 0xD8 0x9F. This avoids any kind of unwanted interference with local charset. I modified John's patch accordingly (against 4.7 branch).

Additionally: What do you think about implementing all international sentence dividers like specified by Unicode STerm: http://www.sql-und-xml.de/unicode-database/sterm.html

AttachmentSize
node_teaser_1.patch2.47 KB

#11

JohnAlbin - December 7, 2006 - 16:25

Gerd, please see http://drupal.org/node/12864 (anyone else have a better link to the UTF-8 in source code standard?). UTF-8 characters are included in much of Drupal's source code; it's the standard. Also, I don't think your patch would work since PHP doesn't do hexadecimal escaping inside single quotes.

Please continue to use my previous patch.

#12

ChrisKennedy - December 8, 2006 - 08:54

Here's an earlier discussion of this issue: Teaser can be longer than maximum allowed..

#13

JohnAlbin - December 8, 2006 - 21:07

Chris, thanks for pointing out the previous discussion.

The only major issue I see in the previous discussion is that people are worried about invalid html caused by chopping off closing tags (such as </div> or </p>.) However, since that is a problem with the current code as well as with my patch, I think "invalid html in teaser" is a separate issue.

Looking at the code presented by Adam (http://drupal.org/node/28211#comment-52638), I decided to do timing tests on his idea (search for each breakpoint using strpos() and compare the found positions) versus my idea (generate a regex using all breakpoints and use preg_match()).

  • using strpos() inside a foreach loop: 0.14s
  • using preg_match() with no loops: 0.24s

So, I rerolled my patch using strpos()

AttachmentSize
node_teaser_2.patch2.18 KB

#14

Steven - December 12, 2006 - 08:34
Status:patch (code needs review)» fixed

Tested and verified. We should indeed do as the description says.

I did clean up the comments and code a bit. It could be simplified. Committed to HEAD.

#15

RayZ - December 12, 2006 - 12:29
Version:5.x-dev» 4.7.x-dev
Status:fixed» patch (to be ported)

Could use a backport to 4.7.x

#16

JohnAlbin - December 12, 2006 - 17:12
Status:patch (to be ported)» patch (code needs review)

Here is Steven's commit patch against 4.7.x-dev.

Steven, one extremely minor nitpick: In your comments, you say We use strpos on the reversed needle and haystack for speed. But it's really because strrpos() is useless/broken in PHP4.

AttachmentSize
node_teaser_47.patch2.1 KB

#17

killes@www.drop.org - January 4, 2007 - 19:50
Status:patch (code needs review)» fixed

applied

#18

bjaspan - January 9, 2007 - 17:26

IMHO, this is a substantial, possibly site-breaking, change. I agree that the previous method for computing teasers was not necessarily what was meant or expected, but the new method will result in incorrect behavior for sites that have optimized themselves to the old behavior.

For example, one client of mine wants one-sentence teasers, so we set the teaser length to 0. The old code found the first sentence/paragraph/whatever break after position 0, as desired. The new code (as I understand it) will result in empty teasers.

This deserves documentation.

#19

Anonymous - January 23, 2007 - 17:30
Status:fixed» closed

#20

Bevan - October 3, 2007 - 06:25
 
 

Drupal is a registered trademark of Dries Buytaert.