The truncation code in node_teaser doesn't make any sense to me:

 foreach ($breakpoints as $point => $charnum) {
   if ($length = strpos($body, $point, $size)) {
     return substr($body, 0, $length + $charnum);
   }
 }

I don't understand: why is the offset given to strpos the maximum teaser size? This would cause teasers to only be nicely split if they are enough longer than the max. teaser size to have one of the special characters. But this guarantees the teaser will either be abruptly truncated, or over the max. teaser size! In fact, this is what I'm seeing on my installation.

Attached is a patch that makes it work the way I think it was intended. At least, it makes more sense to me.

CommentFileSizeAuthor
diff_271.16 KBneale
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

killes@www.drop.org’s picture

Version: 4.6.2 » x.y.z
Status: Active » Needs review

changing version and status.

Gábor Hojtsy’s picture

Status: Needs review » Needs work

Neale, your code does not reflect the intentions in Drupal. The problem with this patch is that it forces a truncate call on the body, regardless of content, and then reduces the teaser size even more, to the first paragraph, first sentence, etc. Problems with your approach are:

  • If the first paragraph is too short, it will be the teaser, regardless of the set size.
  • If there are no paragraph or break tags in the truncated teaser part, the first sentence will be used (regardless of how short is it).
  • On the other hand, if the first paragraph is too long, and the closing tag is out of the initial truncated part, you will end up with broken HTML.

Now all these are not the intention of the teaser generation code. Having the exact set length or shorter is sacrificed for having valid HTML, and complete content. While the current Drupal code might produce too long teasers, your code can easily end up producing too short ones. A teaser for this text with your patch would be: Neale, your code does not reflect the intentions in Drupal.

neale’s picture

Status: Needs work » Needs review

(Oddly, previewing this gives no trimmed version and no full version. I hope the formatting here looks okay.)

Goba, I'm not sure I understand your objections. You say:

  1. If the first paragraph is too short, it will be the teaser, regardless of the set size.

    I presume you mean that an article with two paragraphs (one short, one long) would be split at the end of the first paragraph.

    Currently, if I write an article with three paragraphs (short, long, long), the teaser will be the first two (short, long) even if the second paragraph runs over the max. teaser size. In fact, unless the entire article is under the max. size, it is guaranteed that the teaser will be over the max., since the code only looks for a breakpoint after $size.

    I'm not sure this is in line with what drupal wants, since the help for the max. teaser size states (emphasis mine) it is "The maximum number of characters used in the trimmed version of a post."

    My patch addresses this by looking for a split character in the characters before the max. length.

  2. If there are no paragraph or break tags in the truncated teaser part, the first sentence will be used (regardless of how short is it).

    This is untrue. My patch applies after the code that looks for the break tag, and does not alter the existing logic. Unfortunately, the patch doesn't show the (unmodified) part of the code you mention, so it might be worthwhile to look at the patch alongside the original code to see exactly what it does.

  3. On the other hand, if the first paragraph is too long, and the closing tag is out of the initial truncated part, you will end up with broken HTML.

    This is a different problem already present in the code. My patch doesn't try to address this. (Try it: create a new book page right here, and begin it with an em tag, then paste the same sentence over and over, and close the em tag. The teaser will have an unclosed tag.)

I humbly suggest the reviewer look over the patch again.

neale’s picture

It just occurred to me that when you said "break tag" you may have meant the "br" tag and not the "!--break--" thing. In that case, your second point is a little closer to true, but still not quite correct. My patch will use as many sentences as it can before it hits the max. size (note the change from strpos to strrpos).

Perhaps the help for the "size" should be changed to read minimum size of a teaser, instead of maximum size for a teaser, if the current behavior is what drupal is supposd to do.

Currently, the teaser will always be larger than the "maximum size" unless the !--break-- thing is used. You can test this here by creating a new book page with a very short paragraph followed by an incredibly long one, maybe 16000 characters in multiple paragraphs. The incredibly long one is always part of the "trimmed version".

If you'd like to see my patch in action, head on over to , where it is currently running.

Gábor Hojtsy’s picture

Excuse me, I have not seen that you actually used strrpos() in your patch. You are right that the previous version might have broken tags, but the intention was to search for typical block tags, which are not supposed to be enclosed in inline tags or anything, since it is not valid markup. Your code might break any tag, block tags and inline tags alike.

aaron’s picture

I agree -- I don't like the current behavior of node_teaser. I think it should find the last breakpoint before $size, rather than the next after. How about this? It does just that:

  $breakpoints = array('<p/>' => 4, '</ rb>' => 0, '<rb>' => 0, "\n" => 0, ' .' => 1, ' !' => 1, ' ?' => 1, '??ã' => 1, ' ?Ø' => 1);
  $reversed = strrev($body);
  foreach ($breakpoints as $point => $charnum) {
    if ($length = strpos($reversed, $point, $size)) {
      return strrev(substr($reversed, 0, $length + $charnum));
    }
  }

As I see it, we want to look backwards, assuming that we want the teaser to always be no greater than $size. It will look for the breakpoint closest before our $size. If not found, it would then go on to truncate. Of course, we have to reverse the character sequences in $breakpoints.

The case where it would fail is if someone wrote one short sentence, followed by a very long sentence. If we wanted, we could simply check to make sure our $length is within the second half of our $size range, or something. Seems a bit of overkill to me.

Anyway, I'm using this new version for now, and I think it might be worth making into core. Of course, most people may already be used to the current behavior, and might wonder why their posts are suddenly seeming so much shorter. Still, I think this is the functionality most people would expect, by setting a size limit on the teaser. Currently, it's like telling a cop, "I was only going 14 mph over the speed limit..."

- Aaron
Culture Fix Web Identity & Design
Digital Folk Art (my blog)

aaron’s picture

Whoops! Didn't look at it closely enough before using. Made myself look stupid... Don't use that code! Doesn't work... I'll rethink it and come up with something else, hopefully.

aaron’s picture

This works:

  $breakpoints = array('<p/>' => 0, '</ rb>' => 6, '<rb>' => 4, "\n" => 1, ' .' => 1, ' !' => 1, ' ?' => 1, '??ã' => 2, ' ?Ø' => 2);
  $reversed = strrev($body);
  $checkpoint = strlen($body) - $size;
  foreach ($breakpoints as $point => $charnum) {
    if ($length = strpos($reversed, $point, $checkpoint)) {
      return strrev(substr($reversed, $length + $charnum, strlen($body)));
    }
  }

The problem with it, however, is that there's an implicit weighting in $breakpoints. There could be a suitable "!" 20 characters before $size, but this would choose the "." 150 characters away instead. This is also true with the current behavior in reverse.

This could be fixed with the following:

$breakpoints = array('<p/>' => 0, '</ rb>' => 6, '<rb>' => 4, "\n" => 1, ' .' => 1, ' !' => 1, ' ?' => 1, '??ã' => 2, ' ?Ø' => 2);
  $reversed = strrev($body);
  $checkpoint = strlen($body) - $size;
  $weighted_length = 0;
  $weighted_charnum = 0;
  foreach ($breakpoints as $point => $charnum) {
    if ($length = strpos($reversed, $point, $checkpoint)) {
	    if ($length < $weighted_length || !$weighted_length)
	    {
		    $weighted_length = $length;
		    $weighted_charnum = $charnum;
	    }
    }
  }
  
  if ($weighted_length)
  {
	  return strrev(substr($reversed, $weighted_length + $weighted_charnum, strlen($body)));
  }
  
  // If all else fails, we simply truncate the string.
  return truncate_utf8($body, $size);

This is my final answer. (Knock on wood...)

- Aaron

Culture Fix Web Identity & Design
Digital Folk Art (my blog)

neale’s picture

Aaron, I don't get how reversing the string and doing a strpos is any different than just doing a strrpos. Aside from that, this looks pretty similar to my patch. Maybe if enough people weigh in saying the current behavior confuses them, someone will think about either changing the code or the description of $maxlen. :)

JurriaanRoelofs’s picture

Thanks a lot for the Hack! The huge teasers were driving me crazy, superteaser didn't work on my 4.6 sites ^_-
(I'm using Aaron's patch now)

JurriaanRoelofs’s picture

Unfortunately Aaron's code doesn't close paragraph tags, if they are truncated. Does anyone know how to fix this?

drumm’s picture

Status: Needs review » Needs work

This looks like an okay solution- looking behind instead of ahead. The weighting might need some consideration, the algorithm that does what most users expect will of course be best.

A proper diff file will be needed and this should be well commented. I think the search strings (</p>, <br />, etc) should probably be left unreversed in code so we can easily read them.

drumm’s picture

Title: Teaser always truncated at 600 characters » Teaser can be longer than maximum allowed.
JohnAlbin’s picture

Version: x.y.z » 5.x-dev
Status: Needs work » Closed (duplicate)

Since this was marked "Version: x.y.z" this issue sort of dropped off the radar. It now has a duplicate issue already marked "5.x-dev" with an active discussion. So let's continue the conversation there: http://drupal.org/node/73910