truncate_utf8() is used as a substring function [#200185]

When commenting any type of content, title of comment generates automatically. According to source code the title shold be 29 characters length. When using native characters, the length is shorter. Run a simple test:

write comment to node with the text: "A B C D E F G H I J K L M N O P Q R S T U W X Y Z"
the title should be "A B C D E F G H I J K L M N", it's ok

write comment to node with the text: "А Б В Г Д Е Ё Ж З И Й К Л М Н О П Р С Т У Ф Х Ц Ч Ъ Ы Ю Э Ю Я"
the title should be "А Б В Г Д Е Ё Ж З И Й К Л М",
but actual result is: "А Б В Г Д Е Ё Ж З"

Short titles are inconvinient for non-english speakers, the result is ofthen only one word in a title. I see problem in function truncate_utf8 in the file unicode.inc. I solved the problem by rewriting this function, the code is below. Please, cosider my variant:

function truncate_utf8($string, $len, $wordsafe = FALSE, $dots = FALSE) {

  $string = drupal_substr($string, 0, $len - ($dots ? 4 : 0));

  if ($wordsafe && ($last_space=mb_strrpos($string, ' ')))
    $string = drupal_substr($string, 0, $last_space);

  if ($dots)
    $string .= ' ...';

  return $string;
}

Comment	File	Size	Author
#57	200185_rename_truncate_utf8-56.patch	14.87 KB	Freso
#54	200185_rename_truncate_utf8-54.patch	14.02 KB	Freso
#52	200185_rename_truncate_utf8-52.patch	14.02 KB	Freso
#49	200185_rename_truncate_utf8-49.patch	12.9 KB	Freso
#47	rename_truncate_utf8.patch	12.78 KB	Freso
#44	unicode.test	3.03 KB	Freso
#40	rename_truncate_utf8.patch	11.42 KB	Freso
#39	rename_truncate_utf8.patch	11.42 KB	Freso
#35	rename_truncate_utf8.patch	11.67 KB	Freso
#33	rename_truncate_utf8.patch	11.13 KB	Freso
#31	rename_truncate_utf8.patch	16.82 KB	Susurrus
#27	rename_truncate_utf8.patch	9.19 KB	pancho
#21	truncate_bytes_and_chars.patch	2.66 KB	pancho
#19	truncate_bytes_and_chars.patch	2.92 KB	pancho
#15	x.php_.txt	2.01 KB	chx
#14	x.php_.txt	2.58 KB	chx
#10	truncate_bytes_and_chars.patch	2.83 KB	pancho
#10	truncate_bytes_and_chars.txt	2.07 KB	pancho
#9	truncate_utf8.patch	1.41 KB	pancho
#8	x.php_.txt	2.48 KB	chx
#7	x.php_.txt	2.17 KB	chx
#2	truncate_utf8.patch	1.93 KB	pancho
#1	comment titles.png	5.5 KB	pancho
#1	comment titles (utf).png	5.65 KB	pancho

Comments

Comment #1

pancho

UTC+2 🇪🇺 EU

commented 17 December 2007 at 16:09

Title:

Wrong title length in comments with native characters

» Non-latin chars shorten comment title

Status	File	Size
new	comment titles (utf).png	5.65 KB
new	comment titles.png	5.5 KB

I can reproduce this bug with other utf8 characters (here: German umlauts), though it will be most prominent in languages with a non latin alphabet. Should be fixed.

Comment #2

pancho

UTC+2 🇪🇺 EU

commented 17 December 2007 at 19:11

Priority:	Normal	» Critical
Status:	Active	» Needs review

Status	File	Size
new	truncate_utf8.patch	1.93 KB

I took a look at truncate_utf8 and found it to be completely broken (therefore I mark this critical)!

truncate_utf8 didn't even use drupal_substr() but tried to reinvent the wheel, and did so in a wrong way. Obviously this function has been incorrectly patched when introducing unicode support.

I also checked out aslw's replacement function, which unfortunately has three drawbacks:

It doesn't check if ($string <= $len), therefore expensive string manipulations are performed even if nothing has to be done.
It uses mb_strrpos() which we can't assume is present in the PHP installation.
It doesn't truncate the string at the right word ending, because it first truncates and then looks for the last space.

Therefore I had to rewrite the function.

As a replacement for mb_strrpos() is IMHO unavoidable for this task (correct me if I'm wrong), I adapted an existing one (written by Niels Leenheer & Andy Matsubara).
Now, instead of adding this replacement in line, which wouldn't have made much sense, I added the new API function drupal_substr() to unicode.inc, so the functionality can be used throughout Drupal.

If someone could come up with an alternative that gets along with only one drupal_substr() call, that might be even better. Otherwise I guess this is okay in terms of performance.

I tested the new function both in the comment title and the node teaser use case and found it to work absolutely correct with and without non-latin chars. Still, as truncate_utf8 is being used in many places, some more testing would be awesome.

One thing we might discuss is whether the function name should be changed to something more generic like drupal_truncate(). True that this function is UTF8-safe, but it is just our standard truncate function. This would also be more in line with all the other UTF8-safe functions in unicode.inc.

Comment #3

pancho

UTC+2 🇪🇺 EU

commented 17 December 2007 at 19:14

Title:

Non-latin chars shorten comment title

» truncate_utf8() is not unicode-safe

Comment #4

gábor hojtsy

he/him

Hungarian

Hungary

commented 17 December 2007 at 22:14

As pointed out elsewhere, truncate_utf8() is not used how it was designed. It is designed to split off a string at a certain number of **bytes**, and it walks back from that number of bytes, when that position is in the middle of an UTF-8 sequence. It is not designed to split off a string at a certain number of **chars**.

See Steven's note at http://drupal.org/node/26688:

One issue to deal with is database encodings. If the database encoding is UTF-8, then the (VAR)CHAR column lengths are expressed in characters. In that case, strings should be chopped off based on character counts (i.e. using drupal_substr). However, if the database encoding is an old 8-bit one like Latin-1 (ISO-8859-1), then the column lengths are expressed in bytes. In that case, strings should be chopped off based on byte counts (i.e. using truncate_utf8).

As far as I know, we only run on natively utf-8 supporting databases, so we might not need truncate_utf8() to do this. But in any case, all uses of truncate_utf8() should be examined and noted what it is used for at that place. Lots of places it is probably misused as a substring function, instead of a byte count based truncation function, some places might use it as a byte count based function, so any change in behavior will break them.

We seem to have these uses of it in core:

$ grep -r "truncate_utf8" *
includes/locale.inc:        array('data' => check_plain(truncate_utf8($value['source'], 150, FALSE, TRUE)) .'<br /><small>'. $value['location'] .'</small>'),
includes/unicode.inc:function truncate_utf8($string, $len, $wordsafe = FALSE, $dots = FALSE) {
includes/unicode.inc: * - We progressively cut-off a chunk with truncate_utf8(). This is to ensure
includes/unicode.inc:      $chunk = truncate_utf8($string, $chunk_size);
modules/aggregator/aggregator.module:      $title = preg_replace('/^(.*)[^\w;&].*?$/', "\\1", truncate_utf8($item['DESCRIPTION'], 40));
modules/book/book.module:      $toc[$data['link']['mlid']] = $indent .' '. truncate_utf8($data['link']['title'], 30, TRUE, TRUE);
modules/comment/comment.admin.inc:    $form['subject'][$comment->cid] = array('#value' => l($comment->subject, 'node/'. $comment->nid, array('title' => truncate_utf8($comment->comment, 128), 'fragment' => 'comment-'. $comment->cid)));
modules/comment/comment.module:    $comment_values['subject'] = trim(truncate_utf8(decode_entities(strip_tags(check_markup($comment_values['comment'], $comment_values['format']))), 29, TRUE));
modules/dblog/dblog.admin.inc:        l(truncate_utf8(_dblog_format_message($dblog), 56, TRUE, TRUE), 'admin/reports/event/'. $dblog->wid, array('html' => TRUE)),
modules/dblog/dblog.admin.inc:    $rows[] = array($dblog->count, truncate_utf8(_dblog_format_message($dblog), 56, TRUE, TRUE));
modules/menu/menu.module:      $title = $indent .' '. truncate_utf8($data['link']['title'], 30, TRUE, FALSE);
modules/node/node.module:  $teaser = truncate_utf8($body, $size);
modules/search/search.module:  $text = truncate_utf8($text, 50);
modules/search/search.module:    return truncate_utf8($text, 256) .' ...';
modules/statistics/statistics.module:  $title = truncate_utf8($title, $width, FALSE, TRUE);

I remember that there should be an issue, which was probably postponed to Drupal 7 which tried to do this same utf8-ification of truncate_utf8(), then suggested a new function to be added, so there might very well be previous analysis. Feel free to come up with creative search terms to find that issue, I was unable to.

Comment #5

gábor hojtsy

he/him

Hungarian

Hungary

commented 17 December 2007 at 22:17

Title:

truncate_utf8() is not unicode-safe

» truncate_utf8() is used as if it was unicode-safe

Better title :)

Comment #6

gábor hojtsy

he/him

Hungarian

Hungary

commented 17 December 2007 at 22:39

truncate_utf8() was introduced by Steven in http://drupal.org/cvs?commit=911 This was on April 15, 2004, more then 3 years ago. Actually, more then a year later, Steven proposed and soon committed a general utf8 subsystem for Drupal in the above mentioned and quoted issue: http://drupal.org/node/26688

When you look at the original truncate_utf8() commit, you will see that it was introduced in place of substr() calls, because substr() was not utf8 safe. Now what about simply turning all substr()-itended uses of truncate_utf8() to drupal_substr() which was introduced one year after truncate_utf8(), but was never found wide use in Drupal?

Seems like truncate_utf8() should have been retired with the requirement of utf-8 databases, which happened between 4.6 and 4.7 and the introduction of the utf8 lib, which happened much earlier.

Looking at my grep above, it looks like that outside unicode.inc, where mime_header_encode() reuses truncate_utf8(), it is only used in core as a (sometimes word safe) drupal_substr(). So if something needs to be developed here, then it is a wordsafe drupal_substr() with a $dots option under a different name and the usage of that where appropriate.

Comment #7

chx commented 18 December 2007 at 00:28

Status	File	Size
new	x.php_.txt	2.17 KB

I researched out characters having Separator property and then coded the beginning of what Gabor asks. Ugly. I want PHP6.

Comment #8

chx commented 18 December 2007 at 00:34

Status	File	Size
new	x.php_.txt	2.48 KB

Updated the code because the array was storing numbers and we had characters.

Comment #9

pancho

UTC+2 🇪🇺 EU

commented 18 December 2007 at 05:08

Status	File	Size
new	truncate_utf8.patch	1.41 KB

Gábor:

Thanks for your very useful comments. I checked out those old discussions with Steven and came up with a solution that doesn't introduce more unicode functions. Also, it uses the more expensive drupal_substr() only twice in the rather seldom case, that there is no space in the truncated string. Otherwise I resorted to a faster option using strrpos() and substr().

Surely, truncate_utf8() has been misused. But is has been misused at least from 2004 on, when the $wordsafe flag was overhastily introduced into truncate_utf8().

Enclosed is a new patch replacing the one in #2. As it only changes truncate_utf8(), it allows for nothing more and nothing less than testing my improved algorithm.

To avoid confusion, for the next step I follow up separately.

Comment #10

pancho

UTC+2 🇪🇺 EU

commented 18 December 2007 at 05:38

Status	File	Size
new	truncate_bytes_and_chars.txt	2.07 KB
new	truncate_bytes_and_chars.patch	2.83 KB

Gábor again:

Ok, now let's proceed to the next problem.
I acknowledged that changing truncate_utf8() would break mime_header_encode(), which seems to be the only proper use case.
Therefore I removed both the inadequate $wordsafe and $dots flags from the original truncate_utf8() and renamed the function to drupal_truncate_bytes().
Then, I created the revised function (as in #9) as the new function drupal_truncate_chars().
(Because I didn't want to change all calls of the former truncate_utf8(), I enclosed a wrapper function for testing purposes)

Aside from the complete patch, I enclosed the added functions in a textfile, because the patch is hardly readable.

Comment #11

pancho

UTC+2 🇪🇺 EU

commented 18 December 2007 at 06:10

chx:

This seems to be a more complete approach to the problem, which is awesome!
I just think this would rather be a new feature for D7, as we would need to elaborate quite a lot on it.

While I'm having a hard time reading the code, I just want to drop some comments:

I guess, the way you're doing this will turn out to be somewhat faster than what I'm trying to do.
Most of the characters in your list seem to be valid separators, though the list is not complete (see this valuable table). About some others I'm not so sure:
- NO BREAK SPACE (U+00A0) and NARROW NO-BREAK SPACE (U+202F) are both explicit non-breaking space chars. Do we really want to split here? Given the string "15 km" we would be splitting "15" from "km". I'd rather keep that together!
- About the MONGOLIAN VOWEL SEPARATOR (U+180E) I know nothing. But in case this is a valid non-space separator, is it the only one?
I wouldn't call the function truncate_to_words(), because the wordsafe feature depends on the wordsafe flag.

So if we can get this properly done for D6 and noone objects to adding this feature, it would certainly be great! Otherwise I'd postpone this to D7 and invite you to review my (smaller) proposal in #10.

Comment #12

gábor hojtsy

he/him

Hungarian

Hungary

commented 18 December 2007 at 09:50

I love the approach taken in #10. This is a nasty bug in Drupal for three years, as truncate_utf8() was used in place of substr(), and we are moving to the right direction. I also like the renaming of these functions, to get in line with Drupal standards, and let all contrib maintainers know that they were using the wrong function for their purpose. Drupal 6 is an "internationalization release", so it would be a shame to release without this old bug fixed.

That said, the code in #10 suffers from some small coding style errors, namely around brackets like } else {. Otherwise obviously needs more testing.

Comment #13

gábor hojtsy

he/him

Hungarian

Hungary

commented 18 December 2007 at 09:51

Title:

truncate_utf8() is used as if it was unicode-safe

» truncate_utf8() is used as a substring function

Actually, this one is a better title. truncate_utf8() is utf8 safe but is not a substring function for "exact" lengths.

Comment #14

chx commented 18 December 2007 at 11:37

Status	File	Size
new	x.php_.txt	2.58 KB

I got the list from http://www.fileformat.info/info/unicode/category/Zs/list.htm (and then line and paragraph separators). this quote tells me that 0x00A0, 0x202F and 0x180E are nonbreaking spaces. If we deal with Unicode, then we need to deal with Unicode, so a single 0x0020 won't cut it IMO. I am using the converter http://mikebabcock.ca/code/ here to convert Unicode code points to UTF-8.

Comment #15

chx commented 18 December 2007 at 11:49

Status	File	Size
new	x.php_.txt	2.01 KB

I can init this nice little array now easily as it contains only numbers and bools... enrolled the comments into the array.

Comment #16

Steven commented 18 December 2007 at 12:08

Why don't you use preg_match/u with \x{} instead of munching bytes with byzantine hex loops?

Comment #17

chx commented 18 December 2007 at 12:55

When you set the PCRE_UTF8 flag, the strings passed as patterns and subjects are checked for validity on entry to the relevant functions. If an invalid UTF-8 string is passed, an error return is given. In some situations, you may already know that your strings are valid, and therefore want to skip these checks in order to improve performance. If you set the PCRE_NO_UTF8_CHECK flag at compile time or at run time, PCRE assumes that the pattern or subject it is given (respectively) contains only valid UTF-8 codes. In this case, it does not diagnose an invalid UTF-8 string. If you pass an invalid UTF-8 string to PCRE when PCRE_NO_UTF8_CHECK is set, the results are undefined. Your program may crash.

Presuming this flag is not set, we will get some sort of error from PCRE. My code skips characters it does not recognize -- it's not a full UTF8 validator, no, but it recovers gracefully. This might or might not be good. Also what kind of regexp will match until the last space but if the last space is not given then at most N chars...? Of course, we might want to cross #10 and PCRE... that might work. Question is, what do want to do with invalid UTF8 sequences? How much effort we want to take recognizing them? Should we just fail or skip parts that are invalid per UTF8...?

Comment #18

gábor hojtsy

he/him

Hungarian

Hungary

commented 19 December 2007 at 15:44

Just to note, I realized I went a bit too much overboard with marking the truncate_utf8() renaming a good idea so late in the cycle. We discussed this a bit after my comment above with chx, and agreed it is better kept with its old name. Otherwise there are a few open questions here still.

Comment #19

pancho

UTC+2 🇪🇺 EU

commented 19 December 2007 at 17:33

Status	File	Size
new	truncate_bytes_and_chars.patch	2.92 KB

Hmm... you are thinking about contrib using truncate_utf8() for truncating at character length? Think, what you said in #12 was exactly right:

I also like the renaming of these functions, to get in line with Drupal standards, and let all contrib maintainers know that they were using the wrong function for their purpose

It is okay that contrib modules get broken, as they have used the wrong function. Still this is a small edit for contrib developers and shouldn't be a problem if documented.

Included is a rerolled patch correcting the code style errors noted in #12, adding some spaces for clarity and rewording documentation for the $wordsafe flag. Note that the wrapper function is still included, so this may not be committed.

Comment #20

gábor hojtsy

he/him

Hungarian

Hungary

commented 19 December 2007 at 18:56

Small edit or not, we are past beta 4 and heading to RC1, so we are not supposed to change ANY API whatsoever. This is not an exception. We can add a better function here for sure, fix the use in core, and let contrib maintainers know about it, but we are not supposed to break anything.

Comment #21

pancho

UTC+2 🇪🇺 EU

commented 19 December 2007 at 19:41

Status	File	Size
new	truncate_bytes_and_chars.patch	2.66 KB

Okay, accepted. Rerolled the patch in #19.

The existing, but commonly misused function truncating a string to a number of bytes is renamed to drupal_truncate_bytes().

The new function truncating a string to a number of chars is now called truncate_utf8(). This should meet contrib developers' expectations.
In a next step this one should be renamed to drupal_truncate_chars() in D7. Then we can also discuss chx' more complete approach to wordsafe truncating.

Now I could safely remove the wrapper function. From my point of view, this is rtbc, unless you see some other open questions here.

Comment #22

chx commented 20 December 2007 at 01:33

Status:

Needs review

» Reviewed & tested by the community

This will be interesting for the contrib authors who sometimes have no idea about the difference in bytes and Unicode characters. Despite I like my patch better I can see it being D7 so this one is good to go (though I would have more liked a proper patch with -p but that's a minor tidbit, it's just easier to review).

Comment #23

pancho

UTC+2 🇪🇺 EU

commented 20 December 2007 at 02:42

chx (off-topic): I'd love to add the -p option to Eclipse's patch creation, but it seems to be impossible... or do you know someone who succeded with this?

Comment #24

gábor hojtsy

he/him

Hungarian

Hungary

commented 20 December 2007 at 08:47

Version:	6.x-dev	» 7.x-dev
Status:	Reviewed & tested by the community	» Needs review

Committed #21. Please document the change in the module updates page:

- truncate_utf8() is now UTF-8 characters based
- those who used it as byte based now should use drupal_truncate_bytes()

Left open for Drupal 7 with chx's patch.

Comment #25

chx commented 21 December 2007 at 18:11

http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php#54805

When the subject string contains invalid UTF-8 sequences / codepoints, it basically result in a "quiet death" for the preg_* functions, where nothing is matched but without indication that the string is invalid UTF-8

http://www.pcre.org/news.txt

Release 4.4 21-Aug-03
---------------------

This is mainly a bug-fix and tidying release. The only new feature is that PCRE checks UTF-8 strings for validity by default.

Reading http://www.php.net/ChangeLog-4.php#4.3.5

Version 4.3.5
26-Mar-2004
Upgraded PCRE library to version 4.5. (Andrei)

So yes, preg_match is quite useable for our purposes.

Comment #26

bjaspan commented 24 December 2007 at 15:09

Comment #27

pancho

UTC+2 🇪🇺 EU

commented 21 February 2008 at 01:56

Status:

Needs review

» Reviewed & tested by the community

Status	File	Size
new	rename_truncate_utf8.patch	9.19 KB

Now that the 7.x-dev branch is open, I enclose a patch that renames the new truncate_utf8() to drupal_truncate_chars() for consistancy, as it was agreed upon before (see Gábor's comment in #12).
All core functions that used to reference truncate_utf8() are changed to reflect this renaming.

This needs to be documented as an API change. Then, the next step to accomplish are chx' major improvements to a UTF8-enabled wordsafe truncating. So please set this back to CNW after committing.

Comment #28

pancho

UTC+2 🇪🇺 EU

commented 21 February 2008 at 02:00

Status:

Reviewed & tested by the community

» Needs review

RTBC is too bold, as this surely needs to be tested.
This is straightforward stuff though, so one more complete test should be enough to set this to RTBC.

Comment #29

pancho

UTC+2 🇪🇺 EU

commented 22 February 2008 at 04:50

Priority:

Critical

» Normal

Oh, and of course this is no more critical...

Comment #30

Susurrus commented 22 February 2008 at 17:56

Would this have anything to do with #179521: New function to trim strings?

Comment #31

Susurrus commented 29 February 2008 at 19:20

Status	File	Size
new	rename_truncate_utf8.patch	16.82 KB

I've tested it and it didn't apply cleanly, so I fixed that up and updated the documentation a little.

I think that the drupal_truncate_chars() should be changed so that any text can be appended to the end of a string. It would make it useful elsewhere in the code, especially where a space isn't desired in the ellipsis text. I changed the function so that it appends a ... instead of ..., so it's missing the starting space. This means it can replace much more code in Drupal. I think the order of the last two arguments should also be switched.

I tested this code in the book heirarchy, but the simpletest module fails for a lot of tests right now...

Comment #32

Freso commented 12 April 2008 at 20:26

Status:

Needs review

» Needs work

Thanks for the heads up Susurrus! The $dots argument does seem to make "my" issue non-existant, but this also means that there's some of the stuff in my patch over there that should probably go into here. You've also changed a bunch of strings in _locale_get_predefined_list() (includes/locale.inc), which shouldn't be part of this patch if they really need to be changed. Oh, and just tried to apply the patch. Lots and lots of failed hunks (actually, all the hunks fail).

I'm going to give this patch some love, I think, and marking my own child a duplicate (even if it is older... *sigh*). Some of the discussion from #179521: New function to trim strings should still be considered though, namely these:

Should _filter_url_trim() be removed entirely and functions calling that be changed to use drupal_trim directly?
How long should the username be allowed to be in theme_username? It originally made username longer than 20 characters be 15+3 (18) characters, which I've changed to 17+3.
But perhaps the implementation should change and the $length argument should be the full length - that is, including the $append. ($append here would be the $dots.

(Replace drupal_trim instances with drupal_truncate_chars in the above.)

Note that my drupal_trim made it possible to make a custom $append, so that you could have '...', ' ...', or " (this has been truncated as it was overly long, so now I'm appending this message to make it even longer!)", and this should perhaps be considered here as well? With a default of NULL or '', we should be able to safely append it in all cases. (If this is to be included, I'd also suggest reordering so that $wordsafe is the last argument.)

Comment #33

Freso commented 12 April 2008 at 20:57

Status	File	Size
new	rename_truncate_utf8.patch	11.13 KB

While working on this, I realised that drupal_truncate_chars already does take care of the $dots in the $len it is given - but just by simply subtracting the length of the $dots from $len! Doing the replacements as has been done in the previous patch, this would lead to a regression (I could probably dig up the issue if requested) in that links truncated to, say, 15 characters would have 14 character links (http://abc.de/) "truncated" into a full 15 characters (http://abc.d...)! This is not fixed in the attached patch, so this still needs work IMO. I'm thinking about how to go about this with the current code (it worked beautifully in drupal_trim).

Apart from this, this also cleaned up some of the Doxygen line-wrapping as well as incorporated some stuff from my drupal_trim patch. I'll return soon with a way to fix the regression mentioned above. Feel free to test this in the mean while. ;)

Comment #34

Freso commented 12 April 2008 at 21:17

Status:

Needs work

» Needs review

Ah, somehow I missed the first check in the function which should cancel the perceived regression. Setting to CNR. (I'm still not fully satisfied with the way $wordsafe and $dots work together, but that shouldn't hold this up. I'll post a new patch if I can make it more elegant though.)

Comment #35

Freso commented 12 April 2008 at 21:31

Status	File	Size
new	rename_truncate_utf8.patch	11.67 KB

Actually, this might be good. This should now enable the string "foo bar xyz" to be truncated ($len = 8) to "foo bar" ($wordsafe, no $dots), "foo ..." ($wordsafe, $dots, "foo b..." (no $wordsafe, $dots), or "foo bar " (no $wordsafe, no $dots). Please test and review!

Comment #36

Freso commented 12 April 2008 at 21:50

Oh, and this point is still up in the air: Should _filter_url_trim() be removed entirely and functions calling that be changed to use drupal_trim directly? (I think the others have more or less been nullified by my discoveries and patches.)

(And re: simpletesting, please go give #243335: Move simpletest to core a look (and a test (and a review)).)

Comment #37

catch

he/him

English

commented 14 April 2008 at 10:38

Did a visual review of the patch, amd tested it with a php page comparing generated output to Freso's expected output. Everything looks good.

IMO this is ready. I'd be in favour of a custom append as well, but no reason that shouldn't go in a follow up patch, worth looking at this old issue trying to use a proper ellipses instead of three dots, could maybe be done there: http://drupal.org/node/44987

Going to leave this open for one more pair of eyes to look at it before it goes RTBC, but no complaints from me at all.

Comment #38

Freso commented 14 April 2008 at 15:57

Here's a quick test. I haven't looked into where to place it yet, but if anyone has any brilliant idea, I'll happily turn this snippet into a proper patch against SimpleTest module's tests. :)

  /**
   * Ensure strings are being character truncated properly (drupal_truncate_chars())
   */
  function testCharTruncating() {
    // Strings to try truncating
    $ascii_string = 'foo bar xyz';
    $utf8_string = 'äâãå æêë öø';
    $lat_alphabet = 'A B C D E F G H I J K L M N O P Q R S T U W X Y Z';
    $cyr_alphabet = 'А Б В Г Д Е Ё Ж З И Й К Л М Н О П Р С Т У Ф Х Ц Ч Ъ Ы Ю Э Ю Я';

    // Checks against a plain ASCII string
    $this->assertEqual(drupal_truncate_chars($ascii_string, 8, TRUE, TRUE), 'foo ...', 'Expected truncating to return "foo ..." ($wordsafe on, $dots on)');
    $this->assertEqual(drupal_truncate_chars($ascii_string, 8, TRUE, FALSE), 'foo bar', 'Expected truncating to return "foo bar" ($wordsafe on, $dots off)');
    $this->assertEqual(drupal_truncate_chars($ascii_string, 8, FALSE, TRUE), 'foo b...', 'Expected truncating to return "foo b..." ($wordsafe off, $dots on)');
    $this->assertEqual(drupal_truncate_chars($ascii_string, 8, FALSE, FALSE), 'foo bar ', 'Expected truncating to return "foo bar " ($wordsafe off, $dots off)');
    
    // Checks against an UTF-8 string
    $this->assertEqual(drupal_truncate_chars($utf8_string, 10, TRUE, TRUE), 'äâãå ...', 'Expected truncating to return "äâãå ..." ($wordsafe on, $dots on)');
    $this->assertEqual(drupal_truncate_chars($utf8_string, 10, TRUE, FALSE), 'äâãå æêë', 'Expected truncating to return "äâãå æêë" ($wordsafe on, $dots off)');
    $this->assertEqual(drupal_truncate_chars($utf8_string, 10, FALSE, TRUE), 'äâãå æê...', 'Expected truncating to return "äâãå æê..." ($wordsafe off, $dots on)');
    $this->assertEqual(drupal_truncate_chars($utf8_string, 10, FALSE, FALSE), 'äâãå æêë ö', 'Expected truncating to return "äâãå æêë ö" ($wordsafe off, $dots off)');

    // Checks per issue 200185
    $this->assertEqual(drupal_truncate_chars($lat_alphabet, 29, FALSE, FALSE), 'A B C D E F G H I J K L M N O', 'Failed truncating Latin alphabet');
    $this->assertEqual(drupal_truncate_chars($cyr_alphabet, 29, FALSE, FALSE), 'А Б В Г Д Е Ё Ж З И Й К Л М О', 'Failed truncating Cyrillic alphabet');

  }

Comment #39

Freso commented 14 April 2008 at 20:39

Status	File	Size
new	rename_truncate_utf8.patch	11.42 KB

Patch was broken due to commit of #245115: Fix Drupal's awkward coding standards for the . operator, and needed a single update for the new coding style as well. It still needs someone to review and I'd still like feedback to my comment #36. :)

Comment #40

Freso commented 23 April 2008 at 09:42

Status	File	Size
new	rename_truncate_utf8.patch	11.42 KB

Patch had gotten some offset. Re-rolled. Also not sure what to do about the testing. Should the above test(s) be used, or should they be made into unit tests? I'm not really sure how to do the latter yet. I'll have to poke chx about it, I think.

Comment #41

dries commented 23 April 2008 at 18:42

Please include the tests in the patch, and add them to system.test or something.

Feel free to remove that other function, if it can be removed safely. Write tests for it if necessary.

Good job Freso.

Comment #42

Freso commented 23 April 2008 at 18:53

xmlrpc.inc seems to have xmlrpc.inc.test, so perhaps I should just make a unicode.inc.test?

/me goes to add tests to patch + rip out _filter_url_trim()

Comment #43

dries commented 23 April 2008 at 19:43

xmlrpc.inc.test is not currently called. It is something that needs to be fixed.

Comment #44

Freso commented 7 May 2008 at 08:36

Status	File	Size
new	unicode.test	3.03 KB

I'm still rather rough around the edges, but here's an attempt at a unicode.test file, testing this function. The test itself hasn't actually been tested yet, but I thought I'd put it up here so that it isn't lost and so that people might comment on it before I get back to it again.

Comment #45

Freso commented 7 May 2008 at 08:37

Status:

Needs review

» Needs work

Comment #46

Freso commented 12 June 2008 at 19:15

Since Dries said he wanted me to give this some more love, I'm bumping this so it's not hiding away on page 4 of my issue list...

Comment #47

Freso commented 28 July 2008 at 14:03

Status:

Needs work

» Needs review

Status	File	Size
new	rename_truncate_utf8.patch	12.78 KB

Call it coincidence or synchronicity, but page 4 is exactly where this is placed right now. (Just on the tip of going to page 5 actually, but still hanging on at the edge of page 4.) However, as truncate_utf8() is about to be used (as a substring function) in Pathauto, I thought I'd give this patch another kick in its behind. :)

The attached is a re-roll (including converting some recently added truncate_utf8() calls). I haven't gotten around removing _filter_url_trim(), but I don't see that as a necessity of having this patch go in, and will increase the complexity (albeit slightly) of it instead. I'll happily work on it, once drupal_truncate_chars() is born and truncate_utf8() is no more.

I haven't tested or run any tests against this patch, but I'd still love to hear suggestions on how to improve the test file from #44.

Comment #48

Anonymous (not verified) commented 11 November 2008 at 11:20

Status:

Needs review

» Needs work

The last submitted patch failed testing.

Comment #49

Freso commented 6 March 2009 at 19:02

Status:

Needs work

» Needs review

Status	File	Size
new	200185_rename_truncate_utf8-49.patch	12.9 KB

Aaand... a re-roll. If anyone would help me with the tests so this could go in, I'd appreciate it.

Comment #50

bjaspan commented 7 March 2009 at 13:08

I'm not up on the details of this patch yet. However, I notice the following code:

  $title = drupal_truncate_chars($title, $width, FALSE, TRUE);

This is poor DX. I do not know what FALSE and TRUE mean as arguments to drupal_truncate_chars. As a reviewer, for all I know this introduces a security hole. Even if I know what the two arguments are for, I'll never remember what order they are in or what TRUE and FALSE actually do. If those arguments are always or almost always provided, I suggest

  $title = drupal_truncate_chars($title, $width, TRUNCATE_WORDSAFE, TRUNCATE_ADD_DOTS);

with similar constants for the inverse operations (TRUNCATE_NOT_WORDSAFE, TRUNCATE_NO_DOTS or similar). If the arguments are often not provided (which, given that they have default values, I suspect might be case), I suggest an l()-style optional array:

  $title = drupal_truncate_chars($title, $width, array('wordsafe' => FALSE, 'dots' => TRUE));

Comment #51

Freso commented 7 March 2009 at 14:26

Status:

Needs review

» Needs work

@ bjaspan: Thanks for the excellent input! I'm going to go with the l()-style optional array, as that would also allow for further values (e.g. further customisation of the "dots", per #179521: New function to trim strings), without breaking other functions already using this one. I'm going to do some grocery shopping, but I'll get right on it when I get back! :)

Comment #52

Freso commented 7 March 2009 at 17:50

Status:

Needs work

» Needs review

Status	File	Size
new	200185_rename_truncate_utf8-52.patch	14.02 KB

Here's a patch incorporating l()-style $options, and even adds a parameter to alter what the "dots" should be (per #179521: New function to trim strings). Thank you for the wonderful suggestion, bjaspan! :D

Comment #53

7 March 2009 at 18:00

Status:

Needs review

» Needs work

The last submitted patch failed testing.

Comment #54

Freso commented 7 March 2009 at 18:13

Status:

Needs work

» Needs review

Status	File	Size
new	200185_rename_truncate_utf8-54.patch	14.02 KB

Oops.

Also: <3 the testing bot.

Comment #55

7 March 2009 at 18:35

Status:

Needs review

» Needs work

The last submitted patch failed testing.

Comment #57

Freso commented 3 May 2009 at 21:46

Status:

Needs work

» Needs review

Status	File	Size
new	200185_rename_truncate_utf8-56.patch	14.87 KB

Re-roll and putting it up against the Testbot to see if it still complains. It still needs to have some unit tests included though...

Comment #58

3 May 2009 at 21:55

Status:

Needs review

» Needs work

The last submitted patch failed testing.

Comment #60

jhodgdon

she/her

English

commented 2 April 2010 at 16:11

This is still an issue, and still needs to be fixed.

Related issues that are caused by misuse of this function:
#362609: node teaser generation breaks when using entities (D6)
#269911: Search result trimming should not fall inside HTML entities/tags (D7)
Probably others as well...

Further note: It should be documented that word breaking will NOT work in CJK languages (Chinese, etc.) since it relies on having a space between words, which is not actually true. See
http://drupal.org/node/269911#comment-2795478

My opinion is that the word-breaking feature should be removed, because of this.

Comment #61

jhodgdon

she/her

English

commented 11 April 2010 at 14:11

I've just filed a different issue to address the problem that the truncate_utf8 function is not good with locales:
#768040: truncate_utf8() only works for latin languages (and drupal_substr has a bug)

Comment #62

jhodgdon

she/her

English

commented 10 June 2010 at 15:38

Status:

Needs work

» Closed (duplicate)

OK....

So, on #768040: truncate_utf8() only works for latin languages (and drupal_substr has a bug), we are fixing up truncate_utf8() so that it can be used as a function to truncate strings of any character set to a particular number of characters (not bytes), possibly only on word boundaries, and then append t('...') afterwards if desired. So it will be set up as a great UI-facing substring function.

It looks like we now also have the drupal_truncate_bytes() function, which truncates UTF8 strings safely to a number of bytes. That must have been added in a different patch/issue.

So I think this issue has been resolved, and I'm going to just mark it as a duplicate at this point (duplicate of 768040 and the unknown issue that added the drupal_truncate_bytes() function).