truncate_utf8() is used as a substring function
| Project: | Drupal |
| Version: | 7.x-dev |
| Component: | base system |
| Category: | bug report |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | needs work |
When commenting any type of content, title of comment generates automatically. According to source code the title shold be 29 characters length. When using native characters, the length is shorter. Run a simple test:
write comment to node with the text: "A B C D E F G H I J K L M N O P Q R S T U W X Y Z"
the title should be "A B C D E F G H I J K L M N", it's ok
write comment to node with the text: "А Б В Г Д Е Ё Ж З И Й К Л М Н О П Р С Т У Ф Х Ц Ч Ъ Ы Ю Э Ю Я"
the title should be "А Б В Г Д Е Ё Ж З И Й К Л М",
but actual result is: "А Б В Г Д Е Ё Ж З"
Short titles are inconvinient for non-english speakers, the result is ofthen only one word in a title. I see problem in function truncate_utf8 in the file unicode.inc. I solved the problem by rewriting this function, the code is below. Please, cosider my variant:
function truncate_utf8($string, $len, $wordsafe = FALSE, $dots = FALSE) {
$string = drupal_substr($string, 0, $len - ($dots ? 4 : 0));
if ($wordsafe && ($last_space=mb_strrpos($string, ' ')))
$string = drupal_substr($string, 0, $last_space);
if ($dots)
$string .= ' ...';
return $string;
}
#1
I can reproduce this bug with other utf8 characters (here: German umlauts), though it will be most prominent in languages with a non latin alphabet. Should be fixed.
#2
I took a look at truncate_utf8 and found it to be completely broken (therefore I mark this critical)!
truncate_utf8 didn't even use drupal_substr() but tried to reinvent the wheel, and did so in a wrong way. Obviously this function has been incorrectly patched when introducing unicode support.
I also checked out aslw's replacement function, which unfortunately has three drawbacks:
Therefore I had to rewrite the function.
As a replacement for mb_strrpos() is IMHO unavoidable for this task (correct me if I'm wrong), I adapted an existing one (written by Niels Leenheer & Andy Matsubara).
Now, instead of adding this replacement in line, which wouldn't have made much sense, I added the new API function drupal_substr() to unicode.inc, so the functionality can be used throughout Drupal.
If someone could come up with an alternative that gets along with only one drupal_substr() call, that might be even better. Otherwise I guess this is okay in terms of performance.
I tested the new function both in the comment title and the node teaser use case and found it to work absolutely correct with and without non-latin chars. Still, as truncate_utf8 is being used in many places, some more testing would be awesome.
One thing we might discuss is whether the function name should be changed to something more generic like drupal_truncate(). True that this function is UTF8-safe, but it is just our standard truncate function. This would also be more in line with all the other UTF8-safe functions in unicode.inc.
#3
#4
As pointed out elsewhere, truncate_utf8() is not used how it was designed. It is designed to split off a string at a certain number of **bytes**, and it walks back from that number of bytes, when that position is in the middle of an UTF-8 sequence. It is not designed to split off a string at a certain number of **chars**.
See Steven's note at http://drupal.org/node/26688:
As far as I know, we only run on natively utf-8 supporting databases, so we might not need truncate_utf8() to do this. But in any case, all uses of truncate_utf8() should be examined and noted what it is used for at that place. Lots of places it is probably misused as a substring function, instead of a byte count based truncation function, some places might use it as a byte count based function, so any change in behavior will break them.
We seem to have these uses of it in core:
$ grep -r "truncate_utf8" *includes/locale.inc: array('data' => check_plain(truncate_utf8($value['source'], 150, FALSE, TRUE)) .'<br /><small>'. $value['location'] .'</small>'),
includes/unicode.inc:function truncate_utf8($string, $len, $wordsafe = FALSE, $dots = FALSE) {
includes/unicode.inc: * - We progressively cut-off a chunk with truncate_utf8(). This is to ensure
includes/unicode.inc: $chunk = truncate_utf8($string, $chunk_size);
modules/aggregator/aggregator.module: $title = preg_replace('/^(.*)[^\w;&].*?$/', "\\1", truncate_utf8($item['DESCRIPTION'], 40));
modules/book/book.module: $toc[$data['link']['mlid']] = $indent .' '. truncate_utf8($data['link']['title'], 30, TRUE, TRUE);
modules/comment/comment.admin.inc: $form['subject'][$comment->cid] = array('#value' => l($comment->subject, 'node/'. $comment->nid, array('title' => truncate_utf8($comment->comment, 128), 'fragment' => 'comment-'. $comment->cid)));
modules/comment/comment.module: $comment_values['subject'] = trim(truncate_utf8(decode_entities(strip_tags(check_markup($comment_values['comment'], $comment_values['format']))), 29, TRUE));
modules/dblog/dblog.admin.inc: l(truncate_utf8(_dblog_format_message($dblog), 56, TRUE, TRUE), 'admin/reports/event/'. $dblog->wid, array('html' => TRUE)),
modules/dblog/dblog.admin.inc: $rows[] = array($dblog->count, truncate_utf8(_dblog_format_message($dblog), 56, TRUE, TRUE));
modules/menu/menu.module: $title = $indent .' '. truncate_utf8($data['link']['title'], 30, TRUE, FALSE);
modules/node/node.module: $teaser = truncate_utf8($body, $size);
modules/search/search.module: $text = truncate_utf8($text, 50);
modules/search/search.module: return truncate_utf8($text, 256) .' ...';
modules/statistics/statistics.module: $title = truncate_utf8($title, $width, FALSE, TRUE);
I remember that there should be an issue, which was probably postponed to Drupal 7 which tried to do this same utf8-ification of truncate_utf8(), then suggested a new function to be added, so there might very well be previous analysis. Feel free to come up with creative search terms to find that issue, I was unable to.
#5
Better title :)
#6
truncate_utf8() was introduced by Steven in http://drupal.org/cvs?commit=911 This was on April 15, 2004, more then 3 years ago. Actually, more then a year later, Steven proposed and soon committed a general utf8 subsystem for Drupal in the above mentioned and quoted issue: http://drupal.org/node/26688
When you look at the original truncate_utf8() commit, you will see that it was introduced in place of substr() calls, because substr() was not utf8 safe. Now what about simply turning all substr()-itended uses of truncate_utf8() to drupal_substr() which was introduced one year after truncate_utf8(), but was never found wide use in Drupal?
Seems like truncate_utf8() should have been retired with the requirement of utf-8 databases, which happened between 4.6 and 4.7 and the introduction of the utf8 lib, which happened much earlier.
Looking at my grep above, it looks like that outside unicode.inc, where mime_header_encode() reuses truncate_utf8(), it is only used in core as a (sometimes word safe) drupal_substr(). So if something needs to be developed here, then it is a wordsafe drupal_substr() with a $dots option under a different name and the usage of that where appropriate.
#7
I researched out characters having Separator property and then coded the beginning of what Gabor asks. Ugly. I want PHP6.
#8
Updated the code because the array was storing numbers and we had characters.
#9
Gábor:
Thanks for your very useful comments. I checked out those old discussions with Steven and came up with a solution that doesn't introduce more unicode functions. Also, it uses the more expensive drupal_substr() only twice in the rather seldom case, that there is no space in the truncated string. Otherwise I resorted to a faster option using strrpos() and substr().
Surely, truncate_utf8() has been misused. But is has been misused at least from 2004 on, when the $wordsafe flag was overhastily introduced into truncate_utf8().
Enclosed is a new patch replacing the one in #2. As it only changes truncate_utf8(), it allows for nothing more and nothing less than testing my improved algorithm.
To avoid confusion, for the next step I follow up separately.
#10
Gábor again:
Ok, now let's proceed to the next problem.
I acknowledged that changing truncate_utf8() would break mime_header_encode(), which seems to be the only proper use case.
Therefore I removed both the inadequate $wordsafe and $dots flags from the original truncate_utf8() and renamed the function to drupal_truncate_bytes().
Then, I created the revised function (as in #9) as the new function drupal_truncate_chars().
(Because I didn't want to change all calls of the former truncate_utf8(), I enclosed a wrapper function for testing purposes)
Aside from the complete patch, I enclosed the added functions in a textfile, because the patch is hardly readable.
#11
chx:
This seems to be a more complete approach to the problem, which is awesome!
I just think this would rather be a new feature for D7, as we would need to elaborate quite a lot on it.
While I'm having a hard time reading the code, I just want to drop some comments:
So if we can get this properly done for D6 and noone objects to adding this feature, it would certainly be great! Otherwise I'd postpone this to D7 and invite you to review my (smaller) proposal in #10.
#12
I love the approach taken in #10. This is a nasty bug in Drupal for three years, as truncate_utf8() was used in place of substr(), and we are moving to the right direction. I also like the renaming of these functions, to get in line with Drupal standards, and let all contrib maintainers know that they were using the wrong function for their purpose. Drupal 6 is an "internationalization release", so it would be a shame to release without this old bug fixed.
That said, the code in #10 suffers from some small coding style errors, namely around brackets like
} else {. Otherwise obviously needs more testing.#13
Actually, this one is a better title. truncate_utf8() is utf8 safe but is not a substring function for "exact" lengths.
#14
I got the list from http://www.fileformat.info/info/unicode/category/Zs/list.htm (and then line and paragraph separators). this quote tells me that 0x00A0, 0x202F and 0x180E are nonbreaking spaces. If we deal with Unicode, then we need to deal with Unicode, so a single 0x0020 won't cut it IMO. I am using the converter http://mikebabcock.ca/code/ here to convert Unicode code points to UTF-8.
#15
I can init this nice little array now easily as it contains only numbers and bools... enrolled the comments into the array.
#16
Why don't you use preg_match/u with \x{} instead of munching bytes with byzantine hex loops?
#17
Presuming this flag is not set, we will get some sort of error from PCRE. My code skips characters it does not recognize -- it's not a full UTF8 validator, no, but it recovers gracefully. This might or might not be good. Also what kind of regexp will match until the last space but if the last space is not given then at most N chars...? Of course, we might want to cross #10 and PCRE... that might work. Question is, what do want to do with invalid UTF8 sequences? How much effort we want to take recognizing them? Should we just fail or skip parts that are invalid per UTF8...?
#18
Just to note, I realized I went a bit too much overboard with marking the truncate_utf8() renaming a good idea so late in the cycle. We discussed this a bit after my comment above with chx, and agreed it is better kept with its old name. Otherwise there are a few open questions here still.
#19
Hmm... you are thinking about contrib using truncate_utf8() for truncating at character length? Think, what you said in #12 was exactly right:
It is okay that contrib modules get broken, as they have used the wrong function. Still this is a small edit for contrib developers and shouldn't be a problem if documented.
Included is a rerolled patch correcting the code style errors noted in #12, adding some spaces for clarity and rewording documentation for the $wordsafe flag. Note that the wrapper function is still included, so this may not be committed.
#20
Small edit or not, we are past beta 4 and heading to RC1, so we are not supposed to change ANY API whatsoever. This is not an exception. We can add a better function here for sure, fix the use in core, and let contrib maintainers know about it, but we are not supposed to break anything.
#21
Okay, accepted. Rerolled the patch in #19.
The existing, but commonly misused function truncating a string to a number of bytes is renamed to drupal_truncate_bytes().
The new function truncating a string to a number of chars is now called truncate_utf8(). This should meet contrib developers' expectations.
In a next step this one should be renamed to drupal_truncate_chars() in D7. Then we can also discuss chx' more complete approach to wordsafe truncating.
Now I could safely remove the wrapper function. From my point of view, this is rtbc, unless you see some other open questions here.
#22
This will be interesting for the contrib authors who sometimes have no idea about the difference in bytes and Unicode characters. Despite I like my patch better I can see it being D7 so this one is good to go (though I would have more liked a proper patch with -p but that's a minor tidbit, it's just easier to review).
#23
chx (off-topic): I'd love to add the -p option to Eclipse's patch creation, but it seems to be impossible... or do you know someone who succeded with this?
#24
Committed #21. Please document the change in the module updates page:
- truncate_utf8() is now UTF-8 characters based
- those who used it as byte based now should use drupal_truncate_bytes()
Left open for Drupal 7 with chx's patch.
#25
http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php#54805
http://www.pcre.org/news.txt
Reading http://www.php.net/ChangeLog-4.php#4.3.5
So yes, preg_match is quite useable for our purposes.
#26
subscribe
#27
Now that the 7.x-dev branch is open, I enclose a patch that renames the new truncate_utf8() to drupal_truncate_chars() for consistancy, as it was agreed upon before (see Gábor's comment in #12).
All core functions that used to reference truncate_utf8() are changed to reflect this renaming.
This needs to be documented as an API change. Then, the next step to accomplish are chx' major improvements to a UTF8-enabled wordsafe truncating. So please set this back to CNW after committing.
#28
RTBC is too bold, as this surely needs to be tested.
This is straightforward stuff though, so one more complete test should be enough to set this to RTBC.
#29
Oh, and of course this is no more critical...
#30
Would this have anything to do with #179521: New function to trim strings?
#31
I've tested it and it didn't apply cleanly, so I fixed that up and updated the documentation a little.
I think that the
drupal_truncate_chars()should be changed so that any text can be appended to the end of a string. It would make it useful elsewhere in the code, especially where a space isn't desired in the ellipsis text. I changed the function so that it appends a...instead of..., so it's missing the starting space. This means it can replace much more code in Drupal. I think the order of the last two arguments should also be switched.I tested this code in the book heirarchy, but the simpletest module fails for a lot of tests right now...
#32
Thanks for the heads up Susurrus! The
$dotsargument does seem to make "my" issue non-existant, but this also means that there's some of the stuff in my patch over there that should probably go into here. You've also changed a bunch of strings in_locale_get_predefined_list()(includes/locale.inc), which shouldn't be part of this patch if they really need to be changed. Oh, and just tried to apply the patch. Lots and lots of failed hunks (actually, all the hunks fail).I'm going to give this patch some love, I think, and marking my own child a duplicate (even if it is older... *sigh*). Some of the discussion from #179521: New function to trim strings should still be considered though, namely these:
$appendhere would be the$dots.(Replace
drupal_triminstances withdrupal_truncate_charsin the above.)Note that my
drupal_trimmade it possible to make a custom$append, so that you could have'...',' ...', or" (this has been truncated as it was overly long, so now I'm appending this message to make it even longer!)", and this should perhaps be considered here as well? With a default ofNULLor'', we should be able to safely append it in all cases. (If this is to be included, I'd also suggest reordering so that$wordsafeis the last argument.)#33
While working on this, I realised that
drupal_truncate_charsalready does take care of the$dotsin the$lenit is given - but just by simply subtracting the length of the$dotsfrom$len! Doing the replacements as has been done in the previous patch, this would lead to a regression (I could probably dig up the issue if requested) in that links truncated to, say, 15 characters would have 14 character links (http://abc.de/) "truncated" into a full 15 characters (http://abc.d...)! This is not fixed in the attached patch, so this still needs work IMO. I'm thinking about how to go about this with the current code (it worked beautifully indrupal_trim).Apart from this, this also cleaned up some of the Doxygen line-wrapping as well as incorporated some stuff from my
drupal_trimpatch. I'll return soon with a way to fix the regression mentioned above. Feel free to test this in the mean while. ;)#34
Ah, somehow I missed the first check in the function which should cancel the perceived regression. Setting to CNR. (I'm still not fully satisfied with the way
$wordsafeand$dotswork together, but that shouldn't hold this up. I'll post a new patch if I can make it more elegant though.)#35
Actually, this might be good. This should now enable the string "foo bar xyz" to be truncated (
$len = 8) to "foo bar" ($wordsafe, no$dots), "foo ..." ($wordsafe,$dots, "foo b..." (no$wordsafe,$dots), or "foo bar " (no$wordsafe, no$dots). Please test and review!#36
Oh, and this point is still up in the air:
(And re: simpletesting, please go give #243335: Move simpletest to core a look (and a test (and a review)).)
#37
Did a visual review of the patch, amd tested it with a php page comparing generated output to Freso's expected output. Everything looks good.
IMO this is ready. I'd be in favour of a custom append as well, but no reason that shouldn't go in a follow up patch, worth looking at this old issue trying to use a proper ellipses instead of three dots, could maybe be done there: http://drupal.org/node/44987
Going to leave this open for one more pair of eyes to look at it before it goes RTBC, but no complaints from me at all.
#38
Here's a quick test. I haven't looked into where to place it yet, but if anyone has any brilliant idea, I'll happily turn this snippet into a proper patch against SimpleTest module's tests. :)
<?php
/**
* Ensure strings are being character truncated properly (drupal_truncate_chars())
*/
function testCharTruncating() {
// Strings to try truncating
$ascii_string = 'foo bar xyz';
$utf8_string = 'äâãå æêë öø';
$lat_alphabet = 'A B C D E F G H I J K L M N O P Q R S T U W X Y Z';
$cyr_alphabet = 'А Б В Г Д Е Ё Ж З И Й К Л М Н О П Р С Т У Ф Х Ц Ч Ъ Ы Ю Э Ю Я';
// Checks against a plain ASCII string
$this->assertEqual(drupal_truncate_chars($ascii_string, 8, TRUE, TRUE), 'foo ...', 'Expected truncating to return "foo ..." ($wordsafe on, $dots on)');
$this->assertEqual(drupal_truncate_chars($ascii_string, 8, TRUE, FALSE), 'foo bar', 'Expected truncating to return "foo bar" ($wordsafe on, $dots off)');
$this->assertEqual(drupal_truncate_chars($ascii_string, 8, FALSE, TRUE), 'foo b...', 'Expected truncating to return "foo b..." ($wordsafe off, $dots on)');
$this->assertEqual(drupal_truncate_chars($ascii_string, 8, FALSE, FALSE), 'foo bar ', 'Expected truncating to return "foo bar " ($wordsafe off, $dots off)');
// Checks against an UTF-8 string
$this->assertEqual(drupal_truncate_chars($utf8_string, 10, TRUE, TRUE), 'äâãå ...', 'Expected truncating to return "äâãå ..." ($wordsafe on, $dots on)');
$this->assertEqual(drupal_truncate_chars($utf8_string, 10, TRUE, FALSE), 'äâãå æêë', 'Expected truncating to return "äâãå æêë" ($wordsafe on, $dots off)');
$this->assertEqual(drupal_truncate_chars($utf8_string, 10, FALSE, TRUE), 'äâãå æê...', 'Expected truncating to return "äâãå æê..." ($wordsafe off, $dots on)');
$this->assertEqual(drupal_truncate_chars($utf8_string, 10, FALSE, FALSE), 'äâãå æêë ö', 'Expected truncating to return "äâãå æêë ö" ($wordsafe off, $dots off)');
// Checks per issue 200185
$this->assertEqual(drupal_truncate_chars($lat_alphabet, 29, FALSE, FALSE), 'A B C D E F G H I J K L M N O', 'Failed truncating Latin alphabet');
$this->assertEqual(drupal_truncate_chars($cyr_alphabet, 29, FALSE, FALSE), 'А Б В Г Д Е Ё Ж З И Й К Л М О', 'Failed truncating Cyrillic alphabet');
}
?>
#39
Patch was broken due to commit of #245115: Fix Drupal's awkward coding standards for the . operator, and needed a single update for the new coding style as well. It still needs someone to review and I'd still like feedback to my comment #36. :)
#40
Patch had gotten some offset. Re-rolled. Also not sure what to do about the testing. Should the above test(s) be used, or should they be made into unit tests? I'm not really sure how to do the latter yet. I'll have to poke chx about it, I think.
#41
Please include the tests in the patch, and add them to system.test or something.
Feel free to remove that other function, if it can be removed safely. Write tests for it if necessary.
Good job Freso.
#42
xmlrpc.inc seems to have xmlrpc.inc.test, so perhaps I should just make a unicode.inc.test?
/me goes to add tests to patch + rip out
_filter_url_trim()#43
xmlrpc.inc.test is not currently called. It is something that needs to be fixed.
#44
I'm still rather rough around the edges, but here's an attempt at a unicode.test file, testing this function. The test itself hasn't actually been tested yet, but I thought I'd put it up here so that it isn't lost and so that people might comment on it before I get back to it again.
#45
#46
Since Dries said he wanted me to give this some more love, I'm bumping this so it's not hiding away on page 4 of my issue list...
#47
Call it coincidence or synchronicity, but page 4 is exactly where this is placed right now. (Just on the tip of going to page 5 actually, but still hanging on at the edge of page 4.) However, as
truncate_utf8()is about to be used (as a substring function) in Pathauto, I thought I'd give this patch another kick in its behind. :)The attached is a re-roll (including converting some recently added
truncate_utf8()calls). I haven't gotten around removing_filter_url_trim(), but I don't see that as a necessity of having this patch go in, and will increase the complexity (albeit slightly) of it instead. I'll happily work on it, oncedrupal_truncate_chars()is born andtruncate_utf8()is no more.I haven't tested or run any tests against this patch, but I'd still love to hear suggestions on how to improve the test file from #44.
#48
The last submitted patch failed testing.
#49
Aaand... a re-roll. If anyone would help me with the tests so this could go in, I'd appreciate it.
#50
I'm not up on the details of this patch yet. However, I notice the following code:
<?php$title = drupal_truncate_chars($title, $width, FALSE, TRUE);
?>
This is poor DX. I do not know what FALSE and TRUE mean as arguments to drupal_truncate_chars. As a reviewer, for all I know this introduces a security hole. Even if I know what the two arguments are for, I'll never remember what order they are in or what TRUE and FALSE actually do. If those arguments are always or almost always provided, I suggest
<?php$title = drupal_truncate_chars($title, $width, TRUNCATE_WORDSAFE, TRUNCATE_ADD_DOTS);
?>
with similar constants for the inverse operations (TRUNCATE_NOT_WORDSAFE, TRUNCATE_NO_DOTS or similar). If the arguments are often not provided (which, given that they have default values, I suspect might be case), I suggest an l()-style optional array:
<?php$title = drupal_truncate_chars($title, $width, array('wordsafe' => FALSE, 'dots' => TRUE));
?>
#51
@ bjaspan: Thanks for the excellent input! I'm going to go with the
l()-style optional array, as that would also allow for further values (e.g. further customisation of the "dots", per #179521: New function to trim strings), without breaking other functions already using this one. I'm going to do some grocery shopping, but I'll get right on it when I get back! :)#52
Here's a patch incorporating
l()-style$options, and even adds a parameter to alter what the "dots" should be (per #179521: New function to trim strings). Thank you for the wonderful suggestion, bjaspan! :D#53
The last submitted patch failed testing.
#54
Oops.
Also: <3 the testing bot.
#55
The last submitted patch failed testing.
#57
Re-roll and putting it up against the Testbot to see if it still complains. It still needs to have some unit tests included though...
#58
The last submitted patch failed testing.