Shortening forename of authors containing UTF-8 special characters does not work
Stefan Freudenberg - April 14, 2009 - 10:53
| Project: | Bibliography Module |
| Version: | 6.x-1.x-dev |
| Component: | Code |
| Category: | task |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | closed |
| Issue tags: | utf-8 |
Jump to:
Description
The function theme_biblio_format_authors in biblio.theme.inc won't shorten UTF-8 encoded first names of authors properly if they contain non-ASCII characters like umlauts, accented letters, etc.
I have a project where all data in our database is UTF-8 encoded so I am going to change all those latin-1 codes to UTF8 equivalents. Probably I will have to add more characters.
Do you have any thoughts how I can make it to share this with upstream?

#1
I found out that this is on the todo list, so I volunteer. Before starting I want to make sure not do duplicate effort. This is the first line of the function
theme_biblio_page_number():global $alnum, $alpha, $cntrl, $dash, $digit, $graph, $lower, $print, $punct, $space, $upper, $word, $patternModifiers; // defined in 'transtab_unicode_charset.inc.php' and 'transtab_latin1_charset.inc.php'I cannot find those two files in biblio and the global variables are not defined elsewhere. Is there any chance to get those two files?
#2
Hi Stefan,
I just checked in what I think will be a fix for this issue. As you may have guessed, I "borrowed" and adapted much of the style code from another package. In the process I inadvertently put the latin1 regex patterns in rather than the Unicode ones.
Unfortunately, some of these still aren't working, so I changed some of the code in theme_biblio_format_authors() to use drupal_substr() and str_replace() instead. If you can figure out why those regular expressions are still not working, that would be great, but I think the current workaround will suffice.
Ron.
#3
Hi Ron!
Using
drupal_substrfor shortening the fornames does return only the first initial. I don't know why the regular expressions did not work for you because I almost did the same and it worked for me. I'll write some unit tests for the function. Would you give me the names that caused the function to fail?Stefan
#4
Actually, it failed on ALL forenames (special characters or not). Is that not the case on your end?
#5
No. I replaced the character classes with the unicode properties (you did it even more accurately than I) and I had no more problems with shortening fore names. Our database already has several thousand authors and I haven't encountered any errors yet. I am going to try your version from CVS.
#6
I have tested your version using the regular expressions instead of
drupal_substr.if (!empty($author['firstname'])) {
if ($options['shortenGivenNames']) // if we're supposed to abbreviate given names
{
// within initials, reduce all full first names (-> defined by a starting uppercase character, followed by one ore more lowercase characters)
// to initials, i.e., only retain their first character
$author['firstname'] = preg_replace("/([$upper])[$lower]+/$patternModifiers", "\\1", $author['firstname']);
//$author['firstname'] = drupal_substr($author['firstname'], 0, 1);
}
}
It works for me for authors with and without non-ascii characters in their forenames.
#7
Hmmm, I'm left scratching my head, because in theory those regex expressions should work (and do as you have proven), but in practice on my setup they do not :-( Now I need to find out what it is about my system that is preventing them from working, because I'm sure someone else is going to encounter the same issue.
#8
The Unicode character properties are available since PHP versions 4.4.0 and 5.1.0: http://php.net/manual/en/regexp.reference.php#regexp.reference.unicode
It is also possible that your preg library is compiled without UTF-8 support.
#9
I'm running CentOS 5.3 which bundles PHP 5.1.6 what version of PHP are you using?
#10
You were right, it turns out that RHEL and therefore CentOS doesn't build their PCRE libraries with Unicode properties support enabled! (https://bugzilla.redhat.com/show_bug.cgi?id=457064)
Rebuilding with Unicode support enabled solved the problem on my end too.
Talk about a waste of a day!
Ron.
#11
Automatically closed -- issue fixed for 2 weeks with no activity.