The function theme_biblio_format_authors in biblio.theme.inc won't shorten UTF-8 encoded first names of authors properly if they contain non-ASCII characters like umlauts, accented letters, etc.

I have a project where all data in our database is UTF-8 encoded so I am going to change all those latin-1 codes to UTF8 equivalents. Probably I will have to add more characters.

Do you have any thoughts how I can make it to share this with upstream?

Comments

Stefan Freudenberg’s picture

Category: bug » task

I found out that this is on the todo list, so I volunteer. Before starting I want to make sure not do duplicate effort. This is the first line of the function theme_biblio_page_number():

global $alnum, $alpha, $cntrl, $dash, $digit, $graph, $lower, $print, $punct, $space, $upper, $word, $patternModifiers; // defined in 'transtab_unicode_charset.inc.php' and 'transtab_latin1_charset.inc.php'

I cannot find those two files in biblio and the global variables are not defined elsewhere. Is there any chance to get those two files?

rjerome’s picture

Status: Active » Fixed

Hi Stefan,

I just checked in what I think will be a fix for this issue. As you may have guessed, I "borrowed" and adapted much of the style code from another package. In the process I inadvertently put the latin1 regex patterns in rather than the Unicode ones.

Unfortunately, some of these still aren't working, so I changed some of the code in theme_biblio_format_authors() to use drupal_substr() and str_replace() instead. If you can figure out why those regular expressions are still not working, that would be great, but I think the current workaround will suffice.

Ron.

Stefan Freudenberg’s picture

Hi Ron!

Using drupal_substr for shortening the fornames does return only the first initial. I don't know why the regular expressions did not work for you because I almost did the same and it worked for me. I'll write some unit tests for the function. Would you give me the names that caused the function to fail?

Stefan

rjerome’s picture

Actually, it failed on ALL forenames (special characters or not). Is that not the case on your end?

Stefan Freudenberg’s picture

No. I replaced the character classes with the unicode properties (you did it even more accurately than I) and I had no more problems with shortening fore names. Our database already has several thousand authors and I haven't encountered any errors yet. I am going to try your version from CVS.

Stefan Freudenberg’s picture

I have tested your version using the regular expressions instead of drupal_substr.

if (!empty($author['firstname'])) {
      if ($options['shortenGivenNames']) // if we're supposed to abbreviate given names
      {
        // within initials, reduce all full first names (-> defined by a starting uppercase character, followed by one ore more lowercase characters)
        // to initials, i.e., only retain their first character

        $author['firstname'] = preg_replace("/([$upper])[$lower]+/$patternModifiers", "\\1", $author['firstname']);
        //$author['firstname'] = drupal_substr($author['firstname'], 0, 1);
      }
    }

It works for me for authors with and without non-ascii characters in their forenames.

rjerome’s picture

Hmmm, I'm left scratching my head, because in theory those regex expressions should work (and do as you have proven), but in practice on my setup they do not :-( Now I need to find out what it is about my system that is preventing them from working, because I'm sure someone else is going to encounter the same issue.

Stefan Freudenberg’s picture

The Unicode character properties are available since PHP versions 4.4.0 and 5.1.0: http://php.net/manual/en/regexp.reference.php#regexp.reference.unicode
It is also possible that your preg library is compiled without UTF-8 support.

rjerome’s picture

I'm running CentOS 5.3 which bundles PHP 5.1.6 what version of PHP are you using?

rjerome’s picture

You were right, it turns out that RHEL and therefore CentOS doesn't build their PCRE libraries with Unicode properties support enabled! (https://bugzilla.redhat.com/show_bug.cgi?id=457064)

Rebuilding with Unicode support enabled solved the problem on my end too.

Talk about a waste of a day!

Ron.

Status: Fixed » Closed (fixed)
Issue tags: -utf-8

Automatically closed -- issue fixed for 2 weeks with no activity.