theme_biblio_format_authors() sometimes fails to render the initials from an accentuated firstname. How to reproduce:
- install biblio and biblio_pubmed
- import from pubmed the following ID: 20931971
- truncate the title which longer than 255 (separate issue, but this should not matter here)
- save the node
- note how the second author is Bonnefous Ćline instead of the expected Bonnefous C

The XML this data comes from is http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmo...

CommentFileSizeAuthor
#2 946362_biblio_encoding.png48.21 KBscor
#1 946362.png6.64 KBrjerome

Comments

rjerome’s picture

StatusFileSize
new6.64 KB

Hmm, I can't reproduce that (see image attached), what style are you using?

scor’s picture

StatusFileSize
new48.21 KB

I'm using the out of the box tabular style, browsing to the node full view or the listing shows the bug. See attached screenshot. I was able to reproduce on another machine running a different config (one is localhost Mac OS, the server is debian).

rjerome’s picture

When I asked about style I was referring to "AMA, APA, MLA" etc.. It may be related to that choice.

scor’s picture

It's just a vanilla installation of the latest Drupal 7 and the latest biblio-7.x-1.x-dev without any settings changed. Turns out the default style is CSE, but the problem is the same with AMA, APA too. The weird initial is introduced by theme_biblio_format_authors() at the line:

        $author['firstname'] = preg_replace("/([$upper])[$lower]+/$patternModifiers", '\\1', $author['firstname']);

I have no other modules running on this site.

rjerome’s picture

Ahh, this is probably related to the PCRE library on your webserver I ran into this before.

You will see in biblio_theme.inc at line 403, a funciton called _biblio_get_regex_patterns() which does a test of the PCRE library and decides whether to use it or not.

You might have to change line 332 341 and add "Ć" to it.

Ron.

scor’s picture

The line 341 does not get executed, my localhost uses _biblio_get_utf8_regex()

rjerome’s picture

This appears to be a PHP 5.3 issue, I just tried it on another machine running 5.3 and now I'm seeing the same behavior you see.

I'll have to dig a bit further...

scor’s picture

The thing is, I run PHP 5.2.11 :)

Here is what my phpinfo says about PCRE:
pcre
PCRE (Perl Compatible Regular Expressions) Support enabled
PCRE Library Version 7.9 2009-04-11

Directive Local Value Master Value
pcre.backtrack_limit 100000 100000
pcre.recursion_limit 100000 100000

rjerome’s picture

OK I tracked it down... It has more to do with the way the characters in the XML file were encoded then with PCRE itself. The é character was encoded as two unicode characters "U+0065 U+0103" as opposed to a single unicode charater "U+00E9". The "stand alone" accent character was messing up the expression because it was neither an upper or lower case letter.

Bottom line, adding "\p{M}" to line 378 like this $lower = "\p{Ll}\p{M}"; fixes the problem.

See http://www.regular-expressions.info/unicode.html for all the gory details.

Ron.

scor’s picture

Great! A side effect of this is that importing another paper from the same person like http://www.ncbi.nlm.nih.gov/pubmed/18954984 leads to the creation of a duplicate contributor with the same name. I guess that would be a separate issue though - if only PHP had Unicode normalization, it would help a lot.

rjerome’s picture

Status: Active » Fixed

This is where the author "merge" function comes in handy, but strangely, I can't even import that last one you mention (18954984) it returns no data.

http://drupal.org/cvs?commit=438868

rjerome’s picture

scratch that, the third time I tried 18954984 it worked, maybe a wonky network connection.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.