Posted by sciomako on October 15, 2007 at 6:24am
11 followers
| Project: | Bibliography Module |
| Version: | 5.x-1.x-dev |
| Component: | Code |
| Category: | bug report |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | closed (fixed) |
Issue Summary
Accent characters in bibtex are not imported properly. e.g. I have this entry. Notice the author names have accent characters. Biblio imports them as-is. Similarly, if I hand-modify the characters to what they should be, export doesn't do the properly character encoding.
@inproceedings{Scha03a,
abstract = {Despite the undisputed prominence of inheritance as the fundamental reuse mechanism in object-oriented programming languages, the main variants --- single inheritance, multiple inheritance, and mixin inheritance --- all suffer from conceptual and practical problems. In the first part of this paper, we identify and illustrate these problems. We then present traits, a simple compositional model for structuring object-oriented programs. A trait is essentially a group of pure methods that serves as a building block for classes and is a primitive unit of code reuse. In this model, classes are composed from a set of traits by specifying glue code that connects the traits together and accesses the necessary state. We demonstrate how traits overcome the problems arising from the different variants of inheritance, we discuss how traits can be implemented effectively, and we summarize our experience applying traits to refactor an existing class hierarchy.},
annote = {internationalconference topconference},
author = <b>{Sch\"arli, Nathanael and Ducasse, St\'ephane and Nierstrasz, Oscar and Black, Andrew },</b>
booktitle = {Proceedings of European Conference on Object-Oriented Programming (ECOOP'03)},
citeulike-article-id = {1574024},
doi = {10.1007/b11832},
keywords = {traits, stefpub, snf03, schaerli, scg-traits, scg-pub, jb03, bibtex-import, aspect},
month = {July},
pages = {248--274},
priority = {2},
publisher = {Springer Verlag},
series = {LNCS},
title = {Traits: Composable Units of Behavior},
url = {http://www.iam.unibe.ch/~scg/Archive/Papers/Scha03aTraits.pdf},
volume = {2743},
year = {2003}
}
Comments
#1
The bibtex parsing is done by a bit of thrid party code called Sturctures_Bibtex (http://pear.php.net/pepr/pepr-proposal-show.php?id=386) either you or I could raise this issue with the maintainer of that code.
Ron.
#2
Hi Ron,
I'm not familiar with Sturctures_Bibtex. Would you mind to raise the issue for me please?
Thanks
--
John
#3
I think the parsing *must not* be modified for acccent characters.
At the display time (when php creates the web pages), stuff like \'e should be replaced by &ecaute;
My 2 cents idea ...
Jean-Pierre Roux
#4
(subscribe)
This 'bug' keeps me from using this module on the website of my research department.
Concerning the comment in #3: I think the parsers should modify the characters on import.
There are several ways to input biblio data:
If you mix all these method without decoding, your data will contain combinations of "é", "\'e" and "é", and how do you then determine how it's encoded? This is particularly bad because the ampersand & is both a special character in bibtex encoding and xml encoding.
The proper way should be IMHO:
my 2 cents
#5
additional bibtex import problem:
authors are separated in bibtex by " and ", for example "John Foo and William Bar"
this is also imported as-is, but should be converted to semicolon separation, e.g. "John Foo; William Bar"
#6
Hmmm, I'm sure the "and"s used to be converted as that would stick out like a sore thumb. I'll check that one.
Ron.
#7
I justed tested a bibtex file with multiple authors separated by " and " and it seems to work fine, i.e. the "and"s are gone.
Could you post the offending bibtex file so I can take a look at it.
Ron.
#8
Hi you all,
I think that the accents issue is shifting to a different question and the accents bug is very important. I'm having the same trouble with accents and as long as I have to import about 600 references to a site where bibliography is an important part I'm wondering how to import it in order to record everything properly.
The biblio module is just impressive and it would be sad not being able to use just for that small detail ;-) my customer is just impressed with the set of features that its developer is providing...
Any idea? :-)
#9
I believe I have a working solution to the LaTex character encoding issue and it should be available within a day or two.
Ron.
#10
Great Ron, does it mean that it will import accents properly?
BTW, how has people imported BibTex with accents until now?
#11
Yep, the LaTex character codes like {\"a} will be replaced with the proper UTF-8 character and on export the reverse will happen.
This is not always easy though since technically speaking, the correct format is {\"{a}}, but unfortunately there seem to be many variations on that theme, multiplied by many hundred possible characters :-(
With regard to what other people are doing, I can only speculate that they must be using UTF-8 formated bibTex files which, according to the LaTex web site, have been available since 2004. Regrettably, not all software supports the format.
Ron.
#12
My customer is giving me something (short to test) like:
@article{IIIA-1986-1357,title = {Retórica dialéctica. La justificación de las teorías del debate},
author = {Ramon López de Mántaras and Jaume Agustí},
year = {1986},
journal = {Estudios},
number = {5},
pages = {49-59},
}
I've tried many changes like converting "ó" to "ó", or "{ó}", or "{\"o}", or "{\"{o}}". For example converting it to:
@article{IIIA-1986-1357,title = {Ret{ó}rica dialéctica. La justificación de las teor{í}as del debate},
author = {Ramon L{ó}pez de Mántaras; Jaume Agust{í}},
year = {1986},
journal = {Estudios},
number = {5},
pages = {49-59},
}
And then importing it. But the final results is :
@article { IIIA-1986-1357,title = {Ret{ó}rica dial?ctica. La justificaci?n de las teor{í}as del debate},
journal = {Estudios},
number = {5},
year = {1986},
pages = {49-59},
author = {Ramon L{ó}pez de M?ntaras and Jaume Agust{í}}
}
Sincerely, I don't know what else to do...
#13
The format you customer is giving you will work (no changes required) *PROVIDED* the file containing it is saved as UTF-8 format. If you look at the attached image you will see two entries, the first was created using a file with standard DOS formatting, the second entry is the same file just resaved in UTF-8 format. and voila, all the information is there!
Ron.
#14
Wops!
Uhmmm, I receive that file downloading it from the internet browser from an url where a php process creates the list.
Thanks a lot for your information, at least it gives me a clue. Thanks a lot :-)
#15
No problem,
By the way, I checked in the latex character code conversion code, so both the 5.x-dev and 6.x-dev versions will have that capability the next time the -devs get rebuilt.
Ron.
#16
Yesssss! It's true and so simple as I was not saving the file with UTF-8 encoding, so simple!... but the only detail I was missing :-(
Thanks again Ron.
#17
Yes, imported 902 records on one only file in one only import action (it took me 20 times more bulking all the automated urls with pathauto ;-)
The site is still in development but you can see it at: http://www08.iiia.csic.es/ca/publications.
#18
Looks good!
Speaking of pathauto, you might be interested in this post... http://drupal.org/node/89038#comment-869934
Ron.
#19
#20
Here's another for you to test:
@article{958751,
author = {D\'{a}niel Orincsay and Bal\'{a}zs Szviatovszki and G\'{e}za B\"{o}hm},
title = {Prompt partial path optimization in MPLS networks},
journal = {Comput. Netw.},
volume = {43},
number = {5},
year = {2003},
issn = {1389-1286},
pages = {557--572},
doi = {http://dx.doi.org/10.1016/S1389-1286(03)00290-1},
publisher = {Elsevier North-Holland, Inc.},
address = {New York, NY, USA},
}
#21
Hmmm, I'm assuming this didn't work? There seems to be many schools of thought on how to format special characters in bibTex entries, and it looks like I'm going to have to cater to all of them...
Technically, correct me if I'm wrong, the "\" character should be proceeded by a "{" and the modified character should be enclosed in braces, so I believe "B{\"{o}}hm would be the correct format. Unfortunately, I have seen a number of variations like "B{\"o}hm" and now what you present "B\"{o}hm".
With so many variations, it is challenging to match and replace the character sequences.
Ron.
#22
---it looks like I'm going to have to cater to all of them...
Hi Ron,
if you wish, send me the portions of code you have developed for the replacements you have handled (so I can learn how to make some of my own) and I can help you from now on with future variations.
Hope I can help with the burden.
Best,
Daniel
#23
That's right, unfortunately it didn't work. I got that tex from ACM Portal: http://portal.acm.org/citation.cfm?id=958751
What I ended up getting from the parse is:
name: D\'{a}niel Orincsay
last name: a
first name:
initial: D\'
Thanks.
Michael
#24
In this regard I don't thing this is already fixed. I think at least the form "\'e" should be handled, because it's the most simple, and probably the most used.
I experimented a bit with the CVS versions of the Biblio module (DRUPAL-5 and DRUPAL-6--1, checkout on 9 june 2008) on the following bibtex file:
@article{test123141,title = {Foo \"i\"e\"a\"u\"o \'i\'e\'a\'u\'o \`i\`e\`a\`u\`o Bar Foo {\"i}{\"e}{\"a}{\"u}{\"o} {\'i}{\'e}{\'a}{\'u}{\'o} {\`i}{\`e}{\`a}{\`u}{\`o} Bar},
author = {Foo Bar and Baz Bal and Hello World},
journal = { Foo {\"{i}}{\"{e}}{\"{a}}{\"{u}}{\"{o}} {\'{i}}{\'{e}}{\'{a}}{\'{u}}{\'{o}} {\`{i}}{\`{e}}{\`{a}}{\`{u}}{\`{o}} Bar}
}
The results are shown in attachment for Drupal 5 and 6
Only the form "{\'o}" is decoded to unicode and only on "a", "e", "o" and "u", not on "i", the other forms "\'o" and "{\'{o}}" are not decoded.
Also note that the authors are not separated on the "and" in the Drupal 5 version, the Drupal 6 version does it right
#25
I'm just in the process of fine tuning the regular expressions in order to account for as many permutations and combinations as possible.
That example entry will be very helpful in testing.
Ron
P.S. I fixed the author delimiter issue.
#26
OK, the latest -dev releases will parse the above entries++
Ron.
#27
#28
Automatically closed -- issue fixed for two weeks with no activity.
#29
Hi,
here is a result from a .bib export from our dupal database :
@book { GUER-08b,
title = {Délivrance des aérosols et des traitements inhalés en ventilation assistée},
series = {La Ventilation Artificielle : de la Physiologie à la pratique},
year = {2008},
pages = {(in-press)},
publisher = {Elsevier, Publ.},
address = {Some City, Some Country},
keywords = {med, imagerie_morphologique_fonctionnelle, reponse_pulmonaire_agression},
author = {Guérin, C. and Fassier, T. and Bayle, F. and Lemasson, S. and Richard, J.C.}
}
I would expect the "é" character to be converted to \'{e} but this did not occur...
#30
I have not added the reverse translation to the code yet, I'll put it on the todo list.
Ron.
#31
Here is a patch (against CVS HEAD) to convert back special characters to bibtex format when exporting. Not sure it does all the work...
Sebastien
#32
#33
Thanks,
Committed to CVS
Ron.
#34
Automatically closed -- issue fixed for 2 weeks with no activity.
#35
subscribing
#36