If a title contains italic tags, the title stops being read at the first italic tag.

See 10.1021/bi700336y as an example, which stops parsing before Xanthobacter

Comments

rjerome’s picture

Technically, it's doing what it's supposed to, since from the XML parsers perspective, the data contained within the italics tag is a new piece of data. Technically it would probably have been better for doi.org to htmlencode any embedded HTML tags or put them in a CDATA section prior to writing the XML output so they would be ignored by the parser. I'll see what I can do to work around this.

Ron.

mcookson’s picture

Status: Fixed » Active

I've found this problem too when using EndnoteX2 XML as it uses "style" tags to define italics. This is a significant problem for users with large datasets that contain scientific names. A similar problem seems to occur (as might be expected) when exporting datasets from biblio in tagged or XML format.

rjerome’s picture

@Michael: Are you saying that the EndNote file is NOT imported, or are you saying that the "style" is lost on import? If the latter, then it's a different issue.

Ron.

mcookson’s picture

Ron, I think we are on the same page. As you know EndnoteX2 XML files use "style" tags to define italics which, when parsed during a Biblio import, appear to truncate field data. For example, the title field of my EndnoteX2 XML reads as follows (I have pasted the complete EndnoteX2 XML output for this record below):

<title><style face="normal" font="default" size="100%">Insect vectors of </style><style face="italic" font="default" size="100%">Phytophthora</style><style face="normal" font="default" size="100%"> diseases of cocoa in Papua New Guinea</style></title>

but the title imports into Biblio as:
"Insect vectors of" ONLY (truncating "Phytophthora diseases of cocoa in Papua New Guinea").

On what may be an unrelated issue, I have noticed other truncation in this (and other) records. In this record the date field:

<date><style face="normal" font="default" size="100%">27 September 1999</style></date>

when imported is truncated to "27 September 199" dropping the final "9". It seems likely this truncation is caused by the date field being restricted to 16 characters (?) as I have seen the problem in other records with no italics.

Hope that clarifies the problem. Thanks,
Mike

source XML:
<record><database name="AgBib_2009-01-15-X2.enl" path="E:\NEW PROJECTS\PNGweb\content\lmg\agbib\AgBib_2009-01-15-X2.enl">AgBib_2009-01-15-X2.enl</database><source-app name="EndNote" version="12.0">EndNote</source-app><rec-number>8717</rec-number><foreign-keys><key app="EN" db-id="aprwpwwfxa0avsetvfyv2e21seze05swwe9x">8717</key></foreign-keys><ref-type name="Conference Paper">47</ref-type><contributors><authors><author><style face="normal" font="default" size="100%">Konam, J.</style></author><author><style face="normal" font="default" size="100%">Blaha, G.</style></author><author><style face="normal" font="default" size="100%">Guest, D.</style></author></authors></contributors><titles><title><style face="normal" font="default" size="100%">Insect vectors of </style><style face="italic" font="default" size="100%">Phytophthora</style><style face="normal" font="default" size="100%"> diseases of cocoa in Papua New Guinea</style></title><secondary-title><style face="normal" font="default" size="100%">12th Biennial Australasian Plant Pathology Society Conference</style></secondary-title></titles><pages><style face="normal" font="default" size="100%">262</style></pages><keywords><keyword><style face="normal" font="default" size="100%">PNG</style></keyword><keyword><style face="normal" font="default" size="100%">cocoa</style></keyword><keyword><style face="normal" font="default" size="100%">plant diseases</style></keyword><keyword><style face="normal" font="default" size="100%">fungal diseases</style></keyword><keyword><style face="normal" font="default" size="100%">Phytophthora</style></keyword><keyword><style face="normal" font="default" size="100%">insects</style></keyword><keyword><style face="normal" font="default" size="100%">vectors</style></keyword></keywords><dates><year><style face="normal" font="default" size="100%">1999</style></year><pub-dates><date><style face="normal" font="default" size="100%">27 September 1999</style></date></pub-dates></dates><pub-location><style face="normal" font="default" size="100%">Canberra, Australia</style></pub-location><urls></urls></record>

rjerome’s picture

OK, I understand now. I'll work on the parser a bit to deal with those embedded style tags. As for the date field, your exactly right, it's a database field width issue which is easy to fix.

Ron.

rjerome’s picture

There was a minor error in the parser which was causing the loss of data in the title (or any field with embedded styles for that matter), but it's fixed now.

As a bonus, I also added a new feature which will retain and convert the EndNote font style information (bold, italic, underline, subscript and superscript) to HTML codes so your title will retain the italicized portion.

rjerome’s picture

Status: Active » Fixed

@cowsandmilk; I've applied a similar fix to the CrossRef parser so it will now handle embedded HTML font style tags.

Ron.

rjerome’s picture

FYI, It seems that having HTML tags in the title may not be the greatest idea since the node module runs the title through check_plain and thus you will see the tags in the output ;-(

mcookson’s picture

Status: Active » Fixed

Hmmm... where does that leave us? And will these font styles re-export to Endnote without problems? (I need to update Biblio!).

Going over recent import data I also noticed that text in italics WAS being parsed in the keywords field. The text was not retaining its italics (which from what you've said is presumably now fixed), but it was being recognised - and so included in keywords. The weird thing, though, was that while all the keywords for each record of EndnoteX2 XML were recognised, they were reversed (i.e. keywords: "Aboriginal, Australia, Camponotus inflatus, indigenous" would turn out "indigenous, Camponotus inflatus, Australia, Aboriginal").

Thanks again, Mike

rjerome’s picture

Hi Mike,

Currently the font styles will not be re-exported in the EndNote XML format. I'll take a look and see how much trouble it would be to make that happen.

WRT keywords, each keyword is stored individually in a database table (as opposed to the entire string for each publication), then when the node is displayed they are loaded from the database and recombined in a somewhat random fashion (by keyword ID), but it would be no problem to sort them alphabetically which is I guess what you were expecting (and would make more sense).

Ron.

rjerome’s picture

I did a bit of testing with regards to re-exporting the font styles and it turns out it's quite a PITA! It would work find if only one style is applied like <i>blaa</i> it turns into <style face="italic">blaa </style> but if you have two font styles like this... <i><b>blaa </b></i> you still only have one EndNote style tag like this <style face="italic bold">blaa </style> which means you have to parse the whole string and determine if the are any combined font styles before you can create the Endnote style tag. Really more trouble than it's worth in my opinion.

Ron.

mcookson’s picture

Ron,
Biblio users across the biological sciences will be excited to see their records import and export seamlessly without the loss of italics. Without italics many references to organisms (large or small) are inaccurate and require subsequent edits to fix (untenable for large datasets). Few users will require other embedded style tags as long as their italics can be preserved. If you can achieve this without too much trouble, that would be fantastic!
Cheers,
Mike

PS Keywords that run alphabetically would be a nice refinement.... :-)

rjerome’s picture

I think just italics would be doable without to much problem, but as I said before, multiple styles combined would be a pain.

FYI, I committed the change to the keyword loading so that they are alphabetical

Ron.

mcookson’s picture

Ron,
Have you added the italics functionality to Biblio imports/exports?
If so, I will update my version of Biblio to take advantage of it.
Thanks,
Mike

rjerome’s picture

Just the import, I haven't fixed the export yet.

Ron.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.