Exported RDF and TCS don't handle cyrillic characters
Ognyan Kulev - December 22, 2008 - 09:15
| Project: | Taxonomy import/export via XML |
| Version: | 6.x-1.2 |
| Component: | Code |
| Category: | bug report |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | postponed |
Jump to:
Description
Exported XML is OK but exported RDF and TCS contain bad cyrillic, and I'm sure it's not only about cyrillic but anything different than ISO-8859-1 character set. Samples:
XML (correct):
<term><tid>8</tid><vid>4</vid><name>Пловдив</name><description></description><weight>0</weight><depth>0</depth><parent>0</parent></term>
RDF:
<rdfs:Class rdf:ID="term-8">
<rdfs:label>Ðловдив</rdfs:label>
<rdfs:isDefinedBy rdf:resource="#vocabulary-4"/>
</rdfs:Class>TCS:
<TaxonConcept id="term-8">
<Name scientific="true" ref="?">ÐлПвЎОв</Name>
<TaxonRelationships/>
</TaxonConcept>
#1
Gawd I hate encodings.
OK, I just spent a lot longer trying to figure this out that really should have been neccessary.
On one hand, some of my text editors would not recognise that the file was UTF-8. I tried all sorts of tricks with Server headers, but it seems it cannot be applied to files in a way that sticks when files are downloaded.
Opening the dodgy files in XML editors - or Firefox directly - gives me the expected results.
So I'll try naming the exports as taxonomy_name.rdf.xml and taxonomy_name.tcs.xml . That at least gives everything enough info that it should believe the UTF-8 directive found in the XML header.
But still, the result I was getting was not identical to the one described here.
I keep getting
<name>–ü–ª–æ–≤–¥–∏–≤</name>- which is at least recognizable as a 2-byte unicode gone wrong.
Flicking the UTF-8 toggle in the file properties of my text editor fixes it. But it's not sticky.
YOUR RDF results look more like a result of entity encoding. Probably a result of passing the input through xmlentities() before saving. Which I THOUGHT was safest.
I'm not sure why I'm NOT getting that result, but it's probably to do with versions, libraries and settings on my PHP5 platform.
FWIW, the older drupal-xml version does NOT use proper XML functions - it just builds trees by string concatenation. Which in this case avoids the problem. By being naive.
Bloody hell.
Ah well, I'll roll in an attempted fix to the D6-dev, but I feel I still haven't solved all problems.
Any suggestions on encoding in XML would be appreciated
#2
Input from anyone who is more knowledgeable about character encodings and XML and things is welcome. I can't fix this from here.