Exported RDF and TCS don't handle cyrillic characters

Ognyan Kulev - December 22, 2008 - 09:15
Project:Taxonomy import/export via XML
Version:6.x-1.2
Component:Code
Category:bug report
Priority:normal
Assigned:Unassigned
Status:postponed
Description

Exported XML is OK but exported RDF and TCS contain bad cyrillic, and I'm sure it's not only about cyrillic but anything different than ISO-8859-1 character set. Samples:

XML (correct):
<term><tid>8</tid><vid>4</vid><name>Пловдив</name><description></description><weight>0</weight><depth>0</depth><parent>0</parent></term>

RDF:

<rdfs:Class rdf:ID="term-8">
<rdfs:label>&ETH;Ÿ&ETH;&raquo;&ETH;&frac34;&ETH;&sup2;&ETH;&acute;&ETH;&cedil;&ETH;&sup2;</rdfs:label>
<rdfs:isDefinedBy rdf:resource="#vocabulary-4"/>
</rdfs:Class>

TCS:

<TaxonConcept id="term-8">
<Name scientific="true" ref="?">ПлПвЎОв</Name>
<TaxonRelationships/>
</TaxonConcept>

#1

dman - January 8, 2009 - 12:27

Gawd I hate encodings.

OK, I just spent a lot longer trying to figure this out that really should have been neccessary.
On one hand, some of my text editors would not recognise that the file was UTF-8. I tried all sorts of tricks with Server headers, but it seems it cannot be applied to files in a way that sticks when files are downloaded.
Opening the dodgy files in XML editors - or Firefox directly - gives me the expected results.

So I'll try naming the exports as taxonomy_name.rdf.xml and taxonomy_name.tcs.xml . That at least gives everything enough info that it should believe the UTF-8 directive found in the XML header.

But still, the result I was getting was not identical to the one described here.
I keep getting
<name>–ü–ª–æ–≤–¥–∏–≤</name>
- which is at least recognizable as a 2-byte unicode gone wrong.
Flicking the UTF-8 toggle in the file properties of my text editor fixes it. But it's not sticky.

YOUR RDF results look more like a result of entity encoding. Probably a result of passing the input through xmlentities() before saving. Which I THOUGHT was safest.
I'm not sure why I'm NOT getting that result, but it's probably to do with versions, libraries and settings on my PHP5 platform.

FWIW, the older drupal-xml version does NOT use proper XML functions - it just builds trees by string concatenation. Which in this case avoids the problem. By being naive.

Bloody hell.
Ah well, I'll roll in an attempted fix to the D6-dev, but I feel I still haven't solved all problems.

Any suggestions on encoding in XML would be appreciated

#2

dman - March 25, 2009 - 22:31
Status:active» postponed

Input from anyone who is more knowledgeable about character encodings and XML and things is welcome. I can't fix this from here.

 
 

Drupal is a registered trademark of Dries Buytaert.