RDF import issue with duplicate terms
jjalocha - May 24, 2009 - 14:34
| Project: | Taxonomy import/export via XML |
| Version: | 6.x-1.3 |
| Component: | RDF/XML format |
| Category: | support request |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | closed |
Description
The hierarchy tree created gets messed up when importing an RDF with duplicate terms. This could be a bug, or I might be doing something wrong, but I checked everything twice...
In a sample extract of a larger vocabulary, the following hierarchy is expected:
Arica y Parinacota
-Arica
--Arica
--Camarones
-Parinacota
--Putre
--General LagosBut I get the following output after import:
Arica
-Camarones
Arica y Parinacota
-Parinacota
--General Lagos
--PutreThis is a sample data pair (see the attachment for the complete RDF file):
<rdfs:Class rdf:ID="15">
<rdfs:label>Arica y Parinacota</rdfs:label>
<rdfs:comment>XV. Región de Arica y Parinacota</rdfs:comment>
<rdfs:isDefinedBy rdf:resource="#vocabulary-chile"/>
</rdfs:Class>
<rdfs:Class rdf:ID="151">
<rdfs:label>Arica</rdfs:label>
<rdfs:comment>Provincia de Arica</rdfs:comment>
<rdfs:isDefinedBy rdf:resource="#vocabulary-chile"/>
<rdfs:subClassOf rdf:resource="#15">Arica y Parinacota</rdfs:subClassOf>
</rdfs:Class>And these are the import settings:
Target vocabulary: [Create new]
Data Source: Upload File
Format of file: RDF
Recurse down the taxonomy tree: YES/NO (tested both)
Allow duplicate terms: YES| Attachment | Size |
|---|---|
| rdf-extract.txt | 2.42 KB |

#1
Hm. That's a tricky one.
GREAT PROBLEM REPORT though... thanks for narrowing it down to the information I need. And a good example use-case.
OK, we've got two behaviours - the soft behaviour is to do a string-match every time a term is found to see if a record already exists. When found, it merges new information over top of it. In this case, collapsing the Provincia and the Comuna into one concept.
Not what we wanted.
The strict behaviour is to always create a new item, and not trust string matches. That's what is supposed to happen when you check 'Allow duplicate terms'.
Note that this behaviour will deliberately NOT match with earlier imports, so it can only be done once.
The problem is that checkbox never worked for the RDF format - it was never coded in. I've now turned that on, and now I do get the result we would expect!

Note that upon import, Drupal will create its own term IDs, and the previous ones in that RDF file will be lost. This is because
Drupal does not allow us to use our own IDs, and we don't (yet) have a field for GUIDs on taxonomy terms.
However, you CAN try the 're-use IDs' option if it's a new site and there are no other vocabularies to conflict with yet.
Anyway, too much theory.
Please try the DRUPAL-6--2-DEV version soon, or cvs from DRUPAL-6--2 and see if that helps.
Recommended : delete all 6.x-1.x files entirely first. The DRUPAL-6--2 is a new branch because it has a different folder structure.
#2
I am sorry, that I couldn't test this earlier.
Now, on a clean install, with 6.x-2.x-dev from 2009-May-25, the complete RDF hierarchy loads perfect!
Thank you very much for the quick help!
Tell me, if you need any further testing. I'll be working with different document formats, (XML, CSV, RDF) and will have a close eye on any problems.
Cheers,
Jerzy
#3
It'll be helpful to have someone trying out the 6--2 version. I moved a few things around in it, and there may be bits I forgot.
If it's working for you, I can probably make that the recommended release.
Can you tell me where you got the input RDF from? did you generate it yourself? Was the list in another format that may be useful to import from?
#4
Since I am not really familiar with Drupal development and CVS, I used the instructions in http://drupal.org/node/17918/cvs-instructions/DRUPAL-6--2 for getting the module.
This one worked fine on a clean install of Drupal 6.12 for both the XML, and RDF versions of my vocabulary. The structure, ordering, and comments are preserved correctly.
I created the vocabulary in an Openoffice spreadsheet, and extracted a "raw" XML from the ODS file with XSLT. From there, I derived the RDF and XML versions with other XSLT stylesheets. I can post all these tools, or PM them if they are of any help. The current attachment contains the RAW, XML, and RDF versions.
I am actually really no longer interested in the CSV and RDF versions, because it seems that only XML allows me both to define the order of the terms, and attach comments. But I still could derive other file formats with XSLT if necessary.
The next thing I would like to do, is work with synonyms. I want to define some major cities, that are equivalents at comuna or provincia levels. I think, it would be natural to define them in a separate vocabulary, and it seems like both XML and RDF allow me to link synonyms from other vocabularies.
#5
I was mainly interested in the process you went through to create the RDF. The specification is a bit loose, so I'm looking at what people need from it.
Going from ODS though XSL sounds like fun! Was the information/docs/samples helpful enough to get you started with that? What was missing?
RDF can indeed support comments - as a rdfs:comment (if that wasn't working, let me know)
But no, ordering was not part of the RDF spec that I could find and re-use. Do you know of any useful standards based namespace:attribute that could meaningfully match the Drupal-only 'weight' property? I really did not want the RDF syntax to include anything that wasn't an existing standard or recommendation.
I am not sure about synonyms across vocabularies. It's not well supported (in fact synonyms are totally useless) in Drupal core. You'll need some other module to support taxonomy_relations.
I can see why you'd want to put cities in a different vocab from regions ... but it may be simpler to just put them as child terms - that's what most folk end up doing. It's not semantically perfect, but ends up intuative for everyone.
#6
> Going from ODS though XSL sounds like fun!
It was not as bad as expected. In fact, most stuff can be simply ignored, and the XSL stylesheet is very simple. I will definitely use this method again, in future similar situations.
> Was the information/docs/samples helpful enough to get you started with that? What was missing?
I really think, that the documentation of this module is excellent. There si still some information that could be improved:
In the Information on the formats readable by taxonomy_xml document, the XML sample code doesn't show all element types. Something more complete would be:
(REMARK: You can see here, that when I export as XML, I get no element for the 'Related terms' value. A bug?)
<vocabulary>
<vid>5</vid>
<name>Editorial sections</name>
<description>A hierarchical vocabulary.</description>
<help>Help text here.</help>
<relations>1</relations>
<hierarchy>0</hierarchy>
<multiple>1</multiple>
<required>0</required>
<tags>0</tags>
<module>taxonomy</module>
<weight>0</weight>
<nodes>blog,page,story</nodes>
<term>
<tid>83</tid>
<vid>5</vid>
<name>Analysis</name>
<description>Examines the connections between known facts.</description>
<weight>0</weight>
<depth>0</depth>
<parent>0</parent>
<synonyms>Study</synonyms>
</term>
<term>
...
</term>
For someone who is creating an XML vocabulary from scratch, it would be interesting to have a description of (only) the terms that are interpreted when 'Create new' target vocabulary is choosen instead of 'Determined by source file'. Ie., probably the following, without the IGNORED ones:
> RDF can indeed support comments - as a rdfs:comment (if that wasn't working, let me know)
In the examples I've been working with, the comments work perfect.
> Do you know of any useful standards based namespace:attribute that could meaningfully match the Drupal-only 'weight' property?
You probably will have to look at the RDF containers:
Sadly, I have been unable to find an example use-case of RDF containers together with rdfs:Class. Instead, all examples make use of rdf:li, and look like http://www.w3.org/TR/2004/REC-rdf-primer-20040210/#rss:
<items><rdf:Seq>
<rdf:li rdf:resource="http://www.w3.org/News/2002#item164"/>
<rdf:li rdf:resource="http://www.w3.org/News/2002#item168"/>
<rdf:li rdf:resource="http://www.w3.org/News/2002#item167"/>
</rdf:Seq>
</items>
Note, that the use of an element like <items> here seems to be crucial:
I think, we need the help of an RDF guru, here!
#7
Automatically closed -- issue fixed for 2 weeks with no activity.