Announcing intention for a large rewrite of taxonomy import/export.

dman - August 25, 2008 - 08:24
Project:Taxonomy import/export via XML
Version:6.x-1.x-dev
Component:Code
Category:feature request
Priority:normal
Assigned:dman
Status:closed
Description

A while ago I (locally) extended taxonomy_xml to support CSV and RDF imports of distributed vocabularies from a number of sources. I hadn't got everything stable enough to release at the time, but I've now revisited it and found that most of my patches could be re-applied without trouble.

With the http://sprint.eol.org/ project on the horizon, I thought it would be constructive to get this to a useful state so that we have better ways of swapping taxonomies.
I'll attach two of my research documents outlining my motivations and methods.

I'm proposing a new 5.x-2 branch and a similar 6.x-2 one too. What I've done is splinter the original XML parser phase into is own .inc file, alongside a CSV parser and a RDF parser, each of which do the same thing... differently. So it's a large code change.

This is a notice of intent to push this module forward ... eventually towards being able to support an online taxonomy server for distributed vocabs, although that may be an optional extension.
If anyone has objections or inspiration, please weigh in with your input this week.

I've referred to Wordnet/RDF + Web Ontology Language (OWL) for the target dialect of XML used in this export schema.

Words and Terms come from, and are uniquely identified by the existing wordnet vocabulary, and their relationships are described using the RDF Schema 'ParentOf' and 'ChildOf' terms etc.

This modification of the taxonomy_xml.module is intended for two uses.

  1. To assist in migrating taxonomies between cloned sites, eg dev and live copies of essentially the same site. To this end, some effort has been put into maintaining vocabulary IDs and term IDs, because once they get out of synch, cloning and replication is almost a lost cause.
  2. To become a foundation for a Taxonomy Interchange initiative [Taxonomy Server] and therefore, I guess, somewhat similar to all those other 'taxonomy warehouses' but we intend to publish, for import/export, these shared taxonomies in a way that allows Drupal sites (or other related technologies) to share this data.

Sources of Taxonomies
The following sites provide downloadable taxonomies, Thesauri or Glossaries that are at least partly compatable with this import tool.

Patches and updates will be forthcoming soon...

AttachmentSize
formats.html_.txt5.89 KB
theory.html_.txt8.55 KB

#1

dman - September 3, 2008 - 16:49
Status:active» needs review

OK, following the resounding interest in this project, I've gone and upgraded the whole thing to
Taxonomy XML 5-2

I've got a D6 version too (based on the current D6 release) that provides the same features. I'll branch and roll that soon.
I'll try deploying it on a few test machines before taking it from dev to 5-2-1 release.

#2

dman - September 6, 2008 - 15:51

All the above features have now been rolled into the 6--1-dev also
I intended to create a whole 6--2-dev branch for it ... but did something wrong and just updated the 6--1. OK, so goes it.

Get a (currently working OK for me) snapshot at http://drupal.org/node/304827 Hereby known as taxonomy_xml:DRUPAL-6--1-1

Get the latest dev at http://drupal.org/node/245263

#3

pkej - October 28, 2008 - 13:33

OK, please explain to me what the relation is to wordnet!

I'm very interested as I am about to create an online dictionary for the Sami language and hope to include linking to the English translations, and thus create a 1:1 wordnet style layout.

#4

pkej - October 29, 2008 - 12:16

I just got my hands on a 5.x wordnet module developed by angeliti, who now works for MS. He never released it. I have to set up a 5.x site for this to test out the php and see how it works.

Do you already have code for reading the wordnet data files?

Best regards,
Paul

#5

pkej - October 29, 2008 - 12:46

I've tested your module against the different xml/rdf files in the links you supplied, but nothing gets imported. Am I missing something?

#6

dman - October 29, 2008 - 21:29

The thing with wordnet is that is provides a system and syntax for describing relationships between words.
synonym, hypernym - broader term, hyponym - narrower term. The import supports that terminology, because those are things that make up a Drupal vocabulary.

2 years ago it looke like wordnet was going to be a useful central resource.
At one point, there was even an xml/rdf web service that provided wordnet lookups, which was great for spidering trees of concepts. The import was able to parse input from that. However that service is now broken or something (I lost it, there are only these listed, and the official web service is hopeless.

I'm not sure that Drupal taxonomy is the right way to be building a translation database. I don't know anything about internationalization of terms, but it may be possible.

I'll see if there is something wrong with the current dev version. I've been changing some things in it to allow LARGE imports, so I may have broken the simple ones.

#7

pkej - October 30, 2008 - 09:53

I did find this from the wordnet page: http://sourceforge.net/projects/texai, the interesting thing is that they have made an RDF which is a mashup of wordnet and wiktionary. Their downloads are here: http://sourceforge.net/project/showfiles.php?group_id=176781 and they have a blog at their site with more info.

As for translatioin I tried the latest versions for 6.x and they are worse than the 5.x, I might have done some mistakes, but I have delivered a site last year using 5.x and translation, where I actually got things to work. I'm getting a bit disillusioned with the translation effort.

I should of course contribute.

Anyway, check out your import, I tried some of the RDFs from w3c without any success. BTW, I downloaded from your link in this thread.

#8

dman - October 30, 2008 - 11:05
Version:5.x-1.x-dev» 6.x-1.x-dev

I just reviewed all the examples in the current Drupal6-dev distro and they all worked mostly as expected. There were a few tidy-ups with the heirarchys, but all the examples are good to go.
You do have to choose the right format - 'CSV' for CSV files etc, but they were all doing what they should on a clean D6 install. Can you describe what you tried and what happened?

I'm not sure which W3C downloads you've been expecting to define vocabularies for. Only a few (very few) RDF files are actually intended to define vocabularies. Most other RDF files are just making statements about things.
You need inputs that contain statements similar to the examples, using rdfs:Class and rdfs:subClassOf statements. Anything else is simply saying something different. I used a subset of the w3c demo WineOntology that defines grape varietals as a model for taxonomy tagging, and am always looking for useful schema examples that have been published for real-world vocabularies. I just added a new one today from the International Press Telecommunications Council.

The link to texai says it's a chatbot ? Um? It does appear to have a 16MB-zipped file containing a version of wordnet, but that's not a web service that can be crawled. Still, The syntax may be compatible, if you are able to slice out the bits you want.

#9

dman - August 30, 2009 - 11:21
Status:needs review» closed

issue queue cleanup. Old news

 
 

Drupal is a registered trademark of Dries Buytaert.