Posted by liquidcms on February 22, 2009 at 10:14pm
Jump to:
| Project: | Taxonomy import/export via XML |
| Version: | 6.x-1.x-dev |
| Component: | User interface |
| Category: | support request |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | closed (fixed) |
Issue Summary
I am trying to import MeSH into Dr vocab. An example xml node looks like:
<DescriptorRecord ...><!-- Descriptor -->
<DescriptorUI>D000005</DescriptorUI>
<DescriptorName><String>Abdomen</String></DescriptorName>
<Annotation> region & abdominal organs...
</Annotation>
<ConceptList>
<Concept PreferredConceptYN="Y"><!-- Concept -->
<ConceptUI>M0000005</ConceptUI>
<ConceptName><String>Abdomen</String></ConceptName>
<ScopeNote> That portion of the body that lies
between the thorax and the pelvis.</ScopeNote>
<TermList>
<Term ... PrintFlagYN="Y" ... ><!-- Term -->
<TermUI>T000012</TermUI>
<String>Abdomen</String><!-- String = the term itself -->
<DateCreated>
<Year>1999</Year>
<Month>01</Month>
<Day>01</Day>
</DateCreated>
</Term>
<Term IsPermutedTermYN="Y" LexicalTag="NON">
<TermUI>T000012</TermUI>
<String>Abdomens</String>
</Term>
</TermList>
</Concept>
</ConceptList>
</DescriptorRecord>i only want the fields to be converted into tax terms, so in the above example i would have:
Abdomen
- Abdomen
-- Abdomen
-- Abdomens
does this module have a method (sort of like the csv cck import module does) to select what should be imported from the xml?
Comments
#1
I actually reviewed the MeSH vocab when designing the process (among many others), and partially support the naming scheme it uses among the allowed synonyms.
But arbitrarily analyzing the actual structure of the XML (the ability to read and understand any made-up XML dialect) is out of scope, and pretty much always will be. XML is that much more potentially complex than CSV that writing logic to read and understand it is that much more layered than a simple 1:1 CSV->CCK field map. At some point you need to understand both the subject data and some XPath to be able to extract the data again.
Options are :
- express the taxonomy in an RDF-compatable way (This can be done with some XSL - and would be helpful to the Mesh project)
- write a MeSH-only format parser (this is actually not too hard - see the TCS-format example.)
#2
thanks.. i had figured i would look at doing my own parser - will check out the TCS example
actually also been thinking and even better way is to simply NOT import but to do an ajax/curl app that reads their browser tree direct from their site and adds freetags as the user selects items.
thanks for the tips.
#3
Yeah, I looked at their tree, and can sort imagine scraping it - but it seems a shame when they already have a reasonably mature XML to start with.
If only it was provided in bits - so we could consume it as a service, not a huge lump.
All they need is a link on the html page to the XML version of that data.
The TCS example is VERY similar format, in that it is also modelled on (what it calls) the TaxonConcept + TaxonTerm model. Which MeSH has also evolved into.
Note that Drupal core taxonomy does NOT natively support that level of complexity. In Drupal the term label represents the term concept directly. But we can map it close enough for most cases using synonyms.
For such a huge dataset, it does feel like overkill to import the whole thing (I don't know your use case) and maybe treating their browser as an API could be beneficial.
What they really need to do of course is provide a genuine API we could talk to. I'd be able to learn something from that.
#4
yes, a real api would be the way to go - it is possible the project i am on may have the pull to discuss that with them.
for now, i just need to figure out how they map their xml to the tree structure i see with their browser.. i have xmlreader code to convert to php array.. but i have missed where parent/child relationships are defined.. but, i'll get it..
#5
Doesn't need to be too much of a service/API, just a consistant URL pattern that can return an XML-snippet based on the ConceptUI given as an arg. (described almost in about_services.txt)
The guys at Encyclopedia of Life were able to whip
http://services.eol.org/lifedesk/service.php?function=details_tcs&id=188...
together for me in no time, once I explained what would be useful... It's not exactly documented, but it's self-explanatory.
Explanation (of sorts) how to define relationships I wrote earlier today here:
http://drupal.org/node/381412#comment-1283560
The trick is that Drupal needs term IDs to set up relationships. My steps work around that - as best I can for the jobs I've needed to do so far. We need to use either the string name or the external GUID when establishing our relationships.
#6
FTR, see this write-up
Peter (liquidcms) contacted me directly and offered sponsorship to create a custom solution. We worked through it together and got a result (although it took a couple of steps).
I forgot to re-release a write-up at the time, but here it is. Peter was happy for me to feed it back to the community.
Thanks Peter!
#7
Automatically closed -- issue fixed for 2 weeks with no activity.