HOWTO: Import the MeSH taxonomy database. Or a subset.

dman - May 12, 2009 - 23:44
Project:Taxonomy import/export via XML
Version:6.x-1.3
Component:Documentation
Category:task
Priority:normal
Assigned:Unassigned
Status:active
Description

MeSH (Medical Subject Headings from the National Library of Medicine) is one of the cooler public taxonomies out there. It includes a collection of well-structured taxonomies such as anatomy, Information Science disciplines and a pretty good Geographic Locations tree also.

A while ago Peter Lindstrom approached me for assistance in getting taxonomy_xml to read it in. We worked through it together, and Peter sponsored me to create a custom solution. He said he was happy for the results to go back into open source here, so here is a small write-up. (albeit a few months late)

The job ended up taking several steps, because the database dump supplied was insanely huge (260MB of XML in one file). No way I could get one php web request to touch it. Also it was a custom XML format.

I'll attach the full readme that describes the process we went through, but in short it involved:

  • Scanning the huge file with a php parser from the commandline
  • Saving out bite-sized chunks into smaller files (essentially using the filesystem as a lookup table)
  • Activating a small php script that acted as a lookup service - request term id, get XML response
  • Setting the taxonomy_xml batch daemon going against that service, which would sequentially ask for one term at a time and feed it into the drupal database

Due to licensing (and sheer size) I cannot redistribute the MeSH taxonomy myself. Instead I provide you with the tool to import it yourself.
Here is that tool:
(some assembly required)

OK, here's how it works.

To recap, we will :
- split the big XML file into bite-sized pieces
- set up a service (a web php script) that will supply those pieces when asked
- the service will also annotate its output with the term relationship information that's currently missing from the source.

- Use Drupal with taxonomy_xml to request the top level item(s) of the list
- each response will contain embedded links to further service requests
- drupal goes into batch mode and crawls those further items once-by-one

This means that lots of bite-sized chunks can get processed without blowing the time limits etc.

To begin:

1.
Unpack/move the provided files into somewhere local and web-accessable, eg
http://localhost/MeSH/
such that
http://localhost/MeSH/MeSH_server.php is accessible (and runs PHP5 OK)

2.
Retrieve and unpack the MeSH dump file desc2009.gz and unpack it in that directory, producing the file desc2009

3.
Use the commandline to execute the splitter script split_mesh_into_entries-reader.php
It can take arguments, but is preset to run on "desc2009" so from that directory, just
  php split_mesh_into_entries-reader.php
Should start it off. (Ensure PHP is in your PATH or enter the full path to php.exe on your system as required) 

This may take a while.

4.
Verify that you are getting 2 folders filling up with smallish files, and take a break.

5.
Later (or in the meantime), visit http://localhost/MeSH/MeSH_server.php and play with the options.

6.
Install the mesh_format.inc file into your taxonomy_xml.module directory alongside the others. It should be automatically detected.

7.
To trigger the import, visit /admin/content/taxonomy/import and select 'Web URL' as the data source.
(There's not much difference between a Web URL and a Web Service yet. Services are more for XML/RPC jobs. This server is just an URL+args)
Ensure the format is 'MESH'. (No safe auto-detection yet)

8.
Choose a starter URL to feed the importer
To import only one branch (Highly recommended) Choose the first option (no arg) from MeSH_server.php and submit to get a list of top-level branches.
Copy the URL and put it into taxonomy_xml.

Note that the progress bar is EXTREMELY wrong. It will tell you how fast it's going, but it has no idea how far it has to go.
Every item processed may throw another pile of jobs onto the queue. Which in turn may do the same,.

(one day I may get it together to provide my own taxonomy server which can do this for you...)

AttachmentSize
MeSH_importer-20090227.zip11.92 KB
 
 

Drupal is a registered trademark of Dries Buytaert.