MeSH (Medical Subject Headings from the National Library of Medicine) is one of the cooler public taxonomies out there. It includes a collection of well-structured taxonomies such as anatomy, Information Science disciplines and a pretty good Geographic Locations tree also.

A while ago Peter Lindstrom approached me for assistance in getting taxonomy_xml to read it in. We worked through it together, and Peter sponsored me to create a custom solution. He said he was happy for the results to go back into open source here, so here is a small write-up. (albeit a few months late)

The job ended up taking several steps, because the database dump supplied was insanely huge (260MB of XML in one file). No way I could get one php web request to touch it. Also it was a custom XML format.

I'll attach the full readme that describes the process we went through, but in short it involved:

  • Scanning the huge file with a php parser from the commandline
  • Saving out bite-sized chunks into smaller files (essentially using the filesystem as a lookup table)
  • Activating a small php script that acted as a lookup service - request term id, get XML response
  • Setting the taxonomy_xml batch daemon going against that service, which would sequentially ask for one term at a time and feed it into the drupal database

Due to licensing (and sheer size) I cannot redistribute the MeSH taxonomy myself. Instead I provide you with the tool to import it yourself.
Here is that tool:
(some assembly required)

OK, here's how it works.

To recap, we will :
- split the big XML file into bite-sized pieces
- set up a service (a web php script) that will supply those pieces when asked
- the service will also annotate its output with the term relationship information that's currently missing from the source.

- Use Drupal with taxonomy_xml to request the top level item(s) of the list
- each response will contain embedded links to further service requests
- drupal goes into batch mode and crawls those further items once-by-one

This means that lots of bite-sized chunks can get processed without blowing the time limits etc.

To begin:

1.
Unpack/move the provided files into somewhere local and web-accessable, eg
http://localhost/MeSH/
such that 
http://localhost/MeSH/MeSH_server.php is accessible (and runs PHP5 OK)

2.
Retrieve and unpack the MeSH dump file desc2009.gz and unpack it in that directory, producing the file desc2009

3. 
Use the commandline to execute the splitter script split_mesh_into_entries-reader.php
It can take arguments, but is preset to run on "desc2009" so from that directory, just
  php split_mesh_into_entries-reader.php
Should start it off. (Ensure PHP is in your PATH or enter the full path to php.exe on your system as required)  

This may take a while.

4. 
Verify that you are getting 2 folders filling up with smallish files, and take a break.

5.
Later (or in the meantime), visit http://localhost/MeSH/MeSH_server.php and play with the options.

6.
Install the mesh_format.inc file into your taxonomy_xml.module directory alongside the others. It should be automatically detected.

7.
To trigger the import, visit /admin/content/taxonomy/import and select 'Web URL' as the data source.
(There's not much difference between a Web URL and a Web Service yet. Services are more for XML/RPC jobs. This server is just an URL+args)
Ensure the format is 'MESH'. (No safe auto-detection yet)

8. 
Choose a starter URL to feed the importer
To import only one branch (Highly recommended) Choose the first option (no arg) from MeSH_server.php and submit to get a list of top-level branches.
Copy the URL and put it into taxonomy_xml.

Note that the progress bar is EXTREMELY wrong. It will tell you how fast it's going, but it has no idea how far it has to go.
Every item processed may throw another pile of jobs onto the queue. Which in turn may do the same,.

(one day I may get it together to provide my own taxonomy server which can do this for you...)

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

powzon’s picture

I've been looking for some means to make a proof-of-concept demonstration of MeSH for my organization. I'm hoping this will help. Haven't tried it out yet, but will say how it goes.
Thank you.

dman’s picture

Sorry that it takes 6 (quite technical) steps instead of just two. It's due to the huge size of the dataset mostly - something that can't be done in any one php session.
I'm working (in the background) (very slowly) on providing this as a web service that taxonomy_xml can just query and import, but got distracted trying to upgrade to D7 instead. It may happen before the end of the year!

izmeez’s picture

subscribing

powzon’s picture

I'm not afraid of multiple, technical steps. It went smoothly up to step 7. I only did step 5 after split_mesh_into_entries-reader.php had finished and after step 6.

In trying out MeSH_server.php, I'm pretty sure that, at first, clicking on one top-level descriptor link provided by 'Top Level List (All root items)' returned a paragraph with the descriptor's items in simple black text.
When I first selected 'XML - All root items' I didn't give it the chance to finish. It was after letting it finish and return the xml tree that each descriptor link selected from 'Top Level List (All root items) returned an xml tree for each descriptor, instead of the simple text returned previously. This is in Firefox, which gives the message,

"This XML file does not appear to have any style information associated with it. The document tree is shown below."

I get the same xml tree format from searching by UI or tree number.

At step 7, I wondered whether to check the 'advanced' options in /taxonomy/import/, and decided yes, clicked the Import button, and got

-------------------------------
* Retrieved Submitted URL http://ekwas.com/MeSH/MeSH_server.php?lookup=Top-HTML&id=&op=Lookup. Now starting an import process.
* Vocabulary : No terms added.

* Failed to parse file: not well-formed (invalid token) at line 1.
* warning: max() [function.max]: Array must contain at least one element in /home/abdl4/public_html/ekwas.com/sites/all/modules/taxonomy_xml/xml_format.inc on line 172.
* warning: max() [function.max]: Array must contain at least one element in /home/abdl4/public_html/ekwas.com/sites/all/modules/taxonomy_xml/xml_format.inc on line 172.
* Failed to import any new terms. This may be due to syntax or formattings errors in the import file.
-------------------------------

I've stopped for now, and will try to sort out the error message tomorrow. It's been gratifying so far, thanks again, if you can tell what's the error at this point, it would be much appreciated.

By the way, I just setup and started learning Drupal four days ago, because of your module, which I found while searching for such things.

Cheers.

dman’s picture

Close but...
The URL http://localhost/MeSH/MeSH_server.php?lookup=Top-HTML&id=&op=Lookup
returns just UI HTML. That's for your convenience only, but HTML is not a starting point for an import. It just LISTS the starting points for import.

Copy one of the URLs on that page, eg
http://localhost/MeSH/MeSH_server.php?lookup=TreeNumber&id=A
to retrieve the XML that IS a format that the importer can import from.

Believe me, you do not want to import everything from the top in one go, it'll take a while.
If you want, it's there at http://localhost/MeSH/MeSH_server.php?lookup=Top-XML&id=&op=Lookup
Don't do that.

If your browser doesn't want to render the XML, that doesn't make a difference. The taxonomy_xml import process reads the XML just fine. The browser interface/selector is only to help you find the URL for the XML you are looking for.

powzon’s picture

OK, just tried it correctly, with one, top-level node URL, and got almost the same message, minus the item, "* Failed to parse file: not well-formed (invalid token) at line 1." included in the previous error message.

-------------------------------
* Retrieved Submitted URL http://ekwas.com/MeSH/MeSH_server.php?lookup=TreeNumber&id=A. Now starting an import process.
* Vocabulary : No terms added.

* warning: max() [function.max]: Array must contain at least one element in /home/abdl4/public_html/ekwas.com/sites/all/modules/taxonomy_xml/xml_format.inc on line 172.
* warning: max() [function.max]: Array must contain at least one element in /home/abdl4/public_html/ekwas.com/sites/all/modules/taxonomy_xml/xml_format.inc on line 172.
* Failed to import any new terms. This may be due to syntax or formattings errors in the import file.
-------------------------------

And the URL behind 'function.max', http://ekwas.com/admin/content/taxonomy/function.max, doesn't seem to exist. Maybe I don't know enough about Drupal yet?

dman’s picture

xml_format.inc on line 172.
.. this indicates you forgot to switch the 'format' to 'MESH' - step 7

powzon’s picture

How 'bout that? because 'XML' was selected by default, I wasn't meticulous with that step...
It's working now; thank you!

What exactly does the note, (May require further recursion) indicate?

dman’s picture

The note indicates that a spidering process has started, but we don't at this point know where it will stop.
With atomic requests, we don't know if it will be 20 or 2000 more steps to go.
The note is warning you that the message saying "5 or 50 steps" is a lie, because any of those 50 steps *may* trigger another 50 lookups etc.
It's all to do with breaking down the huge data file into bite-sized chunks. PHP can't process 300MB of XML in one go, it can't even imagine how many elements there are in the future.

powzon’s picture

Things were going fine, doing each category at a time, until C17.800.838.775.424, which returned this error message,

-------------------------------
* Completed a batch round #462. 1 items processed. Larva Migrans, Visceral

* warning: file_get_contents(http://ekwas.com/MeSH/MeSH_server.php?lookup=DescriptorUI&id=D006409) [function.file-get-contents]: failed to open stream: Connection timed out in /home/abdl4/public_html/ekwas.com/sites/all/modules/taxonomy_xml/taxonomy_xml.module on line 1058.
* warning: DOMDocument::loadXML() [domdocument.loadxml]: Empty string supplied as input in /home/abdl4/public_html/ekwas.com/sites/all/modules/taxonomy_xml/mesh_format.inc on line 51.
* user warning: Failed to parse in xml source. [] in /home/abdl4/public_html/ekwas.com/sites/all/modules/taxonomy_xml/mesh_format.inc on line 52.
-------------------------------

I entered the URL for C17 into taxonomy import, and it completed successfully.

Larva Migrans has three places in the 2009 MeSH,

Tree Number C03.335.508.523
Tree Number C03.858.424
Tree Number C17.800.838.775.424

But in the Drupal result list after the import, it's listed only twice,

# Larva Migrans, Visceral # 6896 is a child of Larva Migrans # 6245 (source
# Larva Migrans, Visceral # 6896 is a child of Toxocariasis # 6859 (source

It's the same for a few other terms I sampled: they appear in the drupal result list one count less than in MeSH.

I assume it's with, for instance, Larva Migrans having two entries in C03 and one in C17?

dman’s picture

What your message says is Connection timed out which means it's probably nothing more than a temporary network or service overload. That URL responds now, so your hiccup there is looking like a processing anomaly at first.
Restarting the process from any branch if a process breaks or is halted is safe and will overwrite where possible, merging other values.

But if you've found gaps, then maybe there is something happening there.
To get technical:

   <TreeNumber>C03.335.508.523</TreeNumber> 
   <TreeNumber>C03.858.424</TreeNumber> 
   <TreeNumber>C17.800.838.775.424</TreeNumber> 

D007815 Larva Migrans [C03.335.508.523, C03.858.424, C17.800.838.775.424]
Child of "Skin Diseases, Parasitic" C17.800.838.775 (D012876)
Child of "Skin Diseases, Parasitic" C03.858 (D012876)
Child of "Nematode infections" C03.335.508

:-)))

All good!
There is no problem mon ami!
This term appears three times in the tree, yet has only two parent terms! This is correct!
Because - one of its parents appears twice.
There are three paths to burrow DOWN to a term - which is what MeSH stores and displays as alternative term identifiers.
Yet Drupal is a little more conservative than that, and stores only the parent-child relationships. Thus there are only two directions up from there.

What are the the parents of "Larva Migrans" ?
Either:
- "Skin Diseases, Parasitic" or
- "Nematode infections"

What are the parents of "Skin Diseases, Parasitic"?
Either :
- "Parasitic Diseases" or
- "Skin Diseases, Infectious"

What are the parent paths of "Larva Migrans" ?
... : Nematode infections : Larva Migrans
... : Parasitic Diseases : Skin Diseases, Parasitic : Larva Migrans
... : Skin Diseases, Infectious: Skin Diseases, Parasitic : Larva Migrans

The code has merged and re-written the instantiated paths used by MeSH into the relational database used by Drupal. Which is what it was supposed to do, although I can't recall testing exactly this situation!

And thus, life is good.

Except for the poor fellow with infectious skin worms.

powzon’s picture

great, thanks!

Is it possible to cron the import?

I want to run Chemicals and Drugs but don't want to stay logged in.

dman’s picture

Not as such, no.
I'm looking at a commandline or queue solution for D7, but that's unlikely to happen soon.
It may be possible under drush already, as drush supports the batch process internally, yet I don't know how to formulate an action which would kick it off. Some magic could be constructed to get drush to run drupal_form_submit, but there are no helper functions to do that for taxonomy_xml specifically yet. It's a good idea, but MeSH is on the very high end of test cases :-)

powzon’s picture

Oh man, it really choked on Chemicals and Drugs taken from the top--good for people but bad for taxonomy systems.

Tried twice and got,

"Processing all queued import requests.
Batch Taxonomy Import has encountered an error.
Please continue to the error page"

both crashes in the D03 range.

This is the error message,

-------------------------------
Fatal error: Call to undefined function dpm() in /home/abdl4/public_html/ekwas.com/sites/all/modules/taxonomy_xml/taxonomy_xml.module on line 1289
-------------------------------

Was going to do each [D] branch by hand but tried increasing max_allowed_packets for the database all the way to 1GB, using db_tweaks module, and it seems to be going along OK.

It crashed at the beginning of D12.
Set max_allowed_packet = 1024MB in db_tweaks
It got a bit farther in D12 but crashed.
Added mysqli.reconnect = On to php.ini
It got a still bit farther in D12 then crashed.
Added max_allowed_packet = 1024MB to php.ini just in case db_tweaks didn't take.
A bit farther still in D12...

Finished Chemicals and Drugs by doing each branch individually.

dman’s picture

if it's hitting a dpm() error, that's just debug information left in there accidentally.
You can either delete that line from
taxonomy_xml_batch_import_finished()

1285	    // An error occurred.
1286	    // $operations contains the operations that remained unprocessed.
1287	    $error_operation = reset($operations);
1288	    $message = 'An error occurred while processing '. $error_operation[0] .' with arguments :'. print_r($error_operation[1], TRUE);
1289	    dpm(array("Batch error" => array($success, $results, $operations)));

to make is shut up (though something clearly failed) or install the devel module to find out just what happened there.

It may be the max_allowed_packets issue, though that problem is why we are breaking the steps down into batches of 50 per round. You can change the batch size value define('TAXONOMY_XML_MAX_BATCH_SIZE', 50); but that's probably not going to help - it's sure to slow things down if smaller, and relies on a grunty server to be bigger.

I'll set a local version running and see if it melts my machine. Not sure what exactly it is, unless there happens to be one really broad category - that has several hundred items at one branch of the tree. That could hurt a bit, as it would be a huge file to process in one step. Just guessing here.
But you got it to run by triggering the branches one-by-one? Good for you!

dman’s picture

Heck, 18934 terms, 215 of them instantiated at the top level. (top + immediate children)
That's a lot of work for a PHP process. I got a (30 second) timeout after the first scan - which means it created placeholders for everything, but failed to keep them all in memory at once while attempting to relink them.

Phases are :
retrieve,
scan file,
create items & child placeholders,
add child items to queue (don't process children),
link parents with child placeholders
... then start looking at the queue to fill in the child details recursively

At the point where it's relinking, it's holding pretty much everything about everything in memory, including (I think) multiple copies of the DOM - including all the unused MeSH elements. There may be some stuff that can be trimmed from there.

I can't see how to trim it from the top of the process - cannot reduce that 250-at-once number. But the internals can always be tuned better. arg.

Frank Ralf’s picture

I don't know whether this will remedy the issue at hand but it might be useful to look at how others solve that PHP timeout problem with large data imports, namely MySQLDumper:

The problem …
A PHP script has a maximum execution time that is usually set to 30 seconds on most server installations. A script running longer than this limit will simply stop working. This behavior makes backing up large databases impossible.

MySQLDumper fills a gap …
MySQLDumper uses a proprietary technique to avoid this problem. It only reads and saves a certain amount of data, then calls itself recursively via JavaScript and remembers how far in the backup process it was. The script then resumes backing up from that point.

http://www.mysqldumper.net

hth
Frank

dman’s picture

Um Frank, Drupal has the Batch API that does this. You may have seen it in the progress bar when you first installed Drupal.
Taxonomy_xml makes deep use of the batch process already, breaking the tens of thousands of tasks down into dozens of manageable chunks. It tries pretty damn hard to do this.

It just looks like one of the chunks cannot currently be broken down much further - without some restructuring.

Frank Ralf’s picture

@dman
Thanks for the clarification! That was just a remark from a bystander ;-)

oldmoonlake’s picture

This is remarkable effort. just wonder how the performance will be for mysql to lookup anything in this taxonomy. subscribed.

globallyunique’s picture

Can you provide any insight into how long it will take to load the entire taxonomy? Loading diseases has taken more than 5 hours and isn't finished yet.

dman’s picture

Um, i have encountered a few 3-hour runs, but 5 seems a bit of a concern.
I'm pretty sure looping is taken care of ... Yeah, the MeSH heirarchy had no loops last I looked.
not really sure

You can visit the site in another tab, and check on the size of the vocabulary. If it's not actually growing, something may be stuck.
Also, try looking at the DB table directly. Check the number of rows, then check again in a minute.
After that, it may just come down to your processor. The import process is not efficient, as it's a run-once job, and was tuned for robustness over large batches, not speed.

globallyunique’s picture

The term_data table is growing, e.g., 16262 to 16338 in a minute or so. I'm loading desc2010 (the 2010 version of MeSH). Loading D - Chemicals and Drugs [19425 items] has taken more than 9 hours. I'm running on a reasonable powerful unix server but I don't know about other load on the server. If there were a lot of feeds running through cron which would be hitting different taxonomies would that cause this kind of slow-down?

Is there any table(s) or pages I can look at to determine how far through 'D - Chemicals and Drugs' the load has progressed? Even an approximate measure of progress would be great, e.g., is there a way to determine how many terms there are in the 2010 version vs. how many I currently have in the db?

Amy_M’s picture

Category: task » support

Hi there,

I've actually used this before and had everything working fine... On a new site however I'm getting the following error spitting back at me:

* warning: DOMDocument::loadXML() [domdocument.loadxml]: Extra content at the end of the document in Entity, line: 2 in ../trunk/sites/all/modules/taxonomy_xml/mesh_format.inc on line 51.
* user warning: Failed to parse in xml source. [] in ../trunk/sites/all/modules/taxonomy_xml/mesh_format.inc on line 52.
* Failed to import any new terms. This may be due to syntax or formattings errors in the import file.

I've checked that I'm doing all steps (including 7) and am not sure what I'm missing here. Any help would be really appreciated.

dman’s picture

First guess is content/character encoding in the source somewhere. If you say it's working in one place but not another, that means it's something obscure like locale or unicode.
Can't quite say.

Amy_M’s picture

I think it might be to do with the way the data was parsed, if I navigate to http://example.localhost/MeSH/mesh_server.php?lookup=TreeNumber&id=A

I see:

Notice: Undefined property: stdClass::$parents in ../MeSH/MeSH-parser.inc on line 57

Warning: Cannot modify header information - headers already sent by (output started at ../MeSH/MeSH-parser.inc:57) in ../MeSH/MeSH_server.php on line 100

dman’s picture

a notice?
Oh ok.
So the issue is that your other server has error logging set to max, and prints them to the screen.
So requests to the internal mini-MeSH server complain a little bit before returning the XML.
Fix that by
- turning down the logging verbosity to not be E_ALL.
or
- fix the error, it's just an undefined index which is normal, but lazy PHP.
Quick way is to just change

  foreach ((array) $term->parents as $parentid => $parentname) {

to

  foreach ((array) @$term->parents as $parentid => $parentname) {
Amy_M’s picture

Worked like a charm, thanks!

I'm still getting that original error when trying to do the actual import. I checked the original site and it's no longer working there either or on a fresh installation of drupal with the original files etc. Can't quite figure out what I've done to mess this up or where I should start troubleshooting.

dman’s picture

The difference is (probably) mostly the error logging level. This script wasn't developed with E_ALL messages on it would seem.
So right now, you are not seeing that error when you look at the request URL, but the import still complains about XML validity?

Amy_M’s picture

Actually I finally got it working for all but the first tree (anatomy...) however, the hierarchy has not imported properly and I have roughly 4 pages of top level terms. Time to brew some more coffee :)

dman’s picture

I've found a live place to serve a version of the service from, so anyone who wants to try can now just start with steps 6,7,8 in the above README instructions.

Visit http://mesh.taxonomy.standards.net.nz/MeSH_server.php?lookup=Top-HTML&id...
and COPY the links from there (don't click them or you will get raw XML) into the 'remote URL' data source for taxonomy_import. Remember to select MESH as the import format.

Start small when testing. Big clades take big time. Each single term is an extra web request (though they do get cached, so the next run is faster).
For me, ~600 terms took ~20 minutes.
There is some inefficiency in the import algorithm due to multiple parenting in the heirarchy, so the same item (and its children) get imported as many times as it appears in the tree, so there is an amount of duplication, but no loops I've found so far.

The http://mesh.taxonomy.standards.net.nz/ site is serving the 2011 version of the MeSH dataset from the U.S. National Library of Medicine , retrieved 2011-11-27.
It is appropriate to review the MeSH® Memorandum of Understanding before re-using this service. It's mostly-free, so please respect them.
Modifications I've made have been exclusively to break the 283M dataset up into web-service-accessible chunks. Retaining ALL the data they put into their custom XML format, and annotated it with some structural helpers via RDF tags (rdfs:subClassOf, wn:hyponym - I can't recall why I used mixed schemas there)

Errors of structure, and maybe formatting may be mine. Errors of fact, typos or classification I take no responsibility for. If you've got a problem with Tibet being classified as part of China or Texas being classified as part of the USA, or whether homeopathy is a medical discipline, take it up with someone else. Bear in mind that the regional designations used in MeSH geographic regions are for the purposes of climate, historical, medical factors and disease zones, not politics, Mkay?

thijsboeree’s picture

edit: Problably fixed!! I had to create a vocabulary first...

Hi,

I'm having difficulties importing the file... It went smooth untill step 7...
In Drupal in the Taxonomy import part i put the following URL:

http://localhost/MeSH_importer/MeSH_server.php?lookup=Top-HTML&id=&op=Lo...
for Data Source i use Web URL...
format of file i use Mesh...
Then i push import and i get a bunch of errors...?

    * Retrieved Submitted URL http://localhost/MeSH_importer/MeSH_server.php?lookup=Top-HTML&id=&op=Lookup. Now starting an import process.
    * Created vocabulary to put these terms into. You probably want to go edit it now.

    * warning: Parameter 1 to taxonomy_save_vocabulary() expected to be a reference, value given in /home/thijs/project/drupal-6.20/includes/module.inc on line 462.
    * warning: DOMDocument::loadXML() [domdocument.loadxml]: EntityRef: expecting ';' in Entity, line: 1 in /home/thijs/project/drupal-6.20/sites/all/modules/taxonomy_xml/mesh_format.inc on line 51.
    * warning: DOMDocument::loadXML() [domdocument.loadxml]: EntityRef: expecting ';' in Entity, line: 1 in /home/thijs/project/drupal-6.20/sites/all/modules/taxonomy_xml/mesh_format.inc on line 51.
    * warning: DOMDocument::loadXML() [domdocument.loadxml]: EntityRef: expecting ';' in Entity, line: 1 in /home/thijs/project/drupal-6.20/sites/all/modules/taxonomy_xml/mesh_format.inc on line 51.
    * warning: DOMDocument::loadXML() [domdocument.loadxml]: EntityRef: expecting ';' in Entity, line: 1 in /home/thijs/project/drupal-6.20/sites/all/modules/taxonomy_xml/mesh_format.inc on line 51.
    * warning: DOMDocument::loadXML() [domdocument.loadxml]: EntityRef: expecting ';' in Entity, line: 1 in /home/thijs/project/drupal-6.20/sites/all/modules/taxonomy_xml/mesh_format.inc on line 51.
    * warning: DOMDocument::loadXML() [domdocument.loadxml]: EntityRef: expecting ';' in Entity, line: 1 in /home/thijs/project/drupal-6.20/sites/all/modules/taxonomy_xml/mesh_format.inc on line 51.
    * warning: DOMDocument::loadXML() [domdocument.loadxml]: EntityRef: expecting ';' in Entity, line: 1 in /home/thijs/project/drupal-6.20/sites/all/modules/taxonomy_xml/mesh_format.inc on line 51.
    * warning: DOMDocument::loadXML() [domdocument.loadxml]: EntityRef: expecting ';' in Entity, line: 1 in /home/thijs/project/drupal-6.20/sites/all/modules/taxonomy_xml/mesh_format.inc on line 51.
    * warning: DOMDocument::loadXML() [domdocument.loadxml]: EntityRef: expecting ';' in Entity, line: 1 in /home/thijs/project/drupal-6.20/sites/all/modules/taxonomy_xml/mesh_format.inc on line 51.

I also tried to use the following URL: http://localhost/MeSH_importer/MeSH_server.php?lookup=TreeNumber&id=A, then it takes a big while to import and after that i get a lot of taxonomies where imported, but none of 'm is written in the Database...? It also won't show on the vocabulary list? I get this error:

# warning: Parameter 1 to taxonomy_save_vocabulary() expected to be a reference, value given in /home/thijs/project/drupal-6.20/includes/module.inc on line 462.
# warning: Parameter 1 to taxonomy_save_vocabulary() expected to be a reference, value given in /home/thijs/project/drupal-6.20/includes/module.inc on line 462.
# It seems like we failed to create and retrieve a term called Body Regions
# It seems like we failed to create and retrieve a term called Viral Structures
# It seems like we failed to create and retrieve a term called Abdomen
# It seems like we failed to create and retrieve a term called Back
# It seems like we failed to create and retrieve a term called Breast
# It seems like we failed to create and retrieve a term called Extremities

Regards,
Thijs

thijsboeree’s picture

Hi,

Is there also a possibility to see the tree numbers...? I won't see 'm in the database?

best regards,
Thijs

dman’s picture

1 don't use the top-HTML page as an import starting point. It's a utility page, not a data source.
Plus, you really should probably be putting the different sections in their own vocabulary to have a hope of managing them logically later. But your choice.

2 but if they failed to work when making a new vocabulary even with the proper starting point, that could be a bug to look at

3 the tree numbers are not saved when following this process. Sorry, they could be useful.
Drupal taxonomy has no place to save that extra data. But if you are using taxonomy-enhancer, it should be made possible. Makes sense. But does not currently happen

thijsboeree’s picture

Hi,

I already got a lot of taxonomy imported... but i am stuck at:
http://localhost/MeSH_importer/MeSH_server.php?lookup=TreeNumber&id=F

I also tried to only select one of the children and than do the import, but i'm keep getting the following error...? Do i need to make my swap file bigger (ubuntu 10.10 (maverick))

An error occurred. /drupal-6.20/batch?id=31&op=do { "status": true, "percentage": 6, "message": " Round \x3cem\x3e1287\x3c/em\x3e. Processed 2 out of 29. (May require further recursion)\x3cbr/\x3eImported from http://localhost/MeSH_importer/MeSH_server.php?lookup=DescriptorUI\x26id=D004992\x3cbr/\x3eResult: Ethics, Medical" }<br /> <b>Fatal error</b>: Allowed memory size of 134217728 bytes exhausted (tried to allocate 26869486 bytes) in <b>/home/thijs/project/drupal-6.20/includes/database.mysql-common.inc</b> on line <b>41</b><br /> 

Regards,
Thijs

dman’s picture

Yeah um.
Some broad taxonomies use memory during the import. What can I say?
Turn down the debug when really running, my log messages take more data than the actual job

thijsboeree’s picture

Moved from my macbook 1Gb to my imac 4Gb… so i thought maybe more luck… but…:

I've noticed that the F section(F - Psychiatry and Psychology [1078 items]
F01 Behavior and Behavior Mechanisms)) gets all mixed up…? A untill E works fine, all correct hierachy's but when i import the F - psychiatry… The whole hiearchy is gone…? Isn't there someone with the right mysql datatables…? In a zip format…? That would be nice!!
Oh and BTW i use the address: http://mesh.taxonomy.standards.net.nz/MeSH_server.php?lookup=Top-HTML&id...
Regards!
Thijs

dman’s picture

Supplying a mysql table can't work as taxonomy term Ids and vocab ids are different on different drupal installs. The autoincrement and things can't cleanly be dumped onto an existing site. So we do what we can with the data as it is found from the original source. I don't want to get into transforming their content too much, I've tried to keep all the original data in what I redeploy, but publishing a drupal db dump would cause a lot of data loss in the transformation.
I'm not sure why one tree would be giving you trouble . . ?

thijsboeree’s picture

It's wierd that it's every time the F tree...? And if i look at it it isn't other than the other tree's? Is there a bug in the script? Are there more people who have the same error?

I get this error:
Fatal error: Call to undefined function dpm() in /home/thijs/project/drupal-6.20/sites/all/modules/taxonomy_xml/taxonomy_xml.module on line 1289 Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 29209371 bytes) in /home/thijs/project/drupal-6.20/includes/database.mysqli.inc on line 330

Regards,
Thijs

dman’s picture

The dpm is left-over debug message that shouldn't really be in there.
However, it looks like it's only getting triggered to alert you of a problem that wouldn't normally be encountered, and we don't know how to handle it. As you don't have the logging tool (devel.module) enabled, I now can't tell you exactly what the problem really is. But I can tell you that the system really is trying to tell you about something unexpected happening :-/

The code at that point is just trying to dump the diagnostics.

thijsboeree’s picture

Hi again!!

I am realy far now!! till i discovered that some terms are double used within the same tree... and than will be overwritten... Can you tell me how to configure taxonomy enhancer to get the extra field filled with tree id's...? or is there a way to tell the script that it should not overwrite mesh names...?

Edit! I came up with the idea to concatenate the MeSH-discription and the tree-id.. And later seperate them with my own script, can you tell me where i can do this in your script...?

Regards!
Thijs

dman’s picture

It is legal in many of the trees for terms to be repeated. Multiple parentage is OK. And that is supported as long as taxonomy_enhancer is retaining the GUID.
See #11 above.
It's also possible to have the same string with different IDs (though not recommended). This also is supported through GUID indexing instead of string-indexing.

There could be a hook that assists extra absorbing of extra info into taxonomy_enhancer from the deduced term data. ... but it hasn't been built. Putting it into the description might be OK. I do that with postcodes when importing geography lists.

The place you can play is in mesh_format.inc
In the file, in the same place as:

    // Try to find a desciption. Use the ScopeNote of the Preferred concept. Seems to be the most useful
    $notes = $xp->query("${prefix}ConceptList/${prefix}Concept[@PreferredConceptYN='Y']/${prefix}ScopeNote", $concept);
    if (! empty($notes)) {
      foreach ($notes as $note) {
        $term->description = trim($note->nodeValue);
      }
    }

You could enter something that
- finds the extra data you want from the XML (using an xp->query as above)
- add it to the $term object in the same way that taxonomy_enhancer does, as $term->field_treeid[0]['#value'] = $treeid; .. or something.

See this example from taxonomy_xml_set_term_guid():

    $term->field_guid[0]['#value'] = $guid;
    $term->fields['field_guid'][0]['value'] = $guid;
    // taxonomy_enhancer swaps between these two versions of the same data when reading and writing
    // Do both, as te is unstable.
thijsboeree’s picture

Hi,

Thanks for the explanation... only i do not understand the GUID...?
Do you have a script, where the taxonomy enhancer INSERT is included...? and a name of a taxonomy enhancer field...?

I also would like to know where i can concatenate the 'DescriptorName' and the 'TreeNumber' in the mesh_format.inc, it should be somewhere where the db INSERT takes place...?

Regards
Thijs

dman’s picture

Don't bother with the DB.
- add a field to your term definition using taxonomy_enhancer.
- Insert a value into that field on terms as above, by setting $term->fields['field_your_field'][0]['value']
- Later in the process when the term is finished being built, taxonomy_term_save() (or something) will get taxonomy_enhancer to save the data against that term.

thijsboeree’s picture

OK!
I added a field under my vocabulary (mesh) with the name tree_id, now do i need to put: $term->fields['field_tree_id'][0]['value'] somewhere in the taxonomy_xml.module...? or in the mesh_format.inc?

Sorry to bother you, all the time... i just want to get this right...

Anyhow, thank you so much for everything so far!

already looked at drupal 7, it misses the synonym_taxonomy table, but, you can make your own fields there... but that's for later!!

Regards!
Thijs

dman’s picture

see #42
put your line in mesh_format.inc under where the description gets set. ~= line 149
But you'll need to extract $tree_id from the XML first.
And deal with the possibility that a descriptor record has multiple tree ids - so your text field should be multiple.

Input is: lookup=DescriptorUI&id=D003148

<DescriptorRecord xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:wn="http://uche.ogbuji.net/tech/rdf/wordnet/" DescriptorClass="1" >
  <DescriptorUI>D003148</DescriptorUI>
  <DescriptorName>
   <String>Communism</String>
  </DescriptorName>
<!--
...
-->
  <TreeNumberList>
   <TreeNumber>I01.696.232</TreeNumber>
  </TreeNumberList>
<!--
...
-->
 <rdfs:subClassOf rdf:resource="http://mesh.taxonomy.standards.net.nz/MeSH_server.php?lookup=DescriptorUI&amp;id=D011056">Political Systems</rdfs:subClassOf>
 </DescriptorRecord></DescriptorRecordSet>

So to retrieve the (multiple) TreeNumbers ... Insert this into mesh_format.inc:taxonomy_xml_mesh_parse():150


    $treenumbers = $xp->query("${prefix}TreeNumberList/${prefix}TreeNumber", $concept);
    if (! empty($treenumbers)) {
      foreach ($treenumbers as $treenumber) {
        $term->fields['field_treenumber'][]['value'] = trim($treenumber->nodeValue);
      }
    }

- I named my field 'treenumber'. Which I guess is what you must mean by tree_id. It's the one that looks like 'I01.696.232' which states a terms (multiple) location(s) in the overall tree.
If you want the UNIQUE ID - which is something else, look for the XML value of DescriptorUI, or ... well, the actual schema used by MeSH is much more complex than can be fully expressed by Drupal6 taxonomy (descriptors, concepts, terms and labels are all slightly different, and there is a many-to-many relationship between them sometimes) - what I'm giving you is a usable, somewhat flattened approximation of what the MeSH is trying to express.

But anyway - this code will be a starting point to let you
-examine the real XML,
-and extract the bits you want,
-and save them as extended fields.
What you do past there depends on your understanding of MeSH concepts and which parts are important to you.

dman’s picture

funkju’s picture

Version: 6.x-1.3 » 7.x-1.x-dev

I want to revive this thread for the purposes of the 2013 Mesh with Drupal 7.

The 7.x branch of taxonomy_xml comes with a mesh_format.inc file that seems to work pretty well. And the original splitter script and MeSH_server script seem to be working well with the 2013 version of MeSH.

I did have to make a couple edits to get it going this well:

1) taxonomy_xml_format variable was not being set and defaulting to "xml" in taxonomy_xml.process.inc.

2) The FOREACH loop on line 240 of mesh_format.inc (from the 7.x branch of taxonomy_xml) overwrites the $term variable. I changed that variable name in the loop.

The problem I have is that the terms are created flattened: Is that what it is supposed to do or should it be making a hierarchical structure?
And the amount of time it takes to work through a branch is <3 minutes: Maybe this indicates it isn't picking up all the children?

dman’s picture

In 6 it did, and should still be able to be creating the full tree, not flat.
The probable weak point is that to build a tree whilst parsing each term atomically (not holding the tree in memory) like we have to with MeSH, we need support for each term to have a cannonic URI or GUID.

That should be happening, but may be the point where mesh_format.inc is not behaving, as it may not have been tested fully since some changes were made in that department.

giorgio79’s picture

Also found this http://drupal.org/project/mesh

Re

Sorry that it takes 6 (quite technical) steps instead of just two. It's due to the huge size of the dataset mostly - something that can't be done in any one php session.

PHP's xmlreader is meant to handle extremely large xml files (think GBs) as it is a stream based reader unlike simplexml (http://stackoverflow.com/questions/1835177/how-to-use-xmlreader-in-php). I handled some myself in a $5 shared host :D

Parsers have been popping up lately, eg https://github.com/bio2rdf/bio2rdf-scripts/blob/master/mesh/mesh_parser.php