Closed (fixed)
Project:
Taxonomy import/export via XML
Version:
6.x-1.x-dev
Component:
Code
Priority:
Normal
Category:
Bug report
Assigned:
Unassigned
Reporter:
Created:
22 Feb 2009 at 21:44 UTC
Updated:
12 May 2009 at 23:50 UTC
I am trying to import the MeSH (Medical Subject Headings - http://www.nlm.nih.gov/mesh/) structure using this and so far not much luck. The d/l-ed file is 270M and i have php.ini mem limit set to 600M (not too realistic for some people; but no problem on local devel box) and still i am hitting memory error.
Is there not a way to read and create without loading entire file - although i suspect worse than even that being done here since i have enough mem for entire file and still failing.
Fatal error: Allowed memory size of 629145600 bytes exhausted (tried to allocate 283341231 bytes) in C:\Inetpub\websites\chec\includes\unicode.inc on line 132
Comments
Comment #1
dman commentedThat file size is utterly huge.
It's not just a matter of having enough PHP memory to look at the file - you are asking it to analyze the whole thing - build and links data structures representing each line in the file, and that is sure to take several times the 'memory' as the input file size.
However - there are solutions.
Clearly, we need to divide the input up into reasonable chunks. Depending on your source data, it should be possible to edit it into discrete units, (say 100 terms) and import them individually. Even cross-referenced terms should be able to be remapped across imports - if the string labels are unique.
You need to try that.
For really large vocabs (this was built for size - several million organisms) there is also the ability to recursively crawl a taxonomy server/service and request term definitions by URL. This in turn will re-queue later terms. This may be more useful for a deep vocab rather than a wide one such as MESH.
Comment #2
liquidcms commentednot sure how the url call would works in your code; but i might look at it.
i tried it but it fails to get anything - i suspect since you need to provide your user details to get to the page (not an actual login so no cookie; just fill in a form then you get to d/l page that has link to xml page)
of course, i have file locally now so could easily add it on a local site and call that.. hmm.. although my guess is you simply do same think as if its a file.. which is read/process entire thing at once??
Comment #3
liquidcms commentedperhaps to support "crawling", the url needs to be a url to a top level xml file which in turn has nodes that link to other xml pages?? as opposed to just one large xml page?
Comment #4
dman commentedYes, there's no advantage to crawling if the result is just a huge blob!
What I've done before was indeed read one file that then referenced others (eg narrowerTerm) by way of either URL (RDF) or Unique ID (which was translated into a further URL request using a service-specific pattern)
This worked well for animal taxons, but didn't have or use a big global list at the top - like I think you are looking at.
Either way, it depends on a service that we can consume. without that this can't help much.
You should be able to test the TCS remote service using the dev version, I think I left some preset values in there.
Comment #5
dman commentedHuge datasets can be handled if the terms are passed back in chunks (or one by one) from a service that interfaces with the huge source.
We were able to import the 60,000 terms from MeSH like that. Although it did take a while...
Comment #7
dman commented#460920: HOWTO: Import the MeSH taxonomy database. Or a subset.