needs to have better memory handling [#380414]

I am trying to import the MeSH (Medical Subject Headings - http://www.nlm.nih.gov/mesh/) structure using this and so far not much luck. The d/l-ed file is 270M and i have php.ini mem limit set to 600M (not too realistic for some people; but no problem on local devel box) and still i am hitting memory error.

Is there not a way to read and create without loading entire file - although i suspect worse than even that being done here since i have enough mem for entire file and still failing.

Fatal error: Allowed memory size of 629145600 bytes exhausted (tried to allocate 283341231 bytes) in C:\Inetpub\websites\chec\includes\unicode.inc on line 132

Comments

Comment #1

dman commented 22 February 2009 at 22:43

That file size is utterly huge.
It's not just a matter of having enough PHP memory to look at the file - you are asking it to analyze the whole thing - build and links data structures representing each line in the file, and that is sure to take several times the 'memory' as the input file size.

However - there are solutions.

Clearly, we need to divide the input up into reasonable chunks. Depending on your source data, it should be possible to edit it into discrete units, (say 100 terms) and import them individually. Even cross-referenced terms should be able to be remapped across imports - if the string labels are unique.
You need to try that.

For really large vocabs (this was built for size - several million organisms) there is also the ability to recursively crawl a taxonomy server/service and request term definitions by URL. This in turn will re-queue later terms. This may be more useful for a deep vocab rather than a wide one such as MESH.

Comment #2

liquidcms commented 23 February 2009 at 00:48

not sure how the url call would works in your code; but i might look at it.

i tried it but it fails to get anything - i suspect since you need to provide your user details to get to the page (not an actual login so no cookie; just fill in a form then you get to d/l page that has link to xml page)

of course, i have file locally now so could easily add it on a local site and call that.. hmm.. although my guess is you simply do same think as if its a file.. which is read/process entire thing at once??

Comment #3

liquidcms commented 23 February 2009 at 00:50

perhaps to support "crawling", the url needs to be a url to a top level xml file which in turn has nodes that link to other xml pages?? as opposed to just one large xml page?

Comment #4

dman commented 23 February 2009 at 01:27

Yes, there's no advantage to crawling if the result is just a huge blob!

What I've done before was indeed read one file that then referenced others (eg narrowerTerm) by way of either URL (RDF) or Unique ID (which was translated into a further URL request using a service-specific pattern)

This worked well for animal taxons, but didn't have or use a big global list at the top - like I think you are looking at.
Either way, it depends on a service that we can consume. without that this can't help much.

You should be able to test the TCS remote service using the dev version, I think I left some preset values in there.

Comment #5

dman commented 3 March 2009 at 22:09

Status:

Active

» Fixed

Huge datasets can be handled if the terms are passed back in chunks (or one by one) from a service that interfaces with the huge source.
We were able to import the 60,000 terms from MeSH like that. Although it did take a while...

Comment #6

17 March 2009 at 22:10

Status:

Fixed

» Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

Comment #7

dman commented 12 May 2009 at 23:50

#460920: HOWTO: Import the MeSH taxonomy database. Or a subset.

needs to have better memory handling

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

News items

Our community

Documentation

Drupal code base

Governance of community