Rethink batch record fetching
| Project: | Millennium Integration |
| Version: | 6.x-2.x-dev |
| Component: | Code |
| Category: | task |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | fixed |
sesuncedu in IRC #code4lib gave me another tactic: use the book cart.
Doing some preliminary testing, this might work:
1) Determine the list of item ids to process.
2) Do a request to add all items to the cart at the same time. Calling this URL works for single items:
/search////?save=i[item number]
e.g. /search////?save=i213655
3) Get the bib-item relationship of all items here:
/search/?/++export/1,-1,-1,B/export
4) Get the MARC for all items with a post request:
POST /search~S63*spi/?.i213655/++export/1%2C-1%2C-1%2CB/export/
Post data: email_addx=&email_subj=&ex_device=43&ex_format=50
5) Clear the cart by calling:
/search?/X/X/1,-1,-1,B/browse?clear_saves=1
Obviously, all of the above needs cookies to work.
Extra bonus: post findings here: http://wiki.code4lib.org/index.php/Innovative_Documentation

#1
The way to mass-post items to the cart:
As seen in: http://webpac.nevada.edu/search/X?SEARCH=test
POST: /search~S1?/Xtest&searchscope=1&SORT=D/Xtest&searchscope=1&SORT=D&SUBKEY=test/1%2C7175%2C7175%2CE/2browse
Post data: jumpref=Xtest&save=b3229629&save=b3226913&save=b3248961&save_func=save_marked
#2
Looks very promising!
Could the cart be used not only to get better performance for mass harvesting, but also to enable users to do selective harvesting easily: just do a normal millennium webpac search, put the results in a cart, and import into drupal?
I realise this might not work for the current implementation, because the i-numbers would need to be known beforehand -- but if we could use b-identifiers as an alternative to i-identifiers as an option, maybe then?
#3
This standalone code works somewhat: needs testing.
#4
Missed upload =)
#5
Thinking of committing this patch. Reviewers welcome =)
#6
Forgot to describe what this does:
The new fetch method is through:
/*** Gets a sequential number of records obeying PHP's max_execution_time setting.
* @param array $item_recnums Array of item numbers to fetch.
*/
function millennium_mass_fetch($item_recnums) {
[...]
which recieves an array of item numbers to fetch. This function in turn chunks up this array into groups of 25 at most, which then get crawled using the WebOpac's "Book Cart" functionality using this function:
/*** Gets item information (bib number & MARC record) using the III's book cart.
* Recieves an array of item numbers and returns an array of found item data including bib number and MARC keyed by item number, and an unkeyed array of not found items.
* @param array $item_recnums An unkeyed array of item numbers = array('i100000', 'i100002', ...)
*/
function millennium_fetch_records_via_bookcart($item_recnums) {
[...]
This new function can fetch up to 25 items' information (item number->bib number relationship and MARC record) with only 4-5 requests (where, before, we required as many as 2 requests PER ITEM).
This theoretically would dramatically lower the potential server load caused by the module.
On benchmarking, I was able to get around 3-4x the throughput of the earlier method (pipelining the requests for records).
#7
Committing this patch.
A not-very-scientific benchmark says this runs about 77% faster (in records/sec), and makes about 1/10th the number of requests to the server.
#8
Missed corresponding code in performance report. Committed.
#9
Found out I can add up to 25 individual records to the cart at a time, but I can get back up to 50 =) This seems to increase throughput by about 30% =) Yay!
Committing this patch.
#10
Another round of improvements:
* Changes some of the URLs which had the search string "test" in them; removed it since it's unnecessary and slows down fetching
* use cache_set() to store the session cookie for 5 minutes.
* Make use of the holdings tables available in the MARC export from the cart; why not, since they're already there? =) If it turns out that holdings table is not complete (indicated by a button below it) it makes an extra request if needed.
This round improves things by around factor of 2.
Committing this patch.
#11
Have found some issues with the bookcart import...
For reference, these are some testing sites and valid item/bib ranges to try:
NYPL: http://catnyp.nypl.org
i17820000-i17820500, b18134000-b18134500
AADL: http://irma.aadl.org
i1320500-i1321000, b1321000-b1321500
MELCAT: http://elibrary.mel.org/
i13400000-i13400500, b10340000-b10340500
ITESM: http://millennium.itesm.mx/
i100000-i100500, b1179000-b1179500
CONSULS: http://www.consuls.org
b1580000-b1580500
#12
Have fixed for all the above sites except for NYPL... see attached patch.
This patch includes a testing function millennium_mass_fetch_test() which can be called from inside any PHP-format node in order to run some tests. It requires devel module to show results. Probably will change this in the future, for now it's just something to let me make sure I don't break things. (I ought to learn to do tests The Drupal Way =))
#13
WOOPS: hold the presses. that patch includes a lot of code from #576784: An option to use b-identifiers as an alternative... do not review since this will be fixed over at that issue. =)
#14
Committed a fix.
The add to bookcart requests are GET instead of POSTs as that was found to be more compatible during testing; also, fixed some regexps and other details.
Working correctly with tested OPACs (see millennium_mass_fetch_test())
#15
Automatically closed -- issue fixed for 2 weeks with no activity.
#16
Reopening because of a problem with millennium_fetch_records_via_bookcart():: sometimes back-to-back duplicate titles would trigger a false positive for "missing record"; the approach in the patch reverses the checking so it's based on the MARC records and then checks the records listed in the bookcart to relate MARC to record numbers.
The problem was diagnosed from some crawled records that, although imported, the record numbers did not match with those in the WebOPAC (they were offset by N).
#17
Committed