Rethink batch record fetching

janusman - September 11, 2009 - 17:04
Project:Millennium Integration
Version:6.x-2.x-dev
Component:Code
Category:task
Priority:normal
Assigned:Unassigned
Status:fixed
Description

sesuncedu in IRC #code4lib gave me another tactic: use the book cart.

Doing some preliminary testing, this might work:

1) Determine the list of item ids to process.

2) Do a request to add all items to the cart at the same time. Calling this URL works for single items:
/search////?save=i[item number]
e.g. /search////?save=i213655

3) Get the bib-item relationship of all items here:
/search/?/++export/1,-1,-1,B/export

4) Get the MARC for all items with a post request:
POST /search~S63*spi/?.i213655/++export/1%2C-1%2C-1%2CB/export/
Post data: email_addx=&email_subj=&ex_device=43&ex_format=50

5) Clear the cart by calling:
/search?/X/X/1,-1,-1,B/browse?clear_saves=1

Obviously, all of the above needs cookies to work.

Extra bonus: post findings here: http://wiki.code4lib.org/index.php/Innovative_Documentation

#1

janusman - September 11, 2009 - 17:18
Title:Rethink batch record fethcing» Rethink batch record fetching

The way to mass-post items to the cart:

As seen in: http://webpac.nevada.edu/search/X?SEARCH=test

POST: /search~S1?/Xtest&searchscope=1&SORT=D/Xtest&searchscope=1&SORT=D&SUBKEY=test/1%2C7175%2C7175%2CE/2browse
Post data: jumpref=Xtest&save=b3229629&save=b3226913&save=b3248961&save_func=save_marked

#2

tituomin - September 14, 2009 - 09:47

Looks very promising!

Could the cart be used not only to get better performance for mass harvesting, but also to enable users to do selective harvesting easily: just do a normal millennium webpac search, put the results in a cart, and import into drupal?

I realise this might not work for the current implementation, because the i-numbers would need to be known beforehand -- but if we could use b-identifiers as an alternative to i-identifiers as an option, maybe then?

#3

janusman - October 12, 2009 - 23:08

This standalone code works somewhat: needs testing.

#4

janusman - October 12, 2009 - 23:08

Missed upload =)

AttachmentSize
recordtest_php.txt 4.79 KB

#5

janusman - October 13, 2009 - 22:13
Status:active» needs review

Thinking of committing this patch. Reviewers welcome =)

AttachmentSize
millennium-574912-5.patch 23.21 KB

#6

janusman - October 13, 2009 - 22:20

Forgot to describe what this does:

The new fetch method is through:

/**
* Gets a sequential number of records obeying PHP's max_execution_time setting.
* @param array $item_recnums Array of item numbers to fetch.
*/
function millennium_mass_fetch($item_recnums) {
[...]

which recieves an array of item numbers to fetch. This function in turn chunks up this array into groups of 25 at most, which then get crawled using the WebOpac's "Book Cart" functionality using this function:
/**
* Gets item information (bib number & MARC record) using the III's book cart.
* Recieves an array of item numbers and returns an array of found item data including bib number and MARC keyed by item number, and an unkeyed array of not found items.
* @param array $item_recnums An unkeyed array of item numbers = array('i100000', 'i100002', ...)
*/
function millennium_fetch_records_via_bookcart($item_recnums) {
[...]

This new function can fetch up to 25 items' information (item number->bib number relationship and MARC record) with only 4-5 requests (where, before, we required as many as 2 requests PER ITEM).

This theoretically would dramatically lower the potential server load caused by the module.

On benchmarking, I was able to get around 3-4x the throughput of the earlier method (pipelining the requests for records).

#7

janusman - October 15, 2009 - 19:31
Status:needs review» fixed

Committing this patch.

A not-very-scientific benchmark says this runs about 77% faster (in records/sec), and makes about 1/10th the number of requests to the server.

AttachmentSize
millennium-574912-7.patch 23.86 KB

#8

janusman - October 15, 2009 - 19:36

Missed corresponding code in performance report. Committed.

AttachmentSize
millennium-574912-8.patch 26.42 KB

#9

janusman - October 16, 2009 - 17:07

Found out I can add up to 25 individual records to the cart at a time, but I can get back up to 50 =) This seems to increase throughput by about 30% =) Yay!

Committing this patch.

AttachmentSize
millennium-574912-9.patch 2.18 KB

#10

janusman - October 16, 2009 - 20:32

Another round of improvements:
* Changes some of the URLs which had the search string "test" in them; removed it since it's unnecessary and slows down fetching
* use cache_set() to store the session cookie for 5 minutes.
* Make use of the holdings tables available in the MARC export from the cart; why not, since they're already there? =) If it turns out that holdings table is not complete (indicated by a button below it) it makes an extra request if needed.

This round improves things by around factor of 2.

Committing this patch.

AttachmentSize
millennium-574912-10.patch 4.7 KB

#11

janusman - October 27, 2009 - 17:20
Status:fixed» needs work

Have found some issues with the bookcart import...
For reference, these are some testing sites and valid item/bib ranges to try:

NYPL: http://catnyp.nypl.org
i17820000-i17820500, b18134000-b18134500

AADL: http://irma.aadl.org
i1320500-i1321000, b1321000-b1321500

MELCAT: http://elibrary.mel.org/
i13400000-i13400500, b10340000-b10340500

ITESM: http://millennium.itesm.mx/
i100000-i100500, b1179000-b1179500

CONSULS: http://www.consuls.org
b1580000-b1580500

#12

janusman - October 27, 2009 - 23:02
Status:needs work» needs review

Have fixed for all the above sites except for NYPL... see attached patch.

This patch includes a testing function millennium_mass_fetch_test() which can be called from inside any PHP-format node in order to run some tests. It requires devel module to show results. Probably will change this in the future, for now it's just something to let me make sure I don't break things. (I ought to learn to do tests The Drupal Way =))

AttachmentSize
millennium-574912-12.patch 13.95 KB

#13

janusman - October 27, 2009 - 23:05
Status:needs review» postponed

WOOPS: hold the presses. that patch includes a lot of code from #576784: An option to use b-identifiers as an alternative... do not review since this will be fixed over at that issue. =)

#14

janusman - October 30, 2009 - 21:48
Status:postponed» fixed

Committed a fix.

The add to bookcart requests are GET instead of POSTs as that was found to be more compatible during testing; also, fixed some regexps and other details.

Working correctly with tested OPACs (see millennium_mass_fetch_test())

#15

System Message - November 13, 2009 - 21:50
Status:fixed» closed

Automatically closed -- issue fixed for 2 weeks with no activity.

#16

janusman - December 2, 2009 - 20:20
Status:closed» needs review

Reopening because of a problem with millennium_fetch_records_via_bookcart():: sometimes back-to-back duplicate titles would trigger a false positive for "missing record"; the approach in the patch reverses the checking so it's based on the MARC records and then checks the records listed in the bookcart to relate MARC to record numbers.

The problem was diagnosed from some crawled records that, although imported, the record numbers did not match with those in the WebOPAC (they were offset by N).

2009-12-02_141847.png

AttachmentSize
millennium-574912-16.patch 8.71 KB
2009-12-02_141847.png 4.81 KB

#17

janusman - December 2, 2009 - 20:21
Status:needs review» fixed

Committed

 
 

Drupal is a registered trademark of Dries Buytaert.