Posted by janusman on April 25, 2008 at 4:35pm
Jump to:
| Project: | Millennium OPAC Integration |
| Version: | 6.x-2.x-dev |
| Component: | Miscellaneous |
| Category: | bug report |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | closed (fixed) |
Issue Summary
Code would have to schedule and priorize imported nodes to check against Millennium. This would enable automatic actions like:
- Deleting/unpublishing nodes for records deleted in Millennium
- Gather information like YTDCIRC, LYRCIRC, TOTCIRC (circulation) which might be exposed via views, in nodes, etc.
Code is ready for automatic re-import when the item's Millennium last-modified date is newer than the import date into Drupal.
Comments
#1
Tagging this for a 2.x release.
#2
janusman, how much work would this entail? we're interested in using the millennium module at MPOW, and there is a chance we'd consider throwing some developer time/money at this to get this feature put into a 2.x release.
#3
The short answer is I'm not *exactly* sure how much work this is... it *seems* doable, say in anywhere from 40 - 160 man-hours? (???)
And thanks, I'm open to outside help (work), or donations to get this done on my spare time =)
Let me go over what I think this feature would work out::
I was thinking the module needs to determine on its own (or with help from the admins) how many records it can manage to re-crawl on cron calls, to try to match the refresh rate the admin wants.
For example; our University's Millennium system is (for some reason) rather slow on /record=XXXXX URLs (1-3 seconds), but other OPACs I have crawled are much faster. In our case we would need to decide what to recrawl, because if records take 2 seconds to crawl each, even continously re-crawling everything would give us 46 days between each record's refreshing (we have 2 million items)... and of course I don't want to be banging Millennium constantly =)
(FYI we are currently happy with reimporting records from our OPAC by manually queing them every 2 months or so, since we are only mirroring a small part of our OPAC using this module)
So I (we?) need to look at approaches like:
1) Abandon crawling, use some sort of MARC+lists import (bad because crawling "just works" and I'd hate to have introduce complex requirements)
2) Figure out a way to decide what to crawl often vs. "not very often".
3) Figure out a way to speed up fetching records from millennium (perhaps a call to III might work for us?)
Since only (1) and (2) are *really* under our control (heh), I would go after (2).
A quick braindump of how this would work:
a) Records with more checkouts (or recent checkouts) get refreshed more often.
b) Records for some locations (admin-defined) get refreshed more often than others. Special colls. might get refreshed the least often, for example.
c) Records for some item types (admin-defined) get refreshed more often.
d) Set up a special Drupal URL (not /cron.php) that runs longer, to be called separately and in special days/times, say after midnight, weekends, etc. when the library isn't open and Millennium might be running with less load.
I'm thinking the module should let the admin know how often it thinks it can refresh (re-crawl) each record on its own (based on its own crawl performance, current PHP timeout and cron frequency), and offer ways to let the admin improve on that (for example, set the PHP timeout to be longer, make cron run more often, or let the admin tweak the (b) and (c) variables I mentioned in my braindump =)
Any ideas? =)
#4
For now, the module's settings now include a "APROXIMATE maximum item record number" which loosely defines the last record to fetch automatically before starting a crawl.
The module will try to keep on scanning the database until all records it tries to import during a cron run return an "item not found" message AND the current attempted record is the estimated quantity of records + 5%.
For example; if the starting record is "10000", and the ending record is "10100", and the number of items to attempt per cron run is 10, then the crawl will restart from item #10000 when it reaches 10105 AND items 10095-10105 (10 items) were not found.
This, along with the addition of pipelining (#355611: Pipeline (simultaneous) record fetching) might be enough for most cases.
Care to try this?
#5
Doing some quick numbers, I'm looking at records being automatically refreshed every 112 days in a 2'000,000-item OPAC.
Average crawl time per record (secs) 1.61
Time between cron calls (secs) 1,800
PHP timeout (secs) 600
Number of records in system 2,000,000
Max # of records crawled per Cron call 372
Max # of records crawled per day 17,856
Max # of records crawled per week 124,992
Max # of records crawled per month 3,749,760
Time between record refreshes (days) 112.01
I am thinking now that a percentage of each cron call should always target "higher priority" item # ranges, again, where "higher priority" would be one ora combination of:
#6
Oh, BTW... a quick test of other III OPACs return MUCH better numbers; our system *must* have an unknown problem since fetching a record takes 1-2 seconds where it takes .1-.5 seconds on other OPACs.
#7
Sorry but i have de maximum item record set to 1630000. The crawl now goes for 1832484 and don't stop increasing.
I think that the module restart crawl when it finished...
Thanks in advanced
Xarbot
#8
Concerning #3: I agree, there is good reason not to abandon crawling. However, using marc+lists could have benefits for some use cases, so having it as an additional feature could be nice. I could look into this.
The initial importing of a large amount of records doesn't seem possible with crawling. So, it would be nice if we could initialize the Millennium Integration+Drupal database via an initial marc dump. But then we would need an additional list mapping the marc records to b and i-numbers.
Another thing is, there could be a Millennium list created automatically for example every night, listing record numbers that have changed, or new records. Of course it's a fairly complex requirement, and at least for newly added records crawling would be quite sufficient.
The idea to not use Drupal's cron is good.
#9
@xarbot: Can you please check your site's admin/reports/millennium page to see what's the last item number successfully crawled? Remember that the module is crawling by ITEM number, which will be significantly higher than your highest bibliographic record number.
The module will keep going after the latest item number you have configured, if it keeps finding records. This lets you "set it and forget it". When it sees that there are no records for about 5% past your total collection size, then it does reset back to the starting item number.
If you think this not working for you at all, I'd appreciate you giving me your OPAC's URL so I can try to see what's going on...
#10
@tituomin:
I was coming to the conclusion that I need 3 different cron "crawling" mechanisms:
1) One where you let the module auto-discover what your OPAC has, which is the current way the module behaves.
2) Just recrawl/refresh already-imported records.
This way, you can then choose (2), and choose to manually import items given a list of item numbers (which can also be done using admin/settings/millennium/queue_add ) when you like. The rest of the time the module will automatically take care of those items you have already imported.
For a large initial import, perhaps it could also use the Batch API instead of cron runs. However, this is another issue open for that: #464068: Batch API integration
#11
EDIT: it's 5 million records a month, not 36 million :P Still, whoa. =)
Due to #574912: Rethink batch record fetching (and perhaps some upgrades to our WebOpac) my test numbers are about 10x better! =)
Now, I could look at crawling
365 million records a month. Not that I need to. Whoa.Average crawl time per record (secs) 0.17
Time between cron calls (secs) 1,800
PHP timeout (secs) 600
Number of records in system 2,000,000
Max # of records crawled per Cron call 3,600
Max # of records crawled per day 172,800
Max # of records crawled per week 1,209,600
Max # of records crawled per month
36,288,0004,838,400Time between record refreshes (days) 11.57
I still like adding the option from #10, a radio button that switches from "crawl all" to "just recrawl already-imported items".
#12
Nice work! =)
#13
Well, uhm, closing this out as it's "good enough" IMO:
1) You can choose auto-crawl and it will start over automatically from the set "beginning" when no more records are found.
2) You can also manually import all previously imported items in /admin/content/millennium.
In both options, already-imported records that are not stale ("stale" right now means 30 days or more have passed after last update) will be automatically updated. (Except if you check "Force update" in a manual import).
I confirm the above numbers are valid for my univ's WebOPAC. Maybe I could have some benchmarks from others' =)
#14
Automatically closed -- issue fixed for 2 weeks with no activity.