I could volunteer to implement the following feature:
Let's create a new option in the module settings so that the user can choose between bibliographic record ids and item record ids as the primary record id used by the system.
I do realize there are some future benefits for storing the i-ids, which have been mentioned elsewhere:
* access to circulation statistics require i-ids
* the greater the i-id, the more recent the item
* and others?
However, there would also be gains from using b-ids:
* we could use the millennium rss-feeds to get new items (the feeds contain the b-id permalinks)
* we could use the webpac cart to do selective harvesting (again, the cart contains only b-ids in this use case)
* harvesting would be quicker
Because there are benefits from both approaches, why not give the choice to the user?
I realize this could turn out to be a pervasive change, so let's discuss!
By the way, could it be possible to use b-ids for web harvesting, and batch import b-to-i-mappings from lists created from Millennium directly? This would be a future safeguard, if the user would later realize the need for i-numbers.
| Comment | File | Size | Author |
|---|---|---|---|
| #25 | millennium-576784-25.patch | 102.24 KB | janusman |
| #24 | millennium-576784-24.patch | 97.63 KB | janusman |
| #24 | millennium-576784-19-to-24-meta.patch | 30.16 KB | janusman |
| #20 | millennium-576784-10-to-19-meta-patch.patch | 31.51 KB | janusman |
| #19 | millennium-576784-19.patch | 87.18 KB | janusman |
Comments
Comment #1
janusman commentedI *think* you are incorrect in stating that the cart requires b-numbers; I have done some experiments using item numbers and it seems to work.
I *am* open to adding this feature, it would be great if we could mantain the option of using item numbers OR bib numbers, and not have it be a global option; perhaps some items were imported with i's, other with b's ...
Comment #2
tituomin commentedVery good point, I agree that it's probably best to not make it a global option! I will start to study the code, and come up with a more detailed proposal.
About the cart: I saw your other post about using the cart, and I'm talking about a different scenario: you could do manual searches in the webpac for some search terms, and save the search results in the cart. At this point, if you parse the cart html, at least according to my tests, you have gained the b-id's, not the item ids. So I was talking about using the webpac and the cart to manually choose records for harvesting...
Comment #3
tituomin commentedBy the way, it seems currently only one i-id per node is saved, although there are many items for a manifestation. Should there be an additional table listing all the i-ids?
Comment #4
janusman commentedRe: #3: After a large crawl there *should* be several items for some bib items; at least that's the result for our OPAC crawl =)
Comment #5
tituomin commentedI'm sorry, I've tried to examine the code and the database, but I just can't find several item-ids.. I might be missing something but here's what I see.
First, the schema:
Now, this implies that there is only one item_recnum per table row. So there would need to be several rows for the same bibliographic record with the same bib_recnum -- but this is not what I see. I have different bib_recnums for different rows. (As far as I understand the code, it always fetches an existing row from the database and updates that). This way of storing the data would also duplicate all the bib-data for all items..
The item-id's could also be lurking inside biblio_data as an array, but the item_recnums there are strings!
Also. looking at the code, it seems that it always creates a new nodeobject from the marc data, but just saves it with an existing nid if there was one.. So it doesn't look like it would cumulate the different item ids.
So I'm puzzled! Could you explain? =)
Comment #6
tituomin commentedAlso, for example the code that fetches an item-number for a node does this:
It doesn't fetch an array from the db, thereby assuming there is only one result..
Comment #7
janusman commentedRe: #5; you are correct. The module was only storing the node-bibnum-itemnum relationship when nodes were created, but not updated.
Will commit this patch for millennium.module along with other fixes/tweaks.
Comment #8
janusman commentedJust wanted to jot down a few notes...
Back to the issue... I would like to give users the option of being able to switch back and forth between one and the other mechanism.
So... thinking this:
The logic to determine when a node should be updated is not smart nor logical; I had an extra XRECORD call (that I recently removed) to get the bib record last update date, to determine if a node should be refreshed or not. This might have to be put back in (meaning a performance penalty of about 50%) or switch to an admin-selected expiry time (like, always refresh after 30 days).. or if crawling using items, decide that a new item for that bib means "refresh the node".
Comment #9
janusman commentedFirst part of the patch, nowhere near complete. LOTS of "TODOs" and dpm()s in the code (won't work if you don't have devel.module enabled)
Currently works:
Preview: http://example.com/millennium/preview/b123456
Manual import using lists of b-numbers: /admin/settings/millennium/manual-import
The tools, marc view, node view, etc. on imported records.
This patch changes the database schema... I think the Drupal cache needs to be cleared in order for it to pick up. I don't think I will come up with code to migrate from the earlier version to this (hey, this is DEV) =)
To test you have to start over importing some new records in /admin/settings/millennium/manual-import
Comment #10
janusman commentedThis is a BIG patch, reworking Millennium to be friendly to b-numbers AS WELL AS i-numbers.
The gist is this:
* Crawling and batch importing accepts either item or bib record numbers.
How the node-bib-item relationship is handled:
* There is a 1:1 relationship between nodes and bib records (in the new table millennium_node_bib) for a single base_url (meaning importing same bib # for another base url will create another node)
* There is a 1:[0-n] relationship between nodes and items (meaning there can be 0 item records for a certain node). During import or crawl, specifying an item for import creates its bib and corresponding node first (if it doesn't exist), then the item gets linked via the table millennium_bib_item.
Handling of not-found items during crawl or batch import:
* trying a bib number that is not found in the WebOPAC will delete the node and the bib and item entries (if any).
* trying an item number that is not found in the WebOPAC will delete that item only; and if it was the only item left in millennium_bib_item it will also delete the node.
During crawl of batch import, items are marked "stale" after 30 days, and refresh whenever an import for that record is attempted after that time. If they are not stale, they are not refreshed (except if you specified so in a the batch import form).
Other changes:
* The batch import form is now in admin/content/millennium (we're adding content, so it seemed better than in admin/settings/X)
* Added a URL to directly create nodes using REST:
/millennium/import/[bib or item rec num]
e.g.
http://example.com/millennium/import/b123456
will create a new node based on that record. Or, if there is a node for that record already, it refreshed the bib. data in Drupal and redirects the user to that node. I want this as a bookmarklet (opening another issue for that) =)
TODO:
* The items data itself is still not of much use; this (as before) is just in place to allow future per-item data import; circulation stats, location data, etc. (presumably using XRECORD calls).
* Testing! Not tested at all: automatic delete when item or bib records not being found.
Comment #11
janusman commentedI failed to mention this requires uninstalling and re-installing the module, as new tables are needed. Again, no upgrade path is planned.
Set up the module normally, then you could quickly batch-import some items at
http://example.com/admin/content/millennium
Comment #12
janusman commentedFound a problem with the bookcart "finding" bib records even though they are not visible or do not exist when visiting the direct url (record=b123456).
The bookcart lists the item that doesn't exist with the title as the previous one in the list (that does exist), however some items are missing (like the year and the material icon)... so perhaps this can be asverted with some code.
Comment #13
tituomin commentedLooking very good, I will start testing this patch down here! =)
Comment #14
tituomin commentedTested, works:
Quick note: I first tried to use bnumbers without the final checksum character, didn't work. The checksums won't be there if I get the bnums from an rss feed or other web source. Is the id-number length in Millennium actually fixed to 1 character + 7 digits + 1 digit/character, like it seems to be? In this case, could we just check that the internal bnum is indeed 8 characters long, and cut the last one out if it's 9 characters long? This way the user could use either version, without the 8 character version being truncated to 7 characters..
Continuing with the tests...
Comment #15
tituomin commentedTested, works:
Comment #16
tituomin commentedOk, this might be overly cautious, but here goes:
It might be the only item left because the item-based crawling hasn't yet reached the other items. A bit theoretical, I guess. But wouldn't it be safest to delete the item only if the bib-id is not found in the WebOPAC? Of course, in this scenario, the crawling would eventually reach the other items, the node would be re-created, but all possible associations to the node (comments, links from the web, etc.) would be lost since the nid would change... And we always know the bib-id if we know the item-id, but not vice versa, and item-id's can be more transient because physical items can be lost, broken &c.
Oh, by the way: could we use some sort of status flag instead of actually deleting the node in this case? Again, theoretically, it could be valuable information to know that we've actually seen this node before and noticed it has been removed from Millennium...
Anyway, at least according to my tests, your modifications seem to be working well. This is even better than what I had in mind when I first opened the issue.. But let's continue testing! =)
Comment #17
janusman commentedThanks for testing =) Keep the comments comming!
Ok, re: #16, we could unpublish the node when the last item for the node is not found (instead of deleting it). Finding a new item for that bib would then re-publish it.
However, then we could theoretically end up with lots of unpublished nodes, but I'm thinking we could just inform this somewhere on the reports page so the admin can take action if he wants (or, add an option that lets the admin delete those "orphaned" nodes). Thoughts?
Comment #18
janusman commentedre: #14: when giving a list of item numbers, I opted for the default format Millennium uses when exporting lists, which means it has a check digit (or X) at the end of the record number. However, I realize this is acually not consistent with the rest of the interface where we are asking just for the sequential number.
I guess I could add a checkbox above the list that says "ignore check digit (Use when exported record numbers Millennium contain check digits or 'X' at the end)."
Comment #19
janusman commentedThis new patch addresses the issues found in #12, #14 and #16.
Comment #20
janusman commentedTe ease review this is a meta-patch comparing the patch from #19 to the one in #10. =)
Comment #21
tituomin commentedYeah, it might be a good idea to provide an admin list of unpublished bibliographic record nodes. Actually, we could just use the Drupal admin content lists?
And although we might not always be aware of every item record number, maybe we could in some cases use the Millennium holdings table to see if there are actual items..
I think unpublishing is better in any case.
Comment #22
tituomin commentedNice! But this code...
... has a few problems (I got errors). I think the following regexps would be more correct:
You don't need the extra repetition operators after {}, and I think , is the correct operator inside the braces. Also, we only except at most 1 x.
Notice how I also switched the regexps around. I think your version with the x should match when ignore_checkdigit is true (we're expecting the x when we want to ignore it) ..
I must say the verification power of the regexps suffer if you don't fix the amount of digits exactly (eg. {7}). This is because the check digit becomes part of the actual sequence number. That's why
'/^[bi][0-9]{7}[\dx]?$/i'could be even better, but I'm not sure 7 is always the actual number of digits in Millennium.. That's why we could settle with just spotting an x where one shouldn't be.
One more note: the same kind of verification is happening in two places (cron / batch), I could try to refactor this a bit.
Comment #23
janusman commented#22: You're correct, just noticed this yesterday... guess I didn't actually test the listing functionality...
In fact, i've seen big record numbers (8 digits) like http://catalog.nypl.org/record=b18134823 ... it might be safe to allow from 5-10 digits.
I like the idea of having the regexps as constants...
Will issue a new patch soon... I think this is just about done, don't you think? =)
Comment #24
janusman commentedNew patch. This addresses some issues from previous comments, and also some problems found during import using the bookcart.
* admin/settings/millennium has an option as to what action to take when bibs are no longer found.
* the cart wasn't correctly handling some WebOPAC configurations where the bookcart is paged. Fixed here.
* Added missing t()s inside drupal_set_message() (except some that are still there just for debugging).
* Added the regexps for recognizing records as constants.
Sorry for mixing up so many things in a single patch (this making it hard to review) but maybe at least it can be tested =)
Ah! There's a new function to test harversing: millennium_mass_fetch_test() that can be called from any PHP-enabled page that will scan some known OPACs and show results. It requires devel.module, though.
Comment #25
janusman commentedCommitting this patch.
Tested with several OPACs, seems to be working fine =) Please report any problems you might see.
yay! =)