I could volunteer to implement the following feature:

Let's create a new option in the module settings so that the user can choose between bibliographic record ids and item record ids as the primary record id used by the system.

I do realize there are some future benefits for storing the i-ids, which have been mentioned elsewhere:
* access to circulation statistics require i-ids
* the greater the i-id, the more recent the item
* and others?

However, there would also be gains from using b-ids:

* we could use the millennium rss-feeds to get new items (the feeds contain the b-id permalinks)
* we could use the webpac cart to do selective harvesting (again, the cart contains only b-ids in this use case)
* harvesting would be quicker

Because there are benefits from both approaches, why not give the choice to the user?

I realize this could turn out to be a pervasive change, so let's discuss!

By the way, could it be possible to use b-ids for web harvesting, and batch import b-to-i-mappings from lists created from Millennium directly? This would be a future safeguard, if the user would later realize the need for i-numbers.

Comments

janusman’s picture

I *think* you are incorrect in stating that the cart requires b-numbers; I have done some experiments using item numbers and it seems to work.

I *am* open to adding this feature, it would be great if we could mantain the option of using item numbers OR bib numbers, and not have it be a global option; perhaps some items were imported with i's, other with b's ...

tituomin’s picture

Very good point, I agree that it's probably best to not make it a global option! I will start to study the code, and come up with a more detailed proposal.

About the cart: I saw your other post about using the cart, and I'm talking about a different scenario: you could do manual searches in the webpac for some search terms, and save the search results in the cart. At this point, if you parse the cart html, at least according to my tests, you have gained the b-id's, not the item ids. So I was talking about using the webpac and the cart to manually choose records for harvesting...

tituomin’s picture

By the way, it seems currently only one i-id per node is saved, although there are many items for a manifestation. Should there be an additional table listing all the i-ids?

janusman’s picture

Re: #3: After a large crawl there *should* be several items for some bib items; at least that's the result for our OPAC crawl =)

tituomin’s picture

I'm sorry, I've tried to examine the code and the database, but I just can't find several item-ids.. I might be missing something but here's what I see.

First, the schema:

+--------------+------------------+------+-----+---------+----------------+
| Field        | Type             | Null | Key | Default | Extra          |
+--------------+------------------+------+-----+---------+----------------+
| id           | int(10) unsigned | NO   | PRI | NULL    | auto_increment | 
| nid          | int(10) unsigned | NO   | MUL | NULL    |                | 
| item_recnum  | char(15)         | NO   |     | NULL    |                | 
| bib_recnum   | char(15)         | NO   | MUL | NULL    |                | 
| item_updated | datetime         | NO   |     | NULL    |                | 
| created      | datetime         | NO   |     | NULL    |                | 
| bib_updated  | datetime         | YES  |     | NULL    |                | 
| biblio_data  | text             | NO   |     | NULL    |                | 
| base_url     | char(80)         | NO   |     | NULL    |                | 
+--------------+------------------+------+-----+---------+----------------+

Now, this implies that there is only one item_recnum per table row. So there would need to be several rows for the same bibliographic record with the same bib_recnum -- but this is not what I see. I have different bib_recnums for different rows. (As far as I understand the code, it always fetches an existing row from the database and updates that). This way of storing the data would also duplicate all the bib-data for all items..

The item-id's could also be lurking inside biblio_data as an array, but the item_recnums there are strings!

Also. looking at the code, it seems that it always creates a new nodeobject from the marc data, but just saves it with an existing nid if there was one.. So it doesn't look like it would cumulate the different item ids.

So I'm puzzled! Could you explain? =)

tituomin’s picture

Also, for example the code that fetches an item-number for a node does this:

function millennium_item_recnum_from_node($nid) {
  static $cache;
  if (isset($cache[$nid])) {
    return $cache[$nid];
  }
  else {
    $recnum = db_result(db_query("SELECT item_recnum FROM {millennium_node_opacitem} WHERE nid=%d", $nid));
    if ($recnum) {
      $cache[$nid] = $recnum;
      return $recnum;
    }
    else {
      return FALSE;
    }
  }
}

It doesn't fetch an array from the db, thereby assuming there is only one result..

janusman’s picture

Re: #5; you are correct. The module was only storing the node-bibnum-itemnum relationship when nodes were created, but not updated.

Will commit this patch for millennium.module along with other fixes/tweaks.

@@ -1021,10 +1021,24 @@
   // Store node in MySQL
   node_save($node);
 
-  // Add relationship to millennium_node_opacitem table
-  $result = db_query("UPDATE {millennium_node_opacitem} SET bib_updated='now()', biblio_data='%s' WHERE bib_recnum='%s'",
-    serialize($node->millennium_biblio_data),
-    $bib_recnum);
+
+  // Add or update relationship to millennium_node_opacitem table
+  $result = db_fetch_object(db_query("SELECT * from {millennium_node_opacitem} WHERE item_recnum = '%s' AND bib_recnum= '%s' AND base_url = '%s'", $item_recnum, $bib_recnum, millennium_get_real_baseurl()));
+  if ($result->item_recnum) {
+    #drupal_set_message("millennium_node_update() updating millennium_node_opacitem");
+    $result = db_query("UPDATE {millennium_node_opacitem} SET bib_updated = NOW(), biblio_data='%s' WHERE bib_recnum='%s'",
+      serialize($node->millennium_biblio_data),
+      $bib_recnum);
+  } else {
+    // Add relationship to millennium_node_opacitem table
+    $result = db_query("INSERT INTO {millennium_node_opacitem} (nid, item_recnum, bib_recnum, item_updated, created, bib_updated, biblio_data, base_url) VALUES (%d, '%s', '%s', NOW(), NOW(), NOW(), '%s', '%s')",
+        $node->nid,
+        $item_recnum,
+        $bib_recnum,
+        serialize($node->millennium_biblio_data),
+        millennium_get_real_baseurl()
+    );
+  }
janusman’s picture

Just wanted to jot down a few notes...

Back to the issue... I would like to give users the option of being able to switch back and forth between one and the other mechanism.

So... thinking this:

  • millennium_import_update_item() could optionally take in either the bib number, item number or both.
  • When only the bib number is given, it only does logic to update/import based on that; the millennium_node_opacitem.item_recnum field would be left to null meaning the item number is unknown.
    • Note: If later we decide we do know the item number, and that item number does not exist, then null value is overwritten with the item number given to the function.
  • When only the item number is given, it does the logic to determine the bib number, establishes additional relationships in millennium_node_opacitem (or overwrites a "null" value for item_recnum if one exists for that bib #)

The logic to determine when a node should be updated is not smart nor logical; I had an extra XRECORD call (that I recently removed) to get the bib record last update date, to determine if a node should be refreshed or not. This might have to be put back in (meaning a performance penalty of about 50%) or switch to an admin-selected expiry time (like, always refresh after 30 days).. or if crawling using items, decide that a new item for that bib means "refresh the node".

janusman’s picture

StatusFileSize
new55.47 KB

First part of the patch, nowhere near complete. LOTS of "TODOs" and dpm()s in the code (won't work if you don't have devel.module enabled)

Currently works:

Preview: http://example.com/millennium/preview/b123456
Manual import using lists of b-numbers: /admin/settings/millennium/manual-import
The tools, marc view, node view, etc. on imported records.

This patch changes the database schema... I think the Drupal cache needs to be cleared in order for it to pick up. I don't think I will come up with code to migrate from the earlier version to this (hey, this is DEV) =)

To test you have to start over importing some new records in /admin/settings/millennium/manual-import

janusman’s picture

Status: Active » Needs review
StatusFileSize
new79.11 KB

This is a BIG patch, reworking Millennium to be friendly to b-numbers AS WELL AS i-numbers.

The gist is this:
* Crawling and batch importing accepts either item or bib record numbers.

How the node-bib-item relationship is handled:
* There is a 1:1 relationship between nodes and bib records (in the new table millennium_node_bib) for a single base_url (meaning importing same bib # for another base url will create another node)
* There is a 1:[0-n] relationship between nodes and items (meaning there can be 0 item records for a certain node). During import or crawl, specifying an item for import creates its bib and corresponding node first (if it doesn't exist), then the item gets linked via the table millennium_bib_item.

Handling of not-found items during crawl or batch import:
* trying a bib number that is not found in the WebOPAC will delete the node and the bib and item entries (if any).
* trying an item number that is not found in the WebOPAC will delete that item only; and if it was the only item left in millennium_bib_item it will also delete the node.

During crawl of batch import, items are marked "stale" after 30 days, and refresh whenever an import for that record is attempted after that time. If they are not stale, they are not refreshed (except if you specified so in a the batch import form).

Other changes:
* The batch import form is now in admin/content/millennium (we're adding content, so it seemed better than in admin/settings/X)
* Added a URL to directly create nodes using REST:
/millennium/import/[bib or item rec num]
e.g.
http://example.com/millennium/import/b123456

will create a new node based on that record. Or, if there is a node for that record already, it refreshed the bib. data in Drupal and redirects the user to that node. I want this as a bookmarklet (opening another issue for that) =)

TODO:
* The items data itself is still not of much use; this (as before) is just in place to allow future per-item data import; circulation stats, location data, etc. (presumably using XRECORD calls).
* Testing! Not tested at all: automatic delete when item or bib records not being found.

janusman’s picture

I failed to mention this requires uninstalling and re-installing the module, as new tables are needed. Again, no upgrade path is planned.

Set up the module normally, then you could quickly batch-import some items at
http://example.com/admin/content/millennium

janusman’s picture

Status: Needs review » Needs work

Found a problem with the bookcart "finding" bib records even though they are not visible or do not exist when visiting the direct url (record=b123456).

The bookcart lists the item that doesn't exist with the title as the previous one in the list (that does exist), however some items are missing (like the year and the material icon)... so perhaps this can be asverted with some code.

tituomin’s picture

Looking very good, I will start testing this patch down here! =)

tituomin’s picture

Tested, works:

  • Crawl import using bnumbers
  • Batch import using bnumbers
  • Batch import using inumbers

Quick note: I first tried to use bnumbers without the final checksum character, didn't work. The checksums won't be there if I get the bnums from an rss feed or other web source. Is the id-number length in Millennium actually fixed to 1 character + 7 digits + 1 digit/character, like it seems to be? In this case, could we just check that the internal bnum is indeed 8 characters long, and cut the last one out if it's 9 characters long? This way the user could use either version, without the 8 character version being truncated to 7 characters..

Continuing with the tests...

tituomin’s picture

Tested, works:

  • File import using both inumbers and bnumbers. The correct associations between the records seem to be saved in millennium_bib_item
tituomin’s picture

Ok, this might be overly cautious, but here goes:

* trying an item number that is not found in the WebOPAC will delete that item only; and if it was the only item left in millennium_bib_item it will also delete the node.

It might be the only item left because the item-based crawling hasn't yet reached the other items. A bit theoretical, I guess. But wouldn't it be safest to delete the item only if the bib-id is not found in the WebOPAC? Of course, in this scenario, the crawling would eventually reach the other items, the node would be re-created, but all possible associations to the node (comments, links from the web, etc.) would be lost since the nid would change... And we always know the bib-id if we know the item-id, but not vice versa, and item-id's can be more transient because physical items can be lost, broken &c.

Oh, by the way: could we use some sort of status flag instead of actually deleting the node in this case? Again, theoretically, it could be valuable information to know that we've actually seen this node before and noticed it has been removed from Millennium...

Anyway, at least according to my tests, your modifications seem to be working well. This is even better than what I had in mind when I first opened the issue.. But let's continue testing! =)

janusman’s picture

Thanks for testing =) Keep the comments comming!

Ok, re: #16, we could unpublish the node when the last item for the node is not found (instead of deleting it). Finding a new item for that bib would then re-publish it.

However, then we could theoretically end up with lots of unpublished nodes, but I'm thinking we could just inform this somewhere on the reports page so the admin can take action if he wants (or, add an option that lets the admin delete those "orphaned" nodes). Thoughts?

janusman’s picture

re: #14: when giving a list of item numbers, I opted for the default format Millennium uses when exporting lists, which means it has a check digit (or X) at the end of the record number. However, I realize this is acually not consistent with the rest of the interface where we are asking just for the sequential number.

I guess I could add a checkbox above the list that says "ignore check digit (Use when exported record numbers Millennium contain check digits or 'X' at the end)."

janusman’s picture

Status: Needs work » Needs review
StatusFileSize
new87.18 KB

This new patch addresses the issues found in #12, #14 and #16.

  • Fixed the bookcart import; items not found were still being added to the bookcart under wrong bib numbers. The current fix might not be perfect; it only detects if a title is repeated and ignores the records for those repeated titles. This is more of an III bug, and probably needs testing to see if it's reproducible on other systems.
  • Deleting nodes when items not found: it will still delete the node but only when this is the last item, AND if the corresponding bib record is also not found on the WebOPAC:
  • Batch import via the list (textarea) has a checkbox to automatically strip off the check digits.
janusman’s picture

StatusFileSize
new31.51 KB

Te ease review this is a meta-patch comparing the patch from #19 to the one in #10. =)

tituomin’s picture

However, then we could theoretically end up with lots of unpublished nodes, but I'm thinking we could just inform this somewhere on the reports page so the admin can take action if he wants (or, add an option that lets the admin delete those "orphaned" nodes). Thoughts?

Yeah, it might be a good idea to provide an admin list of unpublished bibliographic record nodes. Actually, we could just use the Drupal admin content lists?

And although we might not always be aware of every item record number, maybe we could in some cases use the Millennium holdings table to see if there are actual items..

I think unpublishing is better in any case.

tituomin’s picture

Batch import via the list (textarea) has a checkbox to automatically strip off the check digits.

Nice! But this code...

$regexp = $form_state["values"]["ignore_checkdigit"] ? '/^[bi][0-9]{4-8}+*$/i' : '/^[bi][0-9]{4-8}+x*$/i';

... has a few problems (I got errors). I think the following regexps would be more correct:

define('MILLENNIUM_RECORD_NUMBER_PREG_CHECKED', '/^[bi][0-9]{4,8}x?$/i');
define('MILLENNIUM_RECORD_NUMBER_PREG_UNCHECKED', '/^[bi][0-9]{4,7}$/i');
....
$regexp = $form_state["values"]["ignore_checkdigit"] ? MILLENNIUM_RECORD_NUMBER_PREG_CHECKED : MILLENNIUM_RECORD_NUMBER_PREG_UNCHECKED;

You don't need the extra repetition operators after {}, and I think , is the correct operator inside the braces. Also, we only except at most 1 x.

Notice how I also switched the regexps around. I think your version with the x should match when ignore_checkdigit is true (we're expecting the x when we want to ignore it) ..

I must say the verification power of the regexps suffer if you don't fix the amount of digits exactly (eg. {7}). This is because the check digit becomes part of the actual sequence number. That's why

'/^[bi][0-9]{7}[\dx]?$/i'

could be even better, but I'm not sure 7 is always the actual number of digits in Millennium.. That's why we could settle with just spotting an x where one shouldn't be.

One more note: the same kind of verification is happening in two places (cron / batch), I could try to refactor this a bit.

janusman’s picture

Status: Needs review » Needs work

#22: You're correct, just noticed this yesterday... guess I didn't actually test the listing functionality...

In fact, i've seen big record numbers (8 digits) like http://catalog.nypl.org/record=b18134823 ... it might be safe to allow from 5-10 digits.

I like the idea of having the regexps as constants...

Will issue a new patch soon... I think this is just about done, don't you think? =)

janusman’s picture

Status: Needs work » Needs review
StatusFileSize
new30.16 KB
new97.63 KB

New patch. This addresses some issues from previous comments, and also some problems found during import using the bookcart.

* admin/settings/millennium has an option as to what action to take when bibs are no longer found.
* the cart wasn't correctly handling some WebOPAC configurations where the bookcart is paged. Fixed here.
* Added missing t()s inside drupal_set_message() (except some that are still there just for debugging).
* Added the regexps for recognizing records as constants.

Sorry for mixing up so many things in a single patch (this making it hard to review) but maybe at least it can be tested =)

Ah! There's a new function to test harversing: millennium_mass_fetch_test() that can be called from any PHP-enabled page that will scan some known OPACs and show results. It requires devel.module, though.

janusman’s picture

Status: Needs review » Fixed
StatusFileSize
new102.24 KB

Committing this patch.

Tested with several OPACs, seems to be working fine =) Please report any problems you might see.

yay! =)

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.