The mass cart import failed for several items. The comparison logic seems a bit fragile, like the comments said. I caught two bugs in _millennium_only_letters: it didn't work for international letters (switched to mb_ereg_...) and the MARC indicators should be replaced with a space (at least here).

Comments

tituomin’s picture

StatusFileSize
new1.04 KB
janusman’s picture

StatusFileSize
new3.91 KB
+++ millennium.import.inc	26 Feb 2010 11:58:18 -0000
@@ -362,16 +362,15 @@ function millennium_fetch_records_via_bo
-  $new = drupal_strtolower(preg_replace("/[^a-z0-9]/i", "", $new));
+  $new = drupal_strtolower(mb_ereg_replace("/\W/i", "", $new));

Because it's difficult to really know what PCRE support admins will have compiled into their PHP instance (see http://drupal.org/node/315486#comment-1117887) I'm hesitant to place this code in. I think the correct way it to use what search.module does: use a constant that defines "non-word" characters instead of using \W.

So, I prefer this new patch... needs testing.

Powered by Dreditor.

janusman’s picture

StatusFileSize
new3.9 KB

I can't seem to find good subjects to test this with!

I *did* find ONE record that did import better with your patch than with no patch or my patch from #2 in b1112000-b1112999 @www.helmet.fi ...

I rewrote my patch to the one below and it seems to fix that one missing record.

I would welcome more record numbers where you found your patch solved the problem, to actually do a good comparison and testing =)

tituomin’s picture

Try

b1935403
--edit: removed records which actually work -- 

For me they all fail -- with your patch they work!

That one fails because the title is all non-ASCII.

tituomin’s picture

Aha! Curiously, because the first one fails, the rest fail too.. If you remove the first from the list, the rest are ok. I made a new issue & patch for this bug: #729016: Title mismatch causes whole block to be set as not found. As for this issue, I think your patch is ok! Actually, I realized you don't even need to replace the subfield indicators with spaces, since it's just an ad hoc test and makes the same transformation for both operands in the test..

janusman’s picture

StatusFileSize
new6.25 KB

Ok, after a LOT of testing, this is what I came up with:
1) Settled with mb_ereg_replace instead of preg_replace for _millennium_only_letters() ... as for some reason (in 2 different test environments) trying to use preg_replace in some instances choked with some MARC records containing som UTF-8 characters in high ranges; mb_ereg_replace() worked perfectly. I added (in another commit) a dependency on that function in order to use the module (hoping it won't be too much of a requisite)

2) I did some fine-grained testing to check if what we're getting from millennium_fetch_records_via_bookcart() can be trusted or not (see new function _millennium_confirm_found_not_found()) by checking, one by one, each found and not-found record. The result?

  • Under some circumstances, false positives can be returned (returning records that aren't really there, repeating an existing one because of the title-comparing algorithm). E.g.
    www.helmet.fi, crawl of 1113600 - 1113799
    Returns false positives for these 4 records
    b1113696-b1113699
    
  • Also, false negatives could be returned (not sure yet why! in some cases records that can be accessed using the /record=b123456 URL that are added to the book cart don't show up in the cart; but it seems rather random since they could show up if they're the single item added after a cart clear.) E.g:
    http://millennium.itesm.mx:2082/search?//1,-1,-1,B/browse?clear_saves=1
    http://millennium.itesm.mx:2082/search*eng/X//1,1,1/?save=b1059162&save=b1059226&save=b1101976&save=b1102574&save=b1117982&save=b1118850&save=b1129933&save=b1182445&save=b1182472&save=b954459&save=b970051&save=b1054640&save=b938229&save=b1007356&save=b1128739&save=b1128741&save=b1137779&save=b940894&save=b1243942&save=b1054530&save=b942316&save=b1055460&save=b854816&save=b873916&save=b887257
    http://millennium.itesm.mx:2082/search*eng/?/++export/1,-1,-1/export
    

    adds all those items to the book cart EXCEPT the last one (b887257); if the cart is cleared and only that one is added:

    http://millennium.itesm.mx:2082/search?//1,-1,-1,B/browse?clear_saves=1
    http://millennium.itesm.mx:2082/search*eng/X//1,1,1/?save=b887257
    http://millennium.itesm.mx:2082/search*eng/?/++export/1,-1,-1/export
    

    it seems to work.

SO, this patch makes things better, but it still needs to fix the false positives... it shouldn't be happening at all in the example above because we're supposed to have run out of MARC-bookcart title matches before we hit those positives.

janusman’s picture

By the previous comment, I didn't make clear that I think that trying to fix false negatives is a lost battle--it might depend totally on the "mood" of the WebOPAC =) That or I'm just not looking well enough (probably!)

BTW I think I have the false positives fixed; will post an updated patch soon.

janusman’s picture

StatusFileSize
new9.91 KB

I think I have nailed it =)

This patch:
* Reduces cookie lifetime to 5 minutes (seemed to cause problems in some OPACs)

Index: contributions/modules/millennium/millennium.module
--- contributions/modules/millennium/millennium.module Base (1.13.2.33.2.2.2.80)
+++ contributions/modules/millennium/millennium.module Locally Modified (Based On 1.13.2.33.2.2.2.80)
@@ -10,6 +10,7 @@
   define('MILLENNIUM_PERFORMANCE_HISTORY_SIZE', 20);
   define('MILLENNIUM_RECORD_PREG_CHECKDIGIT', '/^[bi][0-9]{5,10}x?$/i');
   define('MILLENNIUM_RECORD_PREG_NOCHECKDIGIT', '/^[bi][0-9]{5,10}$/i');
+  define('MILLENNIUM_SESSION_COOKIE_LIFETIME', 300);
[...]
@@ -2793,7 +2794,7 @@
   $base_url = millennium_get_real_baseurl($url);
   $cid = "millennium_http_request_" . $base_url;
 
-  #drupal_set_message("millennium_http_request(): start. URL =" . l($url, $url));
+  #drupal_set_message("millennium_http_request(): start. URL =" . l($url, $url, array('attributes' => array('target' => 'webopac'))));
[...]
@@ -2848,7 +2849,7 @@
 
     if ($cookie_set) {
       $session_cookie = implode("; ", $all);
-      cache_set($cid, $session_cookie, 'cache', time() + 600);
+      cache_set($cid, $session_cookie, 'cache', time() + MILLENNIUM_SESSION_COOKIE_LIFETIME);
[...]

* Reduces items-at-a-time to add to bookcart to 20, total bookcart size is max 40 before emptied

Index: contributions/modules/millennium/millennium.import.inc
--- contributions/modules/millennium/millennium.import.inc Base (1.1.2.19)
+++ contributions/modules/millennium/millennium.import.inc Locally Modified (Based On 1.1.2.19)
@@ -92,12 +92,12 @@
  */
 function millennium_mass_fetch($recnums, $base_url = false) {
   // This seems to be the limit for WebOPACs, do not change =)
-  $max_records_for_bookcart = 50; 
+  $max_records_for_bookcart = 40;
@@ -179,8 +179,15 @@
   $found_items = array();
   $not_found_items = array();
 
-  // Issue N requests to add items to the bookcart, 25 records at a time (apparent maximum)
-  $chunks = array_chunk($recnums, 25);
+  // Clear the cart
+  $path = "/search?//1,-1,-1,B/browse?clear_saves=1";
+  $result = millennium_http_request("{$base_url}{$path}");
+  if ($result->code != 200) {
+    return array('found' => array(), 'not_found' => $recnums);
+  }
+
+  // Issue N requests to add items to the bookcart, 25 records at a time is apparent maximum
+  $chunks = array_chunk($recnums, 20);
[...]

* Eliminates false positives:

@@ -343,6 +354,16 @@
     }
   }
 
+  // If any candidate found items didn't match to any MARC, clear them off the
+  // $found_items array
+  $recnum = $found_items_indexes[$found_items_index];
+  while ($recnum) {
+    $not_found_items[$recnum] = $recnum;
+    unset($found_items[$recnum]);
+    $found_items_index++;
+    $recnum = $found_items_indexes[$found_items_index];
+  }
+
[...]

Still using mb_ereg_replace() for millennium_only_letters():

@@ -368,10 +394,54 @@
 
 function _millennium_only_letters($string) {
   // Remove MARC indicators
-  $new = preg_replace("/\|[a-z]/", "", $string);
+  $new = mb_ereg_replace("\|[a-z]", "", $string);
   // Remove things like "[videorecording]" in titles
-  $new = preg_replace("/\[[^\]]+\]/", "", $new);
+  $new = mb_ereg_replace("\[[^\]]+\]*", "", $new);
+  // Remove things like "{232}" in titles
+  $new = mb_ereg_replace("{[0-9]+}*", "", $new);
   // Leave only letters and numbers
-  $new = drupal_strtolower(preg_replace("/[^a-z0-9]/i", "", $new));
+  $new = drupal_strtolower(mb_ereg_replace("\W", "", $new));
+  $new = str_replace(array(' ', "\n", "\r", '"'), '', $new);
   return $new;
 }
janusman’s picture

StatusFileSize
new9.92 KB

This patch just makes a small change from the previous one, changing

if (strpos($tmp->data, "No Such Record")) {

to

if (!strpos($tmp->data, '<a id="recordnum"')) {

(since non-english OPACs would not show their translation of "No such record" and the test would fail).

janusman’s picture

Status: Needs review » Fixed
StatusFileSize
new10.52 KB

Enough testing! Committing this to stabilize things.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.