Think about using 3rd party code for fetching/parsing records [#401364]

Since a while I have been thinking how some of my code is actually replicationg functionality available in other GPL'd projects like Shrew, SOPAC (the Locum part), Jangle, code from AADL and others like Scriblio, OhioLink's libcat module for Drupal (non-GPL), etc.

Probably the best would be to use either Locum or Shrew; Locum will probably get the most use as there are several libraries implementing SOPAC.

Comments

Comment #1

anarchivist commented 8 April 2009 at 15:29

Alejandro, I'd vote for using Shrew instead of Locum. Locum relies on Sphinx for the back-end indexing and storage of the record data, which is a fundamental design difference between SOPAC and the Millenium integration module. Locum also makes some fairly strong presumptions about how you actually want your data indexed, while you have a lot more freedom using millenium.module. In addition, Shrew doesn't require XRECORDs to be enabled, and instead just requires MARC record view to be enabled.

As I mentioned in #code4lib, I've been writing a module to integrate Shrew with Drupal, which relies on the File_MARC PEAR library. The module has a submodule which provides a CCK field to reference an individual record in Webpac, which you can then use to construct links back to the catalog record or use something like CCK Computed Field to grab individual field values from the record to store in the Drupal database instead of harvesting the full record.

Ideally I'd like to integrate my module with yours; however, the goal of my module is not to harvest full records into individual Drupal nodes, but rather to relate Drupal nodes with individual catalog records.

Also, out of curiosity, why are you harvesting the information from the item records?

Comment #2

anarchivist commented 8 April 2009 at 15:50

Also, FWIW, OhioLINK's libcat module and an earlier version of Scriblio's III importer used Shrew as the mechanism to interact with Webpac.

Comment #3

janusman commented 13 April 2009 at 22:50

@anarchivist, thanks for the input. I'm still trying to piece together the lineage for these projects; good to know Shrew is the basic DNA for several.

I am crawling by item record # because it is then easy to grab their individual info:

Import or not depending on item location
Grab TOTCIRC, LYRCIRC and other individual info
Easy to know which are the new items... those with the highest item record #

Of course, the flip side is having to go over LOTS of items for the same bib record (e.g. in periodicals). But hey, the software is doing the dirty work =) I'm having much better response times now that I have pipelined fetching in place (read #251276: Periodic checking of imported records for more on that)... but perhaps not good enough vs. doing only bib records.

Comment #4

tituomin commented 29 April 2009 at 10:56

I like the basic decision to just import into Drupal nodes. It allows me to reuse third party Drupal modules and the basic Drupal functionality even for bibliographic nodes. For example, I can use Apache Solr for faceted search. This is great for quick, small demos and testing. BTW, another, bigger project also using Drupal is the Extensible Catalog project: http://www.extensiblecatalog.org/ . I was also wondering about the item records, thanks! =)

Comment #5

anarchivist commented 29 April 2009 at 19:18

@tituomin, yeah, that's true, but there are plenty of cases where *not* importing bibliographic data nodes is preferable. In our case we have an OPAC with over 7 million bibliographic records and quite possibly 3 or 4 times that many item records. I'd like to see a general API available for relating, importing, etc. content from a Millennium instance, which means possibly making millennium.module and umbrella module for a number of modules (millennium_import, shrew, iii_record_field, etc.).

Comment #6

tituomin commented 1 June 2009 at 11:16

@anarchivist: In the larger picture I totally agree, it's always a bad idea to make simple copies instead of creating well-thought-out interfaces. However, our current OPAC doesn't support real interfaces at all, so to get concrete prototypes done, the only option currently for us is simple importing.

I'm curious, are there real benefits from using shrew with Millennium which only exposes records through their record number? For example, to index the data into Apache SOLR, one would need all the bibliographic metadata anyway in a quickly accessible form.

Comment #7

anarchivist commented 1 June 2009 at 14:45

@tituomin - I see your point and I don't necessarily disagree with you, but I'm talking about different use cases. The modules I've written allow for the following:

* referencing/generating a link from a given record umber
* programmatically retrieving either an entire record or a subset of the fields from that record
* using something like computed field or a custom module, saving individual fields to the database (as opposed to the entire record)

For instance, in our case we want to retrieve only certain fields for specific records, not every record in our catalog.

Comment #8

janusman commented 15 October 2009 at 20:07

Priority:

Normal

» Minor

Now that the new crawler-using-bookcart is committed (with it's corresponding 3-10x increase in throughput), I think this issue might not be relevant for a while, since I doubt other libraries are doing similarly.

However, maybe reusing other code for batch loading might be better. I'm happy with what I have now, though =)

Comment #9

anarchivist commented 15 October 2009 at 23:18

@janusman - maybe making a pluggable crawler system for batchloads might work? of course, that's a larger scale project, but it might be a good idea.

Comment #10

janusman commented 15 October 2009 at 23:32

@anarchivist: See #464068: Batch API integration for a first (ugly) stab at Batch API loads.

Comment #11

janusman commented 24 October 2009 at 03:04

Status:

Active

» Postponed

Postponing; I've had a look at Shrew and @anarchivist's code; they do great sticking to MARCXML standards and reuse existing code... but I think:

* Using the bookcart is the way to go, they don't incorporate anything like that
* I don't really need a standard; I'm just parsing the "standard" MARC display, which is what Shrew et al. do =) Anyone who wants to generate MARCXML from my code is free to do so =)

However, I would perhaps want to look at _how_ other code simplifies MARC into Genre/Item type, how 6xx fields are parsed into "this is genre, that is a person, this is time", etc. to improve the code.

Think about using 3rd party code for fetching/parsing records