Integrate with other external data sources (WorldCat, LibraryThing, Open Library) [#358765]

Comment	File	Size	Author
#16	millennium_enrich-358765-16.patch	8.11 KB	janusman
#14	bibmash.tar_.gz	12.28 KB	tituomin
#13	bibmash.tar_.gz	12.26 KB	tituomin

Comment #1

janusman commented 14 January 2009 at 15:32

Also see code from:
http://drupal.org/project/bookpost
which is based on:
http://wordpress.org/extend/plugins/openbook-book-data/

Log in or register to post comments

Comment #2

janusman commented 14 January 2009 at 16:35

This article http://arxiv.org/ftp/arxiv/papers/0805/0805.2855.pdf mentions other "external datasets" (useful for Geographic tags?)

GeoNames (http:///geonames.org) and the CIA World Fact Book
(http://www4.wiwiss.fu-berlin.de/factbook/) for geographic headings.
· the RDF BookMashup (http://www4.wiwiss.fu-berlin.de/bizer/bookmashup/) for links
to items that prompted a LCSH concept to be created.
· dbpedia (http://dbpedia.org)
Furthermore, there are additional vocabularies at the Library of Congress such as the Library
of Congress Classification, Name Authority File, and LCCN Permalink Service which could be
made available as RDF. The authors are also involved in the conversion of the RAMEAU, a
controlled vocabulary that is very similar

Also check out http://inkdroid.org/journal/2008/01/04/following-your-nose-to-the-web-of...

Log in or register to post comments

Comment #3

janusman commented 14 January 2009 at 17:16

Would this help?

guessing publisher from ISBN prefix
http://worldcat.org/devnet/blog/2009/01/guessing_publisher_from_isbn_p.html

Log in or register to post comments

Comment #4

janusman commented 14 January 2009 at 17:18

See this:

Open Library embeddable Book Reader
http://openlibrary.org/dev/docs/bookreader

And more:
http://wiki.code4lib.org/index.php/OSBW_Existing_Software

Log in or register to post comments

Comment #5

janusman commented 27 January 2009 at 16:04

Issue tags:

+RDF

Perhaps this is all just a subset of linked data? In that case, any useful information about the item (current prices, ratings at amazon or others), the authors (biographies, pictures, etc.), the publisher (homepage, addresses?), people who own/read it, libraries that hold it, etc... would all be targets for this issue. =)

Log in or register to post comments

Comment #6

janusman commented 18 March 2009 at 18:40

Another data source: the insight web service from random house.

To get direct links to cover images, TOCs and sample pages, see: http://www.randomhouse.biz/webservices/insight/spec.php#G

To embed a widget with a book, see: http://www.randomhouse.biz/webservices/insight/widget/userguide

Log in or register to post comments

Comment #7

janusman commented 19 March 2009 at 15:51

For now I went ahead and committed to 6-DEV an embedded Google Books widget. Probably needs some work though.

Log in or register to post comments

Comment #8

janusman commented 30 November 2009 at 22:13

I think that this should be something more modular; I plan on having additional modules that tie into the drupal_alter() hooks, add other hooks (e.g. alter the biblio data table, alter the holdings table, etc) so those other modules would be in charge of adding more information.

This would mean I would move the code that adds the Library of Congress information and Google Book Search link and widget into other modules.

This would also open up the possibility of these new modules actually not depend on millennium.module at all, since in fact, Google Books and other just need an ISBN or other information to work, and are not at all tied to Millennium. For instance, these modules could work using information from Biblio module, a CCK field deemed to hold an identifier of some sort, or use the Millennium.module's stored biblio array.

We could then have modules at different steps:
* Enrichment during import. Say your III record is missing the number of pages; it could be fetched from another source and added to the record. Or say you want to import the Table of Contents from LOC (like we do now).
* Enrichment during viewing (adding online fulltext viewers, adding links, etc.), kind of like WebBridge (or whatever it's called these days?) does on III.

How does this sound?

Log in or register to post comments

Comment #9

tituomin commented 1 December 2009 at 16:41

I'm in the middle of coding a new component with this same idea.. I can send a preliminary version later. I'm currently using this for cover images, but will be using it also for Wikipedia, Google books, Millennium availability data, etc. (Anything, really.)

Features (currently implemented):

AJAX-based retrieval of external data (won't block Drupal page loading)
Can retrieve data for many nodes per one AJAX request
Only uses Drupal database level bootstrap (if database access needed). This is to avoid performance issues because there might be quite a lot of AJAX requests coming in.
PHP objected-oriented design, allows plugging in new external data sources and data types using inheritance.
Supports a flexible model to deal with different types of external data differently.

Features (coming up):

Two-level caching support (memcached and database cache) (external requests take a long time, trying to avoid delays)
Support for efficient "threaded" http retrieval using curl multi: http://www.php.net/manual/en/function.curl-multi-init.php

I'll try to give a preview as soon as possible.

Log in or register to post comments

Comment #10

janusman commented 1 December 2009 at 17:26

Awesome!! Please share when you have something/anything =)

Log in or register to post comments

Comment #11

janusman commented 2 December 2009 at 21:59

Just another note:

Library thing API:
http://www.librarything.com/services/librarything.ck.getwork.php

I'm thinking this is *most* useful; I thin ka module could ask for the API key, and manage the upper limit of calls per day to keep in line with the terms of service.

Log in or register to post comments

Comment #12

janusman commented 3 December 2009 at 00:21

See latest commit for the beginnings of (very basic, humble, horribly unscalable) configurable enrichment options for the module =)

http://drupal.org/cvs?commit=297112

BTW, yes, I killed, like, 4 kittens with that commit.

Log in or register to post comments

Comment #13

tituomin commented 3 December 2009 at 10:56

Status	File	Size
new	bibmash.tar_.gz	12.26 KB

Re: #10. Here is a preliminary version of the server-side logic. The design is not probably as good as possible yet, I'll refactor it when I add the caching logic and other optimizations.

By the way, currently you should install this in a directory "bibmash" right under the drupal root. (There probably is a better place)
And there is no client side javascript code yet, but you can see the json output in your browser if you go to http://site/bibmash/bibmash.php?id=*valid_millennium_integration_node_id*. LastFM coverimages is really the only concrete feature here. (I'm making a music site).

Here are my basic design choices:

This is *not* a Drupal module. I know there are some disadvantages to this, but I expect there to be many
ajax requests, and loading the whole Drupal environment isn't very wise in this case.
It does however include the Drupal database layer to fetch information from Millennium Integration.

About the files:

bibmash_config.php: the configuration file, where you can set up the loading of the datatypes and data sources you want
bibmash.php: The main file, which handles the ajax request and parameters
datasources: Directory which contains the subclasses for the different data sources
datatypes: The same for different data types

The class model is this:

BibRecord: A bibliographic record, representing a work or manifestation
MetaData: A type of metadata attached to a record. Currently including: CatalogData (bibliographic metadata), and CoverImage
(urls for coverimages). A metadata object knows how to present itself.
DataSource: A source for data, be it external (WebDataSource -> Amazon, LastFM) or internal (for example MillenniumIntegration)
DataFront: Front-end, initializes the object model
DataBack: Back-end, fetches the data (the logic will inlude caching later)

I hope all this isn't overkill. I will try to keep it simple, and I think the object oriented model fits quite well for this use because there is a clear need for different types of data / different sources, and inheritance keeps it extensible.

Any comments? =)

Log in or register to post comments

Comment #14

tituomin commented 3 December 2009 at 12:59

Status	File	Size
new	bibmash.tar_.gz	12.28 KB

Sorry, some trouble with my automatic insertion of license notice on top of each source file. So *here* is a working version.

Log in or register to post comments

Comment #15

janusman commented 4 December 2009 at 19:04

For examples for cover images, see:

http://dobby.darienlibrary.org/websvn/listing.php?repname=locum&path=%2F...

http://github.com/eby/sopac2-contrib/tree/master/covercache/

http://cheerfulcurmudgeon.com/2008/08/11/caching-free-librarything-book-...

Log in or register to post comments

Comment #16

janusman commented 5 December 2009 at 00:13

Status	File	Size
new	millennium_enrich-358765-16.patch	8.11 KB

This patch (committed) adds a cover image "finder" to the "enrichment" module.

It doesn't download the images but it finds them grabbing a portion of the file from a variety of services, and uses that fragment to determine if it's usable (by checking the image width).

Probably next would be to have a simple caching mechanism, and some way for the user to configure this.

Log in or register to post comments

Comment #17

janusman commented 28 January 2010 at 00:03

I am not happy with the current enrichment setup though (it's dog-slow during manual imports); I think I could probably do best to separate it out into another module that can hook into modules like Millennium, biblio, views and CCK and can run independently of Millennium imports OR hook in to use pipelined requests to speed things up.

Note: see Feeds and if it's looking into aggregating data taking in "keys" to search for and storing them in "any" MySQL storage. (Guessing it *might*)

Log in or register to post comments

Comment #18

tituomin commented 9 February 2010 at 12:51

My development has been stalled for a while, but I think it might be a good idea to move the fetching of external data to a separate AJAX-based query. The data gathered this way will accumulate over time and could also be persisted/cached server-side. So there wouldn't be any centralized batch-like process to fetch the enrichment data, but it would be fetched lazily when needed by a user.

Log in or register to post comments

Comment #19

janusman commented 12 February 2010 at 14:37

I've committed the beginnings of a new approach to fetch metadata, changing millennium_enrichment.module. See this commit: http://drupal.org/cvs?commit=324336

From there it might be simple (or simpler?) to make the new metadata classes fetch/show data via AJAX.

The rundown of that commit is this:
* Extra metadata, if stored locally, lives in its own table
* Harvesting can be done separate from millennium importing. (With the commit it can *only* be done separately for now)
* Millennium.module's hooks are implemented in Enrichment module so that the biblio array can be changed by Enrichment's metadata classes (say, you want to add links to OpenLibrary as if they existed in the MARC in the first place)

The enrichment/metadata classes still need some love, I think it's a good start to have something extensible. I have been thinking if it's preferrable that it live as a different module instead of inside millennium--- but for now I don't have the time to imagine all possible use cases to make it generic enough to be a standalone API module, or something that integrates with Feeds... I just want to ship with something useful =) Input welcome!

Log in or register to post comments

Comment #20

tituomin commented 15 February 2010 at 16:38

I took a look at Feeds, it certainly seems possible that we could reuse their existing infrastructure. We could implement a new Processor and new Sources.

About AJAX: only live real-time data such as availability info seems to absolutely require it. But the benefit from AJAX-fetching (by which I mean fetching external metadata for a node when the user is looking at the node) is that you don't have to mass fetch *a lot* of data if there are many records. So it's more agile in a way. You can add new data sources and get them running almost instantly. At least my site will have a huge amount of records. (Plus, you won't be bombarding the external data sources with too many requests..)

Of course, it would be best if you could have both: the ability to mass-fetch and to fetch-when-needed. Don't know how to achieve that easily.

Log in or register to post comments

Comment #21

janusman commented 16 February 2010 at 00:10

I've been looking at feeds and had a quick chat with alexb (the author). I think Feeds *could* support Metadata harvesting, but might not handle anything that would allow to aggregate stuff to a node (e.g. adding the metadata for indexing or viewing).

So, I'm tempted to still follow the current path (release Millennium Integration 6.x-2.0) with some basic metadata harvest/display functionality like what I mentioned in #19 and is already committed... and maybe for a later release Millennium could be more of a discrete API to just communicate with Millennium and leave the node creation and metadata aggregation to Feeds, views, etc. IF those modules can scale for massive fetches like we're doing now (using the Bookcart, for instance).

Still, I have not had time recently to ponder things much =)

Log in or register to post comments

Comment #22

janusman commented 6 April 2010 at 18:04

Component:

Miscellaneous

» Enrichment

This has been committed and is a good start IMO. Just needs documentation and maybe a little more testing.

Log in or register to post comments

Comment #23

janusman commented 9 November 2010 at 22:00

Thinking of Linked data, see http://dilettantes.code4lib.org/blog/2010/11/linked-marc-codes/ .. we can probably just theme parts of the record to "speak" RDFa.

Log in or register to post comments

Comment #24

Stomper commented 2 January 2013 at 14:08

How about ISBNDB? Or maybe Bookfinder. They have a relatively open API (ISBNDB)

http://isbndb.com/data-intro.html
http://isbndb.com/docs/api/

I have a unique business case that am not sure whether this module supports it (for D7)

a) query a third-party book database using Drupal UI (say ISBNDB)
b) display search results using Drupal (on query, site does not redirect to third-parties website)
c) once user finds matching book, bibliographic data is used to populate and create a pre-formatted node (Rules and Feeds?)

For Rules module: if book matches, book meta-data is fed and node is created, user than is redirected to published node and can continue on.

All of this should take place within my Drupal site, at no time should the user be redirected externally. Is this possible

Log in or register to post comments

Integrate with other external data sources (WorldCat, LibraryThing, Open Library)

Comments