Integrate with other external data sources (WorldCat, LibraryThing, Open Library)

janusman - January 14, 2009 - 15:26
Project:Millennium Integration
Version:6.x-2.x-dev
Component:Miscellaneous
Category:feature request
Priority:normal
Assigned:Unassigned
Status:active
Issue tags:RDF
Description

Amazon should probably ruled out because of their API licensing (uses should "mainly" aim to redirect users to amazon).

Probably, like the author in this article mentions (http://journal.code4lib.org/articles/105 ) Open Biblio is the most open and probably least worrisome legal-wise; don't know about coverage, growth, dependability, etc.

#1

janusman - January 14, 2009 - 15:32

#2

janusman - January 14, 2009 - 16:35

This article http://arxiv.org/ftp/arxiv/papers/0805/0805.2855.pdf mentions other "external datasets" (useful for Geographic tags?)

GeoNames (http:///geonames.org) and the CIA World Fact Book
(http://www4.wiwiss.fu-berlin.de/factbook/) for geographic headings.
· the RDF BookMashup (http://www4.wiwiss.fu-berlin.de/bizer/bookmashup/) for links
to items that prompted a LCSH concept to be created.
· dbpedia (http://dbpedia.org)
Furthermore, there are additional vocabularies at the Library of Congress such as the Library
of Congress Classification, Name Authority File, and LCCN Permalink Service which could be
made available as RDF. The authors are also involved in the conversion of the RAMEAU, a
controlled vocabulary that is very similar

Also check out http://inkdroid.org/journal/2008/01/04/following-your-nose-to-the-web-of...

#3

janusman - January 14, 2009 - 17:16

Would this help?

guessing publisher from ISBN prefix
http://worldcat.org/devnet/blog/2009/01/guessing_publisher_from_isbn_p.html

#4

janusman - January 14, 2009 - 17:18

#5

janusman - January 27, 2009 - 16:04
Issue tags:-open library+RDF

Perhaps this is all just a subset of linked data? In that case, any useful information about the item (current prices, ratings at amazon or others), the authors (biographies, pictures, etc.), the publisher (homepage, addresses?), people who own/read it, libraries that hold it, etc... would all be targets for this issue. =)

#6

janusman - March 18, 2009 - 18:40

Another data source: the insight web service from random house.

To get direct links to cover images, TOCs and sample pages, see: http://www.randomhouse.biz/webservices/insight/spec.php#G

To embed a widget with a book, see: http://www.randomhouse.biz/webservices/insight/widget/userguide

#7

janusman - March 19, 2009 - 15:51

For now I went ahead and committed to 6-DEV an embedded Google Books widget. Probably needs some work though.

#8

janusman - November 30, 2009 - 22:13

I think that this should be something more modular; I plan on having additional modules that tie into the drupal_alter() hooks, add other hooks (e.g. alter the biblio data table, alter the holdings table, etc) so those other modules would be in charge of adding more information.

This would mean I would move the code that adds the Library of Congress information and Google Book Search link and widget into other modules.

This would also open up the possibility of these new modules actually not depend on millennium.module at all, since in fact, Google Books and other just need an ISBN or other information to work, and are not at all tied to Millennium. For instance, these modules could work using information from Biblio module, a CCK field deemed to hold an identifier of some sort, or use the Millennium.module's stored biblio array.

We could then have modules at different steps:
* Enrichment during import. Say your III record is missing the number of pages; it could be fetched from another source and added to the record. Or say you want to import the Table of Contents from LOC (like we do now).
* Enrichment during viewing (adding online fulltext viewers, adding links, etc.), kind of like WebBridge (or whatever it's called these days?) does on III.

How does this sound?

#9

tituomin - December 1, 2009 - 16:41

I'm in the middle of coding a new component with this same idea.. I can send a preliminary version later. I'm currently using this for cover images, but will be using it also for Wikipedia, Google books, Millennium availability data, etc. (Anything, really.)

Features (currently implemented):

  1. AJAX-based retrieval of external data (won't block Drupal page loading)
  2. Can retrieve data for many nodes per one AJAX request
  3. Only uses Drupal database level bootstrap (if database access needed). This is to avoid performance issues because there might be quite a lot of AJAX requests coming in.
  4. PHP objected-oriented design, allows plugging in new external data sources and data types using inheritance.
    Supports a flexible model to deal with different types of external data differently.

Features (coming up):

  1. Two-level caching support (memcached and database cache) (external requests take a long time, trying to avoid delays)
  2. Support for efficient "threaded" http retrieval using curl multi: http://www.php.net/manual/en/function.curl-multi-init.php

I'll try to give a preview as soon as possible.

#10

janusman - December 1, 2009 - 17:26

Awesome!! Please share when you have something/anything =)

#11

janusman - December 2, 2009 - 21:59

Just another note:

Library thing API:
http://www.librarything.com/services/librarything.ck.getwork.php

I'm thinking this is *most* useful; I thin ka module could ask for the API key, and manage the upper limit of calls per day to keep in line with the terms of service.

#12

janusman - December 3, 2009 - 00:21

See latest commit for the beginnings of (very basic, humble, horribly unscalable) configurable enrichment options for the module =)

http://drupal.org/cvs?commit=297112

BTW, yes, I killed, like, 4 kittens with that commit.

#13

tituomin - December 3, 2009 - 10:56

Re: #10. Here is a preliminary version of the server-side logic. The design is not probably as good as possible yet, I'll refactor it when I add the caching logic and other optimizations.

By the way, currently you should install this in a directory "bibmash" right under the drupal root. (There probably is a better place)
And there is no client side javascript code yet, but you can see the json output in your browser if you go to http://site/bibmash/bibmash.php?id=*valid_millennium_integration_node_id*. LastFM coverimages is really the only concrete feature here. (I'm making a music site).

Here are my basic design choices:

  • This is *not* a Drupal module. I know there are some disadvantages to this, but I expect there to be many
    ajax requests, and loading the whole Drupal environment isn't very wise in this case.
  • It does however include the Drupal database layer to fetch information from Millennium Integration.

About the files:

bibmash_config.php
the configuration file, where you can set up the loading of the datatypes and data sources you want
bibmash.php
The main file, which handles the ajax request and parameters
datasources
Directory which contains the subclasses for the different data sources
datatypes
The same for different data types

The class model is this:

BibRecord
A bibliographic record, representing a work or manifestation
MetaData
A type of metadata attached to a record. Currently including: CatalogData (bibliographic metadata), and CoverImage
(urls for coverimages). A metadata object knows how to present itself.
DataSource
A source for data, be it external (WebDataSource -> Amazon, LastFM) or internal (for example MillenniumIntegration)
DataFront
Front-end, initializes the object model
DataBack
Back-end, fetches the data (the logic will inlude caching later)

I hope all this isn't overkill. I will try to keep it simple, and I think the object oriented model fits quite well for this use because there is a clear need for different types of data / different sources, and inheritance keeps it extensible.

Any comments? =)

AttachmentSize
bibmash.tar_.gz 12.26 KB

#14

tituomin - December 3, 2009 - 12:59

Sorry, some trouble with my automatic insertion of license notice on top of each source file. So *here* is a working version.

AttachmentSize
bibmash.tar_.gz 12.28 KB

#16

janusman - December 5, 2009 - 00:13

This patch (committed) adds a cover image "finder" to the "enrichment" module.

It doesn't download the images but it finds them grabbing a portion of the file from a variety of services, and uses that fragment to determine if it's usable (by checking the image width).

Probably next would be to have a simple caching mechanism, and some way for the user to configure this.

AttachmentSize
millennium_enrich-358765-16.patch 8.11 KB
 
 

Drupal is a registered trademark of Dries Buytaert.