OAI Harvester module - a developer's guide

Last modified: June 23, 2009 - 14:40

OAI Harvester module

Introduction

The OAI Harvester module collects metadata records from OAI-PMH data providers through the OAI-PMH protocol v2.0. More about the protocol see http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm. The harvester simply harvests: it does not stores the records, because there are a lots of possible way to store such records in Drupal. There are hooks which the OAI harvester invokes and with which it sends the data to the implementers of that hooks. The XC Drupal Toolkit contains one module, which implements these hooks (xc oai harvester bridge module). This module can be used as example for writing your own module. The OAI Harvester module itself is independent from other XC modules.

The two parts of OAI Harvester:
- the database structure and user interface, which helps to harvest data
- the hooks, which helps to store or index data coming from a repository

The concepts of the OAI harvester

Repository

Repository is an OAI-PHM data provider. I supports one or more XML format (identified with metadata prefix), and may supports one or more sets.

Schedule

A process of harvesting of one reporitory. The protocol supports only one format and one set at one time, so if we would like to harvest multiple sets or formats, it should be split into multiple single processes.

A single process

Get all records matching one set of initial parameters (one format, set, from and until parameter). It request first an initial URL consisting from the initial parameters, and if there are more records, than the number allowed in one OAI response, the repository sends a resumptionToken information, with which we can continue the harvesting. The resumptionToken is a kind of session identifier, so we don't need to resend the initial parameters again, only the actual resumptionToken. The end of the single process is, when there is no resumptionToken information in the response.
So from requester's persective the process is the follow:
a) issuing an initial URL
b) find out whether is there a resumptionToken in the response
c) if there is, issue a resumptionToken URL and go to b)
d) if there is not, finish

An initial URL

An initial URL contains the base URL of the repository, one format, and may contain one set parameter. It may contain from and until parameters to select a given date range. We use it when we harvest the same repository with the same parameters the second time and so on, so then we harvest only the incrementation. If the repository supports deleted records, it provide information about the deletions as well.

A resumptionToken URL

The resumptionToken URL is the type of URL we request from the repository from the second request on. resumptionToken is a kind of session identifier, and using this we don't need to use the initial parameters. The resumptionToken identifies the next records in the sequence.

Processing a single request

A single OAI-PMH request issues an initial or a resumptionToken URL, and processing its response.

OAI harvester calls the following hooks:

1) hook_oaiharvester_process_record

Signature:

<?php
hook_oaiharvester_process_record
($record)
?>

Purpose, time of event:
This hook is triggered inside an iteration of every harvested records, so it calls
on each record sequentially. If you wnat to do yomething with the record
(usually: storing, indexing for search), implement the hook. The record is
a complex structure: it is an array created from the XML element of
OAI-PMH response. The actual metadata part is built as DOMElement.

Parameters:
$record The harvested record in an OAI-PMH response to the ListRecords verb request
The record is a complex array with the following internal structure:
$record['header'] - the header part of the record
$record['header']['identifier'] - The record identifier
$record['header']['datestamp'] - The time of the last modification or the creation
$record['header']['setSpec'] - The identifier of the sets in which the record take place
$record['about'] - information about the record
$record['metadata'] - the metadata part of the record. It could be in one of several metadata formats (like Dublin Core, MARCXML, EAD etc.
$record['metadata']['namespaceURI'] - the namespace of the metadata format
$record['metadata']['childNode'] - the content of the metadata. It is in DOMElement object

2) hook_oaiharvester_request_processed

Signature:

<?php
hook_oaiharvester_request_processed
()
?>

Purpose, time of event:
This hook is triggered after a single OAI request processed. Do not confuse with
hook_oaiharvester_harvest_finished which triggered after all request processed
for a given initial URL.

Parameters:
no parameter currently

3) hook_oaiharvester_harvest_finished

Signature:

<?php
hook_oaiharvester_harvest_finished
($success, $results, $operations);
?>

Purpose, time of event:
Triggered after a schedule is finished (sucessfully or not). A schedule may contain
multiple initial URLs

Parameters:
$success Boolean value designating the success of the harvesting
$result An array containing information about the process
$operation Currently unused parameter

 
 

Drupal is a registered trademark of Dries Buytaert.