Last updated December 14, 2010. Created by pkiraly on December 14, 2010.
Log in to edit this page.
The purpose of eXtensible Catalog Drupal Toolkit is to provide a next generation discovery interface for library records. The main communication protocol between diferent XC components is Open Archives Initiative Protocol for Metadata Harvesting, shortly OAI-PMH (http://www.openarchives.org/OAI/openarchivesprotocol.html). Because there is no other module for OAI-PMH harvesting, we created our implementation than not just XC records can be harvested, but other record types as well (see the manual page about the oaiharvester module).
Common task
In the setup guide we showed you how to setup a data provider, and a harvesting schedule. To start using the module, the first step is to start harvest.
launch scheduled harvest manually
- Goto Administer › eXtensible Catalog (XC) › Metadata Harvester › Scheduled harvests (admin/xc/harvester/schedule)
- Select a scheduled harvest from the list by clicking its name
- click Harvest menu
Depending on different factors (like the number of records, and the speed of the data provider) it will take for a while. If the OAI-PMH data provider supports completeListSize attribute of resumptionToken element of ListRecords XML response, the harvester tries to estimate the total and remainder time. It provides you a feedback about the number of records and the elapsed time. The schedule can work together with cron (http://drupal.org/getting-started/6/install/cron), and you can start harvest in a specified time automatically.
Specific tasks regarding to XC schema records
XC schema records are comming from another XC software, the Metadata Services Toolkit. In order to speed up node creation we split the node creation process into several smaller pieces:
- Step 1. Harvesting XC records, and preparing importable CSV files.
- Step 2. Importing metadata into MySQL.
- Step 3. Indexing with Solr
- Step 4. Creating nodes from metadata.
If XC OAI harvester bridge module is enabled, Step 1. is automatically run with harvesting, as well as Step 2. You can enable Step 3. at the schedule. Now you should run Step 4 independently but later it can be enabled at the schedule page. The order of the last two steps are reversable. If you index with Solr first, the advantage is, that your users can search, before creating nodes. When a search result list is created, the nodes will be created on the fly. Meantime you can start Step 4 as batch process. The advantage of starting with node creation is that when you start indexing, the database will contain all node identifiers, and the result list do need to apply the previously mentioned on-the-fly node creation trick, which has a little overhead, so the search will be quicker.
Indexing with Solr
If you harvesting XC records, which is an FRBR-based structure, for the sake of performace issues, you need to restore the MARC-like structure in the Solr index. It means, that the data of Work and Expression records will be merged into the correspondent Manifestation records, and only such records will be stored in Solr - one Solr document for each manifestations.
- Goto Administer › eXtensible Catalog (XC) › Solr › Run one-stop indexer (admin/xc/solr/onestop)
It will take for a while dependently from the size of the original database.
Batch node creation
Node creation is a quite slow process in Drupal, so we split the part of node creation into smaller pieces. The Step 4 piece contains the actual node creation. It iterates over all already harvested and imported metadata records, and create a node for those to which there is no associated node. Since node creation calls hooks, and we could not know the duration of other modules' implementations to these hooks, the process can be quick or long. After the node is created, the module registers the node's identifier into the metadata's table (namely xc_entity_properties). Later you can start this batch process automatically with binding to the schedule, now you should start manually.
- Go to Administer › eXtensible Catalog (XC) › Metadata Storage Configuration and Utilities › Batch node creation (admin/xc/metadata/postharvest)
- Click on Start button