Last updated March 10, 2011. Created by pkiraly on March 9, 2011.
Log in to edit this page.
The harvesting process is the implementation of the Open Archives Initiative Protocol for Metadata Harvesting shortly OAI-PMH (http://www.openarchives.org/OAI/openarchivesprotocol.html). The OAI-PMH's purpose is to transfer huge amount of records from one computer ('data repository') to another. The records are in XML format. A data repository can support different formats, and it may have independently harvestable collection parts ('sets').
In the OAI Harvester module the site administrator have to register an existing repository's name and base URL. The base URL specifies the Internet host and port, and optionally a path, of an HTTP server acting as a repository. The OAI-PMH support 6 different information requests or 'verbs' ('Identify' is about the repository itself, 'GetRecord' requests one record by its identifier, 'ListIdentifiers' is about the available record identifiers, 'ListMetadataFormats' is about the available metadata formats or XML schemes, 'ListSets' is about the available sets or collections, and finally 'ListRecords' is about the whole metadata records). The type of request is denotated by the 'verb' parameter. The base URL is the request without any parameter. Since the base URL itself usually does not return valid response, to check whether your base URL is valid, use the [base URL]?verb=Identify URL in your browser. But into the OAI Harvester module's Add repository form (Administer > eXtensible Catalog (XC) > Metadata Harvester > Repositories > Add repository, or admin/xc/harvester/repository/add) the site administrator have to enter a base URL, and not the URL of a valid request. The module then will detect and save all the necessary information, like the descriptive information about the data provider, the list of metadata formats, and sets, the availablity of the server etc.
Next step is to create a harvester schedule. Its purpose is to set the properties of the schedule (what, when and how to schedule). The harvesting process could be launched two was: automatically with the operating system's and Drupal's cron scheduler, or manually. The module contains an implementation of the cron 'hook', so if you set a cron for the Drupal site, the OAI Harvester module will examine whether the actual time matches the timing settings the administer fixed in the harvester scheduler page. If it matches, the module starts scheduling, otherwise nothing will happen. As we mentioned you can always launch the harvester manually.
To create or modify a harvest schedule is a two step process. On page one, the site administrator selects a data repository, the frequency of the harvest (hourly, daily, weekly), and a date range in between the harvest could be launched by a cron scheduler. On page two the administrator can enter a name for the schedule, select set and format (see the first paragraph of this documentation page), an XML parsing mode (either the robust, but slow DOM process, or the quick regular expression based).
The module can cache the data repository's raw XML responses. It is good, if your internet connection is slow, and you would like to change your indexing settings, this way you can rerun the process without accessing the internet. It is useful only in testing phase, and it could be misleading in a production server, because it does not reflect the changes on the OAI-PMH server. If you don't want to use the cache more, set it to now, and delete the cache directory by clicking "clear cache" button on the top, when you are outside of the editing mode. The cached files are in Drupal's default file directory's oaiharvester_http_cache subdirectory (sites/default/files/oaiharvester_http_cache).
Another useful feature is to limit the number of OAI-PMH requests. If you limit the requests, the harvester will not fetch all information, only the first some records. The number of records per requests is controlled by the data provider, we only could controll the number of requests. To fetch all records could be time consuming, and if you only want to test the toolkit or your indexing settings, it is good, if you can start with only a limited number of records. The default 0 value means no limit. Use other value only for testing reasons.
The OAI Harvester module provides hooks to add even more possibilities, and another module, the XC Harvester Bridge, which connects OAI-PMH harvester and the rest of XC modules provides the following settings:
- storage locations or where to store harvested records. I'd like to remind you, that OAI Harvester module's responsibility is only the harvest, and it provides the records for other modules. The Bridge module use the metadata module's metadata locations to store the records. If you have more such locations defined, you can disable some of those if you don't want to store records into each locations.
- In order to made the records searchable and usable inside Drupal, we have to index them with Solr, and create nodes. These tasks could be enabled or disabled and run in different order, so the administrator have four options:
Option a. Create nodes, but not index with Solr.
Option b. Index with Solr, do not create nodes.
Option c. Index with Solr first, then create nodes.
Option d. Create nodes first, then index with Solr.
After indexing with Solr, the node will be created automatically on-the-fly (if it does not exist), when your users will receive a search result page. But node creation has an overhead, and it slows down the search a bit. You can start node creation batch any time after harvest finished. If you want to publish your data the earliest possible time, choose Option b. or Option c. If you to achive the quickest search, use Option d. Our suggestion is 'd.'
- Finally the administrator can choose a MySQL specific inserting method. To import the harvested records into relational database the Toolkit can use CSV files, and LOAD DATA INFILE commands or the more traditionally INSERT command. The first is much more quicker, but it requires general (and not database) level privilege in MySQL, which is not available in all situation. If you chose this, be very careful with privilege settings. The INSERT is slower command, but it requires only database level privileges, so it is more secure. If you have millions of records, and time is important factor, consider using LOAD DATA INFILE with care, but if you don't have as many record, or the duration of harvest is not an important factor, use the INSERT command.
What's happen during harvest?
For testing and checking reason it is important to understand what has happened during the harvest.
First the harvester get an OAI response. It is an XML format, and the module should parse it. The it iterates over each records, and calls a hook (see details in developer's guide) to process the record. If there are more record at the server it use an OAI-PMH parameter ('resumptionToken') to request the next records. At the beginning/end of each requests, and at the beginning/end of the whole process, it calls another hooks to notify other modules.
In our case it is the XC Harvester Bridge module, which implements these hooks. First it creates an internal object, called XC Entity object from the metadata record. It examines whether it is a deleted record or not, and if it is, calls a Metadata module's function (xc_remove()) to delete it. The XC Entity object stores all information about a metadata record: its types (metadata type and corresponding node type), the complete metadata part and so on. These information are stored in distinct tables. When we load an XC Entity, the module will read all necessary information from these tables, and when we modify or delete an information, the table structure remain persistent with our changes.
If this is not a deleted record, it chack whether this record is existing or not and calls another XC Metadata function (_xc_build_store()) to build and store the record. Without describing here much more details, the records are stored in two tables.
The first, more general one is xc_entity_properties, containing the metadata type and the node type, the OAI identifier, and its gererated/extrated numeric version, the metadata format, a reference to the source of the record (the schedule), the timestamp of creation and modification, and three flags, whether it is built, stored or deleted (which denote operation run on the record.
The second one is xc_sql_metadata which stores the record's metadata part (this is a special element in OAI-PMH XML response, responsible the original metadata itself without the additional payload of the standard) - in PHP serialized format.
If the metadata schema contains hierarchical elements, like the XC schemas FRBR level records, the parent-child relations will be stored in xc_entity_relationships table. Since at the time of harvest
If the record is a new one, and the administrator choose the 'LOAD DATA INFILE' syntax, the records first inserted into a comma separated values (CSV) file, and they will be insterted into the tables only after the last OAI request.
After fetching the last record from data provider
In the Start using document we mentioned, that the harvesting schedule has multiple phases, of which only the first one is fetching and storing records. Now we have the "raw" records, but we can not search, display, print, comment them. We need to finish two more task: node creation and creating the Solr index.
Node creation
As you might know node is the core concept of drupal, it is a piece of content. Drupal provides a handful of functions to handle the nodes like add comments, print them, define its internal structure and so on. XC modules will iterate over the xc_entity_properties table where the node_id field is 0 (the default value, and it means, that there is no node created for that record), and call Drupal funtion (node_save()) to create a node, then get the node's identifier, and update the just mentioned node_id field with it. We create a very basic node: we only set the type, and the title (which now equals to the OAI identifier of the record, later we modify the code to get the real title of the metadata). Because of the node concept of Drupal lots of other modules can response some actions to the node creation event. In Drupal core only 3 table will change: node, node_revision and node_comment_statistics table. Because we specified the node type, and the node type is registered as belonging to the Metadata module, when the node is displayed, or deleted Drupal will call Metadata module, and this module will add the XC Entity object's properties to the node. So the Drupal node itself is a very thin object, during the process of display it will be fulfilled with all of our metadata information. Good to know, that regarding to XC schema we create nodes for manifestation level records, since we do not display directly other level records.
Solr indexing
The XC schema records follow the FRBR data model, and hence they creates a hierarchy. Hierarchial data structures can be handled perfectly in relational databases, but not in Solr, which stores records containing key-value pairs, and has no operation like SQL's JOIN command. Because of this reason if we would like to search field content in distinct hierarchial level, we have to "flatten" our data structure, for XC record practically we reconstruct the MARC record in a way. The module iterates over all manifestations, finds each records' parents (the expression level record), and the parents of those records (the work level records), and merge these records metadata part. In Solr we store different types of information:
- mandatory fields. These are: id (equals to the OAI identifier), node_id, node_type, metadata_id, metadata_type, source_id, and type (same as metadata_type -- yes it is redundancy). All these field are common in every record. These fields are the same as the most important fields of the xc_entity_properties. And these store the base level record (manifestation) values, not of the parents.
- selected fields of the metadata. We will cover later that with XC Index module we can select the fields we would like to index among the schema's all fields. The general principle is that only those fields worth indexing, that we would like to search in. For example: we would like to search the title? Let's select to index! We won't like to search the publisher? Leave it from the selected fields. Each selected schema field may became the origin of more Solr field, e.g. if we would like to search the first name of the author, but we would like to use the author name as phrase, we can index both as "text" and as "phrase" (these are Solr field types).
- facets. We can create facets several ways on the administrator interface, but what is important here, that facets as just as normal field as others in Solr. There are only two restrictions: 1) we have to store the field value 2) we have to use phrase indexing (this is true in the bulk of use cases, but not in all). We use _fc suffix, which has these attibutes defined in schema.xml. The module will create the values of these facet fields according to the rules the administrator set.
- generated fields. The module creates a field called text, which contains all values of all fields of the merged metadata. It will be the base of the genereal search. Since it contains every information of the MARC-like record, this is the default search field: if the user does not specify a field, Solr will use this field. The other generated field is metadata, which is a serialized version of the MARC-like record. The module will use it when displays search result lists or full record. It is a stored, but not searchable field.
Naming convention for Solr field
The Solr field names are automatically transformed from the schema field. Since colon character (":") is reserved in Solr, it would be very uncomfortable to leave colons in field names, thus the module will automatically replace it with two underscore character. Since it would also be very uncomfortable to register each schema fields in Solr's schema.xml, we make use of dynamic fields. This means, that we add a suffix to the end of the field, and Solr will know the field type from this suffix. These are the two rules the modules apply when they translate back and forth schema and Solr fields. Examples: dcterms:title indexed as text became dcterms__title_t in Solr, and same field indexed as phrase became dcterms__title_s.