Technical details

Last updated on
30 April 2025
The big picture

The harvesting and indexing is an iterative process, which has multiple steps.

  1. Harvesting from a data provider. During harvest each records are processed, and eiter directly stored into MySQL tables or in CSV files.
  2. Importing CSV files (if any), and normalizing the table storing which records have to be changed in Solr index
  3. Creating nodes
  4. Creating/updating Solr index

The administrator can modify the order of the last two steps. The steps followed each other in serial way: whe step #2 can be launched only after step #1 has been finished.

1.a Harvesting

The OAI-PMH ListRecords verb returns an XML, similar to this:

<?xml version="1.0" encoding="UTF-8"?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="...">
  <responseDate>2011-03-25T09:22:31Z</responseDate>
  <request verb="ListRecords" metadataPrefix="xc">
    http://128.151.244.136:8080/MetadataServicesToolkit/
    MARCToXCTransformation-Service/oaiRepository
  </request>
<ListRecords>
<record>
<header>
  <identifier>
    oai:mst.rochester.edu:MetadataServicesToolkit
    /MARCToXCTransformation/10000
  </identifier>
  <datestamp>2010-05-13T15:57:16Z</datestamp>
  <setSpec>DemoTS</setSpec>
</header>
<metadata>
  <xc:frbr
    xmlns:xc="http://www.extensiblecatalog.info/Elements"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xmlns:rdvocab="http://rdvocab.info/Elements"
    xmlns:dcterms="http://purl.org/dc/terms/"
    xmlns:rdarole="http://rdvocab.info/roles">
    <xc:entity 
      type="work" 
      id="oai:mst.rochester.edu:MetadataServicesToolkit/
          MARCToXCTransformation/10000">
      <dcterms:subject xsi:type="dcterms:LCC">M2148.2</dcterms:subject>
      <dcterms:subject xsi:type="dcterms:LCC">M2011.C7375</dcterms:subject>
      <rdvocab:titleOfWork>Complete Proper of the Mass,</rdvocab:titleOfWork>
      <xc:subject xsi:type="dcterms:LCSH">Gregorian chants</xc:subject>
      <xc:subject xsi:type="dcterms:LCSH">Propers (Music)</xc:subject>
      <xc:relation>Catholic Church. Missal.</xc:relation>
    </xc:entity>
  </xc:frbr>
</metadata>
</record>
... more records here ...
</ListRecords>
</OAI-PMH>

We have to talk a bit about the data model behind this XML, because it has multiple levels. The first thing what we have an object, which in library context is a book, usually. The second level is the information about this book (who wrote it, what is its title, and so on) the catalog record. In library terminology this level is called "metadata", because it is a data describing our original object, not the object itself. It is stored in the <metadata/> element of the <record/>. And in OAI, and in Drupal Toolkit there is a third level of information, the basic information about the catalog record, the metadata of the metadata. It describes the type or format of the record (whether is is an XC schema record, or Dublin Core), the time of modification, an identifier, and so on. The toolkit stores the metadata (the catalog record) in table xc_sql_metadata, and it stores the metadata of the catalog record in table xc_entity_properties.

This XML is parsed in two ways according to your setting:

  1. DOM-based parsing: the toolkit loads the XML into memory as a huge DOM object. This process is a very robust method, but requires lots of memory and time.
  2. Regular expression-based parsing: the toolkit extracts the records, and parse as DOM object only when requested. This process requires less memory and time, so it is quicker. But since it based on regular expressions, it is less robust, so an irregularly formatted XML could cause problems (if you run into such a problem, please report it, and send us the problematic XML, to help us to be prepared to even more use cases). We suggest you to use this method.

A malformed XML could cause (properly handled) problems in both cases, but in that case consult the provider of the the source document.

1.b Processing records

The module parses each records one by one, and calls hook_oaiharvester_process_record($record, $schedule_id). Our implementation (XC OAIHarvester Bridge module's xc_oaiharvester_bridge_oaiharvester_process_record) is what process the record, inserts it into the database, identifies its relationships, creates a node (if it needs), and finally indexes it within Solr. Let's see the details of these phases.

The $record parameter has the following structure:

$record = array(
  'header' => array(
    '@status' => '...', // the record's status, used when it is "deleted"
    'identifier' => '....', // OAI Identifier
    'datestamp' => '...', // datestamp of the record
    'setSpec' => '...', // the set specification, to which the record belongs
  ),
  'about' => '...', // extra metadata about the record, we do not use it
  'metadata' => array(
    'namespaceURI' => '...', // the namespace URI of the record type
    'childNode' => '...', // the metadata itself as PHP DOMElement
  )
);

In relation to the childNode key, information about the DOMElement class can be found at PHP documentation's DOMElement page. As for the rest of the keys, you may find that it is structured as a direct mapping of the XML record's structure.

The first thing the parsing does is extract the $identifier_int from the identifier of the record, which will be an internal numerical identifier (more about identifier_int see the introductory note for xc_entity_relationships table).

If the @status attribute of the record is "deleted", it means, that it is possible, that the Drupal Toolkit already harvested and stored this record, so it has to delete the record now. This is only possible, but not sure, so the Toolkit checks whether it exists, and if it doesn't exist we can simply skip this record. If the record belongs to a node the Toolkit can call Drupal's node_delete() funtion, which triggers a hook concerning node deletion, what XC Metadata module implemented, and it automatically deletes not just the node, but also the corresponding records in xc_entity_properties, xc_entity_relationships, and xc_sql_metadata records, and Solr document. If the record does not have a node (we create nodes only from manifestation records), the function itself deletes these records from the above mentioned tables. Since the Solr index and the schema record (stored in MySQL) don't have a 1:1 mapping, there is a temporary register for the identifiers of the deleted records that updates Solr in multiple steps.

Afterwards the identifiers are stored in xc_oaiharvester_bridge_changes table. You can find the details of the following steps in the description of that table. After deletion, and this registraton the function returns 1, and the process of the next record will start.

If the status is missing or it is not deleted, the processing continues. First it examines, whether the namespace is registered or not. Each schema module registers one namespace for its schema definition. In the Drupal Toolkit we provided two schema modules: one for XC schema, and another one for Dublin Core. It is possible to create another, 3rd party schema modules. If the record's namespace does not fit a registered schema, the function returns and the toolkit continues with the next record.

The module then creates a simple anonymous object, which has the same fields as the xc_entity_properties records (for the record: it is a stdClass object). If the harvest is not an initial one it checks whether there is an existing record or not (for initial harvest, all records are new ones). If it is a new record it creates an identifier (metadata_id).

The next process is to "build" the metadata structure, it is to convert the DOM object (the <metadata/> element of the OAI-PMH response) into an associative array based on simple PHP datatypes (array, string, numbers). For XC schema record the <xc:entity/> part becomes an associative array, and this associative array becomes the metadata property of our previously created object. We will see it later, that this metadata will be stored in xc_sql_metadata table. (In a PHP there is a pair of functions - serialize/unserialize - which can convert back and forth a complex data type into simple string. We can store this string in a database field, and recreate the complex object from it when we want to display the catalog information.)

The build process has another important function. If the schema records has a hierarchial structure, then these links are stored in the xc_entity_relationships table. For XC schema records these relationships are the FRBR levels. For example, each manifestations record has one or more "uplinks" to its parent expression level records. These links stored in <xc:workExpressed />, <xc:expressionManifested />, and <xc:manifestationHeld /> XML elements. The values contain the OAI-PMH identifiers of records. Since an identifier usually is in the following pattern: [a string refers to the OAI data provider]/[numeric identifier], and the lookup of integer values in MySQL is quicker than that of stings, we store only the numeric identifier. The xc_entity_relationships table contains this propagated identifiers as parent-child pairs.

The last part of the process is to store our objects into two tables:

  1. The first, and more general one is xc_entity_properties, which contains the metadata type and the node type, the OAI identifier, and its gererated/extrated numeric version, the metadata format, a reference to the source of the record (the schedule), the timestamp of creation and modification, and three flags, whether it is built, stored or deleted (which denote the operations run on the record.
    mysql> select * FROM xc_entity_properties LIMIT 0,1\G
    *************************** 1. row ***************************
       metadata_id: 1
     metadata_type: work
           node_id: 0
         node_type: xc_work
        identifier: oai:mst.rochester.edu:MetadataServicesToolkit/
                    MARCToXCTransformation/10000
            format: xc
         source_id: 3
           created: 1300468126
           updated: 1300468126
             locks: a:0:{}
         locations: a:0:{}
        properties: N;
             built: 1
            stored: 1
           deleted: 0
    identifier_int: 10000
    1 row in set (0.00 sec)
    
  2. The second is xc_sql_metadata which stores the object's metadata part (see above) in PHP serialized format.
    metadata_id: 1
    location_id: 1
       metadata: a:6:{s:5:"@type";s:4:"work";s:3:"@id";s:74:"oai:mst.rochester
                 .edu:MetadataServicesToolkit/MARCToXCTransformation/10000";s:
                 15:"dcterms:subject";a:2:{i:0;a:2{s:5:"@type";s:11:"dcterms:L
                 CC";s:6:"#value";s:7:"M2148.2";}i:1;a:2{s:5:"@type";s:11:"dct
                 erms:LCC";s:6:"#value";s:11:"M2011.C7375";}}s:19:"rdvocab:tit
                 leOfWork";a:1:{i:0;a:1:{s:6:"#value";s:28:"Complete Proper of
                  the Mass,";}}s:10:"xc:subject";a:2:{i:0;a:2:{s:5:"@type";s:1
                 2:"dcterms:LCSH";s:6:"#value";s:16:"Gregorian chants";}i:1;a:
                 2:{s:5:"@type";s:12:"dcterms:LCSH";s:6:"#value";s:15:"Propers
                  (Music)";}}s:11:"xc:relation";a:1:{i:0;a:1:{s:6:"#value";s:2
                 4:"Catholic Church. Missal.";}}}
      timestamp: 1300468126
    
  3. From this serialized object, the toolkit extracts the following data structure when needed:

    array(
      '@type' => 'work',
      '@id' => 'oai:mst.rochester.edu:MetadataServicesToolkit/
                MARCToXCTransformation/10000',
      'dcterms:subject' => array(
         0 => array(
           '@type' => 'dcterms:LCC',
           '#value' => 'M2148.2'
         ),
         1 => array(
           '@type' => 'dcterms:LCC',
           '#value' => 'M2011.C7375'
         )
      ),
      'rdvocab:titleOfWork' => array(
         0 => array(
           '#value' => 'Complete Proper of the Mass,',
         )
      ),
      'xc:subject' => array(
         0 => array(
           '@type' => 'dcterms:LCSH',
           '#value' => 'Gregorian chants'
         ),
         1 => array(
           '@type' => 'dcterms:LCSH',
           '#value' => 'Propers (Music)'
         )
      ),
      'xc:relation' => array(
         0 => array(
           '#value' => 'Catholic Church. Missal.'
         )
      )
    )
    

As you might notice, we do not store distinct fields of the metadata in MySQL. We will do this step in Solr indexing.

If the record is a new one, and the administrator chose the 'LOAD DATA INFILE' syntax, the records first inserted into a comma separated values (CSV) file, and they will be inserted into the tables only after the last OAI request (in step #2 of the whole process). This syntax is much quicker than INSERT command, so it is highly recommended. During development we run into the problem that in 32 bit operating systems PHP can not create files larger than 2 GB, so we limited the file size to one GB. The files are created in the oaiharvester_sql_cache/[schedule identifier] directory under Drupal's files directory (the default location is sites/default/files). The filename convention is [table].[number].csv, e.g. xc_entity_properties.0.ccsv, xc_entity_properties.1.ccsv etc.

If the record is an already existing one (remember that the toolkit has checked this previously), the toolkit updates the records (the entity, the metadata and the relationships), and registers it in the xc_oaiharvester_bridge_changes table, which collects information about which records have to be indexed by Solr (see details at the table's documentation).

2. Importing CSV files, and normalizing the "changes" table

After all harvested record are processed, the toolkit runs two immediate tasks:

  1. If the administrator chose to make use of 'LOAD DATA INFILE' syntax, the toolkit imports the comma separated value files into MySQL
  2. The toolkit normalizes the xc_oaiharvester_bridge_changes table, to handle only manifestation record, since the Solr index contains documents which reconstruct the MARC-like structure, merging upper level records (works, expressions) into manifestations.

In the Start using document we mentioned, that the harvesting schedule has multiple phases, of which only the first one is fetching and storing records. Now we have the "raw" records, but we can not search, display, print, comment them. We need to finish two more tasks: node creation and creating the Solr index.

The CSV files will not play any further role in the process, so they can be deleted. In the schedule page there is a tab called clear cache (admin/xc/harvester/schedule/1/clear_cache), where you can delete these files along with the cached XML files which contains the raw harvested records (if you chosed the caching of the OAI-PMH response).

3. Node creation

As you might know, node is the core of Drupal, it is a piece of content. Drupal provides a handful of functions to handle the nodes like add comments, print them, define its internal structure and so on. XC modules will iterate over the xc_entity_properties table where the node_id field is 0 (the default value, and it means, that there is no node created for that record), and call the standard node creating funtion in Drupal (node_save()) to create a node. It returns the node's identifier, and the toolkit updates the just mentioned node_id field with its value.

The process creates a very basic node with type and title (which now equals to the OAI identifier of the record, later we modify the code to get the real title of the metadata). Because of the node concept of Drupal lots of other modules can response some actions to the node creation event.

In Drupal core only 3 tables change when a node is created: node, node_revision and node_comment_statistics table. The node_save() automatically creates records in these tables.

Because we specified the node type, and the node type is registered as belonging to the Metadata module, when the node is displayed, or deleted Drupal will call the Metadata module, and this module will add the XC Entity object's properties to the node. So the Drupal node itself is a very thin object, but during the process of display it will be fulfilled with all of our metadata information. Good to know, that regarding to XC schema we create nodes for manifestation level records, since we do not display directly other level records.

Let's see an example. A node record:

      nid: 1
      vid: 1
     type: xc_manifestation
 language: 
    title: oai:mst.rochester.edu:MetadataServicesToolkit
           /MARCToXCTransformation/10002
      uid: 0
   status: 1
  created: 1300468129
  changed: 1300468129
  comment: 0
  promote: 0
 moderate: 0
   sticky: 0
     tnid: 0
translate: 0
  • nid The primary identifier (xc_entity_properties.node_id)
  • vid The current version identifier (node_revisions.vid)
  • type The type (node_type.type and xc_entity_properties.node_type)
  • language The language (language.language)
  • title The title as plain text
  • uid The user identifier, who created it (users.uid)
  • status Boolean indicating whether the node is published
  • created Timestamp of creation
  • changed Timestamp of last modification
  • comment Whether comments are allowed. Possible values: 0 = no, 1 = read only, 2 = read/write
  • promote Boolean whether the node should be displayed on the front page
  • moderate Not currently used in core
  • sticky Boolean indicating whether the node should be displayed at top of lists
  • tnid The translated node's nid (if this is a translated node)
  • translate Boolean value indicating whether this translation page needs to be updated

A node_comment_statistics record:

                   nid: 1
last_comment_timestamp: 1300468129
     last_comment_name: NULL
      last_comment_uid: 0
         comment_count: 0
  • nid The primary identifier (node.nid and xc_entity_properties.node_id)
  • last_comment_timestamp Timestamp of the node's last comment (comments.timestamp)
  • last_comment_name The name of the author of the node's last comment (comments.name)
  • last_comment_uid The user identifier of the author of the node's last comment (comments.uid)
  • comment_count The total number of comments on this node

A node_revisions record:

      nid: 1
      vid: 1
      uid: 1
    title: oai:mst.rochester.edu:MetadataServicesToolkit/
           MARCToXCTransformation/10002
     body: 
   teaser: 
      log: 
timestamp: 1300468129
   format: 0
  • nid The primary identifier of the node (node.nid and xc_entity_properties.node_id)
  • vid The primary identifier of this version (node.vid)
  • uid The user identifier, who created it (users.uid)
  • title The title as plain text
  • body The body of this version
  • teaser The teaser of this version
  • log The log entry explaining the changes in this version
  • timestamp The timestamp of creation
  • format The input format used by this version's body

4. Solr indexing

The XC schema records follow the FRBR data model, and hence they creates a hierarchy. Hierarchical data structures can be handled perfectly in relational databases, but not in Solr, which stores records containing key-value pairs, and has no operation like SQL's JOIN command. Because of this, if we would like to search field content in distinct hierarchical level, we have to "flatten" our data structure, for an XC record we practically reconstruct the MARC record in a way. The module iterates over all manifestations, finds each records' parents (the expression level record), and the parents of those records (the work level records), and merge these records metadata part. In Solr, we store different types of information:

  1. mandatory fields. These are:
    • id (equals to the OAI identifier)
    • node_id
    • node_type
    • metadata_id
    • metadata_type
    • source_id
    • type (same as metadata_type -- yes it is redundancy).

    All these field are common in every record. These fields are the same as the most important fields of the xc_entity_properties. And these store the base level record (manifestation) values, not of the parents (work, expression). We do not store these identifiers of the parent records, although the administrator can select the link fields to be searchable (xc:manifestationHeld, xc:workExpressed etc.).

  2. selected fields of the metadata. We will cover later that with XC Index module we can select the fields we would like to index among the schema's all fields. The general principle is that only those fields worth indexing, that we would like to search in. For example: we would like to search in the title (dcterms:title field in XC schema)? Let's select it to index! We won't like to search the publisher? Leave it from the selected fields. Each selected schema field may became the origin of more Solr field, e.g. if we would like to search the first name of the author, but we would like to use the author name as phrase, we can index both as "text" type field (which is capable to search distinct words inside) and as "phrase" (which enables you to search the while field as one phrase) - these are Solr field types, which are registered in Solr's schema.xml file, and in Drupal Toolkit, and refers to different indexing features. We have created some basic types (like sortable integer, sortable long number, boolean, date etc.). If you would like to add more (for example a field type, which is able to analyse Japanese words), you have to modify Solr's schema.xml, and register the new type into Drupal Toolkit (see the Solr field types form at Home > Administer > eXtensible Catalog (XC) > Solr Setup and Indexing > Solr field types (admin/xc/solr/field_type)).
  3. facets. We can create facets several ways on the administrator interface, but what is important here, that facets as just as normal field as others in Solr. There are only two restrictions: 1) we have to store the field value 2) we have to use phrase indexing (this is true in the bulk of use cases, but not in all). We use _fc suffix, which has these attibutes defined in Solr's schema.xml, which is the file storing the index settings. The module will create the values of these facet fields according to the rules the administrator set.
  4. generated fields. The module creates a field called text, which contains all values of all fields of the merged metadata concatenated as pure text without any XML tags. This field is without qualifications, so inside the field you can not make distinction between the value of the title, and of author. It will be the base of the genereal search (when the user will search in everything). Since it contains every information of the MARC-like record, this is the default search field: if the user does not specify a field, Solr will use this field. The other generated field is metadata, which is a serialized version of the MARC-like record. The module will use it when displays search result lists or full record. It is a stored, but is not searchable field, used only as the base of displaying.

Naming convention for Solr field

The Solr field names are automatically transformed from the schema field. Since colon character (":") is reserved in Solr, it would be very uncomfortable to leave colons in field names, thus the module will automatically replace it with two underscore character. Since it would also be very uncomfortable to register each schema fields in Solr's schema.xml, we make use of dynamic fields. This means, that we add a suffix to the end of the field, and Solr will know the field type from this suffix. These are the two rules the modules apply when they translate back and forth schema and Solr fields. Examples: dcterms:title indexed as text became dcterms__title_t in Solr, and same field indexed as phrase became dcterms__title_s.

Help improve this page

Page status: Not set

You can: