Clarify metadata flow at import time

tituomin - November 9, 2009 - 14:23
Project:Millennium Integration
Version:6.x-2.x-dev
Component:Code
Category:task
Priority:normal
Assigned:Unassigned
Status:needs work
Description

Here's a really short description of the module: it maps marc metadata into millennium nodes and taxonomies, saving additional metadata and the original marc record into its own database table.

I think currently there is a slight uncertainty about what the correct place for a specific piece of data is and where the data should be handled. Maybe this has to do with the historical Biblio dependency. Let me explain. There are (at least) two places in the code where marc metadata is mapped into other structures:

millennium_marc_to_nodeobject maps chosen marc fields into a php dictionary array / hash map which is then serialized and saved into the database.

millennium_add_taxonomy_to_node maps chosen marc fields into drupal taxonomies.

Additionally, the original marc record is saved.

The problem is: what's the correct place to find a specific piece of metadata? For example, to find out the language, you could (1) parse it from the marc record, (2) look it up in the metadata array or (3) check the vocabulary.

I think some of this confusion leads to repetition in the code: for example, some of the same marc parsing takes place in both the functions mentioned above. Wouldn't it be better, if only millennium_marc_to_nodeobject was responsible for parsing the marc, and millennium_add_taxonomy_to_node would only use the internal "biblio" hash map. I'm basically talking about this: http://en.wikipedia.org/wiki/Don't_repeat_yourself .

So, we could clarify the data flow a bit to make it look more like this linear flow: MARC -> php array -> taxonomies. And, we could disallow using the marc record in other places than millennium_marc_to_nodeobject.

*Or* we could keep the MARC record as the authoritative metadata source, and parse it anew every time (at node load time, etc.)

Personally, I prefer to keep the biblio php array as the authoritative source, because it's easy to manipulate the php array with clear, mnemonic names as keys instead of having to remember all the different marc field numbers and subfields every time. To me, marc is an implementation detail, which should be hidden inside a user-friendly "api", in this case an easy-to-use array.

But maybe moving to CCK will make things even complicated.. Anyway, I think the answer to this issue should at least be documented somewhere to reduce confusion.

#1

janusman - November 30, 2009 - 23:57

I agree, there should first be parsing into the biblio array, then that could be used by millennium_add_taxonomy_to_node() and other modules if they want it.

We are now storing the serialized biblio array too, so that looks like the authorative source.

 
 

Drupal is a registered trademark of Dries Buytaert.