Hi:
In the department of libraries in the University of Catalonia, we have expanded the code so far.

The main concept is a little different from the original. In particular the main features are:

- It is developed for 6.x core module. So even if the version of this page says 5x version, it's false. It is for 6.x core module

- Ability to select more than one repository to import, with different settings for each.

- Ability to assign different content types for each content repositories.

- Ability to assign different taxonomy fields coming by OAI, as well as filter terms that should be excluded in the import.

- Ability to select the sets to import from each repository.

That's it, hope you can be of your interest and even be regarded as a beta for this module.

Thank you

Comments

On Drupal 6.14, I get Headers already sent in /includes/common.inc error

StatusFileSize
new24.56 KB

Try with this module. It is the module i have in my production site (6.10)

It has disabled some features as select sets or put the images of the covers.

Best regards

Xarbot

Whoa, for some reason this didn't show up in my issues list. I'm really sorry this has been neglected for so long, and I'll try to get some time to look at this later this week.

Ok, no problem :)

Xarbot

StatusFileSize
new35.65 KB

I'm betting this module is abandoned.

I took what @xarbot posted in #2 above and improved somewhat... of course it's only beta (or maybe alpha?) quality... the code is sprinkled with TODO comments. However, it seems to work =)

Here's a ZIP of that work in progress; maybe someone will find it useful.

Not quite abandoned, just been busy. Looked over the code, here are some comments on the code:

- I want to start by complimenting all of the additions, they're fantastic.
- Minor nitpick, I would rather not see markup in the code (350, 358), but it's not a big deal.
- Based on some other experiences with importing possibly large repositories, I would really suggest using drupal_queue with a hook_cron_queue_info() pointing to a worker function to do the actual node_saves. This avoids a memory leak in node_save's on larger imports, and is a cleaner way to do it, as opposed to batch api (which depends on a browser and js).
- The big one: this version implements a deletion policy. This is a large enough change where there need to be warnings about this kind of change in behavior. Here is why this is such a tricky issue.

Most people haven't heard of OAI-PMH. Those who have are librarians, who are very concerned about deletion policies for repositories, and portal which mirror these repos. If you're just using drupal for a read-only front end to a single, or a collection of repositories this deletion policy makes perfect sense, but if you start harvesting multiple repositories and and re-serving them via oai-pmh things gets more complex. The OAI-PMH spec indicates that there are several types of repos. Your policy on deletions fits the common case: the repository has tombstone records and the drupal site is read only, and is not re-serving the oai content. However, some repositories do not have tombstone records (we check for this in oai-ident, so this behavior should be known), and in that case deleted records will not show up at all (leading to an out of sync repo). Lastly, if someone is re-serving oai content via one of the few drupal modules which will do this this, this deletion policy doesn't allow for tombstones either, which can also be viewed as a problem. There are a few other issues that can come up as well (such as people deleting records inside of drupal), which is why I ignored deleted records all together initially.

Now, your solution here is a simple one and I have no problems with it as a starting point, however you must document this behavior so there is no ambiguity regarding what happens in the various cases.

Hope this helps a bit.

Joe

Thanks for your comments! And most of the credit should go to @xarbot for kickstarting it =)

Some questions:
> "I would really suggest using drupal_queue"

I'm assuming you mean the D6 backport of the D7 queue at http://drupal.org/project/drupal_queue . I'll look into it. For now is it possible to just let the admin set a hard limit on # of nodes created per run for now? (e.g. like Search.module, apachesolr.module, etc. lets one throttle # of nodes to index per cron run)

> "this version implements a deletion policy" ... "there need to be warnings about this kind of change in behavior"

Also could be handled in the UI? Add in a checkbox that lets the admin chose what to do when deleted records are found on the repository. (Maybe the options could be Ignore, Unpublish node, Delete node)

If you agree, and once these changes are done, what do you think if a 6.x branch is created with that code, so that we can get to work using a single point of reference and deal with actual patches? =)

Thanks!

> For now is it possible to just let the admin set a hard limit on # of nodes created per run for now?

This is fine.

> Also could be handled in the UI?

Also fine, I more just wanted to highlight that this wasn't a matter of coding an existing solution (instead we have to figure out what that solution is, then implement it).

As for creating a branch and a release, you're welcome to do that before these things are done. This wasn't supposed to be stuff you have to finish before you can do what you want, just some notes/stuff to keep in mind :)

Also, I'll try to check in here regularly, but if you need something urgently, or want me to look at something specifically my email is a much better option. Don't hesitate to drop me a line!

Joe

Haven't posted in a bit... thought I'd drop a line.

I am investigating whether the proper place for harvesting OAI-PMH is a separate module, or as a plug-in to the Feeds module. From my testing it seems possible to have this functionality in feeds, which would deeply benefit from having all the other Feeds add-ons to allow us to map to CCK fields (core and stuff like Embedded Media Fields) besides from taxonomy.

So I'll kind of leave my humble development on this as a separate module for a bit, and maybe definitely if I think Feeds is the way to go. If so it'd be great to have your (the current author's) blessing to reuse this code.

Yeah, go for it. Just keep the above points in mind for whatever docs you include with a Feeds plugin.

Thanks. Probably will have some news soon =)

subscribing

Category:bug» task
StatusFileSize
new3.62 KB

Well, iv been trying #5 version against DOAJ OAI ( http://www.doaj.org/ ).
I made some changes to keep the entire categories, changed references from "creator" to "publisher" and added some validation to the start date to ensure that it is set.

PS: Can we change the version of this issue to 6.x?

StatusFileSize
new5.1 KB

Another patch against #5, this one includes my previous post patch and fixed the the date used to save the last record date, it was using the publishing date from the metadata insted of the OAI header timestamp, i think that is the one that should use.

StatusFileSize
new8.75 KB

I keep working on this but may need some help to do it right, deletion, taxonomy and some other stuff.
New patch, this one adds the option to import to Biblio nodes insted of OAI nodes.
Includes all previous changes and it goes against the version posted in comment #5.

StatusFileSize
new8.86 KB

Try this one, added validation to check if biblio module is enabled.

StatusFileSize
new9.65 KB

ok, last patch for the day. Now it includes an option to import all the records from the erlier date in the repository until the current date.
It may take long to import all, DOAJ has 5600 records so to use this option set a long max execution time in php.ini.
Be advised that if you import them to biblio and want to delete them later, you may have to delete all by hand from the biblio catalog, still working on deletion and updates. Im using one of the biblio fields to save the repository name so it can be maped later from OAI to Biblio. This patch doesnt generate only biblio or OAI nodes, it always save the OAI node and if Biblio save option is checked also saves the biblio nodes.

Just thought I'd post here... I've progressed on making a Feeds-based OAI fetcher and parser, and started a new project for it here: http://drupal.org/project/feeds_oai_pmh . Testers welcome!

subscribe