Download & Extend

Avoiding duplicates when importing BibTeX files

Project:Bibliography Module
Version:6.x-1.1
Component:Code
Category:feature request
Priority:normal
Assigned:Unassigned
Status:active

Issue Summary

We have a large .bib file containing all our references. After we update the .bib file (i.e. add entries), we upload it to our drupal site. When we upload the file, the entries are just added, with lots of duplicates as a result... Wouldn't it be possible to add an option: "add all records" or "add new records" since the BibTeX keys are supposed to be unique? Or, and in our case the best solution, to drop all existing records with the same key before inserting the new records. This would allow us to update records in the .bib file (e.g., fix typos). We simply can't remove all old records before uploading, because in that case, we also lose all links we've made with "similar authors"...

Best regards

Stefan

Comments

#1

This is a known problem, and I've done a little work towards a solution but it's not complete yet. The issue with trying to synchronize is if there are differences, which ones do you keep and which ones do you discard? Currently what is done is an MD5 hash value is produced for all records using the title, date and authors. When a new record is saved the hash values are compared and if the same, an entry is made in the biblio_duplicates table indicating which records are potential duplicates. My intent is to create some kind of "diff" view so that you can bring up the two records side by side and decide which changes to keep (i.e. keep all right or left values, merge left to right or merge right to left), but this part isn't done yet.

I could craft a solution which you propose which depends on the BibTex key, but there is still some danger there since if you make any changes on the Drupal side they will be lost and they are really only as unique as you (or the software your using) makes them. Also, I guess you didn't really mean "drop" the existing record, but maybe "update" the existing record, since if we dropped it, the new record would have a new nodeID which could cause problems with indexes like google or internal links pointing to the old (now non-existing) nodeID.

Ron.

#2

I would also love to see this happen, even if it simply ignores entries that perfectly match references already in the database. Not perfect, but would avoid most problems and keep the _duplicates table much smaller in the meantime until a more complete solution is found.

#3

Subscribing,
This feature would be perfect for my current project.

#4

The company I work for have a large list of bibliographic references to papers they have produced stored in a database, which they would like access to on the drupal site I am working on for them using the biblio module. My plan is to use the import function of the biblio module and import a .bib file containing the bibtex for all these references. Ideally I would like this all to happen without the user having to do any thing.

I have a custom module for my site that accesses the database, retrieves the information, organizes it and creates a .bib file with the correct bibtex. It does this on the cron run to keep the file up to date with any new papers. This is the file I will import. I'm waiting (patiently) for a solution to this duplicates issue before I start really making use of this. While I'm waiting I thought I'd work on getting the importing of this file to be managed programmatically by my custom module on the cron run. Could you possibly lend me a hand? :

Is there a main function that manages the importing of files that I could call from my module, passing for example the file name and file type? Do you think this would be feasible? I've just tried importing a file and there is a status bar to monitor the progress. My user's don't need to see that since they didn't invoke the update so is there anyway to have this run in the background? Is there anything else i've over looked? For example would permissions be an issue if the job was invoked whilst a user was logged in without the "import from file" permission?

Thanks for any help you could offer in advance. Looking forward to taking full advantage of this great module on my site!

#5

Well your in luck Paul!

I have completely reworked and modularized the import/export features of Biblio in the soon to be released 6.x-2.x version. There is now a separate module for each file format and as entries are imported from a file, an md5 checksum is created for the complete entry and stored in a database table. This will allow you to completely ignore old entries in a file which have been previously imported. I have also created an import/export API which you will be able to use in your helper module to do the work.

Basically you will be able to do something like this...

<?php
  $file
= "/path/to/some/file.bib";
 
$node_ids = module_invoke('biblio_bibtex', 'biblio_import', $file);
?>

Ron.

#6

Ron, that sounds like exactly what I need, fantastic! I'm eagerly awaiting 6.x-2.x and I'll let you know how I get on with my efforts.

Thanks very much for all the great work.

- Paul

#7

Hi rjerome,

What advice can you give, if we wanted imports of records that already exist, to be updated? Obviously, because the checksum is applied to the whole record, biblio thinks its different and creates a few record, even if say, there is a typo in the title corrected. I would have thought there would be a primary key (i.e. the rec-number) that it would go off of?

Is this functionality in the works or should we try implement it ourselves, as its critical to us using this module. Thanks!

P.S. I'm already currently using the latest 2.x-dev version.

#8

There is no question that this is a tough one. Some formats have well defined, unique identifiers, but bibtex does not. If your bibtex records have "citekeys" already attached, this could make it easier, but basically there needs to be some unique, non-changing identifier in the record.

Maybe if I added a configuration option to the bibtex module which allows the admin to choose which field(s) to use as a key, this would help?

Ron.

#9

Hey Ron, thanks for the prompt reply back. I'm doing these imports with endnote8 (xml), which is what has unique rec-numbers. Does that help?

#10

Yes, that should work. I'm not sure why I didn't use that field in the first place, but I'll add it as a key.

#11

Hey rjerome, have you committed that change? I'm trying to do this with the latest -dev release, but can't see any changes to the way it current works.

#12

No, but thanks for reminding me. The last few months have been crazy busy.

You should also be aware that "record numbers" in EndNote are "Library" specific, so entries from two different libraries on the same EndNote installation could have the same rec-number.

Ron.

nobody click here