Avoiding duplicates when importing BibTeX files

sbecuwe - March 20, 2009 - 10:23
Project:Bibliography Module
Version:6.x-1.1
Component:Code
Category:feature request
Priority:normal
Assigned:Unassigned
Status:active
Description

We have a large .bib file containing all our references. After we update the .bib file (i.e. add entries), we upload it to our drupal site. When we upload the file, the entries are just added, with lots of duplicates as a result... Wouldn't it be possible to add an option: "add all records" or "add new records" since the BibTeX keys are supposed to be unique? Or, and in our case the best solution, to drop all existing records with the same key before inserting the new records. This would allow us to update records in the .bib file (e.g., fix typos). We simply can't remove all old records before uploading, because in that case, we also lose all links we've made with "similar authors"...

Best regards

Stefan

#1

rjerome - March 20, 2009 - 13:53

This is a known problem, and I've done a little work towards a solution but it's not complete yet. The issue with trying to synchronize is if there are differences, which ones do you keep and which ones do you discard? Currently what is done is an MD5 hash value is produced for all records using the title, date and authors. When a new record is saved the hash values are compared and if the same, an entry is made in the biblio_duplicates table indicating which records are potential duplicates. My intent is to create some kind of "diff" view so that you can bring up the two records side by side and decide which changes to keep (i.e. keep all right or left values, merge left to right or merge right to left), but this part isn't done yet.

I could craft a solution which you propose which depends on the BibTex key, but there is still some danger there since if you make any changes on the Drupal side they will be lost and they are really only as unique as you (or the software your using) makes them. Also, I guess you didn't really mean "drop" the existing record, but maybe "update" the existing record, since if we dropped it, the new record would have a new nodeID which could cause problems with indexes like google or internal links pointing to the old (now non-existing) nodeID.

Ron.

#2

mattgilbert - June 23, 2009 - 14:53

I would also love to see this happen, even if it simply ignores entries that perfectly match references already in the database. Not perfect, but would avoid most problems and keep the _duplicates table much smaller in the meantime until a more complete solution is found.

 
 

Drupal is a registered trademark of Dries Buytaert.