Avoiding duplicates when importing BibTeX files [#408200]

We have a large .bib file containing all our references. After we update the .bib file (i.e. add entries), we upload it to our drupal site. When we upload the file, the entries are just added, with lots of duplicates as a result... Wouldn't it be possible to add an option: "add all records" or "add new records" since the BibTeX keys are supposed to be unique? Or, and in our case the best solution, to drop all existing records with the same key before inserting the new records. This would allow us to update records in the .bib file (e.g., fix typos). We simply can't remove all old records before uploading, because in that case, we also lose all links we've made with "similar authors"...

Best regards

Stefan

Comments

Comment #1

rjerome commented 20 March 2009 at 13:53

This is a known problem, and I've done a little work towards a solution but it's not complete yet. The issue with trying to synchronize is if there are differences, which ones do you keep and which ones do you discard? Currently what is done is an MD5 hash value is produced for all records using the title, date and authors. When a new record is saved the hash values are compared and if the same, an entry is made in the biblio_duplicates table indicating which records are potential duplicates. My intent is to create some kind of "diff" view so that you can bring up the two records side by side and decide which changes to keep (i.e. keep all right or left values, merge left to right or merge right to left), but this part isn't done yet.

I could craft a solution which you propose which depends on the BibTex key, but there is still some danger there since if you make any changes on the Drupal side they will be lost and they are really only as unique as you (or the software your using) makes them. Also, I guess you didn't really mean "drop" the existing record, but maybe "update" the existing record, since if we dropped it, the new record would have a new nodeID which could cause problems with indexes like google or internal links pointing to the old (now non-existing) nodeID.

Ron.

Comment #2

mattgilbert commented 23 June 2009 at 14:53

I would also love to see this happen, even if it simply ignores entries that perfectly match references already in the database. Not perfect, but would avoid most problems and keep the _duplicates table much smaller in the meantime until a more complete solution is found.

Comment #3

anasynth commented 8 December 2009 at 16:24

Subscribing,
This feature would be perfect for my current project.

Comment #4

anasynth commented 16 December 2009 at 17:10

The company I work for have a large list of bibliographic references to papers they have produced stored in a database, which they would like access to on the drupal site I am working on for them using the biblio module. My plan is to use the import function of the biblio module and import a .bib file containing the bibtex for all these references. Ideally I would like this all to happen without the user having to do any thing.

I have a custom module for my site that accesses the database, retrieves the information, organizes it and creates a .bib file with the correct bibtex. It does this on the cron run to keep the file up to date with any new papers. This is the file I will import. I'm waiting (patiently) for a solution to this duplicates issue before I start really making use of this. While I'm waiting I thought I'd work on getting the importing of this file to be managed programmatically by my custom module on the cron run. Could you possibly lend me a hand? :

Is there a main function that manages the importing of files that I could call from my module, passing for example the file name and file type? Do you think this would be feasible? I've just tried importing a file and there is a status bar to monitor the progress. My user's don't need to see that since they didn't invoke the update so is there anyway to have this run in the background? Is there anything else i've over looked? For example would permissions be an issue if the job was invoked whilst a user was logged in without the "import from file" permission?

Thanks for any help you could offer in advance. Looking forward to taking full advantage of this great module on my site!

Comment #5

rjerome commented 16 December 2009 at 19:45

Well your in luck Paul!

I have completely reworked and modularized the import/export features of Biblio in the soon to be released 6.x-2.x version. There is now a separate module for each file format and as entries are imported from a file, an md5 checksum is created for the complete entry and stored in a database table. This will allow you to completely ignore old entries in a file which have been previously imported. I have also created an import/export API which you will be able to use in your helper module to do the work.

Basically you will be able to do something like this...

  $file = "/path/to/some/file.bib";
  $node_ids = module_invoke('biblio_bibtex', 'biblio_import', $file);

Ron.

Comment #6

anasynth commented 11 January 2010 at 09:10

Ron, that sounds like exactly what I need, fantastic! I'm eagerly awaiting 6.x-2.x and I'll let you know how I get on with my efforts.

Thanks very much for all the great work.

- Paul

Comment #7

alexkb commented 7 January 2011 at 07:41

Hi rjerome,

What advice can you give, if we wanted imports of records that already exist, to be updated? Obviously, because the checksum is applied to the whole record, biblio thinks its different and creates a few record, even if say, there is a typo in the title corrected. I would have thought there would be a primary key (i.e. the rec-number) that it would go off of?

Is this functionality in the works or should we try implement it ourselves, as its critical to us using this module. Thanks!

P.S. I'm already currently using the latest 2.x-dev version.

Comment #8

rjerome commented 7 January 2011 at 14:14

There is no question that this is a tough one. Some formats have well defined, unique identifiers, but bibtex does not. If your bibtex records have "citekeys" already attached, this could make it easier, but basically there needs to be some unique, non-changing identifier in the record.

Maybe if I added a configuration option to the bibtex module which allows the admin to choose which field(s) to use as a key, this would help?

Ron.

Comment #9

alexkb commented 10 January 2011 at 02:00

Hey Ron, thanks for the prompt reply back. I'm doing these imports with endnote8 (xml), which is what has unique rec-numbers. Does that help?

Comment #10

rjerome commented 10 January 2011 at 20:37

Yes, that should work. I'm not sure why I didn't use that field in the first place, but I'll add it as a key.

Comment #11

alexkb commented 2 March 2011 at 08:01

Hey rjerome, have you committed that change? I'm trying to do this with the latest -dev release, but can't see any changes to the way it current works.

Comment #12

rjerome commented 2 March 2011 at 14:48

No, but thanks for reminding me. The last few months have been crazy busy.

You should also be aware that "record numbers" in EndNote are "Library" specific, so entries from two different libraries on the same EndNote installation could have the same rec-number.

Ron.

Comment #13

liam morland

English

Ontario, CA 🇨🇦

commented 21 December 2018 at 19:09

Issue summary:	View changes
Status:	Active	» Closed (outdated)

This version is no longer maintained. If this issue is still relevant to the Drupal 7 version, please re-open and provide details.

Avoiding duplicates when importing BibTeX files

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

News items

Our community

Documentation

Drupal code base

Governance of community