PubMed to BibTeX via the XMLReader Library

dismalhiker - June 16, 2007 - 00:57
Project:Bibliography Module
Version:6.x-1.x-dev
Component:User interface
Category:feature request
Priority:normal
Assigned:Unassigned
Status:active
Issue tags:Science Collaboration
Description

Biblio is a great module. Many people in biomedical research use the NIH Pubmed database, which exports reference lists in MEDLINE, Pubmed XML, ASN.1 and several other formats. Currently Pubmed does not export reference lists to RIS or others available in Biblio.

It would be a tremendous advantage to have Biblio able to directly import a list of references exported from Pubmed (www.pubmed.gov).

Thanks,
dismalhiker

#1

rjerome - June 18, 2007 - 21:11

I see a number of XML DTD's here http://eutils.ncbi.nlm.nih.gov/entrez/query/DTD/index.html, does one of these match your XML files? If not could you supply an example XML file?

Ron.

#2

pht3k - October 2, 2007 - 14:23

subscribing
interesting addon

#3

libsys - October 2, 2007 - 15:20
Title:Possible to add MEDLINE format for importing files?» PubMed to Zotero to BibTex to Biblio

Interesting idea....FWIW, there is a way to get PubMed records into Biblio:

Install the Firefox extension "Zotero " (http://www.zotero.org/). Import records from your PubMed searches into Zotero and export them into the BibTex or RIS format. This file can then be imported into Biblio.

You may very well already be aware of this kind of process and while Zotero adds the ability to aggregate records from multiple databases, importing directly from the MEDLINE format would obviously be easier in your particular use case. I'm simply posting this for the sake of anyone who needs to import records from PubMed right now.

#4

libsys - October 2, 2007 - 16:06
Title:PubMed to Zotero to BibTex to Biblio» PubMed to BibTeX via the XMLReader Library

Here is a fairly crude script I put together for a prototype to move PubMed XML into BibTeX for import into Biblio.

<?php
$reader
= new XMLReader();
   
$filename = '/path/to/xml/fil/pubmed.xml' //Set the File Name

   
if(!$reader->open($filename)){ print "can't open file";}

while (
$reader->read()) {

  if(
$reader->nodeType == XMLReader::ELEMENT ){
   
$name = $reader->name;
  }
 
    if (
in_array($reader->nodeType, array(XMLReader::TEXT, XMLReader::CDATA, XMLReader::WHITESPACE, XMLReader::SIGNIFICANT_WHITESPACE)) && $name!=''){
       
$value= $reader->value;
    }

      
        if(
$reader->value != ''){
            if(
$name == 'PMID'){ $key = $value;}          
            if(
$name == 'ArticleTitle'){ $title $value;}
            if(
$name == 'Title'){ $journal = $value;}
            if(
$name == 'PubDate'){ $pubdate = 1; }
        if(
$name == 'Year' && $pubdate == 1){ $year = $value;; $pubdate = 0; }      
            if(
$name == 'Volume'){ $volume = $value;}
            if(
$name == 'Issue'){ $number = $value;}
            if(
$name == 'MedlinePgn'){ $pages = $value;}
            if(
$name == 'Affiliation'){ $note = 'Affiliaton: ' . $value;}
            if(
$name == 'Language'){ $language = $value;}
                      
//Yes, I see that I'm mapping isbn to issn
           
if($name == 'ISSN'){ $isbn = $value;}
            if(
$name == 'AbstractText'){ $Abstract = $value;}
                      
           
//remember, we have multiple mesh terms and authors
           
if($name == 'DescriptorName'){ $mesh .= $value . '; ';}          
            if(
$name == 'Keyword'){  $keywordlist .= $value . '; ';}                  
            if(
$name == 'LastName'){ $lastname = $value;}
            if(
$name == 'ForeName'){ $forename = $value;}
  
            if(
$lastname != '' && $forename != ''){
               
$author_list .= $lastname . ', ' . $forename  . '; ';
               
$lastname = '';
               
$forename = '';
               
$initials = '';
            }
      
        }
              

  if (
$reader->nodeType == XMLReader::END_ELEMENT){
   
$name = '';
   
$value = '';
  }
   
//when we reach the end of a node, we grab all the values and make a new bibtex entry
   
if ($reader->nodeType == XMLReader::END_ELEMENT && $reader->name == 'PubmedArticle'){
  
  
               
$authors = substr($author_list, 0, -2);
                
$keywords  = substr($mesh , 0, -2);              
              
       
/*
                * we create a long string of BibTex records in memory.
                * A more memory-conscious approach would be to
                * progressively write these records to a text file
                */
               
$article .= '@article{' . $key . ',' . "\n";
               
$article .= 'author = {' . $authors . '},'. "\n";
               
$article .= 'title = {' . $title  . '},'. "\n";
               
$article .= 'year = {' . $year . '},'. "\n";
               
$article .= 'journal = {' . $journal . '},'. "\n";
               
$article .= 'volume = {' . $volume . '},'. "\n";
                if(
$number !=''){$article .= 'number = {' . $number . '},'. "\n";}
               
$article .= 'pages={' . $pages . '},'. "\n";
                if(
$note !=''){$article .= 'note = {' . $note . ' Additional Keywords: ' . $keywordlist . '},'. "\n";}
               
$article .= 'keywords = {' . $keywords . '},'. "\n";  
               
$article .= 'isbn={' . $isbn . '},'. "\n";
               
$article .= 'language = {' . $language . '},'. "\n";
               
$article .= 'abstract = {' . $Abstract . '}'. "\n";
               
$article .= '}';
              
                              
           
$author_list = '';
           
$mesh  = '';          
           
$keywordlist  = '';
           
$x++;                      
    }
}

   
$bibtex_filename = '/path/to/bibtex/file/pubmed.bib';
    if (!
$handle = fopen($bibtex_filename, 'a')) {
        
printf("Cannot open file (%s)", $bibtex_filename);
         exit;
    }
  
    if (
fwrite($handle, $article) === FALSE) {
        
printf("Cannot write to (%s)", $bibtex_filename);
        exit;
    }
  
   
fclose($handle);
?>

More on the PHP XMLReader extension:

"The XMLReader extension is available in PECL as of PHP 5.0.0 and is included and enabled as of PHP 5.1.0 by default. It can be enabled by adding the argument --enable-xmlreader (or --with-xmlreader before 5.1.0) to your configure line. The libxml extension is required." http://us.php.net/xmlreader

An IBM tutorial on the subject: http://www-128.ibm.com/developerworks/library/x-pullparsingphp.html

#5

rjerome - October 2, 2007 - 16:24

Hey Chad, thanks for that. I had been avoiding the XMLReader Library due to it's dependence on PHP 5.x but maybe what I could do it create a separate "helper" module which you could only activeate if you are running PHP 5.

Ron

#6

libsys - October 2, 2007 - 16:33

No problem, I'll see if I can squeeze in some time to put a helper module together. That's probably the better route anyway - Biblio is pretty packed with features right now.

Re: PHP5, yeah, but now that 4 is EOLed, it'll be a little easier to move in this direction down the road (Yay!!). There was some discussion at Drupalcon this year as to what PHP 5 will bring to Drupal: http://barcelona2007.drupalcon.org/node/350 and http://barcelona2007.drupalcon.org/node/532.

#7

pht3k - October 3, 2007 - 02:45

wow you are on fire!!!
thanks for the zotero extension. i wasn't aware of it. will check it out soon.
cya
pht3k

#8

pht3k - October 3, 2007 - 02:50

argh too bad zotero doesn't work with the latest version of flock :/
will have to move to firefox
i used to love the social tools included within flock
i suppose i can search a bit and build up something similar with ff
will have to wait for some free time...
or maybe some new zotero version will be ok with the latest flock version.
crossed fingers
cya

#9

presleyd - August 22, 2008 - 16:06
Version:5.x-1.4» 5.x-1.16

Bumping this to express interest in PubMed imports.

#10

agaric - April 17, 2009 - 19:39

The Biblio Reference module of the Science Collaboration Framework does this. It is for 6.x but the relevant code for pubmed import should be usable in Drupal 5, PHP 5.2 required though.

#11

rjerome - April 17, 2009 - 20:15

Hi Ben,

Mind if I lift some of that PubMed code and put it in the Biblio module?

Ron.

#12

agaric - April 18, 2009 - 14:27

Hi Ron,

Of course! We're still defining the mappings as new edge cases come up (though we have tried to capture the applicable parts of the PubMed DTD), so we should keep working on these together. Also, note that it requires PHP 5.2.

benjamin, Agaric Design Collective

#13

rjerome - April 18, 2009 - 18:01

I noticed you were using SimpleXML, while it is a bit simpler, I have been avoiding that for a couple of reasons.

1) some of my EndNote XML import files are quite large (10 -15Mb) and SimpleXML does not handle large files very well since in has to build the entire XML tree in memory before you can access any of it.

2) there are still a surprising number of PHP4 installations out there (probably because RHEL 4 is still supported and it ships with PHP4)

I'll probably rewrite the parser using an Expat style.

Cheers,

Ron.

#14

agaric - April 24, 2009 - 15:21
Version:5.x-1.16» 6.x-1.x-dev

Bumping issue to 6.x because I think that's where we the new development for Biblio is taking place.

#15

robertDouglass - April 24, 2009 - 15:22
Version:6.x-1.x-dev» 5.x-1.16

rjerome: what type of XML handling are you proposing in lieu of SimpleXML? Just curious. Also, I'm going to bump this to D6 because it seems that all development is happening there. For D6, the PHP version is moot since D6 requires PHP 5.2.

2) there are still a surprising number of PHP4 installations out there (probably because RHEL 4 is still supported and it ships with PHP4)

#16

robertDouglass - April 24, 2009 - 15:23
Version:5.x-1.16» 6.x-1.x-dev

#17

rjerome - April 24, 2009 - 15:39

I was under the impression that the "PHP 5.x" requirement wasn't until D7... According to the INSTALL.txt included with D6.10...

Drupal requires a web server, PHP 4 (4.3.5 or greater) or PHP 5
(http://www.php.net/) and either MySQL (http://www.mysql.com/) or PostgreSQL
(http://www.postgresql.org/). The Apache web server and MySQL database are...

#18

rjerome - April 24, 2009 - 15:47

@Robert: I forgot to respond to your query regarding XML parsers.in my last post.. As mentioned in #13, I have been using Expat style parsers for all my XML work. They work well for large files since you can stream data through them in chunks (which is exactly what I am doing).

@Ben: Could you provide me with the DTD you were using for pubmed. There seems to be a dizzying array of XML specs coming out of there.

Cheers,

Ron.

#19

agaric - April 26, 2009 - 06:28

As for pubmed Document Type Definition, the I think official one is down right now - http://www.ncbi.nlm.nih.gov/entrez/query/static/PubMed.dtd

This mapping was useful: http://www.blackwellpublishing.com/xml/dtds/4-0/help/pubmedmap.htm

I will check with my colleague Stefan Freudenberg of Agaric to confirm these resources.

benjamin, Agaric Design Collective

#20

ACNBiet - May 11, 2009 - 21:58

Hi all,

http://www.ncbi.nlm.nih.gov/pubmed/12112222?report=xml&format=text will give you an xml file where 12112222 is the PubMedID. This xml file also has an

<ArticleId IdType="doi">doi....</ArticleId>

Maybe fetching this doi here and then feeding it to the already existing doi-import function will be the fastest way to develop a PubMed import feature.

Andreas

EDIT:
Just found out, that the doi does not have to be included always: http://www.crossref.org/08downloads/pmc_briefing_june2008.pdf

#21

presleyd - June 16, 2009 - 21:15

This looks promising:

http://drupal.org/project/entrez

#22

Hagenah - November 6, 2009 - 11:34

Please keep in mind: Both entrez and doi feeding depend on an online access from the drupal installation to pubmed or the doi-server. If your drupal is behind a firewall this complicates the issue as the firewall may block the access to outside servers.

This may be solved in the future by integrating the use of a proxy in the core of drupal, but for now import via a file would help some people. And this could be any of export forms pubmed offers for download to a file. I think XML is the most promising from all. Some discussion on firewalls and drupal may be found here http://drupal.org/node/18390.

I am using zotero as a workaround that works very well but this depends on firefox which is not standard software in some companies. Those may be restrictive to allow open source software on their computers.

#23

rjerome - November 6, 2009 - 13:51

I actually have most of the pubmed import stuff written, I just haven't had a chance to integrate it yet. Look for it in the coming months...

#24

Hagenah - November 6, 2009 - 22:51

Very good, I will be waiting to see it.

I integrated the code from #4 libsys inti biblio.export.import.inc and it worked! Great thing.

 
 

Drupal is a registered trademark of Dries Buytaert.