The RIS format explicitly restricts the allowable character set for most fields to ANSI characters 32-255, except for the Reference ID field, which must be only 0-9 or A-Z, and for the author, keywords, and periodical name fields, which are additionally not allowed to include an asterisk character. In addition to these explicit restrictions, I judge that CR, LF, and tab characters are also allowed. See for e.g.,:
http://www.refman.com/support/risformat_fields_02.asp

Unfortunately, Reference Manager 9-12 does not follow these restrictions when it creates an RIS formatted export file. RM12, includes a number of special, hidden characters that result from any bold, italic, or underline formatting that is applied via the Reference Manager interface to characters in a field. In the case of special fields that can contain hyperlinks, there is almost invariably some additional hidden characters that are not editable or viewable via the Reference Manager interface. My analysis of a sample RIS formatted file shows that at least the following characters may be included in an RIS file exported from Reference Manager 12: NAK, STX, EOT, ETB, DLE, DC1, along with various extraneous space characters inserted before and after these characters.

After importing such a file in biblio, these characters show up as unreadables with question marks or whatever , so it would be best to remove them from RIS files during import. Here is a simple preg_replace that can be added to the biblio_ris.module to perform this function.

Selected lines from _biblio_ris_import_string in biblio_ris.module for 6.x-2.0-rc2, lines 206-215:

    if ($line_len > 3) {
      $start = strpos($line, '  -'); // There could be some unprintables at the beginning of the line so fine the location of the %
      if ($start !== FALSE) {
        $tag = drupal_substr($line, $start -2, 2);
        $data = trim(drupal_substr($line, $start +3));
      }
      else {
        $data = $line;
      }
    }

Replacement code to remove all characters except for carriage return, line feed, tab, or ANSI/ASCII character codes 32-255:

    if ($line_len > 3) {
      $start = strpos($line, '  -'); // There could be some unprintables at the beginning of the line so find the location of the %
      if ($start !== FALSE) {
        $tag = drupal_substr($line, $start -2, 2);
        $data = trim(drupal_substr($line, $start +3));
        // Remove any character other than: carriage return, line feed, tab, or ANSI/ASCII character codes 32-255
        $data = preg_replace('/[^\r\n\t\x20-\xFF]/', '', $data);
      }
      else {
        // Remove any character other than: carriage return, line feed, tab, or ANSI/ASCII character codes 32-255
        $data = preg_replace('/[^\r\n\t\x20-\xFF]/', '', $data);
        $data = $line;
      }
    }

This issue is actually a revisiting of the "Clean up gremlins" issue that was raised for 6.x-1.x here:
RIS Import - Possible Customizations - Clean up gremlins, convert PM to PubMed URL, insert JA (abbreviated Journal title). The solution above is almost identical to the one presented there, except that the allowed character set has now been expanded to include the full set of characters allowed by the RIS format docs.

Phil.

Comments

rjerome’s picture

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.