Enforce RIS permitted character set during RIS import - Remove non-printing characters and other "gremlins" [#1780414]

The RIS format explicitly restricts the allowable character set for most fields to ANSI characters 32-255, except for the Reference ID field, which must be only 0-9 or A-Z, and for the author, keywords, and periodical name fields, which are additionally not allowed to include an asterisk character. In addition to these explicit restrictions, I judge that CR, LF, and tab characters are also allowed. See for e.g.,:
http://www.refman.com/support/risformat_fields_02.asp

Unfortunately, Reference Manager 9-12 does not follow these restrictions when it creates an RIS formatted export file. RM12, includes a number of special, hidden characters that result from any bold, italic, or underline formatting that is applied via the Reference Manager interface to characters in a field. In the case of special fields that can contain hyperlinks, there is almost invariably some additional hidden characters that are not editable or viewable via the Reference Manager interface. My analysis of a sample RIS formatted file shows that at least the following characters may be included in an RIS file exported from Reference Manager 12: NAK, STX, EOT, ETB, DLE, DC1, along with various extraneous space characters inserted before and after these characters.

After importing such a file in biblio, these characters show up as unreadables with question marks or whatever , so it would be best to remove them from RIS files during import. Here is a simple preg_replace that can be added to the biblio_ris.module to perform this function.

Selected lines from _biblio_ris_import_string in biblio_ris.module for 6.x-2.0-rc2, lines 206-215:

    if ($line_len > 3) {
      $start = strpos($line, '  -'); // There could be some unprintables at the beginning of the line so fine the location of the %
      if ($start !== FALSE) {
        $tag = drupal_substr($line, $start -2, 2);
        $data = trim(drupal_substr($line, $start +3));
      }
      else {
        $data = $line;
      }
    }

Replacement code to remove all characters except for carriage return, line feed, tab, or ANSI/ASCII character codes 32-255:

    if ($line_len > 3) {
      $start = strpos($line, '  -'); // There could be some unprintables at the beginning of the line so find the location of the %
      if ($start !== FALSE) {
        $tag = drupal_substr($line, $start -2, 2);
        $data = trim(drupal_substr($line, $start +3));
        // Remove any character other than: carriage return, line feed, tab, or ANSI/ASCII character codes 32-255
        $data = preg_replace('/[^\r\n\t\x20-\xFF]/', '', $data);
      }
      else {
        // Remove any character other than: carriage return, line feed, tab, or ANSI/ASCII character codes 32-255
        $data = preg_replace('/[^\r\n\t\x20-\xFF]/', '', $data);
        $data = $line;
      }
    }

This issue is actually a revisiting of the "Clean up gremlins" issue that was raised for 6.x-1.x here:
RIS Import - Possible Customizations - Clean up gremlins, convert PM to PubMed URL, insert JA (abbreviated Journal title). The solution above is almost identical to the one presented there, except that the allowed character set has now been expanded to include the full set of characters allowed by the RIS format docs.

Phil.

Comments

Comment #1

rjerome commented 21 September 2012 at 20:52

Status:

Active

» Fixed

I've pushed this... (6 & 7)

http://drupalcode.org/project/biblio.git/commit/fba5efa
http://drupalcode.org/project/biblio.git/commit/fc5879d

Comment #2

5 October 2012 at 21:01

Status:

Fixed

» Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

Enforce RIS permitted character set during RIS import - Remove non-printing characters and other "gremlins"

Comments

Comment #1

Comment #2

News items

Our community

Documentation

Drupal code base

Governance of community