Encoding of imported files [#373868]

I import files that contain special characters (mainly accents) because of French items into them. Mainly these files are encoded in ANSI (ISO-8859-1) but UTF-8 is expected. So, if a title is "Utilisation de la méthode de balance harmonique" what remains after import is "Utilisation de la m": everything after the first accentuated character is dropped. If I force UTF-8 tranbslation of my source file befor I submit it, there is no problem.
I wrote a few lines which read the file once uploaded line by line, detecting encoding and translating it. Each line is put in a temporary file whose content replaces the uploaded file once all line have been read.

- file modified : biblio.import.export.inc
- line : 173 (after "}" which closes the "foreach (array_keys($form_state['values']) as $key) {" loop

// ESY (E. Sarrouy, 2009/02/11)
// Takes care of encoding
$handle = fopen($import_file->filepath,"r");
if($handle) {
    $temp = tmpfile();
    while (!feof($handle)) {
        $string = fgets($handle);
        $string = drupal_convert_to_utf8($string,mb_detect_encoding($string, 'UTF-8, ISO-8859-1', true));
        fwrite($temp,$string);
    }
    fclose($handle);
    fseek($temp,0);
    $stream= stream_get_contents($temp);
    fclose($temp);
    $handle = fopen($import_file->filepath,"w");
    fwrite($handle,$stream);
    fclose($handle);
}

It seems to work but maybe that's could be a problem for very large files.

Comments

Comment #1

rjerome commented 13 February 2009 at 14:26

I wonder if reading the file in chunks like shown below would cause problems? Also as shown below simply overwriting the original file with the temp file saves rereading the tempfile. I also wonder if the temp file shouldn't be opened "wb"?

I haven't tested this (could you?), but I propose these changes because believe it or not I have seen 10M import files.

$handle = fopen($import_file->filepath,"r");
if($handle) {
   $tmpfname = tempnam(file_directory_temp(), "BIB");
    $temp = fopen( $tmpfname, "w");
    while (!feof($handle)) {
        $string = fgets($handle, 4096);
        $string = drupal_convert_to_utf8($string, mb_detect_encoding($string, 'UTF-8, ISO-8859-1', true));
        fwrite($temp, $string);
    }
    fclose($handle);
    fclose($temp);
    file_copy($tmpfname,$import_file->filepath, FILE_EXISTS_REPLACE )
}

Comment #2

esarrouy commented 19 February 2009 at 17:10

It seems to work ! Tahnk you.
Will you integrate this in next release ?

Comment #3

esarrouy commented 19 February 2009 at 17:13

Sorry I forgot to tell, just in case : there is a missing semi-colon at end of last line ;)
file_copy($tmpfname,$import_file->filepath, FILE_EXISTS_REPLACE ); // <-- right here

Comment #4

rjerome commented 19 February 2009 at 21:16

Yep, I guess the only other question I had was happens if the file is encoded with something other than ISO-8859-1?

Comment #5

esarrouy commented 20 February 2009 at 10:52

Maybe you could make the 'UTF-8, ISO-8859-1' list a little bit longer but it won't fit for ALL the possible encodings anyway. Php function mb_detect_order won't do the job better than a list.
A way to help people with encoding problems could be a mixed solution, telling them Bibliography can import UTF-8, ISO-8859-1 (and a few others encoded) files and to tell them to re-encode if their encoding is not supported OR to add a textfield in the admin settings page that let them specify an list of used encodings that you could substitute to the default one.
Guess it helps !