I import files that contain special characters (mainly accents) because of French items into them. Mainly these files are encoded in ANSI (ISO-8859-1) but UTF-8 is expected. So, if a title is "Utilisation de la méthode de balance harmonique" what remains after import is "Utilisation de la m": everything after the first accentuated character is dropped. If I force UTF-8 tranbslation of my source file befor I submit it, there is no problem.
I wrote a few lines which read the file once uploaded line by line, detecting encoding and translating it. Each line is put in a temporary file whose content replaces the uploaded file once all line have been read.
- file modified : biblio.import.export.inc
- line : 173 (after "}" which closes the "foreach (array_keys($form_state['values']) as $key) {" loop
// ESY (E. Sarrouy, 2009/02/11)
// Takes care of encoding
$handle = fopen($import_file->filepath,"r");
if($handle) {
$temp = tmpfile();
while (!feof($handle)) {
$string = fgets($handle);
$string = drupal_convert_to_utf8($string,mb_detect_encoding($string, 'UTF-8, ISO-8859-1', true));
fwrite($temp,$string);
}
fclose($handle);
fseek($temp,0);
$stream= stream_get_contents($temp);
fclose($temp);
$handle = fopen($import_file->filepath,"w");
fwrite($handle,$stream);
fclose($handle);
}It seems to work but maybe that's could be a problem for very large files.
Comments
Comment #1
rjerome commentedI wonder if reading the file in chunks like shown below would cause problems? Also as shown below simply overwriting the original file with the temp file saves rereading the tempfile. I also wonder if the temp file shouldn't be opened "wb"?
I haven't tested this (could you?), but I propose these changes because believe it or not I have seen 10M import files.
Comment #2
esarrouy commentedIt seems to work ! Tahnk you.
Will you integrate this in next release ?
Comment #3
esarrouy commentedSorry I forgot to tell, just in case : there is a missing semi-colon at end of last line ;)
file_copy($tmpfname,$import_file->filepath, FILE_EXISTS_REPLACE ); // <-- right hereComment #4
rjerome commentedYep, I guess the only other question I had was happens if the file is encoded with something other than ISO-8859-1?
Comment #5
esarrouy commentedMaybe you could make the
'UTF-8, ISO-8859-1'list a little bit longer but it won't fit for ALL the possible encodings anyway. Php function mb_detect_order won't do the job better than a list.A way to help people with encoding problems could be a mixed solution, telling them Bibliography can import UTF-8, ISO-8859-1 (and a few others encoded) files and to tell them to re-encode if their encoding is not supported OR to add a textfield in the admin settings page that let them specify an list of used encodings that you could substitute to the default one.
Guess it helps !
Comment #6
liam morlandThis version is no longer maintained. If this issue is still relevant to the Drupal 7 version, please re-open and provide details.