Download & Extend

Make XML parsers return UTF-8 data

Project:Drupal core
Component:base system
Category:task
Priority:critical
Assigned:Steven
Status:closed (fixed)

Issue Summary

Read http://drupal.org/node/view/2036:

The problem was that, by default, PHP's XML parser returns the parsed data in the encoding of the input document. For German this can be ISO-8859-1; see http://rss.orf.at/science.xml. The fix is to force the output enconding to UTF-8 using xml_parser_set_option().

If have only fixed the parser used in the import module but we should test (and fix) the other XML parsers used within Drupal.

$ grep -r xml_parser_create\( *
includes/xmlrpc.inc:  $parser = xml_parser_create($xmlrpc_defencoding);
includes/xmlrpcs.inc:  $parser = xml_parser_create($xmlrpc_defencoding);
modules/import.module:    $xml_parser = xml_parser_create();
modules/jabber.module:    $xml_parser = xml_parser_create();

In particular, could someone test posting German (ISO-8859-1) and/or Japanese (UTF-8) characters using the Blogger API? I don't have such tool myself.

Comments

#1

Assigned to:Anonymous» Kjartan

I'll look into this.

#2

Importing news (by import.module) in Russian encodings (KOI8-R, cp1251 etc.) still don't work properly.

#3

Assigned to:Kjartan» Steven

This one has been sitting around for a while as critical, but the original bug is fixed:

Since 4.3.2, Drupal correctly extracts the encoding from the input XML, and specifies the output as UTF-8. The XML parser will handle the conversion. The only problem is that PHP's XML parser only supports US-ASCII (7-bit), ISO-8859-1 (Latin 1) and UTF-8 (Unicode). Any other encoding (such as the above mentioned Cyrillic KOI8-R or CP-1251) will cause parsing to fail.

In the future with PHP5, a different XML library will be used (libxml2 instead of expat), which can support tons of encoding when iconv (a generic conversion library) is compiled in. Until then, the only thing you can do is find (or write) a custom convertor from the unsupported encoding into UTF-8.

The revelant part in import.module is around line 325:

<?php
   
// extract the XML file's encoding (the XML parser in PHP4 doesn't do this by itself):
   
$encoding = 'utf-8';
    if (
ereg('...', $data, $match)) {
     
$encoding = $match[1];
    }

   
// parse the data:
   
$xml_parser = xml_parser_create($encoding);
?>

Suppose you have a koi2utf function to convert from KOI8-R to UTF-8, you would need to add something like this before the parser is created:

<?php
   
if (strtolower($encoding) == "koi8-r") {
     
$encoding = 'utf-8';
     
$da ta = koi2utf($data);
    }
?>

Perhaps someone could make a generic interface between iconv and Drupal to fix unsupported encodings. In any case, this would be platform specific and require iconv to be installed. Something like this is definitely out of the scope of Drupal core, but would be welcomed in contrib.

Marking the bug as fixed.

(note: had to edit the code 'cos it flagged the anti-XSS filter)

#4