HTML Entities Not Defined

Paul Gregory - January 8, 2009 - 13:17
Project:XLIFF Tools
Version:6.x-1.0-beta1
Component:Code
Category:support request
Priority:normal
Assigned:Unassigned
Status:active
Description

I am not an XML expert so please bear with me here... We are testing Xliff Tools for exporting a large site out for translation. The content of the pages is mixed html and a lot of the pages make use of the special characters like ™, © and ® amongst others. When exporting these pages to Xliff documents I was getting the following errors from the Devel module:

DOMDocument::loadXML() [function.DOMDocument-loadXML]: Entity 'trade' not defined in Entity, line: 1

Followed by 'Cannot Modify Headers' errors. Therefore there was no Xliff document produced for download.

I read on the PHP site that to use special entities with DOMDocument::loadXML() I would have to specify an external DTD to support these characters. I modified xliff.module around line 128:

$html = new DOMDocument();
$html->resolveExternals = true;
$html->loadXML('<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html><head><title>'. check_plain($node->title) .'</title></head><body>' . $node->body . '</body></html>');

This stops the error and provides the Xliff document to download but on inspection the document doesn't contain any references to trademark symbol etc (not even a character reference code). It appears these characters get stripped when xml2xliff.xsl is applied - is this intended behavior and is there anything we can do about this?

It would be a painful process to add the symbols in manual accross hundreds of pages so we would appreciate any insight you can offer.

Regards,
Paul

#1

kompis - March 27, 2009 - 15:02

I suffered the same issue (but with version 5.x 1.0) as a quick fix you can change xliff.module $html->loadXML(' to $html->loadHTML('
This will forgive all the tags, but you run the risk of not having correctly encoded html entities. To complete this 'hack' here needs to be a routine added to the code to scan the HTML for html entities and then have them transformed to raw entities: for example: "€" becomes a raw "€".

The better solution of course as you specified is to ensure that the data is valid strict xml, and specify the entities in the head of the xml document generated. I will submit a proper patch for this if I get the chance to do it.

 
 

Drupal is a registered trademark of Dries Buytaert.