HTMLTidy and special chars

By andres@baravalle.it on 21 May 2008 at 00:44 UTC

Hi,
I am trying - once again - to integrate HTMLTidy in my Drupal 5.

I tried different approaches - and the last one seemed simple enough (even though not the best one for what regards performance).

In my theme folder for the web site, in node.tpl.php, I have used this:

	if (function_exists("tidy_repair_string"))
	{ 
	
			$xhtml = tidy_repair_string(trim($content), $GLOBALS["conf"]["tidy_config"]);
			
			if(!empty($xhtml)) {
				// $content = $xhtml . "<!-- (X)HTML sanitized -->";
				$content = $xhtml;			
			}	
	}

        print $content;

and in my settings.php file I have the following declaration:

$conf["tidy_config"] = array(
	"alt-text" => "",
	"break-before-br" => false,
	"drop-proprietary-attributes" => true,
	"indent-spaces" => "2",
	"hide-endtags" => true,
	"indent" => "auto",
	"output-xhtml" => true,
	"show-body-only" => true,
	"tidy-mark" => false,
	"wrap" => false,
	"numeric-entities" => false,
	"word-2000" => false,
	"quote-nbsp" => true, 
	"input-encoding" => "raw", 
	"output-encoding" => "utf8"
);

The problem is - it works only in a basic way, as there are a lot of problems both with entities and with non-ascii chars.

If you have a look for example at this: http://baravalle.it/citazioni/Stefano%20Benni page you can see what I mean. It's full of squares instead of "—" signs, and just commenting the HTMLtidy call gives me back my "—" signs.

Any suggestions? I have been looking at it for a few hours without success.

Changing input-encoding to utf8 doesn't help - already tried that,

Andres

Comments

=-=

vm commented 21 May 2008 at 01:01

are you using the HTMLTidy.module ? http://drupal.org/project/htmltidy

_____________________________________________________________________
My posts & comments are usually dripping with sarcasm.
If you ask nicely I'll give you a towel : )

nope

andres@baravalle.it commented 21 May 2008 at 01:20

I used the module some time ago, when I was still using some previous version of Drupal. I had updated and tweaked it the module (when I tried it was using a binary call to the tidy executable) - but wasn't working well enough (that's why I didn't submit back anything).

Today, after looking at some posts in the forums, I thought about giving it a second go, but in a simpler way, with the code I posted. And doesn't yet work...

Andres

Looking at your html output,

cog.rusty commented 21 May 2008 at 01:48

Looking at your html output, I notice that the em dash has been replaced by a pair of characters, hex 14-20. (20 is a space of course).

The Unicode hex for em dash is 20-14 (U+2014). Not sure if this helps.

Could it be a small-endian/big-endian issue? It shouldn't, since utf8 defines sequences of single bytes, but that's what it looks like.

Hex editor?

andres@baravalle.it commented 21 May 2008 at 02:13

Hi,
what tool/proceduere did you use to see it? From firefox, I cannot get the hex code (or I don't know how to do it).

Thanks for the help,

Andres

I downloaded the html output

cog.rusty commented 21 May 2008 at 02:50

I downloaded the html output as text and opened it with a text editor which could show it in hex (Ultraedit). But you can easily find an open source hex editor or viewer with google (for example http://www.tech-faq.com/hex-editor.shtml).

Then I compared what I saw with this: http://www.fileformat.info/info/unicode/char/2014/index.htm

----- Update

Looking at that page again, I noticed that the em dash is hex 2014 only in UTF-16, which in a small-endian representation becomes 1420, as expected. But in UTF-8 it is different. It is e2 80 94 (you can see it by putting that sequence in the text with a hex editor). So, the output is not really UTF-8.

Interesting

andres@baravalle.it commented 21 May 2008 at 02:50

I followed the procedure you suggested - with some interesting results.

Another page in the site, for example (http://baravalle.it/ecommerceland) is full of �s. It's not a 1 to 1 match - different accented letters are replaced with �s.

�, in the hex editor, appear as ef bf bd, which I understand should be Unicode FFFD, a replacement character.

Which, I think, is saying that something went wrong...

But still not sure what, or why, and how to correct it.

Andres

Was none of these problems

cog.rusty commented 21 May 2008 at 03:06

Was none of these problems present without tidy?

nope...

andres@baravalle.it commented 21 May 2008 at 03:12

If I comment the tidy lines, everything goes back to normal.

Andres

Maybe some php function

cog.rusty commented 21 May 2008 at 03:30

Maybe some php function unaware of Unicode?

Does the third argument ("utf8") like in this example make any difference?
http://php.net/manual/en/function.tidy-repair-string.php#66066

nope...

andres@baravalle.it commented 21 May 2008 at 03:45

But I think I might have some idea.

If I test this code:

<p>
Ecommerceland &egrave; un&nbsp;progetto (using HTML entity).
</p>
<p>
Ecommerceland è un progetto (NOT using HTML entity).
</p>

The first one, using entities, appears incorrect (both the è and the  ). The second one appears correctly.

Looks like, for some reason, my entities are not translated correctly.

Andres

done!

andres@baravalle.it commented 21 May 2008 at 04:18

Not sure if it's the most intelligent approach. Well, quite surely it isn't - but works.

I added html_entity_decode and now works.

	if (function_exists("tidy_repair_string"))
	{ 
	
			$xhtml = tidy_repair_string(trim(html_entity_decode($content,ENT_NOQUOTES,"UTF-8")), $GLOBALS["conf"]["tidy_config"]);
			if(!empty($xhtml)) {
				$content = $xhtml;			
			}	
	}
  

 	print $content;

not exactly sure why - I would have thought that tidy would have been able to deal with it,

Andres

HTMLTidy alternative htmLawed

alpha2zee commented 2 July 2008 at 03:35

You might be intersted in looking at the htmLawed module. It enables the use of the htmLawed filter, a simple stand-alone alternative to the HTMLTidy application.

HTMLTidy and special chars

Comments

=-=

nope

Looking at your html output,

Hex editor?

I downloaded the html output

Interesting

Was none of these problems

nope...

Maybe some php function

nope...

done!

HTMLTidy alternative htmLawed

New forum topics

News items

Our community

Documentation

Drupal code base

Governance of community