Hi,
I am trying - once again - to integrate HTMLTidy in my Drupal 5.

I tried different approaches - and the last one seemed simple enough (even though not the best one for what regards performance).

In my theme folder for the web site, in node.tpl.php, I have used this:

	if (function_exists("tidy_repair_string"))
	{ 
	
			$xhtml = tidy_repair_string(trim($content), $GLOBALS["conf"]["tidy_config"]);
			
			if(!empty($xhtml)) {
				// $content = $xhtml . "<!-- (X)HTML sanitized -->";
				$content = $xhtml;			
			}	
	}

        print $content; 

and in my settings.php file I have the following declaration:

$conf["tidy_config"] = array(
	"alt-text" => "",
	"break-before-br" => false,
	"drop-proprietary-attributes" => true,
	"indent-spaces" => "2",
	"hide-endtags" => true,
	"indent" => "auto",
	"output-xhtml" => true,
	"show-body-only" => true,
	"tidy-mark" => false,
	"wrap" => false,
	"numeric-entities" => false,
	"word-2000" => false,
	"quote-nbsp" => true, 
	"input-encoding" => "raw", 
	"output-encoding" => "utf8"
);

The problem is - it works only in a basic way, as there are a lot of problems both with entities and with non-ascii chars.

If you have a look for example at this: http://baravalle.it/citazioni/Stefano%20Benni page you can see what I mean. It's full of squares instead of "—" signs, and just commenting the HTMLtidy call gives me back my "—" signs.

Any suggestions? I have been looking at it for a few hours without success.

Changing input-encoding to utf8 doesn't help - already tried that,

Andres

Comments

vm’s picture

are you using the HTMLTidy.module ? http://drupal.org/project/htmltidy

_____________________________________________________________________
My posts & comments are usually dripping with sarcasm.
If you ask nicely I'll give you a towel : )

andres@baravalle.it’s picture

I used the module some time ago, when I was still using some previous version of Drupal. I had updated and tweaked it the module (when I tried it was using a binary call to the tidy executable) - but wasn't working well enough (that's why I didn't submit back anything).

Today, after looking at some posts in the forums, I thought about giving it a second go, but in a simpler way, with the code I posted. And doesn't yet work...

Andres

cog.rusty’s picture

Looking at your html output, I notice that the em dash has been replaced by a pair of characters, hex 14-20. (20 is a space of course).

The Unicode hex for em dash is 20-14 (U+2014). Not sure if this helps.

Could it be a small-endian/big-endian issue? It shouldn't, since utf8 defines sequences of single bytes, but that's what it looks like.

andres@baravalle.it’s picture

Hi,
what tool/proceduere did you use to see it? From firefox, I cannot get the hex code (or I don't know how to do it).

Thanks for the help,

Andres

cog.rusty’s picture

I downloaded the html output as text and opened it with a text editor which could show it in hex (Ultraedit). But you can easily find an open source hex editor or viewer with google (for example http://www.tech-faq.com/hex-editor.shtml).

Then I compared what I saw with this: http://www.fileformat.info/info/unicode/char/2014/index.htm

----- Update

Looking at that page again, I noticed that the em dash is hex 2014 only in UTF-16, which in a small-endian representation becomes 1420, as expected. But in UTF-8 it is different. It is e2 80 94 (you can see it by putting that sequence in the text with a hex editor). So, the output is not really UTF-8.

andres@baravalle.it’s picture

I followed the procedure you suggested - with some interesting results.

Another page in the site, for example (http://baravalle.it/ecommerceland) is full of �s. It's not a 1 to 1 match - different accented letters are replaced with �s.

�, in the hex editor, appear as ef bf bd, which I understand should be Unicode FFFD, a replacement character.

Which, I think, is saying that something went wrong...

But still not sure what, or why, and how to correct it.

Andres

cog.rusty’s picture

Was none of these problems present without tidy?

andres@baravalle.it’s picture

If I comment the tidy lines, everything goes back to normal.

Andres

cog.rusty’s picture

Maybe some php function unaware of Unicode?

Does the third argument ("utf8") like in this example make any difference?
http://php.net/manual/en/function.tidy-repair-string.php#66066

andres@baravalle.it’s picture

But I think I might have some idea.

If I test this code:

<p>
Ecommerceland &egrave; un&nbsp;progetto (using HTML entity).
</p>
<p>
Ecommerceland è un progetto (NOT using HTML entity).
</p>

The first one, using entities, appears incorrect (both the &egrave; and the &nbsp;). The second one appears correctly.

Looks like, for some reason, my entities are not translated correctly.

Andres

andres@baravalle.it’s picture

Not sure if it's the most intelligent approach. Well, quite surely it isn't - but works.

I added html_entity_decode and now works.

	if (function_exists("tidy_repair_string"))
	{ 
	
			$xhtml = tidy_repair_string(trim(html_entity_decode($content,ENT_NOQUOTES,"UTF-8")), $GLOBALS["conf"]["tidy_config"]);
			if(!empty($xhtml)) {
				$content = $xhtml;			
			}	
	}
  

 	print $content; 

not exactly sure why - I would have thought that tidy would have been able to deal with it,

Andres

alpha2zee’s picture

You might be intersted in looking at the htmLawed module. It enables the use of the htmLawed filter, a simple stand-alone alternative to the HTMLTidy application.