Hi!
I'm working on a multi-lingual website where I need to import a large number of pages written in German, Portuguese, Spanish and French.

For the last couple of days I've been trying to fine tune tidy's settings to stop converting all latin characters because it renders the text unreadable (in edit mode only, display is fine) with all the html entities but it either converts everything or outputs errors.

Is there any advice you can give me?

Thanks!
Ricardo

Comments

dman’s picture

The only way I found to safely process non-ASCII characters using XML was to ensure everything was UTF8 and numeric-entity-encoded first.
I know that causes some (much) horribleness in the resulting code - although it did get me results that worked.

PHP5 XML does not handle non-UTF8 very well, if at all. I haven't been able to find the magic that will allow it to parse unknown character sets. That's XML magic. Although I have some XML mojo, character sets and content-encoding are hard for me.

I wish I knew the answer. In D6, I'll be revisiting it to possibly use DTDs better, so the known entities don't have to be escaped. Not sure if that will be the answer for other languages though.

rbl’s picture

Thanks dman!

Actually I've tried that (UTF8 and numeric-entity-encoding first) but it didn't work. It just keeps the encoding (which is unreadable as I described before) or double encodes it.

The only thing I didn't try was "turning off" Tidy. Is this possible?

Ricardo

dman’s picture

Tidy is sorta off in the first case - it tries to parse the input without tidying it, but if there is ANY xml problems, it runs it through the mincer and tries again.

dman’s picture

Status: Active » Closed (fixed)

Cleaning up issue queue by closing stuff from the Drupal-5 branch and over a year old.