Download & Extend

Problems working with non-english languages

Project:Import HTML
Version:5.x-2.x-dev
Component:Code
Category:support request
Priority:normal
Assigned:Unassigned
Status:closed (fixed)

Issue Summary

Hi!
I'm working on a multi-lingual website where I need to import a large number of pages written in German, Portuguese, Spanish and French.

For the last couple of days I've been trying to fine tune tidy's settings to stop converting all latin characters because it renders the text unreadable (in edit mode only, display is fine) with all the html entities but it either converts everything or outputs errors.

Is there any advice you can give me?

Thanks!
Ricardo

Comments

#1

The only way I found to safely process non-ASCII characters using XML was to ensure everything was UTF8 and numeric-entity-encoded first.
I know that causes some (much) horribleness in the resulting code - although it did get me results that worked.

PHP5 XML does not handle non-UTF8 very well, if at all. I haven't been able to find the magic that will allow it to parse unknown character sets. That's XML magic. Although I have some XML mojo, character sets and content-encoding are hard for me.

I wish I knew the answer. In D6, I'll be revisiting it to possibly use DTDs better, so the known entities don't have to be escaped. Not sure if that will be the answer for other languages though.

#2

Thanks dman!

Actually I've tried that (UTF8 and numeric-entity-encoding first) but it didn't work. It just keeps the encoding (which is unreadable as I described before) or double encodes it.

The only thing I didn't try was "turning off" Tidy. Is this possible?

Ricardo

#3

Tidy is sorta off in the first case - it tries to parse the input without tidying it, but if there is ANY xml problems, it runs it through the mincer and tries again.

#4

Status:active» closed (fixed)

Cleaning up issue queue by closing stuff from the Drupal-5 branch and over a year old.

nobody click here