Problems working with non-english languages
rbl - May 3, 2009 - 21:46
| Project: | Import HTML |
| Version: | 5.x-2.x-dev |
| Component: | Code |
| Category: | support request |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | active |
Jump to:
Description
Hi!
I'm working on a multi-lingual website where I need to import a large number of pages written in German, Portuguese, Spanish and French.
For the last couple of days I've been trying to fine tune tidy's settings to stop converting all latin characters because it renders the text unreadable (in edit mode only, display is fine) with all the html entities but it either converts everything or outputs errors.
Is there any advice you can give me?
Thanks!
Ricardo

#1
The only way I found to safely process non-ASCII characters using XML was to ensure everything was UTF8 and numeric-entity-encoded first.
I know that causes some (much) horribleness in the resulting code - although it did get me results that worked.
PHP5 XML does not handle non-UTF8 very well, if at all. I haven't been able to find the magic that will allow it to parse unknown character sets. That's XML magic. Although I have some XML mojo, character sets and content-encoding are hard for me.
I wish I knew the answer. In D6, I'll be revisiting it to possibly use DTDs better, so the known entities don't have to be escaped. Not sure if that will be the answer for other languages though.
#2
Thanks dman!
Actually I've tried that (UTF8 and numeric-entity-encoding first) but it didn't work. It just keeps the encoding (which is unreadable as I described before) or double encodes it.
The only thing I didn't try was "turning off" Tidy. Is this possible?
Ricardo
#3
Tidy is sorta off in the first case - it tries to parse the input without tidying it, but if there is ANY xml problems, it runs it through the mincer and tries again.