Posted by rbl on May 3, 2009 at 9:46pm
Jump to:
| Project: | Import HTML |
| Version: | 5.x-2.x-dev |
| Component: | Code |
| Category: | support request |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | closed (fixed) |
Issue Summary
Hi!
I'm working on a multi-lingual website where I need to import a large number of pages written in German, Portuguese, Spanish and French.
For the last couple of days I've been trying to fine tune tidy's settings to stop converting all latin characters because it renders the text unreadable (in edit mode only, display is fine) with all the html entities but it either converts everything or outputs errors.
Is there any advice you can give me?
Thanks!
Ricardo
Comments
#1
The only way I found to safely process non-ASCII characters using XML was to ensure everything was UTF8 and numeric-entity-encoded first.
I know that causes some (much) horribleness in the resulting code - although it did get me results that worked.
PHP5 XML does not handle non-UTF8 very well, if at all. I haven't been able to find the magic that will allow it to parse unknown character sets. That's XML magic. Although I have some XML mojo, character sets and content-encoding are hard for me.
I wish I knew the answer. In D6, I'll be revisiting it to possibly use DTDs better, so the known entities don't have to be escaped. Not sure if that will be the answer for other languages though.
#2
Thanks dman!
Actually I've tried that (UTF8 and numeric-entity-encoding first) but it didn't work. It just keeps the encoding (which is unreadable as I described before) or double encodes it.
The only thing I didn't try was "turning off" Tidy. Is this possible?
Ricardo
#3
Tidy is sorta off in the first case - it tries to parse the input without tidying it, but if there is ANY xml problems, it runs it through the mincer and tries again.
#4
Cleaning up issue queue by closing stuff from the Drupal-5 branch and over a year old.