UTF-8 encoding, image file extensions
| Project: | Import HTML |
| Version: | 5.x-1.x-dev |
| Component: | Code |
| Category: | bug report |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | active |
Jump to:
Hi,
Import HTML 5.x-1.x-dev seems to have introduced a few new issues. From what I noticed:
* The encoding of non-English character is wrong; the input files were encoded in UTF-8, checked with the "isutf8" utility (version 1.1 on Ubuntu "Intrepid Ibex"), and have a proper charset=utf-8 encoding in the header. Checked against Import HTML 5.x-1.2, which is working correctly on the same D5 installation with the same set of files. What becomes corrupted are characters like "ä", "β", and "»".
* Very strangely, imported images break since the extension (e.g. ".JPG") is being stripped from the source code. Also checked against IMport HTML 5.x-1.2 whichs imports the same files differently.
Sorry for not being able to provide a patch (went back to Import HTML 5.x-1.2).
Greetings, asb

#1
Damn, yes I also noticed the character problems.
I was experimenting a few times to try and get it consistant, but couldn't find the right escape sequences that
- validated in XML-UTF8
- didn't immediately convert proper escaped numeric entities into character data (which immediately broke the next process) and
- didn't multi-escape & symbols and produce even more wierdness.
That is on the 'currently broken' list.
I'm not sure what was going right in the previous version, but I can't see how it was OK :-/
I friggin hate non-ASCII characters.
There is an option that intentionally strips suffixes - although usually it's off, and it should run (if checked in the settings) AFTER the images have been rewritten. Its intention is to rewrite "/about/us.htm" into "/about/us"
I guess it could be catching the resource files too. It shouldn't
#2
Hi Dan,
> I friggin hate non-ASCII characters.
Sorry to hear this ;-) UTF is pretty cool - if it works. But it definitely is administrator's hell during migration, even for us folks used to full "latin" charsets.
However, what do you think about utilizing an external helper application like Ulrich Dreppers "iconv" (iconv --from-code ... --to-code ...)? I'll try to talk with our XSL guru how he handles this.
Greetings, -asb