Hi Dan,

After few days, here is what i did and what i got:

i modified function init_xsl() to check ok if DOMXML is installed:

if (!extension_loaded('xsl') && !extension_loaded('xslt') && !extension_loaded('domxml')) {

well, after that i tried to do a quick demo import with google, and failed. I attach the html given.

Thank you and tell me if you want me to do some more testing.

CommentFileSizeAuthor
Import the selected files _ drupal.htm10.69 KBpetasques

Comments

dman’s picture

if (!extension_loaded('xsl') && !extension_loaded('xslt') && !extension_loaded('domxml')) {

Hm, now that's not really much help.
as mentioned before, if xslt is available, it's bundled with domxml (and likewise for the php4 dom/xsl).
I need both, but the xsl,xslt extensions are the most optional so I check for them.

In your example - if extension_loaded('xslt') then it's REQUIRED that extension_loaded('domxml') due to PHP build dependancies. And I need xslt as well.

It's conceivable, but unlikely that you had domxml but NOT xslt enabled, although they usually come together, and, well, that combination wouldn't work and would (correctly) raise a warning.

Does that make sense?

So the extension detection is not our problem, I guess.

well, after that i tried to do a quick demo import with google, and failed. I attach the html given.

Anyway, this all shows that either the HTML was crap (as in the message) or HTMLTidy wasn't working/configured right (which is possible)
I've actually recently modified the module library in CVS to give up on bad HTML, as the rest of the errors aren't much help if we've established we are getting garbage in. They WERE useful as I was tuning the HTMLTidy options to catch things like character entities and script blocks, which were sorta-XHTML but sorta-not-XML.

ANYWAY.

I had a poke at Google pages, and they really really don't validate (ever noticed that before?)
Even running tidy by hand fails to comprehend the source enough to format it.

If we can identify just what is the real killer to HTMLTidy here, I may be able to massage it out in the options, but in the meantime, the error message was correct, and we gotta blame the source. Bad Google!
Dodgy attributes and entities should not be fatal IMO, that's what Tidy is supposed to FIX! Maybe I've got it set too strict.

.dan.

dman’s picture

Status: Active » Closed (fixed)

Not a killer. You can tweak the htmltidy.conf if it helps, but we were unable to see a real solution for truly broken input