I am trying to import a large htm document with a lot of errors in it. $xmldoc->loadxml($xmlsource) in parse_in_xml_file() is returning false even with Tidy enabled. I cannot find a way to get error messages to find out why. Are there any html debuggers that can tell me why loadxml() fails?

The HTML appears to have been created by MS Word. It is pretty bad,

I have used the Firefox HTML Validator in Firefox in a effort to find out why. It gives me many warnings but no fatal errors.

I have been able to copy and paste the source HTML code into the CKEditor and was able to create a node with it. But I cannot do this with every file I have this trouble with. There are too many such files that fail import. I need clear error messages for loadxml().

Comments

dman’s picture

Debugging that sort of crud can be a bitch, that's why there is the "Keep Temp Files" option.

When processing, files are copied temporarily into a temp directory. These are usually deleted immediately after tidying and parsing, but if you want to trace problems, enable this option and check the files/import directory.

When that is on, you'll find traces of the in-between stages saved in your files directory. Specifically, what it looks like after 'tidy' ing and before the DOM XML load. Which is where the pain happens.

You can also turn the debugging messages way up loud (don't do more than a couple of pages at a time with the volume up) and you'll see all sorts of data about where the process got to when it broke.

There are some MS Word docs that even 'tidy' can't recover, though that should give you a visible error blaming the code.

If there is something that got through 'tidy' but not the xml parser (entirely possible, though rare now) you need to find what that is. In the past, 'tidy' let through multiple attributes (eg width="10" width="11"), which broke XML, but I was able to locate that issue and use a flag within 'tidy' that would squash that. There may be other issues like that.
I've encountered some MS-HTML that looked like 'processing instructions' to XML, but I can't recall it exactly. May have been from weird comment escaping.

Oh, you need to use -dev for the advanced debugging options.

frank ralf’s picture

I don't know whether this will help but I got a similar error message after activating "extension=php_domxml.dll" in my php.ini (XAMPP 1.7.1 on Win 2000).

dman’s picture

Status: Active » Closed (cannot reproduce)

Clearing the old 6.x issues from the issue queue for a cleanup.