When importing an HTML page i get the following error: "This node has no body". Debug output will show

Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: attributes construct error in Entity, line: 49 in (...)\modules\import_html\coders_php_library\xml-transform.inc on line 161

By inspecting import_html.module (v 1.53 patched as of http://drupal.org/node/137205 ) we can see in line 2061:

  // collapse dir-up "../" paths. To tricky for XSL. Hope it doesn't break anything
  $xmldoc = preg_replace("|/[^\.][^/]*/\.\./|","/",$rewritten);

Replacing "../" breaks XHTML eating closing tags ( />), e.g.:

	<param name="regcode"
		value="qkiffnz8qtixfwt7derfrd5ar+xuxxfdrhev-jppdkdggnkm" />
	<param name="reglink" value="../../showcase.htm" />

produces

	<param name="regcode" value="qkiffnz8qtixfwt7derfrd5ar+xuxxfdrhev-jppdkdggnkm"/showcase.htm"/>

so loadXML fails. I modified it to:

  $rewritten = preg_replace('%"(\.\./)+%','"/',$rewritten);

which fits my needs but is not complete at all.

According to author´s note this seems tricky to do with XSLT so tried doing RE substitution.

CommentFileSizeAuthor
#2 love2learn2simplehtml.xsl_.txt4.86 KBdman

Comments

dman’s picture

Title: "This node has no body" error when importing HTML » "This node has no body" error when importing HTML ( ../ path rewriting error)
Status: Active » Needs work

That's true, and was indeed where things were falling apart.
Given a test case, I've tried making that pattern less greedy with an adjustment described in this issue

<   $rewritten = preg_replace("|/[^\.][^/]*/\.\./|","/",$rewritten);
---
>   $rewritten = preg_replace('|/[^\.][^/\s"\'>]*/\.\./|','/',$rewritten);

It's still not complete (especially as it's page-wide, not limited to path attributes), but holds up a little better as it shouldn't run out of control as much.
I've committed this patch immediately, but will leave the issue as 'needs more work' as there are likely one or two more test cases to deal with.

dman’s picture

StatusFileSize
new4.86 KB

For fun, I'm attaching a conversion document that illustrates how your one HTML source page can be split into multiple nodes on import. ... in theory.

It looks like there's still a little testing to be done on how the new names get generated, but this should be an OK test case.

Also, there's no automatic call to taxonomy_get_term_by_name() or whatever will be needed to fully absorb the taxonomy terms, but that should be easy sometime soon.

XSL isn't easy, so I'd better give an example of how the multiple documents in one file are expected to work.

dman’s picture

Status: Needs work » Fixed

Not totally resolved. This particular eg is fixed, but the ../ rewriting action is over in
http://drupal.org/node/137162

Anonymous’s picture

Status: Fixed » Closed (fixed)