"This node has no body" error when importing HTML ( ../ path rewriting error) [#138174]

When importing an HTML page i get the following error: "This node has no body". Debug output will show

Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: attributes construct error in Entity, line: 49 in (...)\modules\import_html\coders_php_library\xml-transform.inc on line 161

By inspecting import_html.module (v 1.53 patched as of http://drupal.org/node/137205 ) we can see in line 2061:

  // collapse dir-up "../" paths. To tricky for XSL. Hope it doesn't break anything
  $xmldoc = preg_replace("|/[^\.][^/]*/\.\./|","/",$rewritten);

Replacing "../" breaks XHTML eating closing tags ( />), e.g.:

	<param name="regcode"
		value="qkiffnz8qtixfwt7derfrd5ar+xuxxfdrhev-jppdkdggnkm" />
	<param name="reglink" value="../../showcase.htm" />

produces

	<param name="regcode" value="qkiffnz8qtixfwt7derfrd5ar+xuxxfdrhev-jppdkdggnkm"/showcase.htm"/>

so loadXML fails. I modified it to:

  $rewritten = preg_replace('%"(\.\./)+%','"/',$rewritten);

which fits my needs but is not complete at all.

According to author´s note this seems tricky to do with XSLT so tried doing RE substitution.

Comment	File	Size	Author
#2	love2learn2simplehtml.xsl_.txt	4.86 KB	dman

Comments

Comment #1

dman commented 24 April 2007 at 13:21

Title:	"This node has no body" error when importing HTML	» "This node has no body" error when importing HTML ( ../ path rewriting error)
Status:	Active	» Needs work

That's true, and was indeed where things were falling apart.
Given a test case, I've tried making that pattern less greedy with an adjustment described in this issue

<   $rewritten = preg_replace("|/[^\.][^/]*/\.\./|","/",$rewritten);
---
>   $rewritten = preg_replace('|/[^\.][^/\s"\'>]*/\.\./|','/',$rewritten);

It's still not complete (especially as it's page-wide, not limited to path attributes), but holds up a little better as it shouldn't run out of control as much.
I've committed this patch immediately, but will leave the issue as 'needs more work' as there are likely one or two more test cases to deal with.

Comment #2

dman commented 24 April 2007 at 15:34

Status	File	Size
new	love2learn2simplehtml.xsl_.txt	4.86 KB

For fun, I'm attaching a conversion document that illustrates how your one HTML source page can be split into multiple nodes on import. ... in theory.

It looks like there's still a little testing to be done on how the new names get generated, but this should be an OK test case.

Also, there's no automatic call to taxonomy_get_term_by_name() or whatever will be needed to fully absorb the taxonomy terms, but that should be easy sometime soon.

XSL isn't easy, so I'd better give an example of how the multiple documents in one file are expected to work.

Comment #3

dman commented 1 May 2007 at 08:47

Status:

Needs work

» Fixed

Not totally resolved. This particular eg is fixed, but the ../ rewriting action is over in
http://drupal.org/node/137162

Comment #4

(not verified) commented 15 May 2007 at 08:54

Status:

Fixed

» Closed (fixed)

"This node has no body" error when importing HTML ( ../ path rewriting error)

Comments

Comment #1

Comment #2

Comment #3

Comment #4

News items

Our community

Documentation

Drupal code base

Governance of community