Hi. First of all, thank you dman for this awesome module.

I successfully imported a few pages, but when i went to import the whole site, hundreds of pages were left out with this error -

DOMDocument::loadXML() [function.DOMDocument-loadXML]: Attribute new redefined in Entity, line: 273 in C:\Program Files\EasyPHP 2.0b1\www\Transform html to database\drupal-5.18\modules\import_html\coders_php_library\xml-transform.inc on line 99

A less common error was this - Invalid argument supplied for foreach() in C:\Program Files\EasyPHP 2.0b1\www\Transform html to database\drupal-5.18\modules\import_html\import_html.module on line 1402

I would be grateful if you can help me.

CommentFileSizeAuthor
#6 dontgiveup.txt25.19 KBsigmundfreud
#2 tough.txt25.19 KBsigmundfreud

Comments

dman’s picture

Title: Badly stuck! ERROR : DOMDocument::loadXML() [<a href='function.DOMDocument-loadXML'>function.DOMDocument-loadXM » Badly stuck! ERROR : DOMDocument::loadXML() Attribute new redefined in Entity

curious.
Are you able to attach one of the problem pages?
I'm not sure about Attribute new redefined in Entity. Never seen it before. Are the source docs some custom DOCTYPE?

sigmundfreud’s picture

StatusFileSize
new25.19 KB

THANKS for the reply.

I have attached the html code of the sample page. Not sure what more to attach.
You can see the page at - http://faithfreedom.org/Articles/sina41003p4.htm

I have one more doubt - i dont know xslt, so to import those pages (where the only common pattern is a table with width more than 400) i kept on adding in html2simplehtml.xls lines like these -

<xsl:when test="descendant::*[@width='520' or @width='530' or @width='560' or @width='590'] ">
		<xsl:comment>Imported From the element called width500</xsl:comment>

How do we modify this to select only "tables" not all elements, i mean the html page has font (width="520") which is again causing an error. And only the first table with width more than 400?

Thanks a lot dman.

dman’s picture

Scary HTML there.
I see

<span style="font-size: 12.0pt; mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA">

... which is fine. Also a variant looking like:

<span style="FONT-FAMILY: 'Times New Roman'; FONT-SIZE: 12pt; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA">

It's strange to see this here and different from the first version, but it's good enough.
But also :

<span style="font-size:12.0pt;
font-family:&quot;Times New Roman&quot;;mso-fareast-font-family:&quot;Times New Roman&quot;;
mso-ansi-language:EN-US;mso-fareast-language:EN-US;mso-bidi-language:AR-SA" class="postbody">

Which is just peculiar. Potentially legal XHTML (maybe). But freaky that all three syntaxes are in the same file.

... So that's pretty messed-up input.
Anyway, we hope that html tidy can sort this out for us...

When I tried to pass that page through tidy, I got the following messages. Didn't you?

    * user warning: HTMLTidy failed to parse the input at all! It's probably very problematic HTML. I ran /usr/bin/tidy -q -config /Library/WebServer/Documents/drupal6/sites/devel/modules/modified/import_html/coders_php_library/xhtml_tidy.conf "sites/devel.drupal6.gadget/files/imported/tough_0.txt.htm" and htmltidy returned: 2

      line 172 column 27 - Error: <u1:p> is not recognized!
      line 179 column 46 - Error: <u1:p> is not recognized!
      line 256 column 44 - Error: <st1:place> is not recognized!

      in /Library/WebServer/Documents/drupal6/sites/devel/modules/modified/import_html/coders_php_library/tidy-functions.inc on line 189.
    * user warning: Failed to parse contents of 'sites/devel.drupal6.gadget/files/imported/tough_0.txt.htm' in /Library/WebServer/Documents/drupal6/sites/devel/modules/modified/import_html/coders_php_library/xml-transform.inc on line 99.
    * Failed to get any results from the attempted analysis of tough_0.txt.htm. The source file path was probably unavailable or incorrect.

... No. That input is badly invalid, and hard to validate because it is using namespaces incorrectly - this is worse than not using namespaces at all!
XHTML parsers, and html tidy can handle old HTML that didn't know the rules. This code is pretending to follow the new rules, but doing so badly. XML does NOT permit that! :-{

The only way forward is for you to run some sort of code tidy-up using search & replace or something - and repair that before we even ask html tidy to parse it. This is outside of what import_html can do at the moment, although maybe we can add a pre-process string tweak to support this sort of problem.

Or we can all just gang up together and hunt down and kill the developers of Microsoft FrontPage...

sigmundfreud’s picture

Dear Dman, this is what i get when i try to import that page:


    * warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: Attribute new redefined in Entity, line: 291 in C:\Program Files\EasyPHP 2.0b1\www\Transform html to database\drupal-5.18\modules\import_html\coders_php_library\xml-transform.inc on line 99.
    * warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: Attribute new redefined in Entity, line: 305 in C:\Program Files\EasyPHP 2.0b1\www\Transform html to database\drupal-5.18\modules\import_html\coders_php_library\xml-transform.inc on line 99.
    * warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: Attribute new redefined in Entity, line: 310 in C:\Program Files\EasyPHP 2.0b1\www\Transform html to database\drupal-5.18\modules\import_html\coders_php_library\xml-transform.inc on line 99.
    * user warning: Failed to parse in xml source. [files/imported/files/bond/sina41003p4.htm] in C:\Program Files\EasyPHP 2.0b1\www\Transform html to database\drupal-5.18\modules\import_html\coders_php_library\xml-transform.inc on line 100.
    * Failed to initialize or parse XMLdoc input
    * Failed to process file 'files/bond/sina41003p4.htm'

Interestingly there is a tool http://www.webmaster-toolkit.com/frontpage-code-cleaner.shtml
which cleans frontpage code, but its only an online service so i cant download it. Am searching for more such tools, and i will report as soon as i find them.

BTW, if that PHP script could do this, why cant HTMLTidy? Its more popular na?

sigmundfreud’s picture

Hey Dman!

The script for the above - Frontpage Code Cleaner is available here - http://www.tufat.com/gallery.php?script=216&index=6

Can we use that script to automate the process? Shall i buy the script?

sigmundfreud’s picture

StatusFileSize
new25.19 KB

Alas! Even the cleaned code doesnt work! This is the error i get when i try to import

# warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: Attribute new redefined in Entity, line: 291 in C:\Program Files\EasyPHP 2.0b1\www\Transform html to database\drupal-5.18\modules\import_html\coders_php_library\xml-transform.inc on line 99.
# warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: Attribute new redefined in Entity, line: 305 in C:\Program Files\EasyPHP 2.0b1\www\Transform html to database\drupal-5.18\modules\import_html\coders_php_library\xml-transform.inc on line 99.
# warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: Attribute new redefined in Entity, line: 310 in C:\Program Files\EasyPHP 2.0b1\www\Transform html to database\drupal-5.18\modules\import_html\coders_php_library\xml-transform.inc on line 99.
# user warning: Failed to parse in xml source. [files/imported/files/bond/sina41003p4.htm] in C:\Program Files\EasyPHP 2.0b1\www\Transform html to database\drupal-5.18\modules\import_html\coders_php_library\xml-transform.inc on line 100.
# Failed to initialize or parse XMLdoc input
# Failed to process file 'files/bond/sina41003p4.htm

Am attaching the cleaned page, if you want to test.

Thank you.

ilessing’s picture

When I ran the file 'dontgiveup.txt' through tidy I got three significant errors:

line 172 column 27 - Error: <u1:p> is not recognized!
line 179 column 44 - Error: <u1:p> is not recognized!
line 256 column 44 - Error: <st1:place> is not recognized!

In my experience when one uses MS Word and does a Save As HTML you get similarly invalid HTML

I hope this helps.

sigmundfreud’s picture

Hey guys thank u so much. Strangely, when i try to import the same file in drupal 6.dev version, its getting through htmltidy! Its getting properly trimmed according to the XLS template!

But one problem remains, according to my XLS template i want contents from table with width='400' to be selected but some html pages are getting trimmed according to the template but do not display any data. Can anyone please take a look at the files. What should i use in xls template to select only element "TABLES"

Am currently using this xls



<xsl:when test="descendant::*[@width='400'] ">
<xsl:comment>Imported From the element called width400</xsl:comment>


This code am trying, is not working.....



<xsl:when test="descendant::table::*[@width='400'] ">
<xsl:comment>Imported From the element called width400</xsl:comment>


Thanks guys for the amazing support!

dman’s picture

Your second use of :: is confusing to me there.

There is some inconsistency with XSL and XSL selectors when the XML is using namespaces. I've worked around it by using the prefix xhtml:
something like:

<xsl:when test="//xhtml:table[@width='400']">

this sometimes works better. I'd like to eliminate the need for that and just use the un-namespaced tagnames, but I'm not sure why I need it sometimes.

dman’s picture

Status: Active » Fixed

As the real problem here is massively broken input - that may be fixable on a case-by-case basis, I'll refer to a possible development at #718794: Run command on HTML file before tidy & import and close this support request as answered

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.