Hi. First of all, thank you dman for this awesome module.
I successfully imported a few pages, but when i went to import the whole site, hundreds of pages were left out with this error -
DOMDocument::loadXML() [function.DOMDocument-loadXML]: Attribute new redefined in Entity, line: 273 in C:\Program Files\EasyPHP 2.0b1\www\Transform html to database\drupal-5.18\modules\import_html\coders_php_library\xml-transform.inc on line 99
A less common error was this - Invalid argument supplied for foreach() in C:\Program Files\EasyPHP 2.0b1\www\Transform html to database\drupal-5.18\modules\import_html\import_html.module on line 1402
I would be grateful if you can help me.
| Comment | File | Size | Author |
|---|---|---|---|
| #6 | dontgiveup.txt | 25.19 KB | sigmundfreud |
| #2 | tough.txt | 25.19 KB | sigmundfreud |
Comments
Comment #1
dman commentedcurious.
Are you able to attach one of the problem pages?
I'm not sure about
Attribute new redefined in Entity. Never seen it before. Are the source docs some custom DOCTYPE?Comment #2
sigmundfreud commentedTHANKS for the reply.
I have attached the html code of the sample page. Not sure what more to attach.
You can see the page at - http://faithfreedom.org/Articles/sina41003p4.htm
I have one more doubt - i dont know xslt, so to import those pages (where the only common pattern is a table with width more than 400) i kept on adding in html2simplehtml.xls lines like these -
How do we modify this to select only "tables" not all elements, i mean the html page has font (width="520") which is again causing an error. And only the first table with width more than 400?
Thanks a lot dman.
Comment #3
dman commentedScary HTML there.
I see
... which is fine. Also a variant looking like:
It's strange to see this here and different from the first version, but it's good enough.
But also :
Which is just peculiar. Potentially legal XHTML (maybe). But freaky that all three syntaxes are in the same file.
... So that's pretty messed-up input.
Anyway, we hope that html tidy can sort this out for us...
When I tried to pass that page through tidy, I got the following messages. Didn't you?
... No. That input is badly invalid, and hard to validate because it is using namespaces incorrectly - this is worse than not using namespaces at all!
XHTML parsers, and html tidy can handle old HTML that didn't know the rules. This code is pretending to follow the new rules, but doing so badly. XML does NOT permit that! :-{
The only way forward is for you to run some sort of code tidy-up using search & replace or something - and repair that before we even ask html tidy to parse it. This is outside of what import_html can do at the moment, although maybe we can add a pre-process string tweak to support this sort of problem.
Or we can all just gang up together and hunt down and kill the developers of Microsoft FrontPage...
Comment #4
sigmundfreud commentedDear Dman, this is what i get when i try to import that page:
Interestingly there is a tool http://www.webmaster-toolkit.com/frontpage-code-cleaner.shtml
which cleans frontpage code, but its only an online service so i cant download it. Am searching for more such tools, and i will report as soon as i find them.
BTW, if that PHP script could do this, why cant HTMLTidy? Its more popular na?
Comment #5
sigmundfreud commentedHey Dman!
The script for the above - Frontpage Code Cleaner is available here - http://www.tufat.com/gallery.php?script=216&index=6
Can we use that script to automate the process? Shall i buy the script?
Comment #6
sigmundfreud commentedAlas! Even the cleaned code doesnt work! This is the error i get when i try to import
Am attaching the cleaned page, if you want to test.
Thank you.
Comment #7
ilessing commentedWhen I ran the file 'dontgiveup.txt' through tidy I got three significant errors:
line 172 column 27 - Error: <u1:p> is not recognized!
line 179 column 44 - Error: <u1:p> is not recognized!
line 256 column 44 - Error: <st1:place> is not recognized!
In my experience when one uses MS Word and does a Save As HTML you get similarly invalid HTML
I hope this helps.
Comment #8
sigmundfreud commentedHey guys thank u so much. Strangely, when i try to import the same file in drupal 6.dev version, its getting through htmltidy! Its getting properly trimmed according to the XLS template!
But one problem remains, according to my XLS template i want contents from table with width='400' to be selected but some html pages are getting trimmed according to the template but do not display any data. Can anyone please take a look at the files. What should i use in xls template to select only element "TABLES"
Am currently using this xls
This code am trying, is not working.....
Thanks guys for the amazing support!
Comment #9
dman commentedYour second use of :: is confusing to me there.
There is some inconsistency with XSL and XSL selectors when the XML is using namespaces. I've worked around it by using the prefix xhtml:
something like:
this sometimes works better. I'd like to eliminate the need for that and just use the un-namespaced tagnames, but I'm not sure why I need it sometimes.
Comment #10
dman commentedAs the real problem here is massively broken input - that may be fixable on a case-by-case basis, I'll refer to a possible development at #718794: Run command on HTML file before tidy & import and close this support request as answered