Hello,
I try to import a few pages. As I faced some errors, I tried with index.htm only and here is the result.
* user warning: HTMLTidy failed to parse the input at all! It's probably very problematic HTML. A working version of tidy IS at /usr/bin/tidy isn't it? I ran /usr/bin/tidy -q -config /var/www/intranet/sites/default/modules/import_html/coders_php_library/xhtml_tidy.conf "files/imported/index.htm" and it returned: 2 in /var/www/intranet/sites/default/modules/import_html/coders_php_library/tidy-functions.inc on line 156.
* user warning: Failed to read contents of 'files/imported/index.htm' in /var/www/intranet/sites/default/modules/import_html/coders_php_library/xml-transform.inc on line 79.
* Failed to initialize or parse XMLdoc input
* Failed to process file '/index.htm'
The HTMLTidy command line seams to be wrong. It seams that the input path is relative to my intranet, not absolute.
I tried the command directly in the shell.
Here is my two tentatives
intranet:~# /usr/bin/tidy -q -config /var/www/intranet/sites/default/modules/import_html/coders_php_library/xhtml_tidy.conf "files/imported/index.htm"
Error: Can't open "files/imported/index.htm"
intranet:~# /usr/bin/tidy -q -config /var/www/intranet/sites/default/modules/import_html/coders_php_library/xhtml_tidy.conf "/var/www/intranet/files/imported/index.htm"
line 53 column 251 - Error: <o:p> is not recognized!
line 57 column 45 - Error: <o:p> is not recognized!
line 60 column 304 - Error: <o:p> is not recognized!
line 61 column 317 - Error: <o:p> is not recognized!
line 62 column 328 - Error: <o:p> is not recognized!
line 63 column 66 - Error: <o:p> is not recognized!
intranet:~#
Any help will be appreciated!
Thanks
Comments
Comment #1
dman commentedYes, tidy runs relative to your drupal web root, so that command should have behaved as expected.
If you started that test by
you would then be operating in the same context as the PHP (index.php) does, and you'd find the file fine.
The real complaint is, as in the first warning (the others are just cascades from that) that the input file is a problem. The HTMLTidy process cannot handle
<o:p>at all.There is a chance that you can tweak the htmltidy.conf file to tell it that
o:pis in fact a valid tag (it's not) or there may be other ways around it depending on your tidy version.Where on earth did that fake markup come from? Is it at least defined using an XML namespace? If it does have a namespace
xmlns:o="http://some.proprietory.dtd/"then HTMLtidy is at fault for not recognising it and dealing with it. Perhaps setting input-xml=true in the conf file may help.If your source doc does NOT have a namespace declaration, then it's just broken, and you should probably run a string search and replace on all your files before trying to import them.
Comment #2
dman commentedI see that
o:pis from MS Office pseudo-html. I thought I suspected as much, but didn't want to make wild accusations.Anyway, the htmltidy promises to 'strip Word formatting' if you put the appropriate setting in your htmltidy.conf. try that.
Comment #3
Jean-Christophe commentedThanks for your quick answer!
My HTML stuff comes from a "word2html" converter. I beleave this is not the best solution, but I have a large number of Word documents I would like to import as node.
Best regards,
JC
Comment #4
dman commented