I have finally gotten to the point where I can begin to test importing. Problem: I get as far as the message
XML source sites/l2ltest/files/imported/math/math.htm successfully loaded as an XML doc.
in the demo mode, but then it stalls out and nothing further happens. The debug messages are on. I'm not seeing any errors.

Then I tried entering the website I want to import from and it doesn't find any pages.

CommentFileSizeAuthor
#3 math.txt14 KBleeksoup

Comments

leeksoup’s picture

Never mind about #2. It is a remote site - not supported, right?
But it is still stalling out on import of a single file.

dman’s picture

Can you supply an example of the page source it's stalling on?
I can't guess too well yet, but there may be something in the XHTML parsing or the tidy process that's just too confusing.
In demo mode, no saves are actually made yet - it should be presenting you with a pre-filled node edit form just showing you what fields have been extracted. You could press 'save' from there.

This error point indicates something probably went wrong at line 1697 _import_html_process_html_page() ... at which point the error handler has been temporarily disabled (for hacky reasons - I only want to run 'tidy' if the input is invalid, but can't tell if it's invalid without trying to parse it - which throws errors) so death there is not being handled.

comment out:

#    set_error_handler('stfu');
    $xmldoc = parse_in_xml_file($path, FALSE);
#    restore_error_handler();

to see the errors.

Is there anything quirky in the input - does it even validate already? If not, that should be OK, unless it's broken in a weird way.

Can you post details about the XML and XSLT versions from your phpinfo() ?

leeksoup’s picture

StatusFileSize
new14 KB

Sorry for the slow reply -- I've been out of town.

I'm attaching the page source after it runs through my perl script to tag it. (I changed the extension to txt so I could upload it here.) It is supposed to generate several review type nodes, where the review is a cck type node I have defined. The original html file is online at http://love2learn.net/math/math.htm

Commenting out those code lines didn't change the behavior at all. I am getting messages indicating it is loading OK, i.e. I can see it running through tidy, then it says it is loaded, tidied, stripped of php tags. Then toward the end

Using PHP5 DOM extension to process XML
...
XML source sites/l2ltest/files/imported/math/math.htm successfully loaded as an XML doc.
: in parse_in_xml_file(), line 108 xml-transform.inc : in _import_html_process_html_page(), line 1712 import_html.module : in _import_html_import_files(), line 1514 import_html.module : in import_html_demo_form_submit(), line 916 import_html.module : in drupal_submit_form(), line 428 form.inc : in drupal_process_form(), line 258 form.inc : in drupal_get_form(), line 80 form.inc 0.31s elapsed. (430 total)

The line numbers may be off from your code by a few lines because I added in some code to import_html.module to call my perl script before running tidy. (I would have run it on the cleaned up file but I can't get it to work at all on local files.)

phpinfo() says
xsl
XSL enabled
libxslt Version 1.1.8
libxslt compiled against libxml Version 2.6.11
EXSLT enabled
libexslt Version 1.1.8

(I am using XAMPP for Linux 1.6.)

I haven't yet tuned the import filter -- I wanted to see what I would get with it as-is and go from there.

dman’s picture

hm.Yep, I'm seeing errors in parsing that file - errors I wouldn't have thought could have made it in there.
Man that input file is messy (!) multiple titles, wacky nesting, unquoted attributes, incomplete tables and even really odd closing tags.
Good news is that htmltidy (bless it) seems to be able to decipher it all for us!
however, somewhere between "PARSED XML files/imported/var/www/math.htm . XHTML"
and "The source after URL rewriting . XHTML (string)"
<a href="../index.html"><img align="top" border="0" src="../love2learnimage.jpg" width="397" height="85"/></a>
seems to turn into
<a href="../../files/imported/var/love2learnimage.jpg" width="397" height="85"/></a>
... which is so wrong, and should never happen.

I get a different flavour of error from the one you list, but there is certainly a problem there.
Opening and ending tag mismatch: td line 11 and a in Entity, line: 11

It would seem to be the fault of rewrite_href_and_src.xsl BUT is more likely linked to the recent suggested (as-yet uncommitted) problems re collapse dir-up "../" paths - which I acknowlege as an unfixed issue.

As noted by patricio.keilty
Replacing "../" breaks XHTML eating closing tags ( />)
and it looks like that's what's bitten you - as you have ../ paths next to a singleton element.
His fix however is not quite right for your (or the real) problem - which appears to be a bit too much greediness in the pattern (which was signposted as dodgy in the code :-}

My current attempted fix would be: replace the line under the warning (circa 2336)

  // collapse dir-up "../" paths. To tricky for XSL. Hope it doesn't break anything
diff -r1.51.2.7 import_html.module
2337c2337
<   $rewritten = preg_replace("|/[^\.][^/]*/\.\./|","/",$rewritten);
---
>   $rewritten = preg_replace('|/[^\.][^/\s"\'>]*/\.\./|','/',$rewritten);

That makes your sample page get through to the next stage for me.
- try that in your version - I'll check this in also. It's still not perfect, but it is a step forward.

.dan.

leeksoup’s picture

Aaargh!
It still makes no difference. It simply hangs at the point I described before and I see no further output. I have debug level set at 8, so I don't know what else to try to figure out what's happening. Any suggestions? Is it running out of some resource? (But I'm not getting any error messages.)

The message indicates that the XML is successfully loaded in the parse_in_xml_file routine. After that message, it is supposed to return, and the very next line of code (#1708) is another debug message which is supposed to show a box with the new-and-improved code. It never does that.

Thanks for the xsl import filter! Now if I can just get it to go through, ....

Yes, I understand about the taxonomy. I'd be willing to try to code that up if I can get anything to actually go through. The taxonomy api looks pretty decent. Content Taxonomy is a no-go, btw. It just doesn't work and issues aren't being addresssed as of now.

BTW what are you trying to do with the preg_replace? Does PHP support Perl's '?' modifier to make quantifiers non-greedy (i.e. minimal)? Perl-type regexps I would be happy to assist with.

leeksoup’s picture

The problem seems to be in the xml_tostring($xmldoc) calls. If I comment those out, the code proceeds past that point. Why would that be failing? I tried running it on your project page, i.e. drupal.org/projects/import_html, importing to a vanilla "page" type node with the basic import template, and I get the same problems with that, so I don't think it is my html source that's causing the failure.

A suggestion for the preg_replace:

$rewritten = preg_replace('|/\w[\w\.]+/\.\./|','/',$rewritten);

This is a pretty tight pattern so may not catch all cases of collapsible directories, but should not cause failures like the one you got.

leeksoup’s picture

Status: Active » Closed (fixed)

I figured it out!!!
It stalls when I use PHP 5. I switched back to PHP4 and now it goes through!

I still have some issues to figure out, but at least now I'm getting somewhere. I'll open a new issue for those. This one can be closed.

Thanks for your help!

dman’s picture

I'm still looking for the perfect pattern. These examples each do the job, but havn't been stress tested.
I really need a fix there that resolved multiple levels up correctly. None so far have even tried that.
Any suggestions?

leeksoup’s picture

Here's one possibility:

while (preg_match('|/\w[\w\.]*/\.\./|',$rewritten)) {
    $rewritten = preg_replace('|/\w[\w\.]*/\.\./|','/',$rewritten);
}

Of course, you'd want to expand the character class to include any other allowed chars for dirnames.