This issue is regarding the Import HTML module's demo test function,
admin/build/import_html/demo

Tested with (Acquia) Drupal 6.10.

If a relative url is used rather than an absolute URL, the demo truncates the first character of the URL in the results. for example:

If the test page to be imported is located at:

http://www.supergreatdomain.com/test/page.html

and the url entered in the demo page's "URL to import" is merely:

test/page.html

Then the module only sees "est/page.html" (note the missing "t") and all resulting URLs in the HTML (and on the result page under "URL path settings") are missing that first character.

If the absolute URL is used:

http://www.supergreatdomain.com/test/page.html

It all comes out fine.

(My first issue report... corrections to my procedure are welcomed.)

Comments

dman’s picture

Status: Active » Closed (cannot reproduce)

Closing old issues,
Pretty sure this can't be reproduced any more. I've never seen it.

frank ralf’s picture

Title: demo page truncates first character of relative URL » Demo page truncates first character of relative URL
Version: 6.x-1.x-dev » 7.x-1.0
Issue summary: View changes
Status: Closed (cannot reproduce) » Active

I'm reopening this issue because I have the same problem with version 7.x-1.0. The first character of a relative file path gets truncated on demo page:

Source file s/ites/default/files/garage/tasks/changingtheoil.html is not a normal file.

I think (due to file suffix '') that 's/ites/default/files/garage/tasks/changingtheoil.html' is not a html page I can process.

Frank

frank ralf’s picture

I've had a look at the code and my prime suspect is the ensure_trailing_slash() function in file-routines.inc. Looks like there's a quantifier missing.

/**
 * Adds a slash to a directory path if it wasn't already set
 * Unless input was an empty string.
 *
 * @parameter string $path
 * @returns string path.
 */
function ensure_trailing_slash($path) {
  return preg_replace( '|(.)/?$|', '$1/', $path );
}
dman’s picture

Yeah, good spotting. Looks like that. Wonder how that lasted so long.

dman’s picture

but then again - no, it's not that bad. just does it in an unexpected way. Hm.

frank ralf’s picture

Yes, on second thought the code looks right as the regex is anchored to the end of the string. I think I will have a closer look at the whole path-related code. There's quite some splitting, replacing and re-assembling going on ;-)

frank ralf’s picture

Just another observation from the extended error message:

Fetching content from local 's/ites/default/files/garage/tasks/changingtheoil.html/' now. Saving temp file locally as sites/default/files/imported/ites/default/files/garage/tasks/changingtheoil.html/index.html

The module doesn't seem to recognize the URL as pointing to a file in the first place but to a directory instead (see the added trailing slash to the URL and the addition of index.html to the temp file path). This is also indicated by the empty file suffix in the error message in #2.

My first idea was that the regex in the above mentioned ensure_trailing_slash() function does some unwanted backtracking but I cannot see from the code why it should do that. I'm investigating further ;-)

frank ralf’s picture

Looks like my first suspicion was correct, the culprit is indeed the ensure_trailing_slash() function. You can verify this by adding a marker character (#) after the inserted slash like so:

function ensure_trailing_slash($path) {
  return preg_replace( '|(.)/?$|', '$1/#', $path );
}

This will produce an error message like this one:

Source file s/#ites/default/files/garage/tasks/changingtheoil.html is not a normal file.

However, testing the regex doesn't show any error but when removing the end-of-line anchor ($) I can reproduce the shown behavior. So preg_replace() seems to ignore the anchor.

Will investigate further in that direction and hope to provide a patch soon.

Frank

EDIT:

frank ralf’s picture

Here's another observation. The first character seems to be stripped from the path as can be seen from the following error message.

Importing 'ites/default/files/garage/tasks/changingtheoil.html'	

I think (due to file suffix '') that 's/#ites/default/files/garage/tasks/changingtheoil.html' is not a html page I can process.

It looks as if the path is split, then the ensure_trailing_slash() function adds a slash to the first part ($rel_path) correctly, and then the whole path is re-assembled again. Therefore I'm now suspecting something wrong with the following code from import_html_ui.inc:

...
// Divide the path into two halves (avoiding glue_url())
// to provide a dummy context and a rel_path
$split_at = strrpos($source_path, $url_parts['path']) + 1;
$form_state['values']['source_siteroot'] = $source_siteroot = substr($source_path, 0, $split_at);
$rel_path = substr($source_path, $split_at);
...

This also still leaves the problem why the file extension isn't recognized in the first place.

Frank

frank ralf’s picture

Getting closer, I think. The module uses parse_url() for parsing the input path (import_html_ui.inc, line 2125):

...
$url_parts = parse_url($source_path);
...

But according to http://php.net/manual/en/function.parse-url.php "This function doesn't work with relative URLs."

There might be a workaround by converting the relative path to an absolute one (see http://stackoverflow.com/questions/4444475/transfrom-relative-path-into-...) or to use realpath() instead.

dman’s picture

Thanks for the follow-ups.
If it's *just* the demo page functionality and just the internal links when using the demo, I can easily imagine how that would have slipped through.
That page has only been used for checking the field-mappings and general shape of the page. Once upon a time (D5) it was possible to use the 'save' button and import a page one-by-one, but I stopped using that feature due to ... problems getting the context right when just hitting one page at a time :-}
Image support etc for those was particularly problematic, and had to use different rewrite rules that a real import would in order to pretend to work.

So yeah, some examples would have been sorta-OK when the remote site used absolute URLS. But a remote site with relative URLs ... would have been tricky to sort out. Not impossible or anything, just probably not tested too hard there.

frank ralf’s picture

Thanks for the feedback.

This only seems to affect the demo page functionality and only when using a relative path. AFAICS there are at least two workarounds for this issue:

  1. Use an absolute path with the demo page feature.
  2. Import only a single page with the normal import feature.

Therefore I tend to regard this issue as "Won't fix". Perhaps a hint in the UI to use only absolute paths with the demo feature would be helpful. What do you think?

Kind regards,
Frank

dman’s picture

Well, now that you've narrowed it all the way down to there, I don't think it's a "Won't fix"!
If it's reliably replicable, and just to do with the path re-joining, then a fix is therefore possible.
I did try to note that the demo was just that - a demo (of the scraping, not the re-linking) - and may not be suitable as a base for ongoing imports and getting links exactly right. BUT I would have liked if that were possible.
It's just that trying to calculate whether a local relative link should assume that its target is also being imported (true in bulk, false in demo) was tricky to solve.

I'm sure that a solution for the string truncation issue is now in sight.