In my nodes my internal links are in the format /articletype/articlename.

When I copy such links around, ckeditor changes the link format to ../../../../articletype/articlename (it does the same with image links).

Linkchecker then fails to validate that link and reports it as broken:

http://../../articletype/articlename 0 php_network_getaddresses: getaddrinfo failed: Name or service not known

I could blame ckeditor however even after ckeditor's manipulation the link works when clicked - so really I'd like Linkchecker to figure out the correct path and accept the link as valid ?

P.S. I'm using pathauto.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

hass’s picture

Status: Active » Postponed (maintainer needs more info)

Strange... I've implemented support of this type of paths... there are also tests. We need to figure out why it's not working...

<a href="../foo1/bar1">../foo1/bar1</a>
<a href="./foo2/bar2">./foo2/bar2</a>
<a href="../foo3/../foo4/foo5">../foo3/../foo4/foo5</a>
<a href="./foo4/../foo5/foo6">./foo4/../foo5/foo6</a>
<a href="./foo4/./foo5/foo6">./foo4/./foo5/foo6</a>

I guess there could have something failed with parse_url() in _linkchecker_absolute_content_path(). There are a couple of possibilities...

One is - how do you call your cron.php? Cannot find the case... but had someone in past with the same issue... I suggested him to set $base_url in settings.php. The problem was - he have accessed his cron.php via SSH from different hosts and this caused relative links to get a hostname of his remote host. Maybe this is the same issue here.

Have you set $base_url to your hostname?

hass’s picture

dhope’s picture

Thanks for your response, hass.

cron.php gets called periodically by cronjob for my webserver (setup via my hoster's user interface).

As you suggested I have now set $base_url in my settings.php.

I've subsequently edited an article which shows up on the report as having two broken image links. They still appear on the report as broken (if I "correct" them by removing the leading ../../../ they are immediately removed).

I do not use the https protocol. I've Drupal installed in the root directory of my webspace (but not of the server which is shared).

Will see whether the next scheduled cron run changes anything.

dhope’s picture

The problem persists. Meanwhile I have inserted watchdogs into _linkchecker_absolute_content_path().

If this helps, I get the following output for the variables (node 10214 is the one with the problematic links):

$url: http://{server}/node/10214
$absolute_url: http://{server}/node/10214
$absolute_content_url: http://{server}/node/

hass’s picture

After more thinking, this cannot Be a cron issue... The variables are looking good... misterious issue... If the variables are correct it cannot be wrong in the linkchecker tables... Can you take a Look into linkchecker_links table if the Full URL it correctly saved with the hostname, or broken in the Way you posted first?

hass’s picture

Is it really correct that the URL of your node is /node/10214 ? Do you know why ckeditor adds a path that goes 4 directory levels up, if your content is only 2 directory levels deep?!?

dhope’s picture

Sure. If I key for the lid's that come up for that node I get the following three "broken" links (the link shown in the broken links report matches in all cases what's on the database):

Link 1 (image): http://../sites/default/files/images/Botanikus/Vitis_vinifera9.jpg
Link 2 (image): http://sites/default/files/images/Botanikus/Vitis_vinifera9.jpg
Link 3 (link to other node): http://{articletype}/{article}

the server name is missing in all cases. Link 1 resembles the cases I have seen prior to setting $base_url. Link 2 and 3 are missing the '../' - I'm not sure I saw that before.

The HTML in node edit mode shows the links as

../../../sites/default/files/images/Botanikus/Vitis_vinifera9.jpg
../../sites/default/files/images/Botanikus/Vitis_vinifera9.jpg
../../{articletype}/{article}

All of them show up like that in the HTML output and correctly resolve to

http://{server}/sites/default/files/images/Botanikus/Vitis_vinifera9.jpg or
http://{server}/{articletype}/{article}

when I hover with the mouse over the link, or the link is clicked. Which is kind of strange at least for the first one come to think of it.

EDIT: "Is it really correct that the URL of your node is /node/10214 ?"

Yes, and the URL alias is /{articletype}/{article}. Images however are deeper down in the structure, see above.

EDIT: "Do you know why ckeditor adds a path that goes 4 directory levels up, if your content is only 2 directory levels deep?!?"

No, absolutely not! That would be another path to pursue. Absent a PHP program that periodically scans and removes all leading '../' from my links :/

hass’s picture

Category: support » bug
Status: Postponed (maintainer needs more info) » Active

Ok, there exists two bugs.

The very first and *main* issue in ckeditor and a second one in linkchecker. It would be best if you also open a case in ckeditor to get the tooo many directory jumps fixed. It's absolutly incorrect that ckeditor adds so many /../ into the paths. The node/1234/edit directory is only two directory levels deep and and alias "articletype/articlename" only one directory. No idea why ckeditor makes a directory jump of 4 directories. If ckeditor would behave correctly it would work as expected!

A few examples that are corrected by your browser:
http://drupal.org/node/../handbook/ -> http://drupal.org/handbook/
http://drupal.org/node/../../handbook/ -> http://drupal.org/handbook/
http://drupal.org/node/../../../handbook/ -> http://drupal.org/handbook/

Linkchecker should not remove the hostname - what is for sure wrong behavior and need to be fixed here.

hass’s picture

This type of urls should also cause issues in _drupal_build_css_path() in common.inc.

hass’s picture

As per Wikipedia there seems to be an algorithm described at http://tools.ietf.org/html/rfc3986, 5.2.4. Remove Dot Segments:

Removing dot-segments. The segments “..” and “.” are usually removed from a URL according to the algorithm described in RFC 3986 (or a similar algorithm).
Example: http://www.example.com/../a/b/../c/./d.htmlhttp://www.example.com/a/c/d.html

Hopefully we do not need to re-invent the wheel... no idea if someone have shared PHP code for this. NO idea if there is a PHP function for this... if you have an idea we are able to fix.

hass’s picture

Version: 6.x-2.4 » 6.x-2.x-dev

Only as a log note, i tried to keep the hostname intact and ended up with a path like drupal_http_request('http://example.com/../../articletype/articlename'), but this fails with "Bad Request", Status code 400. Wether we are able to normalize the path before checking the link or it won't work.

hass’s picture

Title: Link checking for paths beginning with ../ fails » Invalid URI's not normalized / dot-segments not removed
dhope’s picture

Thanks a lot hass.

Well for my site simply removing or ignoring all leading ../ sequences would work (I don't have any ../ in the middle of my URLs), however it might break other sites who use those sequences deliberately. Also I wonder whether the search engines cope with this.

So I agree it's better to fix it at the source. As suggested I have opened an issue here http://drupal.org/node/833416

hass’s picture

dhope’s picture

Found something at http://drupal.org/node/227232#comment-822372

Thank you hass. I hope the length of that thread is not a harbinger for the complexity of a fix...

hass’s picture

I've tested the function in that thread and it seems to work as we need it, but core developer chx and Dries named it "extremely complex"... Here are two patches, with the same result based on the patches found in case #227232, comment #9 (v1) and #11 (v2).

If we would have an lib with uri normalization it would be much better as this patch only fixes your specific issue and do not fix all normalization issues. I do not really like to add this code to the module as this is more or less - code that should be in core.

hass’s picture

Status: Active » Needs review

Status: Active » Needs work

The last submitted patch, linkchecker_uri_normalization_remove_dot_segments_v2.patch, failed testing.

hass’s picture

hass’s picture

Status: Needs work » Needs review
FileSize
4.47 KB
hass’s picture

Status: Needs review » Needs work

The last submitted patch, linkchecker_uri_normalization_remove_dot_segments_v2.patch, failed testing.

Ela’s picture

subscribing..

hass’s picture

Priority: Normal » Minor

As we'd like to identify bad links I have no plans to fix this bug soon. I know browsers implement some workarounds, but this is an analysis tool to identify bad links and they are bad for sure and should get a review.

De-ranking priority to minor.

hass’s picture

Version: 6.x-2.x-dev » 7.x-1.x-dev
Status: Needs work » Fixed

Added a small workaround to prevent hostname removal by the normalization regex. This patch intentionally does not implement all RFC rules that browsers implement for URI normalization. Therefore the module will still show over-dot-segmented links as invalid/broken links, so they can be fixed.

D6: http://drupalcode.org/project/linkchecker.git/commit/a7d8dd1
D7: http://drupalcode.org/project/linkchecker.git/commit/b046cb9

hass’s picture

Tagging

Status: Fixed » Closed (fixed)
Issue tags: -URI normalization, -dot-segments

Automatically closed -- issue fixed for 2 weeks with no activity.