Invalid URI's not normalized / dot-segments not removed [#832388]

Comment	File	Size	Author
#21	linkchecker_uri_normalization_remove_dot_segments_v2.patch	2.74 KB	hass

#20	linkchecker_uri_normalization_remove_dot_segments_v1.patch	4.47 KB	hass

#16	linkchecker_uri_normalization_remove_dot_segments_v1.patch	4.47 KB	hass

#16	linkchecker_uri_normalization_remove_dot_segments_v2.patch	2.74 KB	hass

Comment #1

hass CreditAttribution: hass commented 20 June 2010 at 12:46

Status:

Active

» Postponed (maintainer needs more info)

Strange... I've implemented support of this type of paths... there are also tests. We need to figure out why it's not working...

<a href="../foo1/bar1">../foo1/bar1</a>
<a href="./foo2/bar2">./foo2/bar2</a>
<a href="../foo3/../foo4/foo5">../foo3/../foo4/foo5</a>
<a href="./foo4/../foo5/foo6">./foo4/../foo5/foo6</a>
<a href="./foo4/./foo5/foo6">./foo4/./foo5/foo6</a>

I guess there could have something failed with parse_url() in _linkchecker_absolute_content_path(). There are a couple of possibilities...

One is - how do you call your cron.php? Cannot find the case... but had someone in past with the same issue... I suggested him to set $base_url in settings.php. The problem was - he have accessed his cron.php via SSH from different hosts and this caused relative links to get a hostname of his remote host. Maybe this is the same issue here.

Have you set $base_url to your hostname?

Log in or register to post comments

Comment #2

hass CreditAttribution: hass commented 20 June 2010 at 12:48

Maybe a duplicate of #563464: Internal links reported as broken incorrectly on SSL only site.

Log in or register to post comments

Comment #3

dhope CreditAttribution: dhope commented 20 June 2010 at 14:58

Thanks for your response, hass.

cron.php gets called periodically by cronjob for my webserver (setup via my hoster's user interface).

As you suggested I have now set $base_url in my settings.php.

I've subsequently edited an article which shows up on the report as having two broken image links. They still appear on the report as broken (if I "correct" them by removing the leading ../../../ they are immediately removed).

I do not use the https protocol. I've Drupal installed in the root directory of my webspace (but not of the server which is shared).

Will see whether the next scheduled cron run changes anything.

Log in or register to post comments

Comment #4

dhope CreditAttribution: dhope commented 20 June 2010 at 16:02

The problem persists. Meanwhile I have inserted watchdogs into _linkchecker_absolute_content_path().

If this helps, I get the following output for the variables (node 10214 is the one with the problematic links):

$url: http://{server}/node/10214
$absolute_url: http://{server}/node/10214
$absolute_content_url: http://{server}/node/

Log in or register to post comments

Comment #5

hass CreditAttribution: hass commented 20 June 2010 at 17:53

After more thinking, this cannot Be a cron issue... The variables are looking good... misterious issue... If the variables are correct it cannot be wrong in the linkchecker tables... Can you take a Look into linkchecker_links table if the Full URL it correctly saved with the hostname, or broken in the Way you posted first?

Log in or register to post comments

Comment #6

hass CreditAttribution: hass commented 20 June 2010 at 22:51

Is it really correct that the URL of your node is /node/10214 ? Do you know why ckeditor adds a path that goes 4 directory levels up, if your content is only 2 directory levels deep?!?

Log in or register to post comments

Comment #7

dhope CreditAttribution: dhope commented 20 June 2010 at 23:55

Sure. If I key for the lid's that come up for that node I get the following three "broken" links (the link shown in the broken links report matches in all cases what's on the database):

Link 1 (image): http://../sites/default/files/images/Botanikus/Vitis_vinifera9.jpg
Link 2 (image): http://sites/default/files/images/Botanikus/Vitis_vinifera9.jpg
Link 3 (link to other node): http://{articletype}/{article}

the server name is missing in all cases. Link 1 resembles the cases I have seen prior to setting $base_url. Link 2 and 3 are missing the '../' - I'm not sure I saw that before.

The HTML in node edit mode shows the links as

../../../sites/default/files/images/Botanikus/Vitis_vinifera9.jpg
../../sites/default/files/images/Botanikus/Vitis_vinifera9.jpg
../../{articletype}/{article}

All of them show up like that in the HTML output and correctly resolve to

http://{server}/sites/default/files/images/Botanikus/Vitis_vinifera9.jpg or
http://{server}/{articletype}/{article}

when I hover with the mouse over the link, or the link is clicked. Which is kind of strange at least for the first one come to think of it.

EDIT: "Is it really correct that the URL of your node is /node/10214 ?"

Yes, and the URL alias is /{articletype}/{article}. Images however are deeper down in the structure, see above.

EDIT: "Do you know why ckeditor adds a path that goes 4 directory levels up, if your content is only 2 directory levels deep?!?"

No, absolutely not! That would be another path to pursue. Absent a PHP program that periodically scans and removes all leading '../' from my links :/

Log in or register to post comments

Comment #8

hass CreditAttribution: hass commented 21 June 2010 at 06:41

Category:	support	» bug
Status:	Postponed (maintainer needs more info)	» Active

Ok, there exists two bugs.

The very first and *main* issue in ckeditor and a second one in linkchecker. It would be best if you also open a case in ckeditor to get the tooo many directory jumps fixed. It's absolutly incorrect that ckeditor adds so many /../ into the paths. The node/1234/edit directory is only two directory levels deep and and alias "articletype/articlename" only one directory. No idea why ckeditor makes a directory jump of 4 directories. If ckeditor would behave correctly it would work as expected!

A few examples that are corrected by your browser:
http://drupal.org/node/../handbook/ -> http://drupal.org/handbook/
http://drupal.org/node/../../handbook/ -> http://drupal.org/handbook/
http://drupal.org/node/../../../handbook/ -> http://drupal.org/handbook/

Linkchecker should not remove the hostname - what is for sure wrong behavior and need to be fixed here.

Log in or register to post comments

Comment #9

hass CreditAttribution: hass commented 21 June 2010 at 06:43

This type of urls should also cause issues in _drupal_build_css_path() in common.inc.

Log in or register to post comments

Comment #10

hass CreditAttribution: hass commented 21 June 2010 at 07:30

As per Wikipedia there seems to be an algorithm described at http://tools.ietf.org/html/rfc3986, 5.2.4. Remove Dot Segments:

Removing dot-segments. The segments “..” and “.” are usually removed from a URL according to the algorithm described in RFC 3986 (or a similar algorithm).
Example: http://www.example.com/../a/b/../c/./d.html → http://www.example.com/a/c/d.html

Hopefully we do not need to re-invent the wheel... no idea if someone have shared PHP code for this. NO idea if there is a PHP function for this... if you have an idea we are able to fix.

Log in or register to post comments

Comment #11

hass CreditAttribution: hass commented 21 June 2010 at 07:22

Version:

6.x-2.4

» 6.x-2.x-dev

Only as a log note, i tried to keep the hostname intact and ended up with a path like drupal_http_request('http://example.com/../../articletype/articlename'), but this fails with "Bad Request", Status code 400. Wether we are able to normalize the path before checking the link or it won't work.

Log in or register to post comments

Comment #12

hass CreditAttribution: hass commented 21 June 2010 at 08:44

Title:

Link checking for paths beginning with ../ fails

» Invalid URI's not normalized / dot-segments not removed

Log in or register to post comments

Comment #13

dhope CreditAttribution: dhope commented 21 June 2010 at 10:24

Thanks a lot hass.

Well for my site simply removing or ignoring all leading ../ sequences would work (I don't have any ../ in the middle of my URLs), however it might break other sites who use those sequences deliberately. Also I wonder whether the search engines cope with this.

So I agree it's better to fix it at the source. As suggested I have opened an issue here http://drupal.org/node/833416

Log in or register to post comments

Comment #14

hass CreditAttribution: hass commented 26 June 2010 at 11:28

Found something at http://drupal.org/node/227232#comment-822372

Log in or register to post comments

Comment #15

dhope CreditAttribution: dhope commented 27 June 2010 at 08:04

Found something at http://drupal.org/node/227232#comment-822372

Thank you hass. I hope the length of that thread is not a harbinger for the complexity of a fix...

Log in or register to post comments

Comment #16

hass CreditAttribution: hass commented 27 June 2010 at 10:29

Status:

Needs review

» Active

File	Size
linkchecker_uri_normalization_remove_dot_segments_v2.patch	2.74 KB

linkchecker_uri_normalization_remove_dot_segments_v1.patch	4.47 KB

I've tested the function in that thread and it seems to work as we need it, but core developer chx and Dries named it "extremely complex"... Here are two patches, with the same result based on the patches found in case #227232, comment #9 (v1) and #11 (v2).

If we would have an lib with uri normalization it would be much better as this patch only fixes your specific issue and do not fix all normalization issues. I do not really like to add this code to the module as this is more or less - code that should be in core.

Log in or register to post comments

Comment #17

hass CreditAttribution: hass commented 27 June 2010 at 10:28

Status:

Active

» Needs review

Log in or register to post comments

Comment #18

27 June 2010 at 10:40

Status:

Active

» Needs work

The last submitted patch, linkchecker_uri_normalization_remove_dot_segments_v2.patch, failed testing.

Log in or register to post comments

Comment #19

hass CreditAttribution: hass commented 27 June 2010 at 12:04

One more http://labs.apache.org/webarch/uri/rev-2002/uri_test.pl, remove_dot_segments

Log in or register to post comments

Comment #20

hass CreditAttribution: hass commented 27 June 2010 at 20:18

Status:

Needs work

» Needs review

File	Size
linkchecker_uri_normalization_remove_dot_segments_v1.patch	4.47 KB

Log in or register to post comments

Comment #21

hass CreditAttribution: hass commented 27 June 2010 at 20:18

File	Size
linkchecker_uri_normalization_remove_dot_segments_v2.patch	2.74 KB

Log in or register to post comments

Comment #22

27 June 2010 at 20:30

Status:

Needs review

» Needs work

The last submitted patch, linkchecker_uri_normalization_remove_dot_segments_v2.patch, failed testing.

Log in or register to post comments

Comment #23

Ela CreditAttribution: Ela commented 18 November 2010 at 02:56

subscribing..

Log in or register to post comments

Comment #24

hass CreditAttribution: hass commented 29 December 2011 at 21:47

Priority:

Normal

» Minor

As we'd like to identify bad links I have no plans to fix this bug soon. I know browsers implement some workarounds, but this is an analysis tool to identify bad links and they are bad for sure and should get a review.

De-ranking priority to minor.

Log in or register to post comments

Comment #25

hass CreditAttribution: hass commented 8 January 2012 at 14:46

Version:	6.x-2.x-dev	» 7.x-1.x-dev
Status:	Needs work	» Fixed

Added a small workaround to prevent hostname removal by the normalization regex. This patch intentionally does not implement all RFC rules that browsers implement for URI normalization. Therefore the module will still show over-dot-segmented links as invalid/broken links, so they can be fixed.

D6: http://drupalcode.org/project/linkchecker.git/commit/a7d8dd1
D7: http://drupalcode.org/project/linkchecker.git/commit/b046cb9