In my nodes my internal links are in the format /articletype/articlename
.
When I copy such links around, ckeditor changes the link format to ../../../../articletype/articlename
(it does the same with image links).
Linkchecker then fails to validate that link and reports it as broken:
http://../../articletype/articlename 0 php_network_getaddresses: getaddrinfo failed: Name or service not known
I could blame ckeditor however even after ckeditor's manipulation the link works when clicked - so really I'd like Linkchecker to figure out the correct path and accept the link as valid ?
P.S. I'm using pathauto.
Comments
Comment #1
hass CreditAttribution: hass commentedStrange... I've implemented support of this type of paths... there are also tests. We need to figure out why it's not working...
I guess there could have something failed with parse_url() in _linkchecker_absolute_content_path(). There are a couple of possibilities...
One is - how do you call your cron.php? Cannot find the case... but had someone in past with the same issue... I suggested him to set $base_url in settings.php. The problem was - he have accessed his cron.php via SSH from different hosts and this caused relative links to get a hostname of his remote host. Maybe this is the same issue here.
Have you set $base_url to your hostname?
Comment #2
hass CreditAttribution: hass commentedMaybe a duplicate of #563464: Internal links reported as broken incorrectly on SSL only site.
Comment #3
dhope CreditAttribution: dhope commentedThanks for your response, hass.
cron.php gets called periodically by cronjob for my webserver (setup via my hoster's user interface).
As you suggested I have now set $base_url in my settings.php.
I've subsequently edited an article which shows up on the report as having two broken image links. They still appear on the report as broken (if I "correct" them by removing the leading ../../../ they are immediately removed).
I do not use the https protocol. I've Drupal installed in the root directory of my webspace (but not of the server which is shared).
Will see whether the next scheduled cron run changes anything.
Comment #4
dhope CreditAttribution: dhope commentedThe problem persists. Meanwhile I have inserted watchdogs into _linkchecker_absolute_content_path().
If this helps, I get the following output for the variables (node 10214 is the one with the problematic links):
$url: http://{server}/node/10214
$absolute_url: http://{server}/node/10214
$absolute_content_url: http://{server}/node/
Comment #5
hass CreditAttribution: hass commentedAfter more thinking, this cannot Be a cron issue... The variables are looking good... misterious issue... If the variables are correct it cannot be wrong in the linkchecker tables... Can you take a Look into linkchecker_links table if the Full URL it correctly saved with the hostname, or broken in the Way you posted first?
Comment #6
hass CreditAttribution: hass commentedIs it really correct that the URL of your node is /node/10214 ? Do you know why ckeditor adds a path that goes 4 directory levels up, if your content is only 2 directory levels deep?!?
Comment #7
dhope CreditAttribution: dhope commentedSure. If I key for the lid's that come up for that node I get the following three "broken" links (the link shown in the broken links report matches in all cases what's on the database):
Link 1 (image): http://../sites/default/files/images/Botanikus/Vitis_vinifera9.jpg
Link 2 (image): http://sites/default/files/images/Botanikus/Vitis_vinifera9.jpg
Link 3 (link to other node): http://{articletype}/{article}
the server name is missing in all cases. Link 1 resembles the cases I have seen prior to setting $base_url. Link 2 and 3 are missing the '../' - I'm not sure I saw that before.
The HTML in node edit mode shows the links as
../../../sites/default/files/images/Botanikus/Vitis_vinifera9.jpg
../../sites/default/files/images/Botanikus/Vitis_vinifera9.jpg
../../{articletype}/{article}
All of them show up like that in the HTML output and correctly resolve to
http://{server}/sites/default/files/images/Botanikus/Vitis_vinifera9.jpg or
http://{server}/{articletype}/{article}
when I hover with the mouse over the link, or the link is clicked. Which is kind of strange at least for the first one come to think of it.
EDIT: "Is it really correct that the URL of your node is /node/10214 ?"
Yes, and the URL alias is /{articletype}/{article}. Images however are deeper down in the structure, see above.
EDIT: "Do you know why ckeditor adds a path that goes 4 directory levels up, if your content is only 2 directory levels deep?!?"
No, absolutely not! That would be another path to pursue. Absent a PHP program that periodically scans and removes all leading '../' from my links :/
Comment #8
hass CreditAttribution: hass commentedOk, there exists two bugs.
The very first and *main* issue in ckeditor and a second one in linkchecker. It would be best if you also open a case in ckeditor to get the tooo many directory jumps fixed. It's absolutly incorrect that ckeditor adds so many
/../
into the paths. The node/1234/edit directory is only two directory levels deep and and alias "articletype/articlename" only one directory. No idea why ckeditor makes a directory jump of 4 directories. If ckeditor would behave correctly it would work as expected!A few examples that are corrected by your browser:
http://drupal.org/node/../handbook/ -> http://drupal.org/handbook/
http://drupal.org/node/../../handbook/ -> http://drupal.org/handbook/
http://drupal.org/node/../../../handbook/ -> http://drupal.org/handbook/
Linkchecker should not remove the hostname - what is for sure wrong behavior and need to be fixed here.
Comment #9
hass CreditAttribution: hass commentedThis type of urls should also cause issues in _drupal_build_css_path() in common.inc.
Comment #10
hass CreditAttribution: hass commentedAs per Wikipedia there seems to be an algorithm described at http://tools.ietf.org/html/rfc3986, 5.2.4. Remove Dot Segments:
Hopefully we do not need to re-invent the wheel... no idea if someone have shared PHP code for this. NO idea if there is a PHP function for this... if you have an idea we are able to fix.
Comment #11
hass CreditAttribution: hass commentedOnly as a log note, i tried to keep the hostname intact and ended up with a path like
drupal_http_request('http://example.com/../../articletype/articlename')
, but this fails with "Bad Request", Status code 400. Wether we are able to normalize the path before checking the link or it won't work.Comment #12
hass CreditAttribution: hass commentedComment #13
dhope CreditAttribution: dhope commentedThanks a lot hass.
Well for my site simply removing or ignoring all leading ../ sequences would work (I don't have any ../ in the middle of my URLs), however it might break other sites who use those sequences deliberately. Also I wonder whether the search engines cope with this.
So I agree it's better to fix it at the source. As suggested I have opened an issue here http://drupal.org/node/833416
Comment #14
hass CreditAttribution: hass commentedFound something at http://drupal.org/node/227232#comment-822372
Comment #15
dhope CreditAttribution: dhope commentedFound something at http://drupal.org/node/227232#comment-822372
Thank you hass. I hope the length of that thread is not a harbinger for the complexity of a fix...
Comment #16
hass CreditAttribution: hass commentedI've tested the function in that thread and it seems to work as we need it, but core developer chx and Dries named it "extremely complex"... Here are two patches, with the same result based on the patches found in case #227232, comment #9 (v1) and #11 (v2).
If we would have an lib with uri normalization it would be much better as this patch only fixes your specific issue and do not fix all normalization issues. I do not really like to add this code to the module as this is more or less - code that should be in core.
Comment #17
hass CreditAttribution: hass commentedComment #19
hass CreditAttribution: hass commentedOne more http://labs.apache.org/webarch/uri/rev-2002/uri_test.pl, remove_dot_segments
Comment #20
hass CreditAttribution: hass commentedComment #21
hass CreditAttribution: hass commentedComment #23
Ela CreditAttribution: Ela commentedsubscribing..
Comment #24
hass CreditAttribution: hass commentedAs we'd like to identify bad links I have no plans to fix this bug soon. I know browsers implement some workarounds, but this is an analysis tool to identify bad links and they are bad for sure and should get a review.
De-ranking priority to minor.
Comment #25
hass CreditAttribution: hass commentedAdded a small workaround to prevent hostname removal by the normalization regex. This patch intentionally does not implement all RFC rules that browsers implement for URI normalization. Therefore the module will still show over-dot-segmented links as invalid/broken links, so they can be fixed.
D6: http://drupalcode.org/project/linkchecker.git/commit/a7d8dd1
D7: http://drupalcode.org/project/linkchecker.git/commit/b046cb9
Comment #26
hass CreditAttribution: hass commentedTagging