Internal links reported as broken incorrectly on SSL only site [#563464]

On an ssl only site internal links, like those generated by url('node/1'); are reported as broken. Linkchecker appears to be testing them without ssl, so for example a link on the site that appears as <a href="node/1"> will be reported as a broken link to http://www.example.com/node/1 but they should be tested and found as https://www.example.com/node/1

I've only seen this with relative links, absolute links to ssl sites, both internal or external, behave correctly.

Comment	File	Size	Author
#9	linkchecker_563464-D6.patch	1.1 KB	hass
#5	linkchecker.module.diff	704 bytes	corporatebastard

Comments

Comment #1

hass commented 30 August 2009 at 10:18

Category:	bug	» support
Status:	Active	» Postponed (maintainer needs more info)

How have you build your ssl only site? Any special SSL modules installed?

Have you set $base_url in your settings.php to the https://example.com url? If set - url() may be able to prefix the urls correctly. (untested)

Comment #2

hass commented 30 August 2009 at 10:57

Title:

Internal links reported as broken incorrectly on SSL only site

» internal links reported as broken incorrectly on ssl site

Tested, WFM.

Comment #3

hass commented 30 August 2009 at 10:57

Title:	internal links reported as broken incorrectly on ssl site	» Internal links reported as broken incorrectly on SSL only site
Status:	Postponed (maintainer needs more info)	» Fixed

Comment #4

corporatebastard commented 30 August 2009 at 14:30

Title:

internal links reported as broken incorrectly on ssl site

» Internal links reported as broken incorrectly on SSL only site

The site in question is under development here: https://secure.webmast.co.uk/amp/dev

It's not just links made with url(), it effect any relative link. Take a look at the bottom of this page:
https://secure.webmast.co.uk/amp/dev/content/subversion where there are two links, one to contact ("get in touch") made with url('contact'); and one to home made with <a href="/amp/dev">home</a>, both are reported as broken although they work fine.

$base_url is not set, I'm not clear exactly what setting $base_url does but it appears to not be necessary and is described as optional.

There's no special SSL modules.

Comment #5

corporatebastard commented 30 August 2009 at 17:24

Status:

Fixed

» Needs review

Status	File	Size
new	linkchecker.module.diff	704 bytes

Found the problem, attached proposed solution, please review.

Comment #6

hass commented 30 August 2009 at 18:39

Are you able to verify if anchors and URL parameters like "#foo" and "?foo=bar" and "./foo/bar" and "../foo/bar" as href values are extracted correctly? I need to know if they are prefixed with "https://" or "http://".

Comment #7

corporatebastard commented 30 August 2009 at 20:24

They all get extracted ok, on the ssl only site they get prefixed with https:// and I've also checked with a non-ssl site where they get prefixed with http://

A bit unrelated but links to named anchors pass if the page exists, so a link like <a href="valid_page#non-existing-anchor"> will not be spotted as broken.

A couple of other situations which I've not tested but perhaps worth mentioning:

Mixed sites, like shopping sites where the products are in non-ssl pages but the checkout pages are under ssl.
Accessing the site via ssh tunnels, which is how I tend to login as admin on non-ssl sites. I think if cron.php is run that way relative links will end up getting prefixed with something like http://localhost:8080/sitename.

Comment #8

hass commented 30 August 2009 at 22:50

It's correct that relative links are saved wrong if you use a wrong URL. There is simply NO way to solve this problems for sure - except you set $base_url and make sure that all your links are accessible with this domain and protocol. Code wise your patch looks correct and I thought that someone come up with this issue somedays... I only have not expected sooo soon. I thought often about using $base_url and strip $base_path out to get the $base_url without base_path, but I haven't done it as I thought a site is always accessible via http:// if https:// is available. Otherwise SSL cert downloads will have major issues. The problem with $_SERVER["HTTPS"] == 'on' could be that this variable may not exists behind load balancers if certs are installed on the load balancer themself.

The relative links are not the Drupal way! You can add "node/1" links to your content, but make sure you are running http://drupal.org/project/pathologic or http://drupal.org/project/Pathfilter to make them absolute to the site root or fully qualified. Both this modules depend on the way how Drupal works and linkchecker is also not able to guess if you access cron via an "unofficial" link. linkchecker tries it's best... but it cannot be correct in all possible ways. This is why I suggest set up the base_url and never use relative links. It's also better as you will not have duplicate content on different domains... there is also a rewrite rule in .htaccess that should be enabled... better save than sorry.

An anchor cannot be verified with linkchecker at all as linkchecker only checks http status codes and not content. This can only be verified if the module downloads the page and parse it for the anchor. It sounds like a possible feature and I would take a look if you provide a patch... I'm not yet sure how much additionally complexity this would add to the module, but it's possible and a good idea. In such a case the http status code may be 200 OK and the anchor is no longer available on the page. Let's open a new feature request if you'd like to work on this feature.

I also expect issues with domain based language detection and relative links as I have never tested this variant.

Comment #9

hass commented 1 September 2009 at 18:19

Version:	6.x-2.3	» 6.x-2.x-dev
Category:	support	» bug

Status	File	Size
new	linkchecker_563464-D6.patch	1.1 KB

I've tried to reuse core variables with the attached patch. Are you able to test this patch, please?

Comment #10

hass commented 10 September 2009 at 21:00

Status:

Needs review

» Fixed

Patch above was wrong. New one committed.

D5: http://drupal.org/cvs?commit=261384
D6: http://drupal.org/cvs?commit=261380

Comment #11

corporatebastard commented 12 September 2009 at 18:25

Sorry it's taken so long to get back to you. Yes that patch is working well for me and it's a better solution than the one I proposed. I made a weird links page on the SSL only site for testing, it's doing the right thing with respect to the SSL issue, also tested on a non-SSL site, it's working there too. Many thanks for looking in to this issue and for taking the time to explain your thinking.

The following is just some ideas which may or may not be interesting...

Your comments in #8 got me thinking about my strange reluctance to make use of $base_url which started a couple of years ago when first learning drupal. I wanted to be able to do things like move a site from a local development environment to a remote server, complete with database, or from a staging subfolder to the main domain. My experiments showed that if I set RewriteBase in .htaccess correctly for the different locations I could happily move sites around and they would continue working as if nothing had happened. If I set $base_url then the database got scattered with references to it so that when the site was moved everything fell apart. I never figured out what $base_url actually did, how it was used within drupal or how modules used it, perhaps it's time to dig a bit deeper.

One of the main reasons I can get away with moving sites around is by making liberal use of url() or l(). I use them within modules but also in content I write using the php input filter, like on the weird links testing page (ordinary users aren't allowed to do that but it isn't a problem as I don't move their content around in the same way). The point being that the absolute urls produced by those functions are generated dynamically so work regardless of which weird method I use to connect to the site and carry on working when the site is moved or is accessed via SSL or not. That led me to the idea of linkchecker storing the links in the database the way they are found, then do the conversion to fully qualified urls dynamically in cron.php.

Another thought: the linkchecker module has a special need that most other modules don't face (at least non I've ever used): within cron it needs to be able to access the site it's being run on as if it were being accessed by an external entity. I did a little looking around for other modules which might have the same need, so far I only found absolute src (which perhaps should be called "fqdn src").

Perhaps all this thought is just to avoid doing the obvious and setting $base_url !

Comment #12

hass commented 13 September 2009 at 00:24

Thank you very much for your feedback. I have no good idea why setting up the $base_url should clutter the DB... absolutely no idea!

It sounds like you should take a look to http://drupal.org/project/pathologic. It allows you to use links like "node/1234" in your content and lookup the aliases for you with an output filter. By this way no link get's broken - nevertheless you may develop with subdirectory and have a production site without a subdirectory. It also allows you to change the alias of this node without breaking any link in your site linking to this node. Very helpful!

Comment #13

27 September 2009 at 00:30

Status:

Fixed

» Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

Internal links reported as broken incorrectly on SSL only site

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

News items

Our community

Documentation

Drupal code base

Governance of community