Closed (fixed)
Project:
Link checker
Version:
6.x-2.x-dev
Component:
Code
Priority:
Normal
Category:
Bug report
Assigned:
Unassigned
Reporter:
Created:
30 Aug 2009 at 02:17 UTC
Updated:
27 Sep 2009 at 00:30 UTC
Jump to comment: Most recent file
Comments
Comment #1
hass commentedHow have you build your ssl only site? Any special SSL modules installed?
Have you set
$base_urlin your settings.php to the https://example.com url? If set - url() may be able to prefix the urls correctly. (untested)Comment #2
hass commentedTested, WFM.
Comment #3
hass commentedComment #4
corporatebastard commentedThe site in question is under development here: https://secure.webmast.co.uk/amp/dev
It's not just links made with url(), it effect any relative link. Take a look at the bottom of this page:
https://secure.webmast.co.uk/amp/dev/content/subversion where there are two links, one to contact ("get in touch") made with
url('contact');and one to home made with<a href="/amp/dev">home</a>, both are reported as broken although they work fine.$base_url is not set, I'm not clear exactly what setting $base_url does but it appears to not be necessary and is described as optional.
There's no special SSL modules.
Comment #5
corporatebastard commentedFound the problem, attached proposed solution, please review.
Comment #6
hass commentedAre you able to verify if anchors and URL parameters like "#foo" and "?foo=bar" and "./foo/bar" and "../foo/bar" as href values are extracted correctly? I need to know if they are prefixed with "https://" or "http://".
Comment #7
corporatebastard commentedThey all get extracted ok, on the ssl only site they get prefixed with https:// and I've also checked with a non-ssl site where they get prefixed with http://
A bit unrelated but links to named anchors pass if the page exists, so a link like
<a href="valid_page#non-existing-anchor">will not be spotted as broken.A couple of other situations which I've not tested but perhaps worth mentioning:
http://localhost:8080/sitename.Comment #8
hass commentedIt's correct that relative links are saved wrong if you use a wrong URL. There is simply NO way to solve this problems for sure - except you set
$base_urland make sure that all your links are accessible with this domain and protocol. Code wise your patch looks correct and I thought that someone come up with this issue somedays... I only have not expected sooo soon. I thought often about using$base_urland strip$base_pathout to get the$base_urlwithout base_path, but I haven't done it as I thought a site is always accessible via http:// if https:// is available. Otherwise SSL cert downloads will have major issues. The problem with$_SERVER["HTTPS"] == 'on'could be that this variable may not exists behind load balancers if certs are installed on the load balancer themself.The relative links are not the Drupal way! You can add "node/1" links to your content, but make sure you are running http://drupal.org/project/pathologic or http://drupal.org/project/Pathfilter to make them absolute to the site root or fully qualified. Both this modules depend on the way how Drupal works and linkchecker is also not able to guess if you access cron via an "unofficial" link. linkchecker tries it's best... but it cannot be correct in all possible ways. This is why I suggest set up the base_url and never use relative links. It's also better as you will not have duplicate content on different domains... there is also a rewrite rule in .htaccess that should be enabled... better save than sorry.
An anchor cannot be verified with linkchecker at all as linkchecker only checks http status codes and not content. This can only be verified if the module downloads the page and parse it for the anchor. It sounds like a possible feature and I would take a look if you provide a patch... I'm not yet sure how much additionally complexity this would add to the module, but it's possible and a good idea. In such a case the http status code may be 200 OK and the anchor is no longer available on the page. Let's open a new feature request if you'd like to work on this feature.
I also expect issues with domain based language detection and relative links as I have never tested this variant.
Comment #9
hass commentedI've tried to reuse core variables with the attached patch. Are you able to test this patch, please?
Comment #10
hass commentedPatch above was wrong. New one committed.
D5: http://drupal.org/cvs?commit=261384
D6: http://drupal.org/cvs?commit=261380
Comment #11
corporatebastard commentedSorry it's taken so long to get back to you. Yes that patch is working well for me and it's a better solution than the one I proposed. I made a weird links page on the SSL only site for testing, it's doing the right thing with respect to the SSL issue, also tested on a non-SSL site, it's working there too. Many thanks for looking in to this issue and for taking the time to explain your thinking.
The following is just some ideas which may or may not be interesting...
Your comments in #8 got me thinking about my strange reluctance to make use of $base_url which started a couple of years ago when first learning drupal. I wanted to be able to do things like move a site from a local development environment to a remote server, complete with database, or from a staging subfolder to the main domain. My experiments showed that if I set RewriteBase in .htaccess correctly for the different locations I could happily move sites around and they would continue working as if nothing had happened. If I set $base_url then the database got scattered with references to it so that when the site was moved everything fell apart. I never figured out what $base_url actually did, how it was used within drupal or how modules used it, perhaps it's time to dig a bit deeper.
One of the main reasons I can get away with moving sites around is by making liberal use of url() or l(). I use them within modules but also in content I write using the php input filter, like on the weird links testing page (ordinary users aren't allowed to do that but it isn't a problem as I don't move their content around in the same way). The point being that the absolute urls produced by those functions are generated dynamically so work regardless of which weird method I use to connect to the site and carry on working when the site is moved or is accessed via SSL or not. That led me to the idea of linkchecker storing the links in the database the way they are found, then do the conversion to fully qualified urls dynamically in cron.php.
Another thought: the linkchecker module has a special need that most other modules don't face (at least non I've ever used): within cron it needs to be able to access the site it's being run on as if it were being accessed by an external entity. I did a little looking around for other modules which might have the same need, so far I only found absolute src (which perhaps should be called "fqdn src").
Perhaps all this thought is just to avoid doing the obvious and setting $base_url !
Comment #12
hass commentedThank you very much for your feedback. I have no good idea why setting up the $base_url should clutter the DB... absolutely no idea!
It sounds like you should take a look to http://drupal.org/project/pathologic. It allows you to use links like "node/1234" in your content and lookup the aliases for you with an output filter. By this way no link get's broken - nevertheless you may develop with subdirectory and have a production site without a subdirectory. It also allows you to change the alias of this node without breaking any link in your site linking to this node. Very helpful!