Problems with using relative path names
| Project: | Drupal |
| Version: | x.y.z |
| Component: | base system |
| Category: | bug report |
| Priority: | critical |
| Assigned: | chx |
| Status: | closed |
Looking at my site's logs, there seem to be several problems that are caused by Drupal's use of relative path names.
If Drupal causes all the site's urls to be absolute, then none of this would be an issue.
A. Search Engine Crawlers
Getting lots of 404s on things like: linux/index.html/robots.txt
Where 'linux' is an alias to a taxonomy, and 'index.html' is an alias to a node within that taxonomy.
Another example, is recursing unnecessarily. I see 404s on things like: /linux/index.html/linux/index.html
Where 'linux' is a path alias for a taxonomy term, and 'index.html' is an alias to the main node within it.
This does not seem to happen when Google crawls my sites, but Yahoo's Slurp suffers from this problem, and keeps recursing. MSNBot also suffers from this.
Another crawler/harvester called Blinkx/DFS-Fetch keeps adding the .css file to the relative path, getting a 404 on things like: /linux/themes/xtemplate/pushbutton/logo.gif
And Fast Search Engine also attempts to access: /linux/contact/tracker/tracker/user/password
The same goes for grub.org, another crawler.
B. Google Cache / Archive Way Back Machine
Pages in Google cache and archive.org Way Back Machine suffer form a similar problem: the .css files cannot be found, and hence rendering of the pages is not correct.
Examples:
Compare this: http://www.drupal.org/node/4647
To this: http://www.google.ca/search?q=cache:www.drupal.org/node/view/4647
Notice the following:
- How there is no formatting at all, because of the lack of a .css file
- The httpd log on Drupal will show errors for: linux/themes/pushbutton/style.css and linux/misc/drupal.css
Also see: http://web.archive.org/web/20031016184902/http://www.drupal.org/
C. Proxy Caches:
When someone is browsing my site from behind a proxy cache, the web site is hit with a rapid succession of requests, and many of it is just for bogus pages.
Examples:
2004/11/17 - 17:47 404 error: linux/user/1 not found.
2004/11/17 - 17:47 404 error: linux/feedback not found.
2004/11/17 - 17:47 404 error: linux/tracker not found.
2004/11/17 - 17:47 404 error: linux/sitemap not found.
2004/11/17 - 17:47 404 error: linux/search not found.
2004/11/17 - 17:47 404 error: linux/misc not found.
2004/11/17 - 17:47 404 error: linux/programming not found.
2004/11/17 - 17:47 404 error: linux/programming not found.
2004/11/17 - 17:47 404 error: linux/linux not found.
2004/11/17 - 17:47 404 error: linux/technology not found.
2004/11/17 - 17:47 404 error: linux/writings not found.
2004/11/17 - 17:47 404 error: linux/family not found.And also:
2004/11/17 - 07:23 404 error: history/user/1 not found.
2004/11/17 - 07:23 404 error: history/tracker not found.
2004/11/17 - 07:23 404 error: history/feedback not found.
2004/11/17 - 07:23 404 error: history/sitemap not found.
2004/11/17 - 07:23 404 error: history/search not found.
2004/11/17 - 07:23 404 error: history/misc not found.
2004/11/17 - 07:23 404 error: history/technology not found.
2004/11/17 - 07:23 404 error: history/science not found.
2004/11/17 - 07:22 404 error: history/history not found.
2004/11/17 - 07:22 404 error: history/writings not found.
2004/11/17 - 07:22 404 error: history/family not found.As you can tell, history and linux are aliases to taxonomy terms, and so is misc, technology, writings, family, ...etc. The user agent is appending the taxonomy term alias to the url and forming a new URL.
D. Regular Browsing:
There is even at least one extreme case where the following URL was accessed (the result was 404 of course)
/book/view/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/logo.gifIt seems it was a normal user, because the user agent is: "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)"
Proposed Solution:
As a proposed solution, all URLs in Drupal can be made into absolute path names. This can be done by the following:
- The variable $base_url in the conf.php file is broken down into two components:
- $base_host (the 'http://whatever-host.example.com' part WITHOUT the trailing slash)
- $base_path (the '/path-to-drupal' part, WITH the leading slash. If this is the DocumentRoot, then it is just a '/' character)
- $base_url is now $base_host concatenated with $base_path
- A simple filter can be written to preceed every href="path" with the $base_path variable, so it becomes "/path"
- This option can be turned on and off for a site. The default is to have it off so current behavior is maintained.
- A similar scheme applies for style sheets as well.
So, did I miss something obvious? Am I seriously off the mark?
Your thoughts!

#1
I am getting similar 404 errors, mainly from rss feed link that looks like /blog/blog/feed and many manual links that are relative to drupal root.
It was not a problem before Drupal 4.5, so I think there might not be a need to change all URIs to absolute. I can't see where the problem is coming from though.
#2
I am pretty sure that these problems were happening for at least the past 10 months (ever since I moved to Drupal in January 2004).
The main issue here is that crawlers and other user agents get confused by the relative path names.
Using absolute paths will definitely solve this. However, is this the only solution?
I am looking for a discussion of this.
#3
No absolute paths please. Having the path start with '/' solves all the mentioned problems, and is not absolute, it is relative to the domain. Sadly some crawlers and even the Google Cache does not obey to the base href. I have reported this cache problem in April to Google, and they promised they will keep it in mind... Hehe...
What we need is to have the printed relative path values relative to the domain name, and not relative to the Drupal installation path.
Note that this issue will appear on the drupal devel mailing list if someone finally provides a patch we can talk about :)
#4
Goba is right. We need paths relative to the domain name to fix this 'problem'.
#5
Sorry for not making my self clear.
When I said absolute, I meant that they start with just a /. I did NOT mean that they start with http://host.example.com. That would be a very bad idea.
In any case, what do people think about the proposed solution (breaking down $base_url into two parts?)
Also, does this address the style sheets as well, or more is needed?
#6
I have implemented what Goba suggested.
#7
Maybe this one is faster?
#8
Man! You are fast!
I tried the second version. It works fine for things that are not inside the node body, I mean they have a / in front of them, as we want it to be.
Two comments/issues:
- If there is a URL that is already "/" representing the home page, it gets set to "//". Perhaps it should check for that case?
- URLs in nodes that do not start with / do not get changed to have a / prepended to them. Do we need a filter for this?
- Do we need to do something for the style sheets in the page header? I mean the "misc/drupal.css" and "themes/themename/style.css"?
Thanks
#9
Hi chx
Here is a fix for the case where you have a url that is just "/".
In your patch, instead of:
<?php$base = $parts['path'] . '/' ;
?>
Replace that by:
<?php$base = ( $path == '/' ? $base : $parts['path'] . '/' );
?>
#10
Did this patch make it into CVS yet?
If there are any objections to it, can someone please explain what they are?
Thanks
#11
Shouldn't your changes be included in the patch?
Also, it's better to cache $base rather than $parts.
Lastly, it this patch makes it to HEAD, we should probably remove some 'base url' cruft from the themes.
#12
Here is the patch including my fix.
I am asking chx to comment on caching $base instead of $parts.
Will this make it faster?
#13
Hm.
$base = ( $path == '/' ? $base : $parts['path'] . '/' );this depends on path which is a parameter. Thus I fail to see how could we cache$base. I'd correct this code however$base = ( $path == '/' ? '' : $parts['path'] . '/' );'cos I think$baseis not defined before, but this is not a problem, PHP will be happy to replace NULL with NULL...Maybe instead of all parts, only
$parts['path']is enough to be cached, yes, but the performance and memory usage difference -- I guess -- would not be noticable...#14
OK.
I put in chx suggested change.
This patch can go in CVS then, to rid us of the problems with paths not beginning with slash.
This is not an ultimate solution still. We need to address the problem with .css files. Although the header contains a:
<base href="http://example.com" />it does not seem that major search engines and archiving sites obey it anyway.
#15
Your coding style needs work. Also, I'm not going to commit this unless the themes get fixed up: we'd end up with invalid URLs all over the place. Lastly, I wonder how portable the themes will be when Drupal is run from within a subdirectory.
#16
Well, my patch worked from a subdirectory very well, as fact, I have not tested it from the root dir. And I think that it adheres to coding standards. So I resubmit it with the root path fix. However, my Drupal work is focused on i18n these days, and I was never into themeing so it won't be me who fixes those.
#17
I have tested the previous patch (including my fix) with drupal installed in the DocumentRoot of the server.
So, in effect, it is tested with both Drupal in / and Drupal in a subdirectory.
This change fixes the problem for the crawlers and other browsers from getting confused.
While it is true that there is no fix for the .css files in the HTML head section yet, this fix deals with a major part of the problem, and rids us of a major pain. Check your web server's logs some time to see what I mean.
Someone who is familiar with the themes can contribute a patch later.
This patch and the future fix for themes are not mutually exclusive, so let it go in CVS.
#18
Please commit this into Drupal core, this fix is badly needed.
#19
Well as noone have stepped in to fix this problem, I have tried to fix the themes also. themes.inc , xtemplate.engine and the bluemarine template is patched besides common.inc.
Of course, more templates could follow, but first I'd like to see your opinions.
#20
I don't think that removing
<base>from the themes is a good idea, using $parts['path'] should be encouraged though before the files, which would fix the google cache problem, and would still keep the HTML size low. It would also help those, who save the file to find the originating site easier, since clicking on a non-pagelocal link would lead to the online version.#21
Definitely -1 on removing the <base> tag or using absolute or root-relative URLs. This tag has been around for ages, and it is the only way to make clean URLs work without bloating in the code. FYI, "base" is (first?) mentioned in Berners-Lee's HTML 1.0 draft. That's June 1993.
As the amount of clean URL-using sites grows, the crawlers will have to be updated. Perhaps we could prevent crawlers from going too insane by 404ing for URLs with more than say 10 components? That would prevent the really crappy ones from hammering your site.
I'm all for making the <base> tag easier to handle for the user (say, by including a filter to allow simple anchor links to work as most people expect them to), but we should keep Drupal-generated URLs clean and completely relative.
#22
The problem with css is this: The @import argument does not start with a /.
This is simple to fix.
We keep the "base" as it is today, but add the new variable: $base before it.
So for a site where Drupal is installed in the DocumentRoot, all that will change is that /misc/drupal.css and /themes/themename/style.css will be preceded by a slash. For sites that use another path, that path will be prepended to the css file name.
How about that?
#23
What exactly is the problem with the @import? As far as I know:
- url() in stylesheets is interpreted relative to the base of the stylesheet, not the source document.
- However, if the styles are inside an HTML document, through a style tag or style attribute, then the stylesheet's location is the same as the HTML document.
- Thus, the stylesheet's base is the same as the base of the HTML document (which can be altered through the <base> tag).
I just don't see why it is necessary. As far as I know, the only browser that has had problems resolving CSS urls properly was Netscape 4, which does not support @import at all, and which Drupal does not support either, because of its CSS usage.
#24
The problem for stylesheets is as follows. I think it mainly affect crawlers and Google's cache.
Say you have an installtion of Drupal in DocumentRoot. You then use url aliases, and put slashes in them.
For example, you use news/general/2004-12-15.html for a node.
That node still has misc/drupal.css and themes/pushbutton/style.css in the head section if the document. Crawlers get fooled by that and try to look for /news/general/misc/drupal.css and /news/general/themes/pushbutton/style.css, which don't exist.
So, just prepending the new $base variable (in chx's patch) before the stylesheet @import argument would fix this issue. Assuming you are in DocumentRoot, then /misc and /themes would be used instead of just misc and themes.
It would still be compliant with standards, be relative to the web site, and no ambiguous to anyone, be they crawler or browser.
I hope it is clearer now.
I think chx can change the patch to use the $base instead of $base_url everywhere, so as to avoid the host/domain name in the urls.
#25
But typical crawlers don't even pay attention to stylesheets, hence it wouldn't have much use for them. I just don't see why we should adjust to rare cases of buggy software. Reading out a base URL from an HTML document is dead easy, and on top of that it doesn't add more complexity as without the base tag, the document's URL is already an implicit base which has to be parsed anyway.
I did not like it when we altered the <link> tag to accomodate buggy RSS readers and I certainly don't like it now, as this is even rarer. In both cases, it is not Drupal which is at fault.
#26
Steven
While I agree with most of what you said, the 404s show up in the logs enough to be a bother.
Perhaps the original design of Drupal did not forsee that people will use url aliases to mimic directory/file hierarchies. Whether this was intended or not, it is the way many use Drupal today.
It does not matter where the bug is (Drupal or the external world), as long as we can stop it ourselves, by adjusting our end of it.
The fix is simple enough and does not break standards (if implemented as described with a leading / before the css).
#27
It does not break standards, but it does bloat the code in an ugly way. Why not send an e-mail to the owners of the crawlers and tell them to implement a standard that is nearly 10 years old (RFC 1808)?
Note that Google Cache now seems to correctly interpret base URLs and even adds a <base> tag of its own.
By the way, this problem has nothing to do with people using URL aliases or not, as for a browser the regular nested paths that Drupal uses (e.g. "node/1" is no different from aliases mimicking files "foo/bar.html").
#28
Steven, part of the problem is that Google cache does add a base href even if there is a base href in the document. Eg adds a
<BASE HREF="http://drupal.org/node/13733">on the plone comparision page cached. Now that since HTML does not allow more than one base tag to be present, it is up to the browsers, to use the first or the last base value, or any of the base values on the page for that matter as the used base. So even pages displayed from the google cache will be buggy if a full relative path to the domain root is not specified, due to this problem.#29
This one does not use the whole base_url only the path part of it. HTML bloat is kept at minimal.
#30
Please please can this be done?
It's a good idea in itself, but if using fully-qualified paths means we can get rid of the BASE HREF, then page anchors will work without having the overhead of a filter. That's be a huge bonus for those creating larger nodes, or who just want to be able to put a "skip navigation" link in their theme without having to abandon Xtemplate or PHPtemplate
#31
Well, speaking of skip navigation links, phptemplate and xtemplate should expose the REQUEST_URI to the templates, so when a link to an anchor on the same page is needed, the link can be formatted with the complete request URI in mind.
#32
Should, but don't :(
If BASE HREF isn't removed, surely it wouldn't be a big job to implement this tweak?
#33
This patch is badly needed. The lack of a leading / in many paths is causing lots of problems.
#34
Can this patch be applied for 4.6? it is really badly needed.
#35
I don't see why this is badly needed. We generate perfectly valid URLs which are supposed to be short and crispy. This patch has some advantages though, yet it is unclear which patch to go with.
#36
The second patch is better.
#37
Forgive me for saying so, but since the way Drupal is generating hyperlinks is completely valid, why are you suggesting Drupal should move away from an accepted standard when the problem lies with the search engines?
At the very least, this needs to be optional -- which it appears to be -- I hate the 404s too, but I hate to hear that a change in Drupal is needed to fix a Google problem.
#38
I have to agree with the last comment from grohk.
#39
I also agree with the two previous Drupaleers, but I wouldn't mind enabling a 'quirks mode' via my conf file to stop the flood of 404 messages.
#40
I really can't fathom why some of us cannot deal with with the realities out there in the world.
These problems are not because Drupal is broken. It is because crawlers are. We cannot just bury our collective heads in the sand and say that we are standards compliant and forget about what is out there.
As an analogy, people who design themes or write CSS have to deal with the ugliness of Microsoft Internet Explorer and its intentional going against standards. You cannot tell a client or your boss that you are not modifying a theme that works perfectly on Konqueror and Firefox because MS IE is broken.
Similarly, we cannot ignore that crawlers from major search engine companies are broken or confused, and keep recursing through site using Drupal causing countless errors in the logs. We cannot tell our users to ask Google and Yahoo et al to fix their software.
Remember that we are not breaking any standards by implementing this patch. All we are doing is putting the entire path out (from the first / down) and thus eliminating ambiguity for everyone.
Sorry if I am a bit blunt in this post, but I am tired of what may be seen as isolationist thinking.
I do not mind if this is implemented in an advanced mode or via a settings.php thing. All I care about is getting it fixed somehow.
#41
Circular log errors reported here too http://drupal.org/node/9499
#42
I agree with KBAHEY. Burrying our head in the sand and saying "it ain't our problem" is not going to fix the issue. I despise companies that break standards - and I applaud Drupal for working hard to keep within those confines. But the reality is money grubbing, lazy ass programmers exist the world over, and the consequence is things like MSIE breaking everything wantonly and intentionally, Google, Yahoo, et al implementing poor bot code, etc...
I believe this desperately needs to get fixed. Ever since I started hand writing HTML code in 1992 I have always insured that my URL paths are absolute to the base html document root (eg, preceeded with "/" and the full path). It avoids confusion, problems, or issues. It seems odd that the debate over this would rage as it has in this thread.
...and I don't get the "bloat" discussion. How is this bloating things? Are we talking a few dozen extra characters? I hope I'm missing something more obvious and insidious than that!?
I've been a loyal Drupaler for ages now, and I love it. But this new problem is causing me a lot of grief, I see frequent munging of the URLs, and it worries me; particularly when I see that there are end users getting 404s. They don't give a rats ba-tu-tey that Drupal is "standards compliant" ... they just know they got an error when they supposedely did exactly what they should have, click on a URL. That reflects poorly on the site owner and ultimately on the software itself.
Please reconsider this issue, and let a patch go into core to fix it. It doesn't make sense to let it rage on as an issue that is causing lots of people obvious grief. I'm betting it's a bigger issue than most admins think - most don't spend the anal-retentitive time that I and others do grubbing through our logs, trying to insure a "perfect" surfing experience for our end-users...
#43
This is truly not a problem with Drupal, but it may be reasonable to change Drupal's behavior anyway.
The base href tag has been in the W3C standards since 1997. Failing to observe this tag isn't about being slow on the uptake (as with MSIE and CSS2). It's about deliberately breaking existing compatibility.
Has anyone contacted Yahoo, MSN, etc. and told them of this problem? If and when they fix their crawlers, we need to be able to turn off this kludge to discourage other more marginal crawlers to observe the standards.
#44
That should be "encourage other more marginal crawlers to observe the standards."
#45
Here are examples from drupal.org itself:
As you can see, if the paths started with a slash, none of this would have happened.
warning page not found 06/05/2005 - 10:36 drupal-sites/themes/bluebeach/style.css not found.
warning page not found 06/05/2005 - 10:36 drupal-sites/themes/bluebeach/print.css not found.
warning page not found 06/05/2005 - 10:36 drupal-sites/misc/drupal.css not found.
warning page not found 06/05/2005 - 10:33 about/tracker not found.
warning page not found 06/05/2005 - 10:33 about/support not found.
warning page not found 06/05/2005 - 10:33 about/project not found.
warning page not found 06/05/2005 - 10:33 about/services not found.
warning page not found 06/05/2005 - 10:33 about/handbook not found.
warning page not found 06/05/2005 - 10:33 about/features not found.
warning page not found 06/05/2005 - 10:33 about/forum not found.
warning page not found 06/05/2005 - 10:33 about/drupal-sites not found.
warning page not found 06/05/2005 - 10:33 about/druplicon not found.
warning page not found 06/05/2005 - 10:33 about/download not found.
warning page not found 06/05/2005 - 10:33 about/donate not found.
warning page not found 06/05/2005 - 10:33 about/documentation-writers-guide not found.
warning page not found 06/05/2005 - 10:33 about/contributors-guide not found.
warning page not found 06/05/2005 - 10:33 about/cvs not found.
warning page not found 06/05/2005 - 10:33 about/contact not found.
warning page not found 06/05/2005 - 10:33 about/contribute not found.
warning page not found 06/05/2005 - 10:33 about/aggregator not found.
warning page not found 06/05/2005 - 10:33 about/cases not found.
warning page not found 06/05/2005 - 10:33 about/about not found.
#46
Just because some of us disagree with this solution to the perceived problem does not mean we are not fond of reality, it just mean we have a different way of seeing this issue. If we cannot use accepted standards in Drupal, then what are good is it to adhere to them in the first place?
Does this patch really affect the experience of end users of Drupal? Unless I am missing something, normal people never see these errors. Google is a search engine, it is not a user. In my experience, users remain unaware of this "problem". But changing Drupal to adhere to preferences of a broken crawler is not going to encourage anyone to fix their poorly implemented software either.
As someone who appreciates the elegance of Drupal and uses it just as much as anyone, all this is fine with me -- as long as it is optional. But I don't think appeasement is the answer to "fixing" this problem, because with this option enabled there is no impetus for the crawler programmers to fix anything.
And for the record, Google has been caching my pages correctly with CSS for months now and it has not entered into a loop either.
#47
By using paths beginning with slashes, we are not breaking any standards that I know of.
The clutter in the logs is very annoying, and makes makes it harder for the site admin to find the info he needs. It also consumes bandwidth.
The user here is the site admin, not the end user.
By the same token, we can ignore MS IE's broken CSS handling and a bunch of other things, and claim that they should fix themselves. Meanwhile 80% of users are facing these issues.
That is not the way to look at things. If we can implement something that does not break standards but avoid many of us the grief that this causes, then why not?
A solution that allows this to be turned on or off, via an option or a settings.php flag would make everyone happy.
#48
The patch apparently hasn't found much favour, setting to "active".
I suggest to get hold of the IP ranges the broken crawlers use and block them in the .htaccess file we ship with Drupal.
Long live open web standards!
#49
It is really sad that most of us do not see a problem here, or brush it off as someone else's problem.
Standards are only valid if everyone follows them. The reality is that some do not, and depending on the market presence and strengths of those in violation of the standard, they are insignificant to something that is to be dealt with.
If web designers ignored Microsoft Internet Expolrer, with its blatant aloofness to standard, unintentional or otherwise, they would be out of business. 80% of people are still using MS IE. This is exactly the same issue.
To see the magnitude of the problem, run the following SQL against your site:
> select hostname from watchdog where type = 'httpd' and message like '%.css%';
> select hostname, count(*) cnt from watchdog where type = 'httpd' and message like '%.css%' group by hostname order by cnt asc;
The first shows 5383 rows, the latter shows 1886 rows.
As I said we will not be breaking any standards by qualifying our URLs and making them unambiguous to everyone, starting with the /.
#50
kbahey, have you contacted any of these crawlers to tell them their software is broken?
#51
No. I haven't.
There are 1866 unique IPs over the a period of 4 months. Even if we assume that these are in subnets, and say 10 per subnet, this means I have to contact 186 separate organizations/individuals, which is such a great effort. Even if I assume that there is skew, and that there are 20 organizations/individuals, it is still a great effort, and how many of those will respond, let alone fix their crawlers.
The question is: what is within our control and influence and what is not. This is like writing CSS for MS IE and for other standard conforming browsers. Or like defensive driving in an area where there are many rogue drivers. You cannot say that you are conforming to css standards and hell with the rest of the world, and you will not deal with them at all. Nor can you say that you are within your lane, at the set speed and keeping your distance, and will cross a green light while a drunk person is crossing the intersection.
We had to deal with comment spam (Google's nofollow, various modules to deal with it, or turning off anonymous comments, and moderating them), and referer spam (hide the statistics pages from view, or disabling statistics altogether). Didn't we? It is a rough world out there, and if the others are unethical or criminals or just don't play by the rules, we still have to deal with them. How is this any different?
Seeing this as a purely external issue and not dealing with it in the software we control is unrealistic. Remember that the fix does not break any standards. We will still be standard compliant with it, so the standards slant of it is not convincing.
Sorry, I just see it this way, and none of the counter arguments so far is convincing to me so far.
#52
This is not at all like MSIE. IE is a majority browser, so designing sites that don't work with it loses users. This, on the other hand, is a small minority of crawlers making a nuisance of themselves.
The drunk driver analogy is a little bit closer, but this isn't a life or death situation. You've convinced me that you want this feature, but you haven't convinced me that I want this feature. If this is committed, *please* make it optional!
#53
This issue interferes with BlogLines' ability to autodetect feeds. Several desktop clients have the same problem. While I agree that ideally this should be an optional change, losing 10% of your RSS readership is serious stuff. Thank you for the interim hide-saving, chx.
#54
chx
Was the patch updated for 4.6? I have applied only the common.inc.
But since I am now using phptemplate based themes (pushbutton, and soon a custom one), can you please give some instructions on what to change (e.g. putting path_to_theme() in some places?)
Thanks in advance.
#55
I've tested this with 4.6 -- one of the hunks didn't want to apply to common.inc; I think it was a line count off-by-one change or something, though -- I made the change by hand and it's working great.
This patch hasn't found widespread acceptance... is it breaking things for some people?
This feels like a better method of handling links than using BASE HREF, as far as I can feel. The way I see it is: Google is pretty good at indexing web pages. Not perfect, but I'm willing to believe they understand HTML better than I do. BASE HREF isn't a new part of the standard. At all. If Google doesn't support the way Drupal is trying to use BASE HREF, my money is that Google is right, and Drupal is using the tag in an unintended way.
But anyhow, it's happy under 4.6 for me, so far.
#56
I just read through this thread because i have been beaten by the problem: I am in need for URLs relative to the server document root for different reasons.
First, my server lives in a LAN behind a firewall. Since it runs several web apps, drupal in installed in a subdirectory of my document root. It is accessible from the outside solely via https and from inside the LAN under a different name, via https as well as using normal http (the latter is necessary to get cron.php working correctly, isn't it?)
Secondly, run a clone of this site as a testing platform on my laptop, which obviously has its own URL and to be able to sync it with as little hassle as possible. IMHO this is very important - especially if You set up sites using many modules You will need to test Your setup somewhere else before going onto the production system.
Thirdly, it prevents me from using http(s)://localhost/drupal46 or IP addresses to access the site.
Fourthly i or some remote user will get hassle to mirror the site or parts of it into static pages (that's again the crawler problem).
Such a situation has been discussed elsewhere in this forum multiple times, and the commonly accepted proposal was to use a relative path as a base URL, in my setting '/drupal46' (if i remember correctly, it is even in drupal's manual, isn't it?) --- although this will obviosly break the HTML-standard. Now comes the real problem: relative paths in are interpreted as intended by this hack by most relevant browsers: Mozilla et al., Opera, MSIE 5.x, Safari, even by exots like lynx and links, but there are a few ones that adhere strictly to standards in this respect, beside a few other exots like w3m and amaya MSIE 6.x also belongs to this group. Which means that my site is not accessible by 2/3 of the world. Really ugly.
I am aware of using a multisite setup as a workaround, but seriously, why use a workaround which costs me administrative effort if a clean and standard conforming solution is at sight, namely avoiding the use of the tag in favour of using paths relative to the document root of my server or virtual host.
So i would strongly vote for modifying drupal to drop using the tag. Although being part of HTML even before HTML 1.0 has been defined (always as an optional tag, BTW) it brings in more trouble than it helps.
best regards
Michael
#57
What's the status of this issue?
#58
Now that we have patch (code needs work), this is a perfect issue to put into that status. I would vote for the change provided by the latest patch in this issue, although I have not tested the patch myself. At least at weblabor.hu and drupal.hu, we run with a custom url() function which just does what this patch is about to do (but since these Drupal setups are in the root folder, we just prepend a slash to all path values).
The patch needs to be updated to latest CVS, and as Dries said he is willing to fix this problem, it should be committed after a review.
#59
Reworked.
#60
same patch for version 4.6.3 distributed files P L E A S E ... :-)
#61
What if you let this issue get solved for CVS HEAD (4.7), before requesting a patch for 4.6.x? (Restoring important status values).
#62
I believe the issue discussed in this thread is related to this bug.
What can I do to help out?
#63
I would like to bring attention to this long standing issue once more.
What are the objections to this patch? It fixes things and has no side effects, and we are still conforming to the standards.
Can we get this in for 4.7?
#64
Give it a review. http://drupal.org/patch/review
#65
I tested this with today's HEAD.
It works.
I hope this gets commited so we do not have the 404 headaches.
#66
Updating patch to go with current HEAD.
Please include this in 4.7.
#67
+1 this is essential. RSS feeds with image references were generating thousands of 404's a day on my site. Total lifesaver.
#68
Updating for current HEAD. The previous patch does apply, but with offsets. This one should have no warnings.
I hope we get some more +1s on this so it is in 4.7.
#69
+1 As far as how this affects links in RSS feeds as interpreted by aggregators, it's more and more like the analogy for designing for IE, but really closer to the idea of preparing an email newsletter so it looks right and behaves right in 8-10 different email cilents. Well, it's not so bad, because you're just dealing with one thing - dead links (as opposed to many different HTML quirks) - and a knowledgable user can figure out how to get to your dead link manually...but they shouldn't have to, and many won't. And only more aggregator solutions are going to popup - how many will comply to standards? So, another analogy - defensive driving - this is like 'defensive programming for the web.'
Here are just some more notes on the RSS-specific issue (thanks for the redirect to this issue, kbahey):
http://drupal.org/node/35610
#70
This patch outputs $base_url_path literally in several places, which is a bad idea (hypertolerant base url + 404 page = xss). http://drupal.org/node/28984
Of course, Drupal, surprise surprise, correctly outputs xml:base in its RSS feeds.
#71
Steven
I don't really see your point: we discussed the standards compliance aspect of this patch months ago.
Whether Drupal is conforming to standards or not is a moot point: PEOPLE ARE SUFFERING FROM THIS.
The standards compliance is not broken by this patch, but it alleviates a severe pain that many of us are seeing.
As for the security part, please explain more how this makes XSS possible?
#72
Of course, several RSS readers, surprise, surprise don't care about xml:base.
Of course, there are a lot of spiders, surprise, surprise, without base href parsing implemented.
We also suffered from this problem on several sites, and custom fixed it by always prefixing the URL with / in url(). It was easy for us, since we are using domain names, and clean URLs, so linking from the site root was quite straigtforward. The world needs this patch.
#73
If this involves getting rid of the base href tag:
YES PLEASE!
#74
No thanks, it is not our job to cater for other people's borken software.
#75
We have discussed this above.
Although technically this is a problem elsewhere, no one has control over which crawlers visit their site, nor how frequently they do. This is not just one external broken software, but a widespread way of doing things.
So, like comment spam, referral spam, and the whole slew of external problems, one has to take defensive measures against things that fill your logs, eat up resources, ...etc.
#76
I agree 100% with what kbahey has been saying.
the patch still applies (with line number skew) against current CVS. Here's a new version calculated against current CVS head.
I don't know if there's some kind of process that searches for the string "+1" in the comments, but if there is:
+1
#77
Patch needs reviews: ie. actual people who try the actual patch and report findings. You can help greatly with reviewing the patch, so we can mark it ready to be committed.
#78
I tried the patch and it appears to work
#79
I looked into one of the many errors I get in my logs which get caused by this issue, to see if it was a robot or a person. It seems like it was probably a person.
"Mozilla/4.0(compatible; MSIE 5.0; Windows 98; DigExt)"
I hope this goes into the 4.7 branch.
#80
this may not be on the 4.7 radar because it was marked as a support request
#81
+1
What is the value in saving the code from "bloating" by a few lines if it means my logs bloat by twice as much every minute? Not only do this many errors make the logs unusable as a tool, but I have to hide them from my admin clients so that they don't lose confidence in me and Drupal. After all, how do I prove that all of these are bad robots and not valid users?
kbahey has made an exceptional argument for this patch, "It fixes things and has no side effects, and we are still conforming to the standards." At this point any remaining objections are either ignore these facts; indifferent to the problems this patch solves; or taking an overly zealous political stance against the idea of taking defensive measures that shouldn't be necessary in a perfect world.
#82
One valid argument among all the zealotry: It should be inspected that the URLs we change are check_url 'd properly. I do not have the time today.
#83
For those who suffer from this problem, but cannot apply patches, I am attaching pre-patched versions for Drupal 4.6.5 of the three files affected: theme.inc, common.inc, theme.inc.
Just extract those files in the includes directory.
(As a bonus you get a PHPSESSID suppression patch included as well).
#84
I've tried applying this patch but I get
patch page.tpl.php
#85
sorry the last thread didnt come up, this is what I get when trying to apply patch
patch -p1 page.tpl.php <base-kill_0.patchpatching file page.tpl.php
Hunk #1 FAILED at 13.
Hunk #2 FAILED at 44.
2 out of 2 hunks FAILED -- saving rejects to file page.tpl.php.rej
can't find file to patch at input line 31
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--------------------------
|Index: themes/engines/phptemplate/phptemplate.engine
|===================================================================
|--- themes/engines/phptemplate/phptemplate.engine (revision 5734)
|+++ themes/engines/phptemplate/phptemplate.engine (working copy)
#86
Make sure that you download the patch from #76, then do:
$ patch -p0 < dev/base-kill_0.patch
patching file themes/bluemarine/page.tpl.php
patching file themes/engines/phptemplate/phptemplate.engine
patching file themes/pushbutton/page.tpl.php
patching file includes/bootstrap.inc
patching file includes/common.inc
patching file includes/theme.inc
It should work fine. The key is -p0
#87
yep the patch is #76, but still no luck with patch -p0
I've also updated to the latest version 4.6 and installed your (tar patch) khabey
patch -p0 page.tpl.php <base-kill_0.patchpatching file page.tpl.php
Hunk #1 FAILED at 13.
Hunk #2 FAILED at 44.
2 out of 2 hunks FAILED -- saving rejects to file page.tpl.php.rej
can't find file to patch at input line 31
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--------------------------
|Index: themes/engines/phptemplate/phptemplate.engine
|===================================================================
|--- themes/engines/phptemplate/phptemplate.engine (revision 5734)
|+++ themes/engines/phptemplate/phptemplate.engine (working copy)
--------------------------
#88
the patch in #76 is against the 4.7 branch.
If you're using 4.6, make sure you have 4.6.5 and then use http://drupal.org/node/13148#comment-61527
#89
ok no problems, yep 4.65 is the one its been updated to with that (tar patch) from kbahey
But even with that patch, I still get thousands of these feeds errors
i.e
error php 2005-12-22 06:10 Unknown column 'feed' in 'where clause' quer Anonymous details
error php 2005-12-22 06:10 Unknown column 'feed' in 'where clause' quer Anonymous details
error php 2005-12-22 06:10 Duplicate entry '%s' for key 1 query: INSERT INTO cach Anonymous details
error php 2005-12-22 06:10 Unknown column 'feed' in 'where clause' quer Anonymous details
Anonymous details
Lots of these
Type php
Date Thursday, December 22, 2005 - 06:10
User Anonymous
Location /taxonomy_menu/1/46/taxonomy/term/63/all/taxonomy/term/feed/all/taxonomy/term/feed/all/taxonomy/term/feed/all/taxonomy/term/feed
Message Unknown column 'feed' in 'where clause' query: SELECT DISTINCT(n.nid), n.sticky, n.title, n.created FROM node n INNER JOIN term_node tn ON n.nid = tn.nid WHERE tn.tid IN (feed) AND n.status = 1 ORDER BY n.sticky DESC, n.created DESC LIMIT 0, 3 in /home/football/public_html/includes/database.mysql.inc on line 66.
Severity error
Type php
Date Thursday, December 22, 2005 - 06:10
User Anonymous
Location /taxonomy_menu/1/46/taxonomy/term/63/all/taxonomy/term/feed/all/taxonomy/term/feed/all/taxonomy/term/feed/all/taxonomy/term/feed
Message Unknown column 'feed' in 'where clause' query: SELECT COUNT(DISTINCT(n.nid)) FROM node n INNER JOIN term_node tn ON n.nid = tn.nid WHERE tn.tid IN (feed) AND n.status = 1 in /home/football/public_html/includes/database.mysql.inc on line 66.
Severity error
#90
Is there a patch against the 4.6.5 version of the phptemplate.engine? Thanks to whoever might provide it. I have the patched files from the earlier 4.6.5 post. There seems to be some suggestion that this problem has been causing errors with the HTML area module's image plugin as well. Hopefully this update will fix that as well.
#91
I would also like to add my +1 to see this in 4.7.
It's compliant and solves several problems (it might also solve the selection text bug in IE)...
Are there any advantages on not applying the patch?
#92
+1
4.6.5 patch above fixes the following problem as well:
using href=#tagname is broken in the current 4.6, as it's not linked against the current page URL but against the baseurl, so the tags don't work correctly. This is fixed, by the patch.
This should be in 4.6.6, as that is a critical brokenness (I'm doing HTML cut and paste from an existing html site, and discovering that Drupal didn't do the right thing out of the box was a shock. Thanks to this thread, it's clear that this patch IS needed, for direct functionality, regardless of whoever else is not doing the right thing)
#93
Just to point out how truly _broken_ the way Drupal deals with relative urls is:
we now have a module devoted to _fixing_ it.
http://drupal.org/project/rellinkfilter
And worse, if you install the input filter to fix this, caching is then disabled.
#94
rerolled for HEAD. I tested a bunch, uploaded custom logo, and all looks OK.
i am a pragmatist. this patch removes the 404s from our disks and our eyes.
#95
Note that my rellinkfilter module, all 1292 bytes of it, was designed to let me import existing HTML files that expect the document root to be the current "directory". I have five years of nearly daily static weblog pages that I want to import (hence, I'll need the patch that speeds up URL mapping). Drupal uses the Drupal root directory as the document root, and I'm sure thousands of pages depend on this. That's why if this feature is added, it has to be optional, defaulting to the current behavior, or you'll likely break lots of pages.
I turned off the filter cache because it was the easiest way to make sure the page got recomputed after the last preview, when viewed in its final place. Otherwise, prople would get confused. You could actually leave the cache on, as long as you don't preview the last change to a page that uses this filter, and as long as you don't intend to map any of the pages to multiple different "directories", which I doubt many people do.
Relative links are good. They allow downloaded web sites to work properly when loaded from disk. And they allow restructuring of the site directory hierarchy without changing lots of text.
But I haven't tried the patch or looked at the details of what it does, so my opinion is currently uninformed. If it solves my problem, and will become part of the core, I'll definitely prefer it to an additional module. I'll look at it soon.
#96
moshe's patch doesn't work for me at all. All links stay as they are (e.g. href = admin/whatever ) without the global base_url_path at the beginning of them. using cvs
#97
I think the problem with the patch is that $base_url_path is never actually set. So although its global, its = ''. Where was this meant to be set? Perhaps a certain critical file wasn't included in the patch?