Looking at my site's logs, there seem to be several problems that are caused by Drupal's use of relative path names.

If Drupal causes all the site's urls to be absolute, then none of this would be an issue.

A. Search Engine Crawlers

Getting lots of 404s on things like: linux/index.html/robots.txt

Where 'linux' is an alias to a taxonomy, and 'index.html' is an alias to a node within that taxonomy.

Another example, is recursing unnecessarily. I see 404s on things like: /linux/index.html/linux/index.html

Where 'linux' is a path alias for a taxonomy term, and 'index.html' is an alias to the main node within it.

This does not seem to happen when Google crawls my sites, but Yahoo's Slurp suffers from this problem, and keeps recursing. MSNBot also suffers from this.

Another crawler/harvester called Blinkx/DFS-Fetch keeps adding the .css file to the relative path, getting a 404 on things like: /linux/themes/xtemplate/pushbutton/logo.gif

And Fast Search Engine also attempts to access: /linux/contact/tracker/tracker/user/password

The same goes for grub.org, another crawler.

B. Google Cache / Archive Way Back Machine

Pages in Google cache and archive.org Way Back Machine suffer form a similar problem: the .css files cannot be found, and hence rendering of the pages is not correct.

Examples:

Compare this: http://www.drupal.org/node/4647

To this: http://www.google.ca/search?q=cache:www.drupal.org/node/view/4647

Notice the following:

  1. How there is no formatting at all, because of the lack of a .css file
  2. The httpd log on Drupal will show errors for: linux/themes/pushbutton/style.css and linux/misc/drupal.css

Also see: http://web.archive.org/web/20031016184902/http://www.drupal.org/

C. Proxy Caches:

When someone is browsing my site from behind a proxy cache, the web site is hit with a rapid succession of requests, and many of it is just for bogus pages.

Examples:

2004/11/17 - 17:47 404 error: linux/user/1 not found.
2004/11/17 - 17:47 404 error: linux/feedback not found.
2004/11/17 - 17:47 404 error: linux/tracker not found.
2004/11/17 - 17:47 404 error: linux/sitemap not found.
2004/11/17 - 17:47 404 error: linux/search not found.
2004/11/17 - 17:47 404 error: linux/misc not found.
2004/11/17 - 17:47 404 error: linux/programming not found.
2004/11/17 - 17:47 404 error: linux/programming not found.
2004/11/17 - 17:47 404 error: linux/linux not found.
2004/11/17 - 17:47 404 error: linux/technology not found.
2004/11/17 - 17:47 404 error: linux/writings not found.
2004/11/17 - 17:47 404 error: linux/family not found.

And also:

2004/11/17 - 07:23 404 error: history/user/1 not found.
2004/11/17 - 07:23 404 error: history/tracker not found.
2004/11/17 - 07:23 404 error: history/feedback not found.
2004/11/17 - 07:23 404 error: history/sitemap not found.
2004/11/17 - 07:23 404 error: history/search not found.
2004/11/17 - 07:23 404 error: history/misc not found.
2004/11/17 - 07:23 404 error: history/technology not found.
2004/11/17 - 07:23 404 error: history/science not found.
2004/11/17 - 07:22 404 error: history/history not found.
2004/11/17 - 07:22 404 error: history/writings not found.
2004/11/17 - 07:22 404 error: history/family not found.

As you can tell, history and linux are aliases to taxonomy terms, and so is misc, technology, writings, family, ...etc. The user agent is appending the taxonomy term alias to the url and forming a new URL.

D. Regular Browsing:

There is even at least one extreme case where the following URL was accessed (the result was 404 of course)

/book/view/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/logo.gif

It seems it was a normal user, because the user agent is: "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)"

Proposed Solution:

As a proposed solution, all URLs in Drupal can be made into absolute path names. This can be done by the following:

  1. The variable $base_url in the conf.php file is broken down into two components:
    • $base_host (the 'http://whatever-host.example.com' part WITHOUT the trailing slash)
    • $base_path (the '/path-to-drupal' part, WITH the leading slash. If this is the DocumentRoot, then it is just a '/' character)
  2. $base_url is now $base_host concatenated with $base_path
  3. A simple filter can be written to preceed every href="path" with the $base_path variable, so it becomes "/path"
  4. This option can be turned on and off for a site. The default is to have it off so current behavior is maintained.
  5. A similar scheme applies for style sheets as well.

So, did I miss something obvious? Am I seriously off the mark?

Your thoughts!

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

chrisada’s picture

I am getting similar 404 errors, mainly from rss feed link that looks like /blog/blog/feed and many manual links that are relative to drupal root.

It was not a problem before Drupal 4.5, so I think there might not be a need to change all URIs to absolute. I can't see where the problem is coming from though.

kbahey’s picture

Category: bug » feature

I am pretty sure that these problems were happening for at least the past 10 months (ever since I moved to Drupal in January 2004).

The main issue here is that crawlers and other user agents get confused by the relative path names.

Using absolute paths will definitely solve this. However, is this the only solution?

I am looking for a discussion of this.

Gábor Hojtsy’s picture

No absolute paths please. Having the path start with '/' solves all the mentioned problems, and is not absolute, it is relative to the domain. Sadly some crawlers and even the Google Cache does not obey to the base href. I have reported this cache problem in April to Google, and they promised they will keep it in mind... Hehe...

What we need is to have the printed relative path values relative to the domain name, and not relative to the Drupal installation path.

Note that this issue will appear on the drupal devel mailing list if someone finally provides a patch we can talk about :)

Dries’s picture

Goba is right. We need paths relative to the domain name to fix this 'problem'.

kbahey’s picture

Sorry for not making my self clear.

When I said absolute, I meant that they start with just a /. I did NOT mean that they start with http://host.example.com. That would be a very bad idea.

In any case, what do people think about the proposed solution (breaking down $base_url into two parts?)

Also, does this address the style sheets as well, or more is needed?

chx’s picture

FileSize
471 bytes

I have implemented what Goba suggested.

chx’s picture

FileSize
825 bytes

Maybe this one is faster?

kbahey’s picture

Man! You are fast!

I tried the second version. It works fine for things that are not inside the node body, I mean they have a / in front of them, as we want it to be.

Two comments/issues:

- If there is a URL that is already "/" representing the home page, it gets set to "//". Perhaps it should check for that case?

- URLs in nodes that do not start with / do not get changed to have a / prepended to them. Do we need a filter for this?

- Do we need to do something for the style sheets in the page header? I mean the "misc/drupal.css" and "themes/themename/style.css"?

Thanks

kbahey’s picture

Hi chx

Here is a fix for the case where you have a url that is just "/".

In your patch, instead of:

$base = $parts['path'] . '/' ;

Replace that by:

$base = ( $path == '/' ? $base : $parts['path'] . '/' );

kbahey’s picture

Did this patch make it into CVS yet?

If there are any objections to it, can someone please explain what they are?

Thanks

Dries’s picture

Shouldn't your changes be included in the patch?

Also, it's better to cache $base rather than $parts.

Lastly, it this patch makes it to HEAD, we should probably remove some 'base url' cruft from the themes.

kbahey’s picture

FileSize
1 KB

Here is the patch including my fix.

I am asking chx to comment on caching $base instead of $parts.

Will this make it faster?

chx’s picture

Hm. $base = ( $path == '/' ? $base : $parts['path'] . '/' ); this depends on path which is a parameter. Thus I fail to see how could we cache $base. I'd correct this code however $base = ( $path == '/' ? '' : $parts['path'] . '/' ); 'cos I think $base is not defined before, but this is not a problem, PHP will be happy to replace NULL with NULL...

Maybe instead of all parts, only $parts['path'] is enough to be cached, yes, but the performance and memory usage difference -- I guess -- would not be noticable...

kbahey’s picture

FileSize
1 KB

OK.

I put in chx suggested change.

This patch can go in CVS then, to rid us of the problems with paths not beginning with slash.

This is not an ultimate solution still. We need to address the problem with .css files. Although the header contains a:

<base href="http://example.com" />

it does not seem that major search engines and archiving sites obey it anyway.

Dries’s picture

Your coding style needs work. Also, I'm not going to commit this unless the themes get fixed up: we'd end up with invalid URLs all over the place. Lastly, I wonder how portable the themes will be when Drupal is run from within a subdirectory.

chx’s picture

FileSize
849 bytes

Well, my patch worked from a subdirectory very well, as fact, I have not tested it from the root dir. And I think that it adheres to coding standards. So I resubmit it with the root path fix. However, my Drupal work is focused on i18n these days, and I was never into themeing so it won't be me who fixes those.

kbahey’s picture

I have tested the previous patch (including my fix) with drupal installed in the DocumentRoot of the server.

So, in effect, it is tested with both Drupal in / and Drupal in a subdirectory.

This change fixes the problem for the crawlers and other browsers from getting confused.

While it is true that there is no fix for the .css files in the HTML head section yet, this fix deals with a major part of the problem, and rids us of a major pain. Check your web server's logs some time to see what I mean.

Someone who is familiar with the themes can contribute a patch later.

This patch and the future fix for themes are not mutually exclusive, so let it go in CVS.

Gábor Hojtsy’s picture

Category: feature » bug

Please commit this into Drupal core, this fix is badly needed.

chx’s picture

FileSize
4.34 KB

Well as noone have stepped in to fix this problem, I have tried to fix the themes also. themes.inc , xtemplate.engine and the bluemarine template is patched besides common.inc.

Of course, more templates could follow, but first I'd like to see your opinions.

Gábor Hojtsy’s picture

I don't think that removing <base> from the themes is a good idea, using $parts['path'] should be encouraged though before the files, which would fix the google cache problem, and would still keep the HTML size low. It would also help those, who save the file to find the originating site easier, since clicking on a non-pagelocal link would lead to the online version.

Steven’s picture

Definitely -1 on removing the <base&gt tag or using absolute or root-relative URLs. This tag has been around for ages, and it is the only way to make clean URLs work without bloating in the code. FYI, "base" is (first?) mentioned in Berners-Lee's HTML 1.0 draft. That's June 1993.

As the amount of clean URL-using sites grows, the crawlers will have to be updated. Perhaps we could prevent crawlers from going too insane by 404ing for URLs with more than say 10 components? That would prevent the really crappy ones from hammering your site.

I'm all for making the <base> tag easier to handle for the user (say, by including a filter to allow simple anchor links to work as most people expect them to), but we should keep Drupal-generated URLs clean and completely relative.

kbahey’s picture

The problem with css is this: The @import argument does not start with a /.

This is simple to fix.

We keep the "base" as it is today, but add the new variable: $base before it.

So for a site where Drupal is installed in the DocumentRoot, all that will change is that /misc/drupal.css and /themes/themename/style.css will be preceded by a slash. For sites that use another path, that path will be prepended to the css file name.

How about that?

Steven’s picture

What exactly is the problem with the @import? As far as I know:

- url() in stylesheets is interpreted relative to the base of the stylesheet, not the source document.
- However, if the styles are inside an HTML document, through a style tag or style attribute, then the stylesheet's location is the same as the HTML document.
- Thus, the stylesheet's base is the same as the base of the HTML document (which can be altered through the <base> tag).

I just don't see why it is necessary. As far as I know, the only browser that has had problems resolving CSS urls properly was Netscape 4, which does not support @import at all, and which Drupal does not support either, because of its CSS usage.

kbahey’s picture

The problem for stylesheets is as follows. I think it mainly affect crawlers and Google's cache.

Say you have an installtion of Drupal in DocumentRoot. You then use url aliases, and put slashes in them.

For example, you use news/general/2004-12-15.html for a node.

That node still has misc/drupal.css and themes/pushbutton/style.css in the head section if the document. Crawlers get fooled by that and try to look for /news/general/misc/drupal.css and /news/general/themes/pushbutton/style.css, which don't exist.

So, just prepending the new $base variable (in chx's patch) before the stylesheet @import argument would fix this issue. Assuming you are in DocumentRoot, then /misc and /themes would be used instead of just misc and themes.

It would still be compliant with standards, be relative to the web site, and no ambiguous to anyone, be they crawler or browser.

I hope it is clearer now.

I think chx can change the patch to use the $base instead of $base_url everywhere, so as to avoid the host/domain name in the urls.

Steven’s picture

But typical crawlers don't even pay attention to stylesheets, hence it wouldn't have much use for them. I just don't see why we should adjust to rare cases of buggy software. Reading out a base URL from an HTML document is dead easy, and on top of that it doesn't add more complexity as without the base tag, the document's URL is already an implicit base which has to be parsed anyway.

I did not like it when we altered the <link&gt tag to accomodate buggy RSS readers and I certainly don't like it now, as this is even rarer. In both cases, it is not Drupal which is at fault.

kbahey’s picture

Steven

While I agree with most of what you said, the 404s show up in the logs enough to be a bother.

Perhaps the original design of Drupal did not forsee that people will use url aliases to mimic directory/file hierarchies. Whether this was intended or not, it is the way many use Drupal today.

It does not matter where the bug is (Drupal or the external world), as long as we can stop it ourselves, by adjusting our end of it.

The fix is simple enough and does not break standards (if implemented as described with a leading / before the css).

Steven’s picture

It does not break standards, but it does bloat the code in an ugly way. Why not send an e-mail to the owners of the crawlers and tell them to implement a standard that is nearly 10 years old (RFC 1808)?

Note that Google Cache now seems to correctly interpret base URLs and even adds a <base> tag of its own.

By the way, this problem has nothing to do with people using URL aliases or not, as for a browser the regular nested paths that Drupal uses (e.g. "node/1" is no different from aliases mimicking files "foo/bar.html").

Gábor Hojtsy’s picture

Steven, part of the problem is that Google cache does add a base href even if there is a base href in the document. Eg adds a <BASE HREF="http://drupal.org/node/13733"> on the plone comparision page cached. Now that since HTML does not allow more than one base tag to be present, it is up to the browsers, to use the first or the last base value, or any of the base values on the page for that matter as the used base. So even pages displayed from the google cache will be buggy if a full relative path to the domain root is not specified, due to this problem.

chx’s picture

FileSize
4.75 KB

This one does not use the whole base_url only the path part of it. HTML bloat is kept at minimal.

clairem’s picture

Please please can this be done?

It's a good idea in itself, but if using fully-qualified paths means we can get rid of the BASE HREF, then page anchors will work without having the overhead of a filter. That's be a huge bonus for those creating larger nodes, or who just want to be able to put a "skip navigation" link in their theme without having to abandon Xtemplate or PHPtemplate

Gábor Hojtsy’s picture

Well, speaking of skip navigation links, phptemplate and xtemplate should expose the REQUEST_URI to the templates, so when a link to an anchor on the same page is needed, the link can be formatted with the complete request URI in mind.

clairem’s picture

hptemplate and xtemplate should expose the REQUEST_URI to the templates

Should, but don't :(

If BASE HREF isn't removed, surely it wouldn't be a big job to implement this tweak?

kbahey’s picture

This patch is badly needed. The lack of a leading / in many paths is causing lots of problems.

kbahey’s picture

Can this patch be applied for 4.6? it is really badly needed.

Dries’s picture

I don't see why this is badly needed. We generate perfectly valid URLs which are supposed to be short and crispy. This patch has some advantages though, yet it is unclear which patch to go with.

chx’s picture

The second patch is better.

grohk’s picture

Forgive me for saying so, but since the way Drupal is generating hyperlinks is completely valid, why are you suggesting Drupal should move away from an accepted standard when the problem lies with the search engines?

At the very least, this needs to be optional -- which it appears to be -- I hate the 404s too, but I hate to hear that a change in Drupal is needed to fix a Google problem.

jhriggs’s picture

I have to agree with the last comment from grohk.

matt westgate’s picture

I also agree with the two previous Drupaleers, but I wouldn't mind enabling a 'quirks mode' via my conf file to stop the flood of 404 messages.

kbahey’s picture

I really can't fathom why some of us cannot deal with with the realities out there in the world.

These problems are not because Drupal is broken. It is because crawlers are. We cannot just bury our collective heads in the sand and say that we are standards compliant and forget about what is out there.

As an analogy, people who design themes or write CSS have to deal with the ugliness of Microsoft Internet Explorer and its intentional going against standards. You cannot tell a client or your boss that you are not modifying a theme that works perfectly on Konqueror and Firefox because MS IE is broken.

Similarly, we cannot ignore that crawlers from major search engine companies are broken or confused, and keep recursing through site using Drupal causing countless errors in the logs. We cannot tell our users to ask Google and Yahoo et al to fix their software.

Remember that we are not breaking any standards by implementing this patch. All we are doing is putting the entire path out (from the first / down) and thus eliminating ambiguity for everyone.

Sorry if I am a bit blunt in this post, but I am tired of what may be seen as isolationist thinking.

I do not mind if this is implemented in an advanced mode or via a settings.php thing. All I care about is getting it fixed somehow.

kbahey’s picture

Circular log errors reported here too http://drupal.org/node/9499

shane’s picture

I agree with KBAHEY. Burrying our head in the sand and saying "it ain't our problem" is not going to fix the issue. I despise companies that break standards - and I applaud Drupal for working hard to keep within those confines. But the reality is money grubbing, lazy ass programmers exist the world over, and the consequence is things like MSIE breaking everything wantonly and intentionally, Google, Yahoo, et al implementing poor bot code, etc...

I believe this desperately needs to get fixed. Ever since I started hand writing HTML code in 1992 I have always insured that my URL paths are absolute to the base html document root (eg, preceeded with "/" and the full path). It avoids confusion, problems, or issues. It seems odd that the debate over this would rage as it has in this thread.

...and I don't get the "bloat" discussion. How is this bloating things? Are we talking a few dozen extra characters? I hope I'm missing something more obvious and insidious than that!?

I've been a loyal Drupaler for ages now, and I love it. But this new problem is causing me a lot of grief, I see frequent munging of the URLs, and it worries me; particularly when I see that there are end users getting 404s. They don't give a rats ba-tu-tey that Drupal is "standards compliant" ... they just know they got an error when they supposedely did exactly what they should have, click on a URL. That reflects poorly on the site owner and ultimately on the software itself.

Please reconsider this issue, and let a patch go into core to fix it. It doesn't make sense to let it rage on as an issue that is causing lots of people obvious grief. I'm betting it's a bigger issue than most admins think - most don't spend the anal-retentitive time that I and others do grubbing through our logs, trying to insure a "perfect" surfing experience for our end-users...

ezheidtmann’s picture

This is truly not a problem with Drupal, but it may be reasonable to change Drupal's behavior anyway.

The base href tag has been in the W3C standards since 1997. Failing to observe this tag isn't about being slow on the uptake (as with MSIE and CSS2). It's about deliberately breaking existing compatibility.

Has anyone contacted Yahoo, MSN, etc. and told them of this problem? If and when they fix their crawlers, we need to be able to turn off this kludge to discourage other more marginal crawlers to observe the standards.

ezheidtmann’s picture

That should be "encourage other more marginal crawlers to observe the standards."

kbahey’s picture

Here are examples from drupal.org itself:

As you can see, if the paths started with a slash, none of this would have happened.

warning page not found 06/05/2005 - 10:36 drupal-sites/themes/bluebeach/style.css not found.
warning page not found 06/05/2005 - 10:36 drupal-sites/themes/bluebeach/print.css not found.
warning page not found 06/05/2005 - 10:36 drupal-sites/misc/drupal.css not found.
warning page not found 06/05/2005 - 10:33 about/tracker not found.
warning page not found 06/05/2005 - 10:33 about/support not found.
warning page not found 06/05/2005 - 10:33 about/project not found.
warning page not found 06/05/2005 - 10:33 about/services not found.
warning page not found 06/05/2005 - 10:33 about/handbook not found.
warning page not found 06/05/2005 - 10:33 about/features not found.
warning page not found 06/05/2005 - 10:33 about/forum not found.
warning page not found 06/05/2005 - 10:33 about/drupal-sites not found.
warning page not found 06/05/2005 - 10:33 about/druplicon not found.
warning page not found 06/05/2005 - 10:33 about/download not found.
warning page not found 06/05/2005 - 10:33 about/donate not found.
warning page not found 06/05/2005 - 10:33 about/documentation-writers-guide not found.
warning page not found 06/05/2005 - 10:33 about/contributors-guide not found.
warning page not found 06/05/2005 - 10:33 about/cvs not found.
warning page not found 06/05/2005 - 10:33 about/contact not found.
warning page not found 06/05/2005 - 10:33 about/contribute not found.
warning page not found 06/05/2005 - 10:33 about/aggregator not found.
warning page not found 06/05/2005 - 10:33 about/cases not found.
warning page not found 06/05/2005 - 10:33 about/about not found.

grohk’s picture

Just because some of us disagree with this solution to the perceived problem does not mean we are not fond of reality, it just mean we have a different way of seeing this issue. If we cannot use accepted standards in Drupal, then what are good is it to adhere to them in the first place?

Does this patch really affect the experience of end users of Drupal? Unless I am missing something, normal people never see these errors. Google is a search engine, it is not a user. In my experience, users remain unaware of this "problem". But changing Drupal to adhere to preferences of a broken crawler is not going to encourage anyone to fix their poorly implemented software either.

As someone who appreciates the elegance of Drupal and uses it just as much as anyone, all this is fine with me -- as long as it is optional. But I don't think appeasement is the answer to "fixing" this problem, because with this option enabled there is no impetus for the crawler programmers to fix anything.

And for the record, Google has been caching my pages correctly with CSS for months now and it has not entered into a loop either.

kbahey’s picture

If we cannot use accepted standards in Drupal, then what are good is it to adhere to them in the first place?

By using paths beginning with slashes, we are not breaking any standards that I know of.

Does this patch really affect the experience of end users of Drupal? Unless I am missing something, normal people never see these errors.

The clutter in the logs is very annoying, and makes makes it harder for the site admin to find the info he needs. It also consumes bandwidth.

Google is a search engine, it is not a user.

The user here is the site admin, not the end user.

But changing Drupal to adhere to preferences of a broken crawler is not going to encourage anyone to fix their poorly implemented software either ... But I don't think appeasement is the answer to "fixing" this problem, because with this option enabled there is no impetus for the crawler programmers to fix anything.

By the same token, we can ignore MS IE's broken CSS handling and a bunch of other things, and claim that they should fix themselves. Meanwhile 80% of users are facing these issues.

That is not the way to look at things. If we can implement something that does not break standards but avoid many of us the grief that this causes, then why not?

A solution that allows this to be turned on or off, via an option or a settings.php flag would make everyone happy.

killes@www.drop.org’s picture

The patch apparently hasn't found much favour, setting to "active".

I suggest to get hold of the IP ranges the broken crawlers use and block them in the .htaccess file we ship with Drupal.

Long live open web standards!

kbahey’s picture

It is really sad that most of us do not see a problem here, or brush it off as someone else's problem.

Standards are only valid if everyone follows them. The reality is that some do not, and depending on the market presence and strengths of those in violation of the standard, they are insignificant to something that is to be dealt with.

If web designers ignored Microsoft Internet Expolrer, with its blatant aloofness to standard, unintentional or otherwise, they would be out of business. 80% of people are still using MS IE. This is exactly the same issue.
To see the magnitude of the problem, run the following SQL against your site:

> select hostname from watchdog where type = 'httpd' and message like '%.css%';

> select hostname, count(*) cnt from watchdog where type = 'httpd' and message like '%.css%' group by hostname order by cnt asc;

The first shows 5383 rows, the latter shows 1886 rows.

As I said we will not be breaking any standards by qualifying our URLs and making them unambiguous to everyone, starting with the /.

ezheidtmann’s picture

kbahey, have you contacted any of these crawlers to tell them their software is broken?

kbahey’s picture

No. I haven't.

There are 1866 unique IPs over the a period of 4 months. Even if we assume that these are in subnets, and say 10 per subnet, this means I have to contact 186 separate organizations/individuals, which is such a great effort. Even if I assume that there is skew, and that there are 20 organizations/individuals, it is still a great effort, and how many of those will respond, let alone fix their crawlers.

The question is: what is within our control and influence and what is not. This is like writing CSS for MS IE and for other standard conforming browsers. Or like defensive driving in an area where there are many rogue drivers. You cannot say that you are conforming to css standards and hell with the rest of the world, and you will not deal with them at all. Nor can you say that you are within your lane, at the set speed and keeping your distance, and will cross a green light while a drunk person is crossing the intersection.

We had to deal with comment spam (Google's nofollow, various modules to deal with it, or turning off anonymous comments, and moderating them), and referer spam (hide the statistics pages from view, or disabling statistics altogether). Didn't we? It is a rough world out there, and if the others are unethical or criminals or just don't play by the rules, we still have to deal with them. How is this any different?

Seeing this as a purely external issue and not dealing with it in the software we control is unrealistic. Remember that the fix does not break any standards. We will still be standard compliant with it, so the standards slant of it is not convincing.

Sorry, I just see it this way, and none of the counter arguments so far is convincing to me so far.

ezheidtmann’s picture

This is not at all like MSIE. IE is a majority browser, so designing sites that don't work with it loses users. This, on the other hand, is a small minority of crawlers making a nuisance of themselves.

The drunk driver analogy is a little bit closer, but this isn't a life or death situation. You've convinced me that you want this feature, but you haven't convinced me that I want this feature. If this is committed, *please* make it optional!

jayCampbell’s picture

This issue interferes with BlogLines' ability to autodetect feeds. Several desktop clients have the same problem. While I agree that ideally this should be an optional change, losing 10% of your RSS readership is serious stuff. Thank you for the interim hide-saving, chx.

kbahey’s picture

chx

Was the patch updated for 4.6? I have applied only the common.inc.

But since I am now using phptemplate based themes (pushbutton, and soon a custom one), can you please give some instructions on what to change (e.g. putting path_to_theme() in some places?)

Thanks in advance.

njvack’s picture

I've tested this with 4.6 -- one of the hunks didn't want to apply to common.inc; I think it was a line count off-by-one change or something, though -- I made the change by hand and it's working great.

This patch hasn't found widespread acceptance... is it breaking things for some people?

This feels like a better method of handling links than using BASE HREF, as far as I can feel. The way I see it is: Google is pretty good at indexing web pages. Not perfect, but I'm willing to believe they understand HTML better than I do. BASE HREF isn't a new part of the standard. At all. If Google doesn't support the way Drupal is trying to use BASE HREF, my money is that Google is right, and Drupal is using the tag in an unintended way.

But anyhow, it's happy under 4.6 for me, so far.

beate_r’s picture

Version: » 4.6.0

I just read through this thread because i have been beaten by the problem: I am in need for URLs relative to the server document root for different reasons.

First, my server lives in a LAN behind a firewall. Since it runs several web apps, drupal in installed in a subdirectory of my document root. It is accessible from the outside solely via https and from inside the LAN under a different name, via https as well as using normal http (the latter is necessary to get cron.php working correctly, isn't it?)

Secondly, run a clone of this site as a testing platform on my laptop, which obviously has its own URL and to be able to sync it with as little hassle as possible. IMHO this is very important - especially if You set up sites using many modules You will need to test Your setup somewhere else before going onto the production system.

Thirdly, it prevents me from using http(s)://localhost/drupal46 or IP addresses to access the site.

Fourthly i or some remote user will get hassle to mirror the site or parts of it into static pages (that's again the crawler problem).

Such a situation has been discussed elsewhere in this forum multiple times, and the commonly accepted proposal was to use a relative path as a base URL, in my setting '/drupal46' (if i remember correctly, it is even in drupal's manual, isn't it?) --- although this will obviosly break the HTML-standard. Now comes the real problem: relative paths in are interpreted as intended by this hack by most relevant browsers: Mozilla et al., Opera, MSIE 5.x, Safari, even by exots like lynx and links, but there are a few ones that adhere strictly to standards in this respect, beside a few other exots like w3m and amaya MSIE 6.x also belongs to this group. Which means that my site is not accessible by 2/3 of the world. Really ugly.

I am aware of using a multisite setup as a workaround, but seriously, why use a workaround which costs me administrative effort if a clean and standard conforming solution is at sight, namely avoiding the use of the tag in favour of using paths relative to the document root of my server or virtual host.

So i would strongly vote for modifying drupal to drop using the tag. Although being part of HTML even before HTML 1.0 has been defined (always as an optional tag, BTW) it brings in more trouble than it helps.

best regards

Michael

Uwe Hermann’s picture

What's the status of this issue?

Gábor Hojtsy’s picture

Status: Active » Needs work

Now that we have patch (code needs work), this is a perfect issue to put into that status. I would vote for the change provided by the latest patch in this issue, although I have not tested the patch myself. At least at weblabor.hu and drupal.hu, we run with a custom url() function which just does what this patch is about to do (but since these Drupal setups are in the root folder, we just prepend a slash to all path values).

The patch needs to be updated to latest CVS, and as Dries said he is willing to fix this problem, it should be committed after a review.

chx’s picture

Version: 4.6.0 » x.y.z
Status: Needs work » Needs review
FileSize
8.25 KB

Reworked.

bergermann’s picture

Version: x.y.z » 4.6.3
Category: bug » support
Status: Needs review » Active

same patch for version 4.6.3 distributed files P L E A S E ... :-)

Gábor Hojtsy’s picture

Version: 4.6.3 »
Status: Active » Needs review

What if you let this issue get solved for CVS HEAD (4.7), before requesting a patch for 4.6.x? (Restoring important status values).

beeboo’s picture

I believe the issue discussed in this thread is related to this bug.

What can I do to help out?

kbahey’s picture

I would like to bring attention to this long standing issue once more.

What are the objections to this patch? It fixes things and has no side effects, and we are still conforming to the standards.

Can we get this in for 4.7?

chx’s picture

kbahey’s picture

Status: Needs review » Reviewed & tested by the community
FileSize
8.19 KB

I tested this with today's HEAD.

It works.

I hope this gets commited so we do not have the 404 headaches.

kbahey’s picture

FileSize
8.2 KB

Updating patch to go with current HEAD.

Please include this in 4.7.

eaton’s picture

+1 this is essential. RSS feeds with image references were generating thousands of 404's a day on my site. Total lifesaver.

kbahey’s picture

FileSize
8.41 KB

Updating for current HEAD. The previous patch does apply, but with offsets. This one should have no warnings.

I hope we get some more +1s on this so it is in 4.7.

Ian Ward’s picture

+1 As far as how this affects links in RSS feeds as interpreted by aggregators, it's more and more like the analogy for designing for IE, but really closer to the idea of preparing an email newsletter so it looks right and behaves right in 8-10 different email cilents. Well, it's not so bad, because you're just dealing with one thing - dead links (as opposed to many different HTML quirks) - and a knowledgable user can figure out how to get to your dead link manually...but they shouldn't have to, and many won't. And only more aggregator solutions are going to popup - how many will comply to standards? So, another analogy - defensive driving - this is like 'defensive programming for the web.'

Here are just some more notes on the RSS-specific issue (thanks for the redirect to this issue, kbahey):

http://drupal.org/node/35610

Steven’s picture

Status: Reviewed & tested by the community » Needs work

This patch outputs $base_url_path literally in several places, which is a bad idea (hypertolerant base url + 404 page = xss). http://drupal.org/node/28984

Of course, Drupal, surprise surprise, correctly outputs xml:base in its RSS feeds.

kbahey’s picture

Steven

I don't really see your point: we discussed the standards compliance aspect of this patch months ago.

Whether Drupal is conforming to standards or not is a moot point: PEOPLE ARE SUFFERING FROM THIS.

The standards compliance is not broken by this patch, but it alleviates a severe pain that many of us are seeing.

As for the security part, please explain more how this makes XSS possible?

Gábor Hojtsy’s picture

Of course, several RSS readers, surprise, surprise don't care about xml:base.
Of course, there are a lot of spiders, surprise, surprise, without base href parsing implemented.

We also suffered from this problem on several sites, and custom fixed it by always prefixing the URL with / in url(). It was easy for us, since we are using domain names, and clean URLs, so linking from the site root was quite straigtforward. The world needs this patch.

Wesley Tanaka’s picture

If this involves getting rid of the base href tag:

YES PLEASE!

killes@www.drop.org’s picture

Status: Needs work » Closed (won't fix)

No thanks, it is not our job to cater for other people's borken software.

kbahey’s picture

Status: Closed (won't fix) » Active

We have discussed this above.

Although technically this is a problem elsewhere, no one has control over which crawlers visit their site, nor how frequently they do. This is not just one external broken software, but a widespread way of doing things.

So, like comment spam, referral spam, and the whole slew of external problems, one has to take defensive measures against things that fill your logs, eat up resources, ...etc.

Wesley Tanaka’s picture

FileSize
7.37 KB

I agree 100% with what kbahey has been saying.

the patch still applies (with line number skew) against current CVS. Here's a new version calculated against current CVS head.

I don't know if there's some kind of process that searches for the string "+1" in the comments, but if there is:

+1

Gábor Hojtsy’s picture

Status: Active » Needs review

Patch needs reviews: ie. actual people who try the actual patch and report findings. You can help greatly with reviewing the patch, so we can mark it ready to be committed.

Wesley Tanaka’s picture

I tried the patch and it appears to work

Wesley Tanaka’s picture

I looked into one of the many errors I get in my logs which get caused by this issue, to see if it was a robot or a person. It seems like it was probably a person.

"Mozilla/4.0(compatible; MSIE 5.0; Windows 98; DigExt)"

I hope this goes into the 4.7 branch.

Wesley Tanaka’s picture

Category: support » bug

this may not be on the 4.7 radar because it was marked as a support request

Zach Harkey’s picture

+1

What is the value in saving the code from "bloating" by a few lines if it means my logs bloat by twice as much every minute? Not only do this many errors make the logs unusable as a tool, but I have to hide them from my admin clients so that they don't lose confidence in me and Drupal. After all, how do I prove that all of these are bad robots and not valid users?

kbahey has made an exceptional argument for this patch, "It fixes things and has no side effects, and we are still conforming to the standards." At this point any remaining objections are either ignore these facts; indifferent to the problems this patch solves; or taking an overly zealous political stance against the idea of taking defensive measures that shouldn't be necessary in a perfect world.

chx’s picture

One valid argument among all the zealotry: It should be inspected that the URLs we change are check_url 'd properly. I do not have the time today.

kbahey’s picture

For those who suffer from this problem, but cannot apply patches, I am attaching pre-patched versions for Drupal 4.6.5 of the three files affected: theme.inc, common.inc, theme.inc.

Just extract those files in the includes directory.

(As a bonus you get a PHPSESSID suppression patch included as well).

insomoz’s picture

I've tried applying this patch but I get

patch page.tpl.php patching file page.tpl.php
Hunk #1 FAILED at 13.
Hunk #2 FAILED at 44.
2 out of 2 hunks FAILED -- saving rejects to file page.tpl.php.rej
can't find file to patch at input line 31
Perhaps you should have used the -p or --strip option?
The text leading up to this was:
--------------------------
|Index: themes/engines/phptemplate/phptemplate.engine
|===================================================================
|--- themes/engines/phptemplate/phptemplate.engine (revision 5734)
|+++ themes/engines/phptemplate/phptemplate.engine (working copy)

patching file ../engines/phptemplate/phptemplate.engine
Hunk #1 FAILED at 13.
Hunk #2 FAILED at 44.
2 out of 2 hunks FAILED -- saving rejects to file ../engines/phptemplate/phptemplate.engine.rej
can't find file to patch at input line 31
Perhaps you should have used the -p or --strip option?
The text leading up to this was:
--------------------------
|Index: themes/engines/phptemplate/phptemplate.engine
|===================================================================
|--- themes/engines/phptemplate/phptemplate.engine (revision 5734)
|+++ themes/engines/phptemplate/phptemplate.engine (working copy)

and so on.
Cant seem to install this patch

insomoz’s picture

sorry the last thread didnt come up, this is what I get when trying to apply patch
patch -p1 page.tpl.php <base-kill_0.patch

patching file page.tpl.php
Hunk #1 FAILED at 13.
Hunk #2 FAILED at 44.
2 out of 2 hunks FAILED -- saving rejects to file page.tpl.php.rej
can't find file to patch at input line 31
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--------------------------
|Index: themes/engines/phptemplate/phptemplate.engine
|===================================================================
|--- themes/engines/phptemplate/phptemplate.engine (revision 5734)
|+++ themes/engines/phptemplate/phptemplate.engine (working copy)

kbahey’s picture

Make sure that you download the patch from #76, then do:

$ patch -p0 < dev/base-kill_0.patch
patching file themes/bluemarine/page.tpl.php
patching file themes/engines/phptemplate/phptemplate.engine
patching file themes/pushbutton/page.tpl.php
patching file includes/bootstrap.inc
patching file includes/common.inc
patching file includes/theme.inc

It should work fine. The key is -p0

insomoz’s picture

yep the patch is #76, but still no luck with patch -p0
I've also updated to the latest version 4.6 and installed your (tar patch) khabey

patch -p0 page.tpl.php <base-kill_0.patch
patching file page.tpl.php
Hunk #1 FAILED at 13.
Hunk #2 FAILED at 44.
2 out of 2 hunks FAILED -- saving rejects to file page.tpl.php.rej
can't find file to patch at input line 31
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--------------------------
|Index: themes/engines/phptemplate/phptemplate.engine
|===================================================================
|--- themes/engines/phptemplate/phptemplate.engine (revision 5734)
|+++ themes/engines/phptemplate/phptemplate.engine (working copy)
--------------------------

Wesley Tanaka’s picture

the patch in #76 is against the 4.7 branch.

If you're using 4.6, make sure you have 4.6.5 and then use http://drupal.org/node/13148#comment-61527

insomoz’s picture

ok no problems, yep 4.65 is the one its been updated to with that (tar patch) from kbahey

But even with that patch, I still get thousands of these feeds errors

i.e

error php 2005-12-22 06:10 Unknown column 'feed' in 'where clause' quer Anonymous details
error php 2005-12-22 06:10 Unknown column 'feed' in 'where clause' quer Anonymous details
error php 2005-12-22 06:10 Duplicate entry '%s' for key 1 query: INSERT INTO cach Anonymous details
error php 2005-12-22 06:10 Unknown column 'feed' in 'where clause' quer Anonymous details
Anonymous details

Lots of these

Type php
Date Thursday, December 22, 2005 - 06:10
User Anonymous
Location /taxonomy_menu/1/46/taxonomy/term/63/all/taxonomy/term/feed/all/taxonomy/term/feed/all/taxonomy/term/feed/all/taxonomy/term/feed
Message Unknown column 'feed' in 'where clause' query: SELECT DISTINCT(n.nid), n.sticky, n.title, n.created FROM node n INNER JOIN term_node tn ON n.nid = tn.nid WHERE tn.tid IN (feed) AND n.status = 1 ORDER BY n.sticky DESC, n.created DESC LIMIT 0, 3 in /home/football/public_html/includes/database.mysql.inc on line 66.
Severity error

Type php
Date Thursday, December 22, 2005 - 06:10
User Anonymous
Location /taxonomy_menu/1/46/taxonomy/term/63/all/taxonomy/term/feed/all/taxonomy/term/feed/all/taxonomy/term/feed/all/taxonomy/term/feed
Message Unknown column 'feed' in 'where clause' query: SELECT COUNT(DISTINCT(n.nid)) FROM node n INNER JOIN term_node tn ON n.nid = tn.nid WHERE tn.tid IN (feed) AND n.status = 1 in /home/football/public_html/includes/database.mysql.inc on line 66.
Severity error

bomarmonk’s picture

Is there a patch against the 4.6.5 version of the phptemplate.engine? Thanks to whoever might provide it. I have the patched files from the earlier 4.6.5 post. There seems to be some suggestion that this problem has been causing errors with the HTML area module's image plugin as well. Hopefully this update will fix that as well.

markus_petrux’s picture

I would also like to add my +1 to see this in 4.7.

It's compliant and solves several problems (it might also solve the selection text bug in IE)...

Are there any advantages on not applying the patch?

sethcohn’s picture

+1

4.6.5 patch above fixes the following problem as well:

using href=#tagname is broken in the current 4.6, as it's not linked against the current page URL but against the baseurl, so the tags don't work correctly. This is fixed, by the patch.

This should be in 4.6.6, as that is a critical brokenness (I'm doing HTML cut and paste from an existing html site, and discovering that Drupal didn't do the right thing out of the box was a shock. Thanks to this thread, it's clear that this patch IS needed, for direct functionality, regardless of whoever else is not doing the right thing)

sethcohn’s picture

Just to point out how truly _broken_ the way Drupal deals with relative urls is:

we now have a module devoted to _fixing_ it.

http://drupal.org/project/rellinkfilter

And worse, if you install the input filter to fix this, caching is then disabled.

moshe weitzman’s picture

Assigned: Unassigned » moshe weitzman
Status: Needs review » Reviewed & tested by the community
FileSize
7.95 KB

rerolled for HEAD. I tested a bunch, uploaded custom logo, and all looks OK.

i am a pragmatist. this patch removes the 404s from our disks and our eyes.

Bill St. Clair’s picture

Note that my rellinkfilter module, all 1292 bytes of it, was designed to let me import existing HTML files that expect the document root to be the current "directory". I have five years of nearly daily static weblog pages that I want to import (hence, I'll need the patch that speeds up URL mapping). Drupal uses the Drupal root directory as the document root, and I'm sure thousands of pages depend on this. That's why if this feature is added, it has to be optional, defaulting to the current behavior, or you'll likely break lots of pages.

I turned off the filter cache because it was the easiest way to make sure the page got recomputed after the last preview, when viewed in its final place. Otherwise, prople would get confused. You could actually leave the cache on, as long as you don't preview the last change to a page that uses this filter, and as long as you don't intend to map any of the pages to multiple different "directories", which I doubt many people do.

Relative links are good. They allow downloaded web sites to work properly when loaded from disk. And they allow restructuring of the site directory hierarchy without changing lots of text.

But I haven't tried the patch or looked at the details of what it does, so my opinion is currently uninformed. If it solves my problem, and will become part of the core, I'll definitely prefer it to an additional module. I'll look at it soon.

jakeg’s picture

Status: Reviewed & tested by the community » Needs work

moshe's patch doesn't work for me at all. All links stay as they are (e.g. href = admin/whatever ) without the global base_url_path at the beginning of them. using cvs

jakeg’s picture

I think the problem with the patch is that $base_url_path is never actually set. So although its global, its = ''. Where was this meant to be set? Perhaps a certain critical file wasn't included in the patch?

Morbus Iff’s picture

I think I'm dumb. How does this stuff work with non-mod-rewrite sites, where ?q= is required on links?

chx’s picture

Assigned: moshe weitzman » chx
Priority: Normal » Critical
Status: Needs work » Reviewed & tested by the community
FileSize
7.38 KB

Morbus, it works because the patch changes links to <a href="/head/?q=node/10"> in my install.

m3avrck’s picture

Patch does work well here!

There is a related issue (but not directly) with paths being used incorrectly in watchdog errors, I have a patch that is ready: http://drupal.org/node/45064 ... not sure if there is any overlap but wanted to put it out there.

Morbus Iff’s picture

Before I begin:

  • The log entries don't bother me whatsoever.
  • I've never seen a person-controlled UA fail on it.
  • Thus, I'm ambivalent about BASE and its existence.
  • "Fix their code!" - I don't trust people to do this soon enough.
  • "Fix our code!" - Easier, and I can disable it if I don't like it.

With that said, the existing arguments are just going around in circles, so I'm gonna add my own comments and thoughts on it, as there are (in my head) two Big Person's who are against it (Steven and killes) and one for it (Dries).

  • Stop talking about standards. Both are right: BASE is part of the standard and so is the use of root-relatives. The argument cancels itself out, and there's no benefit whatsoever in throwing it around.
  • The IE comparison is weak. IE is an active, user- controlled process that has a very large operating base, where visually detrimental issues can cause a big problem. Robots spidering your site, on the other hand, number much smaller, and their visual failures on BASE (cache, archive.org, etc.) only affect a small portion of the robot-site-using populace. That populace is also more intelligent (how many "normal" people do you know use cache: or archive.org?) and is less concerned about *visual* information (which cache and archive rarely stores anyways) but the actual *data* available (which, if they found one URL in the cache or archive, they can find another just as easily, regardless of a broken link).

There are a few additional points I'd like to address:

  • Fragments are easier. When I'm writing a long document, as it is probably obvious I do a lot, I often use URL fragments - the magical thingy that lets me link directly to a paragraph. If the paragraph is on another page, I use "page.html#section". If the fragment is in the existing document, I use "#section". With the BASE tag, this currently fails, and I always have to manually write "node/34#section" - this drives me batty. I tend to think that those who know about fragments expect "#section" to work.
  • Mirroring utilities fail regardless. Consider wget, a utility that mirrors a website - I use it quite often to make archives of my favorite sites (I don't trust people) or to provide CD-browsable versions of my own site. With the BASE, the saved files of wget require an Internet connection before they'll be CSS'd. Likewise, all links (unless manually reentered) link off to the remote source. Without BASE, on the other hand, the site immediately fails visually (with or without a connection), as do all links. (because they're attempting to serve from /drupal or /). I'm not sure which is the best of two evils: if I'm mirroring the site because I suspect it'll disappear, then I don't want BASE, because I don't want any remote loading. If I'm mirroring the site for CD storage, I'd want everything to refer to the CD files only (again, no BASE, and varying support for "/" as the "root of the CD").
  • Rich Bowen, the author of a large number of Apache books, and "I trust his opinion on anything related to Apache" had this to say about root-relative: ... that breaks through a proxy. [Consider] ProxyPass /drupal http://internal.server/. OK, now that page contains a link to /foo. That won't get proxied correctly, because we're only proxying /drupal. You can fix this with [mod_proxy_html] module, but that's not going to be installed most places[, which'll] rewrite links within pages to make them work in that scenario. I explained the way Drupal plans to implement by learning the path from a base_url, such that domain.com/drupal would have a relative root of /drupal/. He replies: Then that would work from the remote, but not from [http://internal.server].
  • We're calling this "root relative", but it's a grey area, being "absolute" to the document root of the web server. There's, generally, a huge debate about how traditional relative links (no starting "http://" or "/") are always better to use over "absolute" links (of any kind, be it http://, or /). I tend to agree that relative links (no /) are always better. Unfortunately, due to generic use of "absolute" and "relative", and little "root relative", it's difficult to find an authoratative source that addresses specifics.

I did test the patch. The patch does work. Let's pro and con. Ones marked not specifically "CON" and "PRO" aren't really awesome statements, but do lean in one direction or the other:

  • CON: Changing behavior we shouldn't have to change.
  • ---: /lots/of/subdirs/ could cause bandwidth bloat.
  • ---: Grey area on whether root-relative is really absolute.
  • ???: Will change the behavior of cache: and archive.org.
  • +++: It makes mirroring utilities slightly better.
  • PRO: Reduces bum log file entries caused by bad bots.
  • PRO: Fragment hrefs will actually work the way they should.

The following cancel each other out and can be ignored, I think:

  • CON: Potentially negative affect to those who use proxy servers.
  • PRO: Someone using a proxy server can probably install the fix.

Finally, an aside. Realize that there is nothing "authorative" about this at all, and it is purely an opinion by Tim Berners-Lee, in an environment that certainly wasn't intended to stand up to decision-making processes. From 2002 in a W3C Python application, he comments (all [sic]): Your regular relative URI algorithm -- never trust anyone else's ;-) This one checks that it uses a root-realtive one where that is all they share. Now uses root-relative where no path is shared. This is a matter of taste but tends to give more resilience IMHO -- and shorter paths.

After going over all of this this morning, and the PRO/CON breakdown, I support a commit. +1.

Morbus Iff’s picture

In shopping this document around to various folks I respect the opinions of, one person suggested that, if Drupal were using BASE correctly in the first place, the fragment issue above would never actually be an issue. He refers to the following entry:

So it seems you can't just use any base URI, but only the original URI of the content. Final proof comes from a discussion on the W3C URI mailing list. Here Roy T. Fielding, the author of RFC 3986, says: “[...] a person is deliberately abusing the base URI by assigning it an unrelated URI for the purpose of creating an artificial shorthand notation for external references.” Good to know!

xml:base vs. RFC 3986 Grudge Match, which links to the above, adds: In effect, the contract that RFC 3986 lays down is that the base URI provided to an application should be considered to be the URI of that context. Translated, this means that the BASE href for http://drupal.org/node/24 should, in fact, be the same URL, not the "artificial shorthand" of http://www.drupal.org.

With that said, a hearty +1 to the patch.

Dries’s picture

Good research, Morbus!

Can we think of better names for $base_url and $base_url_path ? The difference is not 100% clear, IMO, and from the looks, $base_url is not the proper name either.

Morbus Iff’s picture

Dries: I'd like to hold off a few more days on anything commit-like - I've thought of something that may solve some of the CONs in the above analysis. It also hasn't been talked about in the previous discussion. Simply: ../. Instead of discovering what the base_url_path is, we simply count the number of arg()'s we've received, then add that many "../" to the HREF. So, domain.com/node/45 would have links that point to ../../node/34. Advantages? a) All mirroring utilities work magically. b) It's still a "traditional" relative URL (no starting slash). c) Those dots are a lot cheaper, bandwidth-wise, than /lots/of/subdirs.

Also, base_url_path should be named just base_path, IMO.

Thoughts? Still rolling it around in my head.

Wesley Tanaka’s picture

I've seen problems using "..", where some UAs (lynx? I forget. I've seen more than one) will not assign any special meaning to the dots.

So you'd end up with "http://www.example.com/drupal/node/../themes/../node/1234" being requested by some agents.

using "/foo/bar" style URLs (with the leading slash) has been the most robust that I've been able to find in my personal experience.

chx’s picture

FileSize
7.3 KB

Well then, let's drop the ../ idea, there were other problems with it -- it would lead to very hard to understand HTML source.

I renamed base_url_path to base_path. No other change. base_url comes from settings.php I have not changed it.

So, it's now researched, tested and looks like supported...

Morbus Iff’s picture

Works for me then - I haven't used Lynx enough to notice ../ related errors.
I think I have seen those sorts of crazy URLs you've posted in my other travels.

Morbus Iff’s picture

Style comments on the patch:

* Concat dots are all wrong. Sometimes they're .'misc/drupal.css'., sometimes now . '/'.
* I still can't stand the newlining of the HTML in bluemarine/page.tpl.php.
* PHPTemplate engine is still using base_url_path internally.

Dries’s picture

Status: Reviewed & tested by the community » Needs work

I tried to incorporate Morbus' suggestions and committed this change to HEAD!

1. Please double-check if I didn't miss any of the suggestions.
2. Please document/communicate this change to the developers.

Dries’s picture

I think the favicon doesn't have the right path.

Does the xml:base in the RSS feeds need updating?

markus_petrux’s picture

path_to_theme and theme_stylesheet_import may also need review?

markus_petrux’s picture

...and chameleon.theme?

A question: should all paths start with '/' now?

markus_petrux’s picture

...and drupal_get_path ?

Morbus Iff’s picture

path_to_theme: no, this is getting a relative file path, not necessarily a URL.
drupal_get_path: no, this is getting a relative file path, not necessarily a URL.
theme_stylesheet_import: no - this relies on theme_add_style which uses the $base_path.

favicon.ico: yes, broken.
chameleon: yes, broken.

xml:base: Current use may or may not be correct. Few readers actually support it; I'd support its complete removal.

All URL paths should start with $base_path. All file paths should be left alone.

drewish’s picture

The patch in it's current form has borked my theme for any path other than /. If you don't change path_to_theme() to return a leading / you're going to have update every core and contrib theme. This is a big change to make this late in the game.

IMHO this patch should be rolled back for 4.7. Put it in as soon as the feature freeze ends.

kbahey’s picture

Is it just a matter of changing path_to_theme() to return a path that starts with $base_path ?

If so, then let us put it in. This patch is badly needed.

For me, I have already applied an earlier version of the patch to my 4.6.5 install (from the days when it used $base_url_path), and know that I have to modify my themes.

Morbus Iff’s picture

drewish: we already force every single module author to rewrite their modules for 4.7; it is well within a major release's power to force every theme developer to do the same and it's just a matter of documenting it on the update page (which Dries has already asserted needs to be done). Yes, the patch missed chameleon.theme (as I've confirmed), and I'll probably end up working on those remaining fixes next week. Note that already, we have a "potentionally affects every theme" issue in the way that primary and secondary links are handled. Forcing every theme to use $base_path, as the module has done for phptemplate and bluemarine.tpl.php, is a-ok, and is the recommended way to update your theme, not updating path_to_theme() which needs to be generic enough to refer to both URL path and file paths (most internal uses of path_to_theme are talking about file paths, not URL paths, as I too mentioned above).

drewish’s picture

Morbus Iff, sorry for the tone. I just found this at a bad time but that's my fault for trying to use HEAD on a production site. I realize my theme should have been using $base_url anyway.

Anyway, your point about rewritting for the menus it well taken. I guess I just wish this had happend before we declared a feature free and started rollingout betas.

drewish’s picture

that should have read "feature freeze".

markus_petrux’s picture

Morbus, maybe it could be added a couple of new functions: url_to_theme() and drupal_get_url() ?

contrib modules and themes need to be adapted for 4.7, many changes are also breaking contrib stuff already branched for 4.7, a problem if authors are not following the latest critical changes in core...

Morbus Iff’s picture

Status: Needs work » Reviewed & tested by the community
FileSize
3.53 KB

Attached is a patch to fix favicon and chameleon.

Morbus Iff’s picture

Dries, I've added documentation in Converting 4.6 themes to 4.7.

Dries’s picture

Status: Reviewed & tested by the community » Fixed

Committed. Thanks for the documentation updates too.

m3avrck’s picture

After talking with Morbus and chx on IRC, blog and search modules might be incorrectly using check_url() and should be further looked into. Same with fixing theme_image() to use url() instead of check_url(). No time for a patch but posting this for the record.

Morbus Iff’s picture

Status: Fixed » Needs review
FileSize
8.26 KB

Alright, I took a look at every check_url in core. They're fine. So is theme_image.

I did, however, change:

global $base_path;
... $base_path ...

to just:

  base_path();

I just don't like the fact that we're gonna sprinkle around "global $base_path" everywhere, and using base_path() is a lot easier, IMO. Note that people using "global $base_path" already won't have any problems - that'll still work just fine. I've also left phptemplate's exporting of $base_path alone because we do similar things for other functions (like 'secondary_links' => menu_secondary_links()).

m3avrck’s picture

Code looks good (can't test it right now though). I do like using base_path() better than the global variable, great tweak!

Bèr Kessels’s picture

Category: bug » task
Priority: Critical » Normal

code looks good. IT works as advertised.

However, this is certainly not a critical bug.

Morbus Iff’s picture

Status: Needs review » Reviewed & tested by the community
Dries’s picture

Status: Reviewed & tested by the community » Fixed

Committed to HEAD. Thanks.

Ian Ward’s picture

What are the major issues with getting this same/a similar patch to work w/ 4.6.x?

M.J. Taylor’s picture

http://drupal.org/node/4213 says to come here to bitch that Page Anchors don't work out of the box. . .

http://drupal.org/node/13080 was started November 17, 2004 (maybe the impetus for this bug track?) and addresses the same issue. See my comments therein.

Anyway, page anchors not working are pretty critical in my book. Okay, from my standpoint having a user jump to a completely wrong page off an internal link is extremely critical and by the looks of it it's been an open issue for 15 months? Until it's fixed or I can get some sort of working patch, we _can't_ migrate to Drupal. . .

Anyone have a "joe user" understandable, and documented, solution that works?

Sorry, I'm frustrated, it's seeming every time I turn around in Drupal, I'm having to track down some bug fix patch that isn't document and requires even more tweeks just to get to work. And why did this get closed as "fixed," when it's not? Chx, could you at least re-open it so us "stupid users" don't feel completely abandoned?

Regards,

M.J. Taylor
Publisher
from Reason to Freedom, Weekly libertarian magazine.

moshe weitzman’s picture

M.J. - this is fixed in 4.7. Note the Version field on this issue ... It is too big a change for 4.6 and arguably not a bug anyway.

Morbus Iff’s picture

As part of my research, I'll argue that it *is*, in fact, a bug, but certainly not one that should be backported to 4.6 (it is neither critical, nor a security fix). But, yes, as moshe says, it is fixed in Drupal 4.7 (and should be available to you in the next beta/releasecandidate build of Drupal).

M.J. Taylor’s picture

Dear moshe weitzman, Morbus Iff, and the Drupal development community,

Not to be picky, but the version showing at the top of this page for me is "cvs," [CVS (Concurrent Versions System) is a program that lets a code developer save and retrieve different development versions of source code. It also lets a team of developers share control of different versions of files in a common ...] which means there is no specific version this bug has been assigned to. Ergo, we (e.g. the users) could reasonable expect this to be fixed in all branches if it has been closed as "fixed." I'll also argue in concurrence with Morbus Iff that yes, this is a bug. (A very standard, long standing, expected web page behavior from a long established coding method does not produce the expected behavior. So, how can you even start to claim it's not a bug with the software producing the unexpected behavior?) I'll also argue against Morbus Iff, that yes it is critical, as possibly half the FAQs out there use page anchors to jump from a table of contents to section headings lower on the same page. Plus the thousands or millions of uses of page anchors by people like our site and any academia who has to put a source reference footer on their research.

Irregardless of all that, if you (the Drupal development community) aren't going to fix it in anything prior to 4.7, fine. Not much we the users can do to force you to fix something.

As, I now have to shelve two separate site conversions to Drupal over this; and have wasted several hundred dollars worth of custom module development that now can't be release to drupal.org because it's now 'junk' code, I respectfully ask, when does the Drupal community expect 4.7 to be "live?" (Yes I did search drupal.org, and no, there isn't any easily findable expected final release date for 4.7.)

Thank you for your time,
MJ

Morbus Iff’s picture

"Irregardless" isn't a word. Please continue this discussion with me at morbus@disobey.com. This is not the place.

kbahey’s picture

M.J. Taylor,

CVS in this context means the current HEAD, which means the upcoming release (in this case 4.7).

When something is marked fixed, it often means the upcoming version only, not even the current one.

Porting patches to current older versions (e.g. 4.6) is a lot of work. It is sometimes done, but on an individual basis, for example a developer needs it for himself, or someone pays for it do be done for them.

Other than that, do not expect anything to go into currrent/older versions.

As with open source projects, there is no fixed time set for the release. It will be ready when the developers decide that most of the bugs in the beta versions have been fixed.

Some sites are already live on pre-4.7 version.

Morbus Iff’s picture

Some sites are already live on pre-4.7 version. Such as mine.

chx’s picture

While this goes pretty off topic, I think we should mention _why_ we do not backport big changes like this: it would break a lot of things out there. I know that the XML-RPC library change in a minor release also broke a few sites but I still side my decision to do that for the sake of security. I helped the two contrib modules and every custom module writer who contacted me to do the migration (for the contribs, sxip and location, I actually have written the code). I knew that only a small fraction of Drupal sites has custom XML-RPC code. Now, this change is not related to security and would effect every custom theme. Would _you_ stand up and convert all of them in no time?

Crell’s picture

Not to be picky, but the version showing at the top of this page for me is "cvs," [CVS (Concurrent Versions System) is a program that lets a code developer save and retrieve different development versions of source code. It also lets a team of developers share control of different versions of files in a common ...] which means there is no specific version this bug has been assigned to. Ergo, we (e.g. the users) could reasonable expect this to be fixed in all branches if it has been closed as "fixed."

Terminology issue. "Fixed in CVS" in open source circles generally means "fixed in CVS HEAD, and will therefore be included in the next release, whenever that is". It does not mean "fixed in every version that exists". Should the fix also be backported to 4.5? 4.4? That would be silly. 4.6.x is a frozen tree. Think Debian Stable. :-) The version it's assigned to is CVS HEAD, vis, "the current latest cutting edge version that doesn't have a number yet".

I won't dispute that it's a bug, and I'm not going to get into the critical or not question. (It's also a matter of effort. If it's a huge PITA to backport to 4.6, then it's a waste when 4.7 is already in late beta.)

If you desperately need this fix in 4.6, all of the code to do so is here and available for you or anyone else to backport. Just because chx or Moribus or I aren't going to take the time to do so doesn't mean that no one can. There is a thread for backporting some path.module speed improvements to 4.6. If you have so much code depending on this being fixed in 4.6, try doing a backport. You can even release it as a patch for others who are interested, who, I'm sure, would be very greatful.

Meanwhile, perhaps we should consider renaming the "cvs" version tag to "HEAD"?

drewish’s picture

+1 for renaming it to HEAD. where would the correct place to open that issue be? the project module or drupal.org maintenance?

M.J. Taylor’s picture

Kbahey, thank you for the explanation of Drupal's use of CVS. From a user level walking in and having to understand what you are talking about to solve a problem, I'll second/third the suggestion to rename CVS to HEAD.

Chx, thank you for the private email and the reply above about the problems with backporting items with this level of PITAedness and more specifically the potential for breaking a version release as a whole. As I'm basically a super admin in regards to Drupal, and I don't have the bandwidth to convert any other modules that I would invariably break, I'll pass on the backporting. Nor am I willing to spend money for it to be done, as it would be redundant and of little value to anyone. We will release the channel support modifications we've had done to the AdSense module after we fix and test it for 4.7. [Nor did I feel your post was off topic as you legitimately explained a root issue causing my frustration related to this thread. (And having seen this same issue several places, I'll go post follow ups to point here for others with the same frustration.)]

Pre-4.7 sounds like an option, but it's not in our Fantastico default install and at best I'd get my butt chewed for running production sites on beta software. . .

I also realize that having to deal with us users is problematic at best, so thank you for your patients . . .

Best regards,

M.J. Taylor
Publisher
from Reason to Freedom

Completely off topic, Morbus you're a purist ;) (Granted my kind given my profession)
Irregardless:
Non-Standard, Regardless. [It is a form that many people mistakenly believe to be a correct usage in formal style but that in fact has no legitimate antecedents in either standard or nonstandard varieties. (The word was likely coined from a blend of irrespective and regardless.) Perhaps this is why critics have sometimes insisted that there is "no such word" as irregardless, a charge they would not think of leveling at a bona fide nonstandard word such as ain't, which has an ancient genealogy. The American Heritage® Dictionary of the English Language, Third Edition copyright © 1992 by Houghton Mifflin Company]

kbahey’s picture

Off Topic since I don't have J.M. Taylor's email: Please open an issue for Adsense. I am the author of that module, and it has been ported to 4.7. If there are issues, let me know.

I hope this is the last off topic post in this thread. It is already a long and ancient one.

TDobes’s picture

Status: Fixed » Active

Am I missing something or is there no upgrade path for CONTENT?

If internal links (in user-created content) depended upon the base href tag, they are now broken.
Example: Suppose (on a site using clean URL's), we have a page, node/500 with the following content:
<img src="files/test.jpg" /> <a href="node/2">test link</a>
The image will be broken because it now points to node/files/test.jpg... similarly, the link now points to node/node/2.
The situation gets even more complicated when you consider url aliases.

Proposed solution: Perhaps we could create an upgrade script which would iterate over each node and add the base_path before each link/image that does not already begin with "/" or "*://". I know this won't cover every possible problem, but it should fix a number of them. (Should we handle non-core input formats somehow? bbcode? wiki? and I'm thinking we should just skip PHP nodes.)

grohk’s picture

TDobes is right, for me upgrading to beta5 (when it is released) is going to break a ton of links. I discussed this with Morbus a few weeks ago on IRC and hoped to find a good upgrade path, but I have yet to come to a satisfactory conclusion. And input formats are a problem since alot of our content uses markdown in addition to HTML.

Any ideas?

kbahey’s picture

This is not only links in content, it also affects links that are created by WYSIWYG editors, for example TinyMCE, who strips the $base_url from links to the same site.

moshe weitzman’s picture

you don't have to look far - drupal.org is going to suffer with lots of broken links too. maybe we can handle these with a 404 handler?

Bèr Kessels’s picture

http://drupal.org/node/51612 is related.

Here is what I do (and why I started that issue): Just add the BASE tag in your theme. I feel legacy.module is a) too bloated for this b) not solving all the issues because it only affects content, and not all sorts of other places with links.

readding base tag? Yes, in my theme. Choice is good, we now have the choice to add it ourselves! patch at 51612 makes it very very simple to do so.

Wesley Tanaka’s picture

This is not only links in content, it also affects links that are created by WYSIWYG editors, for example TinyMCE, who strips the $base_url from links to the same site.

http://drupal.org/node/45795 should have fixed those problems.

kbahey’s picture

wtanaka, yes, you are right this seem to fix it, but it only affect new content. I already have a lot of content in my site that was entered using older TinyMCE versions.

Ber: I already have in my themes because this is why I raised this issue to begin with. Crawlers were causing lots fo 404s. I will probably continue to use this going forward.

Here is the PHP code needed in page.tpl.php:

 <base href="<?php print $base_url . $base_url_path ?>" />
matt westgate’s picture

Status: Active » Needs review
FileSize
1.6 KB

Here's an attempt to programatically update node content links. This could be expanded for blocks and comments as well.

Morbus Iff’s picture

This update won't work - it's using "/" when it should be using base_path().

matt westgate’s picture

FileSize
1.7 KB

Thanks Morbus. Here's a new patch with base_path() added like it should be and some other cleanups after running it through about 1,300 nodes.

TDobes’s picture

Matt: Thanks for working on an upgrade script... it seems to have some problems, though:
* serious issues with quotes -- for example, if I feed it <a href="my_url">link</a> it modifies it to <a href=/"my_url">link</a>
* ignores links containing slashes "files/test" should become "/files/test", but instead it is untouched
* I can't seem to get <img> tags to update properly... it doesn't work at all if the src value is unquoted; if it's quoted, I run into the first problem identified above
* sometimes trailing slashes get added which don't belong at all: <img src="files/test.gif" /> becomes <img src=/"files/test.gif/" />
* all update functions need to be followed by a "return array();" -- otherwise, we'll end up with an error under PHP5. (this problem also afflicts system_update_175 in core)

Regular expressions hurt my head, so I'm guessing you'll have better luck fixing these problems than me. If not, I'll try to puzzle my way through it in a day or two.

matt westgate’s picture

FileSize
1.78 KB

There was a typo in the script preventing it from working. Sorry, I took this code from a larger script I wrote while upgrading a clients site and their custom module content. Also fixed the array return issue.

TDobes’s picture

Still problems:
* <a href="files/test">stuff</a> becomes <a href=///"files/test///">stuff</a>
* <a href="http://test.com/">test</a> becomes <a href=///"http://test.com////">test</a> (we shouldn't be touching this case at all)
* <a href="/whatever/stuff">test</a> becomes <a href=///"/whatever/stuff///">test</a> (we shouldn't be touching this case either)

my little mini-test-suite looks like this:

print _url_fix('
<a href=files/test>stuff</a>
<a href="files/test">stuff</a>
<a href="http://test.com/">test</a>
<a href=http://test.com/>test</a>
<a href="/whatever/stuff">test</a>
<a href=/whatever/stuff>test</a>
<img src="files/test.gif" />
<img src=files/test.gif />
<img src="/files/test.gif" />
<img src=/files/test.gif />
<img src="http://test.com/files/test.gif" />
<img src=http://test.com/files/test.gif />
');
chx’s picture

Somehow you need to make this script like update 172, because if you have, say 50000 nodes, it'll run for quite long.

Dries’s picture

Maybe I should postpone updating drupal.org until this script/update is ready?
I'm tempted to mark this critical; breaking one's data isn't funny. ;)

matt westgate’s picture

FileSize
1.49 KB

Cleaned up the patch and tested with TDobes test case above. This script does not try to convert URLs not wrapped in double or single quotes. If we want to do that, we'll need to pull in someone with more regexp skills as the ones in this update maxxed out what I can do.

TDobes’s picture

I'm far from a regexp wizard, but I'll stare at the PHP docs a bit tomorrow and see if I can come up with anything. If anyone else would like to give it a go, feel free -- this is a major problem for sites attempting the upgrade to 4.7.

Also: Bug 51382 was a duplicate.

kbahey’s picture

Fixing the content seems like the best solution.

However, as a workaround, wouldn't the trick mentioned here this comment be enough?

matt westgate’s picture

Adding the <base> tag back works for most browsers, but IE hates it and won't load any resource which uses base_path(), such as misc/drupa.css and styles.css.

kbahey’s picture

Matt,

I am a bit confused.

I am using today's HEAD, and find that it has this in it:

<style type="text/css" media="all">@import "/misc/drupal.css";</style>

and

<style type="text/css" media="all">@import "/themes/bluemarine/style.css";</style>

Notice how the import has a slash , which is the base_path() in this case?

So adding :
$blah = '<base href="' . base_path() . '">';
Should take care of the case for content.

I use something very similar to that on http://baheyeldin.com (an older version, but uses a , and MS IE has no problem that I can see.

Here is another snag: converting content to have base_path() in it makes the site not portable.

For example, a site may start as a subdirectory (e.g. example.com/cooking).

Or, if the site is in testing first and you are adding content to it with links before going live ...

Later, you decide to move the site to its own domain: all your links are broken now, and you need to do something (perhaps adapt that update statement to a one time script/module).

markus_petrux’s picture

One thing I remember doesn't work in IE6 with base href. It is the ability to select text. Maybe it depends on other factors (I was unable to find reports of his "bug"), but that was fixed when base href was removed.

I agree that it would be nice to have an option to "fix" content that was written with base href in mind, so I believe the way to go is make this upgrade path better, as reliable as possible, maybe even optional. But, please, do not add base href again. If you do, make it optional.

Kobus’s picture

Adding the quickfix suggested here: http://drupal.org/node/13148#comment-77844 works with Opera, but not IE.

And it is not only content, it is all URLs on my site that breaks in IE (including admin links, etc.) and all links in my theme that gets broken. Without the fix, it is only the images and hardcoded tags in my theme.

This has changed recently... I think in Beta 5? What caused this? Can it be reverted?

Kobus

Dries’s picture

mysql> select count(vid) from node_revisions;
+------------+
| count(vid) |
+------------+
|      54085 |
+------------+

On drupal.org we have 54k revisions. Will this update work for large databases? I'm afraid not. I think we'll need something like system_update_172 where we use #finished to communicate whether the update was complete. If not, the update function will be called again to complete the remaining work.

Dries’s picture

Priority: Normal » Critical
Status: Needs review » Needs work

I'm marking this critical. Breaking people's data is not an option.

killes@www.drop.org’s picture

IMNSHO the only way to not break people's data would be to revert the patch. Everything else will fail in one way or another for somebody.

moshe weitzman’s picture

@killes - there is a differene between breaking many nodes on many sites and breaking a few nodes on a few sites. lets be pragmatic about this. the patches posted here already are not perfect but they are good enough already so that we should not discuss rolling back the patch.

Dries’s picture

I don't see why we shouldn't be able to get these regex' right.

The most pressing tasks are:

  1. Making sure it works for large sites.
  2. Making sure that comments don't have broken links either.
matt westgate’s picture

FileSize
3.59 KB

This new patch updates both nodes and comments and accounts for system timeouts via the spirit of update 172.

matt westgate’s picture

FileSize
3.59 KB

Fixing a typo in the function description. I just couldn't let it go.

Dries’s picture

I tried running it on scratch.drupal.org but it didn't run. The progress bar jumps to 100% in less than a second. The update.php script reported "The update process did not complete. All errors have been logged. You may need to check the watchdog table manually." but no errors are shown on ?q=admin. Drats.

After fiddling with the code for 30 minutes, I found out that the error messages is 'unspecified error' when Javascript is enabled. I disabled Javascript and I have the same problem; no error shown or reported.

After 10 more minutes I found this line in update.php: ini_set('display_errors', FALSE). Disabling that code shows a PHP error but I couldn't read it because of the page reloading too fast (still no JS). The error isn't in the watchdog either. After 5 more attempts, I manage to read enough of the error message to figure out that _update_177_url_fix() is being redeclared. I "un-nested" that function and this seems to have fixed the problem.

The database script has been busy upgrading the database for the past 10 minutes and it is still crunching ...

Dries’s picture

The upgrade finished. Not sure it completed 100% (no clear confirmation) but I think it worked.

Looks like the update script has rewritten:
<a href="contact" rel="nofollow">test contact</a>
to:
<a href="/contact" rel="nofollow">test /contact</a>

That is, while it fixed the link, it added a slash to the title ....

Dries’s picture

I used 'linkchecker' to scan for broken links before and after running the upgrade script. I used the following command line:
linkchecker -r 2 -s -P 5 http://scratch.drupal.org

Recursively scans all internal links up to 2 levels deep. It doesn't follow external links. Each run, linkcheker found and checked 6527 links.

Before the upgrade, there were 80 links that caused a 404 (not found) and 11 links that caused a 403 (forbidden).
I used 'linkchecker' to scan for broken links before and after running the upgrade script. I used the following command line:
linkchecker -r 2 -s -P 5 http://scratch.drupal.org

Recursively scans all internal links up to 2 levels deep. It doesn't follow external links. Each run, linkcheker found and checked 6527 links.

Before the upgrade, there were 80 links that caused a 404 (not found) and 11 links that caused a 403 (forbidden).

After the upgrade, there were 69 links that caused a 404 (not found) and 11 links that caused a 403 (forbidden).

A number of links got fixed, but at the same time new bugs where introduced. On the http://scratch.drupal.org/contribute/ page, we have (a number of broken links (just look at the source):

 <li><a href="//node/316">Contributor's guide</a></li>
 <li><a href="///project/issues?projects=3060&states=8,13,14">Patch queue</a></li>
 <li><a href="///project/issues?projects=3060&states=1&categories=bug">List of open bugs</a></li>
 <li><a href="///project/issues?projects=3060&states=1&categories=task">List of tasks</a></li>
 <li><a href="http:///cvs.drupal.org/">CVS repositories</a></li>
 <li><a href="///mailing-lists">Developer mailing lists</a></li>

Consider adding the links on the contribute-page to your test script/setup.

Quite a few 404s are due to a theme bug (not related to this issue):

URL        themes/bluebeach/print.css
Parent URL http://scratch.drupal.org/taxonomy/term/45, line 11, col 344
Real URL   http://scratch.drupal.org/taxonomy/term/themes/bluebeach/print.css
Result     Error: 404 Not Found
matt westgate’s picture

FileSize
3.84 KB

I added the contrib links to the test case and made the update more selective when base_path() should be appended. Previously I was using str_replace() which isn't accurate enough since it doesn't understand the context of "being inside an HTML tag or not". I'm now using preg_replace to find these insertion points.

Dries’s picture

The nested/inner fix url function also gave problems; it needs to be taken out of the other function. See my bug report above.

matt westgate’s picture

FileSize
3.78 KB

Okey doke. Inner function kicked to the outside like a kid ready for college.

Dries’s picture

I re-synced scratch.drupal.org (takes quite a bit of time), applied the patch and ran the updater only to find out it didn't work.

I had to change the bottom of the update function to:

    if ($_SESSION['system_update_178_comment'] == $_SESSION['system_update_178_comment_max'] &&
        $_SESSION['system_update_178_node'] == $_SESSION['system_update_178_node_max']) {
      unset($_SESSION['system_update_178_comment']);
      unset($_SESSION['system_update_178_comment_max']);
      unset($_SESSION['system_update_178_node']);
      unset($_SESSION['system_update_178_node_max']);
      return array();
    }
    else {
      return array('#finished' => FALSE);
    }

to make it work. I suspect something is still out of whack with the way how you use '#finished'. Your implementation isn't particularly useful because of the && (and) it will still return 0 or 1, and not the actual progress. Maybe you need to return min(computation 1, computation 2).
Anyway, that's just polish. Let's get this working first.

Sure you added the contribute links to your test case?

After the upgrade, there are still some broken links:

<li><a href="project/issues?projects=3060&amp;states=8,13,14">Patch queue</a></li>
<li><a href="project/issues?projects=3060&amp;states=1&amp;categories=bug">List of open bugs</a></li>
<li><a href="project/issues?projects=3060&amp;states=1&amp;categories=task">List of tasks</a></li>

Links without amps or other special characters seems to work fine.

Ian Ward’s picture

This relates to this thread, but more along the lines of documentation, unless what i'm seeing is a bug. My question is, now that base_path is worked out, what does this mean for themes? I'm testing beta5 using the bluemarine template which has no base set in the head on page.tpl, and images with relative path names like files/image.png show up on teaser views, but not on their full node views. On full node views, the source code from the browser tells me it is trying to read from files/image.png, but Firefox WebDev tool tells me there is a broken image and it is trying to read from http://www.example.com/node/files/image.png. If i url alias the page like images/testing/nodename, it reads from http://www.example.com/node/images/testing/image.png.

matt westgate’s picture

FileSize
3.3 KB

Project URLs weren't working because the '?' in the paths were being interpretated as regexp meta-characters. Nothing a little preg_quote magic won't fix though. Thanks to Steven and Earl for the help.

Dries’s picture

Category: task » bug

Synced scratch.drupal.org. Seems to work. Thanks Matt!

Can be committed according to me, but maybe we want another test report?

Dries’s picture

Status: Needs work » Needs review
Dries’s picture

Status: Needs review » Fixed

Committed to HEAD. Thanks.

Steven’s picture

Status: Fixed » Needs work

The update definitely needs more work:

  • It blindly assumes all content is in HTML format and that all instances of 'href' are regular links. There are plenty of cases where this isn't so. The site could be using a non-HTML format or be using <code> tags to post literal code which outputs anchor tags. Or the node might be in PHP code, printing links somewhere. IMO we should check for the presence of the HTML filter in that node/comments' format and only convert it if the filter is found.
  • There is no ORDER BY clause on the multi-part updates, so it assumes the nodes/comments are all returned in order. I made the same mistake in the comment thread update. It usually works without, but sometimes it doesn't, and the update will be applied multiples times and out of order.
  • The 20 revisions limit seems rather unnecessary and weird. It is essentially the same problem that the node search indexer had for using timestamps. The proper solution is to do an additional sort, for example ORDER BY nid ASC, vid ASC and keep track of both the last nid and vid. Then you change your condition to nid > %d OR (nid == %d and vid > %d).
markus_petrux’s picture

A couple of suggestions:

1) Change _update_178_url_fix() logic like this:

function _update_178_url_fix(&$text) {
  ...
  IF $text has been changed THEN
    return TRUE;
  ELSE
    return FALSE;
}

Comments:
1.a) Pass the content by reference.
1.b) Perform the UPDATE only if it returns TRUE.

It could be something like:

  if (_update_178_url_fix($comment->comment)) {
    db_query("UPDATE {comments} SET comment = '%s' WHERE cid = %d", $comment->comment, $comment->cid);
  }

2) there might be no URLs, so if the preg_match_all returns FALSE, you get a PHP notice in the foreach loop, so...

  foreach ($urlpatterns as $type => $pattern) {
    preg_match_all($pattern, $text, $matches);
    foreach($matches[1] as $url) {

could be like this:

  foreach ($urlpatterns as $type => $pattern) {
    if (preg_match_all($pattern, $text, $matches)) {
      foreach($matches[1] as $url) {
Steven’s picture

Actually I just realized, why not just do the node_revisions update by vid? There is no need to involve the nid.

chx’s picture

Status: Needs work » Closed (fixed)

let's open another issue for this. the mere size of this thread makes it impossible to work with.

We are starting from http://drupal.org/node/13148#comment-80194 .

dkruglyak’s picture

Title: Problems with using relative path names » Consolidate Patches: Problems with using relative path names
Version: » 4.6.5
Status: Closed (fixed) » Needs work

It appears that there are two critical patches making significant changes to path / alias processing. They should be consolidated for 4.6.5 / 4.6.6, since they touch the same files and I am not sure how to do that.

I created a new issue here: http://drupal.org/node/56972

TDobes’s picture

Title: Consolidate Patches: Problems with using relative path names » Problems with using relative path names
Version: 4.6.5 » x.y.z
Status: Needs work » Closed (fixed)

dkruglyak: Please don't re-open issues which are only marginally related to your request. Creating a new forum topic is a much better plan for a request like yours.

WeRockYourWeb.com’s picture

Just wondering if this patch has been committed or not? I am running the latest Drupal CVS and don't see the "base href."

Thanks,
Alex

xand’s picture

Is there a real solution for this? Is this in 4.7.1?