By default drupal returns a list of recent nodes when any URL such as node/asdf is requested. This URL is not valid and does not point to any content, and should return a 404 error. This has been a bit problem on our site as we switched to drupal, as search engines have been crawling the site and requesting odd url's such as node/node/123/node/event/node/123. I've been getting thousands of these requests which fill up the cache table (causing severe performance problems) as well as pollute the site's search engine optimization with useless URL's.
I made a simple modification to node.module that basically calls drupal_not_found() instead of node_page_default() for such requests. You can still access /node to get a list of recent nodes, but any odd urls such as node/asdf will return a 404, as they should. It's a simple modification and there may be better ways to do it, but it works for me.
| Comment | File | Size | Author |
|---|---|---|---|
| #16 | patch_77_refreshed | 1.58 KB | mathieu |
| #12 | patch_77_for_4_7.txt | 2.28 KB | RayZ |
| #10 | patch_77 | 2.01 KB | moshe weitzman |
| node.patch_1.txt | 406 bytes | Anonymous (not verified) |
Comments
Comment #1
mathieu commented+1 I had troubles with this on a few clients' sites. I tested the patch on cvs : works fine.
Comment #2
gerhard killesreiter commentedhere on drupal.org I already get a 404 if I try to accesss node/foobar.
Comment #3
Tobias Maier commentedI'm using the current drupal 4.7 dev version and can verify the problem on my sites.
try: http://nord.jf-muenchen.de/node/dasf
Comment #4
Anonymous (not verified) commentedAs a follow up, it seems part of my problem with lots of weird node/* requests was due to drupal 4.6 using a base href for relative URL's, and 4.7 has fixed this.
BUT.... we still get requests for node/ stuff without the base href, though not nearly as many, and I think regardless it's a bad idea to basically allow wildcard url's to return a page of content rather than a 404.
Comment #5
gerhard killesreiter commentedhttp://drupal.org/node/dasf
Are you guys running php5?
Comment #6
Tobias Maier commentedphp 4.3.3
Comment #7
Anonymous (not verified) commentedPHP version shouldn't have anything to do with it.
Here's my personal site: http://www.gnarfle.com/node/asdf
stock 4.7.2 install. Look at the code in node.module, function node_page(). It checks to to see if the first arg is numeric. If not, it assumes it's an op. If it doesn't match any known op (view/edit/add), it calls node_page_default() which lists recent content.
I don't know why drupal.org returns a 404, unless they've modified it already, even the current cvs has the same code which will exhibit this behavior.
Comment #8
RayZ commentedI can confirm that I have the same issue. Running the latest 4.7 cvs of drupal. Haven't tried the patch.
Comment #9
moshe weitzman commentedi think drupal.org must have hacked core to get that 404.
as for the supplied patch, it can be much better. use the menu system to your advantage. that way, this can work all over drupal, instead of just node page.
introduce a new element in hook_menu called 'allow arguments'. when not specified, assume TRUE so match current behavior. if false, the menu system will automatically return NOT FOUND if arguments would be auto passed to the callback function. hope that makes sense to everyone.
then we set this element to FALSE for node callback, user callback, and more.
Comment #10
moshe weitzman commentedhere is a small patch for menu system and node_menu() that implements what i just suggested. there are at least 2 big benefits to throwing a 404:
- don't calculate an expensive /node page for a obviously wrong paths
- don't cache the page for these bad paths. the current behavior leads to huge cache table which makes deletions expensive and selecting expensive and so on. see http://drupal.org/node/73604 for the horror stories.
Comment #11
Anonymous (not verified) commentedAh yes, that is better, that would solve me having to implement the same sort of logic in my custom modules too :)
Unfortunately, drupal still caches 404 pages, so until that cache table split patch is incorperated, all this really gains you is a smaller cache item as the 404 page is probably smaller than the list of recent nodes. Luckily I've found out about the base href glitch in 4.6 causing all of those odd URL requests by search engines, so that's taken the big hit off our server as I've backported the changes until I can upgrade to 4.7.
Comment #12
RayZ commentedMoshe's patch in #10 does not apply cleanly to a 4.7 install, so I applied parts of it manually (re-rolled patch for 4.7 attached). Unfortunately, it does *not* appear to solve the problem for me. Did I make a mistake in re-rolling the patch?
Setting this to 4.7, since I believe the bug is there too.
Comment #13
moshe weitzman commentedpatches are always generated against HEAD, if they want to be reviewed for inclusion in core. you must clear your menu cache after applying this patch.
i quickly read your patch and didn't see a glaring problem. i suspect you have old menu cache.
Comment #14
RayZ commentedI suspected caching as well and had cleared my cache using the devel.module's clear cache feature. Does that not clear the menu cache?
Comment #15
moshe weitzman commentedthat should be sufficient. lets get confirmation from someone else that this patch does or does not work.
Comment #16
mathieu commentedMoshe's patch (#10, above) still applies (offset 150 lines on node.module) and works. #12 doesn't apply on head.
This patch works, though only if clean_urls are enabled... My guess is that there's a way to make it work withouth clean_urls, but then, I'm no expert on this. If this is not possible, IMHO, this is RTBC.
I attach a refreshed patch, so it applies cleanly again.
Comment #17
RayZ commentedWhile the symptoms sounded the same, it appears the source of my problem was unrelated. Turns out it was caused by having the Views module set up with some page views with no URL's assigned.
So don't hold off the RTBC for me.
Comment #18
moshe weitzman commentedi refreshed the patch again and confirm that it works regardless of clean_url setting. this is just a refresh of the one described in #10
Comment #19
drummI'm not sure of this being something useful. What would the documentation to go on http://api.drupal.org/api/HEAD/function/hook_menu look like? Where else in Drupal core would this be useful?
Comment #20
drummComment #21
moshe weitzman commentedwell, drupal.org has hacked its core to achieve this. they found it useful. this is about minimizing resources. it is not good that one can DOS a drupal site by requesting random node/ urls. this patch helps a little with that problem. More could be done. See http://drupal.org/node/73604 for another good idea that works well with this one.
The docs would be like
'allow arguments' - default TRUE. if set to FALSE, this callback will forcibly return a 404 Not Found page if any arguments are sent on the request. For an example, see node_menu(). It accepts node/ as a valid request only if is numeric.
Other uses taxonomy/term//. There are other ones but I think node/ is our biggest win. Those of you who have crazy crawlers on your site can probably find more candidates for this feature.
I hope I have answered the questions, thus I set back to RTBC.
Comment #22
dries commentedWhile it works, I agree with Neil that it is a little bit weird.
I'm thinking this should be the default behavior and that we should not introduce a special "allow arguments" attribute for it. Ideally, people should always define their callback's arguments. That would help enforce that people write portable page that can be moved around using the menu module. Right now, too many page break when you change their URL path through the menu system. Maybe we could add type checking too.
I think I'm favor of a if()-test in node_default_page().
Comment #23
drummLooks like this needs more review. Since it is adding a new menu argument, it is a feature request.
I'm neutral on this patch. I'd like to see a review from the menu system maintainer, Richard Archer, or someone else familiar with it.
Comment #24
moshe weitzman commentedothers are welcome to persist with this patch. no longer a priority for me.
Comment #25
matt_paz commentedThis condition exists in 5.X too ...? This is bad for SEO as well, correct?
Comment #26
drummThis appears to be fixed.
Comment #27
xsean commentedany solution for version 6 as well?