By default drupal returns a list of recent nodes when any URL such as node/asdf is requested. This URL is not valid and does not point to any content, and should return a 404 error. This has been a bit problem on our site as we switched to drupal, as search engines have been crawling the site and requesting odd url's such as node/node/123/node/event/node/123. I've been getting thousands of these requests which fill up the cache table (causing severe performance problems) as well as pollute the site's search engine optimization with useless URL's.

I made a simple modification to node.module that basically calls drupal_not_found() instead of node_page_default() for such requests. You can still access /node to get a list of recent nodes, but any odd urls such as node/asdf will return a 404, as they should. It's a simple modification and there may be better ways to do it, but it works for me.

CommentFileSizeAuthor
#16 patch_77_refreshed1.58 KBmathieu
#12 patch_77_for_4_7.txt2.28 KBRayZ
#10 patch_772.01 KBmoshe weitzman
node.patch_1.txt406 bytesAnonymous (not verified)

Comments

mathieu’s picture

+1 I had troubles with this on a few clients' sites. I tested the patch on cvs : works fine.

gerhard killesreiter’s picture

here on drupal.org I already get a 404 if I try to accesss node/foobar.

Tobias Maier’s picture

I'm using the current drupal 4.7 dev version and can verify the problem on my sites.
try: http://nord.jf-muenchen.de/node/dasf

Anonymous’s picture

As a follow up, it seems part of my problem with lots of weird node/* requests was due to drupal 4.6 using a base href for relative URL's, and 4.7 has fixed this.

BUT.... we still get requests for node/ stuff without the base href, though not nearly as many, and I think regardless it's a bad idea to basically allow wildcard url's to return a page of content rather than a 404.

gerhard killesreiter’s picture

http://drupal.org/node/dasf

Are you guys running php5?

Tobias Maier’s picture

php 4.3.3

Anonymous’s picture

PHP version shouldn't have anything to do with it.

Here's my personal site: http://www.gnarfle.com/node/asdf

stock 4.7.2 install. Look at the code in node.module, function node_page(). It checks to to see if the first arg is numeric. If not, it assumes it's an op. If it doesn't match any known op (view/edit/add), it calls node_page_default() which lists recent content.

I don't know why drupal.org returns a 404, unless they've modified it already, even the current cvs has the same code which will exhibit this behavior.

RayZ’s picture

I can confirm that I have the same issue. Running the latest 4.7 cvs of drupal. Haven't tried the patch.

moshe weitzman’s picture

Version: 4.7.2 » x.y.z
Status: Needs review » Needs work

i think drupal.org must have hacked core to get that 404.

as for the supplied patch, it can be much better. use the menu system to your advantage. that way, this can work all over drupal, instead of just node page.

introduce a new element in hook_menu called 'allow arguments'. when not specified, assume TRUE so match current behavior. if false, the menu system will automatically return NOT FOUND if arguments would be auto passed to the callback function. hope that makes sense to everyone.

then we set this element to FALSE for node callback, user callback, and more.

moshe weitzman’s picture

Assigned: Unassigned » moshe weitzman
Status: Needs work » Needs review
StatusFileSize
new2.01 KB

here is a small patch for menu system and node_menu() that implements what i just suggested. there are at least 2 big benefits to throwing a 404:
- don't calculate an expensive /node page for a obviously wrong paths
- don't cache the page for these bad paths. the current behavior leads to huge cache table which makes deletions expensive and selecting expensive and so on. see http://drupal.org/node/73604 for the horror stories.

Anonymous’s picture

Ah yes, that is better, that would solve me having to implement the same sort of logic in my custom modules too :)

Unfortunately, drupal still caches 404 pages, so until that cache table split patch is incorperated, all this really gains you is a smaller cache item as the 404 page is probably smaller than the list of recent nodes. Luckily I've found out about the base href glitch in 4.6 causing all of those odd URL requests by search engines, so that's taken the big hit off our server as I've backported the changes until I can upgrade to 4.7.

RayZ’s picture

Version: x.y.z » 4.7.3
Status: Needs review » Needs work
StatusFileSize
new2.28 KB

Moshe's patch in #10 does not apply cleanly to a 4.7 install, so I applied parts of it manually (re-rolled patch for 4.7 attached). Unfortunately, it does *not* appear to solve the problem for me. Did I make a mistake in re-rolling the patch?

Setting this to 4.7, since I believe the bug is there too.

moshe weitzman’s picture

Version: 4.7.3 » x.y.z
Status: Needs work » Needs review

patches are always generated against HEAD, if they want to be reviewed for inclusion in core. you must clear your menu cache after applying this patch.

i quickly read your patch and didn't see a glaring problem. i suspect you have old menu cache.

RayZ’s picture

I suspected caching as well and had cleared my cache using the devel.module's clear cache feature. Does that not clear the menu cache?

moshe weitzman’s picture

that should be sufficient. lets get confirmation from someone else that this patch does or does not work.

mathieu’s picture

StatusFileSize
new1.58 KB

Moshe's patch (#10, above) still applies (offset 150 lines on node.module) and works. #12 doesn't apply on head.

This patch works, though only if clean_urls are enabled... My guess is that there's a way to make it work withouth clean_urls, but then, I'm no expert on this. If this is not possible, IMHO, this is RTBC.

I attach a refreshed patch, so it applies cleanly again.

RayZ’s picture

While the symptoms sounded the same, it appears the source of my problem was unrelated. Turns out it was caused by having the Views module set up with some page views with no URL's assigned.

So don't hold off the RTBC for me.

moshe weitzman’s picture

Status: Needs review » Reviewed & tested by the community

i refreshed the patch again and confirm that it works regardless of clean_url setting. this is just a refresh of the one described in #10

drumm’s picture

Status: Reviewed & tested by the community » Needs review

I'm not sure of this being something useful. What would the documentation to go on http://api.drupal.org/api/HEAD/function/hook_menu look like? Where else in Drupal core would this be useful?

drumm’s picture

Title: node module returns content listing for invalid URL's » node module returns content listing for invalid URLs
moshe weitzman’s picture

Status: Needs review » Reviewed & tested by the community

well, drupal.org has hacked its core to achieve this. they found it useful. this is about minimizing resources. it is not good that one can DOS a drupal site by requesting random node/ urls. this patch helps a little with that problem. More could be done. See http://drupal.org/node/73604 for another good idea that works well with this one.

The docs would be like
'allow arguments' - default TRUE. if set to FALSE, this callback will forcibly return a 404 Not Found page if any arguments are sent on the request. For an example, see node_menu(). It accepts node/ as a valid request only if is numeric.

Other uses taxonomy/term//. There are other ones but I think node/ is our biggest win. Those of you who have crazy crawlers on your site can probably find more candidates for this feature.

I hope I have answered the questions, thus I set back to RTBC.

dries’s picture

While it works, I agree with Neil that it is a little bit weird.

I'm thinking this should be the default behavior and that we should not introduce a special "allow arguments" attribute for it. Ideally, people should always define their callback's arguments. That would help enforce that people write portable page that can be moved around using the menu module. Right now, too many page break when you change their URL path through the menu system. Maybe we could add type checking too.

I think I'm favor of a if()-test in node_default_page().

drumm’s picture

Category: bug » feature
Status: Reviewed & tested by the community » Needs review

Looks like this needs more review. Since it is adding a new menu argument, it is a feature request.

I'm neutral on this patch. I'd like to see a review from the menu system maintainer, Richard Archer, or someone else familiar with it.

moshe weitzman’s picture

Assigned: moshe weitzman » Unassigned

others are welcome to persist with this patch. no longer a priority for me.

matt_paz’s picture

Version: x.y.z » 5.1

This condition exists in 5.X too ...? This is bad for SEO as well, correct?

drumm’s picture

Version: 5.1 » 6.x-dev
Status: Needs review » Closed (fixed)

This appears to be fixed.

xsean’s picture

any solution for version 6 as well?