Allow crawling of sites/default/files by search engines, don't disallow it in robots.txt [#494462]

Comment	File	Size	Author
#18	robots-2.patch	811 bytes	c960657
#9	robots.txt.patch	444 bytes	z.stolar
#7	robots.txt.patch	443 bytes	z.stolar
#5	robots.txt.patch	447 bytes	z.stolar
	robots.txt.patch	547 bytes	z.stolar

Comment #1

steff2009 commented 2 July 2009 at 14:15

Hi, thank you for your suggestion.

I also had noticed that none of my images are indexed in Google, so I added the following line in my robots.txt:

Allow: /sites/default/files/images

Google webmaster tool checking robots.txt says that now the directory is allowed.

I only have one concern: what's Drupal's reason to disallow /sites directory? Maybe a security reason? I then left "Disallow: /sites/" in my robots.txt, so that only the image directory is crawled.

What's your opinion?

Tks!

Log in or register to post comments

Comment #2

z.stolar commented 2 July 2009 at 10:18

Well, I've seen people mentioning "Allow" rules, but I was under the impression it's not a standard robots.txt rule.
I'll add it and wait for results.

Log in or register to post comments

Comment #3

steff2009 commented 2 July 2009 at 14:18

Hi, actually Google Webmaster tools mention more than once that a recommended robots file should "Allow all". However, given the fact that Drupal stores lots of information that are not relevant to users search, it is advisable to disallow some directories and files.

You can do this to test the behavior of your robots in Google:

1. Go to Google > Webmaster tools > Site configuration > Crawler access > Test robots.txt,
2. Copy and paste the text of your robots file in the related field (what you already see is the current robots that is on your website - if you have recently updated it, it might not be the latest version yet)
3. Specify which directory (for example, the one containing images) you would like to be indexed
4. Click on Test.

In the test results you will see if that specific directory is allowed and supposedly Google should crawl it soon (by the way, how often does Google crawl your site?).

Cheers.

Log in or register to post comments

Comment #4

z.stolar commented 2 July 2009 at 15:15

I tried - Google does allow the "Allow" rule.
Adding Allow: /sites/default/files/images achieves the desired result (for Google!)

I'll submit a new patch.

Log in or register to post comments

Comment #5

z.stolar commented 7 July 2009 at 21:59

Status:

Active

» Needs review

Status	File	Size
new	robots.txt.patch	447 bytes

Attached.

Log in or register to post comments

Comment #6

dries commented 8 July 2009 at 07:16

I don't think this is sufficient because on my site, images go under sites/buytaert.net/files/images. I recommend we remove the sites rule.

Log in or register to post comments

Comment #7

z.stolar commented 8 July 2009 at 09:37

Status	File	Size
new	robots.txt.patch	443 bytes

Here's a new patch

Log in or register to post comments

Comment #8

9 July 2009 at 12:56

Status:

Needs review

» Needs work

The last submitted patch failed testing.

Log in or register to post comments

Comment #9

z.stolar commented 25 July 2009 at 08:33

Status:

Needs work

» Needs review

Status	File	Size
new	robots.txt.patch	444 bytes

I don't understand why it failed. In any case, I re-checked-out latest CVS version, and recreated the patch.

Log in or register to post comments

Comment #10

z.stolar commented 8 August 2009 at 15:01

@Dries: just a reminder... :-)

Log in or register to post comments

Comment #11

yhager commented 2 September 2009 at 10:54

Status:

Needs review

» Reviewed & tested by the community

+1 for removing the /sites directory. Not sure what it was doing there in the first place.

Log in or register to post comments

Comment #12

Ozeuss commented 2 September 2009 at 11:40

+1 for this patch.

Log in or register to post comments

Comment #13

c960657 commented 6 September 2009 at 11:12

Not sure if this requires a separate bug, but I think we should also remove /modules, /themes, and /misc from robots.txt. There is no reason why a robot should not fetch files in here, in particular for stylesheets and images.

Note that robots.txt is not only used by Googlebot et al. but also by various other web spiders, e.g. software that spiders a website for making an offline copy.

For instance, on the Wayback Machine, note how the archived version of the CNN.com contains stylesheets etc., while the Drupal.org and Ubuntu.com versions do not.
http://web.archive.org/web/20071127060255rn_1/www.cnn.com/
http://web.archive.org/web/20071125103616/http://drupal.org/
http://web.archive.org/web/20080212114445/http://www.ubuntu.com/

On Ubuntu.com, not the four images below “Ubuntu Editions” on the right. Only the latter is displayed in the archive. This is because it is saved in /files, while the others reside in /themes/.

WRT performance, serving static files is very cheap compared to serving pages generated by Drupal. I assume that search engines spiders like Googlebot are clever enough to not fetch your js and css files with the same frequency as it fetches the front page and other volatile parts of your site.

Log in or register to post comments

Comment #14

samj commented 6 September 2009 at 12:14

It would be interesting to hear the justification for inclusion in the first place as excluding path segments via robots.txt is a fairly drastic action, especially as a default. The unintended consequences for e.g. archiving are hard to grasp until they manifest themselves as problems for users. Sure you could exclude e.g. admin interfaces but these are authenticated anyway, while otoh doing so can clean up search results.

In summary I think there should be fairly solid justification for adding anything to robots.txt

Log in or register to post comments

Comment #15

FlemmingLeer commented 9 September 2009 at 19:29

Issue tags:

+google, +images, +robots.txt

I recommend that /sites/ to be kept in the robots.txt

And that you add this in robots.txt

User-agent: Googlebot-Image
Disallow:
Allow: /*

This will index all images no matter where they are.

(Some of us still use HTMLarea under the /HTMLarea/ folder .... :/)

Log in or register to post comments

Comment #16

webchick

she/they

English

Vancouver 🇨🇦

commented 11 September 2009 at 04:19

Status:

Reviewed & tested by the community

» Needs work

Committed to HEAD, since this fixes the bug in the initial thread.

However, we might need to do some follow-up work here, per c960657 in #13. Since lots of interested parties are already subscribed to this issue, marking this one down to "needs work."

Log in or register to post comments

Comment #17

z.stolar commented 15 September 2009 at 06:49

There should probably be a good reason to include something in robots.txt, so just listing all of Drupal's directories there is not the best approach. This file should prevent files and web pages from being indexed, so your site will be better indexed (as @samj says: no reason to index icons of buttons in FCK editor & Co.).
However, samj's solution isn't ideal, since it prevents CSS files from being cached, and other file types from being indexed (PDF etc). If I understand it well, then no module.info or INSTALL.txt would get indexed, since there's no pointers to those files from a working site. One should actively look for them.

Perhaps the thing to do, is to remove as much as possible from robots.txt, but allowing modules to add entries to the file, so for example RTF editors, will have a chance to say: "Don't index my buttons icons or background images".

Can any SEO expert tell what will be the effect of having background and UI images from themes and modules, indexed by search engines? Will it be on Drupal's side, or will it damage a site's overall ranking (or will it have no effect at all)?

Log in or register to post comments

Comment #18

c960657 commented 20 September 2009 at 19:53

Status:

Needs work

» Needs review

Status	File	Size
new	robots-2.patch	811 bytes

This patch also removes misc/, modules/, and themes/ as suggested in #13.

Can any SEO expert tell what will be the effect of having background and UI images from themes and modules, indexed by search engines? Will it be on Drupal's side, or will it damage a site's overall ranking (or will it have no effect at all)?

I doubt it will have any effect at all. Also, note that robots.txt does not target search engines exclusively but is used for all non-human agents fetching stuff from your server.

Log in or register to post comments

Comment #19

cburschka

they

commented 31 December 2009 at 18:50

Status:

Needs review

» Reviewed & tested by the community

This is a very good idea - PHP scripts are already protected by .htaccess, no search engine can index them anyway.

Log in or register to post comments

Comment #20

cosmicdreams commented 30 March 2010 at 18:46

Status:

Reviewed & tested by the community

» Needs review

There doesn't seem to be much discussion about this change. In my opinion, I don't think this patch is helpful since none of these directories "should" contain images that I would want indexed. The though I keep going back to when considering having the misc, modules, and themes, directories indexed, is a giant haystack of "garbage" images that I wouldn't want to find in an image search. Imagine if a default theme had an image named drupal.jpg and you did an image search for drupal. You'd get spammed by all those dummy images.

So in short, I think the patch that was applied in #16 is sufficient and the patch in #18 should not be committed. To open the discussion a bit more I'll drop the status down to needs review.

Log in or register to post comments

Comment #21

JohnForsythe commented 30 August 2010 at 20:06

This issue needs attention. None of my uploaded product images were getting indexed, and I had no idea why. Fortunately, I took another look at my robots.txt file, and was surprised to see that everything in /sites/ is blocked by default. I just wrote an article to let other people know about the problem:

http://blamcast.net/articles/drupal-seo-mistake

There's no reason to block /sites/ by default. #13 also makes some good points about /modules/ and the others, but /sites/ is especially critical as the default location for uploaded images.

Log in or register to post comments

Comment #22

rszrama commented 30 August 2010 at 22:44

Just as a heads up, John - this has been fixed in D7, but what you want is a backport of the change to D6. It looks like the reason this is still unresolved is to get the proper rules in D7, so perhaps you should open an alternate issue to backport the initial fix to D6. : ?

Log in or register to post comments

Comment #23

adrianmak commented 30 August 2010 at 23:45

subscribe

Log in or register to post comments

Comment #24

geerlingguy commented 31 August 2010 at 02:32

Should we open a new issue for D6, or shall we mark this as patch (to be ported), since there was a patch fixing the original/title issue for this thread...

I think the /sites/ rule should be removed altogether.

Log in or register to post comments

Comment #25

damienmckenna

TN, USA

commented 31 August 2010 at 03:07

An alternative solution - change where the files are stored.

In almost every site I build you'll find the following in the settings.php file:

$conf['file_directory_path'] = 'files';

For some sites, e.g. true multisite installs, I do variations, e.g.

$conf['file_directory_path'] = 'files/public';
// or
$conf['file_directory_path'] = 'files/intranet';

Log in or register to post comments

Comment #26

Scott Reynolds commented 31 August 2010 at 04:59

An alternative solution - change where the files are stored.

This solution does not work for all instances. For example, I have a bunch of images that are important that I track outside the files directory in source control. These include important badges and site icons.

Log in or register to post comments

Comment #27

geerlingguy commented 31 August 2010 at 05:02

Drupal, up to version 5 or 6 (I can't remember which), used the 'files' directory for file storage. With 5 or 6, Drupal switched to using the 'sites' directory instead, which could've introduced this bug originally. The sites folder is supposed to be there for ease of maintenance; a site owner can wipe and reupload/upgrade the rest of the files, but as long as the sites folder remains, all the site files, settings, modules, etc. are preserved...

Of course, on a few sites I run, we have the files in /files just to keep file paths more sane (and sometimes, we have a symbolic link set up to use /files anyways). But Drupal's documentation encourages people to put their files in sites/example.com/files, so our robots.txt shouldn't restrict crawler access there.

:)

Log in or register to post comments

Comment #28

jooplaan commented 31 August 2010 at 05:46

#18: robots-2.patch queued for re-testing.

Log in or register to post comments

Comment #29

wik commented 31 August 2010 at 06:16

subscribing

Log in or register to post comments

Comment #30

idflood commented 31 August 2010 at 06:20

subscribing

Log in or register to post comments

Comment #31

pcoucke commented 31 August 2010 at 06:39

I've added this at the bottom of robots.txt which uses wildcards which are supported by Google:

# Allow images to be crawled
Allow: /sites/*.jpg
Allow: /sites/*.png

You can test this in Google Webmaster tools under site configuration > crawler access

This solution still blocks the /sites/ directory but allows the jpg and png extensions.

Log in or register to post comments

Comment #32

fabianderijk

him/his

Dutch

Alphen aan den Rijn

commented 31 August 2010 at 08:34

subscribing

Log in or register to post comments

Comment #33

robertdouglass commented 31 August 2010 at 08:57

Priority:	Normal	» Major
Status:	Needs review	» Needs work

#31 looks like a good approach, but wouldn't it be more efficient to make a blacklist using that technique? What we really want to avoid being indexed, at all costs, is *.php. *.info, *.module, *.theme, *.inc also belong on the list. *.css too.

As this has a huge affect on SEO I'm bumping to major. http://blamcast.net/articles/drupal-seo-mistake

Log in or register to post comments

Comment #34

damien tournoud commented 31 August 2010 at 09:08

Version:	7.x-dev	» 6.x-dev
Status:	Needs work	» Patch (to be ported)

Let's consider 7.x fixed.

There is no real point in indexing anything in modules/, misc/ and theme/.

Back to D6 to consider a backport.

Log in or register to post comments

Comment #35

z.stolar commented 31 August 2010 at 09:15

@Robert: the files you mentioned shouldn't normally be indexed, since search engines don't crawl directories, they crawl websites. Unless a php/module/info/etc file is directly linked from a web page, there is no risk for it to be indexed at all.

Log in or register to post comments

Comment #36

droplet commented 31 August 2010 at 09:22

Version:	6.x-dev	» 7.x-dev
Status:	Patch (to be ported)	» Needs work

Disallow: /contact/

I think we should remove this too.
no reason to block contact form, prevent spamming ?? we may have some important message on contact page for users.

Log in or register to post comments

Comment #37

robertdouglass commented 31 August 2010 at 10:13

Version:	7.x-dev	» 6.x-dev
Priority:	Major	» Normal
Status:	Needs work	» Patch (to be ported)

@droplet: I suggest you open a new issue for /contact/ and let this one remain for sites and D6

Here's what we've got in D7 for anyone who wants to review: http://drupalcode.org/viewvc/drupal/drupal/robots.txt?view=markup

Log in or register to post comments

Comment #38

grendzy commented 31 August 2010 at 18:40

Status:

Patch (to be ported)

» Reviewed & tested by the community

#9 is the patch that webchick committed to HEAD, and the same patch applies cleanly to 6.x.

Log in or register to post comments

Comment #39

fenstrat

he/him

English

Australia

commented 31 August 2010 at 21:19

RTBC x 2 for #9. Applies cleanly to 6.x

Log in or register to post comments

Comment #40

butler360 commented 1 September 2010 at 07:39

Subscribing.

Log in or register to post comments

Comment #41

Anonymous (not verified) commented 1 September 2010 at 17:02

shubshcribe

Log in or register to post comments

Comment #42

beatnikdude commented 1 September 2010 at 18:25

`(Òvó) - subscribing

Log in or register to post comments

Comment #43

mxmilkiib commented 2 September 2010 at 01:44

sub

Log in or register to post comments

Comment #44

sunward commented 2 September 2010 at 03:56

I manually changed the file and will make a note not to replace on the next update.

I do see a parse error from google:

Line 21: Crawl-delay: 10 - Rule ignored by Googlebot

Now to get the directory re-indexed.

Log in or register to post comments

Comment #45

mikl

Møn

commented 2 September 2010 at 14:51

I hope this will be included in 6.20. This seems to be a major problem for Drupal-sites.

Log in or register to post comments

Comment #46

droplet commented 3 September 2010 at 03:37

@sunward,
bing & yahoo supports Crawl-delay

Log in or register to post comments

Comment #47

cyberwolf commented 3 September 2010 at 06:13

Subscribing.

Log in or register to post comments

Comment #48

FlowerOS commented 6 September 2010 at 03:26

Please do it already. I get tired of manually removing it on every release, on every site.

Log in or register to post comments

Comment #49

gábor hojtsy

he/him

Hungarian

Hungary

commented 6 September 2010 at 10:37

Status:

Reviewed & tested by the community

» Fixed

Thanks all! #9 now committed to Drupal 6 too. Will be included with the next release.

Log in or register to post comments

Comment #50

Mike Dodd commented 6 September 2010 at 11:24

I am not 100% sure allowing robots complete access to sites/, to the user uploaded files area is fine but this means that all of the images contained within all of your modules will be indexed. This means hundreds of extra images being indexed. granted these will link to your site and that can been seen as an SEO boost however from a user experience do you really want all of these graphics included as well. Would it not make more sense to allow access to /sites/default/files (or where ever your files are located) or restrict access to /sites/all/ (or whatever sites you have here that don't contain the user uploaded files).

I don't really have an answer but I am slightly reluctant to get all of my modules images be indexed.

I realize that it may be helpful for some people and it will index the ones that are directly called from the page and this may only be a dozen or so images . . .still I just wanted to flag this as a potential issue.

Log in or register to post comments

Comment #51

JohnForsythe commented 6 September 2010 at 15:03

Glad this is fixed, thanks everyone.

Log in or register to post comments

Comment #52

teezee commented 7 September 2010 at 14:08

A side note maybe, but why is /contact/ in robots.txt?

# Paths (clean URLs)
Disallow: /admin/
Disallow: /comment/reply/
Disallow: /contact/
Disallow: /logout/
Disallow: /node/add/
Disallow: /search/

I mean, the contact module is core, but on most sites something like a contact section (contact press department, contact webmasters) with important address information will not be indexed if u would have aliases like /contact/press or /contact/webmaster.

Is it really up to Drupal core to enforce /contact/ as a disallowed path just because core contains a module that uses /contact ?

Log in or register to post comments

Comment #53

geerlingguy commented 7 September 2010 at 14:20

@teezee - Please open up a new issue (one might already exist, too) for that.

Log in or register to post comments

Comment #54

fabianx commented 7 September 2010 at 17:06

Subscribing

Log in or register to post comments

Comment #55

hedac commented 8 September 2010 at 00:50

I prefer to add the Allow for my specific folders I want to be indexed. such as imagecache or so... instead of removing the disallow and let google index everything under sites... which I don't want.

Log in or register to post comments

Comment #56

geerlingguy commented 8 September 2010 at 13:34

Well, this patch takes care of the main issue. If you have files in /sites/* in particular that you'd like to hide, you can modify your rules.txt file to allow them... this patch simply sets a new sensible default, which will apply to thousands, if not millions of websites.

Log in or register to post comments

Comment #57

sunward commented 8 September 2010 at 14:00

Title:

Allow crawling of sites/default/files by search engines, don't disallow it in robots.txt

» time

I changed the file and went to google webmaster to resubmit the sitemap to try to get the site crawled again.

The site has been crawled by googlebot and the new robot.txt is loaded. Images have still not been loaded yet. So this will take time to affect websites.

Log in or register to post comments

Comment #58

rszrama commented 8 September 2010 at 14:06

Title:

time

» Allow crawling of sites/default/files by search engines, don't disallow it in robots.txt

Log in or register to post comments

Comment #59

droplet commented 8 September 2010 at 15:31

ok, start a new issue
#905576: Allow crawling of /contact/ by search engines, don't disallow it in robots.txt

Log in or register to post comments

Comment #60

hedac commented 13 September 2010 at 23:30

interesting article about robots.txt and drupal. it mentions some other errors that there are in robots.txt
http://tips.webdesign10.com/robots-txt-and-drupal

Log in or register to post comments

Comment #61

betancourt commented 14 September 2010 at 14:02

Will it be solved for Drupal 7?

I am just thinking it it's complettly safe to allow /sites to be crawled, any downsides or risks?

Is there any official response/fix from Drupal to this issue?

Many Thanks

Log in or register to post comments

Comment #62

28 September 2010 at 14:10

Status:

Fixed

» Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

Log in or register to post comments

Comment #63

fgm

French

Paris, France

commented 16 December 2010 at 07:12

@betancourt: see Ryan's confirmation in #22 : this has been solved for D7.

Log in or register to post comments

Comment #64

ANDRZEJ SOSNOWSKI commented 16 December 2010 at 10:32

PROSZĘ DOŁĄCZYĆ!!!

Log in or register to post comments

Comment #65

visualfox commented 19 January 2011 at 22:33

This is the robot.txt I am using for drupal 6. It works for me. It's a little annoying I have to copy it every time I update drupal core. But I guess that why batch/sh is for...

This version use a lot of tips and fixes I found in the diver thread about this issue around the drupal website.

http://www.visualfox.me/blog/drupal-6x-robottxt

Enjoy!

Log in or register to post comments

Comment #66

anusornwebsite commented 2 April 2011 at 19:11

Title:	Allow crawling of sites/default/files by search engines, don't disallow it in robots.txt	» Problem 404 (Not found) and Duplicate title tags in Google Webmaster Tools
Version:	6.x-dev	» 7.0-rc4
Assigned:	Unassigned	» anusornwebsite
Priority:	Normal	» Major

- I use alias(sitename/oldalias/x) instead of URL path(sitename/node/x) and I changed alias(sitename/oldalias/x) into lias(sitename/newalias/x). Google don't found my page in search engine and Webmaster Tools. I want to salve this problem.

- I have Duplicate title tags for example,
newalias/1 - this is what I want to use but below not
newalias/1?language=en - Duplicate title tags
newalias/1?language=th - Duplicate title tags
oldalias/1 - 404 (Not found)
oldalias/1?language=en - 404 (Not found)
oldalias/1?language=th - 404 (Not found)
node/1 - Duplicate title tags
node/1?language=en - Duplicate title tags
node/1?language=th - Duplicate title tags

If I can go to robots.txt and write this
Disallow: /oldalias/
Disallow: /node/
Will url path I don't want it disappear?

Log in or register to post comments

Comment #67

grendzy commented 2 April 2011 at 23:18

Title:

Problem 404 (Not found) and Duplicate title tags in Google Webmaster Tools

» Allow crawling of sites/default/files by search engines, don't disallow it in robots.txt

This queue is for issues in the Drupal core code. Please visit http://drupal.org/support to see what your support options are if you need more assistance.

Log in or register to post comments

Comment #68

anusornwebsite commented 3 April 2011 at 13:47

Thank you for help me. I think aliase invole with URL aliase module. And robot.txt is created by drupal. I'm sorry that misunderstand. Thank again.

Log in or register to post comments

Allow crawling of sites/default/files by search engines, don't disallow it in robots.txt

Comments