robots.txt -- what to actually NOT block?
dugawug - October 18, 2007 - 21:48
So our Drupal site is now launched and all is well, but what I'm wondering about robots.txt now is how to configure it.
I found this thread that seems to have some great advice well worth heeding: http://drupalzilla.com/robots-txt
But what I mainly wonder, and it's partly due to my newbieness, is what directories, particularly of a drupal site, do we definitely NOT want blocked? And what about images? So far, I don't see a single directory (except maybe images) that I would actually not want blocked?
Any experienced soul to help me out here?

=-=
There really isn't anything else you need to do to robots.txt that the developers haven't already thought of. IMHO.
User-agent: *Crawl-delay: 10
# Directories
Disallow: /database/
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /sites/
Disallow: /themes/
Disallow: /scripts/
Disallow: /updates/
Disallow: /profiles/
already blocks all folders a bot shouldn't be in.
# FilesDisallow: /xmlrpc.php
Disallow: /cron.php
Disallow: /update.php
Disallow: /install.php
Disallow: /INSTALL.txt
Disallow: /INSTALL.mysql.txt
Disallow: /INSTALL.pgsql.txt
Disallow: /CHANGELOG.txt
Disallow: /MAINTAINERS.txt
Disallow: /LICENSE.txt
Disallow: /UPGRADE.txt
already blocks all files outside any folders that a bot shouldn't index
# Paths (clean URLs)Disallow: /admin/
Disallow: /aggregator/
Disallow: /comment/reply/
Disallow: /contact/
Disallow: /logout/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/
# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=aggregator/
Disallow: /?q=comment/reply/
Disallow: /?q=contact/
Disallow: /?q=logout/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/
already handles all paths clean and not clean that woudln't need to be indexed.
The only thing that one may want to take into consideration is when a path is aliased. That path may be added to the disallow list if its default path is already included here.
_____________________________________________________________________
Confucius says:
"Those who seek drupal answers should use drupal search!" : )
ok, so...
if everything is configured for you, then i wonder why that thread seemed to say otherwise so much?
in any case, i'll compare it later (when i'm not at work) ;)
so back to my main point here, is there anything in particular i definitely want to ~not~ block? anything that will hurt my site if i do block?
we have all sorts of document, image, drupal, and module folders and so far i don't know of anything i do want accessible to robots.
=-=
content, unless you don't care about being indexed by search engines.
Keep this little note in mind as well. Only good bots follow the robots.txt rules. Bad bots ignore robots.txt. There are far more bad bots than good bots spidering the internet that just eat up bandwidth.
_____________________________________________________________________
Confucius says:
"Those who seek drupal answers should use drupal search!" : )
ok thanks, that i did know,
ok thanks, that i did know, but since Drupal is database-driven, is there a reason to not hide every directory/file in robots.txt? i guess that's my question here. obviously my own custom php content files shouldn't be hidden, but other than that...?
=-=
all folders in a default drupal install are already disallowed , actually there are a few there that are Drupal 4.7.x specific and can be removed.
ie: databases is no longer a directory in Drupal 5.x nor is updates
the only directory that isn't in the folders list is the files directory which is protected by an .htaccess file. that won't allow you to index the folder as an anon user. which can be seen by trying to access yorusite.com/files - Though you can disallow it in your robots.txt file too if you so chose without ill affects.
keep in mind that is sites folder is disallowed, all folders in sites are also disallowed.
thus if sites is disallowed, sites/default is also disallowed
_____________________________________________________________________
Confucius says:
"Those who seek drupal answers should use drupal search!" : )