Allow admin to specify the default robots.txt file and add additional override checkbox

Dave Reid - October 30, 2009 - 22:45
Project:RobotsTxt
Version:6.x-1.x-dev
Component:Code
Category:feature request
Priority:normal
Assigned:Unassigned
Status:needs review
Description

In my case I have a multi-site install that I want to enable robotstxt on all of them, but for some of them I want to be able to specify a different default robots.txt file than the one included with the module so I don't have to configure it on each site. So I'm proposing changing the admin settings to the following:

1. Textfield with path to a robots.txt file to use as the base (default to sites/all/modules/robotstxt/robots.txt).
2. Textarea with additions to include to the result robots.txt. If the user wants to change everything about the robots.txt result, they would leave the file textfield blank and only use this field.

#1

Dave Reid - October 30, 2009 - 22:51
Status:active» needs review

Patch attached for review implementing this feature with an upgrade path for existing users as well.

So basically, we get the same functionality as before, but make it more configurable. I can go into a few of my site's settings.php and add:

<?php
$conf
= array(
 
'robotstxt_custom' => 'Disallow: /images',
);
?>

AttachmentSize
619404-robotstxt-D6.patch 4.81 KB

#2

Dave Reid - October 30, 2009 - 22:55

And the reason I've removed the root '/robots.txt' as the default file is because we expect that file to not exist to use this module, so it doesn't make sense to use it as the default. Plus people who have the file re-named to something like 'robots.txt.removed' can use the following in their site's settings.php:

<?php
$conf
= array(
 
'robotstxt_file' => 'robots.txt.removed',
);
?>

#3

asak - October 31, 2009 - 15:13

thanks Dave, works great. didn't check the upgrade, only fresh install.

#4

Dave Reid - October 31, 2009 - 17:19

Ok this one is working now. Last patch wouldn't allow you to save an empty file path and didn't properly get the variable for the default value. I tested this one thoroughly.

AttachmentSize
619404-robotstxt-D6.patch 4.85 KB

#5

hass - November 3, 2009 - 00:18
Status:needs review» needs work

robotstxt_get_content() is not an API function and is missing _ prefix. Aside I remember - I've said Dave some time ago - I do not like to read the disk for the robots.txt file after the installation has completed. I do not like to commit this as it overcomplicates this simple module without a benefit except ~100 new support requests.

I may not fully understand the need for this as every site could also have it's own custom sites/example.com/modules/robotstxt/robots.txt and otherwise there is a way to override the var via settings.php...

#6

Dave Reid - November 4, 2009 - 07:02

I fail to see how this would create 100 new support requests for you. The problem with your solution is that I'd have to keep a separate version of robots.txt in each subsite. I don't need that. Also I've said that setting something on install is not a Drupal way of doing things. I'll even do some benchmarks.

#7

Dave Reid - November 7, 2009 - 07:19
Status:needs work» needs review

Revised patch that does some filemtime checking to increase performance to a teeny bit that the actual current code in CVS. This even has a very smart update function. Basically it checks to see if the user hasn't made any changes to the robotstxt content (by checking if each line of sites/all/modules/robotstxt/robots.txt can be found in variable_get('robotstxt', '')), then it can safely use the default file and does a smart job of setting the custom lines. Otherwise, it puts the whole contents of the robotstxt variable into the 'custom lines' variable and leaves the default file blank, keeping the same functionality as before.

For testing:
- Went to admin/settings/robots.txt, clicked 'Reset to defaults', and then pressed 'Save configuration'
- The hook_robotstxt module invoke disabled (help narrow down exactly the difference between the old and new versions)
- Caching disabled
- Tested with ab -n 500 -c 1 http://mysql.drupal6dev.local/robots.txt

WITHOUT PATCH:

Requests per second:    5.35 [#/sec] (mean)
Time per request:       186.971 [ms] (mean)
Time per request:       186.971 [ms] (mean, across all concurrent requests)
Transfer rate:          11.71 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:   181  187   5.9    186     253
Waiting:      181  187   5.9    186     253
Total:        181  187   5.9    186     253

WITH PATCH:

Requests per second:    5.37 [#/sec] (mean)
Time per request:       186.136 [ms] (mean)
Time per request:       186.136 [ms] (mean, across all concurrent requests)
Transfer rate:          11.45 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:   181  186   4.9    184     213
Waiting:      181  186   4.9    184     213
Total:        181  186   4.9    184     213

I ran these tests three times each, so I posted the media result of each 'run'. And yes, with the patch, this is actually faster than the current code.

AttachmentSize
619404-robotstxt-D6.patch 6.84 KB

#8

Dave Reid - November 7, 2009 - 07:22

Missed a variable_del() in robotstxt_uninstall().

AttachmentSize
619404-robotstxt-D6.patch 6.88 KB

#9

Dave Reid - November 7, 2009 - 07:23

And in response to robotstxt_get_content() being a private function, I don't agree. This seems like a public function that any module could call if they wanted/needed to. There's nothing that screams 'NEVER CALL THIS' in the code. As a compromise, I can rename the function in a revised patch if you want.

#10

Dave Reid - November 9, 2009 - 22:57

So, is this good to proceed? Yea? Nay?

#11

hass - November 11, 2009 - 01:01

Need to find some time to look into...

 
 

Drupal is a registered trademark of Dries Buytaert.