In my case I have a multi-site install that I want to enable robotstxt on all of them, but for some of them I want to be able to specify a different default robots.txt file than the one included with the module so I don't have to configure it on each site. So I'm proposing changing the admin settings to the following:

1. Textfield with path to a robots.txt file to use as the base (default to sites/all/modules/robotstxt/robots.txt).
2. Textarea with additions to include to the result robots.txt. If the user wants to change everything about the robots.txt result, they would leave the file textfield blank and only use this field.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Dave Reid’s picture

Status: Active » Needs review
FileSize
4.81 KB

Patch attached for review implementing this feature with an upgrade path for existing users as well.

So basically, we get the same functionality as before, but make it more configurable. I can go into a few of my site's settings.php and add:

$conf = array(
  'robotstxt_custom' => 'Disallow: /images',
);
Dave Reid’s picture

And the reason I've removed the root '/robots.txt' as the default file is because we expect that file to not exist to use this module, so it doesn't make sense to use it as the default. Plus people who have the file re-named to something like 'robots.txt.removed' can use the following in their site's settings.php:

$conf = array(
  'robotstxt_file' => 'robots.txt.removed',
);
asak’s picture

thanks Dave, works great. didn't check the upgrade, only fresh install.

Dave Reid’s picture

FileSize
4.85 KB

Ok this one is working now. Last patch wouldn't allow you to save an empty file path and didn't properly get the variable for the default value. I tested this one thoroughly.

hass’s picture

Status: Needs review » Needs work

robotstxt_get_content() is not an API function and is missing _ prefix. Aside I remember - I've said Dave some time ago - I do not like to read the disk for the robots.txt file after the installation has completed. I do not like to commit this as it overcomplicates this simple module without a benefit except ~100 new support requests.

I may not fully understand the need for this as every site could also have it's own custom sites/example.com/modules/robotstxt/robots.txt and otherwise there is a way to override the var via settings.php...

Dave Reid’s picture

I fail to see how this would create 100 new support requests for you. The problem with your solution is that I'd have to keep a separate version of robots.txt in each subsite. I don't need that. Also I've said that setting something on install is not a Drupal way of doing things. I'll even do some benchmarks.

Dave Reid’s picture

Status: Needs work » Needs review
FileSize
6.84 KB

Revised patch that does some filemtime checking to increase performance to a teeny bit that the actual current code in CVS. This even has a very smart update function. Basically it checks to see if the user hasn't made any changes to the robotstxt content (by checking if each line of sites/all/modules/robotstxt/robots.txt can be found in variable_get('robotstxt', '')), then it can safely use the default file and does a smart job of setting the custom lines. Otherwise, it puts the whole contents of the robotstxt variable into the 'custom lines' variable and leaves the default file blank, keeping the same functionality as before.

For testing:
- Went to admin/settings/robots.txt, clicked 'Reset to defaults', and then pressed 'Save configuration'
- The hook_robotstxt module invoke disabled (help narrow down exactly the difference between the old and new versions)
- Caching disabled
- Tested with ab -n 500 -c 1 http://mysql.drupal6dev.local/robots.txt

WITHOUT PATCH:

Requests per second:    5.35 [#/sec] (mean)
Time per request:       186.971 [ms] (mean)
Time per request:       186.971 [ms] (mean, across all concurrent requests)
Transfer rate:          11.71 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:   181  187   5.9    186     253
Waiting:      181  187   5.9    186     253
Total:        181  187   5.9    186     253

WITH PATCH:

Requests per second:    5.37 [#/sec] (mean)
Time per request:       186.136 [ms] (mean)
Time per request:       186.136 [ms] (mean, across all concurrent requests)
Transfer rate:          11.45 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:   181  186   4.9    184     213
Waiting:      181  186   4.9    184     213
Total:        181  186   4.9    184     213

I ran these tests three times each, so I posted the media result of each 'run'. And yes, with the patch, this is actually faster than the current code.

Dave Reid’s picture

FileSize
6.88 KB

Missed a variable_del() in robotstxt_uninstall().

Dave Reid’s picture

And in response to robotstxt_get_content() being a private function, I don't agree. This seems like a public function that any module could call if they wanted/needed to. There's nothing that screams 'NEVER CALL THIS' in the code. As a compromise, I can rename the function in a revised patch if you want.

Dave Reid’s picture

So, is this good to proceed? Yea? Nay?

hass’s picture

Need to find some time to look into...

Dave Reid’s picture

@hass Do you need more reviews? Help maintaining the module? Let me know and I'm more than willing to help.

hass’s picture

No thanks.

Dave Reid’s picture

Fair enough.

perarnet’s picture

I need this change too, it's a hell to have to install and enable and configure this module when you're running 20+ sites on a single drupal installation. Would be good to know if this can be implemented in the module, so I don't need to run a forked version

hass’s picture

You are still able to have a default file without the patch

Dave Reid’s picture

Not really since it sets the file content on install.

hass I don't like to pester anyone, but I'd just like a response if you are or are not going to consider/implement this since this is still set to 'needs review'.

hass’s picture

Perarnet have complained about something missing, but he have not understood how the module install works. He said he need to enable the module and configure on all sites by hand afterwards. This is wrong! He can add a pre-configuration in the module folder, enable the module and the config becomes his pre-configuration. His issue is thereby non-existent.

Dave - you are asking for a change than may go in if there are reviews, but there seems no interrest or we would have some reviews and comments here about test results.

Dave Reid’s picture

@hass: I appreciate the update. We'll just let it lie awaiting reviews. I apologize about pestering for an update. I hate it when that happens in my modules' issue queues.

AlexisWilke’s picture

Dave,

I see that you are much more persistent than I. I just dropped the ball on my end. Hass does not seem to understand maintenance of a multi-site install. And 20 websites isn't very many...

I noticed today that most of my users forbid access to user/<uid> but Google will hit those folders because it is not in the default robots.txt (reason is you may allow anonymous users to access those folders, obviously.) Note: Google knows because of Google Analytics which should probably be turned off on those folders. But it does not change the need for a correct robotstxt file.

That means many more entries in the db log that do not need to be there and false advertisement to search engines.

This being said, I reviewed your code changes. Doing so I noticed we could add a hook in our own modules. Cool!

Of course, I'd have one more request... a way to remove entries from the default robots.txt file... (he! he!) not too important though.

One thing though:

+
+  $file = variable_get('robotstxt_file', drupal_get_path('module', 'robotstxt') . '/robots.txt');
+  if ($file && variable_get('robotstxt_file_modified', 0) == filemtime($file)) {
+    $content[] = variable_get('robotstxt_file_contents', '');
+  }
+  else {
+    $content[] = $file_contents = file_get_contents($file);
+    variable_set('robotstxt_file_contents', $file_contents);
+    variable_set('robotstxt_file_modified', filemtime($file));
+  }
+

The $file parameter is used in the else part even when empty. Either change it to an elseif () or do this (better optimized):

+
+  $file = variable_get('robotstxt_file', drupal_get_path('module', 'robotstxt') . '/robots.txt');
+  if ($file) {
+    if (variable_get('robotstxt_file_modified', 0) == filemtime($file)) {
+      $content[] = variable_get('robotstxt_file_contents', '');
+    }
+    else {
+      $content[] = $file_contents = file_get_contents($file);
+      variable_set('robotstxt_file_contents', $file_contents);
+      variable_set('robotstxt_file_modified', filemtime($file));
+    }
+  }
+

The really cool thing with this is you can upload a robots.txt file to your files folder and point this module to that file. This means any user can do it without having to change anything in the modules folder (still need to chop the Core robots.txt file though.)

Thank you.
Alexis Wilke

P.S. One more thing, by the way, the new way prevents you from seeing the content of the txt file from the admin screen. It could be nice to add one more widget with the content the way it looks at this point. So on save you can check and see that it looks as you intended.

hass’s picture

Version: 6.x-1.x-dev » 7.x-1.x-dev
Status: Needs review » Needs work
hass’s picture

Title: Allow admin to specify the default robots.txt file and add additional override checkbox » Allow admin to specify the default robots.txt file
Issue summary: View changes
Status: Needs work » Fixed

  • hass committed f6f06c8 on 8.x-1.x
    Issue #619404 Allow admin to specify the default robots.txt file
    

  • hass committed dde932a on
    Issue #619404 Allow admin to specify the default robots.txt file
    
hass’s picture

Title: Allow admin to specify the default robots.txt file » Allow admin to specify the default.robots.txt file

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.