Add an option to sanitize email addresses and passwords from user table post sql-sync. [#861822]

Not sure if this is too "featurey" for Drush core, but it's something that Drupal.org needs as well for its DB exports, and comes up on most client projects.

If you take a copy of the production database into dev/staging, you might want to test some e-mail related functionality without spamming actual users. You can set Devel module to log emails, but that doesn't actually test that SMTP is working. So it'd be great to be able to run a command to reset all e-mail addresses to something safe, like replacing all domains with @$base_url or similar.

Better still is this trick I learned from Fen at CivicActions, where you can give it a gmail address, and it can encode all e-mail addresses as $gmail_account . '+' . $account->name . '@gmail.com';. This way, any mail sent from your server could be captured in a dummy account, and you could clearly see which users were affected by a bulk mailing, for example.

Comment	File	Size	Author
#51	sanitize.sh_.txt	1.93 KB	fen
#43	sanitize-bootstrap-2.patch	26.54 KB	greg.1.anderson
#38	sanitize-bootstrap.patch	7.15 KB	greg.1.anderson
#33	sanitize-4.patch	21.75 KB	greg.1.anderson
#31	sanitize-3.patch	20.93 KB	greg.1.anderson
#29	sanitize-2.patch	20.47 KB	greg.1.anderson
#24	sanitize.patch	16.57 KB	greg.1.anderson
#8	sanitize.drush_.inc_.txt	1.67 KB	greg.1.anderson

Comments

Comment #1

moshe weitzman commented 22 July 2010 at 18:08

Title:	Add command for sanitizing e-mail addresses	» Document how to sanitize a DB dump
Assigned:	Unassigned	» moshe weitzman
Category:	feature	» support

We discussed this while designing sql-sync but postponed the feature at that time. First mentioned at the epic issue #460924-10: Remote site aliases for Drush (Proposal and implementation). Especially see point 3 in #98 where Greg suggests doing a sql-sync to a backup DB and manipulating there. I think this request comes down to documentation. We need to document some best practices with drush since they are non obvious. I've got a plan - stay tuned!

Comment #2

greg.1.anderson commented 22 July 2010 at 18:15

Related to this, I wanted to factor sql-sync into sql-import and sql-dump (create the former, and move the remote-db features from sql-sync into sql-dump). Then, sql-sync just does setup and fires off an sql-dump followed by an sql-import. This would make it easier to do extensions via the drush hook mechanism.

Comment #3

greg.1.anderson commented 22 July 2010 at 23:59

Here is a start:

Postgres:
drush @gklive sqlq "select 'myaddr+' || replace(mail, '@', '_') || '@gmail.com' from users where uid <> 0;"

Mysql:
drush @gklive sqlq "select concat('myaddr+', replace(mail, '@', '_'), '@gmail.com') from users where uid <> 0;"

This could be changed to an 'update' statement with a little more work, and from there, it would be easy enough to make this a post-sql-sync hook, even without the refactor in #2.

I'll continue to post here, as by coincidence, I'm doing a similar operation on one of my sites right now.

Comment #4

moshe weitzman commented 23 July 2010 at 01:29

Splitting those two parts (#2) into two commands makes sense, but I slightly worried we will introduce problems like what 'update' command has. update uses --backend to isolate its second step (updatedb) and many people have their php configured oddly and get error on updb step. lets consider this if we find ourselves needing to issue the sql-import command with --backend.

hopefully i have been coherent here.

Comment #5

greg.1.anderson commented 23 July 2010 at 02:53

Yes, I understand completely. When I was experimenting with this before, I just called drush_invoke (or one of the calls in the command-dispatch chain) directly and avoided that problem nicely.

Comment #6

greg.1.anderson commented 23 July 2010 at 14:53

Here is some better SQL.

Postgres:
update users set mail = 'myaddr+' || replace(mail, '@', '_') || '@testdomain.org' where uid <> 0 and strpos(mail, '@testdomain.org') = 0;

Mysql:
update users set mail = concat('myaddr+', replace(mail, '@', '_'), '@testdomain.org') where uid <> 0 and instr(mail, '@testdomain.org') = 0;

Note that we skip users who are already using email addresses @testdomain.org. This has the dual effect of making this transform stable -- you can run it multiple times without corrupting / complicating the test email -- and also it allows you to set up test users in the main user table that receive the test emails at their own addresses rather than the unified '+' address, if you want.

The thing I was thinking about here is that if you have >100 users, you wouldn't really want to send hundreds or thousands of messages to gmail just to test notifications. This sort of environment really demands that you have a test SMTP server on your local domain.

Even with a local SMTP server, though, it might be nicer to test notifications with 10 users rather than thousands of users, so I was considering also doing a drop from users where the uid > 10 or something like that. It would also be good if this dropped cross-references to users that were dropped (e.g., also drop all nodes where the uid > 10), just to avoid finding 'bugs' that caused by chopped sql tables. Updating the schema to cascade delete would be another way to do this. The problem here is that there may be a lot of tables with uid references, depending on what modules are installed.

I started turning greenknowe.org into a local-community social networking site about two weeks ago, and so far have << 100 users, so I'm not going to worry about the 'drop' issue for a while. I'll provide a post-sql-sync hook to do the above mail transformation in ~< a week, though.

Note also that none of this addresses the security issues that Moshe in #1 (if you don't want to let your devs to even be able to see your user's email addresses), but it does address the test needs to not spam your users.

Comment #7

greg.1.anderson commented 23 July 2010 at 22:35

Here is an example post-sync hook for sanitizing email addresses. Drop the sanitize.drush.inc file into your .drush folder, and then call it like this:

drush sql-sync @gklive @gkdev --sanitize --test-domain=greenknowe.org --test-addr=myaddr

sql-sync will behave as it currently does unless the --sanitize option is used. Note that you can also add the sanitize option to your site alias, like so:

$aliases['gkdev'] = array (
  'root' => '/srv/www/dev.greenknowe.org',
  'uri' => 'http://greenknowe.org',
  'sanitize' => true,
  'test-domain' => 'greenknowe.org',
  'test-addr' => 'myaddr'
);

If you do this, then your email addresses will be sanitized every time that alias is used as the destination target of an sql-sync operation.

Should work with both postgres and mysql, but I have not tried mysql.

Comment #8

greg.1.anderson commented 23 July 2010 at 22:37

Status	File	Size
new	sanitize.drush_.inc_.txt	1.67 KB

Here's the file; rename to sanitize.drush.inc and put in $HOME/.drush.

Comment #9

moshe weitzman commented 23 July 2010 at 23:57

Nice.

Any interest in changing all passwords as part of sanitize or a separate option? You can do it quickly in SQL with MD5('password'). I think it is OK for all passwords to be the same. In Drupal 7, this technique yields unusable passwords (by default) since we use php to generate very secure (and slow to generate) passwords. See http://drupalcode.org/viewvc/drupal/drupal/scripts/password-hash.sh?view... for an example. I think D7 should just do the same md5('password') - its OK in this case that passwords are not usable. One could make them usable with a swappable password.inc if thats really desired.

Comment #10

moshe weitzman commented 23 July 2010 at 23:58

I do think sanitize is core-worthy once we scramble passwords.

Comment #11

greg.1.anderson commented 24 July 2010 at 00:54

Sure, I'll throw in a password scramble and see if it works on D6/D7 postgres/mysql; easy to code, just need some time to test.

Comment #12

greg.1.anderson commented 24 July 2010 at 13:41

Before this goes into core, I want to implement seatbelts.

When the 'sanitize' option is specified, sql-sync will write an array of sanitized table names into 'drush-sanitized' in the variables table
sql-sync will support an option 'no-sanitized' that will force an abort if 'drush-sanitized' exists
sql-dump will support an option 'skip-sanitized' (and 'structure-sanitized'? or always 'structure'?) that will add the tables from 'drush-sanitized' to the skip list, if it exists

A new hook, called from drush_sql_post_sql_sync, that allows drush modules to sanitize tables and return an array of tables touched might also be a nice addition.

Comment #13

moshe weitzman commented 27 July 2010 at 13:11

Assigned:	moshe weitzman	» greg.1.anderson
Category:	support	» feature
Status:	Active	» Needs review

Those seatbelts feel a bit over the top for me. And without the seatbelts, the new hook doesn't provide much value over the post_sql_sync that drush_invoke offers.

Comment #14

greg.1.anderson commented 27 July 2010 at 14:34

True. I'll test and commit as-is for now.

Comment #15

greg.1.anderson commented 27 July 2010 at 14:37

Oh, one advantage of including a new hook is that you could have a single drush_confirm that controlled all of the sanitizations in a single switch.

I suppose if I put the built-in --sanitize (with confirm) in the main function, and left post_sql_sync for the modules, that would have the same effect. Each module would just need to test --sanitize by convention.

Comment #16

moshe weitzman commented 27 July 2010 at 16:46

Does sanitize really need a confirm? We are operating on a copy of the data.

Comment #17

greg.1.anderson commented 27 July 2010 at 17:22

Yes, I think that's a valid point. I'll just take out the confirm, and move sanitize to drush_sql_post_sql_sync, to serve as a template for other sanitize operations in other modules.

Comment #18

moshe weitzman commented 7 August 2010 at 13:20

Status:

Needs review

» Needs work

Comment #19

greg.1.anderson commented 7 August 2010 at 14:58

The thing that's bugging me right now is that in order to sanitize passwords per #9, I need to bootstrap drupal, and sql-sync doesn't do that. Note also that the db to be sanitized could be remote, although I imagine in the typical case it would be local.

There are three ways I could do it.

Make an sql-sanitize command that bootstraps drupal, and do all operations there rather than in post-sync hooks.
Modify drush sql-query to fix up password-modifying queries based on the Drupal version (just as it currently unifies 'show tables' between psql and mysql).
Call drush status on the target drupal instance and cache the result.

The thing I realized recently is that confirmation pre-sanitize is actually very important, because users are going to want that sense of security that they've done their persistent settings correctly if they put sanitize options into their aliases. A confirmation will show that they are not about to sanitize their live site. However, the confirm should come up front, as part of the existing sql-sync confirm (just as the existing confirm tells you if you have excluded any tables from the sync).

Therefore I think that #3 is best.

We cache the drupal version during the validate sql sync hook.
Any drush module that wants to add a sanitize step will check flags during the pre sql sync hook, and use APIs provided to record messages that will be added to the confirmation.
The confirmation messages are displayed in sql-sync propper.
The sanitization step itself will happen in the post sql hook. If a custom sanitize step only needs to execute some SQL, it can be queued up in step #2, and no extra custom code will be needed to run it. Custom sanitizers that need to do more than that would need both a pre and post hook (but I can't think of any practical examples of that right now).

Just thinking aloud, but feel free to chime in if you have any comments.

Comment #20

moshe weitzman commented 7 August 2010 at 19:12

I'll have to re-read this again to fully grok it. Sounds like it got more complicated than I had hoped. I'm not sure confirmation merits it. We could recommend --simulate to folks who really want to see whats gonna happen.

I realize now that you don't need to bootstrap to sanitize passwords.


$pass = drush_get_option('pass', 'admin');
if (drush_drupal_major_version() >= 7) {
  // Borrowed from scripts/password-hash.sh
  include_once DRUPAL_ROOT . '/includes/password.inc';
  include_once DRUPAL_ROOT . '/includes/bootstrap.inc';
  $hash = user_hash_password($pass):
}
else {
  $hash = md5();
}
$sql = UPDATE users SET pass = $hash

That last update statement needs to be a db_update() in D7. We could try to wrap UPDATE and INSERT statements for all drupal versions like we did for drush_db_select()

Comment #21

greg.1.anderson commented 7 August 2010 at 20:53

The description sounds complicated, but simple sanitizations like the one above won't be too complicated in code.

Does drush_drupal_major_version work without bootstraping drupal?

Comment #22

moshe weitzman commented 7 August 2010 at 23:59

Yes. All it needs is DRUSH_DRUPAL_ROOT. It includes system.module or bootstrap.inc (D7) as needed.

Comment #23

greg.1.anderson commented 8 August 2010 at 03:42

Note to self: there are more sanitization ideas in #636340: Provide generally available, regular exports of already public data from the drupal.org database.

Comment #24

greg.1.anderson commented 9 August 2010 at 06:35

Status:

Needs work

» Needs review

Status	File	Size
new	sanitize.patch	16.57 KB

Here is the patch, per #19. Needs a little tweaking before commit, but ready for review. Not too many extra lines to get the confirmation messages -- really worth it, in my opinion, but let me know if you have comments or concerns.

Not sure how many, if any, of the ops from the issue mentioned in #23 should be considered for drush core.

Comment #25

moshe weitzman commented 9 August 2010 at 20:46

Looking very nice. Lets make sure to document the new options and add doxygen for new functions.

+++ commands/sql/sql.drush.inc	9 Aug 2010 06:29:54 -0000
@@ -385,11 +385,13 @@ function _drush_sql_query($query, $db_sp
+    if (drush_get_context('DRUSH_SIMULATE') || drush_get_context('DRUSH_VERBOSE')) {
+      drush_print('sql-query: ' . $query);
+    }

looks like debug code

+++ commands/sql/sql.drush.inc	9 Aug 2010 06:29:54 -0000
@@ -557,3 +559,26 @@ function _drush_sql_get_invalid_url_msg(
+  ¶

trailing whitespace. happens several other times.

+++ commands/sql/sync.sql.inc	9 Aug 2010 06:29:54 -0000
@@ -165,44 +205,53 @@ function _drush_sql_sync($source, $desti
+      if ($messages !== FALSE) {

a bit verbose. if ($messages) will do

Comment #26

greg.1.anderson commented 9 August 2010 at 21:06

regarding the if (drush_get_context('DRUSH_SIMULATE') || drush_get_context('DRUSH_VERBOSE')), the thing I was noticing was that in simulate mode, sqlq outputs the operation being executed, which contains the name of the temporary file that contains the query, but it does not show the query itself. I wanted to make sure that the query was displayed in simulated mode. Doing it during verbose operation seemed like a good idea at the time; I'll take that part out.

I'll submit a new patch with the other changes + more comments shortly.

Two more possible additions: the current version never sanitizes the admin account; an option could be added to include admin in the updates. Also, the current version transforms the email address, so you can tell, for debugging purposes, who the email was supposed to go to. Perhaps another option could obscure the email address, removing it from the database.

Comment #27

moshe weitzman commented 10 August 2010 at 00:56

I do think we should sanitize admin account and i do think that wiping email addresses (like we do for passwords) is even more important than transforming. I consider transforming emails to be specialized enough that it ought to be in drush_extras or elsewhere.

Comment #28

greg.1.anderson commented 10 August 2010 at 01:46

Sounds good.

Comment #29

greg.1.anderson commented 10 August 2010 at 05:46

Status	File	Size
new	sanitize-2.patch	20.47 KB

Here's a cleaned-up patch with all of the suggestions from #25 and #27.

Comment #30

moshe weitzman commented 10 August 2010 at 12:19

Looking good. We might want to mention the sanitize API in the command help.

+++ commands/sql/sql.drush.inc	10 Aug 2010 05:44:44 -0000
@@ -115,6 +115,10 @@ function sql_drush_command() {
+      '--sanitize' => 'Obscure email addresses and reset passwords in the user table post-sync.',
+      '--sanitize-password' => 'The password to assign to all accounts in the sanitization operation, or "no" to keep passwords unchanged.',
+      '--test-email' => 'The username for test email addresses in the sanitization operation.  Defaults to "user".',
+      '--test-domain' =>  'The domain for test email addresses in the sanitization operation.  Defaults to "localhost".',

lets indent the lines below --sanitize

lets state the default value for each option.

the meaning of 'test' in the last two options is not clear from reading the help. anyway, i think those will be moved to contrib is favor of resetting email addresses.

+++ commands/sql/sql.drush.inc	10 Aug 2010 05:44:44 -0000
@@ -385,11 +389,16 @@ function _drush_sql_query($query, $db_sp
+    // In --simulate mode, drush_op will show the call to mysql or psql,
+    // but the sql query itself is stored in a temp file and not displayed.
+    // We will therefore show the query explicitly in the interest of full disclosure.

I've since learned about the -e option of mysql command so we could actually stop using a tmp file for sql-query. thats for another issue though.

+++ commands/sql/sync.sql.inc	10 Aug 2010 05:44:44 -0000
@@ -50,6 +25,82 @@ function _drush_sql_sync($source, $desti
+ * sanitization operations via the register post-sync op function;

lets actually use the function name here: drush_sql_register_post_sync_op(). that way it hyperlinks in the api docs.

Comment #31

greg.1.anderson commented 10 August 2010 at 19:47

Title:

Document how to sanitize a DB dump

» Add an option to sanitize email addresses and passwords from user table post sql-sync.

Status	File	Size
new	sanitize-3.patch	20.93 KB

Here's an updated patch. --test-email and --test-domain are now both gone in favor of --sanitize-email, which takes the form --sanitize-email="user+%uid@localhost". Valid substitutions are %uid, %login and %mail. Note that we shouldn't just move the email option to contrib; it's very important to be able to test email notification operations on a dev site, so I think it should be possible to set the address in core. The new code allows emails to be transliterated or obscured with a minimum of extra options and only a handful of lines of code.

I think we should stick with using temporary files with mysql and psql in the sqlq command. It's true that we could try to switch to -e, but then we would need to worry about escaping rules, which may or may not be platform dependent... I really don't want to have to support "sqlq bug with postgres under Windows", and temp files are the cleanest way to stay clear of this possibility.

Other changes made as suggested.

Comment #32

moshe weitzman commented 11 August 2010 at 00:55

Status:

Needs review

» Reviewed & tested by the community

I think the code would be a bit more readable if we refactor password and email sanitizers into their own functions.

Minor code formatting stuff below. Feel free to commit after those are resolved. I like this new functionality a lot. Maybe Civicactions can finally drop their pushdb stuff :)

+++ commands/sql/sql.drush.inc	10 Aug 2010 19:40:42 -0000
@@ -557,3 +565,58 @@ function _drush_sql_get_invalid_url_msg(
+ * Builds a confirmation message for all of the registers
+ * post-sync operations.

a bit awkward text here

+++ commands/sql/sql.drush.inc	10 Aug 2010 19:40:42 -0000
@@ -557,3 +565,58 @@ function _drush_sql_get_invalid_url_msg(
+ *   All post-sync operation messages concatinated together.

typo: concatenated

+++ commands/sql/sync.sql.inc	10 Aug 2010 19:40:42 -0000
@@ -50,6 +25,92 @@ function _drush_sql_sync($source, $desti
+  $message_list = array();
+  ¶
+  // Test to see if 'sanitize' option was specified.

leading whitespace. a few more of these.

+++ commands/sql/sync.sql.inc	10 Aug 2010 19:40:42 -0000
@@ -50,6 +25,92 @@ function _drush_sql_sync($source, $desti
+	// We need a different sanitization query for Postgres and Mysql
+	$db_driver = $destination_settings['databases']['default']['default']['driver'];
+	if ($db_driver == 'pgsql') {
+          $email_map = array('%uid' => "' || uid || '", '%mail' => "' || replace(mail, '@', '_') || '", '%login' => "' || replace(login, ' ', '_') || '");
+	  $newmail =  "'" . str_replace(array_keys($email_map), array_values($email_map), $newemail) . "'";
+	}
+	else {
+          $email_map = array('%uid' => "', uid, '", '%mail' => "', replace(mail, '@', '_'), '", '%login' => "', replace(login, ' ', '_'), '");
+	  $newmail =  "concat('" . str_replace(array_keys($email_map), array_values($email_map), $newemail) . "')";
+	}

looks like we have tabs instead of spaces here. dreditor is reporting a problem

Comment #33

greg.1.anderson commented 11 August 2010 at 04:49

Status:

Reviewed & tested by the community

» Fixed

Status	File	Size
new	sanitize-4.patch	21.75 KB

I did not want to split the email and password sanitizers into separate functions. Two reasons for this. First, separate functions would mean that there would be two sql queries needed, rather than one; that's minor. The more significant issue is that we only have one pre-sql-sync hook available to us in the sql command file; I could have drush_sql_pre_sql_sync call two separate functions that are not drush hook functions, but I don't know if that would fully satisfy the purpose of simplifying the example.

Instead, I decided to put a really simple example hook in drush.api.php. I think this will make it easy for savvy users to start making their own custom sanitization functions. I think that pushdb could now be replaced with sql-sync + a pre-sync hook function.

The code committed is attached here; we can keep discussing this if you'd like to see a different solution to the simplification issue.

Comment #34

greg.1.anderson commented 11 August 2010 at 14:40

Status:

Fixed

» Active

I'm re-opening because I realize that sanitize is missing a critical feature. It should be possible for modules such as project-issues to add drush hooks to define sanitization options to clean up the tables they manage. If this were done consistently, a drush project could be easily sanitized without (a) every module user understanding the schema of the module and (b) customizing the sanitize scripts frequently or testing 'if table x exists then...'.

There are four ways this could happen:

1a. sql-sync could call another function sql-sanitize that bootstrapped to the source site. This could still be done in such a way that you got the prompt prior to the sync by calling with --no to prevent it from sanitizing.

1b. sql-sync could call another function sql-sanitize-preview that bootstrapped and reported back what needed to be done. Maybe safer than 1a, which calls 'sanitize' on the site that you do not want to sanitize...

2a. sql-sync could clear the commandfile cache and force a bootstrap to whichever site of src or dest was local. (Would probably work; need to investigate.)

2b. sql-sync could find the drupal root and site folder of src or dest, whichever was local, and force-add all of the *.drush.inc files it could find to the commandfile cache.

1a and 1b are the cleanest, but they create new commands that will appear in drush help.

2b might be a little cleaner operation-wise (but more "hackey" code-wise), but then sanitize would not always work right when the destination is local (most common case) unless the user was careful to rsync the files before the database after upgrading modules whose schema changed.

I think I favor 1b or 2b, but have not decided. Comments on approach welcome.

Comment #35

moshe weitzman commented 11 August 2010 at 15:36

Not sure I follow. These modules could use the API you just created. drush_project_issue_pre_sql_sync() could choose to add a new query when --sanitize is true or it could check for a new --sanitize-project-issue option.

I think all thats needed is a way for project_issue to document its new option. We need a hook_command_alter on command definitions. That would also be useful for injecting personal command aliases.

Comment #36

moshe weitzman commented 11 August 2010 at 16:16

Oh, I get it now. project_issue isn't even loaded. I feel like bootstrap is committing patricide this week.

Yet another option is to require admins to use --config to load the project_issue.sanitize.inc or ask admins to move/symlink that file to ~/drush or /etc/drush. They can't copy the file there since that leads to duplicate function error.

Comment #37

greg.1.anderson commented 11 August 2010 at 16:37

As far as usability is concerned, I would favor a convention where Drupal modules that add sanitization steps key off of --sanitize, not --sanitize-project-issue, so that you automatically get all of the recommended sanitizations for all of your enabled modules. Moving / symlinking files isn't as good, usability-wise, because it's not automatic enough, and also will affect sites that don't have that module enabled.

I'll try coding up #34.2b and see how it comes out.

I also agree that it would be good to allow commandfiles to alter the command definition of other commands for just this sort of purpose.

Comment #38

greg.1.anderson commented 12 August 2010 at 12:25

Status:

Active

» Needs review

Status	File	Size
new	sanitize-bootstrap.patch	7.15 KB

This patch fixes the sql-sync bootstrap problem so that site modules can define pre-sql-sync hooks in their drush commandfiles.

The first thing that this patch does is define drush_hook_init, which is called in drush_invoke prior to the creation of the commandfile list. This is done so that the hook init function can add items to the commandfile list, e.g. through bootstrapping.

n.b. drush already had a hook for commandfile initialization, but that was different, as it was per commandfile, and this hook is per command.

n.n.b. Prior to the introduction of drush_hook_init, drush_bootstrap_max never really worked right. It would in fact bootstrap the site, but it was always called to late for any site thus bootstrapped to participate (via drush commandfiles in included modules) in the drush hook mechanism. This means that without drush_hook_init, drush help could never have a meaningful command-alter hook. If this patch is accepted, I'll go back and move drush_bootstrap_max into the init function for all commands that use it.

The next thing that this patch does is add a drush option 'enabled-commandfiles-only' that, if set, will subvert the loading of commandfiles during the DRUSH_BOOTSTRAP_DRUPAL_SITE phase so that the more reliable process in DRUSH_BOOTSTRAP_DRUPAL_FULL can be used instead.

Finally, a few lines of code were added to call bootstrap_max on a particular site alias record.

With these changes, I created a test file sites/all/modules/project_issue/project_issue.drush.inc:

<?php

function drush_project_issue_pre_sql_sync($source = NULL, $destination = NULL) {
  if (drush_get_option(array('sanitize', 'destination-sanitize'), FALSE)) {
    drush_sql_register_post_sync_op('sanitize-project-issue', 
      dt('Reset email addresses in project issue configuration'), 
      "UPDATE project_issue_projects SET mail_digest = 'user@localhost', mail_copy = 'user@localhost';");
  }
}

With this file in place, the project issue sanitization operation was added iff the project issue module is enabled on the local site.

n.b. it might be prudent to call sql-sync twice if the source site is remote, and modules were recently enabled or disabled. Since drush uses the local site to control sanitization rather than the source site, under some circumstances a sanitize option might be missed. Using a remote command to control sanitization (e.g. #34.2a) would be quite a bit more complicated, so overall I think I am quite happy with the attached solution.

Comment #39

moshe weitzman commented 12 August 2010 at 18:45

We're at the point where we need a diagram of the drush bootstrap. This is getting complicated enough that perhaps a phone call is in order.

I like the skipping of commandfiles from disabled modules. Can we generalize that further? It seems like we should skip anytime we know that we are destined for FULL_BOOTSTRAP (e.g. cron, cache-clear, ...)

n.n.b. Prior to the introduction of drush_hook_init, drush_bootstrap_max never really worked right. It would in fact bootstrap the site, but it was always called to late for any site thus bootstrapped to participate (via drush commandfiles in included modules) in the drush hook mechanism. This means that without drush_hook_init, drush help could never have a meaningful command-alter hook. If this patch is accepted, I'll go back and move drush_bootstrap_max into the init function for all commands that use it.

A different way to fix this is to change drush_invoke() sequence a bit. It assumes that the list of commandfiles is static. If we reran $list = drush_commandfile_list(); after each $variation, we would pickup newly discovered command files. So if a 'pre' hook did a bootstrap level change to FULL, the new commandfiles would be recognized in later $variations. Not sure if this is any better.

If we go with the new hook, it needs to be documented in API file.

+++ drush.api.php	12 Aug 2010 12:07:49 -0000
@@ -166,9 +166,9 @@ function hook_drush_pm_adjust_download_d
+      dt('Reset passwords and email addresses in user table'), ¶

this line has trailing whitespace. happens several times (mostly in blank lines). dreditor is very good at highlighting these. just getting back into using it.

+++ includes/command.inc	12 Aug 2010 12:07:50 -0000
@@ -287,6 +287,18 @@ function drush_invoke($command) {
+  // If a command needs to bootstrap, it should do so in _init
+  // and not later; otherwise, the commandfile list might not
+  // be complete.

I need to research more, but I think commands like updb and Aegir commands would object to this statement. They want to control the bootstrap process and here we essentially tell them that all their work has to be done in this new hook? Note that my suggestion above makes this recommendation unneeded.

+++ commands/sql/sync.sql.inc	12 Aug 2010 12:07:49 -0000
@@ -24,13 +24,12 @@ function drush_sql_sync_validate($source
-  $major_version = drush_drupal_major_version_of_site(array($source_settings, $destination_settings));

drush_drupal_major_version_of_site() is now unused.

Comment #40

greg.1.anderson commented 12 August 2010 at 20:31

I considered the drush_invoke alternative first, but rejected it because the current algorithm just makes a long list of function names, and it's not currently clear where one section stops and the next begins. Based on your comments above, though, I think it's better to go this route anyway. It can be solved just by breaking the current list into a list of lists, or something like that.

I'll re-roll the patch using this technique.

I like the skipping of commandfiles from disabled modules. Can we generalize that further? It seems like we should skip anytime we know that we are destined for FULL_BOOTSTRAP (e.g. cron, cache-clear, ...)

With the current drush bootstrap, we keep stepping through the bootstrap levels until we find a command that matches the bootstrap level, so this is currently impossible. Now that commands can't have spaces, I could rewrite this too; then we would know in advance how far we were bootstrapping in advance, because we could find the command right away. Might be time to just rewrite the whole stack..

In "the new world", then, we would find the command first and call bootstrap_max to the level desired. If a command wanted to bootstrap further, it could just call bootstrap_max again.

Comment #41

greg.1.anderson commented 13 August 2010 at 15:21

Status:

Needs review

» Needs work

Comment #42

moshe weitzman commented 13 August 2010 at 15:44

I think that the drush_invoke() loop can be refactored a little such that we add a helper function that accepts a commandfile list. The contents of that commandfile will get refetched after each invoke function gets called. that way newly discovered commandfiles are in the game.

Comment #43

greg.1.anderson commented 15 August 2010 at 07:03

Status:

Needs work

» Needs review

Status	File	Size
new	sanitize-bootstrap-2.patch	26.54 KB

Here is the "rewrite" of the drush bootstrap process -- although the required changes were substantially fewer than I initially thought. Note that this patch also fixes #698102: Remove backwards compatibilty for spaces in commands.

I started off this version of the patch by removing the new command init hook, but I ended up putting it back in. It is still valid to bootstrap in drush_hook_command() if you really want to, but if you do that, you'll be denying any of the commandfiles loaded by the bootstrap from participating in the "_validate" and "_pre" stages of the drush command dispatch. If you bootstrap in drush_command_init(), then the new commandfiles can define policies that affect the current command in drush_hook_command_validate(), etc. So, in short, init is the preferred place to bootstrap, but Aegir may continue to bootstrap in command() with no ill effect (and in fact it will work better than it used to).

I did not go back and make 'help', etc. bootstrap in _init(). I did add _init to drush.api.php.

The "shortcut" commandfile searching done in DRUSH_BOOTSTRAP_SITE is now only done for commandfiles in the global context (/etc/drush, $HOME/.drush, etc.) that explicitly request a SITE, CONFIGURATION or DATABASE bootstrap. Global context commands that request a FULL or LOGIN bootstrap, or any commands found in an enabled module of a bootstrapped site, will always get at least a FULL bootstrap, regardless of what they request. Drush will now do a FULL bootstrap before considering any site-wide drush commandfiles, so only commandfiles in enabled modules will ever be seen.

Note that I added a new function drush_bootstrap_to_phase that looks almost exactly like drush_bootstrap_max, except the former returns the result of the last bootstrap operation, and the later returns the highest phase reached. I needed the former result code, but could not be sure that some contrib command might depend on the result code of bootstrap_max. There was no elegant way to factor these two functions, and since they are short, I opted against an inelegant factoring.

Comment #44

moshe weitzman commented 15 August 2010 at 13:25

Awesome work.

We can delete the 'spaces in commands' text in example.drushrc.php

any commands found in an enabled module of a bootstrapped site, will always get at least a FULL bootstrap, regardless of what they request. Drush will now do a FULL bootstrap before considering any site-wide drush commandfiles, so only commandfiles in enabled modules will ever be seen.

This is an inevitable byproduct of this patch. Kind of a bummer that commandfiles in modules have to be slow. I wonder how we can document the fact that the 'bootstrap' element is ignored in module shipped commandfiles. Maybe add a code comment to the bootstrap element in 'make-me-a-sandwich' command.

+++ drush.php	15 Aug 2010 06:44:52 -0000
@@ -65,15 +65,14 @@ function drush_verify_cli() {
+  drush_set_context('DRUSH_BOOTSTRAP_PHASE', -1);

can we use a constant instead of -1?

+++ includes/command.inc	15 Aug 2010 06:44:53 -0000
@@ -280,20 +252,45 @@ function drush_invoke($command) {
+  // If a command needs to bootstrap, it is advisable
+  // to do so in _init; otherwise, new commandfiles
+  // will miss out on participating in any stage that
+  // has passed or started at the time it was discovered.

Isn't this advice unnecessary now that we re-fetch the commandfile list on every variation?

Comment #45

moshe weitzman commented 15 August 2010 at 17:10

OK, that advice is still valid for command callbacks. Its not totally valid for a pre hook for example but we can just leave the text as is.

Comment #46

moshe weitzman commented 16 August 2010 at 11:53

Status:

Needs review

» Reviewed & tested by the community

Feel free to consider my feedback above and then commit. Thanks.

Comment #47

greg.1.anderson commented 16 August 2010 at 14:27

Status:

Reviewed & tested by the community

» Fixed

Committed with constant for not-yet-bootstrapped case.

Comment #48

webchick

she/they

English

Vancouver 🇨🇦

commented 16 August 2010 at 15:40

My goodness. I didn't think my wee feature request would result in a re-write of the Drush bootstrap. :) Thanks SO much for your work on this, Greg! Can't wait to try it out!

Comment #49

greg.1.anderson commented 16 August 2010 at 16:45

@webchick: You're welcome. The next thing I'm planning on doing is submitting drush pre-sync hooks to the projects containing tables sanitized in #636340: Provide generally available, regular exports of already public data from the drupal.org database, for example, like drush_project_issue_pre_sql_sync(). I'd appreciate your help in evangelising this convention to module writers who cache sensitive information; maybe you could write a short blog entry about sql-sync --sanitize? I'll write something up in the drupal handbook. It's already in drush.api.php; I should find a place on drush.ws for it too.

Comment #50

31 August 2010 at 14:40

Status:

Fixed

» Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

Comment #51

fen commented 13 September 2010 at 20:29

Component:

Code

» PM (dl, en, up ...)

Status	File	Size
new	sanitize.sh_.txt	1.93 KB

I'm sorry that I'm coming late to the party (Owen just showed me this thread despite the fact that I may have helped inspire it): I have a couple questions that I'm not sure I see resolved in this:

1) Can there be a list of people/email addresses for which sanitize doesn't work? In the script I use, I have a list of emails to keep untouched (see attached shell script).

2) Can the separator between user and uid be defined or is it set as '+'? Some systems use '-' as the separator...

3) I changed my sanitize method to include a HASH of the original address, so that the final emails results in ${MAIL_PREFIX}SHA1(mail)@$MAIL_SUFFIX

This latter one makes it easy to also run the same function on a CiviCRM database and ensure that two email addresses that are the same in Drupal and CiviCRM before the sanitization are also the same after the sanitization which is essential for proper operation. With CiviCRM, I suppose I can get the $uid from the civicrm_uf_match table if the synchronize process has run, but this is not guaranteed.

I didn't re-open this as I have to admit that I have not yet read through all the comments/patches nor tried the code, so perhaps all my concerns have already been dealt with.

Comment #52

moshe weitzman commented 13 September 2010 at 20:32

Well, at least you admit your sloth :)

The sanitize routine is pluggable so you can easily achieve all the customizations you describe.

Comment #53

greg.1.anderson commented 13 September 2010 at 20:52

Hm, yes, it is pluggable to easily allow adding more sanitization options, but if you wanted to change the existing sanitization options, it's a bit more complicated.

If you have specific requests on how this should work, please open a new issue. I was considering a helper function for sanitize that would build email address modification SQL, for use by any sanitize op that needed to clean up an email address. If we put enough configuration options on that routine, it would probably meet your needs.

Los Angeles

commented 23 September 2010 at 21:22

This is something I was dreaming up for Drush to help with PCI compliance.

My vote +1000

Comment #57

greg.1.anderson commented 13 November 2010 at 05:15

For the benefit of those who may still be following this issue, #898818: drush_bootstrap_max_to_local_sitealias breaks first-time sql-sync operations changed the way sanitize hooks work. You should now put your sanitize hooks in HOOK_drush_sql_sync_sanitize instead of drush_HOOK_pre_sql_sync. The implementation is otherwise very similar. See drush.api.php.

Comment #58

amontero

Barcelona

commented 12 October 2012 at 17:01

Sanitize addition/followup:
#1808516: Remove cron_key when performing sanitized dumps