First, thank you for adding sql-sync, it has been an unbelievable time-saver!
I am attempting to write and debug hook_drush_sql_sync_sanitize() functions for our custom modules, but currently there is no way to force a sanitization without running the sql-sync which is turning out to be a huge pain because the sql-sync takes ~3 hours to run with our large database. It is nearly impossible for me to debug this thing when I have to wait a few hours after every change to see if it worked.
Additionally, I foresee many instances under which one would have a desire to sanitize the data separate from the drush sql-sync command.
You've written the code to run the sanitize, can you give us the option to run it separate of the sql-sync command?
| Comment | File | Size | Author |
|---|---|---|---|
| #15 | 1112142-no-prefix.patch | 644 bytes | djdevin |
| #2 | sanitize-command.patch | 2.13 KB | msonnabaum |
Comments
Comment #1
greg.1.anderson commentedI must admit that I was, and still am, rather ambivalent about the ability to run sanitize without sql-sync. I don't know that adding this capability would actually help your particular use case much, because a run of sanitize on an already-sanitized db is a different thing than a run of sanitize on an unsanitized db.
I think that you would do better to make yourself a small empty Drupal site, and then copy some records from your large db, or generate some random sample records. Then use your small site to test sanitize, targeting another test db. You know on each run that you'll be starting from the same unsanitized data, which is a plus. I can't imagine that you would actually be more happy re-running sql-sanitize on a huge already sanitized database; it's not as good of a test case, and it's got to take some time to run sanitize on the big db even without the sql-sync op thrown in.
I am also really concerned that if 'sanitize' operated on any database that it might cause someone a lot of regret if they accidentally ran it on the wrong database. sql-sync will not destroy just any database; you have to name the target site as a parameter to the sql-sync command. If there was an sql-sanitize command, then
cd /my/production/docroot && drush sql-sanitize -ywould be a very bad thing to do. I know, it's not helpful to go too far in protecting the user from themselves, but in this case I don't think that the usefulness of the proposed feature is great enough to justify the added risk.Comment #2
msonnabaum commentedI regularly find myself wanting this command, so here's a first stab at it.
I think it's a good illustration of the fact that we should probably have it abstracted from sql-sync either way.
Comment #3
moshe weitzman commentedI'm on the fence about danger versus utility here. I suspect enough people will want it that we should provide it.
The confirm message should be more explicit about the database it is operating on. See the confirm for sql-sync for an example.
Comment #4
greg.1.anderson commentedI suppose I can go along with this, reluctantly, as long as we make it a hidden command and give it a really big warning message.
I still cannot think of any good use case that does not involve running sql-sync first. The best I can come up with is an instance where you forget to run sanitize and later want to sanitize a dev site. In that instance, though, I still think you'd be better off making yourself a workflow that automatically sanitizes your prod -> dev syncs, and not trust yourself to hand-sanitize at the right time. I don't know, I guess people trust themselves not to sync dev -> prod, so maybe we can trust them with a manual sanitize command too.
Maybe sanitize should write a marker to the database (variables table?), and sql-sync should print an even scarier warning when you sync from a source db that is already sanitized. I suppose this feature is completely unrelated to having a separate sanitize command, though.
Comment #5
msonnabaum commentedHere's your scenario: I have drush sql-dump making backups on the hour on my server and I often need to just pull one of those dumps down and import it locally. It may not be the latest dump, so sql-sync doesn't help me. Also, if I need to pull down a db from prod, I don't necessarily want to ever sql-sync with prod, so I prefer to just pull down one of the backups. Call me paranoid, but it just doesn't buy me all that much in that scenario.
In general, I don't think we should get too hung up on defensible workflows. sql-sync is an awesome command, but I honestly never used it until Moshe showed it to me on a project. I think it's pretty natural for people to be scared off by the more "full service" drush commands, so I don't see any harm in having a few more simple commands.
Really, we probably should have implemented sql-sanitize on its own from the beginning whether we exposed it or not and just had sql-sync use it. It certainly wouldn't hurt us to decouple some functionality from our giant-commands.
Comment #6
greg.1.anderson commentedYou actually can use sql-sync in your scenario above:
This is probably more convenient than your current workflow in terms of number of characters to type, but I admit it is probably easier to remember the individual steps of rsync, import, then sanitize.
I agree that it is a good idea to break commands up into their component parts; just as pm-update = pm-updatecode + updatedb, it would be better architecturally if sql-sync = sql-dump + sql-import + sql-sanitize. I have wanted to do the sql-dump / sql-import factor for a while, rewriting sql-dump to accept a site alias parameter, handling the remote dump + rsync step if the site is remote, and also implement sql-import the same way, handling an rsync + remote import via ssh if necessary.
I would still prefer to see sanitize be a hidden command, but I agree with the premise of factorization in #5.
Comment #7
moshe weitzman commentedI'd be OK with a hidden command.
We should add an option to `help` command such like --show-hidden or somesuch. We'll tuck some less prominent commands there. Seems very analogous to git command families: plumbing versus porcelain. (I love that bathroom imagery).
Comment #8
msonnabaum commentedI have no objections to a hidden command. I just want to stop maintaining my patch :)
I agree that we should work towards the git model of plumbing vs porcelain commands. It'll give the power users what they need while also hiding potentially site-destroying commands from new folks.
Now that we're in agreement that sql-sync needs to be broken up, I could see the argument for just doing the simplest possible refactor now rather than committing my duct-tape version above. I can go either way on that. Maybe we just start by implementing sql-import and sql-sanitize first and then start moving bits of sql-sync out into those?
Comment #9
greg.1.anderson commentedYeah, I'm willing, I just need to get some other work out of the way first -- various Windows issues (blocked on #1058874: drush_backend_invoke buffers output till end of command. Show progress.), etc.
Comment #10
moshe weitzman commentedI'm fine with Mark committing the patch here (add the HIDDEN) and then we'll strive to get all that refactoring done soon. Meanwhile, mark can stop maintaining a local patch :)
Comment #11
msonnabaum commentedCommitted as a hidden command in 5.x.
Marking for backport since I think its pretty safe.
Comment #12
msonnabaum commentedNow in 4.x.
Comment #14
gregglesAre there any examples of mymodule_drush_sql_sync_sanitize hook in contrib?
I see sql_drush_sql_sync_sanitize but it looks pretty complex. I guess my ideal is that the function could just return sql statements, maybe in an array keyed with weights on them so they can run in different orders.
Comment #15
djdevinDoesn't work with prefixed databases, patch attached. Didn't think the problem was big enough to warrant another issue.
Comment #16
greg.1.anderson commented#15 looks good.
@greggles: See drush.api.php; in particular, hook_drush_sql_sync_sanitize. If your sanitization operation runs invariantly (whenever --sanitize is specified for sql-sync, or whenever sql-sanitize is executed), then this is all you need. If you add any options to control how the sanitization works, then you will also need to declare them via
hook_drush_help_alter, which is documented in the same file.The sanitize ops do not support weights; they run in arbitrary order. If you need to order them, it would be a reasonable feature request to add them.
It was always sort of my vision that contrib modules that maintained sensitive information in the tables they add would also define a sanitize hook. If you spin me around three times and push me in the right direction (i.e., let me know where you want it), I'll write some documentation for it. There must be a place in the module developer's guide that we could drop it.
Comment #17
moshe weitzman commentedcommitted to 5 and 4. back to fixed.
Comment #18
greg.1.anderson commentedBacked out #15 from master and 7.x-4.x; it does not work.
Comment #19
djdevinWhy doesn't it? Seems to be the only thing that does work when running on a prefixed DB.
Comment #20
greg.1.anderson commentedThis does not work on master due to #1219850: sqlq removes all curly braces from sql, including those that are in the text values.. I rolled it out of master because it was causing the sql-sync functional tests to fail. I rolled it out of 7.x-4.x because we must correctly support new features in master before we backport them to drush-4.x.
We can put this back in after the above-mentioned issue is resolved.
Comment #21
osopolarAs already suggested by Moshe Weitzman in #3:
Current contains to less information:
At least it should contain the database name like
Do you really want to sanitize the current database $db_name?
Comment #22
moshe weitzman commentedComment #23
moshe weitzman commentedThis was committed to master and 4.x.