This is a long ticket with many comments. A summary is provided below:
(Note: a secondary problem was discovered during this ticket but has been moved to #1576588: MySQL GRANT test can return an incorrect IP address for webserver if it has more than one IP on same LAN. )
Problem
Problem 1. for a different IP), if the 'master' in the pack is not the Aegir Hostmaster server. In such a situation, it looks like the master webserver context is never even loaded - just @server_master, the 'cluster' context, and the slave context.
Proposed resolution
This was fixed on 14th May by ensuring the array of 'master' webservers always gets returned in grant_host_list() so that they get the GRANTs as well as the slaves.
Remaining tasks
None.
Original report by Brian Altenhofel
My situation:
Two Nginx servers set up as a "Web Pack" in Aegir behind separate dedicated Varnish caching/loadbalancing servers. Behind the webservers, I have two MySQL servers in Master-Master (because user tables are replicated, only one needs to be pointed to, of course).
The issue that I am running into is that, upon creating a site, only the "slave" in the Web Pack gets USAGE/PRIVILEGES granted on the target database server. I have not tested this with more than two webservers in the pack.
Right now, it's not too big of an issue for me because I only have a couple of sites running on this setup, and I can wait for the creation or migration of a site to complete and then add the appropriate grants manually.
My week is pretty crammed, but I can probably work on a patch and/or further testing after this week if needed to resolve this issue.
| Comment | File | Size | Author |
|---|---|---|---|
| #21 | 1555398.patch | 479 bytes | mig5 |
| #18 | aegir_pack.png | 41.2 KB | mig5 |
Comments
Comment #1
Anonymous (not verified) commentedI am seeing a similar issue, except in my case, the master and 1 slave get the GRANT, but the other slave does not.
Comment #2
Anonymous (not verified) commentedActually, to be more precise:
The web pack is:
Master (192.168.180.166)
Slave 1 (192.168.184.44)
Slave 2 (192.168.184.174)
The database (floating IP of the multi-master mysql) is:
192.168.185.128
mysql> select Host from user where User = 'mig5net';
+-----------------+
| Host |
+-----------------+
| 192.168.180.166 |
| 192.168.184.44 |
| 192.168.185.128 |
+-----------------+
3 rows in set (0.00 sec)
So it's missing Slave 2, but included the floating IP of the multi-master mysql along with master + slave 1 from web pack.
Comment #3
Anonymous (not verified) commentedaaaaaand just confirmed in the drush log (was a Migrate task):
Granting privileges to mig5net@192.168.185.128 on mig5net
Granting privileges to mig5net@192.168.184.44 on mig5net
Granting privileges to mig5net@192.168.180.166 on mig5net
So somehow, in provision/http/pack/pack_service.inc, this function is not returning all items of the slave_web_servers array:
Comment #4
Anonymous (not verified) commentedOK, I've debugged a bit further.
The above function grant_server_list() is actually ok, according to my drush_log :
This function's result ends up getting fed into db/db.drush.inc in order to build an array of hosts to GRANT privileges to:
Sticking a drush_log in here, to get the result of $this->context->service('http')->grant_server_list(), things are still looking good:
But later in create_site_database(), grant_host_list() array shows the three hosts - the IP of @server_phobosintmig5net has been lost.
So I guess it's the interaction between grant_host() and how it affects the return array of grant_host_list(), but *why* is beyond me right now.
Comment #5
Anonymous (not verified) commentedHosts are @server_phobosintmig5net,@server_deimosintmig5net
Load alias @server_phobosintmig5net
Loading nginx driver for the http service
Load alias @server_deimosintmig5net
Loading nginx driver for the http service
Server is 192.168.185.128
Server is 192.168.184.44
Server is 192.168.180.166
I'm wondering if the second loading of the second slave server context, overrides the former, for the 'nginx' service.
Comment #6
Anonymous (not verified) commentedOK, new theory:
By the time we are loading the array of hosts in grant_server_list(), we are not looking up the grant_server_list() of the 'pack' service type of http, but of the 'http' service, and the http service has been replaced by the individual slave server's context (hence 'Loading nginx driver for the http service')
So we are never actually returning the array of slave hosts from the pack service - only ever returning the remote_host of the single web server, along with the db server and the hostmaster server.
As you can tell, I'm not getting anywhere :)
Comment #7
Anonymous (not verified) commentedHah, just realised what's going on here..
Granting privileges to test3mig5net@192.168.185.128 on test3mig5net
Granting privileges to test3mig5net@192.168.184.44 on test3mig5net
Granting privileges to test3mig5net@192.168.180.166 on test3mig5net
The 192.168.185.128 is the floating IP of my MySQL-MMM cluster. I mistakenly thought Aegir was just adding the IP of the database server (which is the floating IP) - but what is actually happening here, is whichever *writer* database server (the current active writer in the cluster), is making the 'mysql -U intentnlly invalid' test using the floating IP instead of the IP of its main interface.
So aegir is catching the IP from the preg_match here:
...and it just so happens that the IP returned by the MySQL exception is that of the floating IP instead of the main interface.
So this is not Aegir's fault!
Only problem is I don't know how to fix it, but somehow we need to make the mysql test connection come from *not* the floating IP, so that the correct IP is returned.
Or more so, Aegir should perhaps not rely on the returning IP from the database server, but rely on the valid resolution of the server object's 'remote_host' name. Though I can see that could open up a pandora's box of 'bug reports' from users who haven't guaranteed that their web servers are resolvable to an IP from the point of view of their database servers.
Comment #8
Anonymous (not verified) commentedDisappointingly, it is not possible to force --bind-address in the mysql command, unless you are using MySQL Cluster NDB.
On the other hand, if a webserver has multiple IP addresses, shouldn't we be ignoring what the response was from the mysql 'access denied' message and just add grants that iterate over the ip_addresses attribute of the server context?
Comment #9
Anonymous (not verified) commentedFor what it's worth, I've worked around this for now by moving my two slave webservers off to new servers, so that they aren't the same machines as the mysql pair, and thus neither web server ever has the MySQL floating IP on it, so the attempted mysql connections always return the correct IP address.
Not a great workaround - on the other hand, this is probably better HA design anyway :)
To the original poster: not sure if your mysql multi-master pair were on the same machines as the web servers (your master & slave, or two slaves) - if so, moving the webservers (or database servers) off to separate machines, or in some way guaranteeing that your webservers never connect to mysql with an IP other than that which they would normally use, should work around the problem for you too.
Comment #10
Anonymous (not verified) commentedNope. My setup is like this:
(floating public ip in front between Varnish servers)
Nginx 1 and Nginx 2 are in a Web Pack together, with Nginx 1 as the Pack master and Nginx 2 as the Pack slave. What happens is the Aegir master and Nginx 2 get grants on MySQL, while Nginx 1 gets left out in the cold. For example, the grants look sorta like:
'testvmdohcom'@'aegirmaster.vmdoh.com'
'testvmdohcom'@'nginx2.vmdoh.com'
Name resolution shouldn't be an issue, as each server has its /etc/hosts file maintained by Puppet and they check in on a regular basis (and my deployment script has them check in immediately as soon as they are provisioned and tells Puppet about the new host).
I'm sorry I haven't had a chance to do any more testing recently. I'm staying pretty hooked up right now with a lot of deadlines running together.
I did spend an hour looking through the code a little bit to try to figure out what is going on, and I'm not afraid to admit that it is a little over my head considering I just got started in OO fairly recently.
ETA: Both Nginx servers DO get the proper configurations from Aegir - its just that the MySQL servers don't have a grant created for Nginx 1.
Comment #11
Anonymous (not verified) commentedAre these all separate servers?
It's odd because that's basically my setup now, except minus the Varnish layer.
Next time can you put something like this in provision/db/mysql/mysql_service.inc, around line 104 just above the first preg_match in function(grant_host) and just after $server->shell_exec($command):
dlm(drush_shell_exec_output());
Then try and install a site on your cluster and watch the install log. You should see the deliberate MySQL GRANT statements that are intended to fail, to catch the IP addresses of each host:
The three IPs above should, in your case, be 10.0.0.3, 10.0.0.4 and 10.0.0.7
Let me know how you go!
Comment #12
Anonymous (not verified) commentedYep, all separate servers. I went ahead and did the above to check (needing a break from fixing an export from another CMS that I am migrating a site from...)
There should also be a whiskey-rs01.vmdoh.com in there. whiskey-rs02 is the slave in the pack, and 10.179.6.236 is the Aegir master.
Comment #13
Anonymous (not verified) commentedHmm ok, and just a couple lines above in your task output log, there's nothing like this?
Load alias @server_whiskeyrs01vmdohcom
Loading nginx driver for the http service
?
Sounds like your slave is completely ignored here. Are you running on the latest Aegir?
Comment #14
Anonymous (not verified) commentedComment #15
Anonymous (not verified) commentedRunning Aegir 1.8
What strikes me as odd is that it is actually the server declared as the master that is being ignored.
Here is my drush alias for the pack.
The xxx's are a public IP address, but I just noticed something - that IP address doesn't match any that I have... must have been a leftover from a previous attempt. So I just changed the IP addresses in the cluster to match those two servers? Is that the proper procedure?
Guess not, because here is my new output from the install:
Comment #16
Anonymous (not verified) commentedAh ok. In my setup, the Aegir (hostmaster) server is the master, and my two nginx servers are both slaves. That said, my 'master' (Aegir) still gets its GRANT assigned.
What's the drush alias look like of your whiskey-rs01.vmdoh.com machine?
And it seems this IP 10.179.6.236 is different to that of the Aegir server?
I am starting to think that Pack module doesn't work properly unless the Aegir hostmaster server is assigned as 'master', and then your two other webservers as slaves. Seems similar story to here: http://community.aegirproject.org/node/37/talk. Having looked at the Pack service itself, we actually explicitly only send the array of 'slave' servers as the list of servers that should have database access (and I think the master gets assigned separately if it's the Aegir hostmaster server, as the second item in this merging array):
Confusing! Sorry, I am only learning this module now too :) I hope anarcat can chime in here as he might see what the problem is more obviously.
Edit: and the IP of the 'cluster' should be that of your floating IP - or in this case the IP in front of your two Varnish machines I think.
Comment #17
Anonymous (not verified) commentedAnd the 10.179.6.236 is the correct IP for the Aegir server.
As for the talk page... what I'm getting from it is that whiskey-rs01 and whiskey-rs02 should both be slaves, and then a third machine should be the "master"... or just a third node? Anyone got a drawing?
In my setup, the configs are getting pushed correctly to whiskey-rs01 and whiskey-rs02, and according to the log it rsyncs the platform to whiskey-rs01 (which is not a problem for me as /var/aegir/platforms are mounted with glusterfs across both machines).
And as for the IP, I wasn't sure if it needed to be public public or "public" in the sense of which way traffic enters the server.
This will be my last post for the night, as it's almost 1AM here and I really need to get started on work again by 6AM. It's very true that programmers are just tools to turn caffeine into code.
Thanks for all your help!
Comment #18
Anonymous (not verified) commentedOK, my setup is:
1. Server node of type 'pack'. IP of the pack is the *public* floating IP that loadbalances between my two backend nginx servers. This IP can actually be anything, it doesn't matter I don't think.
2. In that pack node, the 'master' server is the Hostmaster Aegir server node (happens to be private IP). This IP gets a GRANT so aegir can drop/create databases for the sites
3. In that pack node, the two 'slave' servers are my 2 nginx server nodes (private IP). These IPs get a GRANT each so the websites can connect to the databases to serve the sites.
4. My 'database server node' is actually the floating IP of the multi-master mysql cluster, as it's this IP that needs to be set in the settings.php for the sites. So I don't have both database servers in Aegir, just one node with the floating IP's address. (it's up to mysql replication to ensure the data is the same between each db server)
In short - the only thing that matters about IP addresses is:
a) That the hostmaster server can SSH to the slaves on that IP in order to rsync the data
b) Those IPs get used as the GRANT statements - so that IP needs to be able to make SQL connections to your database server. e.g firewalls etc, if you want to use public IPs, need to allow all this connectivity. In my case I keep it all on the private LAN - external public connections are handled at the loadbalancer on public IP, and sent inbound to the private IPs of the nginx servers.
c) It seems crucial to ensure the webserver and database servers are actually *different* servers. Not just different IPs. Because the 'mysql' command might use the wrong local IP to connect to the database server, e.g the IP that is *not* used by the webserver/drupal for making connections to the database, and that will break your GRANT if that webserver has to start making connections to that database server.
In this setup, the 'Master' server is kind of like a 'stealth' or 'hidden' server in DNS-speak. It is the master of the data, but it's not part of the loadbalanced cluster, it really just provides the data as an NFS server and rsyncing out config, but the sites are never served by this server (though it of course could be, if I added it to the loadbalancer)
Very bad diagram attached.
Comment #19
anarcat commentedThanks for reporting this issue... There's a lot of comments and feedback here, which is a bit hard to parse, can we get a summary as per https://drupal.org/node/1155816?
Comment #20
Anonymous (not verified) commentedI've added a summary above. Note that there are probably infact two separate issues here. I unfortunately hijacked Brian's ticket thinking I was reproducing, whereas in fact I accidentally had stumbled on the edge case explained as 'Problem 2'. We probably need two separate tickets but not sure it's too late or not.
Comment #20.0
Anonymous (not verified) commentedissue summary
Comment #20.1
Anonymous (not verified) commentedupdate to summary
Comment #21
Anonymous (not verified) commentedBrian,
Can you please try this patch and see if it fixes your case?
All I am doing is loading the array of 'master_web_servers' in the return of 'grant_server_list()' instead of just the 'slave_web_servers'.
After applying, try install, migrate or clone a site.
I've not properly tested this patch since our environments differ, so please take all necessary precautions :) However in my tests, I do see my 'master' server appear twice in the array (because it is also my Aegir hostmaster server - which is then filtered down to one thanks to the existing array_unique() in db/db.drush.inc ). - which in my case is what I expected.
If you retain your existing dlm() in db/mysql/mysql.service.inc, you should end up seeing something like:
Array ( [0] => ERROR 1045 (28000): Access denied for user 'intntnllyInvalid'@'whiskey-rs01.vmdoh.com' (using password: NO) )
Array ( [0] => ERROR 1045 (28000): Access denied for user 'intntnllyInvalid'@'whiskey-rs02.vmdoh.com' (using password: NO) )
Array ( [0] => ERROR 1045 (28000): Access denied for user 'intntnllyInvalid'@'10.179.6.236' (using password: NO) )
Let me know how you go.
Comment #22
Anonymous (not verified) commentedWorks perfectly for me. I double checked directly on the database to ensure that the grants were actually there. This was tested using an identical setup as I described above (Puppet is a beautiful thing).
Thanks!
Comment #22.0
Anonymous (not verified) commentedupdate summary
Comment #22.1
Anonymous (not verified) commentedremove second problem and update first with the fact that it was fixed.
Comment #23
Anonymous (not verified) commentedThanks for testing, Brian!
I've committed this fix to 6.x-1.x and 6.x-2.x.
I'm actually going to close this ticket and open a new one, since I inadvertently hijacked it with a separate corner case.
New ticket is #1576588: MySQL GRANT test can return an incorrect IP address for webserver if it has more than one IP on same LAN
Comment #25
doka commentedThe patch works fine if you have the mysql server installed on the slaves. But if it is not the case, the "Dummy connection failed to fail" error is again there, for each slave in the pack having no mysql server installed, and the site install is broken.
I reopen this bug, since the proposed use case for web_pack is that the slave server should not have DB, it's intended place is on the master server.
Comment #26
Anonymous (not verified) commentedThis is curious, I set this up on a fresh 1.9 system last week and didn't reproduce it.
The 'Dummy connection failed to fail' message is sent as the result of output fetched from your chosen database server. e.g it can't just come 'from' the slaves when they don't have mysql installed on it - the error is coming from a database server *somewhere*.
Can we get a copy of the full task log?
Also, is there any chance you're running Ubuntu 12.04 on your database server?
Comment #27
doka commentedAegir is running on Ubuntu 12.04, but MySQL is configured with
mysql_secure_installation, and also the bind-address is commented out in /etc/mysql/my.cnfMy log:
I would expect not the "localhost" in
mysql -u intntnllyInvalid -h 'localhost' -P '3306' -e "SELECT VERSION\(\)"since this command is on the slave, having no database. But if I manually execute
mysql -u intntnllyInvalid -h '<MASTER_IP>' -P '3306' -e "SELECT VERSION\(\)"then the expected error code is there:
ERROR 1045 (28000): Access denied for user 'intntnllyInvalid'@'ws105.dmz'Edit: if the site is installed on the (Aegir) master server, then it works fine, even on Ubuntu 12.04, the issue is only here if I want to install a site on the pack.
Comment #28
doka commentedforgot to set the status...
Comment #29
Anonymous (not verified) commentedYou are correct in that the connection should not be to 'localhost'. I wonder if your database server is defined as '@server_localhost' in your site's drush alias (context). Can you look at your alias file in ~/.drush and see what the value of 'db_server' is? If it's '@server_localhost' then that's how this could occur.
Comment #30
doka commentedAt the masters .drush, in the server_localhost.alias.drushrc.php, I have the following line, beside db_port and db_service_type:
'master_db' => 'mysql://dbadmin:dbpwd@localhost',Do not see any line for db_server, however.
The ~/.drush is only at the master server, do not see at the slave. Not sure is it OK or not.
Comment #31
Anonymous (not verified) commentedmaster_db is different, that's for Aegir's own use.
You need a db_server attribute in your site alias, or else it will default to @server_localhost, which is what you're seeing.
How did you create the alias? If you created it manually with 'drush provision-save', you need to pass --db-server='@server_yourdbserver' as an argument.
Comment #32
doka commentedFirst I crearted it in Aegir, not manually, which resulted in the "Dummy connection ..." error. After that, during my trouble shooting, I had some manual attemts, as well. I'll do a fresh Aegir install, and report back the generated aliases.
Comment #33
doka commentedSo now I have a clean setup: an Aegir master with database, and one slave, without database server. Everything is generated over the Aegir frontend, and I have the site alias as follows:
And the error code, when I'm creating the www.example.com, is:
Comment #34
Anonymous (not verified) commentedAgain, though - if you have *any* remote webserver, you can't declare the database server as 'localhost' - as the remote webserver will treat 'localhost' as itself. Make sense?
In short: if you are using remote webservers, you need some form of 'remote database server' that all the remote webservers can connect to. The remote webservers cannot connect to the Aegir server's database server because by default it is a 'localhost-only' database server.
Comment #35
doka commentedOK, understood.
I was thinking that the database server for the remote webservers should be the Aegir server's database server. Now it's clear, there must be a separate database server. I'll make some hint in the documentation, as well.
Comment #35.0
doka commentedtypo