If you try to create an SSL site with a new certificate and you're out of IPs in the frontend, the install still goes ahead (bug #1, in the frontend) and then provision tries to install the site. It creates the vhost and complains there are no IP left, and aborts, but doesn't remove the vhost file, so the site effectively takes over some IP anyways.
Maybe the latter happened only on retry of the install task. Stupid me I also trashed the site node and lost the logs, but I believe this should be reproduceable with this:
1. setup an SSL server with 2 IPs
2. install a site with a new certificate, this takes up IP #1
3. install a site with a new certificate, this takes up IP #2
4. install a site with a new certificate, this should fail but *still* take up IP #2. retry the task if that doesn't work.
The symptom is that site created at step 3 uses certificate created at step 4, which obviously fails certification.
Comments
Comment #1
j0nathan CreditAttribution: j0nathan commentedSubscribing.
Comment #2
anarcat CreditAttribution: anarcat commentedThe problem is worst than that - if two sites are installed simultaneously, they can both steal the same IP from the server.
Comment #3
anarcat CreditAttribution: anarcat commentedOkay, I think I understand the problem better here.
The issue is that the IPs are allocated in the backend, and fairly deep - it's done in the config_data() hook.
(That's one hook that I'm sure our documentation team will have a lot of fun documenting, but that basically gets called when the Context (in this case the site) gets initialized in provision. Oh, and the actual function doesn't get detected by the API module, it's in http.ssl.inc:45, provisionService_http_ssl::config_data - just to complicate matters here.)
So the first issue is that it's unclear exactly *when* that code gets called, so it's obviously hard to rollback from there.
The other problem is that the way the IP is taken is non-atomic - we glob for the pattern (using get_certificate_ip()) and then if it's "okay", we touch a file (in get_ip_certificate(), to avoid any confusion ;). Those two functions should at least be merged so that the lock file would be created in exclusive mode (which fails if the file exists).
So in other words, that's very early (or at least we don't clearly know *where*) in the code sequence so rollbacking is hard, and there's absolutely no locking to ensure that only one site takes an IP.
I would suggest to re-engineer this whole thing and move the IP allocation upwards to the frontend, and make the backend just *obey*: "use certificate X, on IP Y and STFU". This is a typical case of the backend trying to be too smart for its own good without any way of giving proper feedback to the user early enough - once the site is being installed, it's too late for safety checks.
Comment #4
anarcat CreditAttribution: anarcat commentedSo short term solution:
1. merge get_certificate_ip() and get_ip_certificate() in one atomic function
2. fix rollback
That will still make those sites break pretty badly and unexplicably, but it's what this issue really is about.
Long term solution: move IP allocation to the frontend and make the backend just take that.
Comment #5
anarcat CreditAttribution: anarcat commentedHere's a backtrace to diagnose the rollback. This happens in the "do" hook (provision_install, between pre and post).
Comment #6
anarcat CreditAttribution: anarcat commentedThe actual issue described here has been fixed in ab78897d7ec881bb75f39fe4e915b9b1d4119217 - it was just a rollback hook missing in http.
The other issue - well, it's hairy, so I've opened #1126638: two sites can take the same server IP for SSL and #1126640: move the SSL IP allocation to the frontend.