With respect, I'd like to ask the question of how are these events continually making it through your CAB, without some form of mitigations being put in place? If your servers can't handle upcoming the load, why are you not doing *something* about it (scale up, add connection limits etc) rather than just letting them crash?
If your runbooks are incomplete, e.g Gnosis plus the following CG, where clearly someone forgot to make the required changes (or didn't test them properly). Make sure at least one set of the "two eyes" is up to the task of ensuring the runbooks are up to standard and haven't missed major steps. In the event of an outage inform your customers (us) of what went wrong, via an Incident Report or RCA. Its surprising how effective an open and honest IR/RCA can be in generating understanding from your customers. This then normally leads to a desire to improve internally, as once your customers have shown understanding, engineers generally feel more more engaged to ensure sure they don't disappoint them again. It's simple human nature in play.
It's clear from the improvements found in the recent updates, and from your support desk that you have great people on board. However it's equally clear form all the major outages of late that your internal processes are severely lacking. Hopefully this is the last straw before implementing some much needed improvements.
If your runbooks are incomplete, e.g Gnosis plus the following CG, where clearly someone forgot to make the required changes (or didn't test them properly). Make sure at least one set of the "two eyes" is up to the task of ensuring the runbooks are up to standard and haven't missed major steps. In the event of an outage inform your customers (us) of what went wrong, via an Incident Report or RCA. Its surprising how effective an open and honest IR/RCA can be in generating understanding from your customers. This then normally leads to a desire to improve internally, as once your customers have shown understanding, engineers generally feel more more engaged to ensure sure they don't disappoint them again. It's simple human nature in play.
It's clear from the improvements found in the recent updates, and from your support desk that you have great people on board. However it's equally clear form all the major outages of late that your internal processes are severely lacking. Hopefully this is the last straw before implementing some much needed improvements.