Frontier - Its time to implement a functional Change Control process.

StartledPancake · Jan 14, 2019

With respect, I'd like to ask the question of how are these events continually making it through your CAB, without some form of mitigations being put in place? If your servers can't handle upcoming the load, why are you not doing *something* about it (scale up, add connection limits etc) rather than just letting them crash?

If your runbooks are incomplete, e.g Gnosis plus the following CG, where clearly someone forgot to make the required changes (or didn't test them properly). Make sure at least one set of the "two eyes" is up to the task of ensuring the runbooks are up to standard and haven't missed major steps. In the event of an outage inform your customers (us) of what went wrong, via an Incident Report or RCA. Its surprising how effective an open and honest IR/RCA can be in generating understanding from your customers. This then normally leads to a desire to improve internally, as once your customers have shown understanding, engineers generally feel more more engaged to ensure sure they don't disappoint them again. It's simple human nature in play.

It's clear from the improvements found in the recent updates, and from your support desk that you have great people on board. However it's equally clear form all the major outages of late that your internal processes are severely lacking. Hopefully this is the last straw before implementing some much needed improvements.

DragonIV · Jan 14, 2019

StartledPancake said:
Hopefully this is the last straw before implementing some much needed improvements.

With respect, don't hold your breath. Been an issue for at least this game since the beginning.

Vedmo · Jan 14, 2019

Too many people tried to do the same thing and it broke the game, self evident and good enough for me.

IndigoWyrd · Jan 14, 2019

Hmm.. can’t quite grasp if this is a case of Didn’t Know the Servers are Amazon’s, not Frontiers, or Doesn’t Understand The Elite Network Model, or just likes to throw around TLA’s to sound cool.

Factabulous · Jan 14, 2019

Lol - OP sounds v.trad, we're all devops baby!

sollisb · Jan 14, 2019

StartledPancake said:
With respect, I'd like to ask the question of how are these events continually making it through your CAB, without some form of mitigations being put in place? If your servers can't handle upcoming the load, why are you not doing *something* about it (scale up, add connection limits etc) rather than just letting them crash?

If your runbooks are incomplete, e.g Gnosis plus the following CG, where clearly someone forgot to make the required changes (or didn't test them properly). Make sure at least one set of the "two eyes" is up to the task of ensuring the runbooks are up to standard and haven't missed major steps. In the event of an outage inform your customers (us) of what went wrong, via an Incident Report or RCA. Its surprising how effective an open and honest IR/RCA can be in generating understanding from your customers. This then normally leads to a desire to improve internally, as once your customers have shown understanding, engineers generally feel more more engaged to ensure sure they don't disappoint them again. It's simple human nature in play.

It's clear from the improvements found in the recent updates, and from your support desk that you have great people on board. However it's equally clear form all the major outages of late that your internal processes are severely lacking. Hopefully this is the last straw before implementing some much needed improvements.

This is the difference between large scale corporate systems and small-time game developers.

thistle · Jan 14, 2019

These threads take longer than the time it took Frontier to fix it.

CaptainCaboose · Jan 14, 2019

^^ This

The servers were back up less than 15 minutes after they crashed for the EU launch, and aside from 1 more crash, I was able to continue playing for the remainder of the evening without issues, even through the US launch.

During the US launch, servers didn't fail at all, we made the jump and then all the way out to Waypoint 1 without a single crash. Couple long load times when dropping into a system that was heavily player populated, but that was a non-issue for me.

Compared to last year, this launch was 500% smoother. We all knew we were going to crash the servers because we intentionally attempted to do so(forcing wings of players into an instance that was already full, overloading the instance by a factor of 10, then all jumping at the same instant to a new instance which we overloaded).

Taking all that into account, FD did a remarkable job bringing everything back online as fast as they did and I have no complaints.

StartledPancake · Jan 14, 2019

Vedmo said:
Too many people tried to do the same thing and it broke the game, self evident and good enough for me.

I do know that and it makes absolutely no difference to your Change Control process that your servers are remotely hosted. Unless you can magically explain otherwise? Maybe you think that AWS also set up a CAB for you?

sollisb · Jan 14, 2019

CaptainCaboose said:
^^ This

The servers were back up less than 15 minutes after they crashed for the EU launch

No, they were not.

IndigoWyrd · Jan 14, 2019

CaptainCaboose said:
^^ This

The servers were back up less than 15 minutes after they crashed for the EU launch, and aside from 1 more crash, I was able to continue playing for the remainder of the evening without issues, even through the US launch.

During the US launch, servers didn't fail at all, we made the jump and then all the way out to Waypoint 1 without a single crash. Couple long load times when dropping into a system that was heavily player populated, but that was a non-issue for me.

Compared to last year, this launch was 500% smoother. We all knew we were going to crash the servers because we intentionally attempted to do so(forcing wings of players into an instance that was already full, overloading the instance by a factor of 10, then all jumping at the same instant to a new instance which we overloaded).

Taking all that into account, FD did a remarkable job bringing everything back online as fast as they did and I have no complaints.

Unless fully endorsed and authorized by Frontier and possibly Amazon, you might not want to own up to “knowing this would crash the servers”. One sharp corporate lawyer later you could be collectively charged with CFAA violations.

Ian Doncaster · Jan 14, 2019

StartledPancake said:
I do know that and it makes absolutely no difference to your Change Control process that your servers are remotely hosted. Unless you can magically explain otherwise? Maybe you think that AWS also set up a CAB for you?

Sure, since we're talking about real-world testing of a massively concurrent application, rather than whatever neat packaged example your ITIL textbook had:

1) When it comes to load testing, getting a test environment that sensibly resembles the live environment is really difficult and generally highly expensive. I'm doing some development at the moment where the live system is an extremely expensive fileserver stack. Funnily enough, I can't get the budget approved for an identical test environment, so I'm doing the tests on a much smaller simulated stack. For most testing that's "equivalent enough" - but I can't do any useful load testing on it, because the license doesn't support anywhere near a live-like amount of data and users. Maybe it'll be okay, or maybe we'll get some load issues when it goes live. Very much depends on how many of the potential users actually use it heavily in practice.
2) When it comes to load testing of Elite Dangerous, getting the budget together for a live-sized test environment *and* ten thousand geographically distributed clients (it's an instancing problem - you can't use clients all in a neat lab or server farm for a realistic test) is probably a bit much. You can try to extrapolate from smaller tests but that risks missing exponential factors.
3) What does Change Control have to do with this anyway? There's no change taking place in the sense of an "alteration to a configuration item" simply due to the existence of Distant Worlds. There may well have been changes in that sense to spin up more servers on the Sunday night - and I'll note here that if something like that needs to go anywhere near CAB rather than having a Standard Change ready for it, someone loves their committee meetings way too much - but maintaining day-to-day performance of a live service is a Service Operations matter, not a Service Transition matter.
4) The player base would not appreciate having to have its internally-organised events pre-approved by CAB, and rejectable if it's felt that a prerequisite development to the instancing and server support needs to be done first. Sometimes you have to support a service with what you currently have, not with what you could develop in a few years. And the size of DWE I think has surprised everyone - it was about 4,000 signups before Frontier started publicising it ... now it's about 12,000. Any planning for the prior size may have needed some very short-notice changes. (And sure, ECAB, you can do those ... but the quality of your Service Management process can't add more hours to the day or developers to your project)
5) Incident Management - the correct process for last night's scenario - is about getting service back up and running as quickly as possible. They did. There may be as a result of analysis under the Problem Management process later be changes raised and discussed at CAB - or it may be decided that incidents caused by multi-thousand player mass jumps are sufficiently rare that it can just be logged as a Known Error and other work prioritised. That might be a CAB decision, but it's more likely to belong in Service Strategy and Service Design.

Hanerib · Jan 14, 2019

I honestly think it's a conspiracy so that young people would go out more and be productive. Video games were so much better 15 years ago, now I struggle to find anything to spend some of my free time with. Had to try one of those new conflict zones today, but there's no way my fancy new Chieftain can handle a wing of four elite npc's. But then, I haven't got anything else installed either, because I keep deleting everything after 30 minutes of trying. And Stardew Valley makes me feel depressed. Is this some sort of an age crisis? Is 30 middle aged? Oh god...

Rhaedas · Jan 14, 2019

sollisb said:
No, they were not.

True, I was back mining in 10 mins.

Vedmo · Jan 14, 2019

StartledPancake said:
I do know that and it makes absolutely no difference to your Change Control process that your servers are remotely hosted. Unless you can magically explain otherwise? Maybe you think that AWS also set up a CAB for you?

My apologies for not being clear. I have no clue what a Change Control process, AWS, or CAB are, hence my viewpoint that too much stuff happened and the game broke, and why that's a sufficient explanation for me.

Andovar · Jan 14, 2019

Change control process... CAB meetings... ugh.
I come to these forums during work hours to forget about work, not be reminded of it!
...killin me.

Gypsy12 · Jan 14, 2019

Doing my CCNA3... sorry, but good work by FD.

[VR] M4st0d0n · Jan 14, 2019

thistle said:
These threads take longer than the time it took Frontier to fix it.

There's no fix. Everyone crash. Server's rebooted. Architecture still cant support 10000 peeps jumping together...

So, yeah, obvisouly.

Gypsy12 · Jan 14, 2019

But, I got my suspicions they are running IPv4 as a home network. It does not carry over to IPv6 well. The sheer horror of translating IPv6 keeps a lot of companies in the dark these days. You have to use tunnelling. Which is a real bone of contention. And then, Amazon? as a tech firm? I worked for them. Forget automated drones and all that. Since they got stroppy with Google, their maps are 4 years old. Still have a vision of drones flying through offices that dont even EXIST! As an ex delivery guy. OH NO...

Dillon Fallon · Jan 14, 2019

Whilst I'm happy to criticise some things in the game (bugs being one and if I hear anything about an Alliance Centurion... Seriously, please don't), this seems to be like asking the government to stick an extra 2 lanes on a motorway for a couple of days.

I'm not sure what changes could have been made to the system that wouldn't be considered prohibitively expensive given what later demand will settle back to. How long did the issue last for? How often do these events happen?