Frontier - Its time to implement a functional Change Control process.

[VR] M4st0d0n · Jan 14, 2019

Deejayeff145 said:
Whilst I'm happy to criticise some things in the game (bugs being one and if I hear anything about an Alliance Centurion... Seriously, please don't), this seems to be like asking the government to stick an extra 2 lanes on a motorway for a couple of days.

I'm not sure what changes could have been made to the system that wouldn't be considered prohibitively expensive given what later demand will settle back to. How long did the issue last for? How often do these events happen?

They could have spend the extra two cents just to not look stupid for once. Event was planned since long.

Dillon Fallon · Jan 14, 2019

M4st0d0n said:
They could have spend the extra two cents just to not look stupid for once. Event was planned since long.

Doesn't look good on them, I'll agree there. Although, if (and it is a big 'if') the rest of the event runs smoothly and everyone enjoys hopping around the galaxy for half a year, how much of this will be remembered by those who took part?

Maybe that's the risk they're willing to take. This might be what you have to compromise on when you're Frontier and not Blizzard.

Flimley · Jan 14, 2019

I am Flimley

sollisb · Jan 14, 2019

Rhaedas said:
True, I was back mining in 10 mins.

Servers went belly up at approx 8:04pm, some 25 minutes later (I checked my post) I offered (in jest) to drive around to the Amazon Data site (2 miles from me) and reboot the servers...

Get's keep it real.. I can post the logs if you wish to dispute.

EDIT: Servers went down approx 8:03pm and cam back Approx 20:44. I was trying constantly to get back on.

robbyp · Jan 14, 2019

Flimley, I have always noticed you need to keep on reminding yourself of your name, due to your signature. Now it's confirmed!

Flimley · Jan 14, 2019

robbyp said:
Flimley, I have always noticed you need to keep on reminding yourself of your name, due to your signature. Now it's confirmed!

Always best to double check these things.

Flimley

Arklight · Jan 14, 2019

Andovar said:
Change control process... CAB meetings... ugh.
I come to these forums during work hours to forget about work, not be reminded of it!
...killin me.

AMEN!

I don't need to hear about AWS and Change Control here. LOL Bad touch! Bad touch!

Rhaedas · Jan 14, 2019

sollisb said:
Servers went belly up at approx 8:04pm, some 25 minutes later (I checked my post) I offered (in jest) to drive around to the Amazon Data site (2 miles from me) and reboot the servers...

Get's keep it real.. I can post the logs if you wish to dispute.

EDIT: Servers went down approx 8:03pm and cam back Approx 20:44. I was trying constantly to get back on.

Sorry about your experience. Sure felt like ten mins to me. I have no idea which log to look at to be sure, not that it really matters in the big picture. Outages are rarely all or nothing, maybe I just got lucky.

sollisb · Jan 14, 2019

Rhaedas said:
Sorry about your experience. Sure felt like ten mins to me. I have no idea which log to look at to be sure, not that it really matters in the big picture. Outages are rarely all or nothing, maybe I just got lucky.

You can check your Journal logs

Rhaedas · Jan 15, 2019

You could be correct. I have a gap in time between right before 8PM and 8:42 for the next startup. Weird thing is that I don't see before that shutdown any headers for the cracked asteroid that I was at both before and after. Perhaps that doesn't get written right away?

Anyway, it didn't feel like a huge gap to me at the time. I wish Reddit had hard timestamps on posts, because I posted something right after it and then posted about being chased away by pirates from my new login right after that.

It wasn't the worst downtime I've ever had in the game.

Deunan · Jan 15, 2019

Rhaedas said:
True, I was back mining in 10 mins.

I couldn't get a stable connection for almost an hour, which is consistent with what FD stated about when server stability was restored. I'm more concerned that it happened at all, although not particularly concerned in general. It does seem that whatever server architecture their using can't handle a huge number of log ins at once as was the case also when the last update went live.

StartledPancake · Jan 16, 2019

Ian Doncaster said:
Sure, since we're talking about real-world testing of a massively concurrent application, rather than whatever neat packaged example your ITIL textbook had:

1) When it comes to load testing, getting a test environment that sensibly resembles the live environment is really difficult and generally highly expensive. I'm doing some development at the moment where the live system is an extremely expensive fileserver stack. Funnily enough, I can't get the budget approved for an identical test environment, so I'm doing the tests on a much smaller simulated stack. For most testing that's "equivalent enough" - but I can't do any useful load testing on it, because the license doesn't support anywhere near a live-like amount of data and users. Maybe it'll be okay, or maybe we'll get some load issues when it goes live. Very much depends on how many of the potential users actually use it heavily in practice.
2) When it comes to load testing of Elite Dangerous, getting the budget together for a live-sized test environment *and* ten thousand geographically distributed clients (it's an instancing problem - you can't use clients all in a neat lab or server farm for a realistic test) is probably a bit much. You can try to extrapolate from smaller tests but that risks missing exponential factors.
3) What does Change Control have to do with this anyway? There's no change taking place in the sense of an "alteration to a configuration item" simply due to the existence of Distant Worlds. There may well have been changes in that sense to spin up more servers on the Sunday night - and I'll note here that if something like that needs to go anywhere near CAB rather than having a Standard Change ready for it, someone loves their committee meetings way too much - but maintaining day-to-day performance of a live service is a Service Operations matter, not a Service Transition matter.
4) The player base would not appreciate having to have its internally-organised events pre-approved by CAB, and rejectable if it's felt that a prerequisite development to the instancing and server support needs to be done first. Sometimes you have to support a service with what you currently have, not with what you could develop in a few years. And the size of DWE I think has surprised everyone - it was about 4,000 signups before Frontier started publicising it ... now it's about 12,000. Any planning for the prior size may have needed some very short-notice changes. (And sure, ECAB, you can do those ... but the quality of your Service Management process can't add more hours to the day or developers to your project)
5) Incident Management - the correct process for last night's scenario - is about getting service back up and running as quickly as possible. They did. There may be as a result of analysis under the Problem Management process later be changes raised and discussed at CAB - or it may be decided that incidents caused by multi-thousand player mass jumps are sufficiently rare that it can just be logged as a Known Error and other work prioritised. That might be a CAB decision, but it's more likely to belong in Service Strategy and Service Design.

An interesting post and you bring up many good points, I suspect we agree more than you think we do.

1 and 2) As in this case FD have years of load data to mine, load testing *may well* not be required if there are sufficiently few changes or predictable to the code. As you've stated testing environments are very rarely sized to production, instead you extrapolate from the test environment based on the ratio of load to capacity and build a picture of your expected results in full production. Spinning up the extra infra for these tests is really the problem in cloud or virtualized environments (as long as you can run your tests in off hours), as you say its generating the client load that's tricky. However this is not a new or an unsolved problem, its handled by thousands of companies every day, it's just the software to do it is often expensive and takes considerable expertise to operate.

3) That depends on how you run your CAB. I and at least some others would look at a change as "A change in configuration or a significant change in the operation of the CI". My quote, but I firmly believe you should use a process to do good, not to avoid potential problems by hiding behind procedures or dogma. A non standard event of 12K users jumping at once would fit the bill here, it led to a crash of the farm and the event changed the operational state of the CI.

4-5) Not really sure of your point here is here, crashing under load seems to be a persistent *Problem* for FD and working with (not shutting down) DW2 was covered in my post, which you seem to have not noticed. I very much hope they aren't just closing the incident and each time stating it was a one off. You seem to be forgetting DW2 is planning mass jumps for the coming MONTHS.

I'll say you are very lenient on Frontier overall and I'm not sure that's healthy for them in the end. Their internally processes are clearly failing repeatedly (ObsidianAnt agrees) and need to be fixed or they risk bleeding off significant numbers of players (who are paying customers, not sure why gamers are so reluctant to see themselves as as such). The DW2 jump wasn't so disappointing because most people were kicked out for an hour or so, but because it was all so predictable. Plenty of failing organisations (check FD financial results before typing

have been saved by customers demanding they pull their finger out and fix structural problems.

Ill turn this around

- what are you suggestions to address this matter, or do you believe nothing needs to be done?

StartledPancake · Jan 16, 2019

Sorry, probably should have tagged - NSFAWWIIT

Ian Doncaster · Jan 17, 2019

StartledPancake said:
An interesting post and you bring up many good points, I suspect we agree more than you think we do.

Well, I think we probably both agree on the value of good service management, and no organisation will ever be in the position where there's nothing to improve.

StartledPancake said:
3) That depends on how you run your CAB. I and at least some others would look at a change as "A change in configuration or a significant change in the operation of the CI".

That's fair - personally I think this sort of load issue should be considered primarily at a bigger-picture level than CAB in the Design/Strategy stages - keep CAB focused on planned changes rather than external events. (The two meetings will likely have similar attendee lists, but probably not identical)

StartledPancake said:
Ill turn this around - what are you suggestions to address this matter, or do you believe nothing needs to be done?

While I've done process analysis and problem management, I wouldn't want to try doing it remotely on a company where I don't know any of the internal workings or technological details.

I don't think it's so much that I'm intrinsically lenient, as that the evidence I see suggests that Frontier do understand the problem, are moving in the right direction, and do learn from their previous failures [1]. But with a complex problem (or indeed Problem) there may be multiple causes needing to be dealt with before the user-facing symptoms are all sorted.

It is worth remembering that DWE2 is the largest ever expedition by a factor of about 10x on paper and probably around 4-5x in terms of signups who actually set off. That makes it ~100x bigger on paper / 40-50x bigger in practice than any *recent* expedition. And back in 2.0, the original 100-ship mass jump on DWE1 nearly broke the servers and got the organisers a "very nice! please don't do it again" from Frontier.

By mid 2.1, we were regularly having 50-80 meetups on planets as part of expeditions, and finishing them off with mass jumps, and we didn't break any servers at all (and nor did we ask Frontier for special extra servers in advance - it just worked!). There were major instancing stability improvements in 2.1 and 2.2, especially for the expedition case (i.e. no NPCs)

DWE2, though - that was over 1,000 ships mass-jumping (how much over, who knows?). That's at least 10x bigger than the previous mass jump record (and probably higher, even split over three platforms and timezones) and I suspect the "instancing problem" has at least some O(n^2) complexity [2]. That's exactly the sort of situation where doing load testing with a combination of test environments and extrapolation from previous performance is going to be highly unreliable and basically it comes down to someone making an educated guess.

Mass jumps are a particularly tough case, too - 1000 people drifting in to a system over a few hours and getting instanced up is one thing. 1000 people all making simultaneous instancing requests to the same supercruise location? That's going to be much bigger ... but how much bigger?

If DWE2 had been similar in size to DWE1, and broken the servers, that would have been very bad and I'd be a lot more critical. But even here there was evidence of improvements on previously:
- the Gnosis and the 3.3 launch day had serious instancing problems without even the sharp spike of a mass jump. Until the mass jump, instancing was basically working for the DWE2 launch, despite comparable numbers of players in-system to the Gnosis, and probably more of them simultaneously online.
- there were serious server issues following 3.3 launch (especially the new comms server freezes for system/squadron chats) which seemed to have been (correctly) a big priority for Frontier to fix before DWE2 launch.
- the US/Oceania mass-jumps didn't have the same issues. They may have been a bit smaller, but still, Frontier seem to have used the data from the Europe launch to feed their scaling decisions pretty quickly.

My prediction is that now Frontier have good data on the load caused by really big mass jumps they should be able to handle all future scheduled DWE2 mass jumps without wholesale service disruption. Whereas if you're right and their processes are inadequate to manage this sort of event, this will probably be a regular occurrence throughout the expedition.

I guess we should have a good idea of which it is in a few weeks?

[1] These are not things I would always have said. It was a continuous process, of course, but I went from feeling that Frontier didn't really get what "multiplayer Elite" implied when they released it (no-one else did either, of course) ... to having a good understanding of it around 2.2ish ... to actually being able to implement some of that understanding in the Beyond releases. If you look back at my posts from years back I was a lot more critical then and I think that's because they did more things to be critical *of* back then.

[2] Which probably means there's a size of mass jump that is literally impossible for Frontier to support. We're not there yet, but don't get your hopes up for it working smoothly if DWE3 is ten times bigger again