Elite:Dangerous - Systems Architecture

SirShootalot · Sep 30, 2014

Hi,

As a senior systems engineer / solutions architect I'm curious about the architecture behind the whole ED platform, how servers are connected globally, how many there are, how they've been spec'ed, physical or virtual, what OS, how spread they are, what kind of storage is used, backend DB technology, what load balancing measures have been put in place, etc.

Is this fully classified information? Any high level public facing documentation available?

Not sure if this is the only server/cluster for this part of the world but the IP/VIP I'm connected to from the UK is hosted on Amazon EC2. Are you using Amazon globally? Infrastructure as a Service model?

I realise this might be boring talk for most but I find it extremely interesting.

Cheers

Herne · Sep 30, 2014

I was under the impression that it uses peer to peer technology for the instancing.

Central server for the background economy simulation. That's as far as my understanding goes though, and even here i may be mistaken

RickDangerous · Sep 30, 2014

As another systems engineer I endorse this question

Would be great to know the answers, though they would be subject to change etc.

Iain M Norman · Sep 30, 2014

This is just what I've learned over time, from various sources, things FD have publically said and poking around.

It's quite a hybrid system.

Much of it is on AWS yes.

There are multiple game servers, which is the IPs you see in the client, these deal with setting up P2P islands and matching and the like. These are virtual machines and more are automatically instanced when more are needed.

Some of these game servers are dedicated to players from different regions of the world, although I think at the moment they are all in Dublin, but that could of course change easily I suspect.

In the network log the game servers have a name like WIN-UJO7F3XL804.

There are also the persistent world servers, there's only one endpoint for the API, but of course there could be load balanced clusters behind that.

SirShootalot · Sep 30, 2014

Iain M Norman said:
Some of these game servers are dedicated to players from different regions of the world, although I think at the moment they are all in Dublin, but that could of course change easily I suspect.

To my knowledge AWS only host from Ireland yeah when it comes to Europe.
I imagined there'd be some proxies in the Sates / Asia and SE Asia too though...

Not exactly sure how connectivity works though with regards to choosing Solo/Close Group or OP...My friend and I connected via CG, got on the same server, then left to go OP and there seems to be some session persistence going on since we found each other in OP too...For all I know that is just the load balanced VIP that everyone connects to though. Hard to tell.

fergal · Sep 30, 2014

The address I login to is:
55.195.145.163
Which is in Arizona

Seonid · Sep 30, 2014

fergal said:
The address I login to is:
55.195.145.163
Which is in Arizona

The registered address of that net-block is in AZ, doesn't mean that what the IP address ends up on is also there.

Stoo · Sep 30, 2014

Seonid said:
The registered address of that net-block is in AZ, doesn't mean that what the IP address ends up on is also there.

Correct. My office IP is registered in Germany, but last time I looked we were in Wiltshire

SirShootalot · Sep 30, 2014

Stoo said:
Correct. My office IP is registered in Germany, but last time I looked we were in Wiltshire

This awkward Truman Show moment...

VR Mr Teatime · Sep 30, 2014

I would expect FD to be using AWS in a PaaS sense with their own application/server code providing "clustering" for the persistent elements of the galaxy globally, P2P session handling and the elements they have had to move to CS (Stations). There is likely a central patch propagation and admin server somewhere behind those servers, which may or may not be in AWS (possibly two for resilience, if they are on premise).

For the database portion for markets etc. probably MySQL although might be MS SQL (iirc Eve uses the latter).

Specs is largely irrelevant these days, as you can throw as many "cores" and virtual RAM at each server as needed, but from observation it looks like there's roughly 1000 (possibly more now) players per server or presented IP in any case.

Player to player interactions are P2P using a bubble technique (there's a couple of guys at least on here that I know that know about this stuff in great detail).

SirShootalot · Sep 30, 2014

Mr Teatime said:
I would expect FD to be using AWS in a PaaS sense with their own application/server code providing "clustering" for the persistent elements of the galaxy globally, P2P session handling and the elements they have had to move to CS (Stations).

OK that would make sense.

Mr Teatime said:
There is likely a central patch propagation and admin server somewhere behind those servers, which may or may not be in AWS (possibly two for resilience, if they are on premise).

I'd expect 2 or even maybe 3 isolated environments (Dev, Test and Staging) for that, most likely on AWS too.

Mr Teatime said:
For the database portion for markets etc. probably MySQL although might be MS SQL (iirc Eve uses the latter).

OK. At the end of the day DB perf and availability mainly boils down to a good DB architecture and a good choice of storage. Both MySQL and MSSQL are really good IMO. I've always preferred the former (coming from a Linux background) but I have to say MS have come up with some pretty cool features with their latest versions of MSSQL Server 2k12.

Mr Teatime said:
Specs is largely irrelevant these days, as you can throw as many "cores" and virtual RAM at each server as needed

Agreed, but a minimum of capacity planning is required from the start.

Mr Teatime said:
but from observation it looks like there's roughly 1000 (possibly more now) players per server or presented IP in any case.

How did you observe that?

Mr Teatime said:
Player to player interactions are P2P using a bubble technique (there's a couple of guys at least on here that I know that know about this stuff in great detail).

I'm more of a systems than networking type of guys and this P2P malarkey has always baffled me. I'd love to hear more about it. My assumption though (correct me if I'm wrong) was that the more people connected via P2P the better the performance of the network. Based on that assumption, what do you make of the constant networking issues in Open Play, especially when a lot of players are online?

fergal · Sep 30, 2014

Seonid said:
The registered address of that net-block is in AZ, doesn't mean that what the IP address ends up on is also there.

Yeah, tracert doesn't really help me.
Its definitely not in Europe though, I'd say North America somewhere.

Anyone else get that address?

zamalek · Sep 30, 2014

What we know:

Actual instances are serverless.
Actual instances are P2P.
The world is not seamless (you can't travel in SC from one star to another).

Given that I would go with "megaserver meets P2P." You have many servers that retain important state for each instance of a system. I'd assume the total throughput required would be low, as the authoritative peer would be doing most of the work. However, due to the C10K problem you might find they simply have to provision more than one server per instance. Either way, "islands" are likely not much more than data access layers through to the backend DB.

Most of your client state is held in the memory of the clients, likely the authoritative peer (unless some sort of quorum algorithm is used between peers to mitigate cheating). Each individual client is responsible for reporting kills/deaths it sees. The first peer in a system would likely receive its initial state from the island (although it may simply generate the random state itself).

When a player moves to a system (i.e. connect to an island), they likely hit a central (likely single) server that selects which island to send them to. Basically a software NLB with more smarts than "random" or "round-robin."

These servers likely hold the majority of the information in memory and likely rarely directly interact with the database. In the event that they crash (hey, it's the cloud) it's probably not the end of the world if the players' positions are slightly out of date. Immediate DB access likely on occurs on "progress-related transactions" such as purchasing a ship. Thus, the performance (and hence the product) of the database is probably not a big factor.

In terms of the data itself, it's likely a combination of databases. A DB such as SQL is likely used for "row-like" data (credits, ship loadout, etc.), but stuff like island selection would run better against a graph/KV database and not a SQL database (friend, friend-of-friend queries etc.).

That's how I would lay it out, at least. The absolute core of this system is P2P and megaserver, though - the rest if supplementary fluff that can be solved in a million different ways.

bassman · Sep 30, 2014

SirShootalot said:
I'm more of a systems than networking type of guys and this P2P malarkey has always baffled me. I'd love to hear more about it. My assumption though (correct me if I'm wrong) was that the more people connected via P2P the better the performance of the network.

For data sharing type apps then more P2P peers increases performance and reliability via more redundancy. In a multiplayer game each peer represents unique state that all other peers need so the opposite is true - the probability of a consistency loss is a quadratic function of network size.

SirShootalot · Sep 30, 2014

zamalek said:
I'd assume the total throughput required would be low, as the authoritative peer would be doing most of the work. However, due to the C10K problem you might find they simply have to provision more than one server per instance.

Is that C10K problem a limitation in amount of concurrent sessions? Is that on a per instance basis?

zamalek said:
Immediate DB access likely on occurs on "progress-related transactions" such as purchasing a ship. Thus, the performance (and hence the product) of the database is probably not a big factor.

As in only saving changes since the last transaction? i.e. last docking/jump/drop/etc?

zamalek said:
In terms of the data itself, it's likely a combination of databases. A DB such as SQL is likely used for "row-like" data (credits, ship loadout, etc.), but stuff like island selection would run better against a graph/KV database and not a SQL database (friend, friend-of-friend queries etc.).

Are you into Software development by any chance?

What's the logical separation between Solo / Close Group and Open Play? Anyone know? (sorry relatively new to ED...)

bassman · Sep 30, 2014

zamalek said:
as the authoritative peer would be doing most of the work.

There is no single authoritative peer in an instance, I would post a link to one of the dev posts confirming this but having already done so half a dozen times on various thread I can no longer be bothered but you can find if you search for posts by Sandro Sammarco.

SirShootalot · Sep 30, 2014

bassman said:
In a multiplayer game each peer represents unique state that all other peers need so the opposite is true - the probability of a consistency loss is a quadratic function of network size.

Ok, still trying to digest that last bit

Quadra *cough* what..?? I'll have to google that one I'm afraid.

As for the rest of your sentence why choosing P2P as the method of choice if "the opposite is true". Will it organically scale in the long run?

Mazhurg · Sep 30, 2014

SirShootalot said:
Ok, still trying to digest that last bit
Quadra *cough* what..?? I'll have to google that one I'm afraid.

Something like that:

zamalek · Sep 30, 2014

SirShootalot said:
Is that C10K problem a limitation in amount of concurrent sessions? Is that on a per instance basis?

Yeah C10K indicates how many clients you can have to a server or process. It's actually a class of problems, they are now thinking all the way up to 1 million. Either way, both the architecture of the OS and process are important - I think Windows easily gets beyond C10K, but if you e.g. have a thread-per-client your process won't.

SirShootalot said:
As in only saving changes since the last transaction? i.e. last docking/jump/drop/etc?

Yeah, or at periodic intervals. The main point is that some compromise could be made to keep stuff fast and volatile (in RAM) and only periodically hit the DB.

SirShootalot said:
Are you into Software development by any chance?

Yep.

SirShootalot said:
What's the logical separation between Solo / Close Group and Open Play? Anyone know? (sorry relatively new to ED...)

Gameplay: Solo is a private universe just for you, grouped is a private universe for you and your friends, open play is accessible to everyone. The market and so forth is still shared across them (and you still influence it).

I think you just get your own island, or an island accessible only to specific players (ones that are in your group). This is one reason why I believe that islands are actually incredibly lightweight processes because they are presumably cheap to spin up.

bassman said:
There is no single authoritative peer in an instance

I'd actually like to read more about that! Concurrency is one of my favorite problems. *Gets searching.*

SirShootalot said:
Ok, still trying to digest that last bit
Quadra *cough* what..?? I'll have to google that one I'm afraid.

Consistency loss = disagreement between systems that are supposed to agree on some specific data being the same (but don't due to something like packet loss). This is where stuff like quorum comes in.

Quadratic = a type of exponential rate.

So 3 machines have an exponentially larger chance of incorrectly disagreeing on something over 2 machines, and so on.

SirShootalot said:
As for the rest of your sentence why choosing P2P as the method of choice if "the opposite is true". Will it organically scale in the long run?

I think the choice was made to avoid subscription fees. Probably little more than that, really.

Iain M Norman · Sep 30, 2014

SirShootalot said:
To my knowledge AWS only host from Ireland yeah when it comes to Europe.
I imagined there'd be some proxies in the Sates / Asia and SE Asia too though...

Not exactly sure how connectivity works though with regards to choosing Solo/Close Group or OP...My friend and I connected via CG, got on the same server, then left to go OP and there seems to be some session persistence going on since we found each other in OP too...For all I know that is just the load balanced VIP that everyone connects to though. Hard to tell.

The friend matching is supposed to work across game servers, so you shouldn't *have* to be on the same IP when you login in. Not sure how it works whether it finds you and moves you around, or whether whilst making p2p islands the game servers in question talk to each other. No idea.

Howard would know, he might interject if he's listening.

That matching hasn't been working quite well enough for some people, hopefully that'll change tonight.