Game Discussions Star Citizen Discussion Thread v12


Bearded-CIG said:
Hello! For those that don't know me: I'm one of the server admin and observability engineers at CIG.

The 'what' isn't something we have an answer for, yet. But I do have a bug in for the fact that it's happening and it's being investigated.

What I do know is that the shards are getting into some kind of bad state and when that happens, it stops processing the join queue for them. We have some graphs that help us to identify after they get stuck in this state. When they do, we manually isolate them and have to stow them and see if they recover. If they recover, great! If not, we replace them with a working shard while permanently removing the bad one. We've also gathered some debug info such as a linux coredump, server logs, etc. that have been added to the bug to aid with the investigation so we don't have to worry about removing the bad shard causing any needed debug info to go missing.

Of course the next logical question that comes after this is: Why are people put into a join queue for a shard that is broken? Why does a person have to manually isolate these shards and replace them with a working one in order for people to play the game? The answer for that is: We have to know what is wrong before we can programmatically detect it and avoid matching people into it (notice earlier I said that our graphs are currently able to detect after the shard gets in this state but not when). We also need to make sure that this detection is observable to a human engineer. Observability is important because in the event that we do programmatically isolate an affected shard, but that isolation isn't observable, that makes it possible for multiple shards to be broken without us knowing it which in turn, would make us run out of servers for players to join. Realistically, that scenario shouldn't ever happen because our matchmaker already has logic built into it that shows us when a shard is isolated from match making, so the observability work is already done and any new isolation logic just has to be hooked into it. So really, we just need to be able to figure out why they're getting into this bad state before we can isolate them.

Since we aren't yet able to observe exactly when the issue happens and can only currently identify the issue after it's affected players, it's possible for a bunch of shards to break and then only become obvious that they're having issues when the matchmaker tries to put players in them. If only a few shards are affected, it doesn't take all that long for us to help the environment recover but the more there are that have been affected, the longer it will take for things to work again.

Another question that I could see someone wondering is: Why can this issue only be detected after it happens rather than when it happens? This is pretty normal to have happen in video games. This is the third MMO that I've worked on and I don't even remember how many games I've worked on. While developers do try their best to predict the future and make logging events to help investigate issues that arise, there is always inevitably an issue that arises which needs additional debug information to be added in order to find out how to fix it.

So how is the investigation going? Well, we don't see this issue happen on the Public Test Universe ( at least, we haven't yet ) so it may only be an issue that we see happen at higher player counts. That's going to make the process of iterative fixes slower because we need to be safer with what kinds of changes we make on the public environment. New debugging info has been added to aid in the investigation of this issue, but it has not been hotfixed into the public environment. That hasn't been rules out but we have to gauge the risk for that carefully before doing so. If we decide to hotfix the extra debug into the public environment, that will let us have a faster turn around for knowing if we need more debug, or if there are fixes we can make based on the results of that debug info. If not, then we'll have to wait for the debug code to go through the PTU ( which it currently is ) and make its way to the public environment before we can continue with the investigation.


He really could have stopped at:

'What I do know is that the shards are getting into some kind of bad state...'
 
Last edited:
Since we aren't yet able to observe exactly when the issue happens and can only currently identify the issue after it's affected players, it's possible for a bunch of shards to break and then only become obvious that they're having issues when the matchmaker tries to put players in them.

1748982714008.png
 
It's cool, he's stabilised the definition of 'year of stability' now:

Think of it like a car. Stability is how likely it is to flip over and crash. Performance is how well it can accelerate, corner without flipping over and crashing, etc. Usability is how well the driver can interact with the cars various features.

The top level aim of 2025 is to ensure a stationary car doesn't crash ;)

Stability can be a bit of an abstract concept without taking the time to define it, as the definition can change from one person to the next. The internal definition that we use for what is considered a stability issue is pretty narrow and refers specifically to crashes an disconnections. I'm the one that established both the internal definition for it, as well as the data model that we use for how we measure server and client stability in reference to crashes and disconnections. That data model is then used to aid producers in decision making for if we should be prioritizing fixing crashes and disconnections, or if developers should be focusing on gameplay issues such as what you describe in your question.

Think of it like a car. Stability is how likely it is to flip over and crash. Performance is how well it can accelerate, corner without flipping over and crashing, etc. Usability is how well the driver can interact with the cars various features.

So are things more stable in the year of stability? Yes but that's not really what you were asking about so it doesn't help if I focus on that.

You're concerned about the frustrating experiences that are occurring due to gameplay bugs and performance issues. I can't really comment on the prioritization of the gameplay issues, as that's not the area that I'm in charge of. Performance I can comment on a bit since the team I'm on helps out with that process. Fixes for performance issue are something that is iterative. It takes a lot of slow and steady work over time. We do performance captures for every single PTU build and public environment multiple times to provide our developers with the information they need to investigate implement performance fixes. However, fixing performance issues is much like pealing back layers of an onion. When one performance issue is fixed, the next worst one usually rises up to take its place and limits the efficacy of the fix that went into place. Sometimes getting big performance gains requires new tech or complete rewrites of old systems and there's definitely some of that still going on. But even those new systems are going to have their own iterative performance work that has to happen.
 
Last edited:
What is the player supposed to have to do to get the eggs?

Serious answer?

Hope lots of keycards don't despawn...

(And that the invisible medbay wars don't attrit them psychologically ;))

They can request their Idris chef to do them 'procedurally' ;)
 
The sub has tilted grumpy again.

The main themes are:

CIG somehow forgetting to follow through on their $$ damage control...

Source: https://www.reddit.com/r/starcitizen/comments/1l2x3zh/kind_reminder_to_cig_to_not_forget_adding_bomb/


Grindy content made worse:

Hey CIG, we're more than your "free game testers". The scrip changes are awful.

And PvP ruining everything. (They can't bring themselves to say 'P2W' on the Idris armadas camping every location though ;))
 
Back
Top Bottom