Welcome to some more Star Citizen, there was more talk about Server issues & 30k Disconnections with CIG’s Clive Johnson giving us a load of info on the PU server stability & more.
When will the servers be stable? At the end of beta.
Why not before? Because we need to finish making the rest of the game first.
When a game is being worked on as a closed alpha the focus is on feature and content development. Stability and bug fixing take a back seat and only issues that would hinder further development are addressed. This may sound an unprofessional way of doing things but the idea behind it is to try out ideas as quickly and cheaply as possible. That allows the developers to find out which parts of the game’s design work and which need revisiting. There is no point spending time bug fixing a feature that may change or even be completely pulled from the game at any time. Development will continue with the game in this semi-broken state at least until all features and content have been locked down. The game then enters the beta phase of development where bug fixing, optimisation, balance and polish are at the forefront. Ideally no feature work happens during beta but there’s almost always some last minute changes pushed in.
SC of course is open development, so while the focus in alpha is still on trying out different ideas, we need the game to be stable and functional enough for backers to test it and give their feedback. The key word there is “enough” which of course does not mean perfect. It is important that we strike the right balance between bug fixing and further development: too much bug fixing and development slows, too little and we don’t get enough feedback or the bugs hinder further development.
Has CIG got the right balance between bug fixing and development?
The problem with determining whether a build is stable “enough” is that we can only look at how stability affects the playerbase as a whole, i.e. the average. There will therefore be some lucky backers who experience far fewer crashes or other problems than average while there will be some poor souls for whom the build appears a bug-ridden crash fest. Ask the lucky players if we have the balance right and they might say, no the game is stable enough and we need to focus more on expanding the game. Ask the unlucky ones and they might still say no but want us to stop working on new features until all the current bugs are fixed. Very few people are going to say yes.
As a rule of thumb, before releasing a patch to Live, we try to make sure it is at least as stable as the previous Live release. Some patches may be more or less stable for particular play styles than previous ones but, overall, stability should get better from patch to patch. Of course sometimes things don’t work out how we’d like and average stability will end up not as good as it was on the previous version.
Why aren’t we fixing the server crashes causing 30000 disconnection errors?
We are. It only seems like we aren’t because, regardless of the cause, all server crashes result in clients getting the same 30000 disconnect. This disconnect happens because once the server has crashed the clients suddenly stop receiving network traffic from it. They then wait for 30 seconds to see if traffic will resume (incase the server was stuck on a temporary stall or there was a short network outage) before giving up, returning to the front end menus and showing the disconnection error. During these 30 seconds clients will see doors fail to open as well as AI, terminals and other entities become non-responsive. Backers sometimes mistake these symptoms as a sign that the server is about to crash, and you might see in-game chat saying a server crash is incoming, but the truth is that the server is already dead. It is an ex-server. It has ceased to be. If we hadn’t nailed it to its perch it would be pushing up the daisies. (In-game chat only continues to work because that is handled by a different server.)
When a new patch is being prepared on PTU, new builds are available for download almost daily. Once DevOps in ATX has pushed the new build up to the servers and made it available for download they then monitor the build for the first few hours, often working late to do so, looking for anything to indicate a problem that needs dealing with immediately. For the next few hours people play the game, uploading their crash reports, submitting to the Issue Council, responding to feedback forums, etc. Server crashes are all automatically recorded to a database. When the EU studios wake up, Technical QA look through the uploaded client crashes and recorded server crashes and make an initial assessment of which are the worst offenders, based on how often they happen and how soon after joining a game. Server crashes almost always go to the top of the pile, purely because they affect more people than individual client crashes. Jiras get created and passed on to Production. Production do three things here: first they send the crash Jiras to the Leads for triage, second they confirm priorities and which crashes QA should try to reproduce or otherwise assist with, third they flag any particularly bad crashes with Directors for priority calls incase additional people need to be reassigned to try and ensure a speedy resolution. Meanwhile the Leads triage the crashes making sure they go to the right Programmers on the right teams. Then the Programmers investigate the bugs, often working with QA to find as much info on the bug as possible. Most of the time Programmers can commit a fix the same day but sometimes it might take a day or two longer. In rare cases it can take a couple of weeks to track down the problem and come up with a fix. In very rare cases the bug is a symptom of some deeper flaw that will require restructuring some system to work a different way, can’t be done in time or without significant risk for the current patch, and needs to be added to a backlog to be scheduled for a future release. As ATX comes online Community and DevOps publish their reports on the previous build from information gathered over the past day. Production kicks a build with all the latest fixes and meet with QA, Community and DevOps to make an assessment on whether the new build is likely to be better than the last or whether additional fixes are needed first. Production pass their recommendation onto the Executives who make a go/no-go decision on the trying to push the new build to PTU that day. If yes ATX QA and DevOps start working their way through a pre-release checklist that takes several hours to complete. When LA comes online EU Programmers may hand over any issues that were specifically for LA teams or that EU teams were working on but are unresolved and would benefit from
continued investigation after EU has finished for the day. When ATX have completed the pre-release checklist, and if the build has passed, the cycle starts again.
If we are fixing the crashes why do 30000 disconnections keep happening?
Between every quarterly release we change a lot of code. Some of it completely new and some of it merely modifications to existing code. Each change we make has a chance that it may contain bugs. We’re only human and all make mistakes from time to time so each quarter there is the potential for having added a lot of new bugs. There are processes in place to reduce the chances of that happening but some always slip through. Once a bug is discovered it needs fixing. Sometimes a fix doesn’t work. Sometimes it only fixes the crash in some cases but not all. Sometimes the fix itself has a bug in it that can cause other problems.
One of the things we see quite a lot is that once a frequent crash is fixed one or more other crashes will start appearing more often. That happens because the crash that was just fixed was blocking the other crashes from occuring as much as they otherwise would have. As mentioned above there are also crashes that can’t be fixed immediately and need to wait until there is more time to fix them properly or until some other planned work is completed. Eventually though the majority of the most frequent crashes get fixed.
What we are then left with are the really rare crashes, the ones that only occur once every month and we don’t yet have enough information to fix or reproduce them. One of these rare bugs isn’t going to make much difference on its own but a hundred such bugs would be enough for at least three server crashes a day.
If we can’t make the servers stable why don’t we provide some kind of recovery?
It has been suggested that providing some kind of cargo insurance could prevent players losing large sums of aUEC when their server crashes mid cargo run. I believe this has been considered but the potential for it to be abused as an exploit is clear. Until that problem is solved cargo insurance is unlikely to appear in-game.
Another suggestion is to add some kind of server crash recovery. The idea here is that when a server crashes, all the clients would be kicked back to the menus with a 30000 as they are now but would then be given the option to join a newly spun-up server that has restored the state of the original from persistence. This is actually something we’re hoping to do but it requires more work to be done on SOCS and full persistence before it can happen so is still a long way off.
There have also been other suggestions such as clients or servers saving out the game’s state in local files but these aren’t secure or it would be a temporary solution and a waste of work to implement and maintain that could be spent working on the proper solution instead.
For now the best option is for us to continue to fix crashes as we find them and hope that servers are stable enough for most players to be able to test the game.