Star Citizen Alpha 3.18 Post Mortem – PES Almost Broken Everything

Hello & Welcome to some more Star Citizen, the deep dives into analysing what went right and wrong with the 3.18 & 3.19 patches are now out and we are going to be delving into CIGs account of Alpha 3.18 post mortem and what they learnt, the issues they had with PES and other features and what they are going to do differently in the future this video will also contain a lot of my thoughts too, living through that broken patch…

Alpha 3.18’s Launch

As many of our players experienced, we launched Alpha 3.18 on March 10 of this year, and alongside a bevy of high-profile new features like Salvage, the Vulture patch flyable ship, the surprise Scorpius Antares, new rivers, and new missions, the biggest headliner of the patch was our delivery of Persistent Entity Streaming (PES), which completely rewrote how we save state in the game and ushered in the beginning of the promise of a truly persistent universe where a player’s actions could remain in the game for others to interact with and create a lived-in environment where you could literally leave your mark in the PU. And, just as importantly, PES, was the necessary final stepping stone to delivering Static Server Meshing.

Many who follow Star Citizen appreciated how consequential and important our delivery of Alpha 3.18 was for us and the game itself. Many teams at CIG spent most of the second half of last year finishing out Persistent Entity Streaming, as we deployed it to our Public Test Universe servers for testing in December 2022, and then spent the next 3 months attempting to harden and improve it for launching to our Live Star Citizen service as quickly as possible.

However, when the moment of truth came and Alpha 3.18 shipped to Live, the shock to the system was beyond what we had projected. While it’s true that we expected PES to create a rough experience initially as we ironed out the issues that could only expose themselves at massive scale (and we warned as much), to say that we were surprised by the depth of chaos from the PES launch would be an understatement.

The anticipation for PES and 3.18 was nothing short of unprecedented for us. When the patch first dropped on March 10, we experienced our highest peaks ever for logins per minute and per hour, and had our highest number of attempted logins in a day for the first few days of launch. We say “attempted logins” because as you all know, the service was so overwhelmed by the traffic and teething pains of PES that many players could not get into the game, as various issues stalled users throughout the login funnel. Some were stuck in queues, some couldn’t get their characters to load, some were stuck at an infinite loading screen. As you can read more about in Benoit Beausojour’s (our Chief Technical Officer) account below on PES, we underestimated the multiplicative forces of going to Live and now creating and persisting every entity players created through their actions, creating a load on our service that was beyond our initial forecasts. And it took weeks if not months to expose, diagnose, create fixes, test them, and deploy them to restore the service, all while the game was still running on Live.

We learned a lot from the launch of PES, and while we are still recovering, and regret the compromised service in the first days and weeks of the rollout, it has definitely taught us a valuable lesson to value and preserve the integrity of the service more than we had in the past. That’s why as we begin to roll out the Replication Layer split and crash recovery – two things now enabled by PES – we will do so gradually, and as we begin to deliver Server Meshing, we will create dedicated testing channels to harden those new technologies further and implement standards and thresholds before we “graduate” them to PTU and then Live. You’ll start to see the ramifications of that later in the year and hear more from us later about our new approach to deploying potentially disruptive and game-changing new tech to the game service, but it comes down to us truly committing to preserving the experience for the hundreds of thousands, if not millions, who now play Star Citizen as a live service game, albeit an alpha still in development.

Persistent Entity Streaming

What Went Well

The development of Persistent Entity Streaming (PES) involved a diverse strike team of programmers with specialized skills from multiple areas across the Core Tech group and Turbulent. This collaboration was crucial in successfully building this complex system. The strike team followed aligned sprints and goals facilitated by senior engineers and producers that were supported by regular meetings. This resulted in effective communication and minimized miscommunication or technical misunderstandings.

A high-code-quality bar was maintained by the strike team, who ensured it underwent thorough design, discussion, and multiple reviews before being integrated into the mainline codebase.

The initial deployment to the Public Test Universe (PTU) and testing with the PTU community went well, setting a positive foundation for further improvements. However, this led to issues (discussed below).

Finally, PES’ system architecture and API, which are based on durable queues, proved they can recover from the worst kind of problems safely and will always tend towards recovery.

What Didn’t Go Well

The research-and-development aspects of PES posed challenges, requiring the engineers to invent ways around unforeseen problems. Due to the foundational nature of PES, integrating it into the Star Citizen game code resulted in significant changes that disrupted the game at a very low level, and some game teams were unprepared for the integration effort required to bring the game systems back to parity or to convert and leverage the new persistence layer for existing features.

Issues with the changes introduced by PES only became apparent during large-scale use and under heavy player load, which caused delays in identifying and resolving the problems. And features not thought to use persistence at all became affected by trivial delays (like tram systems, spawn queues, and others).

We also underestimated the multiplication factor between the PTU and Live operations; the group had estimated a 10x increase in backend activity but were faced with a 20x+ increase in requests, stream message sizes, and overall activity, which caused service outages across the board during the initial launch.

Regarding vehicles, PES heavily modified the way they are entitled and created in-game. This gives a better user experience (where you can choose where a ship ends up being created) but also significantly reduced the size of the inventory/global database for ships that are never used.

Major issues were also discovered at scale with a third-party database engine that PES leverages for its functionality. These issues gave birth to very unstable request/response cycles as well as heavy queuing. These issues also caused ripple effects where one database server entering a deadlock condition would cause the entire shard cluster (instead of a single shard) to stop processing requests for a period. This was a major cause of the instability throughout Alpha 3.18.x until the team had identified and programmed a workaround to alleviate the effects. Additionally, multiple locking problems at scale were discovered in the global database system (same engine) that would cause a periodic stop of all requests to the inventory systems. The team had to investigate and report to our vendors to determine workarounds and ultimately fixes that would prevent the database engine from locking.

In the engine, several shards reached previously unknown hard limits of the maximum number of allocated entities, forcing the teams to seed/create new shards and cycle them out, diluting the effect of persistence on those shards.

Several bugs were uncovered (in those unstable times) with error handling in parts of the login flow that bricked some accounts in different ways related to character creation. Server-crash handling was discovered to take a much longer amount of time due to a new process that kicks in during the post-mortem analysis. This affected the shard post-mortem and delayed players getting stowed back to the global inventory, which could result in a player character being “stuck” in a shard.

What We’ll Do Better/Future Plans

Going forward, we’ll finalize and use the new Cloud Test Launcher to adequately stress test the game shards at scale. This tool will simulate player behaviors and allow QA and the engineers to connect multiple modified game clients to the shard. By utilizing cloud computing resources, effective stress testing can be conducted, which will help identify and address issues relating to heavily loaded servers before moving to Live.

The team responsible for PES has now moved onto Static Server Meshing and are embracing a transformation approach to the new project. Unlike PES, this foundational technology can be integrated into the codebase gradually, avoiding a disruptive “Big Bang” approach. Parts of the Server Meshing tech are already available to the game team for testing compatibility with their game features. Combined with the Cloud Test Launcher, this approach aims to facilitate a smoother integration process for Static Server Meshing.

By implementing these measures, we aim to enhance our testing capabilities and mitigate integration challenges, ensuring smoother delivery of foundational technologies while minimizing disruption to the game.

Rivers

What Went Well

The inclusion of rivers marked a significant milestone in our quest to create more realistic and immersive planets. We were quite happy with the improvements to river canyons we were able to achieve between Alpha 3.18 and 3.19 due to improvements in our asset pipeline. And the support from the Planet Tech team to address technical issues during this process was remarkable.

What Didn’t Go Well

The procedural river placement tool was not in as good of a state as we had hoped when we started using it. As a consequence, a considerable amount of manual effort was required to meticulously place and verify the resulting rivers to ensure their optimal quality. Moreover, this limitation also led to a decrease in the number of rivers we were able to generate.

What We’ll Do Better/Future Plans

The numerous issues that were successfully identified and addressed during this initial run of rivers have already made a significant impact toward ensuring a smoother experience for next time. Although there is still considerable work ahead before we can consistently create planetary landscapes with rivers that look and feel like the real deal, we have made substantial progress and are now much closer to achieving our goal than ever before.

Sand Caves

What Went Well

We were very happy with the results of this initial push to develop an improved pipeline to produce individual rooms for all cave archetypes and to also define the visual identity of our sand caves. That we were able to release a first set of smaller cave systems out of that prototyping phase thanks to the concerted effort from multiple departments was the icing on the cake for us.

What Didn’t Go Well

With neither the tools for procedurally assembling locations nor automatically placing them on planets ready for use, we had to build and place every cave manually, which was the primary constraint on the number of caves we could place on planets in the Stanton system.

Unfortunately, these caves had to initially be released without missions, making them into locations the player actively needed to seek out to experience.

What We’ll Do Better/Future Plans

We are currently in the final stages of refining the new visuals for rock caves, which will serve as the next archetype. We are looking forward to utilizing the Location Tool to construct a wider variety of cave systems.

Additionally, we will be working towards support for bigger connections, rooms, and entrances, which is a key requirement before we can replace the old caves.

Time Trials

What Went Well

The new racing content and time trial modes were well received by the racing community, helped by the Content teams who produced many more tracks than we could have hoped for.

In the backend, the analytics we added were fantastic and allowed us to make very in-depth analyses of each track, which helped determine where they should go on the difficulty ranking and what the target times should be.

What Didn’t Go Well

Poor server performance meant that a sophisticated new system of checkpoint tracking had to be created, though the markers still do not update as responsively as we would like.

Analytics also show that relatively few players actually unlock the second track.

PTV Racetrack

What Went Well

It was created very quickly; we went into it with the idea that it was a simple location with a short timeframe and minimal impact on other teams. However, we achieved a lot more in three weeks than we were initially expecting, with a good modular kit for kart-style racetracks, and the addition of good dressing, theming, and lighting. It was really good to see how the community, especially the racing community, was excited when the track was initially shown to the public. We have since seen organized races on the track. We also got code support for upgrades to the respawning vehicle entity so if people were to crash, break, or abandon the Greycat PTV, it would respawn back at the starting area of the track. We can also set the values of things like time to respawn.

What Didn’t Go Well

Despite finishing the track before Alpha 3.17 was released, it had yet to have a QA pass and be bug-fixed, so we decided to hold it back until Alpha 3.18. Little did we know Alpha 3.18 was going to be delayed so much, so the track, even though complete, made it into the public’s hands a long time after we had hoped.

What We’ll Do Better/Future Plans

We will certainly develop more modular tracks in the future (and have another in the works), but it is on the back burner for the moment. We will try and support other similar-sized ground vehicles like the Greycat STV in the future as well (initial tests have been positive). We will also work with the Mission team to look into adding a racetrack-style mission to the tracks, which will allow the tracking of race times, checkpoints, and laps, and enable the mission to give rewards to players.

Security Post Kareah

What Went Well

We could never have imagined the level of support we received from the Art team, which really rejuvenated the location.

The player-triggered sandbox activity was well received, and analytics showed that hacking CrimeStats at Kareah is still very popular, which was a concern for us as we were taking a risk removing the other hacking locations.

What Didn’t Go Well

The mission still has rough edges that need ironing out, which is in progress. Also, additional analytics needs on sandbox activities were identified to be able to further understand player participation.

What We’ll Do Better/Future Plans

We’ll continue to iterate on the sandbox activity and location based on the feedback we’ve received and add further analytics to better understand participation.

Jumptown

What Went Well

The changes to the location were very well received, player participation was consistently very high, and the support we received from Art was well beyond what we expected.

What Didn’t Go Well

The implementation of PES led to performance issues around the location after a lot of ships were destroyed. We also wanted to redrop the locations with RASTAR. However, we were unable to at the time due to it breaking the shops.

What We’ll Do Better/Future Plans

For the next run of Global Events, we’re planning to redrop the locations on different planets to give different gameplay. For example, in thick atmosphere, higher gravity planets, and forests.

Infiltrate Missions – Orison

What Went Well

The new FPS environments were well-received and a refreshing change after only having underground facilities for years. The ability to assault the locations on foot or in ships was great too.

What Didn’t Go Well

We had to turn the missions off for Alpha 3.19 because we were aiming to release Siege of Orison, but we were not able to achieve this or the new platform clusters (where we were to relocate them) in time.

What We’ll Do Better/Future Plans

We have relocated the missions to the new platform clusters and will be releasing them when possible.

Prison Activities

What Went Well

The Prison Escape mission is surprisingly well-played and offers a new way for players to clear their CrimeStats. Inside the prison, loot on the AI and new selling terminals were well received; players felt the new AI made the prison feel more alive and it gave them another way to earn merits.

What Didn’t Go Well

The Ursa Rover continues to spawn underground, selling items at the prison kiosk still isn’t reliable, and excessive AI are being spawned due to a spawn closet issue.

What We’ll Do Better/Future Plans

For the next release, we’ll fix any bugs we can, including the Ursa spawning issue.

Drake Vulture

What Went Well

Adding the long-awaited “starter” of the Salvage career alongside its gameplay loop was a great milestone for the Vehicle team. While the vehicle was started some time ago, we had held on to it to ensure it released strongly with the gameplay loop rather than without, and this allowed the team to squeeze in some more features to the ship to make sure it hit all the current standards.

What Didn’t Go Well

A few complaints surrounding the traversal of the ship due to the gameplay mechanics were somewhat a product of the Salvage mechanic evolving over time to require more manual input than initially expected during the vehicle’s concept in 2018.

What We’ll Do Better/Future Plans

Releasing vehicles alongside their gameplay loops rather than earlier in the project (see Starfarer and Reclaimer) is something we’ve been striving to do in recent times, and we’ll continue aiming to do this.

RSI Scorpius Antares

What Went Well

The Antares was designed alongside the base Scorpius as an optional variant to put into production in the future, with the tail section of the ship outlined as the part that could be geometry-swapped. However, during development, it was clear the needs of the EMP and quantum drive required slightly more power than planned and the team reacted well to adjust both the base and Antares to allow the component layout to suit both.

What Didn’t Go Well

There were a few technical issues that we weren’t able to solve that reduced the ability for the second player to have more control over pilot features and a more enhanced MFD setup.

What We’ll Do Better/Future Plans

With Master Modes and new MFDs coming in the future, we should see the copilot get more gameplay features rather than being half passenger, half button-presser.

Salvage & Cargo

What Went Well

We were able to support both Feature teams’ introduction two key features to the PU, with Salvage requiring a lot of time be spent on the art assets and Cargo requiring a pass across all ships by Design.

What Didn’t Go Well

Unfortunately, the scope of the work for Salvage was drastically underestimated, as we thought the existing UV2 damage system all ships used would be suitable out of the box. However, we very quickly realized we’d have to do an entire pass on every ship to up the quality, as you were looking at the visuals much closer than the damage system.

In addition, the gameplay mechanic was built around the idea that you’d be able to 100% scrape the entire hull. However, this wasn’t a consideration in the UV2 damage setup, so some areas were inaccessible, causing frustration to early testers who couldn’t “100%” a ship.

What We’ll Do Better/Future Plans

We’re now more closely integrated with the teams working on big features like this so issues can be found and investigated before development properly starts, rather than being looped in once the prototype has been completed.

Hull Scraping

What Went Well

The long-awaited first iteration of Salvage gameplay finally arrived with Alpha 3.18, which enables players to scrape off hull material and either trade it or use it for field repairs. The core gameplay loop was generally well received and provided a great contrast to other activities.

We also expanded the harvestable system with ship-wrecks and salvageable metal pieces, and introduced the first miniature version of Crafting by allowing players to create a few select items using RMC.

Releasing Hull Scraping alongside the Drake Vulture meant that the ship could come out with a proper gameplay system, and the Aegis Reclaimer finally has appropriate gameplay available to it.

What Didn’t Go Well

A lot of features and systems Hull Scraping was relying on were still in active development when we were building the core gameplay system. This meant the feature was handed off to the EUPU team with a fairly compressed timeline for release.

We addressed the way players would find salvageable objects in the universe way too late in the process, and the balance work for salvageable-object distribution was not properly mapped out.

Not all vehicles could be upgraded to the new Damage Map, meaning some vehicles still won’t work correctly with Hull Scraping.

What We’ll Do Better/Future Plans

We’ve changed our approach to how early we get other teams involved, meaning that downstream teams get involved as early as the prototyping stage. We’ve also introduced additional milestones where downstream and content teams can review and approve the progress we’ve made before we move on to the next stage.