Eliminating Stutter with Asynchronous Shader Implementation!

Background

Shader compilation stutter is nothing new to most emulator users, especially on RPCS3. However it is worth clearing up some misconceptions that go around regarding how RPCS3 shaders work. I’ll try to quickly go over the history of shader compilation on RPCS3 and hopefully explain why the shader compilation stutter appeared and why some people believe RPCS3 did not have shaders before.

Shader complexity and custom vertex fetch

In early 2017, I embarked on a task to remove the very expensive vertex preprocessing step from the CPU side of RPCS3. This basically meant implementing all those custom vertex types and vertex reading techniques to the vertex shader and providing only raw memory view that the ps3 hardware would be viewing. This greatly improved RPCS3 performance, more than tenfold in some applications. This change is what made RPCS3 usable for playing real commercial games with playable framerates without needing HEDT system. However, the new fetch technique increased the size of the vertex shader and added a complex function to extract vertex data from the memory block. This made the graphics drivers take very long to link the programs, even without optimizations, likely due to use of vector indexing, switch blocks and loops with dynamic exits. Extra operations including bitshifts and masking were also needed to decode the vertex layout block. The code runs very fast, but the linking step is very slow. A shader cache system already existed before and if you ran an area for the first time, there was slight microstuttering that some users did not notice; its this stutter that got much worse. The solution to this: preload the shaders so that you don’t need to compile them next time. This lead to the infamous “Loading Pipeline Object…” screen and the “Compiling shaders…” notification.

Challenges

There are several challenges to tackling RSX shaders. First, the RSX is not a unified architecture like most programmers are used to today. It has separate vertex and fragment pipelines, both with their own separate ISA. They are also very limited and larger programs or more complex programs can result in very messy binaries. One of the largest problems is that the bytecode itself does not contain all the information required to run the program, extra configuration is configured via registers as draw calls are passed in. A good example is that the TEX instruction does not differentiate between texture types, but a texture configuration register exists that allows using the same program to read 1D, 2D, 3D, or CUBE textures as well as their shadow comparison variants. This means you can only know the generated shader once the texture register has been set up. There are other examples of things like these that make it so that you need the game to set up the program environment before the program itself is compilable.

Origins

Anyone who has ever looked at the RPCS3 program generation pipeline might have noticed that program decompilers are separated into program class and decompiler thread class. This is because technically asynchronous decompilation was in RPCS3 from the beginning but it likely never worked correctly and ended up being effectively synchronous (see this archive for the original RPCS3 code from 2011 to 2013). The goal was to rework how the programs are decoded and recompiled asynchronously. Before any of this would be possible, the decoder would have to be very efficient to lower cpu load, otherwise infinitely decompiling shaders would cause permanent flickering and graphics glitches. This is why I put in more effort into cleaning up the program decompilers beforehand to minimize unique generated shaders by reusing as much emitted code as possible.

Implementation

Once the vertex program analyser was improved to a point where most titles could be run with under 1k generated shaders, it was time to finally decouple the decompilation, recompilation and linking steps from the main renderer thread. This was pretty straightforward to set up for Vulkan which is inherently built with multithreading in mind, and a minor hassle on OpenGL where multithreading is more difficult to set up due to context spaces. It was expected that severe graphics glitching would persist until all programs were prepared but even with a bare bones implementation I realized that most of the needed shaders would actually be properly compiled after just a few seconds. This made the implementation usable in its incomplete stage, even without assistance of approximation shaders or interpreters. This was very good news and allowed this feature to be enabled by default as most users prefer the missing graphics for a second or two instead of completely freezing the emulator for a minute or so before all the shaders are compiled. This is particularly effective since not all shaders compiled even affect items that are visible to the user.

Results

The results are surprising even to me. The speed at which graphical glitches resolve is very impressive on capable machines. You can see a quick comparisons below:

Loading times and stutter are both reduced significantly when no pre-compiled cache is present.

Conclusion

While the compiler greatly improves quality of life when using the emulator, it is far from perfect. Lower end CPUs might lose some performance when the decompiler is active and even hitch and stutter slightly from time to time. Not as bad as before but it’s still a problem. There is also the issue of the missing graphics. This could be improved with approximation interpreters to loosely provide a placeholder program to assist in rendering.
While much work remains to be done, hopefully these improvements will help many users and fans in running applications on the emulator.

Regards,
kd-11,
Lead Graphics developer for RPCS3

If you like in-depth technical reports, early access to information, or you simply want to contribute, consider becoming a patron! All donations are greatly appreciated. You can support lead developers Nekotekina and kd-11 here on Patreon. Your support will help speed up the development of the emulator so that one day every game will be perfectly playable from start to finish.

Also, come check out our YouTube channel and Discord to stay up-to-date with any big news.