Welcome to September’s progress report! Firstly, we would like to apologize for the delay. Our progress reports are written by voluntary writers and sadly most of them were unavailable to contribute this month. However, there is a silver lining here. The additional time we had gave us a unique opportunity to convert this month’s progress report into a technical exposition hybrid!
We’ll be featuring a deep dive into the inner workings of the texture cache in RPCS3, and how it was improved thanks to the contributions of ruipin and Nekotekina. We will also uncover the wide variety of improvements that kd-11 made and showcase some massive improvements to various AAA titles. Without further ado, let’s jump straight into September’s
irregular progress report!
In addition to the following report, further details about Nekotekina’s and kd-11’s work during September and about their upcoming contributions can be found in their weekly reports on Patreon. The month’s Patreon reports were:
Table of Contents
In the compatibility database statistics, we can see all the numbers moving further in the right direction. The Ingame category has breached the 1300 games barrier while Playable continues to slowly raise due to the amount of time it takes to make a playable compatibility report. Intro also saw a decent reduction as a result of recent improvements and lots of testers making compatibility reports. For more details, take a look at the compatibility history page, to see which games in particular had their status’ changed during the month.
On Git statistics, there have been 10,178 lines of code added and 6,751 removed through 53 pull requests by 12 authors.
Major RPCS3 Improvements
There had been a number of significant PRs merged during September. But in order to properly convey what has changed with the following two, we decided to go into a bit more detail on how texture caching works in RPCS3 first. This also means that we had to get pretty technical at some points, and while this is great news for the enthusiasts out there, it might not be as much for the casual reader. We did try our best to find the sweet spot in between, so we hope that it’ll end up being a great read for both parties regardless.
If you are not sure that this is for you though, feel free to click here and skip this part, then maybe return to it later after gaining some confidence.
For those that are still reading however, let’s dig in!
A “Quick” Primer on Texture Caching in RPCS3
How does data flow around in the PS3?
Architecture differences are sometimes a pain to handle from an emulation standpoint, and PC vs. PS3 is no exception. Contrary to PCs, the PS3’s Cell (CPU) and RSX (GPU) can directly read and write to each other’s memory – and they do so quite regularly, meaning we needed to take care of emulating this communication to a certain extent. While we don’t recreate the latency and bandwidth limitations currently, for reasons that would be a bit outside of the scope of this post, we do use memory mirroring and read/write locks to emulate the rest.
Rendering a frame & texture cache basics
Let’s say you wanted to render a new frame, for which you needed a new texture uploaded onto the host’s (your PC’s) VRAM. To achieve this, you thought, you’d just copy it there from the guest’s (the emulated PS3’s) VRAM, render the frame, and move onto the next one starting afresh. There’d be only one “small” issue with this idea: it’d be painfully slow. But then you say, okay, let’s create a cache for our textures beforehand – and this is where our texture cache and memory mirroring comes into play. The goal is to have these textures’ memory regions in the guest VRAM mirrored onto the host VRAM, and have them synchronized as needed. However, due to the ability of both the Cell threads and the RSX thread to read and/or write to these cached memory regions concurrently, we need to be smart about when we actually do these synchronizations.
Let’s render that frame again! We need another texture for it, so we look through the list of sections we currently have in the host VRAM (i.e. cached in the texture cache). If we find the texture we need, we check if our host VRAM copy is marked as in-sync, and if so, we reuse it. If we know that our host GPU (as per the usage context of the texture) is going to modify the mirrored counterparts of said cached region (e.g. by blitting), we also set a “No-Access” lock on it. This will help us know if the original needs to be re-synchronized (so, copy from host VRAM back to guest VRAM) before touching it, as memory protection will be triggered on access. We are happy, your GPU is happy, and the frame is rendered nice and quick.
But what if this wasn’t the case and we failed on step 0 (i.e. the texture we want hasn’t been cached yet)? Well, we simply go ahead, apply a “Read-Only” lock on its region (so we can detect any modification attempts by Cell threads), mark it as in-sync, and upload it to the host VRAM. If we know that our host GPU is going to modify said mirror, then we replace the lock with a “No-Access” one. We can now render our frame, though we did suffer a minor performance penalty in this case, as this texture was not in the cache and needed to be uploaded.
We churn through several frames like this, but then let’s say, one of the Cell threads wants to read a texture from the guest VRAM (not uncommon). If the wanted memory region was protected by a “Read-Only” lock, our thread will carry on just fine. However, if the “No-Access” lock was set on it instead, the texture cache gets notified. Remember, this means that we’ve cached the given region once, and then let the GPU modify it, therefore, it became out-of-sync (the host VRAM being currently ahead). This is where we re-synchronize. We mark it being out-of-sync, remove the lock, and then do a copy from host VRAM to guest VRAM to re-synchronize them. This is referred to as a “hard fault“, and is an extremely common thing (especially with WCB enabled), hence we use our predictor algorithm to prevent it.
Imagine however if the Cell simply decided to modify a texture, without reading it first (for example, if it wished to replace the texture we wanted with another one before letting us to use it). Whether a “Read-Only” or a “No-Access” lock is applied on that region, the texture cache is notified of such write-attempt. If the host VRAM copy is out-of-sync (we have a “No-Access” lock on the region), we repeat the re-synchronization process mentioned above. Otherwise, we manually mark the host VRAM copy as being out-of-sync, as it will just get overwritten by the given Cell thread, and remove its lock to let it. This will make the texture get re-uploaded to the host VRAM on its next use. Having done either of this, we let the Cell continue.
With this logic implemented, we had all our bases covered, or so we thought. The texture cache cached the textures it needed to, handled all the cross-memory communication emulation, and kept our mirrors in sync. And so, we were happy. Or were we?
Flat Virtual Supermemory (#5188)
The most attentive of you might have noticed one strange detail in the explanation above – if we remove the lock before completing the flush during re-synchronization, couldn’t that lead to unpredictable consequences? Indeed, since when we remove the “No-Access” protection, other threads can easily interfere and read garbage out of the now-unprotected memory region. Due to the exact timings required, this was quite uncommon and thus really hard to both debug, or even deliberately encounter. The only symptom was usually a minor and random texture corruption after hours of ingame time spent. Thankfully however, our devs did implement a solution for this, and it is what we call “supermemory”.
Supermemory is a view of the guest memory which is always unprotected and can only be accessed by the emulator. This means that we can leave all the out-of-sync, mirrored memory regions “No-Access” locked, so while game threads won’t be able to access them prematurely, we will still be able to complete the re-synchronization process in the background. This is done via asking the GPU driver to flush the out-of-sync (ahead of its original) mirror from the host VRAM into a select, contiguous region of host RAM, then copying that data over to the original region in guest memory via this unprotected view. Or at least, this was the way it had been done originally, as the supermemory was a randomly allocated, non-contiguous virtual region/view. We thought this intermediate buffering step wouldn’t be an issue, but as it turned out, we were (a bit) wrong.
During the development of his texture cache refactor PR, ruipin stumbled upon a rather “strange” behaviour in the menus of The Last of Us: after some of his changes, the game started crashing inside the texture cache code. During his investigation, apart from finding out what was causing the crashes, he also noticed how the game flushed over 100 MB into and out of said intermediate buffer, sometimes over 4 times per frame. This is significant even by today’s computing standards, so upon sharing his findings with others, he decided to test a different approach: a contiguously allocated supermemory to replace the need for the intermediate buffering. The change in performance he saw was staggering:
After contacting Nekotekina about using this allocation strategy instead, a complete fix was put in place soon after. As opposed to relying on an intermediate buffer, the GPU driver can now flush into the supermemory directly. This saves on both memory, as well as lots of time-consuming, piecemeal copying, causing a noticeable performance improvement in games with lots of flushes per frame, especially if high resolution rendering and WCB are also at play.
Texture Cache Improvements & Refactor (#5115)
Apart from the extreme refactoring efforts ruipin put into this PR, totaling at a whopping 5,201 lines changed, the other main feat it accomplished was enabling chained invalidation without the significant performance loss it would otherwise entail.
To find out what any of these mean however, we first need to explore the inner mechanics of the texture cache ever so slightly further!
Texture overlaps, locking and chained invalidation
As described in the lengthy explanation above, locks are used to detect Cell-originating read/write attempts within the guest VRAM. This is what enables us to re-synchronize the mirrored memory regions that Cell is trying access. Among the details we have glossed over however, there is one in particular that we have to bring up before moving on: these locks can only be applied on a memory page basis.
So what does this mean for us? Well, memory pages on the x86 architecture are typically 4K in size or bigger, meaning we can only operate in 4K units or higher. Textures however are usually not 4K sized, nor are they 4K aligned, meaning they can often span across pages, or be contained among many others within one.
Handling the protection of pages a texture spans fully across is simple – we can just lock them as normal. However, care must be taken when handling mixed pages. As textures are likely to start and/or end midway through a page, the other part of those pages might contain any data, or even other textures, which is even more problematic.
This brings us to the heart of the issue, the locking of these special pages. In the original implementation of the texture cache, if the former texture had to be unlocked for flushing, it would need to “chain” and unlock the other one as well (since they share that page). If you remember how the locks worked, this also meant marking that texture as out-of-sync, and flushing it if necessary. Given this, the chain would continue further and further, up until a page that does not contain another texture.
This is what we call “chained invalidation“, and was a really common occurrence at the time, with “invalidation chains” sometimes spanning as long as 50+ textures. This resulted in pointless flushing and re-uploading of those textures on their next use (especially since most of these flushes would be hard-faults), which was a huge drain on performance.
This mode that uses this full version of chained invalidation is called full-range protection.
Full-range protection mode vs. conservative protection mode
To fight this phenomena, kd-11 implemented another mode called conservative protection mode. This mode protects only those pages that are fully “covered” by a given texture. The only exception to this is when a texture doesn’t span across any pages fully, in which case we protect the page that contains the majority of the texture.
This means that those problematic mixed pages in the situation described previously would remain unprotected. This poses another issue however: Since locking tells us whether the original copy within the guest VRAM has changed or not, not protecting the head/tail of them prevents us from knowing this with certainty. So, to not lose this ability, he implemented a trick where the first and last 4 bytes of these pages would be rewritten with a magic number. This way, when the texture cache wanted to reuse either of the textures and those magic numbers were not present anymore, it meant we needed to recache that texture.
While this solution worked surprisingly well for a long time, it wasn’t perfect, and sometimes texture corruptions would arise. This is where ruipin came in, modifying and re-enabling the full-range protection mode in such a way that would give us all the benefits, and none of the downsides.
Chaining with exclusions
The way he changed the original concept was that these problematic pages will no longer chain and get unprotected, as long as at least one of the textures they contain is present in the cache and is in-sync (in which case that texture is “excluded” from the chain). As a result, we are able to keep the performance edge of the conservative mode, since the pointless flushes are no more. This process is known as chaining with exclusions.
The only issue left was then handling the case of the first texture being “No-Access” protected, and the other being “Read-Only” protected. If the first texture’s protection was applied later, this meant having the head of the second as “No-Access”, and the rest of it as “Read-Only”. This was obviously wrong, so a fallback technique was implemented that takes care of restoring the correct locks on such shared pages:
As simple as these ideas might sound in the face of everything discussed, implementing them is what actually took up about half of the lines changed. With the refactors and other optimizations in place, this solution now works as fast as the conservative protection mode, while also ensuring the absence of texture corruptions and other related issues.
ARL Opcode Fixes (#5125)
Moving on to kd-11’s changes, let’s start with the most apparently impactful one first. As a bit of an exposition, address registers are special integer registers in the GPU used for indexing operations. RPCS3 previously handled ARL opcodes incorrectly, as the index encoding was being used both for indices and destinations. This resulted in games breaking in spectacular ways.
kd-11 reverse engineered the ARL opcodes used and corrected the destination encoding, fixing a plethora of graphical issues affecting many titles on the way. For getting a better idea on just how much of a visual impact this fix made, check out the following video:
Vertex data streaming fixes (#5047)
This other PR of his has fixed vertex data streaming by restructuring the passing of vertex data via rsx registers, and by unifying how register data is interpreted.
Depending on how some of the other registers are configured, the ordering of memory can be manipulated to have reversed byte order. The new code now handles all transformations at the point it touches a register, which helps keeping the stream contents correct if the configuration registers are changed instantly afterwards.
This is different from the older approach of trying to figure out which layout order was used during the drawing itself, as that messed up if the configuration was changed multiple times before the drawing begun.
These changes have also fixed wrong font colors, missing darkened backgrounds in pause menus and broken fade-in and fade-out transitions in Insomniac titles (such as Ratchet & Clank 1, 2, 3, QfB, Resistance 1, 3) and in Deadlocked/Gladiator.
Custom Shader Loading Screen Background (#4942)
Lastly, we have a more of an aesthetic improvement on our plate. In late July, our github tracker received a feature request by user vsub, who brought up an interesting idea: what if the PIC1.PNG (or ICON0.PNG) file, available among all games’ files, was used as a background of the shader loading screen?
Not only did our regular contributors like this idea, so did many of our users. This quickly led to a PR being opened by kd-11, and after having it temporarily put on hold for two months, we finally had it merged to master. In addition to the original idea, kd-11 also added a blurring and darkening filter that users can freely fiddle around with, as well as the option turn the whole feature off.
Also, with the modification of said files, setting any other custom picture as a background is also possible. There were other changes/additional options proposed as well, but they are left to other interested developers for the time being. If you are one of those interested people, feel free to contribute!
For the longest time, Afro Samurai was a difficult title to emulate correctly, showing unusually low performance and multiple graphical issues. After intense debugging, kd-11 finally stumbled upon the reason: to display objects with its stylish black outlines, the game renders all geometry twice! That’s right, in order to display an object’s outline, a slightly wider black replica would be rendered behind the actual object. This has now been handled correctly, making this title fully playable even on low-end hardware!
Insomniac Games titles – Ratchet and Clank & Resistance
Four more R&C titles are now ingame thanks to the fixes provided by our regular contributor eladash: Tools of Destruction, Into The Nexus, All 4 One, and Full Frontal Assault! In addition to this, the vertex data-streaming changes by kd-11 have fixed various menu and HUD issues, broken fade-in/out transitions and flickering in Quest for Booty and Tools of Destruction.
R&C games weren’t the only ones to benefit from these PRs, as the studio’s other series, the Resistance, has also seen its entries improved:
The Last of Us
Thanks to the great debugging efforts of eladash and the fix from kd-11, The Last of Us can now finally go ingame without a save file, thus, officially counting as an ‘Ingame’ title. Not only that, but Neko’s flat supermemory implementation also improved the game’s performance significantly. While the game is still a long shot away from being actually playable, our developers continue to make great strides in improving its performance and stability.
Uncharted 2: Among Thieves
One of the most popular games on the PS3, Uncharted 2: Among Thieves, has reached the ingame status this month, as one of our users managed to get past the first chapter:
Uncharted 3: Drake’s Deception
Just like The Last of Us, Uncharted 3: Drake’s Deception can now go ingame without requiring a save file! And thanks to an additional fix from hcorion (#5089), the PSN version of the game can now achieve the same!
While the PSN versions were already ingame in January, the graphics were still not quite right, so to speak. Since then, not only were there massive improvements in this department, but the disc versions have also managed reached ingame!
White Knight Chronicles 2
Moving onto a more obscure title, after eladash fixed a particular edge case in our PPU code (#5119), White Knight Chronicles 2 went from Loadable to Ingame!
Thanks to scribam’s and kd-11’s efforts, almost all issues plaguing this title have been fixed, allowing it to now be Playable!
There have been numerous other pull requests merged during the month that just couldn’t make it to the Major Improvements section. This is why we collected here a list of all, and attached a brief overview to each. Make sure to check out the links provided for them if you are interested, as their Github pages usually uncover further details as well as the code changes themselves. To see this whole list right on Github, click here.
5071 – Removed some unused macros; removed the semaphore_lock and writer_lock classes and started using std::lock_guard instead; changed the semaphore<> interface to Lockable; vastly cleaned up atomic_t<> and named bit-set bs_t<> occurrences; and other misc. cleanups;
5082 – Moved rotate / cntlz / cnttz helpers to Utilities/asm.h; fixed missing parameters and wrong attributes in various places; other cleanups and replaced calls to appropriate std lib counterparts;
5103 – Added option “HLE lwmutex” which improved upon the previous lwmutex/lwcond implementation, as well as added atomic_op/fetch_op overloads with template arguments and removed variadic arguments in atomic_op in preference of capturing lambdas;
5111 – Rewritten VFS so as to fix various issues, such as escaping mounted locations with double dots, etc.;
5136 – Added misc. logging-related fixes;
5147 – Added some logging-related optimizations, cleanups and other misc. log fixes;
5173 – Cleaned up the shared_mutex and named_thread classes, implemented lf_queue<> and lf_value<>, fixed cellVdecClose and other misc. stuff;
5188 – Made virtual super-memory “flat”, resulting in games, that relied heavily on the inefficient rsx::weak_ptr implementation, having their performance multiplied, see further explanation in the Major Improvements section above.
5076 – Fixed a buffer overflow in the FP decompiler and added an assert to prevent this issue reappearing during cropping. Also fixed rare crashes when a shader used more than 24 temporary registers. With these fixes, The Last of Us can now reach ingame without needing a savefile;
5086 – Though (very likely) not hardware accurate, this PR improved the fog behavior in RPCS3 to more closely mimic the expected behavior. Fixed visual regressions from PR 5038 in titles like Super Street Fighter IV: Arcade Edition, Catherine, Diablo 3 and many others;
5099 – Fixed minor texture cache issues encountered by ruipin during the validation of the texture cache routines;
5146 – Fixed issues primarily regarding the blit engine including crashes, graphical issues and incorrect viewport sizes in many games. Additionally, took care of a bug that caused flickering when using OpenGL. These improvements addressed graphical issues in Demon Souls, Sonic Generations, Worms Crazy Golf and likely many others;
5162 – Integrates research done by Jarves on how some PS3 system executables handle the display output. While not relevant for games, it is a step in the right direction in handling display flips properly;
5163 – Improved the texture cache predictor by disabling specific predictions if a high enough number of them turned out to be incorrect. This is necessary because a high failure rate can result in crashing. Fixes crashing issues in Ni No Kuni;
5176 – Fixed regressions from the above PR and separated the UI flips from emu flips.
5110 – Removed a hack where draw calls are ignored if the RSX shader address register contains invalid data, specifically 0. This was a workaround for a bug that flushed all rsx registers data when they should not be reset within display flips (which was fixed by PR 5048) and the fact that there is no default register state;
5119 – Fixed atomic loops that use a mixture of 64-bit reservation load instruction and a 32-bit reservation store instruction. This was considered impossible and thus never looked at until was found being used by the game Blade Kitten;
5137 – Ignore the highest bit of the RSX’s texture offset registers. This undocumented ‘feature’ was found being used in quite a few Insomniac titles: each time that bit was set and not ignored, an exception was raised as the pointed memory was not mapped. After a bit of testing on a real PS3, it was found that that is the correct behavior and it was fixed along with these Insomniac titles such as Resistance 3 and R&C: Tools of Destruction. This pull request also removed a hack where texture fetches were considered disabled if the parameters were invalid;
5140 – Fixed a regression where the SPU index used in MMIO was higher than the size of the SPU group;
5166 – Fixed vertex count calculation when there are no available arrays to fetch the specified indices – previously this led to a division by zero exception to be raised;
5174 – Replaced the method used to allocate FIFO commands on the memory, with one that can not cause collisions with memory used in the capture. This fixed the remaining issues and greatly simplified the code;
5181 – Fixed access violations where data injections to the main RSX memory were being used, and fixed initial register states used while replaying. As a result, a lot more games can now be fully ‘captured’ and ‘replayed’ on the emulator. You can read about this debugging tool in May’s progress report;
5185 – Fixed a few inaccuracies where the opcodes of the RSX FIFO command were interpreted incorrectly.
5089 – Improved the stub of sceNpUtilBandwidthTestGetStatus(), making the PSN version of Uncharted 3 go ingame!
5050 – Refactored our Travis build into separate scripts, and moved us to Docker;
5127 – Fixed misc. issues with travis/appimage, fixed a Mac-related build regression, added back xcblib removal and had some minor cleanups done;
5152 – Fixed Travis timing out while cloning git submodules.
4958 – Improved ‘Basic’ keyboard handler. Games that natively support keyboard and mouse should now all be possible to be controlled as intended;
5150 – Fixed an issue with Emu.Restart() where the Play button found in the menu bar would only work for the first run.
5032 – Refactored our CMake build process by creating separate libraries for different components and dependencies;
5145 – Fixed the LLVM recompiler options being disabled in the UI and the Qt resource files not being properly included in the build;
5149 – Fixed an issue where RPCS3 wouldn’t build with -DWITHOUT_LLVM=ON on Linux.
5094 – Added virtual destructors for base types sampled_image_descriptor_base, GSFrameBase, mem_allocator_base and swapchain_base to prevent crashing when using Vulkan on MacOS;
5093 – Fixed cubic view construction;
5092 – Added checks to test if BGRA8 images can be used for blitting, so it wouldn’t trip up upload_image_simple();
5109 – Fixed get_precomputed_render_passes incorrectly checking if the listed color formats even support being rendered to;
5120 – Follow-up to the above PR, fixed a bug where support for VK_FORMAT_UNDEFINED was mistakenly asked from the Vulkan driver.
5085 – Added stub for sys_usbd_event_port_send() in sys_usbd;
5087 – Replaced workarounds for size checking and clamping to their appropriate C++17 std lib equivalents and also removing redundant code;
5106 – added a virtual destructor for default_vertex_cache and minor improvements to the LLVM recompiler code;
5112 – Improve emulation accuracy of lvebx, lvehx and lvewx instructions with the PPU decoders;
5117 – Minor Zcull optimisations.
5075 – Fixed a regression with cellMic caused by PR 4467, where cellMicEnd would get stuck while waiting for the cellMic thread to finish, which incorrectly never did. This fixed a hang with Resistance: Fall of Mankind after the menu;
5102 – Added the missing parameter into the check_dev_num() call in cellCameraOpenEx.
If you like in-depth technical reports, early access to information, or you simply want to contribute, consider becoming a patron! All donations are greatly appreciated. RPCS3 now has two full-time coders that greatly benefit from the continued support of over 800 generous patrons.
The illustrations of the “primer on the texture cache” sections were done by ellie.
The report was written by nitrohigito, elad335, HerrHulaHoop, JMC, KoDa and Asinine.