RPCS3 Inside Look: A Deep-Dive into Hardware and Performance Scaling!

Hey everyone! Today we’re going to talk about something that’s a little different than what you are used to in the progress reports. We’ll be going in-depth on how certain hardware and software configurations could significantly affect your performance in RPCS3.

There are several aspects that could make RPCS3 not perform as well as it should, and memory speeds is one of them. In our case, memory performance will be stressed by RPCS3 in several ways:

  1. Cell emulation: SPUs access to main memory goes through DMA. This is a beastly exercise to emulate all on its own.
  2. RSX emulation.

RSX memory operations fall into two major categories: Upload and Download. Upload operations include transfer of textures, shaders, and shader data (vertex buffers and other register configuration tables) from the host CPU to the host GPU. This process is usually optimized by the GPU driver to occur asynchronously and with heavy use of batching. It is bandwidth heavy, as the sets of data are rather large and transport has to go through PCI-E. We do a lot to hide this issue, and for the most part it works well, but if your memory is too slow or if you are stuck on an older PCI-E revision, the transfer lag can have a huge performance impact, especially if a GPU sync is required.

Download operations for instance include transfer of textures and arbitrary data from the host GPU to the host CPU. This one has very serious implications on performance because we can’t really hide the memory latency for the transfer operation. Most of the time the memory in question will be accessed by Cell without warning, which means we have to stop everything until the GPU has processed the information we need, and then we read all that data back over PCI-E all while our CPU thread is blocked. It is for this very reason that we have the ‘buffer options’ disabled by default: to reduce the penalty of this hard stop as most games might trample on older GPU-resident data without really needing to read it back later, in which case we can just pretend nothing existed for that memory block. This means that it’s also not advisable to run RPCS3 with your GPU usage maxed out or close to it as your GPU will not be quick enough to respond to these random synchronization requests. There is a lot of optimization that could be done in this area however, with a very good predictor that can guess with high accuracy whether or not a memory block will be accessed by the CPU soon and start queueing up the GPU instructions before it happens.

Continue reading RPCS3 Inside Look: A Deep-Dive into Hardware and Performance Scaling!