Introducing RPCS3 for arm64

As of today, we are officially announcing arm64 architecture support for RPCS3! This major feature has been heavily worked on over the last few months, and allows RPCS3 to natively run on Linux, macOS and Windows arm64 devices!

Index

1. Background
2. Supporting arm64 on our CPU LLVM Recompilers
3. The 16K memory page-size problem
4. Pull requests
5. Challenging the limits of PlayStation 3 emulation
6. Platform showcases
7. Available emulator downloads
8. A note about other mobile platforms
9. Special thanks to the developers


Background

Initial support for arm64 devices started in 2021 after the Apple M1 series launched. We started exploring running rpcs3 on arm64. Thanks to the efforts of RPCS3 core developer Nekotekina, we got a basic running build compiled for arm64 in December of 2021 that wasn’t able to run anything. After some cleanup from kd-11 to get RSX working on arm64 and more enhancements from Nekotekina, we finally got samples running and rendering by Jan 2022. Only PPU/SPU interpreters worked correctly, resulting in very poor performance, but it was the first step to getting RPCS3 running correctly on future arm64 devices. Initial efforts were limited to linux support as it is widely available and accessible to developers without hardware lock-in. A few months later, a contributor Jeff Guo submitted a series of patches bringing the x64 macOS branch up to speed. Some major strides for macOS were made at this stage, such as understanding how the new JIT mechanism works. An attempt was made to get LLVM working but by late 2023 we still only had working builds on arm64 without LLVM support.


Supporting arm64 on our CPU LLVM Recompilers

RPCS3 graphics developer kd-11 had been working behind the scenes on improving arm64 support on Asahi Linux since 2021. Initially, there was no access to graphics drivers so performance did not matter but things kept improving on that front over the years to the point where Asahi could actually be a daily driver. The arm64 LLVM integration work was revisited in late 2023. The initial target was linux as any issues could be tracked down to source level for debugging. kd-11 quickly uncovered the reason LLVM did not work with RPCS3 JIT was due to how the emulator handles guest code. Since RPCS3 was initially written with x86 architectures in mind, some architectural decisions hinged on x86-specific behavior. One of them was that every few blocks, the guest would exit back to the host to handle state checks as well as unwind the stack. RPCS3 uses tail-call-optimized frames for all the JIT blocks but x86 consumes stack during any calls that need to be reclaimed through a ret chain or reloading the stack pointer. This is different than arm64 which uses a dedicated register for “linking” calls. A ret-chain does not exist with stackless frames on arm64. When the JIT code executes the first return, the CPU will enter an infinite loop jumping to it’s current position and hang because the link register is not updated with the previous position.

Naturally, there are plenty of ways around this problem as with all things in computer science. We explored some of these approaches in early 2024 mainly to achieve a proof of concept. The first one was easy – if we provide a minimal stack for every function call and mimic x86 behavior. We keep the frames themselves stackless but track the previous call address and reload it to unwind. Unfortunately, this doesn’t work reliably since the basic blocks assume all stack space is scratch and will clobber the return address every once in a while leading to random crashes. Several other variations of this were attempted including reclaiming some registers for a shadow call stack pointer. They all work, but have different upsides and downsides. For example, some of them required modifying LLVM to accommodate RPCS3’s JIT architecture. Which wasn’t ideal as it required us maintaining a fork of LLVM. Another quick solution is to just replace all returns with a far jump to some host gateway that fixes up the register state and we can exit. This also proved challenging since our JIT blocks can loop between blocks and do recursion. The performance hit became too high for such a strategy to work.

By late February 2024, kd-11 had some samples running with a customized LLVM fork. And made further improvements over March resulting in homebrew running under arm64. After a brief hiatus, kd-11 made a breakthrough, and some commercial titles were able to run on arm64 with PPU LLVM by late June 2024! The proof-of-concept stage was now over. It was time to pick a winning strategy and ship the product.

In the end, kd-11 settled on a IR transformer that analyzes the JIT IR generated for x86-64 and makes tweaks to accommodate arm64’s requirements. The idea is simple – generate IR once for a reference architecture (in this case amd64) and then transform it to target other architectures as required. LLVM makes this kind of work possible since everything is represented as an IR before compilation. LLVM basic blocks have some constraints with regards to flow control and since our JIT blocks are always callable from a gateway, there is an ABI of sorts in place with regards to which registers will hold what value – e.g the guest thread context always lives in the first register passed to the block. This setup was extended to provide extra information at compile time that the transformer can piggy back on to correctly identify the location of objects. We don’t need to store the entire call chain for example, we just need to know the location of the closest host gateway. This approach worked really well and by late July, most commercial games were booting and running fairly well but there were more problems that required addressing.


The 16K memory page-size problem

By default, x86-based operating systems provide page granularity of 4KiB. The PS3 natively also supports 4KiB granularity and does expect that the operating system will correctly align memory allocations as such. However, most arm64 platforms today come with a minimum page granularity of 16KiB. This usually means better memory performance as modern era devices use more memory with larger objects.

Support for 16K pages was added to the emulator in 2021 and was fairly well tested before using interpreters. Most games don’t notice that the page sizes are 4x larger than what they requested and this made the porting efforts much easier than we expected on that front. However, when we finally got LLVM running, it became clear that the performance impact can be quite massive. Emulators like RPCS3 use dirty page tracking to handle cache evictions for textures. While modern games have large textures that take up a lot of space, the PS3 is a different beast. Not only are many textures tiny, all user-facing GPU objects are often represented as textures. This includes texel buffers, uniform buffers, shader storage buffers, etc. The sizes can vary wildly here and evicting a single 16K page can cost us a lot in re-uploading resources every draw call. Fortunately, not many games seem to be affected by this, but there are some unfortunate outliers with very poor performance on 16K platforms.


Pull requests

Early work

#11315 – Initial Linux aarch64 port (PPU only)
This is the original work from Nekotekina that brought basic ARM support to RPCS3, on early 2022. Only PPU was initially supported and LLVM was not supported at all. This work is mentioned here for historical purposes as it is the pillar of the current ARM port. This wasn’t the only pull request of course and there were many follow-ups polishing compatibility on ARM devices.

  • Update asmjit dependency (aarch64 branch)
  • Disable USE_DISCORD_RPC by default
  • Dump some JIT objects in the rpcs3 cache directory
  • Add SIGILL handler for all platforms
  • Fix resetting zeroing denormals in thread pool
  • Refactor most v128:: utils into global gv_** functions
  • Refactor PPU interpreter (incomplete), remove “precise”
    • Instruction specialisations with multiple accuracy flags
    • Adjust calling convention for speed
    • Removed precise/fast setting, replaced with static
    • Started refactoring interpreters for building at runtime JIT (due to poor compiler optimisations)
    • Expose some accuracy settings (SAT, NJ, VNAN, FPCC)
    • Add exec_bytes PPU thread variable (akin to cycle count)
  • PPU LLVM: fix VCTUXS+VCTSXS instruction NaN results
  • SPU interpreter: remove “precise” for now (extremely non-portable)
    • As with PPU, settings changed to static/dynamic for interpreters.
    • Precise options will be implemented later
  • Fix termination after fatal error dialog

LLVM bring-up pull requests

#15182 – Minor arm64 improvements
This was the first arm64 pull request of 2024, merged on February 11st, and allowed Arkedo games to run fine on arm64 with PPU Interpreter and SPU Recompiler LLVM. Almost all other games still did not work properly. SPU LLVM would hang very early after booting up the games and audio did not work as a result.

#15869 – gl: Fixes for wayland (asahi linux, aarch64)
Fast forward several months of internal work, kd-11 starts wrapping up months of internal work and prepares to open a series of pull requests to upstream arm64 support, with a second pull request opened on August 1st. This is a smaller one allowed OpenGL to start working on Asahi Linux with the arm64 Linux builds of RPCS3.

#15904 – aarch64/cpu: Add LLVM support
This PR introduced the initial draft of the LLVM transforms that allowed JIT to work on arm64. The work was developed and tested entirely on Asahi Linux for testing.

#15915 – rsx: Fix fragment constants decoding for non-x86 platforms
Some RSX functionality was broken when running on non-x86 CPUs and needed touching up. Since this was high performance code, it had been rewritten in SSE and AVX for x86 and the cross-platform fallback was untested.

#15925 – aarch64/llvm: Handle processing of leaf nodes
This PR improved the LLVM transformer to handle “leaf nodes”. A leaf node is a compute-only basic block that does not make any external tail calls. These types of blocks will attempt to return to the caller after executing and were not correctly handled in the original work. They are surprisingly rare as most normal blocks will end up calling a syscall or library function at some point while normally breaks the traversal chain by starting a gateway exit.

#15962 – aarch64/llvm: Improve compatibility
This PR reworked the transformer a fair bit. Initially, some optimizations were done by eliminating tail calls and replacing them with static branches. This worked very well and performance was good, but LLVM could sometimes clobber it’s own registers even when clobber information was provided. We restored tail calls to some capacity here and a micro-assembler was introduced to write proper inline assembly blocks for injecting into the IR with proper dependency resolution.

#15971 – aarch64 fixes
One challenge we faced with x64 support was decoding the reason for a segfault. A table was written that basically had instruction patterns for matching whenever a segfault is received so that we can determine in a cross-platform manner why the fault was reported. This information was very incomplete for arm64 and many instructions were being miscategorized.

#15974 – Rework aarch64 signal handling
Replaced the instruction decoding tables from the previous PR with a proper parsing of the machine context passed in to Unix signal handlers. This is a lot more reliable as we can check the CPU register state at the time of the fault and know for sure why a fault occurred.

#15981 – aarch64: CPU branding info and misc improvements
This PR introduced some work to help identify arm64 processors. Unlike x86 where you can use CPUID, no textual representation of the CPU name and details exists at all and we have to infer the details from some register values.

#15987 – aarch64: Support for apple exceptions
This PR started the porting of the arm64 work to macOS. One advantage of starting with Linux is that the changes are compatible with other unix-like OSes including macOS. On the first build, things just worked, but exceptions needed to be reworked since the machine context is stored in a different format than Linux.

#15992 – macOS: Implement remaining portions for native arm64
Some minor alterations were needed to get the JIT running correctly on macOS. This PR implemented reserving the x18 register on macOS as it is always reserved for Rosetta2 use and is not preserved during syscalls. That solved random crashes where the x18 register would magically change to 0 in the middle of execution and crash. Some extra code was also added to help read the CPU name, number of cores, etc.

#16011 – aarch64/gl: Misc fixes
Maintenance PR fixing some compiler warning and fixing a bug in our OpenGL backend that caused shadows to break in some titles.

#16022 – aarch64: Support calloc patch blocks
With most of the base functionality done, it was time to start supporting extra features such as game patches. This PR fixed a crash that occurred when a “calloc” style patch instruction was used. These instructions force-inject leaves into the code and add extra jumps instead of returns. This was causing the transformer to generate invalid sequences that led to crashes.

#16035 – aarch64: Fix compilation for windows-on-arm (msys2)
This was the initial windows on arm support PR. RPCS3 was successfully compiled for windows arm64 using msys2 and clang. The visual studio portions of RPCS3 are not compatible with Microsoft’s ARM64 API as it is missing too many intrinsics that we need otherwise.

#16058 – arm64: Fix remaining issues for Windows on ARM
This PR finally got us working windows-on-arm builds that could successfully run demos. Not much testing was done due to lack of test hardware. While windows-on-arm has been around for over a decade (previously windows RT) their devices are largely ignored by the masses and it is difficult to find any testers who run compatible hardware. We expect this to become less of an issue as the new Snapdragon X laptops and tablets become more common.

#16070 – arm64: macOS CI
These changes from nas introduce the automatically compiled macOS builds for arm64, with auto-updater support.

#16148 – arm64: Linux CI
These changes from kd-11 introduce the automatically compiled Linux builds for arm64.


Challenging the limits of PlayStation 3 emulation

And by now you might be thinking, if there’s a Linux arm64 build for RPCS3, which requires an armv8.2-a CPU, at least 8GB of RAM, and an OpenGL 4.3 or Vulkan capable GPU/drivers, what’s stopping it from running on an Raspberry Pi 5 device? How far can we challenge the limits of emulating the console known for being the most resource demanding to emulate still 18 years after its release?

To put this to the test and support arm64 development, AniLeo acquired a Raspberry Pi 5 device. This low-cost arm64 device costs around £68/€93/$85, plus £4/€6/$5 for an Active Cooler.

The device was setup with Arch Linux ARM to make use of all the latest packages, in order to be able to compile RPCS3 on device. The device was also overclocked to squeeze out more performance, the CPU was overclocked to 2900MHz (+400MHz) and the GPU was overclocked to 1060MHz (+100MHz).

#15978 – vk: Support v3dv, allow creating device without textureCompressionBC
We started by trying to running games on the default Vulkan render, but this failed as the mesa v3dv driver for Broadcom GPUs is missing full textureCompressionBC support, due to hardware limitations. However, not all was lost, since the BC1-BC3 formats are supported, and that’s all that RPCS3 needs.
We then added a workaround on RPCS3 to allow the Vulkan render to boot even without this feature reported as supported by the v3dv driver. We could now boot games through Vulkan.

Unfortunately the Vulkan render would only work for a small amount of time, before hanging the entire system and requiring the device to be power cycled. This meant that we had to shift to testing with the OpenGL render. Fortunately, the results were great, with the mesa’s v3d OpenGL for Broadcom GPUs being able to render all tested games correctly with no visual bugs.

However, initial results from testing dozens of games were not very promising, games were displaying very low performance overall and no game seemed to run well, even simple ones. After some investigation, we realised the Broadcom VideoCore VII GPU in the Raspberry Pi 5 is not only unbelievably weak, but was also several times weaker than the PlayStation 3’s own GPU – the RSX. This means that Raspberry Pi 5 is not capable of rendering these games at 720p.

#15977 – config: Set minimum allowed resolution scale to 25%
After testing with different rendering resolutions, we saw a great performance boost as the rendering resolution was lower, as the bottleneck shifted away from the GPU back to the CPU.
Unfortunately, even 360p rendering of these 3D games proved too much for this GPU. We then decided to settle on rendering games at the PlayStation Portable screen’s resolution, 272p, by setting the resolution scale at 38%.

Here we can see God of War 1 tested at both 720p (PS3) and 273p (PSP) resolutions. Running the game at its PS3 resolution of 720p, we can see it struggles to render at around 10 FPS. With PSP resolution, however, it runs at a smooth 30 FPS. With this new piece of data, we were now seeing several games running at Playable performance, which are showcased in the next section.


Platform showcases

macOS arm64 on Apple M1 (macOS Sequoia)

RPCS3 x64 build running under Apple’s Rosetta2 x64 -> arm64 translation layer on the left;
RPCS3 arm64 build running natively on Apple Silicon on the right, showing a huge performance improvement!

Linux arm64 on Apple M1 (Asahi Linux)

Unsurprisingly, RPCS3 runs very well on the M1 when using Asahi Linux. While the Vulkan driver for Asahi is still not ready, we can make use of OpenGL which can render games with decent performance. Unfortunately OpenGL is still not as performant as Vulkan so games still run faster under macOS when using the MoltenVK layer.

Linux arm64 on Raspberry Pi 5 (Arch Linux)

As what might come as a surprise for many, several PS3 games are be Playable on Raspberry Pi 5, albeit for the lower resolution. A lot of Playable games will not perform well on Raspberry Pi 5 due to its hardware performance limitations, we can still push it to its limit to run several games at 30 FPS, rendered at PSP resolution.

Windows arm64 on Windows 10 for ARM

This was the last platform to be supported, for one main reason: kd-11 only had access to macOS and Linux on arm64, and no one else in our team owns any kind of arm64 device that can run Windows. In order to be able to develop this platform, an arm64 virtual machine was temporarily acquired, as physical devices that can run Windows on ARM are expensive.

At first, support was attempted through the msvc compiler, but due to many encountered issues, it was deemed that this was only going to be possible by using msys2 and compiling RPCS3 with clang.

A lot of debugging and two pull requests later, we had RPCS3 running on arm64. Due to the lack of testing hardware, only samples were tested. Homebrew games should work without issue, but commercial titles are likely to run into problems due to Windows-on-ARM having a hard requirement for ASLR which doesn’t play too nicely with some aspects of RPCS3’s JIT engine.


Available emulator downloads

Without further ado: you can now download Linux arm64 and macOS arm64 binaries from the Download page on our website. These two versions now join our existing Windows x64, Linux x64, macOS x64 and FreeBSD x64 builds.

Windows arm64 binaries are not distributed at this point. This is due to the low availability of hardware to develop and test with, as well as missing automatic compilation and deployment through GitHub CI. Users on Windows must compile their own binaries locally until we start distributing them.

FreeBSD arm64 binaries are not available. FreeBSD support and the FreeBSD Ports x64 builds are maintained by FreeBSD contributors. We have not worked on or tested any support for this platform, as we do not use it ourselves, but we invite any contributors to add FreeBSD arm64 support if they wish to do so. The work done on the other platforms should make this endeavour a much easier task now.


A note about other mobile platforms

Adding arm64 architectural support is a key step to ensure long term preservation of the PlayStation 3 console, as arm64 CPUs make their way into the conventional desktop and laptop market.

We are aware it comes with its own disadvantages, such as having to deal with toxic users that have harassed other emulator developers in the past, like the brilliant developers behind AetherSX2, or redistribution of modified builds of an emulator, often violating their FOSS license, while claiming credit for work they didn’t do themselves. As it has already happened with other emulator projects being re-distributed in these malicious ways.

We are also aware of the increase of scam applications that portray themselves as a PlayStation 3 emulator while being malware, or that redistribute RPCS3 under a different name and claim it is a new emulator. Some of them are even vetted in Google’s Play Store. Unfortunately Google haven’t acted on the reports we have made, and these scam apps continue to be exposed to users.

We are thus reminding once again that there are no other PS3 emulators that can run PS3 games. Any footage you see is either completely fake, or recorded on a real console, or through RPCS3 itself and distributed under the claim it’s another emulator with a different name.

For these reasons, we are disallowing Android and iOS discussion in our communities. We have no intention of porting RPCS3 to these platforms at this time, so no discussion on these topics is needed. Unfortunately, this blanket decision is ultimately needed due to the high toxicity of several individuals in the Android community, who refuse to take no for an answer, some of which have already been banned from our communities.


Special thanks to the developers

We would like to thank the specific contributions to this big feature development. As a reminder, RPCS3 as a whole is made possible thanks to the contributions of many current and past developers and contributors at Team RPCS3 since 2011.

kd-11
– Lead developer for Linux, macOS and Windows arm64 support
– Lead graphics developer for OpenGL and Vulkan support on arm64
– Linux arm64 automated build deployment CI

Nekotekina
– Initial Linux arm64 support, laying the groundwork for the full development of this feature

nastys
– macOS arm64 automated build deployment CI
– macOS arm64 build testing

AniLeo
– Raspberry Pi 5 development support for Linux arm64
– Raspberry Pi 5 / Linux arm64 extensive game testing
– Website back-end development for the arm64 builds

DAGINATSUKO
– Website front-end development for the arm64 builds
– Design of the promotional arm64, Raspberry Pi 5 and Apple Silicon banners

Megamouse
– Bug fixes related to the macOS arm64 build

schm1dtmac
– Bug fixes related to the macOS arm64 build

elad335
– CPU emulation performance optimisations for lower end hardware

This blog post was written by kd-11 and AniLeo.