Gameboy Emulation - Performance Improvement Ideas

NaOH · May 26, 2023, 11:09pm

We have a lot of gameboy emulation projects, and it would be wise to collect the wisdom of all our brainstorming. It seems that most projects are using peanut-gb as a base, and branching off that to add support for Playdate.

Here are the gameboy emulation projects I am aware of (all based on peanut-gb):

It seems these projects all hit a performance wall around 30-40 fps (the gameboy's true speed is 59.7 hz).

Discussion on the bottleneck seems to fall into three camps.

Interpreter, i.e. CPU emulation. Compilers are notoriously bad at interpreter loops, so it's believable, but on the other hand, the playdate's CPU is around 150x faster than the Gameboy's. Perhaps the cache holds is to blame?
Rendering (emulated). The 160x144 lcd buffer is updated 1 scanline at a time. After every gb cpu instruction, check the cycle count and determine if it's time to render a scanline (or process an interrupt). That's a lot of overhead! Scanlines themselves don't seem to be terribly expensive though.
Rendering (playdate lcd). Any lines that have changed should be redrawn. Expand 2-bit gameboy pixels to 2x1 or 2x2 1-bit playdate pixels. Mark lcd rows as updated. We know this is slow -- full screen refresh is 50 Hz max. @rpdev claims this only accounts for 4 of 23 ms, i.e around 15% of processor time. The state of the art here is to emulate 2 frames at once and only draw the second one, which at least makes full speed emulation plausible.

It's hard to profile these, but I invite you to try. As far as I can tell, @dustin blames rendering, @timhei blames cpu emulation.

Ideas to improve CPU emulation

JIT Gameboy Emulation - Performance Improvement Ideas - #2 by NaOH
global register variables.
rewrite interpreter loop in assembly. (Did you know gcc doesn't emit bfi instructions? 105090 – BFI instructions are not generated on arm-none-eabi-g++ )
lazy computation of n and h cpu flags, which are rarely used.

Ideas to improve Rendering

cache background planes. I tried this, but it slowed emulation around 10%.
cache sprites
interlacing

Other Ideas

compile with armclang (doesn't seem to help)

Please contribute if you have ideas, or if you'd like to try your hand at profiling!

NaOH · May 26, 2023, 11:42pm

Regarding JIT ("just in time" compilation), the idea is to translate the gameboy SM83 instructions into ARMv7-M assembly, so that the Playdate CPU runs the code natively -- blazingly fast by gameboy standards. What's more, we can statically optimize out most flag updates.

I have a working demo of SM83-to-arm translation, but it needs more test cases to catch all the bugs. GitHub - nstbayless/jit-playdate-gameboy: JIT for running gameboy instructions on playdate

Current progress: 100% of instructions implemented! But hardly any testing done at all.

It's really tough to test correctness on hardware because of a lack of debugging ability, so I am also emulating arm on my laptop to debug this: How to Emulate Playdate (Arm) with QEMU

AdamsImmersive · May 27, 2023, 1:08am

Thinking out loud, could it be useful/fun corner-cutting to strive for just 30fps, BUT only render half the frames, so play speed remains correct, at the expense of smoothness?

(Other work would be done for ALL frames, even the hidden ones, so I don't know how much processing this would save. But I know I've enjoyed certain games in the past that never even hit 30! For some titles it matters more than others.)

NaOH · May 27, 2023, 1:19am

That's already what is currently done by some implementations. There's a disagreement about semantics though (is that 30 or 60 fps?). The actual implementation also differs -- it can be handled as two gameboy frames in one playdate update, or one gameboy frame per update with only the odd numbered frames flushing to the display. In the former case, playdate reports 30 fps maximum; in the latter, 60 (and the updates, we presume, are different lengths of time each, depending on parity.)

matt · May 27, 2023, 11:42am

Thread around this after very impressive work! Sadly I can test but I'm watching closely. Well done so far!

dustin · May 30, 2023, 4:56pm

Very exciting!

Just to clarify: efficiency around memory usage is everything on Playdate. The amount of work most emulators do per pixel just doesn't seem to fly on device. This is why I've been thinking about rendering and caching all backgrounds and sprites ahead of time, and only update them when we see new sprite or bg data loaded. This may break a few games but would work on most.

That said! I never considered building JIT based emulation! Honestly don't think I have the skills to do so. I hope this works! I will happily take Gamekid offline and point players your way if you get this working.

I had considered writing a little Mac/Windows app that would actually recompile GB roms as PDX files to remove emulation entirely, but... that's as close as I got to thinking about something like this.

Best of luck!

dustin · May 30, 2023, 4:59pm

Also! Not really sure what other emulators are doing, but I switched from my own emulation layer to PeanutGB early on as I just didn't feel it necessary to add yet another GB emulator code base into the word when Peanut was pretty good and being actively worked on with goals that aligned with mine.

I do regret it sometimes however as I am a lot less likely to try new things in the emulator.

But hey, I didn't have to write the audio emulation... so that's nice?

NaOH · August 2, 2023, 3:56pm

So, to really progress on the JIT transpiling idea, some good test cases are going to be needed. The problem is that there are some bugs in the transpiler I wrote, but they're very hard to find just looking at the code. To iron them out, at the very least every instruction should be tested. Ideally, every instruction with every flag.

There are some test ROMs online, but at the moment I don't have a way to launch an arbitrary ROM to debug it. I'm using qemu to debug the transpiler on PC, which lets me step through the ARM code instruction-by-instruction to find bugs. But I don't yet have the ability to run this via the Playdate Simulator, so I can't use the existing ROM loading interface, and indeed I don't have an actual gameboy emulator connected to this (only an 'emulator' (transpiler) for gameboy CPU instructions.)

Probably the simplest way to proceed would be to modify the test ROMs so that they don't require an entire gameboy emulation context to run, and then embed them right into the C source for the test suite of the transpiler.

It's a bit of work to do that, though, so I haven't been motivated.

SonoSooS · August 4, 2023, 7:29am

I have also been working on Game Boy emulation for the device.

Here are my observations:

you need to fight with the compiler, so it emits a Branch table instead of an offset table for a switch case, as the default switch case absolutely destroys the data cache, even if you only switch case the upper quarter ($00-$3F), the lower quarter ($C0-$FF), and the ALU ($8x-$Bx) instructions
don't even have (big) switch cases at all in instruction decoding, the instruction cache locality somehow greatly increases (the speed as well) if you select each instruction using bit tests
have a fast dithering algo for the PPU to reduce the bit depth to 1bit from 2bit
work directly with 1bit in the PPU, so you can do writes as read-modify-write (or if you're clever enough then modify-write) pairs of bytes
render each scanline instead of pixel, as function calls and stack ops destroy performance just by the mere fact that you need to emulate 1Million instructions, and even a single instruction loses you 1MHz of performance easily, so it really does add up
DO NOT DO INTERLACING in the current display driver version (see this thread), do 30Hz screen update instead with a flick between even/odd frames between every second or half a second, full screen updates are twice as fast at worst case
use memory region pointer caching (example) - recalculating a ROM or WRAM or VRAM bank pointer is slow, doing a cached_pointer[address & 0x1FFF] is much, MUCH faster, only by the mere fact of how often the code is being called (hot path)
do not implement VRAM and OAM locking, there is not enough CPU cycles
use inlined counters everywhere (example) - a load-decrement-store-compare is much faster than calling a function repeatedly and doing an early return due to a missing prerequisite
when doing APU rendering, do not recalculate the sample, cache the sample and use that instead until the counter expires (example)

Oh also, while the Playdate frontend of my emulator won't be available until start of 2024 (I can't share it yet, so nobody can look at its magic yet, sorry!), I can tell you that you should load the ROM from the .pdx instead of a binary include into the binary. I have fought with this code a lot, and this one works:

    SDFile* file = pd->file->open("game.gb", kFileRead | kFileReadData);
    if(!file)
    {
        pd->system->logToConsole("Fail to load: %s\n", pd->file->geterr());
        goto error;
    }
    
    int readlen = pd->file->read(file, &BLOB_ROM[0], sizeof(BLOB_ROM));
    pd->file->close(file);

Notice the emphasis on

kFileRead | kFileReadData

As for test ROM order, I recommend cpu-instrs individual test ROMs, as they don't require an MBC implementation (pretty sure they still do MBC writes, but you can ignore them).
Here is what order I use:

06
04
10
09
08
11
05
03
07
01
02

NaOH · August 5, 2023, 12:35am

Thanks for your input!

You inspired me to take a stab at getting cpu-instrs working in the JIT-based emulator. I now have it going all the way into a basic test case and returning. I'm not running Blargg's tests yet, but I am running the preamble to them. Already I found and fixed a couple of bugs.

I am starting to feel doubtful of the JIT, largely because I'm unsure if the overhead cost of dispatching to the JITted code section will be offset by the faster execution of the section. It depends largely on how long a typical section is; in other words, how common are jumps...

In the back of my mind, I think it ought to be possible to do a conventional interpreter written in pure asm.

@SonoSooS , I'm not sure if you had this idea too, but you can lazily evaluate the n and h flags by simply storing the last arithmetic inputs (operand 1; operand 2; carry) and only compute what n and h are when they are needed. n and h are very rarely read; they are input to only two instructions (daa; push af).

SonoSooS · August 12, 2023, 10:11am

I already implement lazy flag resolving, and it did give a minor performance boost

For example my SUB instruction has this as the very first line: https://github.com/SonoSooS/PicoGB/blob/fd01c294529dcfcb0259201d364c338d105b6a42/microcode.c#L1413
Which calls this function, which sets the two operands and the type of the flags that need to be calculated: https://github.com/SonoSooS/PicoGB/blob/fd01c294529dcfcb0259201d364c338d105b6a42/microcode.c#L468-L472
Then once needed for PUSH AF or DAA, I resolve the flags: https://github.com/SonoSooS/PicoGB/blob/fd01c294529dcfcb0259201d364c338d105b6a42/microcode.c#L496

N is inherent to the instruction, it's always a constant value (except for POP AF). Hc is the only flag which needs extra computation to resolve. Carry and Zero are so easy to calculate, and so often used, that it makes sense to calculate them in-line.

chou · September 13, 2023, 5:19am

Completely thinking out loud here and this is possibly extremely obvious, but is the bottleneck of this because of having to do an expensive division operation in the Playdate? Instead of keeping a cycle count and dividing it by some weird number, it could be a lot faster just keep a cycle count until the next scanline update and check and see if that's <= 0.

NaOH · September 14, 2023, 4:31pm

yeah this is basically what's done. When cycle count exceeds the scanline-end constant value, it subtracts scanline-end and runs the scanline routine. No division necessary.