I have also been working on Game Boy emulation for the device.
Here are my observations:
- you need to fight with the compiler, so it emits a Branch table instead of an offset table for a switch case, as the default switch case absolutely destroys the data cache, even if you only switch case the upper quarter ($00-$3F), the lower quarter ($C0-$FF), and the ALU ($8x-$Bx) instructions
- don't even have (big) switch cases at all in instruction decoding, the instruction cache locality somehow greatly increases (the speed as well) if you select each instruction using bit tests
- have a fast dithering algo for the PPU to reduce the bit depth to 1bit from 2bit
- work directly with 1bit in the PPU, so you can do writes as read-modify-write (or if you're clever enough then modify-write) pairs of bytes
- render each scanline instead of pixel, as function calls and stack ops destroy performance just by the mere fact that you need to emulate 1Million instructions, and even a single instruction loses you 1MHz of performance easily, so it really does add up
- DO NOT DO INTERLACING in the current display driver version (see this thread), do 30Hz screen update instead with a flick between even/odd frames between every second or half a second, full screen updates are twice as fast at worst case
- use memory region pointer caching (example) - recalculating a ROM or WRAM or VRAM bank pointer is slow, doing a cached_pointer[address & 0x1FFF] is much, MUCH faster, only by the mere fact of how often the code is being called (hot path)
- do not implement VRAM and OAM locking, there is not enough CPU cycles
- use inlined counters everywhere (example) - a load-decrement-store-compare is much faster than calling a function repeatedly and doing an early return due to a missing prerequisite
- when doing APU rendering, do not recalculate the sample, cache the sample and use that instead until the counter expires (example)
Oh also, while the Playdate frontend of my emulator won't be available until start of 2024 (I can't share it yet, so nobody can look at its magic yet, sorry!), I can tell you that you should load the ROM from the .pdx instead of a binary include into the binary. I have fought with this code a lot, and this one works:
SDFile* file = pd->file->open("game.gb", kFileRead | kFileReadData);
if(!file)
{
pd->system->logToConsole("Fail to load: %s\n", pd->file->geterr());
goto error;
}
int readlen = pd->file->read(file, &BLOB_ROM[0], sizeof(BLOB_ROM));
pd->file->close(file);
Notice the emphasis on
kFileRead | kFileReadData
As for test ROM order, I recommend cpu-instrs individual test ROMs, as they don't require an MBC implementation (pretty sure they still do MBC writes, but you can ignore them).
Here is what order I use:
- 06
- 04
- 10
- 09
- 08
- 11
- 05
- 03
- 07
- 01
- 02