I have noticed, that I get a really low FPS without frameskip, but with a single frame of frameskip I get a lot more FPS, but still at 100% CPU usage, meaning that my program is more than capable of running better on the hardware, but display updates grind it to basically a halt.
What's even more weird, is that display updates seem to count as CPU usage, when it's not possible right now to run my code while the display is being updated.
Considering the current hardware of the Playdate device, it should be more than capable and possible to send the scanlines over asynchronously using DMA, with a few caevats.
Those who are willing to tread this far probably already have their own optimized rendering routines (like I do), and should be capable to modify their code so it renders slightly differently to accomodate to the DMA-compatible framebuffer format.
The idea is to ask scanlines from the system in such a way, that they already have the SHARP memory LCD command format in the header, but we get the pointer and the size (as redundancy) to the pixels instead, and we do this enough times until we have enough scanlines to work with. It's then possible to store these pointers in an array, and just index the array instead of doing multiplication to get to a specific Y co-ordinate.
This scanline data could be actually part of an internal struct, where the struct contains some metadata for sequencing, and status info about DMA progress, and also some programmer-controllable variables, like callback pointer + userdata, if needed by the programmer (for synchronization or VSync purpose, for example).
This could be really easily done by a single pointer in system memory: pointer to the currently "active" scanline buffer, which is currently being copied to the display. Once the DMA done interrupt fires, this field could be then checked, and if there is a next pointer, store the current pointer, replace the system pointer with the next pointer, start the DMA (so the callback doesn't waste even more time), and call the callback on the stored pointer.
As for sequencing, this could be done by a single pointer in the struct, to form a singly-linked list. It is possible to foolproof it, by refusing to append a scanline buffer to the DMA list if its next pointer is already set (to prevent circular references deadlocking the state machine), or if it's already queued (could be part of a flag field, which could also store the status, if the DMA is still running or finished, or just newly queued).
As for VSync, it would still work the same way update() callback currently works, that is create a fake VSync signal via a timer of some sorts, as the display has no concept of synchronization, as it doesn't need active scanning.
Although a few parameters could be considered in two functions - the one to add to the queue, and the other to query the status - is to have a blocking flag, which will enter low-power mode until the DMA has completed, and return that status back via function return value. If the blocking flag is not set, it will either fail, or just get the status without blocking.
This approach would not be compatible with the Menu button though, so an unique error value could be returned to signal the code that the Menu wants to open, and stop trying to queue or query DMA, and return from the update handler as fast as possible.
Hopefully this doesn't need much re-engineering of the display code to support this advanced feature.
Although one glaring problem I see is that since this method completely bypasses the system framebuffer, upon pausing it would show something completely different on the screen. I think this should be the programmer's responsibility though.