Scanline DMA support with optional callback

I have noticed, that I get a really low FPS without frameskip, but with a single frame of frameskip I get a lot more FPS, but still at 100% CPU usage, meaning that my program is more than capable of running better on the hardware, but display updates grind it to basically a halt.

What's even more weird, is that display updates seem to count as CPU usage, when it's not possible right now to run my code while the display is being updated.

Considering the current hardware of the Playdate device, it should be more than capable and possible to send the scanlines over asynchronously using DMA, with a few caevats.

Those who are willing to tread this far probably already have their own optimized rendering routines (like I do), and should be capable to modify their code so it renders slightly differently to accomodate to the DMA-compatible framebuffer format.
The idea is to ask scanlines from the system in such a way, that they already have the SHARP memory LCD command format in the header, but we get the pointer and the size (as redundancy) to the pixels instead, and we do this enough times until we have enough scanlines to work with. It's then possible to store these pointers in an array, and just index the array instead of doing multiplication to get to a specific Y co-ordinate.
This scanline data could be actually part of an internal struct, where the struct contains some metadata for sequencing, and status info about DMA progress, and also some programmer-controllable variables, like callback pointer + userdata, if needed by the programmer (for synchronization or VSync purpose, for example).

This could be really easily done by a single pointer in system memory: pointer to the currently "active" scanline buffer, which is currently being copied to the display. Once the DMA done interrupt fires, this field could be then checked, and if there is a next pointer, store the current pointer, replace the system pointer with the next pointer, start the DMA (so the callback doesn't waste even more time), and call the callback on the stored pointer.
As for sequencing, this could be done by a single pointer in the struct, to form a singly-linked list. It is possible to foolproof it, by refusing to append a scanline buffer to the DMA list if its next pointer is already set (to prevent circular references deadlocking the state machine), or if it's already queued (could be part of a flag field, which could also store the status, if the DMA is still running or finished, or just newly queued).

As for VSync, it would still work the same way update() callback currently works, that is create a fake VSync signal via a timer of some sorts, as the display has no concept of synchronization, as it doesn't need active scanning.

Although a few parameters could be considered in two functions - the one to add to the queue, and the other to query the status - is to have a blocking flag, which will enter low-power mode until the DMA has completed, and return that status back via function return value. If the blocking flag is not set, it will either fail, or just get the status without blocking.
This approach would not be compatible with the Menu button though, so an unique error value could be returned to signal the code that the Menu wants to open, and stop trying to queue or query DMA, and return from the update handler as fast as possible.

Hopefully this doesn't need much re-engineering of the display code to support this advanced feature.
Although one glaring problem I see is that since this method completely bypasses the system framebuffer, upon pausing it would show something completely different on the screen. I think this should be the programmer's responsibility though.

2 Likes

Here's a reference for how it currently works (at the time of Dave posting)

Yeah, I have already researched this topic quite extensively, and I have also discovered this thread in the process.

My problem less of a "I don't have enough control over the framebuffer", and more like "I want my code to be able to run asynchronously with the scanlines being DMA'd", but reading the read made me think that it was really hard or something, so instead I came up with an idea, which - I think - should be easy to implement in the OS, and while at it also fix the performance issues.

I'm fine with sacrificing some RAM, and putting more vigorous security checks into my code, if it means that the framebuffer could be DMA'd with the least amount of preprocessing (as in, basically writing just the scanline number and the VCOM flicker value to the command value before the scanline data), while also letting the CPU grind away its calculations while the command buffer (command + pixels + terminator) is being transferred asynchronously via SPI DMA.

4 Likes

Gotta say, I'm not very excited about the idea of putting a lot of work into adding a feature that a tiny fraction of game devs would use to get an even smaller fraction of CPU gain. :confused: But this is a good reminder to go back to that circular buffer code and figure out why it was glitching, there was a small but not insignificant boost there.

2 Likes