I got the following message on Mastodon, posting a reply here for anyone who might be interested.
I'm coding an equivalent to the "DrawBitmap" for fun and I can't even get close to the SDK's method in performance on the PD (SDK is still ~60% faster).
Which makes me wonder what I'm still missing. The gist of my function is that I take 2x 32 bit ints from my source bitmap, shift + bitor them to a single 32 bit word and write that to the canvas bitmap each loop step through.
Would be happy about a hint
I don't know what their implementation looks like but I was really curious how close I could get. I grabbed the function from the Playdate source that shifts and copies 32-bit image data into the destination buffer, wrapped it in a loop, and made a custom bitmap drawing function. (I'm skipping over a lot of debugging here, of course..) I put that in a demo to stress test it and compare against pd->graphics->drawBitmap(), and..
It's a lot faster, like 4x! I wasn't expecting that, though looking at the sampler traces it makes sense. The native drawing function does a lot of stuff that this one doesn't have to--clipping, masked drawing, stencils.. But it suggests that there's some performance we could gain by adding a shortcut path when those features aren't needed. I'll keep that in mind if it ever looks like drawing speed is a problem.
Press the A button to switch between the custom drawing function and the native one. In both cases it's limited by the refresh rate, the FPS display doesn't change, but if you open the device info window in the simulator you'll see that the custom function takes around 20% of the CPU while the native takes 80%.
I recently deleted the message before I saw your post because I somehow missed to enable compiler optimizations the whole time - so my original question isn't relevant anymore. Never checked the whole console output during compilation and I thought that -O2 was enabled by default on the hardware. So by compiling with -O0 I was fighting against windmills here - ugh. So with that out of the door, my method is 2-4x faster on -O2, too, although it's also doing flipping, masking and clipping.
Maybe we'll see a performance bump for the SDK drawBitmap in the future. I suppose it's one of the most used methods by many Playdate games which would instantly receive a performance boost with faster blitting
I'll try to incorporate my sprite drawing into your provided source code in the next few days.
i really want to test the version that supports masking and clipping, i tried to edit dave's version for a playfull test in one of my own games but could not get it to work. So i'll wait in your version to have a play with it, or is it already on github somewhere (the current version ?)
It’s a really good idea. 90% of my game’s graphics make use of polygon drawing and bitmap drawing, and when it comes to bitmaps, the only things I ever use are (sometimes) masks, and clipping, never any of the fancy rotation/blurred/mirror which are too slow to be of any real use for me. So I would love an alternative barebones function for bitmap drawing in the SDK that would give me better performance and gets rid of the superfluous. I can’t imagine that most developers wouldn’t want that as well.
I merged my bitmap drawing into the file provided by Dave so there are now 3 modes (SDK, Dave, my variant). Source code and a hardware .pdx with -O2 are in the zip.
EDIT: Doesn't work if bytes per bitmap row is not a multiple of 4 (wrong assumption). Proton Drive link
My sprite drawing supports:
sprite drawing modes
draw only a region of the bitmap
destination and/or source bitmap with and without mask buffer
I think the only thing it's missing from the SDK one is the stencil buffer (I think?) but I don't really know the ins and outs of drawBitmap. The method currently calculates all the stuff above just like drawBitmap. The only major branch right now is for x-flipped vs non-x-flipped bitmap drawing.
I tried to document my code where I saw fit.
Running the code as is gives me around 7 FPS for SDK, 12 for my variant and 42 for Dave's. Dave's method is pretty much the ceiling for performance for bitmaps in the most basic stripped down case I think (unless you also start to unroll it). When running some tests the performance difference got bigger the smaller the drawn bitmaps became. I called drawBitmap along with pushContext, setDrawMode, setDrawOffset, setScreenClipRect, popContext so it drew the same thing as my own method.
Update: We just noticed after some testing that my method currently does wrongly assume that all SDK bitmap rows would be nicely aligned to 32-bits (like the framebuffer) which is apparently not true. So in the case where rowbytes is not a multiple of 4 it won't work as is. And it just so randomly happened that the width of Dave's example sprite IS aligned to 32 bit so I didn't notice while putting the main.c together.
I originally programmed the bitmap drawing to work with my handrolled engine and texture format where I made sure to align all rows accordingly so I didn't stumbled upon that issue earlier.