I've been investigating the floating point performance of the Playdate, here are my results. As a disclaimer there was a little bit of uncertaintity with my results, due to various things: compiler optimizations, cpu dual issue timing, etc, and I've tried to be as correct as possible with my findings, but there will likely be a small amount of error, and I've made assumptions as to how the CPU operates since ARM are keeping some of those things close to their chest. Secondly, this guide is intended for intermediates and veterans alike, so some things may already be obvious for the latter. Finally, you can't really talk about instruction performance without getting into assembly, so that will happen a bit, just note that all floating point instructions start with the letter 'v'.
Before I dive into the floating point subject matter, I might just point out the CPU is dual issue, meaning given the right combination of two instructions in a row, the CPU can issue (meaning: start processing) both of the instructions at the same time! This is pretty awesome, and it comes up a surprising amount, just as long as the two instructions don't depend on one another, ie: the second instruction cannot use the result of the one before it, if it does then the CPU simply won't issue them together. Dual issue for most CPUs usually means something quite limited like you can issue a memory transaction (like a load or a store) at the same time as an integer math operation (in the Arithmetic Logic Unit aka ALU), but for the Playdate ARM CPU it actually has TWO asymmetric ALU pipes, one pipe can do a specific subset of ALU instructions that the other can't and vice versa, but I believe they have quite a bit of overlap, and this means you have a good chance of being able to dual issue two basic math instructions in a single cycle! Better yet, it has a third pipe for multiply (mul) and multiply-accumulate (mla), so you have an incredibly high chance of being able to issue a mul or mla at the same time as an add, letting you do an multiply and two adds for integers every single cycle! That's crazy powerful. I wouldn't be surprised if someone who made a game with fixed point would see some pretty amazing performance.
The floating point capabilities on the other hand, have their own pipe, some documentation say it's actually two pipes but I'm unconvinced. This is bad as you cannot dual issue floating point instructions with other floating point instructions, with minor exceptions involving memory transactions (vldr/vstr) and register transfers (vmov). But it does still mean you can dual issue general purpose instructions with floating point instructions extremely easily. In fact in my tests I was able to dual issue integer add/mul/mla instructions with their floating point equivalent vadd/vmul/vfma quite freely and see some crazy high performance, 285 million ops per second, despite having a 168MHz CPU. So the first takeaway would be to ensure your compiler has the ability to schedule more than one critical path of code at a time, this can mean making sure you're marking your basic math functions as 'inline' when you write code in c/c++. In extreme cases it may mean doing more than one thing at a time with your code in order to reach peak throughput.
The floating point unit has a whopping total of 32x float32 registers! Despite the Playdate not having the float64 ALU extension, you can still address the float32 registers in pairs within memory transactions, meaning you load/store two float32 registers in a single instruction. Since there's so many float32 registers, you should ensure your code makes full utilization of them and keep all your hot data in registers and not in memory/stack. Again this means marking basic math functions as 'inline'. It would also be smart to periodically investigate the assembly your compiler is generating to make sure it's not dumping/reloading float registers to the stack needlessly in your important functions.
So talking more about the assembly your compiler generates, the Windows PlaydateSDK is (mildly humourosly) trying to compile c code with both -O2 and -O3 in Release mode, cmake defaults to -O3 and Playdate are providing -O2 manually. Some people have seen better performance with -O2, I suspect this is because some of the loop optimizations in -O3 may be detriment for the Playdate and its tiny caches and slow RAM. There is however -Ofast which is a superset of -O3, and it includes one of the most important optimization flags I think everyone should use regardless of which optimization profile they choose: -ffast-math. This flag enables some pretty powerful floating point math optimizations, while deliberately "breaking" some of the needlessly strict IEEE 754 rules, and I call them needless because for 99% of games and realtime applications they're unimportant. Note that -ffast-math is NOT included with -O2 or -O3, so you'll need to supply it yourself. For example in the Playdate SDK you'll see in the file C_API/Examples/3D library/mini3d/3dmath.h, the function _lib3d_sqrtf() which curiously wraps the vsqrt (square root) instruction, I suspect this is attempting to avoid some of the needless overhead which -ffast-math would've kindly removed if it was used by default with the SDK. Note, if your math is reliant on NaNs or infinities, that won't play well with these optimisations and will likely break things, you are much better off avoiding NaNs and infinities in the first place.
The floating point unit has pretty high throughput, but it has its hazards too. Most float instructions can issue (begin processing) in a single cycle, but their latency (the amount of cycles until the result is ready) is usually 3. So if you try to use the result of a float instruction immediately in the next instruction, the CPU will stall usually for 2 cycles until its ready. Those stalls could be better utilized by issuing other instructions, again emphasizing the importance of allowing the compiler to have more than one critical path of code to schedule, inline those basic functions! Not only do vadd and vmul issue in a single instruction, but the two multiply accumulate instructions (vfma, vmla) also issue in a single instruction! My testing was seeing 160+ MFlops from vadd/vmul/vfma instructions with the 168MHz CPU. But the gotcha with vfma is it has a latency of 5 cycles, and vmla has a latency of 6 cycles. So to see peak MFlops, you need to interleave your math instructions to make better use of those stalls, without it, your 160 MFlops for vfma can crash down to 32 MFlops. There are also some quirks with these multiply accumulate instructions, for example despite vfma having a latency of 5 cycles, vfma can actually absorb 2 of those cycles on the accumulator operand but only if it comes from another vfma instruction.
vfma.f32 s0, x, x
; stall
; stall
; stall
; stall
vadd.f32 x, s0, s0 ; any float instructions will stall for 4 cycles when trying to immediately use the result of a vfma
vfma.f32 s0, x, x
; stall
; stall
vfma.f32 s0, x, x ; however vfma can absorb 2 of those cycles but only in the accumulator operand
Another pretty brutal gotcha is vfma will actually block any subsequent, completely unrelated, vadd instruction after it for 3 cycles.
vfma.f32 x, x, x
; stall
; stall
; stall
vadd.f32 x, x, x ; vfma has reserved the adding logic, preventing vadd from starting for 3 cycles, despite there being no register dependency
In my testing, I came up with this scenario:
vadd.f32 s0, x, x
vfma.f32 s0, x, x
vadd.f32 s1, x, x
vfma.f32 s1, x, x
vadd.f32 s2, x, x
vfma.f32 s2, x, x
vadd.f32 s3, x, x
vfma.f32 s3, x, x
This looks pretty innocent but is actually a double whammy of stalling, these 8 instructions take a whopping 25 cycles to run! However with just a simple little bit of reordering, while completely keeping the result the same, we can reduce it back down to 8 cycles like so:
vadd.f32 s0, x, x
vadd.f32 s1, x, x
vadd.f32 s2, x, x
vadd.f32 s3, x, x
vfma.f32 s0, x, x
vfma.f32 s1, x, x
vfma.f32 s2, x, x
vfma.f32 s3, x, x
The reason the former is so brutal, is the vfma instructions are stalling for 2 cycles waiting for the vadd to complete in order to use its results, while every vadd (except the first) is stalling for 3 cycles waiting for the vfma to stop hogging the add logic, allowing it to run.
Let this be a warning to anyone wanting to write assembly code in order to try and beat the compiler, you've got some pretty tricky hazards to keep in mind!
Most veteran programmers are well aware as to how slow divide and square root can be, especially on older or power-reduced architectures, and the Playdate is no exception. Square root (vsqrt) and divide (vdiv) have a throughput of 14 cycles and 16 cycles respectively, meaning they will completely halt everything else the CPU is doing until they're complete, rough! Like the rest of the floating point instructions, they also incur 2 additional cycles until their results are ready, so their latencies are 16 cycles for vsqrt and 18 cycles for vdiv. If you're dividing two or more things with the same denominator (eg: a = a / x, b = b / x), it would be smart to instead calculate the reciprocal of the denominator (rcpX = 1 / x), then multiply with the reciprocal (a = a * rcpX, b = b * rcpX). The result will have some very minor precision differences, but should not be perceivable, and -ffast-math will also likely do this optimization automatically for you anyway. Likewise there are some scenarios when you can avoid using sqrt entirely, like if you're comparing two floats like so: a < sqrt(b), replace it with a * a < b, and trade the expensive sqrt for a super-cheap multiply. For example it is common to compare the length of a vector against something, do you really need the length? Or can you get away with comparing the length-squared and adjust your algorithm to accommodate? Such as a < length(b) can become: a * a < lengthSq(b), completely removing the square root.
A blast from the past is the infamous "fast inverse square root" ( Fast inverse square root - Wikipedia ), which was very helpful on older architectures but isn't very helpful these days on modern architectures. Playdate on the other hand can take full advantage of this trick! Some architectures have a pretty nasty penalty trying to transfer data between integer and float registers, but the Playdate does not (apart from the typical 3 cycle latency). Trying to calculate a reciprocal square root (rsqrt) with vdiv and vsqrt would have a throughput of 14+16=30 cycles, with a latency of 16+18=34 cycles, meaning it would hog your cpu for 30 cycles and have your result ready in 34 cycles. Using this trick on the other hand, has a throughput of 8 cycles, and a latency of 18 cycles which is actually reduced to 16 cycles thanks to our friend dual issue. That's a humongous improvement with plenty of room for other work and absorbing stalls, but remember this is an approximation. As long as you're not trying to normalize tiny or humongous vectors with it, it should work just fine for you with the supplied single-iteration of newton-raphson. It's also pretty inexpensive to extend rsqrt into sqrt or rcp: sqrt(x) = x * rsqrt(x), rcp(x) = rsqrt(x) ** 2. However their timings vs the hardware instructions aren't that appealing, they comfortably beat hardware in throughput (allowing more instructions to execute in the same amount of time), but their latency is either the same or slightly worse, and thus would likely be unworthy of looking into except for the determined programmer.
As mentioned in my previous performance guide ( Discord ), the cpu has hardware support for converting between float32 and float16 with the instructions vcvtt and vcvtb. They both issue in a single cycle with a 3 cycle latency, however converting from float32 to float16 has an additional hazard to be aware of. You cannot pack two float32 values into a single register as two float16 values in back to back instructions without a 2 cycle stall. This is likely because the hardware only partially writes to half of the destination register, while leaving the other half untouched, and trying to do that in back to back instructions means the second instruction waiting for the first to complete. If anyone from ARM ends up reading this (unlikely!), can we have a single instruction to pack-convert 2x float32 registers into a single destination register? The inverse would be handy too. Problem solved!
A quick mention of denormals. Most architectures have serious performance penalties when using denormals in floating point instructions, so much so that they have configurations to treat denormals as zero, and to flush denormal results to zero, allowing floating point to go back to full speed. The Playdate CPU surprisingly eats denormals for breakfast and suffers no performance penalties, whilst still supplying a configuration to treat denormals as zero, which doesn't really have any performance benefits unfortunately, except for vsqrt and vdiv which appear to have an early-exit if they are given a denormal.
Summary and takeaways:
- Most float instructions, including Add, Multiply, and Multiply Accumulate, are super fast and issue in a single cycle.
- However most float instructions have a latency of 3 cycles, Fused Multiply Accmulate (vfma) is 5 cycles, and Multiply Accumulate (vmla) is 6 cycles.
- There are 32 float registers, which can be addressed as pairs in load/store instructions.
- The CPU can dual issue certain instructions. It can especially dual issue float instructions with non-float instructions.
- Keeping your code running at peak throughput is very important, and a little tricky at times because of the latency of some instructions.
- Make sure your code is absorbing these potential latency stalls by interleaving it with other work, and let the compiler work its scheduler with more than one critical path at a time.
- Absorb stalls and keep data in registers by inlining your basic math operations, but be aware this will likely make your code bigger, and /may/ impact your performance negatively since it may use more of the tiny instruction cache. Try it and investigate/test.
- Use the compiler optimization parameter -ffast-math, it's basically free performance for games, and not automatically included in -O2 or -O3.
- Divide and square root are quite slow, avoid them when you can. Transform multiple divides with the same denominiator into reciprocal multiplications, replace some equations with their square to cancel out the sqrt, or make sure -ffast-math is doing these automatically for you.
- The fast inverse square root trick ( Fast inverse square root - Wikipedia ) is incredibly fast on Playdate, I've included a drop in implementation you can use at the end of the post.
- Denormals are surprisingly fully performant.
- Check in on the assembly your compiler is producing! ARM is probably one of the easiest assembly languages to pick up (looking at you PowerPC). But you're definitely at the mercy of your compiler sometimes, but these days compilers are pretty amazing and clever.
- Be aware of some really nasty hazards when trying to write assembly yourself. Register dependency hazards, and pipeline hazards, amongst others.
- Most importantly, don't take any of this as a gold standard, try every suggestion yourself and see if it helps. If it doesn't help, look into it and understand why, perhaps something else is blocking it from working?
fastfloat.h:
#pragma once
#include <stdint.h>
inline float fastRsqrt(float x)
{
union {
float f;
uint32_t u;
} f32u32;
f32u32.f = x;
f32u32.u = 0x5f3759df - (f32u32.u >> 1);
f32u32.f *= 1.5f - (x * 0.5f * f32u32.f * f32u32.f);
return f32u32.f;
}
inline float fastRcp(float x) { x = fastRsqrt(x); return x * x; }
inline float fastDiv(float x, float y) { return x * fastRcp(y); }
inline float fastSqrt(float x) { return x * fastRsqrt(x); }