Fast Inverse SQRT in C API 3D Library in the SDK

shortrounddev · March 2, 2022, 7:34pm

I see a lot of devs try to avoid sqrt and related instructions because of a kind of superstition about their execution speed. On basically all modern CPUs there's a SQRT instruction (including Cortex M7). Just thought y'all should know that it is actually pretty fast to compute sqrts (and inverse sqrts are just 1/sqrt). I would recommend just calling the ASM instruction if you're using the C API instead of going with the old FISR from John Carmack inside the 3dmath example package

Whitebrim · March 2, 2022, 10:16pm

I didn't develop games on C, nor I'm familiar with C, but I remember there was fast inverse sqrt in Quake:

Maybe this will be helpful, maybe not

dave · March 3, 2022, 2:32am

Oh, good call! I took fisr() out of all the internal code after I tested and found it wasn't much faster than sqrt(), and had really bad precision in some cases. I forgot about the 3D demo, though.

dave · March 3, 2022, 2:33am

found the issue comment:

doing 1 million PDVector_normalize() calls in a loop takes around 290 ms with the single-iteration fisr, and that goes to 352 ms if I add a second iteration. Using sqrtf() takes 495 ms.

shortrounddev · March 3, 2022, 2:53am

You're saying the math.h function sqrtf is 495ms? I wonder if the standard library in the compiler (not sure if this is the playdate arm compiler or your local compiler for the simulator) is using a software implementation. I know glibc on x86_64 machines will compile a simple asm() call:

github.com

lattera/glibc/blob/895ef79e04a953cac1493863bcae29ad85657ee1/sysdeps/x86_64/fpu/e_sqrtf.c#L23


      
             Lesser General Public License for more details.
          
             You should have received a copy of the GNU Lesser General Public
             License along with the GNU C Library; if not, see
             <http://www.gnu.org/licenses/>.  */
          
          #include <math_private.h>
          
          #undef __ieee754_sqrtf
          float
          __ieee754_sqrtf (float x)
          {
            float res;
          
            asm ("sqrtss %1, %0" : "=x" (res) : "xm" (x));
          
            return res;
          }
          strong_alias (__ieee754_sqrtf, __sqrtf_finite)

but also has slower software implementations:

github.com

lattera/glibc/blob/master/sysdeps/ieee754/dbl-64/e_sqrt.c#L49


      
          #include "root.tbl"
          #include <math-barriers.h>
          #include <math_private.h>
          
          /*********************************************************************/
          /* An ultimate sqrt routine. Given an IEEE double machine number x   */
          /* it computes the correctly rounded (to nearest) value of square    */
          /* root of x.                                                        */
          /*********************************************************************/
          double
          __ieee754_sqrt (double x)
          {
            static const double
              rt0 = 9.99999999859990725855365213134618E-01,
              rt1 = 4.99999999495955425917856814202739E-01,
              rt2 = 3.75017500867345182581453026130850E-01,
              rt3 = 3.12523626554518656309172508769531E-01;
            static const double big = 134217728.0;
            double y, t, del, res, res1, hy, z, zz, p, hx, tx, ty, s;
            mynumber a, c = { { 0, 0 } };
            int4 k;

shortrounddev · March 3, 2022, 3:01am

I'm not super familiar with ARM (more of a PC guy) but it looks like the __sqrt intrinsic was created to guarantee a single instruction call to the hardware: Documentation – Arm Developer ?

You can probably make more sense of the documentation that I can

shortrounddev · March 3, 2022, 3:28am

At this point I'm just going down the rabbit hole... The libc docs packaged with gcc-arm-none-eabi are for redhat's newlib libc. The source for newlib has the ARM __ieee754_sqrtf implementation as:

#if (__ARM_FP & 0x4) && !defined(__SOFTFP__)
#include <math.h>

float
__ieee754_sqrtf(float x)
{
	float result;
#if __ARM_ARCH >= 6
	asm ("vsqrt.f32 %0, %1" : "=w" (result) : "w" (x));
#else
	/* VFP9 Erratum 760019, see GCC sources "gcc/config/arm/vfp.md" */
	asm ("vsqrt.f32 %0, %1" : "=&w" (result) : "w" (x));
#endif
	return result;
}

#else
#include "../../math/ef_sqrt.c"
#endif

So it SHOULD be using the VFP. This should be the fastest way to run it on ARM hardware

dave · March 3, 2022, 4:08am

I don't remember exactly how I tested it when I wrote that, but now I see PDVector_normalize() is calling sqrtf in newlib instead of using vsqrt, and that's not a lightweight function. I have no doubt that single vsqrt instruction would be much faster.

I've run across a few other places like this where I've found slow code that can be replaced with a single instruction but the compiler doesn't help us out there. (In the case of sqrt there's the matherr mechanism it has to support, so that's not too surprising..) It would be nice to collect all of that in one place, but I'm not sure where it would go in the SDK. Maybe that lives outside somewhere.

NaOH · March 6, 2022, 10:58pm

I may be out of my depth -- but isn't the division part also slow? fisr saves on a division.

sgeos · March 8, 2022, 5:33am

ASM programmers who optimize literally count cycles. The cycle count for each instruction is documented.

Another thing to consider- is the compiler putting instructions around the inline ASM call to unnecessarily guarantee things like register preservation? This is why I preferred to write and assemble my own assembly routines instead of using inline ASM.

Compilers can operate on bulk code far more efficiently than human programmers, but they have no insight into the problem and need to conform to the language standard. Unnecessary instructions can theoretically be optimized out, but the actual output depends on the state of the art.

EDIT: In the case of a sqrt function, you just want to be able to jump to the ASM instruction and return the result in a way that conforms to the C ABI. From memory (without double checking), that should just require using the correct ABI conformant registers in the instruction call, followed by whatever branch instruction is the ASM equivalent of "return".

EDIT 2: The Cortex-M7 has a relatively complex pipeline, but the following link gives some information on instruction cycle times. Evidently documentation for the Cortex-M4 can also be useful. vdiv.f is 16 to 18 cycles, where vsqrt.f is 14 to 16 cycles. FWIW, the Playdate makefiles appear to support .s files.