How is native draw line so performant?

Voley · October 21, 2024, 1:36pm

I'm writing a game to test some drawing and performance.
I have written my own routines to draw lines using DDA and Bresenham's algorithm in C.
They are ok performing, pretty fast.
But when I started to profile I noticed that native graphics->drawLine performs at least 5 times faster.

Wonder where that is coming from? Some ASM magic?

freds72 · October 21, 2024, 2:05pm

surprising - can you post code?
the sdk routine is actually fairly slow as it generates ‘thick lines’ (good looking tho!).

Voley · October 21, 2024, 2:13pm

Here is the function.
On my PC in simulator I'm filling full screen with lines about 100 times during update with 30 fps, any more and fps drops.

With built in routine I can do at least 400+ times before I'm seeing fps drop below 30.

Of course I'm aware that on actual device situation can differ, will check once I have the device.

void bresenhamDrawLine(uint8_t* frameBuffer, int x0, int y0, int x1, int y1) {
    int dx =  abs (x1 - x0), sx = x0 < x1 ? 1 : -1;
    int dy = -abs (y1 - y0), sy = y0 < y1 ? 1 : -1;
    float err = dx + dy, e2; /* error value e_xy */
    int rowStride = 52;

    for (;;){
        int byteIndex = (y0 * rowStride) + (x0 / 8);
        int bitMask = 0x80 >> (x0 % 8);
        frameBuffer[byteIndex] &= ~bitMask;

        if (x0 == x1 && y0 == y1) break;
        e2 = 2 * err;
        if (e2 >= dy) { err += dy; x0 += sx; } /* e_xy+e_x > 0 */
        if (e2 <= dx) { err += dx; y0 += sy; } /* e_xy+e_y < 0 */
    }
}

freds72 · October 21, 2024, 2:19pm

ah ok - yeah don’t use the simulator for any performance benchmarks!

code looks good and fairly sure will just do fine on PD.

note: make sure abs is not casting to double btw

risrr · October 21, 2024, 5:13pm

Minor things the compiler will probably do for you, but if you want to be sure (as division is slow):

Instead of x0 / 8 do x0 >> 3
Instead of x0 % 8 do x0 & 0x7

Also potentially minor (if your perf. is fine on the device then don't worry about it), but in performant loops like this you may want to avoid pipeline flushes due to branch misprediction:

int e2dx = e2 >= dy;  // 1 or 0
int e2sx = e2 <= dx;  // 1 or 0

err += (dx * e2dx) + (dy * e2dy);
x0 += sx * e2dy;
y0 += sy * e2dx;

Now there are no branches and the processor can just do pure math blasting.