Benchmarks & Optimisations

Is there a C version of this benchmark ? I seem to remember there was some discussion on porting it to C api but not sure it has ever been done

has anyone done a REV A vs REV B comparison using this benchmark ?

2 Likes

Comparing to Benchmarks & Optimisations - #27 by matt

I'm getting much better results on rev A hardware, firmware 2.1.1.

Any explanation for this?
Has the testing time per function been increased maybe?

These are the timing values I see in the script I used, which come down to 100 ms per test.

local FPS = 50
local frameMS = 5000/FPS
playdate.display.setRefreshRate(FPS)
#,	 BENCH,	CALL
nil	 14854
drawLine - Diagonal	   387
drawLine - Horizontal	  1846
drawLine - Vertical	   454
drawLine - Random Diagonal	   687
drawLine - fillRect	  4537
drawLine - drawRect	   935
math.random	  6736
math.random - local	  6398
math.sin	  5814
math.sin - random	  2848
math.cos	  5818
math.cos - random	  2598
math.floor - local	  6259
image:sample	  1680
drawText - local	   444
drawTextInRect	    55
drawRect	   454
fillRect	   849
drawCircleAtPoint	   221
fillCircleAtPoint	   280
drawCircleInRect	   379
fillCircleInRect	   479
sprite:moveTo - static	  1206
sprite:moveTo - random	   795
sprite:setImage	  1750
sprite:setCenter - static	  1477
sprite:setCenter - toggle	  1353
sprite:setCenter - random	  1125
sprite:setZIndex	  3054
image:draw	  1451
image:draw - locked	   782
image:draw - locked local	   746
image:draw - pushcontext local	   816

not sure its related to these tests, but dave said the new hardware (rev b units) are more sensitive to cache misses, it has caused performance issues before with drawblurred and rotating bitmaps 90 degrees Performance of graphics.image:drawBlurred method on Rev B hardware - #3 by dave

edit: i just ran the benchmark on my rev a unit by compiling the source code using latest pdc.exe from latest sdk version from the 1st post (lua) and these where my results on my REV A unit:

They seem to be more in line with the values posted from other firmware's although some seem lower now, while it seems with you it actually ran more calls ?

DEVICE (5 RUN AVE)
#,	 BENCH,	CALL
01,	  1590,	nil
02,	    73,	drawLine - Diagonal
03,	   301,	drawLine - Horizontal
04,	    85,	drawLine - Vertical
05,	   124,	drawLine - Random Diagonal
06,	   584,	drawLine - fillRect
07,	   161,	drawLine - drawRect
08,	   791,	math.random
09,	   749,	math.random - local
10,	   655,	math.sin
11,	   428,	math.sin - random
12,	   822,	math.cos
13,	   547,	math.cos - random
14,	  1345,	math.floor - local
15,	   258,	image:sample
16,	    84,	drawText - local
17,	    11,	drawTextInRect
18,	    85,	drawRect
19,	   152,	fillRect
20,	    41,	drawCircleAtPoint
21,	    52,	fillCircleAtPoint
22,	    69,	drawCircleInRect
23,	    85,	fillCircleInRect
24,	   181,	sprite:moveTo - static
25,	   130,	sprite:moveTo - random
26,	   260,	sprite:setImage
27,	   222,	sprite:setCenter - static
28,	   221,	sprite:setCenter - toggle
29,	   175,	sprite:setCenter - random
30,	   399,	sprite:setZIndex
31,	   220,	image:draw
32,	   137,	image:draw - locked
33,	   131,	image:draw - locked local
34,	   128,	image:draw - pushcontext local
END

edit 2: i see a difference in timings the original sources i just downloaded showed this :
image
while you do 5000 / FPS

So the benchmark runs longer with you (5 times longer) and can do more calls per benchmark because the timeframe is different, so your results posted above in theory need to be divided by 5

edit 3: it would seem you downloaded dave's version while this is the Original where the numbers were previously based on
for reference here are the numbers of my REV A as well (i think nino's are REV A Also)

echo off
time and date set
#,	 BENCH,	CALL
nil	 14831
drawLine - Diagonal	   390
drawLine - Horizontal	  1833
drawLine - Vertical	   449
drawLine - Random Diagonal	   683
drawLine - fillRect	  4611
drawLine - drawRect	   954
math.random	  6028
math.random - local	  7295
math.sin	  7293
math.sin - random	  3169
math.cos	  7167
math.cos - random	  3049
math.floor - local	  8664
image:sample	  1948
drawText - local	   455
drawTextInRect	    55
drawRect	   456
fillRect	   848
drawCircleAtPoint	   219
fillCircleAtPoint	   262
drawCircleInRect	   361
fillCircleInRect	   467
sprite:moveTo - static	  1259
sprite:moveTo - random	   782
sprite:setImage	  1802
sprite:setCenter - static	  1750
sprite:setCenter - toggle	  1444
sprite:setCenter - random	  1212
sprite:setZIndex	  2879
image:draw	  1381
image:draw - locked	   745
image:draw - locked local	   752
image:draw - pushcontext local	   759
END

2 Likes

@joyrider3774 asked if I could run these benchmarks on my Rev B Playdate, and I'm happy to, especially as I'm also trying to figure out Rev A vs Rev B performance differences in my music player app. I'll post about that in a separate thread soon.

Results from original version and Dave's version
Original (with 5 second delay, crank enabled)
DEVICE (5 RUN AVE)
#,	 BENCH,	CALL
01,	  1299,	nil
02,	    70,	drawLine - Diagonal
03,	   337,	drawLine - Horizontal
04,	    83,	drawLine - Vertical
05,	   118,	drawLine - Random Diagonal
06,	   783,	drawLine - fillRect
07,	   177,	drawLine - drawRect
08,	   849,	math.random
09,	   798,	math.random - local
10,	   906,	math.sin
11,	   470,	math.sin - random
12,	   892,	math.cos
13,	   461,	math.cos - random
14,	   860,	math.floor - local
15,	   449,	image:sample
16,	   110,	drawText - local
17,	    14,	drawTextInRect
18,	    89,	drawRect
19,	   159,	fillRect
20,	    46,	drawCircleAtPoint
21,	    56,	fillCircleAtPoint
22,	    73,	drawCircleInRect
23,	   101,	fillCircleInRect
24,	   375,	sprite:moveTo - static
25,	   215,	sprite:moveTo - random
26,	   489,	sprite:setImage
27,	   457,	sprite:setCenter - static
28,	   470,	sprite:setCenter - toggle
29,	   309,	sprite:setCenter - random
30,	   686,	sprite:setZIndex
31,	   354,	image:draw
32,	   184,	image:draw - locked
33,	   179,	image:draw - locked local
34,	   185,	image:draw - pushcontext local
END
Dave version (with 5 second delay, crank enabled)
#,	 BENCH,	CALL
nil	  7730
drawLine - Diagonal	   368
drawLine - Horizontal	  1791
drawLine - Vertical	   437
drawLine - Random Diagonal	   619
drawLine - fillRect	  4492
drawLine - drawRect	   945
math.random	  4838
math.random - local	  5080
math.sin	  5478
math.sin - random	  2594
math.cos	  5606
math.cos - random	  2604
math.floor - local	  5659
image:sample	  2525
drawText - local	   577
drawTextInRect	    74
drawRect	   451
fillRect	   773
drawCircleAtPoint	   238
fillCircleAtPoint	   297
drawCircleInRect	   418
fillCircleInRect	   550
sprite:moveTo - static	  1938
sprite:moveTo - random	  1149
sprite:setImage	  2177
sprite:setCenter - static	  2456
sprite:setCenter - toggle	  2421
sprite:setCenter - random	  1688
sprite:setZIndex	  3774
image:draw	  1952
image:draw - locked	   962
image:draw - locked local	   957
image:draw - pushcontext local	   963
END

While testing my app, I noticed that connecting my Playdate to the macOS simulator and clicking 'Control Device with Simulator' improved performance by a significant amount (iirc around 15%, enough to make Opus audio playback real time). The responsible serial command is disablecrank. I added a 5 second delay before starting the benchmark, so I could disable the crank via the Simulator option.

Results from both versions with crank disabled
Original (with 5 second delay, crank disabled)
DEVICE (5 RUN AVE)
#,	 BENCH,	CALL
01,	  1298,	nil
02,	    78,	drawLine - Diagonal
03,	   357,	drawLine - Horizontal
04,	    92,	drawLine - Vertical
05,	   129,	drawLine - Random Diagonal
06,	   835,	drawLine - fillRect
07,	   180,	drawLine - drawRect
08,	   893,	math.random
09,	   880,	math.random - local
10,	   969,	math.sin
11,	   496,	math.sin - random
12,	   973,	math.cos
13,	   513,	math.cos - random
14,	   946,	math.floor - local
15,	   462,	image:sample
16,	   127,	drawText - local
17,	    15,	drawTextInRect
18,	    96,	drawRect
19,	   172,	fillRect
20,	    51,	drawCircleAtPoint
21,	    62,	fillCircleAtPoint
22,	    87,	drawCircleInRect
23,	   112,	fillCircleInRect
24,	   384,	sprite:moveTo - static
25,	   238,	sprite:moveTo - random
26,	   500,	sprite:setImage
27,	   468,	sprite:setCenter - static
28,	   494,	sprite:setCenter - toggle
29,	   353,	sprite:setCenter - random
30,	   684,	sprite:setZIndex
31,	   384,	image:draw
32,	   190,	image:draw - locked
33,	   197,	image:draw - locked local
34,	   201,	image:draw - pushcontext local
END
Dave version (with 5 second delay, crank disabled)
#,	 BENCH,	CALL
nil	  7836
drawLine - Diagonal	   401
drawLine - Horizontal	  1826
drawLine - Vertical	   477
drawLine - Random Diagonal	   666
drawLine - fillRect	  4958
drawLine - drawRect	  1017
math.random	  5109
math.random - local	  5574
math.sin	  6054
math.sin - random	  2645
math.cos	  6038
math.cos - random	  2639
math.floor - local	  6284
image:sample	  2757
drawText - local	   664
drawTextInRect	    79
drawRect	   493
fillRect	   895
drawCircleAtPoint	   261
fillCircleAtPoint	   324
drawCircleInRect	   451
fillCircleInRect	   591
sprite:moveTo - static	  2090
sprite:moveTo - random	  1260
sprite:setImage	  2691
sprite:setCenter - static	  2654
sprite:setCenter - toggle	  2634
sprite:setCenter - random	  1845
sprite:setZIndex	  3824
image:draw	  2112
image:draw - locked	  1038
image:draw - locked local	  1058
image:draw - pushcontext local	  1055
END

@dave, do you happen to know why the option is making such a difference in performance?

1 Like

I've put the REV B with cranck and no cranck in an excel sheet and compared the values to my Rev A results, its indeed weird that disabling the cranck on a REV B seems to have an impact
original version. Overal REV B also seems faster than REV A except on some math stuff

Original version (https://1drv.ms/x/s!AmOKrDXp7rpjlOZdcHmWJTxAuSoBYw?e=vOa1Zk)

Dave's version ( https://1drv.ms/x/s!AmOKrDXp7rpjlOZbB1yiegFswN3yXQ?e=4OtrfA )

1 Like

Holy crap, you're right. For some reason on the rev B board the crank sampling is taking way longer than it does on rev A. :anguished: I'm looking into this now, not sure whether we can get a fix in for 2.2, but if not it'll be soon after.

5 Likes

Here's an interesting Lua thing I discovered whilst looking into constant folding and propagation during compilation, which is when a compiler replaces constants with their values if it sees that they won't change at runtime (there are exceptions). I'm don't pretend to know how this applies to the Lua bytecode compilation of Playdate SDK, or running that bytecode in the interpreter. Regardless...

So, let's say in your game you do some maths and stuff across a bunch of variables, some of which are based on constants defined earlier in the program flow. I do this all the time.

I also have values written out as maths, like gfx.drawText("hello", 20+10+2-6, 120+10), arrived as as I adjust the position of things on screen (so I can remove the most recent addition/subtraction to get back to the previous value) and which I never get around to shortening. Good to know they get optimised out during compilation!

Anyway, consider these two functionally identical (almost exactly identical) snippets:

-- math
local a = 30
local b = 9 - (a / 5)
local c
local d

c = b * 4
if (c > 10) then
    c = c - 10
end
d = 60 / a
-- a = 30, c = 2, res = 4
local function fMathOrderA()
    res = c * (60 / a)
    return "math orderA"
end
local function fMathOrderB()
    res = (60 / a) * c
    return "math orderB"
end
local function fMathOrderC()
    res = d * c
    return "math orderC"
end
  • Device (rev A): fMathOrderB is ~10% slower
  • Device (rev A): fMathOrderC is ~2.5% slower

Lesson: the order of execution of your maths seems to matter more than expected

Could it be summarised as "put your constants first (towards the left)"? Not sure.

Or could all this just be caching symptoms?

2 Likes

I tested the above code with

which = 0

function playdate.update()
	
	local start = playdate.getCurrentTimeMilliseconds()
	
	if which == 0 then
		for i=1,100000 do fMathOrderA() end
	elseif which == 1 then
		for i=1,100000 do fMathOrderB() end
	else
		for i=1,100000 do fMathOrderC() end
	end
	
	print("test "..which.." elapsed: "..(playdate.getCurrentTimeMilliseconds()-start).." ms")
	
	which = (which+1)%3
end

and I get C 17% faster than A (make sense) and B 5% faster (:person_shrugging:). On an H7 unit that's 10% and 4%, respectively.

Here's what I get from echo "res = c * (60 / a)" | luac -l -p -:

main <stdin:0,0> (10 instructions at 0x600000d0c080)
0+ params, 3 slots, 1 upvalue, 0 locals, 3 constants, 0 functions
	1	[1]	VARARGPREP	0
	2	[1]	GETTABUP 	0 0 1	; _ENV "c"
	3	[1]	GETTABUP 	1 0 2	; _ENV "a"
	4	[1]	LOADI    	2 60
	5	[1]	DIV      	1 2 1
	6	[1]	MMBIN    	2 1 11	; __div
	7	[1]	MUL      	0 0 1
	8	[1]	MMBIN    	0 1 8	; __mul
	9	[1]	SETTABUP 	0 0 0	; _ENV "res"
	10	[1]	RETURN   	0 1 1	; 0 out

and here's echo "res = (60 / a) * c" | luac -l -p -:

main <stdin:0,0> (10 instructions at 0x600003194080)
0+ params, 2 slots, 1 upvalue, 0 locals, 3 constants, 0 functions
	1	[1]	VARARGPREP	0
	2	[1]	GETTABUP 	0 0 1	; _ENV "a"
	3	[1]	LOADI    	1 60
	4	[1]	DIV      	0 1 0
	5	[1]	MMBIN    	1 0 11	; __div
	6	[1]	GETTABUP 	1 0 2	; _ENV "c"
	7	[1]	MUL      	0 0 1
	8	[1]	MMBIN    	0 1 8	; __mul
	9	[1]	SETTABUP 	0 0 0	; _ENV "res"
	10	[1]	RETURN   	0 1 1	; 0 out

The code is pretty much the same, but the first one does use one more stack slot. Maybe that's the difference? Not sure what's going on in your test, though..

2 Likes

Thanks Dave, particularly for the decompilation dumps.

At this point, me neither :crazy_face: