Here we go: bench-nogc.csv.zip (2.4 KB)
This is running the above version of bench.pdx on 1.11.1, 1.12.3, and 1.13.4 using five different hash seeds, average of five runs for each configuration (in case there's still some property that affects the results and changes between runs). What I was looking for here was whether we see the same change in performance between versions for any fixed hash seed, and that's pretty clearly not the case—here's a plot of the results from a few functions:
If the squiggles did follow the same shape then we could just use one fixed seed when running the benchmark and be confident that the performance deltas we get would be the same no matter what seed we used. But no such luck: The only way to get an accurate measure of performance is to average over a bunch of separate runs with different hash seeds.
Okay, one last chart. Here's the deltas between the versions, averaging the results for each test on each SDK version, a rough approximation of the more accurate test I described in the last paragraph.
The one thing that stands out to me is that image drawing clearly took a sizeable hit after 1.12.3 so I need to go back and try and figure out what happened and see if it's fixable. I think that might be where I did some refactoring to support pattern stencils, but I'd swear I profiled that and didn't see a significant difference.
Anyway, we're looking at adding this kind of check to the automated tests to alert when a number gets out of whack, just need to make sure it's giving us accurate information.