C Developers - Data Structure/Containers library?

Rascal · May 14, 2022, 5:43pm

I'm very used to modern languages that have built in data structures and containers like vectors, lists, hashmaps, queues, stacks, etc... I'm curious to know what other developers are using while developing in the C API. Are you creating your own library of structures or is there a recommended open source library. Any suggestions or advice is much appreciated!

macoy · October 14, 2022, 12:44am

Since the Playdate has relatively limited memory and performance, you should consider never dynamically allocating and use static arrays instead. If you know your game will never have more than 256 particles, for example, you can do e.g. static Particle s_particles[256]; and include some way of knowing whether a particle is alive or not. Here are some ways you can handle that:

a bool alive inside Particle
X and Y being != 0 (or some other field that should always be set for "alive" entries)
a separate array of the same size (possibly a bitfield instead to be much more compact) indicating alive/dead, etc.. This is likely not worth it (because it is complicated and may not actually be faster), but should be mentioned. An example of this would be:

Object myObjects[256];
// This might need to be a #define instead to have static array sized off it
const unsigned int numObjectsPerAliveEntry = (8 * sizeof(unsigned int));
unsigned int myObjectsAlive[256 / numObjectsPerAliveEntry];
for (int i = 0; i < sizeof(myObjects) / sizeof(myObjects[0]); ++i)
{
   if ((myObjectsAlive[i / numObjectsPerAliveEntry] << (i % numObjectsPerAliveEntry)) & 0x1)
  {
    // Object at myObjects[i] is alive
  }
}

If you still want to dynamically allocate, I recommend the STB data structure header for this, stb_ds.h. It contains a dynamic array, string hash table, and arbitrary key hash table. These three data structures should be sufficient for most cases. Any other data structures I would recommend you code from scratch.

Daeke · October 14, 2022, 8:57pm

I am using Java and Typescript at work so, like you, I am more used to having built-in data structures. Since I like to write everything myself (a bad habit), I had fun writing custom implementations of ArrayList and HashMap for C.

There is a lot of dynamic allocation going on and, for now, it works pretty well.

My main problem is that it took a lot of time to rewrite/fix things before actually writing my game .

DidierMalenfant · October 18, 2022, 10:57am

you should consider never dynamically allocating and use static arrays instead.

Agreed. Allocating/deallocating will not only affect performance but also memory fragmentation. Should be avoided at all cost.

Even in Lua code, creating/destroying objects is not a good idea. Data pools are the way to go.

DavidMedin · October 18, 2022, 4:03pm

I havent't used it on the playdate yet, but I made a rough ECS for C a while back. It'll take some work to work with the Playdate, but it is where I would start. Here it is: GitHub - DavidMedin/TextMMO: The server for a text based rpg game. Maybe inspired by Sokpop..

DidierMalenfant · October 19, 2022, 5:59am

What's an ECS if I may ask?

superfunc · October 19, 2022, 6:03am

Stands for entity component system

Its a common technique used in game engines with the goal of organizing data in a way that is friendly to cpu caches Entity component system - Wikipedia

A popular C++ right now is flecs, if you're curious to see some code: GitHub - SanderMertens/flecs: A fast entity component system (ECS) for C & C++

DidierMalenfant · October 19, 2022, 6:11am

Stands for entity component system

I knew it couldn't be what I thought it was...

Its a common technique used in game engines with the goal of organizing data in a way that is friendly to cpu caches

Yeah, we used to do stuff like this for the Vector Units on PS2. Didn't have a name for it back then though...

yggy · October 26, 2022, 4:56am

Speaking of being friendly to CPU caches, does anyone have details on the Playdate CPU cache? The specs say there is an 8K L1 cache, but at the bottom of the main page it says 32K. Anyone know which is right?

The ARM M7 technical manual says "The data cache is four-way set-associative, the instruction cache is two-way set-associative. Both caches use a line-length of 32-bytes."

Is the Playdate CPU configured with both an instruction cache and a data cache?

It should be possible to devise some experiments to figure this stuff out (for example), maybe I'll play around and see.

superfunc · October 26, 2022, 1:30pm

Just remembered, ifixit did a video opening up a playdate (https://www.youtube.com/watch?v=J5G02ru0GyM), which has this image in the associated write up

After googling the chip number, this seems to be it https://www.st.com/resource/en/datasheet/stm32f745ie.pdf . In the doc it specifies that it does the typical thing of a 50/50 split 4kb instruction cache, 4kb data cache. @yggy

@dave Feel free to correct me if I misidentified anything!

edit: added more specific link to data sheet

yggy · October 27, 2022, 1:15am

Oh, nice find!

4K data cache, that is SMALL! I still want to run some tests to confirm, and to assess timing differences between cached and uncached data access. Stay tuned.

superfunc · October 27, 2022, 4:07am

Definitely, looking forward to your results, would love to have a table similar to the famous Jeff Dean/Peter Norvig one

Latency Comparison Numbers (~2012)
----------------------------------
L1 cache reference                           0.5 ns
Branch mispredict                            5   ns
L2 cache reference                           7   ns                      14x L1 cache
Mutex lock/unlock                           25   ns
Main memory reference                      100   ns                      20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy             3,000   ns        3 us
Send 1K bytes over 1 Gbps network       10,000   ns       10 us
Read 4K randomly from SSD*             150,000   ns      150 us          ~1GB/sec SSD
Read 1 MB sequentially from memory     250,000   ns      250 us
Round trip within same datacenter      500,000   ns      500 us
Read 1 MB sequentially from SSD*     1,000,000   ns    1,000 us    1 ms  ~1GB/sec SSD, 4X memory
Disk seek                           10,000,000   ns   10,000 us   10 ms  20x datacenter roundtrip
Read 1 MB sequentially from disk    20,000,000   ns   20,000 us   20 ms  80x memory, 20X SSD
Send packet CA->Netherlands->CA    150,000,000   ns  150,000 us  150 ms

Notes
-----
1 ns = 10^-9 seconds
1 us = 10^-6 seconds = 1,000 ns
1 ms = 10^-3 seconds = 1,000 us = 1,000,000 ns

Credit
------
By Jeff Dean:               http://research.google.com/people/jeff/
Originally by Peter Norvig: http://norvig.com/21-days.html#answers

yggy · October 30, 2022, 5:46am

Starting to get some results ...

All timings performed with resetElapsedTime/getElapsedTime, which should be microsecond accurate.

Test 1 - cache lines are 32 bytes
Here's a demonstration of the cache line size. The test is to iterate through a large amount of memory, treated as 4-byte ints, striding through memory with a variety of step sizes. Step sizes tested are 4 bytes (read every int), 8 bytes (read every second int), on up to 256 bytes. Results show that whether you read every int (4 byte stride) or every 8th int (32 byte stride), it really doesn't make a difference, because the bulk of the time is spent loading the 32-byte cache line. Beyond 32 bytes the time starts to go down, because you are touching fewer cache lines. In this chart, the total time is arbitrary, based on how many loops through how much memory I run. I should devise another test to show the relative cost of accessing cached vs. uncached memory.

Test 2 - on startup, there is ~16M of available, unfragmented RAM
I tried to allocate (playdate->system->realloc) the largest block of memory I could, starting at 17M and dropping by 4K until I got a successful allocation. I then kept dropping by 4K and trying to allocate another block of memory until I got to zero. The first successful allocation was 16,617,472 bytes, which is exactly 156K short of 16M, and I was unable to allocate a second block of more than 4K, indicating memory is unfragmented.

Test 3 - data cache is 4K
I allocated buffers ranging in size from 32 bytes up to 64K and then iterated repeatedly through the buffer, stepping by the cache line size and reading an integer, repeating this 256 x 1024 times (the precise number isn't important, I just wanted to do a lot to get good timings). As seen in this chart, the loop runs fast, and takes the same amount of time, until we exceed the 4K cache size, after which the timing rapidly drops as new data must be loaded into the cache. Cached data can be read about 30 times faster than uncached. In my test a cached read takes about 45 ns, while an uncached read takes about 1400 ns, but there's some other overhead in the loop, so that's not entirely accureate.

Test 4 - instruction cache is ... perfect?
I tried to create a test that would exercise the instruction cache in a similar fashion, but the results so far are inconclusive. My method was to generate 1K of no-op assembly instructions (mov r0,r0), and repeat this in 12 separate functions. The idea was that each time I call one of these functions I would fill up 1K of instruction cache, and I expected that if I called up to four such funtions they would run quickly, and beyond four functions performance would degrade as the instruction cache keeps reloading. My test makes the same large number of total function calls in each test, differing only in how many of the different functions are called. However, here are the results - no matter how much code I execute (from 1K to 12K) the time taken is virtually identical. It's as if instruction caching has no effect on performance, which I don't understand. I have confirmed that the inline assembly is being emitted, and I see timings get faster if I reduce the size of the inline assembly block, so the code is being generated, not optimized away, and executed. I'm confused - anyone have any theories on this one?
EDIT: I suspect that what's happening is that linear code is optimally prefeteched into the cache no matter how large the code being run. If the code had lots of branches and confuses the branch prediction logic then stalls and slowdowns may occur. More testing required!

Code is here if you're curious. Hopefully I'll have some time to refine and improve, get more detailed timings, add documentation to the repository, and figure out what's going on with the instruction cache.

superfunc · October 30, 2022, 3:43pm

just a theory

It may be worth trying to vary the size of the individual function, rather than the number of functions called, to 4k and beyond. At 1k each, it may be that the icache refill policy can effectively prefetch/stream the 1k blocks in time for it to not effect performance, so it wouldnt matter how many of them there are.

aschuhardt · October 30, 2022, 9:12pm

+1 for the STB single-header libraries.

yggy · October 30, 2022, 11:06pm

@superfunc yes, good point, I was thinking about that afterwards. It may well be that stricly linear code can be prefetched 100% accurately with no stalls. Rather than functions of different sizes, I think maybe I need to confuse the branch prediction logic, so that prefetch is inaccurate. I'll see if I can come up with another approach.

The good news here is that if your code is linear and predictable then the instruction cache will keep it running at top speed, no need to worry about how big it is.

yggy · November 4, 2022, 3:56am

So I've gone further with the instruction cache testing, but I still don't fully understand what I'm seeing.

I use inline assembly to create functions of known size. I thought it best to generate lots of small functions, because large functions without branches will allow the branch prediction logic to work flawlessly and preload the instruction cache very effectively.

My current test scenario is based on a table of 256 function pointers, each pointing to a function that is 64 bytes long, or two cache lines. In total, the 256 functions occupy 256 x 64 = 16K of memory, which is four times the 4K instruction cache size shown in the data sheet from the ifixit teardown.

My testing strategy is to repeatedly run functions that add up to a known amount of memory, and vary the amount of memory covered to assess timing when everything fits into the cache and when it doesn't. So for example, to test 2K of instruction memory I need to run 2048 / 64 = 32 of the functions, and so my code would be:

int n = 32;
for (int calls = 0; calls < 100000; calls++)
{
    functable[calls%n]();
}

I do 100,000 calls to ensure it takes long enough to be able to get consistent timings. Obviously the loop logic is also being run, but that should only consume a couple of cache lines so it shouldn't throw the results off too much.

I repeat the above test for n running from 1 to 256, thus testing 64 bytes up to 16K of instructions, and time how long it takes to make 100,000 function calls - so the same total amount of code is being run in each case. Here are the results:

I'm puzzled by a few things:

Why is there an early spike in time taken, up to and a little beyond the 1K mark?
Why does performance start to drop off at the 8K mark, instead of the 4K mark like I expected from the 4K cache size?
Why isn't there a greater drop in performance? Performance is more than half as good when apparently missing the cache. From messing around with data caches I expected the cache load time to be a more significant hit than that.

All my functions are laid out linearly in memory, so I wondered if the CPU was prefetching subsequent functions, so I tried calling the functions in random order. I used insertion sort to randomize the first n entries in the function table before starting the timing loop. The results were very similar, though surprisingly the early spike in time taken - though still present - was lower than in the linear order case.

In summary, I think my procedure is fairly sound, but I'm puzzled by the results and would appreciate any insights anyone has to share. Code is up to date on GitHub if anyone wants to try it out.

yggy · November 4, 2022, 4:23am

Hmm, I just noticed something interesting ...

The ifixit teardown, from which we got the image with the CPU part number on it, was posted on August 26, 20201. Then 10 weeks or so later, on November 11, 2021, I received an email "Playdate Owner's Update #1" which mentions ...

Here’s one way we’re working to beat the chip shortage: we’ve just finished a revision of Playdate’s main board (for units made later next year) so that we can use a similar, but more widely available, CPU.

So it sounds like that data sheet we've been scrutinizing is in fact for the original CPU, not the "more widely available" replacement. Guess it's time for another teardown!

superfunc · November 4, 2022, 6:19am

If panic people see this, could we please have CPU model #s for the range being used.

yggy · November 4, 2022, 4:30pm

Yes please, hardware details from Panic would be much appreciated!

Confirmation of the specs would also be nice - is the CPU 168 MHz or 216? Is the "8 KB L1 Cache" split into 4K data and 4K instruction?

Thanks!!