Illegal instruction crash in system memory?

jder · March 22, 2025, 11:48pm

I think I am running into an illegal instruction crash in system memory in my native code (rust) project. Here is the crashlog from a rev B playdate running 2.6.2:

--- crash at 2025/03/22 20:45:36---
build:57176cb0-2.6.2-release.177516-buildbot
   r0:00008001    r1:20008688     r2:38800088    r3: 00000000
  r12:00000009    lr:24022627     pc:24022626   psr: 410e0000
 cfsr:00010000  hfsr:40000000  mmfar:00000000  bfar: 00000000
rcccsr:00000000
heap allocated: 2999808
Lua totalbytes=0 GCdebt=0 GCestimate=0 stacksize=0

You can see the PC and LR are both in the 0x2XXXXXXX range which this thread describes as "system memory" -- I'm assuming that means code, stack, and heap? I believe the cfsr means "illegal instruction" based on my reading of this page.

When I try to use the symbols.db in my (mac 2.6.2) SDK, I don't find any functions for either of those symbols. I do see some functions in the 0x2XXXXXXX range but conspicuously nothing in a fairly large gap between 0x2401C774 and 0x24050000 which is where these both fall.

sqlite> SELECT * FROM functions WHERE 0x24022626 BETWEEN low and low + size;
sqlite> SELECT * FROM functions WHERE 0x24022627 BETWEEN low and low + size;

Based on what is happening at the time this occurs, I suspect this is related to streaming audio off disk somehow. Whatever is happening seems to work OK in the simulator (but not sure that means much given this is all native code).

It also happens on my Playdate Rev B running 2.7.0-beta7 and I get the same result trying to look up the resulting symbols in that SDK's symbols.db.

Anything else I could do to try and diagnose? Happy to share a crashing build if that would help.

Thanks!
Jesse

scratchminer · March 23, 2025, 3:12am

My guess as to what's happening is that Dave or another Panic engineer set a breakpoint for testing on the debug devices.
Those breakpoints are in functions which crash the device, but usually there's a guard to prevent the breakpoint from activating if the device isn't one of Panic's debugging devices.

dave · March 25, 2025, 3:19am

Above 0x24000000 is the userspace code that has all of the game API. We give you names for those functions because it can be useful for figuring out what's going on in your game. Below that is kernel code, low-level stuff that you really shouldn't need to worry about--hopefully. In this case that address is in the Memfault handler, which stores the crash info so that it can send it up to their server. As SM said, it's hitting an instruction that stops execution when a debugger is attached, but I don't know why it would be showing up in the crash log here.

This is one of those things I could lose days on sorting out then not have anything useful to show for it. I'd rather try and figure out where the actual crash is happening. Is the build at Beampunk by breakfeast, mrfractal showing this crash? Is there a reliable way to trigger it?

jder · March 25, 2025, 1:30pm

Thanks for the detailed response! I just made a new build here which reliably crashes for me when choosing "Play" from the main menu. (ie press A twice) I included the elf -- let me know if there's anything else I can supply that would be helpful in tracking it down! (I'm also in the Playdate Squad discord as "Jesse Rusak" if that's easier.)

I should also say that for a long time we've had strange audio-related behavior which we think has something to do with how we're handling memory around synths or other audio objects. This was one case I tracked down but I am pretty sure there are others. The strangest is one where the all our synths suddenly become much louder than expected -- sometimes when we first start playing them, sometimes later, sometimes when locking and unlocking the playdate (which is especially strange since we have no code which runs for the lock/unlock events). This behavior appears or goes away when making changes to unrelated code so I suspect some kind of memory or race issue which is sensitive to code layout or fine-grained timing. I am kinda hoping this might be the same underlying issue.

dave · March 29, 2025, 8:15pm

The crash is an assertion failure trying to lock the malloc mutex, in a weird place I've never seen. It looked like the data in the mutex struct was getting corrupted so I set a watchpoint on an address in there that was changing and it triggered at 0x900bc092. To look that up in the pdex.elf we have to drop the leading 9--the pdex file is compiled to base address 0x0 then relocated to either 0x90000000 or 0x60000000 depending on which version of the hardware you've got.

(gdb) info line *0xbc092
Line 1078 of "/Users/jder/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde-1.0.200/src/private/de.rs"
   starts at address 0xbc092 <<serde::__private::de::FlatMapDeserializer<E> as serde::de::Deserializer>::deserialize_struct+6426>
   and ends at 0xbc096 <<serde::__private::de::FlatMapDeserializer<E> as serde::de::Deserializer>::deserialize_struct+6430>.

Yeah, a deserializer definitely tracks with memory corruption. According to the FreeRTOS game task handler its stack starts at 0x20007420 and ends at 0x20009c18.

(gdb) p *gameTaskHandle
[...]
  pxStack = 0x20007420 <ucHeap+29728>,
  pxEndOfStack = 0x20009c18 <ucHeap+39960>,
[...]

The mutex is just below that:

(gdb) p s_malloc_mutex 
$26 = (SemaphoreHandle_t) 0x200073c8 <ucHeap+29640>

and where's our current stack pointer?

(gdb) p $sp
$27 = (void *) 0x20007288 <ucHeap+29320>

Classic stack overflow (which was my guess as soon as I saw Rust was involved). And before you ask, no there's really no way to make the stack bigger. The only way around this is to use heap instead of stack.

jder · March 30, 2025, 6:30pm

Thanks, Dave! This let me find the offending code and reduce the stack usage there. I also noticed once when this crashed I got a "more info" option which said gameTask had overflowed its stack -- not sure if that's new or just only happens sometimes. In any case, appreciate you helping me track it down!

dave · March 31, 2025, 3:50pm

Great! Glad to hear you were able to find a workaround. The stack overflow error is only triggered when the stack pointer is out of bounds when the scheduler switches tasks, so it's hit or miss. I wonder why the CPU doesn't have a "throw an exception if $sp drops below X" feature, seems like that would be really simple.