Interpreting crash logs

DanB91 · October 30, 2022, 3:17pm

Hi guys, I am writing my game in Zig on macOS, which I know is unsupported, but I’ve been pretty okay at getting it running on hardware until now. It is unfortunately crashing on start up but I am not sure why. It works fine on the simulator.

The crash log shows this:

--- crash at 2022/10/30 14:24:56---
build:59185ded27c7-1.12.3-release.140884-buildbot
r0:20026098 r1:20026098 r2:00000083 r3: 000036a0
r12:006d3000 lr:08022d75 pc:0803947c psr: 01000000
cfsr:00000092 hfsr:00000000 mmfar:1ffffff8 bfar: 1ffffff8
rcccsr:00000000
heap allocated: 389696
Lua totalbytes=0 GCdebt=0 GCestimate=0 stacksize=0

Looking at the ARM docs, it seems this a memory access issue, specifically from address 0x1FFF_FFF8. I’ve been changing some code around, and moving from debug to release mode and it is always failing at the same address (0x1FFF_FFF8).

Does anyone know if this typically indicates a stack overflow, or does it indicate something else?

Thank you!

scratchminer · October 30, 2022, 7:32pm

Seems like you've got a stack overflow, since a memory management fault (meaning the Playdate OS doesn't allow you to access there) occured after something was pushed onto the stack.

I obtained this from the number labeled "cfsr", for Configurable Fault Status Register, and this page.

My guess (since the stack is supposed to extend only down to 0x20000000) is that there's a runaway chain of stack pushes that never ends.

DanB91 · November 9, 2022, 5:08am

Thank you. It's good to know the stack goes to 0x20000000.

But, still trying to tackle this problem. I was able to print out the stack pointer address and painstakingly looked through the assembly, and I could not find any indication of a stack overflow, insofar where the stack pointer was decremented a crazy amount. So it might be some weird memory access somewhere

But the code I was looking at generated a lot of assembly so tried rearranging the code to shrinking the amount of assembly. I logged each line to the console along with the SP register and what is at [sp + #240] register in the attached assembly for the __aeabi_memcpy routine. The console output looks like this:

Hit line 255 SP: 0x200099B8 [SP+240]: 0x6005F550
Hit line 260 SP: 0x200099B8 [SP+240]: 0x6005F550
Hit line 265 SP: 0x200099B8 [SP+240]: 0x6005F550
Hit line 270 SP: 0x200099B8 [SP+240]: 0x6005F550

I have a logToConsole statement on line 275, which is not printing out, so i know it is crashing between lines 270 and 275. HOWEVER, a key thing here is that it sometimes crashes earlier or even later!! So something strange is afoot here. But, most runs it gets up to line 270.

Here is the disassembly of lines 270 to 275. The instruction at address 0x60036364 is the call to playdate->system->logToConsole(), so I’m pretty sure this instruction is not called. It seems it undefined instruction error. This is the crash log:

build:59185ded27c7-1.12.3-release.140884-buildbot
   r0:00000000    r1:00000000     r2:00030be5    r3: 00000000
  r12:006d3000    lr:080307eb     pc:0803053c   psr: 61000000
 cfsr:00010000  hfsr:40000000  mmfar:00000000  bfar: 00000000
rcccsr:00000000
heap allocated: 4068864
Lua totalbytes=0 GCdebt=0 GCestimate=0 stacksize=0

I cannot see how this is happening, but maybe there is an instruction in this listing that isn’t supported by the Playdate? Am I reading the crash log wrong? Or is because it’s sometimes crashing earlier or later make this all a red herring?

Thank you!

timhei · November 9, 2022, 2:54pm

You might have better luck tagging @dave for this one.

dave · November 9, 2022, 4:25pm

One important thing to note: the pd->system->logToConsole() call isn't synchronous; it passes the text to the USB driver, which then sends it to the hardware in an interrupt callback. If the device crashes between the logToConsole() call and the interrupt, the message won't get sent.

Something suspicious in that crash log is that the Thumb bit isn't set in the PSR register. Documentation – Arm Developer says

The Cortex-M7 processor only supports execution of instructions in Thumb state. The following can clear the T bit to 0:

Instructions BLX , BX , LDR pc, [] , and POP{PC }.

Restoration from the stacked xPSR value on an exception return.

Bit[0] of the vector value on an exception entry or reset.

Attempting to execute instructions when the T bit is 0 results in a fault or lockup.

That jibes with the rest of the data in the crash, which doesn't make much sense otherwise--the docs say the CFSR indicates an unknown instruction error at the $pc address pushed to the stack, but that address is in the firmware code, definitely not an invalid instruction.

But it doesn't explain why the thumb bit got cleared.. Maybe a side effect of that stack overflow? As I recall the project to support Rust on Playdate had a lot of trouble with stack overflows. That ldr r0, [sp, #240] suggests to me that you're passing stack allocated structs instead of heap pointers. Is that something you have any control over, or is Zig doing that?

If you want to send a build I'd be happy to test it out. Also, if you send me your serial number I can check if Memfault has any logs with more info.

DanB91 · November 10, 2022, 3:50am

Thank you!

Ah so that makes sense why I can get different output to the console on different runs.
Oh so the PSR entry in the crash log also includes the EPSR as well? Good to know!
So that’d be interesting if a stack overflow would cause the T bit to be cleared, but strange how I am getting an undefined instruction in the last crash, but a stack overflow in the first crash report.
So this function is populating a large struct called “global_state”, which is actually allocated on the heap. All the literals used seem to be allocated in rodata memory and not on the stack. See what r1 is populated with starting at 0x60036344 in the attached assembly (it's populated with an rodata address).
I am looking through the assembly and I don’t see anything over a few hundred bytes ever allocated on the stack, at least in this area. Plus, I’ve been printing out the SP register in this function and the value of the SP is at a steady 0x200099B8, which is way below the top of the stack, so I’m really confused at what could be causing a stack overflow here.
Yes that would be awesome if you could look at it! Thanks so much! Just sent you a message.

dave · November 10, 2022, 11:10pm

Okay, I traced the crash down to an endless recursion in __aeabi_memset:

=> 0x6005bc30 <__aeabi_memset>:	cmp	r1, #0
   0x6005bc32 <__aeabi_memset+2>:	it	eq
   0x6005bc34 <__aeabi_memset+4>:	bxeq	lr
   0x6005bc36 <__aeabi_memset+6>:	push	{r7, lr}
   0x6005bc38 <__aeabi_memset+8>:	mov	r7, sp
   0x6005bc3a <__aeabi_memset+10>:	bl	0x6005bc30 <__aeabi_memset>
   0x6005bc3e <__aeabi_memset+14>:	pop	{r7, pc}

googling for the function comes up with aeabi_memset.c source code [glibc/sysdeps/arm/aeabi_memset.c] - Codebrowser, where it just calls plain memset(), so I wonder if that's what's happening here and then memset is aliased to __aeabi_memset.. Similar issue here: Compiler generates recursive memclr - #3 by Phaiax - help - The Rust Programming Language Forum.

Tangentially, did you know the F7 chip we're using has a bug where you can't single step through code in the debugger, have to set a breakpoint everywhere you want to stop then clear it, set the next one, and continue, over and over again? That's real fun, good typing practice.

DanB91 · November 11, 2022, 5:53am

Oh wow thanks so much for finding this! I’ve been banging my head against the wall for the past 2 weeks lol. It’s very interesting I never hit this bug before, but maybe I just got lucky or there is something that changed that I’m not seeing in Zig’s recent commits.

I figured out I can work around this by exporting custom versions of __aeabi_memset (making sure to swap the 'c' and 'n' args!!) and __aeabi_memclr that disable optimizations by using volatile pointers. My game is up and running now! Not an ideal solution, and I’ll need to ask around in the Zig community about this.

But my major question how did you use a debugger here? Do you have to use a JTAG or SWD, or can it be done over USB? This’d prove invaluable even if I have to delete a breakpoint, add a breakpoint and continue (even though, yea, that does suck). But, yea, if I have to open up the Playdate and solder wires to the board, probably not gonna do that.

Thanks so much again!

dave · November 11, 2022, 6:31pm

I'm using a dev unit with an ST-Link connected via SWD, but production devices don't support that. I don't know if there's any way to catch this without being able to step through in a debugger. We might be able to check for stack overflow in the fault handler and give a different error message, but in this case knowing it's a stack overflow didn't really help because the culprit is buried.

I'll keep this in the back of my head and see if any ideas come up about how better to deal with this. Luckily it's a very rare case.

DanB91 · November 12, 2022, 2:18am

Ah understood. Yea, I suppose not many people are trying to get a brand new language working on the Playdate :).

Thanks again for the help and thanks especially for being supportive of this endeavor! Will definitely keep an eye out if you come up something to help with debugging. I’m sure this won’t be the last time I encounter a strange issue like this lol