Tracking and killing a cc65 corruption bug

It’s been a while (seven months, apparently) since I started using the #pragma(code-name, “LC”) feature of cc65 to get a few extra kilobytes of RAM for my retrocomputing programs, and when I did, I noticed random crashes from time to time. They were hugely annoying as they were clearly symptomatic of silent corruption: changing the link order, a few variables, a few lines of code, were enough to make the crashes come and go with no valid explanation.

Between these two screenshots, all I changed is a static char buf[512] to a buf = malloc(512). Anybody who spent enough time doing C knows this means “You just un/hid the problem!“

My problem is that I had no idea what was going wrong with my code, and debugging for the apple2 platform is very much more difficult than with a modern Linux x86_64 distribution. So I did the easy thing: made my code compile on both platforms, #ifdef‘d the bare minimum, and ran my code through the very powerful tools I know: gdb, valgrind, and Coverity.

None of them found anything. So for a few months, I had to live with unreliable builds, progressing nicely and being from time to time reminded: “you can’t trust your code“.

Yesterday it happened again. Only this time I was doing an optimisation pass, I was not very much in the zone, and my changes were supposed to be getting me more than half a kilobyte of extra RAM, and I didn’t want to backtrack, and I’m on holidays. So I set out to figure out what happened and fix it.

So where to start? I started with adding the relevant CFLAGS to have debug symbols available from cc65, somewhere, in some form:

CFLAGS=–debug-info -Wl –dbgfile,program.dbg -Ln program.lbl

Then I added a cgetc() at the very start of my main() function, to be able to start debugging right after the program loaded and before it had a chance to corrupt anything. I launched MAME with the -debug flag, enabled trace (trace log.asm), and found myself with 120.000 lines of:

Raw assembly log of the Apple II’s execution stream

This was, obviously, getting me nowhere, so after looking for solutions to use the .dbg/.lbl files generated by cc65, and finding nothing, I reluctantly set to write a helper tool to display Apple 2 / cc65 debug information from MAME’s execution log. It helped:

The same part of the run, this time with some info

This was only the first step. The second one was to trace a working build, trace a broken build, and figure out where they differ. To simplify matters, I wanted to do this without changing code offsets. Luckily, I found a way to break the build by making a buffer [1] instead of [512]. And now I was progressing!

The correct execution on the left, the wrong one on the right.

Here we are ! after about only 3800 instructions, the codepath differs. On the left at $d447 we can see we’re pushing the value $0a to the Y register, then jump to $7021 (snprintf). On the right, we push $01, then transfer A to X ??? This is absolutely not what I wrote, nor what cc65 compiled.

Back to MAME for another run, this time setting a watchpoint on memory cell $d448 (containing the operand to the sty at $d447). But it was already containing $01 when my program was started, waiting at the very beginning of main() for a character input, long before any of my own bugs could have fucked anything up. And more precisely, in the surrounding area, only 3 bytes were wrong, $d448 to $d44b.

So I traced again, this time from before my program was ready. And behold, as soon as $d448 got a value, it was $01, the wrong value. Which meant it was wrongly set up during loading.

There, I had to delve into the magic of #pragma(code-name, “LC”). How it works is that the code is physically written towards the end of the binary, in the ONCE segment containing init code and LC code. During init, the LC code is moved to the language card, and then, that place of the memory is given back to BSS storage (static variables).

This is done in cc65’s libsrc/apple2/crt0.s, in the highlighted section below:

After finding a disassembly of “BLTU2”, the ROM procedure called for that move, I determined I couldn’t find any obvious bug, the offsets seemed right, and hence the problem was coming from cc65 itself. I computed the source offset of what ends up at $d448. Given the BLTU call logic, it should be __ONCE_LOAD__+ONCE_SIZE+$d448-__LC_START__: $8c42.

Let’s get back to MAME and put a watchpoint on $8c42 very early, shall we?

Aren’t we progressing now! It starts right, gets $0a written to, then $01 !?

Let’s get back to our commented trace run and search for $8c42:

There we are! cwd_init calls callmli, and callmli sets the X register to $01, then stores it to the address $8c42, which somehow has the .mliparam label, although we have determined for sure there is code there and it shouldn’t be written to!

Back to the cc65 codebase, investigating cwd_init… Oh it’s a constructor… Remember the part where I investigated the “Block Transfer Up” setup? right before?

And this is it. initlib calls cwd_init, which calls callmli, which stores a register in the BSS area before the code destined for the Language Card had a chance to be moved away from the BSS area. And now it’s corrupted very slightly.

I moved the jsr initlib call to after the block transfer, and poof! The corruption disappeared, with, initially, a -2/+2 pull request.

Of course, this was not the correct fix, as it had side effects, described in this comment on the pull request. But after a discussion on the best way to fix it, and some guidance from Oliver Schmidt, I had a nice fix (that only adds 9 bytes on disk and 9 in RAM) ready, and it got merged!

In the end, this was one the most gnarly debugging session I did since years (once we found a kernel bug at work after one week of pair-debugging investigation, only to discover it was already fixed and we just had to upgrade): it took me two full days + evenings, so I’d say about 30 hours including the writing of the annotation tool. Even if I tried there to explain things consisely and it may seem evident or at least clear, I was really almost overwhelmed at multiple points by all the temporary information I had to store in my brain / notepad, all the hexadecimal, all the offsets, all the little brain farts.

But I’m glad I did it, because I did learn a lot of things, I fixed a really sneaky bug, and I wouldn’t have dared to open a GitHub issue titled “cc65 corrupts data, checkout my code, setup my complicated emulation stack and figure it out for me please”.