Optimizing a 6502 image decoder, from 70 minutes to 1 minute

When I set out to write a program that would allow me to do basic digital photography on the Apple II, I decided I would do it with the Quicktake cameras. It seemed the obvious choice as they were Apple cameras, and their interface to the computer is via serial port.

The scope creeped a bit after managing to decode Quicktake 100 photos. I wanted it to be able to decode Quicktake 150 and Quicktake 200 pictures too. This threw me into image processing much more than I initially wanted to. This article explains the process of how I got the Quicktake 150 decoder to a reasonabl-ish speed on a 6502 at 1MHz.

The Quicktake 150 format is proprietary and undocumented. Free software decoders exist though, in the dcraw project. This was my source for the initial Apple II decoder. Sadly, it is are written in C, is extremely non-documented, and extremely (to me!) hard to read and understand. The compression is based on Huffman coding, with variable-length codes (which means bit-shifting), and the image construction involves a lot of 16-bits math. None of this is good on a 6502.

But first I had to rework the original algorithm to work with bands of 20 pixels, for memory reasons. Once I had a functional decoder, it ran perfectly, but it took… seventy minutes to decode a single picture.

The same picture, dithered for display on a monochrome Apple II

Of course, I didn’t get there that fast. The first implementation was released two years ago, in November 2023. Getting where I’m now took, I think, five or six deep dives with each time, one or two weeks worth of late evenings and full week-ends dedicated to progressing, wading through hundreds or thousands of debug printf()s, gdb’ing, variables and offsets comparisons, etc.

Follow me through the algorithmic iterations that allowed me to get that decoding time to under one minute. My implementation is now full assembly, but the commits I will link to here are to the general decoding algorithm, that is easier to read in C.

I have noticed that hand-optimizing assembler yields good results, but usually optimizing the algorithm itself leads to much more impressive speed gains. Doing too many things faster is not as good as doing the minimum faster. And that Quicktake 150 decoder sure did useless things, especially in my case where I don’t care about color and end up with a 256×192 image!

I have made a specific repository to track these algorithmic changes. It started here (already a little bit deobfuscated compared to dcraw), at 301 millions x86_64 instructions.

Dropping color

I didn’t have to decode color at all, as I was going to drop it, anyway. I added a flag to only decode the green pixels out of the Bayer matrix, and drop the rest. 264M instructions.

Understanding the buffers

I then set out to understand the use of the various temporary buffers: the more buffers, the more intermediary steps, the more copy and looping. I wanted to drop as much of them as possible. The first step towards it was unrolling some little imbricated loops that worked on y [1,2], x [col+1,col]. 238M instructions.

I figured I still had extra processing I didn’t need, removed it, dropped a buffer (and dropped the #ifdef COLOR conditional to make things clearer). 193M instructions.

At that point, my implementation still outputted green pixels only in a Bayer matrix to a 640×480 buffer, and then interpolated them. It was useless, so I dropped that entirely. 29M instructions.

Without interpolating, we can clearly see we only have half the pixels.

I still had half the pixels black in the destination buffer, so I dropped them earlier rather than later, by outputting a 320×240 images with only the relevant pixels. 25M instructions.

The 8-bits grayscale buffer at that point

At this point I was able to figure out that out of the three buf_m[3], only buf_m[1] was used to construct the picture, that buf_m[2] was only used to feed back into buf_m[0] at the start of a row, that I could construct the image from the buf_m[1] values on the fly instead of doing an extra loop on it, and that I could entirely drop it too. This allowed me to rename the last remaining buffer for more clarity. 22M instructions.

Optimizing divisions

That was about it for the buffers. The rework of the code, at that point, made clear that every final pixel value was computed by dividing the 16-bits values computed from the image data by a common factor, changing once every two rows only. The result of that division was then clamped to [0-255]. This allowed me to precompute a division table every two rows, storing the final result, pre-clamped, in a simple array. This also came with a bit of non-visible precision loss. On an x86_64, still 22M instructions, but on 6502, this was a huge gain, transforming 153600 divisions into less than 2000.

Output index

So far we set the output buffer using the usual buffer[y*WIDTH+x] access method, which is really slow on a processor with no multiplication support. I changed that to a much simpler line-by-line indexing. (Even on x86_64, the change is notable: 20M instructions).

Huffman decoding

The algorithm initialized full tables so that it was possible to get a Huffman code by just looking at the bitbuffer: for code 10001, for example, all codes from 10001000 to 10001111 were matched to the correct value, then the bitbuffer shifted <<5. This seems good at first, but not on 6502, as this requires a 16-bits bitbuffer to make sure we always have a full byte to look at. I reworked that to get bits one at a time. This made the x86_64 implementation slower, but the 6502 one 20 seconds faster, spending 9 seconds shifting bits instead of 29. It also allowed me to pack the tables more tight, freeing up some memory for the cache.

Assembly

This algorithm still performs very poorly when compiled by cc65, but is far easier to manually translate into optimized 6502 assembly. There are also a lot of ad-hoc optimisations, for example:

The division factor for final pixel values for a pair of rows is 48 more than 50% of the time, on any image I tested. So the 6502 implementation has two divisions lookup tables, one for 48 that is never recomputed, one for another factor, recomputed if needed at the start of a pair of rows.
The row initialization multiplies all 320 next_line values by a factor, which is more than half the time 255. In this case, instead of multiplying a = a*255, the assembly version does (a*256)-a, which is (a<<8)-a, which is much faster.
There is a whole lot of <<4 going on in the main loop, which is lookup-table based in the assembly implementation. <<4 is larger than 8 bits, so there are two tables needed, but it still is worth the memory usage.
Half the Huffman codes read are discarded (they are used for blue and red pixels), so “discarder” functions are used in that case, only shifting the bitbuffer without fetching the value.
Buffers accesses (to next_line and output buffer) are patched in self-modifying code rather than using zero-page pointers, which require to keep track and patch about 54 labels on each page cross. This is ugly as hell, but this requires about 50k cycles per image, but spares 9M cycles overall.

The final code

I have pointed to commits to my “test” repository so far, but if you’re interested in the actual 6502 implementation, you can find it in my repository: the decoder, and the bitbuffer.

Bonus: first and current implementation video