In the first part of this article, I focused on the algorithm itself, removing parts I wouldn’t use (like color), making it simpler, less loopy, etc. This yielded a ten-times improvement on a modern architecture, but would still be very slow if simply translated to assembly by a compiler like cc65.
In this part, I’ll focus on things one can do to get “speed” out of a 1MHz 8-bits processor. It will include a number of “tricks”, some of them being largely accepted as “good code” – like predictive branching or lookup tables, others now being regarded as “ugly”, “dangerous”, “bad practice” – like self-modifying code, global variables, absent bound-checking, etc. Taking shortcuts was very often necessary 30 years ago and caring about contracts, scopes and separation and so on was… less of a thing.
For memory, here is the reference C algorithm.
Aligning your buffers
Let’s get that out of the way right from the start: you don’t want the extra page-crossing penalty. Align the buffers, (stuff the holes with something that fits if memory is an issue), and be done with it. Bonus: patching addresses will only require patching the page byte.
Sticking to 8-bits
Anything involving 16-bits will be at least twice slower. Probably three times slower. When presented with a algorithm using int
types, verify every variable. Is it really an int? Does it fit in 8 bits? Anything that fits in 8 bits should be 8 bits. Sometimes it can look like an uint8
could overflow, but is it by more than one bit? r = (a + b) >> 1
if a
and b
are 8-bits might look like it would overflow, but thanks to the carry bit, it would not. In this decoder, there were a lot of variables that were not initially, but ended up being 8-bits, and this helped a lot. The bitbuffer was 32 bits initially. I got it down to 16-bits easily, and it did seem like a good idea to have a second byte ready so the first one would always be complete and allow for direct matching… But shifting 16-bits was much more intensive than shifting and 8-bits bitbuffer 1 bit at at time.
Avoiding stack function parameters
On 6502, a lot of small things become functions. For example,
val = (val_from_last[last] * factor) >> 4;
would be compiled by cc65, approximately, as
ldx last ; load index
lda val_from_last,x ; load table value
jsr pusha0 ; push as parameter for multiplication
lda factor ; load multiplicator
jsr tosmula0 ; multiply (returns 16-bits in A/X)
jsr shrax4 ; shift right 4 (returns in A/X)
This is good in the sense that this is small, but it is slow. It also performs a 16-bits by 16-bits multiplication, which is much slower than we need here as both members are 8-bits. I wrote a multiplication function that takes its parameters from the A and X registers, and that code became:
ldy last
ldx val_from_last,y
lda factor
jsr mult8x8r16
jsr shrax4
Inlining operations
jsr shrax4
comes with a 12-cycles penalty, so it gets inlined:
.macro SHRAX4
stx tmp1 ; 3 cycles
lsr tmp1 ; 5 - total 8
ror a ; 2 - total 10
lsr tmp1 ; 15
ror a ; 17
lsr tmp1 ; 22
ror a ; 24
lsr tmp1 ; 29
ror a ; 31
ldx tmp1 ; 34
.endmacro
jsr mult8x8r16
SHRAX4 ; 34 cycles
sta result ; 37
stx result+1 ; 40
This still takes a large number of cycles, especially as we have to do >> 4 a large number of times (4 times per pair of columns for each line, which represents 7 seconds of CPU time), so…
Lookup tables
Gaining speed is very often at the expense of code size, but luckily, in this project, memory is not really an issue (The I/O cache is 14kB, which is ludicrous). As usual in this case, building lookup tables is the way to go, when possible. It is sometimes not possible when working on 16-bits integers (because a single 16-bits wide lookup table would amount to all of the main memory an Apple II has), but in the case of shifting right by 4, this is very possible: given a 16-bit number ABCDEFGH IJKLMNOP
where each letter represents one bit, shift it >> 4, it becomes 0000ABCD EFGHIJKL
. It can be done with two 256-byte lookup tables, one for shifting right by 4, one for shifting left. What we want to do is:
low_byte = (high_byte<<4 | low_byte>>4), high_byte = high_byte>>4
Building these tables is done only once at startup, in a very simple way:
ldy #0 ; Our index (0-255)
: tya ; Transfer to A,
asl ; shift it
asl
asl
asl
sta _shiftl4,y ; Store the result
iny
bne :-
Afterwards, we can simply shift the result of our multiplication this way:
jsr mult8x8r16 ; Returns with high byte in X, low in A
tay ; 2 - Low byte to Y,
lda _shiftr4,y ; 6 - Shift its high nibble,
ora _shiftl4,x ; 10 - OR it with the high byte low nibble,
sta result ; 13 - Store low byte
lda _shiftr4,x ; 17 - Shift the high byte
sta result+1 ; 20 - And done!
We just divided by two the time needed to shift numbers by 4, left or right. As this is done 150.000 times, this translates to a 3 seconds improvement.
More avoidance of stack parameters
In the previous example, avoiding stack use was easy as we were multiplying two 8 bits numbers, fitting in two registers. How to do the same when our function needs two 16-bits integers ? There are only three registers. In this case, the cc65 compiler way is to push the first 16-bit int on the software stack, load the second one in AX, call the function which will store AX in zero-page and pop the first parameter from the stack. This is very slow, so instead of doing
lda ...
ldx ...
jsr pushax
lda ...
ldx ...
jsr mult16x16r16 ; entry point that stores AX and pops stack
We will do:
lda ...
sta tmp1
ldx ...
stx tmp1+1
lda ...
ldx ...
jsr mult16x16r16_direct ; entry point after the stack pop
This has a second advantage in our algorithm, where there is a place doing 320 multiplications with one of the parameters always identical: the identical one can be stored only once, before the loop.
Shifting smartly
In a similar manner, there is a place in the code where val = factor << 7
. This could be inlined as seven pairs of asl/rol
, costing 49 cycles, or we can notice that as factor
is 8-bits, seven bits of the low byte end in the high byte, the high bit of the low byte is the original low bit, and the rest is zeroes:
; factor<<7: 00000000 ABCDEFGH becomes 0ABCDEFG H0000000
lsr a ; 2 - shift low byte >>1, (puts low bit to carry),
tax ; 4 - move what remains to high byte (0ABCDEFG)
lda #0 ; 6 - reinit low byte to zero,
ror a ; 8 - put initial low bit back to high bit of low byte

Self-patching code
In this algorithm, there are a number of things that are set every couple of rows, and used multiple times per pixel. To keep the same example, factor
is set at start of a couple rows and then used four times in the 320 columns loop. That means setting it is done 120 times, using it 153600 times. Instead of reloading it, we will just patch our code so that
lda factor
ldx val
jsr mult8x8r16
becomes:
factor1:lda #$FF
ldx val
jsr mult8x8r16
This does not spare a lot, but it ends up counting. The same trick can be used to spare usage of zero-page variables when you have to put a temporary result away to reuse it later. In this case, no speed is gained (sta zp/lda zp
is 3+3 cycles, sta abs/lda #imm
is 4+2), but less ZP variables are needed, and ZP variables are few, so this is useful (example, lines 539/548).

Branch prediction
This is equivalent to the LIKELY()/UNLIKELY() macros you can find in kernel code. As the programmer, you know better than the compiler whether a condition is likely to be true or not, and you can adapt your branching to spare one cycle (a branch taken is 1 more cycle than one not taken on 6502) on the most common case.
For example, when iterating over more than 256 items in an array, you will have to increment the high byte of your pointer when reaching 256, so you read $AA00, $AA01, … $AAFF, $AB00, $AB01, etc. One usual way to do that is:
loop:
iny
bne :+
inc buffer+1
: lda buffer,y
; ... do things
jmp loop
Here, the bne
will take 3 cycles 255 times and 2 cycles the 256th time (767 cycles). Instead,
incr_buffer_high:
inc buffer+1
jmp continue_loop
loop:
iny
beq incr_buffer_high
continue_loop:
lda buffer,y
; ... do things
jmp loop
Will take 2*255 + 3 cycles + 3 cycles for the extra jump back, about 33% less. This requires finding places out of the main code path to put these “out-of-band” helpers, but, luckily, there often are some between a jmp
and a entry point label. In this project, this is also very useful on the bitbuffer, which shifts bits one at a time about 500.000 times, and only needs to refill the buffer every 8 bits, and which reads bytes about 60.000 times, and only needs to increment the cache pointer page 234 times.

16-bits buffers splitting
Buffers of 16-bits numbers are a real pain in the ass, especially when they are more than 128 items long: to access buffer[index], you have to double index, increment the page if index > 127, increment index, get the high byte, decrement index, get the low byte:
lda index
asl
bcc :+
inc buffer+1
: tay
iny
lda (buffer),y
tax
dey
lda (buffer),y
Splitting them into two arrays of low bytes / high bytes is much easier (and makes page crossing only an issue for arrays bigger than 255 items):
ldy index
ldx buffer_high,y
lda buffer_low,y

Hardcode pointers
The main loop accesses the next_line[] buffer approximately 32 times per pair of columns, so it’s referenced at about 64 places (for low and high value bytes). The “keep code small” way of handling that is to use indirect addressing, lda (zp_ptr),y
, but that has drawbacks: first, we access that buffer at indexes col, col+1, col+2, col+3 so this would require four zeropage pointers, offset from each other by one – using one ZP pointer and offsetting Y would make clean page crossing impossible. Even if it was possible, shifting the Y index back and forth would be complicated and time-consuming. Second, indirect addressing takes 5 cycles instead of 4 for lda $nnnn,y
, meaning a penalty of (120*320*64*2) = 5M cycles. Instead, all of those are hardcoded with a label in front, and the page bytes require patching twice per row. It is ugly (see set_buf_pages), but takes only 30k cycles, translating to a large gain.
Looking at the data
I delved into “what kind of data do I read and decode” late in the project, as I didn’t think it would help a lot at first. After all, the operations have to be done, don’t they? But…

Sometimes, the data makes you feel lucky. For example, there is a buffer values adjustment taking place at every row start, and it requires 320 multiplications to scale the next_line buffer:
val = some 16-bits integer;
for (i = 0; i < DATABUF_SIZE-1; i++) {
next_line[i] = (val * next_line[i]) / 256;
}
When I printed that val
1, I noticed that it was 255 more often than any other value – actually it is 255 more than 70% of the time over all my test pictures. So this means that I could rewrite the previous multiplication line in this case:
(255 * next_line[i]) / 256
= (next_line[i]*256 - next_line[i]) / 256
= next_line[i] - next_line[i]/256
= next_line[i] - (next_line[i]>>8)
So the previous for loop could be special-cased:
if (val == 255) {
for (i = 0; i < DATABUF_SIZE-1; i++) {
next_line[i] -= next_line[i]>>8;
}
} else {
for (i = 0; i < DATABUF_SIZE-1; i++) {
next_line[i] = (val * next_line[i]) / 256;
}
}
Translated into assembly, this is very much simpler than a multiplication (20-30 cycles instead of 150):
ldx i
lda next_line_low,x ; Low byte of next_line[x] in A,
sec
sbc next_line_high,x ; Subtract high byte from low :
; -(next_line[i]>>8)
bcs :+
dec next_line_high,x ; Decrement high byte if needed
: sta next_line_low,x
Approximating
One of the worst things in this algorithm is that final pixel values are computed from 16-bits numbers, divided by our dear factor
, and then clamped between 0 and 255 (the 16-bits numbers are signed to make things worse, clamping is at both ends). For a 320×240 image, this means 153600 divisions (~45 seconds) with a precise division method. I already explained that in the previous article, as it translates to a large net gain even on modern architecture, but I’ll detail a bit more here.
So we’re going to approximate it too, and replace the division method with the inverse multiplication/shifting method, getting division time down to 30 seconds.
But this is far from enough: as the division factor only changes every few rows, this calls for a division lookup table. For it to fit in memory, I only store the result of “round hexadecimal numbers”: 0x80/n, 0x180/n, 0x280/n, … 0xFF80/n. Yes, this approximates the approximation, and the result image, of course, is different, and some pixels have their value offset by… 1, which is very invisible.

The result of the clamping is also stored in that table, so 1) no check is needed at the time of finalizing a pixel and 2) as soon as the division’s result high byte is not zero when building the table, you can stop dividing. The table is built like this:
lda #0
sta wordcnt ; persist counter, division
; will use all registers
: ldx wordcnt ; load denominator (X/A)
lda #$80
fact: ldy #$FF ; load divisor (patched in)
jsr approx_div16x8 ; divide AX by Y
ldy wordcnt
bmi overflows_neg ; stop if result < 0
; (high byte had high bit set)
sta dyndiv,y ; Store the low byte of the result
txa ; Check if high byte is 0,
bne overflows ; Stop if not: result > 256
inc wordcnt ; Increment the counter,
bne :- ; Go do the next division
jmp done
overflows: ; Fill the overflows with 255...
lda #$FF
: sta dyndiv,y
iny
bmi overflows_neg ; As long as they're positive,
bne :-
overflows_neg:
lda #$00 ; Then fill the underflows with 0
: sta dyndiv,y
iny
bne :-
done:
This takes about one second per picture because now we’re only dividing ~5000 times per image. Once the table is built, storing a pixel after the 16-bit value is computed, which was:
; ... compute 16-bits value
ldy factor ; load factor
jsr approx_div16x8 ; divide
cpx #0 ; is result high byte 0 ?
beq store ; yes, we can store the result
bmi underflow ; is it negative ?
overflow:
lda #$FF ; no, so store 255
jmp store
underflow:
lda #$00 ; yes, so store 0
store:
ldy column ; Reload column, trashed by division
sta output_buffer,y ; Store value!
and took about 200 cycles… becomes this:
; ... compute 16-bits value
lda dyndiv,x
sta buffer,y
Which takes 8 cycles.
Look at data again
Another round of printf()s showed that the division factor is 48 more than 60% of the time. I devoted 256 extra bytes to a static div48 table, initialized once at start, never changing, so that I only have to rebuild the current row’s division table when factor is not 48 (and not the same as before). Dividing now takes 300ms per picture, as there are less than 2.000 divisions taking place.
Look at data again
I was discontent with the Huffman decoder algorithm. It reads bits, one at at time, until it matches a valid Huff code in the current Huff tree, and returns the associated value. Some of the Huffman trees (which, weirdly, are static and not picture-data dependent) are quite simple, others less so. There are a few trees that have only two leafs (0 and 1) and for which there’s only one bit to read. A few ones have four (00, 01, 10, 11), two bits to read. A few ones have simple codes that are easy to match without counting read bits (00, 01, 10, 110, 1110, etc). But others have codes starting with a zero that forces me to count the bits in order to do the matching (00, 010, 011, 100, 101: reading ’10’ is not the same as reading ‘010’, ’10’ is not a valid code, but ‘010’ ends up in the same offset in the code/value array). There are some places in the decoding where data is fetched from a given tree (always the same), others were it could fetch from multiple different trees.
The general decoder algorithm had to take all these cases into account. It also required to patch in the tree to use in the decoder before each read, always (two patches, one for the code/value match array, one for the code/numbits array). The small trees still took a full page. Etc. I modified those, used a jump table instead of a subroutine, made basically one decoder algorithm per tree, at the expense of readability. This took 7 more seconds out, and allowed me to stuff 6 pages of tables into a single one, sparing memory to allocate to the I/O buffer (meaning less reads of more data, also faster when one reads from a floppy with seek times close to the second).
Finally, Huffman decoding matches a code to a value, which is necessary when we’re decoding data (in _decode_row) because we will use it, but useless when skipping data we don’t use (in _consume_extra). So, the decoders got specialized again, only detecting valid codes when skipping data, without returning the corresponding value.
Not checking things
In computing, it is now well-known that functions should have contracts, and should explicitely and cleanly fail when their contract is not respected. For example, checkUser(User a)
should start with if (a == NULL)
before anything else, handle the case cleanly, and that should be in the unit tests, too. Callers should check for the return values, and act accordingly.

Here, we take our memory locations and values and we do the thing we’re supposed to do. If the data is invalid, at best we’ll get a weird image out, at worst we’ll crash hard. For example, dcraw’s algorithm checked that column was still > 0 when doing repeats:
for (rep=0; rep < 8 && rep < nreps && col > 0; rep++) {
col -= 2;
...
}
This is useless. Images are correctly encoded so that repeats don’t underflow col.

In conclusion
I think that’s about it on this algorithm, and I won’t be able to do much better. I may still have a bit of room left in the control Huffman decoders, but I don’t think I have anymore big wins awaiting. Nevertheless, I’m kind of proud of the level of optimization I achieved on this algorithm that was never intended to run on a potato chip incapable of multiplying in hardware, and not even able to shift more than one bit at a time. I hope the dive was interesting and if you have ideas, I’m all ears!
- Stats the easy way: add printf(“val %d\n”, val) to the C code, then
run > out.txt; grep ^val | sort | uniq -c | sort -n
↩︎