Optimizing a 6502 image decoder – part II: assembly

In the first part of this article, I focused on the algorithm itself, removing parts I wouldn’t use (like color), making it simpler, less loopy, etc. This yielded a ten-times improvement on a modern architecture, but would still be very slow if simply translated to assembly by a compiler like cc65.

In this part, I’ll focus on things one can do to get “speed” out of a 1MHz 8-bits processor. It will include a number of “tricks”, some of them being largely accepted as “good code” – like predictive branching or lookup tables, others now being regarded as “ugly”, “dangerous”, “bad practice” – like self-modifying code, global variables, absent bound-checking, etc. Taking shortcuts was very often necessary 30 years ago and caring about contracts, scopes and separation and so on was… less of a thing.

For memory, here is the reference C algorithm.

Aligning your buffers

Let’s get that out of the way right from the start: you don’t want the extra page-crossing penalty. Align the buffers, (stuff the holes with something that fits if memory is an issue), and be done with it. Bonus: patching addresses will only require patching the page byte.

Sticking to 8-bits

Anything involving 16-bits will be at least twice slower. Probably three times slower. When presented with a algorithm using int types, verify every variable. Is it really an int? Does it fit in 8 bits? Anything that fits in 8 bits should be 8 bits. Sometimes it can look like an uint8 could overflow, but is it by more than one bit? r = (a + b) >> 1 if a and b are 8-bits might look like it would overflow, but thanks to the carry bit, it would not. In this decoder, there were a lot of variables that were not initially, but ended up being 8-bits, and this helped a lot. The bitbuffer was 32 bits initially. I got it down to 16-bits easily, and it did seem like a good idea to have a second byte ready so the first one would always be complete and allow for direct matching… But shifting 16-bits was much more intensive than shifting and 8-bits bitbuffer 1 bit at at time.

Avoiding stack function parameters

On 6502, a lot of small things become functions. For example,

val = (val_from_last[last] * factor) >> 4;

would be compiled by cc65, approximately, as

ldx last             ; load index
lda val_from_last,x  ; load table value
jsr pusha0           ; push as parameter for multiplication
lda factor           ; load multiplicator
jsr tosmula0         ; multiply (returns 16-bits in A/X)
jsr shrax4           ; shift right 4 (returns in A/X)

This is good in the sense that this is small, but it is slow. It also performs a 16-bits by 16-bits multiplication, which is much slower than we need here as both members are 8-bits. I wrote a multiplication function that takes its parameters from the A and X registers, and that code became:

ldy last
ldx val_from_last,y
lda factor
jsr mult8x8r16
jsr shrax4

Inlining operations

jsr shrax4 comes with a 12-cycles penalty, so it gets inlined:

.macro SHRAX4
stx tmp1    ; 3 cycles
lsr tmp1    ; 5 - total 8
ror a       ; 2 - total 10
lsr tmp1    ;           15
ror a       ;           17
lsr tmp1    ;           22
ror a       ;           24
lsr tmp1    ;           29
ror a       ;           31
ldx tmp1    ;           34
.endmacro

jsr mult8x8r16
SHRAX4          ; 34 cycles
sta result      ; 37
stx result+1    ; 40

This still takes a large number of cycles, especially as we have to do >> 4 a large number of times (4 times per pair of columns for each line, which represents 7 seconds of CPU time), so…

Lookup tables

Gaining speed is very often at the expense of code size, but luckily, in this project, memory is not really an issue (The I/O cache is 14kB, which is ludicrous). As usual in this case, building lookup tables is the way to go, when possible. It is sometimes not possible when working on 16-bits integers (because a single 16-bits wide lookup table would amount to all of the main memory an Apple II has), but in the case of shifting right by 4, this is very possible: given a 16-bit number ABCDEFGH IJKLMNOP where each letter represents one bit, shift it >> 4, it becomes 0000ABCD EFGHIJKL. It can be done with two 256-byte lookup tables, one for shifting right by 4, one for shifting left. What we want to do is:

low_byte = (high_byte<<4 | low_byte>>4), high_byte = high_byte>>4

Building these tables is done only once at startup, in a very simple way:

        ldy     #0           ; Our index (0-255)
:       tya                  ; Transfer to A,
        asl                  ; shift it
        asl
        asl
        asl
        sta     _shiftl4,y   ; Store the result
        iny
        bne     :-

Afterwards, we can simply shift the result of our multiplication this way:

jsr mult8x8r16     ; Returns with high byte in X, low in A
tay                ; 2  - Low byte to Y,
lda _shiftr4,y     ; 6  - Shift its high nibble,
ora _shiftl4,x     ; 10 - OR it with the high byte low nibble,
sta result         ; 13 - Store low byte
lda _shiftr4,x     ; 17 - Shift the high byte
sta result+1       ; 20 - And done!

We just divided by two the time needed to shift numbers by 4, left or right. As this is done 150.000 times, this translates to a 3 seconds improvement.

More avoidance of stack parameters

In the previous example, avoiding stack use was easy as we were multiplying two 8 bits numbers, fitting in two registers. How to do the same when our function needs two 16-bits integers ? There are only three registers. In this case, the cc65 compiler way is to push the first 16-bit int on the software stack, load the second one in AX, call the function which will store AX in zero-page and pop the first parameter from the stack. This is very slow, so instead of doing

lda ...
ldx ...
jsr pushax
lda ...
ldx ...
jsr mult16x16r16 ; entry point that stores AX and pops stack

We will do:

lda ...
sta tmp1
ldx ...
stx tmp1+1
lda ...
ldx ...
jsr mult16x16r16_direct ; entry point after the stack pop

This has a second advantage in our algorithm, where there is a place doing 320 multiplications with one of the parameters always identical: the identical one can be stored only once, before the loop.

Shifting smartly

In a similar manner, there is a place in the code where val = factor << 7. This could be inlined as seven pairs of asl/rol, costing 49 cycles, or we can notice that as factor is 8-bits, seven bits of the low byte end in the high byte, the high bit of the low byte is the original low bit, and the rest is zeroes:

; factor<<7: 00000000 ABCDEFGH becomes 0ABCDEFG H0000000
lsr     a    ; 2 - shift low byte >>1, (puts low bit to carry),
tax          ; 4 - move what remains to high byte (0ABCDEFG)
lda     #0   ; 6 - reinit low byte to zero,
ror     a    ; 8 - put initial low bit back to high bit of low byte

A juggler — How it makes me feel moving bits around 6 times faster

Self-patching code

In this algorithm, there are a number of things that are set every couple of rows, and used multiple times per pixel. To keep the same example, factor is set at start of a couple rows and then used four times in the 320 columns loop. That means setting it is done 120 times, using it 153600 times. Instead of reloading it, we will just patch our code so that

        lda factor
        ldx val
        jsr mult8x8r16

becomes:

factor1:lda #$FF
        ldx val
        jsr mult8x8r16

This does not spare a lot, but it ends up counting. The same trick can be used to spare usage of zero-page variables when you have to put a temporary result away to reuse it later. In this case, no speed is gained (sta zp/lda zp is 3+3 cycles, sta abs/lda #imm is 4+2), but less ZP variables are needed, and ZP variables are few, so this is useful (example, lines 539/548).

Gromit placing rails in front of a moving train — How it makes me feel writing self-patching code

Branch prediction

This is equivalent to the LIKELY()/UNLIKELY() macros you can find in kernel code. As the programmer, you know better than the compiler whether a condition is likely to be true or not, and you can adapt your branching to spare one cycle (a branch taken is 1 more cycle than one not taken on 6502) on the most common case.

For example, when iterating over more than 256 items in an array, you will have to increment the high byte of your pointer when reaching 256, so you read $AA00, $AA01, … $AAFF, $AB00, $AB01, etc. One usual way to do that is:

loop:
    iny
    bne :+
    inc buffer+1
:   lda buffer,y
    ; ... do things
    jmp loop

Here, the bne will take 3 cycles 255 times and 2 cycles the 256th time (767 cycles). Instead,

incr_buffer_high:
    inc buffer+1
    jmp continue_loop

loop:
    iny
    beq incr_buffer_high
continue_loop:
    lda buffer,y
    ; ... do things
    jmp loop

Will take 2*255 + 3 cycles + 3 cycles for the extra jump back, about 33% less. This requires finding places out of the main code path to put these “out-of-band” helpers, but, luckily, there often are some between a jmp and a entry point label. In this project, this is also very useful on the bitbuffer, which shifts bits one at a time about 500.000 times, and only needs to refill the buffer every 8 bits, and which reads bytes about 60.000 times, and only needs to increment the cache pointer page 234 times.

The highway exit meme, taking the 2-cycles branch — Branching done right

16-bits buffers splitting

Buffers of 16-bits numbers are a real pain in the ass, especially when they are more than 128 items long: to access buffer[index], you have to double index, increment the page if index > 127, increment index, get the high byte, decrement index, get the low byte:

    lda index
    asl
    bcc :+
    inc buffer+1
:   tay
    iny
    lda (buffer),y
    tax
    dey
    lda (buffer),y

Splitting them into two arrays of low bytes / high bytes is much easier (and makes page crossing only an issue for arrays bigger than 255 items):

    ldy index
    ldx buffer_high,y
    lda buffer_low,y

Usul and Cotentin sorting things into two boxes (this is a french political meme originally) — Sorting high bytes and low bytes in two different arrays be like

Hardcode pointers

The main loop accesses the next_line[] buffer approximately 32 times per pair of columns, so it’s referenced at about 64 places (for low and high value bytes). The “keep code small” way of handling that is to use indirect addressing, lda (zp_ptr),y, but that has drawbacks: first, we access that buffer at indexes col, col+1, col+2, col+3 so this would require four zeropage pointers, offset from each other by one – using one ZP pointer and offsetting Y would make clean page crossing impossible. Even if it was possible, shifting the Y index back and forth would be complicated and time-consuming. Second, indirect addressing takes 5 cycles instead of 4 for lda $nnnn,y, meaning a penalty of (120*320*64*2) = 5M cycles. Instead, all of those are hardcoded with a label in front, and the page bytes require patching twice per row. It is ugly (see set_buf_pages), but takes only 30k cycles, translating to a large gain.

Looking at the data

I delved into “what kind of data do I read and decode” late in the project, as I didn’t think it would help a lot at first. After all, the operations have to be done, don’t they? But…

The investigation board meme — Looking closely at the data sometimes uncovers interesting facts

Sometimes, the data makes you feel lucky. For example, there is a buffer values adjustment taking place at every row start, and it requires 320 multiplications to scale the next_line buffer:

    val = some 16-bits integer;

    for (i = 0; i < DATABUF_SIZE-1; i++) {
      next_line[i] = (val * next_line[i]) / 256;
    }

When I printed that val¹, I noticed that it was 255 more often than any other value – actually it is 255 more than 70% of the time over all my test pictures. So this means that I could rewrite the previous multiplication line in this case:

   (255 * next_line[i]) / 256
 = (next_line[i]*256 - next_line[i]) / 256
 =  next_line[i] - next_line[i]/256
 =  next_line[i] - (next_line[i]>>8)

So the previous for loop could be special-cased:

if (val == 255) {
    for (i = 0; i < DATABUF_SIZE-1; i++) {
       next_line[i] -= next_line[i]>>8;
    }
} else {
    for (i = 0; i < DATABUF_SIZE-1; i++) {
       next_line[i] = (val * next_line[i]) / 256;
    }
}

Translated into assembly, this is very much simpler than a multiplication (20-30 cycles instead of 150):

    ldx     i
    lda     next_line_low,x    ; Low byte of next_line[x] in A,
    sec
    sbc     next_line_high,x   ; Subtract high byte from low :
                               ; -(next_line[i]>>8)
    bcs     :+
    dec     next_line_high,x   ; Decrement high byte if needed
:   sta     next_line_low,x

Approximating

One of the worst things in this algorithm is that final pixel values are computed from 16-bits numbers, divided by our dear factor, and then clamped between 0 and 255 (the 16-bits numbers are signed to make things worse, clamping is at both ends). For a 320×240 image, this means 153600 divisions (~45 seconds) with a precise division method. I already explained that in the previous article, as it translates to a large net gain even on modern architecture, but I’ll detail a bit more here.

So we’re going to approximate it too, and replace the division method with the inverse multiplication/shifting method, getting division time down to 30 seconds.

But this is far from enough: as the division factor only changes every few rows, this calls for a division lookup table. For it to fit in memory, I only store the result of “round hexadecimal numbers”: 0x80/n, 0x180/n, 0x280/n, … 0xFF80/n. Yes, this approximates the approximation, and the result image, of course, is different, and some pixels have their value offset by… 1, which is very invisible.

The result of the clamping is also stored in that table, so 1) no check is needed at the time of finalizing a pixel and 2) as soon as the division’s result high byte is not zero when building the table, you can stop dividing. The table is built like this:

      lda     #0
      sta     wordcnt           ; persist counter, division
                                ; will use all registers

:     ldx     wordcnt           ; load denominator (X/A)
      lda     #$80
fact: ldy     #$FF              ; load divisor (patched in)
      jsr     approx_div16x8    ; divide AX by Y

      ldy     wordcnt
      bmi     overflows_neg     ; stop if result < 0 
                                ; (high byte had high bit set)

      sta     dyndiv,y          ; Store the low byte of the result
      txa                       ; Check if high byte is 0,
      bne     overflows         ; Stop if not: result > 256
      inc     wordcnt           ; Increment the counter,
      bne     :-                ; Go do the next division
      jmp     done

overflows:                      ; Fill the overflows with 255...
      lda     #$FF
:     sta     dyndiv,y
      iny
      bmi     overflows_neg     ; As long as they're positive,
      bne     :-

overflows_neg:
      lda     #$00              ; Then fill the underflows with 0
:     sta     dyndiv,y
      iny
      bne     :-

done:

This takes about one second per picture because now we’re only dividing ~5000 times per image. Once the table is built, storing a pixel after the 16-bit value is computed, which was:

    ; ... compute 16-bits value
    ldy factor          ; load factor
    jsr approx_div16x8  ; divide
    cpx #0              ; is result high byte 0 ?
    beq store           ; yes, we can store the result
    bmi underflow       ; is it negative ?
overflow:
    lda #$FF            ; no, so store 255
    jmp store
underflow:
    lda #$00            ; yes, so store 0
store:
    ldy column          ; Reload column, trashed by division
    sta output_buffer,y ; Store value!

and took about 200 cycles… becomes this:

    ; ... compute 16-bits value
    lda dyndiv,x
    sta buffer,y

Which takes 8 cycles.

Look at data again

Another round of printf()s showed that the division factor is 48 more than 60% of the time. I devoted 256 extra bytes to a static div48 table, initialized once at start, never changing, so that I only have to rebuild the current row’s division table when factor is not 48 (and not the same as before). Dividing now takes 300ms per picture, as there are less than 2.000 divisions taking place.

Look at data again

I was discontent with the Huffman decoder algorithm. It reads bits, one at at time, until it matches a valid Huff code in the current Huff tree, and returns the associated value. Some of the Huffman trees (which, weirdly, are static and not picture-data dependent) are quite simple, others less so. There are a few trees that have only two leafs (0 and 1) and for which there’s only one bit to read. A few ones have four (00, 01, 10, 11), two bits to read. A few ones have simple codes that are easy to match without counting read bits (00, 01, 10, 110, 1110, etc). But others have codes starting with a zero that forces me to count the bits in order to do the matching (00, 010, 011, 100, 101: reading ’10’ is not the same as reading ‘010’, ’10’ is not a valid code, but ‘010’ ends up in the same offset in the code/value array). There are some places in the decoding where data is fetched from a given tree (always the same), others were it could fetch from multiple different trees.

The general decoder algorithm had to take all these cases into account. It also required to patch in the tree to use in the decoder before each read, always (two patches, one for the code/value match array, one for the code/numbits array). The small trees still took a full page. Etc. I modified those, used a jump table instead of a subroutine, made basically one decoder algorithm per tree, at the expense of readability. This took 7 more seconds out, and allowed me to stuff 6 pages of tables into a single one, sparing memory to allocate to the I/O buffer (meaning less reads of more data, also faster when one reads from a floppy with seek times close to the second).

Finally, Huffman decoding matches a code to a value, which is necessary when we’re decoding data (in _decode_row) because we will use it, but useless when skipping data we don’t use (in _consume_extra). So, the decoders got specialized again, only detecting valid codes when skipping data, without returning the corresponding value.

Not checking things

In computing, it is now well-known that functions should have contracts, and should explicitely and cleanly fail when their contract is not respected. For example, checkUser(User a) should start with if (a == NULL) before anything else, handle the case cleanly, and that should be in the unit tests, too. Callers should check for the return values, and act accordingly.

Here, we take our memory locations and values and we do the thing we’re supposed to do. If the data is invalid, at best we’ll get a weird image out, at worst we’ll crash hard. For example, dcraw’s algorithm checked that column was still > 0 when doing repeats:

for (rep=0; rep < 8 && rep < nreps && col > 0; rep++) {
  col -= 2;
  ...
}

This is useless. Images are correctly encoded so that repeats don’t underflow col.

Juggling knives blindfolded — Checks are useless, trust the data

In conclusion

I think that’s about it on this algorithm, and I won’t be able to do much better. I may still have a bit of room left in the control Huffman decoders, but I don’t think I have anymore big wins awaiting. Nevertheless, I’m kind of proud of the level of optimization I achieved on this algorithm that was never intended to run on a potato chip incapable of multiplying in hardware, and not even able to shift more than one bit at a time. I hope the dive was interesting and if you have ideas, I’m all ears!