Speed-coding for the 6502 – a simple example

Usually clocked at 1MHz, with no hardware support for multiplication or division, and limited support for bit shifting, it is often important to take a step back from an algorithm in order to make it do the same thing, hundreds of times faster.

Note: this article doesn’t describe any technological breakthrough or extremely clever new way of doing things. It’s just a real-life example using a little part of an algorithm I made, and where at first, I stopped at step 2 instead of going all the way to the last step.

In this example, we want to scale down a 256×192 bitmap down to 192×144, which means the scaling factor is 3/4.

We’re going to have to apply this factor to every one of the 192 lines, and for each line, for every one of the 256 pixels composing the line. This means we’ll have to apply our factor (192*256)+192 times, 49344 times in total.

We can also note that the numbers we want to scale are 8 bits, but at first, it looks like we’ll need to do 16 bits math, as 256*3 is greater than 256.

The first, naive way is to do scaled_value = value*3/4:

lda value       ; load the value
ldx #3          ; load the multiplicator
jsr mult_a_x    ; jump to the multiplication function
                ; which returns a 16-bit number in A and X
ldy #4          ; load the divisor
jsr div_ax_y    ; jump to the division function
sta scaled_value; we have our result!

Given that multiplication and division is made via software, and each one takes about 200 cycles (approximately), our image scaling will require about 20 million cycles, aka 20 seconds at 1MHz!

So let’s try to avoid multiplication and division. Notice our formula can also be written as scaled_value = (value*2 + value)/4, and as we know, multiplying by 2 is shifting left, and dividing by 4 is shifting right twice: scaled_value = (value<<1 + value) >> 2, which in assembly, translates to:

 lda value      ; load the value
 ldx #$00       ; init the high byte of the multiplication to zero
 stx tmp1_zp    ; store it in a variable, as the 6502 can't bit-
                ; shift the X or Y registers

 asl a          ; shift left, which sets the carry if the result
                ; is > 255.
 rol tmp1_zp    ; bring the carry into the high byte
                ; We have done *2

 adc value      ; Add the value to the temporary result
 bcc :+         ; Did we overflow again?
 inc tmp1_zp    ; if so, we need to increment the high byte.
:
                ; We now have value*3 in A and tmp1_zp. Let's 
                ; divide by four.

lsr tmp1_zp     ; Shift high byte right, leaving low bit in carry
ror a           ; Shift low byte taking carry in high bit
lsr tmp1_zp     ; and a second time.
ror a

We have made the same calculation using “only” 35 cycles at best/39 at worst. Taking an average of 37 cycles, we can now scale our image in a bit less than two seconds, a ten-fold improvement over the first algorithm. But we can do better!

Our formula can also be rewritten as scaled_value = value*1.5/2, that is, scaled_value = (value>>1 + value)>>1. It looks like it could lead to loss of precision, but it doesn’t. That will simplify greatly the previous algorithm, as we will be able to do 9-bits math (the carry being the 9th bit):

lda value      ; load our value
lsr a          ; divide it by 2 (sets the carry if low bit=1)
clc            ; clear the carry, we don't care about
               ; what the low bit was
adc value      ; add our value to its half (sets the carry if >255)
ror a          ; divide by two, bringing the carry to the high bit.

We’re now at a nice 12 cycles per operation! And we’re able to scale our image in only 0.6 seconds.

Should we stop there? We also know that “there is no faster computation than the one you already know the result of”, which, is 6502 assembly, translates to “provide lookup tables to spare computing to the computer”. It often comes trading size for speed: in this case, we’ll need a table of 256 values containing pre-calculated results of X*3/4.

I didn’t really want to dedicate a full memory page to that little part of a much larger program where memory constraints exist… But, this program already has a few buffers that are completely unused, and trashable, during this scaling operation. So, instead of providing a dedicated, calculated at build time table, I will just generate it before starting scaling, applying the formula only once for every possible value in the 0-255 range:

 ldy #$00       ; start at 0

:
 tya            ; transfer to A
 ....           ; the previous 12-cycles *1.5/2 algorithm goes here
 sta table,y    ; store the result
 iny
 bne :-         ; and loop 256 times.

Our table is now built at runtime, which has a cost of less than 6000 cycles, and scaling down a pixel is now just a matter of lda table,y: 4 cycles. We just went from needing 10 seconds to scale an image, to 0.2 seconds.