@DATABASE HowToCode7
@NODE MAIN "HowToCode: Optimising"
@toc howtocode:howtocode/MAIN

                        Optimising your Code
                        --------------------

Everyone wants their code to run as fast as possible, so here
are some speed-up tricks for you:

  1 @{" 68000 Optimisation " link 68000}
  2 @{" 68020 Optimisation " link howtocode:text/680x0issues/chrisgreen}
  3 @{" Blitter Speed Optimisation " link howtocode:text/blitter/blitterspeed}
  4 @{" General Speed-Up Notes " link general}
@endnode
@node 68000 "68000 Optimisation"

68000 optimisation
------------------

Written by Irmen de Jong, march '93. (E-mail: ijdjong@cs.vu.nl)
Some notes added by CJ

NOTE! Not all these optimisations can be automatically applied. Make
sure they will not affect other areas in your code!

-----------------------------------------------------------------------------
Original    Possible optimisation   Examples/notes
-----------------------------------------------------------------------------
STANDARD WELL-KNOWN optimisATIONS
RULE: use Quick-type/Short branch! Use INLINE subroutines if they are small!
-----------------------------------------------------------------------------

BRA/BSR xx     BRA.s/BSR.s xx       if xx is close to PC

MOVE.X #0      CLR.X/MOVEQ/SUBA.X   move.l #0,count -> clr.l count
   move.l #0,d0    -> moveq #0,d0
   move.l #0,a0    -> sub.l a0,a0

CLR.L Dx       MOVEQ #0,Dx

CMP #0         TST

MOVE.L #nn,dx  MOVEQ #nn,dx         possible if -128<=nn<=127

ADD.X #nn      ADDQ.X #nn           possible if 1<=nn<=8
SUB.X #nn      SUBQ.X #nn           same...

JMP/JSR xx     BRA/BSR xx           possible if xx is close to PC

JSR xx;RTS     JMP xx               save a RTS
BSR xx;RTS     BRA xx               same...
                                    (assuming routine doesn't rely
                                    on anything in the stack)

LSL/ASL #1/2,xx ADD xx,xx [ADD xx,xx] lsl #2,d0 -> 2 times add d0,d0

MULU #yy,xx where yy is a power of 2, 2..256
   LSL/ASL #1-8,xx      mulu #2,d0 -> asl #1,d0 -> add d0,d0
   BEWARE: STATUS FLAGS ARE "WRONG"

DIVU #yy,xx where yy is a power of 2, 2..256
   LSR/ASR #.. SWAP  divu #16,d0 -> lsr #4,d0
            BEWARE: STATUS FLAGS ARE "WRONG",
            AND HIGHWORD IS NOT THE REMAINDER.

ADDRESS-RELATED OPTIMISATIONS
RULE: use short adressing/quick adds!
----------------------------------------------------------------------------

MOVEA.L #nn    MOVEA.W #nn    Movea is "sign-extending" thus
                              possible if 0<=nn<=$7fff

ADDA.X #nn     LEA nn()       adda.l #800,a0 -> lea 800(a0),a0
                              possible if -$8000<=nn<=$7fff

LEA nn()       ADDQ.W #nn     lea 6(a0),a0 -> addq.w #6,a0
                              possible if 1<=nn<=8

$0000nnnn.l    $nnnn.w        move.l   4,a6 -> move.l 4.w,a6
                              possible if 0<=nnnn<=$7fff
                              (nnnn is SIGN EXTENDED to LONG!)

MOVE.L #xx,Ay  LEA xx,Ay      try xx(PC) with the LEA

MOVE.L Ax,Ay;
ADD #nnnn,Ay   LEA nnnn(Ax),Ay   copy&add in one

OFFSET-RELATED OPTIMISATIONS
RULE: use PC-relative addressing or basereg addressing!
      put your code&data in ONE segment if possible!
----------------------------------------------------------------------------
MOVE.X nnnn    MOVE.X nnnn(pc)      lea copper,a0 -> lea copper(pc),a0..
LEA nnnn       LEA nnnn(pc)      ...possible if nnnn is close to PC

(Ax,Dx.l)      (Ax,Dx.w)            possible if 0<=Dx<=$7fff

If PC-relative doesn't work, use Ax as a pointer to your data block.
Use indirect addressing to get to your data: move.l Data1-Base(Ax),Dx etc.

TRICKY OPTIMISATIONS
----------------------------------------------------------------------------
BSET #xx,yy    ORI.W #2^xx,yy       0<=xx<=15
BCLR #xx,yy    ANDI.W #~(2^xx),yy      "
BCHG #xx,yy    EORI.W #2^xx,yy         "
BTST #xx,yy    ANDI.W #2^xx,yy         "
               Best improvement if yy=a data reg.
               BEWARE: STATUS FLAGS ARE "WRONG".

SILLY OPTIMISATIONS (FOR OPTIMISING COMPILER OUTPUTS ETC)
----------------------------------------------------------------------------
MOVEM (one reg.)  MOVE.l         movem  d0,-(sp) -> move.l d0,-(sp)

MOVE xx,-(sp)     PEA xx         possible if xx=(Ax) or constant.

0(Ax)             (Ax)

MULU/MULS #0      CLR.L          moveq #0,Dx with data-registers.

MULU #1,xx        SWAP CLR SWAP  high word is cleared with mulu #1
MULS #1,xx        SWAP CLR SWAP EXT.L  see MULU, and sign exteded.
                                    BEWARE: STATUS FLAGS ARE "WRONG"

LOOP OPTIMISATION.
----------------------------------------------------------------------------
Example: imagine you want to eor 4096 bytes beginning at (a0).
Solution one:

      move.w   #4096-1,d7
.1    eori.b   d0,(a0)+
      dbra     d7,.1

Consider the loop from above. 4096 times a eor.b and a dbra takes time.
What do you think about this:

      move.w   #4096/4-1,d7
.1    eor.l    d0,(a0)+    ; d0 contains byte repeated 4 times
      dbra     d7,.1

Eors 4096 bytes too! But only needs 1024 eor.l/dbras.
Yeah, I hear you smart guys cry: what about 1024 eor.l without any loop?!
Right, that IS the fastest solution, but is VERY memory consuming (2 Kb).
Instead, join a loop and a few eor.l:

      move     #4096/4/4-1,d7
.1    eor.l    d0,(a0)+
      eor.l    d0,(a0)+
      eor.l    d0,(a0)+
      eor.l    d0,(a0)+
      dbra     d7,.1

This is faster than the loop before. I think about 8 or 16 eor.l's is just
fine, depending on the size of the mem to be handled (and the wanted
speed!). Also, mind the cache on 68020+ processors, the loop code must be
small enough to fit in it for highest speeds.
Try to do as much as possible within one loop (but considering the text
above) instead of a few loops after each other.

MEMORY CLEARING/FILLING.
----------------------------------------------------------------------------
A common problem is how to clear or fill some memory in a short time.
If it is CHIP-MEMORY, use the blitter (only D-channel, see below). In this
case you can still do other things with your 680x0 while the blitter is busy
erasing. If it is FAST-MEMORY, you can use the method from above, with
clr.l instead of eor.l, but there is a much faster way:

   move.l   sp,TempSp
   lea      MemEnd,sp
   moveq    #0,d0
;...for all 7 data regs...
   moveq    #0,d7
   move.l   d0,a0
;...for 6 address regs...
   move.l   d0,a6

After this, ONE instruction can clear 60 bytes of memory (15*4):

   movem.l  d0-d7/a0-a6,-(sp)    ;wham!

Now, repeat this instruction as often as required to erase the memory.
(memsize/60 times). You may need an additional movem.l to erase the last
few bytes. Get sp(=a7) back at the end with (guess..):

   move.l   TempSp,sp

If you are low on mem, put a few movem.l in a loop. But, now you need a
loop-counter register, so you'll only clear 56 bytes in one movem.l.

In the case of CHIP memory, you can use both the blitter and the processor
simultaneously to clear much CHIP mem in a VERY short time...
It takes some experimentation to find the best sizes to clear with the
blitter and with the processor.

BUT, ALWAYS USE A @{" WaitBlit() " link howtocode:text/blitter/waitblit} AFTER CLEARING SIMULTANEOUSLY, even if you
think you know that the blitter is finished before your processor is.

@endnode


@node general "General Speed-Up Notes"

- When optimising programs first try to find the time-critical parts (inner
  loops, interrupt code, often called procedures etc.) In most cases 10%
  of the code is responsible for 90% of the execution time. Don't waste time
  doing needless optimising on startup and exit code when it's only
  called once!

- Often it is better not to set BLTPRI in DMACON (#10 in $dff09a) as this
  can keep your processor from calculating things while the blitter is busy.

- Use as many registers as possible! Store values in registers rather
  than in memory, it's much faster!

- DON'T put your parameters on the stack before calling a routine! Instead,
  put them in registers!

- If you have enough memory, try to remove as many MULU/S and DIVU/S
  as possible by pre-calculating a multiplication or division table, and
  reading values from it, or rewrite multiply/divide code with simpler
  instructions if possible (eg ADD, LSR, etc.)

@endnode
