Optimizing your Code -------------------- Everyone wants their code to run as fast as possible, so here are some speed-up tricks for you (There are also some 68020 and A1200 specific speedup tricks listed in 680x0issues.txt) 68000 Optimization ------------------ Written by Irmen de Jong, march '93. (E-mail: ijdjong@cs.vu.nl) Some notes added by CJ ----------------------------------------------------------------------------- Original Possible optimization Examples/notes ----------------------------------------------------------------------------- STANDARD WELL-KNOWN OPTIMIZATIONS RULE: use Quick-type/Short branch! Use INLINE subroutines if they are small! ----------------------------------------------------------------------------- BRA/BSR xx BRA.s/BSR.s xx if xx is close to PC MOVE.X #0 CLR.X/MOVEQ/SUBA.X move.l #0,count -> clr.l count move.l #0,d0 -> moveq #0,d0 move.l #0,a0 -> sub.l a0,a0 CLR.L Dx MOVEQ #0,Dx - CMP #0 TST - MOVE.L #nn,dx MOVEQ #nn,dx possible if -128<=nn<=127 ADD.X #nn ADDQ.X #nn possible if 1<=nn<=8 SUB.X #nn SUBQ.X #nn same... JMP/JSR xx BRA/BSR xx possible if xx is close to PC * and in same section!* (what's the use of JMP/JSR nn(PC)?) JSR xx;RTS JMP xx save a RTS BSR xx;RTS BRA xx same... (assuming routine doesn't rely on anything in the stack) LSL/ASL #1/2,xx ADD xx,xx [ADD xx,xx] lsl #2,d0 -> 2 times add d0,d0 MULU #yy,xx where yy is a power of 2, 2..256 LSL/ASL #1-8,xx mulu #2,d0 -> asl #1,d0 -> add d0,d0 BEWARE: STATUS FLAGS ARE "WRONG" DIVU #yy,xx where yy is a power of 2, 2..256 LSR/ASR #.. SWAP divu #16,d0 -> lsr #4,d0 BEWARE: STATUS FLAGS ARE "WRONG", AND HIGHWORD IS NOT THE REMAINDER. ADDRESS-RELATED OPTIMIZATIONS RULE: use short adressing/quick adds! ---------------------------------------------------------------------------- MOVEA.L #nn MOVEA.W #nn Movea is "sign-extending" thus possible if 0<=nn<=$7fff ADDA.X #nn LEA nn( adda.l #800,a0 -> lea 800(a0),a0 possible if -$8000<=nn<=$7fff LEA nn( ADDQ.W #nn lea 6(a0),a0 -> addq.w #6,a0 possible if 1<=nn<=8 $0000nnnn.l $nnnn.w move.l 4,a6 -> move.l 4.w,a6 possible if 0<=nnnn<=$7fff (nnnn is SIGN EXTENDED to LONG!) MOVE.L #xx,Ay LEA xx,Ay try xx(PC) with the LEA MOVE.L Ax,Ay; ADD #nnnn,Ay LEA nnnn(Ax),Ay copy&add in one OFFSET-RELATED OPTIMIZATIONS RULE: use PC-relative addressing or basereg addressing! put your code&data in ONE segment if possible! ---------------------------------------------------------------------------- MOVE.X nnnn MOVE.X nnnn(pc) lea copper,a0 -> lea copper(pc),a0.. LEA nnnn LEA nnnn(pc) ...possible if nnnn is close to PC (Ax,Dx.l) (Ax,Dx.w) possible if 0<=Dx<=$7fff If PC-relative doesn't work, use Ax as a pointer to your data block. Use indirect addressing to get to your data: move.l Data1-Base(Ax),Dx etc. TRICKY OPTIMIZATIONS ---------------------------------------------------------------------------- BSET #xx,yy ORI.W #2^xx,yy 0<=xx<=15 BCLR #xx,yy ANDI.W #~(2^xx),yy " BCHG #xx,yy EORI.W #2^xx,yy " BTST #xx,yy ANDI.W #2^xx,yy " Best improvement if yy=a data reg. BEWARE: STATUS FLAGS ARE "WRONG". SILLY OPTIMIZATIONS (FOR OPTIMIZING COMPILER OUTPUTS ETC) RULE: make the routines in assembly yourself! ---------------------------------------------------------------------------- MOVEM (one reg.) MOVE.l movem d0,-(sp) -> move.l d0,-(sp) MOVE xx,-(sp) PEA xx possible if xx=(Ax) or constant. 0(Ax) (Ax) - MULU/MULS #0 CLR.L moveq #0,Dx with data-registers. MULU #1,xx SWAP CLR SWAP high word is cleared with mulu #1 MULS #1,xx SWAP CLR SWAP EXT.L see MULU, and sign exteded. BEWARE: STATUS FLAGS ARE "WRONG" LOOP OPTIMIZATION. ---------------------------------------------------------------------------- Example: imagine you want to eor 4096 bytes beginning at (a0). Solution one: move #4096-1,d7 ..1 eori.b d0,(a0)+ dbra d7,.1 Consider the loop from above. 4096 times a eor.b and a dbra takes time. What do you think about this: move #4096/4-1,d7 ..1 eor.l d0,(a0)+ dbra d7,.1 Eors 4096 bytes too! But only needs 1024 eor.l/dbras. Yeah, I hear you smart guys cry: what about 1024 eor.l without any loop?! Right, that IS the fastest solution, but is VERY memory consuming (2 Kb). Instead, join a loop and a few eor.l: move #4096/4/4-1,d7 ..1 eor.l d0,(a0)+ eor.l d0,(a0)+ eor.l d0,(a0)+ eor.l d0,(a0)+ dbra d7,.1 This is faster than the loop before. I think about 8 or 16 eor.l's is just fine, depending on the size of the mem to be handled (and the wanted speed!). Also, mind the cache on 68020+ processors, the loop code must be small enough to fit in it for highest speeds. Try to do as much as possible within one loop (but considering the text above) instead of a few loops after each other. MEMORY CLEARING/FILLING. ---------------------------------------------------------------------------- A common problem is how to clear or fill some mem in a short time. If it is CHIP-MEMORY, use the blitter (only D-channel, see below). In this case you can still do other things with yer 680x0 while blittie-boy is busy erasing. If it is FAST-MEMORY, you can use the method from above, with clr.l instead of eor.l, but there is a much faster way: move.l sp,TempSp lea MemEnd,sp moveq #0,d0 ...for all 7 data regs... moveq #0,d7 move.l d0,a0 ...for 6 address regs... move.l d0,a6 After this, ONE instruction can clear 60 bytes of memory (15*4): movem.l d0-d7/a0-a6,-(sp) ;wham! Now, repeat this instruction as often as required to erase the memory. (memsize/60 times). You may need an additional movem.l to erase the last few bytes. Get sp(=a7) back at the end with (guess..): move.l TempSp,sp If you are low on mem, put a few movem.l in a loop. But, now you need a loop-counter register, so you'll only clear 56 bytes in one movem.l. In the case of CHIP memory, you can use both the blitter and the processor simultaneously to clear much CHIP mem in a VERY short time... It takes some experimentation to find the best sizes to clear with the blitter and with the processor. BUT, ALWAYS USE A WAITBLIT AFTER CLEARING SIMULTANEOUSLY, even if you know that the blitter is finished before your processor is (mind 680x0's) BLITTER SPEEDS. (from the Hardware Reference Manual) ---------------------------------------------------------------------------- Some general notes about blitter speeds. These numbers are for an OCS/ECS blitter only, in 16-bit chip ram (who knows the AGA blitter speed???) n * H * W time taken = ----------- 7.09 (7.15 for NTSC) time is in microseconds. H=blitheight,W=blitwidth(#words),n=cycles n=4+....depends on # DMA-channels used A: +0 (this one is free!) B: +2 C or D: +0 In line-mode, every pixel takes 8 cycles. C and D: +2 So, use A,D,A&D for the fastest operation. Use A&C for 2-source operations (e.g. collision check or so). NOTES (FURTHER NOTES MAY BE ADDED IN FUTURE...) ---------------------------------------------------------------------------- - 68020+ processors are particularly fast at using longwords. Byte access is some sort of brake on the memory access. Use at least words. - 68010 has a loop-cache, it caches 3 word loops like loop move.l (a0)+,(a1)+ dbra d7,loop - When optimizing BIG programs (for instance, compiler outputs...) first try to find the time-critical parts (inner loops, often called procs etc.) In most cases 10% of the code is responsible for 90% of the execution time. I see people using OldOpenLibrary() because it needs one less register set up.. I mean, what's the point? Are people really going to notice if your demo takes two clock cycles less before starting? :-) - Often it is better not to set BLTPRI in DMACON (#10 in $dff09a) as this can keep your processor from calculating things while the blitter is busy. - Use as much registers as possible! I.e. store values in registers rather than in memory, this gives one hell of a performance boost. (NOTE: just this is the power of RISC machines. Very much register access instead of memory access. Fill these 16 registers!) - Related to the last one: unlike many compilers, DONT put your parameters on stack before calling a sub! Instead, put them in well defined registers! - In case you have enough memory, try to remove as many MULU/S and DIVU/S as possible by pre-calculating a multiplication or division table, and reading values from it, rather than each time MULU #10 or so. * Beware on A1200 though, read Chris's section on 68020 optimization. More 680x0 Optimisations ------------------------ The 68020-40 (bd.w,an) addressmode can be optimized to x(an). Saves 1 word and some cycles. |------------------------|--------------------| | Addressmode | Optimizing | |------------------------|--------------------| |------------------------|--------------------| | move. l (1000.w,an),dn | move.l 1000(an),dn | |------------------------|--------------------| The 68020-40 (bd.w,pc) addressmode can be optimized to bd.w(pc). Saves 1 word and some cycles. |------------------------|--------------------| | Addressmode | Optimizing | |------------------------|--------------------| |------------------------|--------------------| | move. l (1000.w,pc),dn | move.l 1000(pc),dn | |------------------------|--------------------| The 68020-40 (bd.w) addressmode can be optimized to bd.w. Saves 1 word and some cycles. |------------------------|--------------------| | Addressmode | Optimizing | |------------------------|--------------------| |------------------------|--------------------| | move. l (bd.w,an),dn | move.l bd.w,dn | |------------------------|--------------------| The 68020-40 (bd.l) addressmode can be optimized to bd.l. Saves 1 word and some cycles. |------------------------|--------------------| | Addressmode | Optimizing | |------------------------|--------------------| |------------------------|--------------------| | move. l (bd.l,an),dn | move.l bd.l,dn | |------------------------|--------------------| The 68020-40 addressmode (an) can be optimized to the 68000 addressmode (an). (an) can be interprete as a sub type of the address mode (bd.w,an.xn) and this is a 68020-40 addressmode. But (an) is a well known 68000 addressmode, so you should turn optimizing ALWAYS on. |------------------------|--------------------| | Addressmode | Optimizing | |------------------------|--------------------| |------------------------|--------------------| | move. l (an),dn | move.l (an),dn | |------------------------|--------------------| The 68020-40 addressmode (pc) can be optimized to the 68000 addressmode (pc). (pc) can be interprete as a sub type of the address mode (bd.w,pc.xn) and this is a 68020-40 addressmode. But (pc) is a well known 68000 addressmode, so you should turn optimizing ALWAYS on. |------------------------|--------------------| | Addressmode | Optimizing | |------------------------|--------------------| |------------------------|--------------------| | move. l (pc),dn | move.l (pc),dn | |------------------------|--------------------| |---------------|----------------|---------------------| | Addressmode | Optimizing | Note | |---------------|----------------|---------------------| |---------------|----------------|---------------------| | x.l,EA | x.w,EA | $ffff8000<=x<=$7fff | |---------------|----------------|---------------------| | EA,x.l | EA,x.l | $ffff8000<=x<=$7fff | |---------------|----------------|---------------------| |---------------|----------------|---------------------| | Addressmode | Optimizing | Note | |---------------|----------------|---------------------| |---------------|----------------|---------------------| | x(an),EA | (an),EA | x=0 | |---------------|----------------|---------------------| | EA,x(an) | EA,(an) | x=0 | |---------------|----------------|---------------------| |---------------|----------------|---------------------| | Addressmode | Optimizing | Note | |---------------|----------------|---------------------| |---------------|----------------|---------------------| | label,EA | label(pc),EA | $ffff8000<=dx<=$7fff| |---------------|----------------|---------------------| A4 Smalldata mode |---------------|----------------|---------------------| | Addressmode | Optimizing | Note | |---------------|----------------|---------------------| |---------------|----------------|---------------------| | label,EA | x(a4),EA | $ffff8000<=x<=$7fff | |---------------|----------------|---------------------| | EA,label | EA,x(a4) | $ffff8000<=x<=$7fff | |---------------|----------------|---------------------| Move Optimizing |---------------|----------------|-------------------------| | Addressmode | Optimizing | Note | |---------------|----------------|-------------------------| |---------------|----------------|-------------------------| | move.l #x,dn | moveq #x,dn | $ffffff80<=$7f | |---------------|----------------|-------------------------| | move.? #0,an | suba.l an,an | ? = w or l | |---------------|----------------|-------------------------| | move.l #x,dn | moveq #y,dn | $10000<=x<=$7f0000 | | | swap dn | | |---------------|----------------|-------------------------| | move.l #x,dn | moveq #y,dn | $ff80ffff<=x<=$fffEffff | | | swap dn | | |---------------|----------------|-------------------------| | move.l #x,dn | moveq #y,dn | $80<=x<=$ff | | | neg.b dn | | |---------------|----------------|-------------------------| | move.l #x,dn | moveq #y,dn | $ffff<=x<=$ff81 | | | neg.w dn | | |---------------|----------------|-------------------------| | move.l #x,dn | moveq #y,dn | $ffff0080<=x<=$ffff0001 | | | neg.w dn | | |---------------|----------------|-------------------------| | move.? #0,EA | clr.? EA | ? = w or l.See Trashreg | | | | optimizing | |---------------|----------------|-------------------------| | move.b #$ff,EA| st EA | | |---------------|----------------|-------------------------| Clr Optimizing |---------------|----------------|-------------------------| | Addressmode | Optimizing | Note | |---------------|----------------|-------------------------| |---------------|----------------|-------------------------| | clr.l dn | moveq #0,dn | | |---------------|----------------|-------------------------| Add Optimizing |---------------|----------------|-------------------------| | Addressmode | Optimizing | Note | |---------------|----------------|-------------------------| |---------------|----------------|-------------------------| | add.? #x,EA | addq.? #x,EA | 1<=x<=8 | |---------------|----------------|-------------------------| | add.? #x,EA | subq.? #x,EA | -8<=x<=-1 | |---------------|----------------|-------------------------| | add.? #x,an | lea.l x(an),an | $ffff8000<=x<=$7fff | |---------------|----------------|-------------------------| Sub Optimizing |---------------|----------------|-------------------------| | Addressmode | Optimizing | Note | |---------------|----------------|-------------------------| |---------------|----------------|-------------------------| | sub.? #x,EA | subq.? #x,EA | 1<=x<=8 | |---------------|----------------|-------------------------| | sub.? #x,EA | addq.? #x,EA | -8<=x<=-1 | |---------------|----------------|-------------------------| | sub.? #x,an |lea.l -x(an),an | $ffff8000<=x<=$7fff | |---------------|----------------|-------------------------| Lea Optimizing |---------------|----------------|-------------------------| | Addressmode | Optimizing | Note | |---------------|----------------|-------------------------| |---------------|----------------|-------------------------| | lea x(an),an | addq.w #x,an | 1<=x<=8 | |---------------|----------------|-------------------------| | lea x(an),an | subq.w #x,an | -8<=x<=-1 | |---------------|----------------|-------------------------| Cmp Optimizing |---------------|----------------|-------------------------| | Addressmode | Optimizing | Note | |---------------|----------------|-------------------------| |---------------|----------------|-------------------------| | cmp.? #0,EA | tst.? EA | | |---------------|----------------|-------------------------| Bcc Optimizing |---------------|----------------|-------------------------| | Addressmode | Optimizing | Note | |---------------|----------------|-------------------------| |---------------|----------------|-------------------------| | Bcc.l label | Bcc.w label | $8000<=label<=$7fff | |---------------|----------------|-------------------------| | Bcc.l label | Bcc.s label | $80<=label<=$7f | |---------------|----------------|-------------------------| | Bcc.w label | Bcc.s label | $80<=label<=$7f | |---------------|----------------|-------------------------| Jsr Optimizing |---------------|----------------|-------------------------| | Addressmode | Optimizing | Note | |---------------|----------------|-------------------------| |---------------|----------------|-------------------------| | jsr label | bsr.w label | $8000<=label<=$7fff | |---------------|----------------|-------------------------| | jsr label | bsr.s label | $80<=label<=$7f | |---------------|----------------|-------------------------| Jmp Optimizing |---------------|----------------|-------------------------| | Addressmode | Optimizing | Note | |---------------|----------------|-------------------------| |---------------|----------------|-------------------------| | jmp label | bra.w label | $8000<=label<=$7fff | |---------------|----------------|-------------------------| | jmp label | bra.s label | $80<=label<=$7f | |---------------|----------------|-------------------------| Asl Optimizing |---------------|----------------|-------------------------| | Addressmode | Optimizing | Note | |---------------|----------------|-------------------------| |---------------|----------------|-------------------------| | asl.? #1,dn | add.? dn,dn | | |---------------|----------------|-------------------------| Mulu Optimizing |---------------|----------------|-------------------------| | Addressmode | Optimizing | Note | |---------------|----------------|-------------------------| |---------------|----------------|-------------------------| | mulu.w #x,dn | swap dn | x=2^y | | | clr.w dn | y=y1+y2 | | | swap dn | y=1, add.l dn,dn | | | lsl.l #y1,dn | | | | lsl.l #y2,dn | | |---------------|----------------|-------------------------| | mulu.l #x,dn | lsl.l #y1,dn | x=2^y | | | lsl.l #y2,dn | y=y1+y2 | | | | y >= 16 | | | | swap dn ,y-16 | |---------------|----------------|-------------------------| Muls Optimizing |---------------|----------------|-------------------------| | Addressmode | Optimizing | Note | |---------------|----------------|-------------------------| |---------------|----------------|-------------------------| | muls.w #x,dn | swap dn | x=2^y | | | clr.w dn | y=y1+y2 | | | swap dn | y=1, add.l dn,dn | | | asl.l #y1,dn | | | | asl.l #y2,dn | | |---------------|----------------|-------------------------| | muls.l #x,dn | asl.l #y1,dn | x=2^y | | | asl.l #y2,dn | y=y1+y2 | | | | y >= 16 | | | | swap dn ,y-16 | |---------------|----------------|-------------------------| Register Optimizing ------------------- |---------------|--------------------|-------------------------| | Addressmode | Optimizing | Note | |---------------|--------------------|-------------------------| |---------------|--------------------|-------------------------| |move.? EA,label| lea.l label(pc),an | $8000<=label$7fff | | | move.? EA,(an) | | |---------------|--------------------|-------------------------| | tst.? label | lea.l label(pc),an | $8000<=label$7fff | | | tst.? (an) | | |---------------|--------------------|-------------------------| | not.? label | lea.l label(pc),an | $8000<=label$7fff | | | not.? (an) | | |---------------|--------------------|-------------------------| | neg.? label | lea.l label(pc),an | $8000<=label$7fff | | | neg.? (an) | | |---------------|--------------------|-------------------------| | negx.? label | lea.l label(pc),an | $8000<=label$7fff | | | negx.? (an) | | |---------------|--------------------|-------------------------| | nbcd label | lea.l label(pc),an | $8000<=label$7fff | | | nbcd (an) | | |---------------|--------------------|-------------------------| | scc label | lea.l label(pc),an | $8000<=label$7fff | | | scc (an) | | |---------------|--------------------|-------------------------| |---------------|--------------------|-------------------------| | Addressmode | Optimizing | Note | |---------------|--------------------|-------------------------| |---------------|--------------------|-------------------------| | move.l #x,EA | moveq #x,dn | $ffffff80<=x<=$7f | | | move.l dn,EA | | |---------------|--------------------|-------------------------| | ori.l #x,EA | moveq #x,dn | $ffffff80<=x<=$7f | | | or.l dn,EA | | |---------------|--------------------|-------------------------| | eori.l #x,EA | moveq #x,dn | $ffffff80<=x<=$7f | | | eor.l dn,EA | | |---------------|--------------------|-------------------------| | andi.l #x,EA | moveq #x,dn | $ffffff80<=x<=$7f | | | and.l dn,EA | | |---------------|--------------------|-------------------------| | addi.l #x,EA | moveq #x,dn | $ffffff80<=x<=$7f | | | add.l dn,EA | | |---------------|--------------------|-------------------------| | subi.l #x,EA | moveq #x,dn | $ffffff80<=x<=$7f | | | sub.l dn,EA | | |---------------|--------------------|-------------------------| | cmpi.l #x,EA | moveq #x,dn | $ffffff80<=x<=$7f | | | cmp.l EA,dn | | |---------------|--------------------|-------------------------| | move.? #0,EA | moveq #0,dn | Time optimizing | | | move.l dn,EA | must be on | |---------------|--------------------|-------------------------|