P-mode Size Optimization

Introduction

Here are a few tricks and code snippets which will help make your p-mode code that little bit more compact. No doubt you've choosen p-mode because you want to use vast amounts of memory and/or lots of nice 32-bit instructions. At first this may seem like a brilliant idea, no need for lots of segment juggling or segment overrides.. but now you've got the problem of trying to squeeze all those 32-bit immediate values and 32-bit addresses into your nice, new 4-Kb intro...

To FLAT or not to FLAT?

Okay, you've probably all heard of "FLAT-mode" (or "FLAT memory-model") many times before and no doubt have used it for lots of things, but is FLAT mode the best solution for small intros?

Sure, all memory references start from linear-address 00000000 hex, so there is no need to perform the various calculations to convert 32-bit FLAT memory pointers into 32-bit relative offsets for your CODE and DATA segments. E.g.:

```     DATA segment base address = 01234567 hex
FLAT memory address       = 89ABCDEF hex

Relative offset           = FLAT memory address - Data segment base ```

The simple conversion from FLAT --> 32-bit offset for the above example gives:

```     Relative offset           = 89ABCDEF hex  -  01234567 hex

= 88888888 hex ```

But this only needs to be done for every reference outside your main CODE/DATA/STACK segment (such as allocated or system memory).

Hmmm.. Sounds like more hard work..

Wot, no relocation info?

The main advantage of using a segmented memory-model (where the CODE, DATA, STACK segments which do NOT start at 00000000 hex) is that you don't need to store all the relocation information. The relocation information is needed for FLAT mode to translate each 32-bit address into it's actual memory load address. This relocation is needed because your program can not know where it will be loaded in memory (the Operating-System will simply allocate the required amount of memory, SOMEWHERE in memory depending on your PC's configuration and amount of memory already used).

```address.     opcode                  mnemonics
ÄÄÄÄÄÄÄÄÄ    ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ    ÄÄÄÄÄÄÄÄÄ
000010000    A1 xx xx xx xx          mov     eax, [xxxxxxxx]
000010005    90                      nop
000010006    0F B6 38                movzx   edi, byte ptr [eax]
000010009    29 3E xx xx xx xx       sub     [xxxxxxxx], edi
000010010    90                      nop ```

In FLAT mode the above instructions the 32-bit addresses (xxxxxxxx) would need to be relocated at load time by either the Operating-System, or your own loader. You need to relocate the dword at 00010001 hex and the dword at 00001000B hex. In order to do this some extra data needs to be stored specifying which addresses in your program needs to be relocated (adjusted) by the loader.

For a segmented memory model you don't need to store this relocation info, and you also don't need to include the code to perform the relocation operation. The only relocation needed is to set the 32-bit Base address in each segment's GDT or LDT descriptor.

Pushing the immediates

There is another advantage from using the segmented-mode. You can arrange your most frequently used variables at [DS:000000xx] hex which means you can use 8-bit displacements instead of 32-bit addresses for some operations.

`             BF xx xx xx xx          lea     edi, xxxxxxxx `

Try this:

```             6A xx                   push    offset 000000xx
5F                      pop     edi ```

This saves a grand total of 2 bytes. I know, it's not much, but the above PUSH/POP trick can be used for another purpose; loading a small immediate values into a 32-bit register.

```             B8 FF FF FF FE          mov     eax, -2         (5 bytes)

6A FE                   push    -2
58                      pop     eax             (3 bytes) ```

To load the value -1 (FFFFFFFF hex) you can choose one of the following methods:

```5 bytes      B8 FF FF FF FF          mov     eax, -1

3 bytes      2B C0                   sub     eax, eax        (flags)
48                      dec     eax             (flags)

3 bytes      83 C8 FF                or      eax, -1         (flags)

3 bytes      6A FF                   push    -1
58                      pop     eax             (no flags) ```

You can also use the 67 hex address size prefix to help reduce your code size.

```5 bytes      A1 xx xx xx xx          mov     eax, [xxxxxxxx]

4 bytes      67 A1 xx xx             mov     eax, [xxxx] ```

Okay, now lets look at the other registers and those instructions which are 1 byte bigger than an EAX version.

```6 bytes      8B 35 xx 00 00 00       mov     esi, [000000xx]

5 bytes      6A xx                   push    offset 000000xx
5E                      pop     esi
8B 36                   mov     esi, [esi] ```

As you can see, it has only saved 1 byte. Combined with other local memory operations the [ESI] base could help save more bytes (i.e. MOV ECX,[ESI+4]). Of course if you want to read or write AL/eAX then the above PUSH xx method could save another byte.

```4 bytes      6A xx                   push    offset 000000xx
5E                      pop     esi

Combined with a number of XCHG EAX,reg32 operations this is a short way to load lots of registers from memory.

```18 bytes     8B 0D xx xx xx xx       mov     ecx, [000000xx]
8B 15 xx xx xx xx       mov     edx, [000000xx]
8B 1D xx xx xx xx       mov     ebx, [000000xx] ```

Smaller:

```9 bytes      6A xx                   push    offset 000000xx
5E                      pop     esi
91                      xchg    eax, ecx
92                      xchg    eax, edx
93                      xchg    eax, ebx ```

If you arrange your variables and tables/buffers carefully enough the you could probably use the final [DS:ESI] to point to the start of a buffer/table.

Indexing memory using pointer & count

Suppose we have a 2 DWORDs in memory, the first is 'Tbase' which holds the start (base) address of a table, and 'Telement' which is the element number within the table we want to access. The 'Telement' must be scaled up by 4 to obtain the desired DWORD within the table.

```16 bytes     A1 xx xx xx xx          mov     eax, [Telement]
C1 E0 02                shl     eax, 2
03 05 xx xx xx xx       add     eax, [Tbase]
8B 00                   mov     eax, [eax] ```

```14 bytes     A1 xx xx xx xx          mov     eax, [Telement]
8B 1D xx xx xx xx       mov     ebx, [Tbase]
8B 04 83                mov     eax, [eax*4+ebx] ```

Smaller:

``` 12 bytes    8E xx xx xx xx          lea     esi, Telement
91                      xchg    eax, ecx  ; ECX=Telement
96                      xchg    eax, esi  ; ESI=Tbase
F3 AD                   rep lodsd         ; ESI+ECX*4 :)

Or if in a PUSH xx range:

```10 bytes     6A xx                   push    000000xx
5E                      pop     esi
91                      xchg    eax, ecx  ; ECX=Telement
96                      xchg    eax, esi  ; ESI=Tbase

This requires the Telement and Tbase to be stored in memory in that order and obviously messes up ESI and ECX registers. :(

Hacking the stack

I would bet that during your many coding sessions there have been times when you have simply run out of registers, especially when trying to combine multiple data-streams. Normally once the [ES:EDI] register pair have been used you are forced to use a "MOV [reg32],src + ADD reg32, xx" combination. Well, there is an ancient trick which was popular on 8-bit machines like the old ZX Speccy and C64, that of using the stack to build up a table.

On my favourite cpu, the 68000, there were 2 modes and 2 stack-pointers which were mapped into the A7 address-register. For normal use, the 68000 was in USER-mode so A7 = USP (user stack-pointer), now if an interrupt occurred the cpu switched to SUPERVISOR-mode so A7 = SSP (Supervisor stack pointer). Once the interrupt handler(s) had finished the cpu switched back to USER-mode.

Yeah, very interesting TAD, but so what?

Because only the SUPERVISOR mode stack (SSP) was used to store the SR and PC registers during an interrupt or OS 'TRAP #x' call, you could be sure that data BELOW the stack-pointer was not corrupted. This is NOT the case with the 80x86, even with Intel's fancy task switching, gates and exceptions you can NOT expect data below (e)SP to remain unchanged.

Try this for yourself:

```             .model tiny             ; Assemble as a .COM proggy !!
.code
org 256
go:
mov     ax, 55AAh
push    ax
mov     bp, sp
pop     ax
again:
cmp     ax, [bp]
jz      short again
ret
end     go ```

The above code pushes 55AA hex onto the stack, pops it and waits until [SP-2] has changed. (Yes, it does change after the first interrupt.)

This means you can only build a stable table in REVERSE order (from the end to the start). Anything above the (e)SP is safe. So you use PUSH to build up a large table in a far smaller way than using MOV eAX + STOSx.

```9 bytes      8A C3                   mov     al, bl
AA                      stosb
B0 0D                   mov     al, 0Dh
AA                      stosb
B0 0A                   mov     al, 0Ah
AA                      stosb ```

Stack method:

```8 bytes      6A 0A                   push    0Ah
6A 0D                   push    0Dh
4C                      dec     esp
88 1C 24                mov     [esp], bl ```

The advantages are clear; PUSH xx can be used to replace MOV AL,x + STOSB pairs, ESP always points to the start of the buffer (unlike EDI when DF=0) and the AL register is freed up.

Of course, it won't work inside sub-routine/procedures and you need to clean up the stack once you have finished with the data... unless...

You don't have to stick with the default stack buffer, you can change ESP as much as you wish, just remember to allow some extra workspace beyond the start of the destination buffer.

Here is a stoopid example:

```             lea     edi, source             ; some input

mov     [saveESP], esp          ; save the ESP
lea     esp, buffer+400         ; end of buffer
sub     edi, esp                ; \
sub     edi, 4                  ; / adjust EDI offset
mov     ecx, 100                ; 100 dwords
again:
mov     eax, [esp+edi]          ; get something..
not     eax
push    eax                     ; store it in reverse !
loop    again
mov     esp, [saveESP]          ; restore ESP

.DATA?
saveESP      dd      ?

dd      1024 dup (?)    ; extra 4K of ESP workspace
buffer       dd      100 dup (?)     ; the actual buffer ```

Simple (and small) memory-alloc

This is a translation of my 7 byte Real/v86-mode memory allocate routine which I presented in Wilby 4 (the nice 4Kb diskmag). The idea is simple, use the stack-pointer to allocate memory blocks.

```memalloc:
memclear:
push    00h
loop    memclear        ; clear & alloc (ECX) DWORDs
jmp     eax             ; return back to caller ```

You pass the number of DWORDs in (ECX) and it returns the base address of the 'allocated' memory in the (ESP) register.

```             mov     ecx, (4096/4)   ; allocate 4Kb
call    memalloc
lea     edi, [esp]      ; [DS:EDI] -->; 4096 byte buffer

mov     ecx, (50000/4)  ; allocate 50000 bytes
call    memalloc
lea     esi, [esp]      ; [DS:ESI] -->; 50000 byte buffer ```

(1) Remember to give your program a BIG stack with enough memory for all your requirements.

(2) You can only de-alloc memory in REVERSE order!! (In the above example this means ADD ESP,50000 then ADD ESP,4096.)

Cleaning up the stack

If you are using the stack to pass/return parameters then you normally have to clean up the ESP register after each call.

```             push    xx
push    xx
call    MyProc

push    xx
push    xx
call    MyProc

But you don't have to. You can clean up after a number of calls or even reload the ESP from within your main loop.

```             push    xx
push    xx
call    MyProc
push    xx
push    xx
call    Myproc
add     esp, 8+8                ; saves 3 bytes ```

If some of the parameters are identical for each call then you reuse some of the parameters for a number of calls.

```             push    xx
push    yy
call    MyProc
pop     eax                     ; throw away parameter yy
push    zz                      ; replace it with zz
call    Myproc

In the above we call 'Myproc' with parameters xx, yy and then with xx,zz.

PUSHA and POPA procedures

With the 80386+ [ESP] addressing modes together with PUSH xx, PUSHAD and POPDA, passing parameters on the stack becomes very easy, and also very small in terms of code size. These days complex, highly reusable procedures are widely used in small demos/intros. The sheer speed of the current 'standard' CPU makes unrolling loops or using macros to build huge graphical/mathematical almost pointless. Instead of those old, bloated binary offerings, more flexible procedures are desired which often need a large number of parameters. In order to cope with them, high-level/compiler stack based techniques can be used.

Let's use an 'DrawRectangle' for a simple example...

```             mov     al, 7           ; color
mov     ebx, 50         ; x
mov     ecx, 100        ; y
mov     edx, 128        ; width
mov     esi, 90         ; height
call    DrawRectangle
...
DrawRectangle:
pushad                  ; this allows us to use ALL the
; registers within this procedure..

mov     edi, [esp+18h]  ; 'ECX' from PUSHA structure
imul    edi, 320
add     edi, [esp+10h]  ; + 'EBX'       (x)
mov     edx, [esp+04h]  ; 'ESI'         (height)
@@1:
mov     ecx, [esp+14h]  ; 'ECX'         (width)
rep     stosb
sub     edi, [esp+14h]  ; \ next destination line
dec     edx
jnz     @@1

The above (pointless) procedure demonstrates how to use the PUSHAD and POPAD instructions in order to preserve registers and also how to access these stacked input parameters. Just in case you're not familiar with PUSHAD, this is what the stack looks like after a PUSHAD instruction:

```ESP -->      EDI             [ESP+00h]
ESI             [ESP+04h]
EBP             [ESP+08h]
ESP             [ESP+0Ch]       ** value ignored by POPAD **
EBX             [ESP+10h]
EDX             [ESP+14h]
ECX             [ESP+18h]
EAX             [ESP+1Ch]
...             [ESP+20h] ```

The beauty of PUSHAD and POPAD is that they are only 1-byte opcodes, so are ideal for other optimization tricks. Of course you don't have to use registers to pass parameters, you can push values directly onto the stack. This means that you can use the 2-byte PUSH xx instruction, and it allows parameters to be passed from ANY register or memory-location. In short this means far less register-juggling!!

```             push    7               ; color 7
push    50              ; x1
push    [ycoord]        ; y1
push    300             ; x2
call    HorzLine
...
HorzLine:
mov     edi, 320
imul    edi, [esp+32+4+04h]
mov     ecx, [esp+32+4+00h]
sub     ecx, [esp+32+4+08h]
mov     al, [esp+32+4+0Ch]
rep     movsb
ret     16              ; adjust stack (4x PUSHes) ```

Local work-space.

Now suppose in your procedure you need to allocate some space for local variables/workspace. The obvious solution would be something like:

```MyProc:
83 EC 14     sub     esp, 20         ; make room for 20 bytes

... do stuff ...

83 C4 14     add     esp, 20         ; trash local 20 bytes...
ret ```

In the above case we only need 20 bytes of work-space and on average it's usually less than 32 bytes, so here is a little trick:

```MyProc:
60           pushad                  ; make room for 32 bytes

... do stuff ...

61           popad                   ; trash local 32 bytes...
ret ```

As you can see, we have saved 4 bytes by using the PUSHAD to decrease ESP by 32 bytes instead of using the normal SUB ESP,xx and ADD ESP,xx instructions. This can lead on to another nice trick...

Suppose we pass parameters in registers (like in an earlier example), those parameters have been auto-duplicated on the stack by the 2nd PUSHAD!! This means we could screw around with those duplicated values without having to restore them..

```             mov     al, 0
mov     ecx, 64000
mov     edi, 000A0000h
call    Stoopid
...
Stoopid:
pushad                  ; make a 2nd-copy (local vars.)
@@1:
stosb
dec     dword ptr [esp+18h]
jnz     short @@1
popad                   ; trash 32-byte work-space vars.
popad                   ; restore all registers ```

I know its a completely stoopid example, but hopefully you can see how useful this '2nd-copy PUSHAD' method could be... It removes the need to copy passed register values into local variables. Of course the above code is best done using: REP STOSB, but if your procedure has no free registers left then perhaps the '2nd-copy PUSHAD' trick might be worthwhile. ;)

Looping the loop.

Ah, the coder's favourite item, the loop.

```             B9 000000xx             mov     ecx, 000000xx
again:
..                      ..
E2 xx                   loop    again           (7 bytes) ```

Better:

```             6A xx                   push    xx
59                      pop     ecx
again:
..                      ..
E2 xx                   loop    again           (5 bytes) ```

If you're running short of registers, why not use the nice [ESP+nn] addressing mode..

```             6A xx                   push    xx
again:
..                      ..
FF 0C 24                dec dword ptr [esp]
75 xx                   jnz     again           (7 bytes)
58                      pop     eax             (clean up) ```

As a bonus, you can load any register with 00000000 hex during the final stack clean-up operation (the "pop eax").

Toggling 16-bit values

If you are using 16-bit values and want a quick way to toggle between two different values, then try using the BSWAP instruction.

```             BB xx xx xx xx          mov ebx, xxxxxxxx
swap:        66 81 F3 xx xx          xor bx, xxxx            (10 bytes)

BB xx xx xx xx          mov ebx, 12345678h
swap:        C1 CB 10                ror ebx, 16             (8 bytes)

BB xx xx xx xx          mov ebx, 34125678h      ** NOTE **
swap:        0F CB                   bswap ebx               (7 bytes) ```

The advantage of BSWAP is that it doesn't change the flags and is 1 byte shorter. But be careful of the byte ordering!! The bytes in the top word MUST be reversed for the BSWAP to work (3412 hex instead of 1234 hex!!).

If the toggle value is small enough for an 8-bit value then you can combine the above with the PUSH+POP load method.

```             6A xx                   push    000000xx
69                      pop     ecx
swap:        0F C9                   bswap   ecx             (5 bytes) ```

This means you can toggle between 00 and some small values 00xx. Combined with a prefix override this 'might' be useful for the string instructions using the CX register.

You can also toggle between FFFF hex and FF80...FFFF hex by pushing the values -128 to -1.

Final thought.

The tricks presented above 'might' make your code even bigger. It depends if you use some kinda executable compression. In which case you may want to unroll small loops and try to forget about what you have just read. ;) To help your compressor, you might have to un-optimized your code to make it more 'phraze-match-friendly'.

But if your code is so tightly optimized by hand that no compressor can make your code smaller (yes, this can happen) then perhaps this article could help you chew a few bytes off your production to reach the magic X byte limit.

Happy counting...