[Home]  [Edit this page]  [Recent Changes]  [Special Pages]  [Help
x86ASMFAQ_Optimize
How do I optimize my assembly?

Optimization In Assembly

A Short Preamble

Before moving any further, let us ask: do I need optimizing? It's a very interesting and complicated question. If I, for instance, make a program because someone asked me to, the only thing I will care is that it must work. Anything else, however interesting, is beyond my scope, for it takes a good deal of time to perform.

Things are different when I (and maybe, you too) program for the sake of satisfaction. Or I make a nice piece of free software. In the world of free software, it doesn't only count that program does its job; it's also appreciated when the way program works is both Right and Good Thing. When making a project in pure assembly, making it Good is a must; if I wanted my program to spend lots of time wasting my CPU and memory, I would use, for example, Visual C++. Or Delphi. But never would I use assembly.

There is one more benefit of optimizing: there is little things that are equal in mental training. Even if you are writing a life game with a goal that it must be the fastest/smallest life game in the world, you do things, which may be indifferent for the world, but very useful for your mind. And there will come a day when you will join the Wizards... :o)

I don't pretend that my way of thinking is similar to yours; I simply suppose it to be sound. And I hope I'm not alone...

Well... If you are still reading this, you probably have concluded that optimization is absolutely necessary in your case. This means we may move forward. The very first thing you should remember is that there is no algorithm to do optimization; if there was one, it would be implemented, and this whole page could be thrown out. But it's not, because optimization is a form of art, not routine; advices here are no more than obvious recommendations. The best results are achieved after a period of practice, and it will remain a mystery how the Good Thing was seen, hidden in lots of garbage, as a mystery is how our consciousness works, and how poets write poems, and how we see Beautiful in our lifes... But the short preamble gets growing long, and it's high time to start from ...

... The Middle Of The Story

Yes, from the middle. The beginning is a very trivial and usual for programs in any language. It's selecting an algorithm that is optimal in itself. Choosing "quick sorting" instead of bubble sorting can be a fine example. And, by the way, you can easily find bunches of algorithm in appropriate resources of The Programmer's Heaven Web site (I think you know where it is!), in D. E. Knuth's book The Art of Computer Programming and in many other places devoted to this issue. For the rest of our discussion, I will presume that algorithm is OK.

The next step in butchering down our program is finding whatever takes more place it ought to, and whatever is unnecessarily done. That gives us major improvements in speed, as well as in size.

But before doing anything, if your program works, back it up to a safe location in order that you may revert to it if anything goes wrong. If your program, however, does not work, please fix it first, however dirty; for a very upleasant thing is to optimize bugs.

Even if the implementation is very dirty, it can be unnoticed if there are no loops. So, next advices are given in order to improve loops.

Here's a good Pentium optimization reference:
http://www.programmersheaven.com/search/download.asp?FileID=6344

Evaluating Constants Outside Loops

It's obvious, however often unseen, that in some code where
@1: xor ax, ax
    ; [ the loop body; ax remains unchanged]
    loop @1
it would be more efficient to place xor ax, ax before our loop starts. As constants are unchanged throughout the loop, it's irrational to eveluate them over and over again. We should be careful and make sure that "constants" will be actually be constants inside our loop.

Such evaluations can be scattered everywhere, including the middle or even the end of loop body, depending on your favorite programming habits. Make sure to find them all.

Backward Looping

All of us probably got used to those FOR loops where some control variable is increased every pass. In most languages, there is also a way to make that control variable decrease, but it's seldom used, because it's strange for us to go or count backwards. However, in assembly, backward FOR loops are more efficient. Consider the following example:
	mov eax, 0	; our control variable
loop_start:
	;[... The loop body]
	inc eax
	cmp eax, OurFinalValue
	je  loop_end
	jmp loop_start
loop_end:
	...


The last four commands are Achilles' heel. Why? Because if we make our loop execute backwards, we reduce the loop tail to two instructions. We should remember that DEC instruction sets flags almost like CMP, and SUB does this exactly like CMP. So, our loop will look much better like that:
	mov eax, OurFinalValue ; Now it's OurBeginningValue
loop_start:
	; [... The loop body]
	dec eax 
	jnz loop_start
	...
And if we can use ECX/CX as our counter, we can often reduce these two instructions into one LOOP instruction. However, sometimes DEC/JNZ can be more efficient.

Expanding a Loop

It's a good technique when our loop's body is small enough and we know exact number of iterations. We will see below why is it efficient, but for now we should remember that straightforward linear execution is faster than one with jumps and calls. So
	mov ecx, Constant
a_very_small_loop_start:
	; ... A very small loop
	loop a_very_small_loop_start


can be turned into
	
	; ... A very small loop
	; ... A very small loop
	; ...
	; ... (Constant (see above) times)
We can also sub-expand our loops when we know the number of iterations is twofold, threefold etc: we just expand the body inside the loop necessary amount of times and then execute it X/2 or X/3 etc. times.

The last method involves some dealing with low-level optimization, when we count cycles and sizes of opcodes, and swap commands in order to speed them up; it requires knowing some CPU specific aspects but can improve performance greatly.

The Art of Sorcery, or The Low Level Optimization

Now we are sure that we gained maximum of every optimization made before; high time to leave for "newbies forever", but time of interest for those who wish to become Real Programmers (the next obvious step is learning the bare metal programming).

Now it's time to separate aspects of optimizing for size and for speed. However, these techniques rarely interfere (we'll see where), and can be used together to produce real artwork.

Some things need to be performed at once and without second thought: the instruction
	mov register, 0 
must be substituted with
	xor register, register;
instructions like
	cmp register, 0
must be replaced with
	test register, register ;Note that using OR instead can cause a
                                ;pipeline slowdown.
These are so well-known tricks that they are performed in the dirtiest implementations. As you can guess, they will decrease the size of instruction by the size of register they operate on.

The next thing is to find anything like
	or eax, xxx ; ... where x is a byte value
in order to replace it with far more efficient
	or al, xxx 
which is exactly the same, with 3 bytes cut down.

You may have already figured out that it also applies to XORs. Also, this rule may apply to ANDs if xxx (see above) has all of it's bits 8 ... 31 (or 15) set to 1.

Note, however, that it's not efficient to substitute
	or eax, 1024
with
	or ax, 1024.
That's because in 32-bit mode (I hope you program for it) switching back to 16-bit registers and/or addresses requires additional prefix byte and additional cycle to execute. So, overusing 32-bit arguments in 16-bit code and vice versa may lead to a major slowdown.

Use al/ax/eax for the "hottest" variables (that is, variables on which most of the calculations are performed) and in every possible place. The accumulator is the fastest registers, and moreover, some instructions have optimized format for dealing with al/ax/eax.

Before going into a loop, put the most used variables into registers. That accelerates a lot (and reduces evaluations in size as well). After the loop, you can put modified values back into memory.

Keep in mind that using ecx as a counter may be efficient because loops can be made with LOOP instructions. Loops become smaller, however not always faster (for faster implementation, when using PPro or higher processor, it's recommended to use DEC/J(N)Z pair instead).

Use string instructions! They are excellent in reducing size.

Remember: each prefix takes one byte more and uses one cycle more. Be careful.

Don't use ENTER/LEAVE; they are damn slow. Moreover, assembly offers you much more interesting ways to pass arguments.

In Pentium processors, there are two pipes: U-pipe and V-pipe. They are designed to execute instructions simultaneously. But there are some stumbling blocks: if some command writes to a register, and the very next command reads that register, they surely must be serialized. Moreover, there are instructions that can be executed only in U-pipe, and some of them are unable to execute simultaneously with any other instruction. Read a reference on Intel processors for more information.

(Shortly, I'll put here the list of such opcodes for the laziest ones)

In Pentium Pro, things go even farther. Commands are divided into microoperations, which can be executed simultaneously and allow some commands to be executed at one time. Intel's reference must contain exact values for each command; remember, if you can reorganize your program to have triads of operations made the way that first contains 4 microoperations, and two next 2 per instruction, you'll get them three work in just one CPU cycle!

(Maybe the above paragraph sounds too obscure. I promise to return and improve it very shortly)

>> The things to follow very shortly...
>>
>> TODO: (maybe) Separate clearly size-reducing and speed-increasing 
>>       tricks.
>> TODO: Some examples for every trick explained.
>> TODO: FPU issues.
>> TODO: branch predictions.
>> TODO: Win32 issues.
>> TODO: the list of opcodes and in which pipe they can execute (Pentium).


If you don't want to wait for all this 4 days or so, make your own improvements!

What tools are available to help me?



last edited (January 24, 2007) by Ptr_082004, Number of views: 4892, Current Rev: 8 (Diff)

[Edit this page]  [Page history]  [What links here]  [Discuss this topic]  [Printer Friendly]  

Members

Username:

Password:


Register
Forgot Password?




Programmers Heaven - for .NET, Java, C/C++ and WEB Developers!
© 1996-2008 Community Networks Ltd. All rights reserved. Reproduction in whole or in part, in any form or medium without express written permission is prohibited. Violators of this policy may be subject to legal action. Please read Terms Of Use and Privacy Statement for more information. Development by Tore Nestenius at .NET Consultant - Synchron Data.