[Home]
[Edit this page]
[Recent Changes]
[Special Pages]
[Help]
x86ASMFAQ_Optimize
How do I optimize my assembly?
Optimization In Assembly
A Short Preamble
Before moving any further, let us ask: do I need optimizing? It's a very interesting and complicated question. If I, for instance, make a program because someone asked me to, the only thing I will care is that it must work. Anything else, however interesting, is beyond my scope, for it takes a good deal of time to perform.
Things are different when I (and maybe, you too) program for the sake of satisfaction. Or I make a nice piece of free software. In the world of free software, it doesn't only count that program does its job; it's also appreciated when the way program works is both Right and Good Thing. When making a project in pure assembly, making it Good is a must; if I wanted my program to spend lots of time wasting my CPU and memory, I would use, for example, Visual C++. Or Delphi. But never would I use assembly.
There is one more benefit of optimizing: there is little things that are equal in mental training. Even if you are writing a life game with a goal that it must be the fastest/smallest life game in the world, you do things, which may be indifferent for the world, but very useful for your mind. And there will come a day when you will join the Wizards... :o)
I don't pretend that my way of thinking is similar to yours; I simply suppose it to be sound. And I hope I'm not alone...
Well... If you are still reading this, you probably have concluded that optimization is absolutely necessary in your case. This means we may move forward. The very first thing you should remember is that there is no algorithm to do optimization; if there was one, it would be implemented, and this whole page could be thrown out. But it's not, because optimization is a form of art, not routine; advices here are no more than obvious recommendations. The best results are achieved after a period of practice, and it will remain a mystery how the Good Thing was seen, hidden in lots of garbage, as a mystery is how our consciousness works, and how poets write poems, and how we see Beautiful in our lifes... But the short preamble gets growing long, and it's high time to start from ...
... The Middle Of The Story
Yes, from the middle. The beginning is a very trivial and usual for programs in any language. It's selecting an algorithm that is optimal in itself. Choosing "quick sorting" instead of bubble sorting can be a fine example. And, by the way, you can easily find bunches of algorithm in appropriate resources of The Programmer's Heaven Web site (I think you know where it is!), in D. E. Knuth's book The Art of Computer Programming and in many other places devoted to this issue. For the rest of our discussion, I will presume that algorithm is OK.
The next step in butchering down our program is finding whatever takes more place it ought to, and whatever is unnecessarily done. That gives us major improvements in speed, as well as in size.
But before doing anything, if your program works, back it up to a safe location in order that you may revert to it if anything goes wrong. If your program, however, does not work, please fix it first, however dirty; for a very upleasant thing is to optimize bugs.
Even if the implementation is very dirty, it can be unnoticed if there are no loops. So, next advices are given in order to improve loops.
Here's a good Pentium optimization reference:
http://www.programmersheaven.com/search/download.asp?FileID=6344
Evaluating Constants Outside Loops
It's obvious, however often unseen, that in some code where
Such evaluations can be scattered everywhere, including the middle or even the end of loop body, depending on your favorite programming habits. Make sure to find them all.
Backward Looping
All of us probably got used to those FOR loops where some control variable is increased every pass. In most languages, there is also a way to make that control variable decrease, but it's seldom used, because it's strange for us to go or count backwards. However, in assembly, backward FOR loops are more efficient. Consider the following example:
The last four commands are Achilles' heel. Why? Because if we make our loop execute backwards, we reduce the loop tail to two instructions. We should remember that DEC instruction sets flags almost like CMP, and SUB does this exactly like CMP. So, our loop will look much better like that:
Expanding a Loop
It's a good technique when our loop's body is small enough and we know exact number of iterations. We will see below why is it efficient, but for now we should remember that straightforward linear execution is faster than one with jumps and calls. So
can be turned into
The last method involves some dealing with low-level optimization, when we count cycles and sizes of opcodes, and swap commands in order to speed them up; it requires knowing some CPU specific aspects but can improve performance greatly.
The Art of Sorcery, or The Low Level Optimization
Now we are sure that we gained maximum of every optimization made before; high time to leave for "newbies forever", but time of interest for those who wish to become Real Programmers (the next obvious step is learning the bare metal programming).
Now it's time to separate aspects of optimizing for size and for speed. However, these techniques rarely interfere (we'll see where), and can be used together to produce real artwork.
Some things need to be performed at once and without second thought: the instruction
The next thing is to find anything like
You may have already figured out that it also applies to XORs. Also, this rule may apply to ANDs if xxx (see above) has all of it's bits 8 ... 31 (or 15) set to 1.
Note, however, that it's not efficient to substitute
Use al/ax/eax for the "hottest" variables (that is, variables on which most of the calculations are performed) and in every possible place. The accumulator is the fastest registers, and moreover, some instructions have optimized format for dealing with al/ax/eax.
Before going into a loop, put the most used variables into registers. That accelerates a lot (and reduces evaluations in size as well). After the loop, you can put modified values back into memory.
Keep in mind that using ecx as a counter may be efficient because loops can be made with LOOP instructions. Loops become smaller, however not always faster (for faster implementation, when using PPro or higher processor, it's recommended to use DEC/J(N)Z pair instead).
Use string instructions! They are excellent in reducing size.
Remember: each prefix takes one byte more and uses one cycle more. Be careful.
Don't use ENTER/LEAVE; they are damn slow. Moreover, assembly offers you much more interesting ways to pass arguments.
In Pentium processors, there are two pipes: U-pipe and V-pipe. They are designed to execute instructions simultaneously. But there are some stumbling blocks: if some command writes to a register, and the very next command reads that register, they surely must be serialized. Moreover, there are instructions that can be executed only in U-pipe, and some of them are unable to execute simultaneously with any other instruction. Read a reference on Intel processors for more information.
(Shortly, I'll put here the list of such opcodes for the laziest ones)
In Pentium Pro, things go even farther. Commands are divided into microoperations, which can be executed simultaneously and allow some commands to be executed at one time. Intel's reference must contain exact values for each command; remember, if you can reorganize your program to have triads of operations made the way that first contains 4 microoperations, and two next 2 per instruction, you'll get them three work in just one CPU cycle!
(Maybe the above paragraph sounds too obscure. I promise to return and improve it very shortly)
If you don't want to wait for all this 4 days or so, make your own improvements!
What tools are available to help me?
[Edit this page] [Page history] [What links here] [Discuss this topic] [Printer Friendly]
x86ASMFAQ_Optimize
How do I optimize my assembly?
Optimization In Assembly
A Short Preamble
Before moving any further, let us ask: do I need optimizing? It's a very interesting and complicated question. If I, for instance, make a program because someone asked me to, the only thing I will care is that it must work. Anything else, however interesting, is beyond my scope, for it takes a good deal of time to perform.
Things are different when I (and maybe, you too) program for the sake of satisfaction. Or I make a nice piece of free software. In the world of free software, it doesn't only count that program does its job; it's also appreciated when the way program works is both Right and Good Thing. When making a project in pure assembly, making it Good is a must; if I wanted my program to spend lots of time wasting my CPU and memory, I would use, for example, Visual C++. Or Delphi. But never would I use assembly.
There is one more benefit of optimizing: there is little things that are equal in mental training. Even if you are writing a life game with a goal that it must be the fastest/smallest life game in the world, you do things, which may be indifferent for the world, but very useful for your mind. And there will come a day when you will join the Wizards... :o)
I don't pretend that my way of thinking is similar to yours; I simply suppose it to be sound. And I hope I'm not alone...
Well... If you are still reading this, you probably have concluded that optimization is absolutely necessary in your case. This means we may move forward. The very first thing you should remember is that there is no algorithm to do optimization; if there was one, it would be implemented, and this whole page could be thrown out. But it's not, because optimization is a form of art, not routine; advices here are no more than obvious recommendations. The best results are achieved after a period of practice, and it will remain a mystery how the Good Thing was seen, hidden in lots of garbage, as a mystery is how our consciousness works, and how poets write poems, and how we see Beautiful in our lifes... But the short preamble gets growing long, and it's high time to start from ...
... The Middle Of The Story
Yes, from the middle. The beginning is a very trivial and usual for programs in any language. It's selecting an algorithm that is optimal in itself. Choosing "quick sorting" instead of bubble sorting can be a fine example. And, by the way, you can easily find bunches of algorithm in appropriate resources of The Programmer's Heaven Web site (I think you know where it is!), in D. E. Knuth's book The Art of Computer Programming and in many other places devoted to this issue. For the rest of our discussion, I will presume that algorithm is OK.
The next step in butchering down our program is finding whatever takes more place it ought to, and whatever is unnecessarily done. That gives us major improvements in speed, as well as in size.
But before doing anything, if your program works, back it up to a safe location in order that you may revert to it if anything goes wrong. If your program, however, does not work, please fix it first, however dirty; for a very upleasant thing is to optimize bugs.
Even if the implementation is very dirty, it can be unnoticed if there are no loops. So, next advices are given in order to improve loops.
Here's a good Pentium optimization reference:
http://www.programmersheaven.com/search/download.asp?FileID=6344
Evaluating Constants Outside Loops
It's obvious, however often unseen, that in some code where
@1: xor ax, ax
; [ the loop body; ax remains unchanged]
loop @1
it would be more efficient to place xor ax, ax before our loop starts. As
constants are unchanged throughout the loop, it's irrational to eveluate
them over and over again. We should be careful and make sure that
"constants" will be actually be constants inside our loop.Such evaluations can be scattered everywhere, including the middle or even the end of loop body, depending on your favorite programming habits. Make sure to find them all.
Backward Looping
All of us probably got used to those FOR loops where some control variable is increased every pass. In most languages, there is also a way to make that control variable decrease, but it's seldom used, because it's strange for us to go or count backwards. However, in assembly, backward FOR loops are more efficient. Consider the following example:
mov eax, 0 ; our control variable loop_start: ;[... The loop body] inc eax cmp eax, OurFinalValue je loop_end jmp loop_start loop_end: ...
The last four commands are Achilles' heel. Why? Because if we make our loop execute backwards, we reduce the loop tail to two instructions. We should remember that DEC instruction sets flags almost like CMP, and SUB does this exactly like CMP. So, our loop will look much better like that:
mov eax, OurFinalValue ; Now it's OurBeginningValue loop_start: ; [... The loop body] dec eax jnz loop_start ...And if we can use ECX/CX as our counter, we can often reduce these two instructions into one LOOP instruction. However, sometimes DEC/JNZ can be more efficient.
Expanding a Loop
It's a good technique when our loop's body is small enough and we know exact number of iterations. We will see below why is it efficient, but for now we should remember that straightforward linear execution is faster than one with jumps and calls. So
mov ecx, Constant a_very_small_loop_start: ; ... A very small loop loop a_very_small_loop_start
can be turned into
; ... A very small loop ; ... A very small loop ; ... ; ... (Constant (see above) times)We can also sub-expand our loops when we know the number of iterations is twofold, threefold etc: we just expand the body inside the loop necessary amount of times and then execute it X/2 or X/3 etc. times.
The last method involves some dealing with low-level optimization, when we count cycles and sizes of opcodes, and swap commands in order to speed them up; it requires knowing some CPU specific aspects but can improve performance greatly.
The Art of Sorcery, or The Low Level Optimization
Now we are sure that we gained maximum of every optimization made before; high time to leave for "newbies forever", but time of interest for those who wish to become Real Programmers (the next obvious step is learning the bare metal programming).
Now it's time to separate aspects of optimizing for size and for speed. However, these techniques rarely interfere (we'll see where), and can be used together to produce real artwork.
Some things need to be performed at once and without second thought: the instruction
mov register, 0must be substituted with
xor register, register;instructions like
cmp register, 0must be replaced with
test register, register ;Note that using OR instead can cause a
;pipeline slowdown.
These are so well-known tricks that they are performed in the dirtiest
implementations. As you can guess, they will decrease the size of
instruction by the size of register they operate on.The next thing is to find anything like
or eax, xxx ; ... where x is a byte valuein order to replace it with far more efficient
or al, xxxwhich is exactly the same, with 3 bytes cut down.
You may have already figured out that it also applies to XORs. Also, this rule may apply to ANDs if xxx (see above) has all of it's bits 8 ... 31 (or 15) set to 1.
Note, however, that it's not efficient to substitute
or eax, 1024with
or ax, 1024.That's because in 32-bit mode (I hope you program for it) switching back to 16-bit registers and/or addresses requires additional prefix byte and additional cycle to execute. So, overusing 32-bit arguments in 16-bit code and vice versa may lead to a major slowdown.
Use al/ax/eax for the "hottest" variables (that is, variables on which most of the calculations are performed) and in every possible place. The accumulator is the fastest registers, and moreover, some instructions have optimized format for dealing with al/ax/eax.
Before going into a loop, put the most used variables into registers. That accelerates a lot (and reduces evaluations in size as well). After the loop, you can put modified values back into memory.
Keep in mind that using ecx as a counter may be efficient because loops can be made with LOOP instructions. Loops become smaller, however not always faster (for faster implementation, when using PPro or higher processor, it's recommended to use DEC/J(N)Z pair instead).
Use string instructions! They are excellent in reducing size.
Remember: each prefix takes one byte more and uses one cycle more. Be careful.
Don't use ENTER/LEAVE; they are damn slow. Moreover, assembly offers you much more interesting ways to pass arguments.
In Pentium processors, there are two pipes: U-pipe and V-pipe. They are designed to execute instructions simultaneously. But there are some stumbling blocks: if some command writes to a register, and the very next command reads that register, they surely must be serialized. Moreover, there are instructions that can be executed only in U-pipe, and some of them are unable to execute simultaneously with any other instruction. Read a reference on Intel processors for more information.
(Shortly, I'll put here the list of such opcodes for the laziest ones)
In Pentium Pro, things go even farther. Commands are divided into microoperations, which can be executed simultaneously and allow some commands to be executed at one time. Intel's reference must contain exact values for each command; remember, if you can reorganize your program to have triads of operations made the way that first contains 4 microoperations, and two next 2 per instruction, you'll get them three work in just one CPU cycle!
(Maybe the above paragraph sounds too obscure. I promise to return and improve it very shortly)
>> The things to follow very shortly... >> >> TODO: (maybe) Separate clearly size-reducing and speed-increasing >> tricks. >> TODO: Some examples for every trick explained. >> TODO: FPU issues. >> TODO: branch predictions. >> TODO: Win32 issues. >> TODO: the list of opcodes and in which pipe they can execute (Pentium).
If you don't want to wait for all this 4 days or so, make your own improvements!
What tools are available to help me?
[Edit this page] [Page history] [What links here] [Discuss this topic] [Printer Friendly]
