Compiler optimization
|
Compiler optimization techniques are optimization techniques that have been programmed into a compiler. These techniques are automatically applied by the compiler whenever they are appropriate. Because programmers no longer need to manually apply these techniques, programmers are free to write source code in a straightforward manner, expressing their intentions clearly. Then the compiler can choose the most efficient way to handle the implementation details.
Contrary to what the term might imply, optimization techniques (whether manual or automatic) rarely result in executables that are perfectly "optimal" by any measure, only executables that are much improved compared to direct translation of the programmer's original source.
Contents |
|
Types of optimizations
Techniques in optimization can be broken up along various scopes which affect anywhere from a single statement to an entire program. Generally locally scoped techniques are easier to implement than global ones but result in lesser gains. Some examples of scopes include:
- Peephole optimizations: Usually performed late in the compilation process, peephole optimizations examine at most a few instructions, transforming instructions into other less expensive ones, such as turning a multiplication of x by two into an addition of x with itself.
- Local optimizations: These operate on a single basic block, a piece of straight-line code. Basic blocks have many useful properties, particularly order of execution of instructions, that makes it possible to do many simple optimizations on them with naïve algorithms.
- Loop optimizations: These act on a number of basic blocks which make up a loop, such as a for loop in high-level code. Loop optimizations are important because many programs spend a large percent of their time inside loops. An important example is loop-invariant code motion.
- Global or intraprocedural optimization: This type of optimization acts on the complete control flow graph of a single procedure; the term global is somewhat misleading. These optimizations must deal with how control and data are modified or combined as control passes over flow edges of the control flow graph.
- Interprocedural or whole-program optimization: The most powerful of all in some ways, these optimize interactions between procedures. A simple example is taking a call to a function with constant parameters, copying the code of that function, and substituting the constant parameters throughout. These type of analyses are often very limited in order to deal with the complexity of an object as a large as a whole program.
In addition to scoped optimizations there are two further general categories of optimization:
- programming language-independent vs. language-dependent: Most high-level languages share common programming constructs and abstractions--decision (if, switch, case), looping (for, while, repeat .. until, do .. while), encapsulation (structures, objects). Thus similar optimization techniques can be used across languages. However certain language features make some kinds of optimizations possible and/or difficult. For instance, the existence of pointers in C and C++ makes certain optimizations of array accesses difficult. Conversely, in some languages functions are not permitted to have "side effects". Therefore, if repeated calls to the same function with the same arguments are made, the compiler can immediately infer that results need only be computed once and the result referred to repeatedly.
- machine independent vs. machine dependent: Many optimizations that operate on abstract programming concepts (loops, objects, structures) are independent of the machine targeted by the compiler. But many of the most effective optimizations are those that best exploit special features of the target platform.
The following is an instance of a local machine dependent optimization. To set a register to 0, the obvious way is to use the constant 0 in an instruction that sets a register value to a constant. A less obvious way is to XOR a register with itself. It is up to the compiler to know which instruction variant to use. On many RISC machines, both instructions would be equally appropriate, since they would both be the same length and take the same time. On many other microprocessors such as the Intel x86 family, it turns out that the XOR variant is shorter and probably faster, as there will be no need to decode an immediate operand, nor use the internal "immediate operand register".
Factors affecting optimization
- The machine itself
Many of the choices about which optimizations can and should be done depend on the characteristics of the target machine. It is sometimes possible to parameterize some of these machine dependent factors, so that a single piece of compiler code can be used to optimize different machines just by altering the machine description parameters. GCC is a compiler which exemplifies this approach.
- The architecture of the target CPU
- Number of CPU registers: To a certain extent, the more registers, the easier it is to optimize for performance. Local variables can be allocated in the registers and not on the stack. Temporary/intermediate results can be left in registers without writing to and reading back from memory.
- RISC vs. CISC: CISC instruction sets often have variable instruction lengths, often have a larger number of possible instructions that can be used, and each instruction could take differing amounts of time. RISC instruction sets attempt to limit the variability in each of these: instruction sets are usually constant length, with few exceptions, there are usually fewer combinations of registers and memory operations, and the instruction issue rate (the number of instructions completed per time period, usually an integer multiple of the clock cycle) is usually constant in cases where memory latency is not a factor. There may be several ways of carrying out a certain task, with CISC usually offering more alternatives than RISC. Compilers have to know the relative costs among the various instructions and choose the best instruction sequence (see instruction selection).
- Pipelines: A pipeline is essentially an ALU broken up into an assembly line. It allows use of parts of the ALU for different instructions by breaking up the execution of instructions into various stages: instruction decode, address decode, memory fetch, register fetch, compute, register store, etc. One instruction could be in the register store stage, while another could be in the register fetch stage. Pipeline conflicts occur when an instruction in one stage of the pipeline depends on the result of another instruction ahead of it in the pipeline but not yet completed. Pipeline conflicts can lead to pipeline stalls: where the CPU wastes cycles waiting for a conflict to resolve.
- Compilers can schedule, or reorder, instructions so that pipeline stalls occur less frequently.
- number of functional units: Some CPUs have several ALUs and FPUs. This allows them to execute multiple instructions simultaneously. There may be restrictions on which instructions can pair with which other instructions ("pairing" is the simultaneous execution of two or more instructions), and which functional unit can execute which instruction. They also have issues similar to pipeline conflicts.
- Here again, instructions have to be scheduled so that the various functional units are fully fed with instructions to execute.
- The architecture of the machine
- Cache size (256KB, 1MB) and type (fully associative, 4-way associative): Techniques like inline expansion and loop unrolling may increase the size of the generated code and reduce code locality. The program may slow down drastically if an oft-run piece of code (like inner loops in various algorithms) suddenly cannot fit in the cache. Also, caches which are not fully associative have higher chances of cache collisions even in an unfilled cache.
- Cache/Memory transfer rates: These give the compiler an indication of the penalty for cache misses. This is used mainly in specialized applications.
Intended use of the generated code
- Debugging
- When a programmer is still writing an application, they will recompile and test often, and so compilation must be fast. This is one reason most optimizations are avoided while debugging. Also, debugging code is usually stepped through in a symbolic debugger, and optimizing transformations, particularly those that reorder code, can make it difficult to identify the output code with the line numbers in the source code from which the code was generated. This can confuse the debugging tools and the programmer using them.
- General purpose use
- Prepackaged software is very often expected to be executed on a variety of machines and CPUs that may share the same instruction set, but have different timing, cache or memory characteristics. So, the code may not be tuned to any particular CPU, or may be tuned to work well on the most popular CPU and work reasonably on other CPUs.
- Special purpose use
- If the software is compiled to be used on one or a few very similar machines, with known characteristics, then the compiler can heavily tune the generated code to those machines alone. Important special cases include code designed for parallel and vector processors, for which we have parallelizing compilers.
Optimization techniques
Common themes
To a large extent, optimization techniques have the following themes, which sometime conflict.
- Avoid redundancy
- If something has already been computed, it's generally better to store it and reuse it later, instead of recomputing it.
- Less code
- There is less work for the CPU, cache, and memory. So, likely to be faster.
- Straight line code, fewer jumps
- Less complicated code. Jumps interfere with the prefetching of instructions, thus slowing down code.
- Code locality
- Pieces of code executed close together in time should be placed close together in memory, which increases spatial locality of reference.
- Extract more information from code
- The more information the compiler has, the better it can optimize.
- Avoid memory accesses
- Accessing memory, particularly if there is a cache miss, is much more expensive than accessing registers.
Optimization techniques
Loop optimizations
Some optimization techniques primarily designed to operate on loops include:
- induction variable analysis
- Roughly, if a variable in a loop is a simple function of the index variable, such as
j := 4*i+1
, it can be updated appropriately each time the loop variable is changed. This is a strength reduction, and also may allow the index variable's definitions to become dead code. This information is also useful for bounds-checking elimination and dependence analysis, among other things.
- loop fission or loop distribution
- Loop fission attempts to break a loop into multiple loops over the same index range but each taking only a part of the loop's body. This can improve locality of reference, both of the data being accessed in the loop and the code in the loop's body.
- loop fusion or loop combining
- Another technique which attempts to reduce loop overhead. When two adjacent loops would iterate the same number of times (whether or not that number is known at compile time), their bodies can be combined as long as they make no reference to each other's data.
- loop inversion is this a correct name for this technique??
- This technique changes a standard while loop into a do/while (a.k.a. repeat/until) loop wrapped in an if conditional, reducing the number of jumps by two, for cases when the loop is executed. Doing so duplicates the condition check (increasing the size of the code) but is more efficient because jumps usually cause a pipeline stall. Additionally, if the initial condition is known at compile-time and is known to be side-effect -free, the if guard can be skipped.
- loop interchange
- These optimizations exchange inner loops with outer loops. When the loop variables index into an array, such a transformation can improve locality of reference, depending on the array's layout.
- loop-invariant code motion
- If a quantity is computed inside a loop during every iteration, and its value is the same for each iteration, it can vastly improve efficiency to hoist it outside the loop and compute its value just once before the loop begins. This is particularly important with the address-calculation expressions generated by loops over arrays. For correct implementation, this technique must be used with loop inversion, because not all code is safe to be hoisted outside the loop.
- loop nest optimization
- Some pervasive algorithms such as matrix multiplication have very poor cache behavior and excessive memory accesses. Loop nest optimization increases the number of cache hits by performing the operation over small blocks and by using a loop interchange.
- loop reversal
- Loop reversal reverses the order in which values are assigned to the index variable. This is a subtle optimization which can help eliminate dependencies and thus enable other optimizations.
- loop unrolling
- Duplicates the body of the loop multiple times, in order to decrease the number of times the loop condition is tested and the number of jumps, which hurt performance by impairing the instruction pipeline. Completely unrolling a loop eliminates all overhead, but requires that the number of iterations be known at compile time.
- loop splitting
- Loop splitting attempts to simplify a loop or eliminate dependencies by breaking it into multiple loops which have the same bodies but iterate over different contiguous portions of the index range. A useful special case is loop peeling, which can simplify a loop with a problematic first iteration by performing that iteration separately before entering the loop.
- loop unswitching
- Unswitching moves a conditional inside a loop outside of it by duplicating the loop's body, and placing a version of it inside each of the if and else clauses of the conditional.
Data-flow optimizations
Data flow optimizations primarily depend on how certain properties of data are propagated by control edges in the control flow graph. Some of these include:
- common subexpression elimination
- In the expression "(a+b)-(a+b)/4", "common subexpression" refers to the duplicated "(a+b)". Compilers implementing this technique realize that "(a+b)" won't change, and as such, only calculate its value once.
- constant folding and propagation
- replacing expressions consisting of constants (e.g. "3 + 5") with their final value ("8") at compile time, rather than doing the calculation in run-time. Used in most modern languages.
SSA-based optimizations
These optimizations are intended to be done after transforming the program into a special form called static single assignment (see SSA form), in which every variable is assigned in only one place. Although some function without SSA, they are most effective with SSA. Many optimizations listed in other sections also benefit with no special changes, such as register allocation.
global value numbering: GVN eliminates redundancy by constructing a value graph of the program, and then determining which values are computed by equivalent expressions. GVN is able to identify some redundancy that common subexpression elimination cannot, and vice versa.
sparse conditional constant propagation: Effectively equivalent to iteratively performing constant propagation, constant folding, and dead code elimination until there is no change, but is much more efficient. This optimization symbolically executes the program, simultaneously propagating constant values and eliminating portions of the control flow graph that this makes unreachable.
Back end optimizations
register allocation: The most frequently used variables should be kept in processor registers for fastest access. To find which variables to put in registers an interference-graph is created. Each variable is a vertex and when two variables are used at the same time (have an intersecting liverange) they have an edge between them. This graph is colored using for example Chaitin's algorithm using the same number of colors as there are registers. If the coloring fails one variable is "spilled" to memory and the coloring is retried.
instruction selection: Most architectures, particularly CISC architectures and those with many addressing modes, offer several different ways of performing a particular operation, using entirely different sequences of instructions. The job of the instruction selector is to do a good job overall of choosing which instructions to implement which operators in the low-level intermediate representation with. For example, on many processors in the 68000 family, complex addressing modes can be used in statements like "lea 25(a1,d5*4), a0", allowing a single instruction to perform a significant amount of arithmetic with less storage.
instruction scheduling : Instruction scheduling is an important optimization for modern pipelined processors, which avoids stalls or bubbles in the pipeline by clustering instructions with no dependencies together, while being careful to preserve the original semantics.
rematerialization : Rematerialization recalculates a value instead of loading it from memory, preventing a memory access. This is performed in tandem with register allocation to avoid spills.
Functional language optimizations
Although many of these also apply to non-functional languages, they either originate in, are most easily implemented in, or are particularly critical in functional languages such as Lisp and ML.
- removing recursion
- Recursion is often expensive, as a function call consumes stack space and involves some overhead related to parameter passing and flushing the instruction cache. Tail recursive algorithms can be converted to iteration, which does not have call overhead and uses a constant amount of stack space, through a process called tail recursion elimination.
Other optimizations
Please help separate and categorize these further and create detailed pages for them, especially the more complex ones, or link to one where one exists.
- bounds-checking elimination
- Several modern languages, most notably Java, enforce bounds-checking of all array accesses. This is a severe performance bottleneck on certain applications such as scientific code. Bounds-checking elimination allows the compiler to safely remove bounds-checking in many situations where it can determine that the index must fall within valid bounds, for example if it is a simple loop variable.
- Branch offset optimization (machine independent)
- Choose the shortest branch displacement that reaches target
- code-block reordering
- Code-block reordering alters the order of the basic blocks in a program in order to reduce conditional branches and improve locality of reference.
- dead code elimination
- Removes instructions that will not affect the behaviour of the program, for example definitions which have no uses, called dead code. This reduces code size and eliminates unnecessary computation.
- factoring out of invariants
- If an expression is carried out both when a condition is met and otherwise, it can be written just once outside of the conditional statement. Similarly, if certain types of expressions (e.g. the assignment of a constant into a variable) appear inside a loop, they can be moved out of it because their effect will be the same no matter if they're executed many times or just once. Also known as total redundancy elimination. A more powerful optimization is partial redundancy elimination (PRE).
- inline expansion
- When some code invokes a procedure, it is possible to directly insert the body of the procedure inside the calling code rather than transferring control to it. This saves the overhead related to procedure calls, as well as providing great opportunity for many different parameter-specific optimizations, but comes at the cost of space; the procedure body is duplicated each time the procedure is called inline. Generally, inlining is useful in performance-critical code that makes a large number of calls to small procedures.
- jump threading
- In this pass, conditional jumps in the code that branch to identical or inverse tests are detected, and can be "threaded" through a second conditional test.
- strength reduction
- A general term encompassing optimizations that replace complex or difficult or expensive operations with simpler ones. Induction variable analysis can accomplish this with variables inside a loop which depend on the loop variable. Several peephole optimizations also fall into this category, such as replacing division by a constant with multiplication by its reciprocal, converting multiplies into a series of bit-shifts and adds, and replacing large instructions with equivalent smaller ones that load more quickly.
- reduction of cache collisions
- (e.g. by disrupting alignment within a page)
- Stack height reduction
- Rearrange expression tree to minimize resources needed for expression evaluation.
- test reordering
- If we have two tests that are the condition for something, we can first deal with the simpler tests (e.g. comparing a variable to something) and only then with the complex tests (e.g. those that require a function call). This technique complements lazy evaluation, but can be used only when the tests are not dependent on one another. Short-circuiting semantics can make this difficult.
- unreachable code elimination
- Unreachable code refers to code which can never be executed. Unreachable code elimination prevents the compiler from emitting instructions for such code, decreasing code size.
Problems with optimization
Early in the history of compilers, compiler optimizations were not as good as hand-written ones. As compiler technologies have improved, good compilers can often generate better code than human programmers and good post pass optimizers can improve highly hand-optimized code even further. For the RISC CPU architecture, and even more so for VLIW hardware, compiler optimization is the key for obtaining efficient code, because the RISC instruction set is so compact that it is hard for a human to manually schedule or combine small instructions to get efficient results. Indeed, these architectures were designed to rely on compiler writers for adequate performance.
However, optimizing compilers are by no means perfect. There is no way that a compiler can guarantee that, for all program source code, the fastest (or smallest) possible equivalent compiled program is output; such a compiler is fundamentally impossible because it would solve the halting problem. Additionally, there are a number of other more practical issues with optimizing compiler technology:
- Usually, an optimizing compiler only performs low-level, localized changes to small sets of operations. In other words, high-level inefficiency in the source program (such as an inefficient algorithm) remains unchanged.
- Modern third-party compilers usually have to support several objectives. In so doing, these compilers are a 'jack of all trades' yet master of none.
- A compiler typically only deals with a small part of an entire program at a time, at most a module at a time and usually only a procedure; the result is that it is unable to consider at least some important contextual information.
Work to improve optimization technology continues. One approach is the use of so-called "post pass" optimizers. These tools take the executable output by an "optimizing" compiler and optimize it even further. As opposed to compilers which optimize intermediate representations of programs, post pass optimizers work on the assembly language level. Post pass compilers are also limited by this, however, in that much of the information found in the original source code is lost.
As processor performance continues to improve at a rapid pace while memory bandwidth improves more slowly, optimizations that reduce memory bandwidth (even at the cost of making the processor execute "extra" instructions) will become more useful. Examples of this mentioned above include loop nest optimization and rematerialization.
External links
- redhat.com | GCC Optimization (http://www.redhat.com/software/gnupro/technical/gnupro_gcc.html) - A summary of the optimization methods used by the popular Open Source compiler from the Free Software Foundation -- GCC.
- NULLSTONE Optimization Categories (http://www.nullstone.com/htmls/category.htm) - major compiler optimization categories.