Superoptimization

Feasibility Study by Embecosm Limited, supported by Innovate UK

Superoptimization is the process of finding the optimal instruction sequence for a given section of code. This contrast with the traditional process of compilation, which optimizes code but does not ensure optimality.

This report explores the different types of superoptimizer, and how they could be implemented in modern technologies.

This work was led by James Pallister, Research Engineer at Embecosm and supported by Innovate UK under their Technology Inspired Feasibility Studies initiative.

For further information about this work, please contact Dr Jeremy Bennett, Chief Executive, jeremy.bennett@embecosm.com.

1Introduction

Superoptimization is an idea which produces perfectly optimal code, in place of the code we currently have generated by compilers. This is typically done via a brute-force search of every possible instruction sequence, checking whether it performs the desired actions and accepting the sequence if it is the optimal one.

The problem is clearly very difficult, exploding in size as the length of the instruction sequence increases. However, superoptimization gaining traction as a method of optimizing programs. In this feasibility study we hope to find which techniques are extensible enough to bring out of the academic community to be using in a commercial setting. This means that the superoptimizer needs to be able to handle all the corner cases that may arise in production code. If it turns out that superoptimization is not currently feasibly, by the end of the project we will have a research roadmap describing what will need to be done next.

Superoptimization was first conceived by Henry Massalin[1]⁠, where a brute-force search was conducted through the arithmetic instructions of the Motorola 68000. The example given in the original paper is given below.

 

The function on the left is compiled, giving the piece of code on the right, with multiple branches and comparisons. This could perhaps be reduced by an expert assembly programmer, but was generally accepted as close to optimal. When the superoptimizer was run on this program, a much shorter version was found - 4 instructions long and with no branching.

On the left is the instruction sequence found by the superoptimizer (Motorola 68000), with the definitions of the instructions on the far right. In the middle are three cases, computed step by step to illustrate how the code sequence works. How to instructions compute their result is not obvious exploiting the carry flag ('x') in combination with arithmetic instructions.
 

Analysis

2Types of superoptimizer

The most obvious choice for a superoptimizer is a brute force enumeration, simply searching all possible sequences in cost order until a correct sequence is found. However, there are other ways of performing the search, and these are also described below.

2.1Brute force

The most common type of superoptimizer is a brute force optimizer, which enumerates each instruction sequence in turn, testing each until a correct sequence is found. The order of this search is important, since to guarantee optimality the sequences must be searched in cost order.

The search order is simple for some metrics, such as code size, where in many cases this corresponds to number of instructions in the sequences. A superoptimizer would enumerate sequences with one instruction, followed by two instructions, etc.

The disadvantage with this approach is the search space scales exponentially with the length of the instructions sequences, making long sequences intractable. The size of the instruction set also greatly affects the superoptimization potential. This is discussed below in 3.1 Potential efficiency: Instruction set choice. The exponential growth can be mitigated by several techniques to remove redundant instruction sequences and incorrect sequences.

2.2Machine learning

Recently machine learning has been applied to superoptimization to traverse the search space efficiently, allowing the search to quickly identify efficient instruction sequences [2]⁠. With this approach it is difficult to ensure optimality of the generated sequences, although the sequences are typically very close to optimal and the method gives good results in a shorter time frame than any other method.

2.3Constraint solving

The constraint solving approach to superoptimization uses a solver (typically a SMT solver) with a set of equations to find a efficient solution. The instructions are encoded into constraints which the solver can interpret, and then the target function is represented using this form of the instructions. The SMT solver typically converts these constraints into boolean equations (a SAT problem), which is given to a SAT solver. A SAT solver then tries to work out if a solution can be found for the equations. If a solution is found then this describes the instruction sequence performs the target operation.  An off-the-shelf solver can then produce a solution [3]⁠.

SMT solvers are typically very efficient, containing many heuristics to speed up and guide the search. This method allows the superoptimizer to increase in performance as the corresponding solver is developed.

However, SMT solvers typically solve a decision problem, rather than an optimization problem, meaning that the problem domain does not simply translate into SMT constraints – an iterative approach must be taken. This involves asking the solver “Prove there is no instruction sequence of size N to solve the problem”. The SMT solver will either return a counter example (the instruction sequence) or inform that the problem cannot be solved given the constraints [4]⁠. The parameter N can then be increased until the shortest sequences is found.

3Potential efficiency

Currently superoptimization is limited to short sequences of mostly ALU based code, with some techniques being able to handle memory operations and a few with branches. No superoptimizer has attempted to create arbitrary control flow.

Superoptimization often finds optimizations where multiple different representations of a number can be exploited. For example, if the number can be operated on as both bits and as an arithmetic number, interesting optimizations can often be found. This is also possible with floating point numbers, although no superoptimizer has attempted this yet.

One thing a superoptimizer can successfully find is branch-free replacements for short sequences of code, allowing (usually) costly branches to be removed [5]⁠. This is focused on by many of the brute force superoptimizers. However, it is still difficult for superoptimizers to produce code with branches in them, and even more difficult if loops are considered.

3.1Instruction set choice

The instruction set chosen as the target for superoptimization has a large effect on the efficacy of the search – the length of the resulting solution and the time taken to find it. There are a number of trade-offs to consider when choosing the instruction set, or subset of the instruction set to superoptimize.

The number of instructions in the instruction set is the predominant factor in whether the superoptimizer will find a solution or not, since longer sequences can typically solve more complex problems. However, if each instruction only performs a 'small' amount of computation, then the required result will be need to be longer. This means that the subset of instructions chosen should be as small as possible, and consist of 'complex' operations.

The notion of operation 'complexity' is necessarily vague, since there is not a good metric to describe the amount of computation an operation requires.

The following graphs shows the superoptimization results for several architectures using the GNU superoptimizer. This includes the inbuilt superoptimization target functions (134 functions), as well as the superoptimization benchmarks given in the following section (25 functions). Each stacked bar shows the number of target functions that were able to be superoptimized, with the resulting number of instructions shown in colour.

The black line indicates the average length of the found solution for that architecture.

 

 

Overall this graph gives an idea of the complexity of the instruction set (or the subset implemented in GSO) for each architecture. AVR has longer instruction sequences on average, due to it being the only 8-bit architecture and by far the simplest.

An important trend to notice in the graph is that architectures on the left (lower average sequence length) typically have 3 operand instructions, whereas many on the right are 2 operand (implied destination). By not modifying one of the source registers in some operations, the superoptimizer can make reuse of this source again if necessary. If the instruction set only supports 2 operand addressing then the superoptimizer must find another way of preserving the source register, typically by inserting an additional move instruction, thus increasing the sequence size.

The average sequence length is not a measure of how good the instruction set – it is resultant from design choices made when creating the instruction set. The choice of instructions in the AVR instruction set means that the processor have a smaller silicon area than the other processors.

The importance of moving register contents around results in instruction sets with a conditional move performing well. The alpha architecture, and PA-RISC (hppa) both have some support for either conditional instructions or conditional moves.

The following graphs shows a breakdown of the superoptimizer benchmarks for each architecture. Only the 25 superoptimizer benchmarks are displayed, since it is expected these are a better indicator for the overall performance than the full set (GSO's initial functions are heavily biased towards if-conversion type optimizations). This highlights the length of the individual result for each superoptimizer trial.

 

As with the previous graph, there is a general increase in the number of instructions required to solve the benchmark, from left to right. From these graphs it would suggest the subset of AVR instructions implemented is the least complex, whereas the instructions from the Alpha instruction set are the most complex.

4Difficulties

Some constructs and instructions are challenging for a superoptimizer to deal with. These are listed below.

 

Memory operations

Memory operations can be challenging, since it represents a large additional state that must be kept track of. This is handled in [6] by constraining the amount of memory that can be accessed to 256 bytes, by only using the last byte of the memory address. This quick test has aliasing problems in certain cases, which must be checked later with more extensive verification.

 

Memory is also problematic in the constraint solving approach, due to the large number of constraints it produces – a look up table must be created with enough entries to store each unique accesses. Each entry is then compared to the lookup table when a load is performed.

 

Volatility must also be considered when superoptimizing memory accesses.

Control flow.

Branches

Branches greatly widen the scope of what the superoptimizer can generate. It is relatively simple to verify whether a section of loop free code is equivalent to another section of code with a different branching structure (i.e. allowing branching input to the superoptimizer). Generating branching output is more challenging due to the large number of possibilities.

Control flow.

Loops

Loops are extremely difficult to superoptimize, particularly due to the problem of verifying loop equivalence. Superoptimizing across loop boundaries was shown to be possible in [7]⁠, however this kept the same loop structure.

Large numbers of register combinations

Many of the registers combinations are redundant, if some of the registers are orthogonal. For example:

 

mov r0, r1

add r1, r2, r2

mov r3, r2

add r2, r1, r1

 

These two sequences are functionally equivalent, apart from the registers the results start and end in.

Floating point

Floating point is challenging to superoptimize because it is particularly difficult to verify if the output sequence is correct. The typical method of using an SMT solver to prove an instruction sequence's equivalence does not work for floating point (no available floating point theory).

Precoloured registers

Instructions which imply a certain source or destination registers can be problematic in certain circumstances, particularly for brute-force superoptimizers. The canonical form solution for a large number of register combinations (given above) has problems coping with this scenario, since the reduction cannot be applied to these registers. This is less a problem for converting existing sequences to canonical form but the difficulties arise when trying to iteratively generate all sequences of registers in canonical form. More research is needed to efficiently generate precoloured registers.

 

Choice of code to superoptimize

The choice of code to superoptimize has a large impact on the effectiveness of the solution, and the time taken to find it. A simple way to choose which areas of code should be superoptimized is to profile, and find which regions of code are hot. However, some of these will not be suitable for superoptimization – this section discusses some of these cases.

5Memory accesses

If the code frequently accesses memory then it is likely going to be difficult for a superoptimizer to utilize. Global memory accesses are often required, because it is the data the code is operating on. Local accesses frequently cannot be removed either, since they have dependencies across loop edges, or the computation of the value is a significant distance from the target region.

The following dataflow DAG shows a basic block that may not be effective to superoptimize. The green arrows indicate the sequence of memory operations. In this analysis the arrows are marked conservatively – the volatility is unknown so the order should be preserved.

 

If an arithmetic intensive portifc13208on of the basic block can be extracted from the rest of the basic block, then just this section could be considered for superoptimization. This has the advantage of cutting out the computation that has few possible gains, and focuses on what superoptimizers are typically most performant at.

The basic block below has a highlight region which could be fed through a superoptimizer, avoiding the chain of memory accesses on the right.