Superoptimization

Feasibility Study by Embecosm Limited, supported by Innovate UK

Superoptimization is the process of finding the optimal instruction sequence for a given section of code. This contrast with the traditional process of compilation, which optimizes code but does not ensure optimality.

This report explores the different types of superoptimizer, and how they could be implemented in modern technologies.

This work was led by James Pallister, Research Engineer at Embecosm and supported by Innovate UK under their Technology Inspired Feasibility Studies initiative.

For further information about this work, please contact Dr Jeremy Bennett, Chief Executive, jeremy.bennett@embecosm.com.

Superoptimization is an idea which produces perfectly optimal code, in place of the code we currently have generated by compilers. This is typically done via a brute-force search of every possible instruction sequence, checking whether it performs the desired actions and accepting the sequence if it is the optimal one.

The problem is clearly very difficult, exploding in size as the length of the instruction sequence increases. However, superoptimization gaining traction as a method of optimizing programs. In this feasibility study we hope to find which techniques are extensible enough to bring out of the academic community to be using in a commercial setting. This means that the superoptimizer needs to be able to handle all the corner cases that may arise in production code. If it turns out that superoptimization is not currently feasibly, by the end of the project we will have a research roadmap describing what will need to be done next.

Superoptimization was first conceived by Henry Massalin[1], where a brute-force search was conducted through the arithmetic instructions of the Motorola 68000. The example given in the original paper is given below.

The most obvious choice for a superoptimizer is a brute force enumeration, simply searching all possible sequences in cost order until a correct sequence is found. However, there are other ways of performing the search, and these are also described below.

The most common type of superoptimizer is a brute force optimizer, which enumerates each instruction sequence in turn, testing each until a correct sequence is found. The order of this search is important, since to guarantee optimality the sequences must be searched in cost order.

The search order is simple for some metrics, such as code size, where in many cases this corresponds to number of instructions in the sequences. A superoptimizer would enumerate sequences with one instruction, followed by two instructions, etc.

The disadvantage with this approach is the search space scales exponentially with the length of the instructions sequences, making long sequences intractable. The size of the instruction set also greatly affects the superoptimization potential. This is discussed below in 3.1 Potential efficiency: Instruction set choice. The exponential growth can be mitigated by several techniques to remove redundant instruction sequences and incorrect sequences.

Recently machine learning has been applied to superoptimization to traverse the search space efficiently, allowing the search to quickly identify efficient instruction sequences [2]. With this approach it is difficult to ensure optimality of the generated sequences, although the sequences are typically very close to optimal and the method gives good results in a shorter time frame than any other method.

The constraint solving approach to superoptimization uses a solver (typically a SMT solver) with a set of equations to find a efficient solution. The instructions are encoded into constraints which the solver can interpret, and then the target function is represented using this form of the instructions. The SMT solver typically converts these constraints into boolean equations (a SAT problem), which is given to a SAT solver. A SAT solver then tries to work out if a solution can be found for the equations. If a solution is found then this describes the instruction sequence performs the target operation. An off-the-shelf solver can then produce a solution [3].

However, SMT solvers typically solve a decision problem, rather than an optimization problem, meaning that the problem domain does not simply translate into SMT constraints – an iterative approach must be taken. This involves asking the solver “Prove there is no instruction sequence of size N to solve the problem”. The SMT solver will either return a counter example (the instruction sequence) or inform that the problem cannot be solved given the constraints [4]. The parameter N can then be increased until the shortest sequences is found.

Currently superoptimization is limited to short sequences of mostly ALU based code, with some techniques being able to handle memory operations and a few with branches. No superoptimizer has attempted to create arbitrary control flow.

Superoptimization often finds optimizations where multiple different representations of a number can be exploited. For example, if the number can be operated on as both bits and as an arithmetic number, interesting optimizations can often be found. This is also possible with floating point numbers, although no superoptimizer has attempted this yet.

One thing a superoptimizer can successfully find is branch-free replacements for short sequences of code, allowing (usually) costly branches to be removed [5]. This is focused on by many of the brute force superoptimizers. However, it is still difficult for superoptimizers to produce code with branches in them, and even more difficult if loops are considered.

The black line indicates the average length of the found solution for that architecture.

Some constructs and instructions are challenging for a superoptimizer to deal with. These are listed below.

Memory operations | Memory operations can be challenging, since it represents a large additional state that must be kept track of. This is handled in [6] by constraining the amount of memory that can be accessed to 256 bytes, by only using the last byte of the memory address. This quick test has aliasing problems in certain cases, which must be checked later with more extensive verification. Volatility must also be considered when superoptimizing memory accesses. | ||

Control flow. Branches | Branches greatly widen the scope of what the superoptimizer can generate. It is relatively simple to verify whether a section of loop free code is equivalent to another section of code with a different branching structure (i.e. allowing branching input to the superoptimizer). Generating branching output is more challenging due to the large number of possibilities. | ||

Control flow. Loops | Loops are extremely difficult to superoptimize, particularly due to the problem of verifying loop equivalence. Superoptimizing across loop boundaries was shown to be possible in [7], however this kept the same loop structure. | ||

Large numbers of register combinations | Many of the registers combinations are redundant, if some of the registers are orthogonal. For example:
These two sequences are functionally equivalent, apart from the registers the results start and end in. | ||

Floating point | Floating point is challenging to superoptimize because it is particularly difficult to verify if the output sequence is correct. The typical method of using an SMT solver to prove an instruction sequence's equivalence does not work for floating point (no available floating point theory). | ||

Precoloured registers | Instructions which imply a certain source or destination registers can be problematic in certain circumstances, particularly for brute-force superoptimizers. The canonical form solution for a large number of register combinations (given above) has problems coping with this scenario, since the reduction cannot be applied to these registers. This is less a problem for converting existing sequences to canonical form but the difficulties arise when trying to iteratively generate all sequences of registers in canonical form. More research is needed to efficiently generate precoloured registers. |

Choice of code to superoptimize

The choice of code to superoptimize has a large impact on the effectiveness of the solution, and the time taken to find it. A simple way to choose which areas of code should be superoptimized is to profile, and find which regions of code are hot. However, some of these will not be suitable for superoptimization – this section discusses some of these cases.

If the code frequently accesses memory then it is likely going to be difficult for a superoptimizer to utilize. Global memory accesses are often required, because it is the data the code is operating on. Local accesses frequently cannot be removed either, since they have dependencies across loop edges, or the computation of the value is a significant distance from the target region.

The following dataflow DAG shows a basic block that may not be effective to superoptimize. The green arrows indicate the sequence of memory operations. In this analysis the arrows are marked conservatively – the volatility is unknown so the order should be preserved.

If an arithmetic intensive portifc13208on of the basic block can be extracted from the rest of the basic block, then just this section could be considered for superoptimization. This has the advantage of cutting out the computation that has few possible gains, and focuses on what superoptimizers are typically most performant at.

The basic block below has a highlight region which could be fed through a superoptimizer, avoiding the chain of memory accesses on the right.