Comparing Compilers: A Statistical Perspective

Here, we discuss an interesting problem: the task of analyzing the respective performance of two compilers to see which one is better. Explicitly, we have two (or more) different compilers that we wish to compare, and we’re trying to quantify which compiler is “best” (for some value of best) over a set of programs for some measure of performance (like runtime or code size). Historically, there have been two different approaches to this. The first is a qualitative analysis in which the performance of the compilers is compared across a range of selected representative exemplar programs. The second is random testing, in which the two compilers are compared across a large number of randomly generated programs.

One way to compare performance of the compilers is a qualitative analysis, in which the performance on the suite of (chosen or randomly generated) representative programs is itself the comparison between the two compilers. In a large set of representative programs, we might choose to provide a summary statistic (like mean performance) to aid interpretation, but this has no intrinsic statistical meaning beyond a convenient summary. The implicit assumption is made that respective performance on these representative programs are reflective of performance on the wider set of possible programs as a whole. However, it is not possible to quantify how reasonable this assumption really is, nor is it really possible for a finite sample of programs to be completely informative about the set of all possible programs. If we wish to go further than this, and make a quantitative analysis that can let us make inferences about the whole population of possible programs based on a small sample, we need to leverage the machinery of statistical analysis.

Statistically, both of our cases (selected and random programs) are interesting problems, because although the comparison seems simple, they fall badly foul of the central assumptions that underlie almost all statistical tests. We clearly violate the assumption that our programs are independent, random samples from the target population. What do we mean by random samples in this context? We mean that they are independent and identically distributed, that is that each of our samples is selected independently from the others, and from the same source. This is clearly untrue. In the case of selected programs, our selection is neither random, nor independent – engineers have explicitly selected the programs to be fairly different from one another and representative of common coding paradigms. In the case of “randomly” generated programs, limitations in the techniques available to do this means that the sampled programs may be random and independent, but only from a fundamentally limited population of programs.

Now, some violations of the i.i.d assumption are acceptable. Indeed, interesting statistical problems are rarely designed in such a way as that all our samples are completely independent from one another, and quite a lot of statistics is about ways to quantify and account for structured non-independence (e.g. time series, and hierarchical models). However, in our case, there is no quantifiable structure to the dependence between our observations that we can encode into any statistical model. This is a problem because without independent, random samples from the target population, and no way to account for non-randomness, there is no mathematical framework to make inferences about the population the sample was drawn from. That is not to say that one can’t generalize results from a non-random sample to the population, but the generalization must be argued empirically, rather than being formalized mathematically. To put it another way, if we’ve not selected the programs we are evaluating randomly, any statistical tools we might apply can only give us information about our sample, we must argue the generalization of these to the wider population just as we have until now.

With this in mind, what can we practically do to compare our two compilers? One very clear option is to use resampling methods like the permutation test and bootstrapping. The permutation test, for example, will give us a likelihood that the difference in the performances of the compiler has arisen by chance, and bootstrapping could be used to give us estimates of confidence intervals of any statistics we might calculate. The limitation of these methods is that they are (in theory) only valid across the sample they are run across instead of the population as a whole. However, all of our existing methods already have to assume that our samples are representative of the population, so these resampling methods give us a way of quantifying this.

How could we improve this situation? From a purely statistical point of view, most of our difficulties arise from problems with sampling. There are several ways the problem of sampling could potentially be simplified with a suitable statistical model. Firstly, quite a lot of the problems with randomly generating programs are centered around the perceived requirement for the random programs to steer away from certain problem cases. Some of these restrictions might be loosened by accounting for the problem behaviors inside the statistical model, instead of trying to ensure they never occur in the first place. A simple example might be generating programs that don’t execute in a reasonable time. This is generally unfavorable. However, one can envisage a statistical model that accounts for programs taking longer to execute that a pre-specified time as their own category. Another way we might improve the situation is to take advantage of the rigid structure of dependence between related programs. Obviously solving this problem fully nullifies the need for statistics, but if the set of programs can be broken down into (potentially hierarchical) subsets, more advanced sampling strategies like stratified sampling might be usable, which may ease the burden of random sampling, though at the cost of complicating any statistical work. Finally, and most simply, if we are satisfied with assessing our compiler over a restricted set of programs, this can potentially significantly improve our situation.

To summarize, comparing the performance of different compilers is problematic because our sample of either randomly generated or pre-selected programs do not conform to the requirements of classical statistics. An alternative, as long as one is sure that their sample of programs is indeed representative of the population of programs as a whole, is the use of resampling methods. If we wish to be able to overcome this problem and apply more powerful statistical methods, a significant improvement to the way we randomly generate programs will be needed. We propose several ways that statistical models might overcome several problems in random code generation.