I’ve spent the last two days at EuroLLVM in Saarbrücken. 260 engineers from 40 companies and including 70 students, talking compilers. There were over 40 talks, plus posters and a student competition in the program. This post takes a look at some of my highlights.
Day 1 Keynote: LLVM for HPC
The particular issue he and his colleagues are trying to address is making application performance portable; whether they run on many-core processors or on GPUs, applications need to scale from thousands to tens of thousands of cores — and handle a wide range of memory and communication architectures. Solutions need to work for the Exascale systems we’ll have in 4 years time and to achieve this we need to work on three levels:
- application code and libraries
- libraries abstracting parallelism and memory
- the compiler tool chain
Parallelism has traditionally relied on OpenMP in the application code. Memory layout is often hard coded into the implementation making such applications hard to port when the memory architecture changes. Projects like Kokkos, RAJTA try to abstract these into libraries allowing the application to be moved more easily. Another approach to improving performance has been to use the Argobots lightweight threading library instead of pthreads. This can increase performance scalability by an order of magnitude.
Hal himself is working on extending IR to be able to capture parallelism, allowing decisions about optimizing parallel code to be made much later—after register allocation.
Code Size Optimization
This talk was by Sjoerd Meijer on behalf of his team at ARM. In the last year they have contributed approximately 200 patches to LLVM to improve code size, particularly for the smaller M-series ARM microcontrollers. Typically 64K of ROM and 8K of RAM. Code has lots of magic constants, filling in structures and bit twiddling. There work came in four categories:
- turning off optimizations
- tuning optimizations
- improving handling of constants
- improving bit twiddling.
We have done a lot of work in this area on other architectures, and it was useful to see ARM hitting the same areas. I was perhaps surprised to see how little benefit ARM achieved. 1% on CSiBE (embedded benchmark for size) and an average of 3% on real applications. Perhaps what was more telling was that improving the ARM proprietary linker gave on average a further 8% benefit — it seems that this linker previously lacked the equivalent of -gc-sections.
The ARM team want to do more. So far they have not looked at LTO or superoptimization, both of which can make a big difference to code size. Certainly based on experiences with LLVM on other architectures, there should be a lot more to get from the compiler.
Sample Based Profiling
I learned a great deal about DWARF debug information from this talk by Gred Bedwell of Sony. His talk was about sample based profiling. This is relatively non-invasive, so it is well suited to real-time systems, like the games platforms that Sony produce. But it relies on accurate DWARF information to map code locations to source code line numbers.
The problem comes when one code location corresponds to multiple source code line numbers. At present DWARF has no way to represent this, but it is very common in optimized code. Any place where common code has been combined, or code has been hoisted, will yield this. It is the source of much confusion when debugging optimized code.
Their approach is to take advantage of clause 6.2.2 in the DWARF-4 spec, which states that for line numbers “…The compiler may emit the value 0 in cases where an instruction cannot be attributed to any source line”. Now this is perhaps a rather fine interpretation, but it turns out that using line number 0 whenever a code location corresponds to multiple source lines tends to give expected behavior. So as a heuristic it is pretty useful.
Something I intend to experiment with, not for sample based profiling, but to improve the debug experience with optimized code generated by LLVM.
LLVM for SystemZ
Ulrich Weigand of IBM regularly updates us with progress on LLVM for SystemZ. His talks are valuable because they give us a long term view on LLVM for one particular architecture. When his project started LLVM 3.3 generated code 20% slower than GCC. With his latest 4.0 compiler, performance is within 1%. Indeed briefly performance matched, but by the time he spoke, GCC had improved by 1%.
Which underlines the importance of GCC and LLVM to each other. Having competition is a good way to ensure teams strive to excel.
Day 2 Keynote: Weak Memory Concurrency
On the whole I tend not to follow the formalism of models of memory concurrency. I have found talks on the subject too technically difficult, and following the proofs too much of a challenge.
However, Viktor Vafeiadis of the Max Planck Institute turned out to be a very gifted communicator, able to explain about concurrent memory models, and their strengths and weaknesses. I have some belief I can now reason about correctness of code transformations as they affect memory access. I wouldn’t necessarily get the right answer, but I do now understand what I am trying to achieve.
The first half of the talk laid the foundations for revealing the limitations of the C11 concurrency model. But finding a problem is only half the job. Viktor also gave his own proposal for an operational semantics model of memory access, which is mathematically rigorous, accessible for programmers and feasible in hardware.
I always look forward to talks by students; in my experience some of the brightest minds around, not yet cluttered by established theory and practice. These talks were also short — enough to pique interest — but not too long if the subject turned out to be interesting. Three of the talks stood out for me.
Thierno Barry‘s talk, Automated Combination of Tolerance and Control Flow Integrity Countermeasures against Multiple Fault Attacks, looked at how to defend against fault injection by duplication of instructions. The particular challenge is when instructions are not idempotent, so must be transformed rather than simply duplicated. He used LLVM to try to maximize the use of idempotent instructions, and then where instructions had to be transformed, how to make that transformation as efficient as problem. My concern was that the approach works well enough for an orthogonal 3-address RISC architecture, but on more complex 2-address architectures the number of idempotent instructions will be much smaller, so transformation will dominate with a consequent serious impact on code size and performance.
Lawrence Esswood’s talk, ELF GOT Problems? CFI Can Help, addressed the same area, but this time focusing on control-flow integrity, particularly when using dynamically loaded libraries. If you are using DLLs, then you are already sacrificing a lot of performance by indirecting all calls. Adding Lawrence’s mechanism of a table of valid addresses and hints was efficiently implemented, adding just 4% to execution time. It is probably about as good as you can achieve with a purely software defense.
The talk which got my vote was Andres Noetzli on LifeJacket: Verifying Precise Floating-Point Optimizations in LLVM. Another speaker capable of communicating theoretical ideas clearly. In this case to validate optimizations on floating point. The point in this case being that treatment of NaN, infinity and signed zeros needs to all be preserved. A good example of why I come to conferences. Someone who can clearly introduce me to a field with which I was previously unfamiliar.
Rather like the student competition, except that there is no competition, the speakers are not necessarily students and the talks are even shorter at just 10 minutes. And like the student competition, you are sometimes exposed to something unexpected and interesting.
For me the talk of most interest was Phillip Power of Sony on the Debug Information Visual Analyzer. If you are interested in a good debug experience with compiled code, you end up looking at a lot of DWARF. And just as you wouldn’t debug C code by looking at the assembler (hopefully), so debugging DWARF should not involve reading the low level encoding. The concept is simple and in particular makes it easy to compare DWARF output when you change the compiler. The only disappointment is that the code is currently only available as a Windows binary. Sony want to upstream this, so I look forward to seeing the source soon. Otherwise I might just feel inspired to write my own version!
An Autotuning framework for LLVM
Yishen Chen could not get a visa to attend, so his supervisor, Vikram Adve of UIUC presented his work on LLVMTuner. This is a tool to search for high performance code in a search space of program variants. There have been many other approaches to this: DSP optimization, loop tiling, polyhedral loop optimization, compiler pass ordering, superoptimization and algorithmic choice. All are characterized by only working on smallish programs.
LLVMTuner is a framework to autotune larger programs (so far up to 3MLOC). A simple work flow of evaluation-search-compile. LLVMTuner works on IR, uses profiling to identify hot loops and then autotunes the hot loops. The focus is on natural loops at the top level. Selected loops and any functions the call are pulled out into a separate module. A simple heuristic is used to select loops that are big enough to be significant, but not too big to not be tractable.
The first optimization tried was pass sequence tuning (which used opentuner), tested on SPEC 2006 and Clang itself. For SPEC an average 6.5% speedup compared to -O3 -flto, with the best improvement being 37% and tuning times of up to 5 hours. For Clang a 6.7% improvement after 14 hours of autotuning.
Other optimization strategies were not successful (stack layout, code layout), but there are more optimizations to be tried. In many respects this work shares a heritage with MILEPOST/MAGEEC/TSERO and frameworks like Dividiti’s CK. All in all a very impressive piece of work by an undergraduate.
Unison: Combined Instruction Scheduling and Register Allocation
I’ve been following the work of Roberto Castañeda Lozano and his colleagues at SICS and KTH for some time. A big problem with LLVM is the separation of instruction selection, instruction scheduling and register allocation. This works fine for orthogonal RISC architectures, but is a nightmare for irregular CISC architectures with few registers. In such cases the choice of instruction affects the register allocation and vice-versa.
The Unison algorithm unifies instruction scheduling and register allocation, using a constraint solver to find what in many cases is provably the optimal instruction scheduling and register allocation. This is not an easy problem — both instruction scheduling and register allocation are NP complete problems. So there is a cost to running Unison: constraint solvers take a lot of compute. However, it can be used as a final release compilation, or perhaps more usefully to identify compiler improvements or better heuristics for optimizations.
Ultimately I’d like to see Unison incorporate the third NP Complete problem of instruction selection. It already has some of this, in that it can consider instruction variants, which in many practical situations may suffice. But for now this is a great start and good news that their work is now open source.
The final closing session was where the student prize winners, voted for by all attendees, were announced. Third for Andres Noetzli with his talk on floating point verification (my choice), second place for Lawrence Esswood with his talk on call flow integrity for shared libraries, and first place for Sam Ainsworth with his talk on software prefetch.
Hearing from true experts in their field that are focused on very specific areas of compiler technology, it can sometimes prove a challenge to fully understand talks outside your own present areas of focus. However, this year’s EuroLLVM was gifted with some very talented presenters, and I was able to catch up on a great deal of groundbreaking work and engage with some completely new ideas. And that, after all, is why scientists and engineers travel around the world to meet up at such events.