I recently attended SC16 in Salt Lake City and here are some of my personal highlights.
Energy Efficient Supercomputing (E2SC) Workshop
Power utilization and over-provisioning
I was particularly looking forward to E2SC. The session began with a keynote from Barry Rountree motivating open questions and challenges in this area. The talk discussed how existing HPC systems do not make full use of their potential power and performance. For example, the Vulcan Blue Gene/Q typically only uses around 60% of its available power. The HPC community is interested in “doing as much science as possible within the available power budget” and maximizing performance and system throughput rather than simply saving power.
Currently, HPC systems are procured such that the power budget is reached when all nodes run at full power — this is known as worst-case provisioning. The proposed solution to the above problem of power underutilization is to “over-provision hardware”; that is to buy more nodes than the worst-case provisioned system but run some or all of them at lower power.
There is then an optimization problem to decide on a configuration of nodes, cores and processor power cap, to maximize power utilization and performance for a given program. On an example benchmark, by using more nodes, fewer cores per node and lowering the power cap slightly, the study found a configuration 2x faster than the the worst-case provisioned system with only a small increase in average power and therefore an almost 2x increase in energy efficiency too.
Tapasya Patki spoke further on the over-provisioning work, with research into a model that estimates the total gain to be made by over-provisioning in terms of cost and performance. In particular, the work looked at how an over-provisioned system could be obtained without increasing the hardware budget by using previous generation chips. This works because the increase in price is usually higher than the increase in performance. The model suggests that over-provisioning works best on applications that scale well.
Chip-to-chip variation and power caps
The keynote also discussed variations in chip-to-chip efficiency for the same Intel processor. With unbounded power, runtimes were fairly consistent but power varied by around 7W. Under a power cap, the power variation was translated into a time variation, leaving power fairly consistent. These variations pose challenges for load balancing and reproducibility. Further analysis showed the effect power caps have on specific types of applications. Unsurprisingly, lowering the frequency on a CPU bound application gave a bigger slowdown than on a memory bound application.
CPU vs Platform (non-CPU) energy consumption
David Pruitt spoke about Mobile System Features Potentially Relevant to HPC and how architectural features impact on energy and time. The talk distinguished between CPU and platform (non-CPU) energy consumption. In cases where platform energy dominates CPU energy, then the best policy is to “race-to-idle” i.e. run at top frequency and finish the execution as soon as possible. On the other hand, if CPU energy dominates platform energy, then the solution requires an intermediate frequency, such that the CPU does not use too much power going fast, nor waste too much time going slow.
Other interesting talks
The Hartree Centre presented two papers. John Baker showed how energy aware scheduling policies led to energy savings of 12% on BlueWonder. Milos Puzovic presented work on Quantifying Energy Use in a Dense Shared Memory HPC Node.
Bilge Acun spoke on pre-emptive fan control which seeks to start cooling before the temperature of a core rises in order to avoid performance throttling due to temperature peaks.
Some of the above topics were also discussed in the Power-Aware High Performance Computing tutorial. In addition, a system called Adagio was presented which, during runtime, seeks to identify processors whose tasks are not on the critical path and can therefore have their frequencies reduced.
Later in the week I attended the Green500 session. The Green500 ranks the Top500 supercomputers in terms of energy efficiency. The US Department of Energy has set a target to build a 20MW exascale supercomputer by 2020. Interestingly, the Green500 summary states that an exacsale system with the same level of efficiency as the current greenest supercomputer would draw 105.7 MW.
The session highlighted some of the issues encountered in collecting accurate measurements for the Green500 list (one of the major issues is whether participants measure the whole or part of the benchmark run, as the power consumption drops towards the end of the execution and this drop has increased in recent years).
As well as attending talks I visited the Emerging Technologies room where I heard about CU2CL, a Clang Tool which provides a source-to-source translation from CUDA to OpenCL, thus allowing a wider range of devices to be targeted. In some cases converting to OpenCL and changing platform leads to significant performance increases.
It is possible to target FPGAs with CU2CL, however such devices tend to take a performance hit as they favour pipeline parallelism rather than data parallelism. This relates to another talk in the conference, in which Hamid Reza Zohouri demonstrated how implementing FPGA-specific optimisations on OpenCL kernels achieved FPGA performance comparable to an Intel Xeon and better power efficiency than an NVIDIA K20. A remaining challenge is to make the compiler automatically optimize such programs for FPGAs.
Also in the Emerging Technologies room I learned about the QSigma Reconfigurable Compile-Time Superscalar Computer. This is a VLIW architecture in which superscalar execution and multi-threading is controlled at compile-time, rather than in hardware as is traditional. This eliminates the need for superscalar and multi-threading hardware units and is intended to save energy consumption and free up space on the chip.
On the final day I opted for the workshop on Numerical Reproducibility at Exascale (NRE2016). The scope ranged from reproducibility in hardware and software, to reproducibility in scientific papers. Michael Wolfe asked what the compiler can do to aid reproducibility and discussed how it might also break reproducibility. For example, the fused multiply-add operation gives different results depending on the order in which it is executed and different compilers produce different orderings.
Ideas for how the compiler might help reproducibility included: running parallel loops in various orders to check the result is the same, automatic comparison between two compilations to see if one is a fair implementation of the other, and the ability to flag critical parts of code that must be reproducible. I look forward to reading the full paper on this topic when it is available.
The following are a few posters that caught my attention during the conference:
- Performance and Energy Aware Workload Partitioning on Heterogeneous Platforms
- Quantization for Energy Efficient Convolutional Neural Networks
- Discovering Energy Resource Usage Patterns on Scientific Clusters
All in all this was a successful trip with lots of thought-provoking content.