I recently attended ISC 2017 in Frankfurt. Here are my personal highlights from the main conference, as well as the tutorial and workshop days.
Top500 and Green500 Rankings
In the Top500 there was little change to the top ten other than the upgraded Piz Daint system moving from eighth to third place. In contras, the Green500 saw nine new appearances in its top ten, including the new number one, TSUBAME3.0, with an energy efficiency of 14.1 GFlops/W. An exascale system with the same efficiency would draw 70.9MW, which is around 3.5 times higher than the US Department of Energy’s 20MW goal. This is the first time this extrapolation has fallen below 100MW.
Entrants can choose to submit two measurements to the lists, with one power and one performance optimized. Third place, Piz Daint submitted a power optimized submission that achieved a 28% reduction in power at the cost of 13% performance.
There is a clear trend in technology used by the greenest ten: they all use Intel Xeon processors and nine also use NVIDIA Tesla P100 GPUs. Furthermore, the greenest 16 systems are all heterogeneous. The greenest homogeneous system at 6.1 GFlops/W is the Sunway TaihuLight — number one in the Top500.
The latest HPCG rankings were announced in the HPCG BOF session. HPL and HPCG are considered ‘bookends’ on performance, with HPL being very optimistic and HPCG being more realistic for many scientific applications. The latest rankings consist of 110 entries, including 46 from the Top500 — the HPCG performance for each of these 46 is less than 6% of their corresponding HPL result.
Metrics for Energy-Aware Software Optimization
Stephen Roberts presented a study that proposes and evaluates combined energy and time metrics for software optimization. In multi-objective optimization it is often desirable to combine objectives into a single metric on which to minimize or maximize.
The talk showed how Energy Delay Product (EDP) metrics that originated in hardware design have several undesirable properties. For example, the cost of a 1 second or 1 joule increase is not fixed and combining the EDP of multiple functions gives unintuitive results. EDP is also biased towards speeding up already fast programs and reducing energy on already efficient programs.
Six desirable properties for ideal metrics were identified and two new metrics proposed: Energy Delay Sum (EDS) and Energy Delay Distance (EDD). The EDD is a directed metric which makes it more suitable for guiding optimization efforts than EDS. These new metrics can be expressed in tangible units (e.g. dollars) based on user-defined weights on the relative cost of time and energy.
Suitability of Intel and ARM processors for HPC
Vojtech Nikl’s talk on the suitability of Intel and ARM processors for HPC compared Intel Xeon (Haswell) and ARM big.LITTLE (Cortex-A7/A15) clusters in terms of performance and energy efficiency. In general, the more powerful Intel processor achieved the best performance on each benchmark, but in contrast the ARM processors were more energy-efficient in many cases. The ARM processors were most efficient when the CPU frequency was set close to the DRAM frequency.
Analysis of Core- and Chip-level Architectural Features in Four Generations of Intel Server Processors
Part of this talk by Johannes Hofmann discussed how uncore frequency impacts performance and energy efficiency. In Intel processors the uncore consists of elements that are not in the cores, such as L3 cache and memory controllers. Just as cores have a clock frequency, so does the uncore. Starting from Haswell the uncore frequency can be configured independently to the cores.
Overriding the processor’s automatic Uncore Frequency Scaling (UFS) can improve energy efficiency. On the HPCG benchmark, UFS chose the maximum uncore frequency of 2.8 GHz, but performance gains actually diminished after 2.0 GHz as main-memory became the bottleneck. Running at 2.0 GHz was 26% more efficient than 2.8 GHz. Another experiment on HPL showed that setting the uncore frequency too high can lower performance as the uncore and cores compete for chip power.
The forum was made up of a series of lightning talks followed by a poster session. Here are two talks that caught my attention.
Improving Energy Efficiency through Vectorization – Thomas Jakobs demonstrated how vectorization impacts the energy and power consumption of matrix multiplication on an Intel processor. Vectorization reduced energy consumption at all frequencies, but the relationship with power consumption was more complex. Interestingly, the fused multiply-add instruction reduced energy consumption but increased power consumption.
Analysis and Modeling of the Energy Consumption of DVFS Processors for Parallel Scientific Computing – this work presented by Matthias Stachowski aims to reduce energy consumption without sacrificing runtime. As part of the project, three new metrics were introduced to capture energy/power and time: Energy Per Speedup (EPS), Power Increase factor (PI) and Relative Power Increase (RPI).
Since the power optimization tutorial was canceled at short notice, I selected the tutorials on programming and optimizing for KNL instead. Some of the topics included, optimizing OpenMP performance, refactoring loops to enable auto-vectorization — particularly desirable on the KNL in order to exploit its two 512 byte vector processing units — and memory modes. The KNL has two types of main memory (high-capacity DDR4 and high bandwidth MCDRAM) and the way in which they are operated can dramatically affect performance. The tutorials were helpful both in terms of general and KNL-specific strategies for optimizing code.
Energy Aware HPC Workshop
Full program and slides: http://www.ena-hpc.org/program.html
Topics included, power modeling and measurement, energy efficient hardware features, power-constrained computing, dynamic auto-tuning, and impact of compiler choice on energy. Here are few highlights:
Robert Schöne presented an overview of the READEX project, which aims to automatically tune system parameters such as core frequency and c-states to increase the energy efficiency of HPC applications. In particular the approach aims to tune parameters to optimize for changing resource requirements at different points in the application (e.g. clocking down a core that is waiting to synchronize).
In Kashif Nizam Khan’s talk on Analyzing the Power Consumption Behavior of a Large Scale Data Center, I was surprised to hear that unsuccessful jobs accounted for 43% of CPU time (although it was noted that some cases timed out jobs may still provide useful information). This is a case where reporting energy consumption to the user could encourage them to consider the impact of the jobs they run.
Armin Jäger compared the execution time and energy consumption of HPCG when compiled with GCC and ICC. The study compiled HPCG with -O0 to -O3 and ran at different frequencies on four Intel Ivy Bridge Xeon processors. GCC produced significantly more energy-efficient code than ICC (up to 55% more efficient), but in terms of runtime the two were much closer and there was no clear winner overall. More research is needed to explain why GCC gave more energy-efficient code.
This was another successful trip to ISC and it was great to see energy featuring more on the agenda this year.