power and thermal management runtimes for hpc applications ... · power and thermal management...
TRANSCRIPT
Power and Thermal Management Runtimes for HPC Applications in the Era of Exascale Computing
KEYWORD: HPC, Power Capping, Thermal and Power Management, MPI, Runtime, NAS, Quantum ESPRESSO, ILP
Daniele Cesarini‡, Andrea Bartolini‡, Carlo Cavazzoni†, Luca Benini‡*
‡DEI - University of Bologna, †SCAI - CINECA, *IIS - ETH Zurich
Power Wall Thermal Heterogeneity Power Capping Power Capping Thermal Control Energy Efficiency3) POWER CAP EXPLORATION
5) ENERGY SAVING - COUNTDOWN
[1] Bergman et al. "Exascale computing study: Technology challenges in achieving exascale systems." Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO), Tech. Rep 15 (2008).[2] Beneventi et al. "Cooling-aware node-level task allocation for next-generation green HPC systems." High Performance Computing & Simulation (HPCS), 2016 International Conference on. IEEE, 2016.
Exascalecomputing
in 2021
~70MW
50GFLOP/W!
Feasible Exascale
power budget≤ 20MWatts [1]
The fourth one, Tianhe-2 (ex 1st) consumes 17.8 MW for "only" 33.2 PetaFLOPs, but…
Power consumption of cooling system matter!!!
24 MW
Intel Haswell – E5-2699 v3 (18 core)
Up to 24°C Temperature difference on DIE. More than 7°C thermal heterogeneity under same workload[2] (HPC CPU working range 20°-70° C)
Pow
er (W
)
Core Voltage (V)
1.2 GHz
2.4 GHz
1.8 GHz1.5
GHz
2.1 GHz
Dynamic Thermal Management
(DTM)
Data Center Power Consumption
…but supercomputer workloads rarely causes worst-case power consumption…
HW mechanisms (Intel RAPL)
No workload applicationawareness
Low overheadFine granularity(milliseconds)
SW mechanisms
High overheadCoarse granularity
(seconds)
Workload applicationawareness
• Off-line allocation and scheduling approaches [5]DVFS configurations
?
Temperature Prediction
• Mechanisms to distribute a slice of the total system power budget to each computing element [3].
1.8 GHz 1.4 GHz 2.4 GHz 2.1 GHz
• Predictive models to estimate the power consumption [4].
…
Syst
em p
ower
(W) Power
cap
Time
Adagio [7] predicts the application tasks and slows down the processors’ frequency which are waiting in synchronization primitives.
Power Capping: I study and evaluate the benefit of relaxing of the power capping constraints in HPC compute nodes and how scientific applications can take advantage of large time windows for power bounding [3].
AVG Freq FF Cap
1.5 GHz 95.56W
1.8 GHz 111.86W
2.1 GHz 122.87W
2.4 GHz 134.44W
RAPL
1766MHz
2144MHz
2323MHz
2476MHz
RAPL FF F vs R
328.16sec 311.16sec 5.10%
274.16sec 274.11sec 0.11%
254.59sec 247.60sec 2.75%
239.65sec 231.19sec 3.53%
FF (Fixed Frequency): power consumption at the fixed frequency.
RAPL (HW Power Control):power limited by RAPL.
RAPL Cap
94.81W
110.63W
120.71W
131.32W
FF
1499MHz
1797MHz
2094MHz
2392MHz
F vs R
-15.11%
-16.22%
-9.86%
-3.37%
DVFS power cap -> Respected
Low frequencyHigh frequency
RAPL vs Frequency Cap Execution Time Comparison AVG. Frequency Comparison
https://github.com/EEESlab/countdown
Low CPI, high SIMD,low memory bandwidth
High performance (High FLOPS)
High CPI, low SIMD,high memory bandwidth
Memory bound!
RAPL: Most of the time at high frequency but with low performance!
Thermal Control: I have proposed a novel technique for optimal thermal control and job mapping in HPC systems to maximize performance and energy-efficiency but ensuring at the same time a safe working temperature [10,11,12,13].
Energy Saving: I developed a methodology and a tool for identifying and automatically reducing the power consumption of the HPC computing elements during communication which is not based on prediction mechanisms and learning approaches [14,15] but implements a very effective reactive mechanism.
Priority process!!!
OTC
TMCScheduling
Point (a)
EAW
𝑴𝑴𝑴𝑴𝑴𝑴 𝑪𝑪𝑪𝑪𝑪𝑪𝑪𝑪
TMCScheduling
Point (b)
𝑇𝑇𝑠𝑠Core 1
𝑻𝑻𝒎𝒎𝑪𝑪𝒎𝒎
𝑻𝑻𝒎𝒎𝑪𝑪𝒎𝒎
Time
𝑭𝑭𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝑭𝑭𝒎𝒎𝒎𝒎𝒎𝒎
MPI n
MPIAPP
Time
Core n
Core 1
Core n
Freq
Temp
Thermal-awaretask Mapper
and Controller
Energy-AwareMPI Wrapper
1
237
85
64𝑻𝑻𝒎𝒎𝑪𝑪𝒎𝒎
Tem
p
1s 10s SS100s
Time
𝑻𝑻𝒎𝒎𝑪𝑪𝒎𝒎
Tem
p
1s 10s SS100s
Time
𝑻𝑻𝒎𝒎𝑪𝑪𝒎𝒎
Core 1
Time
Core n
Core 1
Core n
Freq
ISP
𝑻𝑻𝒎𝒎𝑪𝑪𝒎𝒎
ISP ISP ISP ISP ISP ISP
Compute Node
MPI runtime
QE
WL Traces
Policy
Block diagram of the experimental set-up
Solver
1200130014001500160017001800190020002100220023002400
1 sec1 sec
10 sec10 sec
100 sec100 sec
SS1 sec
SS10 sec
SS100 sec
SSSS
Freq
uncy
(MHz
)
Frequency Allocator - Average Results
Highest Priority Core Global Average SS: Steady State
FSPISP
Baseline: Static Thermal Optimization
4.97% 4.50% 7.46% 7.06%3.65% 3.65%
0%2%4%6%8%
10%12%
1 sec1 sec
10 sec10 sec
100 sec100 sec
SS1 sec
SS10 sec
SS100 sec
Tim
e (%
)
Cumulative Overheads
P1 P2
FSP
ISP
FSP – ISP Proactive policy perfAVG Gain Overhead FSP Overhead ISP Performance Gain
1 sec – 1 sec 4.97% 0.02% 9.56% -4.61%
10 sec – 10 sec 4.50% 0.02% 0.58% 3.90%
100 sec – 100 sec 3.65% 0.98% 0.09% 2,58
SS – 1 sec 7.46% 0.49% 10.86% −3,89%
SS – 10 sec 7.06% 0.49% 0.72% 5.85%
SS – 100 sec 3.65% 0.49% 0.10% 3.06%
COUNTDOWN implement an asynchronous mechanism based on a callback/timer to reduce the core’s frequency after 500us in MPI primitives.
NAS Parallel Benchmarks: are a small set of programs designed to help evaluate the performance of parallel supercomputers.
Computing Resources: 1024 Cores (29 nodes)
AVG Overhead: <5% Energy/Power Saving: 6% - 50%
Computing Resources3456 Cores (96 nodes)
QuantumEspresso PWscf• QE-PWscf-EU: Expert User• QE-PWscf-NEU: Not Expert User
Overhead: <6%
Energy/Power Saving: 22% - 43%
3sP2
P3
P4
Time
2sP1 4s 4s 3s
2s 2s 2s
2s
2s
6s
3s
6s
1s
2s 2s
3s 3s
B1
B7
B3
B2
B4
B5
B6
15 secAPP 44s MPI 16s Bn = Barriers Constraints
1s 1s 1s 1s
1s
1s
1s
PM TM
Application
Disa
ble
Application Application
Callback Delay Callback Delay
Regi
ster
Core Logic
Rese
t P-S
tate
Low
P-S
tate
Max frequency
Min frequency
Process
Callb
ack
Callb
ack
Callb
ack
Set
Core
Callback
Frequency
Time
Regi
ster
MPI LibraryMPI Library
Integer Linear Programming Model
Thermal Control Runtime
Focus on: Exploration of power capping
strategies used in real HPC system node.
Characterization of Intel RAPL vs a Fixed Frequency allocation on an HPC application to power constraint a compute node.
4) HPC OPTIMAL THERMAL CONTROL (OTC)
Main(){// Initialize MPIMPI_Init()
// Get the number of procsMPI_Comm_size(size)
// Get the rankMPI_Comm_rank(rank)
// Print a hello worldprintf(“Hello world from rank:“
“%rank%, size: %size%”)
// Finalize MPIMPI_Finalize()
}
MPI_$CALL_NAME$(){Prologue()
PMPI_$CALL_NAME$()Epilogue()
}
Prologue(){Profile()Event(START)
}
Epilogue(){Event(END)Profile()
}
// PMPI InterfacePMPI_Init() {…}
PMPI_Comm_size() {…}
PMPI_Comm_rank() {…}
PMPI_ Finalize() {…}
// MPI InterfaceMPI_Init() {…}
MPI_Comm_size() {…}
MPI_Comm_rank() {…}
MPI_ Finalize() {…}
app.x libcntd.so libmpi.so
Dyna
mic
Li
nkin
g
Dyna
mic
Li
nkin
g
Dynamic Linking
We need almost 3.5x more energy
efficiency
[3] Cesarini et al. "Benefits in Relaxing the Power Capping Constraint." Proceedings of the 1st Workshop on AutotuniNg and aDaptivity AppRoaches for Energy efficient HPC Systems. ACM, 2017.
[10] Cesarini et al. "Prediction horizon vs. efficiency of optimal dynamic thermal control policies in HPC nodes." VLSI-SoC’17 [11] Cesarini et al. "Energy Saving and Thermal Management Opportunities in a Workload-Aware MPI Runtime for a Scientific HPC Computing Node." Parallel Computing is Everywhere v. 32 (2018): 277. [12] Bartolini A., Diversi R., Cesarini D., and Beneventi F. “Self-Aware Thermal Management for High-Performance Computing Processors” D&T’2018 [13] Cesarini et al. "Modeling and Evaluation of Application-Aware Dynamic Thermal Control in HPC Nodes”. IFIP/Springer, 2018.
[3] Eastep et al. “Global extensible open power manager: a vehicle for HPC community collaboration toward co-designed energy management solutions”. SC’16. [4] Borghesi et al.“MS3: a Mediterranean-Stile Job Scheduler for Supercomputers-do less when it's too hot!”. HPCS ‘15 [5] Rudi et al. “Optimum: Thermal-aware task allocation for heterogeneous many-core devices.” HPCS’14 [6] Hanumaiah et al. “Performance optimal online dvfs and task migration techniques for thermally constrained multi-core processors”, TCAD’11 [7] Rountree et al. “Adagio: making DVS practical for complex HPC applications” ISC’09 [8] Lim et al. "Adaptive, transparent frequency and voltage scaling of communication phases in MPI programs." SC’06 [9] Kerbyson et al. “Energy templates: Exploiting application information to save energy.” CLUSTER’11
Top500
1) OVERVIEW 2) STATE OF THE ART
HPC systems have large time windows to respect the power constraints (seconds no milliseconds!)
Several works has been presented in this research field but most of them concern embedded systems!
• On-line optimization policies [6]
Learning approaches can lead to significant misprediction errors in irregular applications! [9]
Lim et al. [8] present an MPI runtime system that dynamically reduces CPU performance during communication phases in MPI programs based on a learning approach to identify the regions.
Methodology Characterization of the thermal and power model of an HPC system:
Thermal characterization: we extracted the thermal properties of a computing node.
Power characterization: we characterized the power consumption in different situations.
Study of the workload model: Scientific quantum-chemistry parallel application (Quantum Espresso). Characterize the sensitive of each application task to the frequency variation.
HPC Optimal Thermal Control (OTC): DTM ILP formulation for proactive thermal management: Limiting the future temperature of all the cores using per-core DVFS
mechanisms. Maximizing the application performance. Slowing down the core’s frequency during communication phases. Providing knobs to match the computation to communication ratio
with the frequency selection (priority).
Evaluation: The potential impact of OTC in term of performance, QoS, overhead, maximum temperature and prediction horizon.
Our results show that: Long time horizon for job pinning medium time horizons for the online DVFS selection up to 6% performance gain
(including overheads)> 500us APP/MPI phases manifest sensitivity to DVFS changes
Perf
orm
ance
ov
erhe
ad!
Ener
gy w
aste
!
None DVFSchanges!
MPI PhasesApplication Phases
[14] Cesarini et al. “COUNTDOWN-A Run-time Library for Application-agnostic Energy Saving in MPI Communication Primitives” ANDARE’18.[15] Cesarini et al. “COUNTDOWN: a Run-time Library for Performance-Neutral Energy Saving in MPI Applications“. https://arxiv.org/abs/1806.07258, 2018[16] Cesarini et al. “COUNTDOWN Slack: a Run-time Library to Reduce Energy Footprint in Large-scale MPI Applications”. https://arxiv.org/abs/1909.12684, 2019
COUNTDOWN is able to profile HPC applications with negligible overhead.
COUNTDOWN is MPI library neutral and it can be used with different vendor implementation (Intel MPI, OpenMPI, MPICH, etc.).
COUNTDOWN is able to drastically reduce the power consumption of scientific application up to 50% at a negligible overhead (below 6%).
Capabilities of COUNTDOWN
Overhead analysis on a follow up work [16]
Lesson learned:• RAPL penalty of 2.87% and up
to 5.10% w.r.t. Fixed Frequency allocation.
• Even if RAPL shows a higher average frequency for the entire application time!