power and thermal management runtimes for hpc applications ... · power and thermal management...

Power and Thermal Management Runtimes for HPC Applications in the Era of Exascale Computing

KEYWORD: HPC, Power Capping, Thermal and Power Management, MPI, Runtime, NAS, Quantum ESPRESSO, ILP

Daniele Cesarini‡, Andrea Bartolini‡, Carlo Cavazzoni†, Luca Benini‡*

‡DEI - University of Bologna, †SCAI - CINECA, *IIS - ETH Zurich

Power Wall Thermal Heterogeneity Power Capping Power Capping Thermal Control Energy Efficiency3) POWER CAP EXPLORATION

5) ENERGY SAVING - COUNTDOWN

[1] Bergman et al. "Exascale computing study: Technology challenges in achieving exascale systems." Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO), Tech. Rep 15 (2008).[2] Beneventi et al. "Cooling-aware node-level task allocation for next-generation green HPC systems." High Performance Computing & Simulation (HPCS), 2016 International Conference on. IEEE, 2016.

Exascalecomputing

in 2021

~70MW

50GFLOP/W!

Feasible Exascale

power budget≤ 20MWatts [1]

The fourth one, Tianhe-2 (ex 1st) consumes 17.8 MW for "only" 33.2 PetaFLOPs, but…

Power consumption of cooling system matter!!!

24 MW

Intel Haswell – E5-2699 v3 (18 core)

Up to 24°C Temperature difference on DIE. More than 7°C thermal heterogeneity under same workload[2] (HPC CPU working range 20°-70° C)

Pow

er (W

)

Core Voltage (V)

1.2 GHz

2.4 GHz

1.8 GHz1.5

GHz

2.1 GHz

Dynamic Thermal Management

(DTM)

Data Center Power Consumption

…but supercomputer workloads rarely causes worst-case power consumption…

HW mechanisms (Intel RAPL)

No workload applicationawareness

Low overheadFine granularity(milliseconds)

SW mechanisms

High overheadCoarse granularity

(seconds)

Workload applicationawareness

• Off-line allocation and scheduling approaches [5]DVFS configurations

?

Temperature Prediction

• Mechanisms to distribute a slice of the total system power budget to each computing element [3].

1.8 GHz 1.4 GHz 2.4 GHz 2.1 GHz

• Predictive models to estimate the power consumption [4].

…

Syst

em p

ower

(W) Power

cap

Time

Adagio [7] predicts the application tasks and slows down the processors’ frequency which are waiting in synchronization primitives.

Power Capping: I study and evaluate the benefit of relaxing of the power capping constraints in HPC compute nodes and how scientific applications can take advantage of large time windows for power bounding [3].

AVG Freq FF Cap

1.5 GHz 95.56W

1.8 GHz 111.86W

2.1 GHz 122.87W

2.4 GHz 134.44W

RAPL

1766MHz

2144MHz

2323MHz

2476MHz

RAPL FF F vs R

328.16sec 311.16sec 5.10%

274.16sec 274.11sec 0.11%

254.59sec 247.60sec 2.75%

239.65sec 231.19sec 3.53%

FF (Fixed Frequency): power consumption at the fixed frequency.

RAPL (HW Power Control):power limited by RAPL.

RAPL Cap

94.81W

110.63W

120.71W

131.32W

FF

1499MHz

1797MHz

2094MHz

2392MHz

F vs R

-15.11%

-16.22%

-9.86%

-3.37%

DVFS power cap -> Respected

Low frequencyHigh frequency

RAPL vs Frequency Cap Execution Time Comparison AVG. Frequency Comparison

https://github.com/EEESlab/countdown

Low CPI, high SIMD,low memory bandwidth

High performance (High FLOPS)

High CPI, low SIMD,high memory bandwidth

Memory bound!

RAPL: Most of the time at high frequency but with low performance!

Thermal Control: I have proposed a novel technique for optimal thermal control and job mapping in HPC systems to maximize performance and energy-efficiency but ensuring at the same time a safe working temperature [10,11,12,13].

Energy Saving: I developed a methodology and a tool for identifying and automatically reducing the power consumption of the HPC computing elements during communication which is not based on prediction mechanisms and learning approaches [14,15] but implements a very effective reactive mechanism.

Priority process!!!

OTC

TMCScheduling

Point (a)

EAW

𝑴𝑴𝑴𝑴𝑴𝑴 𝑪𝑪𝑪𝑪𝑪𝑪𝑪𝑪

TMCScheduling

Point (b)

𝑇𝑇𝑠𝑠Core 1

𝑻𝑻𝒎𝒎𝑪𝑪𝒎𝒎


Time

𝑭𝑭𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝑭𝑭𝒎𝒎𝒎𝒎𝒎𝒎

MPI n

MPIAPP

Time

Core n

Core 1

Core n

Freq

Temp

Thermal-awaretask Mapper

and Controller

Energy-AwareMPI Wrapper

1

237

85

64𝑻𝑻𝒎𝒎𝑪𝑪𝒎𝒎

Tem

p

1s 10s SS100s

Time


Tem

p

1s 10s SS100s

Time


Core 1

Time

Core n

Core 1

Core n

Freq

ISP


ISP ISP ISP ISP ISP ISP

Compute Node

MPI runtime

QE

WL Traces

Policy

Block diagram of the experimental set-up

Solver

1200130014001500160017001800190020002100220023002400

1 sec1 sec

10 sec10 sec

100 sec100 sec

SS1 sec

SS10 sec

SS100 sec

SSSS

Freq

uncy

(MHz

)

Frequency Allocator - Average Results

Highest Priority Core Global Average SS: Steady State

FSPISP

Baseline: Static Thermal Optimization

4.97% 4.50% 7.46% 7.06%3.65% 3.65%

0%2%4%6%8%

10%12%

1 sec1 sec

10 sec10 sec

100 sec100 sec

SS1 sec

SS10 sec

SS100 sec

Tim

e (%

)

Cumulative Overheads

P1 P2

FSP

ISP

FSP – ISP Proactive policy perfAVG Gain Overhead FSP Overhead ISP Performance Gain

1 sec – 1 sec 4.97% 0.02% 9.56% -4.61%

10 sec – 10 sec 4.50% 0.02% 0.58% 3.90%

100 sec – 100 sec 3.65% 0.98% 0.09% 2,58

SS – 1 sec 7.46% 0.49% 10.86% −3,89%

SS – 10 sec 7.06% 0.49% 0.72% 5.85%

SS – 100 sec 3.65% 0.49% 0.10% 3.06%

COUNTDOWN implement an asynchronous mechanism based on a callback/timer to reduce the core’s frequency after 500us in MPI primitives.

NAS Parallel Benchmarks: are a small set of programs designed to help evaluate the performance of parallel supercomputers.

Computing Resources: 1024 Cores (29 nodes)

AVG Overhead: <5% Energy/Power Saving: 6% - 50%

Computing Resources3456 Cores (96 nodes)

QuantumEspresso PWscf• QE-PWscf-EU: Expert User• QE-PWscf-NEU: Not Expert User

Overhead: <6%

Energy/Power Saving: 22% - 43%

3sP2

P3

P4

Time

2sP1 4s 4s 3s

2s 2s 2s

2s

2s

6s

3s

6s

1s

2s 2s

3s 3s

B1

B7

B3

B2

B4

B5

B6

15 secAPP 44s MPI 16s Bn = Barriers Constraints

1s 1s 1s 1s

1s

1s

1s

PM TM

Application

Disa

ble

Application Application

Callback Delay Callback Delay

Regi

ster

Core Logic

Rese

t P-S

tate

Low

P-S

tate

Max frequency

Min frequency

Process

Callb

ack

Callb

ack

Callb

ack

Set

Core

Callback

Frequency

Time

Regi

ster

MPI LibraryMPI Library

Integer Linear Programming Model

Thermal Control Runtime

Focus on: Exploration of power capping

strategies used in real HPC system node.

Characterization of Intel RAPL vs a Fixed Frequency allocation on an HPC application to power constraint a compute node.

4) HPC OPTIMAL THERMAL CONTROL (OTC)

Main(){// Initialize MPIMPI_Init()

// Get the number of procsMPI_Comm_size(size)

// Get the rankMPI_Comm_rank(rank)

// Print a hello worldprintf(“Hello world from rank:“

“%rank%, size: %size%”)

// Finalize MPIMPI_Finalize()

}

MPI_$CALL_NAME$(){Prologue()

PMPI_$CALL_NAME$()Epilogue()

}

Prologue(){Profile()Event(START)

}

Epilogue(){Event(END)Profile()

}

// PMPI InterfacePMPI_Init() {…}

PMPI_Comm_size() {…}

PMPI_Comm_rank() {…}

PMPI_ Finalize() {…}

// MPI InterfaceMPI_Init() {…}

MPI_Comm_size() {…}

MPI_Comm_rank() {…}

MPI_ Finalize() {…}

app.x libcntd.so libmpi.so

Dyna

mic

Li

nkin

g

Dyna

mic

Li

nkin

g

Dynamic Linking

We need almost 3.5x more energy

efficiency

[3] Cesarini et al. "Benefits in Relaxing the Power Capping Constraint." Proceedings of the 1st Workshop on AutotuniNg and aDaptivity AppRoaches for Energy efficient HPC Systems. ACM, 2017.

[10] Cesarini et al. "Prediction horizon vs. efficiency of optimal dynamic thermal control policies in HPC nodes." VLSI-SoC’17 [11] Cesarini et al. "Energy Saving and Thermal Management Opportunities in a Workload-Aware MPI Runtime for a Scientific HPC Computing Node." Parallel Computing is Everywhere v. 32 (2018): 277. [12] Bartolini A., Diversi R., Cesarini D., and Beneventi F. “Self-Aware Thermal Management for High-Performance Computing Processors” D&T’2018 [13] Cesarini et al. "Modeling and Evaluation of Application-Aware Dynamic Thermal Control in HPC Nodes”. IFIP/Springer, 2018.

[3] Eastep et al. “Global extensible open power manager: a vehicle for HPC community collaboration toward co-designed energy management solutions”. SC’16. [4] Borghesi et al.“MS3: a Mediterranean-Stile Job Scheduler for Supercomputers-do less when it's too hot!”. HPCS ‘15 [5] Rudi et al. “Optimum: Thermal-aware task allocation for heterogeneous many-core devices.” HPCS’14 [6] Hanumaiah et al. “Performance optimal online dvfs and task migration techniques for thermally constrained multi-core processors”, TCAD’11 [7] Rountree et al. “Adagio: making DVS practical for complex HPC applications” ISC’09 [8] Lim et al. "Adaptive, transparent frequency and voltage scaling of communication phases in MPI programs." SC’06 [9] Kerbyson et al. “Energy templates: Exploiting application information to save energy.” CLUSTER’11

Top500

1) OVERVIEW 2) STATE OF THE ART

HPC systems have large time windows to respect the power constraints (seconds no milliseconds!)

Several works has been presented in this research field but most of them concern embedded systems!

• On-line optimization policies [6]

Learning approaches can lead to significant misprediction errors in irregular applications! [9]

Lim et al. [8] present an MPI runtime system that dynamically reduces CPU performance during communication phases in MPI programs based on a learning approach to identify the regions.

Methodology Characterization of the thermal and power model of an HPC system:

Thermal characterization: we extracted the thermal properties of a computing node.

Power characterization: we characterized the power consumption in different situations.

Study of the workload model: Scientific quantum-chemistry parallel application (Quantum Espresso). Characterize the sensitive of each application task to the frequency variation.

HPC Optimal Thermal Control (OTC): DTM ILP formulation for proactive thermal management: Limiting the future temperature of all the cores using per-core DVFS

mechanisms. Maximizing the application performance. Slowing down the core’s frequency during communication phases. Providing knobs to match the computation to communication ratio

with the frequency selection (priority).

Evaluation: The potential impact of OTC in term of performance, QoS, overhead, maximum temperature and prediction horizon.

Our results show that: Long time horizon for job pinning medium time horizons for the online DVFS selection up to 6% performance gain

(including overheads)> 500us APP/MPI phases manifest sensitivity to DVFS changes

Perf

orm

ance

ov

erhe

ad!

Ener

gy w

aste

!

None DVFSchanges!

MPI PhasesApplication Phases

[14] Cesarini et al. “COUNTDOWN-A Run-time Library for Application-agnostic Energy Saving in MPI Communication Primitives” ANDARE’18.[15] Cesarini et al. “COUNTDOWN: a Run-time Library for Performance-Neutral Energy Saving in MPI Applications“. https://arxiv.org/abs/1806.07258, 2018[16] Cesarini et al. “COUNTDOWN Slack: a Run-time Library to Reduce Energy Footprint in Large-scale MPI Applications”. https://arxiv.org/abs/1909.12684, 2019

COUNTDOWN is able to profile HPC applications with negligible overhead.

COUNTDOWN is MPI library neutral and it can be used with different vendor implementation (Intel MPI, OpenMPI, MPICH, etc.).

COUNTDOWN is able to drastically reduce the power consumption of scientific application up to 50% at a negligible overhead (below 6%).

Capabilities of COUNTDOWN

Overhead analysis on a follow up work [16]

Lesson learned:• RAPL penalty of 2.87% and up

to 5.10% w.r.t. Fixed Frequency allocation.

• Even if RAPL shows a higher average frequency for the entire application time!

https://github.com/EEESlab/countdown

power and thermal management runtimes for hpc applications ... · power and thermal management...

Documents