sc15 poster hpx edit 17 jeremy changes[2].pptx...

1
HPX Applica+ons and Performance Adapta+on Alice Koniges 1 , Jayashree Ajay Candadai 2 , Hartmut Kaiser 3 , Kevin Huck 4 , Jeremy Kemp 5 , Thomas Heller 6 , MaHhew Anderson 2 , Andrew Lumsdaine 2 , Adrian Serio 3 , Michael Wolf 7 , Bryce Adelstein Lelbach 1 , Ron Brightwell 7 , Thomas Sterling 2 1 Berkeley Lab, 2 Indiana University, 3 Louisiana State University, 4 University of Oregon, 5 University of Houston, 6 FriedrichAlexander University, 7 Sandia Na+onal Laboratories The HPX run+me system is a cri+cal component of the DOE XPRESS (eXascale PRogramming Environment and System So@ware) project and other projects worldwide. We are exploring a set of innova+ons in execu+on models, programming models and methods, run+me and opera+ng system so@ware, adap+ve scheduling and resource management algorithms, and instrumenta+on techniques to achieve unprecedented efficiency, scalability, and programmability in the context of billionway parallelism. A number of applica+ons have been implemented to drive system development and quan+ta+ve evalua+on of the HPX system implementa+on details and opera+onal efficiencies and scalabili+es. Performance Adapta/on, Legacy Applica/ons, and Summary Summary Exascale programming models and runtime systems are at a critical juncture in development. Systems based on light-weight tasks and data dependence are an excellent method for extracting parallelism and achieving performance. HPX is emerging as an important new path with support from US Department of Energy, the National Science Foundation, the Bavarian Research Foundation, and the European Horizon 2020 Programme. Application performance of HPX codes on very recent architectures, including current and prototypical next-generation Cray-Intel machines, is very good. For some applications, the performance using HPX is significantly better than standard MPI + OpenMP implementations. Performance adaptation using APEX provides significant energy savings with no performance change. We have shown that legacy applications using OpenMP can run under the HPX runtime system effectively. Figures: Concurrency views of LULESH (HPX-5), as observed and adapted by APEX. The left image is an unmodified execution, while the right image is a runtime power-capped (220W per-node) execution, with equal execution times. (8000 subdomains, 643 elements per subdomain on 8016 cores of Edison, 334 nodes, 24 cores per node). Concurrency throttling by APEX resulted in 12.3% energy savings with no performance change. APEX : Performance Adaptation HPX runtime implementations are integrated with APEX (Autonomic Performance Environment for Exascale), a feedback/control library for performance measurement and runtime adaptation. APEX Introspection observes the application, runtime, OS and hardware to maintain the APEX state, while the Policy Engine enforces policy rules to adapt, constrain or otherwise modify application behavior. APEX Introspection APEX Policy Engine APEX State RCR Toolkit Application HPX Synchronous Asynchronous Triggered Periodic events 0 1000 2000 3000 4000 5000 6000 7000 8000 0 50 100 150 200 250 300 350 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 Concurrency Power Sample period advanceDomainaction SBN3sendsaction probehandler progresshandler joinhandler other PosVelsendsaction MonoQsendsaction SBN3resultaction PosVelresultaction SBN1sendsaction rendezvousgethandler MonoQresultaction initDomainaction finiDomainaction thread cap power 0 1000 2000 3000 4000 5000 6000 7000 8000 0 50 100 150 200 250 300 350 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 Concurrency Power Sample period advanceDomainaction SBN3sendsaction probehandler progresshandler joinhandler other SBN3resultaction MonoQsendsaction PosVelresultaction MonoQresultaction PosVelsendsaction rendezvousgethandler initDomainaction thread cap power Applica/ons and Characteris/c Behavior LULESH: (Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics) For details see Deep Dive to right. Minighost: A miniapp for exploring boundary exchange strategies using stencil computa+ons in scien+fic parallel compu+ng. Implemented by decomposing the spa+al domain, inducing a “halo exchange” of processowned boundary data. NBody Code: An event driven constraint based execu+on model using the Barnes Hut algorithm where the par+cles are grouped by a hierarchy of cube structures using a recursive algorithm. It uses an adap+ve octree data structure to compute center of mass and force on each of the cubes with resultant O(N logN) computa+onal complexity making use of LibGeoDecomp, an autoparallelizing library. PIC: Two 3D par+cleincell (PIC) codes – GTC and PICSAR. The gyrokine+c toroidal code (GTC) was developed to study turbulent transport in magne+c confinement fusion plasmas. It models the interac+ons between fields and par+cles by solving the 5D gyro averaged kine+c equa+on coupled to the Poisson equa+on. PICSAR is a miniapp with the key func+onali+es of PIC accelerator codes, including a Maxwell solver using an arbitrary order finitedifference scheme (staggered/centered), a par+cle pusher using the Boris algorithm, and an energy conserving field gathering with high order par+cle shape factors. miniTri: A newly developed triangle enumera+onbased data analy+cs miniapp. miniTri mimics the computa+on requirements of an important set of data science applica+ons, not well represented by tradi+onal graph search benchmarks such as Graph500. An asynchronous HPXbased approach enables our linear algebrabased implementa+on of miniTri to be significantly more memory efficient, allowing us to process much larger graphs. CMA: (Climate MiniApp) For details see Deep Dive to right. Kernels: Various computa+onal kernels, such as matrix transpose and fast mul+pole algorithms, which are used to explore features of HPX and compare to other approaches. LULESH (Deep Dive) LULESH solves the Sedov blast wave problem. In three dimensions, the problem is spherically-symmetric and the code solves the problem in a parallelepiped region. In the figure, symmetric boundary conditions are imposed on the colored faces such that the normal components of the velocities are always zero; free boundary conditions are imposed on the remaining boundaries. The LULESH algorithm is implemented as a hexahedral mesh- based code with two centerings. Element centering stores thermodynamic variables such as energy and pressure. Nodal centering stores kinematics values such as positions and velocities. The simulation is run via time integration using a Lagrange leapfrog algorithm. There are three main computational phases within each time step: advance node quantities, advance element quantities, and calculate time constraints. There are three communication patterns, each regular, static, and uniform: face adjacent, 26 neighbor, and 13 neighbor communications, illustrated below: [1] LULESH is available from: https://codesign.llnl.gov/lulesh.php [2] Karlin I, Bhatele A, Keasler J, Chamberlain BL, Cohen J, DeVito Z, et al. Exploring Traditional and Emerging Parallel Programming Models using a Proxy Application. In: Proc. of the 27-th IEEE International Parallel and Distributed Processing Symposium (IPDPS); 2013. CMA (Deep Dive) The climate mini-app (CMA) models the performance profile of an atmospheric "dynamic core" (dycore) for non-hydrostatic flows. The codes use a conservative finite- volume discretization on an adaptively-refined cubed- sphere grid. An implicit-explicit (IMEX) time integrator combines a vertical implicit operator (which is FLOP- bound) with a horizontal explicit operator (which is bandwidth-bound). There are three major sources of load imbalance in the code: The number of iterations required to perform the non- linear vertical solves will vary across the grid. Exchanges across the boundaries of the six panels of the cubed sphere are more computationally expensive than intra-panel exchanges. The adaptively refined mesh will add and removed refined regions as the simulation evolves. Figures: Left image shows an example of an adaptively refined cubed-sphere grid used in climate codes. Right image shows vorticity dynamics for a climate test problem with AMR. The mini app is implemented using the Chombo adaptive mesh refinement (AMR) framework, and has both an MPI +OMP and HPX backend. The mini-app is being used to explore performance on multi-core architectures (e.g. Xeon Phi) and to explore the benefits of using HPX for finite- volume AMR codes to combat dynamic load imbalance. Funding by DOE Office of Science through grants: DE-SC0008714, DE-SC0008809, DE-AC04-94AL85000, DE-SC0008596, DE-SC0008638, and DE-AC02-05CH11231, by National Science Foundation through grants: CNS-1117470, AST-1240655, CCF-1160602, IIS-1447831, and ACI-1339782, and by Friedrich-Alexander-University Erlangen-Nuremberg through grant: H2020-EU.1.2.2. 671603. Results Showing Benefits of HPX NBody using LibGeoDecomp MiniGhost Weak Scaling Babbage CMA Strong Scaling on KNC Reduction of communication in GTCX (GTC with HPX added) compared to original GTC (upper image) is shown. GTCX Communica+on Reduc+on Top : HPX-5 weak-scaling LULESH performance on 256 core cluster. : HPX-5 weak scaling LULESH performance on Edison up to 14000 cores. ISIR and PWC are HPX-5 network back-ends. Lower values are better and we demonstrate developing 27k+ core scaling. Time per itera*on (s) 0 0.05 0.1 0.15 0.2 (SMP) 8 27 64 125 216 0me per cycle (s) cores MPI HPX=PWC LULESH Weak Scaling NERSC’s Edison, a Cray XC30 using the Aries interconnect and Intel Xeon processors with a peak performance of more than 2 petaflops. NERSC’s Babbage machine uses the Intel Xeon Phi™ coprocessor (codenamed “Knights Corner”), which combines many Intel CPU cores onto a single chip. Knights Corner is available in mul+ple configura+ons, delivering up to 61 cores, 244 threads, and 1.2 teraflops of performance. Matrix Transpose Kernel Higher is Better Edison HPX3 blocked implementa+on faster on NERSC produc+on Edison HPX3 is drama+cally beHer than both MPI and OpenMP on Babbage Babbage Higher is Better Background Info High Performance ParalleX (HPX) * The HPX runtime system reifies the ParalleX execution model to support large-scale irregular applications: - Localities - Active Global Address Space (AGAS) - ParalleX Processes - Complexes (ParalleX Threads and Thread Management) - Parcel Transport and Parcel Management - Local Control Objects (LCOs) * Sits between the application and OS * Portable interface: C++11/14 (HPX-3 only), XPI - Comprehensive suite of parallel C++ algorithms (HPX-3 only) * Automatic distributed garbage collection in AGAS (HPX-3 only) * Flexible set of execution and scheduling policies * Performance counter framework (HPX-3 only) STARVATION LATENCY OVERHEAD WAITING FOR CONTENTION ENERGY RELIABILITY ParalleX Execution model SLOWER Application Domain Lib(s) HPX OS / Network Supercomputer XPI ParalleX EM HPX avoids the use of locks and/or barriers in parallel computation through the use of LCOs, which are lightweight synchronization objects used by threads as a control mechanism. Reads and writes on LCOs are globally atomic and require no other synchronization mechanism. HPX has optimized transports built on top of Photon (HPX-5 Only) and other communication libraries * Two-sided Isend/Irecv transport (ISIR) * Pre-posts irecvs to reduce probe overhead * One-sided Put-With-Command/Completion (PWC) * Local/remote notifications for RDMA operations * RDMA communication optimizations using Photon: turn large puts into gets, buffer coalescing [3] Thomas Sterling, Daniel Kogler, Matthew Anderson, and Maciej Brodowicz. SLOWER: A performance model for Exascale computing. Supercomputing Frontiers and Innovations, 1:42–57, September 2014. [4] Hartmut Kaiser, Thomas Heller, Bryce Adelstein-Lelbach, Adrian Serio, Dietmar Fey, HPX – A Task Based Programming Model in a Global Address Space, PGAS 2014: The 8th International Conference on Partitioned Global Address Space Programming Models (2014). ISIR Global Address Space Parcel Transport LCOs Threads PWC OS/Network Performance Counters (HPX-3 only) Memory Models and Transport Network Global Address Space Memory Memory Translation Cache AGAS PGAS/ AGAS Parcel/GAS Transport HPX supports three memory models: Symmetric Multi- Processing (SMP, no global memory), a Partitioned Global Address Space (PGAS), an Active Global Address Space (AGAS). HPX Architecture LCOs APPLICATION LAYER Lulesh LibPXGL N-Body ADCIRC LibGeoDecomp GLOBAL ADDRESS SPACE PARCELS PROCESSES SCHEDULER Worker threads ISIR PWC (HPX-5 only) OPERATING SYSTEM HARDWARE Cores NETWORK NETWORK LAYER FMM Note: BlueGeneQ, NVIDIA and AMD GPUs, Windows, and Android support are currently only supported by HPX-3 ParalleX Execution Model HPX-5 is the High Performance ParalleX runtime library from Indiana University. The HPX-5 interface and C99 library implementation is guided by the ParalleX execution model (http://hpx.crest.iu.edu). HPX-3 is the C++11/14 implementation of ParalleX execution model from Louisiana State University ( http:// stellar-group.org/libraries/hpx/). * Lightweight multi-threading - Divides work into smaller tasks - Increases concurrency * Message-driven computation - Move work to data - Keeps work local, stops blocking * Constraint-based synchronization - Declarative criteria for work - Event driven - Eliminates global barriers * Data-directed execution - Merger of flow control and data structure * Shared name space - Global address space - Simplifies random gathers Legacy Application Support OMPTX is an HPX implementation of the Intel OpenMP runtime, enabling existing OpenMP applications to execute with HPX. Within a node, performance achieved for OpenMP programs using OMPTX is comparable to using the Intel OpenMP Runtime. For example, below we show speedup for a blocking LU decomposition benchmark using the optimal block size for each implementation. These results were obtained with a dual-socket Ivy Bridge processor at LSU. 0 5 10 15 20 2 4 5 8 10 16 20 40 Speedup Number of Threads LU Decomposition Speedup (Matrix Size 8192x8192) Intel OpenMP (BS=128) OMPTX (BS=128) HPX (BS=256)

Upload: others

Post on 24-Mar-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SC15 poster HPX edit 17 jeremy changes[2].pptx (Read-Only)sc15.supercomputing.org/.../tech_poster/poster_files/post193s2-file2.pdf · based code with two centerings. Element centering

HPX  Applica+ons  and  Performance  Adapta+on    Alice  Koniges1,  Jayashree  Ajay  Candadai2,  Hartmut  Kaiser3,  Kevin  Huck4,  Jeremy  Kemp5,  Thomas  Heller6,  MaHhew  

Anderson2,  Andrew  Lumsdaine2,  Adrian  Serio3,  Michael  Wolf7,  Bryce  Adelstein  Lelbach1,  Ron  Brightwell7,  Thomas  Sterling2    1Berkeley  Lab,  2Indiana  University,  3Louisiana  State  University,  4University  of  Oregon,    5University  of  Houston,  6Friedrich-­‐Alexander  University,  7Sandia  Na+onal  Laboratories  

The  HPX  run+me  system  is  a  cri+cal  component  of  the  DOE  XPRESS  (eXascale  PRogramming  Environment  and  System  So@ware)  project  and  other  projects  world-­‐wide.  We  are  exploring  a  set  of  innova+ons  in  execu+on  models,  programming  models  and  methods,  run+me  and  opera+ng  system  so@ware,  adap+ve  scheduling  and  resource  management  algorithms,  and  instrumenta+on  techniques  to  achieve  unprecedented  efficiency,  scalability,  and  programmability  in  the  context  of  billion-­‐way  parallelism.  A  number  of  applica+ons  have  been  implemented  to  drive  system  development  and  quan+ta+ve  evalua+on  of  the  HPX  system  implementa+on  details  and  opera+onal  efficiencies  and  scalabili+es.    

Performance  Adapta/on,  Legacy  Applica/ons,  and  Summary  

Summary

Exascale programming models and runtime systems are at a critical juncture in development. Systems based on light-weight tasks and data dependence are an excellent method for extracting parallelism and achieving performance. HPX is emerging as an important new path with support from US Department of Energy, the National Science Foundation, the Bavarian Research Foundation, and the European Horizon 2020 Programme. Application performance of HPX codes on very recent architectures, including current and prototypical next-generation Cray-Intel machines, is very good. For some applications, the performance using HPX is significantly better than standard MPI + OpenMP implementations. Performance adaptation using APEX provides significant energy savings with no performance change. We have shown that legacy applications using OpenMP can run under the HPX runtime system effectively.

Figures: Concurrency views of LULESH (HPX-5), as observed and adapted by APEX. The left image is an unmodified execution, while the right image is a runtime power-capped (220W per-node) execution, with equal execution times. (8000 subdomains, 643 elements per subdomain on 8016 cores of Edison, 334 nodes, 24 cores per node). Concurrency throttling by APEX resulted in 12.3% energy savings with no performance change.

APEX : Performance Adaptation HPX runtime implementations are integrated with APEX (Autonomic Performance Environment for Exascale), a feedback/control library for performance measurement and runtime adaptation. APEX Introspection observes the application, runtime, OS and hardware to maintain the APEX state, while the Policy Engine enforces policy rules to adapt, constrain or otherwise modify application behavior.

APEX Introspection

APEX Policy Engine

APEX State RCR Toolkit

Application HPX

Synchronous Asynchronous

Triggered Periodic

events

0

1000

2000

3000

4000

5000

6000

7000

8000

0 50 100 150 200 250 300 350 0

10000

20000

30000

40000

50000

60000

70000

80000

90000

Con

curre

ncy

Pow

er

Sample period

advanceDomainactionSBN3sendsaction

probehandlerprogresshandler

joinhandlerother

PosVelsendsactionMonoQsendsaction

SBN3resultactionPosVelresultactionSBN1sendsaction

rendezvousgethandler

MonoQresultactioninitDomainactionfiniDomainaction

thread cappower

0

1000

2000

3000

4000

5000

6000

7000

8000

0 50 100 150 200 250 300 350 0

10000

20000

30000

40000

50000

60000

70000

80000

90000

Con

curre

ncy

Pow

er

Sample period

advanceDomainactionSBN3sendsaction

probehandlerprogresshandler

joinhandler

otherSBN3resultaction

MonoQsendsactionPosVelresultactionMonoQresultaction

PosVelsendsactionrendezvousgethandler

initDomainactionthread cap

power

Applica/ons  and  Characteris/c  Behavior  –  LULESH:  (Livermore  Unstructured  Lagrangian  Explicit  Shock  Hydrodynamics)  For  

details  see  Deep  Dive  to  right.  

–  Mini-­‐ghost:  A  mini-­‐app  for  exploring  boundary  exchange  strategies  using  stencil  computa+ons  in  scien+fic  parallel  compu+ng.  Implemented  by  decomposing  the  spa+al  domain,  inducing  a  “halo  exchange”  of  process-­‐owned  boundary  data.  

–  N-­‐Body  Code:  An  event  driven  constraint  based  execu+on  model  using  the  Barnes-­‐Hut  algorithm  where  the  par+cles  are  grouped  by  a  hierarchy  of  cube  structures  using  a  recursive  algorithm.  It  uses  an  adap+ve  octree  data  structure  to  compute  center  of  mass  and  force  on  each  of  the  cubes  with  resultant  O(N  logN)  computa+onal  complexity  making  use  of  LibGeoDecomp,  an  auto-­‐parallelizing  library.  

–  PIC:  Two  3D  par+cle-­‐in-­‐cell  (PIC)  codes  –  GTC  and  PICSAR.  The  gyrokine+c  toroidal  code  (GTC)  was  developed  to  study  turbulent  transport  in  magne+c  confinement  fusion  plasmas.  It  models  the  interac+ons  between  fields  and  par+cles  by  solving  the  5D  gyro-­‐averaged  kine+c  equa+on  coupled  to  the  Poisson  equa+on.  PICSAR  is  a  mini-­‐app  with  the  key  func+onali+es  of  PIC  accelerator  codes,  including  a  Maxwell  solver  using  an  arbitrary  order  finite-­‐difference  scheme  (staggered/centered),  a  par+cle  pusher  using  the  Boris  algorithm,  and  an  energy  conserving  field  gathering  with  high  order  par+cle  shape  factors.  

–  miniTri:    A  newly  developed  triangle  enumera+on-­‐based  data  analy+cs  miniapp.    miniTri  mimics  the  computa+on  requirements  of  an  important  set  of  data  science  applica+ons,  not  well  represented  by  tradi+onal  graph  search  benchmarks  such  as  Graph500.    An  asynchronous  HPX-­‐based  approach  enables  our  linear  algebra-­‐based  implementa+on  of  miniTri  to  be  significantly  more  memory  efficient,  allowing  us  to  process  much  larger  graphs.  

–  CMA:  (Climate  Mini-­‐App)  For  details  see  Deep  Dive  to  right.  

–  Kernels:  Various  computa+onal  kernels,  such  as  matrix  transpose  and  fast  mul+pole  algorithms,  which  are  used  to  explore  features  of  HPX  and  compare  to  other  approaches.  

 

LULESH (Deep Dive) LULESH solves the Sedov blast wave problem. In three dimensions, the problem is spherically-symmetric and the code solves the problem in a parallelepiped region. In the figure, symmetric boundary conditions are imposed on the colored faces such that the normal components of the velocities are always zero; free boundary conditions are imposed on the remaining boundaries.

The LULESH algorithm is implemented as a hexahedral mesh-based code with two centerings. Element centering stores thermodynamic variables such as energy and pressure. Nodal centering stores kinematics values such as positions and velocities. The simulation is run via time integration using a Lagrange leapfrog algorithm. There are three main computational phases within each time step: advance node quantities, advance element quantities, and calculate time constraints. There are three communication patterns, each regular, static, and uniform: face adjacent, 26 neighbor, and 13 neighbor communications, illustrated below:

[1] LULESH is available from: https://codesign.llnl.gov/lulesh.php [2] Karlin I, Bhatele A, Keasler J, Chamberlain BL, Cohen J, DeVito Z, et al. Exploring Traditional and Emerging Parallel Programming Models using a Proxy Application. In: Proc. of the 27-th IEEE International Parallel and Distributed Processing Symposium (IPDPS); 2013.

CMA (Deep Dive) The climate mini-app (CMA) models the performance profile of an atmospheric "dynamic core" (dycore) for non-hydrostatic flows. The codes use a conservative finite-volume discretization on an adaptively-refined cubed-sphere grid. An implicit-explicit (IMEX) time integrator combines a vertical implicit operator (which is FLOP-bound) with a horizontal explicit operator (which is bandwidth-bound). There are three major sources of load imbalance in the code: •  The number of iterations required to perform the non-

linear vertical solves will vary across the grid. •  Exchanges across the boundaries of the six panels of

the cubed sphere are more computationally expensive than intra-panel exchanges.

•  The adaptively refined mesh will add and removed refined regions as the simulation evolves.

Figures: Left image shows an example of an adaptively refined cubed-sphere grid used in climate codes. Right image shows vorticity dynamics for a climate test problem with AMR.

The mini app is implemented using the Chombo adaptive mesh refinement (AMR) framework, and has both an MPI+OMP and HPX backend. The mini-app is being used to explore performance on multi-core architectures (e.g. Xeon Phi) and to explore the benefits of using HPX for finite-volume AMR codes to combat dynamic load imbalance.

Edison

Funding by DOE Office of Science through grants: DE-SC0008714, DE-SC0008809, DE-AC04-94AL85000, DE-SC0008596, DE-SC0008638, and DE-AC02-05CH11231, by National Science Foundation through grants: CNS-1117470, AST-1240655, CCF-1160602, IIS-1447831, and ACI-1339782, and by Friedrich-Alexander-University Erlangen-Nuremberg through grant: H2020-EU.1.2.2. 671603.

Results  Showing  Benefits  of  HPX  

N-­‐Body  using  LibGeoDecomp  MiniGhost  Weak  Scaling  

 

Babbage

CMA  Strong  Scaling  on  KNC  

Reduction of communication in GTCX (GTC with HPX added) compared to original GTC (upper image) is shown.

GTCX  Communica+on  Reduc+on  

 

Top: HPX-5 weak-scaling LULESH performance on 256 core cluster. Bottom: HPX-5 weak scaling LULESH performance on Edison up to 14000 cores. ISIR and PWC are HPX-5 network back-ends. Lower values are better and we demonstrate developing 27k+ core scaling.

0"

0.1"

0.2"

0.3"

0.4"

0.5"

0.6"

0.7"

0.8"

0.9"

1"

0" 2000" 4000" 6000" 8000" 10000" 12000" 14000" 16000" 18000" 20000" 22000" 24000" 26000" 28000"

Time%p

er%ite

ra*o

n%(s)%

Cores%

MPI" HPX2PWC" HPX2ISIR"

0"

0.05"

0.1"

0.15"

0.2"

(SMP)"8" 27" 64" 125" 216"

0me"p

er"cy

cle"(s)"

cores"

MPI" HPX=PWC"

LULESH  Weak  Scaling  

 

NERSC’s  Edison,  a  Cray  XC30  using  the  Aries  interconnect  and  Intel  Xeon  processors  with  a  peak  performance  of  more  than  2  petaflops.  

NERSC’s  Babbage  machine  uses  the  Intel  Xeon  Phi™  coprocessor  (codenamed  “Knights  Corner”),  which  combines  many  Intel  CPU  cores  onto  a  single  chip.  Knights  Corner  is  available  in  mul+ple  configura+ons,  delivering  up  to  61  cores,  244  threads,  and  1.2  teraflops  of  performance.  

Matrix  Transpose  Kernel  

 

HPX by example25.10.2014 | Thomas Heller ([email protected]) | FAU 28

Example 1: Matrix Transpose

Higher is Better

Edison

HPX-­‐3  blocked  implementa+on  faster  on  NERSC  produc+on  Edison      

HPX-­‐3  is  drama+cally  beHer  than  both  MPI  and  OpenMP  on  Babbage  

HPX by example25.10.2014 | Thomas Heller ([email protected]) | FAU 29

Example 1: Matrix Transpose

Babbage

Higher is Better

Background  Info  

High Performance ParalleX (HPX)

* The HPX runtime system reifies the ParalleX execution model to support large-scale irregular applications: - Localities - Active Global Address Space (AGAS) - ParalleX Processes - Complexes (ParalleX Threads and Thread Management) - Parcel Transport and Parcel Management - Local Control Objects (LCOs) * Sits between the application and OS * Portable interface: C++11/14 (HPX-3 only), XPI - Comprehensive suite of parallel C++ algorithms (HPX-3 only) * Automatic distributed garbage collection in AGAS (HPX-3 only) * Flexible set of execution and scheduling policies * Performance counter framework (HPX-3 only)

STARVATION

LATENCY

OVERHEAD

WAITING FOR CONTENTION

ENERGY

RELIABILITY

ParalleX Execution

model SLOWER

Application

Domain Lib(s)

HPX

OS / Network

Supercomputer

XPI

ParalleX EM

HPX avoids the use of locks and/or barriers in parallel computation through the use of LCOs, which are lightweight synchronization objects used by threads as a control mechanism. Reads and writes on LCOs are globally atomic and require no other synchronization mechanism.

HPX has optimized transports built on top of Photon (HPX-5 Only) and other communication libraries * Two-sided Isend/Irecv transport (ISIR) * Pre-posts irecvs to reduce probe overhead * One-sided Put-With-Command/Completion (PWC) * Local/remote notifications for RDMA operations * RDMA communication optimizations using Photon: turn large puts into gets, buffer coalescing

[3] Thomas Sterling, Daniel Kogler, Matthew Anderson, and Maciej Brodowicz. SLOWER: A performance model for Exascale computing. Supercomputing Frontiers and Innovations, 1:42–57, September 2014. [4] Hartmut Kaiser, Thomas Heller, Bryce Adelstein-Lelbach, Adrian Serio, Dietmar Fey, HPX – A Task Based Programming Model in a Global Address Space, PGAS 2014: The 8th International Conference on Partitioned Global Address Space Programming Models (2014).

ISIR

Global Address Space

Parcel Transport

LCOs Threads

PWC

OS/Network Per

form

ance

Cou

nter

s (H

PX

-3 o

nly)

Memory Models and Transport

Network

Global Address SpaceMemory Memory

TranslationCache

AGAS

PGAS/AGAS

Parcel/GAS Transport

HPX supports three memory models: Symmetric Multi-Processing (SMP, no global memory), a Partitioned Global Address Space (PGAS), an Active Global Address Space (AGAS).

HPX Architecture

LCOs

APPLICATION LAYER

Lulesh LibPXGL

N-Body ADCIRC LibGeoDecomp

GLOBAL ADDRESS SPACE

Source

Node Address

Dest

Source directly converts a virtual address into a node/address pair and sends a message to the Dest

PGAS AGAS PARCELS

PROCESSES

SCHEDULER

Worker threads

ISIR PWC (HPX-5 only)

OPERATING SYSTEM

HARDWARE

Cores

NETWORK

NETWORK LAYER

FMM

Note: BlueGeneQ, NVIDIA and AMD GPUs, Windows, and Android support are currently only supported by HPX-3

ParalleX Execution Model

HPX-5 is the High Performance ParalleX runtime library from Indiana University. The HPX-5 interface and C99 library implementation is guided by the ParalleX execution model (http://hpx.crest.iu.edu). HPX-3 is the C++11/14 implementation of ParalleX execution model from Louisiana State University ( http://stellar-group.org/libraries/hpx/).

* Lightweight multi-threading - Divides work into smaller tasks - Increases concurrency * Message-driven computation - Move work to data - Keeps work local, stops blocking * Constraint-based synchronization - Declarative criteria for work - Event driven - Eliminates global barriers * Data-directed execution - Merger of flow control and data structure * Shared name space - Global address space - Simplifies random gathers

Legacy Application Support OMPTX is an HPX implementation of the Intel OpenMP runtime, enabling existing OpenMP applications to execute with HPX.

Within a node, performance achieved for OpenMP programs using OMPTX is comparable to using the Intel OpenMP Runtime. For example, below we show speedup for a blocking LU decomposition benchmark using the optimal block size for each implementation. These results were obtained with a dual-socket Ivy Bridge processor at LSU.

0

5

10

15

20

2 4 5 8 10 16 20 40

Spe

edup

Number of Threads

LU Decomposition Speedup (Matrix Size 8192x8192)

Intel OpenMP (BS=128)

OMPTX (BS=128)

HPX (BS=256)