automatic parameterisation of parallel linear algebra routines domingo giménez javier cuenca josé...

55
Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire et Arithmétique: Calcul Numérique, Symbolique et Paralèle Rabat, Maroc. 28-31 Mai 2001

Upload: joel-cross

Post on 18-Jan-2016

232 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

Automatic Parameterisation of Parallel Linear Algebra Routines

Domingo Giménez Javier Cuenca José González

University of MurciaSPAIN

Algèbre Linéaire et Arithmétique: Calcul Numérique, Symbolique et ParalèleRabat, Maroc. 28-31 Mai 2001

Page 2: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

Outline

Current Situation of Linear Algebra Parallel Routines (LAPRs)ObjectiveApproach I: Analytical Model of the LAPRs

Application: Jacobi Method on Origin 2000 Approach II: Exhaustive Executions

Application: Gauss elimination on networks of processorsValidation with the LU factorizationConclusionsFuture Works

Page 3: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

Linear Algebra: highly optimizable operations

Optimizations are Platform Specific Traditional method: Hand-Optimization for each platform

Current Situation of Linear Algebra Parallel Routines (LAPRs)

020406080

100120140160180

time

(sec

onds

)

512 1024 1536 2048 2560 3072

Problem Size

Untuned

Tuned

Page 4: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

Time-consuming Incompatible with Hardware Evolution Incompatible with changes in the system (architecture

and basic libraries) Unsuitable for dynamic systems Misuse by non expert users

Problems of traditional method

Page 5: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

ATLAS, FLAME, I-LIB

Analyse platform characteristics in detail Sequential code Empirical results of the LAPR + Automation High Installation Time

Current approaches

Page 6: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

Develop a methodology for obtaining Automatically Tuned Software

Execution Environment

Auto-tuning Software

Our objective

Page 7: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

Routines Parameterised:

System parameters, Algorithmic parameters System parameters obtained at installation time

Analytical model of the routine and simple installation routines to obtain the system parameters

A reduced number of executions at installation time Algorithmic parameters obtained at running time

From the analytical model with the system parameters obtained in the installation process

From the file with information generated in the installation process

Methodology

Page 8: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

System parameters obtained at installation time

Analytical model of the routine and simple installation routines to obtain the system parameters

Algorithmic parameters obtained at running time

From the analytical model with the system parameters obtained in the installation process

Analytical modelling

Page 9: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

The behaviour of the algorithm on the platform is defined

Texec = f (SPs, n, APs)

SPs = f(n, APs) System Parameters APs Algorithmic Parameters n Problem Size

Analytical Model

Page 10: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

System Parameters (SPs):Hardware Platform Physical Characteristics

Current Conditions

Basic libraries

How to estimate each SP?

1º.- Obtain the kernel of performance cost of LAPR

2º.- Make an Estimation Routine from this kernel

Two Kinds of SPs:

Communication System Parameters (CSPs)

Arithmetic System Parameters (ASPs)

Analytical Model

LAPRs Performance

Page 11: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

Arithmetic System Parameters (ASPs):tc arithmetic cost

but using BLAS: k1 k2 and k3.

Computation Kernel of the LAPR Estimation Routine Similar storage scheme Similar quantity of data

Analytical Model

Page 12: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

Communication System Parameters (CSPs):ts start-up time

tw word-sending time

Communication Kernel of the LAPR Estimation Routine Similar kind of communication Similar quantity of data

Analytical Model

Page 13: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

Algorithmic Parameters (APs)

Values chosen in each execution

b block size

p number of processors

r c logical topology

grid configuration (logical 2D mesh)

Analytical Model

Page 14: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

Pre-installing (manual):

1º Make the Analytical Model: Texec = f (SPs, n, APs)

2º Write the Estimation Routines for the SPs

Installing on a Platform (automatic):

3º Estimate the SPs using the Estimation Routines of step 2

4º Write a Configuration File, or include the information in the LAPR:

for each n APs that minimize Texec

Execution:

The user executes LAPR for a size n:

LAPR obtains optimal APs

The Methodology. Step by step:

Page 15: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

LAPR: One-sided Block Jacobi Method to solve the Symmetric Eigenvalue Problem.

Message-passing with MPI Logical Ring & Logical 2D-Mesh

Platform:SGI Origin 2000

Application Example

Page 16: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

Application Example. Algorithm Scheme

10 1011 11

B

0001 01

20 2021 21

10

00

20

11

01

21

W D

00b

n/r

n

Page 17: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

Application Example: Pre-installing.

HCVCariexec tttT

r

nkcb

p

nktari 2

12492

1

3

3

wsVC tc

nt

b

nt

2

42 rb

ntbtct wsHC 2

122

22

1º Make the Analytical Model: Texec= f (SPs,n,APs)

Page 18: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

Application Example: Pre-installing.

2º Write the Estimation Routines for the SPs

k3 matrix-matrix multiplication with DGEMM

k1 Givens Rotation to 2 vectors with DROT

ts

communications along the 2 directions of the 2D-mesh

tw

Page 19: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

Application Example: Installing

3º Estimate the SPs using the Estimation Routines

k1 0.01 µs

0.005 µs b = 32

k3 0.004 µs b = 64

0.003 µs b = 128

ts 20 µs

tw 0.1 µs

Page 20: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

Comparison of execution times using different sets of Execution Parameters (4 processors)

Application Example: Executing

0

50

100

150

200

250

300

512 1024 1536 2048 2560 3072

Untuned

Tuned with MCAP

Tuned with MVAP

Optimal Execution Time

Page 21: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

Comparison of execution times using different sets of Execution Parameters (8 processors)

Application Example: Executing

0

20

40

60

80

100

120

140

160

180

200

512 1024 1536 2048 2560 3072

Untuned

Tuned with MCAP

Tuned with MVAP

Optimal Execution Time

Page 22: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

LAPR: One-sided Block Jacobi MethodAlgorithmic Parameters: block size

mesh topologyPlatform: SGI Origin 2000 with message-passing

System Parameters: arithmetic costs

communication costsSatisfactory Reduction of the Execution Time:

from 25% higher than the optimal to only 2%

Application Example: Executing

Page 23: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

Outline

Current Situation of Linear Algebra Parallel Routines (LAPRs)ObjectiveApproach I: Analytical Model of the LAPRs

Application: Jacobi Method on Origin 2000 Approach II: Exhaustive Executions

Application: Gauss elimination on networks of processorsValidation with the LU factorizationConclusionsFuture Works

Page 24: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

System parameters obtained at installation time

Installation routines making a reduced number of executions at installation time

Algorithmic parameters obtained at running time

From the file with information generated in the installation process

Exhaustive Execution

Page 25: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

The behaviour of the algorithm on the platform is defined (as in Analytical Modelling)

Texec = f (SPs, n, APs)

SPs = f(n, APs) System Parameters APs Algorithmic Parameters n Problem Size

Exhaustive Execution

Page 26: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

Identify Algorithmic Parameters (APs) (as in Analytical Modelling)

Values chosen in each execution

b block size

p number of processors

r c logical topology

grid configuration (logical 2D mesh)

Exhaustive Execution

Page 27: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

Pre-installing (manual):

1º Determine the APs

2º Decide heuristics to reduce execution time in the installation process

Installing on a Platform (automatic):

3º Decide (the manager) the problem sizes to be analysed

4º Execute and write a Configuration File, or include the information in the LAPR:

for each n APs that minimize Texec

Execution:

The user executes LAPR for a size n:

LAPR obtains optimal APs

The Methodology. Step by step:

Page 28: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

LAPR: Gaussian elimination.

Message-passing with MPI Logical Ring,

rowwise block-cyclic striped partitioning

Platform:networks of processors (heterogeneous system)

Application Example

Page 29: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

Application Example: Pre-installing.

1º Determine the APs logical ring, rowwise block-cyclic striped partitioning

p number of processors

b block size for the data distribution

different block sizes in heterogeneous systems

b0b1b2b0b1b2b0b1b2b0

Page 30: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

Application Example: Pre-installing.

2º Decide heuristics to reduce execution time in the installation process

Execution time varies in a continuous way with the problem size and the APs

Consider the system as homogeneous Installation can finish:

When Analytical and Experimental predictions coincide

When a certain time has been spent on the installation

Page 31: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

Homogeneous Systems:

3º The manager decides the problem sizes

4º Execute and write a Configuration File, or include the information in the LAPR:

for each n APs that minimize Texec

Heterogeneous Systems:

3º The manager decides the problem sizes

4º Execute:

write a Configuration File, for each n APs that minimize Texec

write a Speed File, with the relative speeds of the processors in the system

Application Example: Installing

Page 32: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

RI-THE: Obtains p and b from the formula.

RI-HOM: Obtains p and b through a reduced number of executions.

RI-HET: 1º. As RI-HOM.

2º. Obtains bi for each processor

pbs

sb p

jj

ii

1

Application Example: Installation Routines

Page 33: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

Three different configurations:

PLA_HOM: 5 SUN Ultra-1

PLA_HYB: 5 SUN Ultra-1

1 SUN Ultra-5

PLA_HET: 1 SUN Ultra-1

1 SUN Ultra-5

1 SUN Ultra-1 (manages the file system)

Application Example: Systems

Page 34: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

Experimental results in PLA-HOM:

Quotient between the execution time with the parameters from the Installation Routine and the optimum execution time

0

0,5

1

1,5

2

500 1000 1500 2000 2500 3000

RI-THEO

RI-HOMO

RI-HETE

Application Example: Executing

Page 35: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

Experimental results in PLA-HYB:

Quotient between the execution time with the parameters from the Installation Routine and the optimum execution time

0

0,5

1

1,5

2

500 1000 1500 2000 2500 3000

RI-THEO

RI-HOMO

RI-HETE

Application Example: Executing

Page 36: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

0

0,5

1

1,5

2

500 1000 1500 2000 2500 3000

RI-THEO

RI-HOMO

RI-HETE

Experimental results in PLA-HET:

Quotient between the execution time with the parameters from the Installation Routine and the optimum execution time

Application Example: Executing

Page 37: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

Two techniques for automatic tuning of Parallel Linear Algebra Routines:

1. Analytical ModellingFor predictable systems (homogeneous, static, ...)

like Origin 2000

2. Exhaustive Execution

For less predictable systems (heterogeneous, dynamic, ...)

like networks of workstations Transparent to the user Execution close to the optimum

Comparison

Page 38: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

Outline

Current Situation of Linear Algebra Parallel Routines (LAPRs)ObjectiveApproach I: Analytical Model of the LAPRs

Application: Jacobi Method on Origin 2000 Approach II: Exhaustive Executions

Application: Gauss elimination on networks of processorsValidation with the LU factorizationConclusionsFuture Works

Page 39: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

To validate the methodology it is necessary to experiment with:

More routines:

block LU factorization More systems:

Architectures:

IBM SP2 and Origin 2000 Libraries:

reference BLAS, machine BLAS, ATLAS

Validation with the LU factorization

Page 40: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

Sequential LU

nkbnbknktari 222

33

3 3

1

3

2

Analytical Model: Texec= f (SPs,n,APs)

SPs: cost of arithmetic operations of different levels:

k1, k2, k3

APs: block size bLU ES

ES UM

b

Page 41: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

Quotient between different execution times and the optimum execution time

Sequential LU. Comparison in IBM SP2

0

0,2

0,4

0,6

0,8

1

1,2

1,4

512 1024 1536 2048 2560

modelled

weighted

LAPACK

Page 42: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

Quotient between the execution time with the parameters provided by the model and the optimum execution time, with different basic libraries. In

SUN 1

Sequential LU. Model execution time/optimum execution time

0

0,2

0,4

0,6

0,8

1

1,2

1,4

256 512 768 1024 1280 1536

ref. BLASmac. BLASATLAS

Page 43: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

Parallel LU

nkbnbkrc

cr

p

nktari 2

223

3

3 3

1

3

2

Analytical Model: Texec= f (SPs,n,APs)

SPs: cost of arithmetic operations: k1, k2, k3

cost of communications: ts, tw

APs: block size b,

number of processors p,

grid configuration rc

00 01 02 00 01 02

10 11 12 10 11 12

00 01 02 00 01 02

10 11 12 10 11 12

00 01 02 00 01 02

10 11 12 10 11 12

b

Page 44: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

Quotient between the execution time with the parameters provided by the model and the optimum execution time. In the sequential case, and in parallel with

4 and 8 processors.

Parallel LU. Comparison in IBM SP2

0

0,5

1

1,5

2

2,5

512 1024 1536 2048 2560 3072 3584

SEQPAR4PAR8

Page 45: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

Quotient between the execution time with the parameters provided by the model and the optimum execution time. In the sequential case, and in parallel with

4 and 8 processors.

Parallel LU. Comparison in Origin 2000

0

0,2

0,4

0,6

0,8

1

1,2

1,4

512 1024 1536 2048 2560 3072 3584

SEQPAR4PAR8

Page 46: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

The modelling of the algorithm provides satisfactory results in different systems

Origin 2000, IBM SP2

reference BLAS, machine BLAS, ATLAS The prediction is worse in some cases:

When the number of processors increases

In multicomputers where communications are more important (IBM SP2)

Exhaustive Executions

Parallel LU. Conclusions

Page 47: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

If the manager installs the routine for sizes 512, 1536, 2560,

and executions are performed for sizes 1024, 2048, 3072,

the execution time is well predicted

The same policy can be used in the installation of other software:

Quotient between the execution time with the parameters provided by the installation process and the optimum execution time. With ScaLAPACK, in

IBM SP2

Parallel LU. Exhaustive Execution

0

0,2

0,4

0,6

0,8

1

1,2

1,4

1,6

1024 2048 3072

4 pro.

8 pro.

Page 48: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

Parameterisation of Parallel Linear Algebra Routines enables development of Automatically Tuned Software

Two techniques can be used:

Analytical Modelling

Exhaustive Executions

or

a combination of both

Experiments performed in different systems and with different routines

Conclusions

Page 49: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

We try to develop a methodology valid for a wide range of systems, and to include it in the design of linear algebra libraries:

it is necessary to analyse the methodology in more systems and with more routines

Architecture of an Automatically Tuned Linear Algebra Library

At the moment we are analysing routines individually, but it could be preferable to analyse algorithmic schemes

Future Works

Page 50: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

Architecture of an Automatically Tuned Linear Algebra Library

Installation file

Installation routines

Basic routines library

SP fileAP file

Library

Basic routines declarationmanager

Installation

Compilation

designer

designer

manager

manager

Page 51: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

Architecture of an Automatically Tuned Linear Algebra Library

Installation routines

Library

designer

designer

Page 52: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

Architecture of an Automatically Tuned Linear Algebra Library

Installation routines

Basic routines library

Library

Basic routines declaration

designer

designer

manager

Page 53: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

Architecture of an Automatically Tuned Linear Algebra Library

Installation file

Installation routines

Basic routines library

Library

Basic routines declarationmanager

Installation

designer

designer

manager

manager

Page 54: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

Architecture of an Automatically Tuned Linear Algebra Library

Installation file

Installation routines

Basic routines library

SP fileAP file

Library

Basic routines declarationmanager

Installation

designer

designer

manager

manager

Page 55: Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire

Architecture of an Automatically Tuned Linear Algebra Library

Installation file

Installation routines

Basic routines library

SP fileAP file

Library

Basic routines declarationmanager

Installation

Compilation

designer

designer

manager

manager