automatic parameterisation of parallel linear algebra routines domingo giménez javier cuenca josé...

Post on 18-Jan-2016

232 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Automatic Parameterisation of Parallel Linear Algebra Routines

Domingo Giménez Javier Cuenca José González

University of MurciaSPAIN

Algèbre Linéaire et Arithmétique: Calcul Numérique, Symbolique et ParalèleRabat, Maroc. 28-31 Mai 2001

Outline

Current Situation of Linear Algebra Parallel Routines (LAPRs)ObjectiveApproach I: Analytical Model of the LAPRs

Application: Jacobi Method on Origin 2000 Approach II: Exhaustive Executions

Application: Gauss elimination on networks of processorsValidation with the LU factorizationConclusionsFuture Works

Linear Algebra: highly optimizable operations

Optimizations are Platform Specific Traditional method: Hand-Optimization for each platform

Current Situation of Linear Algebra Parallel Routines (LAPRs)

020406080

100120140160180

time

(sec

onds

)

512 1024 1536 2048 2560 3072

Problem Size

Untuned

Tuned

Time-consuming Incompatible with Hardware Evolution Incompatible with changes in the system (architecture

and basic libraries) Unsuitable for dynamic systems Misuse by non expert users

Problems of traditional method

ATLAS, FLAME, I-LIB

Analyse platform characteristics in detail Sequential code Empirical results of the LAPR + Automation High Installation Time

Current approaches

Develop a methodology for obtaining Automatically Tuned Software

Execution Environment

Auto-tuning Software

Our objective

Routines Parameterised:

System parameters, Algorithmic parameters System parameters obtained at installation time

Analytical model of the routine and simple installation routines to obtain the system parameters

A reduced number of executions at installation time Algorithmic parameters obtained at running time

From the analytical model with the system parameters obtained in the installation process

From the file with information generated in the installation process

Methodology

System parameters obtained at installation time

Analytical model of the routine and simple installation routines to obtain the system parameters

Algorithmic parameters obtained at running time

From the analytical model with the system parameters obtained in the installation process

Analytical modelling

The behaviour of the algorithm on the platform is defined

Texec = f (SPs, n, APs)

SPs = f(n, APs) System Parameters APs Algorithmic Parameters n Problem Size

Analytical Model

System Parameters (SPs):Hardware Platform Physical Characteristics

Current Conditions

Basic libraries

How to estimate each SP?

1º.- Obtain the kernel of performance cost of LAPR

2º.- Make an Estimation Routine from this kernel

Two Kinds of SPs:

Communication System Parameters (CSPs)

Arithmetic System Parameters (ASPs)

Analytical Model

LAPRs Performance

Arithmetic System Parameters (ASPs):tc arithmetic cost

but using BLAS: k1 k2 and k3.

Computation Kernel of the LAPR Estimation Routine Similar storage scheme Similar quantity of data

Analytical Model

Communication System Parameters (CSPs):ts start-up time

tw word-sending time

Communication Kernel of the LAPR Estimation Routine Similar kind of communication Similar quantity of data

Analytical Model

Algorithmic Parameters (APs)

Values chosen in each execution

b block size

p number of processors

r c logical topology

grid configuration (logical 2D mesh)

Analytical Model

Pre-installing (manual):

1º Make the Analytical Model: Texec = f (SPs, n, APs)

2º Write the Estimation Routines for the SPs

Installing on a Platform (automatic):

3º Estimate the SPs using the Estimation Routines of step 2

4º Write a Configuration File, or include the information in the LAPR:

for each n APs that minimize Texec

Execution:

The user executes LAPR for a size n:

LAPR obtains optimal APs

The Methodology. Step by step:

LAPR: One-sided Block Jacobi Method to solve the Symmetric Eigenvalue Problem.

Message-passing with MPI Logical Ring & Logical 2D-Mesh

Platform:SGI Origin 2000

Application Example

Application Example. Algorithm Scheme

10 1011 11

B

0001 01

20 2021 21

10

00

20

11

01

21

W D

00b

n/r

n

Application Example: Pre-installing.

HCVCariexec tttT

r

nkcb

p

nktari 2

12492

1

3

3

wsVC tc

nt

b

nt

2

42 rb

ntbtct wsHC 2

122

22

1º Make the Analytical Model: Texec= f (SPs,n,APs)

Application Example: Pre-installing.

2º Write the Estimation Routines for the SPs

k3 matrix-matrix multiplication with DGEMM

k1 Givens Rotation to 2 vectors with DROT

ts

communications along the 2 directions of the 2D-mesh

tw

Application Example: Installing

3º Estimate the SPs using the Estimation Routines

k1 0.01 µs

0.005 µs b = 32

k3 0.004 µs b = 64

0.003 µs b = 128

ts 20 µs

tw 0.1 µs

Comparison of execution times using different sets of Execution Parameters (4 processors)

Application Example: Executing

0

50

100

150

200

250

300

512 1024 1536 2048 2560 3072

Untuned

Tuned with MCAP

Tuned with MVAP

Optimal Execution Time

Comparison of execution times using different sets of Execution Parameters (8 processors)

Application Example: Executing

0

20

40

60

80

100

120

140

160

180

200

512 1024 1536 2048 2560 3072

Untuned

Tuned with MCAP

Tuned with MVAP

Optimal Execution Time

LAPR: One-sided Block Jacobi MethodAlgorithmic Parameters: block size

mesh topologyPlatform: SGI Origin 2000 with message-passing

System Parameters: arithmetic costs

communication costsSatisfactory Reduction of the Execution Time:

from 25% higher than the optimal to only 2%

Application Example: Executing

Outline

Current Situation of Linear Algebra Parallel Routines (LAPRs)ObjectiveApproach I: Analytical Model of the LAPRs

Application: Jacobi Method on Origin 2000 Approach II: Exhaustive Executions

Application: Gauss elimination on networks of processorsValidation with the LU factorizationConclusionsFuture Works

System parameters obtained at installation time

Installation routines making a reduced number of executions at installation time

Algorithmic parameters obtained at running time

From the file with information generated in the installation process

Exhaustive Execution

The behaviour of the algorithm on the platform is defined (as in Analytical Modelling)

Texec = f (SPs, n, APs)

SPs = f(n, APs) System Parameters APs Algorithmic Parameters n Problem Size

Exhaustive Execution

Identify Algorithmic Parameters (APs) (as in Analytical Modelling)

Values chosen in each execution

b block size

p number of processors

r c logical topology

grid configuration (logical 2D mesh)

Exhaustive Execution

Pre-installing (manual):

1º Determine the APs

2º Decide heuristics to reduce execution time in the installation process

Installing on a Platform (automatic):

3º Decide (the manager) the problem sizes to be analysed

4º Execute and write a Configuration File, or include the information in the LAPR:

for each n APs that minimize Texec

Execution:

The user executes LAPR for a size n:

LAPR obtains optimal APs

The Methodology. Step by step:

LAPR: Gaussian elimination.

Message-passing with MPI Logical Ring,

rowwise block-cyclic striped partitioning

Platform:networks of processors (heterogeneous system)

Application Example

Application Example: Pre-installing.

1º Determine the APs logical ring, rowwise block-cyclic striped partitioning

p number of processors

b block size for the data distribution

different block sizes in heterogeneous systems

b0b1b2b0b1b2b0b1b2b0

Application Example: Pre-installing.

2º Decide heuristics to reduce execution time in the installation process

Execution time varies in a continuous way with the problem size and the APs

Consider the system as homogeneous Installation can finish:

When Analytical and Experimental predictions coincide

When a certain time has been spent on the installation

Homogeneous Systems:

3º The manager decides the problem sizes

4º Execute and write a Configuration File, or include the information in the LAPR:

for each n APs that minimize Texec

Heterogeneous Systems:

3º The manager decides the problem sizes

4º Execute:

write a Configuration File, for each n APs that minimize Texec

write a Speed File, with the relative speeds of the processors in the system

Application Example: Installing

RI-THE: Obtains p and b from the formula.

RI-HOM: Obtains p and b through a reduced number of executions.

RI-HET: 1º. As RI-HOM.

2º. Obtains bi for each processor

pbs

sb p

jj

ii

1

Application Example: Installation Routines

Three different configurations:

PLA_HOM: 5 SUN Ultra-1

PLA_HYB: 5 SUN Ultra-1

1 SUN Ultra-5

PLA_HET: 1 SUN Ultra-1

1 SUN Ultra-5

1 SUN Ultra-1 (manages the file system)

Application Example: Systems

Experimental results in PLA-HOM:

Quotient between the execution time with the parameters from the Installation Routine and the optimum execution time

0

0,5

1

1,5

2

500 1000 1500 2000 2500 3000

RI-THEO

RI-HOMO

RI-HETE

Application Example: Executing

Experimental results in PLA-HYB:

Quotient between the execution time with the parameters from the Installation Routine and the optimum execution time

0

0,5

1

1,5

2

500 1000 1500 2000 2500 3000

RI-THEO

RI-HOMO

RI-HETE

Application Example: Executing

0

0,5

1

1,5

2

500 1000 1500 2000 2500 3000

RI-THEO

RI-HOMO

RI-HETE

Experimental results in PLA-HET:

Quotient between the execution time with the parameters from the Installation Routine and the optimum execution time

Application Example: Executing

Two techniques for automatic tuning of Parallel Linear Algebra Routines:

1. Analytical ModellingFor predictable systems (homogeneous, static, ...)

like Origin 2000

2. Exhaustive Execution

For less predictable systems (heterogeneous, dynamic, ...)

like networks of workstations Transparent to the user Execution close to the optimum

Comparison

Outline

Current Situation of Linear Algebra Parallel Routines (LAPRs)ObjectiveApproach I: Analytical Model of the LAPRs

Application: Jacobi Method on Origin 2000 Approach II: Exhaustive Executions

Application: Gauss elimination on networks of processorsValidation with the LU factorizationConclusionsFuture Works

To validate the methodology it is necessary to experiment with:

More routines:

block LU factorization More systems:

Architectures:

IBM SP2 and Origin 2000 Libraries:

reference BLAS, machine BLAS, ATLAS

Validation with the LU factorization

Sequential LU

nkbnbknktari 222

33

3 3

1

3

2

Analytical Model: Texec= f (SPs,n,APs)

SPs: cost of arithmetic operations of different levels:

k1, k2, k3

APs: block size bLU ES

ES UM

b

Quotient between different execution times and the optimum execution time

Sequential LU. Comparison in IBM SP2

0

0,2

0,4

0,6

0,8

1

1,2

1,4

512 1024 1536 2048 2560

modelled

weighted

LAPACK

Quotient between the execution time with the parameters provided by the model and the optimum execution time, with different basic libraries. In

SUN 1

Sequential LU. Model execution time/optimum execution time

0

0,2

0,4

0,6

0,8

1

1,2

1,4

256 512 768 1024 1280 1536

ref. BLASmac. BLASATLAS

Parallel LU

nkbnbkrc

cr

p

nktari 2

223

3

3 3

1

3

2

Analytical Model: Texec= f (SPs,n,APs)

SPs: cost of arithmetic operations: k1, k2, k3

cost of communications: ts, tw

APs: block size b,

number of processors p,

grid configuration rc

00 01 02 00 01 02

10 11 12 10 11 12

00 01 02 00 01 02

10 11 12 10 11 12

00 01 02 00 01 02

10 11 12 10 11 12

b

Quotient between the execution time with the parameters provided by the model and the optimum execution time. In the sequential case, and in parallel with

4 and 8 processors.

Parallel LU. Comparison in IBM SP2

0

0,5

1

1,5

2

2,5

512 1024 1536 2048 2560 3072 3584

SEQPAR4PAR8

Quotient between the execution time with the parameters provided by the model and the optimum execution time. In the sequential case, and in parallel with

4 and 8 processors.

Parallel LU. Comparison in Origin 2000

0

0,2

0,4

0,6

0,8

1

1,2

1,4

512 1024 1536 2048 2560 3072 3584

SEQPAR4PAR8

The modelling of the algorithm provides satisfactory results in different systems

Origin 2000, IBM SP2

reference BLAS, machine BLAS, ATLAS The prediction is worse in some cases:

When the number of processors increases

In multicomputers where communications are more important (IBM SP2)

Exhaustive Executions

Parallel LU. Conclusions

If the manager installs the routine for sizes 512, 1536, 2560,

and executions are performed for sizes 1024, 2048, 3072,

the execution time is well predicted

The same policy can be used in the installation of other software:

Quotient between the execution time with the parameters provided by the installation process and the optimum execution time. With ScaLAPACK, in

IBM SP2

Parallel LU. Exhaustive Execution

0

0,2

0,4

0,6

0,8

1

1,2

1,4

1,6

1024 2048 3072

4 pro.

8 pro.

Parameterisation of Parallel Linear Algebra Routines enables development of Automatically Tuned Software

Two techniques can be used:

Analytical Modelling

Exhaustive Executions

or

a combination of both

Experiments performed in different systems and with different routines

Conclusions

We try to develop a methodology valid for a wide range of systems, and to include it in the design of linear algebra libraries:

it is necessary to analyse the methodology in more systems and with more routines

Architecture of an Automatically Tuned Linear Algebra Library

At the moment we are analysing routines individually, but it could be preferable to analyse algorithmic schemes

Future Works

Architecture of an Automatically Tuned Linear Algebra Library

Installation file

Installation routines

Basic routines library

SP fileAP file

Library

Basic routines declarationmanager

Installation

Compilation

designer

designer

manager

manager

Architecture of an Automatically Tuned Linear Algebra Library

Installation routines

Library

designer

designer

Architecture of an Automatically Tuned Linear Algebra Library

Installation routines

Basic routines library

Library

Basic routines declaration

designer

designer

manager

Architecture of an Automatically Tuned Linear Algebra Library

Installation file

Installation routines

Basic routines library

Library

Basic routines declarationmanager

Installation

designer

designer

manager

manager

Architecture of an Automatically Tuned Linear Algebra Library

Installation file

Installation routines

Basic routines library

SP fileAP file

Library

Basic routines declarationmanager

Installation

designer

designer

manager

manager

Architecture of an Automatically Tuned Linear Algebra Library

Installation file

Installation routines

Basic routines library

SP fileAP file

Library

Basic routines declarationmanager

Installation

Compilation

designer

designer

manager

manager

top related