automatic parameterisation of parallel linear algebra routines domingo giménez javier cuenca josé...

Automatic Parameterisation of Parallel Linear Algebra Routines

Domingo Giménez Javier Cuenca José González

University of MurciaSPAIN

Algèbre Linéaire et Arithmétique: Calcul Numérique, Symbolique et ParalèleRabat, Maroc. 28-31 Mai 2001

Outline

Current Situation of Linear Algebra Parallel Routines (LAPRs)ObjectiveApproach I: Analytical Model of the LAPRs

Application: Jacobi Method on Origin 2000 Approach II: Exhaustive Executions

Application: Gauss elimination on networks of processorsValidation with the LU factorizationConclusionsFuture Works

Linear Algebra: highly optimizable operations

Optimizations are Platform Specific Traditional method: Hand-Optimization for each platform

Current Situation of Linear Algebra Parallel Routines (LAPRs)

020406080

100120140160180

512 1024 1536 2048 2560 3072

Problem Size

Untuned

Time-consuming Incompatible with Hardware Evolution Incompatible with changes in the system (architecture

and basic libraries) Unsuitable for dynamic systems Misuse by non expert users

Problems of traditional method

ATLAS, FLAME, I-LIB

Analyse platform characteristics in detail Sequential code Empirical results of the LAPR + Automation High Installation Time

Current approaches

Develop a methodology for obtaining Automatically Tuned Software

Execution Environment

Auto-tuning Software

Our objective

Routines Parameterised:

System parameters, Algorithmic parameters System parameters obtained at installation time

Analytical model of the routine and simple installation routines to obtain the system parameters

A reduced number of executions at installation time Algorithmic parameters obtained at running time

From the analytical model with the system parameters obtained in the installation process

From the file with information generated in the installation process

Methodology

System parameters obtained at installation time

Analytical model of the routine and simple installation routines to obtain the system parameters

Algorithmic parameters obtained at running time

From the analytical model with the system parameters obtained in the installation process

Analytical modelling

The behaviour of the algorithm on the platform is defined

Texec = f (SPs, n, APs)

SPs = f(n, APs) System Parameters APs Algorithmic Parameters n Problem Size

Analytical Model

System Parameters (SPs):Hardware Platform Physical Characteristics

Current Conditions

Basic libraries

How to estimate each SP?

1º.- Obtain the kernel of performance cost of LAPR

2º.- Make an Estimation Routine from this kernel

Two Kinds of SPs:

Communication System Parameters (CSPs)

Arithmetic System Parameters (ASPs)

Analytical Model

LAPRs Performance

Arithmetic System Parameters (ASPs):tc arithmetic cost

but using BLAS: k1 k2 and k3.

Computation Kernel of the LAPR Estimation Routine Similar storage scheme Similar quantity of data

Analytical Model

Communication System Parameters (CSPs):ts start-up time

tw word-sending time

Communication Kernel of the LAPR Estimation Routine Similar kind of communication Similar quantity of data

Analytical Model

Algorithmic Parameters (APs)

Values chosen in each execution

b block size

p number of processors

r c logical topology

grid configuration (logical 2D mesh)

Analytical Model

Pre-installing (manual):

1º Make the Analytical Model: Texec = f (SPs, n, APs)

2º Write the Estimation Routines for the SPs

Installing on a Platform (automatic):

3º Estimate the SPs using the Estimation Routines of step 2

4º Write a Configuration File, or include the information in the LAPR:

for each n APs that minimize Texec

Execution:

The user executes LAPR for a size n:

LAPR obtains optimal APs

The Methodology. Step by step:

LAPR: One-sided Block Jacobi Method to solve the Symmetric Eigenvalue Problem.

Message-passing with MPI Logical Ring & Logical 2D-Mesh

Platform:SGI Origin 2000

Application Example

Application Example. Algorithm Scheme

10 1011 11

0001 01

20 2021 21

Application Example: Pre-installing.

HCVCariexec tttT

nktari 2

wsVC tc

ntbtct wsHC 2

1º Make the Analytical Model: Texec= f (SPs,n,APs)

2º Write the Estimation Routines for the SPs

k3 matrix-matrix multiplication with DGEMM

k1 Givens Rotation to 2 vectors with DROT

communications along the 2 directions of the 2D-mesh

Application Example: Installing

3º Estimate the SPs using the Estimation Routines

k1 0.01 µs

0.005 µs b = 32

k3 0.004 µs b = 64

0.003 µs b = 128

ts 20 µs

tw 0.1 µs

Comparison of execution times using different sets of Execution Parameters (4 processors)

Application Example: Executing

512 1024 1536 2048 2560 3072

Untuned

Tuned with MCAP

Tuned with MVAP

Optimal Execution Time

Comparison of execution times using different sets of Execution Parameters (8 processors)

512 1024 1536 2048 2560 3072

Untuned

Tuned with MCAP

Tuned with MVAP

Optimal Execution Time

LAPR: One-sided Block Jacobi MethodAlgorithmic Parameters: block size

mesh topologyPlatform: SGI Origin 2000 with message-passing

System Parameters: arithmetic costs

communication costsSatisfactory Reduction of the Execution Time:

from 25% higher than the optimal to only 2%

Outline

System parameters obtained at installation time

Installation routines making a reduced number of executions at installation time

Algorithmic parameters obtained at running time

From the file with information generated in the installation process

Exhaustive Execution

The behaviour of the algorithm on the platform is defined (as in Analytical Modelling)

Texec = f (SPs, n, APs)

SPs = f(n, APs) System Parameters APs Algorithmic Parameters n Problem Size

Identify Algorithmic Parameters (APs) (as in Analytical Modelling)

Values chosen in each execution

b block size

r c logical topology

grid configuration (logical 2D mesh)

Pre-installing (manual):

1º Determine the APs

2º Decide heuristics to reduce execution time in the installation process

Installing on a Platform (automatic):

3º Decide (the manager) the problem sizes to be analysed

4º Execute and write a Configuration File, or include the information in the LAPR:

Execution:

The user executes LAPR for a size n:

LAPR obtains optimal APs

The Methodology. Step by step:

LAPR: Gaussian elimination.

Message-passing with MPI Logical Ring,

rowwise block-cyclic striped partitioning

Platform:networks of processors (heterogeneous system)

Application Example

1º Determine the APs logical ring, rowwise block-cyclic striped partitioning

b block size for the data distribution

different block sizes in heterogeneous systems

b0b1b2b0b1b2b0b1b2b0

2º Decide heuristics to reduce execution time in the installation process

Execution time varies in a continuous way with the problem size and the APs

Consider the system as homogeneous Installation can finish:

When Analytical and Experimental predictions coincide

When a certain time has been spent on the installation

Homogeneous Systems:

3º The manager decides the problem sizes

4º Execute and write a Configuration File, or include the information in the LAPR:

Heterogeneous Systems:

3º The manager decides the problem sizes

4º Execute:

write a Configuration File, for each n APs that minimize Texec

write a Speed File, with the relative speeds of the processors in the system

Application Example: Installing

RI-THE: Obtains p and b from the formula.

RI-HOM: Obtains p and b through a reduced number of executions.

RI-HET: 1º. As RI-HOM.

2º. Obtains bi for each processor

Application Example: Installation Routines

Three different configurations:

PLA_HOM: 5 SUN Ultra-1

PLA_HYB: 5 SUN Ultra-1

1 SUN Ultra-5

PLA_HET: 1 SUN Ultra-1

1 SUN Ultra-5

1 SUN Ultra-1 (manages the file system)

Application Example: Systems

Experimental results in PLA-HOM:

Quotient between the execution time with the parameters from the Installation Routine and the optimum execution time

500 1000 1500 2000 2500 3000

RI-THEO

RI-HOMO

RI-HETE

Experimental results in PLA-HYB:

500 1000 1500 2000 2500 3000

RI-THEO

RI-HOMO

RI-HETE

500 1000 1500 2000 2500 3000

RI-THEO

RI-HOMO

RI-HETE

Experimental results in PLA-HET:

Two techniques for automatic tuning of Parallel Linear Algebra Routines:

1. Analytical ModellingFor predictable systems (homogeneous, static, ...)

like Origin 2000

2. Exhaustive Execution

For less predictable systems (heterogeneous, dynamic, ...)

like networks of workstations Transparent to the user Execution close to the optimum

Comparison

Outline

To validate the methodology it is necessary to experiment with:

More routines:

block LU factorization More systems:

Architectures:

IBM SP2 and Origin 2000 Libraries:

reference BLAS, machine BLAS, ATLAS

Validation with the LU factorization

Sequential LU

nkbnbknktari 222

Analytical Model: Texec= f (SPs,n,APs)

SPs: cost of arithmetic operations of different levels:

k1, k2, k3

APs: block size bLU ES

Quotient between different execution times and the optimum execution time

Sequential LU. Comparison in IBM SP2

512 1024 1536 2048 2560

modelled

weighted

LAPACK

Quotient between the execution time with the parameters provided by the model and the optimum execution time, with different basic libraries. In

Sequential LU. Model execution time/optimum execution time

256 512 768 1024 1280 1536

ref. BLASmac. BLASATLAS

Parallel LU

nkbnbkrc

nktari 2

Analytical Model: Texec= f (SPs,n,APs)

SPs: cost of arithmetic operations: k1, k2, k3

cost of communications: ts, tw

APs: block size b,

number of processors p,

grid configuration rc

00 01 02 00 01 02

10 11 12 10 11 12

00 01 02 00 01 02

10 11 12 10 11 12

00 01 02 00 01 02

10 11 12 10 11 12

Quotient between the execution time with the parameters provided by the model and the optimum execution time. In the sequential case, and in parallel with

4 and 8 processors.

Parallel LU. Comparison in IBM SP2

512 1024 1536 2048 2560 3072 3584

SEQPAR4PAR8

Quotient between the execution time with the parameters provided by the model and the optimum execution time. In the sequential case, and in parallel with

4 and 8 processors.

Parallel LU. Comparison in Origin 2000

512 1024 1536 2048 2560 3072 3584

SEQPAR4PAR8

The modelling of the algorithm provides satisfactory results in different systems

Origin 2000, IBM SP2

reference BLAS, machine BLAS, ATLAS The prediction is worse in some cases:

When the number of processors increases

In multicomputers where communications are more important (IBM SP2)

Exhaustive Executions

Parallel LU. Conclusions

If the manager installs the routine for sizes 512, 1536, 2560,

and executions are performed for sizes 1024, 2048, 3072,

the execution time is well predicted

The same policy can be used in the installation of other software:

Quotient between the execution time with the parameters provided by the installation process and the optimum execution time. With ScaLAPACK, in

IBM SP2

Parallel LU. Exhaustive Execution

1024 2048 3072

4 pro.

8 pro.

Parameterisation of Parallel Linear Algebra Routines enables development of Automatically Tuned Software

Two techniques can be used:

Analytical Modelling

Exhaustive Executions

a combination of both

Experiments performed in different systems and with different routines

Conclusions

We try to develop a methodology valid for a wide range of systems, and to include it in the design of linear algebra libraries:

it is necessary to analyse the methodology in more systems and with more routines

Architecture of an Automatically Tuned Linear Algebra Library

At the moment we are analysing routines individually, but it could be preferable to analyse algorithmic schemes

Future Works

Installation file

Installation routines

Basic routines library

SP fileAP file

Library

Basic routines declarationmanager

Installation

Compilation

designer

manager

Library

designer

Library

Basic routines declaration

designer

manager

Installation file

Library

Installation

designer

manager

Installation file

SP fileAP file

Library

Installation

designer

manager

Installation file

SP fileAP file

Library

Installation

Compilation

designer

manager

automatic parameterisation of parallel linear algebra routines domingo giménez javier cuenca josé...

Documents

juego de la oca murcia-

plan de gestion de la cuenca del río lagartero

madrid - faib- federació d'automobilisme de les illes...

cuenca : analyse de l’auto-Évaluation de culture 21 :...

catalogue - acb-riverclack.fr · catalogue produits...

brochure imprimeur 23 juin - azambassade.fr · gerónimo...

la gobernanza de los riesgos industriales la cuenca de...

biodiversidad y cambio climático -...

2017...

monteagudo et murcia, en espagne

paul giménez -...

deb dehaven rebecca murcia 3 diane collins · rebecca...

20des... · web vieworchestre tzigane de paris. ... she...

murcia -...

plan estratégico cuenca 2020

du moyen-Âge au xxe siÈcle - musée d'aquitainesantiago de...

interreg iii-c: inunda cuenca piloto del socio generalitat...

all’improvviso ciaccone, bergamasche & un po’ di · pdf...

dra. amparo garrigues giménez catedrática …...2018/11/30...

juan antonio cutillas - murcia, spain