utilisation des moyens de calcul au lipgraal.ens-lyon.fr/~sdelamar/data/jcalc_users.pdf · 2021. 3....

Utilisation des moyens de calcul au LIP

Journe recherche informatique et calcul, LIP, September 2015

I LIPI V. LefevreI F. Vivien

I CrunchI C. AliasI J-Y. L’ExcellentI C. PernetI M. RaoI N. Revol

I G5KI D. BalouekI H. Coulon

I Autres plateformesI M. AssuncaoI R. Guillier

Recherches de pires cas

Vincent LEFÈVRE

AriC, INRIA Grenoble – Rhône-Alpes / LIP, ENS-Lyon

Journée calcul du LIP, 2015-09-29

[lip-calcul2015.tex 82789 2015-09-28 10:43:11Z vinc17/zira]


Arrondi correct de fonctions mathématiques: pires cas

Données: système à virgule flottante binary64 (double précision), une fonctionmathématique f à un argument (exp, log, sin, cos, etc.).

But: trouver les cas les plus difficiles à arrondir, i.e. arguments x tels que f (x) estle plus près possible d’un point de discontinuité de la fonction d’arrondi (arrondiau plus près → milieu de deux nombres machine consécutifs).

x

I

k

k

x

I

k

k

+

+

1

1

x

I

k

k

+

+

2

2

x

I

k

k

+

+

3

3y’

computed value

exact value y ? rounded to xk+1 or xk+2 ?

machine numbers

Exemple: tanh(7533044902264516/255) = 1.101...1000︸︷︷︸

53 bits

0 159 0100... × 2−3.

Algorithmes: approximations par des polynômes de degré 1, test sous-linéaire.

Vincent LEFÈVRE (INRIA / LIP, ENS-Lyon) Recherches de pires cas Journée calcul du LIP, 2015-09-29 2 / 4


Arrondi correct de fonctions mathématiques: pires cas [2]

Problème: beaucoup d’entrées (∼ 260), donc beaucoup de calculs.

Parallélisation via le découpage du domaine en intervalles.

Logiciels utilisés:

Perl: système client-serveur, gestion des processus, génération de code C.

Maple + intpakX: arithmétique d’intervalles en multiprécision pour le calculdes entrées du code C (coefficients en virgule fixe de polynômes de « grandsdegrés » à partir desquels on obtient des polynômes de degré 1).À remplacer par GNU MPFR.

GCC et GMP (couche mpn).

Résultats dans des fichiers, lus via NFS.

Moyens de calcul:

Machines MI-LIP: lancement des jobs via SGE (queue à basse priorité).

Éventuellement d’autres machines.

Note sur SGE: problème si SIGUSR1 → SIGSTOP → SIGUSR2 → SIGKILL.Le SIGUSR2 n’est pas reçu car le processus reste stoppé avant d’être tué!



Multiplication par une constante: pires cas aussi

Multiplication par une constante entière: générer du code à base d’additions,soustractions et décalages (×2k), le plus court possible.

Divers algorithmes.

Le mien: basé sur des motifs communs dans l’écriture de la constante enbase 2 (en fait, après recodage de Booth, i.e. chiffres signés −1, 0 et 1).

Implémentation en Perl (prototype pour recherche, dont la principaleutilisation à l’origine n’était pas de faire de gros calculs).

Pour comparaison avec d’autres algorithmes: recherche de la plus petiteconstante qui donne un code de taille (#add + #sub) ≥ q.

Mais pas de parallélisation à l’origine (pas le principal problème).

Passage en multithreadé pour cette recherche (car modification simple).

Exécution sur la machine pomme (48 cœurs), qui n’était pas du tout utilisée,du 13 août au 8 septembre 2015.→ Environ 23 jours de calcul (réels) pour les constantes jusqu’à 37 bits.→ Constantes minimales pour les valeurs de q jusqu’à 13.


Utilisation de moyens de calcul par ROMAau LIP et ailleurs

Frederic Vivien

29 septembre 2015

Machines du LIP

Serveurs de calcul

I Experiences sans mesures de temps d’execution ; codessequentiels

Ferme de calcul

I Simulations de type parameter-sweep ; grandes quantites dejobs ; programmes sequentiels ; temps parfois longs (≥ 24h).

Crunch*

I Developpement ; debuggage ; premiers tests et premieresevaluations de performance ; MPI + OpenMP

Machines hors LIP

Idriss

I Code MPI+OpenMP

I Cluster avec 2048 cœurs utilisables : experimentations etevaluation de performance sur donnees reelles ; plus facilementaccessible que PSMN (impossible d’avoir des executions de 5minutes sur plein de cœurs sans attendre une semaine)

I Turing cluster : plus grand, pour traiter de plus grands jeux dedonnees

Cluster, Knoxville, TN

I Code a memoire distribue MPI + PaRSEC : PaRSEC etaitdeja installe

Serveur de calcul, Bordeaux

I Code a memoire partagee StarPU : StarPU etait deja installe

Scientific context: sparse systems of equations

Example:

3 2 00 2 −52 0 3

x1x2x3

=

510

or Ax = b

Direct method: A = LU (or LLT or LDLT ), Ly = b, Ux = y

Large matrices → heavy computations → HPC resources

MUMPS (MUltifrontal Massively Parallel Solver),

http://mumps-solver.org

MUMPS 5.0.1, Jul. 2015, Cecill-C, 250000 lines of Fortran/C

INP Toulouse, Inria, CERFACS, Universite de Bordeaux, ENSLyon, CNRS

Objectives: scheduling for memory/performance, mapirregular data structures, multithreading, numerical featuresand accuracy, low-rank compression to reduce complexity,performance of solve phase, sparsity of right-hand sides

Consortium of MUMPS users (http://mumps-consortium.org):

EDF, Michelin, Altair, LSTC, Siemens, ESI Group, Total

http://mumps-solver.org

http://mumps-consortium.org

Usage of computing resources

crunch1, crunch2, grunch

Validation of research:numerical behaviour, complexity, performance

Performance study and optimization:sequential/MPI/multithread, memory/time/accuracy

Test cases from our users:reproduce and study “interesting” algorithmic behaviour

Debugging: cases requiring significant memory

Other resources

Development of new features and first tests: laptops

Non-regression tests: LIP servers, IRIT servers (Toulouse)

Experimentation, performance study: CALMIP (Toulouse),GENCI/IDRIS, PSMN, hidalgo

Software used

Fortran / C compilers: ifort, icc, gfortran, gcc

Parallelism: MPI (openMPI, intelMPI, . . . ), OpenMP

Libraries: BLAS (MKL. . . ), ScaLAPACK, METIS, SCOTCH

Performance analysis: traces, “factvis”, htop, itac, vtune

Debug: mpirun -np 20 xterm -e ’gdb ./executable’,valgrind

Other: numactl, hwloc, libnuma, . . .

Needs

Experimentation on recent servers, administered close to us,with recent versions of compilers/libraries/software

Training and help on usage of performance analysis tools(vtune, likwid, papi, . . . )

Run relatively short jobs on relatively large amounts ofresources with exclusive node access

2 or 3 machines with exotic systems ? (FreeBSD, IBM)

High performance computer algebra:Use case of the LIP ressources

Journee Informatique et Calcul

Clement Pernet

LIP, Univ. Grenoble Alpes

29 Septembre 2015

C. Pernet High Performance Computer Algebra 29 Septembre 2015 1 / 4

Use case the LIP computing ressources: computer algebra

Computer algebra:

I compute exactly (over Q,Z,Fpe)

I Crypto applications: Integer factorization, breaking discrete log.

I Experimental maths: testing conjectures, databases of modular forms.

Exact linear algebra: kernel for high performance

LinBox, fflas-ffpack: linearalgebra libraries

I Dense, Sparse, Blackbox

I solve,rank, det,

charpoly, Smith-form,

etc

GAP . . .

LinBox

NTL

FFLAS-FFPACK

BLAS, LAPACK


Use case the LIP computing ressources: computer algebra

Computer algebra:

I compute exactly (over Q,Z,Fpe)

I Crypto applications: Integer factorization, breaking discrete log.

I Experimental maths: testing conjectures, databases of modular forms.

Exact linear algebra: kernel for high performance

LinBox, fflas-ffpack: linearalgebra libraries

I Dense, Sparse, Blackbox

I solve,rank, det,

charpoly, Smith-form,

etc

GAP . . .

LinBox

NTL FFLAS-FFPACK

BLAS, LAPACK


ANR HPAC: Developping parallel linear algebra libraries

Ziad Sultan phd Thesis: Parallel exact Gaussian elimination

I New block recursive algorithm

I Task based parallelization (with data-flow dependencies)

I Parallel runtime as a plugin: DSL to interface OpenMP, Cilk, TBB.

0

50

100

150

200

250

300

350

400

0 5000 10000 15000 20000 25000 30000

Gfo

ps

matrix dimension

parallel PLUQ over double on full rank matrices on 32 cores

Intel MKL dgetrfPLASMA-Quark dgetrf tiled storage (k=212)

PLASMA-Quark dgetrf (k=212)







0

50

100

150

200

250

300

350

400

0 5000 10000 15000 20000 25000 30000

Gfo

ps

matrix dimension

parallel PLUQ over double on full rank matrices on 32 cores

FFLAS-FFPACK<double> explicit synchFFLAS-FFPACK<double> dataflow synch

Intel MKL dgetrfPLASMA-Quark dgetrf tiled storage (k=212)

PLASMA-Quark dgetrf (k=212)







0

50

100

150

200

250

300

350

400

0 5000 10000 15000 20000 25000 30000

Gfo

ps

matrix dimension

Parallel PLUQ over Z/131071Z with full rank matrices on 32 cores

Tiled Rec explicit syncTiled Rec dataflow sync







0

50

100

150

200

250

300

350

400

0 5000 10000 15000 20000 25000 30000

Gfo

ps

matrix dimension

Parallel PLUQ over Z/131071Z with full rank matrices on 32 cores

Tiled Rec explicit syncTiled Rec dataflow syncTiled Iter dataflow sync

Tiled Iter explicit sync


ANR HPAC: Attacking Crypto Challenges

In collaboration with LIP6-Polsys, LIRMM-ECO, LJK-CASYS:

Breaking Discrete Logarithm in medium characteristic

I Find x given cx ∈ Fpk where p ≈ 52 bits.

I Recent algorithmic advances in theory [Joux & Al.]

I A few low hanging fruits in practice

Algorithm in brief:

1 Collect relations2 Find a non-zero vector in the null-space of a huge sparse matrix

1 millions of sparse-matrix-vector products2 matrix minpoly (dense)

3 Recover x

Will book crunch2 for 2-3 months to compute the matrix-vectorproducts


Un ensemble apériodique de 11 tuiles de Wang

Emmanuel Jeandel 1 Michaël Rao 2

1LORIA - Nancy

2MC2 - LIP - Lyon

E. Jeandel, M. Rao Un ensemble apériodique de 11 tuiles de Wang

Utilisation de l'outil informatique en combinatoire

Trouver des �objets mathématiques� avec certaines propriétés

Démontrer des théorèmes qui nécessitent une grande étude decas

Exemples:

Combinatoire des mots : trouver une construction d'un motin�ni évitant des motifs (recherche d'un morphisme), oumontrer qu'il n'en existe pas (exhaustivement)

Graphes: preuve de non existence d'un graphe planaire avecune certaine propriété (e.g. théorème des 4 couleurs)

Pavages: Tuiles de Wang


Pavages par tuiles de Wang

Une tuile de Wang est une tuile carrée avec des couleurs sur lesbords

Étant donné un jeu de tuiles, on cherche à paver le plan avec descopies des tuiles, t.q. deux côtés adjacents ont la même couleur

On cherche des jeux de tuiles qui pavent le plan, mais uniquementde manière apériodique


Pavages de Wang apériodiques : historique

Berger : 20426 tuiles en 1966 (descendu à 104 plus tard)

Knuth : 92 tuiles en 1968

Robinson : 56 tuiles en 1971

Ammann : 16 tuiles en 1971

Grunbaum : 24 tuiles en 1987

Kari et Culik : 14 puis 13 tuiles en 1996

Jeandel et Rao : 11 tuiles - le plus petit possible (2014 - 2015)


Pavages de Wang apériodiques : résultats

Théorème

Tout jeu de n tuiles de Wang, n ≤ 10, est �ni ou périodique

Environ 4 jours sur le cluster du LIP

Un seul jeu a posé problème, mais il a été démontré �ni après uneétude plus approfondie (humain + ordinateur)

Théorème

Il existe un jeu de 11 tuiles de Wang apériodique

Environ 1 an sur le cluster du LIP + quelques mois sur PSMN

Des centaines de jeux di�ciles qui demandent un calcul plus long(beaucoup de RAM : crunchs/grunch)

Actuellement : encore 34 jeux qui posent problème

Mais : deux jeux ont étés montrés apériodiques

Conjecture : 2 autres jeux sont apériodiques


0 01

00 32

11 02

21 10

21 32

33 01

1

3 11

13 12

23 33

12 21

02 20

2

Multiplication of Interval Matrices onMulticores

Philippe Theveny and Nathalie RevolENS de Lyon and INRIA, AriC team

LIP, ENS de LyonUniversite de Lyon

Journee Calcul

September 29, 2015

Philippe Theveny and Nathalie Revol Multiplication of Interval Matrices on Multicores

How to Implement Parallel Interval MatrixMultiplication?

Certified Computing

Interval NumericalLinear

Algebramatrix-matrix multiplication

Parallel Computing

on multi-core processor


Interval Matrices

representation by endpoints midpoint-radius representation [1, 3] [0, 4] [−1, 5][−4, 0] [2, 4] [0, 6]

[−5, 1] [−1, 1] [3, 7]

< 2, 1 > < 2, 2 > < 3, 3 >< −2, 2 > < 3, 1 > < 1, 4 >< −3, 3 > < 0, 1 > < 5, 2 >


C ⊃ A× B using Endpoint Representation

Ci ,j = [0, 0]for (i = 0; i < n; i + +) do

for (j = 0; j < n; j + +) dofor (k = 0; k < n; k + +) do

Ci ,j+ = Ai ,kBk,j , i.e.lower bound C i,j = RD (C i,j + min(RD (Ai,k · Bk,j ), RD (Ai,k · Bk,j ), RD (Ai,k · Bk,j ), RD (Ai,k · Bk,j ))),

upper bound Ci,j = RU (Ci,j + max(RU (Ai,k · Bk,j ), RU (Ai,k · Bk,j ), RU (Ai,k · Bk,j ), RU (Ai,k · Bk,j )))

I accumulation with min and max: cannot use optimized BLASroutines

I changing rounding mode is slow


C ⊃ A× B using Midpoint-Radius Representation

< Cm,Cr >=< RN (Am × Bm),RU (Ar×(|Bm|+Br )+|Am|×(Br+n+2

2 ulp|Bm|)+realmin) >

I need 3 floating point matrix products, but

I cannot use optimized BLAS routines because of directedrounding mode

I solution: implement custom interval BLAS kernel with:blocking, loop unrolling, vectorizing, and multithread


Interval MM on one coreCompiler:ICC seems to respect the rounding mode (with -fp-model-strict),GCC does not when optimizations are done.

Manual vectorization: Manual unrolling:

Block size:32 to optimize cache L1.


Performance on crunch

0 1,000 2,000 3,000 4,000 5,0000

1

2

3

4

matrix size

tim

era

tio

inte

rval

/MK

L

1 2 4 8 16 24 320

0.2

0.4

0.6

0.8

1

number of threads

effici

ency

MKLinterval


Journee Recherche Informatique et Calcul du LIP

Daniel Balouek-ThomertEquipe Avalon

September, 29th 2015

Context

Growing needs of computationnalresourcesIncreasing number of:

I servers

I datacenters

Energy consumption is a main concern!

Energy efficiency by schedulingApplication domain: Cloud ComputingIssues:

I Management of large number ofvirtual machines

I Optimal placement at any time

I Dealing with fluctuating workloads

I Respect to the Quality of Service

Large Scale Experiments

Management of a virtualizedinfrastructureComplexity:

I Multi-criteria scheduling

I Prediction

Requirements:

I Trade-offs performance/energysavings

I Satisfying actors(administrators, end-users,...)

Validation:

I Real life testbeds

I Production traces

Dimensionning Experiments over Grid’5000

Why are we using Grid’5000?I Large number of physical nodes (Hardware heterogeneity)I Customizable operating systems in administrator modeI Network isolation (Avoiding perturbation from other users)I Energy Monitoring capabilities (nodes sensors)

Recent experiments over Grid’5000:I Large scale validation: Scheduling of 1,000,000+ jobs from

production tracesI Industrial transfer: Implementation of a federation of small

datacenters for a french sovereign cloud computing serviceI Leveraging technical locks: Easy and reproducible way start of

10,000+ virtual machines

Lesson learned happens in pratical large scale experiments

Retrieving the energy consumption of a Grid’5000 job

A key to separate experimental purposes from metrology concerns

Experiment Scenario

I Finding and booking available nodes

I Configuring nodes and installing dependencies

I Running an experiment

Monitoring is seamlessly performed for the end user

Figure: Grid5000 Live visualisation tool

I Data avalaible on-demandthrough a web API

I Post-experiment analysis ofmonitoring logs

I Live visualisation over thelast {minute,hour,day}

Tools

Grid’5000 dedicated tools

OAR Finding and booking avalaible nodes

Kadeploy Cloning, configuring (post installation) and managingcluster nodes.

Execo Prototyping experiments on distributed systems

Kavlan Level 2 Network isolation from other experiments

VM5K Easy and reproducible way to deploy virtual machines

Platform independent tools

DIET Middleware for high-performance computing in distributedenvironments.

Kwapi Driver-based monitoring software

Openstack Control pools of compute, storage and network resources

Daniel Balouek-Thomert (PhD), Eddy Caron (MCF), Laurent Lefevre (CR)

1/ 8

Journee Recherche Informatique et Calcul du LIPFrom Gird’5000 to Curie

Helene Coullon

Post-doc INRIAEquipe Avalon

29th September 2015

H. Coullon (INRIA) Calcul LIP 29th September 2015 1 / 8

2/ 8

Table of contents

1 Scientific Context

2 Performance Metrics

3 From Grid’5000 to Curie, Why ?

4 In pratice ? Tools and libraries


3/ 8

Scientific Context

HPC in 2015

Programming models

Execution models

Hardware Clusters Multi-cores GPGPUs Many-cores ...


3/ 8

Scientific Context

HPC in 2015

Programming models

Execution models


MPI pthreads Cuda task scheduling...


3/ 8

Scientific Context

HPC in 2015

Programming models

Execution models


BSP PGAS APGAS OpenMP OpenCL ...

MPI pthreads Cuda task scheduling...


3/ 8

Scientific Context

HPC in 2015

Programming models

Execution models

Hardware

SIMPLICITY

PERFORMANCE

Clusters Multi-cores GPGPUs Many-cores ...

MAINTAINABILITY, PRODUCTIVITY, PORTABILITY ?


4/ 8

Scientific Context

Scientific Context

Component-Based Software Engineering (CBSE)

Code-reuse (productivity)

Maintainability

Portability

Toward CBSE in HPC !

Equipe Avalon

Christian Perez (DR), Vincent Lanore (Phd), Jerome Richard (Phd) andHelene Coullon (Postdoc)


5/ 8

Performance Metrics

Performance Metrics

Considering a parallel scientific application PHow to evaluate performance ?

Flop/s

Number of floating point operations per seconds.Comparison with some references.

Strong speedup

T1(Reference)

Tn(P)

Ti : execution time using i cores (processors)

Reference : time reference application (ideally the best sequentialtime)

On homogeneous machines, be close to a linear result x = y


6/ 8

From Grid’5000 to Curie, Why ?

From Grid’5000 to Curie, Why ?

Scalability on clusters !(homogeneous architectures)

Flop/s

G5K taurus (Lyon site) cluster peak : 2.3 TFlop/s

Curie fat nodes flops peak : 105 TFlop/s

Speedup

G5K Paravence cluster (Rennes site) number of cores : 1152

Curie thin nodes number of cores : 80.640


7/ 8

In pratice ? Tools and libraries

In pratice ? Tools and libraries

Good tools of G5K

Execo g5k

KDeploy

Not on Curie !Slurm instead of OAR

Libraries that I use

Already available : MPI, OpenMP, pthread, ...Locally and manually installed

HPC libraries : MadMPI (thread safe MPI), starPU

CBSE libraries : L2C and its dependencies


8/ 8

Thank You !


ARM and NetFPGA Platform

● Context: Investigation and design of energy-efficient routing algorithms for core networks

● Requirements: Infrastructure for prototyping and evaluation of both hardware-assisted data planes and software solutions– Raspberry PIs for generating background traffic

– NetFPGA cards for building switch prototypes

Technical Details

● 8 Raspberry PI 2 Model B● File system:

– SD cards with a Raspian full install– File system superposed by NFS volumes (aufs)

● Environment deployment “à la Kadeploy 3”– kaenv3, kadeploy3, etc.

● 5 NetFPGA cards 4x10Gbps Ethernet– http://netfpga.org/site/#/systems/3netfpga-10g/details/

E-Biothon: a platform for BioInformatics

● Context:– Large amount of data produced by sequencers

– Requires large amount of computing

– Requires efficient software

● Goals:– Help researchers build tomorrow's algorithms

– Give easy access to existing applications

Architecture

BlueGene/P (hosted at Idris):* 56 Tflops peak performance* 4 racks of 1024 CPU, with 4 cores each => 16 384 cores * 200 TB of storage

Applications available: PhyML (Phylogeny), NAMD, LAMMPS, GROMACS (Molecular Dynamics)

Usage Stats

Afrique du Sud

AllemagneAngleterreArgentineAustralieAutricheBelgiqueBrésil

Canada

ChiliChine

DanemarkEgypte

EspagneFranceGrèce

Guatemala

Hongrie

Inde IranIrlande

Italie JaponLithuanieMexiquePolognePortugal

RoumanieRussie

République Tchèque

SlovénieSuisseSuède

Thaïlande

USA

Uruguay

0

5

10

15

20

25

30

Users Nationality

Laboratory Origin

User Results

● Biodiversiton

– Alain Franc (INRA - UMR BioGeCo /Inria - Pleiade Team)

– Disseq software (distance comparaison of sequences from a environmental sample)

● Insyght

– Jean-François Gibrat (INRA, UR1404, Unité Mathématiques et Informatique Appliquées du Génome à l’Environnement)

– Insyght platform (http://genome.jouy.inra.fr/Insyght/ ) comparison of genomic organization of homologous genes in different bacterial genomes

● COMMA

– Yasaman Karami (Unité de Biologie Computationnelle et Quantitative, UMR 7238 CNRS-UPMC)

– Communication Mapping (COMMA) identifies the dynamical architecture of proteins from all-atom molecular dynamics (MD) simulations in explicit solvent

● CXCR4

– Bruck Taddese, Marie Chabbert (Laboratoire BNMI, UMR CNRS 6214 – INSERM U1083, Faculté de médecine d'Angers)

– Detailed analysis of the dynamics properties of the CXC chemokine receptor 4 (CXCR4)

http://genome.jouy.inra.fr/Insyght/

utilisation des moyens de calcul au lipgraal.ens-lyon.fr/~sdelamar/data/jcalc_users.pdf · 2021. 3....

Documents