utilisation des moyens de calcul au lipgraal.ens-lyon.fr/~sdelamar/data/jcalc_users.pdf · 2021. 3....
TRANSCRIPT
Utilisation des moyens de calcul au LIP
Journe recherche informatique et calcul, LIP, September 2015
I LIPI V. LefevreI F. Vivien
I CrunchI C. AliasI J-Y. L’ExcellentI C. PernetI M. RaoI N. Revol
I G5KI D. BalouekI H. Coulon
I Autres plateformesI M. AssuncaoI R. Guillier
Recherches de pires cas
Vincent LEFÈVRE
AriC, INRIA Grenoble – Rhône-Alpes / LIP, ENS-Lyon
Journée calcul du LIP, 2015-09-29
[lip-calcul2015.tex 82789 2015-09-28 10:43:11Z vinc17/zira]
[lip-calcul2015.tex 82789 2015-09-28 10:43:11Z vinc17/zira]
Arrondi correct de fonctions mathématiques: pires cas
Données: système à virgule flottante binary64 (double précision), une fonctionmathématique f à un argument (exp, log, sin, cos, etc.).
But: trouver les cas les plus difficiles à arrondir, i.e. arguments x tels que f (x) estle plus près possible d’un point de discontinuité de la fonction d’arrondi (arrondiau plus près → milieu de deux nombres machine consécutifs).
x
I
k
k
x
I
k
k
+
+
1
1
x
I
k
k
+
+
2
2
x
I
k
k
+
+
3
3y’
computed value
exact value y ? rounded to xk+1 or xk+2 ?
machine numbers
Exemple: tanh(7533044902264516/255) = 1.101...1000︸ ︷︷ ︸
53 bits
0 159 0100... × 2−3.
Algorithmes: approximations par des polynômes de degré 1, test sous-linéaire.
Vincent LEFÈVRE (INRIA / LIP, ENS-Lyon) Recherches de pires cas Journée calcul du LIP, 2015-09-29 2 / 4
[lip-calcul2015.tex 82789 2015-09-28 10:43:11Z vinc17/zira]
Arrondi correct de fonctions mathématiques: pires cas [2]
Problème: beaucoup d’entrées (∼ 260), donc beaucoup de calculs.
Parallélisation via le découpage du domaine en intervalles.
Logiciels utilisés:
Perl: système client-serveur, gestion des processus, génération de code C.
Maple + intpakX: arithmétique d’intervalles en multiprécision pour le calculdes entrées du code C (coefficients en virgule fixe de polynômes de « grandsdegrés » à partir desquels on obtient des polynômes de degré 1).À remplacer par GNU MPFR.
GCC et GMP (couche mpn).
Résultats dans des fichiers, lus via NFS.
Moyens de calcul:
Machines MI-LIP: lancement des jobs via SGE (queue à basse priorité).
Éventuellement d’autres machines.
Note sur SGE: problème si SIGUSR1 → SIGSTOP → SIGUSR2 → SIGKILL.Le SIGUSR2 n’est pas reçu car le processus reste stoppé avant d’être tué!
Vincent LEFÈVRE (INRIA / LIP, ENS-Lyon) Recherches de pires cas Journée calcul du LIP, 2015-09-29 3 / 4
[lip-calcul2015.tex 82789 2015-09-28 10:43:11Z vinc17/zira]
Multiplication par une constante: pires cas aussi
Multiplication par une constante entière: générer du code à base d’additions,soustractions et décalages (×2k), le plus court possible.
Divers algorithmes.
Le mien: basé sur des motifs communs dans l’écriture de la constante enbase 2 (en fait, après recodage de Booth, i.e. chiffres signés −1, 0 et 1).
Implémentation en Perl (prototype pour recherche, dont la principaleutilisation à l’origine n’était pas de faire de gros calculs).
Pour comparaison avec d’autres algorithmes: recherche de la plus petiteconstante qui donne un code de taille (#add + #sub) ≥ q.
Mais pas de parallélisation à l’origine (pas le principal problème).
Passage en multithreadé pour cette recherche (car modification simple).
Exécution sur la machine pomme (48 cœurs), qui n’était pas du tout utilisée,du 13 août au 8 septembre 2015.→ Environ 23 jours de calcul (réels) pour les constantes jusqu’à 37 bits.→ Constantes minimales pour les valeurs de q jusqu’à 13.
Vincent LEFÈVRE (INRIA / LIP, ENS-Lyon) Recherches de pires cas Journée calcul du LIP, 2015-09-29 4 / 4
Utilisation de moyens de calcul par ROMAau LIP et ailleurs
Frederic Vivien
29 septembre 2015
Machines du LIP
Serveurs de calcul
I Experiences sans mesures de temps d’execution ; codessequentiels
Ferme de calcul
I Simulations de type parameter-sweep ; grandes quantites dejobs ; programmes sequentiels ; temps parfois longs (≥ 24h).
Crunch*
I Developpement ; debuggage ; premiers tests et premieresevaluations de performance ; MPI + OpenMP
Machines hors LIP
Idriss
I Code MPI+OpenMP
I Cluster avec 2048 cœurs utilisables : experimentations etevaluation de performance sur donnees reelles ; plus facilementaccessible que PSMN (impossible d’avoir des executions de 5minutes sur plein de cœurs sans attendre une semaine)
I Turing cluster : plus grand, pour traiter de plus grands jeux dedonnees
Cluster, Knoxville, TN
I Code a memoire distribue MPI + PaRSEC : PaRSEC etaitdeja installe
Serveur de calcul, Bordeaux
I Code a memoire partagee StarPU : StarPU etait deja installe
Scientific context: sparse systems of equations
Example:
3 2 00 2 −52 0 3
x1x2x3
=
510
or Ax = b
Direct method: A = LU (or LLT or LDLT ), Ly = b, Ux = y
Large matrices → heavy computations → HPC resources
MUMPS (MUltifrontal Massively Parallel Solver),
http://mumps-solver.org
MUMPS 5.0.1, Jul. 2015, Cecill-C, 250000 lines of Fortran/C
INP Toulouse, Inria, CERFACS, Universite de Bordeaux, ENSLyon, CNRS
Objectives: scheduling for memory/performance, mapirregular data structures, multithreading, numerical featuresand accuracy, low-rank compression to reduce complexity,performance of solve phase, sparsity of right-hand sides
Consortium of MUMPS users (http://mumps-consortium.org):
EDF, Michelin, Altair, LSTC, Siemens, ESI Group, Total
Usage of computing resources
crunch1, crunch2, grunch
Validation of research:numerical behaviour, complexity, performance
Performance study and optimization:sequential/MPI/multithread, memory/time/accuracy
Test cases from our users:reproduce and study “interesting” algorithmic behaviour
Debugging: cases requiring significant memory
Other resources
Development of new features and first tests: laptops
Non-regression tests: LIP servers, IRIT servers (Toulouse)
Experimentation, performance study: CALMIP (Toulouse),GENCI/IDRIS, PSMN, hidalgo
Software used
Fortran / C compilers: ifort, icc, gfortran, gcc
Parallelism: MPI (openMPI, intelMPI, . . . ), OpenMP
Libraries: BLAS (MKL. . . ), ScaLAPACK, METIS, SCOTCH
Performance analysis: traces, “factvis”, htop, itac, vtune
Debug: mpirun -np 20 xterm -e ’gdb ./executable’,valgrind
Other: numactl, hwloc, libnuma, . . .
Needs
Experimentation on recent servers, administered close to us,with recent versions of compilers/libraries/software
Training and help on usage of performance analysis tools(vtune, likwid, papi, . . . )
Run relatively short jobs on relatively large amounts ofresources with exclusive node access
2 or 3 machines with exotic systems ? (FreeBSD, IBM)
High performance computer algebra:Use case of the LIP ressources
Journee Informatique et Calcul
Clement Pernet
LIP, Univ. Grenoble Alpes
29 Septembre 2015
C. Pernet High Performance Computer Algebra 29 Septembre 2015 1 / 4
Use case the LIP computing ressources: computer algebra
Computer algebra:
I compute exactly (over Q,Z,Fpe)
I Crypto applications: Integer factorization, breaking discrete log.
I Experimental maths: testing conjectures, databases of modular forms.
Exact linear algebra: kernel for high performance
LinBox, fflas-ffpack: linearalgebra libraries
I Dense, Sparse, Blackbox
I solve,rank, det,
charpoly, Smith-form,
etc
GAP . . .
LinBox
NTL
FFLAS-FFPACK
BLAS, LAPACK
C. Pernet High Performance Computer Algebra 29 Septembre 2015 2 / 4
Use case the LIP computing ressources: computer algebra
Computer algebra:
I compute exactly (over Q,Z,Fpe)
I Crypto applications: Integer factorization, breaking discrete log.
I Experimental maths: testing conjectures, databases of modular forms.
Exact linear algebra: kernel for high performance
LinBox, fflas-ffpack: linearalgebra libraries
I Dense, Sparse, Blackbox
I solve,rank, det,
charpoly, Smith-form,
etc
GAP . . .
LinBox
NTL FFLAS-FFPACK
BLAS, LAPACK
C. Pernet High Performance Computer Algebra 29 Septembre 2015 2 / 4
Use case the LIP computing ressources: computer algebra
Computer algebra:
I compute exactly (over Q,Z,Fpe)
I Crypto applications: Integer factorization, breaking discrete log.
I Experimental maths: testing conjectures, databases of modular forms.
Exact linear algebra: kernel for high performance
LinBox, fflas-ffpack: linearalgebra libraries
I Dense, Sparse, Blackbox
I solve,rank, det,
charpoly, Smith-form,
etc
GAP . . .
LinBox
NTL FFLAS-FFPACK
BLAS, LAPACK
C. Pernet High Performance Computer Algebra 29 Septembre 2015 2 / 4
ANR HPAC: Developping parallel linear algebra libraries
Ziad Sultan phd Thesis: Parallel exact Gaussian elimination
I New block recursive algorithm
I Task based parallelization (with data-flow dependencies)
I Parallel runtime as a plugin: DSL to interface OpenMP, Cilk, TBB.
0
50
100
150
200
250
300
350
400
0 5000 10000 15000 20000 25000 30000
Gfo
ps
matrix dimension
parallel PLUQ over double on full rank matrices on 32 cores
Intel MKL dgetrfPLASMA-Quark dgetrf tiled storage (k=212)
PLASMA-Quark dgetrf (k=212)
C. Pernet High Performance Computer Algebra 29 Septembre 2015 3 / 4
ANR HPAC: Developping parallel linear algebra libraries
Ziad Sultan phd Thesis: Parallel exact Gaussian elimination
I New block recursive algorithm
I Task based parallelization (with data-flow dependencies)
I Parallel runtime as a plugin: DSL to interface OpenMP, Cilk, TBB.
0
50
100
150
200
250
300
350
400
0 5000 10000 15000 20000 25000 30000
Gfo
ps
matrix dimension
parallel PLUQ over double on full rank matrices on 32 cores
Intel MKL dgetrfPLASMA-Quark dgetrf tiled storage (k=212)
PLASMA-Quark dgetrf (k=212)
C. Pernet High Performance Computer Algebra 29 Septembre 2015 3 / 4
ANR HPAC: Developping parallel linear algebra libraries
Ziad Sultan phd Thesis: Parallel exact Gaussian elimination
I New block recursive algorithm
I Task based parallelization (with data-flow dependencies)
I Parallel runtime as a plugin: DSL to interface OpenMP, Cilk, TBB.
0
50
100
150
200
250
300
350
400
0 5000 10000 15000 20000 25000 30000
Gfo
ps
matrix dimension
parallel PLUQ over double on full rank matrices on 32 cores
FFLAS-FFPACK<double> explicit synchFFLAS-FFPACK<double> dataflow synch
Intel MKL dgetrfPLASMA-Quark dgetrf tiled storage (k=212)
PLASMA-Quark dgetrf (k=212)
C. Pernet High Performance Computer Algebra 29 Septembre 2015 3 / 4
ANR HPAC: Developping parallel linear algebra libraries
Ziad Sultan phd Thesis: Parallel exact Gaussian elimination
I New block recursive algorithm
I Task based parallelization (with data-flow dependencies)
I Parallel runtime as a plugin: DSL to interface OpenMP, Cilk, TBB.
0
50
100
150
200
250
300
350
400
0 5000 10000 15000 20000 25000 30000
Gfo
ps
matrix dimension
Parallel PLUQ over Z/131071Z with full rank matrices on 32 cores
Tiled Rec explicit syncTiled Rec dataflow sync
C. Pernet High Performance Computer Algebra 29 Septembre 2015 3 / 4
ANR HPAC: Developping parallel linear algebra libraries
Ziad Sultan phd Thesis: Parallel exact Gaussian elimination
I New block recursive algorithm
I Task based parallelization (with data-flow dependencies)
I Parallel runtime as a plugin: DSL to interface OpenMP, Cilk, TBB.
0
50
100
150
200
250
300
350
400
0 5000 10000 15000 20000 25000 30000
Gfo
ps
matrix dimension
Parallel PLUQ over Z/131071Z with full rank matrices on 32 cores
Tiled Rec explicit syncTiled Rec dataflow syncTiled Iter dataflow sync
Tiled Iter explicit sync
C. Pernet High Performance Computer Algebra 29 Septembre 2015 3 / 4
ANR HPAC: Attacking Crypto Challenges
In collaboration with LIP6-Polsys, LIRMM-ECO, LJK-CASYS:
Breaking Discrete Logarithm in medium characteristic
I Find x given cx ∈ Fpk where p ≈ 52 bits.
I Recent algorithmic advances in theory [Joux & Al.]
I A few low hanging fruits in practice
Algorithm in brief:
1 Collect relations2 Find a non-zero vector in the null-space of a huge sparse matrix
1 millions of sparse-matrix-vector products2 matrix minpoly (dense)
3 Recover x
Will book crunch2 for 2-3 months to compute the matrix-vectorproducts
C. Pernet High Performance Computer Algebra 29 Septembre 2015 4 / 4
ANR HPAC: Attacking Crypto Challenges
In collaboration with LIP6-Polsys, LIRMM-ECO, LJK-CASYS:
Breaking Discrete Logarithm in medium characteristic
I Find x given cx ∈ Fpk where p ≈ 52 bits.
I Recent algorithmic advances in theory [Joux & Al.]
I A few low hanging fruits in practice
Algorithm in brief:
1 Collect relations2 Find a non-zero vector in the null-space of a huge sparse matrix
1 millions of sparse-matrix-vector products2 matrix minpoly (dense)
3 Recover x
Will book crunch2 for 2-3 months to compute the matrix-vectorproducts
C. Pernet High Performance Computer Algebra 29 Septembre 2015 4 / 4
ANR HPAC: Attacking Crypto Challenges
In collaboration with LIP6-Polsys, LIRMM-ECO, LJK-CASYS:
Breaking Discrete Logarithm in medium characteristic
I Find x given cx ∈ Fpk where p ≈ 52 bits.
I Recent algorithmic advances in theory [Joux & Al.]
I A few low hanging fruits in practice
Algorithm in brief:
1 Collect relations2 Find a non-zero vector in the null-space of a huge sparse matrix
1 millions of sparse-matrix-vector products2 matrix minpoly (dense)
3 Recover x
Will book crunch2 for 2-3 months to compute the matrix-vectorproducts
C. Pernet High Performance Computer Algebra 29 Septembre 2015 4 / 4
ANR HPAC: Attacking Crypto Challenges
In collaboration with LIP6-Polsys, LIRMM-ECO, LJK-CASYS:
Breaking Discrete Logarithm in medium characteristic
I Find x given cx ∈ Fpk where p ≈ 52 bits.
I Recent algorithmic advances in theory [Joux & Al.]
I A few low hanging fruits in practice
Algorithm in brief:
1 Collect relations2 Find a non-zero vector in the null-space of a huge sparse matrix
1 millions of sparse-matrix-vector products2 matrix minpoly (dense)
3 Recover x
Will book crunch2 for 2-3 months to compute the matrix-vectorproducts
C. Pernet High Performance Computer Algebra 29 Septembre 2015 4 / 4
Un ensemble apériodique de 11 tuiles de Wang
Emmanuel Jeandel 1 Michaël Rao 2
1LORIA - Nancy
2MC2 - LIP - Lyon
E. Jeandel, M. Rao Un ensemble apériodique de 11 tuiles de Wang
Utilisation de l'outil informatique en combinatoire
Trouver des �objets mathématiques� avec certaines propriétés
Démontrer des théorèmes qui nécessitent une grande étude decas
Exemples:
Combinatoire des mots : trouver une construction d'un motin�ni évitant des motifs (recherche d'un morphisme), oumontrer qu'il n'en existe pas (exhaustivement)
Graphes: preuve de non existence d'un graphe planaire avecune certaine propriété (e.g. théorème des 4 couleurs)
Pavages: Tuiles de Wang
E. Jeandel, M. Rao Un ensemble apériodique de 11 tuiles de Wang
Pavages par tuiles de Wang
Une tuile de Wang est une tuile carrée avec des couleurs sur lesbords
Étant donné un jeu de tuiles, on cherche à paver le plan avec descopies des tuiles, t.q. deux côtés adjacents ont la même couleur
On cherche des jeux de tuiles qui pavent le plan, mais uniquementde manière apériodique
E. Jeandel, M. Rao Un ensemble apériodique de 11 tuiles de Wang
Pavages de Wang apériodiques : historique
Berger : 20426 tuiles en 1966 (descendu à 104 plus tard)
Knuth : 92 tuiles en 1968
Robinson : 56 tuiles en 1971
Ammann : 16 tuiles en 1971
Grunbaum : 24 tuiles en 1987
Kari et Culik : 14 puis 13 tuiles en 1996
Jeandel et Rao : 11 tuiles - le plus petit possible (2014 - 2015)
E. Jeandel, M. Rao Un ensemble apériodique de 11 tuiles de Wang
Pavages de Wang apériodiques : résultats
Théorème
Tout jeu de n tuiles de Wang, n ≤ 10, est �ni ou périodique
Environ 4 jours sur le cluster du LIP
Un seul jeu a posé problème, mais il a été démontré �ni après uneétude plus approfondie (humain + ordinateur)
Théorème
Il existe un jeu de 11 tuiles de Wang apériodique
Environ 1 an sur le cluster du LIP + quelques mois sur PSMN
Des centaines de jeux di�ciles qui demandent un calcul plus long(beaucoup de RAM : crunchs/grunch)
Actuellement : encore 34 jeux qui posent problème
Mais : deux jeux ont étés montrés apériodiques
Conjecture : 2 autres jeux sont apériodiques
E. Jeandel, M. Rao Un ensemble apériodique de 11 tuiles de Wang
0 01
00 32
11 02
21 10
21 32
33 01
1
3 11
13 12
23 33
12 21
02 20
2
Multiplication of Interval Matrices onMulticores
Philippe Theveny and Nathalie RevolENS de Lyon and INRIA, AriC team
LIP, ENS de LyonUniversite de Lyon
Journee Calcul
September 29, 2015
Philippe Theveny and Nathalie Revol Multiplication of Interval Matrices on Multicores
How to Implement Parallel Interval MatrixMultiplication?
Certified Computing
Interval NumericalLinear
Algebramatrix-matrix multiplication
Parallel Computing
on multi-core processor
Philippe Theveny and Nathalie Revol Multiplication of Interval Matrices on Multicores
Interval Matrices
representation by endpoints midpoint-radius representation [1, 3] [0, 4] [−1, 5][−4, 0] [2, 4] [0, 6]
[−5, 1] [−1, 1] [3, 7]
< 2, 1 > < 2, 2 > < 3, 3 >< −2, 2 > < 3, 1 > < 1, 4 >< −3, 3 > < 0, 1 > < 5, 2 >
Philippe Theveny and Nathalie Revol Multiplication of Interval Matrices on Multicores
C ⊃ A× B using Endpoint Representation
Ci ,j = [0, 0]for (i = 0; i < n; i + +) do
for (j = 0; j < n; j + +) dofor (k = 0; k < n; k + +) do
Ci ,j+ = Ai ,kBk,j , i.e.lower bound C i,j = RD (C i,j + min(RD (Ai,k · Bk,j ), RD (Ai,k · Bk,j ), RD (Ai,k · Bk,j ), RD (Ai,k · Bk,j ))),
upper bound Ci,j = RU (Ci,j + max(RU (Ai,k · Bk,j ), RU (Ai,k · Bk,j ), RU (Ai,k · Bk,j ), RU (Ai,k · Bk,j )))
I accumulation with min and max: cannot use optimized BLASroutines
I changing rounding mode is slow
Philippe Theveny and Nathalie Revol Multiplication of Interval Matrices on Multicores
C ⊃ A× B using Endpoint Representation
Ci ,j = [0, 0]for (i = 0; i < n; i + +) do
for (j = 0; j < n; j + +) dofor (k = 0; k < n; k + +) do
Ci ,j+ = Ai ,kBk,j , i.e.lower bound C i,j = RD (C i,j + min(RD (Ai,k · Bk,j ), RD (Ai,k · Bk,j ), RD (Ai,k · Bk,j ), RD (Ai,k · Bk,j ))),
upper bound Ci,j = RU (Ci,j + max(RU (Ai,k · Bk,j ), RU (Ai,k · Bk,j ), RU (Ai,k · Bk,j ), RU (Ai,k · Bk,j )))
I accumulation with min and max: cannot use optimized BLASroutines
I changing rounding mode is slow
Philippe Theveny and Nathalie Revol Multiplication of Interval Matrices on Multicores
C ⊃ A× B using Endpoint Representation
Ci ,j = [0, 0]for (i = 0; i < n; i + +) do
for (j = 0; j < n; j + +) dofor (k = 0; k < n; k + +) do
Ci ,j+ = Ai ,kBk,j , i.e.lower bound C i,j = RD (C i,j + min(RD (Ai,k · Bk,j ), RD (Ai,k · Bk,j ), RD (Ai,k · Bk,j ), RD (Ai,k · Bk,j ))),
upper bound Ci,j = RU (Ci,j + max(RU (Ai,k · Bk,j ), RU (Ai,k · Bk,j ), RU (Ai,k · Bk,j ), RU (Ai,k · Bk,j )))
I accumulation with min and max: cannot use optimized BLASroutines
I changing rounding mode is slow
Philippe Theveny and Nathalie Revol Multiplication of Interval Matrices on Multicores
C ⊃ A× B using Midpoint-Radius Representation
< Cm,Cr >=< RN (Am × Bm),RU (Ar×(|Bm|+Br )+|Am|×(Br+n+2
2 ulp|Bm|)+realmin) >
I need 3 floating point matrix products, but
I cannot use optimized BLAS routines because of directedrounding mode
I solution: implement custom interval BLAS kernel with:blocking, loop unrolling, vectorizing, and multithread
Philippe Theveny and Nathalie Revol Multiplication of Interval Matrices on Multicores
C ⊃ A× B using Midpoint-Radius Representation
< Cm,Cr >=< RN (Am × Bm),RU (Ar×(|Bm|+Br )+|Am|×(Br+n+2
2 ulp|Bm|)+realmin) >
I need 3 floating point matrix products, but
I cannot use optimized BLAS routines because of directedrounding mode
I solution: implement custom interval BLAS kernel with:blocking, loop unrolling, vectorizing, and multithread
Philippe Theveny and Nathalie Revol Multiplication of Interval Matrices on Multicores
C ⊃ A× B using Midpoint-Radius Representation
< Cm,Cr >=< RN (Am × Bm),RU (Ar×(|Bm|+Br )+|Am|×(Br+n+2
2 ulp|Bm|)+realmin) >
I need 3 floating point matrix products, but
I cannot use optimized BLAS routines because of directedrounding mode
I solution: implement custom interval BLAS kernel with:blocking, loop unrolling, vectorizing, and multithread
Philippe Theveny and Nathalie Revol Multiplication of Interval Matrices on Multicores
Interval MM on one coreCompiler:ICC seems to respect the rounding mode (with -fp-model-strict),GCC does not when optimizations are done.
Manual vectorization: Manual unrolling:
Block size:32 to optimize cache L1.
Philippe Theveny and Nathalie Revol Multiplication of Interval Matrices on Multicores
Performance on crunch
0 1,000 2,000 3,000 4,000 5,0000
1
2
3
4
matrix size
tim
era
tio
inte
rval
/MK
L
1 2 4 8 16 24 320
0.2
0.4
0.6
0.8
1
number of threads
effici
ency
MKLinterval
Philippe Theveny and Nathalie Revol Multiplication of Interval Matrices on Multicores
Journee Recherche Informatique et Calcul du LIP
Daniel Balouek-ThomertEquipe Avalon
September, 29th 2015
Context
Growing needs of computationnalresourcesIncreasing number of:
I servers
I datacenters
Energy consumption is a main concern!
Energy efficiency by schedulingApplication domain: Cloud ComputingIssues:
I Management of large number ofvirtual machines
I Optimal placement at any time
I Dealing with fluctuating workloads
I Respect to the Quality of Service
Context
Growing needs of computationnalresourcesIncreasing number of:
I servers
I datacenters
Energy consumption is a main concern!
Energy efficiency by schedulingApplication domain: Cloud ComputingIssues:
I Management of large number ofvirtual machines
I Optimal placement at any time
I Dealing with fluctuating workloads
I Respect to the Quality of Service
Context
Growing needs of computationnalresourcesIncreasing number of:
I servers
I datacenters
Energy consumption is a main concern!
Energy efficiency by schedulingApplication domain: Cloud ComputingIssues:
I Management of large number ofvirtual machines
I Optimal placement at any time
I Dealing with fluctuating workloads
I Respect to the Quality of Service
Large Scale Experiments
Management of a virtualizedinfrastructureComplexity:
I Multi-criteria scheduling
I Prediction
Requirements:
I Trade-offs performance/energysavings
I Satisfying actors(administrators, end-users,...)
Validation:
I Real life testbeds
I Production traces
Dimensionning Experiments over Grid’5000
Why are we using Grid’5000?I Large number of physical nodes (Hardware heterogeneity)I Customizable operating systems in administrator modeI Network isolation (Avoiding perturbation from other users)I Energy Monitoring capabilities (nodes sensors)
Recent experiments over Grid’5000:I Large scale validation: Scheduling of 1,000,000+ jobs from
production tracesI Industrial transfer: Implementation of a federation of small
datacenters for a french sovereign cloud computing serviceI Leveraging technical locks: Easy and reproducible way start of
10,000+ virtual machines
Lesson learned happens in pratical large scale experiments
Retrieving the energy consumption of a Grid’5000 job
A key to separate experimental purposes from metrology concerns
Experiment Scenario
I Finding and booking available nodes
I Configuring nodes and installing dependencies
I Running an experiment
Monitoring is seamlessly performed for the end user
Figure: Grid5000 Live visualisation tool
I Data avalaible on-demandthrough a web API
I Post-experiment analysis ofmonitoring logs
I Live visualisation over thelast {minute,hour,day}
Tools
Grid’5000 dedicated tools
OAR Finding and booking avalaible nodes
Kadeploy Cloning, configuring (post installation) and managingcluster nodes.
Execo Prototyping experiments on distributed systems
Kavlan Level 2 Network isolation from other experiments
VM5K Easy and reproducible way to deploy virtual machines
Platform independent tools
DIET Middleware for high-performance computing in distributedenvironments.
Kwapi Driver-based monitoring software
Openstack Control pools of compute, storage and network resources
Daniel Balouek-Thomert (PhD), Eddy Caron (MCF), Laurent Lefevre (CR)
1/ 8
Journee Recherche Informatique et Calcul du LIPFrom Gird’5000 to Curie
Helene Coullon
Post-doc INRIAEquipe Avalon
29th September 2015
H. Coullon (INRIA) Calcul LIP 29th September 2015 1 / 8
2/ 8
Table of contents
1 Scientific Context
2 Performance Metrics
3 From Grid’5000 to Curie, Why ?
4 In pratice ? Tools and libraries
H. Coullon (INRIA) Calcul LIP 29th September 2015 2 / 8
3/ 8
Scientific Context
HPC in 2015
Programming models
Execution models
Hardware Clusters Multi-cores GPGPUs Many-cores ...
H. Coullon (INRIA) Calcul LIP 29th September 2015 3 / 8
3/ 8
Scientific Context
HPC in 2015
Programming models
Execution models
Hardware Clusters Multi-cores GPGPUs Many-cores ...
MPI pthreads Cuda task scheduling...
H. Coullon (INRIA) Calcul LIP 29th September 2015 3 / 8
3/ 8
Scientific Context
HPC in 2015
Programming models
Execution models
Hardware Clusters Multi-cores GPGPUs Many-cores ...
BSP PGAS APGAS OpenMP OpenCL ...
MPI pthreads Cuda task scheduling...
H. Coullon (INRIA) Calcul LIP 29th September 2015 3 / 8
3/ 8
Scientific Context
HPC in 2015
Programming models
Execution models
Hardware Clusters Multi-cores GPGPUs Many-cores ...
BSP PGAS APGAS OpenMP OpenCL ...
MPI pthreads Cuda task scheduling...
H. Coullon (INRIA) Calcul LIP 29th September 2015 3 / 8
3/ 8
Scientific Context
HPC in 2015
Programming models
Execution models
Hardware
SIMPLICITY
PERFORMANCE
Clusters Multi-cores GPGPUs Many-cores ...
MAINTAINABILITY, PRODUCTIVITY, PORTABILITY ?
H. Coullon (INRIA) Calcul LIP 29th September 2015 3 / 8
4/ 8
Scientific Context
Scientific Context
Component-Based Software Engineering (CBSE)
Code-reuse (productivity)
Maintainability
Portability
Toward CBSE in HPC !
Equipe Avalon
Christian Perez (DR), Vincent Lanore (Phd), Jerome Richard (Phd) andHelene Coullon (Postdoc)
H. Coullon (INRIA) Calcul LIP 29th September 2015 4 / 8
5/ 8
Performance Metrics
Performance Metrics
Considering a parallel scientific application PHow to evaluate performance ?
Flop/s
Number of floating point operations per seconds.Comparison with some references.
Strong speedup
T1(Reference)
Tn(P)
Ti : execution time using i cores (processors)
Reference : time reference application (ideally the best sequentialtime)
On homogeneous machines, be close to a linear result x = y
H. Coullon (INRIA) Calcul LIP 29th September 2015 5 / 8
5/ 8
Performance Metrics
Performance Metrics
Considering a parallel scientific application PHow to evaluate performance ?
Flop/s
Number of floating point operations per seconds.Comparison with some references.
Strong speedup
T1(Reference)
Tn(P)
Ti : execution time using i cores (processors)
Reference : time reference application (ideally the best sequentialtime)
On homogeneous machines, be close to a linear result x = y
H. Coullon (INRIA) Calcul LIP 29th September 2015 5 / 8
5/ 8
Performance Metrics
Performance Metrics
Considering a parallel scientific application PHow to evaluate performance ?
Flop/s
Number of floating point operations per seconds.Comparison with some references.
Strong speedup
T1(Reference)
Tn(P)
Ti : execution time using i cores (processors)
Reference : time reference application (ideally the best sequentialtime)
On homogeneous machines, be close to a linear result x = y
H. Coullon (INRIA) Calcul LIP 29th September 2015 5 / 8
6/ 8
From Grid’5000 to Curie, Why ?
From Grid’5000 to Curie, Why ?
Scalability on clusters !(homogeneous architectures)
Flop/s
G5K taurus (Lyon site) cluster peak : 2.3 TFlop/s
Curie fat nodes flops peak : 105 TFlop/s
Speedup
G5K Paravence cluster (Rennes site) number of cores : 1152
Curie thin nodes number of cores : 80.640
H. Coullon (INRIA) Calcul LIP 29th September 2015 6 / 8
7/ 8
In pratice ? Tools and libraries
In pratice ? Tools and libraries
Good tools of G5K
Execo g5k
KDeploy
Not on Curie !Slurm instead of OAR
Libraries that I use
Already available : MPI, OpenMP, pthread, ...Locally and manually installed
HPC libraries : MadMPI (thread safe MPI), starPU
CBSE libraries : L2C and its dependencies
H. Coullon (INRIA) Calcul LIP 29th September 2015 7 / 8
8/ 8
Thank You !
H. Coullon (INRIA) Calcul LIP 29th September 2015 8 / 8
ARM and NetFPGA Platform
● Context: Investigation and design of energy-efficient routing algorithms for core networks
● Requirements: Infrastructure for prototyping and evaluation of both hardware-assisted data planes and software solutions– Raspberry PIs for generating background traffic
– NetFPGA cards for building switch prototypes
Technical Details
● 8 Raspberry PI 2 Model B● File system:
– SD cards with a Raspian full install– File system superposed by NFS volumes (aufs)
● Environment deployment “à la Kadeploy 3”– kaenv3, kadeploy3, etc.
● 5 NetFPGA cards 4x10Gbps Ethernet– http://netfpga.org/site/#/systems/3netfpga-10g/details/
E-Biothon: a platform for BioInformatics
● Context:– Large amount of data produced by sequencers
– Requires large amount of computing
– Requires efficient software
● Goals:– Help researchers build tomorrow's algorithms
– Give easy access to existing applications
Architecture
BlueGene/P (hosted at Idris):* 56 Tflops peak performance* 4 racks of 1024 CPU, with 4 cores each => 16 384 cores * 200 TB of storage
Applications available: PhyML (Phylogeny), NAMD, LAMMPS, GROMACS (Molecular Dynamics)
Usage Stats
Afrique du Sud
AllemagneAngleterreArgentineAustralieAutricheBelgiqueBrésil
Canada
ChiliChine
DanemarkEgypte
EspagneFranceGrèce
Guatemala
Hongrie
Inde IranIrlande
Italie JaponLithuanieMexiquePolognePortugal
RoumanieRussie
République Tchèque
SlovénieSuisseSuède
Thaïlande
USA
Uruguay
0
5
10
15
20
25
30
Users Nationality
Laboratory Origin
User Results
● Biodiversiton
– Alain Franc (INRA - UMR BioGeCo /Inria - Pleiade Team)
– Disseq software (distance comparaison of sequences from a environmental sample)
● Insyght
– Jean-François Gibrat (INRA, UR1404, Unité Mathématiques et Informatique Appliquées du Génome à l’Environnement)
– Insyght platform (http://genome.jouy.inra.fr/Insyght/ ) comparison of genomic organization of homologous genes in different bacterial genomes
● COMMA
– Yasaman Karami (Unité de Biologie Computationnelle et Quantitative, UMR 7238 CNRS-UPMC)
– Communication Mapping (COMMA) identifies the dynamical architecture of proteins from all-atom molecular dynamics (MD) simulations in explicit solvent
● CXCR4
– Bruck Taddese, Marie Chabbert (Laboratoire BNMI, UMR CNRS 6214 – INSERM U1083, Faculté de médecine d'Angers)
– Detailed analysis of the dynamics properties of the CXC chemokine receptor 4 (CXCR4)