l’architecture d’une gpu - programmation cuda · (2017) nvidia tegra x2 denver(2)+cortex-57(4)...

P.Bakowski 1

L’architecture d’une GPUL’architecture d’une GPU- programmation CUDA- programmation CUDA

P. BakowskiP. Bakowski

P.Bakowski 2

Evolution des architectures parallèles Evolution des architectures parallèles

..Nous pouvons distinguer 3 générations d'architectures Nous pouvons distinguer 3 générations d'architectures massivement parallèles (calcul scientifique):massivement parallèles (calcul scientifique): (1)(1) Les Les super-ordinateurssuper-ordinateurs avec avec processeurs spéciauxprocesseurs spéciaux pour le calcul vectoriel (Single Instruction Multiple Data - pour le calcul vectoriel (Single Instruction Multiple Data - SIMDSIMD))Le Cray-1 (1976) contenait 200 000 circuits intégrés et pouvait Le Cray-1 (1976) contenait 200 000 circuits intégrés et pouvait effectuer 100 millions d'opérations en virgule flottante par effectuer 100 millions d'opérations en virgule flottante par seconde (100 seconde (100 MFLOPSMFLOPS).).

Prix: $5 - $8.8 million Prix: $5 - $8.8 million

Nombre d'unités vendues: 85Nombre d'unités vendues: 85

P.Bakowski 3

Evolution des architectures parallèles Evolution des architectures parallèles

(2)(2) Les Les super-ordinateurssuper-ordinateurs dotés de dotés de microprocesseurs microprocesseurs standardstandard adaptés au multitraitement massif fonctionnant comme des adaptés au multitraitement massif fonctionnant comme des ordinateurs à instructions multiples et données multiples (ordinateurs à instructions multiples et données multiples (MIMDMIMD))

ExempleExemple::IBMIBM RoadrunnerRoadrunner: PowerXCell 8i : PowerXCell 8i CPUs, 6480 dual cores - AMD CPUs, 6480 dual cores - AMD Opteron, Linux Opteron, Linux

Consommation: 2,35 MWConsommation: 2,35 MWSurface: 296 racks, 560 mSurface: 296 racks, 560 m22 Mémoire: 103,6 TiBMémoire: 103,6 TiBPerformance: 1.042 petaflopsPerformance: 1.042 petaflopsPrix:Prix: USD $125MUSD $125M

P.Bakowski 4

GPGPU et ML sur GPU embarquéesGPGPU et ML sur GPU embarquées

((20142014) Nvidia ) Nvidia Tegra K1 Tegra K1 Cortex-15(4) + Cortex-15(4) + KeplerKepler GPU (192 cores) GPU (192 cores)

((20162016) Nvidia ) Nvidia Tegra X1Tegra X1 Cortex-57(4) + Cortex-57(4) + MaxwellMaxwell GPU (2*128 cores) GPU (2*128 cores)

(2017)(2017) Nvidia Nvidia Tegra X2Tegra X2 Denver(2)+Cortex-57(4) + Denver(2)+Cortex-57(4) + PascalPascal GPU (256 cores) GPU (256 cores)

(2018)(2018) Nvidia Nvidia NanoNano Cortex-57(4) + Cortex-57(4) + MaxwellMaxwell GPU (128 cores) GPU (128 cores)

(2019)(2019) Nvidia Nvidia XavierXavier Carmel(8) + Carmel(8) + VoltaVolta GPU (512 cores) + GPU (512 cores) + TPUTPU (64 cores) (64 cores)

$100 $100 ~15W~15W

$800 $800 ~30W~30W

P.Bakowski 5

Traitement basé sur GPU (Kepler)Traitement basé sur GPU (Kepler)

..Tegra K1Tegra K1: une unité GPU de : une unité GPU de classe classe KeplerKepler avec 192 avec 192 cœurs de traitement, 2 cœurs de traitement, 2 processeurs de signal, unités processeurs de signal, unités de traitement vidéo pour le de traitement vidéo pour le codage et le décodage vidéo codage et le décodage vidéo haute définition (haute définition (2K2K), une unité ), une unité de traitement audio et un de traitement audio et un ensemble d'interfaces de ensemble d'interfaces de données, vidéo et audio.données, vidéo et audio.

P.Bakowski 6

Traitement basé sur GPUTraitement basé sur GPU

..Tegra X1Tegra X1: puce graphique : puce graphique MaxwellMaxwell de 2 SMM, de 2 SMM, permettant aux 256 cœurs permettant aux 256 cœurs CUDA de traiter les CUDA de traiter les données en données en FP16, FP32 FP16, FP32 et FP64et FP64. . La puissance de calcul La puissance de calcul 1024 GFLOPS1024 GFLOPS, soit un , soit un 1 TFLOP1 TFLOP. . Le processeur peut traiter Le processeur peut traiter un flux vidéo en définition un flux vidéo en définition 4K4K à 60 ips. à 60 ips.

P.Bakowski 7

Tegra K1/X1: Tegra K1/X1: streaming multi-processorstreaming multi-processor

..The The streamingstreaming multiprocessormultiprocessor (SMX) (SMX)

32/48/128/19232/48/128/192 cœurs par cœurs par SMX SMX

chaque chaque cœurcœur contient contient une unité FP et une unité une unité FP et une unité INTINT

GPGPUGPGPU programmation avec programmation avec CUDACUDA (ou (ou openCLopenCL))

P.Bakowski 8

Xavier Xavier : Autonomous Machines Processor: Autonomous Machines Processor

..

8 custom ARM v8 cores8 custom ARM v8 cores

P.Bakowski 9


..

512 GPU cores512 GPU cores

Throughput

22.6 DL TOPS 8-bit22.6 DL TOPS 8-bit2.8 CUDA TFLOPS FP162.8 CUDA TFLOPS FP161.4 CUDA TFLOPS FP321.4 CUDA TFLOPS FP32

P.Bakowski 10


..

Deep Learning Accelerator :11.4 TOPS (int8), 5.7 TFLOPS (FP16) Deep Learning Accelerator :11.4 TOPS (int8), 5.7 TFLOPS (FP16)

Throughput

P.Bakowski 11


..

9 9 billion transistors : $2 billion R&D and 8,000 engineering years.billion transistors : $2 billion R&D and 8,000 engineering years.

P.Bakowski 12

NVIDIA – Tegra et CUDANVIDIA – Tegra et CUDA

CUDA - une architecture logicielle sur matériel Tegra CUDA - une architecture logicielle sur matériel Tegra CUDA “language” - une extension du CCUDA “language” - une extension du C

P.Bakowski 13

NVIDIA et NVIDIA et CUDACUDA

Le CUDA Le CUDA ToolkitToolkit contient::

compiler: compiler: nvccnvcc libraries libraries FFTFFT and and BLASBLAS profilerprofiler debugger debugger gdbgdb for GPU for GPU runtimeruntime driver for CUDA included in nVIDIA drivers driver for CUDA included in nVIDIA drivers guide of programmingguide of programming SDK for SDK for CUDACUDA developers developers source codes (examples) and documentationsource codes (examples) and documentation

P.Bakowski 14

CUDACUDA : phases de compilation : phases de compilation

Le code CUDA C est compilé avec Le code CUDA C est compilé avec nvccnvcc, c’est un script , c’est un script activant d’autres programmes: activant d’autres programmes: cudacccudacc, , g++g++, , clcl, etc.., etc..

P.Bakowski 15


nvccnvcc genère: genère:

le code de la CPU, compilé avec d'autres le code de la CPU, compilé avec d'autres parties de l'application et écrit en C pur,parties de l'application et écrit en C pur,

etet

le code d'objet le code d'objet PTXPTX pour la pour la GPUGPU

P.Bakowski 16


Les fichiers exécutables avec le code CUDA nécessitent:Les fichiers exécutables avec le code CUDA nécessitent:

● bibliothèque d'exécution CUDA (bibliothèque d'exécution CUDA (cudartcudart) et) et● bibliothèque de base CUDA bibliothèque de base CUDA

P.Bakowski 17

CUDACUDA : modèle de programmation : modèle de programmation

Le code C est projeté sur plusieurs Le code C est projeté sur plusieurs threadsthreads

Les Les threadsthreads sont organisés en sont organisés en blocksblocks

Un ensemble de Un ensemble de blocksblocks avec leurs avec leurs threadsthreads forme un forme un gridgrid

Un Un gridgrid bidimensionnel avec (3 colonnes, 2 rangées) de 6 bidimensionnel avec (3 colonnes, 2 rangées) de 6 blocksblocks tridimensionnels représentés par 4 * 4 * 4 tridimensionnels représentés par 4 * 4 * 4 threadsthreads. .

P.Bakowski 18

CUDACUDA : mémoires : mémoires

Mémoire Mémoire globalglobal – toutes les SMX et CPU – toutes les SMX et CPU

Mémoire Mémoire sharedshared - les threads exécutés dans le même - les threads exécutés dans le même blockblock

Mémoires Mémoires constantconstant et et texturetexture - pour tous les - pour tous les threadsthreads en en mode mode read-onlyread-only

Pour chaque Pour chaque threadthread - mémoire - mémoire locallocal et l’ensemble de et l’ensemble de registersregisters

P.Bakowski 19

CUDA:CUDA: programmation de base programmation de base

Les programmes CUDA contiennent:Les programmes CUDA contiennent:- - purpur code C pour l’exécution sur code C pour l’exécution sur CPUCPU - code C - code C extended extended pour l’exécution sur pour l’exécution sur GPUGPU

Dans ce contexte nous avons triois types de Dans ce contexte nous avons triois types de functionsfunctions::

____hosthost____ exécuté seulement sur exécuté seulement sur CPUCPU (optionnel) (optionnel)

____globalglobal____ exécuté sur exécuté sur GPUGPU, appelé par la , appelé par la CPUCPU

____devicedevice____ exécuté sur exécuté sur GPUGPU, appelé par la , appelé par la GPUGPU

P.Bakowski 20


____hosthost____ exécuté seulement sur exécuté seulement sur CPUCPU (optionnel) (optionnel)____globalglobal____ exécuté sur exécuté sur GPUGPU, appelé par la , appelé par la CPUCPU____devicedevice____ exécuté sur exécuté sur GPUGPU, appelé par la , appelé par la GPUGPU

la fonction marqué par le prefix la fonction marqué par le prefix ____globalglobal____ est est également appelée également appelée kernelkernel..

P.Bakowski 21


L'appel d'une fonction globale est organiséL'appel d'une fonction globale est organiséautour de l'ensemble des autour de l'ensemble des threadsthreads et des et des blocksblocks à activer. à activer.

Ceci est défini par une entrée du type:Ceci est défini par une entrée du type:

kernelkernel <<<blocs, threads>>> (arguments) <<<blocs, threads>>> (arguments)

P.Bakowski 22


Un exemple le plus example est:Un exemple le plus example est:kernelkernel <<<1,10>>> (arguments); <<<1,10>>> (arguments);

Un autre est:Un autre est:kernelkernel <<<2,5>>> (arguments); <<<2,5>>> (arguments);

P.Bakowski 23

CUDACUDA : structure du kernel : structure du kernel

Les variables Les variables automaticautomatic sont: sont: threadIdxthreadIdx, , blockIdxblockIdx, , blockDimblockDim, , gridDimgridDim

Pour une organisation unidimensionnelle (only Pour une organisation unidimensionnelle (only xx dimension !): dimension !): threadIdx.threadIdx.xx,,blockIdx.blockIdx.xx,,blockDim.blockDim.x,x,gridDim.gridDim.xx

// GPU kernel for AddVect.Float.cu // GPU kernel for AddVect.Float.cu ____globalglobal__ __ void addVect(float* in1, float* in2, float* out) void addVect(float* in1, float* in2, float* out) { { int i = threadIdx.int i = threadIdx.xx + blockIdx. + blockIdx.xx*blockDim.*blockDim.xx; ; out[i] = in1[i] + in2[i]; out[i] = in1[i] + in2[i]; }}

P.Bakowski 24

CUDACUDA : exemple – coté CPU : exemple – coté CPU

int main() int main() { int i=0; { int i=0; float v1[]={1,2,3,4,5,6,7,8,9,10}; float v1[]={1,2,3,4,5,6,7,8,9,10}; float v2[]={1.0,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9}; float v2[]={1.0,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9}; int memsize = sizeof(v1); int memsize = sizeof(v1); int vsize = memsize/sizeof(float); int vsize = memsize/sizeof(float); float res[vsize]; float res[vsize]; float* Cv1; float* Cv1; cudaMalloccudaMalloc((void **)&Cv1,memsize); ((void **)&Cv1,memsize); float* Cv2; float* Cv2; cudaMalloccudaMalloc((void **)&Cv2,memsize); ((void **)&Cv2,memsize); float* Cres; float* Cres; cudaMalloccudaMalloc((void **)&Cres,memsize); ((void **)&Cres,memsize); cudaMemcpycudaMemcpy(Cv1,v1,memsize,(Cv1,v1,memsize,cudaMemcpyHostToDevicecudaMemcpyHostToDevice); ); cudaMemcpycudaMemcpy(Cv2,v2,memsize,(Cv2,v2,memsize,cudaMemcpyHostToDevicecudaMemcpyHostToDevice);); ....

P.Bakowski 25

CUDACUDA : : exemple – coté CPUexemple – coté CPU

.. .. cudaMemcpycudaMemcpy(Cv1,v1,memsize,(Cv1,v1,memsize,cudaMemcpyHostToDevicecudaMemcpyHostToDevice); ); cudaMemcpycudaMemcpy(Cv2,v2,memsize,(Cv2,v2,memsize,cudaMemcpyHostToDevicecudaMemcpyHostToDevice); ); addVect<<<addVect<<<22,,vsize/2vsize/2>>>(Cv1,Cv2,Cres); // 2 blocks >>>(Cv1,Cv2,Cres); // 2 blocks cudaMemcpycudaMemcpy(res,Cres,memsize,(res,Cres,memsize,cudaMemcpyDeviceToHostcudaMemcpyDeviceToHost); ); printf("res= { "); printf("res= { "); for(i=0;i<vsize;i++){printf("%2.2f ",res[i]);} for(i=0;i<vsize;i++){printf("%2.2f ",res[i]);} printf("}\n"); printf("}\n"); }}

P.Bakowski 26

CUDACUDA : Tegra K1 – : Tegra K1 – Zero CopyZero Copy// Set flag to enable zero copy access// Set flag to enable zero copy accesscudaSetDeviceFlagscudaSetDeviceFlags((cudaDeviceMapHostcudaDeviceMapHost););// Host Arrays// Host Arraysfloat* float* h_inh_in = NULL; = NULL;float* float* h_outh_out = NULL; = NULL;// Allocate host memory using CUDA allocation calls// Allocate host memory using CUDA allocation callscudaHostAlloccudaHostAlloc((void **)&((void **)&h_inh_in, sizeIn, , sizeIn, cudaHostAllocMappedcudaHostAllocMapped););cudaHostAlloccudaHostAlloc((void **)&((void **)&h_outh_out,sizeOut,,sizeOut,cudaHostAllocMappedcudaHostAllocMapped););// Device arrays// Device arraysfloat *float *d_outd_out, *, *d_ind_in;;// Get device pointer from host memory. No allocation or memcpy// Get device pointer from host memory. No allocation or memcpycudaHostGetDevicePointercudaHostGetDevicePointer((void **)&((void **)&d_ind_in, (void *), (void *)h_inh_in , 0); , 0);cudaHostGetDevicePointercudaHostGetDevicePointer((void **)&((void **)&d_outd_out, (void *), (void *)h_outh_out, 0);, 0);// Launch the GPU kernel// Launch the GPU kernelkernel<<<blocks, threads>>>(d_out, d_in);kernel<<<blocks, threads>>>(d_out, d_in);// No need to copy // No need to copy d_outd_out back back// Continue processing on host using // Continue processing on host using h_outh_out..}..}

P.Bakowski 27

CUDACUDA : analyse du : analyse du devicedevice { // struct { // struct cudaDevicePropcudaDevicePropchar name [256];char name [256];totalGlobalMem size_t // possible value 2 GBtotalGlobalMem size_t // possible value 2 GBsharedMemPerBlock size_t // possible value- 128 KBsharedMemPerBlock size_t // possible value- 128 KBregsPerBlock int // possible value- 64regsPerBlock int // possible value- 64

warpSize int //possible value - 32warpSize int //possible value - 32memPitch size_t;memPitch size_t;

maxThreadsPerBlock int // possible value- 1024maxThreadsPerBlock int // possible value- 1024maxThreadsDim int [3];maxThreadsDim int [3];maxGridSize int [3];maxGridSize int [3];totalConstMem size_t;totalConstMem size_t;

int major; // possible value – 1, 2 or int major; // possible value – 1, 2 or 33int minor; // possible value - 1,int minor; // possible value - 1,22,3,3int clockrate / / possible value - 1.2 GHzint clockrate / / possible value - 1.2 GHztextureAlignment size_t;textureAlignment size_t;deviceOverlap int;deviceOverlap int;

int multiProcessorCount – int multiProcessorCount – 11,2,4,2,4kernelExecTimeoutEnabled int;kernelExecTimeoutEnabled int;}}

P.Bakowski 28

CUDACUDA : analyse du : analyse du devicedevice// DeviceStat.cu// DeviceStat.cu.. .. #include <cuda.h> #include <cuda.h> #include <cuda_runtime.h>#include <cuda_runtime.h>

int main () {int main () {cudaDeviceProp dP / / dP short for devicePropertiescudaDeviceProp dP / / dP short for devicePropertiesint int devicedevice = 0 = 0;;cudaGetDevicePropertiescudaGetDeviceProperties(&dP,(&dP,devicedevice););

printf ("Name:%s\n", dP.printf ("Name:%s\n", dP.namename););printf ("Memory total:%d MB\n", dP.printf ("Memory total:%d MB\n", dP.totalGlobalMemtotalGlobalMem/(1024*1024));/(1024*1024));printf ("Shared memory per block:%d in B\n", dP.printf ("Shared memory per block:%d in B\n", dP.sharedMemPerBlocksharedMemPerBlock););printf ("MaxThreads block:%d \n", dP.printf ("MaxThreads block:%d \n", dP.maxThreadsPerBlockmaxThreadsPerBlock););printf ("warpSize:%d \n", dP.printf ("warpSize:%d \n", dP.warpSizewarpSize););printf ("major:%d \n", dP.printf ("major:%d \n", dP.majormajor););printf ("minor:%d \n", dP.printf ("minor:%d \n", dP.minorminor););printf ("number of SM:%d \n", dP.printf ("number of SM:%d \n", dP.multiProcessorCountmultiProcessorCount););printf ("Clock frequency:%1.3f inGHz \n", dP.printf ("Clock frequency:%1.3f inGHz \n", dP.clockRateclockRate/1000000.0);/1000000.0);return 0;return 0;}}

P.Bakowski 29

CUDACUDA : analyse du : analyse du devicedevicePour Pour Tegra K1Tegra K1 nous obtenons: nous obtenons:ubuntu@tegra-ubuntu:~/cuda$./devicestat ubuntu@tegra-ubuntu:~/cuda$./devicestat name: name: GK20AGK20A totalGlobalMem: totalGlobalMem: 17461746 in MB in MB

shared memory per block: shared memory per block: 4848 in KBytes in KBytes max threads per block: max threads per block: 10241024 warpSize: warpSize: 3232

major: major: 33 minor: minor: 22 multi processor count: multi processor count: 11 clock rate: clock rate: 0.8520.852 in GHz in GHz

P.Bakowski 30

CUDACUDA : multiplication des matrices : multiplication des matrices

void CPU_matrix_mul(float* a, float* b, float* c) void CPU_matrix_mul(float* a, float* b, float* c) { { for(int i=0;i<DIM;i++) for(int i=0;i<DIM;i++) for(int j=0;j<DIM;j++) for(int j=0;j<DIM;j++) for(int k=0;k<DIM;k++) for(int k=0;k<DIM;k++) c[j+i*DIM] += a[k+j*DIM]*b[j+k*DIM]; c[j+i*DIM] += a[k+j*DIM]*b[j+k*DIM]; }}

P.Bakowski 31

CUDACUDA : multiplication des matrices : multiplication des matrices

#define Width 512 // corresponds to DIM #define Width 512 // corresponds to DIM __global__ void __global__ void matrix_mul(float* dev_A,float* dev_B,float* dev_C,int Width) matrix_mul(float* dev_A,float* dev_B,float* dev_C,int Width) { { // 2D thread ID // 2D thread ID int tx = threadIdx.x; int tx = threadIdx.x; int ty = threadIdx.y; int ty = threadIdx.y; float Pvalue =0; float Pvalue =0; for(int k=0;k<Width;++k) for(int k=0;k<Width;++k) { { float Ael=dev_A[ty*Width + k]; float Ael=dev_A[ty*Width + k]; float Bel=dev_B[k*Width +tx]; float Bel=dev_B[k*Width +tx]; Pvalue += Ael*Bel; Pvalue += Ael*Bel; } } dev_C[ty*Width+tx]=Pvalue; dev_C[ty*Width+tx]=Pvalue; }}

Chaque produit (512*512) est calculé par Chaque produit (512*512) est calculé par un thread séparé qui ajoute les produits un thread séparé qui ajoute les produits en 512 pas.en 512 pas.

P.Bakowski 32

CUDACUDA : évaluation des performances : évaluation des performancesfloat eT; // short for elapsedTimefloat eT; // short for elapsedTimecudaEvent_tcudaEvent_t start, stop; start, stop;cudaEventCreatecudaEventCreate(&start);(&start);cudaEventCreatecudaEventCreate(&stop);(&stop);cudaEventRecordcudaEventRecord(start, 0);(start, 0);

// here we call the GPU kernel (or CPU function)// here we call the GPU kernel (or CPU function)

cudaEventRecordcudaEventRecord(stop,0);(stop,0);cudaEventSynchronizecudaEventSynchronize(stop);(stop);cudaEventElapsedTimecudaEventElapsedTime(&eT,start,stop);(&eT,start,stop);

Le résultat de cette évaluation peut être affiché comme suit:Le résultat de cette évaluation peut être affiché comme suit:

printf("GPU.time:%d*%d:%3.2fms\n",Width,Width,eT);printf("GPU.time:%d*%d:%3.2fms\n",Width,Width,eT);

P.Bakowski 33

CUDACUDA : évaluation des performances : évaluation des performances

Le résultat de cette évaluation peut être affiché comme suit:Le résultat de cette évaluation peut être affiché comme suit:printf("GPU.time:%d*%d:%3.2fms\n",Width,Width,eT);printf("GPU.time:%d*%d:%3.2fms\n",Width,Width,eT);

GPUGPU time => 12.35 ms time => 12.35 ms

CPUCPU time => 2639.35 ms time => 2639.35 msQuel est le facteur Quel est le facteur d'accélération?d'accélération?

P.Bakowski 34

CUDACUDA : mémoire : mémoire sharedshared et synchronisation et synchronisation

Mémoire Mémoire sharedshared – seulement pour les threads fonctionnant dans le – seulement pour les threads fonctionnant dans le même bloc.même bloc.Un produit scalaire de deux vecteurs: les éléments consécutifs de ces Un produit scalaire de deux vecteurs: les éléments consécutifs de ces vecteurs sont multipliés et ajoutés à une quantité représentant vecteurs sont multipliés et ajoutés à une quantité représentant finalement le produit scalaire appelé finalement le produit scalaire appelé produit scalaireproduit scalaire..

P.Bakowski 35


____globalglobal__ void dot(float *a, float *b, float *c)__ void dot(float *a, float *b, float *c){{____sharedshared__ float cache[threadsPerBlock];__ float cache[threadsPerBlock];int int tidtid = threadIdx.x + blockIdx.x *blockDim.x; = threadIdx.x + blockIdx.x *blockDim.x;int int cIdxcIdx = threadIdx.x ; // = threadIdx.x ; // cIdxcIdx - short for cacheIndex - short for cacheIndexfloat temp = 0;float temp = 0;

// addition of products in blocks of threads// addition of products in blocks of threadswhile(tid) {while(tid) {temp += a[tid] * b[tid];temp += a[tid] * b[tid];// addition of the threads in a block// addition of the threads in a blocktid += blockDim.x*gridDim.x;tid += blockDim.x*gridDim.x;}}cache[cIdx] = temp;cache[cIdx] = temp;

____syncthreadssyncthreads();();// reduction// reduction

P.Bakowski 36


int i = blockDim.x/2; // vector reduction int i = blockDim.x/2; // vector reduction

while (i! = 0)while (i! = 0){{if(cIdx <i) cache[cIdx]+= cache[cIdx+i];if(cIdx <i) cache[cIdx]+= cache[cIdx+i];____syncthreadssyncthreads();();i/=2; i/=2; }}

if (cIdx == 0) if (cIdx == 0) c[blockIdx.x] = cache[0]; c[blockIdx.x] = cache[0]; // final product// final product}}

reduction in several stepsreduction in several steps

P.Bakowski 37

CUDACUDA et APIs graphiques et APIs graphiques

Les programmes CUDA peuvent exploiter les fonctions Les programmes CUDA peuvent exploiter les fonctions graphiques fournies par les API graphiques (graphiques fournies par les API graphiques (openCVopenCV, , openGLopenGL))

Ces fonctions fournissent les opérations de traitement et de Ces fonctions fournissent les opérations de traitement et de génération d’images nécessaires pour le génération d’images nécessaires pour le rasteringrastering et le et le shadingshading - rendu des images à l’écran. - rendu des images à l’écran.

Nous utilisons uniquement certaines opérations Nous utilisons uniquement certaines opérations openCVopenCV et et openGLopenGL pour lire/écrire des images depuis/vers des fichiers pour lire/écrire des images depuis/vers des fichiers ((openCVopenCV) et pour afficher les images directement à partir de la ) et pour afficher les images directement à partir de la mémoire vidéo de la GPU (mémoire vidéo de la GPU (openGLopenGL).).

P.Bakowski 38

CUDACUDA et et openCVopenCV

// NegImage.CV.cu// NegImage.CV.cu#include <opencv/highgui.h>#include <opencv/highgui.h>#define uchar unsigned char#define uchar unsigned char#define DtoH cudaMemcpyDeviceToHost#define DtoH cudaMemcpyDeviceToHost#define HtoD cudaMemcpyHostToDevice#define HtoD cudaMemcpyHostToDevice

____globalglobal__ void negimage(uchar * array)__ void negimage(uchar * array){{int i = int i = threadIdx.xthreadIdx.x + blockIdx.x*blockDim.x; + blockIdx.x*blockDim.x;array[i] = 255 - array[i]; // byte complementarray[i] = 255 - array[i]; // byte complement}}

P.Bakowski 39

CUDACUDA et et openCVopenCVint main ()int main (){{int no = 192 * 4800; // number of elementsint no = 192 * 4800; // number of elementsint nb = no*sizeof(uchar);int nb = no*sizeof(uchar);IplImage *img = 0; // image in CVIplImage *img = 0; // image in CVuchar *data; // space for the bitmapuchar *data; // space for the bitmapuchar *d_a = 0; // pointer - global memoryuchar *d_a = 0; // pointer - global memory

img = img = cvLoadImagecvLoadImage("ClipVGA.jpg", 1); // loading and decompressing("ClipVGA.jpg", 1); // loading and decompressingdata = (uchar *) img->imageData;data = (uchar *) img->imageData; cudaMalloccudaMalloc((void **) &d _a,nb);((void **) &d _a,nb);

int bs = 192; // block size int bs = 192; // block size int gs = no/bs; // grid size – number of blocks (x)int gs = no/bs; // grid size – number of blocks (x)cudaMemcpycudaMemcpy(d_a,data,nb,HtoD);(d_a,data,nb,HtoD);negimagenegimage<<<gs,bs>>>(d_a); // kernel call<<<gs,bs>>>(d_a); // kernel call

cudaMemcpycudaMemcpy(data,d_a,nb,DtoH);(data,d_a,nb,DtoH);cvNamedWindowcvNamedWindow("Win1" CV_WINDOW_AUTOSIZE);("Win1" CV_WINDOW_AUTOSIZE);

cvShowImagecvShowImage ("Win1",img); ("Win1",img);cvWaitKeycvWaitKey(0); cudaFree(d_a);(0); cudaFree(d_a);}}

P.Bakowski 40

CUDACUDA et et openGLopenGL

La projection du buffer La projection du buffer CUDACUDA sur le sur le openGLopenGL framebufferframebuffer

classclass GPUAnimBitmapGPUAnimBitmap and functionsand functions display_and_exit()display_and_exit(),, anim_and_exit() anim_and_exit()

P.Bakowski 41

CUDACUDA et et openGLopenGL

int main( int argc, char **argv ) { int main( int argc, char **argv ) { GPUAnimBitmapGPUAnimBitmap bitmap( DIMX, DIMY, NULL ); bitmap( DIMX, DIMY, NULL ); bitmap.bitmap.displaydisplay_and_exit_and_exit( ( (void(*(uchar4*,void*))(void(*(uchar4*,void*))generate_framegenerate_frame,NULL); ,NULL); }}

int main( void ) { int main( void ) { GPUAnimBitmapGPUAnimBitmap bitmap(DIMX,DIMY,NULL ); bitmap(DIMX,DIMY,NULL ); bitmap.bitmap.animanim_and_exit_and_exit((void(*)((void(*)(uchar4*,void*,(uchar4*,void*,intint))))generate_framegenerate_frame,NULL); ,NULL); }}

clock tickclock tick

P.Bakowski 42

RésuméRésumé

Evolution du multi-traitement massif (many-core)Evolution du multi-traitement massif (many-core)

GPU - indépendants et intégrés (intégrés)GPU - indépendants et intégrés (intégrés)

Architecture NVIDIA Tegra K1Architecture NVIDIA Tegra K1

NVIDIA et CUDANVIDIA et CUDA

Modèle de traitement et de mémoire CUDAModèle de traitement et de mémoire CUDA

Quelques exemples simplesQuelques exemples simples

CUDA - openCV et openGLCUDA - openCV et openGL

l’architecture d’une gpu - programmation cuda · (2017) nvidia tegra x2 denver(2)+cortex-57(4)...

Documents