Adaptabilité Les données varient Les ressources varient Application

Adaptabilité Les données varient Les ressources varient Application
2 slides : Contexte: données et machine varient Adaptatbilite: repteter le schema puis ajouiter une bulle adptatbilite - combiner plusieurs algorithmes - recuperer des mesures Projet algorithmique:. Lors de l’exécution d’un algorithme, le contexte d’exécution varie: - au niveau des données manipulées au niveau de l’architecture d’exécution Pour s’adapter Approche hiérarchique : Première distribution prenant en compte l’architecture globale : Volume global de calculs Hétérogénéité inter-clusters Adaptation à la machine locale de l’application Pré-calibrage Équilibrage dynamique Nécessité d’adaptation pour les applications: => chgt de degré de //isme, chgt d’algo Retrait/Ajout de clusters/gros nœuds (lourd) Surcharge/sous-charge au sein d’un cluster/nœud SMP => Vision générique, intégrée d’une application adaptable Faire un dessin : et 2 slides ou une animation : bien le faire passer ! Ce qui varie : Ressources Données Ce que peut faire l’ordonnancement : Planification a priori : scheduling Load-balancing : redistribution Ce que peut faire l’algorithme : Pré-paramétrage en fonction de caractéristiques connues Réaction à l’environnement : hétérogène dynamique AHA: avoir une vision globale d’un algorithme adaptatif qui est basé sur une combinaison de plusieurs algorithmes possibles qui calculent la même chose avec des performances différentes Nécessité d’adaptation pour améliorer la performance

MiniSymposium Adaptive Algortihms for Scientific computing
9h45 Adaptive algorithms - Theory and applications Collective work - AHA Team Jean-Louis Roch, INRIA-CNRS Grenoble, France 10h15 Hybrids in exact linear algebra Dave Saunders, U. Delaware, USA 10h45 Adaptive programming with hierarchical multiprocessor tasks Thomas Rauber, U. Bayreuth, Germany 11h15 Cache-Obloivious algorithms Michael Bender, Stony Brook U., USA

Why adaptive algorithms ?
Resources availability is versatile Data vary Mesures sur les ressources Mesures sur les données Adaptations Choix algorithme séquentiels/parallèle(s) approché/exact en mémoire / out of core Ordonnancement planification (scheduling) volume calculs / hétérogénéité redistribution (load-balancing) Calibrage pré-paramétrage taille de blocs / cache choix d’instructions gestion de priorités 2 slides : Contexte: données et machine varient Adaptatbilite: repteter le schema puis ajouiter une bulle adptatbilite - combiner plusieurs algorithmes - recuperer des mesures Projet algorithmique:. Lors de l’exécution d’un algorithme, le contexte d’exécution varie: - au niveau des données manipulées au niveau de l’architecture d’exécution Pour s’adapter Approche hiérarchique : Première distribution prenant en compte l’architecture globale : Volume global de calculs Hétérogénéité inter-clusters Adaptation à la machine locale de l’application Pré-calibrage Équilibrage dynamique Nécessité d’adaptation pour les applications: => chgt de degré de //isme, chgt d’algo Retrait/Ajout de clusters/gros nœuds (lourd) Surcharge/sous-charge au sein d’un cluster/nœud SMP => Vision générique, intégrée d’une application adaptable Faire un dessin : et 2 slides ou une animation : bien le faire passer ! Ce qui varie : Ressources Données Ce que peut faire l’ordonnancement : Planification a priori : scheduling Load-balancing : redistribution Ce que peut faire l’algorithme : Pré-paramétrage en fonction de caractéristiques connues Réaction à l’environnement : hétérogène dynamique AHA: avoir une vision globale d’un algorithme adaptatif qui est basé sur une combinaison de plusieurs algorithmes possibles qui calculent la même chose avec des performances différentes Objectif de AHA : vision intégrée de l’adaptation Approche algorithmique : combinaison auto-adaptative d’algorithmes avec comportement global justifié d’un point de vue théorique

Algorithmes parallèles à grain adaptatif
Exemple du préfixe Projet MOAIS (www-id.imag.fr/MOAIS) Laboratoire ID-IMAG (CNRS-INRIA INPG-UJF)

How to adapt the application ?
By minimizing communications e.g. amortizing synchronizations in the simulation [Beaumont, Daoudi, Maillard, Manneback, Roch - PMAA 2004] adaptive granularity By contolling latency (interactivity constraints) : FlowVR [Allard, Menier, Raffin] overhead By managing node failures and resilience [Checkpoint/restart][checkers] FlowCert [Jafar, Krings, Leprevost; Roch, Varrette] By adapting granularity malleable tasks [Trystram, Mounié] dataflow cactus-stack : Athapascan/Kaapi [Gautier] recursive parallelism by « work-stealling » [Blumofe-Leiserson 98, Cilk, Athapascan, ... ] [Bender Rabin 2002] Self-adaptive grain algorithms dynamic extraction of paralllelism [Daoudi, Gautier, Revire, Roch - J. TSI 2005 ] [Roch, Traore, Bernard - … ]

Algorithmes parallèles à grain adaptatif :
Quelques exemples Ordonnancement de programme parallèle à grain fin : work-stealing Algorithmes à grain adaptatif : principe d’une « cascade » dynamique exemple du produit itéré Couplage séquentiel - parallèle : exemple du préfixe

In « practice »: coarse granularity
b H(a) O(b,7) F(2,a) G(a,b) H(b) High potential degree of parallelism In « practice »: coarse granularity Splitting into p = #resources Drawback : heterogeneous architecture, dynamic: i(t) : speed of processor i at time t In « theory »: fine granularity Maximal parallelism Drawback : overhead of tasks management How to choose/adapt granularity ?

Greedy scheduling W = #ops on a critcal path
«Depth » parallel time on  resources W = #ops on a critcal path « Work » sequential time W1= #operations Homogeneous case [Graham 69] : greedy scheduling : No ready task when a processor is idle Tp < W1/p + (1-1/p).W => Tp < W1/p + W Heterogeneous case [Jaffe 80] Maximum utilization schedule If i < p ready tasks, assign the threads to the i faster procs High utilisation schedule [Bender 02] : parameter B If i < p ready tasks, the fastest idle processor is at most B times faster than the slowest busy processor Tp < W1/(p. ave) + B.W /ave

Work stealing Distributed randomized implementation of greedy scheduling Each processor manages locally the tasks it creates When idle, a processor steals the oldest ready task on a remote -non idle- processor (randomly chosen) Implementation: local stack = deque [Cilk, Kaapi] Local parallelism is implemented by sequential function call Local sequential execution correct => restrictions serie-parallel/Cilk reference order/Kaapi On heteorogeneous processors : Slight modification : when a processor steals a B-times slower busy processor, it preempts its task Interests : => with good probability, #succeeded steals < p. W few task migrations [Blumofe 98, Narlikar 01, Bender 02,Revire-Roch 03, ....] => suited to heterogeneous architectures [Bender-Rabin 02] Tp < W1/(p. ave) + O ( W / ave ) with good probability => How to have W small and W1 = #ops seq ???

Best case : parallel algorithm is efficient
W is small and W1 = Wseq The parallel algorithm is an optimal sequential one Exemples: parallel D&C algorithms Implementation: work-first principle - no overhead when local execution of tasks Examples : Cilk : THE protocol Kaapi : Compare&swap only

Origin 3800 (32 procs) Cilk / Athapascan
Experimentation: knary benchmark #procs Speed-Up 8 7,83 16 15,6 32 30,9 64 59,2 100 90,1 Distributed Archi. iCluster Athapascan SMP Architecture Origin 3800 (32 procs) Cilk / Athapascan Ts = 2397 s  T1 = 2435

But usually, when W is small W1 >> Wseq
Solution: to mix both sequential and parallel algorithm Basic technique : Parallel algorithm until a certain « grain »; then use the sequential one Problem : T increases also, the number of migration … and the inefficiency ;o( Work-preserving speed-up [Bini-Pan 94] = cascading technique [Jaja92] Careful interplay of both algorithms to build one with both T small and T1 = O( Ts ) Divide the sequential algorithm into block Each block is compute with the (non-optimal) parallel algorithm Drawback : sequential at coarse grain and parallel at fine grain ;o( Adaptive grain: dual approach : parallelism is extracted from any sequential task

How to obtain an efficient fine-grain algorithm ?
Hypothesis for efficiency of work-stealing : the parallel algorithm is « work-optimal » T is very small (recursive parallelism) Problem : Fine grain (T small) parallel algorithms may involve a large overhead with respect to a sequential efficient algorithm: Overhead due to parallelism creation and synchronization But also arithmetic overhead

Self-grain Adaptive algorithms
Recursive computations Local sequential computation Special case: recursive extraction of parallelism when a resource becomes idle But local execution of a sequential algorithm Hypothesis : two algorithms : - 1 sequential : SeqCompute - 1 parallel : LastPartComputation => at any time, it is possible to extract parallelism from the remaining computations of the sequential algorithm Example : - iterated product [Vernizzi] - gzip / compression [Kerfali] - MPEG-4 / H264 [Bernard ….] - prefix computation [Traore]

Self-adaptive grain algorithm
Principle : To save parallelism overhead by privilegiating a sequential algorithm : => use parallel algorithm only if a processor becomes idle by extracting parallelism from a sequential computation Hypothesis : two algorithms : - 1 sequential : SeqCompute - 1 parallel : LastPartComputation => at any time, it is possible to extract parallelism from the remaining computations of the sequential algorithm Examples : - iterated product [Vernizzi] - gzip / compression [Kerfali] - MPEG-4 / H264 [Bernard ….] - prefix computation [Traore] Extract_par LastPartComputation SeqCompute SeqCompute

Indeed parallelism often costs. Eg : Prefix computation. P1 = a0
Indeed parallelism often costs Eg : Prefix computation P1 = a0*a1, P2=a0*a1*a2, …, Pn=a0*a1*…*an Sequential algorithm : for (i= 0 ; i <= n; i++ ) P[ i ] = P[ i – 1 ] * a [ i ] ; T1 = n Parallel algorithm : a0 a1 a2 an a3 * an-1 * P4 P2 Pn-1 Préfixe ( n / 2 ) P1 P3 Pn T =2. log n but T1 = 2.n

Adaptive prefix computation
Any (parallel) algorithm with depth T =d performs at least 2n-d operations Slower bound on p identical processors: 2n/(p+1) Block algorithm + pipeline [Nicolau 2000] Adaptive scheme : One process performs sequential computation p-1 processes perform a parallel « segmented » prefix computation : Tp < 2n/((p+1). ave) + O (log n/ ave)

Adaptive Prefix versus optimal on identical processors

Adaptive Prefix with variable speeds
- Lower bound: decreasing parallel time => #ops increases > 2n. (1-1/p) - Adaptive grain algorithm with provable performances : dynamic cascading of two algorithms (sequential/parallel) [TSI2005}] - Theorem : T = 2n / (p*+1) + O(log n) ~ optimal on processors with average speed p* [soon 2006] External charge Parallel Parallel Adaptive Adaptive Single user context Adaptive is equivalent to: - sequential on 1 proc - optimal parallel-2 proc. on 2 processors - … - optimal parallel-8 proc. on 8 processors Multiuser context Adaptive is the fastest 15% benefit over a static grain algorithm

The race: sequential/parallel fixed/ Adaptive Prefix
Adaptative 8 proc. Parallel 8 proc. Parallel 7 proc. Parallel 6 proc. Parallel 5 proc. Parallel 4 proc. Parallel 3 proc. Parallel 2 proc. Sequential

Conclusion Adaptive algorithm with provable performances
-> also confirmed by first experimentations To experiment : - on SMP at fine grain [floating point prefix sum] (memory, fixing workstealer on cpus) - on distributed heterogeneous architectures The scheme (and its complexity analysis) appears general - to apply the technique on oher problems [AHA]

Implementation of work-stealing
Hypothesis : a sequential schedule is valid + non-préemptive execution of ready task Stack f1 P P’ f1() { …. fork f2 ; … } f1 steal f2 fork f2 Intérêt : Grain fin « statique », mais contrôle dynamique Inconvénient: surcôut possible de l’algorithme parallèle [ex. préfixes]

Generic self-adaptive grain algorithm

Illustration : f(i), i=1..100 LastPart(w) W=2..100 SeqComp(w)
sur CPU=A f(1)

Illustration : f(i), i=1..100 LastPart(w) W=3..100 SeqComp(w)
sur CPU=A f(1);f(2)

Illustration : f(i), i=1..100 LastPart(w) on CPU=B W=3..100 SeqComp(w)
sur CPU=A f(1);f(2)

Illustration : f(i), i=1..100 LastPart(w) on CPU=B LastPart(w’)
SeqComp(w) sur CPU=A f(1);f(2) SeqComp(w’)

Illustration : f(i), i=1..100 LastPart(w’) LastPart(w) W=3..51
SeqComp(w) sur CPU=A f(1);f(2) SeqComp(w’)

Illustration : f(i), i=1..100 LastPart(w) LastPart(w’) W=3..51
SeqComp(w) sur CPU=A f(1);f(2) SeqComp(w’) sur CPU=B f(52)

Adaptivité Kaapi: réification, interaction avec l’environnement (ajout de ressources), … (interaction) Mais aussi : impact sur l’algorithmique / ordonnancement Example : workstealing based algorithms Recursive parallel computations Local sequential computation Special case: recursive extraction of parallelism when a resource becomes idle But local execution of a sequential algorithm Example : prefix computation Sequential : n operations Parallel on p identical resources : at least 2n.(p/(p+1)) operations Adaptive with work-stealing : Coupling sequential and parallel partial-prefix computation May benefit of an unbounded number or ressources Performance : on p processors of variable speeds :2n/(p+1) + O(log n)

Adaptive algorithms Recursive computations Special case:
Local sequential computation Special case: recursive extraction of parallelism when a resource becomes idle But local execution of a sequential algorithm Example : prefix computation Sequential : n operations Parallel on p identical resources : at least 2n.(p/(p+1)) operations Adaptive with work-stealing : Coupling sequential and parallel partial-prefix computation May benefit of an unbounded number or ressources Performance : on p processors of variable speeds :2n/(p+1) + O(log n)

E.g.Triangular system solving
.x = b Sequential algorithm : T1 = n2/2; T = n (fine grain) .x = b 1/ x1 = - b1 / a11 2/ For k=2..n bk = bk - ak1.x1 A system of dimension n .x = b system of dimension n-1

E.g.Triangular system solving
.x = b Sequential algorithm : T1 = n2/2; T = n (fine grain) Using parallel matrix inversion : T1 = n3; T = log2 n (fine grain) A21 A22 A11 -1 = S S= -A22.A21.A11 with A = and x=A-1.b Self-adaptive granularity algorithm : T1 = n2; T = n.log n .x = b self adaptive sequential algorithm self-adaptivematrix inversion ExtractPar choice of h = m h m and self-adaptive scalar product

Algorithmes parallèles à grain adaptatif :
Quelques exemples Ordonnancement de programme parallèle à grain fin : work-stealing et efficacité Algorithmes à grain adaptatif : principe d’une « cascade » dynamique Exemples Produit itéré, préfixe Compression gzip Inversion de systèmes triangulaire Vision 3D / Calcul d’oct-tree

Séquentiel, parallèle, adaptatif
Produit iteré Séquentiel, parallèle, adaptatif [Davide Vernizzi] Séquentiel : Entrée: tableau de n valeurs Sortie: c/c++ code: for (i=0; i<n; i++) res += atoi(x[i]); Algorithme parallèle : calcul récursif par bloc (arbre binaire avec fusion) Taille de bloc = pagesize Code kaapi : athapascan API Expérimentation : parallèle <=> adaptatif

Variante : somme de pages
Entrée: ensemble de n pages. Chaque page est un tableau de valeurs Sortie: une page où chaque élément est la somme des éléments de même indice des pages précédentes c/c++ code: for (i=0; i<n; i++) for (j=0; j<pageSize; j++) res [j] += f (pages[i][j]); Expérimentation : - l’algorithme parallèle coûte environ 2 fois plus que l’algorithme séquentiel - l’algorithme adaptatif a une efficacité proche de 1

Démonstration sur ensibull
Script: demo]$ more go-tout.sh #!/bin/sh ./spg /tmp/data & ./ppg /tmp/data 1 --a1 -thread.poolsize 3 & ./apg /tmp/data 1 --a1 -thread.poolsize 3 & Résultat: demo]$ ./go-tout.sh Page size: 4096 Memory allocated 0:In main: th = 1, parallel 0: 0: res = e+07 0: time = s ADAPTATIF (3 procs) 0: Threads created: 54 0: res = e+07 0: time = s PARALLELE (3 procs) 0: #fork = 7497 : : res = e+07 : time = s SEQUENTIEL (1 proc)

D’où vient la différence ? …Les sources des programmes
Source des codes pour la somme des pages : parallèle / arbre binaire adaptatif par couplage ; - séquentiel + Fork<LastPartComp> - LastParComp: génération (récursive) de 3 tâches

Algorithme parallèle struct Iterated {
void operator() (a1::Shared_w<Page> res, int start, int stop) { if ( (stop-start) <2) { // If max num of pages is reached, sequential algorithm Page resLocal (pageSize); IteratedSeq(start, resLocal); res.write(resLocal); } else { // If max num of pages is not reached int half = (start+stop)/2; a1::Shared<Page> res1; // First thread result a1::Shared<Page> res2; // Second thread result a1::Fork<Iterated> () (res1, start, half); //First thread a1::Fork<Iterated> () (res2, half, stop); //Second thread a1::Fork<Merge> () (res, res1, res2); //Merging results... }}};

Parallélisation adaptative
Calcul par bloc sur des entrées en k blocs: 1 bloc = pagesize Exécution indépendante des k tâches Fusion des resultats

Algorithme adaptatif (1/3)
Hypothèse: ordonnancement non préemptif - de type work-stealing Couplage séquentiel adaptatif : void Adaptative (a1::Shared_w<Page> *resLocal, DescWork dw) { // cout << "Adaptative" << endl; a1::Shared <Page> resLPC; a1::Fork<LPC>() (resLPC, dw); Page resSeq (pageSize); AdaptSeq (dw, &resSeq); a1::Fork <Merge> () (resLPC, *resLocal, resSeq); }

Côté séquentiel : void AdaptSeq (DescWork dw, Page *resSeq){ DescLocalWork w; Page resLoc (pageSize); double k; while (!dw.desc->extractSeq(&w)) { for (int i=0; i<pageSize; i++ ) { k = resLoc.get (i) + (double) buff[w*pageSize+i]; resLoc.put(i, k); } *resSeq=resLoc;

Côté extraction = algorithme parallèle : struct LPC { void operator () (a1::Shared_w<Page> resLPC, DescWork dw){ DescWork dw2; dw2.Allocate(); dw2.desc->l.initialize(); if (dw.desc->extractPar(&dw2)) { a1::Shared<Page> res2; a1::Fork<AdaptativeMain>() (res2, dw2.desc->i, dw2.desc->j); a1::Shared<Page> resLPCold; a1::Fork<LPC>() (resLPCold, dw); a1::Fork<MergeLPC>() (resLPCold, res2, resLPC); } };

Parallélisation adaptative
Une seule tache de calcul est demarrée pour toutes les entrées Division du travail qui reste à faire seulement dans le cas où un processeur devient inactif Moins de taches, moins de fusions

Exemple 2 : parallélisation de gzip
Utilisé (web) et coûteux bien que de complexité linéaire Code source :10000 lignes C, structures de données complexes Principe : LZ77 + arbre Huffman Pourquoi gzip ? Problème P-complet, mais parallélisation pratique possible Inconvénient: toute parallélisation (connue) entraîne un surcoût -> perte de taux de compression

Compression parallèle
Comment paralléliser gzip ? Fichier compressé Fichier en entrée Compression à la volée Algorithme Blocs compressés Compression parallèle Partition statique en blocs Parallélisation => Partition dynamique en blocs Parallélisation « facile » ,100% compatible avec gzip/gunzip Problèmes : perte de taux de compression, grain dépend de la machine, surcoût

Parallélisation gzip à grain adaptatif
LastPartComputation Output compressed file Input File Compression à la volée SeqComp Dynamic partition in blocks Output compressed blocks Parallel compression cat

Surcoût en taille de fichier comprimé
Fichiers Gzip Adaptatif 2 procs 8 procs 16 procs 0,86 Mo 272573 275692 280660 5,2 Mo 1,023Mo 1,027Mo 1,05Mo 1,08 Mo 9,4 Mo 6,60 Mo 6,62 Mo 6,73 Mo 6,79 Mo 10 Mo 1,12 Mo 1,13 Mo 1,14 Mo 1,17 Mo Gain en T 5,2 Mo 3,35 s 0,96 s 0,55 s 9,4 Mo 7,67 s 6,73 s 6,79 s 10 Mo 1,71 s 0,88 s

Performances Pentium 4x200Mhz

Adaptabilité Les données varient Les ressources varient Application

Présentations similaires

Présentation au sujet: "Adaptabilité Les données varient Les ressources varient Application"— Transcription de la présentation:

Présentations similaires

Notre projet

Feed-back

Entrer

S'autoriser via un réseau social:

Adaptabilité Les données varient Les ressources varient Application

Présentations similaires

Présentation au sujet: "Adaptabilité Les données varient Les ressources varient Application"— Transcription de la présentation:

Présentations similaires

Notre projet

Feed-back