Flot de conception pour plateforme reconfigurable Raphaël David , Daniel Chillet , Sébastien Pillement , Olivier Sentieys , ENSSAT / LASTI- nom@enssat.fr IRISA / INRIA - sentieys@irisa.fr Thank you mister Chairman. Good afternoon everybody, I am working on the design of a dynamically reconfigurable architecture which aim at supporting the third generation telecommunication constraints and the purpose of this talk is to present this architecture and to discuss about its performance and its adequacy with our application domain Troisième Colloque CAO Paris, 16 mai 2002
Flot de conception pour plateforme reconfigurable Introduction Une architecture enfouie reconfigurable dynamiquement : DART Méthodologie de développement Conclusions et perspectives For that, this talk will be divided in four parts. In a first time i will present our application domain in order to extract the constraints associated to it. We will then discuss about the architectural incidence of these constraints and about the interest we have to develop a new architecture. The presentation of this architecture, called DART, is the topic of the second part of this talk. And once presented, the performance of DART will be discuss as well as its adequacy with our application domain. Finally we will talk about the work in progess and the perspectives of this study.
Chaîne de traitement 3G Hautes performances Faible consommation Flexibilité Applications Services Multi-granularité Traitements arithmétiques Traitements logiques Processing Data Audio Video Source coding V34, V8, H225, H245, ... EFR, AMR, CELP, RPE-LTP, ... MEGx, H26x, ... Chanel coding Viterbi, turbo coding, Reed Solomon, ... access modulation TDMA, FDMA, W-CDMA, ... PSK, MSK, ASK, QAM, ... Je vais maintenant vous présenter une chaîne de traitement typique de troisième génération, du point de vue de l’émission où des traitements portant sur de la vidéo, de l’audio ou sur des données informatiques, sont suivis d’un codage de source minimisant la quantité de donnée à transmettre, d’un codage de canal améliorant la robustesse du signal et d’une modulation permettant de s’adapter au canal de transmission et de multiplexer les utilisateurs. C’est en particulier ici que l’on trouvera le W-CDMA qui remplace les traditionnels multiplexages en temps ou en fréquence par un multiplexage par code, beaucoup plus performant mais également beaucoup plus complexe en terme de calcul. En plus des hautes performances inhérentes au W-CDMA et aux traitements multimédia, cette chaîne de traitement fait apparaître de nouvelles contraintes, beaucoup plus originales pour le monde de la conception HW. En particulier, de l’avis de tous, le succès de l’UMTS passera par une flexibilité du standard bien supérieure à celle du GSM ou de l’IS-95. On souhaitera par exemple au niveau du codage de source coder un signal de parole suivant le standard CELP tout en pouvant rester conforme à la norme GSM par un codage RPE-LTP. Nous savons par ailleurs que les standards et les services offerts à l’utilisateur seront susceptibles d’évoluer dans le temps. Cette flexibilité se traduira par des modifications ponctuelles des traitements à implémenter et on parlera alors de flexibilité au niveau logiciel Cette chaîne de traitement fait cependant également apparaître une variété de traitement beaucoup plus problématique. En effet, un terminal devra successivement réaliser des traitements très complexes et très différents, aussi bien en terme de motif de calcul, que de grain de calcul ou que de type de données manipulé. On passera par exemple d’un codeur MPEG traitant des données 8 bits à un codage de canal manipulant des bits. Cette évolutivité des traitements est alors beaucoup plus problématique puisque pour être performant, il faudra s’adapter dynamiquement à ces modifications. On parlera ici de flexibilité au niveau matériel. Les contraintes portant sur la mobilité des systèmes 3G se traduiront par des architectures faible consommation. Hautes performances Faible consommation 24MOPS/mW@12GOPS
DART : Présentation générale Architecture autonome 2 grains de reconfiguration Fonctionnel (DPR), porte (FPGA) Reconfiguration dynamique Faible consommation Distribution des ressources calcul, interconnexions, contrôle, stockage Our architecture is part of a project associating The University of Brest and ST microelectronics and funded by the industry and research french ministry. It is a fully autonomous architecture and to support the various calculation granularities in a data processing sequence of third generation we use two kinds of operator. Some DPRs for the arithmetic processings and Some FPGAs for the bit-level processings. All these resources, will be presented later, are dynamically reconfigurable and to be embedded, this architecture have been developped with energy awareness. Finally, DART have been designed to have a programmation model as simple as possible in order to ease the design of the development tools. For that, we have organize our architecture into a hierarchy which concern the computation, the interconnexion, the control and the srorage resources.
Architecture des clusters DPR1 Réseau segmenté Data mem Contrôle DPR2 DPR3 DPR4 The clusters are the processings pimitives for the system level and each one of them have the architecture represented in this slide. To support several granularities of processing, two kinds of operator: one FPGA core and some DPRs. The DPRs may be connected the one with the others thanks to a segmented network for the massively parallel processings or may be disconnect to work independently on different threads. They will be presented in the next slide in more details. All these resources accesse a same data memory space and there is also a configuration memory which is dedicated to the FPGA. Its configuration will be realized in a serial maner thanks to the DMA controller. The cluster controller manage the configurations of the DPR which are realized dynamically thanks to configuration instructions. Its architecture is similar to those of every programmable processors. However it sequence configurations rather than instructions and so, it does not have to access an instruction memory at each cycle. Indeed the memory readings are realized only when a reconfiguration occur and is so very occasionnal. This allow a very significant energy saving. DMA ctrl DPR5 Config mem. FPGA DPR6
Architecture des DPRs Bus globaux Gestion de boucle AG1 AG2 AG3 AG4 Data mem1 Data mem2 Data mem3 Data mem4 Réseau multi-bus I will now describe the DPRs which are the arithmetic processing primitives of DART. They are organized around functional units and memories interconnected thanks to a very powerfull multi-bus network. Every DPR have 4 dynamically reconfigurable functionnal units, that is to say 2 multipliers and 2 ALUs, able to realize SubWord Processings. For the storage of the data, we use four local memories. Moreover this 4 memories, you can see on this slide two registers that are particularly useful for data flow oriented applications where the different functional units are working on the same data flow but on samples delayed from one iteration to the following. Indeed, thanks to these registers we will in that case realize a lot of data sharing and we will so significantly decrease the number of data memory accesses and so the energy consumed. All these resources are connected thanks to a multibus network which allow every resources to be connected with every others in the DPR. A memory may for example simultaneously supply in data the four functionnal units. The right part of this slide fact also of appearing some connections with global buses to connect several DPRs for the massively parallel processings. reg1 reg2 MUL1 ALU1 MUL2 ALU2
HW Reconfiguration versus SW Reconfiguration La reconfiguration HW pour optimiser le chemin de donnée Mem3 - X Config. 2 y(n)=(x(n)-x(n-1))² Mem1 Config. 1 y(n)+=x(n)*c(n) X + Mem1 Mem2 rec 4 cycles La reconfiguration SW pour modifier le chemin de donnée à chaque cyle This slide illustrate this two kinds of reconfiguration. So, for the regular processings like the loop kernels, where a same calculation pattern is used for long period of time, we use an Hardware reconfiguration to optimize the datapath according to the calculation pattern. We will for example adopt this datapath to realize a filtering based on Multiply-Accumulate operation, fix this configuration the time of the filtering, then after a reconfiguration step which may take 4 cycles, we will adopt a new datapath to realize the square of the difference between x n and x n minus 1 for example. On the other hand, for the irregular processings where the calculation pattern is changing very often, we will reconfigure the DPRs at each cycle by limiting their flexibility. In that case the datapath realyze only Read-Modify-Write operation that is to say that the data are red, then the operation is realized, and finally the result is stored in the memories. The configuration will thus concern only the functionnality of the operators and the memories but it will be realized at each cycle. We will for example add the data A and B stored in the memories 1 and 2 then, at the next cycle multiply the data C and D stored in the memories 1 and 4 Thus, thanks to this two kinds of reconfigurations we are able to execute every kind of processings while beeing able to optimize the datapath for the critical and regular processings. Config. 1 S=A+B + Mem1 Mem2 rec 1 cycle Config. 2 S=C*D X Mem4 Mem1
Résultats d'implémenation Applications clés de l'UMTS W-CDMA (Complex Despreading) Traitements vidéo (DCT 2-D) Traitements audio (autocorrelation) Peu de lectures d'instruction Partage massif de données 149 85 16.4 40.8 In order to verify the adequacy between DART and the third generation telecommunications, we also have implemented some key applications of the UMTS, that is to say a complex despreading to illustrate the W-CDMA, a 2 Dimension DCT for the video processings and finally for the speech coding we also implement an autocorrelation working on 240 samples. This table confirm us that DART is very efficient while consuming very few energy on these three applications that are representatives of our application domain. This energy efficiency is notably due to the data sharing and to memory acces savings. In particular, we can noticed that only fourty-three instructions are sufficient to control the autocorrelation computation while more than 57000 instruction readings and decodings should have been done in a conventional DSP. Moreover since the number of instruction memory acces are much less numerous, the instruction memory can be much smaller and hance an instruction access is less energy consumming. Even if the data memory savings are less impressive, we can also noticed that the use of a delay chain allow a division by 12 of the number of data memory accesses for the example of the autocorrelation in comparison with a traditionnal solution, and so a very significant energy saving. We can finally noticed on this table the interest of the Sub-Word Processings since it drastically increase the performance and the energy efficiency of DART for the 2 dimension DCT when handling 8-bits data. 57600 43 5040 57600
Le Flot de développement Trois types de traitements doivent être distingués : Les codes irréguliers Les manipulations de données Les calculs réguliers Les codes irréguliers et les manipulations de données sont traduits en codes binaires exécutables via des passes classiques de compilation issues de l'environnement de compilation reciblable CALIFE Génération des instructions SW : cDART Génération des instructions de manipulation de données : ACG Les traitements réguliers sont transformés en reconfigurations HW via une extension de l'outil de synthèse comportementale BSS Génération des configurations HW : gDART To exploit the computation power of DART, the conception of an efficient development flow is the key to enhance the status of the architecture. Hence, we design a development framework based on the joint use of a front-end allowing the transformation and the optimisation of a C code, a retargetable compiler and a behavioral synthesis tool. The heart of our methodology is a SUIF Front-end which have to distinguished the regular processing from the irregular ones and the data manipulations. Indeed, these three kinds of processings will be proceed differently since the irregular processings and the data manipulations will be translated into binary executable codes according to calssical compilation passes ensues from a retargetable compilation framework developped at the IRISA in France. On the other hand, the regular processings will be translated in Hardware reconfigurations with a module in charge of scheduling the loop kernels Data Flow Graph in order to overcome the DART limitations. This module is an extension of a behavioral synthesys tool developped at the laboratory which is call BSS for Breizh Synthesis Tool. And finally, all the executable codes previously generated can be simulated thanks to the SystemC simulator of DART.
² Code C SUIF ARMORC CDART GDART ACG SCDART Loop kernel ARMOR description of DART SUIF SUIF Front-end ² Profiling Partial loop unrolling DPR Allocation ARMORC Loop kernel CDART GDART ACG Compilation scheduling Data access extraction Compilation Parser assembler -> Config SW Parser DFG -> Config HW To exploit the computation power of DART, the conception of an efficient development flow is the key to enhance the status of the architecture. Hence, we design a development framework based on the joint use of a front-end allowing the transformation and the optimisation of a C code, a retargetable compiler and a behavioral synthesis tool, as described in this slide. This development flow allows the user to describe its application in C. This high-level description is first translated into a Control and Data Flow Graph (CDFG) from which some automatic transformations such as the loop unrolling are done in order to optimize the execution time. These kind of optimisations are realized in the SUIF Front-end and this module have also to distinguish the regular processings such as the loop kernel, to the irregular ones and the data manipulations. Then, according to the type of processing, one of the three following tool is used to ibtain a binary executable code. For the irregulat processings, implemented on a DART subset, we use cDART which is a compiler obtained thanks to the CALIFE framewrok. This framework allow to generate an optimized compiler according to an ARMOR description of an architecture which depict the instruction semantic as well as the Instruction level Parallelism Constraints. For the regular processings we use gDART which is a module in charge of scheduling the loop kernels DFG in order to overcome the DART limitations The data manipulations are translated in address generation instruction thanks to ACG which is also built around calssical compilation passes ensues from CALIFE And finally, all the executable codes previously generated can be simulated thanks to the SystemC simulator of DART and the user can receive some informations about the energy consumption or the performances of the implementations. Parser assembler -> Codes AG SCDART RTL simulation Performance Analysis Consumption, nb cycles, resource using, ..
Résumé DART supporte les principales contraintes des T3G : Variété de grains de calcul Variété des motifs de calcul Variété des tailles de données Exécution concurrente des tâches Faible consommation d'énergie Flexibilité Modélisation RTL de DART en SystemC Définition d'un simulateur Bit-true et cycle-true Estimation de l'énergie au niveau RT To resume, we can noticed that DART is trying to answer every constraints associated to the third generation telecommunications thanks to one of the concept previously mentionned. In particular: The various granularity of calculation are supported thanks to our two kinds of operators: the DPRs and the FPGA core To optimize the architecture at each calculation pattern change, DART is dynamically reconfigurable and its functional units are connected through a very powerfull multi-bus network. All these functional units moreover support the Sub-Word Processings And DART is able to execute concurently different tasks on its clusters. Finally, different technics have been implemented to minimize the energy consumption of DART, such as the resource distribution, the use of delay chain to increase the data sharing, the use of guarded clock or the voltage scaling. We have designed a bit-true and cycle-true simulator of DART, by describing this architecture in SystemC. Moreover to provide a way to validate DART, this simulator allows us to estimate the energy consumed by the execution of an application. In order to have a good relative accuracy, the DART modeling have been done at the Register Transfer Level and each operator has been characterized by an average energy consumption per access thanks to the synopsys design tools.
Travail en cours : Aspects logiciel Validation de SCDART Finalisation de gDART Déroulage partiel de boucle État des OS temps Réel pour les architectures reconfigurables Au niveau logiciel, les études en cours portent actuellement sur la définition d’un simulateur en SystemC de l’architecture et des estimateurs de consommation d’énergie associés. Dans le même temps une étude des machines SUIF est en cours afin de déterminer de quelle manière la partie frontale d’un compilateur SUIF pourra être intégrée à notre chaîne de développement. Afin, nous tentons de réutiliser l’expérience du laboratoire dans le domaine de la synthèse de haut niveau pour définir un outil nous permettant de synthétiser sous contrainte de ressource les traitements réguliers sur les DPRs. Cette phase du développement est très proche de ce qui a pu être fait au laboratoire dans le cadre de la synthèse d’ASIP (J-G Cousin).
Travail en cours : Aspects matériel Implantation Hardware des DPRs Générateurs d'adresses et mémoires Placement/Routage Étude du FPGA Architecture Intégration dans le cluster Au niveau matériel, les DPRs sont en cours de synthèse. Après les opérateurs et le réseau d’interconnexion, c’est la génération d’adresse et les mémoires qui sont à l’étude. Malheureusement, il nous est pour le moment très difficile de faire la synthèse de ces mémoires faute d’avoir un design kit opérationnel. Toujours au niveau matériel, et afin de conclure la définition de l’architecture du cluster, l’étude du FPGA est en cours afin de déterminer son architecture et son mode d’intégration dans le cluster.
Travail en cours : Validation des résultats Implémentation d'un codeur vidéo Étude de DART dans le cadre des applications réseau Au niveau matériel, les DPRs sont en cours de synthèse. Après les opérateurs et le réseau d’interconnexion, c’est la génération d’adresse et les mémoires qui sont à l’étude. Malheureusement, il nous est pour le moment très difficile de faire la synthèse de ces mémoires faute d’avoir un design kit opérationnel. Toujours au niveau matériel, et afin de conclure la définition de l’architecture du cluster, l’étude du FPGA est en cours afin de déterminer son architecture et son mode d’intégration dans le cluster.
Perspectives System View IP View For the moment we have mainly worked on the design of the clusters and it is time now to study the way in which they can be managed and used. For that, two integration ways are possible: The first one is that i had presented previously. For that, we have to study the system view of the architecture. In particular we must define a task controller which will assign the different tasks to the clusters under resources avaliability and urgency constraints. Another possibility to use DART is to consider it as an IP which can be integrated in an architecture like that which represented on this slide. In particular, within the framework of a project with ST microelectronics we are working on the integration of a DART cluster inside the ST200 processor. This architecture is a clusterised VLIW DSP and the aim of this project is to replace a ST200 cluster by a DART cluster and to see the impact on the performances. So, these perspectives conclude my talk and i thank you for your attention. Do you have any questions?