Introduction aux Architectures Multi-Cores Smail Niar Master 1 ISECOM Université de Valenciennes
Un marche en plein développement “We are dedicating all of our future product development to multicore designs. … This is a sea change in computing” Paul Otellini, President, Intel “By 2015, Intel may deliver processors with tens or even hundreds of individual cores", Intel Developer. "The expectation is that the number of cores per chip will roughly double every two years while processor clock speeds will remain relatively flat. " Loi "parallèle" à la lois de Moore!!
Consommation de puissance Scaling clock speed (business as usual) will not work 10000 Sun’s Surface Rocket Nozzle 1000 Nuclear Reactor 100 Power Density (W/cm2) Hot Plate 8086 10 4004 P6 8008 8085 Pentium® 386 286 486 8080 Source: Patrick Gelsinger, Intel 1 1970 1980 1990 2000 2010 Year
Parallelism Saves Power Exploit explicit parallelism for reducing power Power = 2C * V2 * F Performance = 2Cores * F Power = 2C * V2/4 * F/2 Performance = 2Cores * F/2 Power = C * V2 * F Performance = Cores * F Capacitance Voltage Frequency Power = (C * V2 * F)/4 Performance = (Cores * F)*1 Using additional cores Increase density (= more transistors = more capacitance) Can increase cores (2x) and performance (2x) Or increase cores (2x), but decrease frequency (1/2): same performance at ¼ the power Additional benefits Small/simple cores more predictable performance
Capacitance = nombre de cores Multi-cores: une solution pour une réduction de la consommation de puissance Capacitance = nombre de cores Power = capacitance * voltage2 * frequency Performance = 0.9 * #cores * frequency 1/2 1.62 2.88 1
Avantage des architectures multiprocesseurs ou multi-cores
Qu’est ce qu’une architecture Multi-core? Une collection d'éléments de calcul, capables de communiquer et de coopérer dans le but de résoudre rapidement des problèmes de grandes tailles. Questions Combien de processeurs (cores) ? Comment et quant coopérer? Comment et quant communiquer? Avec quelle efficacité? Quel langage de programmation utilisé? ….
2/ Classifications des architectures multi-processeurs Architecture Parallèle SISD MIMD Plusieurs flux instructions sur une donnée ?? Plusieurs flux d’instructions et plusieurs flux données 1 flux d’instructions Et 1 flux de données SISD Monoprocesseur SIMD Mémoire Distribuée (clusters) Mémoire partagée 1) Flux instruction Sur plusieurs données Processeurs vectoriels (ex Cray) 2) Processeurs tableau MIMD à accès uniforme UMA MultiPro. Symétrique SMP Accès Non uniforme NUMA
Computer Architecture Classifications Processor Organizations UMA (SMP) NUMA ccNUMA Single Instruction, Single Instruction, Multiple Instruction Multiple Instruction Single Data Stream Multiple Data Stream Single Data Stream Multiple Data Stream (SISD) (SIMD) (MISD) (MIMD) Uniprocessor Vector Array Shared Memory Multicomputer Processor Processor (tightly coupled) (loosely coupled)
Classification 1 flux d’instructions & 1 flux de données SISD Monoprocesseur ILP (Instruction Level Parallelism): Pipeline Supersclaire, VLIW. 1 Flux instruction sur plusieurs données SIMD Processeurs vectoriels (ex Cray) Processeurs tableau (grille de processeurs) Plusieurs flux instructions sur 1 donnée MISD Plusieurs flux d’instructions et plusieurs flux données MIMD Mémoire partagée: MultiPro. Symétrique SMP, NUMA Mémoire Distribuée (clusters)
Parallélisme intra-processeur (SISD) Parallélisme Instruction ILP 1 Instruction # Cycle 1 Instruction/Cycle # Instructions /Cycles Processeur pipeliné Processeur non pipeliné Superscalar VLIW 1 Threads # Threads
Single Instruction, Multiple Data Stream - SIMD 1 seul instruction à la fois Plusieurs élements de calcul (PE: processing Element) Le prog est exécuté instruction/instruction Une instruction n’est exécutée que lorsque tt les instructions précédentes ont été exécutées Chaque PE dispose d'une mémoire locale (données locales) Chaque PE execute l'instruction sur ses données propres 2 types: Processeurs vecteurs et processeurs tableaux
SIMD processeurs tableaux
Single Instruction Multiple Data (SIMD) Single instruction acts on multiple pieces of data at once Common application: graphics Perform short arithmetic operations (also called packed arithmetic) For example, add eight 8-bit elements Smail.Niar@univ-valenciennes.fr
Multiprocessors Multiple processors (cores) with a method of communication between them Types: Homogeneous: multiple cores with shared main memory Heterogeneous: separate cores for different tasks (for example, DSP and CPU in cell phone) Clusters: each core has own memory system
Multiple Instruction, Multiple Data Stream- MIMD Set of processors (GPP: general purpose processor) Each can process all instructions necessary Simultaneously execute different instruction sequences Different sets of data SMPs, clusters and NUMA systems
SMP: Symmetric Multi-Processor MIMD Fortement couplé Processors share memory (processeurs à mémoire partagée) Communicate via shared memory (mémoire=boite aux lettres entre les processeurs) Symmetric Multiprocessor (SMP) Share single memory Shared bus to access memory Memory access time to given area of memory is approximately the same for each processor (UMA: Uniform Memory Access)
Symmetric Multiprocessor Organization
Time Share Bus - Disadvantage Performance limited by bus cycle time Each processor should have local cache Reduce number of bus accesses Leads to problems with cache coherence Solved in hardware - see later
Designs with private L2 caches C O R E 1 C O R E 0 C O R E 1 C O R E 0 L1 cache L1 cache L1 cache L1 cache L2 cache L2 cache L2 cache L2 cache L3 cache L3 cache memory memory Both L1 and L2 are private Examples: AMD Opteron, AMD Athlon, Intel Pentium D A design with L3 caches Example: Intel Itanium 2
Multicore Organization Alternatives No shared ARM11 MPCore AMD Opteron Shared Intel Core Duo Intel Core i7
Intel Core i7 Block Diagram Video: QuickPath Interconnect
Processor package Core x4 Registers Instruction fetch MMU (addr translation) L1 d-cache 32 KB, 8-way L1 i-cache 32 KB, 8-way L1 d-TLB 64 entries, 4-way L1 i-TLB 128 entries, 4-way L2 unified cache 256 KB, 8-way L2 unified TLB 512 entries, 4-way To other cores QuickPath interconnect 4 links @ 25.6 GB/s 102.4 GB/s total To I/O bridge L3 unified cache 8 MB, 16-way (shared by all cores) DDR3 Memory controller 3 x 64 bit @ 10.66 GB/s 32 GB/s total (shared by all cores) Main memory
Probleme de Cohérence de Cache!! 28/10/08 Probleme de Cohérence de Cache!! Processor 0 Processor 1 Processor 0 Processor 1 X 4 X 4 X 5 X 4 X<- 5 by processor 0 X 4 X 5 4 when a processor modifies a cached data element that is also located in another processor’s cache, a protocol must be used to maintain data coherency. Here we have an example of such situation Shared memory Shared memory Copies of “X” are Coherent Copies of “X”are incoherent 25
C'est quoi la cohérence des caches? Invalidation X=2 Duplication X=2 Plusieurs caches privés Proc. 1 Proc. 2 Proc. 3 Proc. n Cache Privé Mémoire Principale Plusieurs caches privés Proc. 1 Proc. 2 Proc. 3 Proc. n Cache Privé Mémoire Principale X=4 X=4 X=4 X=4 X=4 X=2 X=4 X=4
Architecture SMP (CC-NUMA)
Cache Coherence Problem - multiple copies of same data in different caches Can result in an inconsistent view of memory Write back* (WB) policy can lead to inconsistency Write through* (WT) can also give problems unless caches monitor memory traffic Rappel: * WT: Quant un proc écrit dans sa cache, il ecrit tout de suite dans la Memoire . * WB: Quant un proc écrit dans sa cache, il n'ecrit pas dans Memoire . La maj se fera quant le bloc est éjecté du cache, ie sa place est récupérer par un autre
Cache Coherency
Ecriture avec Invalidation Write Invalidate Lors d’une écriture: toute les caches des autres processeurs contenant le bloc modifié sont invalidés. Le processeur qui a écrit à un accès exclusive à la donnée. Les autres processeurs ne connaissent pas la donnée. Lorsqu’un processeur a besoin de nouveau de la données, c’est le processeur à l’état exclusive qui délivre. A chaque bloc du cache, on associe un état: Invalide, Partagé (lecture seulement), Exclusive (lecture et écriture): ESI (voir schéma suivant) ESI a été amélioré pour devenir MESI (ajout de l’état modifié) voir après. Pentium II et PowerPC
Protocol ESI E E E E E E X Exclusive E E E
MESI ressemble beaucoup à ESI: L'état E est divisé en 2 Modified: le bloc reside exclusivement dans ce cache (uniquement). Le contenu est différent de ce celui de la mémoire partagée. Exclusive: même chose que M, sauf que la mémoire partagée contient aussi la dernière valeur du bloc est aussi en mémoire partagée.
Ecriture avec Mise-à-Jour Write Update Lorsqu’un processeurs modifié une donnée connue par les autres (shared), il leur demande de la maj. Il fait une diffusion à tout le monde de l’adresse du bloc avec sa nouvelle valeur En générale, on profite aussi pour mettre à jour la donnée en mémoire. 2 états sont associées à chaque bloc: Valide (shared) ou invalide.
The Zynq Processing System
Simplified Block Diagram of the Application Processing Unit (APU) Source: The Zynq Book
Snoop Control Unit (SCU) Undertakes interfacing between cores and L1 and L2 caches. Ensuring cache coherency. Initiates and controls access to L2, arbitrating from the two cores where necessary. Manages transactions btw the PS and PL via the Accelerator Coherency Port (ACP). The SCU communicates with each of the Cortex-A9 processors
Snoop Control Unit (SCU) SCU supports MESI snooping Implements duplicated 4-way associative tag RAMs Lists coherent cache lines held in the L1 d$s. SCU checks if data is in L1 caches speed and without interrupting the processors. Accesses filtered only to processor sharing data. Copy clean data from one $ to another No need for main memory accesses Move dirty data between core No shared state and latency with WB
SIMD (Single Instruction Multiple Data) Media Processing Engine (MPE) Processing in the NEON Media Processing Engine (MPE) Source: The Zynq Book
Programmable Logic (PL) CLBs and IOBs Source: The Zynq Book
Programmable Logic (PL) BRAMs and DSP units Source: The Zynq Book
Animation https://www.cs.tcd.ie/Jeremy.Jones/vivio/caches/ALL%20protocols.htm
MIMD à Mémoire Partagée Distribuée Architectures parallèle à mémoire distribuée partagée DSM La mémoire est partagée mais se trouve éparpillée dans le réseau (NUMA : Non Uniforme Memory Access) Exemple : Stanford DASH avec 16 nœuds contenant chacun 4 processeurs Mips connectes par grille de dim 2. Espace d’adressage unique, accès mémoire non uniforme chaque processeur dispose d’un cache, utilisation des répertoire de cache P 1 $ Inter connection network n Mem Distributed-memory NUMA
Interconnexion entre processeurs 2 types de réseaux d'interconnexion Réseaux statiques Grille Hypercube Réseaux Dynamiques Bus partagée Cache partagée Crossbar MIN: Multi-stage Interconnexion Network
Architecture des Réseaux
Interconnection Networks Morgan Kaufmann Publishers 26 September, 2017 Interconnection Networks Network topologies Arrangements of processors, switches, and links Bus Ring N-cube (N = 3) 2D Mesh Fully connected Chapter 7 — Multicores, Multiprocessors, and Clusters
NoC Router
Basic Switching Techniques Circuit Switching A real or virtual circuit establishes a direct connection between source and destination. Packet Switching Each packet of a message is routed independently. The destination address has to be provided with each packet. Store and Forward Packet Switching The entire packet is stored and then forwarded at each switch. Cut Through Packet Switching The flits of a packet are pipelined through the network. The packet is not completely buffered in each switch. Virtual Cut Through Packet Switching The entire packet is stored in a switch only when the header flit is blocked due to congestion. Wormhole Switching is cut through switching and all flits are blocked on the spot when the header flit is blocked.
Réseaux d'interconnexion dans les multi-cores 1 seul cache partagé Multibanked cache Solution coûteuse: Temps, Énergie Surface 1 seul cache partagé Multiported cache Proc. 1 Proc. 2 Proc.n Conflit Proc. 1 Proc. 2 Proc. 3 Proc. n Crossbar Cache L1 partagé Cache L1 partagé Cache L1 partagé Cache L1 partagé Cache Partagé Mémoire Principale Mémoire Principale
Architectures des Systèmes Multi-threads (SMT): Les processeurs Multi-Threads reduisent les temps morts en exécutant un autre thread, lorsque le th courant est bloqué. Threads (processus légers) : un ensemble d'instructions Les threads semblent se dérouler en parallèle. Un thread != processus : chaque processus possède sa propre mémoire virtuelle, les processus légers appartenant au même processus père partagent la mémoire virtuelle (évite .
Architectures Multi-threads…. Program is divided into threads to be executed simultaneously thread-level parallelism TLP On data dependencies the CPU does not stall a new thread is selected (thread switch)
Architectures Multi-threads…. Process: program running on a computer Multiple processes can run at once: e.g., surfing Web, playing music, writing a paper Thread: part of a program Each process has multiple threads: e.g., a word processor may have threads for typing, spell checking, printing
Architectures Multi-threads…. One thread runs at once When one thread stalls (for example, waiting for memory): Architectural state of that thread stored Architectural state of waiting thread loaded into processor and it runs Called context switching Appears to user like all threads running simultaneously
Architectures Multi-threads…. Multiple copies of architectural state Multiple threads active at once: When one thread stalls, another runs immediately If one thread can’t keep all execution units busy, another thread can use them Does not increase instruction-level parallelism (ILP) of single thread, but increases throughput Intel calls this “hyperthreading”
The Multithread Concept (from Theo Ungerer) Multithread on superscalar SMT T i m e ( p r o c s y l ) Issue slots ) s e l c y c r o s s e c o r p ( e m i T Issue slots Superscalar or VLIW
Comparison between Multithreaded approaches Multi thread on multiprocessors CMP:single Chip Multi Processors PE0 PE1 PE2 PE3 PE0 PE1 PE2 Multithread on multiprocessors MPA CMP
Threading on a 4-way SS Processor Example Coarse MT Fine MT SMT Issue slots → Thread A Thread B Time → Thread C Thread D For lecture Coarse MT takes 32 cycles to complete (Assumes that coarse MT takes one cycle start-up time (optimistic).) Fine MT takes 25 cycles to complete. SMT takes 14 cycles to complete.
Multithreading Example Morgan Kaufmann Publishers 26 September, 2017 Multithreading Example Chapter 7 — Multicores, Multiprocessors, and Clusters — 57 Chapter 7 — Multicores, Multiprocessors, and Clusters
Where parallels threads are founds for multithreaded Archi.? In loops: Several iterations are evaluated simultaneously In procedures (or functions): the same procedure or different procedures are evaluated In speculative path: executing the code after loop, after call, after IF, …. Multithreaded processors exploits all these sources of potential parallelism
Hyper Threading Technologie Hyper-Threading d’Intel 1 processeur physique gère les données comme 2 processeurs logiques Développée pour résoudre le fait que les ressources du processeur sont généralement sous-utilisées par les applications 2 processeurs logiques sur 1 même processeur physique
Technologie Intel Hyper-threading
Hyperthreading dans l’Intel Core i7 Sans Hyperthreading: 1 thread /core Avec Hyperthreading: 2 threads /core
SMT Vs. CMP SMT increases utilization of key resources within a core. CMP allows multiple cores to share resources such as the L2 cache and offchip bandwidth
Intel Dunnington 6-core 9/26/20179/26/2017 ACA H.Corporaal
AMD Hydra 8 core 45 nm L2: 1MByte/core L3: shared 6MByte 9/26/20179/26/2017 ACA H.Corporaal
Intel 80 processor die 9/26/20179/26/2017 ACA H.Corporaal
http://techonline.com/article/pdf/showPDF.jhtml?id=2229002841 Marc BrownVice President, VxWorks Product Strategy and Marketing, Wind River