Introduction aux Architectures Multi-Cores

Slides:



Advertisements
Présentations similaires
Architectures parallèles PhD Marco Antonio Ramos Corchado.
Advertisements

Making PowerPoint Slides Avoiding the Pitfalls of Bad Slides.
PERFORMANCE One important issue in networking is the performance of the network—how good is it? We discuss quality of service, an overall measurement.
1 Case Study 1: UNIX and LINUX Chapter History of unix 10.2 Overview of unix 10.3 Processes in unix 10.4 Memory management in unix 10.5 Input/output.
An Introduction To Two – Port Networks The University of Tennessee Electrical and Computer Engineering Knoxville, TN wlg.
Traffic Sign Recognition Jacob Carlson Sean St. Onge Advisor: Dr. Thomas L. Stewart.
IP Multicast Text available on
Template Provided By Genigraphics – Replace This Text With Your Title John Smith, MD 1 ; Jane Doe, PhD 2 ; Frederick Smith, MD, PhD 1,2 1.
From Implementing Cisco IP Routing (ROUTE) Foundation Learning Guide by Diane Teare, Bob Vachon and Rick Graziani ( ) Copyright © 2015 Cisco Systems,
Update on Edge BI pricing January ©2011 SAP AG. All rights reserved.2 Confidential What you told us about the new Edge BI pricing Full Web Intelligence.
1. Neal Creative | click & Learn more Neal Creative © TIP │ Use the built-in c olor palette with green and yellow for callouts and accents Neal Creative.
Subject: CMS(Content Management System) Université Alioune DIOP de Bambey UFR Sciences Appliquées et Technologies de l’Information et de la Communication.
TP4
Ch3 : Les Processeurs Superscalairs
1MPES2 Multiply Polynomials
Notes for teacher. You can just use slides 2-5 if you wish. If you want to do the practical activity (slides 6-8) you will need to: print off Slide 6.
IGTMD réunion du 4 Mai 2007 CC IN2P3 Lyon
Speaking Exam Preparation
Qu’est-ce qu’ils aiment faire?
IDL_IDL bridge The IDL_IDLBridge object class allows an IDL session to create and control other IDL sessions, each of which runs as a separate process.
Réunion service Instrumentation Activités CMS-Traces
Classification des archtecutres paralleles
Samples for evaluation from All Charts & Templates Packs for PowerPoint © All-PPT-Templates.comPersonal Use Only – not for distribution. All Rights Reserved.
ABAQUS I Summary Program Capability Components of an ABAQUS Model Elements, Materials and Procedures Modules (analysis, pre and post processing) Input.
Quantum Computer A New Era of Future Computing Ahmed WAFDI ??????
MATCHSLIDE : INT contribution Patrick HORAIN Hichem ATTI Waheb LARBI Presented as : "TELESLIDE: Technical aspects ", Jacques Klossa & Patrick Horain, Joint.
Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics.
Les Fruits :.
A level French 8th December 2017.
Field Trips Michelle Maddux, Shara Bachman, Megan Mary, and Valerie Johnson.
© 2004 Prentice-Hall, Inc.Chap 4-1 Basic Business Statistics (9 th Edition) Chapter 4 Basic Probability.
Copyright 2007 – Biz/ed Globalisation.
Architectures parallèles
Phase-Locked Loop Design S emiconducto r S imulation L aboratory Phase-locked loops: Building blocks in receivers and other communication electronics Main.
P&ID SYMBOLS. P&IDs Piping and Instrumentation Diagrams or simply P&IDs are the “schematics” used in the field of instrumentation and control (Automation)
1 ISO/TC 176/SC 2/N1219 ISO 9001:2015 Revision overview - General users July 2014.
Lect12EEE 2021 Differential Equation Solutions of Transient Circuits Dr. Holbert March 3, 2008.
Essai
High-Availability Linux Services And Newtork Administration Bourbita Mahdi 2016.
Le soir Objectifs: Talking about what you do in the evening
Technologies SoPC (System On Programmable Chip)
Qu’est-ce que tu as dans ta trousse?
Bienvenue Au monde des Pronoms.
Benchmarking noise policies
L’objectif: to know the words for school subjects in French.
Qu’est-ce que tu as dans ta trousse?
Gei 431 Architecture des ordinateurs II – Frédéric Mailhot Introduction Objectifs du cours Évaluation Références Matière du cours: - Techniques modernes.
Definition Division of labour (or specialisation) takes place when a worker specialises in producing a good or a part of a good.
C’est quel numéro? Count the numbers with pupils.
Quelle est la date aujourd’hui?
Standards Certification Education & Training Publishing Conferences & Exhibits Automation Connections ISA EXPO 2006 Wed, 1:00 Oct 18.
sortir avec mes copains faire les magasins jouer à des vidéo
Question formation In English, you can change a statement into a question by adding a helping verb (auxiliary): does he sing? do we sing? did they sing.
WRITING A PROS AND CONS ESSAY. Instructions 1. Begin your essay by introducing your topic Explaining that you are exploring the advantages and disadvantages.
Making PowerPoint Slides Avoiding the Pitfalls of Bad Slides.
POWERPOINT PRESENTATION FOR INTRODUCTION TO THE USE OF SPSS SOFTWARE FOR STATISTICAL ANALISYS BY AMINOU Faozyath UIL/PG2018/1866 JANUARY 2019.
Basics Concepts of Parallel Programming. What is parallel computing It’s a form of computation where many calculations can be carried out simultaneously.
By : HOUSNA hebbaz Computer NetWork. Plane What is Computer Network? Type of Network Protocols Topology.
Les formes et les couleurs
C021TV-I1-S4.
les instructions Bonjour la classe, sortez vos affaires
Les négatifs et l’interrogation
Les Mots Intérrogatifs
University : Ammar Telidji Laghouat Faculty : Technology Department : Electronics 3rd year Telecommunications Professor : S.Benghouini Student: Tadj Souad.
Les opinions Les opinions = Opinions. In this lesson pupils will learn to understand and give their own opinions about singular items.
Soutenance de thèse: Okba Taouali 1 02/08/2019 Fathia AZZOUZI, Adam BOURAS, Nizar JEBLI Conceptual specifications of a cooperative inter- machines dialogue.
Introduction aux Circuits Reconfigurables et FPGA.
IMPROVING PF’s M&E APPROACH AND LEARNING STRATEGY Sylvain N’CHO M&E Manager IPA-Cote d’Ivoire.
M’SILA University Information Communication Sciences and technology
Transcription de la présentation:

Introduction aux Architectures Multi-Cores Smail Niar Master 1 ISECOM Université de Valenciennes

Un marche en plein développement “We are dedicating all of our future product development to multicore designs. … This is a sea change in computing” Paul Otellini, President, Intel “By 2015, Intel may deliver processors with tens or even hundreds of individual cores", Intel Developer. "The expectation is that the number of cores per chip will roughly double every two years while processor clock speeds will remain relatively flat. " Loi "parallèle" à la lois de Moore!!

Consommation de puissance Scaling clock speed (business as usual) will not work 10000 Sun’s Surface Rocket Nozzle 1000 Nuclear Reactor 100 Power Density (W/cm2) Hot Plate 8086 10 4004 P6 8008 8085 Pentium® 386 286 486 8080 Source: Patrick Gelsinger, Intel 1 1970 1980 1990 2000 2010 Year

Parallelism Saves Power Exploit explicit parallelism for reducing power Power = 2C * V2 * F Performance = 2Cores * F Power = 2C * V2/4 * F/2 Performance = 2Cores * F/2 Power = C * V2 * F Performance = Cores * F Capacitance Voltage Frequency Power = (C * V2 * F)/4 Performance = (Cores * F)*1 Using additional cores Increase density (= more transistors = more capacitance) Can increase cores (2x) and performance (2x) Or increase cores (2x), but decrease frequency (1/2): same performance at ¼ the power Additional benefits Small/simple cores  more predictable performance

Capacitance = nombre de cores Multi-cores: une solution pour une réduction de la consommation de puissance Capacitance = nombre de cores Power = capacitance * voltage2 * frequency Performance = 0.9 * #cores * frequency 1/2 1.62 2.88 1

Avantage des architectures multiprocesseurs ou multi-cores

Qu’est ce qu’une architecture Multi-core? Une collection d'éléments de calcul, capables de communiquer et de coopérer dans le but de résoudre rapidement des problèmes de grandes tailles. Questions Combien de processeurs (cores) ? Comment et quant coopérer? Comment et quant communiquer? Avec quelle efficacité? Quel langage de programmation utilisé? ….

2/ Classifications des architectures multi-processeurs Architecture Parallèle SISD MIMD Plusieurs flux instructions sur une donnée ?? Plusieurs flux d’instructions et plusieurs flux données 1 flux d’instructions Et 1 flux de données SISD Monoprocesseur SIMD Mémoire Distribuée (clusters) Mémoire partagée 1) Flux instruction Sur plusieurs données Processeurs vectoriels (ex Cray) 2) Processeurs tableau MIMD à accès uniforme UMA MultiPro. Symétrique SMP Accès Non uniforme NUMA

Computer Architecture Classifications Processor Organizations UMA (SMP) NUMA ccNUMA Single Instruction, Single Instruction, Multiple Instruction Multiple Instruction Single Data Stream Multiple Data Stream Single Data Stream Multiple Data Stream (SISD) (SIMD) (MISD) (MIMD) Uniprocessor Vector Array Shared Memory Multicomputer Processor Processor (tightly coupled) (loosely coupled)

Classification 1 flux d’instructions & 1 flux de données SISD Monoprocesseur ILP (Instruction Level Parallelism): Pipeline Supersclaire, VLIW. 1 Flux instruction sur plusieurs données SIMD Processeurs vectoriels (ex Cray) Processeurs tableau (grille de processeurs) Plusieurs flux instructions sur 1 donnée MISD Plusieurs flux d’instructions et plusieurs flux données MIMD Mémoire partagée: MultiPro. Symétrique SMP, NUMA Mémoire Distribuée (clusters)

Parallélisme intra-processeur (SISD) Parallélisme Instruction ILP 1 Instruction # Cycle 1 Instruction/Cycle # Instructions /Cycles Processeur pipeliné Processeur non pipeliné Superscalar VLIW 1 Threads # Threads

Single Instruction, Multiple Data Stream - SIMD 1 seul instruction à la fois Plusieurs élements de calcul (PE: processing Element) Le prog est exécuté instruction/instruction Une instruction n’est exécutée que lorsque tt les instructions précédentes ont été exécutées Chaque PE dispose d'une mémoire locale (données locales) Chaque PE execute l'instruction sur ses données propres 2 types: Processeurs vecteurs et processeurs tableaux

SIMD processeurs tableaux

Single Instruction Multiple Data (SIMD) Single instruction acts on multiple pieces of data at once Common application: graphics Perform short arithmetic operations (also called packed arithmetic) For example, add eight 8-bit elements Smail.Niar@univ-valenciennes.fr

Multiprocessors Multiple processors (cores) with a method of communication between them Types: Homogeneous: multiple cores with shared main memory Heterogeneous: separate cores for different tasks (for example, DSP and CPU in cell phone) Clusters: each core has own memory system

Multiple Instruction, Multiple Data Stream- MIMD Set of processors (GPP: general purpose processor) Each can process all instructions necessary Simultaneously execute different instruction sequences Different sets of data SMPs, clusters and NUMA systems

SMP: Symmetric Multi-Processor MIMD Fortement couplé Processors share memory (processeurs à mémoire partagée) Communicate via shared memory (mémoire=boite aux lettres entre les processeurs) Symmetric Multiprocessor (SMP) Share single memory Shared bus to access memory Memory access time to given area of memory is approximately the same for each processor (UMA: Uniform Memory Access)

Symmetric Multiprocessor Organization

Time Share Bus - Disadvantage Performance limited by bus cycle time Each processor should have local cache Reduce number of bus accesses Leads to problems with cache coherence Solved in hardware - see later

Designs with private L2 caches C O R E 1 C O R E 0 C O R E 1 C O R E 0 L1 cache L1 cache L1 cache L1 cache L2 cache L2 cache L2 cache L2 cache L3 cache L3 cache memory memory Both L1 and L2 are private Examples: AMD Opteron, AMD Athlon, Intel Pentium D A design with L3 caches Example: Intel Itanium 2

Multicore Organization Alternatives No shared ARM11 MPCore AMD Opteron Shared Intel Core Duo Intel Core i7

Intel Core i7 Block Diagram Video: QuickPath Interconnect

Processor package Core x4 Registers Instruction fetch MMU (addr translation) L1 d-cache 32 KB, 8-way L1 i-cache 32 KB, 8-way L1 d-TLB 64 entries, 4-way L1 i-TLB 128 entries, 4-way L2 unified cache 256 KB, 8-way L2 unified TLB 512 entries, 4-way To other cores QuickPath interconnect 4 links @ 25.6 GB/s 102.4 GB/s total To I/O bridge L3 unified cache 8 MB, 16-way (shared by all cores) DDR3 Memory controller 3 x 64 bit @ 10.66 GB/s 32 GB/s total (shared by all cores) Main memory

Probleme de Cohérence de Cache!! 28/10/08 Probleme de Cohérence de Cache!! Processor 0 Processor 1 Processor 0 Processor 1 X 4 X 4 X 5 X 4 X<- 5 by processor 0 X 4 X 5 4 when a processor modifies a cached data element that is also located in another processor’s cache, a protocol must be used to maintain data coherency. Here we have an example of such situation Shared memory Shared memory Copies of “X” are Coherent Copies of “X”are incoherent 25

C'est quoi la cohérence des caches? Invalidation X=2 Duplication X=2 Plusieurs caches privés Proc. 1 Proc. 2 Proc. 3 Proc. n Cache Privé Mémoire Principale Plusieurs caches privés Proc. 1 Proc. 2 Proc. 3 Proc. n Cache Privé Mémoire Principale X=4 X=4 X=4 X=4 X=4 X=2 X=4 X=4

Architecture SMP (CC-NUMA)

Cache Coherence Problem - multiple copies of same data in different caches Can result in an inconsistent view of memory Write back* (WB) policy can lead to inconsistency Write through* (WT) can also give problems unless caches monitor memory traffic Rappel: * WT: Quant un proc écrit dans sa cache, il ecrit tout de suite dans la Memoire . * WB: Quant un proc écrit dans sa cache, il n'ecrit pas dans Memoire . La maj se fera quant le bloc est éjecté du cache, ie sa place est récupérer par un autre

Cache Coherency

Ecriture avec Invalidation Write Invalidate Lors d’une écriture: toute les caches des autres processeurs contenant le bloc modifié sont invalidés. Le processeur qui a écrit à un accès exclusive à la donnée. Les autres processeurs ne connaissent pas la donnée. Lorsqu’un processeur a besoin de nouveau de la données, c’est le processeur à l’état exclusive qui délivre. A chaque bloc du cache, on associe un état: Invalide, Partagé (lecture seulement), Exclusive (lecture et écriture): ESI (voir schéma suivant) ESI a été amélioré pour devenir MESI (ajout de l’état modifié) voir après. Pentium II et PowerPC

Protocol ESI E E E E E E X Exclusive E E E

MESI ressemble beaucoup à ESI: L'état E est divisé en 2 Modified: le bloc reside exclusivement dans ce cache (uniquement). Le contenu est différent de ce celui de la mémoire partagée. Exclusive: même chose que M, sauf que la mémoire partagée contient aussi la dernière valeur du bloc est aussi en mémoire partagée.

Ecriture avec Mise-à-Jour Write Update Lorsqu’un processeurs modifié une donnée connue par les autres (shared), il leur demande de la maj. Il fait une diffusion à tout le monde de l’adresse du bloc avec sa nouvelle valeur En générale, on profite aussi pour mettre à jour la donnée en mémoire. 2 états sont associées à chaque bloc: Valide (shared) ou invalide.

The Zynq Processing System

Simplified Block Diagram of the Application Processing Unit (APU) Source: The Zynq Book

Snoop Control Unit (SCU) Undertakes interfacing between cores and L1 and L2 caches. Ensuring cache coherency. Initiates and controls access to L2, arbitrating from the two cores where necessary. Manages transactions btw the PS and PL via the Accelerator Coherency Port (ACP). The SCU communicates with each of the Cortex-A9 processors

Snoop Control Unit (SCU) SCU supports MESI snooping Implements duplicated 4-way associative tag RAMs Lists coherent cache lines held in the L1 d$s. SCU checks if data is in L1 caches speed and without interrupting the processors. Accesses filtered only to processor sharing data. Copy clean data from one $ to another No need for main memory accesses Move dirty data between core No shared state and latency with WB

SIMD (Single Instruction Multiple Data) Media Processing Engine (MPE) Processing in the NEON Media Processing Engine (MPE) Source: The Zynq Book

Programmable Logic (PL) CLBs and IOBs Source: The Zynq Book

Programmable Logic (PL) BRAMs and DSP units Source: The Zynq Book

Animation https://www.cs.tcd.ie/Jeremy.Jones/vivio/caches/ALL%20protocols.htm

MIMD à Mémoire Partagée Distribuée Architectures parallèle à mémoire distribuée partagée DSM La mémoire est partagée mais se trouve éparpillée dans le réseau (NUMA : Non Uniforme Memory Access) Exemple : Stanford DASH avec 16 nœuds contenant chacun 4 processeurs Mips connectes par grille de dim 2. Espace d’adressage unique, accès mémoire non uniforme chaque processeur dispose d’un cache, utilisation des répertoire de cache P 1 $ Inter connection network n Mem Distributed-memory NUMA

Interconnexion entre processeurs 2 types de réseaux d'interconnexion Réseaux statiques Grille Hypercube Réseaux Dynamiques Bus partagée Cache partagée Crossbar MIN: Multi-stage Interconnexion Network

Architecture des Réseaux

Interconnection Networks Morgan Kaufmann Publishers 26 September, 2017 Interconnection Networks Network topologies Arrangements of processors, switches, and links Bus Ring N-cube (N = 3) 2D Mesh Fully connected Chapter 7 — Multicores, Multiprocessors, and Clusters

NoC Router

Basic Switching Techniques Circuit Switching A real or virtual circuit establishes a direct connection between source and destination. Packet Switching Each packet of a message is routed independently. The destination address has to be provided with each packet. Store and Forward Packet Switching The entire packet is stored and then forwarded at each switch. Cut Through Packet Switching The flits of a packet are pipelined through the network. The packet is not completely buffered in each switch. Virtual Cut Through Packet Switching The entire packet is stored in a switch only when the header flit is blocked due to congestion. Wormhole Switching is cut through switching and all flits are blocked on the spot when the header flit is blocked.

Réseaux d'interconnexion dans les multi-cores 1 seul cache partagé Multibanked cache Solution coûteuse: Temps, Énergie Surface 1 seul cache partagé Multiported cache Proc. 1 Proc. 2 Proc.n Conflit Proc. 1 Proc. 2 Proc. 3 Proc. n Crossbar Cache L1 partagé Cache L1 partagé Cache L1 partagé Cache L1 partagé Cache Partagé Mémoire Principale Mémoire Principale

Architectures des Systèmes Multi-threads (SMT): Les processeurs Multi-Threads reduisent les temps morts en exécutant un autre thread, lorsque le th courant est bloqué. Threads (processus légers) : un ensemble d'instructions Les threads semblent se dérouler en parallèle. Un thread != processus : chaque processus possède sa propre mémoire virtuelle, les processus légers appartenant au même processus père partagent la mémoire virtuelle (évite .

Architectures Multi-threads…. Program is divided into threads to be executed simultaneously thread-level parallelism TLP On data dependencies the CPU does not stall a new thread is selected (thread switch)

Architectures Multi-threads…. Process: program running on a computer Multiple processes can run at once: e.g., surfing Web, playing music, writing a paper Thread: part of a program Each process has multiple threads: e.g., a word processor may have threads for typing, spell checking, printing

Architectures Multi-threads…. One thread runs at once When one thread stalls (for example, waiting for memory): Architectural state of that thread stored Architectural state of waiting thread loaded into processor and it runs Called context switching Appears to user like all threads running simultaneously

Architectures Multi-threads…. Multiple copies of architectural state Multiple threads active at once: When one thread stalls, another runs immediately If one thread can’t keep all execution units busy, another thread can use them Does not increase instruction-level parallelism (ILP) of single thread, but increases throughput Intel calls this “hyperthreading”

The Multithread Concept (from Theo Ungerer) Multithread on superscalar SMT T i m e ( p r o c s y l ) Issue slots ) s e l c y c r o s s e c o r p ( e m i T Issue slots Superscalar or VLIW

Comparison between Multithreaded approaches Multi thread on multiprocessors CMP:single Chip Multi Processors PE0 PE1 PE2 PE3 PE0 PE1 PE2 Multithread on multiprocessors MPA CMP

Threading on a 4-way SS Processor Example Coarse MT Fine MT SMT Issue slots → Thread A Thread B Time → Thread C Thread D For lecture Coarse MT takes 32 cycles to complete (Assumes that coarse MT takes one cycle start-up time (optimistic).) Fine MT takes 25 cycles to complete. SMT takes 14 cycles to complete.

Multithreading Example Morgan Kaufmann Publishers 26 September, 2017 Multithreading Example Chapter 7 — Multicores, Multiprocessors, and Clusters — 57 Chapter 7 — Multicores, Multiprocessors, and Clusters

Where parallels threads are founds for multithreaded Archi.? In loops: Several iterations are evaluated simultaneously In procedures (or functions): the same procedure or different procedures are evaluated In speculative path: executing the code after loop, after call, after IF, …. Multithreaded processors exploits all these sources of potential parallelism

Hyper Threading Technologie Hyper-Threading d’Intel 1 processeur physique gère les données comme 2 processeurs logiques Développée pour résoudre le fait que les ressources du processeur sont généralement sous-utilisées par les applications 2 processeurs logiques sur 1 même processeur physique

Technologie Intel Hyper-threading

Hyperthreading dans l’Intel Core i7 Sans Hyperthreading: 1 thread /core Avec Hyperthreading: 2 threads /core

SMT Vs. CMP SMT increases utilization of key resources within a core. CMP allows multiple cores to share resources such as the L2 cache and offchip bandwidth

Intel Dunnington 6-core 9/26/20179/26/2017 ACA H.Corporaal

AMD Hydra 8 core 45 nm L2: 1MByte/core L3: shared 6MByte 9/26/20179/26/2017 ACA H.Corporaal

Intel 80 processor die 9/26/20179/26/2017 ACA H.Corporaal

http://techonline.com/article/pdf/showPDF.jhtml?id=2229002841 Marc BrownVice President, VxWorks Product Strategy and Marketing, Wind River