Au-delà du lambda Benjamin Guinebertière

Slides:



Advertisements
Présentations similaires
Phase d’étudeDéveloppement Marketing & Vente.
Advertisements

IP Multicast Text available on
Template Provided By Genigraphics – Replace This Text With Your Title John Smith, MD 1 ; Jane Doe, PhD 2 ; Frederick Smith, MD, PhD 1,2 1.
Update on Edge BI pricing January ©2011 SAP AG. All rights reserved.2 Confidential What you told us about the new Edge BI pricing Full Web Intelligence.
Subject: CMS(Content Management System) Université Alioune DIOP de Bambey UFR Sciences Appliquées et Technologies de l’Information et de la Communication.
Quelle solution pour quel usage ?
Utilisation du logiciel EduStat©
La règlementation en matière de transfert de données
Titre de la présentation
Carlos Oliveira Club Subaquatique du CERN
SharePoint Saturday Montréal
PowerShell sur Linux : pourquoi et comment ?
Français 41 – Unité 3 Leçon B – Quels cours suivra un lycéen
Theme Three Speaking Questions
Business Case Title Company name
CONJUGAISON.
Concrètement Pascal Sauliere
Unité 2– L’art de la nourriture
Création du statut de membre non plongeur
Infinitive There are 3 groups of REGULAR verbs in French: verbs ending with -ER = 1st group verbs ending with -IR = 2nd group verbs ending with -RE = 3rd.
L’Oculus Rift pour améliorer l’efficacité et la logistique en entrepôt
Titre de la présentation
Proposition pour un comité composé de 9 membres
Titre de la présentation
Discussion sur dossiers
Titre de la présentation
Samples for evaluation from All Charts & Templates Packs for PowerPoint © All-PPT-Templates.comPersonal Use Only – not for distribution. All Rights Reserved.
Unité 3 – On fait les courses
Reflective verbs or Pronominal verbs
Quantum Computer A New Era of Future Computing Ahmed WAFDI ??????
Carlos Oliveira Club Subaquatique du CERN
Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics.
Français I – Leçon 6A Structures
Le célèbre Nizar Chaari
© 2004 Prentice-Hall, Inc.Chap 4-1 Basic Business Statistics (9 th Edition) Chapter 4 Basic Probability.
F RIENDS AND FRIENDSHIP Project by: POPA BIANCA IONELA.
Stimulez la croissance de votre entreprise grâce à Bing Ads.
La Croix de Saint-Anselme
Create and publish reports with Power BI for Desktop
Quel type de compétences peut-on apprendre en participant à des activités de robotique? Recherche et raisonnement déductif.
Introduction to Computational Journalism: Thinking Computationally JOUR479V/779V – Computational Journalism University of Maryland, College Park Nick Diakopoulos,
La Passerelle de données locales
Français - couleurs - pays - drapeaux
Comment créer des applications Sharepoint Intelligentes
Le soir Objectifs: Talking about what you do in the evening
de façon réelle et efficace.
La famille ER conjugaison
Qu’est-ce que tu as dans ta trousse?
Gestion d’identité dans Azure et Office 365
11/29/2018 4:22 AM Mail: Tel: recrute
Qu’est-ce que tu as dans ta trousse?
Français Les animaux (2).
12/7/2018 9:48 PM Mail: Tel: recrute
Definition Division of labour (or specialisation) takes place when a worker specialises in producing a good or a part of a good.
C’est quel numéro? Count the numbers with pupils.
1-1 Introduction to ArcGIS Introductions Who are you? Any GIS background? What do you want to get out of the class?
What’s the weather like?
POWERPOINT PRESENTATION FOR INTRODUCTION TO THE USE OF SPSS SOFTWARE FOR STATISTICAL ANALISYS BY AMINOU Faozyath UIL/PG2018/1866 JANUARY 2019.
Les formes et les couleurs
Microsoft Azure Quelles protections des données à l'heure du Cloud ?
les instructions Bonjour la classe, sortez vos affaires
Backend pour les applis multi-devices
1 Sensitivity Analysis Introduction to Sensitivity Analysis Introduction to Sensitivity Analysis Graphical Sensitivity Analysis Graphical Sensitivity Analysis.
REPLICA Hyper-V Comme solution à un PRA
Utilisation du logiciel EduStat©
Ecrire du code .NET 2.0 dans SQL Server 2005
EDHEC OPEN INNOVATION - Season 9 - Company LOGO Business Case Title.
IMPROVING PF’s M&E APPROACH AND LEARNING STRATEGY Sylvain N’CHO M&E Manager IPA-Cote d’Ivoire.
M’SILA University Information Communication Sciences and technology
Transcription de la présentation:

Au-delà du lambda Benjamin Guinebertière 12/11/2017 Au-delà du lambda Benjamin Guinebertière Technical Evangelist, Microsoft France @benjguin Vincent Heuschling Fondateur et Architecte, Affini-Tech @vhe74 © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Introduction

Agenda Architecture lambda Remise en cause de l'architecture lambda Challenges du streaming Introduction boontadata Démo boontadata Contribuez ! Démo contribution (dev test labs)

L’architecture lambda et sa remise en cause

Big Data / ML - Architecture typique Big Data Engines Machine Learning API API IOT NoSQL Databases Social Web Data Lake Mobile Web Relational Databases Data lake stores all the data that can be used in nearly raw format (it may be compressed for instance). It is in nearly raw format so that everything can be stored, there is no need to throw any data. Data can be easily used in low latency scenarios (interactive mobile / web use cases) when it is stored in databases. There are two main database families: relational databases. They have the following advantages: They allow any kind of queries. Data is consistent because it’s stored once They all use SQL as a language (different dialects may exist) They have the following drawbacks: Scaling is thru scale up (more powerfull machine) Data schema must be designed with specific rules (cf https://en.wikipedia.org/wiki/Database_normalization) NoSQL databases: Data can be stored in a format which corresponds to the way development objects exist in code Scaling is thru scale out (more machines) Consistency between different copies of the same piece of data need to be managed by the application (ex: a customer name can be in orders, sales, logistics, …) Each no SQL engine has its own API. NB: most NoSQL engine tend to also have SQL as a query language with restrictions like no joins between tables (so that scale out can happen) Data Lake is implemented as a distributed file system: HDFS by default in Hadoop, Kudu by Cloudera, MapR-FS by MapR, Azure blob storage, Azure Data Lake Store. In order to query and transform data stored in the data lake you need a distributed engine (Big Data Engine). You focus on the code (SQL in Hive, Pig latin in Pig, Java, Scala, Python, …) and the framework distributed it on worker nodes that may or may not be collocated with the data. Typical big data engines include Hadoop, Spark, Azure Data Lake Analytics. Having raw data kept in Data Lake allows the following scenario: you are missing a field in your database. You can add this field in the database schema AND you can fill its value from past data that you kept in the data lake even before you knew you would need that field. Databases may also be a major source for DataViz (data visualization) and Business Intelligence (BI). BI may use OLAP / multidimensional engines (like SQL Server Analysis Services) or in memory column storage (like PowerPivot, SQL Datawarehouse / SQL DB Column store index, Spark SQL cache based on Parquet, Hive+TEZ+ORC, …) The most interesting use cases are often implemented with Machine learning. This is a way to program thru examples. For instance, you show the machine pictures with cats, and picture without cats. After a while the machine knows if a picture contains a cat or not. Another easiest one is to predict the value of column N based on the value of the N-1 other columns. The main phases of machine learning are: Learn from the labelled (you know the Nth column value, or if a picture has a cat) dataset Predict the past where you have labelled data and evaluate the performance of the model. Ex: you can predict at 85% whether a picture has a cat. Predict in production for unlabeled data. You may need to have all those phases backed by API so that you can update the model based on new learning data; you also need to predict thru API. The prediction API may be hybrid between human prediction and machine learning prediction. One can use a workflow that tried to predict thru machine learning; if not possible, then the machine can escalate to a human, and add the labelled data to its learning dataset. A few times later, escalation may become useless. This is a way to have machine learning from humans. A global API will add business rules on top of databases data and machine learning API. This API will be called by Web, Mobile apps or other channels (phone servers, …). The API may be internal or public so that external developers can develop their own app based on the public API. Data Lake gets its data from multiple sources: IOT sensors, social networks, web, logs, enterprise data, … One additional source is the application itself which will generate additional data. In order to have all data available in the data lake, it may be interesting to copy also that data in the data lake. logs … Enter-prise … BI / DataViz Sens dans lequel va la donnée (vs sens des appels RPC)

Ingestion données poussées vers un broker: approche de type streaming / « hot path » Donnée poussée vers un stockage: approche compatible avec la plupart des systèmes existants, souvent de type batch / « cold path »

Traitement Plusieurs instances de traitement peuvent travailler sur le même « broker ». Cas d’usages: indexation pour tableau de bord, transformation, curation, file d’attente avant stockage, … Plusieurs instances de traitement peuvent travailler sur le stockage non structuré. Cas d’usages: requêtes ad-hoc sur donnée non préparée, Transformation, …

Préparation Le broker peut être vu comme un “log” que plusieurs solutions peuvent traiter. Cas d’usages: ETL, traitement en continu pour réduction de la latence d’observation Un des types de traitement peut être de pousser la donnée vers le stockage En mettant la donnée dans du stockage non structuré, on peut disposer de toute la donnée (pas de schéma), afin d’alimenter les bases SQL et noSQL dont le schéma peut évoluer en ajoutant des champs qui recevront aussi la donnée passée.

Mise à disposition La mise à disposition via des bases SQL ou noSQL est optimisée ici pour la latence d’observation, mais peut avoir d’autres contraintes. La mise à disposition via des bases SQL, noSQL ou sur les données brutes permet un plus grand nombre de scénarios, au prix d’une latence un peu plus importante.

Architecture Lambda Events Near Real Time DB query & merge storage hot path Near Real Time DB SQL/noSQL query & merge cold path T storage batch Storage, DB SQL/noSQL

Big Data vs Business Intelligence Schéma à la lecture => je peux tout écrire Schéma à l’écriture => je peux lire rapidement

Composants fondamentaux du Big Data broker data lake moteurs de traitement bases de données dataViz

Up to 32 partitions via portal, more on request broker IEventProcessor Event Processor Host Azure Event Hub Direct Receivers > 1M Producers > 1GB/sec Aggregate Throughput Consumer Group(s) Partitions PartitionKey Event Producers Hash AMQP 1.0 Credit-based flow control Client-side cursors Offset by Id or Timestamp Throughput Units: 1 ≤ TUs ≤ Partition Count TU: 1 MB/s writes, 2 MB/s reads Up to 32 partitions via portal, more on request

data lake HDFS MapR FS Amazon s3 Google File System Azure Blob Storage Azure Data Lake Store Co localité des données cluster à cluster

moteurs de traitement

Bases de données SQL NoSQL moteur relationnel permettant tout type de requêtes langage de requêtes noSQL (CQL, …) Hive, Presto, Drill, SparkSQL, … NoSQL Scale out colonnes documents clefs/valeurs graphes réparties rapides en écriture rapides en lecture

dataviz

aka.ms/hellodata

Remise en cause http://bigdatahebdo.azurewebsites.net/episodes/2016/05/16/EP23_Kafka_a_Devoxxfr/ https://engineering.linkedin.com/blog/2016/06/stream-processing-hard-problems-part-1-killing-lambda http://imply.io/post/2016/07/05/exactly-once-streaming-ingestion.html https://www.mapr.com/ebooks/intro-to-apache-flink/chapter-6-batch-is-a-special-case-of-streaming.html …

Architecture non lambda hot path Near Real Time DB SQL/noSQL query events log cold path Events batch T

le streaming et ses challenges

Exemple Evénements: Aggrégations: Heure Objet A Objet B 10:00:00 100 2000 10:00:03 1800 10:00:04 85 10:00:07 40 2500 10:00:08 10:00:09 3000 Aggrégations: Fenêtre de temps Objet A (somme) Objet B (moyenne) 10:00:05 185 1900 10:00:10 140 2750

un système distribué objet A Moteur de traitement broker objet B

cas idéal B,m2,2000 B,m3,1800 B,m6,2500 B,m8,3000 A,m1,100 A,m7,100 10:00:00 10:00:01 10:00:02 10:00:03 10:00:04 10:00:05 10:00:06 10:00:07 10:00:08 10:00:09 10:00:10 10:00:11 10:00:12

événements dupliqués B,m2,2000 B,m2,2000 B,m3,1800 B,m6,2500 A,m5,40 10:00:00 10:00:01 10:00:02 10:00:03 10:00:04 10:00:05 10:00:06 10:00:07 10:00:08 10:00:09 10:00:10 10:00:11 10:00:12

événements dans le désordre B,m2,2000 B,m3,1800 B,m6,2500 B,m8,3000 A,m1,100 A,m7,100 A,m4,85 A,m5,40 10:00:00 10:00:01 10:00:02 10:00:03 10:00:04 10:00:05 10:00:06 10:00:07 10:00:08 10:00:09 10:00:10 10:00:11 10:00:12

événements en retard B,m2,2000 B,m3,1800 B,m6,2500 B,m8,3000 A,m7,100 10:00:00 10:00:01 10:00:02 10:00:03 10:00:04 10:00:05 10:00:06 10:00:07 10:00:08 10:00:09 10:00:10 10:00:11 10:00:12

Apache Flink from https://cwiki.apache.org/confluence/display/FLINK/Time+and+Order+in+Streams

Apache Flink from https://cwiki.apache.org/confluence/display/FLINK/Time+and+Order+in+Streams

boontadata

Storm Flink Samza Spark Streaming …

Cassandra noSQL Database boontadata Streaming engine #1 Kafka broker Streaming engine #2 IOT simulator Streaming engine #... Cassandra noSQL Database compare

http://boontadata.io

boontadata

Contribuez ! boontadata-streams boontadata-vstsbuild Ajout de Frameworks Kafka Streams, Beam, Apex, Storm, … Amélioration du code indexes dans Cassandra Flink lit Kafka 0.10 nativement … boontadata-vstsbuild automatisation avec Visual Studio Team Services boontadata-azurepaas les mêmes scénarios en Azure PaaS

Environnement pour la contribution à boontadata-streams

Conclusion

Contribuez ! http://boontadata.io

Benjamin Guinebertière   Technical Evangelist, Microsoft France Azure, data insights, machine learning @benjguin | http://3-4.fr