La présentation est en train de télécharger. S'il vous plaît, attendez

La présentation est en train de télécharger. S'il vous plaît, attendez

Un moteur in-memory d’écriture

Présentations similaires


Présentation au sujet: "Un moteur in-memory d’écriture"— Transcription de la présentation:

1 Un moteur in-memory d’écriture
Cassandra Un moteur in-memory d’écriture

2 Plan Caractéristiques principales Ring LSM-Tree Modèle de données
Insertion massive des données Interrogation des données Nouveau paradigme: in-memory

3 Lexique Un cluster Cassandra est appelé ring: il fonctionne en mode peer-to-peer, chaque nœud du ring pouvant traiter toute demande d’un client ( absence de relation maître-esclave ) Un nœud du ring appelé par un client en tant que coordinateur est capable de lire ou d’écrire des données d’une table ( ou famille de colonnes ) réparties sur plusieurs noeuds ( architecture de type shared-nothing ) Chaque table a ses données répliquées n fois sur les nœuds du cluster. Cassandra optimise l’écriture des données via une table en mémoire appelée Memtable . Les écritures disques se font de manière asynchrone dans une Sstable ( Sorted String Table ) Le pair à pair ou pair-à-pair (traduction de l'anglicisme peer-to-peer, souvent abrégé « P2P ») est un modèle de réseau informatique proche du modèle client-serveur mais où chaque client est aussi un serveur.

4 Caractéristiques Solution libre de la fondation Apache développée initialement par Facebook Distribution Datastax ( Community + Enterprise ) Ecrit en Java SGBD orienté colonne => clé-valeur ( valeur = ensemble de colonnes ) Système distribué en mode peer-to-peer Cassandra = une instance de type JVM par nœud site idéal pour monter en compétence sur Cassandra Cassandra = ring où chaque nœud est capable d’effectuer tous les traitements, pas d’architecture de type maître-esclave

5 Caractéristiques Cassandra 2.0
CQL, système d’interrogation de la base, surcouche sql => client cqlsh à privilégier au détriment de cassandra-cli orienté colonne Liste des drivers clients: Java, C#, Python Pas de locking en cas de mises à jour concurrentes => si plusieurs clients modifient les mêmes colonnes de manière concurrente, seule les modifications les plus récentes seront conservées.

6 Une table Cassandra

7 Caractéristiques Atomicité assurée au niveau de la ligne pour une transaction => insertion et modification de colonnes pour une ligne traitées comme une seule opération Isolation assurée au niveau d’une ligne Durabilité assurée via un journal de commit log Read & Write consistency The consistency level specifies the number of replicas on which the write must succeed before returning an acknowledgment to the client application. The consistency level specifies how many replicas must respond to a read request before returning data to the client application.

8 Ring Cassandra Système peer-to-peer où chaque nœud est capable de traiter une demande d’un client ( pas de relation maître/esclave ). Les données des tables sont distribuées de manière hashée et compressée sur chaque nœud dans des partitions. Chaque partition est répliquée sur des nœuds différents. Plusieurs rings possibles

9 Ring: écriture The coordinator sends a write request to all replicas that own the row being written. As long as all replica nodes are up and available, they will get the write regardless of the consistency level specified by the client. The write consistency level determines how many replica nodes must respond with a success acknowledgment in order for the write to be considered successful. Success means that the data was written to the commit log and the memtable as described in About writes. For example, in a single data center 10 node cluster with a replication factor of 3, an incoming write will go to all 3 nodes that own the requested row. If the write consistency level specified by the client is ONE, the first node to complete the write responds back to the coordinator, which then proxies the success message back to the client. A consistency level of ONE means that it is possible that 2 of the 3 replicas could miss the write if they happened to be down at the time the request was made. If a replica misses a write, Cassandra will make the row consistent later using one of its built-in repair mechanisms: hinted handoff, read repair, or anti-entropy node repair. That node forwards the write to all replicas of that row. It will respond back to the client once it receives a write acknowledgment from the number of nodes specified by the consistency level. when a node writes and responds, that means it has written to the commit log and puts the mutation into a memtable.

10 Ring: lecture There are two types of read requests that a coordinator can send to a replica: • A direct read request • A background read repair request The number of replicas contacted by a direct read request is determined by the consistency level specified by the client. Background read repair requests are sent to any additional replicas that did not receive a direct request. Read repair requests ensure that the requested row is made consistent on all replicas. Thus, the coordinator first contacts the replicas specified by the consistency level. The coordinator sends these requests to the replicas that are currently responding the fastest. The nodes contacted respond with the requested data; if multiple nodes are contacted, the rows from each replica are compared in memory to see if they are consistent. If they are not, then the replica that has the most recent data (based on the timestamp) is used by the coordinator to forward the result back to the client. To ensure that all replicas have the most recent version of frequently-read data, the coordinator also contacts and compares the data from all the remaining replicas that own the row in the background. If the replicas are inconsistent, the coordinator issues writes to the out-of-date replicas to update the row to the most recent values. This process is known as read repair. Read repair can be configured per table (using read_repair_chance), and is enabled by default. For example, in a cluster with a replication factor of 3, and a read consistency level of QUORUM, 2 of the 3 replicas for the given row are contacted to fulfill the read request. Supposing the contacted replicas had different versions of the row, the replica with the most recent version would return the requested data. In the background, the third replica is checked for consistency with the first two, and if needed, the most recent replica issues a write to the out-of-date replicas.

11 LSM-tree Source: SSTable is a simple abstraction to efficiently store large numbers of key-value pairs while optimizing for high throughput, sequential read/write workloads. On-disk SSTable indexes are always loaded into memory All writes go directly to the MemTable index Reads check the MemTable first and then the SSTable indexes Periodically, the MemTable is flushed to disk as an SSTable Periodically, on-disk SSTables are "collapsed together" P.O’Neil:

12 LSM-tree Structure optimisée pour l’écriture des données, plus performante qu’une table SQL munie d’index sur de grands volumes ( GB, TB ). Idée principale: écrire en mémoire dans une table de type clé-valeur, puis écrire sur disque de manière asynchrone et séquentielle Une écriture sur disque est immuable => algorithme de merge-sort pour fusionner les mêmes tables SST Once the SSTable is on disk, it is immutable, hence updates and deletes can't touch the data. Instead, a more recent value is simply stored in MemTable in case of update, and a "tombstone" record is appended for deletes. Because we check the indexes in sequence, future reads will find the updated or the tombstone record without ever reaching the older values! Finally, having hundreds of on-disk SSTables is also not a great idea, hence periodically we will run a process to merge the on-disk SSTables, at which time the update and delete records will overwrite and remove the older data. Google's BigTable, Hadoop's HBase, and Cassandra amongst others are all using a variant or a direct copy of this very nice architecture. IO ( HDD & SSD ):

13 Ecriture dans une table
Cassandra processes data at several stages on the write path, starting with the immediate logging of a write and ending in compaction: • Logging data in the commit log • Writing data to the memtable • Flushing data from the memtable • Storing data on disk in SSTables • Compaction Logging writes and memtable storage When a write occurs, Cassandra stores the data in a structure in memory, the memtable, and also appends writes to the commit log on disk, providing configurable durability. The commit log receives every write made to a Cassandra node, and these durable writes survive permanently even after hardware failure. The memtable is a write-back cache of data partitions that Cassandra looks up by key. The more a table is used, the larger its memtable needs to be. Cassandra can dynamically allocate the right amount of memory for the memtable or you can manage the amount of memory being utilized yourself. The memtable, unlike a write-through cache, stores writes until reaching a limit, and then is flushed. Flushing data from the memtable When memtable contents exceed a configurable threshold, the memtable data, which includes indexes, is put in a queue to be flushed to disk. You can configure the length of the queue by changing memtable_flush_queue_size in the cassandra.yaml. If the data to be flushed exceeds the queue size, Cassandra blocks writes. You can manually flush data from the memtable using the nodetool flush command. Typically, before restarting nodes, flushing the memtable is recommended to reduce commit log replay time. To flush the data, Cassandra sorts memtables by partition key and then writes the data to disk sequentially. The process is extremely fast because it involves only a commitlog append and the sequential write. Storing data on disk in SSTables The memtable data is flushed to SSTables on disk using sequential I/O. Data in the commit log is purged after its corresponding data in the memtable is flushed to the SSTable. Memtables and SSTables are maintained per table. SSTables are immutable, not written to again after the memtable is flushed. Consequently, a partition is typically stored across multiple SSTable files. For each SSTable, Cassandra creates these in-memory structures: • Partition index A list of primary keys and the start position of rows in the data file. • Partition summary A subset of the partition index. By default 1 primary key out of every 128 is sampled. Compaction

14 Lecture dans une table Mémoire/Disque ; Partition key cache
Cache de type ligne disponible en plus. First, Cassandra checks the Bloom filter. Each SSTable has a Bloom filter associated with it that checks the probability of having any data for the requested partition key in the SSTable before doing any disk I/O. If the probability is good, Cassandra checks the partition key cache and takes one of these courses of action: • If an index entry is found in the cache: • Cassandra goes to the compression offset map to find the compressed block having the data. • Fetches the compressed data on disk and returns the result set. • If an index entry is not found in the cache: • Cassandra searches the partition summary to determine the approximate location on disk of the index entry. • Next, to fetch the index entry, Cassandra hits the disk for the first time, performing a single seek and a sequential read of columns (a range read) in the SSTable if the columns are contiguous.

15 Modèle de données Ensemble de tables indépendantes les unes des autres ( pas de jointure en nosql ) Un seul index, la clé de partition Clusterisation possible des tables: clé composite Ajout possible d’index secondaires Tip: a good rule of a thumb is one column family per query since you optimize column families for read performance Source: Cela illustre bien la “modélisation” en  mode NoSQL: - En SQL, on part des référentiels de données de l’entreprise, puis on modélise, puis les traitements vont chercher les données en fonction des besoins ; - En NoSQL, on définit un besoin en mode agile, puis on crée les structures de données adaptées ( objet contenant toutes les infos, pas de jointure ). D’un côté, les traitements s’adaptent aux données, de l’autre, les données s’adaptent aux traitements Index secondaire: Cassandra’s secondary indexes are not distributed like normal tables. They are implemented as local indexes. Each node stores an index of only the data that it stores. To perform the country index lookup, every node is queried, looks up the ‘UK’ partition and then looks up each user_accounts partition found.

16 Type des données Types usuels: int, double, varchar, boolean, timestamp, blob Collections : set, list, map Autres types : counter, inet, uuid, timeuuid Data modeling example: A set stores a group of elements that are returned in sorted order when queried. A column of type set consists of unordered unique values. When the order of elements matters, which may not be the natural order dictated by the type of the elements, use a list. Also, use a list when you need to store same value multiple times. List values are returned according to their index value in the list, whereas set values are returned in alphabetical order, assuming the values are text. As its name implies, a map maps one thing to another. A map is a name and a pair of typed values. Using the map type, you can store timestamp-related information in user profiles. Each element of the map is internally stored as one Cassandra column that you can modify, replace, delete, and query. Each element can have an individual time-to-live and expire when the TTL ends. A counter is a special kind of column used to store a number that incrementally counts the occurrences of a particular event or process. For example, you might use a counter column to count the number of times a page is viewed. Liste des types:

17 Cluster de test Cluster à 3 nœuds Commodity hardware I5-3470 ( 4 CPU )
32 GB RAM 4 TB HDD ( 7200 RPM ) Carte réseau à 100 Mb/s Ubuntu LTS Commodity hardware Carte réseau: 100 Mb/s => 1 Gb/s.

18 Installation Pré-requis:
Sudo JRE Oracle 7 Accès internet => apt-get Installation rapide ( < 1 jour si ports ouverts ) Documentation: Cassandra : machine java

19 Insertion massive de données
Méthode 1 : commande Copy ( cql ) Import de fichiers csv Exemple: copy T from ‘/home/user/file’ with delimiter = ‘|’ Méthode 2: outil sstableloader Générer une SS table à partir d’un fichier csv via un programme Java à créer Utiliser l’outil pour charger la SS table créée dans Cassandra Pas d’outil pour insérer des données semi-structurées => Création d’un outil en java Copy: IMPORTANT NOTE: COPY FROM is intended for importing small datasets (a few million rows or less) into Cassandra. For importing larger datasets, use Cassandra SSTableBulkLoader . Cassandra Bulk Loader: Cassandra Bulk Loader:

20 SsTableLoad SsTableLoad <node_address> <nb_iter> <nb_insert> <table_name> <min_key> Il se connecte à un nœud du ring, lance n itérations sur une table au format prédéfini. Pour chaque itération, il exécute un bulk-insert de m lignes. La première ligne insérée a comme clé min_key, puis on incrémente de 1 pour chaque nouvelle insertion. Cassandra : principe de l’upsert

21 SsTableLoad CREATE TABLE test_insert (
string varchar, nb bigint, bool boolean, list list<varchar>, map map<timestamp,text>, val blob, PRIMARY KEY (nb)); alter table test_insert with gc_grace_seconds = 30; Insertion d’un BLOB de 1 MB Alter table => suppression définitive des lignes.

22 Exceptions Java Exceptions java rencontrées durant la phase de développement: Exception in thread "main" com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: / (com.datastax.driver.core.exceptions.DriverException: Timeout during read), / (com.datastax.driver.core.TransportException: [/ ] Error writing), / (com.datastax.driver.core.exceptions.DriverException: Timeout during read)) Dans le fichier /etc/cassandra/cassandra.yaml: read_request_timeout_in_ms: 5000 => 1 minute write_request_timeout_in_ms: 2000 => 24 secondes

23 Exceptions Java Exception in thread "main" com.datastax.driver.core.exceptions.InvalidQueryException: Request is too big: length exceeds maximum allowed length => nb_insert = 250 java.lang.OutOfMemoryError: Java heap space => changer la taille de la heap size dans le fichier /etc/cassandra/cassandra-env.sh : 8 GB => 12 GB Maturité du driver Java à améliorer.

24 Scalabilité Un processus charge 64 GB en 14m25s, soit 1 GB en 14s.
Deux processus chargent 64 GB en 8m47s, soit 1 GB en 8s. Saturation si lancement de 3 processus, un par noeud Test: 64 GB sur un nœud, deux nœuds et trois nœuds Un noeud: time java SsTableLoad test_insert_6 1 real 14m24.715s user 7m57.607s sys 0m44.056s => 1 GB en 14 s Deux nœuds: time java SsTableLoad test_insert_7 1 real 8m29.722s user 4m29.329s sys 0m25.591s time java SsTableLoad test_insert_ real 8m47.181s user 4m38.459s sys 0m26.548s 1 GB en 8 s Trois noeuds: time java SsTableLoad test_insert_8 1 real 7m49.742s user 3m21.242s sys 0m18.058s time java SsTableLoad test_insert_ real 7m38.393s user 3m8.004s sys 0m18.161s time java SsTableLoad test_insert_ real 7m32.199s user 3m9.014s sys 0m17.900s 1 GB en 7,3 s La config sature. Réseau : 100 mbs => 1 gbs

25 Activité réseau CPU: OK, RAM: OK
IO disque: autour de 30 % pour trois processus => création de n sstable en asynchrone Iftop Test scalabilité: peak *2 => chg de carte Piste suivante à explorer : contention au niveau de la JVM ( exemple d’outil: visualvm )

26 Etude des requêtes Objectif: comprendre le fonctionnement interne de quelques requêtes => tracing on sous cqlsh Liste des requêtes étudiées ( CRUD ) : Insert Update Select Count(*) Scan full, utilisation d’index secondaire, order by Delete Explain:

27 Description des tables
CREATE TABLE test_insert_x ( nb bigint, bool boolean, "list" list<text>, "map" map<timestamp, text>, string text, val blob, PRIMARY KEY (nb) ) WITH bloom_filter_fp_chance= AND caching='ALL' AND comment='' AND dclocal_read_repair_chance= AND gc_grace_seconds=30 AND index_interval=128 AND read_repair_chance= AND replicate_on_write='true' AND populate_io_cache_on_flush='false' AND default_time_to_live=0 AND speculative_retry='99.0PERCENTILE' AND memtable_flush_period_in_ms=0 AND compaction={'class': 'SizeTieredCompactionStrategy'} AND compression={'sstable_compression': 'LZ4Compressor'}; CREATE TABLE test_select_x ( nb bigint, string text, bool boolean, "list" list<text>, "map" map<timestamp, text>, val blob, PRIMARY KEY (nb, string) ) WITH bloom_filter_fp_chance= AND caching='KEYS_ONLY' AND comment='' AND dclocal_read_repair_chance= AND gc_grace_seconds=30 AND index_interval=128 AND read_repair_chance= AND replicate_on_write='true' AND populate_io_cache_on_flush='false' AND default_time_to_live=0 AND speculative_retry='99.0PERCENTILE' AND memtable_flush_period_in_ms=0 AND compaction={'class': 'SizeTieredCompactionStrategy'} AND compression={'sstable_compression': 'LZ4Compressor'}; CREATE KEYSPACE demodb WITH replication = { 'class': 'SimpleStrategy', 'replication_factor': '2' }; bloom_filter_fp_chance (Default: 0.01 for SizeTieredCompactionStrategy, 0.1 for LeveledCompactionStrategy) Desired false-positive probability for SSTable Bloom filters. When data is requested, the Bloom filter checks if the requested row exists before doing any disk I/O. Valid values are 0 to 1.0. A setting of 0 means that the unmodified (effectively the largest possible) Bloom filter is enabled. Setting the Bloom Filter at 1.0 disables it. The higher the setting, the less memory Cassandra uses. The maximum recommended setting is 0.1, as anything above this value yields diminishing returns. For detailed information, see Tuning Bloom filters. caching (Default: keys_only) Optimizes the use of cache memory without manual tuning. Set caching to one of the following values: • all • keys_only • rows_only • none Cassandra weights the cached data by size and access frequency. Use this parameter to specify a key or row cache instead of a table cache, as in earlier versions. dclocal_read_repair_chance (Default: 0.0) Specifies the probability of read repairs being invoked over all replicas in the current data center. Contrast read_repair_chance. gc_grace (Default: [10 days]) Specifies the time to wait before garbage collecting tombstones (deletion markers). The default value allows a great deal of time for consistency to be achieved prior to deletion. In many deployments this interval can be reduced, and in a single-node cluster it can be safely set to zero. index_interval (Default: 128) Controls the sampling of entries from the primary row index. The interval corresponds to the number of index entries that are skipped between taking each sample. By default Cassandra samples one row key out of every 128. The larger the interval, the smaller and less effective the sampling. The larger the sampling, the more effective the index, but with increased memory usage. Generally, the best trade off between memory usage and performance is a value between 128 and 512 in combination with a large table key cache. However, if you have small rows (many to an OS page), you may want to increase the sample size, which often lowers memory usage without an impact on performance. For large rows, decreasing the sample size may improve read performance. read_repair_chance (Default: 0.1 or 1) Specifies the probability with which read repairs should be invoked on non-quorum reads. The value must be between 0 and 1. For tables created in versions of Cassandra before 1.0, it defaults to 1. For tables created in versions of Cassandra 1.0 and higher, it defaults to 0.1. However, for Cassandra 1.0, the default is 1.0 if you use CLI or any Thrift client, such as Hector or pycassa. replicate_on_write (Default: true) Applies only to counter tables. When set to true, replicates writes to all affected replicas regardless of the consistency level specified by the client for a write request. For counter tables, this should always be set to true. populate_io_cache_on_flush (Default: false) Populates the page cache on memtable flush and compaction. Enable only when all data on the node fits within memory. Use for fast reading of SSTables from IO cache (memory). default_time_to_live (Default: 0) The default expiration time in seconds for a table. Used in MapReduce/Hive scenarios when you have no control of TTL. speculative_retry (Default: NONE) Overrides normal read timeout when read_repair_chance is not 1.0, sending another request to read. Choices are: • ALWAYS: Retry reads of all replicas. • Xpercentile: Retry reads based on the effect on throughput and latency. • Yms: Retry reads after specified milliseconds. • NONE: Do not retry reads. memtable_flush_period_in_ms (Default: 0) Forces flushing of the memtable after the specified time in milliseconds elapses. compaction_strategy (Default: SizeTieredCompactionStrategy) Sets the compaction strategy for the table. The available strategies are: • SizeTieredCompactionStrategy: The default compaction strategy and the only compaction strategy available in releases earlier than Cassandra 1.0. This strategy triggers a minor compaction whenever there are a number of similar sized SSTables on disk (as configured by min_compaction_threshold). Using this strategy causes bursts in I/O activity while a compaction is in process, followed by longer and longer lulls in compaction activity as SSTable files grow larger in size. These I/O bursts can negatively effect read-heavy workloads, but typically do not impact write performance. Watching disk capacity is also important when using this strategy, as compactions can temporarily double the size of SSTables for a table while a compaction is in progress. • LeveledCompactionStrategy: The leveled compaction strategy creates SSTables of a fixed, relatively small size (5 MB by default) that are grouped into levels. Within each level, SSTables are guaranteed to be non-overlapping. Each level (L0, L1, L2 and so on) is 10 times as large as the previous. Disk I/O is more uniform and predictable as SSTables are continuously being compacted into progressively larger levels. At each level, row keys are merged into non-overlapping SSTables. This can improve performance for reads, because Cassandra can determine which SSTables in each level to check for the existence of row key data. This compaction strategy is modeled after Google's leveldb implementation. For more information, see the articles When to Use Leveled Compaction and Leveled Compaction in Apache Cassandra. sstable_compression (Default: SnappyCompressor) The compression algorithm to use. Valid values are LZ4Compressor available in Cassandra and later), SnappyCompressor, and DeflateCompressor. Use an empty string ('') to disable compression. Choosing the right compressor depends on your requirements for space savings over read performance. LZ4 is fastest to decompress, followed by Snappy, then by Deflate. Compression effectiveness is inversely correlated with decompression speed. The extra compression from Deflate or Snappy is not enough to make up for the decreased performance for general-purpose workloads, but for archival data they may be worth considering. Developers can also implement custom compression classes using the org.apache.cassandra.io.compress.ICompressor interface. Specify the full class name as a "string constant".

28 Insert insert into test_insert_1 (nb,bool,list,map,string)values (12001, true, ['azerty', 'qwerty'], { ' :00' : 't1'}, '12000'); Source_elapsed: micros La requête s’exécute. Val = null

29 Count(*) select count(*) from test_insert_1 limit 20000; count -------
10002 (1 rows) Fichier trace : > lignes Compactage de la table test_insert_1: nodetool compact demodb test_insert_1 nodetool compactionhistory nodetool status => état du cluster

30 Utilisation de l’index de la clé primaire
select nb, list, string from test_insert_1 where nb = 1535 ; cqlsh:demodb> consistency; Current consistency level is ONE.

31 Utilisation d’un index secondaire
CREATE INDEX test_insert_1_string_idx ON test_insert_1 (string); select nb, list, string from test_insert_1 where string = 'VvmEQQwkPEtypCrmBRrKUbhpXXxtfe'; cqlsh:demodb> select nb, list, string from test_insert_1 where string = 'VvmEQQwkPEtypCrmBRrKUbhpXXxtfe'; Bad Request: No indexed columns present in by-columns clause with Equal operator

32 Order by select nb,string, bool,list,map from test_select_1 where nb = 1221 order by string cqlsh:demodb> select nb,string, bool,list,map from test_select_1 where nb = 1221 order by nb; Bad Request: Order by is currently only supported on the clustered columns of the PRIMARY KEY, got nb nodetool cfstats demodb.test_select_1 => approximatif

33 Update update test_select_1 set bool = true where nb = 1221 and string = 'LfazkllbGORcyHSwmiZgLVWcmbaWHL' ; Principe de l’upsert => update = insert. In an UPDATE statement, all updates within the same partition key are applied atomically and in isolation. Use the IF keyword followed by a condition to be met for the update to succeed. Using an IF condition incurs a performance hit associated with using Paxos internally to support linearizable consistency cqlsh:demodb> update test_select_1 set string = 'UPDATE' where nb = 1221; Bad Request: PRIMARY KEY part string found in SET part cqlsh:demodb> update test_select_1 set bool = true where nb = 1221; Bad Request: Missing mandatory PRIMARY KEY part string Clé composite: pas de IN dans la clause WHERE

34 Delete delete from test_select_1 where nb = 840;

35 Interrogation des données
Grammaire du select très limitée, peu d’index, accès disque en lecture => comment mieux exploiter cette immense quantité de données collectée ? Une solution: version Enterprise de Datastax Partie batch ( map-reduce, hive, apache mahout ) Moteur de recherche full-text: solr ( = elasticsearch ) Ajout d’une couche in-memory Grammaire select: Hive: run HiveQL queries on Cassandra data => surcouche SQL ( jointure, calcul d’agrégat, … ) Pig: explore very large data sets Pig est basé sur un langage de haut niveau PigLatin qui permet de créer des programmes de type MapReduce. Apache Mahout: learning machine => lecture et apprentissage des données, prédire le futur, par exemple quels achats futurs va effectuer un client sur le web ? Apache Mahout est un projet de la fondation Apache visant à créer des implémentations d’algorithmes d’apprentissage automatique et de datamining. In-memory option: Outil de migration : scoop

36 In-memory Changement de paradigme: disque & RAM => RAM & cache processeur Solution 1: coupler Cassandra à un moteur in-memory ( Spark, Shark, MLlib, … ) Solution 2: coupler Cassandra à une base in-memory en mode colonne ( Hana de SAP, Vertica de HP, Amazon RedShift, … ) => cible: BI Spark, shark: : Société: Ooyala, vidéo à la demande


Télécharger ppt "Un moteur in-memory d’écriture"

Présentations similaires


Annonces Google