valtech - big data & nosql : au-delà du nouveau buzz

Big Data et NoSQL

au-delà du nouveau buzz

Définition

Méthodologie

Architecture!

Cas d’utilisation

Définition

Définition

«Les technologies Big Data correspondent à une nouvelle génération de technologies et d’architectures, conçues pour retirer une valeur économique de gigantesques volumes de données hétéroclites, en les capturant, en les explorant et/ou en les analysant en un temps record »

4

Sommaire

Ò  Présentation Ò  Cas d’utilisation Ò  Architecture Ò  Cas Pratique Ò  Conclusion Ò  Références Ò  Annexes

5

Méthodologie

Big Data / Méthodologie

La mise en place d’une démarche Big Data est toujours composée de trois étapes :

Ò  Collecter, stocker les données : Partie infrastructure.

Ò  Analyser, corréler, agréger les données : Partie analytique.

Ò  Exploiter, afficher l’analyse Big Data : Comment exploiter les données et les analyses ?

Architecture

Architecture Big Data

Audio, Vidéo, Image

Docs, Texte, XML

Web logs, Clicks,

Social, Graphs,

RSS,

Capteurs, Graphs,

RSS,

Spatial, GPS Autres

Base de données Orientée colonne

NoSQL

Distributed File

System

Map Reduce

Base de données SQL

SQL

Analytiques , Business Intelligent

CO

LLECTER

LES D

ON

NEES

STOC

KA

GE &

OR

GA

NISATIO

N

EXTRA

CTIO

N

AN

ALYSER

&

VISU

ALISER

BUSINESS

Technologie Big Data

Audio, Vidéo, Image

Docs, Texte, XML

Web logs, Clicks,

Social, Graphs,

RSS,

Capteurs, Graphs,

RSS,

Spatial, GPS Autres

HBase, Big Table, Cassandra,

Voldemort, …

HDFS, GFS, S3, …

Oracle, DB2, MySQL, …

SQL

CO

LLECTER

LES D

ON

NEES

STOC

KA

GE &

OR

GA

NISATIO

N

EXTRA

CTIO

N

AN

ALYSER

&

VISU

ALISER

BUSINESS

Architecture Big Data Open Source pour l’entreprise

Big Data full solution

Cas d’utilisation

Cas d’utilisation BigQuery

Cas d’utilisation Hadoop

Ò  Facebook uses Hadoop to store copies of internal log and dimension data sources and as a source for reporting/analytics and machine learning. There are two clusters, a 1100-machine cluster with 8800 cores and about 12 PB raw storage and a 300-machine cluster with 2400 cores and about 3 PB raw storage.

Ò  Yahoo! deploys more than 100,000 CPUs in > 40,000 computers running Hadoop. The biggest cluster has 4500 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM). This is used to support research for Ad Systems and Web Search and to do scaling tests to support development of Hadoop on larger clusters

Ò  eBay uses a 532 nodes cluster (8 * 532 cores, 5.3PB), Java MapReduce, Pig, Hive and HBase

Ò  Twitter uses Hadoop to store and process tweets, log files, and other data generated across Twitter. They use Cloudera’s CDH2 distribution of Hadoop. They use both Scala and Java to access Hadoop’s MapReduce APIs as well as Pig, Avro, Hive, and Cassandra.

Gartner talk

« D'ici 2015, 4,4 millions d'emplois informatiques seront créés dans le monde pour soutenir le Big Data, dont 1,9 millions aux Etat-Unis », a déclaré Peter Sondergaard, senior vice-président et responsable mondial de la recherche chez Gartner.

Wanted « Sophisticated Statistical Analysis »

100 000 to 500 000 $

Présentation

Architecture & API

Cas d’utilisations

Etude de cas

Sécurité

Ecosystème Hadoop

Présentation

Présentation

“HBase is the Hadoop database. Think of it as a distributed scalable Big Data store”

http://hbase.apache.org/

19

Présentation

“Project's goal is the hosting of very large tables – billions of rows X millions of columns – atop clusters of commodity hardware”


20

Présentation

“HBase is an open-source, distributed, versioned, column-oriented store modeled after Google's BigTable”


21

La trilogie Google

Ò The Google File System http://research.google.com/archive/gfs.html

Ò MapReduce: Simplified Data Processing on Large Cluster http://research.google.com/archive/mapreduce.html

Ò Bigtable: A Distributed Storage System for Structured Data http://research.google.com/archive/bigtable.html

22

Systèmes d’exploitation / Plateforme

23

Installation / démarrage / arrêt

$ mkdir bigdata $ cd bigdata $ wget http://apache.claz.org/hbase/hbase-0.92.1/hbase-0.92.1.tar.gz … $ tar xvfz hbase-0.92.1.tar.gz … $ export HBASE_HOME=`pwd`/hbase-0.92.1 $ $HBASE_HOME/bin/start-hbase.sh … $ $HBASE_HOME/bin/stop-hbase.sh …

24

HBase Shell / exemple de session $ $HBASE_HOME/hbase shell HBase Shell; enter 'help<RETURN>' for list of supported commands. Type "exit<RETURN>" to leave the HBase Shell Version 0.92.1, r1298924, Fri Mar 9 16:58:34 UTC 2012 hbase(main):001:0> list TABLE 0 row(s) in 0.5510 seconds hbase(main):002:0> create 'mytable', 'cf' 0 row(s) in 1.1730 seconds hbase(main):003:0> list TABLE mytable 1 row(s) in 0.0080 seconds hbase(main):004:0> put 'mytable', 'first', 'cf:message', 'hello HBase' 0 row(s) in 0.2070 seconds hbase(main):005:0> get 'mytable', 'first' COLUMN CELL cf:message timestamp=1323483954406, value=hello HBase 1 row(s) in 0.0250 seconds

25

HBase Shell / Commandes

Ò  Générales Ò status, version

Ò  DDL Ò alter, alter_async, alter_status, create, describe, disable,

disable_all, drop, drop_all, enable, enable_all, exists, is_disabled, is_enabled, list, show_filters

Ò  DML Ò count, delete, deleteall, get, get_counter, incr, put, scan, truncate

Ò  Outils Ò  assign, balance_switch, balancer, close_region, compact, flush,

hlog_roll, major_compact, move, split, unassign, zk_dump Ò  Replication

Ò  add_peer, disable_peer, enable_peer, list_peers, remove_peer, start_replication, stop_replication

Ò  Securité Ò  grant, revoke, user_permission

26

Architecture & API

Architecture logique

28

Table Design

29

Exemple de table

30

Ò Coordonnées des données [rowkey, column family, column qualifier, timestamp] → value [fr.wikipedia.org/wiki/NoSQL, links, count, 1234567890] → 24

HBase Java API

Ò HBaseAdmin, HTableDescriptor, HColumnDescriptor HTableDescriptor desc = new HTableDescriptor("TableName"); HColumnDescriptor cf = new HColumnDescriptor("Family".getBytes()); desc.addFamily(contentFamily); Configuration conf = HBaseConfiguration.create(); HBaseAdmin admin = new HBaseAdmin(conf); admin.createTable(desc);

31

HBase Java API

Ò HTablePool & HTableInterface HTablePool pool = new HTablePool(); HTableInterface table = pool.getTable("TableName");

32

HBase Java API

Ò Put byte[] cellValue = … Put p = new Put("RowKey".getBytes()); p.add("Family".getBytes(), "Qualifier".getBytes(), cellValue); table.put(p);

33

Chemin d’écriture

34

HBase Java API

Ò Get Get g = new Get("RowKey".getBytes()); g.addColumn("Family".getBytes(), "Qualifier".getBytes()); Result r = table.get(g);

Ò Result byte[] cellValue = r.getValue("Family".getBytes(), "Qualifier".getBytes());

35

HBase Java API

Ò Scan Scan scan = new Scan(); scan.setStartRow("StartRow".getBytes()); scan.setStopRow("StopRow".getBytes()); scan.addColumn("Family".getBytes(), "Qualifier".getBytes()); ResultScanner rs = table.getScanner(scan); for (Result r : rs) { // … }

36

Chemin de lecture

37

Sommaire


38



39


40

Vous ?

Etude de cas

Moteur de recherche

42

Table Wikipedia.fr

43

Création de la table Wikipedia

HTableDescriptor desc = new HTableDescriptor("wikipedia");

HColumnDescriptor contentFamily = new

HColumnDescriptor("content".getBytes()); contentFamily.setMaxVersions(1); desc.addFamily(contentFamily); HColumnDescriptor linksFamily = new

HColumnDescriptor("links".getBytes()); linksFamily.setMaxVersions(1); desc.addFamily(linksFamily); Configuration conf = HBaseConfiguration.create(); HBaseAdmin admin = new HBaseAdmin(conf); admin.createTable(desc); 44

HBase API : insertion de données par les Crawlers

Put p = new Put(Util.toBytes(url)); p.add("content".getBytes(), "text".getBytes(), htmlParseData.getText().getBytes()); List<WebURL> links = htmlParseData.getOutgoingUrls(); int count = 0; for (WebURL outGoingURL : links) { p.add("links".getBytes(), Bytes.toBytes(count++), outGoingURL.getURL().getBytes()); } p.add("links".getBytes(), "count".getBytes(), Bytes.toBytes(count)); try { table.put(p); } catch (IOException e) { e.printStackTrace(); } 45

Table de l’index inversé

46

MapReduce

47

MapReduce Index inversé – algorithme

method map(url, text) for all word ∈ text do emit(word, url)

method reduce(word, urls) count ← 0 for all url ∈ urls do put(word, "links", count, url) count ← count + 1 put(word, "links", "count", count)

48

MapReduce Index inversé – Configuration

public static void main(String[] args) throws Exception { Configuration conf = HBaseConfiguration.create(); Job job = new Job(conf, "Reverse Index Job"); job.setJarByClass(InvertedIndex.class);

Scan scan = new Scan(); scan.setCaching(500); scan.addColumn("content".getBytes(), "text".getBytes()); TableMapReduceUtil.initTableMapperJob("wikipedia", scan, Map.class, Text.class, Text.class, job); TableMapReduceUtil.initTableReducerJob("wikipedia_ri", Reduce.class, job);

job.setNumReduceTasks(1); System.exit(job.waitForCompletion(true) ? 0 : 1);

} 49

MapReduce Index inversé – Map

public static class Map extends TableMapper<Text, Text> { private static final Pattern PATTERN = Pattern.compile(SENTENCE_SPLITTER_REGEX); private Text key = new Text(); private Text value = new Text();

@Override protected void map(ImmutableBytesWritable rowkey, Result result, Context context) { byte[] b = result.getValue("content".getBytes(), "text".getBytes()); String text = Bytes.toString(b); String[] words = PATTERN.split(text); value.set(result.getRow()); for (String word : words) { key.set(word.getBytes()); try { context.write(key, value); } catch (Exception e) { e.printStackTrace(); } } } }

50

MapReduce Index inversé – Reduce

public static class Reduce extends TableReducer<Text, Text, ImmutableBytesWritable> { @Override protected void reduce(Text rowkey, Iterable<Text> values, Contex context) { Put p = new Put(rowkey.getBytes()); int count = 0; for (Text link : values) { p.add("links".getBytes(), Bytes.toBytes(count++), link.getBytes()); } p.add("links".getBytes(), "count".getBytes(), Bytes.toBytes(count)); try { context.write(new ImmutableBytesWritable(rowkey.getBytes()), p); } catch (Exception e) { e.printStackTrace(); } } } 51

Question

Comment trier par ordre d’importance les résultats d’une recherche ?

52

Comptage des liens

53

Comptage pondéré

54

Définition récursive

55

Un peu d’algèbre linéaire

56

PageRank

57

Moteur de recherche

58

Digression 1

59

Google@Stanford, Larry Page et Sergey Brin – 1998 Ò 2-proc Pentium II 300mhz, 512mb, five 9gb drives Ò 2-proc Pentium II 300mhz, 512mb, four 9gb drives Ò 4-proc PPC 604 333mhz, 512mb, eight 9gb drives Ò 2-proc UltraSparc II 200mhz, 256mb, three 9gb drives, six

4gb drives Ò Disk expansion, eight 9gb drives Ò Disk expansion, ten 9gb drives

CPU 2933 MHz (sur 10 CPUs)

RAM 1792 MB

HD 366 GB

Digression 2

60

Google 1999 – Premier serveur de production

Une formule sans maths J

Données + science + perspicacité = valeur

61

Sécurité

Sécurité

Ò Kerberos Ò Access Control List

63

Sommaire


64

Ecosystème Hadoop

Ecosystème Hadoop

Ò Hadoop Ò Pig Ò Hive Ò Mahout Ò Whirr

65

Thank you

valtech - big data & nosql : au-delà du nouveau buzz

Technology