scalable density clustering for spark - crim.ca · 3 apache • popular distributed in-memory...

of 26 /26
WWW.CRIM.CA Principal partenaire financier Scalable Density Clustering for Spark THOMAS TRIPLET, PH.D., ENG. MARCH 9 TH 2016

Author: others

Post on 06-Sep-2019

2 views

Category:

Documents


0 download

Embed Size (px)

TRANSCRIPT

  • W W W . C R I M . C A

    Principal partenaire financier

    Scalable Density Clustering for Spark

    THOMAS TRIPLET, PH.D., ENG.

    MARCH 9TH 2016

  • 2

    TECHNOLOGIES BIG-DATA

    • Hadoop Core – HDFS: Système de fichiers distribué – YARN: Gestion des ressources CPU et planification – MapReduce: Traitement en lot (batch) des données à grande échelle

    • Écosystème Hadoop – NoSQL: HBase, Cassandra, Accumulo, etc… – SQL: Hive, Stinger (Hortonworks), Impala (Cloudera), Presto (FB), Tajo, Drill (MapR) – Transfert: Sqoop, Flume – Calcul/ML: Spark, Storm, Giraph, Mahout – Scripts: Pig, Cascading – Administration: Hue, ZooKeeper, Knox – Recherche: Solr, ElasticSearch

  • 3

    APACHE

    • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear horizontal scalability • Fault tolerant (RDDs) • Applications range from long-running batch jobs to stream processing • High-level Scala, Java, Python and R APIs

  • 4

    AGENDA

    • Clustering algorithms (unsupervised learning) – Distance-based (k-means) – Density-based (DBSCAN)

    • PatchWork – Algorithm – Results – Performance

    • Conclusion

    • Future Work

  • 5

    • Class labels are known and pre-defined

    • Training and testing datasets are (manually) labeled with same classes

    • Goal is to learn function/rule that can classify new data points

    • Examples: SVMs, Neural nets, Bayesian classifiers, Decision trees…

    • Class labels of the data are unknown

    • Group/cluster similar data points without prior knowledge

    • Goal is to discover structure or pattern in the data

    • Examples: k-means, EM, DBScan, HCA…

    INTRODUCTION: MACHINE LEARNING

    Supervised Learning Unsupervised Learning (clustering)

  • 5

    • Class labels are known and pre-defined

    • Training and testing datasets are (manually) labeled with same classes

    • Goal is to learn function/rule that can classify new data points

    • Examples: SVMs, Neural nets, Bayesian classifiers, Decision trees…

    • Class labels of the data are unknown

    • Group/cluster similar data points without prior knowledge

    • Goal is to discover structure or pattern in the data

    • Examples: k-means, EM, DBScan, HCA…

    INTRODUCTION: MACHINE LEARNING

    Supervised Learning Unsupervised Learning (clustering)

    PatchWork ➔

  • 6

    • Popular algorithm: k-means 
(implemented in MLLib)

    • Relies on distance function between data points

    • Easy to implement • Linear complexity (big-data) • Easy to distribute

    • Discovers spherical clusters of similar sizes only

    • Sensitive to noise and local optima

    • Prior knowledge of k.

    • Popular algorithm: DBScan
(not in MLLib)

    • Relies on the density of data points in feature space

    • Natural protection against noise and outliers

    • Discovers clusters of arbitrary shape and size

    • No prior knowledge of k

    • Discovers clusters of similar densities only

    • Quadratic complexity: not scalable

    INTRODUCTION: CLUSTERING

    Distance-based Density-based

  • 6

    • Popular algorithm: k-means 
(implemented in MLLib)

    • Relies on distance function between data points

    • Easy to implement • Linear complexity (big-data) • Easy to distribute

    • Discovers spherical clusters of similar sizes only

    • Sensitive to noise and local optima

    • Prior knowledge of k.

    • Popular algorithm: DBScan
(not in MLLib)

    • Relies on the density of data points in feature space

    • Natural protection against noise and outliers

    • Discovers clusters of arbitrary shape and size

    • No prior knowledge of k

    • Discovers clusters of similar densities only

    • Quadratic complexity: not scalable

    INTRODUCTION: CLUSTERING

    Distance-based Density-based

    PatchWork ➔

  • 7

    PATCHWORK ALGORITHM

    2 main steps:

    1. createCells( dataPoints ) à cells à RDD[(string, int)]

    2. createClusters( cells) à clusters

  • 8

    STEP 1: CELL CREATION

  • 9

    STEP 1: CELL CREATION

  • 10

    STEP 1: CELL CREATION

    4( -1,2 ; )

    4( -1,3 ; )

    4( -2,2 ; )

    1( -3,4 ; )

    4( 2,3 ; )

    3( 2,4 ; )

    3( 3,3 ; )

    3( 3,4 ; )

  • 11

    STEP 1: CELL CREATION

    1( -1,2 ; )

    1( -2,2 ; )

    1( -1,2 ; )

    1( -1,2 ; )

    .

    .

    .

    1( 3,4 ; )

    1( 3,4 ; )

    1( 3,4 ; )

    )

    )

    )

    )

    )

    )

    )

    )

    4( -1,2 ;

    4( -1,3 ;

    4( -2,2 ;

    1( -3,4 ;

    4( 2,3 ;

    3( 2,4 ;

    3( 3,3 ;

    3( 3,4 ;

    setOfCells = dataPoints.map(Pà(cellID(P),1)) .reduceByKey(_ + _)

  • 12

    STEP 2: CLUSTER CREATION

  • 13

    EXPERIMENTAL SETUP

    • 6 servers, each with: – Intel Xeon E5-2650 8 cores @2.6GHz – 192GB memory – 30TB storage

    • Cloudera CDH 5.4.0 • Apache Spark 1.3

  • 14

    DATASETSAggregation Compound

    Jain Spiral

  • 15

    RESULTS (JAIN DATASET)K-means

    DBScan PatchWork

  • 16

    RESULTS (SPIRAL DATASET)K-means

    DBScan PatchWork

  • 17

    RESULTS (AGGREGATION DATASET)K-means

    DBScan PatchWork

  • 18

    RESULTS (COMPOUND DATASET)K-means

    DBScan PatchWork

  • 19

    PERFORMANCER

    unni

    ng T

    ime

    (sec

    onds

    )

    1

    10

    100

    1,000

    10,000

    100,000

    Millions of data points10,000.0 100,000.0 1,000,000.0 10,000,000.0 100,000,000.0 1,000,000,000.0 10,000,000,000.0

    DBSCAN PatchWork MLLib k-means ||

  • 20

    PERFORMANCE: SCALABILITYN

    orm

    aliz

    ed e

    xecu

    tion-

    time

    0

    0.25

    0.5

    0.75

    1

    Number of servers1 2 3 4 5

    MLLib k-means|| PatchWork

  • 21

    CONCLUSION

  • 22

    FUTURE WORK

    • Tests against new clustering algorithms available in Spark 1.6

    • Better distribution of step 2

    • Indexing for region query using R-trees

    • Streaming version

  • Q & A

    Contact: [email protected]

    Availability: https://github.com/crim-ca/patchwork (MIT Licence)

    Reference: Frank Gouineau, Tom Landry, Thomas Triplet (2016) PatchWork, a Scalable Density-Grid Clustering Algorithm. In Proc. 31th ACM Symposium On Applied Computing, Data-Mining track

  • WWW.CRIM.CA

    Suivez-nous Dialoguez avec nous Suivez-nous #CRIM_ca wwwCRIMca

    Tous droits réservés © 2016 CRIM. 405, avenue Ogilvy, bureau 101, Montréal (Québec) H3N 1M3/514 840-1234/1 877 840-2746

    Thomas Triplet, Ph.D., Eng. 
[email protected]

    Principal partenaire financierLe CRIM est un centre de recherche appliquée en TI qui développe, en mode collaboratif avec ses clients et partenaires, des technologies innovatrices et du savoir-faire de pointe, et les transfère aux entreprises et aux organismes québécois afin de les rendre plus productifs et plus compétitifs localement et mondialement. Le CRIM dispose de quatre équipes de recherche en TI de calibre mondial. Le CRIM œuvre principalement dans les domaines des interactions et interfaces personne-système, de l’analytique avancée et des architectures et technologies avancées de développement et tests. Détenteur d’une certification ISO 9001:2008, son action s’inscrit dans les politiques et stratégies pilotées par le ministère de l'Économie, de l'Innovation et des Exportations (MEIE), son principal partenaire financier.