scalable density clustering for spark - crim.ca · 3 apache • popular distributed in-memory...

W W W . C R I M . C A

Principal partenaire financier

Scalable Density Clustering for Spark

THOMAS TRIPLET, PH.D., ENG.

MARCH 9TH 2016

2

TECHNOLOGIES BIG-DATA

• Hadoop Core – HDFS: Système de fichiers distribué – YARN: Gestion des ressources CPU et planification – MapReduce: Traitement en lot (batch) des données à grande échelle

• Écosystème Hadoop – NoSQL: HBase, Cassandra, Accumulo, etc… – SQL: Hive, Stinger (Hortonworks), Impala (Cloudera), Presto (FB), Tajo, Drill (MapR) – Transfert: Sqoop, Flume – Calcul/ML: Spark, Storm, Giraph, Mahout – Scripts: Pig, Cascading – Administration: Hue, ZooKeeper, Knox – Recherche: Solr, ElasticSearch

3

APACHE

• Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear horizontal scalability • Fault tolerant (RDDs) • Applications range from long-running batch jobs to stream processing • High-level Scala, Java, Python and R APIs

4

AGENDA

• Clustering algorithms (unsupervised learning) – Distance-based (k-means) – Density-based (DBSCAN)

• PatchWork – Algorithm – Results – Performance

• Conclusion

• Future Work

5

• Class labels are known and pre-defined

• Training and testing datasets are (manually) labeled with same classes

• Goal is to learn function/rule that can classify new data points

• Examples: SVMs, Neural nets, Bayesian classifiers, Decision trees…

• Class labels of the data are unknown

• Group/cluster similar data points without prior knowledge

• Goal is to discover structure or pattern in the data

• Examples: k-means, EM, DBScan, HCA…

INTRODUCTION: MACHINE LEARNING

Supervised Learning Unsupervised Learning (clustering)

5

• Class labels are known and pre-defined

• Training and testing datasets are (manually) labeled with same classes

• Goal is to learn function/rule that can classify new data points

• Examples: SVMs, Neural nets, Bayesian classifiers, Decision trees…

• Class labels of the data are unknown

• Group/cluster similar data points without prior knowledge

• Goal is to discover structure or pattern in the data

• Examples: k-means, EM, DBScan, HCA…

INTRODUCTION: MACHINE LEARNING

Supervised Learning Unsupervised Learning (clustering)

PatchWork ➔

6

• Popular algorithm: k-means (implemented in MLLib)

• Relies on distance function between data points

• Easy to implement

• Linear complexity (big-data)

• Easy to distribute

• Discovers spherical clusters of similar sizes only

• Sensitive to noise and local optima

• Prior knowledge of k.

• Popular algorithm: DBScan(not in MLLib)

• Relies on the density of data points in feature space

• Natural protection against noise and outliers

• Discovers clusters of arbitrary shape and size

• No prior knowledge of k

• Discovers clusters of similar densities only

• Quadratic complexity: not scalable

INTRODUCTION: CLUSTERING

Distance-based Density-based

6

• Popular algorithm: k-means (implemented in MLLib)

• Relies on distance function between data points

• Easy to implement

• Linear complexity (big-data)

• Easy to distribute

• Discovers spherical clusters of similar sizes only

• Sensitive to noise and local optima

• Prior knowledge of k.

• Popular algorithm: DBScan(not in MLLib)

• Relies on the density of data points in feature space

• Natural protection against noise and outliers

• Discovers clusters of arbitrary shape and size

• No prior knowledge of k

• Discovers clusters of similar densities only

• Quadratic complexity: not scalable

INTRODUCTION: CLUSTERING

Distance-based Density-based

PatchWork ➔

7

PATCHWORK ALGORITHM

2 main steps:

1. createCells( dataPoints ) à cells à RDD[(string, int)]

2. createClusters( cells) à clusters

8

STEP 1: CELL CREATION

9


10


4( -1,2 ; )

4( -1,3 ; )

4( -2,2 ; )

1( -3,4 ; )

4( 2,3 ; )

3( 2,4 ; )

3( 3,3 ; )

3( 3,4 ; )

11


1( -1,2 ; )

1( -2,2 ; )

1( -1,2 ; )

1( -1,2 ; )

.

.

.

1( 3,4 ; )

1( 3,4 ; )

1( 3,4 ; )

)

)

)

)

)

)

)

)

4( -1,2 ;

4( -1,3 ;

4( -2,2 ;

1( -3,4 ;

4( 2,3 ;

3( 2,4 ;

3( 3,3 ;

3( 3,4 ;

setOfCells = dataPoints.map(Pà(cellID(P),1)) .reduceByKey(_ + _)

12

STEP 2: CLUSTER CREATION

•

13

EXPERIMENTAL SETUP

• 6 servers, each with: – Intel Xeon E5-2650 8 cores @2.6GHz – 192GB memory – 30TB storage

• Cloudera CDH 5.4.0 • Apache Spark 1.3

14

DATASETSAggregation Compound

Jain Spiral

15

RESULTS (JAIN DATASET)K-means

DBScan PatchWork

16

RESULTS (SPIRAL DATASET)K-means

DBScan PatchWork

17

RESULTS (AGGREGATION DATASET)K-means

DBScan PatchWork

18

RESULTS (COMPOUND DATASET)K-means

DBScan PatchWork

19

PERFORMANCER

unni

ng T

ime

(sec

onds

)

1

10

100

1,000

10,000

100,000

Millions of data points10,000.0 100,000.0 1,000,000.0 10,000,000.0 100,000,000.0 1,000,000,000.0 10,000,000,000.0

DBSCAN PatchWork MLLib k-means ||

20

PERFORMANCE: SCALABILITYN

orm

aliz

ed e

xecu

tion-

time

0

0.25

0.5

0.75

1

Number of servers1 2 3 4 5

MLLib k-means|| PatchWork

21

CONCLUSION

22

FUTURE WORK

• Tests against new clustering algorithms available in Spark 1.6

• Better distribution of step 2

• Indexing for region query using R-trees

• Streaming version

Q & A

Contact: [email protected]

Availability: https://github.com/crim-ca/patchwork (MIT Licence)

Reference: Frank Gouineau, Tom Landry, Thomas Triplet (2016) PatchWork, a Scalable Density-Grid Clustering Algorithm. In Proc. 31th ACM Symposium On Applied Computing, Data-Mining track

WWW.CRIM.CA

Suivez-nous Dialoguez avec nous

Suivez-nous #CRIM_ca wwwCRIMca

Tous droits réservés © 2016 CRIM. 405, avenue Ogilvy, bureau 101, Montréal (Québec) H3N 1M3/514 840-1234/1 877 840-2746

Thomas Triplet, Ph.D., Eng. [email protected]

Principal partenaire financierLe CRIM est un centre de recherche appliquée en TI qui développe, en mode collaboratif avec ses clients et partenaires, des technologies innovatrices et du savoir-faire de pointe, et les transfère aux entreprises et aux organismes québécois afin de les rendre plus productifs et plus compétitifs localement et mondialement. Le CRIM dispose de quatre équipes de recherche en TI de calibre mondial. Le CRIM œuvre principalement dans les domaines des interactions et interfaces personne-système, de l’analytique avancée et des architectures et technologies avancées de développement et tests. Détenteur d’une certification ISO 9001:2008, son action s’inscrit dans les politiques et stratégies pilotées par le ministère de l'Économie, de l'Innovation et des Exportations (MEIE), son principal partenaire financier.

scalable density clustering for spark - crim.ca · 3 apache • popular distributed in-memory...

Documents