data mining intoutsi/dm1.sose19/lectures/1... · 2019-04-10 · data mining –data science –big...

69
Data Mining I Summer semester 2019 Lecture 1: Introduction Lectures: Prof. Dr. Eirini Ntoutsi TAs: Tai Le Quy, Vasileios Iosifidis, Maximilian Idahl, Shaheer Asghar, Wazed Ali Fakultät für Elektrotechnik und Informatik Institut für Verteilte Systeme AG Intelligente Systeme - Data Mining group

Upload: others

Post on 01-Apr-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Data Mining I

Summer semester 2019

Lecture 1: Introduction

Lectures: Prof. Dr. Eirini Ntoutsi

TAs: Tai Le Quy, Vasileios Iosifidis, Maximilian Idahl, Shaheer Asghar, Wazed Ali

Fakultät für Elektrotechnik und InformatikInstitut für Verteilte Systeme

AG Intelligente Systeme - Data Mining group

Page 2: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

About me

03/2016 - present Associate Professor Faculty of Electrical Engineering & Computer Science Leibniz University Hannover L3S Research Center (since May 2016)

02/2012 - 02/2016 Post-doctoral researcher & lecturer Institute for Informatics, LMU Munich, Germany

02/2010 - 01/2012 Alexander von Humboldt postdoc fellowInstitute for Informatics, LMU Munich, Germany

2009: Data Mining Expert National Hellenic Organization (OTE), Athens, Greece

04/2007 – 02/2009 Co-Founder and AI expert

NeeMo Startup, Greece

09/2003 – 09/2008 PhD in Data MiningUniversity of Piraeus, Athens, Greece

09/2001 – 09/2003 MSc, Computer Science/ Text MiningPolytechnic School, University of Patras, Greece

09/1996 – 09/2001 Diploma, Computer Engineering and Informatics/ AI GamesPolytechnic School, University of Patras, Greece

2Learning from streaming data

Current focus areas:• Data Stream Mining/ Adaptive Machine Learning• Responsible AI: Fairness-Aware Machine Learning

Page 3: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Outline

■ Why to study Data Mining?

■ Why we need Data Mining?

■ What is the KDD (Knowledge Discovery in Databases) process?

■ Main data mining tasks

■ Course logististics

■ Things you should know from this lecture

■ Homework/ Tutorial

3Data Mining I @SS19: Introduction

Page 4: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Why to study Data Mining/Machine Learning – famous quotes*

■ “A breakthrough in machine learning would be worth ten Microsofts” (Bill Gates, Chairman, Microsoft)

■ “Machine learning is the next Internet” (Tony Tether, Director, DARPA)

■ “Machine learning is the hot new thing” (John Hennessy, President, Stanford)

■ “Web rankings today are mostly a matter of machine learning” (Prabhakar Raghavan, Dir. Research, Yahoo)

■ “Machine learning is going to result in a real revolution” (Greg Papadopoulos, Former CTO, Sun)

■ “Machine learning today is one of the hottest aspects of computer science” (Steve Ballmer, CEO, Microsoft)

4

*Source: Pedro Domingos http://courses.cs.washington.edu/courses/cse446/15sp/slides/intro.pdf

Data Mining I @SS19: Introduction

Disclaimer: I use the terms data mining and machine learning (sometimes also Artificial Intelligence (AI) interchangeably here and through the lecture.We will discuss the similarities/differences later. In both cases, we talk to learning from data.

Page 5: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Data Mining – Data Science – Big Data – Machine Learning – Deep Learning Analytics …

■ New fancy words for knowledge discovery from data

❑ Data mining, machine learning have been focusing on knowledge discovery from data for decades

❑ Well defined set of tasks and solutions

■ Big data and analytics are more business terms and ill-defined

■ The same holds today for AI

5

“Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it.“

Source: Dan Ariely, Duke University

Data Mining I @SS19: Introduction

Page 6: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Ever increasing interest … the ``rebranding’’ effect

6

Source: Google trends, query on 9.4.2019

Data Mining I @SS19: Introduction

Page 7: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Why to study Data Mining - Data Scientist: The sexiest job of 21st century

7

“If “sexy” means having rare qualities that are much in demand, data scientists are alreadythere. They are difficult and expensive to hire and, given the very competitive market for theirservices, difficult to retain. There simply aren’t a lot of people with their combination ofscientific background and computational and analytical skills.”

Source: Harvard Business Review. Data Scientist: The Sexiest Job of the 21st Century. October 2012 link

Data Mining I @SS19: Introduction

Source: https://www.slideshare.net/IBMBDA/myths-and-mathemagical-superpowers-of-data-scientists

Page 8: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

A good conjuncture for ML/DM/DS (data-driven learning)

8Data Mining I @SS19: Introduction

Data deluge Machine Learningadvances

Computer power Enthusiasm

Page 9: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

World-wide competition on Artificial Intelligence (AI)

■ From FORSCHUNGSGIPFEL 2019 - Künstliche Intelligenz – Innovationstreiber einer neuen Generation, 19/3/2019, Berlin

■ Cédric Villani’s talk on the geopolitics of AIThere are 3 fierce competitions

■ Competition for human talent

■ Competition for infrastructure

■ Competition for data

■ More info on

❑ Track #fogipf19 in Twitter

❑ Check videos online: http://www.forschungsgipfel.de/2019/videos

9Data Mining I @SS19: Introduction

Page 10: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Outline

■ Why to study Data Mining?

■ Why we need Data Mining?

■ What is the KDD (Knowledge Discovery in Databases) process?

■ Main data mining tasks

■ Course logististics

■ Things you should know from this lecture

■ Homework/ Tutorial

11Data Mining I @SS19: Introduction

Page 11: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Why we need Data Mining

■ Huge amounts of data are collected nowadays from different application domains

■ “We are drowning in information but starving for knowledge” John Naibett link

■ The amount and the complexity of the collected data does not allow for manual analysis.

Telecommunication

Astronomy

Banks Biology

Internet

Supermarkets

12

IoT

Data Mining I @SS19: Introduction

Page 12: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Examples of data sources: The Internet

■ Internet users

13

Web 2.0: A world of opinionsUser generated content

Data Mining I @SS19: Introduction

Source: http://www.internetlivestats.com/internet-users/

Page 13: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Examples of data sources: Internet of things

■ The Internet of Things (IoT) is the network of physical objects or "things" embedded with electronics, software, sensors, and network connectivity, which enables these objects to collect and exchange data.

Source: https://en.wikipedia.org/wiki/Internet_of_Things

14

Image source:http://tinyurl.com/prtfqxf

Source: http://blogs.cisco.com/diversity/the-internet-of-things-infographic

During 2008, the number of things connected to the internet surpassed the number of people on earth… By 2020 there will be 50 billion … vs 7.3 billion people (2015).

These things are everything, smartphones, tablets, refrigerators …. cattle.

Data Mining I @SS19: Introduction

Page 14: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Examples of data sources: data intensive science

15

Slide from:http://research.microsoft.com/en-us/um/people/gray/talks/nrc-cstb_escience.ppt

“Increasingly, scientific breakthroughs willbe powered by advanced computingcapabilities that help researchersmanipulate and explore massive datasets.”

-The Fourth Paradigm – Microsoft

Examples of e-science applications:• Earth and environment• Health and wellbeing

− E.g., The Human Genome Project (HGP)

• Citizen science• Scholarly communication• Basic science

− E.g., CERN

Data Mining I @SS19: Introduction

Page 15: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Examples of data sources: Manufacturing

■ Andrew Ng Says Factories Are AI’s Next Frontier

Source: https://www.technologyreview.com/s/609770/andrew-ng-says-factories-are-ais-next-frontier/

16

Image source: https://images.readwrite.com/wp-content/uploads/2018/03/AAEAAQAAAAAAAAueAAAAJDY1NmFl

N2NhLWExZTUtNDRhNy1iMWQ5LTViZGM3NTFlODczYQ.jpg

Companies are making major investments in AI and industrial analytics to help drive their digital transformation

Data Mining I @SS19: Introduction

Image source: https://cdn-sv1.deepsense.ai/wp-content/uploads/2018/04/Spot-the-flaw-Visual-quality-control-in-

manufacturing-1140x337.jpg

Page 16: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Examples of data sources: We … the data subjects

■ Wherever we go, we are "datafied".

■ Smartphones are tracking our locations.

■ We leave a data trail in our web browsing.

■ Interaction in social networks.

■ Privacy is an important issue … not covered though in this lecture → privacy aware data mining

❑ Check the EU General Data Protection Regulation (https://eugdpr.org/)

■ e.g., https://www.whitecase.com/publications/article/chapter-5-key-definitions-unlocking-eu-general-data-protection-regulation

17Data Mining I @SS19: Introduction

Page 17: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

From data to knowledge of different types

18Data Mining I @SS19: Introduction

Data Methods Knowledge

Call records

Movie ratings

Telescope images

Outlier Detection Detect fraud cases

Collaborative filtering Recommend movies to users

ClassificationIs it an «early», «intermediate» or «late formation» star?

News articles ClusteringWhat are the topics people discuss about in the news today?

Page 18: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Short break (5’) – Get to know us better

■ What is your field of study?

❑ Informatik? Informationstechnik? Elektrotechnik?

■ What is your interest in the data mining field?

■ Are there people from Physics, Medicine, Engineering Sciences in the audience?

19Data Mining I @SS19: Introduction

Page 19: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Outline

■ Why to study Data Mining?

■ Why we need Data Mining?

■ What is the KDD (Knowledge Discovery in Databases) process?

■ Main data mining tasks

■ Course logististics

■ Things you should know from this lecture

■ Homework/ Tutorial

20Data Mining I @SS19: Introduction

Page 20: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

What is KDD

Knowledge Discovery in Databases (KDD) is the nontrivial process of identifying valid, novel, potentially

useful, and ultimately understandable patterns in data.

[Fayyad, Piatetsky-Shapiro, and Smyth 1996]

Remarks:

● valid: the discovered patterns should also hold for new, previously unseen problem instances.

● novel: at least to the system and preferably to the user

● potentially useful: they should lead to some benefit to the user or task

● ultimately understandable: the end user should be able to interpret the patterns either immediately or after some post-processing

21Data Mining I @SS19: Introduction

Clarification: The term databases does not refer exclusively to relational databases storing structured data … it can be any data storage and also structured, semi-structured, non-structured data

Page 21: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

The KDD process and the Data Mining step

22

Patterns

Knowledge

[Fayyad, Piatetsky-Shapiro & Smyth, 1996]

Transformed data

Target data

Preprocessed data

Sele

ctio

n:

•Se

lect

a r

elev

ant

dat

aset

or

focu

s o

n a

su

bse

t o

f a

dat

aset

•Fi

le /

DB

/

Pre

pro

cess

ing/

Cle

anin

g:•

Inte

grat

ion

of

dat

a fr

om

d

iffe

ren

t d

ata

sou

rces

•N

ois

e re

mo

val

•M

issi

ng

valu

es

Tran

sfo

rmat

ion

:•

Sele

ct u

sefu

l fea

ture

s•

Feat

ure

tra

nsf

orm

atio

n/

dis

cret

izat

ion

•D

imen

sio

nal

ity

red

uct

ion

Dat

a M

inin

g:•

Sear

ch f

or

pat

tern

s o

f in

tere

st

Eval

uat

ion

:•

Eval

uat

e p

atte

rns

bas

ed o

n

inte

rest

ingn

ess

mea

sure

s•

Stat

isti

cal v

alid

atio

n o

f th

e M

od

els

•V

isu

aliz

atio

n•

Des

crip

tive

Sta

tist

ics

Data

Data Mining I @SS19: Introduction

Page 22: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

A modern version: The Data Science process

23Data Mining I @SS19: Introduction

Page 23: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

The interdisciplinary nature of KDD 1/2

24

KDD

Machine Learning

Databases

Statistics

Data visualization

Pattern recognition

Algorithms Other disciplines

Data Mining I @SS19: Introduction

Page 24: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

The interdisciplinary nature of KDD 2/2

25

Statistics Machine Learning

Databases

KDD

Model based inferenceFocus on numerical

data

Theory + methodsFocus on small datasets

Scalability to large data setsNew data types (web data, micro-arrays, social data ...)

Integration with commercial databases[Chen, Han & Yu 1996]

[Berthold & Hand 1999] [Mitchell 1997]

Data Mining I @SS19: Introduction

Page 25: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

How do machines learn?

■ ML “gives computers the ability to learn without being explicitly programmed” (Arthur Samuel, 1959)

■ We don’t codify the solution. We don’t even know it!

■ Data is the key & the learning algorithm

26Data Mining I @SS19: Introduction

Algorithms

Models

Models

(semi)Automatic

decision making

Data

How can we build computer programs that automatically improve with experience?

Tom Mitchell, Machine Learning book

Page 26: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

More formally: How do machines learn?

■ A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

Tom Mitchell, Machine Learning 1997.

■ Example: A backgammon learning problem

❑ Task T: playing backgammon

❑ Performance measure P: % of games won against opponents

❑ Training experience E: playing practice games against itself

■ Example: Exam performance

❑ Task T: predict whether a student will pass the final DM exam or not

❑ Experience E: historical records of students that took the DM exam

❑ Performance measure P: % of correctly identified students

27Data Mining I @SS19: Introduction

Page 27: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

(Machine) Learning from experience/feedback 1/2

■ “Experience comes in terms of data (the so called, instances or examples) from the specific problem/ application”

■ Datasets consists of instances (also known as examples or objects)

❑ e.g., in a university database: students, professors, courses, grades,…

❑ e.g., in a library database: books, users, loans, publishers, ….

❑ e.g., in a movie database: movies, actors, director,…

■ Instances are described through features (also known as attributes or variables)

❑ E.g. a course is described in terms of a title, description, lecturer, teaching frequency etc.

❑ An easy to visualize example: if our data are in a database table, the rows are the instances and the columns are the features.

28Data Mining I @SS19: Introduction

Page 28: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

(Machine) Learning from experience/feedback 2/2

■ Except for the instance description, we might also have feedback on those instances from some “teacher”/”expert“

❑ E.g., whether a student passed the exam

■ The direct feedback is known as label, i.e., each instance is associated with a label labeleddataset

■ But we might have no feedback at all unlabeled dataset

■ There might be also indirect feedback

29Data Mining I @SS19: Introduction

Unlabeled datasetLabeled dataset

Lecture 2 is devoted on getting to know our data!!!

Page 29: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Short break (5’) – Modeling students data for the exam performance task

■ Recall our learning example

■ Example: Exam performance

❑ Task T: predict whether a student will pass the final DM exam or not

❑ Experience E: historical records of students that took the DM exam

❑ Performance measure P: % of correctly identified students

■ If students are the learning instances, what sort of features could I use to describe each of them?

■ What could be the feedback (direct, indirect) for the learning model (if any)?

30Data Mining I @SS19: Introduction

Page 30: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Outline

■ Why to study Data Mining?

■ Why we need Data Mining?

■ What is the KDD (Knowledge Discovery in Databases) process?

■ Main data mining tasks

■ Course logististics

■ Things you should know from this lecture

■ Homework/ Tutorial

31Data Mining I @SS19: Introduction

Page 31: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Different learning tasks

Based on the feedback we have on the data, we can distinguish between:

■ Direct-feedback instances

❑ the correct response /label is provided for each instance by the “teacher”

❑ e.g., good or bad product

■ No-feedback instances

❑ no evaluation/label of the instance is provided, since there is no “teacher“

❑ e.g., no information on whether a product is good or bad, just the description of the product/instance

■ Indirect-feedback instances

❑ less feedback is given, since not the proper action, but only an evaluation of the chosen action is given by the teacher

32Data Mining I @SS19: Introduction

Supervised learning

Reinforcement learning

Unsupervised learning

Page 32: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Different learning tasks: Supervised learning

■ Supervised learning/ Predictive:

❑ A description of the instances and their class labels is available (training set)

❑ The goal is to learn a mapping from the instances to the class labels, i.e., given a future unseen instance to predict its class label

■ Typical examples covered in this lecture:

❑ Classification

❑ Outlier detection

❑ Regression

33Data Mining I @SS19: Introduction

Page 33: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Classification: an example

■ The goal is to learn a mapping from the “height, width space” to the class space (nails, screw,paper clips)

■ For the new objects, the result of the classification if one of the class labels {nails, screw,paper clips}

34Data Mining I @SS19: Introduction

Screw

Nails

Paper clips

Hei

ght

[cm

]

Width[cm]

instance width height class

1 2,6 4,5 Screw

2 3,7 7,3 Nails

3 4,1 6,5 Paper Clips

4 8,5 8,1 Screw

5 9,5 5,5 Nails

… … … …

New objectNew object

Page 34: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Classification applications 1/2

■ Application: Fraud Detection

❑ Goal: Predict fraudulent cases in credit card transactions.

❑ Approach:

■ Use credit card transactions and the information on its account-holder as attributes.

❑ When does a customer buy, what does he buy, how often he pays on time, etc

■ Label past transactions as fraud or fair transactions. This forms the class attribute.

■ Learn a model for the class of the transactions.

■ Use this model to detect fraud by observing credit card transactions on an account.

35Data Mining I @SS19: Introduction

Page 35: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Classification applications 2/2

■ Application: Churn prediction in telco

❑ Goal: Predict whether a customer is likely to be lost to a competitor

❑ Approach:

■ Use detailed record of transactions with each of the past and present customers, to find attributes.

❑ How often the customer calls, where he calls, what time-of-the day he calls most, his financial status, marital status, etc.

■ Label the customers as loyal or disloyal (class attribute).

■ Find a model for customer loyalty

■ Use this model to predict churn and organize possible retain strategies.

36Data Mining I @SS19: Introduction

Page 36: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Example: Google News

37Data Mining I @SS19: Introduction

Page 37: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

A huge variety of classification algorithms

38Data Mining I @SS19: Introduction

Decision trees k nearest neighbours

Support vector machines

Neural networks Bayesian classifiers

Ensembles

Page 38: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Supervised learning: Regression

■ Similar to classification, but the feature-result to be learned is continuous rather than discrete.

■ Goal: Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency.

39Data Mining I @SS19: Introduction

Given this data, a friend has a house 750 square feet - how much can they be expected to get?

Source: Andrew Ng ML course, Coursera

Page 39: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Application: Precision farming

■ Create a production curve depending on multiple parameters like soil characteristics, weather, used fertilizers.

■ Only the appropriate amount of fertilizers given the environmental settings (soil, weather) will result in maximum yield.

■ Controlling the effects of over-fertilization on the environment is also important

40

Water capacity

Soil parametersWeatherFertilizers

Fertilizers

productionproduction

curve

Data Mining I @SS19: Introduction

Page 40: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Different learning tasks: Unsupervised learning

■ Unsupervised learning/ Descriptive:

❑ Only a description of the instances is available

❑ No feedback/labels are available

❑ The goal is to discover groups of similar instances

■ Typical subtasks covered in this lecture:

❑ clustering

❑ association rules mining

❑ outlier detection

41Data Mining I @SS19: Introduction

Page 41: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Clustering: an example

■ Each point described in terms of its height and width

■ No information on the actual classes (nails, paper clips) is available to the clustering algorithm.

42

Cluster 1Cluster 2

Hei

ght

[cm

]

Width[cm]

Data Mining I @SS19: Introduction

instance width height

1 2,6 4,5

2 3,7 7,3

3 4,1 6,5

4 8,5 8,1

5 9,5 5,5

… … …

Page 42: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Clustering applications 1/2

Application: Market Segmentation

■ Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix.

■ Approach:

❑ Collect different attributes of customers based on their geographical and lifestyle related information.

■ E.g., age, income, education, family status, ….

❑ Find clusters of similar customers.

❑ Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters.

43Data Mining I @SS19: Introduction

Page 43: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Clustering applications 2/2

Application: Document clustering

■ Find groups of documents (topics) that are similar to each other based on the important terms appearing in them.

■ Approach:

❑ Identify important terms in each document.

❑ Form a similarity measure between documents.

❑ Cluster based on the similarity measure.

■ Gain:

❑ Help the end user to navigate in the collection of documents (based on the extracted clusters).

❑ Utilize the clusters to relate a new document or search term to clustered documents.

■ Check for example, Google News.

44Data Mining I @SS19: Introduction

Page 44: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Example: Google News

45Data Mining I @SS19: Introduction

Page 45: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

A huge variety of clustering algorithms

46Data Mining I @SS19: Introduction

Partitioning methods (k-Means)

Grid-based methods (CLIQUE)

Model-based methods (DBSCAN)

Hierarchical methods

Constraint-based methods

Model-based methods(EM)

Page 46: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Unsupervised learning: Association rules mining

■ Task: Find all rules in the database, in the following form:

If x, y, z are contained in a set M, then t is also contained in M with a probability of at least X%.

47

a,b,c,d,eb,c,da,b,c,da,b,c,d,ea,c,e,fd,c,e,fa,b,c,d,f

In 5 out of 7 cases (~71 %)b,c,d appear together

a,b,c,d,eb,c,da,b,c,da,b,c,d,ea,c,e,fd,c,e,fa,b,c,d,f

In 5 out of 5 cases (100 %) it holds that:If b,c then d also exists.

Data Mining I @SS19: Introduction

• a= milk• b=cheese• c =wine

• d= pasta• e= yogurt• f = apples

Page 47: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Application: Market basket analysis

■ Result:

❑ Frequently purchased items together may be better to be positioned close to each other: E.g. since diapers are often purchased together with beers => Place beer in the way from diapers to the checkout

❑ Generate recommendations for customers with similar baskets:=> e.g. Customers that bought „Star Wars“, might be also interested in „The lord of the rings “.

48

Shopping basket

DataWarehouse

Possible generalizations:• Paprika-Chips Snacks • Enrichment of customer data

Association rules

Data Mining I @SS19: Introduction

Page 48: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Unsupervised|Supervised Learning: Outlier detection

■ Outlier detection is defined as identification of non-typical data

■ Outliers might indicate

❑ possible abuse of credit cards, mobile phones

❑ data errors

❑ device failures

49Data Mining I @SS19: Introduction

Page 49: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Application

■ Analysis of the SAT.1-Ran-Soccer-Database (Season 1998/99)

❑ 375 players

❑ Primary attributes: Name, #games, #goals, playing position (goalkeeper, defense, midfield, offense),

❑ Derived attribute: Goals per game

❑ Outlier analysis (playing position, #games, #goals)

■ Result: Top 5 outliers

50

Rank Name # games #goals position Explanation

1 Michael Preetz 34 23 Offense Top scorer overall

2 Michael Schjönberg 15 6 Defense Top scoring defense player

3 Hans-Jörg Butt 34 7 Goalkeeper Goalkeeper with the most goals

4 Ulf Kirsten 31 19 Offense 2nd scorer overall

5 Giovanne Elber 21 13 Offense High #goals/per game

Data Mining I @SS19: Introduction

Note: “Outliers” is not necessarily a negative term.

Page 50: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Short break (5’) – Learning from the student data

■ Recall our learning example

■ Example: Exam performance

❑ Task T: predict whether a student will pass the final DM exam or not

❑ Experience E: historical records of students that took the DM exam

❑ Performance measure P: % of correctly identified students

■ If students are the learning instances, what sort of features could I use to describe each of them?

■ What could be the feedback/label for the learning model (if any)?

■ What could be a supervised learning task here?

❑ For classification? For prediction?

■ What could be an unsupervised learning task ?

❑ For clustering, frequent itemsets mining?

■ What could be an outlier detection problem here?

51Data Mining I @SS19: Introduction

Page 51: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Outline

■ Why to study Data Mining?

■ Why we need Data Mining?

■ What is the KDD (Knowledge Discovery in Databases) process?

■ Main data mining tasks

■ Course logististics

■ Things you should know from this lecture

■ Homework/ Tutorial

52Data Mining I @SS19: Introduction

Page 52: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Course logistics 1/3

■ Class schedule

❑ Lectures: Wednesdays, 12:15 - 13:45, Multimedia-Hörsaal (3703 - 023), Appelstraße 4.

❑ Tutorials: Monday: 10:00 - 11:30 , Monday: 13:30 - 15:00, Tuesday: 11:45 - 13:15, Tuesday: 13:30 - 15:00, Room 235, Gebaeude 3703, Appelstraße 4.

■ StudIP as a common information sharing place

❑ Up to date announcements and material

❑ Use the forum for your questions. They might benefit everyone!

■ Exam:

❑ Written exam, 90’

■ You are allowed to bring a hand-written A4 with formulas etc (No need to memorize) – Each student should have her own A4 (copied are not allowed)

❑ The exam will be based on the material discussed in the class plus the tutorials.

■ Exam date: Monday 28.8.2019, 08:30-11:00, Rooms: F 102, F 303

53Data Mining I @SS19: Introduction

Page 53: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Overview of the lectures (current planning)

1. Introduction

2. Getting to know our data

3. Association Rules Mining

4. Clustering

6. Classification

7. Outlier Detection

54Data Mining I @SS19: Introduction

Page 54: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Course logistics 2/3

■ Projects

❑ Focus on the complete KDD pipeline for two different learning tasks:

■ Classification: 15/5/2019 & Clustering: 26/6/2019

■ Groups of 2 (Please form the teams by yourselves)

■ Goal: how to run a data mining case study? From data preprocessing to transformation, learning algorithm, evaluation and presentation of the results. Both analysis and presentation part are important.

■ We will use Kaggle for result submission (but you have to submit the report separately)

■ We will have a poster session at the end where each team present its results

■ Bonus schema

❑ Pass both projects: you switch to the next best grade

■ e.g., from 1.7→1,3

❑ Each member ``inherits’’ the grade of the group

❑ Extra bonus for those that score best in Kaggle (system) & those with the best poster (voting)

55Data Mining I @SS19: Introduction

Page 55: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Course logistics 3/3

56Data Mining I @SS19: Introduction

■ Teaching Assistants

❑ Vasileios Iosifidis

■ Room 240, 2nd floor, Appelstraße 4

[email protected]

❑ Tai Le Quy

■ Room 010, Ground floor, Appelstraße 4

[email protected]

❑ Maximilian Idahl

■ -

❑ Wazed Ali

[email protected]

❑ Shaheer Asghar

[email protected]

■ Lecturer

❑ Prof. Dr. Eirini Ntoutsi

■ Room 203, 2nd floor, Appelstraße 4

Contact via email:[email protected]

Please use [DM1] in the subject

Page 56: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Tutorials: Organization

■ 4 tutorial groups

❑ Times:

■ Monday, 10:00 – 11:30 and 13:30 – 15:00

■ Tuesday, 11:45 – 13:15 and 13:30 – 15:00

❑ Room: 235, Appelstraße 4

❑ Registration for groups in Stud.IP, unlocked today at 20:00

57Data Mining I @SS19: Introduction

Page 57: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Tutorials: Why to attend

Why should you attend the tutorials?

Solving theoretical and algorithmic parts is an excellent preparation for the exam

Working on the implementation part is useful for the bonus projects

Each tutorial will consist of

1. a theoretical part (e.g., properties of an algorithm)

2. an algorithmic part (e.g., applying an algorithm on a particular dataset)

3. an implementation part (e.g., how to run a data mining analysis in Python)

58SoSe19: DM I - Tutorial

Page 58: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Tutorials: structure

1 worksheet per week

Announced after the lecture, so you have the chance to prepare beforehand

Solutions will be available via Stud.IP

Theoretical and algorithmic parts will be mainly presented at the blackboard (with help from you)

Implementation part in form of jupyter notebooks (interactive python)

Use Stud.IP for any tutorial related question.

This is the fastest way to get an answer

Your question might be relevant to other students

Posing and answering questions is a great way to learn

Or send an e-mail to [email protected]

59SoSe19: DM I - Tutorial

Page 59: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Tutorials: For next week (1st tutorial)

Take a look at python and jupyter notebooks

Installation:

Anaconda python distribution includes most packages needed for this course

Step-by-step installation guide for Windows/Mac/Linux: https://docs.anaconda.com/anaconda/

Quick start guide for jupyter notebooks: https://jupyter.readthedocs.io/en/latest/content-quickstart.html

Alternative:

Jupyter notebook environment Google Colab: https://colab.research.google.com/notebooks/welcome.ipynb

Free, requires no setup, runs entirely in the cloud

60SoSe19: DM I - Tutorial

Page 60: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Tutorials: Python resources

There are lots of great python tutorials

Recommended: http://scipy-lectures.org/intro/language/python_language.html

Official: https://docs.python.org/3/tutorial/

If you prefer video tutorials:

https://pythonprogramming.net/python-fundamental-tutorials/ or

https://youtu.be/YYXdXT2l-Gg

And on jupyter notebooks

https://medium.com/codingthesmartway-com-blog/getting-started-with-jupyter-notebook-for-python-4e7082bd5d46 or

http://opentechschool.github.io/python-data-intro/core/notebook.html or

https://youtu.be/HW29067qVWk

61SoSe19: DM I - Tutorial

Page 61: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Tutorials 1-2 plan

62SoSe19: DM I - Tutorial

Tutorial 1 + 2 will include introductions to

Python basics

Arrays in NumPy

Data manipulation with Pandas

Visualization using matplotlib

Data mining and analysis tools in scikit-learn

Goal: Running a data analysis process in python, from data selection to pattern evaluation

Useful for the projects

Learning-by-doing

Page 62: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Textbook and recommended readings

■ Textbook:

❑ Tan P.-N., Steinbach M., Kumar V., Introduction to Data Mining, Addison-Wesley, 2014

❑ New edition is expected in May 2019

■ Recommended readings

❑ Meira and Zaki, Data Mining and Analysis: Fundamental Concepts and Algorithms, Cambridge University Press, 2014

❑ Mitchell T. M., Machine Learning, McGraw-Hill, 1997

❑ Han J., Kamber M., Pei J., Data Mining: Concepts and Techniques, 3rd ed., Morgan Kaufmann, 2011

❑ C. Aggarwal, Data Mining the textbook, 2015

❑ Witten I. H., Frank E., Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann Publishers, 2016.

63Data Mining I @SS19: Introduction

Page 63: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Online resources

■ Machine Learning class by Andrew Ng, Stanford

❑ http://ml-class.org/

■ Tom Mitchel’s lectures on youtube

❑ www.youtube.com/playlist?list=PLAJ0alZrN8rD63LD0FkzKFiFgkOmEtltQ

■ Kdnuggets: Data Mining and Analytics resources

❑ http://www.kdnuggets.com/

64Data Mining I @SS19: Introduction

Page 64: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Tools

■ Several options for either commercial or free/ open source tools

❑ Check an up to date list at: http://www.kdnuggets.com/software/suites.html

■ Commercial tools offered by major vendors

❑ e.g., IBM, Microsoft, Oracle …

■ Free/ open source tools

65

Weka

Elki

R

SciPy + NumPy

OrangeRapid Miner (free, commercial versions)

Data Mining I @SS19: Introduction

Page 65: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Outline

■ Why to study Data Mining?

■ Why we need Data Mining?

■ What is the KDD (Knowledge Discovery in Databases) process?

■ Main data mining tasks

■ Course logististics

■ Things you should know from this lecture

■ Homework/ Tutorial

66Data Mining I @SS19: Introduction

Page 66: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Things you should know from this lecture

■ KDD definition

■ KDD process

■ DM step

■ Supervised vs Unsupervised learning

■ Main DM tasks

❑ Clustering

❑ Classification

❑ Regression

❑ Association rules mining

❑ Outlier detection

67Data Mining I @SS19: Introduction

Page 67: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Outline

■ Why to study Data Mining?

■ Why we need Data Mining?

■ What is the KDD (Knowledge Discovery in Databases) process?

■ Main data mining tasks

■ Course logististics

■ Things you should know from this lecture

■ Homework/ Tutorial

68Data Mining I @SS19: Introduction

Page 68: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Homework/ Tutorial

■ Homework: Think of some real world applications that you find suitable for Data Mining.

❑ Why?

❑ What type of patterns would you look for?

❑ Would you approach it as a supervised or unsupervised learning task?

■ Readings:

❑ Tan P.-N., Steinbach M., Kumar V book, Chapter 1.

❑ U. Fayyad, et al. (1995), “From Knowledge Discovery to Data Mining: An Overview” Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press.

69Data Mining I @SS19: Introduction

Page 69: Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge

Acknowledgement

■ The slides are based on

❑ KDD I lecture at LMU Munich (Johannes Aßfalg, Christian Böhm, Karsten Borgwardt, Martin Ester, EshrefJanuzaj, Karin Kailing, Peer Kröger, Eirini Ntoutsi, Jörg Sander, Matthias Schubert, Arthur Zimek, Andreas Züfle)

❑ Introduction to Data Mining book slides at http://www-users.cs.umn.edu/~kumar/dmbook/

❑ Pedro Domingos Machine Lecture course slides at the University of Washington

❑ Machine Learning book by T. Mitchel slides at http://www.cs.cmu.edu/~tom/mlbook-chapter-slides.html

❑ Thank you to all TAs contributing to their improvement, namely Vasileios Iosifidis, Damianos Melidis, Tai Le Quy, Han Tran

70Data Mining I @SS19: Introduction