pattern mining lec 2

Upload: ashish-sharma

Post on 06-Apr-2018

230 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Pattern Mining Lec 2

    1/30

    Introduction to Data Mining

    Saeed Salem

    Department of Computer Science

    North Dakota State Universitycs.ndsu.edu/~salem

    These slides are based on slides by Jiawei Han and Micheline Kamber and Mohammed Zaki

  • 8/3/2019 Pattern Mining Lec 2

    2/30

    2

    Mining Frequent Patterns, Association andCorrelations

    Basic concepts and a road map

    Efficient and scalable frequent itemset mining

    methods

    Mining various kinds of association rules

    From association mining to correlation

    analysis Constraint-based association mining

    Summary

  • 8/3/2019 Pattern Mining Lec 2

    3/30

    3

    What Is Frequent Pattern Analysis?

    Frequent pattern: a pattern (a set of items, subsequences, substructures,etc.) that occurs frequently in a data set

    First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context

    offrequent itemsets and association rule mining

    Motivation: Finding inherent regularities in data

    What products were often purchased together? Beer and diapers?!

    What are the subsequent purchases after buying a PC?

    What kinds of DNA are sensitive to this new drug?

    Can we automatically classify web documents?

    Applications

    Basket data analysis, cross-marketing, catalog design, sale campaign

    analysis, Web log (click stream) analysis, and DNA sequence analysis.

  • 8/3/2019 Pattern Mining Lec 2

    4/30

    4

    Why Is Freq. Pattern Mining Important?

    Discloses an intrinsic and important property of data sets Forms the foundation for many essential data mining tasks

    Association, correlation, and causality analysis

    Sequential, structural (e.g., sub-graph) patterns

    Pattern analysis in spatiotemporal, multimedia, time-

    series, and stream data

    Classification: associative classification

    Cluster analysis: frequent pattern-based clustering

    Data warehousing: iceberg cube and cube-gradient

    Semantic data compression: fascicles

    Broad applications

  • 8/3/2019 Pattern Mining Lec 2

    5/30

    5

    Basic Concepts: Frequent Patterns andAssociation Rules

    Itemset X = {x1, , xk}

    Find all the rules XYwith minimumsupport and confidence

    support, s, probability that a

    transaction contains X Y confidence, c,conditional

    probability that a transactionhaving X also contains Y

    Let supmin= 50%, confmin= 50%

    Freq. Pat.:{A:3, B:3, D:4, E:3, AD:3}

    Association rules:

    A D (60%, 100%)

    D A (60%, 75%)

    Customer

    buys diaper

    Customer

    buys both

    Customer

    buys beer

    Transaction-id Items bought

    10 A, B, D

    20 A, C, D

    30 A, D, E

    40 B, E, F50 B, C, D, E, F

  • 8/3/2019 Pattern Mining Lec 2

    6/30

    6

    Closed Patterns and Max-Patterns

    A long pattern contains a combinatorial number of sub-patterns, e.g., {a1, , a100} contains

    + +.+ = =2100 1 = 1.27*1030 sub-patterns!

    Solution: Mineclosed patternsandmax-patternsinstead

    An itemset Xis closed if X is frequentand there exists no

    super-patternY X, with the same supportas X (proposed

    by Pasquier, et al. @ ICDT99)

    An itemset X is a max-pattern if X is frequent and there

    exists no frequent super-pattern Y X (proposed by Bayardo

    @ SIGMOD98)

    Closed pattern is a lossless compression of freq. patterns

    Reducing the # of patterns and rules

    100

    1

    100

    1

    100

    i i

    100

    2

    100

    100

  • 8/3/2019 Pattern Mining Lec 2

    7/307

    Closed Patterns and Max-Patterns

    Exercise. DB = {, < a1, , a50>} Min_sup = 1.

    What is the set ofclosed itemset?

    What is the set ofmax-pattern?

    What is the set ofall frequent patterns?

    !!

  • 8/3/2019 Pattern Mining Lec 2

    8/308

    Closed Patterns and Max-Patterns

    Exercise. DB = {, < a1, , a50>} Min_sup = 1.

    What is the set ofclosed itemset?

    : 1 < a1, , a50>: 2

    What is the set ofmax-pattern?

    : 1

    What is the set ofall frequent patterns?

    !!

  • 8/3/2019 Pattern Mining Lec 2

    9/309

    Mining Frequent Patterns, Associationand Correlations

    Basic concepts and a road map

    Efficient and scalable frequent itemset mining

    methods

    Mining various kinds of association rules

    From association mining to correlation

    analysis Constraint-based association mining

    Summary

  • 8/3/2019 Pattern Mining Lec 2

    10/3010

    Brute Force Approach for MiningFrequent Itemsets

    BRUTEFORCE (D , I , minsup):1 foreachX I do2 sup(X) COMPUTESUPPORT(X,D );

    3 ifsup(X) minsup then4 print X, sup(X)

    5 end

    6 end

    Problems:Lots of wasteful computation since many of

    the candidates may NOT be frequent.

  • 8/3/2019 Pattern Mining Lec 2

    11/3011

    Scalable Methods for Mining Frequent Patterns

    The downward closure property of frequent patterns Any subset of a frequent itemset must be frequent

    If{beer, diaper, nuts} is frequent, so is {beer,diaper}

    i.e., every transaction having {beer, diaper, nuts} alsocontains {beer, diaper}

    Scalable mining methods: Three major approaches

    Apriori (Agrawal & Srikant@VLDB94)

    Freq. pattern growth (FPgrowthHan, Pei & Yin@SIGMOD00)

    Vertical data format approach (CharmZaki & Hsiao@SDM02)

  • 8/3/2019 Pattern Mining Lec 2

    12/3012

    Apriori: A Candidate Generation-and-Test Approach

    Apriori pruning principle: If there is any itemset which is

    infrequent, its superset should not be generated/tested!

    (Agrawal & Srikant @VLDB94, Mannila, et al. @ KDD 94)

    Method: Initially, scan DB once to get frequent 1-itemset

    Generate length (k+1) candidate itemsets from length k

    frequent itemsets

    Test the candidates against DB

    Terminate when no frequent or candidate set can be

    generated

  • 8/3/2019 Pattern Mining Lec 2

    13/3013

    The Apriori AlgorithmAn Example

    Database TDB

    1st scan

    C1

    L1

    L2

    C2 C2

    2nd scan

    C3

    L33rd scan

    Tid Items

    10 A, C, D

    20 B, C, E

    30 A, B, C, E

    40 B, E

    Itemset sup

    {A} 2

    {B} 3

    {C} 3

    {D} 1

    {E} 3

    Itemset sup{A} 2

    {B} 3

    {C} 3

    {E} 3

    Itemset

    {A, B}

    {A, C}

    {A, E}

    {B, C}

    {B, E}

    {C, E}

    Itemset sup

    {A, B} 1

    {A, C} 2

    {A, E} 1

    {B, C} 2

    {B, E} 3

    {C, E} 2

    Itemset sup

    {A, C} 2

    {B, C} 2

    {B, E} 3{C, E} 2

    Itemset

    {B, C, E}

    Itemset sup

    {B, C, E} 2

    Supmin = 2

  • 8/3/2019 Pattern Mining Lec 2

    14/3014

    The Apriori Algorithm

    Pseudo-code:Ck: Candidate itemset of size kLk: frequent itemset of size k

    L1

    = {frequent items};for(k= 1; Lk !=; k++) do begin

    Ck+1= candidates generated from Lk;for each transaction tin database do

    increment the count of all candidates in Ck+1

    that are contained in tLk+1 = candidates in Ck+1with min_supportend

    returnkLk;

  • 8/3/2019 Pattern Mining Lec 2

    15/3015

    Important Details of Apriori

    How to generate candidates?

    Step 1: self-joining Lk

    Step 2: pruning

    How to count supports of candidates?

    Example of Candidate-generation L3={abc, abd, acd, ace, bcd}

    Self-joining: L3*L3

    abcdfrom abcand abd

    acdefrom acdand ace Pruning:

    acdeis removed because adeis not in L3

    C4={abcd}

  • 8/3/2019 Pattern Mining Lec 2

    16/3016

    How to Generate Candidates?

    Suppose the items in Lk-1are listed in an order

    Step 1: self-joining Lk-1

    insert intoCk

    selectp.item1

    , p.item2

    , , p.itemk-1

    , q.itemk-1

    from Lk-1p, Lk-1q

    wherep.item1=q.item1, , p.itemk-2=q.itemk-2, p.itemk-1

  • 8/3/2019 Pattern Mining Lec 2

    17/3017

    How to Count Supports of Candidates?

    Why counting supports of candidates a problem?

    The total number of candidates can be very huge

    One transaction may contain many candidates

  • 8/3/2019 Pattern Mining Lec 2

    18/3018

    An Example

    Minsup=3Brute Force: 25 -1 =31candidates

    Apriori: 21 candidates

    out of which 19 arefrequent.

  • 8/3/2019 Pattern Mining Lec 2

    19/3019

    An Example

    Minsup=3Brute Force: 25 -1 =31candidates

    Apriori: 21 candidates

    out of which 19 arefrequent.

  • 8/3/2019 Pattern Mining Lec 2

    20/3020

    Challenges of Frequent Pattern Mining

    Challenges

    Multiple scans of transaction database

    Huge number of candidates

    Tedious workload of support counting for candidates

    Improving Apriori: general ideas

    Reduce passes of transaction database scans

    Shrink number of candidates

    Facilitate support counting of candidates

  • 8/3/2019 Pattern Mining Lec 2

    21/3021

    Bottleneck of Frequent-pattern Mining

    Multiple database scans are costly

    Mining long patterns needs many passes of

    scanning and generates lots of candidates

    To find frequent itemset i1i2i100 # of scans: 100

    # of Candidates: (1001) + (100

    2) + + (110000) = 2100-

    1 = 1.27*10

    30

    ! Bottleneck: candidate-generation-and-test

    Can we avoid candidate generation?

  • 8/3/2019 Pattern Mining Lec 2

    22/3022

    CHARM: Mining by Exploring Vertical Data Format

    Vertical format: t(AB) = {T11, T25, } tid-list: list of trans.-ids containing an itemset

    Deriving closed patterns based on vertical intersections

    t(X) = t(Y): X and Y always happen together

    t(X) t(Y): transaction having X always has Y

    Using diffset to accelerate mining

    Only keep track of differences of tids

    t(X) = {T1, T2, T3}, t(XY) = {T1, T3}

    Diffset (XY, X) = {T2}

    Eclat/MaxEclat (Zaki et al. @KDD97), VIPER(P. Shenoy et

    al.@SIGMOD00), CHARM (Zaki & Hsiao@SDM02)

  • 8/3/2019 Pattern Mining Lec 2

    23/30

    23

    CHARM: Mining by Exploring Vertical Data Format

  • 8/3/2019 Pattern Mining Lec 2

    24/30

    24

    Mining Frequent Patterns, Associationand Correlations

    Basic concepts and a road map

    Efficient and scalable frequent itemset mining

    methods

    Mining various kinds of association rules

    From association mining to correlation

    analysis

    Constraint-based association mining

    Summary

  • 8/3/2019 Pattern Mining Lec 2

    25/30

    25

    Mining Other Interesting Patterns

    Flexible support constraints (Wang et al. @ VLDB02)

    Some items (e.g., diamond) may occur rarely but are

    valuable

    Customized supmin

    specification and application

    Top-K closed frequent patterns (Han, et al. @ ICDM02)

    Hard to specify supmin, but top-kwith lengthmin is more

    desirable

    Dynamically raise supmin in FP-tree construction andmining, and select most promising path to mine

  • 8/3/2019 Pattern Mining Lec 2

    26/30

    26

    Mining Other Interesting Patterns

    Mining discriminatory itemsets:can be used as features for classifying transactions.

    For an itemset X, delta score=

    The higher the delta score the more the pattern X isdiscriminatory.

    ))()(()( XSXSabsXDD

    Transaction-id Items

    10 A, B, D20 A, C, D

    30 A, D, E

    40 B, E, F

    50 B, C, D, E, F

    Transaction-id Items

    10 A, B, C, D, F20 A, C, D, F

    30 A, C, D, E, F

    40 B, C, E, F

    50 B, C, D, E, F

    D

    D

  • 8/3/2019 Pattern Mining Lec 2

    27/30

    27

    Which Measures Should Be Used?

    lift and2are notgood measures forcorrelations in largetransactional DBs

    all-conforcoherence could begood measures(Omiecinski@TKDE03)

    Both all-confandcoherence have thedownward closure

    property Efficient algorithms

    can be derived formining (Lee et al.@ICDM03sub)

  • 8/3/2019 Pattern Mining Lec 2

    28/30

    28

    Frequent-Pattern Mining: Summary

    Frequent pattern miningan important task in data mining

    Scalable frequent pattern mining methods

    Apriori (Candidate generation & test)

    Projection-based (FPgrowth, CLOSET+, ...)

    Vertical format approach (CHARM, ...)

    Mining a variety of rules and interesting patterns

    Constraint-based mining

    Mining sequential and structured patterns

    Extensions and applications

  • 8/3/2019 Pattern Mining Lec 2

    29/30

    29

    Frequent-Pattern Mining: Research Problems

    Mining fault-tolerant frequent, sequential and structuredpatterns

    Patterns allows limited faults (insertion, deletion,

    mutation)

    Mining truly interesting patterns

    Surprising, novel, concise,

    Application exploration

    E.g., DNA sequence analysis and bio-patternclassification

    Invisible data mining

    Application: Mining Version Histories to

  • 8/3/2019 Pattern Mining Lec 2

    30/30

    Application: Mining Version Histories toGuide Software Changes (Zimmermann et.al., 2005)

    Suggest and predict likely changes. Suppose aprogrammer has just made a change. What else doesshe have to change?

    Prevent errors due to incomplete changes. If aprogrammer wants to commit changes, but hasmissed a related change, issues a warning.