…@liris.cnrs.fr - ... laboratoire d'informatique en image et systèmes d'information...

26
…@liris.cnrs.fr - http://liris.cnrs.fr/... Laboratoire d'InfoRmatique en Image et Systèmes d'information UMR 5205 CNRS/INSA de Lyon/Université Claude Bernard Lyon 1/Université Lumière Lyon 2/Ecole Centrale d Université Claude Bernard Lyon 1, bâtiment Nautibus 43, boulevard du 11 novembre 1918 — F-69622 Villeurbanne cedex http://liris.cnrs.fr UMR 5205 DDDM'08, Pisa - 15/12/2008 DDDM'08, Pisa - 15/12/2008 Parameter Tuning for Differential Mining of String Patterns J.Besson, C. Rigotti, I. Mitasiunaite and J.-F. Boulicaut

Post on 19-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

…@liris.cnrs.fr - http://liris.cnrs.fr/...

Laboratoire d'InfoRmatique en Image et Systèmes d'informationLIRIS UMR 5205 CNRS/INSA de Lyon/Université Claude Bernard Lyon 1/Université Lumière Lyon 2/Ecole Centrale de Lyon

Université Claude Bernard Lyon 1, bâtiment Nautibus43, boulevard du 11 novembre 1918 — F-69622 Villeurbanne cedex

http://liris.cnrs.fr

UMR 5205

DDDM'08, Pisa - 15/12/2008

DDDM'08, Pisa - 15/12/2008

Parameter Tuning for Differential Miningof String Patterns

J.Besson, C. Rigotti, I. Mitasiunaite and J.-F. Boulicaut

DDDM'08, Pisa - 15/12/2008 2

Tuning extraction parametersLocal pattern mining: itemsets, closed itemsets, episodes, seq. patterns, substrings

…. under constraints (monotonic or not or neither, pattern shapes, occurrence properties, measures …)

can select/focus ….… where to look in the parameter space ?often easy when a single threshold… but when multiple constraints/multiple thresholds ?

DDDM'08, Pisa - 15/12/2008 3

Two different kinds of tuning

1) exploratory stage: find in parameter space promising areas

2) fine grain tuning: ako greedy strategy by small local exploration of the parameter space

DDDM'08, Pisa - 15/12/2008 4

Tools ?

Best ever tool used in exploratory stage to find promising setting of the parameters in local pattern mining ??? …

DDDM'08, Pisa - 15/12/2008 5

Tools

GREP + Word Count

method: manual mix count extracted patterns choose points in parameter space random walk try local greedy strategy having in mind known properties of the constraints

(when applicable) and domain knowledge

DDDM'08, Pisa - 15/12/2008 6

Tools… when several parameters, several thresholds, e.g., minimal support and maximal support on another dataset …

perform more exhaustive exploration of pattern space

draw curves depicting the extraction landscape

DDDM'08, Pisa - 15/12/2008 7

Tools / landscapeExamples

QuickTime™ et undécompresseur TIFF (non compressé)

sont requis pour visionner cette image.

DDDM'08, Pisa - 15/12/2008 8

Obtaining extraction landscapes

use script - can need a lot of resources to execute - too much time needed to explore a large parameter space (several parameters)

use a global model of the presence of the local patterns to estimate the number of patterns

reuse/adapt a model - not so much exist develop a new global model - each kind of patterns and

each conjunction of constraints can be a research problem in itself

incorporate K of domain ? Global analytical model even more complex to exhibit …

DDDM'08, Pisa - 15/12/2008 9

What about sampling the pattern space ?

sounds too naive, needing complicated frameworks

how to sample ?

size of the sample ?

number of pattern in the sample that satisfy the constraints ?

using domain knowledge ?

how to estimate value for the whole pattern space ?

DDDM'08, Pisa - 15/12/2008 10

What about simple choices ?

sampling with replacement in pat. that satisfies the syntactic constraints (conjunction of constraints)

number of patterns in the sample that satisfy the constraints compute probability to satisfy the constraints for each patterns

(incorporate K of the domain) in the sample approx. number of patterns that sat. the constraints (in the

sample)

sample size: growth the sample up to convergence of percentage of patterns satisfying the constraints

estimate the number of patterns in the pattern space that satisfy the constraints: percentage of the pat. that sat. syntactic constraints

DDDM'08, Pisa - 15/12/2008 11

Whole process

1) built an initial sample of Psynt

2) comp. estimate of E(N) from the sample

3) add more patt. to the sample

4) comp. estimate of E(N) from the sample

5) if estimate changes a lot goto 3)

DDDM'08, Pisa - 15/12/2008 12

Using it in freq. substring mining

Two datasets: R1 and R2 (two sets of strings)

Constraints having size Z appearing at least min times in R1 appearing no more than max times in R2

Consider exact and approx. matching

DDDM'08, Pisa - 15/12/2008 13

Pattern space and K of domain

string over an alphabet of 4 or 8 symbols

K of domain as three models of symbol distribution Me - independent symbols with equal frequency Md - independent symb. with different frequencies Mm - first order Markov model

for given p, and Me or Md or Mm, we have the proba that exits at-least one occ. of p in a string

from binomial distribution we have the proba that p sat. min and max support constraints

DDDM'08, Pisa - 15/12/2008 14

Example / random data

4 symb. Md (0.4, 0.1, 0.2, 0.3) 100 strings of length 1000 in R1 and R2 , exact match

DDDM'08, Pisa - 15/12/2008 15

Example / random data

4 symb. Mm, 100 strings of length 1000 in R1 and R2, exact and approx. match

DDDM'08, Pisa - 15/12/2008 16

Example / gene promoter seq.

4 symb. A,C,G,T - Md, strings of 4000 symb., 29 in R1 and 21 in R2 - approx. match

DDDM'08, Pisa - 15/12/2008 17

Example / gene promoter seq.

Estimate vs. extraction

DDDM'08, Pisa - 15/12/2008 18

Conclusion

Drawing extraction landscape for parameter tuning, in local pattern extraction, using pattern space sampling …seems possible …… at-least in some cases… using simple framework… incorparating K of domain (to some extend - many works on proba of a given patt. to sat. constraints)

simplier than building a global analytical modelfaster than running real extractions

… sufficient in exploratory stage ?… companion software?

DDDM'08, Pisa - 15/12/2008 19

Example / random data

8 symb. Me, 100 strings of length 30000 in R1 and R2, approx. match

DDDM'08, Pisa - 15/12/2008 20

Pb - Sampling / estimate

kind of sampling (with replacement ?)

specific sampling (ako stratified sampling) for some constraints ?

kinds of patterns ?

quality of estimates … occurrences of different patterns are not independent

DDDM'08, Pisa - 15/12/2008 21

Pb - Other parameters added

size of starting set

convergence criterion ? 5% ?

size of additional subsets

… not so hard to tune ?

DDDM'08, Pisa - 15/12/2008 22

Number of patterns

conjunction of constraints C

patterns in patt. space PS

for each patt. p, let var Xp=1 if p sat. C or Xp=0 if p not sat. C

N = nb of patt. that sat. C = sum of Xp over PS

E(N) = sum of E(Xp) over PS

E(Xp) = proba that p sat. C

Psynt = patt. in PS that sat. syntactic constraint in C

E(N) = sum of E(Xp) over Psynt

DDDM'08, Pisa - 15/12/2008 23

Number of patterns

comp. NS = sum of E(Xp) over a sample of Psynt

comp. ratio NR = NS/sample size

use NR * size of Psynt as an estimate of E(N)

DDDM'08, Pisa - 15/12/2008 24

Example / gene promoter seq.

Estimate vs. extraction

DDDM'08, Pisa - 15/12/2008 25

Example / gene promoter seq.

Estimate vs. extraction

DDDM'08, Pisa - 15/12/2008 26

Often repeat exploratory stage

redo exploratory stage after important changes as:

data selection (e.g., part of sequences)

encoding (e.g., mapping on event types)

discretization (e.g., threshold of binarization)