bayesian quadrature - university of torontoduvenaud/talks/intro_bq.pdf · 2017-05-29 ·...

Post on 05-Apr-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Bayesian Quadrature:Model-based Approximate Integration

David DuvenaudUniversity of Cambridge

December 8, 2012

The Quadrature Problem

� We want to estimate anintegral

Z =

∫f (x)p(x)dx

� Most computational problems

in inference correspond to

integrals:

� Expectations� Marginal distributions� Integrating out

nuisance parameters� Normalization

constants� Model comparison

x

function f(x)

input density p(x)

Sampling Methods

� Monte Carlo methods:Sample from p(x), takeempirical mean:

Z =1

N

N∑i=1

f (xi )

x

function f(x)

input density p(x)

samples

Sampling Methods

� Monte Carlo methods:Sample from p(x), takeempirical mean:

Z =1

N

N∑i=1

f (xi )

� Possibly sub-optimal for tworeasons:

x

function f(x)

input density p(x)

samples

Sampling Methods

� Monte Carlo methods:Sample from p(x), takeempirical mean:

Z =1

N

N∑i=1

f (xi )

� Possibly sub-optimal for two

reasons:

� Random bunching up x

function f(x)

input density p(x)

samples

Sampling Methods

� Monte Carlo methods:Sample from p(x), takeempirical mean:

Z =1

N

N∑i=1

f (xi )

� Possibly sub-optimal for two

reasons:

� Random bunching up� Often, nearby function

values will be similar

x

function f(x)

input density p(x)

samples

Sampling Methods

� Monte Carlo methods:Sample from p(x), takeempirical mean:

Z =1

N

N∑i=1

f (xi )

� Possibly sub-optimal for two

reasons:

� Random bunching up� Often, nearby function

values will be similar

� Model-based andquasi-Monte Carlo methodsspread out samples to achievefaster convergence.

x

function f(x)

input density p(x)

samples

Model-based Integration

� Place a prior on f , for example, a GP

� Posterior over f implies a posterior over Z .

Model-based Integration

� Place a prior on f , for example, a GP

� Posterior over f implies a posterior over Z .

Z

x

f(x)

p(x)

GP meanGP mean ± SDp(Z)

Model-based Integration

� Place a prior on f , for example, a GP

� Posterior over f implies a posterior over Z .

Z

x

f(x)

p(x)

GP meanGP mean ±2SDp(Z)

samples

Model-based Integration

� Place a prior on f , for example, a GP

� Posterior over f implies a posterior over Z .

Z

x

f(x)

p(x)

GP meanGP mean ±2SDp(Z)

samples

Model-based Integration

� Place a prior on f , for example, a GP

� Posterior over f implies a posterior over Z .

Z

x

f(x)

p(x)

GP meanGP mean ±2SDp(Z)

samples

Model-based Integration

� Place a prior on f , for example, a GP

� Posterior over f implies a posterior over Z .

Z

x

f(x)

p(x)

GP meanGP mean ±2SDp(Z)

samples

Model-based Integration

� Place a prior on f , for example, a GP

� Posterior over f implies a posterior over Z .

Z

x

f(x)

p(x)

GP meanGP mean ±2SDp(Z)

samples

Model-based Integration

� Place a prior on f , for example, a GP

� Posterior over f implies a posterior over Z .

Z

x

f(x)

p(x)

GP meanGP mean ±2SDp(Z)

samples

� We’ll call using a GP prior Bayesian Quadrature

Bayesian Quadrature Estimator

� Posterior over Z has mean linear in f (xs):

Egp [Z |f (xs)] =N∑i=1

zTK−1f (xi )

where zn =∫k(x , xn)p(x)dx

� Natural to minimize posterior variance of Z :

V [Z |f (xs)] =

∫ ∫k(x , x ′)p(x)p(x ′)dxdx ′ − zTK−1z

� Doesn’t depend on function values at all!

� Choosing samples sequentially to minimize variance:Sequential Bayesian Quadrature.

Bayesian Quadrature Estimator

� Posterior over Z has mean linear in f (xs):

Egp [Z |f (xs)] =N∑i=1

zTK−1f (xi )

where zn =∫k(x , xn)p(x)dx

� Natural to minimize posterior variance of Z :

V [Z |f (xs)] =

∫ ∫k(x , x ′)p(x)p(x ′)dxdx ′ − zTK−1z

� Doesn’t depend on function values at all!

� Choosing samples sequentially to minimize variance:Sequential Bayesian Quadrature.

Bayesian Quadrature Estimator

� Posterior over Z has mean linear in f (xs):

Egp [Z |f (xs)] =N∑i=1

zTK−1f (xi )

where zn =∫k(x , xn)p(x)dx

� Natural to minimize posterior variance of Z :

V [Z |f (xs)] =

∫ ∫k(x , x ′)p(x)p(x ′)dxdx ′ − zTK−1z

� Doesn’t depend on function values at all!

� Choosing samples sequentially to minimize variance:Sequential Bayesian Quadrature.

Bayesian Quadrature Estimator

� Posterior over Z has mean linear in f (xs):

Egp [Z |f (xs)] =N∑i=1

zTK−1f (xi )

where zn =∫k(x , xn)p(x)dx

� Natural to minimize posterior variance of Z :

V [Z |f (xs)] =

∫ ∫k(x , x ′)p(x)p(x ′)dxdx ′ − zTK−1z

� Doesn’t depend on function values at all!

� Choosing samples sequentially to minimize variance:Sequential Bayesian Quadrature.

Things you can do with Bayesian Quadrature

� Can incorporate knowledge of function (symmetries)

f (x , y) = f (y , x)⇔ ks(x , y , x ′, y ′) = k(x , y , x ′, y ′) + k(x , y ′, x ′, y)

+ k(x ′, y , x , y ′) + k(x ′, y ′, x , y)

� Can condition on gradients

� Posterior variance is a natural convergence diagnostic

100 120 140 160 180 200−0.2

−0.1

0

0.1

0.2

Z

# of samples

Things you can do with Bayesian Quadrature

� Can incorporate knowledge of function (symmetries)

f (x , y) = f (y , x)⇔ ks(x , y , x ′, y ′) = k(x , y , x ′, y ′) + k(x , y ′, x ′, y)

+ k(x ′, y , x , y ′) + k(x ′, y ′, x , y)

� Can condition on gradients

� Posterior variance is a natural convergence diagnostic

100 120 140 160 180 200−0.2

−0.1

0

0.1

0.2

Z

# of samples

Things you can do with Bayesian Quadrature

� Can incorporate knowledge of function (symmetries)

f (x , y) = f (y , x)⇔ ks(x , y , x ′, y ′) = k(x , y , x ′, y ′) + k(x , y ′, x ′, y)

+ k(x ′, y , x , y ′) + k(x ′, y ′, x , y)

� Can condition on gradients

� Posterior variance is a natural convergence diagnostic

100 120 140 160 180 200−0.2

−0.1

0

0.1

0.2

Z

# of samples

More things you can do with Bayesian Quadrature

� Can compute likelihood of GP, learn kernel

� Can compute marginals with error bars, in two ways:

More things you can do with Bayesian Quadrature

� Can compute likelihood of GP, learn kernel

� Can compute marginals with error bars, in two ways:

� Simply from the GP posterior:

More things you can do with Bayesian Quadrature

� Can compute likelihood of GP, learn kernel

� Can compute marginals with error bars, in two ways:

� Or by recomputing fθ(x) for different θ with same x

0.2 0.4 0.6 0.8 1 1.2σ

x

Z

True σx

Marg. Like.

More things you can do with Bayesian Quadrature

� Can compute likelihood of GP, learn kernel

� Can compute marginals with error bars, in two ways:

� Or by recomputing fθ(x) for different θ with same x

0.2 0.4 0.6 0.8 1 1.2σ

x

Z

True σx

Marg. Like.

� Much nicer than histograms!

Rates of Convergence

What is rate of convergence of SBQ when its assumptions are true?

Expected Variance / MMD

Rates of Convergence

What is rate of convergence of SBQ when its assumptions are true?

Expected Variance / MMD

Rates of Convergence

What is rate of convergence of SBQ when its assumptions are true?

Expected Variance / MMD

Rates of Convergence

What is rate of convergence of SBQ when its assumptions are true?

Expected Variance / MMD Empirical Rates in RKHS

Rates of Convergence

What is rate of convergence of SBQ when its assumptions are true?

Expected Variance / MMDEmpirical Rates out of

RKHS

Rates of Convergence

What is rate of convergence of SBQ when its assumptions are true?

Expected Variance / MMD Bound on Bayesian Error

GPs vs Log-GPs for Inference

−20

0

20

40

60

80

100

120

x

`(x)

True FunctionEvaluations

GPs vs Log-GPs for Inference

−20

0

20

40

60

80

100

120

x

`(x)

True FunctionEvaluationsGP Posterior

GPs vs Log-GPs for Inference

−20

0

20

40

60

80

100

120

x

`(x)

−1

0

1

2

3

4

x

log`(x)

True FunctionEvaluationsGP Posterior

True Log-func

Evaluations

GPs vs Log-GPs for Inference

−20

0

20

40

60

80

100

120

x

`(x)

−1

0

1

2

3

4

x

log`(x)

True Log-func

Evaluations

GP Posterior

True FunctionEvaluationsGP Posterior

GPs vs Log-GPs for Inference

−20

0

20

40

60

80

100

120

x

`(x)

True FunctionEvaluationsGP PosteriorLog-GP Posterior

−1

0

1

2

3

4

x

log`(x)

True Log-func

Evaluations

GP Posterior

GPs vs Log-GPs

0

50

100

150

200

x

`(x)

True FunctionEvaluations

GPs vs Log-GPs

0

50

100

150

200

x

`(x)

True FunctionEvaluationsGP Posterior

GPs vs Log-GPs

0

50

100

150

200

x

`(x)

−200

−150

−100

−50

0

x

log`(x)

True FunctionEvaluationsGP Posterior

True Log-func

Evaluations

GPs vs Log-GPs

0

50

100

150

200

x

`(x)

−200

−150

−100

−50

0

x

log`(x)

True Log-func

Evaluations

GP Posterior

True FunctionEvaluationsGP Posterior

GPs vs Log-GPs

0

50

100

150

200

x

`(x)

True FunctionEvaluationsGP PosteriorLog-GP Posterior

−200

−150

−100

−50

0

x

log`(x)

True Log-func

Evaluations

GP Posterior

Integrating under Log-GPs

−20

0

20

40

60

80

x

`(x)

True FunctionEvaluationsGP PosteriorLog-GP Posterior

−1

0

1

2

3

4

x

log`(x)

True Log-func

Evaluations

GP Posterior

Integrating under Log-GPs

−20

0

20

40

60

80

x

`(x)

True FunctionEvaluationsMean of GPApprox Log-GP

Inducing Points

−1

0

1

2

3

4

x

log`(x)

True Log-func

Evaluations

GP Posterior

Inducing Points

Conclusions

� Model-based integration allows active learning about integrals,can require fewer samples than MCMC, and allows us tocheck our assumptions.

� BQ has nice convergence properties if its assumptions arecorrect.

� For inference, GP is not especially appropriate, but othermodels are intractable.

Conclusions

� Model-based integration allows active learning about integrals,can require fewer samples than MCMC, and allows us tocheck our assumptions.

� BQ has nice convergence properties if its assumptions arecorrect.

� For inference, GP is not especially appropriate, but othermodels are intractable.

Conclusions

� Model-based integration allows active learning about integrals,can require fewer samples than MCMC, and allows us tocheck our assumptions.

� BQ has nice convergence properties if its assumptions arecorrect.

� For inference, GP is not especially appropriate, but othermodels are intractable.

Limitations and Future Directions

� Right now, BQ really only works in low dimensions ( < 10 ),when the function is fairly smooth, and is only worth usingwhen computing f (x) is expensive.

� How to extend to high dimensions? Gradient observations arehelpful, but a D-dimensional gradient is D separateobservations.

� It seems unlikely that we’ll find another tractablenonparametric distribution like GPs - should we accept thatwe’ll need a second round of approximate integration on asurrogate model?

� How much overhead is worthwhile? Bounded rationality workseems relevant.

Thanks!

Limitations and Future Directions

� Right now, BQ really only works in low dimensions ( < 10 ),when the function is fairly smooth, and is only worth usingwhen computing f (x) is expensive.

� How to extend to high dimensions? Gradient observations arehelpful, but a D-dimensional gradient is D separateobservations.

� It seems unlikely that we’ll find another tractablenonparametric distribution like GPs - should we accept thatwe’ll need a second round of approximate integration on asurrogate model?

� How much overhead is worthwhile? Bounded rationality workseems relevant.

Thanks!

Limitations and Future Directions

� Right now, BQ really only works in low dimensions ( < 10 ),when the function is fairly smooth, and is only worth usingwhen computing f (x) is expensive.

� How to extend to high dimensions? Gradient observations arehelpful, but a D-dimensional gradient is D separateobservations.

� It seems unlikely that we’ll find another tractablenonparametric distribution like GPs - should we accept thatwe’ll need a second round of approximate integration on asurrogate model?

� How much overhead is worthwhile? Bounded rationality workseems relevant.

Thanks!

Limitations and Future Directions

� Right now, BQ really only works in low dimensions ( < 10 ),when the function is fairly smooth, and is only worth usingwhen computing f (x) is expensive.

� How to extend to high dimensions? Gradient observations arehelpful, but a D-dimensional gradient is D separateobservations.

� It seems unlikely that we’ll find another tractablenonparametric distribution like GPs - should we accept thatwe’ll need a second round of approximate integration on asurrogate model?

� How much overhead is worthwhile? Bounded rationality workseems relevant.

Thanks!

Limitations and Future Directions

� Right now, BQ really only works in low dimensions ( < 10 ),when the function is fairly smooth, and is only worth usingwhen computing f (x) is expensive.

� How to extend to high dimensions? Gradient observations arehelpful, but a D-dimensional gradient is D separateobservations.

� It seems unlikely that we’ll find another tractablenonparametric distribution like GPs - should we accept thatwe’ll need a second round of approximate integration on asurrogate model?

� How much overhead is worthwhile? Bounded rationality workseems relevant.

Thanks!

top related