bayesian quadrature - university of torontoduvenaud/talks/intro_bq.pdf · 2017-05-29 ·...

Bayesian Quadrature:Model-based Approximate Integration

David DuvenaudUniversity of Cambridge

December 8, 2012

The Quadrature Problem

� We want to estimate anintegral

∫f (x)p(x)dx

� Most computational problems

in inference correspond to

integrals:

� Expectations� Marginal distributions� Integrating out

nuisance parameters� Normalization

constants� Model comparison

function f(x)

input density p(x)

Sampling Methods

� Monte Carlo methods:Sample from p(x), takeempirical mean:

N∑i=1

f (xi )

function f(x)

input density p(x)

samples

Sampling Methods

N∑i=1

f (xi )

� Possibly sub-optimal for tworeasons:

function f(x)

input density p(x)

samples

Sampling Methods

N∑i=1

f (xi )

� Possibly sub-optimal for two

reasons:

� Random bunching up x

function f(x)

input density p(x)

samples

Sampling Methods

N∑i=1

f (xi )

reasons:

� Random bunching up� Often, nearby function

values will be similar

function f(x)

input density p(x)

samples

Sampling Methods

N∑i=1

f (xi )

reasons:

� Random bunching up� Often, nearby function

values will be similar

� Model-based andquasi-Monte Carlo methodsspread out samples to achievefaster convergence.

function f(x)

input density p(x)

samples

Model-based Integration

� Place a prior on f , for example, a GP

� Posterior over f implies a posterior over Z .

GP meanGP mean ± SDp(Z)

GP meanGP mean ±2SDp(Z)

samples

� We’ll call using a GP prior Bayesian Quadrature

Bayesian Quadrature Estimator

� Posterior over Z has mean linear in f (xs):

Egp [Z |f (xs)] =N∑i=1

zTK−1f (xi )

where zn =∫k(x , xn)p(x)dx

� Natural to minimize posterior variance of Z :

V [Z |f (xs)] =

∫ ∫k(x , x ′)p(x)p(x ′)dxdx ′ − zTK−1z

� Doesn’t depend on function values at all!

� Choosing samples sequentially to minimize variance:Sequential Bayesian Quadrature.

zTK−1f (xi )

V [Z |f (xs)] =

zTK−1f (xi )

V [Z |f (xs)] =

zTK−1f (xi )

V [Z |f (xs)] =

Things you can do with Bayesian Quadrature

� Can incorporate knowledge of function (symmetries)

f (x , y) = f (y , x)⇔ ks(x , y , x ′, y ′) = k(x , y , x ′, y ′) + k(x , y ′, x ′, y)

+ k(x ′, y , x , y ′) + k(x ′, y ′, x , y)

� Can condition on gradients

� Posterior variance is a natural convergence diagnostic

100 120 140 160 180 200−0.2

−0.1

# of samples

+ k(x ′, y , x , y ′) + k(x ′, y ′, x , y)

100 120 140 160 180 200−0.2

−0.1

# of samples

+ k(x ′, y , x , y ′) + k(x ′, y ′, x , y)

100 120 140 160 180 200−0.2

−0.1

# of samples

More things you can do with Bayesian Quadrature

� Can compute likelihood of GP, learn kernel

� Can compute marginals with error bars, in two ways:

� Simply from the GP posterior:

� Or by recomputing fθ(x) for different θ with same x

0.2 0.4 0.6 0.8 1 1.2σ

True σx

Marg. Like.

� Or by recomputing fθ(x) for different θ with same x

0.2 0.4 0.6 0.8 1 1.2σ

True σx

Marg. Like.

� Much nicer than histograms!

Rates of Convergence

What is rate of convergence of SBQ when its assumptions are true?

Expected Variance / MMD

Expected Variance / MMD Empirical Rates in RKHS

Expected Variance / MMDEmpirical Rates out of

Expected Variance / MMD Bound on Bayesian Error

GPs vs Log-GPs for Inference

True FunctionEvaluations

True FunctionEvaluationsGP Posterior

log`(x)

True Log-func

Evaluations

log`(x)

True Log-func

Evaluations

GP Posterior

True FunctionEvaluationsGP PosteriorLog-GP Posterior

log`(x)

True Log-func

Evaluations

GP Posterior

GPs vs Log-GPs

True FunctionEvaluations

GPs vs Log-GPs

−200

−150

−100

log`(x)

True Log-func

Evaluations

GPs vs Log-GPs

−200

−150

−100

log`(x)

True Log-func

Evaluations

GP Posterior

GPs vs Log-GPs

−200

−150

−100

log`(x)

True Log-func

Evaluations

GP Posterior

Integrating under Log-GPs

log`(x)

True Log-func

Evaluations

GP Posterior

Integrating under Log-GPs

True FunctionEvaluationsMean of GPApprox Log-GP

Inducing Points

log`(x)

True Log-func

Evaluations

GP Posterior

Inducing Points

Conclusions

� Model-based integration allows active learning about integrals,can require fewer samples than MCMC, and allows us tocheck our assumptions.

� BQ has nice convergence properties if its assumptions arecorrect.

� For inference, GP is not especially appropriate, but othermodels are intractable.

Conclusions

Limitations and Future Directions

� Right now, BQ really only works in low dimensions ( < 10 ),when the function is fairly smooth, and is only worth usingwhen computing f (x) is expensive.

� How to extend to high dimensions? Gradient observations arehelpful, but a D-dimensional gradient is D separateobservations.

� It seems unlikely that we’ll find another tractablenonparametric distribution like GPs - should we accept thatwe’ll need a second round of approximate integration on asurrogate model?

� How much overhead is worthwhile? Bounded rationality workseems relevant.

Thanks!

bayesian quadrature - university of torontoduvenaud/talks/intro_bq.pdf · 2017-05-29 ·...

Documents