Data Analysis Recipes: Using Markov Chain Monte Carlo*

David W. Hogg; Daniel Foreman-Mackey

doi:10.3847/1538-4365/aab76e

1. When Do You Need MCMC?

Markov Chain Monte Carlo (MCMC) methods are methods for sampling probability distribution functions or probability density functions (pdfs). These pdfs may be either probability mass functions on a discrete space or probability densities on a continuous space, though we will concentrate on the latter in this article. MCMC methods do not require that you have a full analytic description of the properly normalized pdf for sampling to proceed; they only require that you be able to compute ratios of the pdf at pairs of locations. This makes MCMC methods ideal for sampling posterior pdfs in probabilistic inferences.

In a probabilistic inference, the posterior pdf $p(\theta | D)$ , or pdf for the parameters $\theta$ given the data $D$ , is constructed from the likelihood $p(D| \theta )$ , or pdf for the data given the parameters, and the prior pdf $p(\theta )$ for the parameters by what is often known as the "Bayes rule,"

$\begin{eqnarray}&&p(\theta | D)=\displaystyle \frac{1}{Z}\,p(D| \theta )p(\theta ).\end{eqnarray} \tag{ 1 }$

In these contexts, the constant Z, sometimes written as $p(D)$ , is known by the names "evidence," "marginal likelihood," "Bayes integral," and "prior predictive probability" and is usually extremely hard to calculate.⁷ That is, you often know the function $p(\theta | D)$ up to a constant factor; you can compute ratios of the pdf at pairs of points, but not the precise value at any individual point.

In addition to this normalization-insensitive property of MCMC, in its simplest forms it can be run without computing any derivatives or integrals of the function, and (as we will show below in Section 3) in its simplest forms it is extremely easy to implement. For all these reasons, MCMC is ideal for sampling posterior pdfs in the real situations in which scientists find themselves.

Say you are in this situation: You have a huge blob of data $D$ (think of this as a vector or list or heterogeneous but ordered collection of observations). You also have a model sophisticated enough—a probabilistic, generative model, if you will⁸ —that, given a setting of a huge blob of parameters (again, think of this as a vector or list or heterogeneous but ordered collection of values) $\theta$ , you can compute a pdf for data (or likelihood⁹ ) $p(D| \theta )$ . Furthermore, say also that you can write down some kind of informative or vague prior pdf $p(\theta )$ for the parameter blob $\theta$ . If all these things are true, then—even if you cannot compute anything else—in principle a trivial-to-implement MCMC can give you a fair sampling of the posterior pdf. That is, you can run MCMC (for a very long time—see Section 5 for how long), and you will be left with a set of K parameter-blob settings ${\theta }_{k}$ such that the full set $\{{\theta }_{k}\}{}_{k=1}^{K}$ constitutes a fair sampling from the posterior pdf $p(\theta | D)$ . We will give some sense of what "fair" means in this context below.

All that said, and adhering to the traditions of the Data Analysis Recipes project,¹⁰ we are compelled to note at the outset that MCMC is in fact overused. Because MCMC provably (under assumptions,¹¹ some of which will be discussed) samples the full posterior pdf in all of parameter space, many investigators use MCMC because (they believe) it will sample all of the parameter space ( $\theta$ -space). That is, they are using MCMC because they want to search the parameter space for good models. This is not a good reason to use MCMC! Another bad use case is the following: Because MCMC samples the parameter representatively, it spends most of its time near very good models, models that (within the confines of the prior pdf) do a good job of explaining the data. For this reason, many investigators are using MCMC because it effectively optimizes the posterior pdf or, for certain choices of prior pdf, optimizes the likelihood.¹² This is another bad reason!

Both of these reasons for using MCMC—that it is a parameter space search algorithm, and that it is a simple-to-code effective optimizer—are not good reasons. MCMC is a sampler. If you are trying to find the optimum of the likelihood or the posterior pdf, you should use an optimizer, not a sampler. If you want to make sure you search all of the parameter space, you should use a search algorithm, not a sampler. MCMC is good at one thing, and one thing only: sampling ill-normalized (or otherwise hard to sample) pdfs.

In what follows, we are going to provide a kind of "user manual" or advice document or folklore capture regarding the use of MCMC for data analysis. This will not be a detailed description of multiple MCMC methods (indeed, we will only explain one method in detail), and it will not be about the mathematical properties or structure of the method or methods.¹³ Think of this article as a getting-started document for a new user of MCMC methods. It concentrates on conceptual points and questions we get asked frequently by users, with a concentration on how an MCMC project is diagnosed and tested, and how its outputs are used responsibly. Most of what we say will be applicable to any MCMC method; in that sense the discussion is general. And all in the language—more or less—of an astrophysicist.

The first couple of sections will describe what a sampling is and how the simplest MCMC method, the Metropolis–Hastings algorithm, can provide one. The next few sections will provide ideas about how to initialize, tune, and operate MCMC methods for good performance. The last few sections will provide advice for making decisions among the myriad MCMC method implementations, how to implement a good likelihood function and prior pdf function for inference, and how to troubleshoot standard kinds of problems that arise in operating MCMC methods on real problems. We will leave parenthetical and philosophical matters to the footnotes.

2. What Is a Sampling?

Heuristically, a sampling $\{{\theta }_{k}\}{}_{k=1}^{K}$ from some pdf $p(\theta )$ is a set of K values ${\theta }_{k}$ that are draws from the pdf. Heuristically, if an enormous (large K) sampling could be displayed in a finely binned $\theta$ -space histogram, the histogram would look—up to a total normalization—just like the original function $p(\theta )$ .

We do not know all the relevant mathematics (measure theory), but for our purposes here a pdf is any single-valued (scalar) function that is non-negative everywhere (the entire domain of $\theta$ ) and obeys a normalization condition

$\begin{eqnarray}&&0\leqslant p(\theta )\quad {\rm{for}}\,{\rm{all}}\,{\rm{}}\theta \end{eqnarray} \tag{ 2 }$

$\begin{eqnarray}&&1=\int p(\theta ){\rm{d}}\theta ,\end{eqnarray} \tag{ 3 }$

where, implicitly, the integral is over the full domain of $\theta$ . Importantly, although $p(\theta )$ is single valued, $\theta$ can be a multielement vector or list or blob; it can be arbitrarily large and complicated. In data-analysis contexts, $\theta$ will often be the full blob of free parameters in the model. Implicitly, the integral in Equation (3) is high-dimensional; it has as many dimensions as there are elements or entries or components of $\theta$ . Also, if there are elements or entries or components of $\theta$ that are discrete (that is, take on only integer values or equivalent), then along those dimensions the integral becomes a discrete sum. The latter is a detail to which we return below (briefly, in Section 9).

Given this pdf $p(\theta )$ , we can define expectation values ${E}_{p(\theta )}[\theta ]$ for $\theta$ or for any quantity that can be expressed as a function $g(\theta )$ of $\theta$ :

$\begin{eqnarray}&&{E}_{p(\theta )}[\theta ]\equiv \int \theta \,p(\theta )d\theta \end{eqnarray} \tag{ 4 }$

$\begin{eqnarray}&&{E}_{p(\theta )}[g(\theta )]\equiv \int g(\theta )p(\theta )d\theta ,\end{eqnarray} \tag{ 5 }$

where again the integrals are implicitly definite integrals over the entire domain of $\theta$ (all the parts of $\theta$ space in which $p(\theta )$ is finite) and the integrals are multidimensional if $\theta$ is multidimensional. These expectation values are the mean values of $\theta$ and $g(\theta )$ under the pdf. A good sampling—and really this is the definition of a good sampling—makes the sampling approximation to these integrals accurate. With a good sampling $\{{\theta }_{k}\}{}_{k=1}^{K}$ the integrals get replaced with sums over samples ${\theta }_{k}$ :

$\begin{eqnarray}&&{E}_{p(\theta )}[\theta ]\approx \displaystyle \frac{1}{K}\,\displaystyle \sum _{k=1}^{K}{\theta }_{k}\end{eqnarray} \tag{ 6 }$

$\begin{eqnarray}&&{E}_{p(\theta )}[g(\theta )]\approx \displaystyle \frac{1}{K}\,\displaystyle \sum _{k=1}^{K}g({\theta }_{k}).\end{eqnarray} \tag{ 7 }$

That is, a sampling is good when any expectation value of interest is accurately computed via the sampling approximation. The word "accurately" here translates into some kinds of theorems about limiting behavior; the general idea is that the sampling approximation becomes exact as K goes to infinity. The size K of the sampling you need in practice will depend on the expectations you want to compute and the accuracies you need.

As we noted above (Section 1), in the context of MCMC, we are often using some badly normalized function $f(\theta )$ . This function is just the pdf $p(\theta )$ multiplied by some unknown and hard-to-compute scalar. In this case, for our purposes, the conditions on $f(\theta )$ are that it be non-negative everywhere and have finite integral Z:

$\begin{eqnarray}&&0\leqslant f(\theta )\quad {\rm{for}}\,{\rm{all}}\,{\rm{}}\theta \end{eqnarray} \tag{ 8 }$

$\begin{eqnarray}&&Z=\int f(\theta )d\theta .\end{eqnarray} \tag{ 9 }$

And recall that we do not actually know the value of Z, but we do know that it is finite.

When the sampling $\{{\theta }_{k}\}{}_{k=1}^{K}$ is of one of these badly normalized functions $f(\theta )$ —as it usually will be—the sampling-approximation expectation values are the expectation values under the properly normalized corresponding pdf, even though you might never learn that normalization. When we run MCMC sampling on $f(\theta ),$ a function that differs from a pdf $p(\theta )$ by some unknown normalization constant, then the sampling permits computation of the following kinds of quantities:

$\begin{eqnarray}&&{E}_{p(\theta )}[g(\theta )]\equiv \displaystyle \frac{\int g(\theta )f(\theta )d\theta }{\int f(\theta )d\theta }\end{eqnarray} \tag{ 10 }$

$\begin{eqnarray}&&{E}_{p(\theta )}[g(\theta )]\approx \displaystyle \frac{1}{K}\,\displaystyle \sum _{k=1}^{K}g({\theta }_{k}).\end{eqnarray} \tag{ 11 }$

That is, the sampling can be constructed (as we will show below in Section 3) from evaluations of $f(\theta )$ directly, and it permits you to compute expectation values without ever requiring you to integrate either the numerator integral or the denominator integral, both of which are generally intractable.¹⁴

The "correctness" of a sampling is defined (above in Section 2) in terms of its use in performing integrals or approximate computation of integrals. In a deep sense, the only thing a sampling is good for is computing integrals. There are many uses of the MCMC sampling, some of them good and some of them bad. Most of the good or sensible uses will somehow involve integration.

For example, one magical property of a sampling (in a D-dimensional space) is that a histogram (a D-dimensional histogram) of the samples (divided, if you like, by the number of samples in each bin and the bin width¹⁵ ) looks very much like the pdf from which the samples were drawn. This is a way to "reconstruct" the pdf from the sampling: make a histogram from the samples. Even in this case, the sampling is being used to do integrals; the (possibly odd) idea is that the approximate or effective value of the pdf in each bin of the histogram is an average over the bin. That average is obtained by performing an integral.

Integrals are also involved in finding the mean, median, and any quantiles of a pdf. They are not involved in finding the mode of a pdf. For this reason (and others), in what follows, when we talk about what to report about the outcome of your MCMC sampling, we will advise in favor of mean, median, and quantiles, and we will advise against mode.

Finally—and perhaps most importantly—a magical property of a sampling (in a D-dimensional space) is that if some of your D dimensions are totally uninteresting (nuisance parameters, if you will) and some of your D dimensions are of great interest, the sampling in the full D-space is trivially converted into a sampling in the subspace of interest: you just drop from each ${\theta }_{k}$ vector (or blob or list) the dimensions of no interest! That is, the projection of the sampling to the subspace of interest produces a sampling of the marginalized pdf, marginalizing (or projecting) out the nuisance parameters.¹⁶ That is extremely important for inference, where there are always parameters with very different levels of importance to the scientific conclusions. This point generalizes from a subspace to any function of the parameters; we will return to this again below (Section 8).

Although the discussion in this article is general, the most common use of MCMC sampling (for us, anyway) is in probabilistic inference. For this reason, we will often refer to the function $f(\theta )$ colloquially as "the posterior pdf"¹⁷ even though it is implicitly ill normalized and might not be a posterior pdf in any sense. We will also occasionally assume—just because it is true in inference—that the function $f(\theta )$ is the product of two functions, one called "the prior pdf" and one called "the likelihood." Again, this usage is colloquial and is only strictly correct in inference contexts with proper inputs. That the function $f(\theta )$ can be thought of as a product of a prior pdf and a likelihood is only necessary for what follows in the context of advanced sampling techniques like tempering or nested sampling, both mentioned briefly below (Section 10).

Problem 1. Look up (or choose) definitions for the mean, variance, skewness, and kurtosis of a distribution. Also look up or compute the analytic values of these four statistics for a top-hat (uniform) distribution. Write a computer program that uses some standard package (such as numpy¹⁸ ) to generate K random numbers x from a uniform distribution in the interval 0 < x < 1. Now use those K numbers to compute a sampling estimate of the mean, variance, skewness, and kurtosis (four estimates; look up definitions as needed). Make four plots of these four estimates as a function of 1/K or perhaps ${\mathrm{log}}_{2}K$ , for K = 4ⁿ for n = 1 up to n = 10 (that is, K = 4, K = 16, and so on up to K = 1,048,576). Overplot the analytic answers. What can you conclude?

3. Metropolis–Hastings MCMC

The simplest algorithm for MCMC is the Metropolis–Hastings algorithm (M–H MCMC).¹⁹ It is so simple that we recommend that any reader of this document who has not previously implemented the algorithm take a break at the end of this section and implement it forthwith, in a short piece of computer code, in the context of some simple problems.²⁰

The M–H MCMC algorithm requires two inputs. The first is a handle to the function $f(\theta )$ that is the function to be sampled, such that the algorithm can evaluate $f(\theta )$ for any value of the parameters $\theta$ . In data-analysis contexts, this function would be the prior $p(\theta )$ times the likelihood $p(D| \theta )$ evaluated at the observed data $D$ . The second input is a handle to a proposal pdf function $q(\theta ^{\prime} | \theta )$ that can deliver samples, such that the algorithm can draw a new position $\theta ^{\prime}$ in the parameter space given an "old" position $\theta$ . This second function must meet a symmetry requirement (detailed balance) we discuss further below. It permits us to random walk around the parameter space in a fair way.

The algorithm²¹ is the following: We have generated some set of samples, the most recent of which is ${\theta }_{k}$ , To generate the next sample ${\theta }_{k+1}$ , do the following:

1.
Draw a proposal $\theta ^{\prime}$ from the proposal pdf $q(\theta ^{\prime} | {\theta }_{k})$ .
2.
Draw a random number 0 < r < 1 from the uniform distribution.
3.
If $f(\theta ^{\prime} )/f({\theta }_{k})\gt r$ , then ${\theta }_{k+1}\leftarrow \theta ^{\prime} ;$ otherwise, ${\theta }_{k+1}\,\leftarrow \,{\theta }_{k}$ .

That is, at each step, either a new proposed position in the parameter space gets accepted into the list of samples, or else the previous sample in the parameter space gets repeated. The algorithm can be iterated a large number K times to produce K samples.

Why does this algorithm work? The answer is not absolutely trivial,²² but there are two components to the argument: The first is that the Markov process delivers a unique stationary distribution. The second is that the stationary distribution is proportional to the density function $f(\theta )$ .

This algorithm—and indeed any MCMC algorithm—produces a biased random walk through parameter space. It is a random walk for the same reason that it is "Markov": the step it makes to position ${\theta }_{k+1}$ depends only on the state of the sampler at position ${\theta }_{k}$ (and no previous state). It is a biased random walk, biased by the acceptance algorithm involving the ratios of function values; this acceptance rule biases the random walk such that the amount of time spent in the neighborhood of location $\theta$ is proportional to $f(\theta )$ . Because of this local (Markov) property, the nearby samples are not independent; the algorithm only produces fair samples in the limit of arbitrary run time, and two samples are only independent when they are sufficiently separated in the chain (more on this below in Section 5).

The principal user-settable knob in M–H MCMC is the proposal pdf $q(\theta ^{\prime} | \theta )$ . A typical choice is a multivariate Gaussian distribution for $\theta ^{\prime}$ centered on $\theta$ with some simple (diagonal, perhaps) variance tensor. We will discuss the choice and tuning of this proposal distribution below (Sections 6 and 9).

Importantly—for the algorithm given above to work correctly—the proposal pdf must satisfy a "detailed-balance" condition;²³ it must have the property that

$\begin{eqnarray}&&q(\theta ^{\prime} | \theta )=q(\theta | \theta ^{\prime} ),\end{eqnarray} \tag{ 12 }$

that is, it must be just as easy to go one way in the parameter space as the other. You can break this property, if you like, and then adjust the acceptance condition accordingly (which is truly the Metropolis–Hastings algorithm), but we do not recommend this except under the supervision of a trained professional.²⁴ The reason is this: it is one thing to draw samples from $q(x^{\prime} | x)$ ; it is another thing to correctly write down $q(x^{\prime} | x)$ . If you violate detailed balance in $q(x^{\prime} | x)$ , then you have to be able to both draw from and write down $q(x^{\prime} | x)$ , and you are subject to a set of new possible bugs for your code. If the detailed-balance condition frightens you (as it frightens us), then just stick with pdfs that are symmetric in $\theta ^{\prime}$ and $\theta$ , like the Gaussian, or a centered uniform distribution, and forget about it.

In what follows, when we discuss tuning of the proposal pdf, we will often cycle through the parameter-blob components and propose changes in only one dimension or element or component at a time. That is, at step $k+1$ you might only propose a one-dimensional move in the ith dimension, not in all D dimensions of the parameter space. That will not change anything significant; but if you are implementing this right now, you might want to do that. It will help us tune the proposal pdf (Section 6) and diagnose problems (Section 9).

One note to make here is that you must propose in the same coordinates (parameterization or transformation of parameters) as that in which your priors are specified, or else multiply in a (possibly nasty) Jacobian. That is, if your prior is "flat in $\theta$ " but you find it easier to run your sampler by proposing steps in $\mathrm{ln}\theta$ , if you do not modify your acceptance probability by the Jacobian, your real prior will not be flat in $\theta$ . These thinkos can get subtle; the best way to avoid them is to write your code and your likelihood function and your prior pdf function and your proposal distribution all in precisely the same parameterization. We will return to this point below in our comments on testing (Section 9). The point is that if you do change coordinates between the statement of your priors and the specific variables you step in, you have to introduce Jacobian factors.

As we discuss briefly below (Section 10), there are many MCMC methods more advanced than M–H. However, they all share characteristics with M–H (initialization, tuning, judging convergence), such that it is very valuable to understand M–H well before using anything more advanced. Furthermore, a scientist new to MCMC benefits enormously from building, tuning, and using her or his own MCMC software. One piece of advice we give then, to the new user of MCMC, is to code up, tune, and use a M–H MCMC sampler for a scientific project. A huge amount is learned in doing this, and it is very often the case that the home-built M–H MCMC does everything needed; you often do not need any more advanced tool. Only after you have concluded that your home-built M–H MCMC sampler is not suited to your project (or not developed into a properly versatile software package) should you download and start to use any professionally developed alternative. That is, even if—in the end—you want to leave the sampling code to the experts and run an industrial-strength code, it is still valuable to build your own, given the simplicity of the algorithm, and given the intuition you gain by doing it (at least once) yourself.²⁵

Because many problems involve a huge amount of dynamic range in the density function $f(\theta )$ , and we like to avoid underflows and overflows (in, say, ratios computed for accept–reject steps), it is often advisable to work in the (natural) logarithm of the density rather than the density. When working in logarithmic density, the accept–reject step would change from a comparison of a random deviate with a ratio of probabilities to comparison of the log of a random deviate with a difference of log probabilities. In other words, the accept–reject step becomes

1.
If $\mathrm{ln}f(\theta ^{\prime} )-\mathrm{ln}f({\theta }_{k})\gt \mathrm{ln}r$ , then ${\theta }_{k+1}\leftarrow \theta ^{\prime} ;$ otherwise, ${\theta }_{k+1}\leftarrow {\theta }_{k}$ .

This protects you from underflow but exposes you to some (negative) infinities, if you end up taking the logarithm of a zero. It is important to write your code to be infinity-safe.²⁶

Problem 2. In your scientific programming language of choice, write a very simple M–H MCMC sampler. Sample in a single parameter x and give the sampler as its density function p(x) a Gaussian density with mean 2 and variance 2. (Note that variance is the square of the standard deviation.) Make the proposal distribution $q(x^{\prime} | x)$ a Gaussian pdf for x' with mean x and variance 1. Initialize the sampler with x = 0 and run the sampler for more than 10⁴ steps. Plot the results as a histogram, with the true density overplotted sensibly. The resulting plot should look something like Figure 1.

**Figure 1.** Solution to Problem 2: MCMC samples from a Gaussian (black) and the true distribution (blue).
Download figure:
Standard image High-resolution image

Problem 3. Redo Problem 2, but now with an input density that is uniform on 3 < x < 7 and zero everywhere else. The plot should look like Figure 2. What change did you have to make to the initialization, and why?

Problem 4. Redo Problem 2, but now with an input density that is a function of two variables (x, y). For the density function use two different functions. (a) The first density function is a covariant two-dimensional Gaussian density with variance tensor

$\begin{eqnarray}V=\left[\begin{array}{cc}2.0 & 1.2\\ 1.2 & 2.0\end{array}\right].\end{eqnarray} \tag{ 13 }$

(b) The second density function is a rectangular top-hat function that is uniform on the joint constraint 3 < x < 7 and 1 < y < 9 and zero everywhere else. For the proposal distribution $q(x^{\prime} ,y^{\prime} | x,y)$ use a two-dimensional Gaussian density with mean at [x, y] and variance tensor set to the two-dimensional identity matrix. Plot the two one-dimensional histograms and also a two-dimensional scatterplot for each sampling. Figure 3 shows the expected results for the Gaussian. Make a similar plot for the top hat.

**Figure 3.** Solution to Problem 4(a): MCMC samples from a two-dimensional Gaussian (scatterplot) and the one-dimensional marginalized distributions (histograms).
Download figure:
Standard image High-resolution image

**Figure 4.** Solution to Problem 9: the black histogram shows the MCMC samples, and the blue curve is the analytic target density.
Download figure:
Standard image High-resolution image

Problem 5. Redo Problem 4(a), but with different values for the variance of the proposal distribution $q(x^{\prime} | x)$ . What happens when you go to very extreme values (e.g., 10⁻¹ or 10²)?

Problem 6. Why, in all of the previous problems, did we give the proposal distributions $q(x^{\prime} | x)$ a mean of x? What would be bad if we hadn't done that? Redo Problem 4(a) with a proposal $q(x^{\prime} | x)$ with a stupidly shifted mean of $x+2$ and see what happens. Bonus points: modify the acceptance–rejection criterion to deal with the messed-up $q(x^{\prime} | x)$ and show that everything works once again.

4. Likelihoods and Priors

MCMC is used to obtain samples ${\theta }_{k}$ from a pdf $p(\theta )$ , given a badly normalized all-positive function $f(\theta )$ that is different from the pdf by an unknown factor Z. In the context of data analysis, MCMC is usually being used to obtain samples ${\theta }_{k}$ that are parameter values $\theta$ for a probabilistic model of some data. That is, MCMC is sampling a pdf for the parameters.

Ideally, if you are using MCMC for inference, your code should input to the MCMC function or routine a probability function that is a product of a prior times a likelihood. As we noted above—in terms of implementation—it is usually advisable to work in the logarithm of the density function, so the function input to the MCMC code would be called something like ln_f(). This function ln_f() internally would compute and return the sum of a log-prior pdf ln_prior() and a log-likelihood function ln_likelihood().

If you are using "flat" (improper) priors, the ln_prior() function can just return a zero no matter what the parameters. If it is flat with bounds (that is, proper), the ln_prior() should check the bounds and return -Inf values when the parameter vector is out of bounds. The pseudo-code for the ln_f() function should look something like this:

$\begin{eqnarray}&&{\mathtt{def}}\,{\mathtt{ln}}\_{\mathtt{f}}({\mathtt{pars}},{\mathtt{data}}):\\ &&\qquad {\mathtt{x}}={\mathtt{ln}}\_{\mathtt{prior}}({\mathtt{pars}})\\ &&{\mathtt{if}}\,{\mathtt{not}}\,{\mathtt{is}}\_{\mathtt{finite}}\,({\mathtt{x}}):\\ &&\qquad \qquad {\mathtt{return}}-{\mathtt{Inf}}\\ &&\qquad {\mathtt{return}}\,{\mathtt{x}}+{\mathtt{ln}}\_{\mathtt{likelihood}}({\mathtt{data}},{\mathtt{pars}}).\end{eqnarray} \tag{ 14 }$

This pseudo-code ensures that you only compute the likelihood when the parameters are within the prior bounds; it assumes implicitly that the prior pdf is easier to compute than the likelihood. If you find yourself in the opposite regime, adjust accordingly. Of course, the above pseudo-code presumes that when you perform the accept–reject step of the MCMC method, the programming language handles properly the -Inf values.²⁷

MCMC cannot sample a likelihood (which is a probability for the data given parameters).²⁸ Despite this, in many cases, data analysts believe they are sampling the likelihood. This is because (we presume) they have put a likelihood function (or log-likelihood function) in as the input to the MCMC code, where the probability function (or log-probability function) should go. Then is it the likelihood that is being sampled? No, not really; it is a posterior probability that is directly proportional to the likelihood function. That is, it is a posterior probability for some implicit (and improper) "flat" priors.

It should be outside of the scope of this document to note here that it is a good idea to have proper priors.²⁹ Proper priors obey the integral constraint (9), with Z finite. It is not a requirement that priors be proper for the posterior to be proper, so many investigators and projects violate this rule. This is outside the current scope, except for the following point: it is a great functional test of your sampler and your data-analysis setup to take the likelihood function to a very small power (much less than 1) or multiply the log-likelihood by a very small number (much less than 1) and check that the sampler samples, properly and correctly, the prior pdf. This test is only possible if the prior pdf is proper.

Problem 7. Run your M–H MCMC sampler from Problem 2, but now with a density function that is precisely unity everywhere (that is, at any input value of x it returns unity)—that is, an improper function (as discussed in Section 4). Run it for longer and longer and plot the chain value x as a function of time step. What happens?

Problem 8. For a real-world inference problem, read enough of Hogg et al. (2010a) to understand and execute Exercise 6 in that document.

Problem 9. Modify the sampler you wrote in Problem 2 to take steps not in x but in $\mathrm{ln}x$ . That is, replace the Gaussian proposal distribution $q(x^{\prime} | x)$ with a Gaussian distribution in $\mathrm{ln}xq(\mathrm{ln}x^{\prime} | \mathrm{ln}x)$ , but make no other changes. By doing this, you are no longer sampling the Gaussian p(x) that you were in Problem 2. What about your answers changes? What distribution are you sampling now? Compute the analytic function that you have sampled from—this will no longer be the same p(x)—and overplot it on your histogram. The solution to this problem is shown in Figure 4.

5. Autocorrelation and Convergence

A key question for an MCMC operator—the key question in some sense—is how long to run to be sure of having reliable results. It is disappointing and annoying to many that there is no extremely simple and reliable answer to this question.³⁰

The reason that there is no simple answer is that you cannot really ever know that you have sampled the full posterior pdf, and the reason for that is that if you could know that, you would also be able to solve the famously difficult discrete optimization problem.³¹ Qualitatively, you can consider the scenario where you have two modes—two separated regions of substantial total probability—in the posterior pdf that are both important but separated by a large region of low probability. If you are using a simple sampler, you could sample one of these modes very well, but the sampler will take effectively infinite time to find the other mode. In most real problems, the situation is much worse, with an unknown number of modes, and knowing that you have a complete and representative sampling is effectively impossible.

In this context, it is simply not fair to require an MCMC run to have fully and completely sampled the posterior pdf, at least not in any provable sense. The question of whether a sampling is definitely converged to a representative sampling of the posterior pdf is actually outside the domain of science, because the domain of science is the domain of questions answerable by scientists. That is, unless you have some bounds on the support and smoothness of the posterior pdf, you can never know that you have correctly sampled the posterior pdf, despite many statements in the literature to the contrary.³² The upshot of this is that we do not and cannot require any kind of absolute convergence; indeed, it would be impossible even to test for it.

In this pragmatic situation—the real world, as it is called—we have to rely on heuristics. Heuristically, you have sampled long enough when you can see that the (or each) walker has traversed the high-probability parts of the parameter space many times in the length of the chain. Or, equivalently, you have sampled long enough when the first half of the chain shows very much the same posterior pdf morphology as the second half of the chain, or indeed as any substantial subset. Or, relatedly, you have sampled long enough when different walkers initialized differently and run with different random number seeds create posterior inferences that are substantially the same.

The above heuristics can be made more precise in terms of the amount of deviation one expects between the means and variances, say, of two disjoint subsets of the chain. The premier tool for making this heuristic precise, however, is to look at the "integrated autocorrelation time" of the chain. In general, when one has a sequence $({\theta }_{1},{\theta }_{2},{\theta }_{3},\ldots )$ generated by a Markov process in the $\theta$ space, nearby points in the sequence will be similar, but sufficiently distant points will not "know about" one another; the autocorrelation function measures this.

More formally, as discussed in Section 2, the goal of MCMC sampling is to compute integrals of the form given in Equation (5) using the Monte Carlo approximation in Equation (7). The Monte Carlo error introduced by this approximation is proportional to $\sqrt{{\tau }_{\mathrm{int}}/N}$ , where τ_int is the integrated autocorrelation time and N is the total number of samples of $p(\theta )$ . In other words, τ_int is the number of steps required for the chain to produce an independent sample.

A sampler with a smaller integrated autocorrelation time is better; you have to do fewer $f(\theta )$ calls per independent sample, and you have to run less time to get accurate sampling-based integral estimates. A sampler that takes an independent sample every time would have an autocorrelation time of unity, which is the best possible value; this optimal sampling is only possible for problems where the sampling is analytic (for example, if the posterior is perfectly Gaussian with known mean and variance). In general, the best sampler will be different for different problems, and we will (in Section 10) tune the samplers we have to do the best on the problems we have; this tuning will also be problem specific. There are many heuristic bases on which different samplers might be compared or tuned, but fundamentally it is lower autocorrelation time that separates good samplers from bad ones and is the ultimate basis on which we compare performance.³³

We will not go into details about how to estimate τ_int, but it is notoriously difficult: it is a two-point statistic, and two-point statistics are much harder to estimate than one-point statistics.³⁴ If you are not a hard-core user of MCMC, and if all you want is heuristic indicators of convergence, then what you should take from this section is that the autocorrelation time is involved in variance estimates, but that it is hard to estimate. If you want to think about autocorrelation estimation, more detailed references can be found elsewhere.³⁵

Besides estimating the integrated autocorrelation time, another simple and sensible test of convergence is the Gelman–Rubin diagnostic,³⁶ which compares the variance (in one parameter, or your most important parameter, or all parameters) within a chain to the variance across chains. This requires running multiple chains, looking at the empirical variance of the parameter away from its mean within each individual chain, and comparing it to the variance in the mean of that parameter across chains, inflated to be a mean per-sample variance. What Gelman and Rubin do with these variances specifically is sensible, but the important question of convergence is whether, as the chains get longer, these two variances asymptotically reach stable values, and whether those two values agree. The Gelman–Rubin diagnostic is related to the autocorrelation time, in that it will only deliver success when the chains are much longer than an autocorrelation time.

Finally, one last point about convergence: since all convergence tests are fundamentally heuristic, it is useful to just make the heuristic visualization of the samples ${\theta }_{k}$ as a function of k, in the order they were generated. The chain is likely to be converged only if the random-walk process crossed the domain of $\theta$ fully many times during the MCMC run. Often some parameter directions are much worse than others; it is worth looking at the chain in all parameter directions.

Problem 10. Redo Problem 2, but now look at convergence: Plot the x chain as a function of time step. Also split the chain into four contiguous segments (the first, second, third, and fourth quarters of the chain). In each of these four, compute the empirical mean and empirical variance of x. What do you conclude about convergence from these heuristics?

Problem 11. Write a piece of code that computes the empirical autocorrelation function. You will probably want to speed this computation up by using a fast Fourier transform.³⁷ Run this on the chain you obtained from Problem 2. Plot the autocorrelation function you find at short lags (Δ < 100). This plot should resemble Figure 5.

**Figure 5.** Solution to Problem 11: the autocorrelation function of the chain from Problem 2.
Download figure:
Standard image High-resolution image

Problem 12. Write a piece of code that estimates the integrated autocorrelation time for a chain of samples using an estimate of the autocorrelation function and a given "window" size M (see Sokal 1997). Plot the estimated τ as a function of M for several contiguous segments of the chain and overplot the sample function based on the full chain. What can you conclude from this plot? Implement an iterative procedure for automatically choosing M.³⁸ Overplot this estimate on the plot of τ(M), and the result should look like Figure 6.

**Figure 6.** Solution to Problem 12: estimates of the integrated autocorrelation time of different segments of the MCMC chain from Problem 2 (black lines) and for the full chain (orange line) as a function of window size M. The "optimized" value computed using an iterative procedure is overplotted as a dashed line, and its value is listed in the title.
Download figure:
Standard image High-resolution image

6. Tuning

Most MCMC methods make use of something like a "proposal distribution," which determines what kinds of steps the walker can take as it random-walks through the parameter space. This is the function we called $q(\theta ^{\prime} | \theta )$ in Section 3. The user generally has a lot of control over what this proposal distribution might be and how to choose the parameters. For example, in a D-dimensional parameter space, a zero-mean but anisotropic Gaussian (normal distribution) is often used to draw offsets for the walker to take from point to point. Even within this choice (which is of one function among many possibilities), there are $D(D+1)/2$ parameters in the D × D symmetric, positive-definite covariance matrix to set "by hand." How to choose this function and set these parameters?

The key idea here is that if the proposal distribution is too narrow—it proposes steps too small—almost all steps will be accepted (recall the acceptance–rejection step from Section 3), but it will take a long time to move anywhere because of timidity. If the proposal distribution is too wide—it proposes steps too large—the moves will cover parameter space easily, but almost no steps will be accepted; it will tend to jump to much lower probability regions. There is a Goldilocks step size (proposal distribution root variance) that is "just right." In one dimension this might be easy to find, but, as we say, in large numbers of dimensions, there is a lot of freedom in choosing the parameters of the distribution.

**Figure 7.** Solutions to Problems 13 and 14: two methods of tuning the M–H proposal parameters. Left: acceptance fraction as a function of proposal scale for the distribution from Problem 4(a). Right: integrated autocorrelation time for each parameter (indicated by the different colors) as a function of proposal scale parameter.
Download figure:
Standard image High-resolution image

In terms of long-term computational efficiency, the only scalar that makes sense to optimize, when choosing proposal distribution (or, loosely speaking, step size), is the autocorrelation time, described above (Section 5). The optimal step size is the step size that makes for the shortest autocorrelation time. This is easy to state but hard to use in practice; being a second-order statistic of the MCMC chain, the autocorrelation time is a hard thing to measure without a lot of data, so it is hard to quickly estimate it and adjust, much less put it into some kind of optimization loop. When tuning, we usually use proxies for this.

The simplest heuristic proxy statistic for tuning is the acceptance fraction. If you are accepting almost all proposed steps, your step sizes are too small on average. If you are accepting almost none, your step sizes are too large. The Goldilocks value is between a half and about a quarter, with an argument floating around that it should be 0.234 for best performance in high-dimensional problems³⁹ (though you could never tune it precisely enough to warrant that third digit of accuracy). In the burn-in phase (discussed in Section 7) of an MCMC run, it makes sense to track the acceptance ratio and adjust the proposal distribution variance as you get acceptance ratios that are far from the Goldilocks ratio. This process can be automated easily; such automation is part of many projects that use MCMC.

**Figure 8.** Solution to Problem 20: MCMC samples from the Rosenbrock density (Equation (15)), with the mean shown with blue lines and the 0.16, 0.5, and 0.84 quantiles shown as vertical dashed lines.
Download figure:
Standard image High-resolution image

A very common—and very useful—kind of proposal distribution is one that cycles through parameters, taking a random step in just one parameter at a time.⁴⁰ This kind of proposal distribution can be valuable, in part because it reduces the (almost impossible) D-dimensional tuning problem to D one-dimensional problems: in this form of proposal distribution, it is possible to track a separate acceptance fraction for every parameter. Code can be built that uses the burn-in phase to tune all D proposal variances such that each of them, individually, obtains acceptance at the same Goldilocks ratio.

One note to make here—because it is relevant to tuning—is that tuning can only take place during the burn-in phase (that is, some part of your chain you will discard later); you cannot tune while you run your final MCMC run. Why not? Because tuning the proposal distribution based on the past history of the chain violates the "Markov" property that each step depends only on the state at the previous step. Violating the Markov property can be very bad: when you violate it, you lose all the provable properties of MCMC on which all our righteous power is based.

Another proxy for autocorrelation time useful for tuning is the Expected Squared Jump Distance.⁴¹ This is the mean squared distance the walker moves, per step. It is maximized when the acceptance ratio is reasonable and the step size is large; it is large when the exploration of the space is fast. This proxy is easy to measure and use for tuning, and it is more directly related to autocorrelation time than the acceptance ratio. We recommend using it for tuning, though we have never used it ourselves. Like the acceptance fraction, it can also be used to tune in the case that the proposal distribution loops over parameters. Once again, the user gets (in this case) D one-dimensional tunings.

When sampling gets very slow, there are various tricks and tips to work through. The huge bag of possible tricks is so large that it goes way beyond the scope of this introductory article. In our experience, it is valuable to make friends with at least one statistician, one applied mathematician, and one computer scientist. Between them, they ought to span the relevant literature.

Problem 13. Run the MCMC sampling of Problem 4 with the covariant Gaussian density. Give the proposal density $q(x^{\prime} | x)$ a diagonal variance tensor that is Q times the two-dimensional identity matrix. Assess the acceptance fraction as a function of Q. Find (very roughly) the value of Q that gives an acceptance fraction of about 0.25. Do not try to optimize precisely; just evaluate the acceptance fraction on a logarithmic grid of Q with values of Q separated by factors of 2. The solution to this problem is shown in the left panel of Figure 7.

Problem 14. Redo Problem 13, but instead of trying to reach a certain acceptance fraction, try to minimize the autocorrelation time. You will need one of the autocorrelation-time estimators you might have built in a previous problem. (This, by the way, is the right thing to do, but often expensive.) What do you get as the best value of Q in this case? Again, just evaluate on a coarse logarithmic grid. The solution to this problem is shown in the right panel of Figure 7.

Problem 15. In Problem 13 you varied only the parameter Q, but really there are three free parameters (two variances and a covariance). If the problem was D-dimensional, how many tuning parameters would there be, in principle?

Problem 16. The Rosenbrock density is used as a demonstration case for many samplers (see, e.g., Goodman & Weare 2010). Test your sampler on this density:

$\begin{eqnarray}&&f({\theta }_{1},{\theta }_{2})=\exp \left(-\displaystyle \frac{100{({\theta }_{2}-{{\theta }_{1}}^{2})}^{2}+{(1-{\theta }_{1})}^{2}}{20}\right).\end{eqnarray} \tag{ 15 }$

Tune the Gaussian proposal distribution in your M–H MCMC sampler to sample this density efficiently. What autocorrelation time do you get? Compare to what emcee⁴² gets.

7. Initialization and Burn-in

Just as most (though not all) MCMC methods require a choice of proposal distribution, most (though not all) require a choice about initialization: the walker needs (or walkers need) to be started somewhere in the parameter space. In many use cases for MCMC, the investigator wants to use MCMC to obtain uncertainty information about or propagate uncertainty into parameter estimates. In these cases, it makes sense to initialize the walker or walkers at sensible parameter estimates, found by optimizing a likelihood or a posterior pdf in advance of sampling. In extremely high dimensions (large numbers of parameters), this can be a bad idea, since the optimal parameters are not necessarily near typical posterior samples,⁴³ but in low dimensions (few to tens) this is often sensible.

Ideally, you will initialize the walker not at a completely irrelevant point, nor at the optimum of the posterior pdf, but at a typical or pretty good place in the posterior pdf. That would minimize burn-in! But, in general, being at the optimum or anywhere near a good place is usually better than a mindless initialization.

As to burn-in: if you have started your sampler in a nontypical place—or if you are concerned that you might have started your sampler in a nontypical place—then you should discard the beginning of your MCMC run before you do your inferences. This discarded part is called the "burn-in." Some practitioners insist that burn-in cannot exist:⁴⁴ so long as the initial point is a conceivable sample, you are fine! Ensemble methods (like emcee,⁴⁵ and mentioned below in Section 10) require a burn-in phase and discard, because the whole point is that the ensemble must grow (or shrink) to fill the posterior pdf volume, and you cannot initialize the ensemble sensibly if you do not know that volume a priori.

If you suspect, or if it is even possible, that your problem is badly multimodal, then you will have to start at multiple points in parameter space and compare the resulting chains. By "badly multimodal" we mean that there are peaks in the posterior pdf that are connected by low-enough valleys that it is very unlikely or takes a long time for a walker to traverse from one peak to the next. If this might be an issue, then the best diagnosis is to start many chains in parallel with different initializations and check that they lead to identical (or very similar) posterior inferences (same pdf location and shape and statistics).

Of course, if you find out the worst—if you find out that different initializations lead to different posterior inferences—you are in trouble: there is no trivial way to combine together the samplings you get of different modes! You can either wait a very long time so that you see a single walker in a single MCMC run traverse from mode to mode enough times to make a representative sampling (dozens or hundreds of times, ideally), or you can use a high-end sampler (like nested sampling; see Section 10) that is designed for such problems. In principle, there are methods that involve splitting the space into two spaces, sampling them separately, and then combining the chains according to their relative evidence afterward. That's a research project beyond the scope of this article.

Of course, there are some kinds of multimodalities that might not be a problem. For example, you can have two solutions to a problem that differ only in the labeling of the components—say, the Gaussians making up a mixture of Gaussians—that make up the model. One solution has the K Gaussian components in one order, and the other in another order, but they are identical Gaussians in all other respects. Because these two solutions make the same predictions for any data and only differ in their irrelevant, latent, internal "naming" of components, they are not really different solutions. This suggests that when you ask whether the outputs of two MCMC chains are consistent with one another, you should do so in the realm of the parameters you care about, not irrelevant parameters that do not have an impact on any present or future data.

But it is worth remembering that—just as MCMC is not a good optimizer (generically, samples will not lie close to the maximum of the posterior pdf)—it is also not a good search algorithm. There is no sense in which (standard) MCMC methods are efficient at searching, or engineered to search, all of parameter space. If you really need to check all of parameter space, you should put MCMC aside and do an exhaustive search (which will in general take an enormous amount of time).

Problem 17. Redo Problem 2, but with different starting positions. What happens as you make the starting position extremely far from the origin? What is the scaling—as you move the initialization further away, how much longer does it take for the sampler to reach reasonable density?

Problem 18. Check the scaling you found in Problem 17 with a higher-dimensional Gaussian (try, for example, a 10d Gaussian). The same or worse?

Problem 19. Import (or write) an optimizer, and upgrade the code you wrote for Problem 17 to begin by optimizing $\mathrm{ln}p(x)$ and only then start the MCMC sampler from that optimum. Use a sensible optimizer. Compare the scaling you found in Problem 17 to the same scaling for the optimizer. To make this test fair, do not use the awesome math you know about Gaussians to help you here; pretend that p(x) is an unknown function with unknown derivatives.

8. Results, Error Bars, and Figures

For a committed probabilistic scientist—frequentist or Bayesian—there is no single "answer"; there are only pdfs. In a paper or in an email or in a conversation we might say what "the answer is," but even if we say it with an error bar, we have departed the probabilistic program. There are many principled ways to depart the program; in another forum, we hope to say more about the economic explanation of how we can make hard decisions in the context of probabilistic reasoning.⁴⁶ But without going into that economic model, it is fair to say that a substantial problem with being a committed probabilist is that when we publish, we are not permitted (not now, at least) to publish a probability distribution over publications. We have to publish one single, deterministic text; we have to make a decision about what to write in the title, the abstract, the tables, figures, and results section. Given an MCMC sampling of the posterior pdf, what do we report as our results?

One thing we can keep in mind to guide us in this is what we said above (in Section 2), which is that samplings are good for doing integrals. We should endeavor to use as our "results" outputs from the MCMC that are based on integrals computed with the sampling. This includes expectations, medians, quantiles, one-dimensional histograms, and multidimensional histograms. This does not include the "best" sample or a mode or optimum. The latter things are not necessarily illegitimate outputs of the MCMC, but they do not make best use of the fact that the sampling is a tool for integration, and we do not recommend them.

Imagine that you are in the simplest case: you have a model with a small number of parameters (say, three-ish), and only one of them is of great interest. What are your options for the "measurement" of this parameter? The only simple integral-based options are the posterior mean or posterior median value for the parameter.

For example, imagine that by MCMC you have generated K samples ${\theta }_{k}$ of a parameter vector $\theta$ . Now you have a scalar function $g(\theta )$ that takes the parameter vector and returns a scalar value,⁴⁷ which could be as simple as a single component of the parameter vector (one parameter from the list), or something more complex. The mean-of-sampling value $\langle g\rangle$ is just

$\begin{eqnarray}&&\langle g\rangle \leftarrow \displaystyle \frac{1}{K}\,\displaystyle \sum _{k=1}^{K}g({\theta }_{k}),\end{eqnarray} \tag{ 16 }$

and the median-of-sampling value is just the [K/2]th value of $g({\theta }_{k})$ when the $g({\theta }_{k})$ have been ordered (sorted) from lowest to highest. These are both produced by integrals; the first is the value returned by the expectation integral estimate; the second is the value past which half of the integrated pdf lies.

What are your best options for the "uncertainty" or "limits" on this parameter? The only simple integral-based options are either variances or posterior quantiles. We usually use quantiles. That is, the "1σ" error bar can be taken to be the half-size of the central (or smallest) interval of the parameter that contains 68% of the posterior samples. In our own work, we usually create a 68% interval or region by excluding the top and bottom 16% of the posterior samples. Some more aggressive investigators find the smallest interval that contains 68% (or 95%) of the samples. Both are legitimate, and they are nearly identical for distributions that are nearly symmetric around the mean. The "2σ" error bar would be the same but for the 95% interval (excluding the top and bottom 2.5%). Again, these limits can be estimated⁴⁸ by ordering the $g({\theta }_{k})$ values and finding the $\theta$ values such that some fraction (0.68 or 0.95) of the samples lie between them. Again, to form these limits, $g(\theta )$ must be a scalar function (so sorting is well defined). Because of the danger of having them confused with frequentist uncertainties, Bayesians usually call these regions "credible" rather than "confidence" regions.⁴⁹

One amusing thing is that if you choose the 68% interval sensibly but use as the "measurement" the posterior mean, you can get pathological situations (from large skewness) in which the posterior mean measurement is actually outside the 1σ confidence interval. For this reason (and others), we usually recommend as a default behavior—in the one-dimensional case—to choose the median of sampling as the measurement value, the 16% quantile as the lower 1σ error bar, and the 84% quantile as the upper 1σ error bar. This has pathologies in higher dimensions (as we are about to see), but it is pretty safe for one-dimensional answers.

The conservative scientist would show not just 68% error bars but also 95% ones (to help readers visualize the skew of the posterior pdf). The very conservative scientist would also show a histogram of the posterior samples, with the relevant quantiles (2.5%, 16%, 50%, 84%, and 97.5%) indicated.

As we mentioned above, scientists often like to report the "best sample," that is, the sample with the highest posterior pdf or $f(\theta )$ value. This is not usually a good idea. For one thing, MCMC is a sampling algorithm, not an optimization algorithm: there are no guarantees that it will find the optimum of the posterior pdf in reasonable time.⁵⁰ Second, the closeness of the best sample to the posterior mode is a strong function of sample size, and with very bad scaling.⁵¹ Third, if you just want an optimum, use an optimizer! That is, we strongly advise against reporting, as "the" measurement or "the result," the parameter value or values for the best sample. Giving some posterior samples is a good idea (we return to this below).

If you feel drawn to give the best sample, instead give the parameter values found by optimizing the posterior pdf, starting at the best sample as an initialization (and sacrifice your probabilism⁵² ). This optimal-posterior parameter value is called the "maximum a posteriori" or "MAP" value for the parameters: it is like a maximum-likelihood value but regularized by the prior. It is an estimator with some good (and some bad) properties, but it is not the point of MCMC and therefore outside the scope of this article.

The parameter space $\theta$ is usually multidimensional, and usually it is more than one of those parameters that is of interest. In this case, the mean or median of sampling can be produced for each dimension. Because the sampling projected onto any dimension is a marginalization of the posterior pdf, any such one-dimensional mean or median is the mean or median of the marginalized posterior pdf.

There is one oddity to note here, as it catches many investigators by surprise: even if the model is good and the posterior pdf is unimodal and well behaved, the median (or even mean) of the posterior pdf for each of the parameters, when taken together as a "best-fit" parameter vector θ*, will not itself necessarily be a good fit to the data! If there is substantial "curvature" to the pdf in the parameter space, the mean or median of sampling does not necessarily lie at a high-probability location of parameter space. This may seem counterintuitive, but it is easy to see in the case of a "banana-shaped" posterior pdf. This all relates to the fact that MCMC is not an optimizer but a sampler. It is important, when reporting the output of an MCMC run by giving means or medians of the posterior pdf, to remind the reader or user of that output that it does not necessarily represent (collectively) a good fit to the data; these outputs only give information that is useful one parameter at a time.

In principle, the only output from the sampling that safely gives both probabilistic information about the result of the inference and good-fitting models is a few—randomly chosen—example samples. We strongly recommend this;⁵³ in particular, this is a better thing to do than to give the "best sample," which is not even guaranteed to be near the bulk of the posterior pdf or the samples therefrom.

If you do return posterior samples to your audience, how many should you return? There are only heuristic answers to this, but a guideline is that if you only care about a single (one-dimensional) parameter, you do not need many samples; a dozen suffices to give you a reasonable posterior pdf mean and variance.⁵⁴ If you care about K parameters or dimensions of the parameter space, in general the need for samples grows exponentially (or perhaps even factorially!) with the number of dimensions. So if you have a 10-dimensional space, expect to be publishing your samples in an electronic table on the web!

If you do publish sufficient numbers of posterior samples, you might find users who want to make serious use of them.⁵⁵

Of course, figures are always better than tables, at least in the printed form of scientific publications. There are several figures that are useful and should be made every time you run MCMC; we will describe a few here.

1.
Trace plots.—The first set of plots are the "trace plots"—the parameter values as a function of step number. These plots can be used to select burn-in lengths, indicate problems with the model or sampler, and qualitatively judge convergence. This being said, it is important to remember that good-looking trace plots do not guarantee converged sampling and, even if the traces look okay, the convergence diagnostics discussed in Section 5 might identify problems.
2.
Posterior predictive plots.—Another useful plot is the predictive distribution for the model in data space. For this plot, you take some K random samples from your chain, plot the prediction that each sample makes for the data, and overplot the observed data. This plot gives a qualitative sense of how well the model fits the data, and it can identify problems with sampling or convergence.
3.
Corner plots or scatterplot matrices.—In a D-dimensional parameter space, we recommend plotting all D one-dimensional sample histograms and all D-choose-2 two-dimensional histograms (scatterplots) to show the low-level covariances and nonlinearities. If you are clever, all these plots can be arranged into a lovely and informative triangle.⁵⁶ These two-dimensional plots do not in any sense contain all the information in the sampling, but they are remarkable for locating expected and unexpected parameter relationships and often invaluable for suggesting reparameterizations and transformations that simplify your problem.

Problem 20. Execute Problem 16, or—if you are lazy—just install and use emcee to do the hard work. Now plot the x and y histograms of a 10,000-point sampling of this distribution (you might have to sample more than 10,000 and thin the chain), and also plot the two-dimensional scatterplot of x, y samples. Overplot on all three plots an indicator of the means and medians of the samples along the x and y directions. Overplot on all three plots the (above-)recommended quantiles of the samples. Comment on the results. A partial solution to this problem is shown in Figure 8.

9. Troubleshooting and Advice

When issues arise with MCMC sampling, they are sometimes difficult to diagnose: is it the MCMC code, the priors, the likelihood function, the initialization, the tuning, or something else?

Functional testing.—In addition to a full set of unit tests for every part of your code (yeah right⁵⁷ ), we recommend building some end-to-end functional tests that permit you to perform MCMC runs with output that must meet certain expectations. The simplest kind of functional test is to sample a known distribution and check that the sampling has the moments you expect (to within tolerances). For example, you can sample a D-dimensional Gaussian distribution with known nontrivial covariance and check that the empirical covariance comes out as expected. This is an easy test to fail, so a success builds confidence in any MCMC code.

Another important functional test is to sample the prior pdf. This tests the sampler, tests that your priors really are what you think they are (and there are so many ways for them not to be⁵⁸ ), and tests that your prior pdfs are proper (functions that can be normalized). This test is most effective as a true end-to-end test of your model if you structure your code such that the log-probability function ln_f input to the MCMC scheme is set up as a sum of a log-likelihood function ln_likelihood and a log-prior function ln_prior. The sampler can be switched to sampling the prior by a simple replacement of the ln_likelihood function with a trivial function that always returns zero.

MCMC should run well in this flat-likelihood case (priors are usually easy to sample), and the output sample histograms should look like what you expect given the prior pdfs you coded. If the walkers run off to positive or negative infinity and nothing seems to converge, your priors are probably not proper. If the priors do not look as you expect, you have either a bug in your code or else a thinko about how your priors ought to look. Either way, diagnosis is in order.

Likelihood issues.—Sometimes there can be problems with the likelihood function itself. For example, when you recall the likelihood function with the same parameters, do you get exactly the same answer? If you do not, your likelihood function effectively depends on some random numbers; maybe it includes within it an integral performed by the Monte Carlo method? In general, you want to write your likelihood function such that, even if it includes a numerical or stochastic integral, it is designed to produce precisely the same value when it returns to the same point in parameter space. One way to make this happen is to choose the random numbers (those used for any internal integration inside the likelihood function) in advance and reuse them on every likelihood call. This provides the same integral according to the same approximation, but it both speeds up the code (no regeneration of random numbers) and makes the likelihood function single valued. Even better is to replace internal Monte Carlo integrals with deterministic numerical (or even better analytic) integrals.

If things are giving trouble and you suspect the likelihood function, another important test is to visualize slices through parameter space. When you call the likelihood function on a grid in each parameter (with the others fixed, say, to reasonable values), you should see a likelihood value that is a smooth function of each parameter, or at least as smooth as you expect. Things like numerical integration, truncated expansions, or adaptive approximations inside likelihood functions can make the function noisy or jagged at small scales in parameter space. These problems hurt optimizers more than samplers (although Hamiltonian methods depend strongly on good derivatives; see Section 10), but they usually point to code issues. In general, plotting the likelihood function as a function of parameters along slices in parameter space often reveals bugs or thinkos.

Low acceptance fraction.—If you find that you are getting low acceptance fraction (in standard M–H MCMC or other varieties for which there is such a concept), you can simply reduce the sizes of the steps you consider to increase the fraction that are accepted. This means decreasing the variance of the proposal distribution, or whatever parameter or parameters control the width of that distribution. If reducing the variance of the proposal distribution does not increase the acceptance fraction, then you have a bug. You must find that bug by auditing your code or else returning to the functional tests listed above.

Similarly for high acceptance fraction: if you are finding a high acceptance fraction, then increasing the variance of the proposal distribution must decrease the acceptance fraction. If it does not, then you have a bug.

Sticky chains.—If you are concerned about acceptance fraction or convergence, it makes sense to plot parameter values as a function of "step number" or iteration number. That is, plot the ordered chain in each parameter dimension. These plots will have long horizontal patches if the chain is getting stuck; a converged chain will show the walker traversing the parameter space in every dimension fully many times over the length of the run. Not a few times; many times. If you are using an ensemble sampler,⁵⁹ make plots showing all M walkers, each a different color. A converged run for an ensemble sampler will show every single walker traversing the parameter space in every dimension many times.

Initialization dependence.—If you initialize your MCMC runs in different places in parameter space, do you get the same (or very similar) final parameter samplings, in mean and variance? If you do not, then your posterior pdf is badly multimodal; trivial MCMC methods probably will not fix your problem. Not to put too fine a point on it: if your results are initialization dependent, then your MCMC runs are not converged.

You have to either go to a method that tries to explore the posterior more liberally, such as nested sampling or simulated tempering, or else split your "model" up into several submodels, each of which contains different posterior modes. In general, this problem is hard to fix; as we mentioned above, it is provably impossible to explore all of parameter space if the posterior pdf is complex in morphology. We do not have very useful advice here, except to spend time learning about the multiple modes in the posterior pdf and what each "means" or corresponds to, and to do your best to express in your results the multiple optima. One of the miracles of MCMC is that if you can fairly sample as complex posterior pdf, the fraction of samples in each mode tends to the relative total integrated posterior probability inside each mode, as the sampling converges.

Ensemble samplers react to multiple modes by obtaining very low acceptance ratio,⁶⁰ because the samplers in the different modes are not easily able to "help one another" move efficiently. Again, if the acceptance ratio is low and there appear to be multiple modes, it is likely that the sampling will also be initialization dependent, and it is very unlikely that your sampling is going to be converged.

Bad initialization.—If you initialize your sampler in a disallowed region of parameter space—that is, a part of the space with $f(\theta )=0$ (so your ln_f() function returns -Inf)—it might have a very hard time random-walking out of that location. At initialization time, it is important to test that the initialization is permitted. For ensemble methods, it is important to test that all the walkers in the ensemble are initialized to permitted values.

Parameterization problems.—It helps to think carefully about how to parameterize. For example, if there is an angle ϕ in your problem, and your prior requires it to be between 0 and 2π, and the posterior mode is very near 0, a little bit of probability leaking below 0 plus the prior cutoffs plus mod-2πsymmetry makes an intrinsically unimodal problem multimodal! In these cases, we usually advise reparameterization from a (say) amplitude A and angle ϕ to two vector components $a\equiv A\,\cos \phi$ and $b\equiv A\,\sin \phi$ . Of course, this reparameterization changes significantly the form of the prior pdf (the transformation brings in a Jacobian); it has to be adjusted with care.

To be explicit, imagine that you have parameters A and ϕ and priors p(A) and $p(\phi )$ . Now imagine that instead (and very sensibly) you want to sample in parameters $x\equiv A\,\cos (\phi )$ and $y\equiv A\,\sin (\phi )$ . What is the proper prior to apply to x, y? It is not just p(A)p(ϕ) with $A=\sqrt{{x}^{2}+{y}^{2}}$ and $\phi =\arctan (y,x)$ ! That's close, though! There is a correction factor, which is the absolute value of the determinant of the Jacobian matrix $| | d(A,\phi )/d(x,y)| |$ . That is, it is the determinant of the full derivative of one set of parameters with respect to the other. Keep your units straight!⁶¹ In this particular case, the Jacobian is equal to 1/A, such that

$\begin{eqnarray}&&p(x,y)=\displaystyle \frac{1}{A}\,p(A)p(\phi ),\end{eqnarray} \tag{ 17 }$

where $A=\sqrt{{x}^{2}+{y}^{2}}$ and $\phi =\arctan (y,x)$ .⁶²

For another example, sometimes there is almost an exact degeneracy between two parameters, say a and b. Then it makes sense to switch to parameters a + b and a − b, or some other linear combinations where correlations or near degeneracies are likely to be reduced. Again, such transformations require also prior transformations, which must be made with care. There are some affine-invariant samplers⁶³ that are invariant to such transformations, but if you are not using such a sampler, it is worth getting out ahead of these problems. They can be found by performing an exploratory sampling, making a triangle plot as described above, and looking for narrow diagonal lines in the two-dimensional scatterplots.

For another example, sometimes some parameters are continuous, and some are integer (or discrete). In this case, it is sufficient, in the context of M–H MCMC, to make custom proposal distributions for the integer parameters that make sure only integers are proposed (provided that care is taken to make sure that detailed balance is preserved). An example is to have a proposal that has some probability of incrementing the parameter up and an equal (that is, the same) probability of incrementing it down. Some samplers implicitly or explicitly assume that all parameters are continuous;⁶⁴ these should be avoided when some parameters are integer. If you really want to use such samplers when some parameters are discrete, you have to put a round() or int() operation into the likelihood and prior functions. This makes for a stepwise-constant posterior pdf. It is a hack, but it works in most circumstances.

For yet another example, there might be constraints such that all acceptable models lie on a nontrivial, nonlinear subspace of your parameter space. Here almost no arbitrary steps in parameter space could lead to an acceptance, since if you move arbitrarily, you are very unlikely to hit the subspace! Your best move here is to reparameterize so there are only models on that subspace. We are not claiming that that reparameterization is always easy to find; the mathematics of Lagrange multipliers can be useful here.

Model checking.—All models are wrong, but some are useful. That's not a quote from Box, but it is close.⁶⁵ The point (for our purposes) is that if you check your model hard enough—that is, you take enough data—you will certainly rule it out.⁶⁶ "Model checking" is an enormous subject that goes way beyond the scope of this article, for which reason we will not say much about it here except to make two points:

The first—and most important—point about model checking is that you always learn a huge amount by checking the model. What this entails is looking at the distribution of residuals or the quality of the predictions of the model when compared to the data. There are methods (like the chi-squared statistic and the Bayesian evidence integral) that check the absolute quality of the model in a scalar way. These are usually uninformative, because (as we noted) all models are wrong, and eventually the data will be good enough to return a bad statistic here.⁶⁷

Much more informative are methods (like plotting residuals of the data away from the model in the space of the data) that look at what parts of the data the model "explains well" and what parts of the data the model "explains badly." These kinds of experiments usually reveal inadequacies of the model in a rich way, or confirm that certain assumptions are bad. Cross-validation⁶⁸ is a framework for constructing valuable tests of this form, as is posterior predictive checking.⁶⁹ Indeed, making posterior predictions of new data, and testing the model by taking completely new data and comparing to those predictions, is the strongest test of any model. Each sample from the MCMC chain provides a prediction for future data; the distribution over these sample predictions is the posterior prediction for new data. Importantly, you want to find ways to compare your model to your data in the space of the data. Do not just look at the parameter space of your model. Look at the data. It is the data that really exist, especially when all models are wrong!

There are model-checking methods (like varying the prior or assumptions in the likelihood function and rerunning MCMC) that test the sensitivity of results to assumptions. Not only are these very valuable for testing the model, but they are also essential for writing any paper about your results: the reader always wants to know the sensitivity of results to assumptions.⁷⁰ Once again, it is important to find ways to display the results of such tests in the space of the data. All we will say here about these kinds of rich model tests is that you should do them and report the results.

The second—and less important—point about model checking is that the ur-Bayesian thing to be doing is computing a Bayesian evidence integral. However, it is usually inadvisable: The integral is expensive to compute. Most MCMC methods are cleverly designed to avoid computing this integral (as we note above in Section 1), so it usually adds expense to the project. At the end of the day, it returns only a single, scalar number, with no guidance about how to use it (and indeed there are only heuristics to reach for). If your model is wrong, the richer tests of visualization of residuals or sensitivity to assumptions are much more likely to lead to insight about what changes to make or what new directions to explore.⁷¹

10. More Sophisticated Sampling Methods

As we have discussed above (Section 3), MCMC methods are proven to be correct "in the limit," that is, when an infinite number of samples have been taken. When the function $f(\theta )$ has challenging properties, the results can approach this limit very slowly (or not at all⁷² ). For this reason, in the finite duration of a scientific project, most users of MCMC have one of two (related) problems: either (1) it is taking too long—it is taking too many steps or executing too many calls of the function $f(\theta )$ to get independent samples—or (2) it is not exploring the full parameter space, i.e., there are local optima (or bad local geometry around those optima) "trapping" the algorithm.

There is no general, problem-independent solution for either of these problems. Indeed, there are many different strategies that work well or badly for different functions $f(\theta )$ . In this section, we will try to guide the reader toward methods that might help in different circumstances. We will not give a full description of all the MCMC algorithms out there, but just provide some field notes and pointers to references. New MCMC methods are being developed all the time, so a reader with a serious problem would do well to consult with the applied mathematics or statistics literature.⁷³ The different methods to which we refer here are (in general) very different in their difficulty of implementation. We will not help with that either, except to say that there are more and more open-source or distributed packages every year. We will try to name some current implementations in the notes, but any list compiled today will be incomplete tomorrow.⁷⁴ The spaces of MCMC methods and MCMC method implementations are both growing rapidly, in part because MCMC has become such a core technology in the empirical sciences.

All we are going to do here is give a cursory mention to some methods we know about, with some keywords that can be used to search more deeply. There is no sense in which this list is comprehensive; these are just methods (or method classes) that are likely to be useful to projects similar to projects we ourselves have executed.

Ensembles.—As we discussed (Section 6), one of the critical challenges in the use of MCMC is tuning the proposal pdf $q(\theta ^{\prime} | \theta )$ . In problems in which all of the parameters (all of the components or entries of $\theta$ ) are somehow "equivalent," or the parameters can be seen as components of a vector in a vector space, there are ensemble methods that make use of not just one random-walker but instead many walkers to automatically generate a properly tuned proposal distribution. These methods make use of the distribution of a set of independent walkers in the parameter space ( $\theta$ -space) to gauge the typical step sizes and directions at which new proposals should be made, obviating or greatly reducing the tuning requirements. This tuning improvement comes at the cost of a burn-in phase at the beginning of any run—a phase that lasts a few autocorrelation times—in which the ensemble expands or shrinks to fill the posterior volume. We are coauthors on one of these methods, emcee,⁷⁵ which is popular in the physical sciences, but there are others.⁷⁶ It is worthy of note that ensemble methods have particular peculiarities about initialization and convergence diagnostics that are worth understanding before use.⁷⁷ Because ensemble methods are updating independent MCMC chains, they are often easy to parallelize to make use of multiple cores.

Gibbs—Neither M–H MCMC nor most ensemble methods are good at going to very large numbers of parameters.⁷⁸ Even dozens of parameters can be enough to drive autocorrelation times extremely large. With larger numbers of parameters, new methods need to be considered. Gibbs sampling is worth considering when the number of parameters is large, and the different parameters have different "scopes" in the problem.

In many problems that have large numbers of parameters, the parameters—the components or entries of the blob $\theta$ —are not equivalent at all. Some are global parameters, which touch or have an effect on every subpart of the data blob $D$ , while others are local parameters, which only touch a small subset of the data. Or, in another use case, some parameters are linear parameters that—at fixed values of the other parameters—will have a Gaussian likelihood or posterior pdf. In these problems, there are some directions in parameter space that are hard to move in (they require complete recalculation of the full nonlinear function $f(\theta )$ ) and other directions that are very easy to move in (they require only partial recalculations or the sampling could even be analytic and exact). For problems with these structures, Gibbs samplers are ideal, and there are good implementations.⁷⁹

A trivial kind of Gibbs sampling was mentioned above (Section 3) in the discussion of M–H MCMC: even when running vanilla M–H MCMC, it is often useful to update only one parameter at a time, that is, do only axis-aligned moves in the $\theta$ -space, cycling through axes as you go from step to step. This helps with tuning and diagnosis, but it also permits a good implementation to capitalize on simpler calculations for the simpler parameter directions.⁸⁰

Gibbs sampling can also be very good for exploiting multiprocessing: global parameter updates might require a serial recalculation of the full likelihood, but local parameter updates can be done in parallel, with each local parameter working on its local bit of data all at the same time. This kind of structure is common in hierarchical inference, where local parameters touch only individual objects in some population, say, and global parameters are parameters of that population as a whole. Parallelization is built into some Gibbs implementations.⁸¹

Hamiltonian.—At the time of writing, the premier approaches for problems with large numbers of parameters are Hamiltonian methods.⁸² These methods perform very well at large dimensions but require (for speed) analytic derivatives of the function $f(\theta )$ with respect to the parameters.

In many cases of interest, the function $f(\theta )$ can be analytically differentiated, such that it is possible to quickly execute a gradient ${df}/d\theta$ evaluation. Indeed, in the modern world of computing, there are even auto-differentiation systems that deliver automatically code that produces the gradient of any function you can write.⁸³ There are a class of MCMC methods that make use of gradient information to effectively inform the proposal distribution and thus speed sampling.

The reason these methods are "Hamiltonian" is that they use a kind of Hamiltonian dynamics to improve the sampling: they augment the position of the walker in parameter space with a fictitious momentum in parameter space and use the derivatives in the accept–reject step. The momentum helps the sampler move through the space on a more efficient variant of the random walk. The ergodicity property of (relevant) dynamical systems ensures good sampling (in the infinite-time limit, of course); the connection to dynamics is that the negative logarithm of $f(\theta )$ becomes the potential for the dynamical system. At the present day, these samplers are essentially the only good options when the number of parameters gets very large.

Importance.—Sometimes it is possible to sample efficiently from a distribution that is close to the target function $f(\theta )$ you care about, even when the target $f(\theta )$ itself is hard to sample efficiently with MCMC. An example is a problem that has a simple (Gaussian, say) likelihood function times a nontrivial prior: it is easy to sample from a Gaussian likelihood times a Gaussian prior, even without MCMC, but it might be hard to do the same sampling under a more complex prior. Importance sampling is a direct sampling method (not really an MCMC method at all): you take a sampling from the approximate function (which by assumption was easy) and then reweight the sampling (or do a subsequent rejection of samples) using the ratio of the true function to the approximate function as a weight or probability. We use this quite a bit in our hierarchical inferences,⁸⁴ as well as in problems where the dimensionality is low.⁸⁵

Two of the significant technical details for importance sampling are the following: the approximate function must have the same (or greater) support as the true function, and the method degrades in efficiency as the two functions diverge. If the ratio of the true function to the approximate function is to be used as a probability for rejection of samples, then the true function must be greater than or equal to the approximate function everywhere!

Tempering.—A huge problem for MCMC samplers—which make local moves based on local positioning in the parameter space—is that there can be multiple modes in the function $f(\theta )$ , separated by valleys of zero or near-zero density. A vanilla MCMC method cannot easily cross such valleys with local moves. The idea behind simulated tempering and related methods is to "smooth out" the function $f(\theta )$ or reduce its dynamic range, to make the peaks less tall and the valleys less deep, to permit the MCMC random walk to cross the valleys. The standard method is to introduce a "temperature" variable and, while sampling, take the likelihood function to the power of the inverse temperature (or, if working in the logarithm, multiply the log-likelihood by the inverse temperature). When the temperature is high, this reduces the influence of the likelihood function relative to the prior pdf and (assuming that the prior pdf is easy to sample) makes movement in the parameter space easier. As with Hamiltonian methods, the sense in which this system makes use of a "temperature" is that the logarithm of $f(\theta )$ can be related to a potential, in a physical analogy. At each step, when the temperature has moved away from unity, the distribution being sampled is not exactly $f(\theta )$ , but the results can be transformed into a fair sampling.

Nested.—Similar to tempering-like methods are a class of methods called "nested" samplers,⁸⁶ which also smoothly change the target distribution away from $f(\theta )$ . In the nested case, instead of increasing the temperature, the samplings are of a censored version of the prior, censored by the value of the likelihood function. When the likelihood censoring is strong, only the most high-likelihood parts of the prior get sampled; when the likelihood censoring is weak, almost all of the prior gets sampled. The idea behind nested sampling—like tempering—is that it is designed to be a good method for exploring the full parameter space or searching for all of the modes of the posterior pdf. It is also designed such that it produces not just a sampling but also an estimate of the integral Z of the likelihood times the prior (the Bayes factor or fully marginalized likelihood).

Reversible jump.—Finally, there are some extreme problems in which it is not just hard to sample in the parameter space, but the parameter space itself is not of fixed dimension.

For example, in the case in which you are modeling an astronomical image with a collection of stars and you do not know how many stars to use, the number of stars itself is a parameter, so the number of parameters is itself a function of the parameters.⁸⁷ In these cases you can only use certain kinds of samplers (though M–H MCMC is one you can choose), and, in addition, you need to make special kinds of proposals that permit the system to jump from one parameter space to another that has different dimensionality. In these cases, the concept of detailed balance becomes nontrivial, but there are criteria for creating reversible-jump proposals between the parameter spaces. These methods are best built and operated under the supervision of a trained professional.

We would like to thank Alex Barnett (Flatiron), Jo Bovy (Toronto), Brendon Brewer (Auckland), Will Farr (Flatiron), Marla Geha (Yale), Jonathan Goodman (NYU), Fengji Hou (NYU), Dustin Lang (Toronto), Phil Marshall (KIPAC), and Iain Murray (Edinburgh) for valuable advice and comments over the years and Alessandro Gentilini for a very close read. This project benefited from various research grants, especially NSF grants AST-0908357, IIS-1124794, and AST-1517237; NASA grant NNX12AI50G; and the the Moore-Sloan Data Science Environment at NYU. This work was performed in part under contract with the Jet Propulsion Laboratory (JPL) funded by NASA through the Sagan Fellowship Program executed by the NASA Exoplanet Science Institute.