The Basics of Markov Chain Monte Carlo

Sometimes models are difficult to evaluate. In this two-part series we’ll take a look at how Markov Chain Monte Carlo (MCMC) can sometimes make things easier.

This article describes the key ideas behind MCMC and the next will focus on the practicalities of implementing it.

Introducing Markov Chain Monte Carlo

As with Monte Carlo simulation, MCMC is a simulation method that can be used to calculate difficult integrals.

While MCMC is more complicated than standard Monte Carlo, it is much more widely applicable. Therefore, understanding MCMC enables practitioners to simulate from a much larger set of models than would otherwise be the case.

Monte Carlo simulation works by generating many independent simulations, where the probability of an outcome occurring in a simulation is equal to its modelled probability. Statistics based on this large sample can then be used to estimate the statistics of the model.

For example, my model could relate to the outcome of rolling a die. In this case I could simulate many independent rolls of the die and use the average, median or any other statistic of these rolls to estimate the exact statistic of one roll.

The key to MCMC is the realisation that it doesn’t matter if the simulations are independent, so long as the long-run proportion of simulations with a given outcome is equal to the modelled probability of that outcome.

The charts below help illustrate what this means.


The charts summarise the result of simulating values from a Uniform distribution using Monte Carlo simulation (left) and MCMC simulation (middle). In addition, the right-hand chart displays the empirical CDF associated with the Monte Carlo sequence (red) and the MCMC sequence (blue).

It’s clear that the two methods produce sequences of simulations that look different. In the ‘Monte Carlo’ plot the dots are spread evenly. On the other hand, the dots in the ‘MCMC’ plot seem to follow a path.

Despite this, the right-hand plot shows that both methods generate sequences of simulations that are consistent with the model. The CDFs are very similar and approximately fall on a 45⁰ line, as you would expect of a Uniform distribution.

This example illustrates the basic premise of MCMC but it isn’t that exciting, especially because Monte Carlo simulation can be used instead. However, as you will see in the next article, sometimes MCMC provides a neat solution to problems that would otherwise seem difficult.

So, how do you construct a sequence of correlated simulations in which outcomes occur with the desired frequencies?

To do this, MCMC methods make use of a type of random walk called a “Markov Chain”.

Markov Chain Basics

You can visualise a random walk as a particle in space that randomly changes position over time.

A Markov Chain is a special type of random walk where the focus is on the position of the particle at the end of consecutive time intervals. So in the middle plot above, the dots represent the position of the particle (y-axis) at the end of consecutive time intervals (x-axis).

The key differentiator between Markov Chains is their tendencies to move to some positions more frequently than others. These tendencies are described by their “one-step transition probabilities”.

As the name suggests, the one-step transition probabilities represent the probability of moving to a position in one step, given the particle’s current position. For example, if the particle is at position “A” and the one-step transition probability from “A to B” is 20%, then 20% of the time the particle will subsequently move to “B”.

The one-step probabilities also happen to determine the Chain’s tendencies over a greater number of steps. This is because, given the latest position, the next step in a Markov Chain is taken without regard to the prior steps.

Continuing from the previous example, if we know that the transition probability from “B to C” is 30% say, then 6% of the time (= 20% * 30%) after starting at “A” the particle will move to “B” and then “C”.

In general, multiplying the transition probabilities associated with a sequence of steps will produce the probability of taking all of those steps.

While the probability associated with any one path is quite interesting, often we want to know the probability of ending in a position regardless of how we got there. Depending on the number of steps considered, these are referred to as the “n-step transition probabilities”.

As we don’t care how the chain gets to the ending position, the n-step transition probability is simply the sum of the probability associated with each path that starts and ends at the given positions.

For example, if we wanted to know the two-step transition probability from “A” to “B”, we would calculate the probability associated with each two-step path that ends at “B” (e.g.   A\rightarrow A\rightarrow B ,   A\rightarrow B\rightarrow B ,   A\rightarrow C\rightarrow B , …, etc) and then sum them.

Fortunately, rather than having to consider an impractically large sum, the n-step transition probabilities can be calculated recursively:

    \[  P^{n}(A\rightarrow B) = \sum\limits_{C \in All Positions}{P^{n-1}(A\rightarrow C)\cdot P^{1}(C\rightarrow B)} \]

In words, the n-step transition probability from “A” to “B”,   P^{n}(A\rightarrow B), is the sum over all positions of the probability of moving there from “A” in one less step,   P^{n-1}(A\rightarrow C), and then moving to “B” in the final step,   P^{1}(C\rightarrow B).

Now that we know this much, we can get back to the goal of MCMC. In particular, we want a Chain in which the long-run proportion of visits to each position is equal to the probability that our model assigns to them. Given this, let’s first consider how we might calculate the long-run proportions.

Long-Run Behaviour of Markov Chains

Because random walks are random, it’s not possible to determine the exact proportion of the time the chain will spend at each position. Nevertheless, it is possible to calculate the expected proportion of the time.

For instance, if the random walk is only one step long, the particle will start at a position, “A” say, and then visit a position at random. In this case, the expected number of visits to a position, “B” say, is given by the one-step transition probability,   P^{1}(A\rightarrow B). Also, as only one position is visited, this is also the expected proportion of the time that the particle spends there.

When the walk is two steps long, the particle will visit two positions at random. In this case, the one-step transition probability,   P^{1}(A\rightarrow B), gives the expected number of visits from the first step and the two-step transition probability,   P^{2}(A\rightarrow B), gives the expected number from the second step. As two positions are visited, the expected proportion of the time that the particle spends in a position is the sum of the one and two-step probabilities divided by two.

In general, the expected proportion of the time spent at a position, “B”, after starting at another position, “A”, and after “n” steps is:

    \[  \sum\limits_{s=1}^{n}{\frac{P^{s}(A\rightarrow B)}{n}}  \]

Although this only produces the expected proportions, the difference between the actual and expected proportions becomes small as the number of steps (“n”) increases. Thus, after a large number of steps we can be sure that they are approximately equal.

With these tools we can start to analyse the long-run behaviour of a particular Markov Chain. However, as the calculations involved are relatively onerous, using them for MCMC is not particularly practical. Instead, we will focus on those Markov Chains where the long-run behaviour is known up-front.

Limiting and Invariant Distributions

There is one class of Markov Chains for which the long-run proportions are easy to calculate. This is when the n-step transition probabilities,   P^{n}(A\rightarrow B), converge to some distribution,   \pi(B), as the number of steps increases. Let’s refer to this distribution as the “limiting distribution” of the Markov Chain.

In this case, as the expected proportion of the time spent in each position is a simple average of the converging n-step transition probabilities, the expected proportions also converge to the limiting distribution.

The trick to MCMC, therefore, is picking transition probabilities that have the desired distribution as the limiting distribution of the Markov Chain. If we are able to do this, we can be sure that a long random walk following these transition probabilities will produce an appropriate sequence of simulations.

In fact, there is another definition of the limiting distribution that is more useful for finding appropriate transition probabilities. Not only is the limiting distribution the distribution that the n-step transition probabilities converge to but it is also the distribution,   \pi, that satisfies the following equation:

    \[  \pi(B) = \sum\limits_{A \in All Positions}{\pi(A)\cdot P^{1}(A\rightarrow B)} \]

Hopefully you can see that the right-hand side of the equation is another formula for the distribution of the particle after an additional step. However, here the starting position is random and follows some distribution,   \pi.

This equation says that the distribution “  \pi” is the one such that if the starting distribution is equal to it, after an additional step the ending position will also be equal to it. Because the distribution is unchanged by a step, the limiting distribution is also referred to as the “invariant distribution”.

Consequently, to apply MCMC to a model we simply need to find transition probabilities such that the modelled distribution is equal to the invariant distribution. We can then be sure that a Markov Chain generated using these probabilities will visit each position with the desired long-run frequencies.

This is all there is to MCMC and it is the basis for the techniques discussed in the next article. However, before moving on, let’s consider why the limiting and invariant distributions are equal.

Convergence to the Invariant Distribution

To see that the limiting and invariant distributions are equal it’s useful to consider how the transition probabilities operate on the starting distribution when the n-step transition probabilities are calculated.

Let’s consider the one-step probabilities first. In this case the starting position is given and so we might say that the starting distribution assigns 100% of the probability mass to the starting position.

One step on, the particle will randomly occupy one of the other positions with probabilities equal to the one-step transition probabilities. Therefore, a given position (“B”) will receive a percentage of the original mass from the starting position (“A”), equal to the associated transition probability (  P^{1}(A\rightarrow B)} ).

When calculating the two-step probabilities an identical process occurs but this time the starting mass will not be in one position but will be spread among the positions according to the one-step probabilities. Thus, the process will result in each position retaining some of its mass, losing some of its mass to other positions, and receiving mass from other positions.

The animation below illustrates this process for a Markov Chain that has two positions, ‘A’ and ‘B’, and for which the 1-step transition probability from A to B is a half and from B to A is a third.

Here the Markov Chain starts at position “A” and so 100% of the starting probability mass is at position “A”. In each step, however, half of the mass from position “A” moves to position “B” and a third of the mass at “B” moves to “A”. The video shows how the distribution of mass changes over eight steps.

This operation typically results in some positions gaining and others losing mass. However, when the starting mass is spread between the positions according to the invariant distribution, we know that the starting and ending distributions are equal and so the movements of mass exactly offset one another.

Given this, it’s useful to view a particular starting distribution as assigning mass equal to the invariant distribution to each position and then an additional excess or deficit of mass to each position. The animation below hints at why this is the case.

The left-hand chart shows the transfer of mass between the two positions using the same one-step transition probabilities as before, the middle chart shows the invariant distribution for these transition probabilities, and the right-hand chart shows the transfer of the excess and deficit of mass.

As you can see, after each step the distribution in the left-hand plot is simply the sum of the distributions in the middle and right-hand plots.

In particular, the invariant distribution assigns 40% of the mass to position “A” and 60% to position “B”. Therefore, when 100% of the starting mass is at “A” the right-hand plot shows an excess of 60% for position “A” and an equal deficit in position “B”. In addition after a step, when the mass is split equally between the positions, the right-hand plot shows an excess of 10% for position “A” and an equal deficit for position “B”.

This representation shows us that the left-hand distribution will converge to the invariant distribution when the excess and deficit in the right-hand plot tends to zero. However, because excesses and deficits cancel, as long as mass is transferred between the two positions the excess and deficit will reduce.

Here I have focused on a simple case involving two positions, however the intuition gained from this example is more generally applicable. In particular, as long as excesses and deficits do not avoid one another, the distribution of the position of the particle will converge to the invariant distribution and so the limiting and invariant distributions will be equal.

Final Word

In this article we introduced Markov Chains and discussed how to understand their long-run behaviour based on their transition probabilities.

Although it was a relatively long article I had to gloss over a few areas. Therefore, if you are of a technical mind-set, you may find it useful to skim the Wikipedia article on Markov Chains.

In the next article we will move on to more practical matters. In particular, we will consider how to apply what was discussed here.