Gibbs Sampling by Example

In order to grasp a new technique it’s often useful to see how it is used in real-world examples. Here we will examine a particular implementation of a Gibbs sampler which pertains to a model we discussed in a previous series of articles.

Re-Cap

As was discussed in the previous article, the Gibbs sampling algorithm is a Markov Chain Monte Carlo (MCMC) technique that can be used to simulate outcomes from a multivariate distribution.

As with all MCMC techniques, simulations are produced by generating a realisation of a random walk and recording the position of the walk at the end of consecutive time intervals. MCMC techniques are designed such that, no matter where it starts, the positions produced by a long enough walk will resemble a sample from the target distribution.

The Gibbs sampler produces a random walk by updating each coordinate associated with the multivariate distribution sequentially. Once each coordinate has been updated the new values are recorded and another iteration occurs.

The key to the Gibbs sampler is in how each of the coordinates are updated. In particular, updating a particular variable involves simulating an outcome from its conditional distribution, given the latest values of all the other variables.

The animation below illustrates how this process unfolds in the case of a bivariate Normal distribution.


Here you can see that the random walk moves around the space by first taking a horizontal step and then a vertical step.

In the animation I’ve also plotted the positions of the random walk on top of some bivariate Normal contours. One can see that after a large number of steps the distribution of the dots resembles the target distribution.

In this article we will show how Gibbs sampling can be applied to another more specialised problem.

The Model

In the Incomplete Data series I introduced a model that could be used to describe the timing of individual claim payments made by an insurer.

From the data we made two observations regarding the timing of payments; payments occurred sporadically and eventually they ceased.

These phenomena were assumed to be caused by the state of each claim. In particular, we assumed that the claim could be in one of three states: an open state in which no payment occurs; a closed state, also with no payment; and an open state in which a payment does occur. In addition, it was assumed that if a claim closed it remained closed.

The probability of a claim moving from an open state to the closed state (  p ) and, given it was open, giving rise to a payment (  q ) were both assumed to be constant and applicable to all claims.

To understand more about the claim payment process, therefore, it’s useful to investigate what can be learned about the unknown factors in our model based on the observed data. In this case, the unknowns are the transition probability parameters (  p and   q ) and if and when any of the claims closed.

As was discussed in the Incomplete Data series, this problem can be tackled by looking at the posterior distribution of the unknowns. Specifically, assuming a uniform prior on   p and   q , the posterior density takes the following form:

    \begin{equation*} \begin{split} P(p,q,\mathbf{c},& \mathbf{z^{open}}) \propto \\ & \prod\limits_{i=1}^{m}{(1-p)^{z^{open}_{i} + n_{i} + d_{i}} \cdot p^{c_{i}} \cdot q^{d_{i}} \cdot (1-q)^{z^{open}_{i}+n_{i}} } \end{split} \end{equation*}

Here, the event that a given claim is closed is represented by   c_{i} ,   z^{open}_{i} represents the number of trailing periods with no payments that each claim remained open for, and   p and   q are the parameters we discussed previously.

The observed data is represented by the number of claims (  m), the number of periods in which payments occurred (  d_{i} ), the number of periods prior to the last payment in which no payments occurred (  n_{i} ), and the number of observed periods after the last payment (  z_{i} ).

In this article we will use a Gibbs sampler to simulate from this distribution. However, as discussed in the previous article, to do this we need to know what each of the unknowns’ conditional distributions are.

The Conditional Distributions

The purpose of this article isn’t to dwell on how we arrive at the conditional distribution of each unknown. After all, in your problem the posterior and conditional distributions will undoubtedly be different. However, as they are integral to performing Gibbs sampling it’s worth getting a feel for how you might derive them.

In this case you can get most of the way to deriving the conditional distributions by “inspection”. This means considering the posterior density as a function of each unknown and seeing whether it looks like any of the standard distributions you already know (see Wikipedia for a list of standard distributions).

For example, as a function of (  p) the posterior looks like:

    \begin{equation*} P(p|q,\mathbf{c},\mathbf{z^{open}}) \propto (1-p)^{\sum\limits_{i=1}^{m}{z^{open}_{i}+n_{i}+d_{i}}} \cdot p^{\sum\limits_{i=1}^{m}{c_{i}}} \end{equation*}

This density is of the same form as a beta distribution.

By a similar albeit more long-winded approach it’s possible to see that the conditional distributions of each of the unknowns take the form of the following standard distributions:

gibbsexampleconditionals

As you can see the conditional distributions each take a standard form. The exception to this being   z^{open}_{i} –which is either a constant or follows a truncated Geomtric distribution–nevertheless it is still easy to simulate from.

With these conditional distributions, we have all that we need to implement a Gibbs sampler.

The Gibbs Sampler

At this point it may be obvious to you how we would set up a Gibbs sampler using the conditional distributions above. However, I expect others might feel uncomfortable translating what we’ve discussed into an actual implementation.

With this in mind, below is the Gibbs sampler relevant to this article written in “pseudo-code”:

gibbsexamplealgorithm

Hence, in the programming language of your choice, one iteration of the Gibbs sampler involves: for each claim simulating whether it is closed or not (step 7); simulating how many periods each claim has been open for (step 9); simulating new parameter values for   p and   q (steps 12 and 13).

Final Word

In this article we discussed a specialised example of a Gibbs sampling algorithm. In particular, we examined how to implement a Gibbs sampler for a model that could be used to explain the timing of payments on insurance claims.

I hope that by working through a real-world example applying Gibbs sampling to your problems will seem less daunting.

If you have any questions on this article or how it pertains to your problem feel free to leave a comment below.

Markov Chain Monte Carlo: The Gibbs Sampler

This article is the second in a series in which we discuss Markov Chain Monte Carlo techniques.

In the previous article we focused on understanding what Markov Chain Monte Carlo is and why it works. Here we look at Gibbs sampling, a Markov Chain Monte Carlo technique that can be directly applied to many modelling problems.

Re-Cap

In the previous article we introduced Markov Chain Monte Carlo (MCMC) as a method for simulating outcomes from a model.

We saw that the key to this approach was in choosing a random walk that visited each outcome as frequently as the model suggested. By recording the position of the random walk at consecutive time intervals, we could generate a set of simulations that looked like a large sample from the model.

For example, a model might prescribe that two outcomes are possible, “A” and “B” say. In this case, if each outcome occurred with a probability of 50% then an appropriate random walk would transition between “A” and “B”, spending half of the time at each outcome.

Furthermore, by recording the positions of the walk we could get a large sample in which approximately half of the records would have outcome “A” and the other half outcome “B”.

Ultimately, we saw that a particular type of random walk was useful in this regard. In particular, we found that if the invariant distribution of a Markov Chain is equal to the modelled distribution, it would (aside from some unusual exceptions) visit each position with the desired long-run frequency.

The objective of the previous article was to ensure that we were comfortable with this key result, as it is the reason why all MCMC methods work. However, alone this simply presents us with a new problem:

How do I find a Markov Chain with invariant distribution equal to my modelled distribution?

The purpose of this article is to describe a technique that can be used to answer this question.

Please note, understanding the previous article is important for understanding what is discussed here. As a result, you may find it helpful to re-read or skim the previous post.

Narrowing Down our Search

After the last article, identifying the Markov Chain appropriate to your model may have felt like finding the proverbial needle in the haystack.

There is a whole world of possible Markov Chains out there and naively picking transition probabilities until we find the ones with the right invariant distribution is likely to end in frustration.

To make this search easier, MCMC techniques usually consider only the subset of Markov Chains that have a particular property. This property is known as “detailed balance”.

Detailed balance narrows our focus to those transition probabilities that, when applied to the modelled distribution, result in no net transfer of mass between each pair of outcomes.

This property is only subtly different to how we describe the effect of the transition probabilities on the invariant distribution. If you recall, the invariant distribution is the one for which the net transfer of mass is zero. Hence, detailed balance is similar but applies at a more granular level.

Clearly, as the total net transfer of mass between every outcome is the aggregate of the transfers between each pair of outcomes, it’s easy to see that if a distribution satisfies detailed balance it is also an invariant distribution.

Detailed balance is more formally defined by the following equation. In particular, for all “x” and “y” the following equation must be true:

    \[  P(x)\cdot P(x\rightarrow y) = P(y)\cdot P(y\rightarrow x) \]

Here, the left-hand side represents the amount of mass that is transferred from outcome x to outcome y, while the right-hand side represents the amount transferred from y to x.

The Gibbs sampling algorithm is one MCMC technique that leverages detailed balance in order to produce a Markov Chain with the desired invariant distribution.

The Gibbs Sampling Algorithm

Often MCMC is used when we want to simulate from a multivariate distribution. In these cases, when the conditional distribution of each variable given the other variables is known, Gibbs sampling can be used.

The Gibbs sampling algorithm is best understood by considering how it produces the next step in a Markov Chain. The animation below shows how this is done when Gibbs sampling is used to simulate from a bivariate Normal distribution.


In the animation the leading black dot represents the latest position of the random walk. In this case the random walk starts at the coordinate (0,0). To produce the next position, Gibbs sampling involves picking a new value for the x-coordinate and, afterwards, a new value for the y-coordinate.

As was alluded to above, the x-coordinate is sampled from the conditional distribution of x given the latest value of y and vice versa for the y-coordinate.

After each coordinate has been updated, the position is recorded as the next step in the walk. In this case, the second position of the walk is at (-0.19,-0.41).

To give you some confidence that the simulations generated by Gibbs sampling are appropriate, I’ve plotted the simulations on top of some bivariate Normal contours. As you can see, the spread of the black dots resembles the shape of the contours. Furthermore, for the reasons discussed in the last article, the resemblance becomes closer as the number of simulations increases.

This method of generating a Markov Chain may seem unfamiliar. In particular, you may be more familiar with having a set of transition probabilities that fully describe the probability of moving to each position given the current one.

In this case, however, we only know the transition probabilities corresponding to the sub-steps involved in updating each coordinate. Nevertheless, as the next step in the walk only depends on its current position, this process does generate a Markov Chain.

The key to the Gibbs sampler is that the updates to each coordinate satisfy detailed balance. To see this, let’s first consider the probability of moving from the coordinate (x, y) to (z, y), after an update to X:

    \[  P( (X = x, Y = y)\rightarrow (X = z, Y = y) ) = P(X = z|Y = y) \]

In words, the transition probability from (x, y) to (z, y), given an update to X, simply follows from the conditional distribution of X given Y. Furthermore, from the definition of conditional probability, this transition probability can be written as:

    \[  P( (x,y)\rightarrow (z,y) ) = \frac{P(z,y)}{P(y)}\]

Note that in the equation above and the ones below I write, for example, P(x, y) instead of P(X = x, Y = y). Both of these are supposed to denote the same thing, the probability that X is equal to x and Y is equal to y, however I’ve removed some bits to make the notation less clunky.

Inserting the modelled joint distribution and our final representation of the transition probabilities into both sides of the detailed balance equation we get:

    \[  \text{LHS} = P(x,y)\cdot P((x,y)\rightarrow (z,y)) = P(x,y)\cdot \frac{P(z,y)}{P(y)}\]

    \[  \text{RHS} = P(z,y)\cdot P((z,y)\rightarrow (x,y)) = P(z,y)\cdot \frac{P(x,y)}{P(y)}\]

Clearly both sides are equal, no matter what the value of x, y, or z.

While this shows that an update to the x-coordinate satisfies detailed balance, we can use this logic to show the same thing for the y-coordinate and/or given any number of variables.

Furthermore, as the update to each coordinate satisfies detailed balance, so does a Gibbs step. In particular, a Gibbs step is the result of updating every coordinate and because no mass is transferred while updating each coordinate, neither is any mass transferred once all of the coordinates are updated.

Thankfully, that’s all there is to Gibbs sampling.

Final Word

In this article we discussed the Gibbs sampling algorithm, a Markov Chain Monte Carlo technique.

We saw that Gibbs sampling was particularly useful when we wanted to simulate from a multivariate distribution. In particular, the algorithm produces a new simulation by updating each modelled variable in turn and recording the outcome after all of the variables are updated.

The reason why Gibbs sampling works is that the updating process defines a Markov Chain for which the modelled distribution is the invariant distribution.

In subsequent articles we will cover other MCMC techniques but if you have any questions on this article please leave a comment below.

The Basics of Markov Chain Monte Carlo

Sometimes models are difficult to evaluate. In this two-part series we’ll take a look at how Markov Chain Monte Carlo (MCMC) can sometimes make things easier.

This article describes the key ideas behind MCMC and the next will focus on the practicalities of implementing it.

Introducing Markov Chain Monte Carlo

As with Monte Carlo simulation, MCMC is a simulation method that can be used to calculate difficult integrals.

While MCMC is more complicated than standard Monte Carlo, it is much more widely applicable. Therefore, understanding MCMC enables practitioners to simulate from a much larger set of models than would otherwise be the case.

Monte Carlo simulation works by generating many independent simulations, where the probability of an outcome occurring in a simulation is equal to its modelled probability. Statistics based on this large sample can then be used to estimate the statistics of the model.

For example, my model could relate to the outcome of rolling a die. In this case I could simulate many independent rolls of the die and use the average, median or any other statistic of these rolls to estimate the exact statistic of one roll.

The key to MCMC is the realisation that it doesn’t matter if the simulations are independent, so long as the long-run proportion of simulations with a given outcome is equal to the modelled probability of that outcome.

The charts below help illustrate what this means.

MCMC

The charts summarise the result of simulating values from a Uniform distribution using Monte Carlo simulation (left) and MCMC simulation (middle). In addition, the right-hand chart displays the empirical CDF associated with the Monte Carlo sequence (red) and the MCMC sequence (blue).

It’s clear that the two methods produce sequences of simulations that look different. In the ‘Monte Carlo’ plot the dots are spread evenly. On the other hand, the dots in the ‘MCMC’ plot seem to follow a path.

Despite this, the right-hand plot shows that both methods generate sequences of simulations that are consistent with the model. The CDFs are very similar and approximately fall on a 45⁰ line, as you would expect of a Uniform distribution.

This example illustrates the basic premise of MCMC but it isn’t that exciting, especially because Monte Carlo simulation can be used instead. However, as you will see in the next article, sometimes MCMC provides a neat solution to problems that would otherwise seem difficult.

So, how do you construct a sequence of correlated simulations in which outcomes occur with the desired frequencies?

To do this, MCMC methods make use of a type of random walk called a “Markov Chain”.

Markov Chain Basics

You can visualise a random walk as a particle in space that randomly changes position over time.

A Markov Chain is a special type of random walk where the focus is on the position of the particle at the end of consecutive time intervals. So in the middle plot above, the dots represent the position of the particle (y-axis) at the end of consecutive time intervals (x-axis).

The key differentiator between Markov Chains is their tendencies to move to some positions more frequently than others. These tendencies are described by their “one-step transition probabilities”.

As the name suggests, the one-step transition probabilities represent the probability of moving to a position in one step, given the particle’s current position. For example, if the particle is at position “A” and the one-step transition probability from “A to B” is 20%, then 20% of the time the particle will subsequently move to “B”.

The one-step probabilities also happen to determine the Chain’s tendencies over a greater number of steps. This is because, given the latest position, the next step in a Markov Chain is taken without regard to the prior steps.

Continuing from the previous example, if we know that the transition probability from “B to C” is 30% say, then 6% of the time (= 20% * 30%) after starting at “A” the particle will move to “B” and then “C”.

In general, multiplying the transition probabilities associated with a sequence of steps will produce the probability of taking all of those steps.

While the probability associated with any one path is quite interesting, often we want to know the probability of ending in a position regardless of how we got there. Depending on the number of steps considered, these are referred to as the “n-step transition probabilities”.

As we don’t care how the chain gets to the ending position, the n-step transition probability is simply the sum of the probability associated with each path that starts and ends at the given positions.

For example, if we wanted to know the two-step transition probability from “A” to “B”, we would calculate the probability associated with each two-step path that ends at “B” (e.g.   A\rightarrow A\rightarrow B ,   A\rightarrow B\rightarrow B ,   A\rightarrow C\rightarrow B , …, etc) and then sum them.

Fortunately, rather than having to consider an impractically large sum, the n-step transition probabilities can be calculated recursively:

    \[  P^{n}(A\rightarrow B) = \sum\limits_{C \in All Positions}{P^{n-1}(A\rightarrow C)\cdot P^{1}(C\rightarrow B)} \]

In words, the n-step transition probability from “A” to “B”,   P^{n}(A\rightarrow B), is the sum over all positions of the probability of moving there from “A” in one less step,   P^{n-1}(A\rightarrow C), and then moving to “B” in the final step,   P^{1}(C\rightarrow B).

Now that we know this much, we can get back to the goal of MCMC. In particular, we want a Chain in which the long-run proportion of visits to each position is equal to the probability that our model assigns to them. Given this, let’s first consider how we might calculate the long-run proportions.

Long-Run Behaviour of Markov Chains

Because random walks are random, it’s not possible to determine the exact proportion of the time the chain will spend at each position. Nevertheless, it is possible to calculate the expected proportion of the time.

For instance, if the random walk is only one step long, the particle will start at a position, “A” say, and then visit a position at random. In this case, the expected number of visits to a position, “B” say, is given by the one-step transition probability,   P^{1}(A\rightarrow B). Also, as only one position is visited, this is also the expected proportion of the time that the particle spends there.

When the walk is two steps long, the particle will visit two positions at random. In this case, the one-step transition probability,   P^{1}(A\rightarrow B), gives the expected number of visits from the first step and the two-step transition probability,   P^{2}(A\rightarrow B), gives the expected number from the second step. As two positions are visited, the expected proportion of the time that the particle spends in a position is the sum of the one and two-step probabilities divided by two.

In general, the expected proportion of the time spent at a position, “B”, after starting at another position, “A”, and after “n” steps is:

    \[  \sum\limits_{s=1}^{n}{\frac{P^{s}(A\rightarrow B)}{n}}  \]

Although this only produces the expected proportions, the difference between the actual and expected proportions becomes small as the number of steps (“n”) increases. Thus, after a large number of steps we can be sure that they are approximately equal.

With these tools we can start to analyse the long-run behaviour of a particular Markov Chain. However, as the calculations involved are relatively onerous, using them for MCMC is not particularly practical. Instead, we will focus on those Markov Chains where the long-run behaviour is known up-front.

Limiting and Invariant Distributions

There is one class of Markov Chains for which the long-run proportions are easy to calculate. This is when the n-step transition probabilities,   P^{n}(A\rightarrow B), converge to some distribution,   \pi(B), as the number of steps increases. Let’s refer to this distribution as the “limiting distribution” of the Markov Chain.

In this case, as the expected proportion of the time spent in each position is a simple average of the converging n-step transition probabilities, the expected proportions also converge to the limiting distribution.

The trick to MCMC, therefore, is picking transition probabilities that have the desired distribution as the limiting distribution of the Markov Chain. If we are able to do this, we can be sure that a long random walk following these transition probabilities will produce an appropriate sequence of simulations.

In fact, there is another definition of the limiting distribution that is more useful for finding appropriate transition probabilities. Not only is the limiting distribution the distribution that the n-step transition probabilities converge to but it is also the distribution,   \pi, that satisfies the following equation:

    \[  \pi(B) = \sum\limits_{A \in All Positions}{\pi(A)\cdot P^{1}(A\rightarrow B)} \]

Hopefully you can see that the right-hand side of the equation is another formula for the distribution of the particle after an additional step. However, here the starting position is random and follows some distribution,   \pi.

This equation says that the distribution “  \pi” is the one such that if the starting distribution is equal to it, after an additional step the ending position will also be equal to it. Because the distribution is unchanged by a step, the limiting distribution is also referred to as the “invariant distribution”.

Consequently, to apply MCMC to a model we simply need to find transition probabilities such that the modelled distribution is equal to the invariant distribution. We can then be sure that a Markov Chain generated using these probabilities will visit each position with the desired long-run frequencies.

This is all there is to MCMC and it is the basis for the techniques discussed in the next article. However, before moving on, let’s consider why the limiting and invariant distributions are equal.

Convergence to the Invariant Distribution

To see that the limiting and invariant distributions are equal it’s useful to consider how the transition probabilities operate on the starting distribution when the n-step transition probabilities are calculated.

Let’s consider the one-step probabilities first. In this case the starting position is given and so we might say that the starting distribution assigns 100% of the probability mass to the starting position.

One step on, the particle will randomly occupy one of the other positions with probabilities equal to the one-step transition probabilities. Therefore, a given position (“B”) will receive a percentage of the original mass from the starting position (“A”), equal to the associated transition probability (  P^{1}(A\rightarrow B)} ).

When calculating the two-step probabilities an identical process occurs but this time the starting mass will not be in one position but will be spread among the positions according to the one-step probabilities. Thus, the process will result in each position retaining some of its mass, losing some of its mass to other positions, and receiving mass from other positions.

The animation below illustrates this process for a Markov Chain that has two positions, ‘A’ and ‘B’, and for which the 1-step transition probability from A to B is a half and from B to A is a third.


Here the Markov Chain starts at position “A” and so 100% of the starting probability mass is at position “A”. In each step, however, half of the mass from position “A” moves to position “B” and a third of the mass at “B” moves to “A”. The video shows how the distribution of mass changes over eight steps.

This operation typically results in some positions gaining and others losing mass. However, when the starting mass is spread between the positions according to the invariant distribution, we know that the starting and ending distributions are equal and so the movements of mass exactly offset one another.

Given this, it’s useful to view a particular starting distribution as assigning mass equal to the invariant distribution to each position and then an additional excess or deficit of mass to each position. The animation below hints at why this is the case.

The left-hand chart shows the transfer of mass between the two positions using the same one-step transition probabilities as before, the middle chart shows the invariant distribution for these transition probabilities, and the right-hand chart shows the transfer of the excess and deficit of mass.

As you can see, after each step the distribution in the left-hand plot is simply the sum of the distributions in the middle and right-hand plots.

In particular, the invariant distribution assigns 40% of the mass to position “A” and 60% to position “B”. Therefore, when 100% of the starting mass is at “A” the right-hand plot shows an excess of 60% for position “A” and an equal deficit in position “B”. In addition after a step, when the mass is split equally between the positions, the right-hand plot shows an excess of 10% for position “A” and an equal deficit for position “B”.

This representation shows us that the left-hand distribution will converge to the invariant distribution when the excess and deficit in the right-hand plot tends to zero. However, because excesses and deficits cancel, as long as mass is transferred between the two positions the excess and deficit will reduce.

Here I have focused on a simple case involving two positions, however the intuition gained from this example is more generally applicable. In particular, as long as excesses and deficits do not avoid one another, the distribution of the position of the particle will converge to the invariant distribution and so the limiting and invariant distributions will be equal.

Final Word

In this article we introduced Markov Chains and discussed how to understand their long-run behaviour based on their transition probabilities.

Although it was a relatively long article I had to gloss over a few areas. Therefore, if you are of a technical mind-set, you may find it useful to skim the Wikipedia article on Markov Chains.

In the next article we will move on to more practical matters. In particular, we will consider how to apply what was discussed here.