## Gaussian Process regression

Among other things nonparametric Bayesian methods can simplify predictive modelling projects. Here I’ll give a brief introduction to one of these methods, Gaussian Process regression, and how it can help us make inferences about complicated trends in a quick and consistent way.

### Introduction

Just as the outcomes of a coin flip can be thought of as a random variable, so can a Gaussian Process. However, while a coin flip produces scalar-valued outcomes (tails = 0, heads = 1), stochastic processes generate functions as outcomes.

In the graph below you can see a number of univariate random functions. Each line is one function and represents one outcome of the Gaussian Process.

As you can see, the outcomes of a Gaussian Process are quite varied: the light blue line is pretty straight; the blue, purple and teal lines have one inflection point; and the gold, red, and green lines have two inflection points.

What gives the Gaussian Process its name is that any collection of function values (e.g. f(1), f(5.5), f(-2.2)) follows a multivariate Gaussian distribution. Furthermore, just as the Gaussian distribution is parameterised by a mean and (co)variance, a Gaussian process is parameterised by a mean function and covariance function.

The mean function represents the long-run average outcome of the random functions, while the covariance function determines how points of the function will diverge from the mean and whether the divergence will be of the same degree and direction at different points.

The plot above has been annotated with the form of the mean and covariance functions (m and k) used to generate the lines. In particular, the mean is the zero function and the covariance function decays from one, when the two points are equal, to zero as the points become more distant.

These choices result in smooth functions, centred around zero, where nearby points on the x-axis have similar values and distant points are relatively independent. Furthermore, with this covariance function almost any smooth function is possible and the propensity for certain types of functions to occur can be changed by varying the covariance function’s parameters.

A key goal of regression is to learn something about an unknown function underlying a set of noisy observations. For example, in simple linear regression we investigate the straight line functions that best map values of the covariates to noisy observations of the response.

In Bayesian analysis, simple linear regression proceeds by placing a prior on the parameters of the linear regression model (the intercept, slope and noise parameters). The posterior distribution of the parameters, conditional on the data, is then used to determine which straight lines could have plausibly generated the data.

As you saw in the plot above, a Gaussian Process can be used as a prior on a much larger set of functions. Furthermore Gaussian processes are convenient priors as, with Gaussian distributed noise, the posterior distribution is also a Gaussian process for which the mean and covariance functions are relatively easy to determine.

In the example below I show how these facts allow you to produce fast and convincing analyses of some weather data.

### Example

Here I’ve used a Gaussian Process regression model to make predictions about future temperatures.

Historical temperatures were sourced from the met office’s historical dataset for the Oxford weather station. A subset of the data and the Gaussian Process forecast are summarised in the chart below.

Historical month-end measurements are displayed as black points, with the last reading being at the end of August 2017. After this time the smooth black line represents the mean posterior predicted outcome. Dark and light blue shading has been added to delimit one and two standard deviation prediction intervals.

In this case I’ve again used the zero function for the mean but for the covariance I’ve used a periodic function.

This particular covariance function has three free parameters: a, the degree to which the underlying trend deviates from the mean function; b, the duration between points that are highly correlated; and c, the degree to which nearby points are independent or not. The independent and identically distributed Gaussian noise results in one further parameter.

Depending on the dataset in question, parameters can be inferred from the data (e.g. using MCMC techniques) or set based on prior beliefs about the form of the trend. Here I have set “b” equal to 365 days and I have used maximum likelihood estimation to choose values for the other parameters.

The key takeaway from this, however, is to note that the inferences made by using Gaussian Process regression are reasonable—the periodic pattern observable in the historical data is reflected in the modelled predictions—and could easily be applied to other datasets.

### Final Word

In this article I gave you a taste of nonparametric Bayesian methods by introducing Gaussian Process regression. In particular, I described the basic premise behind Gaussian Process regression and then showed you how it might be applied to a dataset with a periodic trend.

If you have any questions about Gaussian Process regression, please do not hesitate to leave a comment below. In future articles I’ll discuss other nonparametric Bayesian methods and will highlight some of the benefits and pitfalls associated with them.

## Gibbs Sampling by Example

In order to grasp a new technique it’s often useful to see how it is used in real-world examples. Here we will examine a particular implementation of a Gibbs sampler which pertains to a model we discussed in a previous series of articles.

### Re-Cap

As was discussed in the previous article, the Gibbs sampling algorithm is a Markov Chain Monte Carlo (MCMC) technique that can be used to simulate outcomes from a multivariate distribution.

As with all MCMC techniques, simulations are produced by generating a realisation of a random walk and recording the position of the walk at the end of consecutive time intervals. MCMC techniques are designed such that, no matter where it starts, the positions produced by a long enough walk will resemble a sample from the target distribution.

The Gibbs sampler produces a random walk by updating each coordinate associated with the multivariate distribution sequentially. Once each coordinate has been updated the new values are recorded and another iteration occurs.

The key to the Gibbs sampler is in how each of the coordinates are updated. In particular, updating a particular variable involves simulating an outcome from its conditional distribution, given the latest values of all the other variables.

The animation below illustrates how this process unfolds in the case of a bivariate Normal distribution.

Here you can see that the random walk moves around the space by first taking a horizontal step and then a vertical step.

In the animation I’ve also plotted the positions of the random walk on top of some bivariate Normal contours. One can see that after a large number of steps the distribution of the dots resembles the target distribution.

In this article we will show how Gibbs sampling can be applied to another more specialised problem.

### The Model

In the Incomplete Data series I introduced a model that could be used to describe the timing of individual claim payments made by an insurer.

From the data we made two observations regarding the timing of payments; payments occurred sporadically and eventually they ceased.

These phenomena were assumed to be caused by the state of each claim. In particular, we assumed that the claim could be in one of three states: an open state in which no payment occurs; a closed state, also with no payment; and an open state in which a payment does occur. In addition, it was assumed that if a claim closed it remained closed.

The probability of a claim moving from an open state to the closed state () and, given it was open, giving rise to a payment () were both assumed to be constant and applicable to all claims.

To understand more about the claim payment process, therefore, it’s useful to investigate what can be learned about the unknown factors in our model based on the observed data. In this case, the unknowns are the transition probability parameters ( and ) and if and when any of the claims closed.

As was discussed in the Incomplete Data series, this problem can be tackled by looking at the posterior distribution of the unknowns. Specifically, assuming a uniform prior on and , the posterior density takes the following form:

Here, the event that a given claim is closed is represented by , represents the number of trailing periods with no payments that each claim remained open for, and and are the parameters we discussed previously.

The observed data is represented by the number of claims (), the number of periods in which payments occurred (), the number of periods prior to the last payment in which no payments occurred (), and the number of observed periods after the last payment ().

In this article we will use a Gibbs sampler to simulate from this distribution. However, as discussed in the previous article, to do this we need to know what each of the unknowns’ conditional distributions are.

### The Conditional Distributions

The purpose of this article isn’t to dwell on how we arrive at the conditional distribution of each unknown. After all, in your problem the posterior and conditional distributions will undoubtedly be different. However, as they are integral to performing Gibbs sampling it’s worth getting a feel for how you might derive them.

In this case you can get most of the way to deriving the conditional distributions by “inspection”. This means considering the posterior density as a function of each unknown and seeing whether it looks like any of the standard distributions you already know (see Wikipedia for a list of standard distributions).

For example, as a function of () the posterior looks like:

This density is of the same form as a beta distribution.

By a similar albeit more long-winded approach it’s possible to see that the conditional distributions of each of the unknowns take the form of the following standard distributions:

As you can see the conditional distributions each take a standard form. The exception to this being –which is either a constant or follows a truncated Geomtric distribution–nevertheless it is still easy to simulate from.

With these conditional distributions, we have all that we need to implement a Gibbs sampler.

### The Gibbs Sampler

At this point it may be obvious to you how we would set up a Gibbs sampler using the conditional distributions above. However, I expect others might feel uncomfortable translating what we’ve discussed into an actual implementation.

With this in mind, below is the Gibbs sampler relevant to this article written in “pseudo-code”:

Hence, in the programming language of your choice, one iteration of the Gibbs sampler involves: for each claim simulating whether it is closed or not (step 7); simulating how many periods each claim has been open for (step 9); simulating new parameter values for and (steps 12 and 13).

### Final Word

In this article we discussed a specialised example of a Gibbs sampling algorithm. In particular, we examined how to implement a Gibbs sampler for a model that could be used to explain the timing of payments on insurance claims.

I hope that by working through a real-world example applying Gibbs sampling to your problems will seem less daunting.

## Markov Chain Monte Carlo: The Gibbs Sampler

In the previous article we focused on understanding what Markov Chain Monte Carlo is and why it works. Here we look at Gibbs sampling, a Markov Chain Monte Carlo technique that can be directly applied to many modelling problems.

### Re-Cap

In the previous article we introduced Markov Chain Monte Carlo (MCMC) as a method for simulating outcomes from a model.

We saw that the key to this approach was in choosing a random walk that visited each outcome as frequently as the model suggested. By recording the position of the random walk at consecutive time intervals, we could generate a set of simulations that looked like a large sample from the model.

For example, a model might prescribe that two outcomes are possible, “A” and “B” say. In this case, if each outcome occurred with a probability of 50% then an appropriate random walk would transition between “A” and “B”, spending half of the time at each outcome.

Furthermore, by recording the positions of the walk we could get a large sample in which approximately half of the records would have outcome “A” and the other half outcome “B”.

Ultimately, we saw that a particular type of random walk was useful in this regard. In particular, we found that if the invariant distribution of a Markov Chain is equal to the modelled distribution, it would (aside from some unusual exceptions) visit each position with the desired long-run frequency.

The objective of the previous article was to ensure that we were comfortable with this key result, as it is the reason why all MCMC methods work. However, alone this simply presents us with a new problem:

How do I find a Markov Chain with invariant distribution equal to my modelled distribution?

Please note, understanding the previous article is important for understanding what is discussed here. As a result, you may find it helpful to re-read or skim the previous post.

### Narrowing Down our Search

After the last article, identifying the Markov Chain appropriate to your model may have felt like finding the proverbial needle in the haystack.

There is a whole world of possible Markov Chains out there and naively picking transition probabilities until we find the ones with the right invariant distribution is likely to end in frustration.

To make this search easier, MCMC techniques usually consider only the subset of Markov Chains that have a particular property. This property is known as “detailed balance”.

Detailed balance narrows our focus to those transition probabilities that, when applied to the modelled distribution, result in no net transfer of mass between each pair of outcomes.

This property is only subtly different to how we describe the effect of the transition probabilities on the invariant distribution. If you recall, the invariant distribution is the one for which the net transfer of mass is zero. Hence, detailed balance is similar but applies at a more granular level.

Clearly, as the total net transfer of mass between every outcome is the aggregate of the transfers between each pair of outcomes, it’s easy to see that if a distribution satisfies detailed balance it is also an invariant distribution.

Detailed balance is more formally defined by the following equation. In particular, for all “x” and “y” the following equation must be true:

Here, the left-hand side represents the amount of mass that is transferred from outcome x to outcome y, while the right-hand side represents the amount transferred from y to x.

The Gibbs sampling algorithm is one MCMC technique that leverages detailed balance in order to produce a Markov Chain with the desired invariant distribution.

### The Gibbs Sampling Algorithm

Often MCMC is used when we want to simulate from a multivariate distribution. In these cases, when the conditional distribution of each variable given the other variables is known, Gibbs sampling can be used.

The Gibbs sampling algorithm is best understood by considering how it produces the next step in a Markov Chain. The animation below shows how this is done when Gibbs sampling is used to simulate from a bivariate Normal distribution.

In the animation the leading black dot represents the latest position of the random walk. In this case the random walk starts at the coordinate (0,0). To produce the next position, Gibbs sampling involves picking a new value for the x-coordinate and, afterwards, a new value for the y-coordinate.

As was alluded to above, the x-coordinate is sampled from the conditional distribution of x given the latest value of y and vice versa for the y-coordinate.

After each coordinate has been updated, the position is recorded as the next step in the walk. In this case, the second position of the walk is at (-0.19,-0.41).

To give you some confidence that the simulations generated by Gibbs sampling are appropriate, I’ve plotted the simulations on top of some bivariate Normal contours. As you can see, the spread of the black dots resembles the shape of the contours. Furthermore, for the reasons discussed in the last article, the resemblance becomes closer as the number of simulations increases.

This method of generating a Markov Chain may seem unfamiliar. In particular, you may be more familiar with having a set of transition probabilities that fully describe the probability of moving to each position given the current one.

In this case, however, we only know the transition probabilities corresponding to the sub-steps involved in updating each coordinate. Nevertheless, as the next step in the walk only depends on its current position, this process does generate a Markov Chain.

The key to the Gibbs sampler is that the updates to each coordinate satisfy detailed balance. To see this, let’s first consider the probability of moving from the coordinate (x, y) to (z, y), after an update to X:

In words, the transition probability from (x, y) to (z, y), given an update to X, simply follows from the conditional distribution of X given Y. Furthermore, from the definition of conditional probability, this transition probability can be written as:

Note that in the equation above and the ones below I write, for example, instead of . Both of these are supposed to denote the same thing, the probability that X is equal to x and Y is equal to y, however I’ve removed some bits to make the notation less clunky.

Inserting the modelled joint distribution and our final representation of the transition probabilities into both sides of the detailed balance equation we get:

Clearly both sides are equal, no matter what the value of x, y, or z.

While this shows that an update to the x-coordinate satisfies detailed balance, we can use this logic to show the same thing for the y-coordinate and/or given any number of variables.

Furthermore, as the update to each coordinate satisfies detailed balance, so does a Gibbs step. In particular, a Gibbs step is the result of updating every coordinate and because no mass is transferred while updating each coordinate, neither is any mass transferred once all of the coordinates are updated.

Thankfully, that’s all there is to Gibbs sampling.

### Final Word

In this article we discussed the Gibbs sampling algorithm, a Markov Chain Monte Carlo technique.

We saw that Gibbs sampling was particularly useful when we wanted to simulate from a multivariate distribution. In particular, the algorithm produces a new simulation by updating each modelled variable in turn and recording the outcome after all of the variables are updated.

The reason why Gibbs sampling works is that the updating process defines a Markov Chain for which the modelled distribution is the invariant distribution.

## The Basics of Markov Chain Monte Carlo

Sometimes models are difficult to evaluate. In this two-part series we’ll take a look at how Markov Chain Monte Carlo (MCMC) can sometimes make things easier.

This article describes the key ideas behind MCMC and the next will focus on the practicalities of implementing it.

### Introducing Markov Chain Monte Carlo

As with Monte Carlo simulation, MCMC is a simulation method that can be used to calculate difficult integrals.

While MCMC is more complicated than standard Monte Carlo, it is much more widely applicable. Therefore, understanding MCMC enables practitioners to simulate from a much larger set of models than would otherwise be the case.

Monte Carlo simulation works by generating many independent simulations, where the probability of an outcome occurring in a simulation is equal to its modelled probability. Statistics based on this large sample can then be used to estimate the statistics of the model.

For example, my model could relate to the outcome of rolling a die. In this case I could simulate many independent rolls of the die and use the average, median or any other statistic of these rolls to estimate the exact statistic of one roll.

The key to MCMC is the realisation that it doesn’t matter if the simulations are independent, so long as the long-run proportion of simulations with a given outcome is equal to the modelled probability of that outcome.

The charts below help illustrate what this means.

The charts summarise the result of simulating values from a Uniform distribution using Monte Carlo simulation (left) and MCMC simulation (middle). In addition, the right-hand chart displays the empirical CDF associated with the Monte Carlo sequence (red) and the MCMC sequence (blue).

It’s clear that the two methods produce sequences of simulations that look different. In the ‘Monte Carlo’ plot the dots are spread evenly. On the other hand, the dots in the ‘MCMC’ plot seem to follow a path.

Despite this, the right-hand plot shows that both methods generate sequences of simulations that are consistent with the model. The CDFs are very similar and approximately fall on a 45⁰ line, as you would expect of a Uniform distribution.

This example illustrates the basic premise of MCMC but it isn’t that exciting, especially because Monte Carlo simulation can be used instead. However, as you will see in the next article, sometimes MCMC provides a neat solution to problems that would otherwise seem difficult.

So, how do you construct a sequence of correlated simulations in which outcomes occur with the desired frequencies?

To do this, MCMC methods make use of a type of random walk called a “Markov Chain”.

### Markov Chain Basics

You can visualise a random walk as a particle in space that randomly changes position over time.

A Markov Chain is a special type of random walk where the focus is on the position of the particle at the end of consecutive time intervals. So in the middle plot above, the dots represent the position of the particle (y-axis) at the end of consecutive time intervals (x-axis).

The key differentiator between Markov Chains is their tendencies to move to some positions more frequently than others. These tendencies are described by their “one-step transition probabilities”.

As the name suggests, the one-step transition probabilities represent the probability of moving to a position in one step, given the particle’s current position. For example, if the particle is at position “A” and the one-step transition probability from “A to B” is 20%, then 20% of the time the particle will subsequently move to “B”.

The one-step probabilities also happen to determine the Chain’s tendencies over a greater number of steps. This is because, given the latest position, the next step in a Markov Chain is taken without regard to the prior steps.

Continuing from the previous example, if we know that the transition probability from “B to C” is 30% say, then 6% of the time (= 20% * 30%) after starting at “A” the particle will move to “B” and then “C”.

In general, multiplying the transition probabilities associated with a sequence of steps will produce the probability of taking all of those steps.

While the probability associated with any one path is quite interesting, often we want to know the probability of ending in a position regardless of how we got there. Depending on the number of steps considered, these are referred to as the “n-step transition probabilities”.

As we don’t care how the chain gets to the ending position, the n-step transition probability is simply the sum of the probability associated with each path that starts and ends at the given positions.

For example, if we wanted to know the two-step transition probability from “A” to “B”, we would calculate the probability associated with each two-step path that ends at “B” (e.g. , , , …, etc) and then sum them.

Fortunately, rather than having to consider an impractically large sum, the n-step transition probabilities can be calculated recursively:

In words, the n-step transition probability from “A” to “B”, , is the sum over all positions of the probability of moving there from “A” in one less step, , and then moving to “B” in the final step, .

Now that we know this much, we can get back to the goal of MCMC. In particular, we want a Chain in which the long-run proportion of visits to each position is equal to the probability that our model assigns to them. Given this, let’s first consider how we might calculate the long-run proportions.

### Long-Run Behaviour of Markov Chains

Because random walks are random, it’s not possible to determine the exact proportion of the time the chain will spend at each position. Nevertheless, it is possible to calculate the expected proportion of the time.

For instance, if the random walk is only one step long, the particle will start at a position, “A” say, and then visit a position at random. In this case, the expected number of visits to a position, “B” say, is given by the one-step transition probability, . Also, as only one position is visited, this is also the expected proportion of the time that the particle spends there.

When the walk is two steps long, the particle will visit two positions at random. In this case, the one-step transition probability, , gives the expected number of visits from the first step and the two-step transition probability, , gives the expected number from the second step. As two positions are visited, the expected proportion of the time that the particle spends in a position is the sum of the one and two-step probabilities divided by two.

In general, the expected proportion of the time spent at a position, “B”, after starting at another position, “A”, and after “n” steps is:

Although this only produces the expected proportions, the difference between the actual and expected proportions becomes small as the number of steps (“n”) increases. Thus, after a large number of steps we can be sure that they are approximately equal.

With these tools we can start to analyse the long-run behaviour of a particular Markov Chain. However, as the calculations involved are relatively onerous, using them for MCMC is not particularly practical. Instead, we will focus on those Markov Chains where the long-run behaviour is known up-front.

### Limiting and Invariant Distributions

There is one class of Markov Chains for which the long-run proportions are easy to calculate. This is when the n-step transition probabilities, , converge to some distribution, , as the number of steps increases. Let’s refer to this distribution as the “limiting distribution” of the Markov Chain.

In this case, as the expected proportion of the time spent in each position is a simple average of the converging n-step transition probabilities, the expected proportions also converge to the limiting distribution.

The trick to MCMC, therefore, is picking transition probabilities that have the desired distribution as the limiting distribution of the Markov Chain. If we are able to do this, we can be sure that a long random walk following these transition probabilities will produce an appropriate sequence of simulations.

In fact, there is another definition of the limiting distribution that is more useful for finding appropriate transition probabilities. Not only is the limiting distribution the distribution that the n-step transition probabilities converge to but it is also the distribution, , that satisfies the following equation:

Hopefully you can see that the right-hand side of the equation is another formula for the distribution of the particle after an additional step. However, here the starting position is random and follows some distribution, .

This equation says that the distribution “” is the one such that if the starting distribution is equal to it, after an additional step the ending position will also be equal to it. Because the distribution is unchanged by a step, the limiting distribution is also referred to as the “invariant distribution”.

Consequently, to apply MCMC to a model we simply need to find transition probabilities such that the modelled distribution is equal to the invariant distribution. We can then be sure that a Markov Chain generated using these probabilities will visit each position with the desired long-run frequencies.

This is all there is to MCMC and it is the basis for the techniques discussed in the next article. However, before moving on, let’s consider why the limiting and invariant distributions are equal.

### Convergence to the Invariant Distribution

To see that the limiting and invariant distributions are equal it’s useful to consider how the transition probabilities operate on the starting distribution when the n-step transition probabilities are calculated.

Let’s consider the one-step probabilities first. In this case the starting position is given and so we might say that the starting distribution assigns 100% of the probability mass to the starting position.

One step on, the particle will randomly occupy one of the other positions with probabilities equal to the one-step transition probabilities. Therefore, a given position (“B”) will receive a percentage of the original mass from the starting position (“A”), equal to the associated transition probability ().

When calculating the two-step probabilities an identical process occurs but this time the starting mass will not be in one position but will be spread among the positions according to the one-step probabilities. Thus, the process will result in each position retaining some of its mass, losing some of its mass to other positions, and receiving mass from other positions.

The animation below illustrates this process for a Markov Chain that has two positions, ‘A’ and ‘B’, and for which the 1-step transition probability from A to B is a half and from B to A is a third.

Here the Markov Chain starts at position “A” and so 100% of the starting probability mass is at position “A”. In each step, however, half of the mass from position “A” moves to position “B” and a third of the mass at “B” moves to “A”. The video shows how the distribution of mass changes over eight steps.

This operation typically results in some positions gaining and others losing mass. However, when the starting mass is spread between the positions according to the invariant distribution, we know that the starting and ending distributions are equal and so the movements of mass exactly offset one another.

Given this, it’s useful to view a particular starting distribution as assigning mass equal to the invariant distribution to each position and then an additional excess or deficit of mass to each position. The animation below hints at why this is the case.

The left-hand chart shows the transfer of mass between the two positions using the same one-step transition probabilities as before, the middle chart shows the invariant distribution for these transition probabilities, and the right-hand chart shows the transfer of the excess and deficit of mass.

As you can see, after each step the distribution in the left-hand plot is simply the sum of the distributions in the middle and right-hand plots.

In particular, the invariant distribution assigns 40% of the mass to position “A” and 60% to position “B”. Therefore, when 100% of the starting mass is at “A” the right-hand plot shows an excess of 60% for position “A” and an equal deficit in position “B”. In addition after a step, when the mass is split equally between the positions, the right-hand plot shows an excess of 10% for position “A” and an equal deficit for position “B”.

This representation shows us that the left-hand distribution will converge to the invariant distribution when the excess and deficit in the right-hand plot tends to zero. However, because excesses and deficits cancel, as long as mass is transferred between the two positions the excess and deficit will reduce.

Here I have focused on a simple case involving two positions, however the intuition gained from this example is more generally applicable. In particular, as long as excesses and deficits do not avoid one another, the distribution of the position of the particle will converge to the invariant distribution and so the limiting and invariant distributions will be equal.

### Final Word

In this article we introduced Markov Chains and discussed how to understand their long-run behaviour based on their transition probabilities.

Although it was a relatively long article I had to gloss over a few areas. Therefore, if you are of a technical mind-set, you may find it useful to skim the Wikipedia article on Markov Chains.

In the next article we will move on to more practical matters. In particular, we will consider how to apply what was discussed here.

## More Incomplete Data: Modelling Latent Factors

This article is the second in a series in which we discuss techniques that can be used to analyse latent factors. In the last article we discussed how we could analyse latent factors in a qualitative way. Here, we discuss one way in which we could make our qualitative approach amenable to statistical analysis.

The techniques we discuss are broadly applicable but here we will focus on the specific problem introduced in the last article. It may be obvious how the techniques can be applied to your problem but if not, we will discuss them more generally in subsequent posts.

## Re-cap

The problem introduced in the previous article involved predicting the future payments on a number of claims that had been reported to an insurer. This problem had already been considered in a previous series but we complicated the matter by acknowledging that each claim was likely to behave differently due to latent factors.

Having introduced the problem, we then discussed whether it was even possible to analyse latent factors. This was done in light of the fact that traditional analytical techniques are not particularly useful when the data does not contain the exact state of each claim, as is the case in our problem.

Based on our intuition, however, it seemed clear that some sort of analysis was possible. In particular, it seemed likely that claims with “peculiar” payment series were probably predisposed to behave in an odd way. Therefore, these claims probably had a different latent state to the other claims.

In this article we will look at how we can place our intuitive analysis on a more rigorous footing. The approach considered in this article involves extending a “simple” model. For this problem we will extend the state transition model described in the second article of the previous series.

## A simple model

The model we discussed in the previous series was based on the following assumptions:

• Between periods, each claim transitions between three states: open with a payment; open with no payment; and closed with no payment
• An open claim will close in the next period with probability ‘p’ or, given that it doesn’t, will give rise to a payment with probability ‘q’
• Once a claim closes it remains closed

These assumptions are represented in the diagram below.

Unfortunately, this model has limitations.

In particular, as all claims are subject to the same parameters, this model assumes that the only source of variation between claims is due to chance. In our case this is inappropriate, as additional variation will be caused by the fact that each claim has a different latent predisposition to close or to be associated with a payment in a period.

However, by extending our simple model we can resolve this issue.

## Extending the model

The simple model assumes that each claim is governed by the same parameters. Instead, to allow for the different predispositions of each claim, let’s assume that:

1. There is a larger population of claims from which our claims have been sampled.
2. For each claim there is a simple model that adequately describes the payment process.
3. Each claim has its own specific parameters and therefore, its own model.

The graphs below attempt to illustrate what these assumptions mean.

The left-hand graph depicts one possible population of claims. Each dot represents one claim and the coordinates for each dot give us the transition probabilities applicable to each claim. As you can see, each claim in the population is associated with a different set of parameters. In this case, the majority of claims have a closure probability of ~20% and a payment given open probability of ~50%.

On the right-hand side I’ve simply copied the graph from the previous article, which shows the historical payment series on each claim in our dataset.

The same claims are highlighted in the right-hand graph as in the previous article. In addition, I have highlighted two corresponding claims in the population plot. The colours identify which claim in the population the claims in our dataset correspond to and, most importantly, the parameters that govern the payment series on each of our claims.

Of course, the data on the left-hand side is entirely hypothetical. In reality, we don’t know how the dots in the left-hand graph are distributed. Even if we did, we wouldn’t be able to identify which claim in the left-hand graph each claim in our dataset relates to. However, the illustration does give us the machinery we need to formalise the intuitive analysis we performed in the previous article.

In particular, in the last article we said that:

“…because the latest payments on the red and blue claims are so much later than most of the other claims, it seems likely that there are latent factors that are causing these claims to remain open for longer than the others. In addition, as the blue claim has such a high rate of payment in each period compared to the red claim, it seems likely that there are latent factors causing the blue claim to have more payments.”

Based on the framework discussed here, this means that we believe the red and blue claims are more likely to be associated with a claim in the population with a relatively low closure probability. Furthermore, we believe that the blue claim has a relatively higher probability of payment compared to the red claim.

Therefore, while we can’t pin point the exact claim in the population that each of our claims corresponds to, we can identify regions in the population plot that are more likely to contain each of our claims.

The challenge we have now is to perform this analysis quickly and consistently for every claim. We will look into this in the next article.

## Final word

In this article we discussed how we could transform our intuitive approach to analysing latent factors into something that was amenable to statistical analysis. To do this we extended a simple model.

In the next article we will begin to discuss how we can analyse the data under our new model in an automated way.

Please feel free to leave any questions below regarding the article or your specific problem.

## More Incomplete Data: Latent Factors

Making predictions in a rigorous way is difficult. Part of the reason for this is that it is rare for all of the key factors contributing to events to be recorded—these factors are known as “latent” factors. In this series we will consider techniques that can be used to identify latent factors.

This article is devoted to introducing a problem afflicted by latent factors. The problem comes from revisiting the one discussed in the previous series. Then, in subsequent posts, we will look at techniques that one might use.

## Re-cap & motivation

In the previous series we were tasked with predicting the future series of payments on a number of claims that had been reported to an insurer. To help us with this we had received a dataset containing the timing and amount of past payments on these claims.

The starting point of our original analysis was based on the following features of the payment process:

1. An open claim may or may not give rise to a payment
2. Payment amounts vary
3. All claims eventually close, after which no further payments are made

1. Latent factors will cause some of the differences observed between claims

This observation is obvious when you consider, for example, that for different claims: the claimant is different; the lawyers and claim professionals involved are different; and the nature of the loss to the insured is different.

Incorporating this sort of information into our analysis may seem daunting. So instead of immediately jumping into the technical detail, let’s first convince ourselves that we can learn something about latent factors from the available data.

## Data and an intuitive analysis

The graph below contains the same type of data considered in the previous series. There are one-hundred lines plotted and each line represents the cumulative amount paid on an insurance claim following notification to the insurer. The cumulative amounts are shown at the end of quarterly time intervals.

I have highlighted two claims to illustrate how claim payment series can differ. In particular, payments are made frequently on some claims and not on others (e.g. the blue claim has had payments in 60% of periods to date while the red claim has only made payments in 20%). In addition, by far the majority of claims have no payments beyond the tenth quarter (the highlighted claims being the ones with the latest payments).

The question we are interested in answering is, “to what extent is the variation in the payment series driven by latent factors?”

Standard analytical techniques rely on us being able to identify all of the factors that result in systematic variation between claims. The combination of factors is called the “state” of the claim. Being able to identify the state allows us to see to what extent the payment series varies between claims with similar or different states.

For example, if we saw that a disproportionate number of claims with late payments were associated with claimants that had suffered injuries, we might conclude that some of the variation in the payment series was driven by this factor.

Unfortunately, there is no data on latent factors (by definition) and so standard techniques are not helpful. Nevertheless, given that there are latent factors, we can draw some conclusions about the state of the claim based on its effects on the observed data.

For example, because the latest payments on the red and blue claims are so much later than most of the other claims, it seems likely that there are latent factors that are causing these claims to remain open for longer than the others. In addition, as the blue claim has such a high rate of payment in each period compared to the red claim, it seems likely that there are latent factors causing the blue claim to have more payments.

In other words, it is enough to know the relative effect on a claim to know that there is something different about the state of it.

Hopefully, you find these conclusions relatively intuitive. If not, please leave a question in the comments below and I will get back to you.

Now that we can see it is possible to draw conclusions on the latent state of a claim, all that remains is to understand the techniques that can be used to automate this process. These techniques will be the subject of subsequent articles.

## Final word

In this article we introduced a problem involving latent factors. We then interrogated a dataset to prove to ourselves that we could learn something about latent factors from the available data.

In the next article we will begin to look at techniques that can be used to quantify the latent effects.

## Modelling in the case of Incomplete Data (3)

### Re-cap

In the first article of this series we discussed a problem involving incomplete data. The problem was to estimate the amount of future payments on all claims reported to an insurer given data on historical payments. Below is the plot of historical payments that we discussed.

As a first step towards constructing an estimate of future payments, we looked at a model that could be used to describe the payment process. The model consisted of two elements, a state transition element and a payment amount element.

Having built a model, in the second article we considered the problem of assessing the goodness of fit of different parameter values. We found that an appealing measure of goodness of fit was the likelihood. Furthermore, of relevance to our problem, we saw that we could calculate the likelihood even for problems involving incomplete data.

Here we will discuss how Bayesian methods leverage the likelihood in order to produce predictions. We will tackle this in two stages. First, we will look at the issue that is resolved by the Bayesian approach and how the likelihood contributes to the solution. Second, we will discuss why using the likelihood in this way makes sense from a more technical point of view. Throughout the article we will use our claim payment problem as a case study.

If you haven’t already, you may find reading the first two articles in this series helpful. In particular, as I will refer to them later, you will want to know that our model has three parameters: ‘p’ and ‘q’ parameterise the state transition model; ‘mu’ parameterises the payment amount model (we assume claim payments follow an Exponential distribution).

### The Problem & A Solution

The output of a Bayesian analysis is an exact predictive distribution of what we are interested in given what we know. In our case, this means the approach produces a probability distribution of future claim payments given what we have learned from historical payments.

As a first step to understanding the approach, it is useful to consider how we would estimate the probability of a future outcome if we knew what the “correct” values for the parameters in our model were.

For instance, how would we estimate the probability of future payments exceeding $10m if we knew the correct values for ‘p’, ‘q’ and ‘mu’? To answer this question we could use simulation. By using the “correct” parameter values, we could simulate future outcomes from our model and then estimate the required probability by calculating the proportion of simulations in which future payments exceeded$10m.

The uncertainty that this approach captures is the uncertainty in the statistical process. Let’s refer to this as “process uncertainty”.

Unfortunately, we don’t actually know the correct values for the parameters. Therefore, not only do we have to allow for process uncertainty but we also have to allow for the uncertainty regarding the parameters.

From a practical perspective, as with process uncertainty, Bayesian methods allow for “parameter uncertainty” by simulating different values for the parameters. The distribution we simulate from is known as the “posterior” distribution.

In other words, with our Bayesian hat on, we can augment our previous simulation approach (which involved fixed “correct” parameters) by simulating parameter values from the posterior before using these to simulate future outcomes from our model.

As you can probably tell, the trick to this approach is in constructing a posterior distribution that makes sense. It needs to capture our general ignorance about the parameter values as well as what we have learned from the data.

One relatively coherent approach to doing this is to use the likelihood.

Below is a plot of the likelihood for our claim transition model, based on the data shown in the plot above.

As we discussed in the previous article, the likelihood measures the goodness of fit of different parameter values. Those with a high value (the red region) represent values that are a relatively good fit and those with a low value (the blue region) are those with a poor fit.

With a leap of faith, we might wonder whether we could use the likelihood to represent the posterior. For instance, if the total likelihood in the red region is twenty times higher than that of the blue region, could we reasonably say that the probability of the parameter falling in the red region is twenty times that of it falling in the blue region?

Although some may wince at this derivation, it so happens that we can–more on this in the next section.

In particular, the normalised likelihood (the likelihood scaled so that it sums to one) gives us a posterior distribution of the parameters. With this we are able to generate predictions by using the simulation approach mentioned above.

The plot below represents the result of applying this approach to our problem.

The plot displays historical payments as well as predictions. Historical cumulative payments up to the present time (0) are represented by the black line. The median predicted cumulative payment as well as prediction intervals are shown for each future quarter for the next thirty five years (times 1 to 140).

From this we might say that we expect total future payments to be ~$10m. Furthermore, we could say that we think there is a 95% probability that total payments will fall between ~$8m and ~$12m. ### Why the Solution makes sense As you may have noticed, the critical stage in a Bayesian analysis is in constructing the posterior distribution. In the previous section I raced past our choice of posterior, here I’ll quickly discuss why the approach makes sense. In our case, the posterior distribution produces answers to questions like, “Given what you have seen, what do you believe the probability is that the true values for ‘p’, ‘q’, and ‘mu’ are equal to 0.1, 0.5, and$10k?”. In general, through probabilities, the posterior represents our beliefs regarding the true parameter values.

Using symbols, the posterior probability that the true values for the parameters () are equal to some value () given the data () is denoted .

This definition may remind you of the likelihood. Indeed, in the previous article we introduced the likelihood as the probability of the model generating data equal to the observed data–otherwise denoted as .

Because of the symmetry between the Likelihood and the Posterior, we can make use of Bayes theorem to write:

In words, the posterior is proportional to the product of the likelihood and some a priori probability for the parameters. The a priori probability is referred to as the “prior”.

With this formula we can see what assumption I had made in the previous section. I assumed that the posterior was proportional to the likelihood, hence I implicitly assumed that the prior was constant for all values of the parameters.

Please note that other prior assumptions are valid. However, we can be satisfied that the approach we arrived at has some technical merit.

For those who will apply Bayesian methods to their problems, it is important to recognise that any choice of prior is ultimately subjective. This is not necessarily a weakness but if you need to maintain a degree of objectivity, it is worth taking a look at articles concerning “Objective Bayes” and testing a variety of reasonable priors.

### Final Word

The Bayesian approach is very powerful. Given some basic building blocks it enables us to generate predictive distributions of future outcomes. With these distributions we can attempt to answer questions like the one presented in the first article of this series.

Before becoming completely smitten by Bayesian methods, however, it is important to recognise that this power does not come for free. In particular, employing Bayesian methods involves a subjective selection of the prior distribution.

Given that we now have a method for solving the problem posed in the first article, I will let this series sit for a while. However, there are many other approaches we have not discussed and in future I may revisit this series to discuss them.

If you have any questions or requests for future topics please leave them in the comments section.

## Modelling in the case of Incomplete Data (2)

Please note that this article is going to get technical and those without a numerical background may need to have Google on standby, a quiet room to sit in, and a strong coffee prepared. However, we all have to start somewhere and so I encourage you to stick with it, re-read and post questions in the comments section.

### Re-cap

In the previous article we discussed a problem involving incomplete data. In the problem we were tasked with estimating future claim payments given data on historical payments. Furthermore, as a first step towards producing an estimate, we came up with a model that described the occurrence and amount of payments.

We assumed that the occurrence of payments could be described by a state transition model. The transition diagram associated with the model is shown below.

Given this model, the reason the data is incomplete is that it doesn’t conclusively tell us which of the states a claim is in. In particular, our data only tells us whether a payment is made or not. Consequently, if a payment isn’t made in the most recent period a claim could either be in the “Open with no payment” state or the “Closed with no payment” state.

In addition to the data and the model, I hinted at an “intuitive” approach you could use to estimate parameters. In this approach pseudo-data is generated from the model based on different values for the parameters and the selected parameter values are those where the pseudo-data is most similar to the actual data.

To illustrate this approach, below I show the actual data (left) and the four sets of pseudo-data (right) from the last article. I have put a star next to the parameter values that I thought generated the most similar pseudo-data.

The problem with this approach is that computers do not have an intuitive notion of similarity. Instead, computers make decisions by comparing numbers. Therefore, in order to automate the process, we need a numerical measure of how good different parameter values are. In this article I present the likelihood function as this measure.

As it is boring to use the term “likelihood function” repeatedly, I will use “likelihood” for short. Also, measuring how good parameter values are is referred to as measuring the “goodness of fit” of the parameters. For the same reason I will use “goodness” as an abbreviation of “goodness of fit”.

### Why use the Likelihood & What is it?

Why use the likelihood? For instance, you may already know other measures of goodness and so using the likelihood may seem like reinventing the wheel.

Although the likelihood is only one of many measures, it is a good place to start as it represents a natural progression from our intuitive assessment of goodness. To show this, let’s first look at the intuitive approach again.

In the intuitive approach, goodness is assessed by judging how similar a plot of the pseudo-data is to a plot of the observed data.

Bringing rigour to this approach is difficult, as people will have different opinions on how similar two plots are. However, when the plots are identical we can be sure that everyone will consider them to be highly similar (I would hope!).

The likelihood follows from this, as it measures goodness by calculating the probability of the model generating pseudo-data that is identical to the observed data.

In other words, rather than judging by eye how similar one set of pseudo-data is to the actual data, with the likelihood we can calculate the proportion of the time that the pseudo-data will be identical to the actual data.

Now that we understand why we might use the likelihood, we are prepared to discuss how it is constructed. For this purpose we will take our state transition model as a case study. I will first walk through the relatively simple task of constructing the likelihood imagining we had complete data. Following this I move on to our actual problem involving incomplete data.

### The Likelihood in the case of Complete Data

In the case of our claim transition model, a complete dataset would include the state of each claim at each time after it has been reported. In order to visualise this it helps to use a tree diagram, as below (please click).

The diagram shows every possible combination of states a claim can be in after being reported, up until the fourth period. The states a claim can be in at each time are represented by nodes (circles). Furthermore, the nodes are arranged in rows as each row represents the time at which the claim is in the given state. Finally, the arrows between nodes represent how the state of the claim can change between times.

There are three different types of node to represent the three possible states of the claim. A claim can be closed (the node labelled with a “C”), it can be open at the end of a period when no payment occurs (“N”), or it can be open when a payment does occur (“P”). The state of the claim at reporting (“Time 0”) is represented by a red dot; other than being open, the state at reporting is not important in our model.

We can use the diagram to represent the data collected on a single claim after observing it for four periods. To illustrate this I have highlighted one “path” on the diagram.

For the highlighted claim, a payment was made in the first period and so it transitioned to the “P” node. No payments were made in the second and third periods, although the claim remained open, so it transitioned to the “N” node. Finally, in the last period it closed and so it transitioned to the “C” node. Although we don’t see the transitions beyond time 4, we know that a closed claim remains closed.

We can calculate the probability of a claim taking this path by multiplying together the probabilities of moving between each of the states, read from the state transition diagram. In particular, for the highlighted claim, the first transition has a probability of “(1-p)q”, the second transition a probability of “(1-p)(1-q)”, the third “(1-p)(1-q)”, and the fourth “p”. Consequently, the probability of a claim taking this path is “(1-p)q(1-p)(1-q)(1-p)(1-q)p”.

If we had only observed this one claim then this probability, considered as a function of “p” and “q”, is equal to the likelihood.

You can find a general formula for the complete likelihood by following this link.

### The Likelihood in the case of Incomplete Data

The only thing that differs between the cases of complete and incomplete data is the data that is recorded. The model is the same in both cases and each claim can still go down any of the paths. However, now our data only records whether a payment has been made or not. In particular, if no payment has been made, the claim could either be closed (“C”) or open (“N”).

As in the case of complete data, we can use the tree diagram to illustrate what incomplete data looks like. Let’s consider the paths relevant to one claim, the same claim we considered in the case of complete data.

As you can see, instead of highlighting one path, this time many paths are highlighted. This is because there are four possible paths the claim could have taken that would have resulted in a payment in the first period followed by three periods with no payments.

The probability of a claim taking a particular path is again the product of the probability of transitioning between each of the nodes. However, as the observed data would have been generated if any of these paths were taken, we must now sum the probabilities of going down each path.

This principle is true for any model. The likelihood in the presence of incomplete data is the sum of the complete data likelihood over all possible states of the unobserved data.

I won’t write out the incomplete likelihood in this case, as it is long. However, you can follow this link if you would like to see a general formula for the incomplete likelihood.

### A Brief Comparison of the Complete & Incomplete Likelihoods

Now that we can construct the incomplete likelihood, you may be wondering how it compares to the complete likelihood.

To make this comparison, let’s consider the incomplete likelihood generated by the claim highlighted in our tree diagram and each of the complete likelihoods generated by a claim taking one of the possible paths our claim could have taken. In particular, the claim could have closed in the second, third or fourth periods, or not at all. Hence, we can construct four complete likelihoods for comparison.

A plot of each of these is contained in the gallery below.

« 1 of 5 »

In these graphs, “q” and “p” are plotted on the horizontal and vertical axes. Furthermore, the coloured regions identify parameter values that have a similar likelihood. The likelihood is high for regions coloured in dark red and low for those in dark blue.

You should be able to see that the incomplete likelihood shares features with each of the complete likelihoods. In particular, the regions with a high incomplete likelihood are also high for at least one of the complete likelihoods.

This shouldn’t come as a surprise, as the incomplete likelihood is simply the sum of these complete likelihoods.

In addition, we can see that each complete likelihood has a larger dark blue region compared to the incomplete likelihood. Recalling that the likelihood is a measure of goodness of fit, this indicates that the the complete likelihood identifies a smaller set of good parameters compared to the incomplete likelihood (i.e. it is more precise about which parameters are good).

Again, this seems reasonable as, all else being equal, having more information on historical claims should enable us to be more precise about which parameter values are good and which are bad.

Although this is a contrived example, as insurers typically have  data on thousands of claims, hopefully it gives you some comfort that the incomplete likelihood behaves in an intuitive way.

### Final Word

In this article I introduced the likelihood as a measure of goodness of fit. In addition, I showed why the likelihood is a natural progression from our intuitive assessment of goodness.

After introducing it, we considered how to construct the likelihood in the case of complete and incomplete data. I then briefly compared the complete and incomplete likelihood and showed that they behave in an intuitive way.

In subsequent articles we will discuss different techniques that use the likelihood to estimate values for the parameters.

## Modelling in the case of Incomplete Data (1)

We are often presented with incomplete data. In the face of this, clients who would otherwise have asked for the input of modellers might not bother. Furthermore, when asked for insights, some modellers might not know where to start.

By demonstrating that a variety of approaches can be used, I hope that clients will ask for analyses and modellers will be more confident at providing them.

### The data and the problem

Insurance companies receive claims for money from policyholders. For example, if you are at fault of causing a car accident your insurer will cover the costs incurred by the other parties as a result of the accident.

Sometimes the total cost of these claims takes a long time to be determined and an insurer must pay incremental amounts until the total cost is established and fully paid. The chart below represents the type of data on individual claims that modellers have access to.

Each line in the chart represents one claim. It shows the cumulative amount paid to the insured since the claim was made. The “quarter” is the number of three month periods since the claim was first reported. I have highlighted two claims to illustrate how different claims can develop.

In the chart you can see that the cumulative amount paid on each claim increases over time in a step-like fashion. There are periods when no payments are made and the line remains flat or when payments are made and the line jumps up. Furthermore, at some point, when the total cost has been established and paid by the insurer, the line remains flat for all subsequent periods and the claim is referred to as “closed”.

You will also see that some lines are longer than others. This is because claims that were reported to the insurer many years ago will have been observed for longer.

As part of prudent management of claims, management will want to estimate how many more payments might be made on each claim and ultimately how much will be paid in the future.

The issue with this data is that it is impossible to say whether any of the claims have closed. For instance, take the case of the claim in the chart highlighted in red. At the end of the eleventh quarter the claim had not had any payments for the past two years. Based on this we might have concluded that it was closed and no further payments would be made. However, we would have been wrong.

This problem is of the type mentioned in the introduction to this article. In particular, the data does not tell us everything about the state of each claim and, most importantly, whether the claim is currently closed or not. We refer to the unknown but actual time of closure as a latent variable and the series of cash flows and timings of cash flows, which make up the observed data, as the observed variables.

The first step to handling incomplete data is to construct a model that describes the interaction between latent and observed variables.

### A model

In this context, any decent model of claim payments must allow for the following possibilities:

• A claim may have a series of periods with no payments followed by one or more payments
• At some point each claim will close
• The value of individual payments varies

The model I am going to use allows for these features but is otherwise as “simple” a model as I could come up with. In subsequent articles I might consider extensions to this model.

To cover the first two features, let’s assume that the occurrence of a payment and of closure follows the state transition model represented by the diagram below.

The diagram describes what state a claim can be in and how the state can change from one period to the next. It also displays the probability of moving between each of the states.

For example, if at the beginning of a period a claim is open and a payment did not occur in the previous period (the top-left state in the diagram) it can: stay in the same state with probability ‘(1-p)*(1-q)’; generate a payment and remain open (the top-right state) with probability ‘(1-p)*q’; or close with no payment (the bottom state) with probability ‘p’.

To cover the third feature, let’s assume that the value a payment takes gives you no information on the future state of the claim in subsequent periods. To start somewhere, we can assume that each payment follows an independent and identically distributed Exponential distribution.

With this model I have generated a possible dataset for each of four different values of the parameters. These are plotted below along with the associated parameters.

Obviously some parameter values produce data that is more similar to our actual data than others. In particular, to me, the graph in the bottom-right looks most similar to the real data. However, I would not recommend this approach for ultimately deciding on parameter values.

Some key issues with this approach are:

• It is time consuming
• It is highly subjective
• It is prone to error
• It is not particularly fun (not to be understated!)

Fortunately, there are many (fun!) statistical techniques that improve on this manual approach and we’ll begin to discuss these in the next article in the series.

### Final word

Frequently we are presented with data that is incomplete. In this article I have provided an example of such a dataset, which happens to be relevant to Insurance.

In the next article I will begin to describe different approaches that can be used to estimate parameter values for this model.

If you would like to suggest an alternative model, make a request for the next article, or have questions on any of the above, feel free to put something in the comments section.