Modelling in the case of Incomplete Data (3)

This article is the third in a series in which I discuss techniques that can be used when presented with incomplete data. The focus of this article is on the use of the likelihood for prediction. In particular, we will discuss the Bayesian approach to generating predictive distributions.


In the first article of this series we discussed a problem involving incomplete data. The problem was to estimate the amount of future payments on all claims reported to an insurer given data on historical payments. Below is the plot of historical payments that we discussed.


As a first step towards constructing an estimate of future payments, we looked at a model that could be used to describe the payment process. The model consisted of two elements, a state transition element and a payment amount element.

Having built a model, in the second article we considered the problem of assessing the goodness of fit of different parameter values. We found that an appealing measure of goodness of fit was the likelihood. Furthermore, of relevance to our problem, we saw that we could calculate the likelihood even for problems involving incomplete data.

Here we will discuss how Bayesian methods leverage the likelihood in order to produce predictions. We will tackle this in two stages. First, we will look at the issue that is resolved by the Bayesian approach and how the likelihood contributes to the solution. Second, we will discuss why using the likelihood in this way makes sense from a more technical point of view. Throughout the article we will use our claim payment problem as a case study.

Please note that this article is necessarily abbreviated. It would be more complete if we discussed the simulation methods I used. However, simulation is not the main theme of this article, hence the omission.

If you haven’t already, you may find reading the first two articles in this series helpful. In particular, as I will refer to them later, you will want to know that our model has three parameters: ‘p’ and ‘q’ parameterise the state transition model; ‘mu’ parameterises the payment amount model (we assume claim payments follow an Exponential distribution).

The Problem & A Solution

The output of a Bayesian analysis is an exact predictive distribution of what we are interested in given what we know. In our case, this means the approach produces a probability distribution of future claim payments given what we have learned from historical payments.

As a first step to understanding the approach, it is useful to consider how we would estimate the probability of a future outcome if we knew what the “correct” values for the parameters in our model were.

For instance, how would we estimate the probability of future payments exceeding $10m if we knew the correct values for ‘p’, ‘q’ and ‘mu’?

To answer this question we could use simulation.

By using the “correct” parameter values, we could simulate future outcomes from our model and then estimate the required probability by calculating the proportion of simulations in which future payments exceeded $10m.

The uncertainty that this approach captures is the uncertainty in the statistical process. Let’s refer to this as “process uncertainty”.

Unfortunately, we don’t actually know the correct values for the parameters. Therefore, not only do we have to allow for process uncertainty but we also have to allow for the uncertainty regarding the parameters.

From a practical perspective, as with process uncertainty, Bayesian methods allow for “parameter uncertainty” by simulating different values for the parameters. The distribution we simulate from is known as the “posterior” distribution.

In other words, with our Bayesian hat on, we can augment our previous simulation approach (which involved fixed “correct” parameters) by simulating parameter values from the posterior before using these to simulate future outcomes from our model.

As you can probably tell, the trick to this approach is in constructing a posterior distribution that makes sense. It needs to capture our general ignorance about the parameter values as well as what we have learned from the data.

One relatively coherent approach to doing this is to use the likelihood.

Below is a plot of the likelihood for our claim transition model, based on the data shown in the plot above.


As we discussed in the previous article, the likelihood measures the goodness of fit of different parameter values. Those with a high value (the red region) represent values that are a relatively good fit and those with a low value (the blue region) are those with a poor fit.

With a leap of faith, we might wonder whether we could use the likelihood to represent the posterior. For instance, if the total likelihood in the red region is twenty times higher than that of the blue region, could we reasonably say that the probability of the parameter falling in the red region is twenty times that of it falling in the blue region?

Although some may wince at this derivation, it so happens that we can–more on this in the next section.

In particular, the normalised likelihood (the likelihood scaled so that it sums to one) gives us a posterior distribution of the parameters. With this we are able to generate predictions by using the simulation approach mentioned above.

The plot below represents the result of applying this approach to our problem.


The plot displays historical payments as well as predictions. Historical cumulative payments up to the present time (0) are represented by the black line. The median predicted cumulative payment as well as prediction intervals are shown for each future quarter for the next thirty five years (times 1 to 140).

From this we might say that we expect total future payments to be ~$10m. Furthermore, we could say that we think there is a 95% probability that total payments will fall between ~$8m and ~$12m.

Why the Solution makes sense

As you may have noticed, the critical stage in a Bayesian analysis is in constructing the posterior distribution. In the previous section I raced past our choice of posterior, here I’ll quickly discuss why the approach makes sense.

In our case, the posterior distribution produces answers to questions like, “Given what you have seen, what do you believe the probability is that the true values for ‘p’, ‘q’, and ‘mu’ are equal to 0.1, 0.5, and $10k?”. In general, through probabilities, the posterior represents our beliefs regarding the true parameter values.

Using symbols, the posterior probability that the true values for the parameters ( \Lambda ) are equal to some value ( \lambda ) given the data ( D ) is denoted  P(\Lambda = \lambda | D) .

This definition may remind you of the likelihood. Indeed, in the previous article we introduced the likelihood as the probability of the model generating data equal to the observed data–otherwise denoted as P(D | \Lambda = \lambda).

Because of the symmetry between the Likelihood and the Posterior, we can make use of Bayes theorem to write:

     \begin{gather*} P(\Lambda = \lambda | D) \propto P(D|\Lambda = \lambda) \cdot P(\Lambda = \lambda)  \end{gather*}

In words, the posterior is proportional to the product of the likelihood and some a priori probability for the parameters. The a priori probability is referred to as the “prior”.

With this formula we can see what assumption I had made in the previous section. I assumed that the posterior was proportional to the likelihood, hence I implicitly assumed that the prior was constant for all values of the parameters.

Please note that other prior assumptions are valid. However, we can be satisfied that the approach we arrived at has some technical merit.

For those who will apply Bayesian methods to their problems, it is important to recognise that any choice of prior is ultimately subjective. This is not necessarily a weakness but if you need to maintain a degree of objectivity, it is worth taking a look at articles concerning “Objective Bayes” and testing a variety of reasonable priors.

Final Word

The Bayesian approach is very powerful. Given some basic building blocks it enables us to generate predictive distributions of future outcomes. With these distributions we can attempt to answer questions like the one presented in the first article of this series.

Before becoming completely smitten by Bayesian methods, however, it is important to recognise that this power does not come for free. In particular, employing Bayesian methods involves a subjective selection of the prior distribution.

Given that we now have a method for solving the problem posed in the first article, I will let this series sit for a while. However, there are many other approaches we have not discussed and in future I may revisit this series to discuss them.

If you have any questions or requests for future topics please leave them in the comments section.

Modelling in the case of Incomplete Data (2)

This article is the second in a series in which I discuss techniques that can be used when presented with incomplete data. In this article I introduce the likelihood function as a way of measuring goodness of fit.

Please note that this article is going to get technical and those without a numerical background may need to have Google on standby, a quiet room to sit in, and a strong coffee prepared. However, we all have to start somewhere and so I encourage you to stick with it, re-read and post questions in the comments section.


In the previous article we discussed a problem involving incomplete data. In the problem we were tasked with estimating future claim payments given data on historical payments. Furthermore, as a first step towards producing an estimate, we came up with a model that described the occurrence and amount of payments.

We assumed that the occurrence of payments could be described by a state transition model. The transition diagram associated with the model is shown below.


Given this model, the reason the data is incomplete is that it doesn’t conclusively tell us which of the states a claim is in. In particular, our data only tells us whether a payment is made or not. Consequently, if a payment isn’t made in the most recent period a claim could either be in the “Open with no payment” state or the “Closed with no payment” state.

In addition to the data and the model, I hinted at an “intuitive” approach you could use to estimate parameters. In this approach pseudo-data is generated from the model based on different values for the parameters and the selected parameter values are those where the pseudo-data is most similar to the actual data.

To illustrate this approach, below I show the actual data (left) and the four sets of pseudo-data (right) from the last article. I have put a star next to the parameter values that I thought generated the most similar pseudo-data.


The problem with this approach is that computers do not have an intuitive notion of similarity. Instead, computers make decisions by comparing numbers. Therefore, in order to automate the process, we need a numerical measure of how good different parameter values are. In this article I present the likelihood function as this measure.

As it is boring to use the term “likelihood function” repeatedly, I will use “likelihood” for short. Also, measuring how good parameter values are is referred to as measuring the “goodness of fit” of the parameters. For the same reason I will use “goodness” as an abbreviation of “goodness of fit”.

Why use the Likelihood & What is it?

Why use the likelihood? For instance, you may already know other measures of goodness and so using the likelihood may seem like reinventing the wheel.

Although the likelihood is only one of many measures, it is a good place to start as it represents a natural progression from our intuitive assessment of goodness. To show this, let’s first look at the intuitive approach again.

In the intuitive approach, goodness is assessed by judging how similar a plot of the pseudo-data is to a plot of the observed data.

Bringing rigour to this approach is difficult, as people will have different opinions on how similar two plots are. However, when the plots are identical we can be sure that everyone will consider them to be highly similar (I would hope!).

The likelihood follows from this, as it measures goodness by calculating the probability of the model generating pseudo-data that is identical to the observed data.

In other words, rather than judging by eye how similar one set of pseudo-data is to the actual data, with the likelihood we can calculate the proportion of the time that the pseudo-data will be identical to the actual data.

Hopefully the link between the likelihood and our intuitive assessment is apparent but please leave a comment if you are unsure.

Now that we understand why we might use the likelihood, we are prepared to discuss how it is constructed. For this purpose we will take our state transition model as a case study. I will first walk through the relatively simple task of constructing the likelihood imagining we had complete data. Following this I move on to our actual problem involving incomplete data.

The Likelihood in the case of Complete Data

In the case of our claim transition model, a complete dataset would include the state of each claim at each time after it has been reported. In order to visualise this it helps to use a tree diagram, as below (please click).


The diagram shows every possible combination of states a claim can be in after being reported, up until the fourth period. The states a claim can be in at each time are represented by nodes (circles). Furthermore, the nodes are arranged in rows as each row represents the time at which the claim is in the given state. Finally, the arrows between nodes represent how the state of the claim can change between times.

There are three different types of node to represent the three possible states of the claim. A claim can be closed (the node labelled with a “C”), it can be open at the end of a period when no payment occurs (“N”), or it can be open when a payment does occur (“P”). The state of the claim at reporting (“Time 0”) is represented by a red dot; other than being open, the state at reporting is not important in our model.

We can use the diagram to represent the data collected on a single claim after observing it for four periods. To illustrate this I have highlighted one “path” on the diagram.

For the highlighted claim, a payment was made in the first period and so it transitioned to the “P” node. No payments were made in the second and third periods, although the claim remained open, so it transitioned to the “N” node. Finally, in the last period it closed and so it transitioned to the “C” node. Although we don’t see the transitions beyond time 4, we know that a closed claim remains closed.

We can calculate the probability of a claim taking this path by multiplying together the probabilities of moving between each of the states, read from the state transition diagram. In particular, for the highlighted claim, the first transition has a probability of “(1-p)q”, the second transition a probability of “(1-p)(1-q)”, the third “(1-p)(1-q)”, and the fourth “p”. Consequently, the probability of a claim taking this path is “(1-p)q(1-p)(1-q)(1-p)(1-q)p”.

If we had only observed this one claim then this probability, considered as a function of “p” and “q”, is equal to the likelihood.

You can find a general formula for the complete likelihood by following this link.

The Likelihood in the case of Incomplete Data

The only thing that differs between the cases of complete and incomplete data is the data that is recorded. The model is the same in both cases and each claim can still go down any of the paths. However, now our data only records whether a payment has been made or not. In particular, if no payment has been made, the claim could either be closed (“C”) or open (“N”).

As in the case of complete data, we can use the tree diagram to illustrate what incomplete data looks like. Let’s consider the paths relevant to one claim, the same claim we considered in the case of complete data.


As you can see, instead of highlighting one path, this time many paths are highlighted. This is because there are four possible paths the claim could have taken that would have resulted in a payment in the first period followed by three periods with no payments.

The probability of a claim taking a particular path is again the product of the probability of transitioning between each of the nodes. However, as the observed data would have been generated if any of these paths were taken, we must now sum the probabilities of going down each path.

This principle is true for any model. The likelihood in the presence of incomplete data is the sum of the complete data likelihood over all possible states of the unobserved data.

I won’t write out the incomplete likelihood in this case, as it is long. However, you can follow this link if you would like to see a general formula for the incomplete likelihood.

A Brief Comparison of the Complete & Incomplete Likelihoods

Now that we can construct the incomplete likelihood, you may be wondering how it compares to the complete likelihood.

To make this comparison, let’s consider the incomplete likelihood generated by the claim highlighted in our tree diagram and each of the complete likelihoods generated by a claim taking one of the possible paths our claim could have taken. In particular, the claim could have closed in the second, third or fourth periods, or not at all. Hence, we can construct four complete likelihoods for comparison.

A plot of each of these is contained in the gallery below. 

« 1 of 5 »

In these graphs, “q” and “p” are plotted on the horizontal and vertical axes. Furthermore, the coloured regions identify parameter values that have a similar likelihood. The likelihood is high for regions coloured in dark red and low for those in dark blue.

You should be able to see that the incomplete likelihood shares features with each of the complete likelihoods. In particular, the regions with a high incomplete likelihood are also high for at least one of the complete likelihoods.

This shouldn’t come as a surprise, as the incomplete likelihood is simply the sum of these complete likelihoods.

In addition, we can see that each complete likelihood has a larger dark blue region compared to the incomplete likelihood. Recalling that the likelihood is a measure of goodness of fit, this indicates that the the complete likelihood identifies a smaller set of good parameters compared to the incomplete likelihood (i.e. it is more precise about which parameters are good).

Again, this seems reasonable as, all else being equal, having more information on historical claims should enable us to be more precise about which parameter values are good and which are bad.

Although this is a contrived example, as insurers typically have  data on thousands of claims, hopefully it gives you some comfort that the incomplete likelihood behaves in an intuitive way.

Final Word

In this article I introduced the likelihood as a measure of goodness of fit. In addition, I showed why the likelihood is a natural progression from our intuitive assessment of goodness.

After introducing it, we considered how to construct the likelihood in the case of complete and incomplete data. I then briefly compared the complete and incomplete likelihood and showed that they behave in an intuitive way.

In subsequent articles we will discuss different techniques that use the likelihood to estimate values for the parameters.

I hope you found this article useful and if you have any comments or questions please leave them in the comments section below.

Modelling in the case of Incomplete Data (1)

We are often presented with incomplete data. In the face of this, clients who would otherwise have asked for the input of modellers might not bother. Furthermore, when asked for insights, some modellers might not know where to start.

This article is the first in a series of articles in which I discuss and compare estimation techniques that can be used when presented with incomplete data. This article is devoted to introducing a modelling problem and subsequent articles will focus on estimation techniques.

By demonstrating that a variety of approaches can be used, I hope that clients will ask for analyses and modellers will be more confident at providing them.

The data and the problem

Insurance companies receive claims for money from policyholders. For example, if you are at fault of causing a car accident your insurer will cover the costs incurred by the other parties as a result of the accident.

Sometimes the total cost of these claims takes a long time to be determined and an insurer must pay incremental amounts until the total cost is established and fully paid. The chart below represents the type of data on individual claims that modellers have access to.


Each line in the chart represents one claim. It shows the cumulative amount paid to the insured since the claim was made. The “quarter” is the number of three month periods since the claim was first reported. I have highlighted two claims to illustrate how different claims can develop.

In the chart you can see that the cumulative amount paid on each claim increases over time in a step-like fashion. There are periods when no payments are made and the line remains flat or when payments are made and the line jumps up. Furthermore, at some point, when the total cost has been established and paid by the insurer, the line remains flat for all subsequent periods and the claim is referred to as “closed”.

You will also see that some lines are longer than others. This is because claims that were reported to the insurer many years ago will have been observed for longer.

As part of prudent management of claims, management will want to estimate how many more payments might be made on each claim and ultimately how much will be paid in the future.

The issue with this data is that it is impossible to say whether any of the claims have closed. For instance, take the case of the claim in the chart highlighted in red. At the end of the eleventh quarter the claim had not had any payments for the past two years. Based on this we might have concluded that it was closed and no further payments would be made. However, we would have been wrong.

This problem is of the type mentioned in the introduction to this article. In particular, the data does not tell us everything about the state of each claim and, most importantly, whether the claim is currently closed or not. We refer to the unknown but actual time of closure as a latent variable and the series of cash flows and timings of cash flows, which make up the observed data, as the observed variables.

The first step to handling incomplete data is to construct a model that describes the interaction between latent and observed variables.

A model

In this context, any decent model of claim payments must allow for the following possibilities:

  • A claim may have a series of periods with no payments followed by one or more payments
  • At some point each claim will close
  • The value of individual payments varies

The model I am going to use allows for these features but is otherwise as “simple” a model as I could come up with. In subsequent articles I might consider extensions to this model.

To cover the first two features, let’s assume that the occurrence of a payment and of closure follows the state transition model represented by the diagram below.


The diagram describes what state a claim can be in and how the state can change from one period to the next. It also displays the probability of moving between each of the states.

For example, if at the beginning of a period a claim is open and a payment did not occur in the previous period (the top-left state in the diagram) it can: stay in the same state with probability ‘(1-p)*(1-q)’; generate a payment and remain open (the top-right state) with probability ‘(1-p)*q’; or close with no payment (the bottom state) with probability ‘p’.

To cover the third feature, let’s assume that the value a payment takes gives you no information on the future state of the claim in subsequent periods. To start somewhere, we can assume that each payment follows an independent and identically distributed Exponential distribution.

With this model I have generated a possible dataset for each of four different values of the parameters. These are plotted below along with the associated parameters.


Obviously some parameter values produce data that is more similar to our actual data than others. In particular, to me, the graph in the bottom-right looks most similar to the real data. However, I would not recommend this approach for ultimately deciding on parameter values.

Some key issues with this approach are:

  • It is time consuming
  • It is highly subjective
  • It is prone to error
  • It is not particularly fun (not to be understated!)

Fortunately, there are many (fun!) statistical techniques that improve on this manual approach and we’ll begin to discuss these in the next article in the series.

Final word

Frequently we are presented with data that is incomplete. In this article I have provided an example of such a dataset, which happens to be relevant to Insurance.

In the next article I will begin to describe different approaches that can be used to estimate parameter values for this model.

If you would like to suggest an alternative model, make a request for the next article, or have questions on any of the above, feel free to put something in the comments section.