Modelling in the case of Incomplete Data (1)

We are often presented with incomplete data. In the face of this, clients who would otherwise have asked for the input of modellers might not bother. Furthermore, when asked for insights, some modellers might not know where to start.

This article is the first in a series of articles in which I discuss and compare estimation techniques that can be used when presented with incomplete data. This article is devoted to introducing a modelling problem and subsequent articles will focus on estimation techniques.

By demonstrating that a variety of approaches can be used, I hope that clients will ask for analyses and modellers will be more confident at providing them.

The data and the problem

Insurance companies receive claims for money from policyholders. For example, if you are at fault of causing a car accident your insurer will cover the costs incurred by the other parties as a result of the accident.

Sometimes the total cost of these claims takes a long time to be determined and an insurer must pay incremental amounts until the total cost is established and fully paid. The chart below represents the type of data on individual claims that modellers have access to.


Each line in the chart represents one claim. It shows the cumulative amount paid to the insured since the claim was made. The “quarter” is the number of three month periods since the claim was first reported. I have highlighted two claims to illustrate how different claims can develop.

In the chart you can see that the cumulative amount paid on each claim increases over time in a step-like fashion. There are periods when no payments are made and the line remains flat or when payments are made and the line jumps up. Furthermore, at some point, when the total cost has been established and paid by the insurer, the line remains flat for all subsequent periods and the claim is referred to as “closed”.

You will also see that some lines are longer than others. This is because claims that were reported to the insurer many years ago will have been observed for longer.

As part of prudent management of claims, management will want to estimate how many more payments might be made on each claim and ultimately how much will be paid in the future.

The issue with this data is that it is impossible to say whether any of the claims have closed. For instance, take the case of the claim in the chart highlighted in red. At the end of the eleventh quarter the claim had not had any payments for the past two years. Based on this we might have concluded that it was closed and no further payments would be made. However, we would have been wrong.

This problem is of the type mentioned in the introduction to this article. In particular, the data does not tell us everything about the state of each claim and, most importantly, whether the claim is currently closed or not. We refer to the unknown but actual time of closure as a latent variable and the series of cash flows and timings of cash flows, which make up the observed data, as the observed variables.

The first step to handling incomplete data is to construct a model that describes the interaction between latent and observed variables.

A model

In this context, any decent model of claim payments must allow for the following possibilities:

  • A claim may have a series of periods with no payments followed by one or more payments
  • At some point each claim will close
  • The value of individual payments varies

The model I am going to use allows for these features but is otherwise as “simple” a model as I could come up with. In subsequent articles I might consider extensions to this model.

To cover the first two features, let’s assume that the occurrence of a payment and of closure follows the state transition model represented by the diagram below.


The diagram describes what state a claim can be in and how the state can change from one period to the next. It also displays the probability of moving between each of the states.

For example, if at the beginning of a period a claim is open and a payment did not occur in the previous period (the top-left state in the diagram) it can: stay in the same state with probability ‘(1-p)*(1-q)’; generate a payment and remain open (the top-right state) with probability ‘(1-p)*q’; or close with no payment (the bottom state) with probability ‘p’.

To cover the third feature, let’s assume that the value a payment takes gives you no information on the future state of the claim in subsequent periods. To start somewhere, we can assume that each payment follows an independent and identically distributed Exponential distribution.

With this model I have generated a possible dataset for each of four different values of the parameters. These are plotted below along with the associated parameters.


Obviously some parameter values produce data that is more similar to our actual data than others. In particular, to me, the graph in the bottom-right looks most similar to the real data. However, I would not recommend this approach for ultimately deciding on parameter values.

Some key issues with this approach are:

  • It is time consuming
  • It is highly subjective
  • It is prone to error
  • It is not particularly fun (not to be understated!)

Fortunately, there are many (fun!) statistical techniques that improve on this manual approach and we’ll begin to discuss these in the next article in the series.

Final word

Frequently we are presented with data that is incomplete. In this article I have provided an example of such a dataset, which happens to be relevant to Insurance.

In the next article I will begin to describe different approaches that can be used to estimate parameter values for this model.

If you would like to suggest an alternative model, make a request for the next article, or have questions on any of the above, feel free to put something in the comments section.