More Incomplete Data: Modelling Latent Factors

This article is the second in a series in which we discuss techniques that can be used to analyse latent factors. In the last article we discussed how we could analyse latent factors in a qualitative way. Here, we discuss one way in which we could make our qualitative approach amenable to statistical analysis.

The techniques we discuss are broadly applicable but here we will focus on the specific problem introduced in the last article. It may be obvious how the techniques can be applied to your problem but if not, we will discuss them more generally in subsequent posts.


The problem introduced in the previous article involved predicting the future payments on a number of claims that had been reported to an insurer. This problem had already been considered in a previous series but we complicated the matter by acknowledging that each claim was likely to behave differently due to latent factors.

Having introduced the problem, we then discussed whether it was even possible to analyse latent factors. This was done in light of the fact that traditional analytical techniques are not particularly useful when the data does not contain the exact state of each claim, as is the case in our problem.

Based on our intuition, however, it seemed clear that some sort of analysis was possible. In particular, it seemed likely that claims with “peculiar” payment series were probably predisposed to behave in an odd way. Therefore, these claims probably had a different latent state to the other claims.

In this article we will look at how we can place our intuitive analysis on a more rigorous footing. The approach considered in this article involves extending a “simple” model. For this problem we will extend the state transition model described in the second article of the previous series.

A simple model

The model we discussed in the previous series was based on the following assumptions:

  • Between periods, each claim transitions between three states: open with a payment; open with no payment; and closed with no payment
  • An open claim will close in the next period with probability ‘p’ or, given that it doesn’t, will give rise to a payment with probability ‘q’
  • Once a claim closes it remains closed

These assumptions are represented in the diagram below.


Unfortunately, this model has limitations.

In particular, as all claims are subject to the same parameters, this model assumes that the only source of variation between claims is due to chance. In our case this is inappropriate, as additional variation will be caused by the fact that each claim has a different latent predisposition to close or to be associated with a payment in a period.

However, by extending our simple model we can resolve this issue.

Extending the model

The simple model assumes that each claim is governed by the same parameters. Instead, to allow for the different predispositions of each claim, let’s assume that:

  1. There is a larger population of claims from which our claims have been sampled.
  2. For each claim there is a simple model that adequately describes the payment process.
  3. Each claim has its own specific parameters and therefore, its own model.

The graphs below attempt to illustrate what these assumptions mean.


The left-hand graph depicts one possible population of claims. Each dot represents one claim and the coordinates for each dot give us the transition probabilities applicable to each claim. As you can see, each claim in the population is associated with a different set of parameters. In this case, the majority of claims have a closure probability of ~20% and a payment given open probability of ~50%.

On the right-hand side I’ve simply copied the graph from the previous article, which shows the historical payment series on each claim in our dataset.

The same claims are highlighted in the right-hand graph as in the previous article. In addition, I have highlighted two corresponding claims in the population plot. The colours identify which claim in the population the claims in our dataset correspond to and, most importantly, the parameters that govern the payment series on each of our claims.

Of course, the data on the left-hand side is entirely hypothetical. In reality, we don’t know how the dots in the left-hand graph are distributed. Even if we did, we wouldn’t be able to identify which claim in the left-hand graph each claim in our dataset relates to. However, the illustration does give us the machinery we need to formalise the intuitive analysis we performed in the previous article.

In particular, in the last article we said that:

“…because the latest payments on the red and blue claims are so much later than most of the other claims, it seems likely that there are latent factors that are causing these claims to remain open for longer than the others. In addition, as the blue claim has such a high rate of payment in each period compared to the red claim, it seems likely that there are latent factors causing the blue claim to have more payments.”

Based on the framework discussed here, this means that we believe the red and blue claims are more likely to be associated with a claim in the population with a relatively low closure probability. Furthermore, we believe that the blue claim has a relatively higher probability of payment compared to the red claim.

Therefore, while we can’t pin point the exact claim in the population that each of our claims corresponds to, we can identify regions in the population plot that are more likely to contain each of our claims.

The challenge we have now is to perform this analysis quickly and consistently for every claim. We will look into this in the next article.

Final word

In this article we discussed how we could transform our intuitive approach to analysing latent factors into something that was amenable to statistical analysis. To do this we extended a simple model.

In the next article we will begin to discuss how we can analyse the data under our new model in an automated way.

Please feel free to leave any questions below regarding the article or your specific problem.