More Incomplete Data: Modelling Latent Factors

This article is the second in a series in which we discuss techniques that can be used to analyse latent factors. In the last article we discussed how we could analyse latent factors in a qualitative way. Here, we discuss one way in which we could make our qualitative approach amenable to statistical analysis.

The techniques we discuss are broadly applicable but here we will focus on the specific problem introduced in the last article. It may be obvious how the techniques can be applied to your problem but if not, we will discuss them more generally in subsequent posts.

Re-cap

The problem introduced in the previous article involved predicting the future payments on a number of claims that had been reported to an insurer. This problem had already been considered in a previous series but we complicated the matter by acknowledging that each claim was likely to behave differently due to latent factors.

Having introduced the problem, we then discussed whether it was even possible to analyse latent factors. This was done in light of the fact that traditional analytical techniques are not particularly useful when the data does not contain the exact state of each claim, as is the case in our problem.

Based on our intuition, however, it seemed clear that some sort of analysis was possible. In particular, it seemed likely that claims with “peculiar” payment series were probably predisposed to behave in an odd way. Therefore, these claims probably had a different latent state to the other claims.

In this article we will look at how we can place our intuitive analysis on a more rigorous footing. The approach considered in this article involves extending a “simple” model. For this problem we will extend the state transition model described in the second article of the previous series.

A simple model

The model we discussed in the previous series was based on the following assumptions:

  • Between periods, each claim transitions between three states: open with a payment; open with no payment; and closed with no payment
  • An open claim will close in the next period with probability ‘p’ or, given that it doesn’t, will give rise to a payment with probability ‘q’
  • Once a claim closes it remains closed

These assumptions are represented in the diagram below.

state_transition_model

Unfortunately, this model has limitations.

In particular, as all claims are subject to the same parameters, this model assumes that the only source of variation between claims is due to chance. In our case this is inappropriate, as additional variation will be caused by the fact that each claim has a different latent predisposition to close or to be associated with a payment in a period.

However, by extending our simple model we can resolve this issue.

Extending the model

The simple model assumes that each claim is governed by the same parameters. Instead, to allow for the different predispositions of each claim, let’s assume that:

  1. There is a larger population of claims from which our claims have been sampled.
  2. For each claim there is a simple model that adequately describes the payment process.
  3. Each claim has its own specific parameters and therefore, its own model.

The graphs below attempt to illustrate what these assumptions mean.

population_plot

The left-hand graph depicts one possible population of claims. Each dot represents one claim and the coordinates for each dot give us the transition probabilities applicable to each claim. As you can see, each claim in the population is associated with a different set of parameters. In this case, the majority of claims have a closure probability of ~20% and a payment given open probability of ~50%.

On the right-hand side I’ve simply copied the graph from the previous article, which shows the historical payment series on each claim in our dataset.

The same claims are highlighted in the right-hand graph as in the previous article. In addition, I have highlighted two corresponding claims in the population plot. The colours identify which claim in the population the claims in our dataset correspond to and, most importantly, the parameters that govern the payment series on each of our claims.

Of course, the data on the left-hand side is entirely hypothetical. In reality, we don’t know how the dots in the left-hand graph are distributed. Even if we did, we wouldn’t be able to identify which claim in the left-hand graph each claim in our dataset relates to. However, the illustration does give us the machinery we need to formalise the intuitive analysis we performed in the previous article.

In particular, in the last article we said that:

“…because the latest payments on the red and blue claims are so much later than most of the other claims, it seems likely that there are latent factors that are causing these claims to remain open for longer than the others. In addition, as the blue claim has such a high rate of payment in each period compared to the red claim, it seems likely that there are latent factors causing the blue claim to have more payments.”

Based on the framework discussed here, this means that we believe the red and blue claims are more likely to be associated with a claim in the population with a relatively low closure probability. Furthermore, we believe that the blue claim has a relatively higher probability of payment compared to the red claim.

Therefore, while we can’t pin point the exact claim in the population that each of our claims corresponds to, we can identify regions in the population plot that are more likely to contain each of our claims.

The challenge we have now is to perform this analysis quickly and consistently for every claim. We will look into this in the next article.

Final word

In this article we discussed how we could transform our intuitive approach to analysing latent factors into something that was amenable to statistical analysis. To do this we extended a simple model.

In the next article we will begin to discuss how we can analyse the data under our new model in an automated way.

Please feel free to leave any questions below regarding the article or your specific problem.

More Incomplete Data: Latent Factors

Making predictions in a rigorous way is difficult. Part of the reason for this is that it is rare for all of the key factors contributing to events to be recorded—these factors are known as “latent” factors. In this series we will consider techniques that can be used to identify latent factors.

This article is devoted to introducing a problem afflicted by latent factors. The problem comes from revisiting the one discussed in the previous series. Then, in subsequent posts, we will look at techniques that one might use.

Re-cap & motivation

In the previous series we were tasked with predicting the future series of payments on a number of claims that had been reported to an insurer. To help us with this we had received a dataset containing the timing and amount of past payments on these claims.

The starting point of our original analysis was based on the following features of the payment process:

  1. An open claim may or may not give rise to a payment
  2. Payment amounts vary
  3. All claims eventually close, after which no further payments are made

For the purpose of this article, however, let’s also add:

  1. Latent factors will cause some of the differences observed between claims

This observation is obvious when you consider, for example, that for different claims: the claimant is different; the lawyers and claim professionals involved are different; and the nature of the loss to the insured is different.

Incorporating this sort of information into our analysis may seem daunting. So instead of immediately jumping into the technical detail, let’s first convince ourselves that we can learn something about latent factors from the available data.

Data and an intuitive analysis

The graph below contains the same type of data considered in the previous series. There are one-hundred lines plotted and each line represents the cumulative amount paid on an insurance claim following notification to the insurer. The cumulative amounts are shown at the end of quarterly time intervals.

ClaimDevelopment

I have highlighted two claims to illustrate how claim payment series can differ. In particular, payments are made frequently on some claims and not on others (e.g. the blue claim has had payments in 60% of periods to date while the red claim has only made payments in 20%). In addition, by far the majority of claims have no payments beyond the tenth quarter (the highlighted claims being the ones with the latest payments).

The question we are interested in answering is, “to what extent is the variation in the payment series driven by latent factors?”

Standard analytical techniques rely on us being able to identify all of the factors that result in systematic variation between claims. The combination of factors is called the “state” of the claim. Being able to identify the state allows us to see to what extent the payment series varies between claims with similar or different states.

For example, if we saw that a disproportionate number of claims with late payments were associated with claimants that had suffered injuries, we might conclude that some of the variation in the payment series was driven by this factor.

Unfortunately, there is no data on latent factors (by definition) and so standard techniques are not helpful. Nevertheless, given that there are latent factors, we can draw some conclusions about the state of the claim based on its effects on the observed data.

For example, because the latest payments on the red and blue claims are so much later than most of the other claims, it seems likely that there are latent factors that are causing these claims to remain open for longer than the others. In addition, as the blue claim has such a high rate of payment in each period compared to the red claim, it seems likely that there are latent factors causing the blue claim to have more payments.

In other words, it is enough to know the relative effect on a claim to know that there is something different about the state of it.

Hopefully, you find these conclusions relatively intuitive. If not, please leave a question in the comments below and I will get back to you.

Now that we can see it is possible to draw conclusions on the latent state of a claim, all that remains is to understand the techniques that can be used to automate this process. These techniques will be the subject of subsequent articles.

Final word

In this article we introduced a problem involving latent factors. We then interrogated a dataset to prove to ourselves that we could learn something about latent factors from the available data.

In the next article we will begin to look at techniques that can be used to quantify the latent effects.