Making predictions in a rigorous way is difficult. Part of the reason for this is that it is rare for all of the key factors contributing to events to be recorded—these factors are known as “latent” factors. In this series we will consider techniques that can be used to identify latent factors.
This article is devoted to introducing a problem afflicted by latent factors. The problem comes from revisiting the one discussed in the previous series. Then, in subsequent posts, we will look at techniques that one might use.
Re-cap & motivation
In the previous series we were tasked with predicting the future series of payments on a number of claims that had been reported to an insurer. To help us with this we had received a dataset containing the timing and amount of past payments on these claims.
The starting point of our original analysis was based on the following features of the payment process:
- An open claim may or may not give rise to a payment
- Payment amounts vary
- All claims eventually close, after which no further payments are made
For the purpose of this article, however, let’s also add:
- Latent factors will cause some of the differences observed between claims
This observation is obvious when you consider, for example, that for different claims: the claimant is different; the lawyers and claim professionals involved are different; and the nature of the loss to the insured is different.
Incorporating this sort of information into our analysis may seem daunting. So instead of immediately jumping into the technical detail, let’s first convince ourselves that we can learn something about latent factors from the available data.
Data and an intuitive analysis
The graph below contains the same type of data considered in the previous series. There are one-hundred lines plotted and each line represents the cumulative amount paid on an insurance claim following notification to the insurer. The cumulative amounts are shown at the end of quarterly time intervals.
I have highlighted two claims to illustrate how claim payment series can differ. In particular, payments are made frequently on some claims and not on others (e.g. the blue claim has had payments in 60% of periods to date while the red claim has only made payments in 20%). In addition, by far the majority of claims have no payments beyond the tenth quarter (the highlighted claims being the ones with the latest payments).
The question we are interested in answering is, “to what extent is the variation in the payment series driven by latent factors?”
Standard analytical techniques rely on us being able to identify all of the factors that result in systematic variation between claims. The combination of factors is called the “state” of the claim. Being able to identify the state allows us to see to what extent the payment series varies between claims with similar or different states.
For example, if we saw that a disproportionate number of claims with late payments were associated with claimants that had suffered injuries, we might conclude that some of the variation in the payment series was driven by this factor.
Unfortunately, there is no data on latent factors (by definition) and so standard techniques are not helpful. Nevertheless, given that there are latent factors, we can draw some conclusions about the state of the claim based on its effects on the observed data.
For example, because the latest payments on the red and blue claims are so much later than most of the other claims, it seems likely that there are latent factors that are causing these claims to remain open for longer than the others. In addition, as the blue claim has such a high rate of payment in each period compared to the red claim, it seems likely that there are latent factors causing the blue claim to have more payments.
In other words, it is enough to know the relative effect on a claim to know that there is something different about the state of it.
Hopefully, you find these conclusions relatively intuitive. If not, please leave a question in the comments below and I will get back to you.
Now that we can see it is possible to draw conclusions on the latent state of a claim, all that remains is to understand the techniques that can be used to automate this process. These techniques will be the subject of subsequent articles.
In this article we introduced a problem involving latent factors. We then interrogated a dataset to prove to ourselves that we could learn something about latent factors from the available data.
In the next article we will begin to look at techniques that can be used to quantify the latent effects.