Gaussian Process regression

Among other things nonparametric Bayesian methods can simplify predictive modelling projects. Here I’ll give a brief introduction to one of these methods, Gaussian Process regression, and how it can help us make inferences about complicated trends in a quick and consistent way.

Introduction

Just as the outcomes of a coin flip can be thought of as a random variable, so can a Gaussian Process. However, while a coin flip produces scalar-valued outcomes (tails = 0, heads = 1), stochastic processes generate functions as outcomes.

In the graph below you can see a number of univariate random functions. Each line is one function and represents one outcome of the Gaussian Process.

app2

As you can see, the outcomes of a Gaussian Process are quite varied: the light blue line is pretty straight; the blue, purple and teal lines have one inflection point; and the gold, red, and green lines have two inflection points.

What gives the Gaussian Process its name is that any collection of function values (e.g. f(1), f(5.5), f(-2.2)) follows a multivariate Gaussian distribution. Furthermore, just as the Gaussian distribution is parameterised by a mean and (co)variance, a Gaussian process is parameterised by a mean function and covariance function.

The mean function represents the long-run average outcome of the random functions, while the covariance function determines how points of the function will diverge from the mean and whether the divergence will be of the same degree and direction at different points.

The plot above has been annotated with the form of the mean and covariance functions (m and k) used to generate the lines. In particular, the mean is the zero function and the covariance function decays from one, when the two points are equal, to zero as the points become more distant.

These choices result in smooth functions, centred around zero, where nearby points on the x-axis have similar values and distant points are relatively independent. Furthermore, with this covariance function almost any smooth function is possible and the propensity for certain types of functions to occur can be changed by varying the covariance function’s parameters.

A key goal of regression is to learn something about an unknown function underlying a set of noisy observations. For example, in simple linear regression we investigate the straight line functions that best map values of the covariates to noisy observations of the response.

In Bayesian analysis, simple linear regression proceeds by placing a prior on the parameters of the linear regression model (the intercept, slope and noise parameters). The posterior distribution of the parameters, conditional on the data, is then used to determine which straight lines could have plausibly generated the data.

As you saw in the plot above, a Gaussian Process can be used as a prior on a much larger set of functions. Furthermore Gaussian processes are convenient priors as, with Gaussian distributed noise, the posterior distribution is also a Gaussian process for which the mean and covariance functions are relatively easy to determine.

In the example below I show how these facts allow you to produce fast and convincing analyses of some weather data.

Example

Here I’ve used a Gaussian Process regression model to make predictions about future temperatures.

Historical temperatures were sourced from the met office’s historical dataset for the Oxford weather station. A subset of the data and the Gaussian Process forecast are summarised in the chart below.

app1

Historical month-end measurements are displayed as black points, with the last reading being at the end of August 2017. After this time the smooth black line represents the mean posterior predicted outcome. Dark and light blue shading has been added to delimit one and two standard deviation prediction intervals.

In this case I’ve again used the zero function for the mean but for the covariance I’ve used a periodic function.

This particular covariance function has three free parameters: a, the degree to which the underlying trend deviates from the mean function; b, the duration between points that are highly correlated; and c, the degree to which nearby points are independent or not. The independent and identically distributed Gaussian noise results in one further parameter.

Depending on the dataset in question, parameters can be inferred from the data (e.g. using MCMC techniques) or set based on prior beliefs about the form of the trend. Here I have set “b” equal to 365 days and I have used maximum likelihood estimation to choose values for the other parameters.

The key takeaway from this, however, is to note that the inferences made by using Gaussian Process regression are reasonable—the periodic pattern observable in the historical data is reflected in the modelled predictions—and could easily be applied to other datasets.

Final Word

In this article I gave you a taste of nonparametric Bayesian methods by introducing Gaussian Process regression. In particular, I described the basic premise behind Gaussian Process regression and then showed you how it might be applied to a dataset with a periodic trend.

If you have any questions about Gaussian Process regression, please do not hesitate to leave a comment below. In future articles I’ll discuss other nonparametric Bayesian methods and will highlight some of the benefits and pitfalls associated with them.

More Incomplete Data: Latent Factors

Making predictions in a rigorous way is difficult. Part of the reason for this is that it is rare for all of the key factors contributing to events to be recorded—these factors are known as “latent” factors. In this series we will consider techniques that can be used to identify latent factors.

This article is devoted to introducing a problem afflicted by latent factors. The problem comes from revisiting the one discussed in the previous series. Then, in subsequent posts, we will look at techniques that one might use.

Re-cap & motivation

In the previous series we were tasked with predicting the future series of payments on a number of claims that had been reported to an insurer. To help us with this we had received a dataset containing the timing and amount of past payments on these claims.

The starting point of our original analysis was based on the following features of the payment process:

  1. An open claim may or may not give rise to a payment
  2. Payment amounts vary
  3. All claims eventually close, after which no further payments are made

For the purpose of this article, however, let’s also add:

  1. Latent factors will cause some of the differences observed between claims

This observation is obvious when you consider, for example, that for different claims: the claimant is different; the lawyers and claim professionals involved are different; and the nature of the loss to the insured is different.

Incorporating this sort of information into our analysis may seem daunting. So instead of immediately jumping into the technical detail, let’s first convince ourselves that we can learn something about latent factors from the available data.

Data and an intuitive analysis

The graph below contains the same type of data considered in the previous series. There are one-hundred lines plotted and each line represents the cumulative amount paid on an insurance claim following notification to the insurer. The cumulative amounts are shown at the end of quarterly time intervals.

ClaimDevelopment

I have highlighted two claims to illustrate how claim payment series can differ. In particular, payments are made frequently on some claims and not on others (e.g. the blue claim has had payments in 60% of periods to date while the red claim has only made payments in 20%). In addition, by far the majority of claims have no payments beyond the tenth quarter (the highlighted claims being the ones with the latest payments).

The question we are interested in answering is, “to what extent is the variation in the payment series driven by latent factors?”

Standard analytical techniques rely on us being able to identify all of the factors that result in systematic variation between claims. The combination of factors is called the “state” of the claim. Being able to identify the state allows us to see to what extent the payment series varies between claims with similar or different states.

For example, if we saw that a disproportionate number of claims with late payments were associated with claimants that had suffered injuries, we might conclude that some of the variation in the payment series was driven by this factor.

Unfortunately, there is no data on latent factors (by definition) and so standard techniques are not helpful. Nevertheless, given that there are latent factors, we can draw some conclusions about the state of the claim based on its effects on the observed data.

For example, because the latest payments on the red and blue claims are so much later than most of the other claims, it seems likely that there are latent factors that are causing these claims to remain open for longer than the others. In addition, as the blue claim has such a high rate of payment in each period compared to the red claim, it seems likely that there are latent factors causing the blue claim to have more payments.

In other words, it is enough to know the relative effect on a claim to know that there is something different about the state of it.

Hopefully, you find these conclusions relatively intuitive. If not, please leave a question in the comments below and I will get back to you.

Now that we can see it is possible to draw conclusions on the latent state of a claim, all that remains is to understand the techniques that can be used to automate this process. These techniques will be the subject of subsequent articles.

Final word

In this article we introduced a problem involving latent factors. We then interrogated a dataset to prove to ourselves that we could learn something about latent factors from the available data.

In the next article we will begin to look at techniques that can be used to quantify the latent effects.

Modelling in the case of Incomplete Data (3)

This article is the third in a series in which I discuss techniques that can be used when presented with incomplete data. The focus of this article is on the use of the likelihood for prediction. In particular, we will discuss the Bayesian approach to generating predictive distributions.

Re-cap

In the first article of this series we discussed a problem involving incomplete data. The problem was to estimate the amount of future payments on all claims reported to an insurer given data on historical payments. Below is the plot of historical payments that we discussed.

claim_development

As a first step towards constructing an estimate of future payments, we looked at a model that could be used to describe the payment process. The model consisted of two elements, a state transition element and a payment amount element.

Having built a model, in the second article we considered the problem of assessing the goodness of fit of different parameter values. We found that an appealing measure of goodness of fit was the likelihood. Furthermore, of relevance to our problem, we saw that we could calculate the likelihood even for problems involving incomplete data.

Here we will discuss how Bayesian methods leverage the likelihood in order to produce predictions. We will tackle this in two stages. First, we will look at the issue that is resolved by the Bayesian approach and how the likelihood contributes to the solution. Second, we will discuss why using the likelihood in this way makes sense from a more technical point of view. Throughout the article we will use our claim payment problem as a case study.

Please note that this article is necessarily abbreviated. It would be more complete if we discussed the simulation methods I used. However, simulation is not the main theme of this article, hence the omission.

If you haven’t already, you may find reading the first two articles in this series helpful. In particular, as I will refer to them later, you will want to know that our model has three parameters: ‘p’ and ‘q’ parameterise the state transition model; ‘mu’ parameterises the payment amount model (we assume claim payments follow an Exponential distribution).

The Problem & A Solution

The output of a Bayesian analysis is an exact predictive distribution of what we are interested in given what we know. In our case, this means the approach produces a probability distribution of future claim payments given what we have learned from historical payments.

As a first step to understanding the approach, it is useful to consider how we would estimate the probability of a future outcome if we knew what the “correct” values for the parameters in our model were.

For instance, how would we estimate the probability of future payments exceeding $10m if we knew the correct values for ‘p’, ‘q’ and ‘mu’?

To answer this question we could use simulation.

By using the “correct” parameter values, we could simulate future outcomes from our model and then estimate the required probability by calculating the proportion of simulations in which future payments exceeded $10m.

The uncertainty that this approach captures is the uncertainty in the statistical process. Let’s refer to this as “process uncertainty”.

Unfortunately, we don’t actually know the correct values for the parameters. Therefore, not only do we have to allow for process uncertainty but we also have to allow for the uncertainty regarding the parameters.

From a practical perspective, as with process uncertainty, Bayesian methods allow for “parameter uncertainty” by simulating different values for the parameters. The distribution we simulate from is known as the “posterior” distribution.

In other words, with our Bayesian hat on, we can augment our previous simulation approach (which involved fixed “correct” parameters) by simulating parameter values from the posterior before using these to simulate future outcomes from our model.

As you can probably tell, the trick to this approach is in constructing a posterior distribution that makes sense. It needs to capture our general ignorance about the parameter values as well as what we have learned from the data.

One relatively coherent approach to doing this is to use the likelihood.

Below is a plot of the likelihood for our claim transition model, based on the data shown in the plot above.

levelplot_observeddata

As we discussed in the previous article, the likelihood measures the goodness of fit of different parameter values. Those with a high value (the red region) represent values that are a relatively good fit and those with a low value (the blue region) are those with a poor fit.

With a leap of faith, we might wonder whether we could use the likelihood to represent the posterior. For instance, if the total likelihood in the red region is twenty times higher than that of the blue region, could we reasonably say that the probability of the parameter falling in the red region is twenty times that of it falling in the blue region?

Although some may wince at this derivation, it so happens that we can–more on this in the next section.

In particular, the normalised likelihood (the likelihood scaled so that it sums to one) gives us a posterior distribution of the parameters. With this we are able to generate predictions by using the simulation approach mentioned above.

The plot below represents the result of applying this approach to our problem.

PredictiveFan

The plot displays historical payments as well as predictions. Historical cumulative payments up to the present time (0) are represented by the black line. The median predicted cumulative payment as well as prediction intervals are shown for each future quarter for the next thirty five years (times 1 to 140).

From this we might say that we expect total future payments to be ~$10m. Furthermore, we could say that we think there is a 95% probability that total payments will fall between ~$8m and ~$12m.

Why the Solution makes sense

As you may have noticed, the critical stage in a Bayesian analysis is in constructing the posterior distribution. In the previous section I raced past our choice of posterior, here I’ll quickly discuss why the approach makes sense.

In our case, the posterior distribution produces answers to questions like, “Given what you have seen, what do you believe the probability is that the true values for ‘p’, ‘q’, and ‘mu’ are equal to 0.1, 0.5, and $10k?”. In general, through probabilities, the posterior represents our beliefs regarding the true parameter values.

Using symbols, the posterior probability that the true values for the parameters ( \Lambda ) are equal to some value ( \lambda ) given the data ( D ) is denoted  P(\Lambda = \lambda | D) .

This definition may remind you of the likelihood. Indeed, in the previous article we introduced the likelihood as the probability of the model generating data equal to the observed data–otherwise denoted as P(D | \Lambda = \lambda).

Because of the symmetry between the Likelihood and the Posterior, we can make use of Bayes theorem to write:

     \begin{gather*} P(\Lambda = \lambda | D) \propto P(D|\Lambda = \lambda) \cdot P(\Lambda = \lambda)  \end{gather*}

In words, the posterior is proportional to the product of the likelihood and some a priori probability for the parameters. The a priori probability is referred to as the “prior”.

With this formula we can see what assumption I had made in the previous section. I assumed that the posterior was proportional to the likelihood, hence I implicitly assumed that the prior was constant for all values of the parameters.

Please note that other prior assumptions are valid. However, we can be satisfied that the approach we arrived at has some technical merit.

For those who will apply Bayesian methods to their problems, it is important to recognise that any choice of prior is ultimately subjective. This is not necessarily a weakness but if you need to maintain a degree of objectivity, it is worth taking a look at articles concerning “Objective Bayes” and testing a variety of reasonable priors.

Final Word

The Bayesian approach is very powerful. Given some basic building blocks it enables us to generate predictive distributions of future outcomes. With these distributions we can attempt to answer questions like the one presented in the first article of this series.

Before becoming completely smitten by Bayesian methods, however, it is important to recognise that this power does not come for free. In particular, employing Bayesian methods involves a subjective selection of the prior distribution.

Given that we now have a method for solving the problem posed in the first article, I will let this series sit for a while. However, there are many other approaches we have not discussed and in future I may revisit this series to discuss them.

If you have any questions or requests for future topics please leave them in the comments section.