The article will introduce a non-parametric, Bayesian regression method - Gaussian process regression, from derive to visualise with D3JS.
Gaussian process (GP) is a very useful technique in regression, classification, optimisation, etc. It is a non-parametric Bayesian approach to modelling. This article will derive from basic Gaussian distribution to its regression model, Gaussian process regression (GPR) model. We will also implement this method to regress a 1 dimensional \(\sin\) function with some sample data and visualise the result with d3js.
Refresh for another plot
We will first review some basics of Gaussian distributions, then we will introduce the GPR model and kernels, followed by the implementation of the model to regress a 1 dimensional \(\sin\) function. Finally, we will cover the derivation of the GPR model.
\[ \newcommand\cov{\text{Cov}} \newcommand\norm[1]{\lVert#1\rVert} \]
Let \(\mathbf{X} = (X_1, X_2, \dots, X_n)^T \sim \mathcal{N}(\mathbf{\mathbf{\mu}}, \mathbf{\Sigma})\), where \(\mathbf{\mathbf{\mu}}\) is the mean of \(\mathbf{X}\) and \(\mathbf{\Sigma}\) is the correlation matrix of \(\mathbf{X}\) such that
\[ \mathbf{\mathbf{\mu}} = \begin{pmatrix}\mathbf{\mu}_1 \\ \mathbf{\mu}_2 \\ \vdots \\ \mathbf{\mu}_n \end{pmatrix} \]
and
\[ \begin{align*} \mathbf{\Sigma} &= \mathbb{E}\left[ \left( \mathbf{X} - \mathbf{\mathbf{\mu}} \right) \left( \mathbf{X} - \mathbf{\mathbf{\mu}} \right)^T \right] \\[5pt] &= \begin{pmatrix} \cov(X_1, X_1) & \cov(X_1, X_2) & \cdots & \cov(X_1, X_n) \\ \cov(X_2, X_1) & \cov(X_2, X_2) & \cdots & \cov(X_2, X_n) \\ \vdots & \vdots & \ddots & \vdots \\ \cov(X_n, X_1) & \cov(X_n, X_2) & \cdots & \cov(X_n, X_n) \end{pmatrix}. \end{align*} \]
The density function of \(\mathbf{X}\) is given by
\[ f(\mathbf{x}) = \frac{1}{(2 \pi)^{n/2} \det{(\mathbf{\Sigma})^{1/2}}} \exp\left( -\frac{1}{2} (\mathbf{x} - \mathbf{\mathbf{\mu}})^T \mathbf{\Sigma}^{-1} (\mathbf{x} - \mathbf{\mathbf{\mu}}) \right). \tag{1}\label{eq:1} \]
Let \(\mathbf{Z} = (\mathbf{X}, \mathbf{Y})^T\). Then \(\mathbf{Z}\) has a normal distribution with
\[ Z = \begin{pmatrix} \mathbf{X} \\ \mathbf{Y} \end{pmatrix} \sim \mathcal{N} \left( \begin{pmatrix} \mathbf{\mathbf{\mu}}_\mathbf{X} \\ \mathbf{\mathbf{\mu}}_\mathbf{Y} \end{pmatrix}, \begin{pmatrix} \mathbf{\Sigma}_{\mathbf{X}\mathbf{X}} & \mathbf{\Sigma}_{\mathbf{X}\mathbf{Y}} \\ \mathbf{\Sigma}_{\mathbf{Y}\mathbf{X}} & \mathbf{\Sigma}_{\mathbf{Y}\mathbf{Y}} \end{pmatrix} \right) \]
if and only if \(\mathbf{X} \sim \mathcal{N}(\mathbf{\mathbf{\mu}_\mathbf{X}}, \mathbf{\Sigma}_\mathbf{X})\) and \(\mathbf{Y} \sim \mathcal{N}(\mathbf{\mathbf{\mu}_\mathbf{Y}}, \mathbf{\Sigma}_\mathbf{Y})\), with \(\mathbf{\Sigma}_{\mathbf{Y}\mathbf{X}} = \mathbf{\Sigma}_{\mathbf{X}\mathbf{Y}}^T\). Also,
\[\mathbf{Y} | \mathbf{X} \sim \mathcal{N} \left( \mathbf{\mathbf{\mu}}_{\mathbf{Y}} + \mathbf{\Sigma}_{\mathbf{Y}\mathbf{X}} \mathbf{\Sigma}_{\mathbf{X}\mathbf{X}}^{-1} (\mathbf{X} - \mathbf{\mathbf{\mu}}_{\mathbf{X}}), \; \mathbf{\Sigma}_{\mathbf{Y}\mathbf{Y}} - \mathbf{\Sigma}_{\mathbf{Y}\mathbf{X}} \mathbf{\Sigma}_{\mathbf{X}\mathbf{X}}^{-1} \mathbf{\Sigma}_{\mathbf{X}\mathbf{Y}} \right)\]
Gaussian Process Regression (GPR) is a non-parametric, Bayesian approach for inference. Unlike GLMs, where we explicitly want to fit a specific line (say \(y = mx + c\)) to a set of data, GPR replies on similarity matrix (or kernel) within and between training (\(X\)) and testing (\(X_*\)) inputs along with the training outputs (\(\mathbf{y}\)) to infer the testing outputs (\(\mathbf{y}_*\)).
Let \(X\) and \(\mathbf{y} = \mathbf{f}(X)\) be some observable points from a function \(f\). Our goal is to obtain some predicted \(\mathbf{f}_* = \mathbf{f}(X_*)\) from some testing points \(X_*\). The GPR model assumes the outputs of the function are Gaussian distributed:
\[ \begin{pmatrix} \mathbf{y} \\ \mathbf{f}_* \end{pmatrix} \sim \mathcal{N} \left( \mathbf{0}, \begin{pmatrix} K + \sigma_n^2I & K_* \\ K_*^T & K_{**} \end{pmatrix} \right), \]
where \(K = k(X, X)\), \(K_* = k(X, X_*)\), \(k\) is a kernel function, \(\sigma_n\) is some noise affecting the value \(\mathbf{y}\). The posterior distribution is given by \[ \begin{align*} \mathbf{f}_* | X_*, X, \mathbf{y} &\sim \mathcal{N}(\mathbf{\mu}_*, \mathbf{\Sigma}_*) \\[5pt] \mathbf{\mu}_* &= K_*^T (K + \sigma_n^2 I) \mathbf{y} \\[5pt] \mathbf{\Sigma}_* &= K_{**} - K_*^T (K + \sigma_n^2I)^{-1} K_* \end{align*} \]
This means that we only need to define a kernel function \(k\), which assigns a similarity value to each pair of input vectors (\(\mathbf{x}_1\) and \(\mathbf{x}_2\)), then we can infer a function \(f_*\) over a set of input vectors \(X_*\).
Below shows 3 commonly used kernels. The Squared exponential is the most widely used kernel. It is infinitly differentiable and hence the resulting function \(f_*\) is very smooth. The rational qudratic kernel behaves similarly the squared exponential kernel. It can be seen as a scle mixture of the squared exponential kernels with different characteristic length-scales. Finally the Periodic kernel is used when the underlying function is periodic.
\[ \begin{align*} \text{Squared exponential:} \quad & & k(\mathbf{x}_1, \mathbf{x}_2) &= \sigma_p^2 \exp \left( -\frac{\norm{\mathbf{x}_1 - \mathbf{x}_2}^2}{2 \ell^2} \right) \\[5pt] \text{Rational quadratic:} \quad & & k(\mathbf{x}_1, \mathbf{x}_2) &= \sigma_p^2 \left( 1 + \frac{\norm{\mathbf{x}_1 - \mathbf{x}_2}^2}{2\alpha\ell^2} \right)^{-\alpha} \\[5pt] \text{Periodic:} \quad & & k(\mathbf{x}_1, \mathbf{x}_2) &= \sigma_p^2 \exp \left( -\frac{2}{\ell^2} \sin^2 \left( \frac{\pi}{p} \norm{\mathbf{x}_1 - \mathbf{x}_2} \right) \right) \end{align*} \]
where \(\sigma_p\) is the overall variance, \(\ell\) is the length scale, \(\alpha (>0)\) is the scale mixture, and \(p\) is the period of the function.
Notice that by Schur product theorem, for any two positive definite matrix \(A\) and \(B\), the Hadamard Product \(A \circ B\) is also a positive definite matrix. Hence, we can always combine two or more kernels together to create a new kernel.
In this section, we will implement the regression model on a \(\sin\) function with mathjs and visualise the result using d3js. Before implementing the model, we need a few functions to help.
rnorm
and mvrnorm
We need to generate a few sample points from the \(\sin\) function with some noise. The following function will sample \(n\) numbers from a Gaussian distribution to serve the purpose.
function rnorm(n=1, mean=0, stdev=1) {
let output = [];
for (let i = 0; i < n; i++) {
const u = 1 - math.random();
const v = math.random();
const z = math.sqrt( -2.0 * math.log( u ) ) * math.cos( 2.0 * math.pi * v );
.push( z * stdev + mean);
output
}return math.matrix(output);
}
rnorm(2);
// DenseMatrix {
// _data: [ 0.2639986366419256, -0.21067762371667006 ],
// _size: [ 2 ],
// _datatype: undefined
// }
We also want to generate a few testing lines based on output testing inputs \(X_*\). Each of the corresponding \(y\) is different from the the corresponding \(\mu_*\) by some value from the variance \(\mathbf{\Sigma}_*\). This means that we also need to sample from a multivariate Gaussian distribution. Recall that if \(\mathbf{x} \sim \mathbf{N}(\mathbf{\mu}_*, \mathbf{\Sigma}_*)\),
\[ \mathbf{y} = \mathbf{\Sigma}_*^{-1/2} (\mathbf{x} - \mathbf{\mu}_*) \sim \mathcal{N}(\mathbf{0}, I) \]
To calculate \(\mathbf{\Sigma}_*^{1/2}\) from \(\mathbf{\Sigma}_*\), Cholesky decomposition is one of the efficient ways. It takes a positive definite matrix \(\mathbf{\Sigma}_*\) and return a lower triangular matrix \(L\) such that \(\mathbf{\Sigma}_* = LL^T\). We can then let \(\mathbf{x} = \mathbf{\mu}_* + L \mathbf{y}\) because
\[ \newcommand\var{\text{Var}} \begin{align*} \mathbb{E}(\mathbf{x}) &= \mathbf{\mu}_* + L \mathbb{E}(\mathbf{y}) = \mathbf{\mu}_* \\[10pt] \mathbb{E}(\mathbf{x} \mathbf{x}^T) &= \mathbf{\mu}_* \mathbf{\mu}_*^T + \mathbf{\mu}_* \mathbb{E}(\mathbf{y})^T L^T + L \mathbb{E}(\mathbf{y}) \mathbf{\mu}_*^T + L\mathbb{E}(\mathbf{yy}^T)L^T \\[5pt] &= \mathbf{\mu}_* \mathbf{\mu}_*^T + \mathbf{O} + \mathbf{O} + LIL^T \\[5pt] &= \mathbf{\mu}_* \mathbf{\mu}_*^T + \mathbf{\Sigma}_*. \end{align*} \]
function cholesky(array) {
let zeros = [...Array(array.length)].map((_) => Array(array.length).fill(0));
let L = zeros.map((row, r, xL) =>
.map((v, c) => {
rowlet sum = row.reduce(
, _, i) => (i < c ? s + xL[r][i] * xL[c][i] : s),
(s0
;
)return (xL[r][c] =
< r + 1
c ? r === c
? Math.sqrt(array[r][r] - sum)
: (array[r][c] - sum) / xL[c][c]
: v);
});
)return L;
}
var L = cholesky([[2,1], [1,1]]);
console.log(L);
// [
// [ 1.4142135623730951, 0 ],
// [ 0.7071067811865475, 0.7071067811865476 ]
// ]
.multiply(math.matrix(L), math.transpose(math.matrix(L)))
math// DenseMatrix {
// _data: [ [ 2.0000000000000004, 1 ], [ 1, 1 ] ],
// _size: [ 2, 2 ],
// _datatype: undefined
// }
Now we are ready to write a function to sample from multivariate Gaussian.
function mvrnorm(n, mu, sigma2) {
const sigma = math.matrix(cholesky(sigma2.valueOf()));
let output = [];
for (let i = 0; i < n; i++) {
.push(math.chain(sigma).multiply(rnorm(mu.size()[0])).add(mu).value);
output
}return math.matrix(output);
}
mvrnorm(3, math.matrix([0,0]), math.matrix([[2,1],[1,1]]))
// DenseMatrix {
// _data: [
// [ 2.744885257918074, 1.2485497832758634 ],
// [ 1.6920978123191832, 2.4678945538837755 ],
// [ -0.5855845041253414, -0.41890730080651606 ]
// ],
// _size: [ 3, 2 ],
// _datatype: undefined
// }
We will first implement the euclidean \(p\)-norm \(\norm{\mathbf{x}_1 - \mathbf{x}_2}^p\)
matrix as it appears in most of the kernels. As we are dealing with 1 dimensional
inputs, the function below takes a vector and calculate the difference between
each pair of elements, raise it to power pow
, and return a matrix of distances.
function euclidMatrix(x, pow=1) {
let output = math.zeros(x.size()[0], x.size()[0]);
for (let i = 0; i < x.size()[0]; i ++) {
for (let j = 0; j < i; j++) {
if (i == j) continue
const val = math.abs(x.valueOf()[i] - x.valueOf()[j]) ** pow
.set([i,j], val)
output.set([j,i], val)
output
}
}return output;
}
euclidMatrix(math.matrix([1,1.5,2,3]), 2)
// DenseMatrix {
// _data: [
// [ 0, 0.25, 1, 4 ],
// [ 0.25, 0, 0.25, 2.25 ],
// [ 1, 0.25, 0, 1 ],
// [ 4, 2.25, 1, 0 ]
// ],
// _size: [ 4, 4 ],
// _datatype: undefined
// }
Before implementing the kernel functions, we also need to create a function for the noise matrix \(\sigma_n^2 I\). As we will use it for the whole correlation matrix for \(\mathbf{z}\), the function below takes two 1-dimensional vectors (\(X\) and \(X_*\)) and return the square matrix \(\sigma_n^2I \oplus \mathbf{O}\).
function noiseMatrix(trainX, testX, noise, eps) {
/*
This function generate a noise diagonal matrix. The `eps` value makes sure
the matrix is positive definite.
*/
const trainN = trainX.size()[0];
const testN = testX.size()[0];
const sigmaI = math.diag(math.concat(
.ones(trainN).map(x => x * ((noise ** 2 || eps))),
math.ones(testN).map(x => x * eps)
math;
))return sigmaI;
}
noiseMatrix(math.matrix([0,0]), math.matrix([0,0,0]), 1, 1e-8)
// DenseMatrix {
// _data: [
// [ 1, 0, 0, 0, 0 ],
// [ 0, 1, 0, 0, 0 ],
// [ 0, 0, 1e-8, 0, 0 ],
// [ 0, 0, 0, 1e-8, 0 ],
// [ 0, 0, 0, 0, 1e-8 ]
// ],
// _size: [ 5, 5 ],
// _datatype: undefined
// }
Now we are ready to implement the kernel functions. Below shows the squared exponential and periodic kernels.
function eqkernel(trainX, testX, ell=1, pea=1, noise=0, eps=1e-8) {
const sigmaI = noiseMatrix(trainX, testX, noise, eps);
const distSq = euclidMatrix(math.concat(trainX, testX), pow=2);
const K = math.add(
.map(x => math.exp(x / (-2 * (ell ** 2)))),
distSq
sigmaI.map(x => x * (pea ** 2));
)return K
}
And below shows the periodic kernel function.
function pdkernel(trainX, testX, period, ell=1, pea=1, noise=0, eps=1e-8) {
const sigmaI = noiseMatrix(trainX, testX, noise, eps);
const distSq = euclidMatrix(math.concat(trainX, testX), pow=1);
const K = math.add(
.map(x => math.exp(( math.sin(x * math.pi / period) ** 2 ) / (-2 * (ell ** 2)))),
distSq
sigmaI.map(x => x * (pea ** 2));
)return K;
}
Recall that we assumed that
\[ \begin{pmatrix} \mathbf{y} \\ \mathbf{f}_* \end{pmatrix} \sim \mathcal{N} \left( \mathbf{0}, \begin{pmatrix} K + \sigma_n^2I & K_* \\ K_*^T & K_{**} \end{pmatrix} \right), \]
and
\[ \begin{align*} \mathbf{f}_* | X_*, X, \mathbf{y} &\sim \mathcal{N}(\mathbf{\mu}_*, \mathbf{\Sigma}_*) \\[5pt] \mathbf{\mu}_* &= K_*^T (K + \sigma_n^2 I) \mathbf{y} \\[5pt] \mathbf{\Sigma}_* &= K_{**} - K_*^T (K + \sigma_n^2I)^{-1} K_* \end{align*} \]
The function below handles the calculation of the posterior mean and variance, from the kernel matrix obtained above.
function posteriorCal(ytrain, K) {
let output = new Object();
const totalN = K.size()[0];
const trainN = ytrain.size()[0];
const trainRange = math.range(0, trainN);
const testRange = math.range(trainN, totalN);
.fullK = K;
output.trainK = math.subset(K, math.index(trainRange, trainRange));
output.testK = math.subset(K, math.index(testRange, testRange));
output.trainTestK = math.subset(K, math.index(trainRange, testRange));
output.testTrainK = math.subset(K, math.index(testRange, trainRange));
outputconst KtrainInv = math.inv(output.trainK);
.postK = math.subtract(
output.testK,
output.chain(output.testTrainK).multiply(KtrainInv).multiply(output.trainTestK).value
math;
).choleskyK = math.matrix(cholesky(output.postK.valueOf()));
output.postMu = math.chain(output.testTrainK).multiply(KtrainInv).multiply(ytrain).value;
outputconst s2 = math.diag(output.postK);
.seLower = math.chain(output.postMu).subtract(s2.map(x => 2 * (x ** (1/2)))).value;
output.seUpper = math.chain(output.postMu).add(s2.map(x => 2 * (x ** (1/2)))).value;
outputreturn output;
}
Finally, we put the whole model into a class with a predict
method to generate
new functions within the posterior mean and variance. The toData
method export
a jsonline object sorted by the \(x\)-coordinate (as it is an 1d implementation).
class Gpr {
constructor(trainX, trainY, testX, kernel_type, noise=0, eps=1e-8, ell=1, pea=1, period=period) {
this.trainX = math.matrix(trainX);
this.trainY = math.matrix(trainY);
this.testX = math.matrix(testX);
this.kernel_type = kernel_type;
this.noise = noise;
this.eps = eps;
this.ell = ell;
this.period = period;
this._ytest = [];
if (kernel_type === ktype.SEXPONENTIAL) {
this.posterior = posteriorCal(
this.trainY, eqkernel(this.trainX, this.testX, ell, pea, noise, eps)
;
)else if (kernel_type === ktype.PERIODIC) {
} this.posterior = posteriorCal(
this.trainY, pdkernel(this.trainX, this.testX, period, ell, pea, noise, eps)
;
)
}
}
get ytest() {
return math.matrix(this._ytest);
}
predict() {
this._ytest.push(math.chain(this.posterior.choleskyK)
.multiply(rnorm(this.testX.size()[0]))
.add(this.posterior.postMu).value.valueOf());
return math.matrix(this._ytest[this._ytest.length - 1]);
}
toData() {
const trainData = [...Array(this.trainY.size()[0]).keys()].map(i => {
return {
"type": "train",
"x": this.trainX.valueOf()[i],
"ytrain": this.trainY.valueOf()[i]
;
};
})const testData = [...Array(this.testX.size()[0]).keys()].map(i => {
let output = {
"type": "test",
"x": this.testX.valueOf()[i],
"postMu": this.posterior.postMu.valueOf()[i],
"seLower": this.posterior.seLower.valueOf()[i],
"seUpper": this.posterior.seUpper.valueOf()[i],
;
}for (let j = 0; j < this._ytest.length; j++) {
"ytest" + String(j + 1)] = this._ytest[j][i];
output[
}return output;
;
})return trainData.concat(testData).sort(function(a, b) {return a.x - b.x});
} }
The toData
method from the above class generates an array of objects with the
following schema.
{
"type": string, // specify if the points is coming from training or testing
"x": float,
"y": float,
"ytrain": float,
"postMu": float,
"seLower": float,
"seUpper": float,
"ytest": float // There may be multiple testing y coordinates
}
The function below takes the above array as data
and the cls
argument to
plot the following within the cls
html class:
(x, y)
,(x, ytrain)
(x, ytest)
,(x, seLower)
and (x, seUpper)
.function plotSvg(cls, data) {
let svg = d3.select("." + cls)
.append("svg")
.attr("width", boxWidth)
.attr("height", boxHeight)
.append("g")
.attr("transform", "translate(" + boxWidth/2 + "," + boxHeight/2 + ")")
svg.append("g")
.attr("class", "bg")
.append("path")
.datum(data.filter(function(d) { return d.type == "test" }))
.attr("fill", seColor)
.attr("stroke", "none")
.attr("d", d3.area()
.x(function(d) { return d.x; })
.y0(function(d) { return d.seUpper; })
.y1(function(d) { return d.seLower; })
)
svg.append("g")
.attr("class", "ribbon")
.append("path")
.datum(data)
.attr("fill", "none")
.attr("stroke", ribbonColor)
.attr("stroke-width", 6)
.attr(
"d",
.line()
d3.x(function(d) { return d.x; })
.y(function(d) { return d.y; })
;
)
let tlines = svg.append("g").attr("class", "tlines")
for (let i = 1; i <= predN; i++) {
tlines.append("path")
.datum(data.filter(function(d) { return d.type == "test"}))
.attr("fill", "none")
.attr("stroke", lineColor)
.attr("stroke-width", 1.5)
.attr("d", d3.line()
.x(function(d) { return d.x; })
.y(function(d) { return d["ytest" + String(i)]; }));
}
svg.append("g").attr("class", "tdots")
.selectAll("dot")
.data(data.filter(function(d) {return d.type == "train"}))
.enter()
.append("circle")
.attr("cx", function(d) { return d.x; })
.attr("cy", function(d) { return d.ytrain; })
.attr("r", 5)
.style("fill", dotColor)
}
main
functionFinally, below shows the main
function to plot the implementation.
// Set up
const boxWidth = 1100;
const boxHeight = 300;
const ribbonColor = "#3a3a9f44";
const dotColor = "#00990044";
const lineColor = "#ff669988";
const seColor = "#00000011";
const plotWidth = 22;
const xmin = - plotWidth / 2;
const xmax = plotWidth / 2;
const trainSize = 20;
const testSize = 200;
const eps = 1e-8;
const pea = 1;
const noise = 0.3;
const ell = 1;
const predN = 5;
const m = plotWidth / 2;
const r = m * 1.2;
const boxPlotRatio = boxWidth / plotWidth;
const period = math.pi * 2;
const sStd = 0.5;
function main() {
// Generate the sample training data
const Xtrain = math.random(size=[1, trainSize], xmin, xmax)[0];
const ytrain = math.add(Xtrain.map(x => math.sin(x)), rnorm(trainSize, 0, sStd));
const Xtest = Array.from(
length: testSize}, (_, i) => xmin + (plotWidth * i) / (testSize - 1)
{;
)
// Fit the model
const gpr = new Gpr(Xtrain, ytrain, Xtest, ktype.PERIODIC, noise, eps, ell, pea, period);
// Generating `predN` number of lines
...Array(predN)].forEach(i => {gpr.predict()});
[
// Reverse the y coordinates for d3js
const data = gpr.toData().map(x => {
.y = math.sin(x.x) * boxPlotRatio * (-1);
x.x = x.x * boxPlotRatio;
xif (x.type === "train") {
.ytrain = x.ytrain * boxPlotRatio * (-1);
xelse if (x.type === "test") {
} .postMu = x.postMu * boxPlotRatio * (-1);
x.seLower = x.seLower * boxPlotRatio * (-1);
x.seUpper = x.seUpper * boxPlotRatio * (-1);
xfor (let i = 1; i <= predN; i++) {
"ytest" + String(i)] = x["ytest" + String(i)] * boxPlotRatio * (-1);
x[
}
}return x;
;
})
// Plot the graph
plotSvg("sim-plot", data);
}
The green dots below show the training points (or the observable samples). The blue line is the underlying \(\sin\) function. The pink linkes are the predicted lines from the GPR.
Let \(\mathbf{X} \sim \mathcal{N}(\mathbf{\mu}, \mathbf{\Sigma})\) and \(\mathbf{Y} | \mathbf{X} \sim \mathcal{N}(A \mathbf{X} + \mathbf{b}, \mathbf{\Lambda})\). We aim to find the distribution for \(\mathbf{Y}\). Consider
\[ \begin{align*} f_{\mathbf{Y}}(\mathbf{y}) &= \int f_{\mathbf{Y} | \mathbf{X}} (\mathbf{y} | \mathbf{x}) \cdot f_{\mathbf{X}} (\mathbf{x})\, d\mathbf{x} \\[5pt] &\propto \int \exp \left( -\frac{1}{2} \left[(\mathbf{y} - A \mathbf{x} - \mathbf{b})^T \mathbf{\mathbf{\Lambda}}^{-1} (\mathbf{y} - A \mathbf{x} - \mathbf{b}) + (\mathbf{x} - \mathbf{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x} - \mathbf{\mu}) \right] \right) \, d\mathbf{x} \end{align*} \]
which is also the pdf of \(\mathbf{Z} = (\mathbf{X}, \mathbf{Y})^T\). Notice that the sum of two positive definite quadratic form is also a positive definite quadratic form. Hence \(\mathbf{Z}\) is Gaussian distributed, which makes \(\mathbf{Y}\) is also Gaussian distributed. To identify the correlation matrix of \(\mathbf{Z}\), consider the second order terms,
\[ \begin{align*} & \mathbf{y}^T \mathbf{\Lambda}^{-1} \mathbf{y} - 2\mathbf{x}^T A^T \mathbf{\Lambda}^{-1} \mathbf{y} + \mathbf{x}^T A^T \mathbf{\Lambda}^{-1} A \mathbf{x} + \mathbf{x}^T \mathbf{\Sigma}^{-1} \mathbf{x} \\[5pt] & \quad = \mathbf{y}^T \mathbf{\Lambda}^{-1} \mathbf{y} - 2\mathbf{x}^T A^T \mathbf{\Lambda}^{-1} \mathbf{y} + \mathbf{x}^T (A^T \mathbf{\Lambda}^{-1} A + \mathbf{\Sigma}^{-1}) \mathbf{x} \\[5pt] & \quad = \begin{pmatrix} \mathbf{x} \\ \mathbf{y} \end{pmatrix}^T \begin{pmatrix} \mathbf{\Sigma}^{-1} + A^T \mathbf{\Lambda}^{-1} A & - A^T \mathbf{\Lambda}^{-1} \\ - \mathbf{\Lambda}^{-1} A & \mathbf{\Lambda}^{-1} \end{pmatrix} \begin{pmatrix} \mathbf{x} \\ \mathbf{y} \end{pmatrix} \\[5pt] & \quad = \mathbf{z}^T R \mathbf{z}, \end{align*} \]
where
\[ R = \begin{pmatrix} \mathbf{\Sigma}^{-1} + A^T \mathbf{\Lambda}^{-1} A & - A^T \mathbf{\Lambda}^{-1} \\ - \mathbf{\Lambda}^{-1} A & \mathbf{\Lambda}^{-1} \end{pmatrix} \qquad \text{and} \qquad R^{-1} = \begin{pmatrix} \mathbf{\Sigma} & \mathbf{\Sigma} A^T \\ A \mathbf{\Sigma} & \mathbf{\Lambda} + A \mathbf{\Sigma} A^T \end{pmatrix}. \]
From \(\eqref{eq:1}\), we know that \(R^{-1}\) is the covarriance matrix for \(\mathbf{z}\). We now consider the linear terms to find the mean of \(\mathbf{z}\).
\[ - \mathbf{y}^T \mathbf{\Lambda}^{-1} \mathbf{b} + \mathbf{x}^T A^T \mathbf{\Lambda}^{-1} \mathbf{b} - \mathbf{x}^T \mathbf{\Sigma}^{-1} \mathbf{\mu} = - \begin{pmatrix} \mathbf{x} \\ \mathbf{y} \end{pmatrix}^T \begin{pmatrix} \mathbf{\Sigma}^{-1} \mathbf{\mu} - A^T \mathbf{\Lambda}^{-1} \mathbf{b} \\ \mathbf{\Lambda}^{-1} \mathbf{b} \end{pmatrix}. \]
By \(\eqref{eq:1}\) again, we know that
\[ -\mathbf{z}^T \Sigma_{\mathbf{z}}^{-1} \mathbf{\mu}_\mathbf{z} = - \begin{pmatrix} \mathbf{x} \\ \mathbf{y} \end{pmatrix}^T \begin{pmatrix} \mathbf{\Sigma}^{-1} \mathbf{\mu} - A^T \mathbf{\Lambda}^{-1} \mathbf{b} \\ \mathbf{\Lambda}^{-1} \mathbf{b} \end{pmatrix}. \]
Thus
\[ \begin{align*} \mathbf{\mu}_\mathbf{z} &= R^{-1} \begin{pmatrix} \mathbf{\Sigma}^{-1} \mathbf{\mu} - A^T \mathbf{\Lambda}^{-1} \mathbf{b} \\ \mathbf{\Lambda}^{-1} \mathbf{b} \end{pmatrix} = \begin{pmatrix} \mathbf{\mu} \\ A \mathbf{\mu} + \mathbf{b} \end{pmatrix} \\[5pt] \mathbf{\Sigma}_{\mathbf{z}} &= \begin{pmatrix} \mathbf{\Sigma} & \mathbf{\Sigma} A^T \\ A \mathbf{\Sigma} & \mathbf{\Lambda} + A \mathbf{\Sigma} A^T \end{pmatrix}, \end{align*} \]
and
\[ \begin{align*} \mathbf{\mu}_\mathbf{y} &= A \mathbf{\mu} + \mathbf{b} \\[5pt] \mathbf{\Sigma}_{\mathbf{y}} &= \mathbf{\Lambda} + A \mathbf{\Sigma}A^T. \end{align*} \]
Let \(X_{train}\) and \(\mathbf{y}_{train}\) be the observed points from an underlying function \(f\) such that
\[\mathbf{y} = \mathbf{f}(X) + \varepsilon,\]
where \(\varepsilon \sim \mathcal{N}(0, \sigma_n^2 I)\). We are interested in estimating the underlying function \(f\). Consider
\[ \begin{align*} p(\mathbf{y} | X) &= \prod_{i = 1}^n p(y_i | \mathbf{x}_i) \\[5pt] &= \prod_{i = 1}^n p(\varepsilon_i) \\[5pt] &= \prod_{i = 1}^n \frac{1}{\sqrt{2\pi \sigma_n^2}} \exp \left( -\frac{(y_i - f(\mathbf{x}_i))^2}{2\sigma_n^2} \right) \\[5pt] &= \frac{1}{(2\pi\sigma_n^2)^{n/2}} \exp \left( - \frac{\norm{\mathbf{y} - \mathbf{f}(X)}^2}{2\sigma_n^2} \right). \end{align*} \]
That is, \(\mathbf{y} | X \sim \mathcal{N}(\mathbf{f}(X), \sigma_n^2I)\). Now let \(f(\mathbf{x}) = \phi(\mathbf{x})^T \mathbf{w}\), where \(\phi\) is some polinomial projection and \(\mathbf{w}\) are some weights. Then the previous result becomes
\[\mathbf{y} | X, \mathbf{w} \sim \mathcal{N}(\Phi(X)^T \mathbf{w}, \sigma_n^2 I).\]
Now, we are interested in finding
\[ \text{posterior} = \frac{\text{likelihood} \times \text{prior}}{\text{marginal likelihood}} \qquad \text{or} \qquad p(\mathbf{w} | X, \mathbf{y}) = \frac{p(\mathbf{y} | X, \mathbf{w}) \cdot p(\mathbf{w})}{p(\mathbf{y} | X)}, \]
where the marginal likelihood is a normalisation constant that is independent of the weights and is given by
\[p(\mathbf{y} | X) = \int p(\mathbf{y} | X, \mathbf{w}) p(\mathbf{w})\, d\mathbf{w}.\]
Assuming the prior distribution \(\mathbf{w} \sim \mathcal{N}(0, \mathbf{\Sigma}_p)\), we have
\[ \begin{align*} p(\mathbf{w} | X, \mathbf{y}) &\propto p(\mathbf{y} | X, \mathbf{w}) \cdot p(\mathbf{w}) \\[5pt] &\propto \exp \left( -\frac{1}{2\sigma_n^2} (\mathbf{y} - \Phi_X^T\mathbf{w})^T(\mathbf{y} - \Phi_X^T\mathbf{w}) - \frac{1}{2} \mathbf{w}^T \mathbf{\Sigma}_p^{-1} \mathbf{w} \right) \\[5pt] &\propto \exp \left( -\frac{1}{2} \left[ \sigma_n^{-2} \mathbf{y}^T \mathbf{y} - 2 \sigma_n^2 \mathbf{y}^T \Phi_X^T \mathbf{w} + \sigma_n^{-2} \mathbf{w}^T\Phi_X\Phi_X^T\mathbf{w} + \mathbf{w}^T \mathbf{\Sigma}_p^{-1} \mathbf{w} \right] \right) \\[5pt] &\propto \exp \left( -\frac{1}{2} \left[ \mathbf{w}^T (\sigma_n^{-2} \Phi_X \Phi_X^T - \mathbf{\Sigma}_p^{-1}) \mathbf{w} - 2 \sigma_n^{-2} \mathbf{y}^T \Phi_X^T \mathbf{w} \right] \right) \qquad (\mathbf{y}^T \mathbf{y} \text{ is a constant}) \\[5pt] &\propto \exp \left( -\frac{1}{2} \left[ \mathbf{w}^T A \mathbf{w} - 2 (\sigma_n^{-2} A^{-1} \Phi_X \mathbf{y})^T A \mathbf{w} \right] \right) \\[5pt] &\propto \exp \left( -\frac{1}{2} \left[\mathbf{w}^T A \mathbf{w} - 2 \mathbf{\bar{w}}^T A \mathbf{w} + \mathbf{\bar{w}}^T A \mathbf{\bar{w}} \right] \right) \\[5pt] &\propto \exp\left( -\frac{1}{2} (\mathbf{w} - \mathbf{\bar{w}})^T A (\mathbf{w} - \mathbf{\bar{w}}) \right), \end{align*} \]
where \(A = \sigma_n^{-2} \Phi_X \Phi_X^T + \mathbf{\Sigma}_p^{-1}\) and \(\mathbf{\bar{w}} = \sigma_n^{-2} A^{-1} \Phi_X \mathbf{y}\). Hence
\[\mathbf{w} | X, \mathbf{y} \sim \mathcal{N} (\mathbf{\bar{w}}, A^{-1}).\]
Now, let \(\mathbf{f}_* = \mathbf{f}(\Phi_{X_*})\), where \(X_*\) are some testing inputs. We have
\[ \begin{align*} p(\mathbf{f}_* | \Phi_{X_*}, X, \mathbf{y}) &= \int p(\mathbf{f}_* | \Phi_{X_*}, \mathbf{w}) p(\mathbf{w} | X, \mathbf{y}) \, d\mathbf{w}. \end{align*} \]
From the above session, we know that
\[\mathbf{f}_* | X_*, X, \mathbf{y} \sim \mathcal{N}(\sigma_n^{-2} \Phi_{X_*}^T A^{-1} \Phi_X \mathbf{y}, \Phi_{X_*}^T A^{-1} \Phi_{X_*}).\]
We omitted the \(\sigma_n^2 I\) in the covarrance as it is very small and simplify the upcoming calculations. Now let \(K = \Phi_X^T \mathbf{\Sigma}_P \Phi_X\) and notice that
\[ \begin{align*} \sigma_n^{-2} \Phi_X (K + \sigma_n^{2}I) &= \sigma_n^{-2} \Phi_X (\Phi_X^T \mathbf{\Sigma}_P \Phi_X + \sigma_n^{2}I) \\[5pt] &= A\mathbf{\Sigma}_P \Phi_X \\[5pt] \sigma_n^{-2} A^{-1} \Phi_X (K + \sigma_n^2I) &= \mathbf{\Sigma}_P \Phi_X \\[5pt] \sigma_n^{-2} A^{-1} \Phi_X &= \mathbf{\Sigma}_P \Phi_X (K + \sigma_n^2 I)^{-1}. \tag{2}\label{eq:2} \end{align*} \]
Also, with the fact that
\[ (Z + UWV^T)^{-1} = Z^{-1} - Z^{-1} U(W^{-1} + V^T Z^{-1} U)^{-1} V^T Z^{-1}, \]
for any matrix \(Z, U, W, V\) with suitable sizes, we know that
\[ A^{-1} = (\mathbf{\Sigma}_p^{-1} + \Phi_X(\sigma_n^{-2}I)\Phi_X^T)^{-1} = \mathbf{\Sigma}_p - \mathbf{\Sigma}_p \Phi_X(\sigma_n^2 I + \Phi_X^T \mathbf{\Sigma}_p \Phi_X)^{-1} \Phi_X^T \mathbf{\Sigma}_p. \tag{3}\label{eq:3} \]
Hence, combining \(\eqref{eq:2}\) and \(\eqref{eq:3}\), we have
\[ \begin{align*} \mathbf{f}_* | X_*, X, \mathbf{y} &\sim \mathcal{N}( \Phi_{X_*}^T \mathbf{\Sigma}_P \Phi_X (\Phi_X^T \mathbf{\Sigma}_P \Phi_X + \sigma_n^2 I) \mathbf{y}, \\ & \qquad \quad \Phi_{X_*}^T \mathbf{\Sigma}_p \Phi_{X_*} - \Phi_{X_*}^T \mathbf{\Sigma}_p \Phi_X(\sigma_n^2 I + \Phi_X^T \mathbf{\Sigma}_p \Phi_X)^{-1} \Phi_X^T \mathbf{\Sigma}_p \Phi_{X_*}). \end{align*} \]
We shall define the kernel \(k: \mathbb{R}^p \times \mathbb{R}^p \rightarrow \mathbb{R}\) such that \(k(\mathbf{x}, \mathbf{y}) = \Phi(\mathbf{x})^T \mathbf{\Sigma}_p \Phi(\mathbf{y})\). With these settings, we can write
\[ \begin{pmatrix} \mathbf{y} \\ \mathbf{f}_* \end{pmatrix} \sim \mathcal{N} \left( \begin{pmatrix} \mathbf{\mu}_\mathbf{y} \\ \mathbf{\mu}_{\mathbf{f}_*} \end{pmatrix}, \begin{pmatrix} k(X, X) + \sigma_n^2I & k(X, X_*) \\ k(X_*, X) & k(X_*, X_*) \end{pmatrix} \right). \]