Bayesian regression methods are very powerful, as they not only provide us with point estimates of regression parameters, but rather deliver an entire distribution over these parameters. This can be understood as not only learning one model, but an entire family of models and giving them different weights according to their likelihood of being correct. As this weight distribution depends on the observed data, Bayesian methods can give us an uncertainty quantification of our predictions representing what the model was able to learn from the data. The uncertainty measure could be e.g. the standard deviation of the predictions of all the models, something that point estimators will not provide by default. Knowing what the model doesn't know helps to make AI more explainable. To clarify the basic idea of Bayesian regression, we will stick to discussing Bayesian Linear Regression (BLR).

BLR is the Bayesian approach to linear regression analysis. We will start with an example to motivate the method. To make things clearer, we will then introduce a couple of non-Bayesian methods that the reader might already be familiar with and discuss how they relate to Bayesian regression.

In the following I assume that you have elementary knowledge of linear algebra and stochastics.

Let's get started!

#### A simple example

The task we want to solve is the following: Assume we are given a set $$D$$ of $$N_D$$ independently generated, noisy data points $$D := \{ (x_i, y_i)\}_{i=1}^{N_D}$$, where $$x_i \in \mathbb{R}^m$$ and $$y_i \in \mathbb{R}^n$$ (in our example $$m=n=1$$). We know that there is some relationship between the $$x_i$$ and the $$y_i$$, but we do not know what this relationship is. We will make a simplifying assumption, namely that the relationship is linear, but we also have some measurement noise on our hands. Formally, we can write this as

\begin{equation} \begin{aligned} y_i &= w_1 x_i + w_0 + \epsilon_i = \vec{w}^T \vec{x}_i + \epsilon_i \\ \epsilon_i &\sim \mathcal{N}(0, \sigma_\epsilon^2 ), \end{aligned} \label{eqModel} \end{equation}

where for convenience we define $$\vec{x}_i := (1, x_i)^T$$ and $$\vec{w} := (w_0, w_1)^T$$. $$w_0$$ and $$w_1$$ are the model parameters that we want to estimate. All $$\epsilon_i$$ are independently drawn from a normal distribution with standard deviation $$\sigma_\epsilon$$. We do not want to model the noise $$\epsilon_i$$, but rather use it as an excuse that our model will not represent the data perfectly.

Assume we already know the posterior distribution $$p(\vec{w}|D)$$ which encodes what we think the parameters $$\vec{w}$$ could be after observing the data $$D$$ (we will learn how to obtain it in a minute). For a given input $$\vec{x}$$ we get an entire distribution over the predictions $$y = \vec{w}^T \vec{x}$$ weighted by $$p(\vec{w}|D)$$. This distribution is called the predictive distribution.

We can visualize the predictive distribution by plotting the mean and the standard deviation of the prediction for a given $$x$$. The neat thing about BLR is that we can compute these moments analytically. For the mean prediction we get

\begin{equation} \begin{aligned} \mu_{y|\vec{x}} &= \int d\vec{w} \ p(\vec{w}|D) \ \vec{w}^T \vec{x} \\ &= \vec{\mu}_w \vec{x}, \end{aligned} \end{equation}

where we use the linearity of both expectation value and scalar product. For the variance we need to compute

\begin{equation} \begin{aligned} \sigma_{y|\vec{x}}^2 &= \int d\vec{w} \ p(\vec{w}|D) \ \left( \vec{w}^T \vec{x} - \vec{\mu}_w^T \vec{x} \right)^2 \\ &= \vec{x}^T \left[ \int d\vec{w} \ p(\vec{w}|D) \ \left( \vec{w} \vec{w}^T - \vec{\mu}_w \vec{\mu}_w^T \right) \right] \vec{x} \\ &= \vec{x}^T \Sigma_w \vec{x}, \end{aligned} \end{equation}

where $$\vec{\mu}_w$$ and $$\Sigma_w$$ are the mean vector and the covariance matrix of $$p(\vec{w}|D)$$.

Furthermore, we can also sample $$p(\vec{w}|D)$$ and plot the resulting models. As $$\Sigma_w$$ is positive-definite (we'll assume this to be true without proving it here, but it has something to do with the fact that it is the sum of a positive semi-definite matrix and an identity matrix), we can decompose it such that

\begin{equation} \Sigma_w = GG^T. \end{equation}

If we draw a vector $$\vec{z}$$ from a multi-variate standard Gaussian

\begin{equation} \begin{aligned} \vec{z} &= [z_1, z_2, ... ]^T \\ z_i &\sim \mathcal{N}(0, 1) \end{aligned} \end{equation}

we can convert it into a sample of the posterior using

\begin{equation} \vec{w} = G^T \vec{z} + \vec{\mu}_w \end{equation}.

Before observing data we simply draw from the prior distribution $$p(\vec{w}) = \mathcal{N}(\vec{0}, I)$$ (where $$I$$ is the identity matrix), which we choose to be a multi-variate standard Gaussian. Then we gradually introduce new data points, compute the posterior and plot the distribution and the sampled models.