L02 Linear Regression

Materials are adopted from "Murphy, Kevin P. Probabilistic machine learning: an introduction. MIT press, 2022.". This handout is only for teaching. DO NOT DISTRIBUTE.

Least squares linear regression

Terminology

Linear Regression Model
$p(y|\mathbf{x},\theta)=\mathcal{N}(y|w_0+\mathbf{w}^T\mathbf{x},\sigma^2)$
model parameters: $\theta=(w_0,\mathbf{w}^T,\sigma^2)$ $θ = (w_{0}, w^{T}, σ^{2})$
- weights or regression coefficients: $\mathbf{w}_{1:D}$
- offset or bias: $w_0$ ( $w_0=\mathbb{E}[y]$ )
- by writing $\mathbf{x}$ as $[1,x_1,x_2,\dots,x_D]$ , we can write $\mathbf{w}$ as $[w_0,w_1,\dots,w_D]$
simple linear regression
- $D=1$
- $f(\mathbf{x};\mathbf{w})=ax+b$
multiple linear regression
- $\mathbf{x}\in\mathbb{R}^D,D>1$
multivariate linear regression
- $\mathbf{x}\in\mathbb{R}^D, D>1$
- $\mathbf{y}\in\mathbb{R}^J, J>1$
  $p(\mathbf{y}|\mathbf{x},\mathbf{W})=\prod_{j=1}^J\mathcal{N}(y_j|\mathbf{w}_j^T\mathbf{x},\sigma_j^2)$
if $y$ $y$ can not be well fitted by linear function of $\mathbf{x}$ $x$
- apply nonlinear transformation $\phi$ (feature extractor) to $\mathbf{x}$
- as long as the parameters for $\phi$ are fixed, the model remains linear in parameters
  $p(y|\mathbf{x},\theta)=\mathcal{N}(y|\mathbf{w}^T\mathbf{\phi(x)},\sigma^2)$

Least squares estimation

minimizing the negative log likelihood (NLL):
$\begin{array}{lll}NLL(w,\sigma^2) &=&-\sum_{n=1}^N\log\left[\left(\frac{1}{2\pi\sigma^2}\right)^{\frac{1}{2}}\exp\left(-\frac{1}{2\sigma^2}\left(y_n-\mathbf{w}^T\mathbf{x}_n\right)^2\right)\right]\\&=&\frac{1}{2\pi\sigma^2}\sum_{n=1}^N(y-\hat{y})^2+\frac{N}{2}\log(2\pi\sigma^2)\end{array}$
- $\hat{y}_n=\mathbf{w}^T\mathbf{x}$
- The MLE is the point where: $\nabla_{\mathbf{w},\sigma}NLL(\mathbf{w},\sigma^2)=0$
the NLL is equal to the residual sum of squares (RSS)
$RSS(\mathbf{w})=\frac{1}{2}\sum_{n=1}^N(y_n-\mathbf{w}^T\mathbf{x}_n)^2=\frac{1}{2}\|\mathbf{X}\mathbf{w}-\mathbf{y}\|^2_2=\frac{1}{2}(\mathbf{X}\mathbf{w}-\mathbf{y})^T(\mathbf{X}\mathbf{w}-\mathbf{y})$

OLS

$\nabla_{\mathbf{w}}RSS(\mathbf{w})=\mathbf{X}^T\mathbf{X}\mathbf{w}-\mathbf{X}^T\mathbf{y}$
the normal equation (FOC)
$\mathbf{X}^T\mathbf{X}\mathbf{w}=\mathbf{X}^T\mathbf{y}$
the OLS solution
$\hat{\mathbf{w}}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$
the solution is unique since tha Hessian is positive definite
$\mathbf{H}(\mathbf{w})=\frac{\partial^2}{\partial\mathbf{w}^2}RSS(\mathbf{w})=\mathbf{X}^T\mathbf{X}$

Geometric interpretation of least squares

orthogonal projection
$\hat{\boldsymbol{y}}=\mathbf{X} \boldsymbol{w}=\mathbf{X}\left(\mathbf{X}^{\top} \mathbf{X}\right)^{-1} \mathbf{X}^{\top} \boldsymbol{y}$
projection matrix (hat matrix)
$\text{Proj}(\mathbf{X})=\mathbf{X}\left(\mathbf{X}^{\top} \mathbf{X}\right)^{-1} \mathbf{X}^{\top}$
special case: $\mathbf{X}=\boldsymbol{x}$
$\operatorname{Proj}(\boldsymbol{x}) \boldsymbol{y}=\boldsymbol{x} \frac{\boldsymbol{x}^{\top} \boldsymbol{y}}{\boldsymbol{x}^{\top} \boldsymbol{x}}$

Weighted least squares

heteroskedastic regression
$p(y \mid \boldsymbol{x} ; \boldsymbol{\theta})=\mathcal{N}\left(y \mid \boldsymbol{w}^{\top} \boldsymbol{x}, \sigma^{2}(\boldsymbol{x})\right)=\frac{1}{\sqrt{2 \pi \sigma^{2}(\boldsymbol{x})}} \exp \left(-\frac{1}{2 \sigma^{2}(\boldsymbol{x})}\left(y-\boldsymbol{w}^{\top} \boldsymbol{x}\right)^{2}\right)$
weighted linear regression
$p(\boldsymbol{y} \mid \boldsymbol{x} ; \boldsymbol{\theta})=\mathcal{N}\left(\boldsymbol{y} \mid \mathbf{X} \boldsymbol{w}, \boldsymbol{\Lambda}^{-1}\right)$
- $\boldsymbol{\Lambda}=\operatorname{diag}(1/\sigma^2(\boldsymbol{x}_n))$
- MLE (weighted least squares estimate): $\hat{\boldsymbol{w}}=\left(\mathbf{X}^{\top} \boldsymbol{\Lambda} \mathbf{X}\right)^{-1} \mathbf{X}^{\top} \boldsymbol{\Lambda} \boldsymbol{y}$

$\underset{\hat{\boldsymbol{y}} \in \operatorname{span}\left(\left\{\boldsymbol{x}_{:, 1}, \ldots, \boldsymbol{x}_{:, d}\right\}\right)}{\operatorname{argmin}}\|\boldsymbol{y}-\hat{\boldsymbol{y}}\|_{2} .$

Measuring goodness of fit

$TSS = ESS + RSS$ $TSS = ESS + RSS$
- T-total: $\sum_i(y_i-\bar{y})^2$
- E-explained: $\sum_i(\hat{y}_i-\bar{y})^2$
- R-residual: $\sum_i(y_i-\hat{y}_i)^2$
$R^2=\frac{RSS}{TSS}=1-\frac{\sum_{i}(y_i-\hat{y}_i)^2}{\sum_i(y_i-\bar{y})^2}$
$R^2_{OSS}=1-\frac{\sum_{i\in\text{out-of-sample}}(y_i-\hat{y}_i)^2}{\sum_{i\in\text{out-of-sample}}(y_i-\bar{y})^2}$

Ridge regression

ridge regression: MAP estimation with a zero-mean Gaussian prior on the weights $p(\boldsymbol{w})=\mathcal{N}\left(\boldsymbol{w} \mid \mathbf{0}, \lambda^{-1} \mathbf{I}\right)$
MAP estimate
$\begin{aligned} \hat{\boldsymbol{w}}_{\text {map }} &=\operatorname{argmin} \frac{1}{2 \sigma^{2}}(\boldsymbol{y}-\mathbf{X} \boldsymbol{w})^{\top}(\boldsymbol{y}-\mathbf{X} \boldsymbol{w})+\frac{1}{2 \tau^{2}} \boldsymbol{w}^{\top} \boldsymbol{w} \\ &=\operatorname{argmin} \operatorname{RSS}(\boldsymbol{w})+\lambda\|\boldsymbol{w}\|_{2}^{2} \end{aligned}$
where $\lambda \triangleq \frac{\sigma^{2}}{\tau^{2}}$ is proportional to the strength of the prior, and
$\|\boldsymbol{w}\|_{2} \triangleq \sqrt{\sum_{d=1}^{D}\left|w_{d}\right|^{2}}=\sqrt{\boldsymbol{w}^{\top} \boldsymbol{w}}$
$l_2$ regularization or weight decay

Choosing the strength of the regularizer

the simple (but expensive) idea
- try a finite number of distinct values
- use cross validation to estimate their expected loss
a practitical method
- start with a highly constrained model (strong regularizer)
- gradually relax the constraints (decrease the amount of regularization)
empirical Bayes approach: $\hat{\lambda}=\operatorname{argmax}_{\lambda} \log p(\mathcal{D} \mid \lambda)$ $\hat{λ} = argmax_{λ} lo g p (D ∣ λ)$
- get the same result as the CV estimate
- can be done by fitting a single model
- use gradient-based optimization instead of discrete search

Lasso regression

least absolute shrinkage and selection operator(LASSO)
$\operatorname{PNLL}(\boldsymbol{w})=-\log p(\mathcal{D} \mid \boldsymbol{w})-\log p(\boldsymbol{w} \mid \lambda)=\|\mathbf{X} \boldsymbol{w}-\boldsymbol{y}\|_{2}^{2}+\lambda\|\boldsymbol{w}\|_{1}$
- $l_1$ -regularization: MAP estimation with a Laplace prior
  $\operatorname{Lap}(w \mid \mu, b) \triangleq \frac{1}{2 b} \exp \left(-\frac{|w-\mu|}{b}\right)$
other norms
- in general: $\|\boldsymbol{w}\|_q=\left(\sum_{d=1}^\mathcal{D}|w_d|^q\right)^{1/q}$
- $q<1$ $q < 1$
  - even sparser solutions
  - the problem becomes non-convex
- $q=0$ ( $l_0-$ norm): $\|\boldsymbol{w}\|_0=\sum_{d=1}^\mathcal{D}\mathbb{I}\left(|w_d|>0\right)$

Why does $l_1$ regularization yield sparse solutions?

lasso/ridge as Lagrangian of constrained optimization problem
- lasso:
  $\min_{\boldsymbol{w}}\operatorname{NNL}(\boldsymbol{w})+\lambda\|\boldsymbol{w}\|_1\leq B$
- ridge:
  $\min_{\boldsymbol{w}}\operatorname{NNL}(\boldsymbol{w})+\lambda\|\boldsymbol{w}\|_2^2\leq B$

Hard vs soft thresholding

Consider the partial derivatives of the lasso objective

the NLL part
- FOC
  $\begin{aligned} \frac{\partial}{\partial w_{d}} \mathrm{NLL}(\boldsymbol{w}) &=a_{d} w_{d}-c_{d} \\ a_{d} &=\sum_{n=1}^{N} x_{n d}^{2} \\ c_{d} &=\sum_{n=1}^{N} x_{n d}\left(y_{n}-\boldsymbol{w}_{-d}^{\top} \boldsymbol{x}_{n,-d}\right) \end{aligned}$
- the solution: $w_{d}=c_{d} / a_{d}=\frac{\boldsymbol{x}_{:, d}^{\top} \boldsymbol{r}_{-d}}{\left\|\boldsymbol{x}_{:, d}\right\|_{2}^{2}}$
adding the $l_1$ part
$\begin{aligned} \partial_{w_{d}} \mathcal{L}(\boldsymbol{w}) &=\left(a_{d} w_{d}-c_{d}\right)+\lambda \partial_{w_{d}}\|\boldsymbol{w}\|_{1} \\ &=\left\{\begin{array}{cl} \left\{a_{d} w_{d}-c_{d}-\lambda\right\} & \text { if } w_{d}<0 \\ {\left[-c_{d}-\lambda,-c_{d}+\lambda\right]} & \text { if } w_{d}=0 \\ \left\{a_{d} w_{d}-c_{d}+\lambda\right\} & \text { if } w_{d}>0 \end{array}\right. \end{aligned}$
- the solution
  - If $c_{d}<-\lambda$ , so the feature is strongly negatively correlated with the residual, then the subgradient is zero at $\hat{w}_{d}=\frac{c_{d}+\lambda}{a_{d}}<0$ .
  - If $c_{d} \in[-\lambda, \lambda]$ , so the feature is only weakly correlated with the residual, then the subgradient is zero at $\hat{w}_{d}=0$ .
  - If $c_{d}>\lambda$ , so the feature is strongly positively correlated with the residual, then the subgradient is zero at $\hat{w}_{d}=\frac{c_{d}-\lambda}{a_{d}}>0$ .
    $\hat{w}_{d}\left(c_{d}\right)=\left\{\begin{array}{cc} \left(c_{d}+\lambda\right) / a_{d} & \text { if } c_{d}<-\lambda \\ 0 & \text { if } c_{d} \in[-\lambda, \lambda] \\ \left(c_{d}-\lambda\right) / a_{d} & \text { if } c_{d}>\lambda \end{array}\right.$
- We can write this as $\hat{w}_{d}=\operatorname{SoftThreshold}\left(\frac{c_{d}}{a_{d}}, \lambda / a_{d}\right)$ $\overset{w}{^}_{d} = SoftThreshold (\frac{c _{d}}{a _{d}}, λ / a_{d})$
  - $\operatorname{SoftThreshold}(x, \delta) \triangleq \operatorname{sign}(x)(|x|-\delta)_{+}$
hard thresholding:
- $w_d=0$ for $-\lambda\leq c_d\leq\lambda$
- does not shrink the values of $w_d$ for other cases
debiasing: the two-stage estimation process
- run lasso to get the sparse estimaion
- run ols with the variable from lasso

Regularization path

Plot the values $\hat{w}_d$ vs $\lambda$ (or vs the bound $B$ ) for each feature $d$ .

Group lasso

group sparsity
- many parameters associated with a given variable
- a vector of weights $\boldsymbol{w}_d$ for variable $d$
- If we want to exclude variable $d$ , we have to force the whole subvector $\boldsymbol{w}_d$ to go to zero
applications
- Linear regression with categorical inputs: If the $d$ ’th variable is categorical with $K$ possible levels, then it will be represented as a one-hot vector of length $K$ , so to exclude variable $d$ , we have to set the whole vector of incoming weights to $0$ .
- Multinomial logistic regression: The $d$ ’th variable will be associated with $C$ different weights, one per class, so to exclude variable $d$ , we have to set the whole vector of outgoing weights to $0$ .
- Neural networks: the $k$ ’th neuron will have multiple inputs, so if we want to “turn the neuron off”, we have to set all the incoming weights to zero. This allows us to use group sparsity to learn neural network structure.
- Multi-task learning: each input feature is associated with $C$ different weights, one per output task. If we want to use a feature for all of the tasks or none of the tasks, we should select weights at the group level.

Elastic net (ridge and lasso combined)

$\mathcal{L}\left(\boldsymbol{w}, \lambda_{1}, \lambda_{2}\right)=\|\boldsymbol{y}-\mathbf{X} \boldsymbol{w}\|^{2}+\lambda_{2}\|\boldsymbol{w}\|_{2}^{2}+\lambda_{1}\|\boldsymbol{w}\|_{1}$