L02 Linear Regression

Materials are adopted from "Murphy, Kevin P. Probabilistic machine learning: an introduction. MIT press, 2022.". This handout is only for teaching. DO NOT DISTRIBUTE.

Least squares linear regression

Terminology

Least squares estimation

OLS

wRSS(w)=XTXwXTy\nabla_{\mathbf{w}}RSS(\mathbf{w})=\mathbf{X}^T\mathbf{X}\mathbf{w}-\mathbf{X}^T\mathbf{y}
the normal equation (FOC)
XTXw=XTy\mathbf{X}^T\mathbf{X}\mathbf{w}=\mathbf{X}^T\mathbf{y}
the OLS solution
w^=(XTX)1XTy\hat{\mathbf{w}}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}
the solution is unique since tha Hessian is positive definite
H(w)=2w2RSS(w)=XTX\mathbf{H}(\mathbf{w})=\frac{\partial^2}{\partial\mathbf{w}^2}RSS(\mathbf{w})=\mathbf{X}^T\mathbf{X}

Geometric interpretation of least squares

Weighted least squares

argminy^span({x:,1,,x:,d})yy^2.\underset{\hat{\boldsymbol{y}} \in \operatorname{span}\left(\left\{\boldsymbol{x}_{:, 1}, \ldots, \boldsymbol{x}_{:, d}\right\}\right)}{\operatorname{argmin}}\|\boldsymbol{y}-\hat{\boldsymbol{y}}\|_{2} .

Measuring goodness of fit

Ridge regression

Choosing the strength of the regularizer

Lasso regression

Why does l1l_1 regularization yield sparse solutions?

Hard vs soft thresholding

Consider the partial derivatives of the lasso objective

Regularization path

Plot the values w^d\hat{w}_d vs λ\lambda (or vs the bound BB) for each feature dd.

Group lasso

Elastic net (ridge and lasso combined)

L(w,λ1,λ2)=yXw2+λ2w22+λ1w1\mathcal{L}\left(\boldsymbol{w}, \lambda_{1}, \lambda_{2}\right)=\|\boldsymbol{y}-\mathbf{X} \boldsymbol{w}\|^{2}+\lambda_{2}\|\boldsymbol{w}\|_{2}^{2}+\lambda_{1}\|\boldsymbol{w}\|_{1}