L02 Regressions

We will

  • Algorithms
    • Linear regression
    • Penalized regression
    • Nonelinear regressions
    • Robust linear regression
  • Coding with python
  • Financial applications

Least squares linear regression

Terminology

  • Linear Regression Model

  • model parameters:
    • weights or regression coefficients:
    • offset or bias: ()
    • by writing as , we can write as

  • simple linear regression
  • multiple linear regression
  • multivariate linear regression

  • if can not be well fitted by linear function of
    • apply nonlinear transformation (feature extractor) to
    • as long as the parameters for are fixed, the model remains linear in parameters

Least squares estimation

  • minimizing the negative log likelihood (NLL):

    • The MLE is the point where:
  • the NLL is equal to the residual sum of squares (RSS)

OLS

the normal equation (FOC)

the OLS solution

the solution is unique since tha Hessian is positive definite

Statistical Properties of OLS (finite sample)

  • unbiasedness:
  • variance:
  • Gauss-Markov: the OLS estimator is efficient in the class of linear unbiased estimators. That is, for any unbiased estimator that is linear in , in the matrix sense.

Geometric interpretation of least squares

  • orthogonal projection

  • projection matrix (hat matrix)

  • special case:

Weighted least squares

  • heteroskedastic regression

  • weighted linear regression

  • MLE (weighted least squares estimate):

Measuring goodness of fit

    • T-total:
    • E-explained:
    • R-residual:



Penalized (linear) regressions

Ridge regression

  • Ridge regression: MAP estimation with a zero-mean Gaussian prior on the weights

  • MAP estimate

where is proportional to the strength of the prior, and

  • regularization or weight decay

Choosing the strength of the regularizer

  • the simple (but expensive) idea
    • try a finite number of distinct values
    • use cross validation to estimate their expected loss
  • a practitical method
    • start with a highly constrained model (strong regularizer)
    • gradually relax the constraints (decrease the amount of regularization)
  • empirical Bayes approach:
    • get the same result as the CV estimate
    • can be done by fitting a single model
    • use gradient-based optimization instead of discrete search

Lasso regression

  • least absolute shrinkage and selection operator(LASSO)

  • -regularization: MAP estimation with a Laplace prior

  • other norms
    • in general:
      • even sparser solutions
      • the problem becomes non-convex
    • (norm):

Why does regularization yield sparse solutions?

Algorithms Lagrangian Constrained quadratic program
lasso

ridge

Hard vs soft thresholding

Consider the partial derivatives of the lasso objective

  • the NLL part
    • FOC

  • the solution:
  • adding the part

  • the solution
    • If , so the feature is strongly negatively correlated with the residual, then the subgradient is zero at .
    • If , so the feature is only weakly correlated with the residual, then the subgradient is zero at .
    • If , so the feature is strongly positively correlated with the residual, then the subgradient is zero at .

  • We can write this as
  • hard thresholding:
    • for
    • does not shrink the values of for other cases
  • debiasing: the two-stage estimation process
    • run lasso to get the sparse estimaion
    • run ols with the variable from lasso

Regularization path

Plot the values vs (or vs the bound ) for each feature .

Group lasso

  • group sparsity
    • many parameters associated with a given variable
    • a vector of weights for variable
    • If we want to exclude variable , we have to force the whole subvector to go to zero
  • applications
    • Linear regression with categorical inputs: If the ’th variable is categorical with possible levels, then it will be represented as a one-hot vector of length , so to exclude variable , we have to set the whole vector of incoming weights to .
    • Multinomial logistic regression: The ’th variable will be associated with different weights, one per class, so to exclude variable , we have to set the whole vector of outgoing weights to .
    • Neural networks: the ’th neuron will have multiple inputs, so if we want to “turn the neuron off”, we have to set all the incoming weights to zero. This allows us to use group sparsity to learn neural network structure.
    • Multi-task learning: each input feature is associated with different weights, one per output task. If we want to use a feature for all of the tasks or none of the tasks, we should select weights at the group level.

Elastic net (ridge and lasso combined)

Nonelinear regression

Polynomial regression

  • nonlinear in s
  • still linear in the parameters (s)
  • similar to multi-linear regression
  • polynomial function imposes global structure

Coding: Polynomial regression



  • imports
import numpy as np
import pandas as pd
import patsy as pt
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import statsmodels.api as sm
from sklearn.preprocessing import PolynomialFeatures
from sklearn import linear_model
import warnings
warnings.filterwarnings('ignore')
  • load wage dataset
wage_df = pd.read_csv('./data/Wage.csv')
wage_df = wage_df.drop(wage_df.columns[0], axis=1)
wage_df['education'] = wage_df['education'].map({'1. < HS Grad': 1.0, 
                                                 '2. HS Grad': 2.0, 
                                                 '3. Some College': 3.0,
                                                 '4. College Grad': 4.0,
                                                 '5. Advanced Degree': 5.0
                                                })
wage_df.head()
  • preview of the dataset

year age maritl race education region jobclass health health_ins logwage wage
0 2006 18 1. Never Married 1. White 1.0 2. Middle Atlantic 1. Industrial 1. <=Good 2. No 4.318063
1 2004 24 1. Never Married 1. White 4.0 2. Middle Atlantic 2. Information 2. >=Very Good 2. No 4.255273
2 2003 45 2. Married 1. White 3.0 2. Middle Atlantic 1. Industrial 1. <=Good 1. Yes 4.875061
3 2003 43 2. Married 3. Asian 4.0 2. Middle Atlantic 2. Information 2. >=Very Good 1. Yes 5.041393
4 2005 50 4. Divorced 1. White 2.0 2. Middle Atlantic 2. Information 1. <=Good 1. Yes 4.318063
  • Regression & results

# Derive 4 degree polynomial features of age
degree = 4
f = ' + '.join(['np.power(age, {})'.format(i)
                for i in np.arange(1, degree+1)])
X = pt.dmatrix(f, wage_df)
y = np.asarray(wage_df['wage'])

# Fit linear model
model = sm.OLS(y, X).fit()
y_hat = model.predict(X)
model.summary()
  • Plots

# STATS
# ----------------------------------
# Covariance of coefficient estimates
mse = np.sum(np.square(y_hat - y)) / y.size
cov = mse * np.linalg.inv(X.T @ X)
# ...or alternatively this stat is provided by stats models:
#cov = model.cov_params()

# Calculate variance of f(x)
var_f = np.diagonal((X @ cov) @ X.T)
# Derive standard error of f(x) from variance
se       = np.sqrt(var_f)
conf_int = 2*se

# PLOT
# ----------------------------------
# Setup axes
fig, ax = plt.subplots(figsize=(10,10))

# Plot datapoints
sns.scatterplot(x='age', y='wage',
                color='tab:gray',
                alpha=0.2,
                ax=ax,
                data=pd.concat([wage_df['age'], wage_df['wage']], axis=1));

# Plot estimated f(x)
sns.lineplot(x=X[:, 1], y=y_hat, ax=ax, color='blue');

# Plot confidence intervals
sns.lineplot(x=X[:, 1], y=y_hat+conf_int, color='blue');
sns.lineplot(x=X[:, 1], y=y_hat-conf_int, color='blue');
# dash confidnece int
ax.lines[1].set_linestyle("--")
ax.lines[2].set_linestyle("--")

Step functions

  • no global structure
  • break the range of into bins

  • fit a different constant in each bin

  • unless there are natural breakpoints in the predictors, piecewise-constant functions can miss the action

Coding: Step function

### Step function
steps = 6

# Segment data into 4 segments by age
cuts = pd.cut(wage_df['age'], steps)
X = np.asarray(pd.get_dummies(cuts))
y = np.asarray(wage_df['wage'])

# Fit logistic regression model
model = sm.OLS(y, X).fit(disp=0)
y_hat = model.predict(X)


# PLOT
# ----------------------------------
# Setup axes
fig, ax = plt.subplots(figsize=(10,10))

# Plot datapoints
sns.scatterplot(x='age', y='wage',
                color='tab:gray',
                alpha=0.2,
                ax=ax,
                data=pd.concat([wage_df['age'], 
                wage_df['wage']], axis=1));

# Plot estimated f(x)
sns.lineplot(x=wage_df['age'], y=y_hat, ax=ax, color='blue')

Basis functions

  • Polynomial and piecewise-constant regression models are special cases of a basis function approach.
  • Basis fucntion: a family of functions or transformations that can be applied to a variable : .
  • The model

  • some examples of basis fucntions
    • polynomial function
    • piecewise constant function
    • wavelet
    • Fourier series
    • splines

Regression splines

Piecewise Polynomials

  • fitting separate low-degree polynomials over different regions of .

  • example: piecewise cubic polynomial with a single knot at a point

  • degree of freedom

  • Using more knots leads to a more flexible piecewise polynomial

Constraints and Splines

  • piecewise cubic: no constraint
  • continuous piecewise cubic: continuity of
  • cubic spline: continuity of , and
  • degree- spline: a piecewise degree- polynomial, with continuity in derivatives up to degree at each knot

The Spline Basis Representation

  • regression spline:

  • the spline basis
    • polynomial basis: , , and
    • truncated power basis for each knot

  • splines can have high variance at the outer range of the predictors
  • natural spline
    • a regression spline with additional boundary constraints
    • the function is required to be linear at the boundary
    • natural splines generally produce more stable estimates at the boundaries

Choosing the Number and Locations of the Knots

  • locations of knots (given the number fixed)
    • more knots -> the function might vary most rapidly; fewer knots -> seems more stable
    • in practice: place knots in a uniform fashion
      • specify the desired degrees of freedom
      • software automatically place knots
  • number of knots
    • try out diferent numbers of knots
    • cross-validation

Comparison to Polynomial Regression

  • natural cubic spline with 15 degrees of freedom vs. degree-15 polynomial
  • natural cubic spline works better on boundaries
  • in general, natural cubic spline produces more stable estimates

Smoothing splines

An Overview of Smoothing Splines

  • fitting a curve: minimize RSS to be small.
  • should be a smooth function (WHY? & HOW?)
  • smoothing spline minimize the following objective

  • the smoothing spline is a natural cubic spline with knots at
    • piecewise cubic polynomial with knots at the unique values of
    • continuous first and second derivatives at each knot
    • linear in the region outside of the extreme knots
    • it is a shrunken version of such a natural cubic spline

Coding: Regression spline

# Putting confidence interval calcs into function for convenience.
def confidence_interval(X, y, y_hat):
    """Compute 5% confidence interval for linear regression"""
    # STATS
    # ----------------------------------    
    # Covariance of coefficient estimates
    mse = np.sum(np.square(y_hat - y)) / y.size
    cov = mse * np.linalg.inv(X.T @ X)
    # ...or alternatively this stat is provided by stats models:
    #cov = model.cov_params()
    
    # Calculate variance of f(x)
    var_f = np.diagonal((X @ cov) @ X.T)
    # Derive standard error of f(x) from variance
    se       = np.sqrt(var_f)
    conf_int = 2*se
    return conf_int

# Fit spline with 6 degrees of freedom

# Use patsy to generate entire matrix of basis functions
X = pt.dmatrix('bs(age, df=7, degree=3, include_intercept=True)', wage_df)
y = np.asarray(wage_df['wage'])

# Fit logistic regression model
model = sm.OLS(y, X).fit(disp=0)
y_hat = model.predict(X)
conf_int = confidence_interval(X, y, y_hat)

# PLOT
# ----------------------------------
# Setup axes
fig, ax = plt.subplots(figsize=(10,10))

# Plot datapoints
sns.scatterplot(x='age', y='wage',
                color='tab:gray',
                alpha=0.2,
                ax=ax,
                data=pd.concat([wage_df['age'], wage_df['wage']], axis=1));

# Plot estimated f(x)
sns.lineplot(x=wage_df['age'], y=y_hat, ax=ax, color='blue');

# Plot confidence intervals
sns.lineplot(x=wage_df['age'], y=y_hat+conf_int, color='blue');
sns.lineplot(x=wage_df['age'], y=y_hat-conf_int, color='blue');
# dash confidnece int
ax.lines[1].set_linestyle("--")
ax.lines[2].set_linestyle("--")

Coding: Natural spline

# Fit a natural spline with seven degrees of freedom

# Use patsy to generate entire matrix of basis functions
X = pt.dmatrix('cr(age, df=7)', wage_df)     
# REVISION NOTE: Something funky happens when df=6

y = np.asarray(wage_df['wage'])

# Fit logistic regression model
model = sm.OLS(y, X).fit(disp=0)
y_hat = model.predict(X)
conf_int = confidence_interval(X, y, y_hat)

# PLOT
# ----------------------------------
# Setup axes
fig, ax = plt.subplots(figsize=(10,10))

# Plot datapoints
sns.scatterplot(x='age', y='wage',
                color='tab:gray',
                alpha=0.2,
                ax=ax,
                data=pd.concat([wage_df['age'], wage_df['wage']], axis=1));

# Plot estimated f(x)
sns.lineplot(x=wage_df['age'], y=y_hat, ax=ax, color='blue');

# Plot confidence intervals
sns.lineplot(x=wage_df['age'], y=y_hat+conf_int, color='blue');
sns.lineplot(x=wage_df['age'], y=y_hat-conf_int, color='blue');
# dash confidnece int
ax.lines[1].set_linestyle("--")
ax.lines[2].set_linestyle("--")

Local regression

computing the fit at a target point using only the nearby training observations

Algorithm

Algorithm: Local Regression At
1. Gather the fraction of training points whose are closest to .
2. Assign a weight to each point in this neighborhood, so that the point furthest from has weight zero, and the closest has the highest weight. All but these nearest neighbors get weight zero.
3. Fit a weighted least squares regression of the on the using the aforementioned weights, by finding and that minimize
4. The fitted value at is given by .

Local linear regression

Generalized additive models

GAMs for Regression Problems

  • the multiple linear regression model

  • GAM

An Example (using natural spline)

  • year and age are quantitative variables
  • education is qualitative with five levels: , HS, <Coll, Coll, >Coll
  • natural splines

An Example (using smoothing spline)

Pros and Cons of GAMs

  • Pros
    • GAMs automatically model non-linear relationships that standard linear regression will miss.

    • The non-linear fits can potentially make more accurate predictions for the response .

    • We can examine the effect of each on individually while holding all of the other variables fixed.

    • The smoothness of the function for the variable can be summarized via degrees of freedom.

  • Cons: the model is restricted to be additive.

Coding: GAM


# Use patsy to generate entire matrix of basis functions
X = pt.dmatrix('cr(year, df=4)+cr(age, df=5) + education', wage_df)
y = np.asarray(wage_df['wage'])

# Fit logistic regression model
model = sm.OLS(y, X).fit(disp=0)
y_hat = model.predict(X)
conf_int = confidence_interval(X, y, y_hat)
# Plot estimated f(year)
sns.lineplot(x=wage_df['year'], y=y_hat)
# Plot estimated f(age)
sns.lineplot(x=wage_df['age'], y=y_hat);
# Plot estimated f(age)
sns.lineplot(x=wage_df['age'], y=y_hat);
  • Comparing GAM configurations with ANOVA
# Model 1
X = pt.dmatrix('cr(age, df=5) + education', wage_df)
y = np.asarray(wage_df['wage'])
model1 = sm.OLS(y, X).fit(disp=0)
# Model 2
X = pt.dmatrix('year+cr(age, df=5) + education', wage_df)
y = np.asarray(wage_df['wage'])
model2 = sm.OLS(y, X).fit(disp=0)
# Model 3
X = pt.dmatrix('cr(year, df=4)+cr(age, df=5) + education', wage_df)
y = np.asarray(wage_df['wage'])
model3 = sm.OLS(y, X).fit(disp=0)

# Compare models with ANOVA
display(sm.stats.anova_lm(model1, model2, model3))
df_resid ssr df_diff ss_diff F Pr(>F)
0 2994.0 3.750437e+06 0.0 NaN NaN
1 2993.0 3.732809e+06 1.0 17627.473318 14.129318
2 2991.0 3.731516e+06 2.0 1293.696286 0.518482
display(model3.summary())
  • Local regression GAM

x = np.asarray(wage_df['age'])
y = np.asarray(wage_df['wage'])
# Create lowess feature for age
wage_df['age_lowess'] = sm.nonparametric.lowess(
    y, x, frac=.7, return_sorted=False)

# Fit logistic regression model
X = pt.dmatrix('cr(year, df=4)+ age_lowess
                + education', wage_df)
y = np.asarray(wage_df['wage'])
model = sm.OLS(y, X).fit(disp=0)
model.summary()

Robust linear regression

  • Gaussian noise assumption

    • OLS estimator = MLE
    • Poor performace for outliers
  • Robust regression: replace the Gaussian distribution for the response
    variable with a distribution that has heavy tails

Likelihood Prior Posterior Name
Gaussian Uniform Point Least squares
Student Uniform Point Robust regression
Laplace Uniform Point Robust regression
Gaussian Gaussian Point Ridge
Gaussian Laplace Point Lasso
Gaussian Gauss-Gamma Gauss-Gamma Bayesian linear regression

Laplace likelihood

Student- likelihood

  • We can fit this model using SGD or EM

Huber loss

  • An alternative to minimizing the NLL using a Laplace or Student likelihood is to use the Huber loss:

  • It is equivalent to for errors that are smaller than , and is equivalent to for larger errors.

  • Huber loss function is everywhere differentiable.

  • optimizing the Huber loss is much faster than using the Laplace likelihood

  • controls the degree of robustness

    • set by hand
    • by crossvalidation

Applications of Regressions in Finance