L03 Classification

We will

  • Algorithms
    • Logistic regression
    • Generative models for classification (LDA, QDA, naive Bayes)
    • GAM
    • Support Vector Machine
  • Coding with python
  • Financial applications

Introduction

Examples of classification problems

  • A person arrives at the emergency room with a set of symptoms that could possibly be attributed to one of three medical conditions. Which of the three conditions does the individual have?
  • An online banking service must be able to determine whether or not a transaction being performed on the site is fraudulent, on the basis of the user’s IP address, past transaction history, and so forth.
  • On the basis of DNA sequence data for a number of patients with and without a given disease, a biologist would like to fgure out which DNA mutations are deleterious (disease-causing) and which are not.

Regression is not appropriate for classification tasks

  • regression methods can not accommodate a qualitative response with more than wo classes

  • regression methods can not provide estimation of the conditional probability of responses

The default dataset

default student balance income
1 No No 729.5264952 44361.62507
2 No Yes 817.1804066 12106.1347
3 No No 1073.549164 31767.13895
4 No No 529.2506047 35704.49394
5 No No 785.6558829 38463.49588

Losgistic regression

The logistic model

  • The probability of default

  • Linear regression

  • logistic function

Estimation and Predictions

  • the likelihood function:

  • prediction: for an individual with a balance of is

  • prediction: for student status

Coefcient Std. error z-statistic p-value
Intercept −10.6513 0.3612 −29.5 <0.0001
balance 0.0055 0.0002 24.9 <0.0001

Coefcient Std. error z-statistic p-value
Intercept −3.5041 0.0707 −49.55 <0.0001
student[Yes] 0.4049 0.1150 3.52 0.0004

Multiple logistic regression

  • the model of odds

  • the model of

Coefcient Std. error z-statistic p-value
Intercept −10.8690 0.4923 −22.08 <0.0001
balance 0.0057 0.0002 24.74 <0.0001
income 0.0030 0.0082 0.37 0.7115
student[Yes] −0.6468 0.2362 −2.74 0.0062
  • prediction
  • A student with a credit card balance of $1,500 and an income of $40, 000 has an estimated probability of default of

  • A non-student with the same balance and income has an estimated probability of default of

Multinomial logistic regression

  • classify a response variable that has more than two classes

  • the model

    • for : the baseline

  • for

  • the log odds (for )

Coding: Logistic Regression


  • imports
import numpy as np
import pandas as pd
from matplotlib.pyplot import subplots
import statsmodels.api as sm
from ISLP import load_data
from ISLP.models import (ModelSpec as MS, summarize)
from ISLP import confusion_table
from ISLP.models import contrast
from sklearn.discriminant_analysis import \
     (LinearDiscriminantAnalysis as LDA,
      QuadraticDiscriminantAnalysis as QDA)
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
  • load data
Smarket = load_data('Smarket')
Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
0 2001 0.381 -0.192 -2.624 -1.055 5.010 1.19130 0.959 Up
1 2001 0.959 0.381 -0.192 -2.624 -1.055 1.29650 1.032 Up
2 2001 1.032 0.959 0.381 -0.192 -2.624 1.41120 -0.623 Down
3 2001 -0.623 1.032 0.959 0.381 -0.192 1.27600 0.614 Up
... ... ... ... ... ... ... ... ... ...
1249 2005 -0.298 0.130 -0.955 0.043 0.422 1.38254 -0.489 Down

  • fitting
allvars = Smarket.columns.drop(['Today', 'Direction', 'Year'])
design = MS(allvars)
X = design.fit_transform(Smarket)
y = Smarket.Direction == 'Up'
glm = sm.GLM(y,
             X,
             family=sm.families.Binomial())
results = glm.fit()
summarize(results)
coef std err z
intercept -0.1260 0.241 -0.523 0.601
Lag1 -0.0731 0.050 -1.457 0.145
Lag2 -0.0423 0.050 -0.845 0.398
Lag3 0.0111 0.050 0.222 0.824
Lag4 0.0094 0.050 0.187 0.851
Lag5 0.0103 0.050 0.208 0.835
Volume 0.1354 0.158 0.855 0.392
  • fitting
probs = results.predict()
labels = np.array(['Down']*1250)
labels[probs>0.5] = 'Up'
confusion_table(labels, Smarket.Direction)
Truth Down Up
Predicted
Down 145 141
Up 457 507

Generative models for classification

  • The bigidea of generative models for classification
    • model the distribution of the predictors separately in each of the response classes
    • use Bayes' theorem to flip these around into estimates for
  • Why do we need the generative models for classification
    • When there is substantial separation between the two classes, the parameter estimates for the logistic regression model are surprisingly unstable
    • If the distribution of the predictors is approximately normal in each of the classes and the sample size is small, then the approaches in this section may be more accurate than logistic regression
    • The methods in this section can be naturally extended to the case of more than two response classes
  • Suppose the qualitative response variable can take on possible distinct and unordered values.
  • Let represent the overall or prior probability that a randomly chosen observation comes from the th class.
  • Let denote the density function of for an observation that comes from the th class
  • the posterior probability (The Bayes' Theorem)

  • estimation
    • instead of directly computing the posterior probability , we can simply plug in estimates of and
    • : we simply compute the fraction of the training observations that belong to the th class
    • : is much more challenging

Linear discriminant analysis (LDA) for

  • assumptions
    • only one predictor:
    • is normal / Gaussian

  • the posterior

  • prediction
    • classify an observation to the class for which is greatest
    • an equivalent law: to assigning the observation to the class for which maximize

An Example

  • and
  • assigns an observation to class 1 if , and to class 2 otherwise
  • The Bayes decision boundary

LDA for

  • The multivariate Gaussian distribution
  • jointed density


  • LHS: and

  • RHS: correlated / have unequal variances
  • the observations in the th class are drawn from a multivariate Gaussian distribution
    • is a class-specific mean vector
    • is a covariance matrix that is common to all classes
  • the Bayes classifier assigns an observation to the class for which maximizes

  • The Bayes decision boundary solves

Coding: LDA


  • fitting
lda = LDA(store_covariance=True)
X_train, X_test = [M.drop(columns=['intercept'])
                   for M in [X_train, X_test]]
lda.fit(X_train, L_train)
  • predictiion
lda_pred = lda.predict(X_test)
confusion_table(lda_pred, L_test)
Truth Down Up
Predicted
Down 35 35
Up 76 106

Quadratic discriminant analysis (QDA)

  • each class has its own covariance matrix

  • the Bayes classifier assigns an observation to the class for which maximizes

Coding: QDA


  • fitting
qda = QDA(store_covariance=True)
qda.fit(X_train, L_train)
  • predictiion
qda_pred = qda.predict(X_test)
confusion_table(qda_pred, L_test)
Truth Down Up
Predicted
Down 30 20
Up 81 121

Naive Bayes

  • Assumption: Within the kth class, the p predictors are independent

  • the posterior probability

Estimating the one-dimensional density function using training data

  • If is quantitative, then we can assume that
  • If is quantitative use a non-parametric estimate for
    • making a histogram for the observations of the th predictor within each class
    • kernel density estimator
  • If is qualitative count the proportion of training observations for the th predictor corresponding to each class

Coding: QDA


  • fitting
NB = GaussianNB()
NB.fit(X_train, L_train)
  • predictiion
nb_labels = NB.predict(X_test)
confusion_table(nb_labels, L_test)
Truth Down Up
Predicted
Down 29 20
Up 82 121

Generalized additive models

Model the log odds ratio as a generalized additive models:

Support vector machine

  • developed in the 1990s
  • perform well in a variety of settings
  • often considered one of the best "out of the box" classifiers.

Maximal Margin Classifier

Hyperplane

  • In a -dimensional space: a flat affine subspace of dimension

    • In 2-d: a line
    • In 3-d: a plane
    • In -d (): hard to visualize
  • The mathematical definition (for the -d setting)

  • the 2-d example:

Classification Using a Separating Hyperplane

for a seperating hyperplane

and

Equivalently, a separating hyperplane has the property that

for all .

Separating Hyperplanes

  • we classify the test observation based on the sign of

    • class 1
    • class -1
  • the magnitude of

The Maximal Margin Classifier

  • margin
  • maximal margin hyperplane (a.k.a. the optimal separating hyperplane)
  • Construction of the Maximal Margin Classifier

The Non-separable Case & Noisy Data

  • sometimes data are non-seperateble
  • sometimes the maximal margin classifier is very sensitive to noisy data

Support Vector Classifiers

  • : the i-th obs is on the correct side of the margin
  • : the i-th obs is on the wrong side of the margin
  • : the i-th obs is on the wrong side of the hyperplane

Parameter

  • is the budget for the amount that the margin can be violated by the observations
    • : no budget for violations to the margin
    • : no more than obs can be on the wrong side of the hyperplane
    • : margin will widen
  • controls the bias-variance trade-off
    • small : low bias, high variance
    • big : high bias, low variance
    • selected via CV
  • An observation: only observations that either lie on the margin or that violate the margin (support vectors) will afect the hyperplane

Coding: SVC


  • imports
import numpy as np
from matplotlib.pyplot import subplots, cm
import sklearn.model_selection as skm
from ISLP import load_data, confusion_table
from sklearn.svm import SVC
from ISLP.svm import plot as plot_svm
from sklearn.metrics import RocCurveDisplay
roc_curve = RocCurveDisplay.from_estimator # shorthand
  • create data
rng = np.random.default_rng(1)
X = rng.standard_normal((50, 2))
y = np.array([-1]*25+[1]*25)
X[y==1] += 1
fig, ax = subplots(figsize=(8,8))
ax.scatter(X[:,0],
           X[:,1],
           c=y,
           cmap=cm.coolwarm)
  • fitting
svm_linear = SVC(C=10, kernel='linear')
svm_linear.fit(X, y)
  • plotting
fig, ax = subplots(figsize=(8,8))
plot_svm(X,
         y,
         svm_linear,
         ax=ax)

You may try with other parameter values (e.g. ).

  • hyperparameter tuning
kfold = skm.KFold(5, 
                  random_state=0,
                  shuffle=True)
grid = skm.GridSearchCV(svm_linear,
                        {'C':[0.001,0.01,0.1,1,5,10,100]},
                        refit=True,
                        cv=kfold,
                        scoring='accuracy')
grid.fit(X, y)
grid.best_params_

{'C': 1}

  • printing results
grid.cv_results_[('mean_test_score')]

array([0.46, 0.46, 0.72, 0.74, 0.74, 0.74, 0.74])

  • generating testing sample
X_test = rng.standard_normal((20, 2))
y_test = np.array([-1]*10+[1]*10)
X_test[y_test==1] += 1
  • predicting
best_ = grid.best_estimator_
y_test_hat = best_.predict(X_test)
confusion_table(y_test_hat, y_test)


Truth -1 1
Predicted
-1 8 4
1 2 6

Support Vector Machines

Support Vector Machines can not handle nonlinearity.

What can we do?

Nonlinear Classifiers Utilizing Polynomial Features

  • The original features

  • The polynomial features

  • The SVM via Optimization

Kernel Functions

  • Definition:

  • is a kernel function if and only if the kernel matrix is positive semi-definite for any data .
  • Some examples of kernel functions
name function
Linear kernel
Polynomial kernel
Radial kernel
Gaussian kernel
Laplacian kernel
Sigmoid kernel

Suppose and are kernel functions:

  • The linear combination of and is a kernel function

  • The direct product of and is a kernel function

  • For any function , is a kernel function if

SVC vs. SVM

SVC
SVM
inner products / kernels

functional form

Coding: SVM


  • create data
X = rng.standard_normal((200, 2))
X[:100] += 2
X[100:150] -= 2
y = np.array([1]*150+[2]*50)
  • plotting
fig, ax = subplots(figsize=(8,8))
ax.scatter(X[:,0],
           X[:,1],
           c=y,
           cmap=cm.coolwarm)
  • fitting
(X_train, 
 X_test,
 y_train,
 y_test) = skm.train_test_split(X,
                                y,
                                test_size=0.5,
                                random_state=0)
svm_rbf = SVC(kernel="rbf", gamma=1, C=1)
svm_rbf.fit(X_train, y_train)
  • plotting
fig, ax = subplots(figsize=(8,8))
plot_svm(X_train,
         y_train,
         svm_rbf,
         ax=ax)

You may try with other parameter values (e.g. ).

  • hyperparameter tuning
kfold = skm.KFold(5, 
                  random_state=0,
                  shuffle=True)
grid = skm.GridSearchCV(svm_rbf,
                        {'C':[0.1,1,10,100,1000],
                         'gamma':[0.5,1,2,3,4]},
                        refit=True,
                        cv=kfold,
                        scoring='accuracy');
grid.fit(X_train, y_train)
grid.best_params_

{'C': 1, 'gamma': 0.5}

  • training with the best hyperparameters
best_svm = grid.best_estimator_
fig, ax = subplots(figsize=(8,8))
plot_svm(X_train,
         y_train,
         best_svm,
         ax=ax)
y_hat_test = best_svm.predict(X_test)
confusion_table(y_hat_test, y_test)
Truth -1 1
Predicted
-1 8 4
1 2 6

SVMs with More than Two Classes

  • One-Versus-One (OVO) Classification

    • a.k.a. all-pairs
    • constructs a SVM for each pair of classes
    • classify a test obs using each of the SVMs
    • assign the obs to the class to which it was most frequently assigned in
  • One-Versus-All (OVA) Classification

    • a.k.a. one-versus-rest
    • fit SVM for each class (coded as "1" and the rest was coded as "-1")
    • let denote the parameters
    • assign the observation to the class for which is largest

Coding: SVM with Multiple Classes


  • data and plotting
rng = np.random.default_rng(123)
X = np.vstack([X, rng.standard_normal((50, 2))])
y = np.hstack([y, [0]*50])
X[y==0,1] += 2
fig, ax = subplots(figsize=(8,8))
ax.scatter(X[:,0], X[:,1], c=y, cmap=cm.coolwarm)
  • fitting
svm_rbf_3 = SVC(kernel="rbf",
                C=10,
                gamma=1,
                decision_function_shape='ovo');
svm_rbf_3.fit(X, y)
fig, ax = subplots(figsize=(8,8))
plot_svm(X,
         y,
         svm_rbf_3,
         scatter_cmap=cm.tab10,
         ax=ax)

Relationship to Logistic Regression

  • The hinge loss + penalty form of support-vector classifier optimization:

    • let
    • the optimization model

    • it is very similar to “loss” in logistic regression (negative log-likelihood).
  • SVM vs. Logistic Regression

    • When classes are (nearly) separable, SVM does better than LR. So does LDA.
    • When not, LR (with ridge penalty) and SVM very similar.
    • If you wish to estimate probabilities, LR is the choice.
    • For nonlinear boundaries, kernel SVMs are popular. Can use kernels with LR and LDA as well, but computations are more expensive.

Financial applications