L04 Tree-Based Methods

We will learn

  • Algorithms
    • CART (regression, classification)
    • Ensemble learning (bagging, random forests, boosting)
  • Coding with python

Classification and regression trees (CART)

  • CART / Decision Tree
    • recursively partitioning the input space
    • a local model in each resulting region of input space

Model definition

  • regression trees: all inputs are real-valued.
    • The tree consists of a set of nested decision rules. At each node , the feature dimension of the input vector is compared to a threshold value , and the input is then passed down to the left or right branch, depending on whether it is above or below threshold.
    • At the leaves of the tree, the model specifies the predicted output for any input that falls into that part of the input space.
  • An example:
    • the region of space:

  • the output for region 1 (mean response) can be estimated using

  • Formally, a regression tree can be defined by

  • is the region specified by the 'th leaf node, is the predicted output for that node, and
  • is the number of nodes.
  • The regions:
    • , etc.
  • For categorical inputs, we can define the splits based on comparing feature to each of the possible values for that feature, rather than comparing to a numeric threshold.
  • For classification problems, the leaves contain a distribution over the class labels, rather than just the mean response.

Model fitting

  • minimizing the following loss:

  • it is not differentiable
  • finding the optimal partitioning of the data is NP-complete
  • the standard practice is to use a greedy procedure, in which we iteratively grow the tree one node at a time.
  • three popular implementations: CART, C4.5, and ID3.

The big idea of greedy algorithms

  • let be the set of examples that reach node .
  • If the 'the feature is a real-valued scalar
    • partition the data at node by comparing to a threshold
    • the set of possible thresholds for feature can be obtained by sorting the unique values of
    • example: if feature 1 has the values , then we set . For each possible threshold, we define the left and right splits, and .
  • If the 'th feature is categorical with possible values
    • check if the feature is equal to each of those values or not.
    • it defines a set of possible binary splits: : and .)
  • choose the best feature to split on, and the best value for that feature, , as:
  • the cost function of node .
  • For regression, we can use the mean squared error

where is the mean of the response variable for examples reaching node .

  • For classification, we first compute the empirical distribution over class labels for this node:

Given this, we can then compute the Gini index

This is the expected error rate. To see this, note that is the probability a random entry in the leaf belongs to class , and is the probability it would be misclassified.

  • Alternatively we can define cost as the entropy or deviance of the node:

  • A node that is pure (i.e., only has examples of one class) will have 0 entropy.

Regularization

  • the danger of overfitting: If we let the tree become deep enough, it can achieve 0 error on the training set (assuming no label noise), by partioning the input space into sufficiently small regions where the output is constant.
  • two main approaches against overfitting:
    • The first is to stop the tree growing process according to some heuristic, such as having too few examples at a node, or reaching a maximum depth.
    • The second approach is to grow the tree to its maximum depth, where no more splits are possible, and then to prune it back, by merging split subtrees back into their parent.

Pros and cons

  • advantages:
    • They are easy to interpret.
    • They can easily handle mixed discrete and continuous inputs.
    • They are insensitive to monotone transformations of the inputs (because the split points are based on ranking the data points), so there is no need to standardize the data.
    • They perform automatic variable selection.
    • They are relatively robust to outliers.
    • They are fast to fit, and scale well to large data sets.
    • They can handle missing input features.
  • disadvantages:
    • they do not predict very accurately compared to other kinds of model (the greedy nature of the tree construction algorithm).
    • trees are unstable: small changes to the input data can have large effects on the structure of the tree, due to the hierarchical nature of the tree-growing process, causing errors at the top to affect the rest of the tree.

Coding: Classification


from sklearn.datasets import load_iris
from sklearn import tree
iris = load_iris()
X, y = iris.data, iris.target
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, y)
# tree.plot_tree(clf)
import graphviz 
dot_data = tree.export_graphviz(
  clf, out_file=None, 
  feature_names=iris.feature_names,  
  class_names=iris.target_names,  
  filled=True, rounded=True,  
  special_characters=True)  
graph = graphviz.Source(dot_data)  
graph 

Coding: Regression

# Import the necessary modules and libraries
import matplotlib.pyplot as plt
import numpy as np
from sklearn.tree import DecisionTreeRegressor
# Create a random dataset
rng = np.random.RandomState(1)
X = np.sort(5 * rng.rand(80, 1), axis=0)
y = np.sin(X).ravel()
y[::5] += 3 * (0.5 - rng.rand(16))
# Fit regression model
regr_1 = DecisionTreeRegressor(max_depth=2)
regr_2 = DecisionTreeRegressor(max_depth=5)
regr_1.fit(X, y)
regr_2.fit(X, y)
# Predict
X_test = np.arange(0.0, 5.0, 0.01)[:, np.newaxis]
y_1 = regr_1.predict(X_test)
y_2 = regr_2.predict(X_test)
# Plot the results
plt.figure()
plt.scatter(X, y, s=20, edgecolor="black", 
  c="darkorange", label="data")
plt.plot(X_test, y_1, color="cornflowerblue", 
  label="max_depth=2", linewidth=2)
plt.plot(X_test, y_2, color="yellowgreen", 
  label="max_depth=5", linewidth=2)
plt.xlabel("data")
plt.ylabel("target")
plt.title("Decision Tree Regression")
plt.legend()
plt.show()

Ensemble learning

  • Ensemble learning: reduce variance by averaging multiple (regression) models (or taking a majority vote for classifiers)

  • is the 'th base model.
  • The ensemble will have similar bias to the base models, but lower variance, generally resulting in improved overall performance
  • For classifiers: take a majority vote of the outputs. (This is sometimes called a committee method.)
    • suppose each base model is a binary classifier with an accuracy of , and suppose class 1 is the correct class.
    • Let be the prediction for the 'th model
    • Let be the number of votes for class 1
    • We define the final predictor to be the majority vote, i.e., class 1 if and class 0 otherwise. The probability that the ensemble will pick class 1 is

Stecking

  • Stecking (stacked generalization): combine the base models, by using

  • the combination weights used by stacking need to be trained on a separate dataset, otherwise they would put all their mass on the best performing base model.

Ensembling is not Bayes model averaging

  • An ensemble considers a larger hypothesis class of the form

  • the BMA uses

  • The key difference
    • in the case of BMA, the weights sum to one
    • in the limit of infinite data, only a single model will be chosen (namely the MAP model). By contrast,
    • the ensemble weights are arbitrary, and don't collapse in this way to a single model.

Bagging

  • bagging ("bootstrap aggregating")
    • This is a simple form of ensemble learning in which we fit different base models to different randomly sampled versions of the data
    • this encourages the different models to make diverse predictions
    • The datasets are sampled with replacement (a technique known as bootstrap sampling)
    • a given example may appear multiple times, until we have a total of examples per model (where is the number of original data points).
  • The disadvantage of bootstrap

    • each base model only sees, on average, of the unique input examples.

    • The of the training instances that are not used by a given base model are called out-of-bag instances (oob).

    • We can use the predicted performance of the base model on these oob instances as an estimate of test set performance.

    • This provides a useful alternative to cross validation.

  • The main advantage of bootstrap is that it prevents the ensemble from relying too much on any individual training example, which enhances robustness and generalization.

  • Bagging does not always improve performance. In particular, it relies on the base models being unstable estimators (decision tree), so that omitting some of the data significantly changes the resulting model fit.

Random forests

Random forests: learning trees based on a randomly chosen subset of input variables (at each node of the tree), as well as a randomly chosen subset of data cases.

It shows that random forests work much better than bagged decision trees, because many input features are irrelevant.

Boosting

  • Ensembles of trees, whether fit by bagging or the random forest algorithm, corresponding to a model of the form

  • is the 'th tree
  • is the corresponding weight, often set to .
  • additive model: generalize this by allowing the functions to be general function approximators, such as neural networks, not just trees.
  • We can think of this as a linear model with adaptive basis functions. The goal, as usual, is to minimize the empirical loss (with an optional regularizer):

  • Boosting is an algorithm for sequentially fitting additive models where each is a binary classifier that returns .

    • first fit on the original data
    • weight the data samples by the errors made by , so misclassified examples get more weight
    • fit to this weighted data set
    • keep repeating this process until we have fit the desired number of components.
  • as long as each has an accuracy that is better than chance (even on the weighted dataset), then the final ensemble of classifiers will have higher accuracy than any given component.

    • if is a weak learner (so its accuracy is only slightly better than )
    • we can boost its performance using the above procedure so that the final becomes a strong learner.
  • boosting, bagging and RF
    • boosting reduces the bias of the strong learner, by fitting trees that depend on each other
    • bagging and RF reduce the variance by fitting independent trees
    • In many cases, boosting can work better

Forward stagewise additive modeling

forward stagewise additive modeling: sequentially optimize the objective for general (differentiable) loss functions

We then set

Quadratic loss and least squares boosting

  • squared error loss
  • the 'th term in the objective at step becomes

  • is the residual of the current model on the 'th observation.
  • We can minimize the above objective by simply setting , and fitting to the residual errors. This is called least squares boosting.

Exponential loss and AdaBoost

  • binary classification, i.e., predicting .
  • assuming the weak learner computes

so returns half the log odds.

  • the negative log likelihood is given by

  • we can also use other loss functions. In this section, we consider the exponential loss

  • this is a smooth upper bound on the 0-1 loss.
  • In the population setting (with infinite sample size), the optimal solution to the exponential loss is the same as for loss. To see this, we can just set the derivative of the expected loss (for each ) to zero:

  • the exponential loss is easier to optimize in the boosting setting.

discrete AdaBoost

  • At step we have to minimize

where is a weight applied to datacase , and . We can rewrite this objective as follows:

  • the optimal function to add is

This can be found by applying the weak learner to a weighted version of the dataset, with weights .

  • Subsituting into and solving for we find

where

  • Therefore overall update becomes

  • the weights for the next iteration, as follows:

  • If , then , and if , then . Hence , so the update becomes

  • Since the is constant across all examples, it can be dropped. If we then define , the update becomes

Thus we see that we exponentially increase weights of misclassified examples. The resulting algorithm shown in Algorithm 8, and is known as Adaboost.

LogitBoost

  • exponential loss puts a lot of weight on misclassified examples
  • This makes the method very sensitive to outliers (mislabeled examples).
  • In addition, is not the logarithm of any pmf for binary variables ; consequently we cannot recover probability estimates from .
  • A natural alternative is to use loss. This only punishes mistakes linearly, as is clear from Figure 18.7.
  • Furthermore, it means that we will be able to extract probabilities from the final learned function, using

  • The goal is to minimze the expected log-loss, given by

Gradient boosting

  • to derive a generic version, known as gradient boosting. To explain this, imagine solving by performing gradient descent in the space of functions. Since functions are infinite dimensional objects, we will represent them by their values on the training set, . At step , let be the gradient of evaluated at :

  • make the update

where is the step length, chosen by

  • fitting a weak learner to approximate the negative gradient signal. That is, we use this update

Gradient tree boosting

  • In practice, gradient boosting nearly always assumes that the weak learner is a regression tree, which is a model of the form

  • To use this in gradient boosting, we first find good regions for tree using standard regression tree learning on the residuals; we then (re)solve for the weights of each leaf by solving

XGBoost

  • XGBoost: (https://github.com/dmlc/xgboost), which stands for "extreme gradient boosting".is a very efficient and widely used implementation of gradient boosted trees:
    • it adds a regularizer on the tree complexity
    • it uses a second order approximation of the loss instead of just a linear approximation
    • it samples features at internal nodes (as in random forests)
    • it uses various computer science methods (such as handling out-of-core computation for large datasets) to ensure scalability.
  • In more detail, XGBoost optimizes the following regularized objective

  • the regularizer

where is the number of leaves, and and are regularization coefficients.