Trees, Forests, Bagging, and Boosting

Materials are adopted from "Murphy, Kevin P. Probabilistic machine learning: an introduction. MIT press, 2022.". This handout is only for teaching. DO NOT DISTRIBUTE.

Classification and regression trees (CART)

Model definition

Model fitting

Regularization

Pros and cons

Ensemble learning

Stecking

Ensembling is not Bayes model averaging

Bagging

Random forests

Random forests: learning trees based on a randomly chosen subset of input variables (at each node of the tree), as well as a randomly chosen subset of data cases.

Figure 18.5 shows that random forests work much better than bagged decision trees, because many input features are irrelevant.

Boosting

Forward stagewise additive modeling

forward stagewise additive modeling: sequentially optimize the objective for general (differentiable) loss functions
(βm,θm)=argminβ,θi=1N(yi,fm1(xi)+βF(xi;θ))\left(\beta_{m}, \boldsymbol{\theta}_{m}\right)=\underset{\beta, \boldsymbol{\theta}}{\operatorname{argmin}} \sum_{i=1}^{N} \ell\left(y_{i}, f_{m-1}\left(\boldsymbol{x}_{i}\right)+\beta F\left(\boldsymbol{x}_{i} ; \boldsymbol{\theta}\right)\right)
We then set
fm(x)=fm1(x)+βmF(x;θm)=fm1(x)+βmFm(x)f_{m}(\boldsymbol{x})=f_{m-1}(\boldsymbol{x})+\beta_{m} F\left(\boldsymbol{x} ; \boldsymbol{\theta}_{m}\right)=f_{m-1}(\boldsymbol{x})+\beta_{m} F_{m}(\boldsymbol{x})

Quadratic loss and least squares boosting

Exponential loss and AdaBoost

discrete AdaBoost

where ωi,mexp(y~ifm1(xi))\omega_{i, m} \triangleq \exp \left(-\tilde{y}_{i} f_{m-1}\left(\boldsymbol{x}_{i}\right)\right) is a weight applied to datacase ii, and y~i{1,+1}\tilde{y}_{i} \in\{-1,+1\}. We can rewrite this objective as follows:
Lm=eβy~i=F(xi)ωi,m+eβy~iF(xi)ωi,m=(eβeβ)i=1Nωi,mI(y~iF(xi))+eβi=1Nωi,m\begin{aligned} L_{m} &=e^{-\beta} \sum_{\tilde{y}_{i}=F\left(\boldsymbol{x}_{i}\right)} \omega_{i, m}+e^{\beta} \sum_{\tilde{y}_{i} \neq F\left(\boldsymbol{x}_{i}\right)} \omega_{i, m} \\ &=\left(e^{\beta}-e^{-\beta}\right) \sum_{i=1}^{N} \omega_{i, m} \mathbb{I}\left(\tilde{y}_{i} \neq F\left(\boldsymbol{x}_{i}\right)\right)+e^{-\beta} \sum_{i=1}^{N} \omega_{i, m} \end{aligned}

LogitBoost

Gradient boosting

If we apply this algorithm using squared loss, we recover L2Boosting, since gim=yifm1(xi)-g_{i m}=y_{i}-f_{m-1}\left(\boldsymbol{x}_{i}\right) is just the residual error. We can also apply this algorithm to other loss functions, such as absolute loss or Huber loss (Section 5.1.5.3), which is useful for robust regression problems.

For classification, we can use log-loss. In this case, we get an algorithm known as BinomialBoost [BH07]. The advantage of this over LogitBoost is that it does not need to be able to do weighted fitting: it just applies any black-box regression model to the gradient vector. To apply this to multi-class classification, we can fit CC separate regression trees, using the pseudo residual of the form
gicm=(yi,f1m(xi),,fCm(xi))fcm(xi)=I(yi=c)πic-g_{i c m}=\frac{\partial \ell\left(y_{i}, f_{1 m}\left(\boldsymbol{x}_{i}\right), \ldots, f_{C m}\left(\boldsymbol{x}_{i}\right)\right)}{\partial f_{c m}\left(\boldsymbol{x}_{i}\right)}=\mathbb{I}\left(y_{i}=c\right)-\pi_{i c}
Although the trees are fit separately, their predictions are combined via a softmax transform
p(y=cx)=efc(x)c=1Cefc(x)p(y=c \mid \boldsymbol{x})=\frac{e^{f_{c}(\boldsymbol{x})}}{\sum_{c^{\prime}=1}^{C} e^{f_{c^{\prime}}(\boldsymbol{x})}}
When we have large datasets, we can use a stochastic variant in which we subsample (without replacement) a random fraction of the data to pass to the regression tree at each iteration. This is called stochastic gradient boosting [Fri99]. Not only is it faster, but it can also generalize better, because subsampling the data is a form of regularization.

Gradient tree boosting

In practice, gradient boosting nearly always assumes that the weak learner is a regression tree, which is a model of the form
Fm(x)=j=1JmwjmI(xRjm)F_{m}(\boldsymbol{x})=\sum_{j=1}^{J_{m}} w_{j m} \mathbb{I}\left(\boldsymbol{x} \in R_{j m}\right)
where wjmw_{j m} is the predicted output for region RjmR_{j m}. (In general, wjmw_{j m} could be a vector.) This combination is called gradient boosted regression trees, or gradient tree boosting. (A related version is known as MART, which stands for "multivariate additive regression trees" [FM03].)
To use this in gradient boosting, we first find good regions RjmR_{j m} for tree mm using standard regression tree learning (see Section 18.1) on the residuals; we then (re)solve for the weights of each leaf by solving
w^jm=argminwxiRjm(yi,fm1(xi)+w)\hat{w}_{j m}=\underset{w}{\operatorname{argmin}} \sum_{\boldsymbol{x}_{i} \in R_{j m}} \ell\left(y_{i}, f_{m-1}\left(\boldsymbol{x}_{i}\right)+w\right)

XGBoost

follows, dropping terms that are independent of FmF_{m} :
Lm(q,w)i=1N[gimFm(xi)+12himFm2(xi)]+γJ+12λj=1Jwj2=j=1J[(iIjgim)wj+12(iIjhi+λ)wj2]+γJ\begin{aligned} \mathcal{L}_{m}(q, \boldsymbol{w}) & \approx \sum_{i=1}^{N}\left[g_{i m} F_{m}\left(\boldsymbol{x}_{i}\right)+\frac{1}{2} h_{i m} F_{m}^{2}\left(\boldsymbol{x}_{i}\right)\right]+\gamma J+\frac{1}{2} \lambda \sum_{j=1}^{J} w_{j}^{2} \\ &=\sum_{j=1}^{J}\left[\left(\sum_{i \in I_{j}} g_{i m}\right) w_{j}+\frac{1}{2}\left(\sum_{i \in I_{j}} h_{i}+\lambda\right) w_{j}^{2}\right]+\gamma J \end{aligned}
where Ij={i:q(xi)=j}I_{j}=\left\{i: q\left(\boldsymbol{x}_{i}\right)=j\right\} is the set of indices of data points assigned to the jj 'th leaf.
Let us define Gjm=iIjgimG_{j m}=\sum_{i \in I_{j}} g_{i m} and Hjm=iIjhimH_{j m}=\sum_{i \in I_{j}} h_{i m}. Then the above simplifies to
Lm(q,w)=j=1J[Gjmwj+12(Hjm+λ)wj2]+γJ\mathcal{L}_{m}(q, \boldsymbol{w})=\sum_{j=1}^{J}\left[G_{j m} w_{j}+\frac{1}{2}\left(H_{j m}+\lambda\right) w_{j}^{2}\right]+\gamma J
This is a quadratic in each wj mw_{j} \mathrm{~m} so the optimal weights are given by
wj=GjmHjm+λw_{j}^{*}=-\frac{G_{j m}}{H_{j m}+\lambda}
The loss for evaluating different tree structures qq then becomes
Lm(q,w)=12j=1JGjm2Hjm+λ+γJ\mathcal{L}_{m}\left(q, \boldsymbol{w}^{*}\right)=-\frac{1}{2} \sum_{j=1}^{J} \frac{G_{j m}^{2}}{H_{j m}+\lambda}+\gamma J
We can greedily optimize this using a recursive node splitting procedure, as in Section 18.1. Specifically, for a given leaf jj, we consider splitting it into a left and right half, I=ILIRI=I_{L} \cup I_{R}. We can compute the gain (reduction in loss) of such a split as follows:
 gain =12[GL2HL+λ+GR2HR+λ(GL+GR)2(HL+HR)+λ]γ\text { gain }=\frac{1}{2}\left[\frac{G_{L}^{2}}{H_{L}+\lambda}+\frac{G_{R}^{2}}{H_{R}+\lambda}-\frac{\left(G_{L}+G_{R}\right)^{2}}{\left(H_{L}+H_{R}\right)+\lambda}\right]-\gamma
where GL=iILgim,GR=iIRgim,HL=iILhimG_{L}=\sum_{i \in I_{L}} g_{i m}, G_{R}=\sum_{i \in I_{R}} g_{i m}, H_{L}=\sum_{i \in I_{L}} h_{i m}, and HR=iIRhimH_{R}=\sum_{i \in I_{R}} h_{i m}. Thus we see that it is not worth splitting a node if the gain is negative (i.e., the first term is less than γ\gamma ).

Interpreting tree ensembles

Feature importance

Partial dependency plots