66 lines
3.2 KiB
TeX
66 lines
3.2 KiB
TeX
\subsubsection{Random Forest Regression}
|
|
\label{rf}
|
|
|
|
\cite{breiman1984} introduce the classification and regression tree
|
|
(CART) model that is built around the idea that a single binary
|
|
decision tree maps learned combinations of intervals of the feature
|
|
columns to a label.
|
|
Thus, each sample in the training set is associated with one leaf node that
|
|
is reached by following the tree from its root and branching along the
|
|
arcs according to some learned splitting rule per intermediate node that
|
|
compares the sample's realization for the feature specified by the rule to
|
|
the learned decision rule.
|
|
While such models are computationally fast and offer a high degree of
|
|
interpretability, they tend to overfit strongly to the training set as
|
|
the splitting rules are not limited to any functional form (e.g., linear)
|
|
in the relationship between the features and the labels.
|
|
In the regression case, it is common to maximize the variance reduction $I_V$
|
|
from a parent node $N$ to its two children, $C1$ and $C2$, as the
|
|
splitting rule.
|
|
\cite{breiman1984} formulate this as follows:
|
|
$$
|
|
I_V(N)
|
|
=
|
|
\frac{1}{|S_N|^2} \sum_{i \in S_N} \sum_{j \in S_N}
|
|
\frac{1}{2} (y_i - y_j)^2
|
|
- \left(
|
|
\frac{1}{|S_{C1}|^2} \sum_{i \in S_{C1}} \sum_{j \in S_{C1}}
|
|
\frac{1}{2} (y_i - y_j)^2
|
|
+
|
|
\frac{1}{|S_{C2}|^2} \sum_{i \in S_{C2}} \sum_{j \in S_{C2}}
|
|
\frac{1}{2} (y_i - y_j)^2
|
|
\right)
|
|
$$
|
|
$S_N$, $S_{C1}$, and $S_{C2}$ are the index sets of the samples in $N$, $C1$,
|
|
and $C2$.
|
|
|
|
\cite{ho1998} and then \cite{breiman2001} generalize this method by combining
|
|
many CART models into one forest of trees where every single tree is
|
|
a randomized variant of the others.
|
|
Randomization is achieved at two steps in the training process:
|
|
First, each tree receives a distinct training set resampled with replacement
|
|
from the original training set, an idea also called bootstrap
|
|
aggregation.
|
|
Second, at each node a random subset of the features is used to grow the tree.
|
|
Trees can be fitted in parallel speeding up the training significantly.
|
|
For prediction at the tree level, the average of all the samples at a
|
|
particular leaf node is used.
|
|
Then, the individual values are combined into one value by averaging again
|
|
across the trees.
|
|
Due to the randomization, the trees are decorrelated offsetting the
|
|
overfitting.
|
|
Another measure to counter overfitting is pruning the tree, either by
|
|
specifying the maximum depth of a tree or the minimum number of samples
|
|
at leaf nodes.
|
|
|
|
The forecaster must tune the structure of the forest.
|
|
Parameters include the number of trees in the forest, the size of the random
|
|
subset of features, and the pruning criteria.
|
|
The parameters are optimized via grid search: We train many models with
|
|
parameters chosen from a pre-defined list of values and select the best
|
|
one by CV.
|
|
RFs are a convenient ML method for any dataset as decision trees do not
|
|
make any assumptions about the relationship between features and labels.
|
|
\cite{herrera2010} use RFs to predict the hourly demand for water in an urban
|
|
context, a similar application as the one in this paper, and find that RFs
|
|
work well with time series type of data.
|