1
0
Fork 0

Add Literature section

This commit is contained in:
Alexander Hess 2020-10-04 23:00:15 +02:00
commit 3849e5fd3f
Signed by: alexander
GPG key ID: 344EA5AB10D868E0
15 changed files with 878 additions and 3 deletions

View file

@ -0,0 +1,15 @@
\subsection{Demand Forecasting with Machine Learning Methods}
\label{ml_methods}
ML methods have been employed in all kinds of prediction tasks in recent
years.
In this section, we restrict ourselves to the models that performed well in
our study: Random Forest (\gls{rf}) and Support Vector Regression
(\gls{svr}).
RFs are in general well-suited for datasets without a priori knowledge about
the patterns, while SVR is known to perform well on time series data, as
shown by \cite{hansen2006} in general and \cite{bao2004} specifically for
intermittent demand.
Gradient Boosting, another popular ML method, was consistently outperformed by
RFs, and artificial neural networks require an amount of data
exceeding what our industry partner has by far.

View file

@ -0,0 +1,53 @@
\subsubsection{Supervised Learning.}
\label{learning}
A conceptual difference between classical and ML methods is the format
for the model inputs.
In ML models, a time series $Y$ is interpreted as labeled data.
Labels are collected into a vector $\vec{y}$ while the corresponding
predictors are aligned in an $(T - n) \times n$ matrix $\mat{X}$:
$$
\vec{y}
=
\begin{pmatrix}
y_T \\
y_{T-1} \\
\dots \\
y_{n+1}
\end{pmatrix}
~~~~~~~~~~
\mat{X}
=
\begin{bmatrix}
y_{T-1} & y_{T-2} & \dots & y_{T-n} \\
y_{T-2} & y_{T-3} & \dots & y_{T-(n+1)} \\
\dots & \dots & \dots & \dots \\
y_n & y_{n-1} & \dots & y_1
\end{bmatrix}
$$
The $m = T - n$ rows are referred to as samples and the $n$ columns as
features.
Each row in $\mat{X}$ is "labeled" by the corresponding entry in $\vec{y}$,
and ML models are trained to fit the rows to their labels.
Conceptually, we model a functional relationship $f$ between $\mat{X}$ and
$\vec{y}$ such that the difference between the predicted
$\vec{\hat{y}} = f(\mat{X})$ and the true $\vec{y}$ are minimized
according to some error measure $L(\vec{\hat{y}}, \vec{y})$, where $L$
summarizes the goodness of the fit into a scalar value (e.g., the
well-known mean squared error [MSE]; cf., Section \ref{mase}).
$\mat{X}$ and $\vec{y}$ show the ordinal character of time series data:
Not only overlap the entries of $\mat{X}$ and $\vec{y}$, but the rows of
$\mat{X}$ are shifted versions of each other.
That does not hold for ML applications in general (e.g., the classical
example of predicting spam vs. no spam emails, where the features model
properties of individual emails), and most of the common error measures
presented in introductory texts on ML, are only applicable in cases
without such a structure in $\mat{X}$ and $\vec{y}$.
$n$, the number of past time steps required to predict a $y_t$, is an
exogenous model parameter.
For prediction, the forecaster supplies the trained ML model an input
vector in the same format as a row $\vec{x}_i$ in $\mat{X}$.
For example, to predict $y_{T+1}$, the model takes the vector
$(y_T, y_{T-1}, ..., y_{T-n+1})$ as input.
That is in contrast to the classical methods, where we only supply the number
of time steps to be predicted as a scalar integer.

38
tex/2_lit/3_ml/3_cv.tex Normal file
View file

@ -0,0 +1,38 @@
\subsubsection{Cross-Validation.}
\label{cv}
Because ML models are trained by minimizing a loss function $L$, the
resulting value of $L$ underestimates the true error we see when
predicting into the actual future by design.
To counter that, one popular and model-agnostic approach is cross-validation
(\gls{cv}), as summarized, for example, by \cite{hastie2013}.
CV is a resampling technique, which ranomdly splits the samples into a
training and a test set.
Trained on the former, an ML model makes forecasts on the latter.
Then, the value of $L$ calculated only on the test set gives a realistic and
unbiased estimate of the true forecasting error, and may be used for one
of two distinct aspects:
First, it assesses the quality of a fit and provides an idea as to how the
model would perform in production when predicting into the actual future.
Second, the errors of models of either different methods or the same method
with different parameters may be compared with each other to select the
best model.
In order to first select the best model and then assess its quality, one must
apply two chained CVs:
The samples are divided into training, validation, and test sets, and all
models are trained on the training set and compared on the validation set.
Then, the winner is retrained on the union of the training and validation
sets and assessed on the test set.
Regarding the splitting, there are various approaches, and we choose the
so-called $k$-fold CV, where the samples are randomly divided into $k$
folds of the same size.
Each fold is used as a test set once and the remaining $k-1$ folds become
the corresponding training set.
The resulting $k$ error measures are averaged.
A $k$-fold CV with $k=5$ or $k=10$ is a compromise between the two extreme
cases of having only one split and the so-called leave-one-out CV
where $k = m$: Computation is still relatively fast and each sample is
part of several training sets maximizing the learning from the data.
We adapt the $k$-fold CV to the ordinal stucture in $\mat{X}$ and $\vec{y}$ in
Sub-section \ref{unified_cv}.

66
tex/2_lit/3_ml/4_rf.tex Normal file
View file

@ -0,0 +1,66 @@
\subsubsection{Random Forest Regression.}
\label{rf}
\cite{breiman1984} introduce the classification and regression tree
(\gls{cart}) model that is built around the idea that a single binary
decision tree maps learned combinations of intervals of the feature
columns to a label.
Thus, each sample in the training set is associated with one leaf node that
is reached by following the tree from its root and branching along the
arcs according to some learned splitting rule per intermediate node that
compares the sample's realization for the feature specified by the rule to
the learned decision rule.
While such models are computationally fast and offer a high degree of
interpretability, they tend to overfit strongly to the training set as
the splitting rules are not limited to any functional form (e.g., linear)
in the relationship between the features and the labels.
In the regression case, it is common to maximize the variance reduction $I_V$
from a parent node $N$ to its two children, $C1$ and $C2$, as the
splitting rule.
\cite{breiman1984} formulate this as follows:
$$
I_V(N)
=
\frac{1}{|S_N|^2} \sum_{i \in S_N} \sum_{j \in S_N}
\frac{1}{2} (y_i - y_j)^2
- \left(
\frac{1}{|S_{C1}|^2} \sum_{i \in S_{C1}} \sum_{j \in S_{C1}}
\frac{1}{2} (y_i - y_j)^2
+
\frac{1}{|S_{C2}|^2} \sum_{i \in S_{C2}} \sum_{j \in S_{C2}}
\frac{1}{2} (y_i - y_j)^2
\right)
$$
$S_N$, $S_{C1}$, and $S_{C2}$ are the index sets of the samples in $N$, $C1$,
and $C2$.
\cite{ho1998} and then \cite{breiman2001} generalize this method by combining
many CART models into one forest of trees where every single tree is
a randomized variant of the others.
Randomization is achieved at two steps in the training process:
First, each tree receives a distinct training set resampled with replacement
from the original training set, an idea also called bootstrap
aggregation.
Second, at each node a random subset of the features is used to grow the tree.
Trees can be fitted in parallel speeding up the training significantly.
For prediction at the tree level, the average of all the samples at a
particular leaf node is used.
Then, the individual values are combined into one value by averaging again
across the trees.
Due to the randomization, the trees are decorrelated offsetting the
overfitting.
Another measure to counter overfitting is pruning the tree, either by
specifying the maximum depth of a tree or the minimum number of samples
at leaf nodes.
The forecaster must tune the structure of the forest.
Parameters include the number of trees in the forest, the size of the random
subset of features, and the pruning criteria.
The parameters are optimized via grid search: We train many models with
parameters chosen from a pre-defined list of values and select the best
one by CV.
RFs are a convenient ML method for any dataset as decision trees do not
make any assumptions about the relationship between features and labels.
\cite{herrera2010} use RFs to predict the hourly demand for water in an urban
context, a similar application as the one in this paper, and find that RFs
work well with time series type of data.

60
tex/2_lit/3_ml/5_svm.tex Normal file
View file

@ -0,0 +1,60 @@
\subsubsection{Support Vector Regression.}
\label{svm}
\cite{vapnik1963} and \cite{vapnik1964} introduce the so-called support vector
machine (\gls{svm}) model, and \cite{vapnik2013} summarizes the research
conducted since then.
In its basic version, SVMs are linear classifiers, modeling a binary
decision, that fit a hyperplane into the feature space of $\mat{X}$ to
maximize the margin around the hyperplane seperating the two groups of
labels.
SVMs were popularized in the 1990s in the context of optical character
recognition, as shown in \cite{scholkopf1998}.
\cite{drucker1997} and \cite{stitson1999} adapt SVMs to the regression case,
and \cite{smola2004} provide a comprehensive introduction thereof.
\cite{mueller1997} and \cite{mueller1999} focus on SVRs in the context of time
series data and find that they tend to outperform classical methods.
\cite{chen2006a} and \cite{chen2006b} apply SVRs to predict the hourly demand
for water in cities, an application similar to the UDP case.
In the SVR case, a linear function
$\hat{y}_i = f(\vec{x}_i) = \langle\vec{w},\vec{x}_i\rangle + b$
is fitted so that the actual labels $y_i$ have a deviation of at most
$\epsilon$ from their predictions $\hat{y}_i$ (cf., the constraints
below).
SVRs are commonly formulated as quadratic optimization problems as follows:
$$
\text{minimize }
\frac{1}{2} \norm{\vec{w}}^2 + C \sum_{i=1}^m (\xi_i + \xi_i^*)
\quad \text{subject to }
\begin{cases}
y_i - \langle \vec{w}, \vec{x}_i \rangle - b \leq \epsilon + \xi_i
\text{,} \\
\langle \vec{w}, \vec{x}_i \rangle + b - y_i \leq \epsilon + \xi_i^*
\end{cases}
$$
$\vec{w}$ are the fitted weights in the row space of $\mat{X}$, $b$ is a bias
term in the column space of $\mat{X}$, and $\langle\cdot,\cdot\rangle$
denotes the dot product.
By minimizing the norm of $\vec{w}$, the fitted function is flat and not prone
to overfitting strongly.
To allow individual samples outside the otherwise hard $\epsilon$ bounds,
non-negative slack variables $\xi_i$ and $\xi_i^*$ are included.
A non-negative parameter $C$ regulates how many samples may violate the
$\epsilon$ bounds and by how much.
To model non-linear relationships, one could use a mapping $\Phi(\cdot)$ for
the $\vec{x}_i$ from the row space of $\mat{X}$ to some higher
dimensional space; however, as the optimization problem only depends on
the dot product $\langle\cdot,\cdot\rangle$ and not the actual entries of
$\vec{x}_i$, it suffices to use a kernel function $k$ such that
$k(\vec{x}_i,\vec{x}_j) = \langle\Phi(\vec{x}_i),\Phi(\vec{x}_j)\rangle$.
Such kernels must fulfill certain mathematical properties, and, besides
polynomial kernels, radial basis functions with
$k(\vec{x}_i,\vec{x}_j) = exp(\gamma \norm{\vec{x}_i - \vec{x}_j}^2)$ are
a popular candidate where $\gamma$ is a parameter controlling for how the
distances between any two samples influence the final model.
SVRs work well with sparse data in high dimensional spaces, such as
intermittent demand data, as they minimize the risk of misclassification
or predicting a significantly far off value by maximizing the error
margin, as also noted by \cite{bao2004}.