Add Literature section

2020-10-04 23:00:15 +02:00 · 2020-10-04 23:00:15 +02:00 · 3849e5fd3f
commit 3849e5fd3f
parent 8d10ba9a05
15 changed files with 878 additions and 3 deletions
--- a/tex/2_lit/3_ml/1_intro.tex
+++ b/tex/2_lit/3_ml/1_intro.tex
@ -0,0 +1,15 @@
+\subsection{Demand Forecasting with Machine Learning Methods}
+\label{ml_methods}
+
+ML methods have been employed in all kinds of prediction tasks in recent
+    years.
+In this section, we restrict ourselves to the models that performed well in
+    our study: Random Forest (\gls{rf}) and Support Vector Regression
+    (\gls{svr}).
+RFs are in general well-suited for datasets without a priori knowledge about
+    the patterns, while SVR is known to perform well on time series data, as
+    shown by \cite{hansen2006} in general and \cite{bao2004} specifically for
+    intermittent demand.
+Gradient Boosting, another popular ML method, was consistently outperformed by
+    RFs, and artificial neural networks require an amount of data
+    exceeding what our industry partner has by far.
--- a/tex/2_lit/3_ml/2_learning.tex
+++ b/tex/2_lit/3_ml/2_learning.tex
@ -0,0 +1,53 @@
+\subsubsection{Supervised Learning.}
+\label{learning}
+
+A conceptual difference between classical and ML methods is the format
+    for the model inputs.
+In ML models, a time series $Y$ is interpreted as labeled data.
+Labels are collected into a vector $\vec{y}$ while the corresponding
+    predictors are aligned in an $(T - n) \times n$ matrix $\mat{X}$:
+$$
+\vec{y}
+=
+\begin{pmatrix}
+    y_T \\
+    y_{T-1} \\
+    \dots \\
+    y_{n+1}
+\end{pmatrix}
+~~~~~~~~~~
+\mat{X}
+=
+\begin{bmatrix}
+    y_{T-1} & y_{T-2} & \dots & y_{T-n} \\
+    y_{T-2} & y_{T-3} & \dots & y_{T-(n+1)} \\
+    \dots   & \dots   & \dots & \dots \\
+    y_n     & y_{n-1} & \dots & y_1
+\end{bmatrix}
+$$
+The $m = T - n$ rows are referred to as samples and the $n$ columns as
+    features.
+Each row in $\mat{X}$ is "labeled" by the corresponding entry in $\vec{y}$,
+    and ML models are trained to fit the rows to their labels.
+Conceptually, we model a functional relationship $f$ between $\mat{X}$ and
+    $\vec{y}$ such that the difference between the predicted
+    $\vec{\hat{y}} = f(\mat{X})$ and the true $\vec{y}$ are minimized
+    according to some error measure $L(\vec{\hat{y}}, \vec{y})$, where $L$
+    summarizes the goodness of the fit into a scalar value (e.g., the
+    well-known mean squared error [MSE]; cf., Section \ref{mase}).
+$\mat{X}$ and $\vec{y}$ show the ordinal character of time series data:
+    Not only overlap the entries of $\mat{X}$ and $\vec{y}$, but the rows of
+    $\mat{X}$ are shifted versions of each other.
+That does not hold for ML applications in general (e.g., the classical
+    example of predicting spam vs. no spam emails, where the features model
+    properties of individual emails), and most of the common error measures
+    presented in introductory texts on ML, are only applicable in cases
+    without such a structure in $\mat{X}$ and $\vec{y}$.
+$n$, the number of past time steps required to predict a $y_t$, is an
+    exogenous model parameter.
+For prediction, the forecaster supplies the trained ML model an input
+    vector in the same format as a row $\vec{x}_i$ in $\mat{X}$.
+For example, to predict $y_{T+1}$, the model takes the vector
+    $(y_T, y_{T-1}, ..., y_{T-n+1})$ as input.
+That is in contrast to the classical methods, where we only supply the number
+    of time steps to be predicted as a scalar integer.
--- a/tex/2_lit/3_ml/3_cv.tex
+++ b/tex/2_lit/3_ml/3_cv.tex
@ -0,0 +1,38 @@
+\subsubsection{Cross-Validation.}
+\label{cv}
+
+Because ML models are trained by minimizing a loss function $L$, the
+    resulting value of $L$ underestimates the true error we see when
+    predicting into the actual future by design.
+To counter that, one popular and model-agnostic approach is cross-validation
+    (\gls{cv}), as summarized, for example, by \cite{hastie2013}.
+CV is a resampling technique, which ranomdly splits the samples into a
+    training and a test set.
+Trained on the former, an ML model makes forecasts on the latter.
+Then, the value of $L$ calculated only on the test set gives a realistic and
+    unbiased estimate of the true forecasting error, and may be used for one
+    of two distinct aspects:
+First, it assesses the quality of a fit and provides an idea as to how the
+    model would perform in production when predicting into the actual future.
+Second, the errors of models of either different methods or the same method
+    with different parameters may be compared with each other to select the
+    best model.
+In order to first select the best model and then assess its quality, one must
+    apply two chained CVs:
+The samples are divided into training, validation, and test sets, and all
+    models are trained on the training set and compared on the validation set.
+Then, the winner is retrained on the union of the training and validation
+    sets and assessed on the test set.
+
+Regarding the splitting, there are various approaches, and we choose the
+    so-called $k$-fold CV, where the samples are randomly divided into $k$
+    folds of the same size.
+Each fold is used as a test set once and the remaining $k-1$ folds become
+    the corresponding training set.
+The resulting $k$ error measures are averaged.
+A $k$-fold CV with $k=5$ or $k=10$ is a compromise between the two extreme
+    cases of having only one split and the so-called leave-one-out CV
+    where $k = m$: Computation is still relatively fast and each sample is
+    part of several training sets maximizing the learning from the data.
+We adapt the $k$-fold CV to the ordinal stucture in $\mat{X}$ and $\vec{y}$ in
+    Sub-section \ref{unified_cv}.
--- a/tex/2_lit/3_ml/4_rf.tex
+++ b/tex/2_lit/3_ml/4_rf.tex
@ -0,0 +1,66 @@
+\subsubsection{Random Forest Regression.}
+\label{rf}
+
+\cite{breiman1984} introduce the classification and regression tree
+    (\gls{cart}) model that is built around the idea that a single binary
+    decision tree maps learned combinations of intervals of the feature
+    columns to a label.
+Thus, each sample in the training set is associated with one leaf node that
+    is reached by following the tree from its root and branching along the
+    arcs according to some learned splitting rule per intermediate node that
+    compares the sample's realization for the feature specified by the rule to
+    the learned decision rule.
+While such models are computationally fast and offer a high degree of
+    interpretability, they tend to overfit strongly to the training set as
+    the splitting rules are not limited to any functional form (e.g., linear)
+    in the relationship between the features and the labels.
+In the regression case, it is common to maximize the variance reduction $I_V$
+    from a parent node $N$ to its two children, $C1$ and $C2$, as the
+    splitting rule.
+\cite{breiman1984} formulate this as follows:
+$$
+I_V(N)
+=
+\frac{1}{|S_N|^2} \sum_{i \in S_N} \sum_{j \in S_N}
+    \frac{1}{2} (y_i - y_j)^2
+- \left(
+    \frac{1}{|S_{C1}|^2} \sum_{i \in S_{C1}} \sum_{j \in S_{C1}}
+        \frac{1}{2} (y_i - y_j)^2
+    +
+    \frac{1}{|S_{C2}|^2} \sum_{i \in S_{C2}} \sum_{j \in S_{C2}}
+        \frac{1}{2} (y_i - y_j)^2
+\right)
+$$
+$S_N$, $S_{C1}$, and $S_{C2}$ are the index sets of the samples in $N$, $C1$,
+    and $C2$. 
+
+\cite{ho1998} and then \cite{breiman2001} generalize this method by combining
+    many CART models into one forest of trees where every single tree is
+    a randomized variant of the others.
+Randomization is achieved at two steps in the training process:
+First, each tree receives a distinct training set resampled with replacement
+    from the original training set, an idea also called bootstrap
+    aggregation.
+Second, at each node a random subset of the features is used to grow the tree.
+Trees can be fitted in parallel speeding up the training significantly.
+For prediction at the tree level, the average of all the samples at a
+    particular leaf node is used.
+Then, the individual values are combined into one value by averaging again
+    across the trees.
+Due to the randomization, the trees are decorrelated offsetting the
+    overfitting.
+Another measure to counter overfitting is pruning the tree, either by
+    specifying the maximum depth of a tree or the minimum number of samples
+    at leaf nodes.
+
+The forecaster must tune the structure of the forest.
+Parameters include the number of trees in the forest, the size of the random
+    subset of features, and the pruning criteria.
+The parameters are optimized via grid search: We train many models with
+    parameters chosen from a pre-defined list of values and select the best
+    one by CV.
+RFs are a convenient ML method for any dataset as decision trees do not
+    make any assumptions about the relationship between features and labels.
+\cite{herrera2010} use RFs to predict the hourly demand for water in an urban
+    context, a similar application as the one in this paper, and find that RFs
+    work well with time series type of data.
--- a/tex/2_lit/3_ml/5_svm.tex
+++ b/tex/2_lit/3_ml/5_svm.tex
@ -0,0 +1,60 @@
+\subsubsection{Support Vector Regression.}
+\label{svm}
+
+\cite{vapnik1963} and \cite{vapnik1964} introduce the so-called support vector
+    machine (\gls{svm}) model, and \cite{vapnik2013} summarizes the research
+    conducted since then.
+In its basic version, SVMs are linear classifiers, modeling a binary
+    decision, that fit a hyperplane into the feature space of $\mat{X}$ to
+    maximize the margin around the hyperplane seperating the two groups of
+    labels.
+SVMs were popularized in the 1990s in the context of optical character
+    recognition, as shown in \cite{scholkopf1998}.
+
+\cite{drucker1997} and \cite{stitson1999} adapt SVMs to the regression case,
+    and \cite{smola2004} provide a comprehensive introduction thereof.
+\cite{mueller1997} and \cite{mueller1999} focus on SVRs in the context of time
+    series data and find that they tend to outperform classical methods.
+\cite{chen2006a} and \cite{chen2006b} apply SVRs to predict the hourly demand
+    for water in cities, an application similar to the UDP case.
+
+In the SVR case, a linear function
+    $\hat{y}_i = f(\vec{x}_i) = \langle\vec{w},\vec{x}_i\rangle + b$
+    is fitted so that the actual labels $y_i$ have a deviation of at most
+    $\epsilon$ from their predictions $\hat{y}_i$ (cf., the constraints
+    below).
+SVRs are commonly formulated as quadratic optimization problems as follows:
+$$
+\text{minimize }
+\frac{1}{2} \norm{\vec{w}}^2 + C \sum_{i=1}^m (\xi_i + \xi_i^*)
+\quad \text{subject to }
+\begin{cases}
+y_i - \langle \vec{w}, \vec{x}_i \rangle - b \leq \epsilon + \xi_i
+\text{,} \\
+\langle \vec{w}, \vec{x}_i \rangle + b - y_i \leq \epsilon + \xi_i^*
+\end{cases}
+$$
+$\vec{w}$ are the fitted weights in the row space of $\mat{X}$, $b$ is a bias
+    term in the column space of $\mat{X}$, and $\langle\cdot,\cdot\rangle$
+    denotes the dot product.
+By minimizing the norm of $\vec{w}$, the fitted function is flat and not prone
+    to overfitting strongly.
+To allow individual samples outside the otherwise hard $\epsilon$ bounds,
+    non-negative slack variables $\xi_i$ and $\xi_i^*$ are included.
+A non-negative parameter $C$ regulates how many samples may violate the
+    $\epsilon$ bounds and by how much.
+To model non-linear relationships, one could use a mapping $\Phi(\cdot)$ for
+    the $\vec{x}_i$ from the row space of $\mat{X}$ to some higher
+    dimensional space; however, as the optimization problem only depends on
+    the dot product $\langle\cdot,\cdot\rangle$ and not the actual entries of
+    $\vec{x}_i$, it suffices to use a kernel function $k$ such that
+    $k(\vec{x}_i,\vec{x}_j) = \langle\Phi(\vec{x}_i),\Phi(\vec{x}_j)\rangle$.
+Such kernels must fulfill certain mathematical properties, and, besides
+    polynomial kernels, radial basis functions with 
+    $k(\vec{x}_i,\vec{x}_j) = exp(\gamma \norm{\vec{x}_i - \vec{x}_j}^2)$ are
+    a popular candidate where $\gamma$ is a parameter controlling for how the
+    distances between any two samples influence the final model.
+SVRs work well with sparse data in high dimensional spaces, such as
+    intermittent demand data, as they minimize the risk of misclassification
+    or predicting a significantly far off value by maximizing the error
+    margin, as also noted by \cite{bao2004}.