Merge branch 'literature-section' into develop

2020-10-04 23:01:40 +02:00 · 2020-10-04 23:01:40 +02:00 · 7c203cb87c
commit 7c203cb87c
parent 8d10ba9a05 4b15ea7184
16 changed files with 878 additions and 3 deletions
--- a/paper.pdf
+++ b/paper.pdf
--- a/paper.tex
+++ b/paper.tex
@ -9,6 +9,15 @@
 \input{tex/1_intro}
 \input{tex/2_lit/1_intro}
 \input{tex/2_lit/2_class/1_intro}
 \input{tex/2_lit/2_class/2_ets}
 \input{tex/2_lit/2_class/3_arima}
 \input{tex/2_lit/2_class/4_stl}
 \input{tex/2_lit/3_ml/1_intro}
 \input{tex/2_lit/3_ml/2_learning}
 \input{tex/2_lit/3_ml/3_cv}
 \input{tex/2_lit/3_ml/4_rf}
 \input{tex/2_lit/3_ml/5_svm}
 \input{tex/3_mod/1_intro}
 \input{tex/4_stu/1_intro}
 \input{tex/5_con/1_intro}
--- a/tex/2_lit/1_intro.tex
+++ b/tex/2_lit/1_intro.tex
@ -1,2 +1,17 @@
 \section{Literature Review}
-\label{lit}
+\label{lit}
 In this section, we review the specific forecasting methods that make up our
    forecasting system.
 We group them into classical statistics and ML models.
 The two groups differ mainly in how they represent the input data and how
    accuracy is evaluated.
 A time series is a finite and ordered sequence of equally spaced observations.
 Thus, time is regarded as discrete and a time step as a short period.
 Formally, a time series $Y$ is defined as $Y = \{y_t: t \in I\}$, or $y_t$ for
    short, where $I$ is an index set of positive integers.
 Besides its length $T = |Y|$, another property is the a priori fixed and
    non-negative periodicity $k$ of a seasonal pattern in demand:
 $k$ is the number of time steps after which a pattern repeats itself (e.g.,
    $k=12$ for monthly sales data).
--- a/tex/2_lit/2_class/1_intro.tex
+++ b/tex/2_lit/2_class/1_intro.tex
@ -0,0 +1,13 @@
 \subsection{Demand Forecasting with Classical Forecasting Methods}
 \label{class_methods}
 Forecasting became a formal discipline starting in the 1950s and has its
    origins in the broader field of statistics.
 \cite{hyndman2018} provide a thorough overview of the concepts and methods
    established, and \cite{ord2017} indicate business-related applications
    such as demand forecasting.
 These "classical" forecasting methods share the characteristic that they are
    trained over the entire $Y$ first.
 Then, for prediction, the forecaster specifies the number of time steps for
    which he wants to generate forecasts.
 That is different for ML models.
--- a/tex/2_lit/2_class/2_ets.tex
+++ b/tex/2_lit/2_class/2_ets.tex
@ -0,0 +1,78 @@
 \subsubsection{Na\"{i}ve Methods, Moving Averages, and Exponential Smoothing.}
 \label{ets}
 Simple forecasting methods are often employed as a benchmark for more
    sophisticated ones.
 The so-called na\"{i}ve and seasonal na\"{i}ve methods forecast the next time
    step in a time series, $y_{T+1}$, with the last observation, $y_T$,
    and, if a seasonal pattern is present, with the observation $k$ steps
    before, $y_{T+1-k}$.
 As variants, both methods can be generalized to include drift terms in the
    presence of a trend or changing seasonal amplitude.
 If a time series exhibits no trend, a simple moving average (SMA) is a
    generalization of the na\"{i}ve method that is more robust to outliers.
 It is defined as follows: $\hat{y}_{T+1} = \frac{1}{h} \sum_{i=T-h}^{T} y_i$
    where $h$ is the horizon over which the average is calculated.
 If a time series exhibits a seasonal pattern, setting $h$ to a multiple of the
    periodicity $k$ suffices that the forecast is unbiased.
 Starting in the 1950s, another popular family of forecasting methods,
    so-called exponential smoothing methods, was introduced by
    \cite{brown1959}, \cite{holt1957}, and \cite{winters1960}.
 The idea is that forecasts $\hat{y}_{T+1}$ are a weighted average of past
    observations where the weights decay over time; in the case of the simple
    exponential smoothing (SES) method we obtain:
 $
 \hat{y}_{T+1} = \alpha y_T + \alpha (1 - \alpha) y_{T-1}
                + \alpha (1 - \alpha)^2 y_{T-2}
                + \dots + \alpha (1 - \alpha)^{T-1} y_{1}
 $
 where $\alpha$ (with $0 \le \alpha \le 1$) is a smoothing parameter.
 Exponential smoothing methods are often expressed in an alternative component
    form that consists of a forecast equation and one or more smoothing
    equations for unobservable components.
 Below, we present a generalization of SES, the so-called Holt-Winters'
    seasonal method, in an additive formulation.
 $\ell_t$, $b_t$, and $s_t$ represent the unobservable level, trend, and
    seasonal components inherent in $y_t$, and $\beta$ and $\gamma$ complement
    $\alpha$ as smoothing parameters:
 \begin{align*}
 \hat{y}_{t+1} & = \ell_t + b_t + s_{t+1-k} \\
 \ell_t        & = \alpha(y_t - s_{t-k}) + (1 - \alpha)(\ell_{t-1} + b_{t-1}) \\
 b_t           & = \beta (\ell_{t} - \ell_{t-1}) + (1 - \beta) b_{t-1} \\
 s_t           & = \gamma (y_t - \ell_{t-1} - b_{t-1}) + (1-\gamma)s_{t-k}
 \end{align*}
 With $b_t$, $s_t$, $\beta$, and $\gamma$ removed, this formulation reduces to
    SES.
 Distinct variations exist: Besides the three components, \cite{gardner1985}
    add dampening for the trend, \cite{pegels1969} provides multiplicative
    formulations, and \cite{taylor2003} adds dampening to the latter.
 The accuracy measure commonly employed is the sum of squared errors between
    the observations and their forecasts.
 Originally introduced by \cite{assimakopoulos2000}, \cite{hyndman2003} show
    how the Theta method can be regarded as an equivalent to SES with a drift
    term.
 We mention this method here only because \cite{bell2018} emphasize that it
    performs well at Uber.
 However, in our empirical study, we find that this is not true in general.
 \cite{hyndman2002} introduce statistical processes, so-called innovations	
    state-space models, to generalize the methods in this sub-section.
 They call this family of models ETS as they capture error, trend, and seasonal
    terms.
 Linear and additive ETS models have a structure like so:
 \begin{align*}
 y_t       & = \vec{w} \cdot \vec{x}_{t-1} + \epsilon_t \\
 \vec{x_t} & = \mat{F} \vec{x}_{t-1} + \vec{g} \epsilon_t
 \end{align*}
 $y_t$ denote the observations as before while $\vec{x}_t$ is a state vector of
    unobserved components.
 $\epsilon_t$ is a white noise series and the matrix $\mat{F}$ and the vectors
    $\vec{g}$ and $\vec{w}$ contain a model's coefficients.
 Just as the models in the next sub-section, ETS models are commonly fitted
    with maximum likelihood and evaluated using information theoretical
    criteria against historical data.
 We refer to \cite{hyndman2008b} for a thorough summary.
--- a/tex/2_lit/2_class/3_arima.tex
+++ b/tex/2_lit/2_class/3_arima.tex
@ -0,0 +1,69 @@
 \subsubsection{Autoregressive Integrated Moving Averages.}
 \label{arima}
 \cite{box1962}, \cite{box1968}, and more papers by the same authors in the
    1960s introduce a type of model where observations correlate with their
    neighbors and refer to them as autoregressive integrated moving average
    (ARIMA) models for stationary time series.
 For a thorough overview, we refer to \cite{box2015} and \cite{brockwell2016}.
 A time series $y_t$ is stationary if its moments are independent of the
    point in time where it is observed.
 A typical example is a white noise $\epsilon_t$ series.
 Therefore, a trend or seasonality implies non-stationarity.
 \cite{kwiatkowski1992} provide a test to check the null hypothesis of
    stationary data.
 To obtain a stationary time series, one chooses from several techniques:
 First, to stabilize a changing variance (i.e., heteroscedasticity), one
    applies a Box-Cox transformation (e.g., $log$) as first suggested by
    \cite{box1964}.
 Second, to factor out a trend (or seasonal) pattern, one computes differences
    of consecutive (or of lag $k$) observations or even differences thereof.
 Third, it is also common to pre-process $y_t$ with one of the decomposition
    methods mentioned in Sub-section \ref{stl} below with an ARIMA model
    then trained on an adjusted $y_t$.
 In the autoregressive part, observations are modeled as linear combinations of
    its predecessors.
 Formally, an $AR(p)$ model is defined with a drift term $c$, coefficients
    $\phi_i$ to be estimated (where $i$ is an index with $0 < i \leq p$), and
    white noise $\epsilon_t$ like so:
 $
 AR(p): \ \
 y_t = c + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \dots + \phi_p y_{t-p}
      + \epsilon_t
 $.
 The moving average part considers observations to be regressing towards a
    linear combination of past forecasting errors.
 Formally, a $MA(q)$ model is defined with a drift term $c$, coefficients
    $\theta_j$ to be estimated, and white noise terms $\epsilon_t$ (where $j$
    is an index with $0 < j \leq q$) as follows:
 $
 MA(q): \ \
 y_t = c + \epsilon_t + \theta_1 \epsilon_{t-1} + \theta_2 \epsilon_{t-2}
      + \dots + \theta_q \epsilon_{t-q}
 $.
 Finally, an $ARIMA(p,d,q)$ model unifies both parts and adds differencing
    where $d$ is the degree of differences and the $'$ indicates differenced
    values:
 $
 ARIMA(p,d,q): \ \
 y'_t = c + \phi_1 y'_{t-1} + \dots + \phi_p y'_{t-p} + \theta_1 \epsilon_{t-1}
       + \dots + \theta_q \epsilon_{t-q} + \epsilon_{t}
 $.
 $ARIMA(p,d,q)$ models are commonly fitted with maximum likelihood estimation.
 To find an optimal combination of the parameters $p$, $d$, and $q$, the
    literature suggests calculating an information theoretical criterion
    (e.g., Akaike's Information Criterion) that evaluates the fit on
    historical data.
 \cite{hyndman2008a} provide a step-wise heuristic to choose $p$, $d$, and $q$,
    that also decides if a Box-Cox transformation is to be applied, and if so,
    which one.
 To obtain a one-step-ahead forecast, the above equation is reordered such
    that $t$ is substituted with $T+1$.
 For forecasts further into the future, the actual observations are
    subsequently replaced by their forecasts.    
 Seasonal ARIMA variants exist; however, the high frequency $k$ in the kind of
    demand a UDP faces typically renders them impractical as too many
    coefficients must be estimated.
--- a/tex/2_lit/2_class/4_stl.tex
+++ b/tex/2_lit/2_class/4_stl.tex
@ -0,0 +1,62 @@
 \subsubsection{Seasonal and Trend Decomposition using Loess.}
 \label{stl}
 A time series $y_t$ may exhibit different types of patterns; to fully capture
    each of them, the series must be decomposed.
 Then, each component is forecast with a distinct model.
 Most commonly, the components are the trend $t_t$, seasonality $s_t$, and
    remainder $r_t$.
 They are themselves time series, where only $s_t$ exhibits a periodicity $k$.
 A decomposition may be additive (i.e., $y_t = s_t + t_t + r_t$) or
    multiplicative (i.e., $y_t = s_t * t_t * r_t$); the former assumes that
    the effect of the seasonal component is independent of the overall level
    of $y_t$ and vice versa.
 The seasonal component is centered around $0$ in both cases such that its
    removal does not affect the level of $y_t$.
 Often, it is sufficient to only seasonally adjust the time series, and model
    the trend and remainder together, for example, as $a_t = y_t - s_t$ in the
    additive case.
 Early approaches employed moving averages (cf., Sub-section \ref{ets}) to
    calculate a trend component, and, after removing that from $y_t$, averaged
    all observations of the same seasonal lag to obtain the seasonal
    component.
 The downsides of this are the subjectivity in choosing the window lengths for
    the moving average and the seasonal averaging, the incapability of the
    seasonal component to vary its amplitude over time, and the missing
    handling of outliers.
 The X11 method developed at the U.S. Census Bureau and described in detail by
    \cite{dagum2016} overcomes these disadvantages.
 However, due to its background in economics, it is designed primarily for
    quarterly or monthly data, and the change in amplitude over time cannot be
    controlled.
 Variants of this method are the SEATS decomposition by the Bank of Spain and
    the newer X13-SEATS-ARIMA method by the U.S. Census Bureau.
 Their main advantages stem from the fact that the models calibrate themselves
    according to statistical criteria without manual work for a statistician
    and that the fitting process is robust to outliers.
 \cite{cleveland1990} introduce a seasonal and trend decomposition using a
    repeated locally weighted regression - the so-called Loess procedure - to
    smoothen the trend and seasonal components, which can be viewed as a
    generalization of the methods above and is denoted by the acronym
    \gls{stl}.
 In contrast to the X11, X13, and SEATS methods, the STL supports seasonalities
    of any lag $k$ that must, however, be determined with additional
    statistical tests or set with out-of-band knowledge by the forecaster
    (e.g., hourly demand data implies $k = 24 * 7 = 168$ assuming customer
    behavior differs on each day of the week).
 Moreover, the seasonal component's rate of change, represented by the $ns$
    parameter and explained in detail with Figure \ref{f:stl} in Section
    \ref{decomp}, must be set by the forecaster as well, while the trend's
    smoothness may be controlled via setting a non-default window size.
 Outliers are handled by assignment to the remainder such that they do not
    affect the trend and seasonal components.
 In particular, the manual input needed to calibrate the STL explains why only
    the X11, X13, and SEATS methods are widely used by practitioners.
 However, the widespread adoption of concepts like cross-validation (cf.,
    Sub-section \ref{cv}) in recent years enables the usage of an automated
    grid search to optimize the parameters.
 The STL's usage within a grid search is facilitated even further by its being
    computationally cheaper than the other methods discussed.
--- a/tex/2_lit/3_ml/1_intro.tex
+++ b/tex/2_lit/3_ml/1_intro.tex
@ -0,0 +1,15 @@
 \subsection{Demand Forecasting with Machine Learning Methods}
 \label{ml_methods}
 ML methods have been employed in all kinds of prediction tasks in recent
    years.
 In this section, we restrict ourselves to the models that performed well in
    our study: Random Forest (\gls{rf}) and Support Vector Regression
    (\gls{svr}).
 RFs are in general well-suited for datasets without a priori knowledge about
    the patterns, while SVR is known to perform well on time series data, as
    shown by \cite{hansen2006} in general and \cite{bao2004} specifically for
    intermittent demand.
 Gradient Boosting, another popular ML method, was consistently outperformed by
    RFs, and artificial neural networks require an amount of data
    exceeding what our industry partner has by far.
--- a/tex/2_lit/3_ml/2_learning.tex
+++ b/tex/2_lit/3_ml/2_learning.tex
@ -0,0 +1,53 @@
 \subsubsection{Supervised Learning.}
 \label{learning}
 A conceptual difference between classical and ML methods is the format
    for the model inputs.
 In ML models, a time series $Y$ is interpreted as labeled data.
 Labels are collected into a vector $\vec{y}$ while the corresponding
    predictors are aligned in an $(T - n) \times n$ matrix $\mat{X}$:
 $$
 \vec{y}
 =
 \begin{pmatrix}
    y_T \\
    y_{T-1} \\
    \dots \\
    y_{n+1}
 \end{pmatrix}
 ~~~~~~~~~~
 \mat{X}
 =
 \begin{bmatrix}
    y_{T-1} & y_{T-2} & \dots & y_{T-n} \\
    y_{T-2} & y_{T-3} & \dots & y_{T-(n+1)} \\
    \dots   & \dots   & \dots & \dots \\
    y_n     & y_{n-1} & \dots & y_1
 \end{bmatrix}
 $$
 The $m = T - n$ rows are referred to as samples and the $n$ columns as
    features.
 Each row in $\mat{X}$ is "labeled" by the corresponding entry in $\vec{y}$,
    and ML models are trained to fit the rows to their labels.
 Conceptually, we model a functional relationship $f$ between $\mat{X}$ and
    $\vec{y}$ such that the difference between the predicted
    $\vec{\hat{y}} = f(\mat{X})$ and the true $\vec{y}$ are minimized
    according to some error measure $L(\vec{\hat{y}}, \vec{y})$, where $L$
    summarizes the goodness of the fit into a scalar value (e.g., the
    well-known mean squared error [MSE]; cf., Section \ref{mase}).
 $\mat{X}$ and $\vec{y}$ show the ordinal character of time series data:
    Not only overlap the entries of $\mat{X}$ and $\vec{y}$, but the rows of
    $\mat{X}$ are shifted versions of each other.
 That does not hold for ML applications in general (e.g., the classical
    example of predicting spam vs. no spam emails, where the features model
    properties of individual emails), and most of the common error measures
    presented in introductory texts on ML, are only applicable in cases
    without such a structure in $\mat{X}$ and $\vec{y}$.
 $n$, the number of past time steps required to predict a $y_t$, is an
    exogenous model parameter.
 For prediction, the forecaster supplies the trained ML model an input
    vector in the same format as a row $\vec{x}_i$ in $\mat{X}$.
 For example, to predict $y_{T+1}$, the model takes the vector
    $(y_T, y_{T-1}, ..., y_{T-n+1})$ as input.
 That is in contrast to the classical methods, where we only supply the number
    of time steps to be predicted as a scalar integer.
--- a/tex/2_lit/3_ml/3_cv.tex
+++ b/tex/2_lit/3_ml/3_cv.tex
@ -0,0 +1,38 @@
 \subsubsection{Cross-Validation.}
 \label{cv}
 Because ML models are trained by minimizing a loss function $L$, the
    resulting value of $L$ underestimates the true error we see when
    predicting into the actual future by design.
 To counter that, one popular and model-agnostic approach is cross-validation
    (\gls{cv}), as summarized, for example, by \cite{hastie2013}.
 CV is a resampling technique, which ranomdly splits the samples into a
    training and a test set.
 Trained on the former, an ML model makes forecasts on the latter.
 Then, the value of $L$ calculated only on the test set gives a realistic and
    unbiased estimate of the true forecasting error, and may be used for one
    of two distinct aspects:
 First, it assesses the quality of a fit and provides an idea as to how the
    model would perform in production when predicting into the actual future.
 Second, the errors of models of either different methods or the same method
    with different parameters may be compared with each other to select the
    best model.
 In order to first select the best model and then assess its quality, one must
    apply two chained CVs:
 The samples are divided into training, validation, and test sets, and all
    models are trained on the training set and compared on the validation set.
 Then, the winner is retrained on the union of the training and validation
    sets and assessed on the test set.
 Regarding the splitting, there are various approaches, and we choose the
    so-called $k$-fold CV, where the samples are randomly divided into $k$
    folds of the same size.
 Each fold is used as a test set once and the remaining $k-1$ folds become
    the corresponding training set.
 The resulting $k$ error measures are averaged.
 A $k$-fold CV with $k=5$ or $k=10$ is a compromise between the two extreme
    cases of having only one split and the so-called leave-one-out CV
    where $k = m$: Computation is still relatively fast and each sample is
    part of several training sets maximizing the learning from the data.
 We adapt the $k$-fold CV to the ordinal stucture in $\mat{X}$ and $\vec{y}$ in
    Sub-section \ref{unified_cv}.
--- a/tex/2_lit/3_ml/4_rf.tex
+++ b/tex/2_lit/3_ml/4_rf.tex
@ -0,0 +1,66 @@
 \subsubsection{Random Forest Regression.}
 \label{rf}
 \cite{breiman1984} introduce the classification and regression tree
    (\gls{cart}) model that is built around the idea that a single binary
    decision tree maps learned combinations of intervals of the feature
    columns to a label.
 Thus, each sample in the training set is associated with one leaf node that
    is reached by following the tree from its root and branching along the
    arcs according to some learned splitting rule per intermediate node that
    compares the sample's realization for the feature specified by the rule to
    the learned decision rule.
 While such models are computationally fast and offer a high degree of
    interpretability, they tend to overfit strongly to the training set as
    the splitting rules are not limited to any functional form (e.g., linear)
    in the relationship between the features and the labels.
 In the regression case, it is common to maximize the variance reduction $I_V$
    from a parent node $N$ to its two children, $C1$ and $C2$, as the
    splitting rule.
 \cite{breiman1984} formulate this as follows:
 $$
 I_V(N)
 =
 \frac{1}{|S_N|^2} \sum_{i \in S_N} \sum_{j \in S_N}
    \frac{1}{2} (y_i - y_j)^2
 - \left(
    \frac{1}{|S_{C1}|^2} \sum_{i \in S_{C1}} \sum_{j \in S_{C1}}
        \frac{1}{2} (y_i - y_j)^2
    +
    \frac{1}{|S_{C2}|^2} \sum_{i \in S_{C2}} \sum_{j \in S_{C2}}
        \frac{1}{2} (y_i - y_j)^2
 \right)
 $$
 $S_N$, $S_{C1}$, and $S_{C2}$ are the index sets of the samples in $N$, $C1$,
    and $C2$. 
 \cite{ho1998} and then \cite{breiman2001} generalize this method by combining
    many CART models into one forest of trees where every single tree is
    a randomized variant of the others.
 Randomization is achieved at two steps in the training process:
 First, each tree receives a distinct training set resampled with replacement
    from the original training set, an idea also called bootstrap
    aggregation.
 Second, at each node a random subset of the features is used to grow the tree.
 Trees can be fitted in parallel speeding up the training significantly.
 For prediction at the tree level, the average of all the samples at a
    particular leaf node is used.
 Then, the individual values are combined into one value by averaging again
    across the trees.
 Due to the randomization, the trees are decorrelated offsetting the
    overfitting.
 Another measure to counter overfitting is pruning the tree, either by
    specifying the maximum depth of a tree or the minimum number of samples
    at leaf nodes.
 The forecaster must tune the structure of the forest.
 Parameters include the number of trees in the forest, the size of the random
    subset of features, and the pruning criteria.
 The parameters are optimized via grid search: We train many models with
    parameters chosen from a pre-defined list of values and select the best
    one by CV.
 RFs are a convenient ML method for any dataset as decision trees do not
    make any assumptions about the relationship between features and labels.
 \cite{herrera2010} use RFs to predict the hourly demand for water in an urban
    context, a similar application as the one in this paper, and find that RFs
    work well with time series type of data.
--- a/tex/2_lit/3_ml/5_svm.tex
+++ b/tex/2_lit/3_ml/5_svm.tex
@ -0,0 +1,60 @@
 \subsubsection{Support Vector Regression.}
 \label{svm}
 \cite{vapnik1963} and \cite{vapnik1964} introduce the so-called support vector
    machine (\gls{svm}) model, and \cite{vapnik2013} summarizes the research
    conducted since then.
 In its basic version, SVMs are linear classifiers, modeling a binary
    decision, that fit a hyperplane into the feature space of $\mat{X}$ to
    maximize the margin around the hyperplane seperating the two groups of
    labels.
 SVMs were popularized in the 1990s in the context of optical character
    recognition, as shown in \cite{scholkopf1998}.
 \cite{drucker1997} and \cite{stitson1999} adapt SVMs to the regression case,
    and \cite{smola2004} provide a comprehensive introduction thereof.
 \cite{mueller1997} and \cite{mueller1999} focus on SVRs in the context of time
    series data and find that they tend to outperform classical methods.
 \cite{chen2006a} and \cite{chen2006b} apply SVRs to predict the hourly demand
    for water in cities, an application similar to the UDP case.
 In the SVR case, a linear function
    $\hat{y}_i = f(\vec{x}_i) = \langle\vec{w},\vec{x}_i\rangle + b$
    is fitted so that the actual labels $y_i$ have a deviation of at most
    $\epsilon$ from their predictions $\hat{y}_i$ (cf., the constraints
    below).
 SVRs are commonly formulated as quadratic optimization problems as follows:
 $$
 \text{minimize }
 \frac{1}{2} \norm{\vec{w}}^2 + C \sum_{i=1}^m (\xi_i + \xi_i^*)
 \quad \text{subject to }
 \begin{cases}
 y_i - \langle \vec{w}, \vec{x}_i \rangle - b \leq \epsilon + \xi_i
 \text{,} \\
 \langle \vec{w}, \vec{x}_i \rangle + b - y_i \leq \epsilon + \xi_i^*
 \end{cases}
 $$
 $\vec{w}$ are the fitted weights in the row space of $\mat{X}$, $b$ is a bias
    term in the column space of $\mat{X}$, and $\langle\cdot,\cdot\rangle$
    denotes the dot product.
 By minimizing the norm of $\vec{w}$, the fitted function is flat and not prone
    to overfitting strongly.
 To allow individual samples outside the otherwise hard $\epsilon$ bounds,
    non-negative slack variables $\xi_i$ and $\xi_i^*$ are included.
 A non-negative parameter $C$ regulates how many samples may violate the
    $\epsilon$ bounds and by how much.
 To model non-linear relationships, one could use a mapping $\Phi(\cdot)$ for
    the $\vec{x}_i$ from the row space of $\mat{X}$ to some higher
    dimensional space; however, as the optimization problem only depends on
    the dot product $\langle\cdot,\cdot\rangle$ and not the actual entries of
    $\vec{x}_i$, it suffices to use a kernel function $k$ such that
    $k(\vec{x}_i,\vec{x}_j) = \langle\Phi(\vec{x}_i),\Phi(\vec{x}_j)\rangle$.
 Such kernels must fulfill certain mathematical properties, and, besides
    polynomial kernels, radial basis functions with 
    $k(\vec{x}_i,\vec{x}_j) = exp(\gamma \norm{\vec{x}_i - \vec{x}_j}^2)$ are
    a popular candidate where $\gamma$ is a parameter controlling for how the
    distances between any two samples influence the final model.
 SVRs work well with sparse data in high dimensional spaces, such as
    intermittent demand data, as they minimize the risk of misclassification
    or predicting a significantly far off value by maximizing the error
    margin, as also noted by \cite{bao2004}.
--- a/tex/3_mod/1_intro.tex
+++ b/tex/3_mod/1_intro.tex
@ -1,2 +1,8 @@
 \section{Model Formulation}
-\label{mod}
+\label{mod}
 % temporary placeholders
 \label{decomp}
 \label{f:stl}
 \label{mase}
 \label{unified_cv}
--- a/tex/glossary.tex
+++ b/tex/glossary.tex
@ -1,7 +1,25 @@
 % Abbreviations for technical terms.
 \newglossaryentry{cart}{
    name=CART, description={Classification and Regression Trees}
 }
 \newglossaryentry{cv}{
    name=CV, description={Cross Validation}
 }
 \newglossaryentry{ml}{
    name=ML, description={Machine Learning}
 }
 \newglossaryentry{rf}{
    name=RF,  description={Random Forest}
 }
 \newglossaryentry{stl}{
    name=STL, description={Seasonal and Trend Decomposition using Loess}
 }
 \newglossaryentry{svm}{
    name=SVM, description={Support Vector Machine}
 }
 \newglossaryentry{svr}{
    name=SVR, description={Support Vector Regression}
 }
 \newglossaryentry{udp}{
    name=UDP, description={Urban Delivery Platform}
 }
--- a/tex/preamble.tex
+++ b/tex/preamble.tex
@ -6,4 +6,9 @@
 % Make opening quotes look different than closing quotes.
 \usepackage[english=american]{csquotes}
-\MakeOuterQuote{"}
+\MakeOuterQuote{"}
 % Define helper commands.
 \usepackage{bm}
 \newcommand{\mat}[1]{\bm{#1}}
 \newcommand{\norm}[1]{\left\lVert#1\right\rVert}
--- a/tex/references.bib
+++ b/tex/references.bib
@ -7,6 +7,25 @@ volume={129},
 pages={263--286}
 }
@article{assimakopoulos2000,
 title={The theta model: a decomposition approach to forecasting},
 author={Assimakopoulos, Vassilis and Nikolopoulos, Konstantinos},
 year={2000},
 journal={International Journal of Forecasting},
 volume={16},
 number={4},
 pages={521--530}
 }
@inproceedings{bao2004,
 title={Forecasting intermittent demand by SVMs regression},
 author={Bao, Yukun and Wang, Wen and Zhang, Jinlong},
 year={2004},
 booktitle={2004 IEEE International Conference on Systems, Man and Cybernetics},
 volume={1},
 pages={461--466}
 }
@misc{bell2018,
 title = {Forecasting at Uber: An Introduction},
 author={Bell, Franziska and Smyl, Slawek},
@ -15,6 +34,119 @@ howpublished = {\url{https://eng.uber.com/forecasting-introduction/}},
 note = {Accessed: 2020-10-01}
 }
@article{box1962,
 title={Some statistical Aspects of adaptive Optimization and Control},
 author={Box, George and Jenkins, Gwilym},
 year={1962},
 journal={Journal of the Royal Statistical Society. Series B (Methodological)},
 volume={24},
 number={2},
 pages={297--343}
 }
@article{box1964,
 title={An Analysis of Transformations},
 author={Box, George and Cox, David},
 year={1964},
 journal={Journal of the Royal Statistical Society. Series B (Methodological)},
 volume={26},
 number={2},
 pages={211--252}
 }
@article{box1968,
 title={Some recent Advances in Forecasting and Control},
 author={Box, George and Jenkins, Gwilym},
 year={1968},
 journal={Journal of the Royal Statistical Society.
         Series C (Applied Statistics)},
 volume={17},
 number={2},
 pages={91--109}
 }
@book{box2015,
 title={Time Series Analysis: Forecasting and Control},
 author={Box, George and Jenkins, Gwilym and Reinsel, Gregory and Ljung, Greta},
 series={Wiley Series in Probability and Statistics},
 year={2015},
 publisher={Wiley}
 }
@book{breiman1984,
 title={Classification and Regression Trees},
 author={Breiman, Leo and Friedman, Jerome and Olshen, R.A.
        and Stone, Charles},
 year={1984},
 publisher={Wadsworth}
 }
@article{breiman2001,
 title={Random Forests},
 author={Breiman, Leo},
 year={2001},
 journal={Machine Learning},
 volume={45},
 number={1},
 pages={5--32}
 }
@book{brockwell2016,
 title={Introduction to Time Series and Forecasting},
 author={Brockwell, Peter and Davis, Richard},
 series={Springer Texts in Statistics},
 year={2016},
 publisher={Springer}
 }
@book{brown1959,
 title={Statistical Forecasting for Inventory Control},
 author={Brown, Robert},
 year={1959},
 publisher={McGraw/Hill}
 }
@article{chen2006a,
 title={Hourly Water Demand Forecast Model based on Bayesian Least Squares
       Support Vector Machine},
 author={Chen, Lei and Zhang, Tu-qiao},
 year={2006},
 journal={Journal of Tianjin University},
 volume={39},
 number={9},
 pages={1037--1042}
 }
@article{chen2006b,
 title={Hourly Water Demand Forecast Model based on Least Squares Support
       Vector Machine},
 author={Chen, Lei and Zhang, Tu-qiao},
 year={2006},
 journal={Journal of Harbin Institute of Technology},
 volume={38},
 number={9},
 pages={1528--1530}
 }
@article{cleveland1990,
 title={STL: A Seasonal-Trend Decomposition Procedure Based on Loess},
 author={Cleveland, Robert and Cleveland, Williiam and McRae, Jean
        and Terpenning, Irma},
 year={1990},
 journal={Journal of Official Statistics},
 volume={6},
 number={1},
 pages={3--73}
 }
@book{dagum2016,
 title={Seasonal Adjustment Methods and Real Time Trend-Cycle Estimation},
 author={Dagum, Estela and Bianconcini, Silvia},
 series={Statistics for Social and Behavioral Sciences},
 year={2016},
 publisher={Springer}
 }
@article{de2006,
 title={25 Years of Time Series Forecasting},
 author={De Gooijer, Jan and Hyndman, Rob},
@ -25,6 +157,16 @@ number={3},
 pages={443--473}
 }
@inproceedings{drucker1997,
 title={Support Vector Regression Machines},
 author={Drucker, Harris and Burges, Christopher and Kaufman, Linda
        and Smola, Alex and Vapnik, Vladimir},
 year={1997},
 booktitle={Advances in Neural Information Processing Systems},
 pages={155--161},
 organization={Springer}
 }
@article{ehmke2018,
 title={Optimizing for total costs in vehicle routing in urban areas},
 author={Ehmke, Jan Fabian and Campbell, Ann M and Thomas, Barrett W},
@ -34,6 +176,45 @@ volume={116},
 pages={242--265}
 }
@article{gardner1985,
 title={Forecasting Trends in Time Series},
 author={Gardner, Everette and McKenzie, Ed},
 year={1985},
 journal={Management Science},
 volume={31},
 number={10},
 pages={1237--1246}
 }
@article{hansen2006,
 title={Some Evidence on Forecasting Time-Series with Support Vector Machines},
 author={Hansen, James and McDonald, James and Nelson, Ray},
 year={2006},
 journal={Journal of the Operational Research Society},
 volume={57},
 number={9},
 pages={1053--1063}
 }
@book{hastie2013,
 title={The Elements of Statistical Learning: Data Mining, Inference,
       and Prediction},
 author={Hastie, Trevor and Tibshirani, Robert and Friedman, Jerome},
 year={2013},
 publisher={Springer}
 }
@article{herrera2010,
 title={Predictive Models for Forecasting Hourly Urban Water Demand},
 author={Herrera, Manuel and Torgo, Lu{\'\i}s and Izquierdo, Joaqu{\'\i}n
        and P{\'e}rez-Garc{\'\i}a, Rafael},
 year={2010},
 journal={Journal of Hydrology},
 volume={387},
 number={1-2},
 pages={141--150}
 }
@misc{hirschberg2016,
 title = {McKinsey: The changing market for food delivery},
 author={Hirschberg, Carsten and Rajko, Alexander and Schumacher, Thomas
@ -44,6 +225,25 @@ howpublished = "\url{https://www.mckinsey.com/industries/high-tech/
 note = {Accessed: 2020-10-01}
 }
@article{ho1998,
 title={The Random Subspace Method for Constructing Decision Forests},
 author={Ho, Tin Kam},
 year={1998},
 journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
 volume={20},
 number={8},
 pages={832--844}
 }
@article{holt1957,
 title={Forecasting Seasonals and Trends by Exponentially Weighted Moving
       Averages},
 author={Holt, Charles},
 year={1957},
 journal={ONR Memorandum},
 volume={52}
 }
@article{hou2018,
 title={Ride-matching and routing optimisation: Models and a large
       neighbourhood search heuristic},
@ -54,6 +254,62 @@ volume={118},
 pages={143--162}
 }
@article{hyndman2002,
 title={A State Space Framework for Automatic Forecasting using Exponential
       Smoothing Methods},
 author={Hyndman, Rob and Koehler, Anne and Snyder, Ralph and Grose, Simone},
 year={2002},
 journal={International Journal of Forecasting},
 volume={18},
 number={3},
 pages={439--454}
 }
@article{hyndman2003,
 title={Unmasking the Theta method},
 author={Hyndman, Rob and Billah, Baki},
 year={2003},
 journal={International Journal of Forecasting},
 volume={19},
 number={2},
 pages={287--290}
 }
@article{hyndman2008a,
 title={Automatic Time Series Forecasting: The forecast package for R},
 author={Hyndman, Rob and Khandakar, Yeasmin},
 year={2008},
 journal={Journal of Statistical Software},
 volume={26},
 number={3}
 }
@book{hyndman2008b,
 title={Forecasting with Exponential Smoothing: the State Space Approach},
 author={Hyndman, Rob and Koehler, Anne and Ord, Keith and Snyder, Ralph},
 year={2008},
 publisher={Springer}
 }
@book{hyndman2018,
 title={Forecasting: Principles and Practice},
 author={Hyndman, Rob and Athanasopoulos, George},
 year={2018},
 publisher={OTexts}
 }
@article{kwiatkowski1992,
 title={Testing the null hypothesis of stationarity against the alternative of a
       unit root: How sure are we that economic time series have a unit root?},
 author={Kwiatkowski, Denis and Phillips, Peter and Schmidt, Peter
        and Shin, Yongcheol},
 year={1992},
 journal={Journal of Econometrics},
 volume={54},
 number={1-3},
 pages={159--178}
 }
@misc{laptev2017,
 title = {Engineering Extreme Event Forecasting
         at Uber with Recurrent Neural Networks},
@ -74,6 +330,108 @@ volume={118},
 pages={392--420}
 }
@inproceedings{mueller1997,
 title={Predicting Time Series with Support Vector Machines},
 author={M{\"u}ller, Klaus-Robert and Smola, Alexander and R{\"a}tsch, Gunnar
        and Sch{\"o}lkopf, Bernhard and Kohlmorgen, Jens and Vapnik, Vladimir},
 year={1997},
 booktitle={International Conference on Artificial Neural Networks},
 pages={999--1004},
 organization={Springer}
 }
@article{mueller1999,
 title={Using Support Vector Machines for Time Series Prediction},
 author={M{\"u}ller, Klaus-Robert and Smola, Alexander and R{\"a}tsch, Gunnar
        and Sch{\"o}lkopf, Bernhard and Kohlmorgen, Jens and Vapnik, Vladimir},
 year={1999},
 journal={Advances in Kernel Methods — Support Vector Learning},
 pages={243--254},
 publisher={MIT, Cambridge, MA, USA}
 }
@book{ord2017,
 title={Principles of Business Forecasting},
 author={Ord, Keith and Fildes, Robert and Kourentzes, Nikos},
 year={2017},
 publisher={WESSEX Press}
 }
@article{pegels1969,
 title={Exponential Forecasting: Some new variations},
 author={Pegels, C.},
 year={1969},
 journal={Management Science},
 volume={15},
 number={5},
 pages={311--315}
 }
@incollection{scholkopf1998,
 title={Fast Approximation of Support Vector Kernel Expansions, and an
       Interpretation of Clustering as Approximation in Feature Spaces},
 author={Sch{\"o}lkopf, Bernhard and Knirsch, Phil and Smola, Alex
        and Burges, Chris},
 year={1998},
 booktitle={Mustererkennung 1998},
 publisher={Springer},
 pages={125--132}
 }
@article{smola2004,
 title={A Tutorial on Support Vector Regression},
 author={Smola, Alex and Sch{\"o}lkopf, Bernhard},
 year={2004},
 journal={Statistics and Computing},
 volume={14},
 number={3},
 pages={199--222}
 }
@article{stitson1999,
 title={Support Vector Regression with ANOVA Decomposition Kernels},
 author={Stitson, Mark and Gammerman, Alex and Vapnik, Vladimir
        and Vovk, Volodya and Watkins, Chris and Weston, Jason},
 year={1999},
 journal={Advances in Kernel Methods — Support Vector Learning},
 pages={285--292},
 publisher={MIT, Cambridge, MA, USA}
 }
@article{taylor2003,
 title={Exponential Smoothing with a Damped Multiplicative Trend},
 author={Taylor, James},
 year={2003},
 journal={International Journal of Forecasting},
 volume={19},
 number={4},
 pages={715--725}
 }
@article{vapnik1963,
 title={Pattern Recognition using Generalized Portrait Method},
 author={Vapnik, Vladimir and Lerner, A},
 year={1963},
 journal={Automation and Remote Control},
 volume={24},
 pages={774--780},
 }
@article{vapnik1964,
 title={A Note on one Class of Perceptrons},
 author={Vapnik, Vladimir and Chervonenkis, A},
 year={1964},
 journal={Automation and Remote Control},
 volume={25}
 }
@book{vapnik2013,
 title={The Nature of Statistical Learning Theory},
 author={Vapnik, Vladimir},
 year={2013},
 publisher={Springer}
 }
@article{wang2018,
 title={Delivering meals for multiple suppliers: Exclusive or sharing
       logistics service},
@ -82,4 +440,14 @@ year={2018},
 journal={Transportation Research Part E: Logistics and Transportation Review},
 volume={118},
 pages={496--512}
 }
@article{winters1960,
 title={Forecasting Sales by Exponentially Weighted Moving Averages},
 author={Winters, Peter},
 year={1960},
 journal={Management Science},
 volume={6},
 number={3},
 pages={324--342}
 }