1
0
Fork 0

Merge branch 'literature-section' into develop

This commit is contained in:
Alexander Hess 2020-10-04 23:01:40 +02:00
commit 7c203cb87c
Signed by: alexander
GPG key ID: 344EA5AB10D868E0
16 changed files with 878 additions and 3 deletions

BIN
paper.pdf

Binary file not shown.

View file

@ -9,6 +9,15 @@
\input{tex/1_intro} \input{tex/1_intro}
\input{tex/2_lit/1_intro} \input{tex/2_lit/1_intro}
\input{tex/2_lit/2_class/1_intro}
\input{tex/2_lit/2_class/2_ets}
\input{tex/2_lit/2_class/3_arima}
\input{tex/2_lit/2_class/4_stl}
\input{tex/2_lit/3_ml/1_intro}
\input{tex/2_lit/3_ml/2_learning}
\input{tex/2_lit/3_ml/3_cv}
\input{tex/2_lit/3_ml/4_rf}
\input{tex/2_lit/3_ml/5_svm}
\input{tex/3_mod/1_intro} \input{tex/3_mod/1_intro}
\input{tex/4_stu/1_intro} \input{tex/4_stu/1_intro}
\input{tex/5_con/1_intro} \input{tex/5_con/1_intro}

View file

@ -1,2 +1,17 @@
\section{Literature Review} \section{Literature Review}
\label{lit} \label{lit}
In this section, we review the specific forecasting methods that make up our
forecasting system.
We group them into classical statistics and ML models.
The two groups differ mainly in how they represent the input data and how
accuracy is evaluated.
A time series is a finite and ordered sequence of equally spaced observations.
Thus, time is regarded as discrete and a time step as a short period.
Formally, a time series $Y$ is defined as $Y = \{y_t: t \in I\}$, or $y_t$ for
short, where $I$ is an index set of positive integers.
Besides its length $T = |Y|$, another property is the a priori fixed and
non-negative periodicity $k$ of a seasonal pattern in demand:
$k$ is the number of time steps after which a pattern repeats itself (e.g.,
$k=12$ for monthly sales data).

View file

@ -0,0 +1,13 @@
\subsection{Demand Forecasting with Classical Forecasting Methods}
\label{class_methods}
Forecasting became a formal discipline starting in the 1950s and has its
origins in the broader field of statistics.
\cite{hyndman2018} provide a thorough overview of the concepts and methods
established, and \cite{ord2017} indicate business-related applications
such as demand forecasting.
These "classical" forecasting methods share the characteristic that they are
trained over the entire $Y$ first.
Then, for prediction, the forecaster specifies the number of time steps for
which he wants to generate forecasts.
That is different for ML models.

View file

@ -0,0 +1,78 @@
\subsubsection{Na\"{i}ve Methods, Moving Averages, and Exponential Smoothing.}
\label{ets}
Simple forecasting methods are often employed as a benchmark for more
sophisticated ones.
The so-called na\"{i}ve and seasonal na\"{i}ve methods forecast the next time
step in a time series, $y_{T+1}$, with the last observation, $y_T$,
and, if a seasonal pattern is present, with the observation $k$ steps
before, $y_{T+1-k}$.
As variants, both methods can be generalized to include drift terms in the
presence of a trend or changing seasonal amplitude.
If a time series exhibits no trend, a simple moving average (SMA) is a
generalization of the na\"{i}ve method that is more robust to outliers.
It is defined as follows: $\hat{y}_{T+1} = \frac{1}{h} \sum_{i=T-h}^{T} y_i$
where $h$ is the horizon over which the average is calculated.
If a time series exhibits a seasonal pattern, setting $h$ to a multiple of the
periodicity $k$ suffices that the forecast is unbiased.
Starting in the 1950s, another popular family of forecasting methods,
so-called exponential smoothing methods, was introduced by
\cite{brown1959}, \cite{holt1957}, and \cite{winters1960}.
The idea is that forecasts $\hat{y}_{T+1}$ are a weighted average of past
observations where the weights decay over time; in the case of the simple
exponential smoothing (SES) method we obtain:
$
\hat{y}_{T+1} = \alpha y_T + \alpha (1 - \alpha) y_{T-1}
+ \alpha (1 - \alpha)^2 y_{T-2}
+ \dots + \alpha (1 - \alpha)^{T-1} y_{1}
$
where $\alpha$ (with $0 \le \alpha \le 1$) is a smoothing parameter.
Exponential smoothing methods are often expressed in an alternative component
form that consists of a forecast equation and one or more smoothing
equations for unobservable components.
Below, we present a generalization of SES, the so-called Holt-Winters'
seasonal method, in an additive formulation.
$\ell_t$, $b_t$, and $s_t$ represent the unobservable level, trend, and
seasonal components inherent in $y_t$, and $\beta$ and $\gamma$ complement
$\alpha$ as smoothing parameters:
\begin{align*}
\hat{y}_{t+1} & = \ell_t + b_t + s_{t+1-k} \\
\ell_t & = \alpha(y_t - s_{t-k}) + (1 - \alpha)(\ell_{t-1} + b_{t-1}) \\
b_t & = \beta (\ell_{t} - \ell_{t-1}) + (1 - \beta) b_{t-1} \\
s_t & = \gamma (y_t - \ell_{t-1} - b_{t-1}) + (1-\gamma)s_{t-k}
\end{align*}
With $b_t$, $s_t$, $\beta$, and $\gamma$ removed, this formulation reduces to
SES.
Distinct variations exist: Besides the three components, \cite{gardner1985}
add dampening for the trend, \cite{pegels1969} provides multiplicative
formulations, and \cite{taylor2003} adds dampening to the latter.
The accuracy measure commonly employed is the sum of squared errors between
the observations and their forecasts.
Originally introduced by \cite{assimakopoulos2000}, \cite{hyndman2003} show
how the Theta method can be regarded as an equivalent to SES with a drift
term.
We mention this method here only because \cite{bell2018} emphasize that it
performs well at Uber.
However, in our empirical study, we find that this is not true in general.
\cite{hyndman2002} introduce statistical processes, so-called innovations
state-space models, to generalize the methods in this sub-section.
They call this family of models ETS as they capture error, trend, and seasonal
terms.
Linear and additive ETS models have a structure like so:
\begin{align*}
y_t & = \vec{w} \cdot \vec{x}_{t-1} + \epsilon_t \\
\vec{x_t} & = \mat{F} \vec{x}_{t-1} + \vec{g} \epsilon_t
\end{align*}
$y_t$ denote the observations as before while $\vec{x}_t$ is a state vector of
unobserved components.
$\epsilon_t$ is a white noise series and the matrix $\mat{F}$ and the vectors
$\vec{g}$ and $\vec{w}$ contain a model's coefficients.
Just as the models in the next sub-section, ETS models are commonly fitted
with maximum likelihood and evaluated using information theoretical
criteria against historical data.
We refer to \cite{hyndman2008b} for a thorough summary.

View file

@ -0,0 +1,69 @@
\subsubsection{Autoregressive Integrated Moving Averages.}
\label{arima}
\cite{box1962}, \cite{box1968}, and more papers by the same authors in the
1960s introduce a type of model where observations correlate with their
neighbors and refer to them as autoregressive integrated moving average
(ARIMA) models for stationary time series.
For a thorough overview, we refer to \cite{box2015} and \cite{brockwell2016}.
A time series $y_t$ is stationary if its moments are independent of the
point in time where it is observed.
A typical example is a white noise $\epsilon_t$ series.
Therefore, a trend or seasonality implies non-stationarity.
\cite{kwiatkowski1992} provide a test to check the null hypothesis of
stationary data.
To obtain a stationary time series, one chooses from several techniques:
First, to stabilize a changing variance (i.e., heteroscedasticity), one
applies a Box-Cox transformation (e.g., $log$) as first suggested by
\cite{box1964}.
Second, to factor out a trend (or seasonal) pattern, one computes differences
of consecutive (or of lag $k$) observations or even differences thereof.
Third, it is also common to pre-process $y_t$ with one of the decomposition
methods mentioned in Sub-section \ref{stl} below with an ARIMA model
then trained on an adjusted $y_t$.
In the autoregressive part, observations are modeled as linear combinations of
its predecessors.
Formally, an $AR(p)$ model is defined with a drift term $c$, coefficients
$\phi_i$ to be estimated (where $i$ is an index with $0 < i \leq p$), and
white noise $\epsilon_t$ like so:
$
AR(p): \ \
y_t = c + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \dots + \phi_p y_{t-p}
+ \epsilon_t
$.
The moving average part considers observations to be regressing towards a
linear combination of past forecasting errors.
Formally, a $MA(q)$ model is defined with a drift term $c$, coefficients
$\theta_j$ to be estimated, and white noise terms $\epsilon_t$ (where $j$
is an index with $0 < j \leq q$) as follows:
$
MA(q): \ \
y_t = c + \epsilon_t + \theta_1 \epsilon_{t-1} + \theta_2 \epsilon_{t-2}
+ \dots + \theta_q \epsilon_{t-q}
$.
Finally, an $ARIMA(p,d,q)$ model unifies both parts and adds differencing
where $d$ is the degree of differences and the $'$ indicates differenced
values:
$
ARIMA(p,d,q): \ \
y'_t = c + \phi_1 y'_{t-1} + \dots + \phi_p y'_{t-p} + \theta_1 \epsilon_{t-1}
+ \dots + \theta_q \epsilon_{t-q} + \epsilon_{t}
$.
$ARIMA(p,d,q)$ models are commonly fitted with maximum likelihood estimation.
To find an optimal combination of the parameters $p$, $d$, and $q$, the
literature suggests calculating an information theoretical criterion
(e.g., Akaike's Information Criterion) that evaluates the fit on
historical data.
\cite{hyndman2008a} provide a step-wise heuristic to choose $p$, $d$, and $q$,
that also decides if a Box-Cox transformation is to be applied, and if so,
which one.
To obtain a one-step-ahead forecast, the above equation is reordered such
that $t$ is substituted with $T+1$.
For forecasts further into the future, the actual observations are
subsequently replaced by their forecasts.
Seasonal ARIMA variants exist; however, the high frequency $k$ in the kind of
demand a UDP faces typically renders them impractical as too many
coefficients must be estimated.

View file

@ -0,0 +1,62 @@
\subsubsection{Seasonal and Trend Decomposition using Loess.}
\label{stl}
A time series $y_t$ may exhibit different types of patterns; to fully capture
each of them, the series must be decomposed.
Then, each component is forecast with a distinct model.
Most commonly, the components are the trend $t_t$, seasonality $s_t$, and
remainder $r_t$.
They are themselves time series, where only $s_t$ exhibits a periodicity $k$.
A decomposition may be additive (i.e., $y_t = s_t + t_t + r_t$) or
multiplicative (i.e., $y_t = s_t * t_t * r_t$); the former assumes that
the effect of the seasonal component is independent of the overall level
of $y_t$ and vice versa.
The seasonal component is centered around $0$ in both cases such that its
removal does not affect the level of $y_t$.
Often, it is sufficient to only seasonally adjust the time series, and model
the trend and remainder together, for example, as $a_t = y_t - s_t$ in the
additive case.
Early approaches employed moving averages (cf., Sub-section \ref{ets}) to
calculate a trend component, and, after removing that from $y_t$, averaged
all observations of the same seasonal lag to obtain the seasonal
component.
The downsides of this are the subjectivity in choosing the window lengths for
the moving average and the seasonal averaging, the incapability of the
seasonal component to vary its amplitude over time, and the missing
handling of outliers.
The X11 method developed at the U.S. Census Bureau and described in detail by
\cite{dagum2016} overcomes these disadvantages.
However, due to its background in economics, it is designed primarily for
quarterly or monthly data, and the change in amplitude over time cannot be
controlled.
Variants of this method are the SEATS decomposition by the Bank of Spain and
the newer X13-SEATS-ARIMA method by the U.S. Census Bureau.
Their main advantages stem from the fact that the models calibrate themselves
according to statistical criteria without manual work for a statistician
and that the fitting process is robust to outliers.
\cite{cleveland1990} introduce a seasonal and trend decomposition using a
repeated locally weighted regression - the so-called Loess procedure - to
smoothen the trend and seasonal components, which can be viewed as a
generalization of the methods above and is denoted by the acronym
\gls{stl}.
In contrast to the X11, X13, and SEATS methods, the STL supports seasonalities
of any lag $k$ that must, however, be determined with additional
statistical tests or set with out-of-band knowledge by the forecaster
(e.g., hourly demand data implies $k = 24 * 7 = 168$ assuming customer
behavior differs on each day of the week).
Moreover, the seasonal component's rate of change, represented by the $ns$
parameter and explained in detail with Figure \ref{f:stl} in Section
\ref{decomp}, must be set by the forecaster as well, while the trend's
smoothness may be controlled via setting a non-default window size.
Outliers are handled by assignment to the remainder such that they do not
affect the trend and seasonal components.
In particular, the manual input needed to calibrate the STL explains why only
the X11, X13, and SEATS methods are widely used by practitioners.
However, the widespread adoption of concepts like cross-validation (cf.,
Sub-section \ref{cv}) in recent years enables the usage of an automated
grid search to optimize the parameters.
The STL's usage within a grid search is facilitated even further by its being
computationally cheaper than the other methods discussed.

View file

@ -0,0 +1,15 @@
\subsection{Demand Forecasting with Machine Learning Methods}
\label{ml_methods}
ML methods have been employed in all kinds of prediction tasks in recent
years.
In this section, we restrict ourselves to the models that performed well in
our study: Random Forest (\gls{rf}) and Support Vector Regression
(\gls{svr}).
RFs are in general well-suited for datasets without a priori knowledge about
the patterns, while SVR is known to perform well on time series data, as
shown by \cite{hansen2006} in general and \cite{bao2004} specifically for
intermittent demand.
Gradient Boosting, another popular ML method, was consistently outperformed by
RFs, and artificial neural networks require an amount of data
exceeding what our industry partner has by far.

View file

@ -0,0 +1,53 @@
\subsubsection{Supervised Learning.}
\label{learning}
A conceptual difference between classical and ML methods is the format
for the model inputs.
In ML models, a time series $Y$ is interpreted as labeled data.
Labels are collected into a vector $\vec{y}$ while the corresponding
predictors are aligned in an $(T - n) \times n$ matrix $\mat{X}$:
$$
\vec{y}
=
\begin{pmatrix}
y_T \\
y_{T-1} \\
\dots \\
y_{n+1}
\end{pmatrix}
~~~~~~~~~~
\mat{X}
=
\begin{bmatrix}
y_{T-1} & y_{T-2} & \dots & y_{T-n} \\
y_{T-2} & y_{T-3} & \dots & y_{T-(n+1)} \\
\dots & \dots & \dots & \dots \\
y_n & y_{n-1} & \dots & y_1
\end{bmatrix}
$$
The $m = T - n$ rows are referred to as samples and the $n$ columns as
features.
Each row in $\mat{X}$ is "labeled" by the corresponding entry in $\vec{y}$,
and ML models are trained to fit the rows to their labels.
Conceptually, we model a functional relationship $f$ between $\mat{X}$ and
$\vec{y}$ such that the difference between the predicted
$\vec{\hat{y}} = f(\mat{X})$ and the true $\vec{y}$ are minimized
according to some error measure $L(\vec{\hat{y}}, \vec{y})$, where $L$
summarizes the goodness of the fit into a scalar value (e.g., the
well-known mean squared error [MSE]; cf., Section \ref{mase}).
$\mat{X}$ and $\vec{y}$ show the ordinal character of time series data:
Not only overlap the entries of $\mat{X}$ and $\vec{y}$, but the rows of
$\mat{X}$ are shifted versions of each other.
That does not hold for ML applications in general (e.g., the classical
example of predicting spam vs. no spam emails, where the features model
properties of individual emails), and most of the common error measures
presented in introductory texts on ML, are only applicable in cases
without such a structure in $\mat{X}$ and $\vec{y}$.
$n$, the number of past time steps required to predict a $y_t$, is an
exogenous model parameter.
For prediction, the forecaster supplies the trained ML model an input
vector in the same format as a row $\vec{x}_i$ in $\mat{X}$.
For example, to predict $y_{T+1}$, the model takes the vector
$(y_T, y_{T-1}, ..., y_{T-n+1})$ as input.
That is in contrast to the classical methods, where we only supply the number
of time steps to be predicted as a scalar integer.

38
tex/2_lit/3_ml/3_cv.tex Normal file
View file

@ -0,0 +1,38 @@
\subsubsection{Cross-Validation.}
\label{cv}
Because ML models are trained by minimizing a loss function $L$, the
resulting value of $L$ underestimates the true error we see when
predicting into the actual future by design.
To counter that, one popular and model-agnostic approach is cross-validation
(\gls{cv}), as summarized, for example, by \cite{hastie2013}.
CV is a resampling technique, which ranomdly splits the samples into a
training and a test set.
Trained on the former, an ML model makes forecasts on the latter.
Then, the value of $L$ calculated only on the test set gives a realistic and
unbiased estimate of the true forecasting error, and may be used for one
of two distinct aspects:
First, it assesses the quality of a fit and provides an idea as to how the
model would perform in production when predicting into the actual future.
Second, the errors of models of either different methods or the same method
with different parameters may be compared with each other to select the
best model.
In order to first select the best model and then assess its quality, one must
apply two chained CVs:
The samples are divided into training, validation, and test sets, and all
models are trained on the training set and compared on the validation set.
Then, the winner is retrained on the union of the training and validation
sets and assessed on the test set.
Regarding the splitting, there are various approaches, and we choose the
so-called $k$-fold CV, where the samples are randomly divided into $k$
folds of the same size.
Each fold is used as a test set once and the remaining $k-1$ folds become
the corresponding training set.
The resulting $k$ error measures are averaged.
A $k$-fold CV with $k=5$ or $k=10$ is a compromise between the two extreme
cases of having only one split and the so-called leave-one-out CV
where $k = m$: Computation is still relatively fast and each sample is
part of several training sets maximizing the learning from the data.
We adapt the $k$-fold CV to the ordinal stucture in $\mat{X}$ and $\vec{y}$ in
Sub-section \ref{unified_cv}.

66
tex/2_lit/3_ml/4_rf.tex Normal file
View file

@ -0,0 +1,66 @@
\subsubsection{Random Forest Regression.}
\label{rf}
\cite{breiman1984} introduce the classification and regression tree
(\gls{cart}) model that is built around the idea that a single binary
decision tree maps learned combinations of intervals of the feature
columns to a label.
Thus, each sample in the training set is associated with one leaf node that
is reached by following the tree from its root and branching along the
arcs according to some learned splitting rule per intermediate node that
compares the sample's realization for the feature specified by the rule to
the learned decision rule.
While such models are computationally fast and offer a high degree of
interpretability, they tend to overfit strongly to the training set as
the splitting rules are not limited to any functional form (e.g., linear)
in the relationship between the features and the labels.
In the regression case, it is common to maximize the variance reduction $I_V$
from a parent node $N$ to its two children, $C1$ and $C2$, as the
splitting rule.
\cite{breiman1984} formulate this as follows:
$$
I_V(N)
=
\frac{1}{|S_N|^2} \sum_{i \in S_N} \sum_{j \in S_N}
\frac{1}{2} (y_i - y_j)^2
- \left(
\frac{1}{|S_{C1}|^2} \sum_{i \in S_{C1}} \sum_{j \in S_{C1}}
\frac{1}{2} (y_i - y_j)^2
+
\frac{1}{|S_{C2}|^2} \sum_{i \in S_{C2}} \sum_{j \in S_{C2}}
\frac{1}{2} (y_i - y_j)^2
\right)
$$
$S_N$, $S_{C1}$, and $S_{C2}$ are the index sets of the samples in $N$, $C1$,
and $C2$.
\cite{ho1998} and then \cite{breiman2001} generalize this method by combining
many CART models into one forest of trees where every single tree is
a randomized variant of the others.
Randomization is achieved at two steps in the training process:
First, each tree receives a distinct training set resampled with replacement
from the original training set, an idea also called bootstrap
aggregation.
Second, at each node a random subset of the features is used to grow the tree.
Trees can be fitted in parallel speeding up the training significantly.
For prediction at the tree level, the average of all the samples at a
particular leaf node is used.
Then, the individual values are combined into one value by averaging again
across the trees.
Due to the randomization, the trees are decorrelated offsetting the
overfitting.
Another measure to counter overfitting is pruning the tree, either by
specifying the maximum depth of a tree or the minimum number of samples
at leaf nodes.
The forecaster must tune the structure of the forest.
Parameters include the number of trees in the forest, the size of the random
subset of features, and the pruning criteria.
The parameters are optimized via grid search: We train many models with
parameters chosen from a pre-defined list of values and select the best
one by CV.
RFs are a convenient ML method for any dataset as decision trees do not
make any assumptions about the relationship between features and labels.
\cite{herrera2010} use RFs to predict the hourly demand for water in an urban
context, a similar application as the one in this paper, and find that RFs
work well with time series type of data.

60
tex/2_lit/3_ml/5_svm.tex Normal file
View file

@ -0,0 +1,60 @@
\subsubsection{Support Vector Regression.}
\label{svm}
\cite{vapnik1963} and \cite{vapnik1964} introduce the so-called support vector
machine (\gls{svm}) model, and \cite{vapnik2013} summarizes the research
conducted since then.
In its basic version, SVMs are linear classifiers, modeling a binary
decision, that fit a hyperplane into the feature space of $\mat{X}$ to
maximize the margin around the hyperplane seperating the two groups of
labels.
SVMs were popularized in the 1990s in the context of optical character
recognition, as shown in \cite{scholkopf1998}.
\cite{drucker1997} and \cite{stitson1999} adapt SVMs to the regression case,
and \cite{smola2004} provide a comprehensive introduction thereof.
\cite{mueller1997} and \cite{mueller1999} focus on SVRs in the context of time
series data and find that they tend to outperform classical methods.
\cite{chen2006a} and \cite{chen2006b} apply SVRs to predict the hourly demand
for water in cities, an application similar to the UDP case.
In the SVR case, a linear function
$\hat{y}_i = f(\vec{x}_i) = \langle\vec{w},\vec{x}_i\rangle + b$
is fitted so that the actual labels $y_i$ have a deviation of at most
$\epsilon$ from their predictions $\hat{y}_i$ (cf., the constraints
below).
SVRs are commonly formulated as quadratic optimization problems as follows:
$$
\text{minimize }
\frac{1}{2} \norm{\vec{w}}^2 + C \sum_{i=1}^m (\xi_i + \xi_i^*)
\quad \text{subject to }
\begin{cases}
y_i - \langle \vec{w}, \vec{x}_i \rangle - b \leq \epsilon + \xi_i
\text{,} \\
\langle \vec{w}, \vec{x}_i \rangle + b - y_i \leq \epsilon + \xi_i^*
\end{cases}
$$
$\vec{w}$ are the fitted weights in the row space of $\mat{X}$, $b$ is a bias
term in the column space of $\mat{X}$, and $\langle\cdot,\cdot\rangle$
denotes the dot product.
By minimizing the norm of $\vec{w}$, the fitted function is flat and not prone
to overfitting strongly.
To allow individual samples outside the otherwise hard $\epsilon$ bounds,
non-negative slack variables $\xi_i$ and $\xi_i^*$ are included.
A non-negative parameter $C$ regulates how many samples may violate the
$\epsilon$ bounds and by how much.
To model non-linear relationships, one could use a mapping $\Phi(\cdot)$ for
the $\vec{x}_i$ from the row space of $\mat{X}$ to some higher
dimensional space; however, as the optimization problem only depends on
the dot product $\langle\cdot,\cdot\rangle$ and not the actual entries of
$\vec{x}_i$, it suffices to use a kernel function $k$ such that
$k(\vec{x}_i,\vec{x}_j) = \langle\Phi(\vec{x}_i),\Phi(\vec{x}_j)\rangle$.
Such kernels must fulfill certain mathematical properties, and, besides
polynomial kernels, radial basis functions with
$k(\vec{x}_i,\vec{x}_j) = exp(\gamma \norm{\vec{x}_i - \vec{x}_j}^2)$ are
a popular candidate where $\gamma$ is a parameter controlling for how the
distances between any two samples influence the final model.
SVRs work well with sparse data in high dimensional spaces, such as
intermittent demand data, as they minimize the risk of misclassification
or predicting a significantly far off value by maximizing the error
margin, as also noted by \cite{bao2004}.

View file

@ -1,2 +1,8 @@
\section{Model Formulation} \section{Model Formulation}
\label{mod} \label{mod}
% temporary placeholders
\label{decomp}
\label{f:stl}
\label{mase}
\label{unified_cv}

View file

@ -1,7 +1,25 @@
% Abbreviations for technical terms. % Abbreviations for technical terms.
\newglossaryentry{cart}{
name=CART, description={Classification and Regression Trees}
}
\newglossaryentry{cv}{
name=CV, description={Cross Validation}
}
\newglossaryentry{ml}{ \newglossaryentry{ml}{
name=ML, description={Machine Learning} name=ML, description={Machine Learning}
} }
\newglossaryentry{rf}{
name=RF, description={Random Forest}
}
\newglossaryentry{stl}{
name=STL, description={Seasonal and Trend Decomposition using Loess}
}
\newglossaryentry{svm}{
name=SVM, description={Support Vector Machine}
}
\newglossaryentry{svr}{
name=SVR, description={Support Vector Regression}
}
\newglossaryentry{udp}{ \newglossaryentry{udp}{
name=UDP, description={Urban Delivery Platform} name=UDP, description={Urban Delivery Platform}
} }

View file

@ -6,4 +6,9 @@
% Make opening quotes look different than closing quotes. % Make opening quotes look different than closing quotes.
\usepackage[english=american]{csquotes} \usepackage[english=american]{csquotes}
\MakeOuterQuote{"} \MakeOuterQuote{"}
% Define helper commands.
\usepackage{bm}
\newcommand{\mat}[1]{\bm{#1}}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}

View file

@ -7,6 +7,25 @@ volume={129},
pages={263--286} pages={263--286}
} }
@article{assimakopoulos2000,
title={The theta model: a decomposition approach to forecasting},
author={Assimakopoulos, Vassilis and Nikolopoulos, Konstantinos},
year={2000},
journal={International Journal of Forecasting},
volume={16},
number={4},
pages={521--530}
}
@inproceedings{bao2004,
title={Forecasting intermittent demand by SVMs regression},
author={Bao, Yukun and Wang, Wen and Zhang, Jinlong},
year={2004},
booktitle={2004 IEEE International Conference on Systems, Man and Cybernetics},
volume={1},
pages={461--466}
}
@misc{bell2018, @misc{bell2018,
title = {Forecasting at Uber: An Introduction}, title = {Forecasting at Uber: An Introduction},
author={Bell, Franziska and Smyl, Slawek}, author={Bell, Franziska and Smyl, Slawek},
@ -15,6 +34,119 @@ howpublished = {\url{https://eng.uber.com/forecasting-introduction/}},
note = {Accessed: 2020-10-01} note = {Accessed: 2020-10-01}
} }
@article{box1962,
title={Some statistical Aspects of adaptive Optimization and Control},
author={Box, George and Jenkins, Gwilym},
year={1962},
journal={Journal of the Royal Statistical Society. Series B (Methodological)},
volume={24},
number={2},
pages={297--343}
}
@article{box1964,
title={An Analysis of Transformations},
author={Box, George and Cox, David},
year={1964},
journal={Journal of the Royal Statistical Society. Series B (Methodological)},
volume={26},
number={2},
pages={211--252}
}
@article{box1968,
title={Some recent Advances in Forecasting and Control},
author={Box, George and Jenkins, Gwilym},
year={1968},
journal={Journal of the Royal Statistical Society.
Series C (Applied Statistics)},
volume={17},
number={2},
pages={91--109}
}
@book{box2015,
title={Time Series Analysis: Forecasting and Control},
author={Box, George and Jenkins, Gwilym and Reinsel, Gregory and Ljung, Greta},
series={Wiley Series in Probability and Statistics},
year={2015},
publisher={Wiley}
}
@book{breiman1984,
title={Classification and Regression Trees},
author={Breiman, Leo and Friedman, Jerome and Olshen, R.A.
and Stone, Charles},
year={1984},
publisher={Wadsworth}
}
@article{breiman2001,
title={Random Forests},
author={Breiman, Leo},
year={2001},
journal={Machine Learning},
volume={45},
number={1},
pages={5--32}
}
@book{brockwell2016,
title={Introduction to Time Series and Forecasting},
author={Brockwell, Peter and Davis, Richard},
series={Springer Texts in Statistics},
year={2016},
publisher={Springer}
}
@book{brown1959,
title={Statistical Forecasting for Inventory Control},
author={Brown, Robert},
year={1959},
publisher={McGraw/Hill}
}
@article{chen2006a,
title={Hourly Water Demand Forecast Model based on Bayesian Least Squares
Support Vector Machine},
author={Chen, Lei and Zhang, Tu-qiao},
year={2006},
journal={Journal of Tianjin University},
volume={39},
number={9},
pages={1037--1042}
}
@article{chen2006b,
title={Hourly Water Demand Forecast Model based on Least Squares Support
Vector Machine},
author={Chen, Lei and Zhang, Tu-qiao},
year={2006},
journal={Journal of Harbin Institute of Technology},
volume={38},
number={9},
pages={1528--1530}
}
@article{cleveland1990,
title={STL: A Seasonal-Trend Decomposition Procedure Based on Loess},
author={Cleveland, Robert and Cleveland, Williiam and McRae, Jean
and Terpenning, Irma},
year={1990},
journal={Journal of Official Statistics},
volume={6},
number={1},
pages={3--73}
}
@book{dagum2016,
title={Seasonal Adjustment Methods and Real Time Trend-Cycle Estimation},
author={Dagum, Estela and Bianconcini, Silvia},
series={Statistics for Social and Behavioral Sciences},
year={2016},
publisher={Springer}
}
@article{de2006, @article{de2006,
title={25 Years of Time Series Forecasting}, title={25 Years of Time Series Forecasting},
author={De Gooijer, Jan and Hyndman, Rob}, author={De Gooijer, Jan and Hyndman, Rob},
@ -25,6 +157,16 @@ number={3},
pages={443--473} pages={443--473}
} }
@inproceedings{drucker1997,
title={Support Vector Regression Machines},
author={Drucker, Harris and Burges, Christopher and Kaufman, Linda
and Smola, Alex and Vapnik, Vladimir},
year={1997},
booktitle={Advances in Neural Information Processing Systems},
pages={155--161},
organization={Springer}
}
@article{ehmke2018, @article{ehmke2018,
title={Optimizing for total costs in vehicle routing in urban areas}, title={Optimizing for total costs in vehicle routing in urban areas},
author={Ehmke, Jan Fabian and Campbell, Ann M and Thomas, Barrett W}, author={Ehmke, Jan Fabian and Campbell, Ann M and Thomas, Barrett W},
@ -34,6 +176,45 @@ volume={116},
pages={242--265} pages={242--265}
} }
@article{gardner1985,
title={Forecasting Trends in Time Series},
author={Gardner, Everette and McKenzie, Ed},
year={1985},
journal={Management Science},
volume={31},
number={10},
pages={1237--1246}
}
@article{hansen2006,
title={Some Evidence on Forecasting Time-Series with Support Vector Machines},
author={Hansen, James and McDonald, James and Nelson, Ray},
year={2006},
journal={Journal of the Operational Research Society},
volume={57},
number={9},
pages={1053--1063}
}
@book{hastie2013,
title={The Elements of Statistical Learning: Data Mining, Inference,
and Prediction},
author={Hastie, Trevor and Tibshirani, Robert and Friedman, Jerome},
year={2013},
publisher={Springer}
}
@article{herrera2010,
title={Predictive Models for Forecasting Hourly Urban Water Demand},
author={Herrera, Manuel and Torgo, Lu{\'\i}s and Izquierdo, Joaqu{\'\i}n
and P{\'e}rez-Garc{\'\i}a, Rafael},
year={2010},
journal={Journal of Hydrology},
volume={387},
number={1-2},
pages={141--150}
}
@misc{hirschberg2016, @misc{hirschberg2016,
title = {McKinsey: The changing market for food delivery}, title = {McKinsey: The changing market for food delivery},
author={Hirschberg, Carsten and Rajko, Alexander and Schumacher, Thomas author={Hirschberg, Carsten and Rajko, Alexander and Schumacher, Thomas
@ -44,6 +225,25 @@ howpublished = "\url{https://www.mckinsey.com/industries/high-tech/
note = {Accessed: 2020-10-01} note = {Accessed: 2020-10-01}
} }
@article{ho1998,
title={The Random Subspace Method for Constructing Decision Forests},
author={Ho, Tin Kam},
year={1998},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
volume={20},
number={8},
pages={832--844}
}
@article{holt1957,
title={Forecasting Seasonals and Trends by Exponentially Weighted Moving
Averages},
author={Holt, Charles},
year={1957},
journal={ONR Memorandum},
volume={52}
}
@article{hou2018, @article{hou2018,
title={Ride-matching and routing optimisation: Models and a large title={Ride-matching and routing optimisation: Models and a large
neighbourhood search heuristic}, neighbourhood search heuristic},
@ -54,6 +254,62 @@ volume={118},
pages={143--162} pages={143--162}
} }
@article{hyndman2002,
title={A State Space Framework for Automatic Forecasting using Exponential
Smoothing Methods},
author={Hyndman, Rob and Koehler, Anne and Snyder, Ralph and Grose, Simone},
year={2002},
journal={International Journal of Forecasting},
volume={18},
number={3},
pages={439--454}
}
@article{hyndman2003,
title={Unmasking the Theta method},
author={Hyndman, Rob and Billah, Baki},
year={2003},
journal={International Journal of Forecasting},
volume={19},
number={2},
pages={287--290}
}
@article{hyndman2008a,
title={Automatic Time Series Forecasting: The forecast package for R},
author={Hyndman, Rob and Khandakar, Yeasmin},
year={2008},
journal={Journal of Statistical Software},
volume={26},
number={3}
}
@book{hyndman2008b,
title={Forecasting with Exponential Smoothing: the State Space Approach},
author={Hyndman, Rob and Koehler, Anne and Ord, Keith and Snyder, Ralph},
year={2008},
publisher={Springer}
}
@book{hyndman2018,
title={Forecasting: Principles and Practice},
author={Hyndman, Rob and Athanasopoulos, George},
year={2018},
publisher={OTexts}
}
@article{kwiatkowski1992,
title={Testing the null hypothesis of stationarity against the alternative of a
unit root: How sure are we that economic time series have a unit root?},
author={Kwiatkowski, Denis and Phillips, Peter and Schmidt, Peter
and Shin, Yongcheol},
year={1992},
journal={Journal of Econometrics},
volume={54},
number={1-3},
pages={159--178}
}
@misc{laptev2017, @misc{laptev2017,
title = {Engineering Extreme Event Forecasting title = {Engineering Extreme Event Forecasting
at Uber with Recurrent Neural Networks}, at Uber with Recurrent Neural Networks},
@ -74,6 +330,108 @@ volume={118},
pages={392--420} pages={392--420}
} }
@inproceedings{mueller1997,
title={Predicting Time Series with Support Vector Machines},
author={M{\"u}ller, Klaus-Robert and Smola, Alexander and R{\"a}tsch, Gunnar
and Sch{\"o}lkopf, Bernhard and Kohlmorgen, Jens and Vapnik, Vladimir},
year={1997},
booktitle={International Conference on Artificial Neural Networks},
pages={999--1004},
organization={Springer}
}
@article{mueller1999,
title={Using Support Vector Machines for Time Series Prediction},
author={M{\"u}ller, Klaus-Robert and Smola, Alexander and R{\"a}tsch, Gunnar
and Sch{\"o}lkopf, Bernhard and Kohlmorgen, Jens and Vapnik, Vladimir},
year={1999},
journal={Advances in Kernel Methods — Support Vector Learning},
pages={243--254},
publisher={MIT, Cambridge, MA, USA}
}
@book{ord2017,
title={Principles of Business Forecasting},
author={Ord, Keith and Fildes, Robert and Kourentzes, Nikos},
year={2017},
publisher={WESSEX Press}
}
@article{pegels1969,
title={Exponential Forecasting: Some new variations},
author={Pegels, C.},
year={1969},
journal={Management Science},
volume={15},
number={5},
pages={311--315}
}
@incollection{scholkopf1998,
title={Fast Approximation of Support Vector Kernel Expansions, and an
Interpretation of Clustering as Approximation in Feature Spaces},
author={Sch{\"o}lkopf, Bernhard and Knirsch, Phil and Smola, Alex
and Burges, Chris},
year={1998},
booktitle={Mustererkennung 1998},
publisher={Springer},
pages={125--132}
}
@article{smola2004,
title={A Tutorial on Support Vector Regression},
author={Smola, Alex and Sch{\"o}lkopf, Bernhard},
year={2004},
journal={Statistics and Computing},
volume={14},
number={3},
pages={199--222}
}
@article{stitson1999,
title={Support Vector Regression with ANOVA Decomposition Kernels},
author={Stitson, Mark and Gammerman, Alex and Vapnik, Vladimir
and Vovk, Volodya and Watkins, Chris and Weston, Jason},
year={1999},
journal={Advances in Kernel Methods — Support Vector Learning},
pages={285--292},
publisher={MIT, Cambridge, MA, USA}
}
@article{taylor2003,
title={Exponential Smoothing with a Damped Multiplicative Trend},
author={Taylor, James},
year={2003},
journal={International Journal of Forecasting},
volume={19},
number={4},
pages={715--725}
}
@article{vapnik1963,
title={Pattern Recognition using Generalized Portrait Method},
author={Vapnik, Vladimir and Lerner, A},
year={1963},
journal={Automation and Remote Control},
volume={24},
pages={774--780},
}
@article{vapnik1964,
title={A Note on one Class of Perceptrons},
author={Vapnik, Vladimir and Chervonenkis, A},
year={1964},
journal={Automation and Remote Control},
volume={25}
}
@book{vapnik2013,
title={The Nature of Statistical Learning Theory},
author={Vapnik, Vladimir},
year={2013},
publisher={Springer}
}
@article{wang2018, @article{wang2018,
title={Delivering meals for multiple suppliers: Exclusive or sharing title={Delivering meals for multiple suppliers: Exclusive or sharing
logistics service}, logistics service},
@ -82,4 +440,14 @@ year={2018},
journal={Transportation Research Part E: Logistics and Transportation Review}, journal={Transportation Research Part E: Logistics and Transportation Review},
volume={118}, volume={118},
pages={496--512} pages={496--512}
}
@article{winters1960,
title={Forecasting Sales by Exponentially Weighted Moving Averages},
author={Winters, Peter},
year={1960},
journal={Management Science},
volume={6},
number={3},
pages={324--342}
} }