1
0
Fork 0

Add Literature section

This commit is contained in:
Alexander Hess 2020-10-04 23:00:15 +02:00
commit 3849e5fd3f
Signed by: alexander
GPG key ID: 344EA5AB10D868E0
15 changed files with 878 additions and 3 deletions

View file

@ -0,0 +1,13 @@
\subsection{Demand Forecasting with Classical Forecasting Methods}
\label{class_methods}
Forecasting became a formal discipline starting in the 1950s and has its
origins in the broader field of statistics.
\cite{hyndman2018} provide a thorough overview of the concepts and methods
established, and \cite{ord2017} indicate business-related applications
such as demand forecasting.
These "classical" forecasting methods share the characteristic that they are
trained over the entire $Y$ first.
Then, for prediction, the forecaster specifies the number of time steps for
which he wants to generate forecasts.
That is different for ML models.

View file

@ -0,0 +1,78 @@
\subsubsection{Na\"{i}ve Methods, Moving Averages, and Exponential Smoothing.}
\label{ets}
Simple forecasting methods are often employed as a benchmark for more
sophisticated ones.
The so-called na\"{i}ve and seasonal na\"{i}ve methods forecast the next time
step in a time series, $y_{T+1}$, with the last observation, $y_T$,
and, if a seasonal pattern is present, with the observation $k$ steps
before, $y_{T+1-k}$.
As variants, both methods can be generalized to include drift terms in the
presence of a trend or changing seasonal amplitude.
If a time series exhibits no trend, a simple moving average (SMA) is a
generalization of the na\"{i}ve method that is more robust to outliers.
It is defined as follows: $\hat{y}_{T+1} = \frac{1}{h} \sum_{i=T-h}^{T} y_i$
where $h$ is the horizon over which the average is calculated.
If a time series exhibits a seasonal pattern, setting $h$ to a multiple of the
periodicity $k$ suffices that the forecast is unbiased.
Starting in the 1950s, another popular family of forecasting methods,
so-called exponential smoothing methods, was introduced by
\cite{brown1959}, \cite{holt1957}, and \cite{winters1960}.
The idea is that forecasts $\hat{y}_{T+1}$ are a weighted average of past
observations where the weights decay over time; in the case of the simple
exponential smoothing (SES) method we obtain:
$
\hat{y}_{T+1} = \alpha y_T + \alpha (1 - \alpha) y_{T-1}
+ \alpha (1 - \alpha)^2 y_{T-2}
+ \dots + \alpha (1 - \alpha)^{T-1} y_{1}
$
where $\alpha$ (with $0 \le \alpha \le 1$) is a smoothing parameter.
Exponential smoothing methods are often expressed in an alternative component
form that consists of a forecast equation and one or more smoothing
equations for unobservable components.
Below, we present a generalization of SES, the so-called Holt-Winters'
seasonal method, in an additive formulation.
$\ell_t$, $b_t$, and $s_t$ represent the unobservable level, trend, and
seasonal components inherent in $y_t$, and $\beta$ and $\gamma$ complement
$\alpha$ as smoothing parameters:
\begin{align*}
\hat{y}_{t+1} & = \ell_t + b_t + s_{t+1-k} \\
\ell_t & = \alpha(y_t - s_{t-k}) + (1 - \alpha)(\ell_{t-1} + b_{t-1}) \\
b_t & = \beta (\ell_{t} - \ell_{t-1}) + (1 - \beta) b_{t-1} \\
s_t & = \gamma (y_t - \ell_{t-1} - b_{t-1}) + (1-\gamma)s_{t-k}
\end{align*}
With $b_t$, $s_t$, $\beta$, and $\gamma$ removed, this formulation reduces to
SES.
Distinct variations exist: Besides the three components, \cite{gardner1985}
add dampening for the trend, \cite{pegels1969} provides multiplicative
formulations, and \cite{taylor2003} adds dampening to the latter.
The accuracy measure commonly employed is the sum of squared errors between
the observations and their forecasts.
Originally introduced by \cite{assimakopoulos2000}, \cite{hyndman2003} show
how the Theta method can be regarded as an equivalent to SES with a drift
term.
We mention this method here only because \cite{bell2018} emphasize that it
performs well at Uber.
However, in our empirical study, we find that this is not true in general.
\cite{hyndman2002} introduce statistical processes, so-called innovations
state-space models, to generalize the methods in this sub-section.
They call this family of models ETS as they capture error, trend, and seasonal
terms.
Linear and additive ETS models have a structure like so:
\begin{align*}
y_t & = \vec{w} \cdot \vec{x}_{t-1} + \epsilon_t \\
\vec{x_t} & = \mat{F} \vec{x}_{t-1} + \vec{g} \epsilon_t
\end{align*}
$y_t$ denote the observations as before while $\vec{x}_t$ is a state vector of
unobserved components.
$\epsilon_t$ is a white noise series and the matrix $\mat{F}$ and the vectors
$\vec{g}$ and $\vec{w}$ contain a model's coefficients.
Just as the models in the next sub-section, ETS models are commonly fitted
with maximum likelihood and evaluated using information theoretical
criteria against historical data.
We refer to \cite{hyndman2008b} for a thorough summary.

View file

@ -0,0 +1,69 @@
\subsubsection{Autoregressive Integrated Moving Averages.}
\label{arima}
\cite{box1962}, \cite{box1968}, and more papers by the same authors in the
1960s introduce a type of model where observations correlate with their
neighbors and refer to them as autoregressive integrated moving average
(ARIMA) models for stationary time series.
For a thorough overview, we refer to \cite{box2015} and \cite{brockwell2016}.
A time series $y_t$ is stationary if its moments are independent of the
point in time where it is observed.
A typical example is a white noise $\epsilon_t$ series.
Therefore, a trend or seasonality implies non-stationarity.
\cite{kwiatkowski1992} provide a test to check the null hypothesis of
stationary data.
To obtain a stationary time series, one chooses from several techniques:
First, to stabilize a changing variance (i.e., heteroscedasticity), one
applies a Box-Cox transformation (e.g., $log$) as first suggested by
\cite{box1964}.
Second, to factor out a trend (or seasonal) pattern, one computes differences
of consecutive (or of lag $k$) observations or even differences thereof.
Third, it is also common to pre-process $y_t$ with one of the decomposition
methods mentioned in Sub-section \ref{stl} below with an ARIMA model
then trained on an adjusted $y_t$.
In the autoregressive part, observations are modeled as linear combinations of
its predecessors.
Formally, an $AR(p)$ model is defined with a drift term $c$, coefficients
$\phi_i$ to be estimated (where $i$ is an index with $0 < i \leq p$), and
white noise $\epsilon_t$ like so:
$
AR(p): \ \
y_t = c + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \dots + \phi_p y_{t-p}
+ \epsilon_t
$.
The moving average part considers observations to be regressing towards a
linear combination of past forecasting errors.
Formally, a $MA(q)$ model is defined with a drift term $c$, coefficients
$\theta_j$ to be estimated, and white noise terms $\epsilon_t$ (where $j$
is an index with $0 < j \leq q$) as follows:
$
MA(q): \ \
y_t = c + \epsilon_t + \theta_1 \epsilon_{t-1} + \theta_2 \epsilon_{t-2}
+ \dots + \theta_q \epsilon_{t-q}
$.
Finally, an $ARIMA(p,d,q)$ model unifies both parts and adds differencing
where $d$ is the degree of differences and the $'$ indicates differenced
values:
$
ARIMA(p,d,q): \ \
y'_t = c + \phi_1 y'_{t-1} + \dots + \phi_p y'_{t-p} + \theta_1 \epsilon_{t-1}
+ \dots + \theta_q \epsilon_{t-q} + \epsilon_{t}
$.
$ARIMA(p,d,q)$ models are commonly fitted with maximum likelihood estimation.
To find an optimal combination of the parameters $p$, $d$, and $q$, the
literature suggests calculating an information theoretical criterion
(e.g., Akaike's Information Criterion) that evaluates the fit on
historical data.
\cite{hyndman2008a} provide a step-wise heuristic to choose $p$, $d$, and $q$,
that also decides if a Box-Cox transformation is to be applied, and if so,
which one.
To obtain a one-step-ahead forecast, the above equation is reordered such
that $t$ is substituted with $T+1$.
For forecasts further into the future, the actual observations are
subsequently replaced by their forecasts.
Seasonal ARIMA variants exist; however, the high frequency $k$ in the kind of
demand a UDP faces typically renders them impractical as too many
coefficients must be estimated.

View file

@ -0,0 +1,62 @@
\subsubsection{Seasonal and Trend Decomposition using Loess.}
\label{stl}
A time series $y_t$ may exhibit different types of patterns; to fully capture
each of them, the series must be decomposed.
Then, each component is forecast with a distinct model.
Most commonly, the components are the trend $t_t$, seasonality $s_t$, and
remainder $r_t$.
They are themselves time series, where only $s_t$ exhibits a periodicity $k$.
A decomposition may be additive (i.e., $y_t = s_t + t_t + r_t$) or
multiplicative (i.e., $y_t = s_t * t_t * r_t$); the former assumes that
the effect of the seasonal component is independent of the overall level
of $y_t$ and vice versa.
The seasonal component is centered around $0$ in both cases such that its
removal does not affect the level of $y_t$.
Often, it is sufficient to only seasonally adjust the time series, and model
the trend and remainder together, for example, as $a_t = y_t - s_t$ in the
additive case.
Early approaches employed moving averages (cf., Sub-section \ref{ets}) to
calculate a trend component, and, after removing that from $y_t$, averaged
all observations of the same seasonal lag to obtain the seasonal
component.
The downsides of this are the subjectivity in choosing the window lengths for
the moving average and the seasonal averaging, the incapability of the
seasonal component to vary its amplitude over time, and the missing
handling of outliers.
The X11 method developed at the U.S. Census Bureau and described in detail by
\cite{dagum2016} overcomes these disadvantages.
However, due to its background in economics, it is designed primarily for
quarterly or monthly data, and the change in amplitude over time cannot be
controlled.
Variants of this method are the SEATS decomposition by the Bank of Spain and
the newer X13-SEATS-ARIMA method by the U.S. Census Bureau.
Their main advantages stem from the fact that the models calibrate themselves
according to statistical criteria without manual work for a statistician
and that the fitting process is robust to outliers.
\cite{cleveland1990} introduce a seasonal and trend decomposition using a
repeated locally weighted regression - the so-called Loess procedure - to
smoothen the trend and seasonal components, which can be viewed as a
generalization of the methods above and is denoted by the acronym
\gls{stl}.
In contrast to the X11, X13, and SEATS methods, the STL supports seasonalities
of any lag $k$ that must, however, be determined with additional
statistical tests or set with out-of-band knowledge by the forecaster
(e.g., hourly demand data implies $k = 24 * 7 = 168$ assuming customer
behavior differs on each day of the week).
Moreover, the seasonal component's rate of change, represented by the $ns$
parameter and explained in detail with Figure \ref{f:stl} in Section
\ref{decomp}, must be set by the forecaster as well, while the trend's
smoothness may be controlled via setting a non-default window size.
Outliers are handled by assignment to the remainder such that they do not
affect the trend and seasonal components.
In particular, the manual input needed to calibrate the STL explains why only
the X11, X13, and SEATS methods are widely used by practitioners.
However, the widespread adoption of concepts like cross-validation (cf.,
Sub-section \ref{cv}) in recent years enables the usage of an automated
grid search to optimize the parameters.
The STL's usage within a grid search is facilitated even further by its being
computationally cheaper than the other methods discussed.