Add Literature section

2020-10-04 23:00:15 +02:00 · 2020-10-04 23:00:15 +02:00 · 3849e5fd3f
commit 3849e5fd3f
parent 8d10ba9a05
15 changed files with 878 additions and 3 deletions
--- a/tex/2_lit/2_class/1_intro.tex
+++ b/tex/2_lit/2_class/1_intro.tex
@ -0,0 +1,13 @@
+\subsection{Demand Forecasting with Classical Forecasting Methods}
+\label{class_methods}
+
+Forecasting became a formal discipline starting in the 1950s and has its
+    origins in the broader field of statistics.
+\cite{hyndman2018} provide a thorough overview of the concepts and methods
+    established, and \cite{ord2017} indicate business-related applications
+    such as demand forecasting.
+These "classical" forecasting methods share the characteristic that they are
+    trained over the entire $Y$ first.
+Then, for prediction, the forecaster specifies the number of time steps for
+    which he wants to generate forecasts.
+That is different for ML models.
--- a/tex/2_lit/2_class/2_ets.tex
+++ b/tex/2_lit/2_class/2_ets.tex
@ -0,0 +1,78 @@
+\subsubsection{Na\"{i}ve Methods, Moving Averages, and Exponential Smoothing.}
+\label{ets}
+
+Simple forecasting methods are often employed as a benchmark for more
+    sophisticated ones.
+The so-called na\"{i}ve and seasonal na\"{i}ve methods forecast the next time
+    step in a time series, $y_{T+1}$, with the last observation, $y_T$,
+    and, if a seasonal pattern is present, with the observation $k$ steps
+    before, $y_{T+1-k}$.
+As variants, both methods can be generalized to include drift terms in the
+    presence of a trend or changing seasonal amplitude.
+
+If a time series exhibits no trend, a simple moving average (SMA) is a
+    generalization of the na\"{i}ve method that is more robust to outliers.
+It is defined as follows: $\hat{y}_{T+1} = \frac{1}{h} \sum_{i=T-h}^{T} y_i$
+    where $h$ is the horizon over which the average is calculated.
+If a time series exhibits a seasonal pattern, setting $h$ to a multiple of the
+    periodicity $k$ suffices that the forecast is unbiased.
+
+Starting in the 1950s, another popular family of forecasting methods,
+    so-called exponential smoothing methods, was introduced by
+    \cite{brown1959}, \cite{holt1957}, and \cite{winters1960}.
+The idea is that forecasts $\hat{y}_{T+1}$ are a weighted average of past
+    observations where the weights decay over time; in the case of the simple
+    exponential smoothing (SES) method we obtain:
+$
+\hat{y}_{T+1} = \alpha y_T + \alpha (1 - \alpha) y_{T-1}
+                + \alpha (1 - \alpha)^2 y_{T-2}
+                + \dots + \alpha (1 - \alpha)^{T-1} y_{1}
+$
+where $\alpha$ (with $0 \le \alpha \le 1$) is a smoothing parameter.
+
+Exponential smoothing methods are often expressed in an alternative component
+    form that consists of a forecast equation and one or more smoothing
+    equations for unobservable components.
+Below, we present a generalization of SES, the so-called Holt-Winters'
+    seasonal method, in an additive formulation.
+$\ell_t$, $b_t$, and $s_t$ represent the unobservable level, trend, and
+    seasonal components inherent in $y_t$, and $\beta$ and $\gamma$ complement
+    $\alpha$ as smoothing parameters:
+\begin{align*}
+\hat{y}_{t+1} & = \ell_t + b_t + s_{t+1-k} \\
+\ell_t        & = \alpha(y_t - s_{t-k}) + (1 - \alpha)(\ell_{t-1} + b_{t-1}) \\
+b_t           & = \beta (\ell_{t} - \ell_{t-1}) + (1 - \beta) b_{t-1} \\
+s_t           & = \gamma (y_t - \ell_{t-1} - b_{t-1}) + (1-\gamma)s_{t-k}
+\end{align*}
+With $b_t$, $s_t$, $\beta$, and $\gamma$ removed, this formulation reduces to
+    SES.
+Distinct variations exist: Besides the three components, \cite{gardner1985}
+    add dampening for the trend, \cite{pegels1969} provides multiplicative
+    formulations, and \cite{taylor2003} adds dampening to the latter.
+The accuracy measure commonly employed is the sum of squared errors between
+    the observations and their forecasts.
+
+Originally introduced by \cite{assimakopoulos2000}, \cite{hyndman2003} show
+    how the Theta method can be regarded as an equivalent to SES with a drift
+    term.
+We mention this method here only because \cite{bell2018} emphasize that it
+    performs well at Uber.
+However, in our empirical study, we find that this is not true in general.
+
+\cite{hyndman2002} introduce statistical processes, so-called innovations	
+    state-space models, to generalize the methods in this sub-section.
+They call this family of models ETS as they capture error, trend, and seasonal
+    terms.
+Linear and additive ETS models have a structure like so:
+\begin{align*}
+y_t       & = \vec{w} \cdot \vec{x}_{t-1} + \epsilon_t \\
+\vec{x_t} & = \mat{F} \vec{x}_{t-1} + \vec{g} \epsilon_t
+\end{align*}
+$y_t$ denote the observations as before while $\vec{x}_t$ is a state vector of
+    unobserved components.
+$\epsilon_t$ is a white noise series and the matrix $\mat{F}$ and the vectors
+    $\vec{g}$ and $\vec{w}$ contain a model's coefficients.
+Just as the models in the next sub-section, ETS models are commonly fitted
+    with maximum likelihood and evaluated using information theoretical
+    criteria against historical data.
+We refer to \cite{hyndman2008b} for a thorough summary.
--- a/tex/2_lit/2_class/3_arima.tex
+++ b/tex/2_lit/2_class/3_arima.tex
@ -0,0 +1,69 @@
+\subsubsection{Autoregressive Integrated Moving Averages.}
+\label{arima}
+
+\cite{box1962}, \cite{box1968}, and more papers by the same authors in the
+    1960s introduce a type of model where observations correlate with their
+    neighbors and refer to them as autoregressive integrated moving average
+    (ARIMA) models for stationary time series.
+For a thorough overview, we refer to \cite{box2015} and \cite{brockwell2016}.
+
+A time series $y_t$ is stationary if its moments are independent of the
+    point in time where it is observed.
+A typical example is a white noise $\epsilon_t$ series.
+Therefore, a trend or seasonality implies non-stationarity.
+\cite{kwiatkowski1992} provide a test to check the null hypothesis of
+    stationary data.
+To obtain a stationary time series, one chooses from several techniques:
+First, to stabilize a changing variance (i.e., heteroscedasticity), one
+    applies a Box-Cox transformation (e.g., $log$) as first suggested by
+    \cite{box1964}.
+Second, to factor out a trend (or seasonal) pattern, one computes differences
+    of consecutive (or of lag $k$) observations or even differences thereof.
+Third, it is also common to pre-process $y_t$ with one of the decomposition
+    methods mentioned in Sub-section \ref{stl} below with an ARIMA model
+    then trained on an adjusted $y_t$.
+
+In the autoregressive part, observations are modeled as linear combinations of
+    its predecessors.
+Formally, an $AR(p)$ model is defined with a drift term $c$, coefficients
+    $\phi_i$ to be estimated (where $i$ is an index with $0 < i \leq p$), and
+    white noise $\epsilon_t$ like so:
+$
+AR(p): \ \
+y_t = c + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \dots + \phi_p y_{t-p}
+      + \epsilon_t
+$.
+The moving average part considers observations to be regressing towards a
+    linear combination of past forecasting errors.
+Formally, a $MA(q)$ model is defined with a drift term $c$, coefficients
+    $\theta_j$ to be estimated, and white noise terms $\epsilon_t$ (where $j$
+    is an index with $0 < j \leq q$) as follows:
+$
+MA(q): \ \
+y_t = c + \epsilon_t + \theta_1 \epsilon_{t-1} + \theta_2 \epsilon_{t-2}
+      + \dots + \theta_q \epsilon_{t-q}
+$.
+Finally, an $ARIMA(p,d,q)$ model unifies both parts and adds differencing
+    where $d$ is the degree of differences and the $'$ indicates differenced
+    values:
+$
+ARIMA(p,d,q): \ \
+y'_t = c + \phi_1 y'_{t-1} + \dots + \phi_p y'_{t-p} + \theta_1 \epsilon_{t-1}
+       + \dots + \theta_q \epsilon_{t-q} + \epsilon_{t}
+$.
+
+$ARIMA(p,d,q)$ models are commonly fitted with maximum likelihood estimation.
+To find an optimal combination of the parameters $p$, $d$, and $q$, the
+    literature suggests calculating an information theoretical criterion
+    (e.g., Akaike's Information Criterion) that evaluates the fit on
+    historical data.
+\cite{hyndman2008a} provide a step-wise heuristic to choose $p$, $d$, and $q$,
+    that also decides if a Box-Cox transformation is to be applied, and if so,
+    which one.
+To obtain a one-step-ahead forecast, the above equation is reordered such
+    that $t$ is substituted with $T+1$.
+For forecasts further into the future, the actual observations are
+    subsequently replaced by their forecasts.    
+Seasonal ARIMA variants exist; however, the high frequency $k$ in the kind of
+    demand a UDP faces typically renders them impractical as too many
+    coefficients must be estimated.
--- a/tex/2_lit/2_class/4_stl.tex
+++ b/tex/2_lit/2_class/4_stl.tex
@ -0,0 +1,62 @@
+\subsubsection{Seasonal and Trend Decomposition using Loess.}
+\label{stl}
+
+A time series $y_t$ may exhibit different types of patterns; to fully capture
+    each of them, the series must be decomposed.
+Then, each component is forecast with a distinct model.
+Most commonly, the components are the trend $t_t$, seasonality $s_t$, and
+    remainder $r_t$.
+They are themselves time series, where only $s_t$ exhibits a periodicity $k$.
+A decomposition may be additive (i.e., $y_t = s_t + t_t + r_t$) or
+    multiplicative (i.e., $y_t = s_t * t_t * r_t$); the former assumes that
+    the effect of the seasonal component is independent of the overall level
+    of $y_t$ and vice versa.
+The seasonal component is centered around $0$ in both cases such that its
+    removal does not affect the level of $y_t$.
+Often, it is sufficient to only seasonally adjust the time series, and model
+    the trend and remainder together, for example, as $a_t = y_t - s_t$ in the
+    additive case.
+
+Early approaches employed moving averages (cf., Sub-section \ref{ets}) to
+    calculate a trend component, and, after removing that from $y_t$, averaged
+    all observations of the same seasonal lag to obtain the seasonal
+    component.
+The downsides of this are the subjectivity in choosing the window lengths for
+    the moving average and the seasonal averaging, the incapability of the
+    seasonal component to vary its amplitude over time, and the missing
+    handling of outliers.
+
+The X11 method developed at the U.S. Census Bureau and described in detail by
+    \cite{dagum2016} overcomes these disadvantages.
+However, due to its background in economics, it is designed primarily for
+    quarterly or monthly data, and the change in amplitude over time cannot be
+    controlled.
+Variants of this method are the SEATS decomposition by the Bank of Spain and
+    the newer X13-SEATS-ARIMA method by the U.S. Census Bureau.
+Their main advantages stem from the fact that the models calibrate themselves
+    according to statistical criteria without manual work for a statistician
+    and that the fitting process is robust to outliers.
+
+\cite{cleveland1990} introduce a seasonal and trend decomposition using a
+    repeated locally weighted regression - the so-called Loess procedure - to
+    smoothen the trend and seasonal components, which can be viewed as a
+    generalization of the methods above and is denoted by the acronym
+    \gls{stl}.
+In contrast to the X11, X13, and SEATS methods, the STL supports seasonalities
+    of any lag $k$ that must, however, be determined with additional
+    statistical tests or set with out-of-band knowledge by the forecaster
+    (e.g., hourly demand data implies $k = 24 * 7 = 168$ assuming customer
+    behavior differs on each day of the week).
+Moreover, the seasonal component's rate of change, represented by the $ns$
+    parameter and explained in detail with Figure \ref{f:stl} in Section
+    \ref{decomp}, must be set by the forecaster as well, while the trend's
+    smoothness may be controlled via setting a non-default window size.
+Outliers are handled by assignment to the remainder such that they do not
+    affect the trend and seasonal components.
+In particular, the manual input needed to calibrate the STL explains why only
+    the X11, X13, and SEATS methods are widely used by practitioners.
+However, the widespread adoption of concepts like cross-validation (cf.,
+    Sub-section \ref{cv}) in recent years enables the usage of an automated
+    grid search to optimize the parameters.
+The STL's usage within a grid search is facilitated even further by its being
+    computationally cheaper than the other methods discussed.