Add Model section

2020-10-04 23:39:20 +02:00 · 2020-10-04 23:39:20 +02:00 · 91bd4ba083
commit 91bd4ba083
parent 7c203cb87c
25 changed files with 1354 additions and 6 deletions
--- a/tex/3_mod/1_intro.tex
+++ b/tex/3_mod/1_intro.tex
@ -1,8 +1,6 @@
 \section{Model Formulation}
 \label{mod}

-% temporary placeholders
-\label{decomp}
-\label{f:stl}
-\label{mase}
-\label{unified_cv}
+In this section, we describe how the platform's raw data are pre-processed
+    into model inputs and how the forecasting models are built and benchmarked
+    against each other.
--- a/tex/3_mod/2_overall.tex
+++ b/tex/3_mod/2_overall.tex
@ -0,0 +1,28 @@
+\subsection{Overall Approach}
+\label{approach_approach}
+
+On a conceptual level, there are three distinct aspects of the model
+    development process.
+First, a pre-processing step transforms the platform's tabular order data into
+    either time series in Sub-section \ref{grid} or feature matrices in
+    Sub-section \ref{ml_models}.
+Second, a benchmark methodology is developed in Sub-section \ref{unified_cv}
+    that compares all models on the same scale, in particular, classical
+    models with ML ones.
+Concretely, the CV approach is adapted to the peculiar requirements of
+    sub-daily and ordinal time series data.
+This is done to maximize the predictive power of all models into the future
+    and to compare them on the same scale.
+Third, the forecasting models are described with respect to their assumptions
+    and training requirements.
+Four classification dimensions are introduced:
+\begin{enumerate}
+\item \textbf{Timeliness of the Information}:
+    whole-day-ahead vs. real-time forecasts
+\item \textbf{Time Series Decomposition}: raw vs. decomposed
+\item \textbf{Algorithm Type}: "classical" statistics vs. ML
+\item \textbf{Data Sources}: pure vs. enhanced (i.e., with external data)
+\end{enumerate}
+Not all of the possible eight combinations are implemented; instead, the
+    models are varied along these dimensions to show different effects and
+    answer the research questions.
--- a/tex/3_mod/3_grid.tex
+++ b/tex/3_mod/3_grid.tex
@ -0,0 +1,95 @@
+\subsection{Gridification, Time Tables, and Time Series Generation}
+\label{grid}
+
+The platform's tabular order data are sliced with respect to both location and
+    time and then aggregated into time series where an observation tells
+    the number of orders in an area for a time step/interval.
+Figure \ref{f:grid} shows how the orders' delivery locations are each
+    matched to a square-shaped cell, referred to as a pixel, on a grid
+    covering the entire service area within a city.
+This gridification step is also applied to the pickup locations separately.
+The lower-left corner is chosen at random.
+\cite{winkenbach2015} apply the same gridification idea and slice an urban
+    area to model a location-routing problem, and \cite{singleton2017} portray
+    it as a standard method in the field of urban analytics.
+With increasing pixel sizes, the time series exhibit more order aggregation
+    with a possibly stronger demand pattern.
+On the other hand, the larger the pixels, the less valuable become the
+    generated forecasts as, for example, a courier sent to a pixel
+    preemptively then faces a longer average distance to a restaurant in the
+    pixel.
+
+\begin{center}
+\captionof{figure}{Gridification for delivery locations in Paris with a pixel
+                   size of $1~\text{km}^2$}
+\label{f:grid}
+\includegraphics[width=.8\linewidth]{static/gridification_for_paris_gray.png}
+\end{center}
+
+After gridification, the ad-hoc orders within a pixel are aggregated by their
+    placement timestamps into sub-daily time steps of pre-defined lengths
+    to obtain a time table as exemplified in Figure \ref{f:timetable} with
+    one-hour intervals.
+
+\begin{center}
+\captionof{figure}{Aggregation into a time table with hourly time steps}
+\label{f:timetable}
+\begin{tabular}{|c||*{9}{c|}}
+    \hline
+    \backslashbox{Time}{Day} & \makebox[2em]{\ldots}
+        & \makebox[3em]{Mon} & \makebox[3em]{Tue}
+        & \makebox[3em]{Wed} & \makebox[3em]{Thu}
+        & \makebox[3em]{Fri} & \makebox[3em]{Sat}
+        & \makebox[3em]{Sun} & \makebox[2em]{\ldots} \\
+    \hline
+    \hline
+    11:00 & \ldots & $y_{11,Mon}$ & $y_{11,Tue}$ & $y_{11,Wed}$ & $y_{11,Thu}$
+                   & $y_{11,Fri}$ & $y_{11,Sat}$ & $y_{11,Sun}$ & \ldots \\
+    \hline
+    12:00 & \ldots & $y_{12,Mon}$ & $y_{12,Tue}$ & $y_{12,Wed}$ & $y_{12,Thu}$
+                   & $y_{12,Fri}$ & $y_{12,Sat}$ & $y_{12,Sun}$ & \ldots \\
+    \hline
+    \ldots & \ldots & \ldots & \ldots & \ldots
+           & \ldots & \ldots & \ldots & \ldots & \ldots \\
+    \hline
+    20:00 & \ldots & $y_{20,Mon}$ & $y_{20,Tue}$ & $y_{20,Wed}$ & $y_{20,Thu}$
+                   & $y_{20,Fri}$ & $y_{20,Sat}$ & $y_{20,Sun}$ & \ldots \\
+    \hline
+    21:00 & \ldots & $y_{21,Mon}$ & $y_{21,Tue}$ & $y_{21,Wed}$ & $y_{21,Thu}$
+                   & $y_{21,Fri}$ & $y_{21,Sat}$ & $y_{21,Sun}$ & \ldots \\
+    \hline
+    \ldots & \ldots & \ldots & \ldots & \ldots
+           & \ldots & \ldots & \ldots & \ldots & \ldots \\
+    \hline
+\end{tabular}
+\end{center}
+\
+
+Consequently, each $y_{t,d}$ in Figure \ref{f:timetable} is the number of
+    all orders within the pixel for the time of day $t$ and day of week
+    $d$ ($y_t$ and $y_{t,d}$ are the same but differ in that the latter
+    acknowledges a 2D view).
+The same trade-off as with gridification applies:
+The shorter the interval, the weaker is the demand pattern to be expected in
+    the time series due to less aggregation while longer intervals lead to
+    less usable forecasts.
+We refer to time steps by their start time, and their number per day, $H$,
+    is constant.
+Given a time table as in Figure \ref{f:timetable} there are two ways to
+    generate a time series by slicing:
+\begin{enumerate}
+    \item \textbf{Horizontal View}:
+    Take only the order counts for a given time of the day
+    \item \textbf{Vertical View}:
+    Take all order counts and remove the double-seasonal pattern induced
+    by the weekday and time of the day with decomposition
+\end{enumerate}
+Distinct time series are retrieved by iterating through the time tables either
+    horizontally or vertically in increments of a single time step.
+Another property of a generated time series is its length, which, following
+    the next sub-section, can be interpreted as the sum of the production
+    training set and the test day.
+In summary, a distinct time series is generated from the tabular order data
+    based on a configuration of parameters for the dimensions pixel size,
+    number of daily time steps $H$, shape (horizontal vs. vertical), length,
+    and the time step to be predicted.
--- a/tex/3_mod/4_cv.tex
+++ b/tex/3_mod/4_cv.tex
@ -0,0 +1,86 @@
+\subsection{Unified Cross-Validation and Training, Validation, and Test Sets}
+\label{unified_cv}
+
+The standard $k$-fold CV, which assumes no structure in the individual
+    features of the samples, as shown in $\mat{X}$ above, is adapted to the
+    ordinal character of time series data:
+A model must be evaluated on observations that occurred strictly after the
+    ones used for training as, otherwise, the model knows about the future.
+Furthermore, some models predict only a single to a few time steps before
+    being retrained, while others predict an entire day without retraining
+    (cf., Sub-section \ref{ml_models}).
+Consequently, we must use a unified time interval wherein all forecasts are
+    made first before the entire interval is evaluated.
+As whole days are the longest prediction interval for models without
+    retraining, we choose that as the unified time interval.
+In summary, our CV methodology yields a distinct best model per pixel and day
+    to be forecast.
+Whole days are also practical for managers who commonly monitor, for example,
+    the routing and thus the forecasting performance on a day-to-day basis.
+Our methodology assumes that the models are trained at least once per day.
+As we create operational forecasts into the near future in this paper,
+    retraining all models with the latest available data is a logical step.
+
+\begin{center}
+\captionof{figure}{Training, validation, and test sets
+                   during cross validation}
+\label{f:cv}
+\includegraphics[width=.8\linewidth]{static/cross_validation_gray.png}
+\end{center}
+
+The training, validation, and test sets are defined as follows.
+To exemplify the logic, we refer to Figure \ref{f:cv} that shows the calendar
+    setup (i.e., weekdays on the x-axis) for three days $T_1$, $T_2$, and
+    $T_3$ (shown in dark gray) for which we generate forecasts.
+Each of these days is, by definition, a test day, and the test set comprises
+    all time series, horizontal or vertical, whose last observation lies on
+    that day.
+With an assumed training horizon of three weeks, the 21 days before each of
+    the test days constitute the corresponding training sets (shown in lighter
+    gray on the same rows as $T_1$, $T_2$, and $T_3$).
+There are two kinds of validation sets, depending on the decision to be made.
+First, if a forecasting method needs parameter tuning, the original training
+    set is divided into as many equally long series as validation days are
+    needed to find stable parameters.
+The example shows three validation days per test day named $V_n$ (shown
+    in darker gray below each test day).
+The $21 - 3 = 18$ preceding days constitute the training set corresponding to
+    a validation day.
+To obtain the overall validation error, the three errors are averaged.
+We call these \textit{inner} validation sets because they must be repeated
+    each day to re-tune the parameters and because the involved time series
+    are true subsets of the original series.
+Second, to find the best method per day and pixel, the same averaging logic
+    is applied on the outer level.
+For example, if we used two validation days to find the best method for $T_3$,
+    we would average the errors of $T_1$ and $T_2$ for each method and select
+    the winner; then, $T_1$ and $T_2$ constitute an \textit{outer} validation
+    set.
+Whereas the number of inner validation days is method-specific and must be
+    chosen before generating any test day forecasts in the first place, the
+    number of outer validation days may be varied after the fact and is
+    determined empirically as we show in Section \ref{stu}.
+
+Our unified CV approach is also optimized for large-scale production settings,
+    for example, at companies like Uber.
+As \cite{bell2018} note, there is a trade-off as to when each of the
+    inner time series in the example begins.
+While the forecasting accuracy likely increases with more training days,
+    supporting inner series with increasing lengths, cutting the series
+    to the same length allows caching the forecasts and errors.
+In the example, $V_3$, $V_5$, and $V_7$, as well as $V_6$ and $V_8$ are
+    identical despite belonging to different inner validation sets.
+Caching is also possible on the outer level when searching for an optimal
+    number of validation days for model selection.
+We achieved up to 80\% cache hit ratios in our implementation in the
+    empirical study, thereby saving computational resources by the same
+    amount.
+Lastly, we assert that our suggested CV, because of its being unified
+    around whole test days and usage of fix-sized time series, is also
+    suitable for creating consistent learning curves and, thus, answering
+    \textbf{Q3} on the relationship between forecast accuracy and amount of
+    historic data:
+We simply increase the length of the outer training set holding the test day
+    fixed.
+Thus, independent of a method's need for parameter tuning, all methods have
+    the same demand history available for each test day forecast.
--- a/tex/3_mod/5_mase.tex
+++ b/tex/3_mod/5_mase.tex
@ -0,0 +1,87 @@
+\subsection{Accuracy Measures}
+\label{mase}
+
+Choosing an error measure for both model selection and evaluation is not
+    straightforward when working with intermittent demand, as shown, for
+    example, by \cite{syntetos2005}, and one should understand the trade-offs
+    between measures.
+\cite{hyndman2006} provide a study of measures with real-life data taken from
+    the popular M3-competition and find that most standard measures degenerate
+    under many scenarios.
+They also provide a classification scheme for which we summarize the main
+    points as they apply to the UDP case:
+\begin{enumerate}
+\item \textbf{Scale-dependent Errors}:
+The error is reported in the same unit as the raw data.
+Two popular examples are the root mean square error (RMSE) and mean absolute
+    error (MAE).
+They may be used for model selection and evaluation within a pixel, and are
+    intuitively interpretable; however, they may not be used to compare errors
+    of, for example, a low-demand pixel (e.g., at the UDP's service
+    boundary) with that of a high-demand pixel (e.g., downtown).
+\item \textbf{Percentage Errors}:
+The error is derived from the percentage errors of individual forecasts per
+    time step, and is also intuitively interpretable.
+A popular example is the mean absolute percentage error (MAPE) that is the
+    primary measure in most forecasting competitions.
+Whereas such errors could be applied both within and across pixels, they
+    cannot be calculated reliably for intermittent demand.
+If only one time step exhibits no demand, the result is a divide-by-zero
+    error.
+This often occurs even in high-demand pixels due to the slicing.
+\item \textbf{Relative Errors}:
+A workaround is to calculate a scale-dependent error for the test day and
+    divide it by the same measure calculated with forecasts of a simple
+    benchmark method (e.g., na\"{i}ve method).
+An example could be
+    $\text{RelMAE} = \text{MAE} / \text{MAE}_\text{bm}$.
+Nevertheless, even simple methods create (near-)perfect forecasts, and then
+    $\text{MAE}_\text{bm}$ becomes (close to) $0$.
+These numerical instabilities occurred so often in our studies that we argue
+    against using such measures.
+\item \textbf{Scaled Errors}:
+\cite{hyndman2006} contribute this category and introduce the mean absolute
+    scaled error (\gls{mase}).
+It is defined as the MAE from the actual forecasting method on the test day
+    (i.e., "out-of-sample") divided by the MAE from the (seasonal) na\"{i}ve
+    method on the entire training set (i.e., "in-sample").
+A MASE of $1$ indicates that a forecasting method has the same accuracy
+    on the test day as the (seasonal) na\"{i}ve method applied on a longer
+    horizon, and lower values imply higher accuracy.
+Within a pixel, its results are identical to the ones obtained with MAE.
+Also, we acknowledge recent publications, for example, \cite{prestwich2014} or
+    \cite{kim2016}, showing other ways of tackling the difficulties mentioned.
+However, only the MASE provided numerically stable results for all
+    forecasts in our study.
+\end{enumerate}
+Consequently, we use the MASE with a seasonal na\"{i}ve benchmark as the
+    primary measure in this paper.
+With the previously introduced notation, it is defined as follows:
+$$
+\text{MASE}
+:=
+\frac{\text{MAE}_{\text{out-of-sample}}}{\text{MAE}_{\text{in-sample}}}
+=
+\frac{\text{MAE}_{\text{forecasts}}}{\text{MAE}_{\text{training}}}
+=
+\frac{\frac{1}{H} \sum_{h=1}^H |y_{T+h} - \hat{y}_{T+h}|}
+     {\frac{1}{T-k} \sum_{t=k+1}^T |y_{t} - y_{t-k}|}
+$$
+The denominator can only become $0$ if the seasonal na\"{i}ve benchmark makes
+    a perfect forecast on each day in the training set except the first seven
+    days, which never happened in our case study involving hundreds of
+    thousands of individual model trainings.
+Further, as per the discussion in the subsequent Section \ref{decomp}, we also
+    calculate peak-MASEs where we leave out the time steps of non-peak times
+    from the calculations.
+For this analysis, we define all time steps that occur at lunch (i.e., noon to
+    2 pm) and dinner time (i.e., 6 pm to 8 pm) as peak.
+As time steps in non-peak times typically average no or very low order counts,
+    a UDP may choose to not actively forecast these at all and be rather
+    interested in the accuracies of forecasting methods during peaks only.
+
+We conjecture that percentage error measures may be usable for UDPs facing a
+    higher overall demand with no intra-day down-times in between but have to
+    leave that to a future study.
+Yet, even with high and steady demand, divide-by-zero errors are likely to
+    occur.
--- a/tex/3_mod/6_decomp.tex
+++ b/tex/3_mod/6_decomp.tex
@ -0,0 +1,76 @@
+\subsection{Time Series Decomposition}
+\label{decomp}
+
+Concerning the time table in Figure \ref{f:timetable}, a seasonal demand
+    pattern is inherent to both horizontal and vertical time series.
+First, the weekday influences if people eat out or order in with our partner
+    receiving more orders on Thursday through Saturday than the other four
+    days.
+This pattern is part of both types of time series.
+Second, on any given day, demand peaks occur around lunch and dinner times.
+This only regards vertical series.
+Statistical analyses show that horizontally sliced time series indeed exhibit
+    a periodicity of $k=7$, and vertically sliced series only yield a seasonal
+    component with a regular pattern if the periodicity is set to the product
+    of the number of weekdays and the daily time steps indicating a distinct
+    intra-day pattern per weekday.
+
+Figure \ref{f:stl} shows three exemplary STL decompositions for a
+    $1~\text{km}^2$ pixel and a vertical time series with 60-minute time steps
+    (on the x-axis) covering four weeks:
+With the noisy raw data $y_t$ on the left, the seasonal and trend components,
+    $s_t$ and $t_t$, are depicted in light and dark gray for increasing $ns$
+    parameters.
+The plots include (seasonal) na\"{i}ve forecasts for the subsequent test day
+    as dotted lines.
+The remainder components $r_t$ are not shown for conciseness.
+The periodicity is set to $k = 7 * 12 = 84$ as our industry partner has $12$
+    opening hours per day.
+
+\begin{center}
+\captionof{figure}{STL decompositions for a medium-demand pixel with hourly
+                   time steps and periodicity $k=84$}
+\label{f:stl}
+\includegraphics[width=.95\linewidth]{static/stl_gray.png}
+\end{center}
+
+As described in Sub-section \ref{stl}, with $k$ being implied by the
+    application, at the very least, the length of the seasonal smoothing
+    window, represented by the $ns$ parameter, must be calibrated by the
+    forecaster:
+It controls how many past observations go into each smoothened $s_t$.
+Many practitioners, however, skip this step and set $ns$ to a big number, for
+    example, $999$, then referred to as "periodic."
+For the other parameters, it is common to use the default values as
+    specified in \cite{cleveland1990}.
+The goal is to find a decomposition with a regular pattern in $s_t$.
+In Figure \ref{f:stl}, this is not true for $ns=7$ where, for
+    example, the four largest bars corresponding to the same time of day a
+    week apart cannot be connected by an approximately straight line.
+On the contrary, a regular pattern in the most extreme way exists for
+    $ns=999$, where the same four largest bars are of the same height.
+This observation holds for each time step of the day.
+For $ns=11$, $s_t$ exhibits a regular pattern whose bars adapt over time:
+The pattern is regular as bars corresponding to the same time of day can be
+    connected by approximately straight lines, and it is adaptive as these
+    lines are not horizontal.
+The trade-off between small and large values for $ns$ can thus be interpreted
+    as allowing the average demand during peak times to change over time:
+If demand is intermittent at non-peak times, it is reasonable to expect the
+    bars to change over time as only the relative differences between peak and
+    non-peak times impact the bars' heights with the seasonal component being
+    centered around $0$.
+To confirm the goodness of a decomposition statistically, one way is to verify
+    that $r_t$ can be modeled as a typical error process like white noise
+    $\epsilon_t$.
+
+However, we suggest an alternative way of calibrating the STL method in an
+    automated fashion based on our unified CV approach.
+As hinted at in Figure \ref{f:stl}, we interpret an STL decomposition as a
+    forecasting method on its own by just adding the (seasonal) na\"{i}ve
+    forecasts for $s_t$ and $t_t$ and predicting $0$ for $r_t$.
+Then, the $ns$ parameter is tuned just like a parameter for an ML model.
+To the best of our knowledge, this has not yet been proposed before.
+Conceptually, forecasting with the STL method can be viewed as a na\"{i}ve
+    method with built-in smoothing, and it outperformed all other
+    benchmark methods in all cases.
--- a/tex/3_mod/7_models/1_intro.tex
+++ b/tex/3_mod/7_models/1_intro.tex
@ -0,0 +1,20 @@
+\subsection{Forecasting Models}
+\label{models}
+
+This sub-section describes the concrete models in our study.
+Figure \ref{f:inputs} shows how we classify them into four families with
+    regard to the type of the time series, horizontal or vertical, and the
+    moment at which a model is trained:
+Solid lines indicate that the corresponding time steps lie before the
+    training, and dotted lines show the time horizon predicted by a model.
+For conciseness, we only show the forecasts for one test day.
+The setup is the same for each inner validation day.
+
+\
+
+\begin{center}
+\captionof{figure}{Classification of the models by input type and training
+                   moment}
+\label{f:inputs}
+\includegraphics[width=.95\linewidth]{static/model_inputs_gray.png}
+\end{center}
--- a/tex/3_mod/7_models/2_hori.tex
+++ b/tex/3_mod/7_models/2_hori.tex
@ -0,0 +1,42 @@
+\subsubsection{Horizontal and Whole-day-ahead Forecasts.}
+\label{hori}
+
+The upper-left in Figure \ref{f:inputs} illustrates the simplest way to
+    generate forecasts for a test day before it has started:
+For each time of the day, the corresponding horizontal slice becomes the input
+    for a model.
+With whole days being the unified time interval, each model is trained $H$
+    times, providing a one-step-ahead forecast.
+While it is possible to have models of a different type be selected per time
+    step, that did not improve the accuracy in the empirical study.
+As the models in this family do not include the test day's demand data in
+    their training sets, we see them as benchmarks to answer \textbf{Q4},
+    checking if a UDP can take advantage of real-time information.
+The models in this family are as follows; we use prefixes, such as "h" here,
+    when methods are applied in other families as well:
+\begin{enumerate}
+\item \textit{\gls{naive}}:
+          Observation from the same time step one week prior
+\item \textit{\gls{trivial}}:
+          Predict $0$ for all time steps
+\item \textit{\gls{hcroston}}:
+          Intermittent demand method introduced by \cite{croston1972}
+\item \textit{\gls{hholt}},
+      \textit{\gls{hhwinters}},
+      \textit{\gls{hses}},
+      \textit{\gls{hsma}}, and
+      \textit{\gls{htheta}}:
+          Exponential smoothing without calibration
+\item \textit{\gls{hets}}:
+          ETS calibrated as described by \cite{hyndman2008b}
+\item \textit{\gls{harima}}:
+          ARIMA calibrated as described by \cite{hyndman2008a}
+\end{enumerate}
+\textit{naive} and \textit{trivial} provide an absolute benchmark for the
+    actual forecasting methods.
+\textit{hcroston} is often mentioned in the context of intermittent demand;
+    however, the method did not perform well at all.
+Besides \textit{hhwinters} that always fits a seasonal component, the
+    calibration heuristics behind \textit{hets} and \textit{harima} may do so
+    as well.
+With $k=7$, an STL decomposition is unnecessary here.
--- a/tex/3_mod/7_models/3_vert.tex
+++ b/tex/3_mod/7_models/3_vert.tex
@ -0,0 +1,39 @@
+\subsubsection{Vertical and Whole-day-ahead Forecasts without Retraining.}
+\label{vert}
+
+The upper-right in Figure \ref{f:inputs} shows an alternative way to
+    generate forecasts for a test day before it has started:
+First, a seasonally-adjusted time series $a_t$ is obtained from a vertical
+    time series by STL decomposition.
+Then, the actual forecasting model, trained on $a_t$, makes an $H$-step-ahead
+    prediction.
+Lastly, we add the $H$ seasonal na\"{i}ve forecasts for the seasonal component
+    $s_t$ to them to obtain the actual predictions for the test day.
+Thus, only one training is required per model type, and no real-time data is
+    used.
+By decomposing the raw time series, all long-term patterns are assumed to be
+    in the seasonal component $s_t$, and $a_t$ only contains the level with
+    a potential trend and auto-correlations.
+The models in this family are:
+\begin{enumerate}
+\item \textit{\gls{fnaive}},
+      \textit{\gls{pnaive}}:
+          Sum of STL's trend and seasonal components' na\"{i}ve forecasts
+\item \textit{\gls{vholt}},
+      \textit{\gls{vses}}, and
+      \textit{\gls{vtheta}}:
+          Exponential smoothing without calibration and seasonal
+                       fit
+\item \textit{\gls{vets}}:
+          ETS calibrated as described by \cite{hyndman2008b}
+\item \textit{\gls{varima}}:
+          ARIMA calibrated as described by \cite{hyndman2008a}
+\end{enumerate}
+As mentioned in Sub-section \ref{unified_cv}, we include the sum of the
+    (seasonal) na\"{i}ve forecasts of the STL's trend and seasonal components
+    as forecasts on their own:
+For \textit{fnaive}, we tune the "flexible" $ns$ parameter, and for
+    \textit{pnaive}, we set it to a "periodic" value.
+Thus, we implicitly assume that there is no signal in the remainder $r_t$, and
+    predict $0$ for it.
+\textit{fnaive} and \textit{pnaive} are two more simple benchmarks.
--- a/tex/3_mod/7_models/4_rt.tex
+++ b/tex/3_mod/7_models/4_rt.tex
@ -0,0 +1,22 @@
+\subsubsection{Vertical and Real-time Forecasts with Retraining.}
+\label{rt}
+
+The lower-left in Figure \ref{f:inputs} shows how models trained on vertical
+    time series are extended with real-time order data as it becomes available
+    during a test day:
+Instead of obtaining an $H$-step-ahead forecast, we retrain a model after
+    every time step and only predict one step.
+The remainder is as in the previous sub-section, and the models are:
+\begin{enumerate}
+\item \textit{\gls{rtholt}},
+      \textit{\gls{rtses}}, and
+      \textit{\gls{rttheta}}:
+          Exponential smoothing without calibration and seasonal fit
+\item \textit{\gls{rtets}}:
+          ETS calibrated as described by \cite{hyndman2008b}
+\item \textit{\gls{rtarima}}:
+          ARIMA calibrated as described by \cite{hyndman2008a}
+\end{enumerate}
+Retraining \textit{fnaive} and \textit{pnaive} did not increase accuracy, and
+    thus we left them out.
+A downside of this family is the significant increase in computing costs.
--- a/tex/3_mod/7_models/5_ml.tex
+++ b/tex/3_mod/7_models/5_ml.tex
@ -0,0 +1,54 @@
+\subsubsection{Vertical and Real-time Forecasts without Retraining.}
+\label{ml_models}
+
+The lower-right in Figure \ref{f:inputs} shows how ML models take
+    real-time order data into account without retraining.
+Based on the seasonally-adjusted time series $a_t$, we employ the feature
+    matrix and label vector representations from Sub-section \ref{learning}
+    and set $n$ to the number of daily time steps, $H$, to cover all potential
+    auto-correlations.
+The ML models are trained once before a test day starts.
+For training, the matrix and vector are populated such that $y_T$ is set to
+    the last time step of the day before the forecasts, $a_T$.
+As the splitting during CV is done with whole days, the \gls{ml} models are
+    trained with training sets consisting of samples from all times of a day
+    in an equal manner.
+Thus, the ML models learn to predict each time of the day.
+For prediction on a test day, the $H$ observations preceding the time
+    step to be forecast are used as the input vector after seasonal
+    adjustment.
+As a result, real-time data are included.
+The models in this family are:
+\begin{enumerate}
+\item \textit{\gls{vrfr}}: RF trained on the matrix as described
+\item \textit{\gls{vsvr}}: SVR trained on the matrix as described
+\end{enumerate}
+We tried other ML models such as gradient boosting machines but found
+    only RFs and SVRs to perform well in our study.
+In the case of gradient boosting machines, this is to be expected as they are
+    known not to perform well in the presence of high noise - as is natural
+    with low count data - as shown, for example, by \cite{ma2018} or
+    \cite{mason2000}.
+Also, deep learning methods are not applicable as the feature matrices only
+    consist of several hundred to thousands of rows (cf., Sub-section
+    \ref{params}).
+In \ref{tabular_ml_models}, we provide an alternative feature matrix
+    representation that exploits the two-dimensional structure of time tables
+    without decomposing the time series.
+In \ref{enhanced_feats}, we show how feature matrices are extended
+    to include predictors other than historical order data.
+However, to answer \textbf{Q5} already here, none of the external data sources
+    improves the results in our study.
+Due to the high number of time series in our study, to investigate why
+    no external sources improve the forecasts, we must us some automated
+    approach to analyzing individual time series.
+\cite{barbour2014} provide a spectral density estimation approach, called
+    the Shannon entropy, that measures the signal-to-noise ratio in a
+    database with a number normalized between 0 and 1 where lower values
+    indicate a higher signal-to-noise ratio.
+We then looked at averages of the estimates on a daily level per pixel and
+    find that including any of the external data sources from
+    \ref{enhanced_feats} always leads to significantly lower signal-to-noise
+    ratios.
+Thus, we conclude that at least for the demand faced by our industry partner
+    the historical data contains all of the signal.