1
0
Fork 0
urban-meal-delivery-demand-.../tex/3_mod/4_cv.tex
2020-10-04 23:39:20 +02:00

86 lines
4.9 KiB
TeX

\subsection{Unified Cross-Validation and Training, Validation, and Test Sets}
\label{unified_cv}
The standard $k$-fold CV, which assumes no structure in the individual
features of the samples, as shown in $\mat{X}$ above, is adapted to the
ordinal character of time series data:
A model must be evaluated on observations that occurred strictly after the
ones used for training as, otherwise, the model knows about the future.
Furthermore, some models predict only a single to a few time steps before
being retrained, while others predict an entire day without retraining
(cf., Sub-section \ref{ml_models}).
Consequently, we must use a unified time interval wherein all forecasts are
made first before the entire interval is evaluated.
As whole days are the longest prediction interval for models without
retraining, we choose that as the unified time interval.
In summary, our CV methodology yields a distinct best model per pixel and day
to be forecast.
Whole days are also practical for managers who commonly monitor, for example,
the routing and thus the forecasting performance on a day-to-day basis.
Our methodology assumes that the models are trained at least once per day.
As we create operational forecasts into the near future in this paper,
retraining all models with the latest available data is a logical step.
\begin{center}
\captionof{figure}{Training, validation, and test sets
during cross validation}
\label{f:cv}
\includegraphics[width=.8\linewidth]{static/cross_validation_gray.png}
\end{center}
The training, validation, and test sets are defined as follows.
To exemplify the logic, we refer to Figure \ref{f:cv} that shows the calendar
setup (i.e., weekdays on the x-axis) for three days $T_1$, $T_2$, and
$T_3$ (shown in dark gray) for which we generate forecasts.
Each of these days is, by definition, a test day, and the test set comprises
all time series, horizontal or vertical, whose last observation lies on
that day.
With an assumed training horizon of three weeks, the 21 days before each of
the test days constitute the corresponding training sets (shown in lighter
gray on the same rows as $T_1$, $T_2$, and $T_3$).
There are two kinds of validation sets, depending on the decision to be made.
First, if a forecasting method needs parameter tuning, the original training
set is divided into as many equally long series as validation days are
needed to find stable parameters.
The example shows three validation days per test day named $V_n$ (shown
in darker gray below each test day).
The $21 - 3 = 18$ preceding days constitute the training set corresponding to
a validation day.
To obtain the overall validation error, the three errors are averaged.
We call these \textit{inner} validation sets because they must be repeated
each day to re-tune the parameters and because the involved time series
are true subsets of the original series.
Second, to find the best method per day and pixel, the same averaging logic
is applied on the outer level.
For example, if we used two validation days to find the best method for $T_3$,
we would average the errors of $T_1$ and $T_2$ for each method and select
the winner; then, $T_1$ and $T_2$ constitute an \textit{outer} validation
set.
Whereas the number of inner validation days is method-specific and must be
chosen before generating any test day forecasts in the first place, the
number of outer validation days may be varied after the fact and is
determined empirically as we show in Section \ref{stu}.
Our unified CV approach is also optimized for large-scale production settings,
for example, at companies like Uber.
As \cite{bell2018} note, there is a trade-off as to when each of the
inner time series in the example begins.
While the forecasting accuracy likely increases with more training days,
supporting inner series with increasing lengths, cutting the series
to the same length allows caching the forecasts and errors.
In the example, $V_3$, $V_5$, and $V_7$, as well as $V_6$ and $V_8$ are
identical despite belonging to different inner validation sets.
Caching is also possible on the outer level when searching for an optimal
number of validation days for model selection.
We achieved up to 80\% cache hit ratios in our implementation in the
empirical study, thereby saving computational resources by the same
amount.
Lastly, we assert that our suggested CV, because of its being unified
around whole test days and usage of fix-sized time series, is also
suitable for creating consistent learning curves and, thus, answering
\textbf{Q3} on the relationship between forecast accuracy and amount of
historic data:
We simply increase the length of the outer training set holding the test day
fixed.
Thus, independent of a method's need for parameter tuning, all methods have
the same demand history available for each test day forecast.