86 lines
4.9 KiB
TeX
86 lines
4.9 KiB
TeX
\subsection{Unified Cross-Validation and Training, Validation, and Test Sets}
|
|
\label{unified_cv}
|
|
|
|
The standard $k$-fold CV, which assumes no structure in the individual
|
|
features of the samples, as shown in $\mat{X}$ above, is adapted to the
|
|
ordinal character of time series data:
|
|
A model must be evaluated on observations that occurred strictly after the
|
|
ones used for training as, otherwise, the model knows about the future.
|
|
Furthermore, some models predict only a single to a few time steps before
|
|
being retrained, while others predict an entire day without retraining
|
|
(cf., Sub-section \ref{ml_models}).
|
|
Consequently, we must use a unified time interval wherein all forecasts are
|
|
made first before the entire interval is evaluated.
|
|
As whole days are the longest prediction interval for models without
|
|
retraining, we choose that as the unified time interval.
|
|
In summary, our CV methodology yields a distinct best model per pixel and day
|
|
to be forecast.
|
|
Whole days are also practical for managers who commonly monitor, for example,
|
|
the routing and thus the forecasting performance on a day-to-day basis.
|
|
Our methodology assumes that the models are trained at least once per day.
|
|
As we create operational forecasts into the near future in this paper,
|
|
retraining all models with the latest available data is a logical step.
|
|
|
|
\begin{center}
|
|
\captionof{figure}{Training, validation, and test sets
|
|
during cross validation}
|
|
\label{f:cv}
|
|
\includegraphics[width=.8\linewidth]{static/cross_validation_gray.png}
|
|
\end{center}
|
|
|
|
The training, validation, and test sets are defined as follows.
|
|
To exemplify the logic, we refer to Figure \ref{f:cv} that shows the calendar
|
|
setup (i.e., weekdays on the x-axis) for three days $T_1$, $T_2$, and
|
|
$T_3$ (shown in dark gray) for which we generate forecasts.
|
|
Each of these days is, by definition, a test day, and the test set comprises
|
|
all time series, horizontal or vertical, whose last observation lies on
|
|
that day.
|
|
With an assumed training horizon of three weeks, the 21 days before each of
|
|
the test days constitute the corresponding training sets (shown in lighter
|
|
gray on the same rows as $T_1$, $T_2$, and $T_3$).
|
|
There are two kinds of validation sets, depending on the decision to be made.
|
|
First, if a forecasting method needs parameter tuning, the original training
|
|
set is divided into as many equally long series as validation days are
|
|
needed to find stable parameters.
|
|
The example shows three validation days per test day named $V_n$ (shown
|
|
in darker gray below each test day).
|
|
The $21 - 3 = 18$ preceding days constitute the training set corresponding to
|
|
a validation day.
|
|
To obtain the overall validation error, the three errors are averaged.
|
|
We call these \textit{inner} validation sets because they must be repeated
|
|
each day to re-tune the parameters and because the involved time series
|
|
are true subsets of the original series.
|
|
Second, to find the best method per day and pixel, the same averaging logic
|
|
is applied on the outer level.
|
|
For example, if we used two validation days to find the best method for $T_3$,
|
|
we would average the errors of $T_1$ and $T_2$ for each method and select
|
|
the winner; then, $T_1$ and $T_2$ constitute an \textit{outer} validation
|
|
set.
|
|
Whereas the number of inner validation days is method-specific and must be
|
|
chosen before generating any test day forecasts in the first place, the
|
|
number of outer validation days may be varied after the fact and is
|
|
determined empirically as we show in Section \ref{stu}.
|
|
|
|
Our unified CV approach is also optimized for large-scale production settings,
|
|
for example, at companies like Uber.
|
|
As \cite{bell2018} note, there is a trade-off as to when each of the
|
|
inner time series in the example begins.
|
|
While the forecasting accuracy likely increases with more training days,
|
|
supporting inner series with increasing lengths, cutting the series
|
|
to the same length allows caching the forecasts and errors.
|
|
In the example, $V_3$, $V_5$, and $V_7$, as well as $V_6$ and $V_8$ are
|
|
identical despite belonging to different inner validation sets.
|
|
Caching is also possible on the outer level when searching for an optimal
|
|
number of validation days for model selection.
|
|
We achieved up to 80\% cache hit ratios in our implementation in the
|
|
empirical study, thereby saving computational resources by the same
|
|
amount.
|
|
Lastly, we assert that our suggested CV, because of its being unified
|
|
around whole test days and usage of fix-sized time series, is also
|
|
suitable for creating consistent learning curves and, thus, answering
|
|
\textbf{Q3} on the relationship between forecast accuracy and amount of
|
|
historic data:
|
|
We simply increase the length of the outer training set holding the test day
|
|
fixed.
|
|
Thus, independent of a method's need for parameter tuning, all methods have
|
|
the same demand history available for each test day forecast.
|