\subsection{Unified Cross-Validation and Training, Validation, and Test Sets} \label{unified_cv} The standard $k$-fold CV, which assumes no structure in the individual features of the samples, as shown in $\mat{X}$ above, is adapted to the ordinal character of time series data: A model must be evaluated on observations that occurred strictly after the ones used for training as, otherwise, the model knows about the future. Furthermore, some models predict only a single to a few time steps before being retrained, while others predict an entire day without retraining (cf., Sub-section \ref{ml_models}). Consequently, we must use a unified time interval wherein all forecasts are made first before the entire interval is evaluated. As whole days are the longest prediction interval for models without retraining, we choose that as the unified time interval. In summary, our CV methodology yields a distinct best model per pixel and day to be forecast. Whole days are also practical for managers who commonly monitor, for example, the routing and thus the forecasting performance on a day-to-day basis. Our methodology assumes that the models are trained at least once per day. As we create operational forecasts into the near future in this paper, retraining all models with the latest available data is a logical step. \begin{center} \captionof{figure}{Training, validation, and test sets during cross validation} \label{f:cv} \includegraphics[width=.8\linewidth]{static/cross_validation_gray.png} \end{center} The training, validation, and test sets are defined as follows. To exemplify the logic, we refer to Figure \ref{f:cv} that shows the calendar setup (i.e., weekdays on the x-axis) for three days $T_1$, $T_2$, and $T_3$ (shown in dark gray) for which we generate forecasts. Each of these days is, by definition, a test day, and the test set comprises all time series, horizontal or vertical, whose last observation lies on that day. With an assumed training horizon of three weeks, the 21 days before each of the test days constitute the corresponding training sets (shown in lighter gray on the same rows as $T_1$, $T_2$, and $T_3$). There are two kinds of validation sets, depending on the decision to be made. First, if a forecasting method needs parameter tuning, the original training set is divided into as many equally long series as validation days are needed to find stable parameters. The example shows three validation days per test day named $V_n$ (shown in darker gray below each test day). The $21 - 3 = 18$ preceding days constitute the training set corresponding to a validation day. To obtain the overall validation error, the three errors are averaged. We call these \textit{inner} validation sets because they must be repeated each day to re-tune the parameters and because the involved time series are true subsets of the original series. Second, to find the best method per day and pixel, the same averaging logic is applied on the outer level. For example, if we used two validation days to find the best method for $T_3$, we would average the errors of $T_1$ and $T_2$ for each method and select the winner; then, $T_1$ and $T_2$ constitute an \textit{outer} validation set. Whereas the number of inner validation days is method-specific and must be chosen before generating any test day forecasts in the first place, the number of outer validation days may be varied after the fact and is determined empirically as we show in Section \ref{stu}. Our unified CV approach is also optimized for large-scale production settings, for example, at companies like Uber. As \cite{bell2018} note, there is a trade-off as to when each of the inner time series in the example begins. While the forecasting accuracy likely increases with more training days, supporting inner series with increasing lengths, cutting the series to the same length allows caching the forecasts and errors. In the example, $V_3$, $V_5$, and $V_7$, as well as $V_6$ and $V_8$ are identical despite belonging to different inner validation sets. Caching is also possible on the outer level when searching for an optimal number of validation days for model selection. We achieved up to 80\% cache hit ratios in our implementation in the empirical study, thereby saving computational resources by the same amount. Lastly, we assert that our suggested CV, because of its being unified around whole test days and usage of fix-sized time series, is also suitable for creating consistent learning curves and, thus, answering \textbf{Q3} on the relationship between forecast accuracy and amount of historic data: We simply increase the length of the outer training set holding the test day fixed. Thus, independent of a method's need for parameter tuning, all methods have the same demand history available for each test day forecast.