\subsection{Accuracy Measures} \label{mase} Choosing an error measure for both model selection and evaluation is not straightforward when working with intermittent demand, as shown, for example, by \cite{syntetos2005}, and one should understand the trade-offs between measures. \cite{hyndman2006} provide a study of measures with real-life data taken from the popular M3-competition and find that most standard measures degenerate under many scenarios. They also provide a classification scheme for which we summarize the main points as they apply to the UDP case: \begin{enumerate} \item \textbf{Scale-dependent Errors}: The error is reported in the same unit as the raw data. Two popular examples are the root mean square error (RMSE) and mean absolute error (MAE). They may be used for model selection and evaluation within a pixel, and are intuitively interpretable; however, they may not be used to compare errors of, for example, a low-demand pixel (e.g., at the UDP's service boundary) with that of a high-demand pixel (e.g., downtown). \item \textbf{Percentage Errors}: The error is derived from the percentage errors of individual forecasts per time step, and is also intuitively interpretable. A popular example is the mean absolute percentage error (MAPE) that is the primary measure in most forecasting competitions. Whereas such errors could be applied both within and across pixels, they cannot be calculated reliably for intermittent demand. If only one time step exhibits no demand, the result is a divide-by-zero error. This often occurs even in high-demand pixels due to the slicing. \item \textbf{Relative Errors}: A workaround is to calculate a scale-dependent error for the test day and divide it by the same measure calculated with forecasts of a simple benchmark method (e.g., na\"{i}ve method). An example could be $\text{RelMAE} = \text{MAE} / \text{MAE}_\text{bm}$. Nevertheless, even simple methods create (near-)perfect forecasts, and then $\text{MAE}_\text{bm}$ becomes (close to) $0$. These numerical instabilities occurred so often in our studies that we argue against using such measures. \item \textbf{Scaled Errors}: \cite{hyndman2006} contribute this category and introduce the mean absolute scaled error (MASE). It is defined as the MAE from the actual forecasting method on the test day (i.e., "out-of-sample") divided by the MAE from the (seasonal) na\"{i}ve method on the entire training set (i.e., "in-sample"). A MASE of $1$ indicates that a forecasting method has the same accuracy on the test day as the (seasonal) na\"{i}ve method applied on a longer horizon, and lower values imply higher accuracy. Within a pixel, its results are identical to the ones obtained with MAE. Also, we acknowledge recent publications, for example, \cite{prestwich2014} or \cite{kim2016}, showing other ways of tackling the difficulties mentioned. However, only the MASE provided numerically stable results for all forecasts in our study. \end{enumerate} Consequently, we use the MASE with a seasonal na\"{i}ve benchmark as the primary measure in this paper. With the previously introduced notation, it is defined as follows: $$ \text{MASE} := \frac{\text{MAE}_{\text{out-of-sample}}}{\text{MAE}_{\text{in-sample}}} = \frac{\text{MAE}_{\text{forecasts}}}{\text{MAE}_{\text{training}}} = \frac{\frac{1}{H} \sum_{h=1}^H |y_{T+h} - \hat{y}_{T+h}|} {\frac{1}{T-k} \sum_{t=k+1}^T |y_{t} - y_{t-k}|} $$ The denominator can only become $0$ if the seasonal na\"{i}ve benchmark makes a perfect forecast on each day in the training set except the first seven days, which never happened in our case study involving hundreds of thousands of individual model trainings. Further, as per the discussion in the subsequent Section \ref{decomp}, we also calculate peak-MASEs where we leave out the time steps of non-peak times from the calculations. For this analysis, we define all time steps that occur at lunch (i.e., noon to 2 pm) and dinner time (i.e., 6 pm to 8 pm) as peak. As time steps in non-peak times typically average no or very low order counts, a UDP may choose to not actively forecast these at all and be rather interested in the accuracies of forecasting methods during peaks only. We conjecture that percentage error measures may be usable for UDPs facing a higher overall demand with no intra-day down-times in between but have to leave that to a future study. Yet, even with high and steady demand, divide-by-zero errors are likely to occur.