1
0
Fork 0
urban-meal-delivery-demand-.../tex/3_mod/5_mase.tex
2020-11-30 18:42:54 +01:00

87 lines
4.5 KiB
TeX

\subsection{Accuracy Measures}
\label{mase}
Choosing an error measure for both model selection and evaluation is not
straightforward when working with intermittent demand, as shown, for
example, by \cite{syntetos2005}, and one should understand the trade-offs
between measures.
\cite{hyndman2006} provide a study of measures with real-life data taken from
the popular M3-competition and find that most standard measures degenerate
under many scenarios.
They also provide a classification scheme for which we summarize the main
points as they apply to the UDP case:
\begin{enumerate}
\item \textbf{Scale-dependent Errors}:
The error is reported in the same unit as the raw data.
Two popular examples are the root mean square error (RMSE) and mean absolute
error (MAE).
They may be used for model selection and evaluation within a pixel, and are
intuitively interpretable; however, they may not be used to compare errors
of, for example, a low-demand pixel (e.g., at the UDP's service
boundary) with that of a high-demand pixel (e.g., downtown).
\item \textbf{Percentage Errors}:
The error is derived from the percentage errors of individual forecasts per
time step, and is also intuitively interpretable.
A popular example is the mean absolute percentage error (MAPE) that is the
primary measure in most forecasting competitions.
Whereas such errors could be applied both within and across pixels, they
cannot be calculated reliably for intermittent demand.
If only one time step exhibits no demand, the result is a divide-by-zero
error.
This often occurs even in high-demand pixels due to the slicing.
\item \textbf{Relative Errors}:
A workaround is to calculate a scale-dependent error for the test day and
divide it by the same measure calculated with forecasts of a simple
benchmark method (e.g., na\"{i}ve method).
An example could be
$\text{RelMAE} = \text{MAE} / \text{MAE}_\text{bm}$.
Nevertheless, even simple methods create (near-)perfect forecasts, and then
$\text{MAE}_\text{bm}$ becomes (close to) $0$.
These numerical instabilities occurred so often in our studies that we argue
against using such measures.
\item \textbf{Scaled Errors}:
\cite{hyndman2006} contribute this category and introduce the mean absolute
scaled error (MASE).
It is defined as the MAE from the actual forecasting method on the test day
(i.e., "out-of-sample") divided by the MAE from the (seasonal) na\"{i}ve
method on the entire training set (i.e., "in-sample").
A MASE of $1$ indicates that a forecasting method has the same accuracy
on the test day as the (seasonal) na\"{i}ve method applied on a longer
horizon, and lower values imply higher accuracy.
Within a pixel, its results are identical to the ones obtained with MAE.
Also, we acknowledge recent publications, for example, \cite{prestwich2014} or
\cite{kim2016}, showing other ways of tackling the difficulties mentioned.
However, only the MASE provided numerically stable results for all
forecasts in our study.
\end{enumerate}
Consequently, we use the MASE with a seasonal na\"{i}ve benchmark as the
primary measure in this paper.
With the previously introduced notation, it is defined as follows:
$$
\text{MASE}
:=
\frac{\text{MAE}_{\text{out-of-sample}}}{\text{MAE}_{\text{in-sample}}}
=
\frac{\text{MAE}_{\text{forecasts}}}{\text{MAE}_{\text{training}}}
=
\frac{\frac{1}{H} \sum_{h=1}^H |y_{T+h} - \hat{y}_{T+h}|}
{\frac{1}{T-k} \sum_{t=k+1}^T |y_{t} - y_{t-k}|}
$$
The denominator can only become $0$ if the seasonal na\"{i}ve benchmark makes
a perfect forecast on each day in the training set except the first seven
days, which never happened in our case study involving hundreds of
thousands of individual model trainings.
Further, as per the discussion in the subsequent Section \ref{decomp}, we also
calculate peak-MASEs where we leave out the time steps of non-peak times
from the calculations.
For this analysis, we define all time steps that occur at lunch (i.e., noon to
2 pm) and dinner time (i.e., 6 pm to 8 pm) as peak.
As time steps in non-peak times typically average no or very low order counts,
a UDP may choose to not actively forecast these at all and be rather
interested in the accuracies of forecasting methods during peaks only.
We conjecture that percentage error measures may be usable for UDPs facing a
higher overall demand with no intra-day down-times in between but have to
leave that to a future study.
Yet, even with high and steady demand, divide-by-zero errors are likely to
occur.