urban-meal-delivery-demand-.../tex/4_stu/6_fams.tex

\subsection{Results by Model Families}
\label{fams}

\begin{center}
\captionof{table}{Ranking of benchmark and horizontal models
                  ($1~\text{km}^2$ pixel size, 60-minute time steps):
                  the table shows the ranks for cases with $2.5 < ADD < 25$
                  (and $25 < ADD < \infty$ in parentheses if they differ)}
\label{t:hori}
\begin{tabular}{|c|ccc|cccccccc|}
\hline
\multirow{2}{*}{\rotatebox{90}{\thead{\scriptsize{Training}}}}
    & \multicolumn{3}{c|}{\thead{Benchmarks}}
    & \multicolumn{8}{c|}{\thead{Horizontal (whole-day-ahead)}} \\
\cline{2-12}
~ & \textit{naive}     & \textit{fnaive}   & \textit{paive}
  & \textit{harima}    & \textit{hcroston} & \textit{hets} & \textit{hholt}
  & \textit{hhwinters} & \textit{hses}     & \textit{hsma} & \textit{htheta} \\
\hline \hline
3 & 11      &  7 (2) &  8 (5) & 5 (7) & 4     & 3
  &  9 (10) & 10 (9) &  2 (6) & 1     & 6 (8) \\
4 & 11      &  7 (2) &  8 (3) & 5 (6) & 4 (5) & 3 (1)
  &  9 (10) & 10 (9) &  2 (7) & 1 (4) & 6 (8) \\
5 & 11      &  7 (2) &  8 (4) & 5 (3) & 4 (9) & 3 (1)
  &  9 (10) & 10 (5) &  2 (8) & 1 (6) & 6 (7) \\
6 & 11      &  8 (5) &  9 (6) & 5 (4) & 4 (7) & 2 (1)
  & 10      &  7 (2) &  3 (8) & 1 (9) & 6 (3)  \\
7 & 11      &  8 (5) & 10 (6) & 5 (4) & 4 (7) & 2 (1)
  &  9 (10) &  7 (2) &  3 (8) & 1 (9) & 6 (3) \\
8 & 11      &  9 (5) & 10 (6) & 5 (4) & 4 (7) & 2 (1)
  &  8 (10) &  7 (2) &  3 (8) & 1 (9) & 6 (3) \\
\hline
\end{tabular}
\end{center}
\

Besides the overall results, we provide an in-depth comparison of models
    within a family.
Instead of reporting the MASE per model, we rank the models holding the
    training horizon fixed to make comparison easier.
Table \ref{t:hori} presents the models trained on horizontal time series.
In addition to \textit{naive}, we include \textit{fnaive} and \textit{pnaive}
    already here as more competitive benchmarks.
The tables in this section report two rankings simultaneously:
The first number is the rank resulting from lumping the low and medium
    clusters together, which yields almost the same rankings when analyzed
    individually.
The ranks from only high demand pixels are in parentheses if they differ.

A first insight is that \textit{fnaive} is the best benchmark in all
    scenarios:
Decomposing flexibly by tuning the $ns$ parameter is worth the computational
    cost.
Further, if one is limited in the number of non-na\"{i}ve methods,
    \textit{hets} is the best compromise and works well across all demand
    levels.
It is also the best model independent of the training horizon for high demand.
With low or medium demand, \textit{hsma} is the clear overall winner; yet,
    with high demand, models with a seasonal fit (i.e., \textit{harima},
    \textit{hets}, and \textit{hhwinters}) are more accurate, in particular,
    for longer training horizons.
This is due to demand patterns in the weekdays becoming stronger with higher
    overall demand.

\begin{center}
\captionof{table}{Ranking of classical models on vertical time series
                  ($1~\text{km}^2$ pixel size, 60-minute time steps):
                  the table shows the ranks for cases with $2.5 < ADD < 25$
                  (and $25 < ADD < \infty$ in parentheses if they differ)}
\label{t:vert}
\begin{tabular}{|c|cc|ccccc|ccccc|}
\hline
\multirow{2}{*}{\rotatebox{90}{\thead{\scriptsize{Training}}}}
    & \multicolumn{2}{c|}{\thead{Benchmarks}}
    & \multicolumn{5}{c|}{\thead{Vertical (whole-day-ahead)}}
    & \multicolumn{5}{c|}{\thead{Vertical (real-time)}} \\
\cline{2-13}
~ & \textit{hets}  & \textit{hsma}   & \textit{varima} & \textit{vets}
  & \textit{vholt} & \textit{vses}   & \textit{vtheta} & \textit{rtarima}
  & \textit{rtets} & \textit{rtholt} & \textit{rtses}  & \textit{rttheta} \\
\hline \hline
3 &  2 (10) &  1  (7) & 6 (4) & 8 (6) & 10 (9)
  &  7  (5) & 11 (12) & 4 (1) & 5 (3) &  9 (8) & 3 (2) & 12 (11) \\
4 &  2  (8) &  1 (10) & 6 (4) & 8 (6) & 10 (9)
  &  7  (5) & 12 (11) & 3 (1) & 5 (3) &  9 (7) & 4 (2) & 11 (12) \\
5 &  2  (3) &  1 (10) & 7 (5) & 8 (7) & 10 (9)
  &  6      & 11      & 4 (1) & 5 (4) &  9 (8) & 3 (2) & 12 \\
6 &  2  (1) &  1 (10) & 6 (5) & 8 (7) & 10 (9)
  &  7  (6) & 11 (12) & 3 (2) & 5 (4) &  9 (8) & 4 (3) & 12 (11) \\
7 &  2  (1) &  1 (10) & 8 (5) & 7     & 10 (9)
  &  6      & 11 (12) &	5 (2) & 4     &  9 (8) & 3     & 12 (11) \\
8 &  2  (1) &  1  (9) & 8 (5) & 7 (6) & 10 (8)
  &  6      & 12 (10) & 5 (2) & 4     &  9 (7) & 3     & 11 \\
\hline
\end{tabular}
\end{center}
\

Table \ref{t:vert} extends the previous analysis to classical models trained
    on vertical time series.
Now, the winners from before, \textit{hets} and \textit{hsma}, serve as
    benchmarks.
Whereas for low and medium demand, no improvements can be obtained,
    \textit{rtarima} and \textit{rtses} are the most accurate with high demand
    and short training horizons.
For six or more training weeks, \textit{hets} is still optimal.
Independent of retraining and the demand level, the models' relative
    performances are consistent:
The \textit{*arima} and \textit{*ses} models are best, followed by
    \textit{*ets}, \textit{*holt}, and \textit{*theta}.
Thus, models that can deal with auto-correlations and short-term forecasting
    errors, as expressed by moving averages, and that cannot be distracted by
    trend terms are optimal for vertical series.

Finally, Table \ref{t:ml} compares the two ML-based models against the
    best-performing classical models and answers \textbf{Q2}:
With low and medium demand, no improvements can be obtained again; however,
    with high demand, \textit{vrfr} has the edge over \textit{rtarima} for
    training horizons up to six weeks.
We conjecture that \textit{vrfr} fits auto-correlations better than
    \textit{varima} and is not distracted by short-term noise as
    \textit{rtarima} may be due to the retraining.
With seven or eight training weeks, \textit{hets} remains the overall winner.
Interestingly, \textit{vsvr} is more accurate than \textit{vrfr} for low and
    medium demand.
We assume that \textit{vrfr} performs well only with strong auto-correlations,
    which are not present with low and medium demand.

\begin{center}
\captionof{table}{Ranking of ML models on vertical time series
                  ($1~\text{km}^2$ pixel size, 60-minute time steps):
                  the table shows the ranks for cases with $2.5 < ADD < 25$
                  (and $25 < ADD < \infty$ in parentheses if they differ)}
\label{t:ml}
\begin{tabular}{|c|cccc|cc|}
\hline
\multirow{2}{*}{\rotatebox{90}{\thead{\scriptsize{Training}}}}
    & \multicolumn{4}{c|}{\thead{Benchmarks}}
    & \multicolumn{2}{c|}{\thead{ML}} \\
\cline{2-7}
~ & \textit{fnaive}  & \textit{hets} & \textit{hsma}
  & \textit{rtarima} & \textit{vrfr} & \textit{vsvr} \\
\hline \hline
3 & 6     & 2 (5) & 1 (4) & 3 (1) & 5 (2) & 4 (3) \\
4 & 6 (5) & 2 (4) & 1 (6) & 3 (2) & 5 (1) & 4 (3) \\
5 & 6 (5) & 2 (4) & 1 (6) & 4 (2) & 5 (1) & 3 \\
6 & 6 (5) & 2     & 1 (6) & 4     & 5 (1) & 3 \\
7 & 6 (5) & 2 (1) & 1 (6) & 4     & 5 (2) & 3 \\
8 & 6 (5) & 2 (1) & 1 (6) & 4     & 5 (2) & 3 \\
\hline
\end{tabular}
\end{center}
\

Analogously, we created tables like Table \ref{t:hori} to \ref{t:ml} for the
    forecasts with time steps of 90 and 120 minutes and find that the relative
    rankings do not change significantly.
The same holds true for the rankings with changing pixel sizes.
For conciseness reasons, we do not include these additional tables in this
    article.
In summary, the relative performances exhibited by certain model families
    are shown to be rather stable in this case study.