urban-meal-delivery-demand-.../tex/2_lit/3_ml/2_learning.tex

\subsubsection{Supervised Learning}
\label{learning}

A conceptual difference between classical and ML methods is the format
    for the model inputs.
In ML models, a time series $Y$ is interpreted as labeled data.
Labels are collected into a vector $\vec{y}$ while the corresponding
    predictors are aligned in an $(T - n) \times n$ matrix $\mat{X}$:
$$
\vec{y}
=
\begin{pmatrix}
    y_T \\
    y_{T-1} \\
    \dots \\
    y_{n+1}
\end{pmatrix}
~~~~~~~~~~
\mat{X}
=
\begin{bmatrix}
    y_{T-1} & y_{T-2} & \dots & y_{T-n} \\
    y_{T-2} & y_{T-3} & \dots & y_{T-(n+1)} \\
    \dots   & \dots   & \dots & \dots \\
    y_n     & y_{n-1} & \dots & y_1
\end{bmatrix}
$$
The $m = T - n$ rows are referred to as samples and the $n$ columns as
    features.
Each row in $\mat{X}$ is "labeled" by the corresponding entry in $\vec{y}$,
    and ML models are trained to fit the rows to their labels.
Conceptually, we model a functional relationship $f$ between $\mat{X}$ and
    $\vec{y}$ such that the difference between the predicted
    $\vec{\hat{y}} = f(\mat{X})$ and the true $\vec{y}$ are minimized
    according to some error measure $L(\vec{\hat{y}}, \vec{y})$, where $L$
    summarizes the goodness of the fit into a scalar value (e.g., the
    well-known mean squared error [MSE]; cf., Section \ref{mase}).
$\mat{X}$ and $\vec{y}$ show the ordinal character of time series data:
    Not only overlap the entries of $\mat{X}$ and $\vec{y}$, but the rows of
    $\mat{X}$ are shifted versions of each other.
That does not hold for ML applications in general (e.g., the classical
    example of predicting spam vs. no spam emails, where the features model
    properties of individual emails), and most of the common error measures
    presented in introductory texts on ML, are only applicable in cases
    without such a structure in $\mat{X}$ and $\vec{y}$.
$n$, the number of past time steps required to predict a $y_t$, is an
    exogenous model parameter.
For prediction, the forecaster supplies the trained ML model an input
    vector in the same format as a row $\vec{x}_i$ in $\mat{X}$.
For example, to predict $y_{T+1}$, the model takes the vector
    $(y_T, y_{T-1}, ..., y_{T-n+1})$ as input.
That is in contrast to the classical methods, where we only supply the number
    of time steps to be predicted as a scalar integer.