1
0
Fork 0
urban-meal-delivery-demand-.../tex/2_lit/3_ml/2_learning.tex

53 lines
2.3 KiB
TeX

\subsubsection{Supervised Learning}
\label{learning}
A conceptual difference between classical and ML methods is the format
for the model inputs.
In ML models, a time series $Y$ is interpreted as labeled data.
Labels are collected into a vector $\vec{y}$ while the corresponding
predictors are aligned in an $(T - n) \times n$ matrix $\mat{X}$:
$$
\vec{y}
=
\begin{pmatrix}
y_T \\
y_{T-1} \\
\dots \\
y_{n+1}
\end{pmatrix}
~~~~~~~~~~
\mat{X}
=
\begin{bmatrix}
y_{T-1} & y_{T-2} & \dots & y_{T-n} \\
y_{T-2} & y_{T-3} & \dots & y_{T-(n+1)} \\
\dots & \dots & \dots & \dots \\
y_n & y_{n-1} & \dots & y_1
\end{bmatrix}
$$
The $m = T - n$ rows are referred to as samples and the $n$ columns as
features.
Each row in $\mat{X}$ is "labeled" by the corresponding entry in $\vec{y}$,
and ML models are trained to fit the rows to their labels.
Conceptually, we model a functional relationship $f$ between $\mat{X}$ and
$\vec{y}$ such that the difference between the predicted
$\vec{\hat{y}} = f(\mat{X})$ and the true $\vec{y}$ are minimized
according to some error measure $L(\vec{\hat{y}}, \vec{y})$, where $L$
summarizes the goodness of the fit into a scalar value (e.g., the
well-known mean squared error [MSE]; cf., Section \ref{mase}).
$\mat{X}$ and $\vec{y}$ show the ordinal character of time series data:
Not only overlap the entries of $\mat{X}$ and $\vec{y}$, but the rows of
$\mat{X}$ are shifted versions of each other.
That does not hold for ML applications in general (e.g., the classical
example of predicting spam vs. no spam emails, where the features model
properties of individual emails), and most of the common error measures
presented in introductory texts on ML, are only applicable in cases
without such a structure in $\mat{X}$ and $\vec{y}$.
$n$, the number of past time steps required to predict a $y_t$, is an
exogenous model parameter.
For prediction, the forecaster supplies the trained ML model an input
vector in the same format as a row $\vec{x}_i$ in $\mat{X}$.
For example, to predict $y_{T+1}$, the model takes the vector
$(y_T, y_{T-1}, ..., y_{T-n+1})$ as input.
That is in contrast to the classical methods, where we only supply the number
of time steps to be predicted as a scalar integer.