\subsubsection{Support Vector Regression}
\label{svm}

\cite{vapnik1963} and \cite{vapnik1964} introduce the so-called support vector
    machine (SVM) model, and \cite{vapnik2013} summarizes the research
    conducted since then.
In its basic version, SVMs are linear classifiers, modeling a binary
    decision, that fit a hyperplane into the feature space of $\mat{X}$ to
    maximize the margin around the hyperplane seperating the two groups of
    labels.
SVMs were popularized in the 1990s in the context of optical character
    recognition, as shown in \cite{scholkopf1998}.

\cite{drucker1997} and \cite{stitson1999} adapt SVMs to the regression case,
    and \cite{smola2004} provide a comprehensive introduction thereof.
\cite{mueller1997} and \cite{mueller1999} focus on SVRs in the context of time
    series data and find that they tend to outperform classical methods.
\cite{chen2006a} and \cite{chen2006b} apply SVRs to predict the hourly demand
    for water in cities, an application similar to the UDP case.

In the SVR case, a linear function
    $\hat{y}_i = f(\vec{x}_i) = \langle\vec{w},\vec{x}_i\rangle + b$
    is fitted so that the actual labels $y_i$ have a deviation of at most
    $\epsilon$ from their predictions $\hat{y}_i$ (cf., the constraints
    below).
SVRs are commonly formulated as quadratic optimization problems as follows:
$$
\text{minimize }
\frac{1}{2} \norm{\vec{w}}^2 + C \sum_{i=1}^m (\xi_i + \xi_i^*)
\quad \text{subject to }
\begin{cases}
y_i - \langle \vec{w}, \vec{x}_i \rangle - b \leq \epsilon + \xi_i
\text{,} \\
\langle \vec{w}, \vec{x}_i \rangle + b - y_i \leq \epsilon + \xi_i^*
\end{cases}
$$
$\vec{w}$ are the fitted weights in the row space of $\mat{X}$, $b$ is a bias
    term in the column space of $\mat{X}$, and $\langle\cdot,\cdot\rangle$
    denotes the dot product.
By minimizing the norm of $\vec{w}$, the fitted function is flat and not prone
    to overfitting strongly.
To allow individual samples outside the otherwise hard $\epsilon$ bounds,
    non-negative slack variables $\xi_i$ and $\xi_i^*$ are included.
A non-negative parameter $C$ regulates how many samples may violate the
    $\epsilon$ bounds and by how much.
To model non-linear relationships, one could use a mapping $\Phi(\cdot)$ for
    the $\vec{x}_i$ from the row space of $\mat{X}$ to some higher
    dimensional space; however, as the optimization problem only depends on
    the dot product $\langle\cdot,\cdot\rangle$ and not the actual entries of
    $\vec{x}_i$, it suffices to use a kernel function $k$ such that
    $k(\vec{x}_i,\vec{x}_j) = \langle\Phi(\vec{x}_i),\Phi(\vec{x}_j)\rangle$.
Such kernels must fulfill certain mathematical properties, and, besides
    polynomial kernels, radial basis functions with 
    $k(\vec{x}_i,\vec{x}_j) = exp(\gamma \norm{\vec{x}_i - \vec{x}_j}^2)$ are
    a popular candidate where $\gamma$ is a parameter controlling for how the
    distances between any two samples influence the final model.
SVRs work well with sparse data in high dimensional spaces, such as
    intermittent demand data, as they minimize the risk of misclassification
    or predicting a significantly far off value by maximizing the error
    margin, as also noted by \cite{bao2004}.