\subsubsection{Autoregressive Integrated Moving Averages}
\label{arima}

\cite{box1962}, \cite{box1968}, and more papers by the same authors in the
    1960s introduce a type of model where observations correlate with their
    neighbors and refer to them as autoregressive integrated moving average
    (ARIMA) models for stationary time series.
For a thorough overview, we refer to \cite{box2015} and \cite{brockwell2016}.

A time series $y_t$ is stationary if its moments are independent of the
    point in time where it is observed.
A typical example is a white noise $\epsilon_t$ series.
Therefore, a trend or seasonality implies non-stationarity.
\cite{kwiatkowski1992} provide a test to check the null hypothesis of
    stationary data.
To obtain a stationary time series, one chooses from several techniques:
First, to stabilize a changing variance (i.e., heteroscedasticity), one
    applies a Box-Cox transformation (e.g., $log$) as first suggested by
    \cite{box1964}.
Second, to factor out a trend (or seasonal) pattern, one computes differences
    of consecutive (or of lag $k$) observations or even differences thereof.
Third, it is also common to pre-process $y_t$ with one of the decomposition
    methods mentioned in Sub-section \ref{stl} below with an ARIMA model
    then trained on an adjusted $y_t$.

In the autoregressive part, observations are modeled as linear combinations of
    its predecessors.
Formally, an $AR(p)$ model is defined with a drift term $c$, coefficients
    $\phi_i$ to be estimated (where $i$ is an index with $0 < i \leq p$), and
    white noise $\epsilon_t$ like so:
$
AR(p): \ \
y_t = c + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \dots + \phi_p y_{t-p}
      + \epsilon_t
$.
The moving average part considers observations to be regressing towards a
    linear combination of past forecasting errors.
Formally, a $MA(q)$ model is defined with a drift term $c$, coefficients
    $\theta_j$ to be estimated, and white noise terms $\epsilon_t$ (where $j$
    is an index with $0 < j \leq q$) as follows:
$
MA(q): \ \
y_t = c + \epsilon_t + \theta_1 \epsilon_{t-1} + \theta_2 \epsilon_{t-2}
      + \dots + \theta_q \epsilon_{t-q}
$.
Finally, an $ARIMA(p,d,q)$ model unifies both parts and adds differencing
    where $d$ is the degree of differences and the $'$ indicates differenced
    values:
$
ARIMA(p,d,q): \ \
y'_t = c + \phi_1 y'_{t-1} + \dots + \phi_p y'_{t-p} + \theta_1 \epsilon_{t-1}
       + \dots + \theta_q \epsilon_{t-q} + \epsilon_{t}
$.

$ARIMA(p,d,q)$ models are commonly fitted with maximum likelihood estimation.
To find an optimal combination of the parameters $p$, $d$, and $q$, the
    literature suggests calculating an information theoretical criterion
    (e.g., Akaike's Information Criterion) that evaluates the fit on
    historical data.
\cite{hyndman2008a} provide a step-wise heuristic to choose $p$, $d$, and $q$,
    that also decides if a Box-Cox transformation is to be applied, and if so,
    which one.
To obtain a one-step-ahead forecast, the above equation is reordered such
    that $t$ is substituted with $T+1$.
For forecasts further into the future, the actual observations are
    subsequently replaced by their forecasts.    
Seasonal ARIMA variants exist; however, the high frequency $k$ in the kind of
    demand a UDP faces typically renders them impractical as too many
    coefficients must be estimated.