diff --git a/paper.pdf b/paper.pdf index 2ffb022..7167219 100644 Binary files a/paper.pdf and b/paper.pdf differ diff --git a/paper.tex b/paper.tex index d445c71..04a42f1 100644 --- a/paper.tex +++ b/paper.tex @@ -9,6 +9,15 @@ \input{tex/1_intro} \input{tex/2_lit/1_intro} +\input{tex/2_lit/2_class/1_intro} +\input{tex/2_lit/2_class/2_ets} +\input{tex/2_lit/2_class/3_arima} +\input{tex/2_lit/2_class/4_stl} +\input{tex/2_lit/3_ml/1_intro} +\input{tex/2_lit/3_ml/2_learning} +\input{tex/2_lit/3_ml/3_cv} +\input{tex/2_lit/3_ml/4_rf} +\input{tex/2_lit/3_ml/5_svm} \input{tex/3_mod/1_intro} \input{tex/4_stu/1_intro} \input{tex/5_con/1_intro} diff --git a/tex/2_lit/1_intro.tex b/tex/2_lit/1_intro.tex index e875a28..f28e145 100644 --- a/tex/2_lit/1_intro.tex +++ b/tex/2_lit/1_intro.tex @@ -1,2 +1,17 @@ \section{Literature Review} -\label{lit} \ No newline at end of file +\label{lit} + +In this section, we review the specific forecasting methods that make up our + forecasting system. +We group them into classical statistics and ML models. +The two groups differ mainly in how they represent the input data and how + accuracy is evaluated. + +A time series is a finite and ordered sequence of equally spaced observations. +Thus, time is regarded as discrete and a time step as a short period. +Formally, a time series $Y$ is defined as $Y = \{y_t: t \in I\}$, or $y_t$ for + short, where $I$ is an index set of positive integers. +Besides its length $T = |Y|$, another property is the a priori fixed and + non-negative periodicity $k$ of a seasonal pattern in demand: +$k$ is the number of time steps after which a pattern repeats itself (e.g., + $k=12$ for monthly sales data). diff --git a/tex/2_lit/2_class/1_intro.tex b/tex/2_lit/2_class/1_intro.tex new file mode 100644 index 0000000..e296160 --- /dev/null +++ b/tex/2_lit/2_class/1_intro.tex @@ -0,0 +1,13 @@ +\subsection{Demand Forecasting with Classical Forecasting Methods} +\label{class_methods} + +Forecasting became a formal discipline starting in the 1950s and has its + origins in the broader field of statistics. +\cite{hyndman2018} provide a thorough overview of the concepts and methods + established, and \cite{ord2017} indicate business-related applications + such as demand forecasting. +These "classical" forecasting methods share the characteristic that they are + trained over the entire $Y$ first. +Then, for prediction, the forecaster specifies the number of time steps for + which he wants to generate forecasts. +That is different for ML models. diff --git a/tex/2_lit/2_class/2_ets.tex b/tex/2_lit/2_class/2_ets.tex new file mode 100644 index 0000000..6db9781 --- /dev/null +++ b/tex/2_lit/2_class/2_ets.tex @@ -0,0 +1,78 @@ +\subsubsection{Na\"{i}ve Methods, Moving Averages, and Exponential Smoothing.} +\label{ets} + +Simple forecasting methods are often employed as a benchmark for more + sophisticated ones. +The so-called na\"{i}ve and seasonal na\"{i}ve methods forecast the next time + step in a time series, $y_{T+1}$, with the last observation, $y_T$, + and, if a seasonal pattern is present, with the observation $k$ steps + before, $y_{T+1-k}$. +As variants, both methods can be generalized to include drift terms in the + presence of a trend or changing seasonal amplitude. + +If a time series exhibits no trend, a simple moving average (SMA) is a + generalization of the na\"{i}ve method that is more robust to outliers. +It is defined as follows: $\hat{y}_{T+1} = \frac{1}{h} \sum_{i=T-h}^{T} y_i$ + where $h$ is the horizon over which the average is calculated. +If a time series exhibits a seasonal pattern, setting $h$ to a multiple of the + periodicity $k$ suffices that the forecast is unbiased. + +Starting in the 1950s, another popular family of forecasting methods, + so-called exponential smoothing methods, was introduced by + \cite{brown1959}, \cite{holt1957}, and \cite{winters1960}. +The idea is that forecasts $\hat{y}_{T+1}$ are a weighted average of past + observations where the weights decay over time; in the case of the simple + exponential smoothing (SES) method we obtain: +$ +\hat{y}_{T+1} = \alpha y_T + \alpha (1 - \alpha) y_{T-1} + + \alpha (1 - \alpha)^2 y_{T-2} + + \dots + \alpha (1 - \alpha)^{T-1} y_{1} +$ +where $\alpha$ (with $0 \le \alpha \le 1$) is a smoothing parameter. + +Exponential smoothing methods are often expressed in an alternative component + form that consists of a forecast equation and one or more smoothing + equations for unobservable components. +Below, we present a generalization of SES, the so-called Holt-Winters' + seasonal method, in an additive formulation. +$\ell_t$, $b_t$, and $s_t$ represent the unobservable level, trend, and + seasonal components inherent in $y_t$, and $\beta$ and $\gamma$ complement + $\alpha$ as smoothing parameters: +\begin{align*} +\hat{y}_{t+1} & = \ell_t + b_t + s_{t+1-k} \\ +\ell_t & = \alpha(y_t - s_{t-k}) + (1 - \alpha)(\ell_{t-1} + b_{t-1}) \\ +b_t & = \beta (\ell_{t} - \ell_{t-1}) + (1 - \beta) b_{t-1} \\ +s_t & = \gamma (y_t - \ell_{t-1} - b_{t-1}) + (1-\gamma)s_{t-k} +\end{align*} +With $b_t$, $s_t$, $\beta$, and $\gamma$ removed, this formulation reduces to + SES. +Distinct variations exist: Besides the three components, \cite{gardner1985} + add dampening for the trend, \cite{pegels1969} provides multiplicative + formulations, and \cite{taylor2003} adds dampening to the latter. +The accuracy measure commonly employed is the sum of squared errors between + the observations and their forecasts. + +Originally introduced by \cite{assimakopoulos2000}, \cite{hyndman2003} show + how the Theta method can be regarded as an equivalent to SES with a drift + term. +We mention this method here only because \cite{bell2018} emphasize that it + performs well at Uber. +However, in our empirical study, we find that this is not true in general. + +\cite{hyndman2002} introduce statistical processes, so-called innovations + state-space models, to generalize the methods in this sub-section. +They call this family of models ETS as they capture error, trend, and seasonal + terms. +Linear and additive ETS models have a structure like so: +\begin{align*} +y_t & = \vec{w} \cdot \vec{x}_{t-1} + \epsilon_t \\ +\vec{x_t} & = \mat{F} \vec{x}_{t-1} + \vec{g} \epsilon_t +\end{align*} +$y_t$ denote the observations as before while $\vec{x}_t$ is a state vector of + unobserved components. +$\epsilon_t$ is a white noise series and the matrix $\mat{F}$ and the vectors + $\vec{g}$ and $\vec{w}$ contain a model's coefficients. +Just as the models in the next sub-section, ETS models are commonly fitted + with maximum likelihood and evaluated using information theoretical + criteria against historical data. +We refer to \cite{hyndman2008b} for a thorough summary. diff --git a/tex/2_lit/2_class/3_arima.tex b/tex/2_lit/2_class/3_arima.tex new file mode 100644 index 0000000..d55ffd8 --- /dev/null +++ b/tex/2_lit/2_class/3_arima.tex @@ -0,0 +1,69 @@ +\subsubsection{Autoregressive Integrated Moving Averages.} +\label{arima} + +\cite{box1962}, \cite{box1968}, and more papers by the same authors in the + 1960s introduce a type of model where observations correlate with their + neighbors and refer to them as autoregressive integrated moving average + (ARIMA) models for stationary time series. +For a thorough overview, we refer to \cite{box2015} and \cite{brockwell2016}. + +A time series $y_t$ is stationary if its moments are independent of the + point in time where it is observed. +A typical example is a white noise $\epsilon_t$ series. +Therefore, a trend or seasonality implies non-stationarity. +\cite{kwiatkowski1992} provide a test to check the null hypothesis of + stationary data. +To obtain a stationary time series, one chooses from several techniques: +First, to stabilize a changing variance (i.e., heteroscedasticity), one + applies a Box-Cox transformation (e.g., $log$) as first suggested by + \cite{box1964}. +Second, to factor out a trend (or seasonal) pattern, one computes differences + of consecutive (or of lag $k$) observations or even differences thereof. +Third, it is also common to pre-process $y_t$ with one of the decomposition + methods mentioned in Sub-section \ref{stl} below with an ARIMA model + then trained on an adjusted $y_t$. + +In the autoregressive part, observations are modeled as linear combinations of + its predecessors. +Formally, an $AR(p)$ model is defined with a drift term $c$, coefficients + $\phi_i$ to be estimated (where $i$ is an index with $0 < i \leq p$), and + white noise $\epsilon_t$ like so: +$ +AR(p): \ \ +y_t = c + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \dots + \phi_p y_{t-p} + + \epsilon_t +$. +The moving average part considers observations to be regressing towards a + linear combination of past forecasting errors. +Formally, a $MA(q)$ model is defined with a drift term $c$, coefficients + $\theta_j$ to be estimated, and white noise terms $\epsilon_t$ (where $j$ + is an index with $0 < j \leq q$) as follows: +$ +MA(q): \ \ +y_t = c + \epsilon_t + \theta_1 \epsilon_{t-1} + \theta_2 \epsilon_{t-2} + + \dots + \theta_q \epsilon_{t-q} +$. +Finally, an $ARIMA(p,d,q)$ model unifies both parts and adds differencing + where $d$ is the degree of differences and the $'$ indicates differenced + values: +$ +ARIMA(p,d,q): \ \ +y'_t = c + \phi_1 y'_{t-1} + \dots + \phi_p y'_{t-p} + \theta_1 \epsilon_{t-1} + + \dots + \theta_q \epsilon_{t-q} + \epsilon_{t} +$. + +$ARIMA(p,d,q)$ models are commonly fitted with maximum likelihood estimation. +To find an optimal combination of the parameters $p$, $d$, and $q$, the + literature suggests calculating an information theoretical criterion + (e.g., Akaike's Information Criterion) that evaluates the fit on + historical data. +\cite{hyndman2008a} provide a step-wise heuristic to choose $p$, $d$, and $q$, + that also decides if a Box-Cox transformation is to be applied, and if so, + which one. +To obtain a one-step-ahead forecast, the above equation is reordered such + that $t$ is substituted with $T+1$. +For forecasts further into the future, the actual observations are + subsequently replaced by their forecasts. +Seasonal ARIMA variants exist; however, the high frequency $k$ in the kind of + demand a UDP faces typically renders them impractical as too many + coefficients must be estimated. diff --git a/tex/2_lit/2_class/4_stl.tex b/tex/2_lit/2_class/4_stl.tex new file mode 100644 index 0000000..127123c --- /dev/null +++ b/tex/2_lit/2_class/4_stl.tex @@ -0,0 +1,62 @@ +\subsubsection{Seasonal and Trend Decomposition using Loess.} +\label{stl} + +A time series $y_t$ may exhibit different types of patterns; to fully capture + each of them, the series must be decomposed. +Then, each component is forecast with a distinct model. +Most commonly, the components are the trend $t_t$, seasonality $s_t$, and + remainder $r_t$. +They are themselves time series, where only $s_t$ exhibits a periodicity $k$. +A decomposition may be additive (i.e., $y_t = s_t + t_t + r_t$) or + multiplicative (i.e., $y_t = s_t * t_t * r_t$); the former assumes that + the effect of the seasonal component is independent of the overall level + of $y_t$ and vice versa. +The seasonal component is centered around $0$ in both cases such that its + removal does not affect the level of $y_t$. +Often, it is sufficient to only seasonally adjust the time series, and model + the trend and remainder together, for example, as $a_t = y_t - s_t$ in the + additive case. + +Early approaches employed moving averages (cf., Sub-section \ref{ets}) to + calculate a trend component, and, after removing that from $y_t$, averaged + all observations of the same seasonal lag to obtain the seasonal + component. +The downsides of this are the subjectivity in choosing the window lengths for + the moving average and the seasonal averaging, the incapability of the + seasonal component to vary its amplitude over time, and the missing + handling of outliers. + +The X11 method developed at the U.S. Census Bureau and described in detail by + \cite{dagum2016} overcomes these disadvantages. +However, due to its background in economics, it is designed primarily for + quarterly or monthly data, and the change in amplitude over time cannot be + controlled. +Variants of this method are the SEATS decomposition by the Bank of Spain and + the newer X13-SEATS-ARIMA method by the U.S. Census Bureau. +Their main advantages stem from the fact that the models calibrate themselves + according to statistical criteria without manual work for a statistician + and that the fitting process is robust to outliers. + +\cite{cleveland1990} introduce a seasonal and trend decomposition using a + repeated locally weighted regression - the so-called Loess procedure - to + smoothen the trend and seasonal components, which can be viewed as a + generalization of the methods above and is denoted by the acronym + \gls{stl}. +In contrast to the X11, X13, and SEATS methods, the STL supports seasonalities + of any lag $k$ that must, however, be determined with additional + statistical tests or set with out-of-band knowledge by the forecaster + (e.g., hourly demand data implies $k = 24 * 7 = 168$ assuming customer + behavior differs on each day of the week). +Moreover, the seasonal component's rate of change, represented by the $ns$ + parameter and explained in detail with Figure \ref{f:stl} in Section + \ref{decomp}, must be set by the forecaster as well, while the trend's + smoothness may be controlled via setting a non-default window size. +Outliers are handled by assignment to the remainder such that they do not + affect the trend and seasonal components. +In particular, the manual input needed to calibrate the STL explains why only + the X11, X13, and SEATS methods are widely used by practitioners. +However, the widespread adoption of concepts like cross-validation (cf., + Sub-section \ref{cv}) in recent years enables the usage of an automated + grid search to optimize the parameters. +The STL's usage within a grid search is facilitated even further by its being + computationally cheaper than the other methods discussed. diff --git a/tex/2_lit/3_ml/1_intro.tex b/tex/2_lit/3_ml/1_intro.tex new file mode 100644 index 0000000..f04f137 --- /dev/null +++ b/tex/2_lit/3_ml/1_intro.tex @@ -0,0 +1,15 @@ +\subsection{Demand Forecasting with Machine Learning Methods} +\label{ml_methods} + +ML methods have been employed in all kinds of prediction tasks in recent + years. +In this section, we restrict ourselves to the models that performed well in + our study: Random Forest (\gls{rf}) and Support Vector Regression + (\gls{svr}). +RFs are in general well-suited for datasets without a priori knowledge about + the patterns, while SVR is known to perform well on time series data, as + shown by \cite{hansen2006} in general and \cite{bao2004} specifically for + intermittent demand. +Gradient Boosting, another popular ML method, was consistently outperformed by + RFs, and artificial neural networks require an amount of data + exceeding what our industry partner has by far. diff --git a/tex/2_lit/3_ml/2_learning.tex b/tex/2_lit/3_ml/2_learning.tex new file mode 100644 index 0000000..86d157a --- /dev/null +++ b/tex/2_lit/3_ml/2_learning.tex @@ -0,0 +1,53 @@ +\subsubsection{Supervised Learning.} +\label{learning} + +A conceptual difference between classical and ML methods is the format + for the model inputs. +In ML models, a time series $Y$ is interpreted as labeled data. +Labels are collected into a vector $\vec{y}$ while the corresponding + predictors are aligned in an $(T - n) \times n$ matrix $\mat{X}$: +$$ +\vec{y} += +\begin{pmatrix} + y_T \\ + y_{T-1} \\ + \dots \\ + y_{n+1} +\end{pmatrix} +~~~~~~~~~~ +\mat{X} += +\begin{bmatrix} + y_{T-1} & y_{T-2} & \dots & y_{T-n} \\ + y_{T-2} & y_{T-3} & \dots & y_{T-(n+1)} \\ + \dots & \dots & \dots & \dots \\ + y_n & y_{n-1} & \dots & y_1 +\end{bmatrix} +$$ +The $m = T - n$ rows are referred to as samples and the $n$ columns as + features. +Each row in $\mat{X}$ is "labeled" by the corresponding entry in $\vec{y}$, + and ML models are trained to fit the rows to their labels. +Conceptually, we model a functional relationship $f$ between $\mat{X}$ and + $\vec{y}$ such that the difference between the predicted + $\vec{\hat{y}} = f(\mat{X})$ and the true $\vec{y}$ are minimized + according to some error measure $L(\vec{\hat{y}}, \vec{y})$, where $L$ + summarizes the goodness of the fit into a scalar value (e.g., the + well-known mean squared error [MSE]; cf., Section \ref{mase}). +$\mat{X}$ and $\vec{y}$ show the ordinal character of time series data: + Not only overlap the entries of $\mat{X}$ and $\vec{y}$, but the rows of + $\mat{X}$ are shifted versions of each other. +That does not hold for ML applications in general (e.g., the classical + example of predicting spam vs. no spam emails, where the features model + properties of individual emails), and most of the common error measures + presented in introductory texts on ML, are only applicable in cases + without such a structure in $\mat{X}$ and $\vec{y}$. +$n$, the number of past time steps required to predict a $y_t$, is an + exogenous model parameter. +For prediction, the forecaster supplies the trained ML model an input + vector in the same format as a row $\vec{x}_i$ in $\mat{X}$. +For example, to predict $y_{T+1}$, the model takes the vector + $(y_T, y_{T-1}, ..., y_{T-n+1})$ as input. +That is in contrast to the classical methods, where we only supply the number + of time steps to be predicted as a scalar integer. diff --git a/tex/2_lit/3_ml/3_cv.tex b/tex/2_lit/3_ml/3_cv.tex new file mode 100644 index 0000000..1d5186b --- /dev/null +++ b/tex/2_lit/3_ml/3_cv.tex @@ -0,0 +1,38 @@ +\subsubsection{Cross-Validation.} +\label{cv} + +Because ML models are trained by minimizing a loss function $L$, the + resulting value of $L$ underestimates the true error we see when + predicting into the actual future by design. +To counter that, one popular and model-agnostic approach is cross-validation + (\gls{cv}), as summarized, for example, by \cite{hastie2013}. +CV is a resampling technique, which ranomdly splits the samples into a + training and a test set. +Trained on the former, an ML model makes forecasts on the latter. +Then, the value of $L$ calculated only on the test set gives a realistic and + unbiased estimate of the true forecasting error, and may be used for one + of two distinct aspects: +First, it assesses the quality of a fit and provides an idea as to how the + model would perform in production when predicting into the actual future. +Second, the errors of models of either different methods or the same method + with different parameters may be compared with each other to select the + best model. +In order to first select the best model and then assess its quality, one must + apply two chained CVs: +The samples are divided into training, validation, and test sets, and all + models are trained on the training set and compared on the validation set. +Then, the winner is retrained on the union of the training and validation + sets and assessed on the test set. + +Regarding the splitting, there are various approaches, and we choose the + so-called $k$-fold CV, where the samples are randomly divided into $k$ + folds of the same size. +Each fold is used as a test set once and the remaining $k-1$ folds become + the corresponding training set. +The resulting $k$ error measures are averaged. +A $k$-fold CV with $k=5$ or $k=10$ is a compromise between the two extreme + cases of having only one split and the so-called leave-one-out CV + where $k = m$: Computation is still relatively fast and each sample is + part of several training sets maximizing the learning from the data. +We adapt the $k$-fold CV to the ordinal stucture in $\mat{X}$ and $\vec{y}$ in + Sub-section \ref{unified_cv}. diff --git a/tex/2_lit/3_ml/4_rf.tex b/tex/2_lit/3_ml/4_rf.tex new file mode 100644 index 0000000..784d2a7 --- /dev/null +++ b/tex/2_lit/3_ml/4_rf.tex @@ -0,0 +1,66 @@ +\subsubsection{Random Forest Regression.} +\label{rf} + +\cite{breiman1984} introduce the classification and regression tree + (\gls{cart}) model that is built around the idea that a single binary + decision tree maps learned combinations of intervals of the feature + columns to a label. +Thus, each sample in the training set is associated with one leaf node that + is reached by following the tree from its root and branching along the + arcs according to some learned splitting rule per intermediate node that + compares the sample's realization for the feature specified by the rule to + the learned decision rule. +While such models are computationally fast and offer a high degree of + interpretability, they tend to overfit strongly to the training set as + the splitting rules are not limited to any functional form (e.g., linear) + in the relationship between the features and the labels. +In the regression case, it is common to maximize the variance reduction $I_V$ + from a parent node $N$ to its two children, $C1$ and $C2$, as the + splitting rule. +\cite{breiman1984} formulate this as follows: +$$ +I_V(N) += +\frac{1}{|S_N|^2} \sum_{i \in S_N} \sum_{j \in S_N} + \frac{1}{2} (y_i - y_j)^2 +- \left( + \frac{1}{|S_{C1}|^2} \sum_{i \in S_{C1}} \sum_{j \in S_{C1}} + \frac{1}{2} (y_i - y_j)^2 + + + \frac{1}{|S_{C2}|^2} \sum_{i \in S_{C2}} \sum_{j \in S_{C2}} + \frac{1}{2} (y_i - y_j)^2 +\right) +$$ +$S_N$, $S_{C1}$, and $S_{C2}$ are the index sets of the samples in $N$, $C1$, + and $C2$. + +\cite{ho1998} and then \cite{breiman2001} generalize this method by combining + many CART models into one forest of trees where every single tree is + a randomized variant of the others. +Randomization is achieved at two steps in the training process: +First, each tree receives a distinct training set resampled with replacement + from the original training set, an idea also called bootstrap + aggregation. +Second, at each node a random subset of the features is used to grow the tree. +Trees can be fitted in parallel speeding up the training significantly. +For prediction at the tree level, the average of all the samples at a + particular leaf node is used. +Then, the individual values are combined into one value by averaging again + across the trees. +Due to the randomization, the trees are decorrelated offsetting the + overfitting. +Another measure to counter overfitting is pruning the tree, either by + specifying the maximum depth of a tree or the minimum number of samples + at leaf nodes. + +The forecaster must tune the structure of the forest. +Parameters include the number of trees in the forest, the size of the random + subset of features, and the pruning criteria. +The parameters are optimized via grid search: We train many models with + parameters chosen from a pre-defined list of values and select the best + one by CV. +RFs are a convenient ML method for any dataset as decision trees do not + make any assumptions about the relationship between features and labels. +\cite{herrera2010} use RFs to predict the hourly demand for water in an urban + context, a similar application as the one in this paper, and find that RFs + work well with time series type of data. diff --git a/tex/2_lit/3_ml/5_svm.tex b/tex/2_lit/3_ml/5_svm.tex new file mode 100644 index 0000000..1c12af5 --- /dev/null +++ b/tex/2_lit/3_ml/5_svm.tex @@ -0,0 +1,60 @@ +\subsubsection{Support Vector Regression.} +\label{svm} + +\cite{vapnik1963} and \cite{vapnik1964} introduce the so-called support vector + machine (\gls{svm}) model, and \cite{vapnik2013} summarizes the research + conducted since then. +In its basic version, SVMs are linear classifiers, modeling a binary + decision, that fit a hyperplane into the feature space of $\mat{X}$ to + maximize the margin around the hyperplane seperating the two groups of + labels. +SVMs were popularized in the 1990s in the context of optical character + recognition, as shown in \cite{scholkopf1998}. + +\cite{drucker1997} and \cite{stitson1999} adapt SVMs to the regression case, + and \cite{smola2004} provide a comprehensive introduction thereof. +\cite{mueller1997} and \cite{mueller1999} focus on SVRs in the context of time + series data and find that they tend to outperform classical methods. +\cite{chen2006a} and \cite{chen2006b} apply SVRs to predict the hourly demand + for water in cities, an application similar to the UDP case. + +In the SVR case, a linear function + $\hat{y}_i = f(\vec{x}_i) = \langle\vec{w},\vec{x}_i\rangle + b$ + is fitted so that the actual labels $y_i$ have a deviation of at most + $\epsilon$ from their predictions $\hat{y}_i$ (cf., the constraints + below). +SVRs are commonly formulated as quadratic optimization problems as follows: +$$ +\text{minimize } +\frac{1}{2} \norm{\vec{w}}^2 + C \sum_{i=1}^m (\xi_i + \xi_i^*) +\quad \text{subject to } +\begin{cases} +y_i - \langle \vec{w}, \vec{x}_i \rangle - b \leq \epsilon + \xi_i +\text{,} \\ +\langle \vec{w}, \vec{x}_i \rangle + b - y_i \leq \epsilon + \xi_i^* +\end{cases} +$$ +$\vec{w}$ are the fitted weights in the row space of $\mat{X}$, $b$ is a bias + term in the column space of $\mat{X}$, and $\langle\cdot,\cdot\rangle$ + denotes the dot product. +By minimizing the norm of $\vec{w}$, the fitted function is flat and not prone + to overfitting strongly. +To allow individual samples outside the otherwise hard $\epsilon$ bounds, + non-negative slack variables $\xi_i$ and $\xi_i^*$ are included. +A non-negative parameter $C$ regulates how many samples may violate the + $\epsilon$ bounds and by how much. +To model non-linear relationships, one could use a mapping $\Phi(\cdot)$ for + the $\vec{x}_i$ from the row space of $\mat{X}$ to some higher + dimensional space; however, as the optimization problem only depends on + the dot product $\langle\cdot,\cdot\rangle$ and not the actual entries of + $\vec{x}_i$, it suffices to use a kernel function $k$ such that + $k(\vec{x}_i,\vec{x}_j) = \langle\Phi(\vec{x}_i),\Phi(\vec{x}_j)\rangle$. +Such kernels must fulfill certain mathematical properties, and, besides + polynomial kernels, radial basis functions with + $k(\vec{x}_i,\vec{x}_j) = exp(\gamma \norm{\vec{x}_i - \vec{x}_j}^2)$ are + a popular candidate where $\gamma$ is a parameter controlling for how the + distances between any two samples influence the final model. +SVRs work well with sparse data in high dimensional spaces, such as + intermittent demand data, as they minimize the risk of misclassification + or predicting a significantly far off value by maximizing the error + margin, as also noted by \cite{bao2004}. diff --git a/tex/3_mod/1_intro.tex b/tex/3_mod/1_intro.tex index bf72f24..fdc9207 100644 --- a/tex/3_mod/1_intro.tex +++ b/tex/3_mod/1_intro.tex @@ -1,2 +1,8 @@ \section{Model Formulation} -\label{mod} \ No newline at end of file +\label{mod} + +% temporary placeholders +\label{decomp} +\label{f:stl} +\label{mase} +\label{unified_cv} \ No newline at end of file diff --git a/tex/glossary.tex b/tex/glossary.tex index b34e1b9..6c685f7 100644 --- a/tex/glossary.tex +++ b/tex/glossary.tex @@ -1,7 +1,25 @@ % Abbreviations for technical terms. +\newglossaryentry{cart}{ + name=CART, description={Classification and Regression Trees} +} +\newglossaryentry{cv}{ + name=CV, description={Cross Validation} +} \newglossaryentry{ml}{ name=ML, description={Machine Learning} } +\newglossaryentry{rf}{ + name=RF, description={Random Forest} +} +\newglossaryentry{stl}{ + name=STL, description={Seasonal and Trend Decomposition using Loess} +} +\newglossaryentry{svm}{ + name=SVM, description={Support Vector Machine} +} +\newglossaryentry{svr}{ + name=SVR, description={Support Vector Regression} +} \newglossaryentry{udp}{ name=UDP, description={Urban Delivery Platform} } diff --git a/tex/preamble.tex b/tex/preamble.tex index bc34c42..85ddd10 100644 --- a/tex/preamble.tex +++ b/tex/preamble.tex @@ -6,4 +6,9 @@ % Make opening quotes look different than closing quotes. \usepackage[english=american]{csquotes} -\MakeOuterQuote{"} \ No newline at end of file +\MakeOuterQuote{"} + +% Define helper commands. +\usepackage{bm} +\newcommand{\mat}[1]{\bm{#1}} +\newcommand{\norm}[1]{\left\lVert#1\right\rVert} \ No newline at end of file diff --git a/tex/references.bib b/tex/references.bib index be2c815..9d330ff 100644 --- a/tex/references.bib +++ b/tex/references.bib @@ -7,6 +7,25 @@ volume={129}, pages={263--286} } +@article{assimakopoulos2000, +title={The theta model: a decomposition approach to forecasting}, +author={Assimakopoulos, Vassilis and Nikolopoulos, Konstantinos}, +year={2000}, +journal={International Journal of Forecasting}, +volume={16}, +number={4}, +pages={521--530} +} + +@inproceedings{bao2004, +title={Forecasting intermittent demand by SVMs regression}, +author={Bao, Yukun and Wang, Wen and Zhang, Jinlong}, +year={2004}, +booktitle={2004 IEEE International Conference on Systems, Man and Cybernetics}, +volume={1}, +pages={461--466} +} + @misc{bell2018, title = {Forecasting at Uber: An Introduction}, author={Bell, Franziska and Smyl, Slawek}, @@ -15,6 +34,119 @@ howpublished = {\url{https://eng.uber.com/forecasting-introduction/}}, note = {Accessed: 2020-10-01} } +@article{box1962, +title={Some statistical Aspects of adaptive Optimization and Control}, +author={Box, George and Jenkins, Gwilym}, +year={1962}, +journal={Journal of the Royal Statistical Society. Series B (Methodological)}, +volume={24}, +number={2}, +pages={297--343} +} + +@article{box1964, +title={An Analysis of Transformations}, +author={Box, George and Cox, David}, +year={1964}, +journal={Journal of the Royal Statistical Society. Series B (Methodological)}, +volume={26}, +number={2}, +pages={211--252} +} + +@article{box1968, +title={Some recent Advances in Forecasting and Control}, +author={Box, George and Jenkins, Gwilym}, +year={1968}, +journal={Journal of the Royal Statistical Society. + Series C (Applied Statistics)}, +volume={17}, +number={2}, +pages={91--109} +} + +@book{box2015, +title={Time Series Analysis: Forecasting and Control}, +author={Box, George and Jenkins, Gwilym and Reinsel, Gregory and Ljung, Greta}, +series={Wiley Series in Probability and Statistics}, +year={2015}, +publisher={Wiley} +} + +@book{breiman1984, +title={Classification and Regression Trees}, +author={Breiman, Leo and Friedman, Jerome and Olshen, R.A. + and Stone, Charles}, +year={1984}, +publisher={Wadsworth} +} + +@article{breiman2001, +title={Random Forests}, +author={Breiman, Leo}, +year={2001}, +journal={Machine Learning}, +volume={45}, +number={1}, +pages={5--32} +} + +@book{brockwell2016, +title={Introduction to Time Series and Forecasting}, +author={Brockwell, Peter and Davis, Richard}, +series={Springer Texts in Statistics}, +year={2016}, +publisher={Springer} +} + +@book{brown1959, +title={Statistical Forecasting for Inventory Control}, +author={Brown, Robert}, +year={1959}, +publisher={McGraw/Hill} +} + +@article{chen2006a, +title={Hourly Water Demand Forecast Model based on Bayesian Least Squares + Support Vector Machine}, +author={Chen, Lei and Zhang, Tu-qiao}, +year={2006}, +journal={Journal of Tianjin University}, +volume={39}, +number={9}, +pages={1037--1042} +} + +@article{chen2006b, +title={Hourly Water Demand Forecast Model based on Least Squares Support + Vector Machine}, +author={Chen, Lei and Zhang, Tu-qiao}, +year={2006}, +journal={Journal of Harbin Institute of Technology}, +volume={38}, +number={9}, +pages={1528--1530} +} + +@article{cleveland1990, +title={STL: A Seasonal-Trend Decomposition Procedure Based on Loess}, +author={Cleveland, Robert and Cleveland, Williiam and McRae, Jean + and Terpenning, Irma}, +year={1990}, +journal={Journal of Official Statistics}, +volume={6}, +number={1}, +pages={3--73} +} + +@book{dagum2016, +title={Seasonal Adjustment Methods and Real Time Trend-Cycle Estimation}, +author={Dagum, Estela and Bianconcini, Silvia}, +series={Statistics for Social and Behavioral Sciences}, +year={2016}, +publisher={Springer} +} + @article{de2006, title={25 Years of Time Series Forecasting}, author={De Gooijer, Jan and Hyndman, Rob}, @@ -25,6 +157,16 @@ number={3}, pages={443--473} } +@inproceedings{drucker1997, +title={Support Vector Regression Machines}, +author={Drucker, Harris and Burges, Christopher and Kaufman, Linda + and Smola, Alex and Vapnik, Vladimir}, +year={1997}, +booktitle={Advances in Neural Information Processing Systems}, +pages={155--161}, +organization={Springer} +} + @article{ehmke2018, title={Optimizing for total costs in vehicle routing in urban areas}, author={Ehmke, Jan Fabian and Campbell, Ann M and Thomas, Barrett W}, @@ -34,6 +176,45 @@ volume={116}, pages={242--265} } +@article{gardner1985, +title={Forecasting Trends in Time Series}, +author={Gardner, Everette and McKenzie, Ed}, +year={1985}, +journal={Management Science}, +volume={31}, +number={10}, +pages={1237--1246} +} + +@article{hansen2006, +title={Some Evidence on Forecasting Time-Series with Support Vector Machines}, +author={Hansen, James and McDonald, James and Nelson, Ray}, +year={2006}, +journal={Journal of the Operational Research Society}, +volume={57}, +number={9}, +pages={1053--1063} +} + +@book{hastie2013, +title={The Elements of Statistical Learning: Data Mining, Inference, + and Prediction}, +author={Hastie, Trevor and Tibshirani, Robert and Friedman, Jerome}, +year={2013}, +publisher={Springer} +} + +@article{herrera2010, +title={Predictive Models for Forecasting Hourly Urban Water Demand}, +author={Herrera, Manuel and Torgo, Lu{\'\i}s and Izquierdo, Joaqu{\'\i}n + and P{\'e}rez-Garc{\'\i}a, Rafael}, +year={2010}, +journal={Journal of Hydrology}, +volume={387}, +number={1-2}, +pages={141--150} +} + @misc{hirschberg2016, title = {McKinsey: The changing market for food delivery}, author={Hirschberg, Carsten and Rajko, Alexander and Schumacher, Thomas @@ -44,6 +225,25 @@ howpublished = "\url{https://www.mckinsey.com/industries/high-tech/ note = {Accessed: 2020-10-01} } +@article{ho1998, +title={The Random Subspace Method for Constructing Decision Forests}, +author={Ho, Tin Kam}, +year={1998}, +journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, +volume={20}, +number={8}, +pages={832--844} +} + +@article{holt1957, +title={Forecasting Seasonals and Trends by Exponentially Weighted Moving + Averages}, +author={Holt, Charles}, +year={1957}, +journal={ONR Memorandum}, +volume={52} +} + @article{hou2018, title={Ride-matching and routing optimisation: Models and a large neighbourhood search heuristic}, @@ -54,6 +254,62 @@ volume={118}, pages={143--162} } +@article{hyndman2002, +title={A State Space Framework for Automatic Forecasting using Exponential + Smoothing Methods}, +author={Hyndman, Rob and Koehler, Anne and Snyder, Ralph and Grose, Simone}, +year={2002}, +journal={International Journal of Forecasting}, +volume={18}, +number={3}, +pages={439--454} +} + +@article{hyndman2003, +title={Unmasking the Theta method}, +author={Hyndman, Rob and Billah, Baki}, +year={2003}, +journal={International Journal of Forecasting}, +volume={19}, +number={2}, +pages={287--290} +} + +@article{hyndman2008a, +title={Automatic Time Series Forecasting: The forecast package for R}, +author={Hyndman, Rob and Khandakar, Yeasmin}, +year={2008}, +journal={Journal of Statistical Software}, +volume={26}, +number={3} +} + +@book{hyndman2008b, +title={Forecasting with Exponential Smoothing: the State Space Approach}, +author={Hyndman, Rob and Koehler, Anne and Ord, Keith and Snyder, Ralph}, +year={2008}, +publisher={Springer} +} + +@book{hyndman2018, +title={Forecasting: Principles and Practice}, +author={Hyndman, Rob and Athanasopoulos, George}, +year={2018}, +publisher={OTexts} +} + +@article{kwiatkowski1992, +title={Testing the null hypothesis of stationarity against the alternative of a + unit root: How sure are we that economic time series have a unit root?}, +author={Kwiatkowski, Denis and Phillips, Peter and Schmidt, Peter + and Shin, Yongcheol}, +year={1992}, +journal={Journal of Econometrics}, +volume={54}, +number={1-3}, +pages={159--178} +} + @misc{laptev2017, title = {Engineering Extreme Event Forecasting at Uber with Recurrent Neural Networks}, @@ -74,6 +330,108 @@ volume={118}, pages={392--420} } +@inproceedings{mueller1997, +title={Predicting Time Series with Support Vector Machines}, +author={M{\"u}ller, Klaus-Robert and Smola, Alexander and R{\"a}tsch, Gunnar + and Sch{\"o}lkopf, Bernhard and Kohlmorgen, Jens and Vapnik, Vladimir}, +year={1997}, +booktitle={International Conference on Artificial Neural Networks}, +pages={999--1004}, +organization={Springer} +} + +@article{mueller1999, +title={Using Support Vector Machines for Time Series Prediction}, +author={M{\"u}ller, Klaus-Robert and Smola, Alexander and R{\"a}tsch, Gunnar + and Sch{\"o}lkopf, Bernhard and Kohlmorgen, Jens and Vapnik, Vladimir}, +year={1999}, +journal={Advances in Kernel Methods — Support Vector Learning}, +pages={243--254}, +publisher={MIT, Cambridge, MA, USA} +} + +@book{ord2017, +title={Principles of Business Forecasting}, +author={Ord, Keith and Fildes, Robert and Kourentzes, Nikos}, +year={2017}, +publisher={WESSEX Press} +} + +@article{pegels1969, +title={Exponential Forecasting: Some new variations}, +author={Pegels, C.}, +year={1969}, +journal={Management Science}, +volume={15}, +number={5}, +pages={311--315} +} + +@incollection{scholkopf1998, +title={Fast Approximation of Support Vector Kernel Expansions, and an + Interpretation of Clustering as Approximation in Feature Spaces}, +author={Sch{\"o}lkopf, Bernhard and Knirsch, Phil and Smola, Alex + and Burges, Chris}, +year={1998}, +booktitle={Mustererkennung 1998}, +publisher={Springer}, +pages={125--132} +} + +@article{smola2004, +title={A Tutorial on Support Vector Regression}, +author={Smola, Alex and Sch{\"o}lkopf, Bernhard}, +year={2004}, +journal={Statistics and Computing}, +volume={14}, +number={3}, +pages={199--222} +} + +@article{stitson1999, +title={Support Vector Regression with ANOVA Decomposition Kernels}, +author={Stitson, Mark and Gammerman, Alex and Vapnik, Vladimir + and Vovk, Volodya and Watkins, Chris and Weston, Jason}, +year={1999}, +journal={Advances in Kernel Methods — Support Vector Learning}, +pages={285--292}, +publisher={MIT, Cambridge, MA, USA} +} + +@article{taylor2003, +title={Exponential Smoothing with a Damped Multiplicative Trend}, +author={Taylor, James}, +year={2003}, +journal={International Journal of Forecasting}, +volume={19}, +number={4}, +pages={715--725} +} + +@article{vapnik1963, +title={Pattern Recognition using Generalized Portrait Method}, +author={Vapnik, Vladimir and Lerner, A}, +year={1963}, +journal={Automation and Remote Control}, +volume={24}, +pages={774--780}, +} + +@article{vapnik1964, +title={A Note on one Class of Perceptrons}, +author={Vapnik, Vladimir and Chervonenkis, A}, +year={1964}, +journal={Automation and Remote Control}, +volume={25} +} + +@book{vapnik2013, +title={The Nature of Statistical Learning Theory}, +author={Vapnik, Vladimir}, +year={2013}, +publisher={Springer} +} + @article{wang2018, title={Delivering meals for multiple suppliers: Exclusive or sharing logistics service}, @@ -82,4 +440,14 @@ year={2018}, journal={Transportation Research Part E: Logistics and Transportation Review}, volume={118}, pages={496--512} +} + +@article{winters1960, +title={Forecasting Sales by Exponentially Weighted Moving Averages}, +author={Winters, Peter}, +year={1960}, +journal={Management Science}, +volume={6}, +number={3}, +pages={324--342} } \ No newline at end of file