"# Chapter 2: A first Example - Classifying Flowers"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The purpose of this notebook is to look at a first example of a typical data science application, namely **statistical learning**, which is often referred to by its more well-known name **machine learning**. To do so, we look at a very popular example involving the classification of flowers. Albeit simplistic and almost boring in its kind, the example is a rather good one to look at from a beginner's point of view as it does not involve too many decision variables. That makes understanding technicalities and visualizing the data set a lot easier."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What is Machine Learning"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's at first review a couple of generic definitions to get started.\n",
"\n",
"Machine learning is the process of **extracting knowledge from data** in an automated fashion.\n",
"\n",
"Typical use cases regard making predictions on new and unseen data or simply understanding a given dataset better by finding patterns.\n",
"\n",
"Central to machine learning is the idea of **automating** the **decision making** from data **without** the user specifying **explicit rules** how these decisions should be made.\n",
"\n",
"That is in direct opposition to what we learned in the \"Expressing Logic\" section in Chapter 0, where we learned how to implement decision criterions \"by hand\" with the `if` statement."
"- **Supervised** (focus of the example in this notebook): Each entry in the dataset comes with a **label**. Examples are a list of emails where spam mail is already marked as such or a sample of handwritten digits. The goal is to use the historic data to make predictions.\n",
"\n",
"- **Unsupervised**: There is no desired output associated with a data entry. In a sense, one can think of unsupervised learning as a means of discovering labels from the data itself. A popular example is the clustering of customer data.\n",
"\n",
"- **Reinforcement**: Conceptually, this can be seen as \"learning by doing\". Some kind of **reward function** tells how good a predicted outcome is. A rather recent and extremely popular example for his approach is the Alpha Go machine."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Types of Supervised Learning"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Algorithms from the supervised learning category are often broken down further into classification and regression:"
"- In **classification** tasks, the labels are *discrete*, such as \"spam\" or \"no spam\" for emails. Often, labels are nominal (e.g., colors of something), or ordinal (e.g., T-shirt sizes in S, M, or L).\n",
"- In **regression**, the labels are *continuous*. For example, given a person's age, education, and position, infer his/her salary."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example: Iris Flower Classification"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the example, we are given measurments regarding the size of various parts of the so-called Iris flower kind. A concrete flower always belongs to one of three distinct special Iris classes. This example application is about classifying a given flower into one of the three classes by only looking at the measurements."
"The `sklearn` library provides several sample datasets, among which is also the Iris dataset.\n",
"\n",
"In a tabular visualization, the dataset could be portrayed somewhat like this:\n",
"\n",
"<img src=\"static/iris.png\" width=\"50%\">\n",
"\n",
"However, the data object imported from `sklearn` is organized slightly different. In particular, the so-called **features** are separated from the **labels**."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.datasets import load_iris"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"iris = load_iris()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using Python's `dir()` function we can inspect the data object, i.e. find out what **attributes** it has."
"`iris.data` provides us with a `numpy.ndarray`, where the first dimension equals the number of observed flowers (i.e., the **instances**) and the second dimension lists the various features of a flower."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[5.1, 3.5, 1.4, 0.2],\n",
" [4.9, 3. , 1.4, 0.2],\n",
" [4.7, 3.2, 1.3, 0.2],\n",
" [4.6, 3.1, 1.5, 0.2],\n",
" [5. , 3.6, 1.4, 0.2],\n",
" [5.4, 3.9, 1.7, 0.4],\n",
" [4.6, 3.4, 1.4, 0.3],\n",
" [5. , 3.4, 1.5, 0.2],\n",
" [4.4, 2.9, 1.4, 0.2],\n",
" [4.9, 3.1, 1.5, 0.1],\n",
" [5.4, 3.7, 1.5, 0.2],\n",
" [4.8, 3.4, 1.6, 0.2],\n",
" [4.8, 3. , 1.4, 0.1],\n",
" [4.3, 3. , 1.1, 0.1],\n",
" [5.8, 4. , 1.2, 0.2],\n",
" [5.7, 4.4, 1.5, 0.4],\n",
" [5.4, 3.9, 1.3, 0.4],\n",
" [5.1, 3.5, 1.4, 0.3],\n",
" [5.7, 3.8, 1.7, 0.3],\n",
" [5.1, 3.8, 1.5, 0.3],\n",
" [5.4, 3.4, 1.7, 0.2],\n",
" [5.1, 3.7, 1.5, 0.4],\n",
" [4.6, 3.6, 1. , 0.2],\n",
" [5.1, 3.3, 1.7, 0.5],\n",
" [4.8, 3.4, 1.9, 0.2],\n",
" [5. , 3. , 1.6, 0.2],\n",
" [5. , 3.4, 1.6, 0.4],\n",
" [5.2, 3.5, 1.5, 0.2],\n",
" [5.2, 3.4, 1.4, 0.2],\n",
" [4.7, 3.2, 1.6, 0.2],\n",
" [4.8, 3.1, 1.6, 0.2],\n",
" [5.4, 3.4, 1.5, 0.4],\n",
" [5.2, 4.1, 1.5, 0.1],\n",
" [5.5, 4.2, 1.4, 0.2],\n",
" [4.9, 3.1, 1.5, 0.2],\n",
" [5. , 3.2, 1.2, 0.2],\n",
" [5.5, 3.5, 1.3, 0.2],\n",
" [4.9, 3.6, 1.4, 0.1],\n",
" [4.4, 3. , 1.3, 0.2],\n",
" [5.1, 3.4, 1.5, 0.2],\n",
" [5. , 3.5, 1.3, 0.3],\n",
" [4.5, 2.3, 1.3, 0.3],\n",
" [4.4, 3.2, 1.3, 0.2],\n",
" [5. , 3.5, 1.6, 0.6],\n",
" [5.1, 3.8, 1.9, 0.4],\n",
" [4.8, 3. , 1.4, 0.3],\n",
" [5.1, 3.8, 1.6, 0.2],\n",
" [4.6, 3.2, 1.4, 0.2],\n",
" [5.3, 3.7, 1.5, 0.2],\n",
" [5. , 3.3, 1.4, 0.2],\n",
" [7. , 3.2, 4.7, 1.4],\n",
" [6.4, 3.2, 4.5, 1.5],\n",
" [6.9, 3.1, 4.9, 1.5],\n",
" [5.5, 2.3, 4. , 1.3],\n",
" [6.5, 2.8, 4.6, 1.5],\n",
" [5.7, 2.8, 4.5, 1.3],\n",
" [6.3, 3.3, 4.7, 1.6],\n",
" [4.9, 2.4, 3.3, 1. ],\n",
" [6.6, 2.9, 4.6, 1.3],\n",
" [5.2, 2.7, 3.9, 1.4],\n",
" [5. , 2. , 3.5, 1. ],\n",
" [5.9, 3. , 4.2, 1.5],\n",
" [6. , 2.2, 4. , 1. ],\n",
" [6.1, 2.9, 4.7, 1.4],\n",
" [5.6, 2.9, 3.6, 1.3],\n",
" [6.7, 3.1, 4.4, 1.4],\n",
" [5.6, 3. , 4.5, 1.5],\n",
" [5.8, 2.7, 4.1, 1. ],\n",
" [6.2, 2.2, 4.5, 1.5],\n",
" [5.6, 2.5, 3.9, 1.1],\n",
" [5.9, 3.2, 4.8, 1.8],\n",
" [6.1, 2.8, 4. , 1.3],\n",
" [6.3, 2.5, 4.9, 1.5],\n",
" [6.1, 2.8, 4.7, 1.2],\n",
" [6.4, 2.9, 4.3, 1.3],\n",
" [6.6, 3. , 4.4, 1.4],\n",
" [6.8, 2.8, 4.8, 1.4],\n",
" [6.7, 3. , 5. , 1.7],\n",
" [6. , 2.9, 4.5, 1.5],\n",
" [5.7, 2.6, 3.5, 1. ],\n",
" [5.5, 2.4, 3.8, 1.1],\n",
" [5.5, 2.4, 3.7, 1. ],\n",
" [5.8, 2.7, 3.9, 1.2],\n",
" [6. , 2.7, 5.1, 1.6],\n",
" [5.4, 3. , 4.5, 1.5],\n",
" [6. , 3.4, 4.5, 1.6],\n",
" [6.7, 3.1, 4.7, 1.5],\n",
" [6.3, 2.3, 4.4, 1.3],\n",
" [5.6, 3. , 4.1, 1.3],\n",
" [5.5, 2.5, 4. , 1.3],\n",
" [5.5, 2.6, 4.4, 1.2],\n",
" [6.1, 3. , 4.6, 1.4],\n",
" [5.8, 2.6, 4. , 1.2],\n",
" [5. , 2.3, 3.3, 1. ],\n",
" [5.6, 2.7, 4.2, 1.3],\n",
" [5.7, 3. , 4.2, 1.2],\n",
" [5.7, 2.9, 4.2, 1.3],\n",
" [6.2, 2.9, 4.3, 1.3],\n",
" [5.1, 2.5, 3. , 1.1],\n",
" [5.7, 2.8, 4.1, 1.3],\n",
" [6.3, 3.3, 6. , 2.5],\n",
" [5.8, 2.7, 5.1, 1.9],\n",
" [7.1, 3. , 5.9, 2.1],\n",
" [6.3, 2.9, 5.6, 1.8],\n",
" [6.5, 3. , 5.8, 2.2],\n",
" [7.6, 3. , 6.6, 2.1],\n",
" [4.9, 2.5, 4.5, 1.7],\n",
" [7.3, 2.9, 6.3, 1.8],\n",
" [6.7, 2.5, 5.8, 1.8],\n",
" [7.2, 3.6, 6.1, 2.5],\n",
" [6.5, 3.2, 5.1, 2. ],\n",
" [6.4, 2.7, 5.3, 1.9],\n",
" [6.8, 3. , 5.5, 2.1],\n",
" [5.7, 2.5, 5. , 2. ],\n",
" [5.8, 2.8, 5.1, 2.4],\n",
" [6.4, 3.2, 5.3, 2.3],\n",
" [6.5, 3. , 5.5, 1.8],\n",
" [7.7, 3.8, 6.7, 2.2],\n",
" [7.7, 2.6, 6.9, 2.3],\n",
" [6. , 2.2, 5. , 1.5],\n",
" [6.9, 3.2, 5.7, 2.3],\n",
" [5.6, 2.8, 4.9, 2. ],\n",
" [7.7, 2.8, 6.7, 2. ],\n",
" [6.3, 2.7, 4.9, 1.8],\n",
" [6.7, 3.3, 5.7, 2.1],\n",
" [7.2, 3.2, 6. , 1.8],\n",
" [6.2, 2.8, 4.8, 1.8],\n",
" [6.1, 3. , 4.9, 1.8],\n",
" [6.4, 2.8, 5.6, 2.1],\n",
" [7.2, 3. , 5.8, 1.6],\n",
" [7.4, 2.8, 6.1, 1.9],\n",
" [7.9, 3.8, 6.4, 2. ],\n",
" [6.4, 2.8, 5.6, 2.2],\n",
" [6.3, 2.8, 5.1, 1.5],\n",
" [6.1, 2.6, 5.6, 1.4],\n",
" [7.7, 3. , 6.1, 2.3],\n",
" [6.3, 3.4, 5.6, 2.4],\n",
" [6.4, 3.1, 5.5, 1.8],\n",
" [6. , 3. , 4.8, 1.8],\n",
" [6.9, 3.1, 5.4, 2.1],\n",
" [6.7, 3.1, 5.6, 2.4],\n",
" [6.9, 3.1, 5.1, 2.3],\n",
" [5.8, 2.7, 5.1, 1.9],\n",
" [6.8, 3.2, 5.9, 2.3],\n",
" [6.7, 3.3, 5.7, 2.5],\n",
" [6.7, 3. , 5.2, 2.3],\n",
" [6.3, 2.5, 5. , 1.9],\n",
" [6.5, 3. , 5.2, 2. ],\n",
" [6.2, 3.4, 5.4, 2.3],\n",
" [5.9, 3. , 5.1, 1.8]])"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"iris.data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To find out what the four features are, we can list them:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['sepal length (cm)',\n",
" 'sepal width (cm)',\n",
" 'petal length (cm)',\n",
" 'petal width (cm)']"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"iris.feature_names"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Similarly, we can also print the flowers' labels (a.k.a. targets):"
"Since the data is four-dimensional, we cannot visualize all features together. Instead, we can plot the distribution of the flower classes by a single feature using histograms."
"The goal of a supervised machine learning model is to make predictions on *new* (i.e., previously unseen) data.\n",
"\n",
"In a real-world application, we are not interested in marking an already labeled email as spam or not. Instead, we want to make the user's life easier by automatically classifying new incoming mail.\n",
"\n",
"In order to get an idea of how good a model **generalizes**, a best practice is to *split* the available data into a **training** and a **test** set. Only the former is used to train the model. Then, predictions are made on the test data and the predictions can be compared with the actual labels.\n",
"`sklearn` provides a function that not only randomizes the split but also ensures that the resulting label distribution is proportionate to the overall distribution, a concept called **stratification**."
"### A simple Classification Model: k-Nearest Neighbors"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To predict the label for any observation, just determine the k \"nearest\" observations in the training set (e.g., by Euclidean distance) and use a simple majority vote."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"./static/knn.png\" width=\"60%\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Training and Predicting with the Iris data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`sklearn` provides a uniform interface for all its classification models. They all have a `.fit()` and a `.predict()` method that abstract away the actual machine learning algorithm."
"It is important to mention that we can also \"predict\" the training set. Somehow surprisingly, the model does not get the training set 100% correct."
"In practice, the number of neighbors must be chosen before the model is trained. Therefore, it is possible to \"optimize\" it. This process is referred to as **hyper-parameter tuning**. For the Iris dataset this does not make much of a difference."
"Depending on the programming language one chooses, the following books are recommended:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- [Python Machine Learning](https://www.amazon.de/Python-Machine-Learning-scikit-learn-TensorFlow/dp/1787125939/ref=sr_1_1?__mk_de_DE=%C3%85M%C3%85%C5%BD%C3%95%C3%91&keywords=python+machine+learning&qid=1575545025&sr=8-1) by Sebastian Raschka\n",
"\n",
"<img src=\"static/python_ml_book.png\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- [An Introduction to Statistical Learning](http://faculty.marshall.usc.edu/gareth-james/ISL/)\n",