intro-to-data-science/03_classification/00_content.ipynb

1181 lines
267 KiB
Text
Raw Permalink Normal View History

2021-05-25 01:33:04 +02:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Chapter 2: A first Example - Classifying Flowers"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The purpose of this notebook is to look at a first example of a typical data science application, namely **statistical learning**, which is often referred to by its more well-known name **machine learning**. To do so, we look at a very popular example involving the classification of flowers. Albeit simplistic and almost boring in its kind, the example is a rather good one to look at from a beginner's point of view as it does not involve too many decision variables. That makes understanding technicalities and visualizing the data set a lot easier."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What is Machine Learning"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's at first review a couple of generic definitions to get started.\n",
"\n",
"Machine learning is the process of **extracting knowledge from data** in an automated fashion.\n",
"\n",
"Typical use cases regard making predictions on new and unseen data or simply understanding a given dataset better by finding patterns.\n",
"\n",
"Central to machine learning is the idea of **automating** the **decision making** from data **without** the user specifying **explicit rules** how these decisions should be made.\n",
"\n",
"That is in direct opposition to what we learned in the \"Expressing Logic\" section in Chapter 0, where we learned how to implement decision criterions \"by hand\" with the `if` statement."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"./static/what_is_machine_learning.png\" width=\"60%\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Example Applications"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"static/examples.png\" width=\"60%\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Types of Machine Learning"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Concete machine learning algorithms are commonly classified into three broad categories that may overlap as well:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"static/3_types_of_machine_learning.png\" width=\"60%\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- **Supervised** (focus of the example in this notebook): Each entry in the dataset comes with a **label**. Examples are a list of emails where spam mail is already marked as such or a sample of handwritten digits. The goal is to use the historic data to make predictions.\n",
"\n",
"- **Unsupervised**: There is no desired output associated with a data entry. In a sense, one can think of unsupervised learning as a means of discovering labels from the data itself. A popular example is the clustering of customer data.\n",
"\n",
"- **Reinforcement**: Conceptually, this can be seen as \"learning by doing\". Some kind of **reward function** tells how good a predicted outcome is. A rather recent and extremely popular example for his approach is the Alpha Go machine."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Types of Supervised Learning"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Algorithms from the supervised learning category are often broken down further into classification and regression:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"static/classification_vs_regression.png\" width=\"60%\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- In **classification** tasks, the labels are *discrete*, such as \"spam\" or \"no spam\" for emails. Often, labels are nominal (e.g., colors of something), or ordinal (e.g., T-shirt sizes in S, M, or L).\n",
"- In **regression**, the labels are *continuous*. For example, given a person's age, education, and position, infer his/her salary."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example: Iris Flower Classification"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the example, we are given measurments regarding the size of various parts of the so-called Iris flower kind. A concrete flower always belongs to one of three distinct special Iris classes. This example application is about classifying a given flower into one of the three classes by only looking at the measurements."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"static/iris_data.png\" width=\"60%\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Importing the Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `sklearn` library provides several sample datasets, among which is also the Iris dataset.\n",
"\n",
"In a tabular visualization, the dataset could be portrayed somewhat like this:\n",
"\n",
"<img src=\"static/iris.png\" width=\"50%\">\n",
"\n",
"However, the data object imported from `sklearn` is organized slightly different. In particular, the so-called **features** are separated from the **labels**."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.datasets import load_iris"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"iris = load_iris()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using Python's `dir()` function we can inspect the data object, i.e. find out what **attributes** it has."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['DESCR',\n",
" 'data',\n",
" 'data_module',\n",
2021-05-25 01:33:04 +02:00
" 'feature_names',\n",
" 'filename',\n",
" 'frame',\n",
" 'target',\n",
" 'target_names']"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dir(iris)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`iris.data` provides us with a `numpy.ndarray`, where the first dimension equals the number of observed flowers (i.e., the **instances**) and the second dimension lists the various features of a flower."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[5.1, 3.5, 1.4, 0.2],\n",
" [4.9, 3. , 1.4, 0.2],\n",
" [4.7, 3.2, 1.3, 0.2],\n",
" [4.6, 3.1, 1.5, 0.2],\n",
" [5. , 3.6, 1.4, 0.2],\n",
" [5.4, 3.9, 1.7, 0.4],\n",
" [4.6, 3.4, 1.4, 0.3],\n",
" [5. , 3.4, 1.5, 0.2],\n",
" [4.4, 2.9, 1.4, 0.2],\n",
" [4.9, 3.1, 1.5, 0.1],\n",
" [5.4, 3.7, 1.5, 0.2],\n",
" [4.8, 3.4, 1.6, 0.2],\n",
" [4.8, 3. , 1.4, 0.1],\n",
" [4.3, 3. , 1.1, 0.1],\n",
" [5.8, 4. , 1.2, 0.2],\n",
" [5.7, 4.4, 1.5, 0.4],\n",
" [5.4, 3.9, 1.3, 0.4],\n",
" [5.1, 3.5, 1.4, 0.3],\n",
" [5.7, 3.8, 1.7, 0.3],\n",
" [5.1, 3.8, 1.5, 0.3],\n",
" [5.4, 3.4, 1.7, 0.2],\n",
" [5.1, 3.7, 1.5, 0.4],\n",
" [4.6, 3.6, 1. , 0.2],\n",
" [5.1, 3.3, 1.7, 0.5],\n",
" [4.8, 3.4, 1.9, 0.2],\n",
" [5. , 3. , 1.6, 0.2],\n",
" [5. , 3.4, 1.6, 0.4],\n",
" [5.2, 3.5, 1.5, 0.2],\n",
" [5.2, 3.4, 1.4, 0.2],\n",
" [4.7, 3.2, 1.6, 0.2],\n",
" [4.8, 3.1, 1.6, 0.2],\n",
" [5.4, 3.4, 1.5, 0.4],\n",
" [5.2, 4.1, 1.5, 0.1],\n",
" [5.5, 4.2, 1.4, 0.2],\n",
" [4.9, 3.1, 1.5, 0.2],\n",
" [5. , 3.2, 1.2, 0.2],\n",
" [5.5, 3.5, 1.3, 0.2],\n",
" [4.9, 3.6, 1.4, 0.1],\n",
" [4.4, 3. , 1.3, 0.2],\n",
" [5.1, 3.4, 1.5, 0.2],\n",
" [5. , 3.5, 1.3, 0.3],\n",
" [4.5, 2.3, 1.3, 0.3],\n",
" [4.4, 3.2, 1.3, 0.2],\n",
" [5. , 3.5, 1.6, 0.6],\n",
" [5.1, 3.8, 1.9, 0.4],\n",
" [4.8, 3. , 1.4, 0.3],\n",
" [5.1, 3.8, 1.6, 0.2],\n",
" [4.6, 3.2, 1.4, 0.2],\n",
" [5.3, 3.7, 1.5, 0.2],\n",
" [5. , 3.3, 1.4, 0.2],\n",
" [7. , 3.2, 4.7, 1.4],\n",
" [6.4, 3.2, 4.5, 1.5],\n",
" [6.9, 3.1, 4.9, 1.5],\n",
" [5.5, 2.3, 4. , 1.3],\n",
" [6.5, 2.8, 4.6, 1.5],\n",
" [5.7, 2.8, 4.5, 1.3],\n",
" [6.3, 3.3, 4.7, 1.6],\n",
" [4.9, 2.4, 3.3, 1. ],\n",
" [6.6, 2.9, 4.6, 1.3],\n",
" [5.2, 2.7, 3.9, 1.4],\n",
" [5. , 2. , 3.5, 1. ],\n",
" [5.9, 3. , 4.2, 1.5],\n",
" [6. , 2.2, 4. , 1. ],\n",
" [6.1, 2.9, 4.7, 1.4],\n",
" [5.6, 2.9, 3.6, 1.3],\n",
" [6.7, 3.1, 4.4, 1.4],\n",
" [5.6, 3. , 4.5, 1.5],\n",
" [5.8, 2.7, 4.1, 1. ],\n",
" [6.2, 2.2, 4.5, 1.5],\n",
" [5.6, 2.5, 3.9, 1.1],\n",
" [5.9, 3.2, 4.8, 1.8],\n",
" [6.1, 2.8, 4. , 1.3],\n",
" [6.3, 2.5, 4.9, 1.5],\n",
" [6.1, 2.8, 4.7, 1.2],\n",
" [6.4, 2.9, 4.3, 1.3],\n",
" [6.6, 3. , 4.4, 1.4],\n",
" [6.8, 2.8, 4.8, 1.4],\n",
" [6.7, 3. , 5. , 1.7],\n",
" [6. , 2.9, 4.5, 1.5],\n",
" [5.7, 2.6, 3.5, 1. ],\n",
" [5.5, 2.4, 3.8, 1.1],\n",
" [5.5, 2.4, 3.7, 1. ],\n",
" [5.8, 2.7, 3.9, 1.2],\n",
" [6. , 2.7, 5.1, 1.6],\n",
" [5.4, 3. , 4.5, 1.5],\n",
" [6. , 3.4, 4.5, 1.6],\n",
" [6.7, 3.1, 4.7, 1.5],\n",
" [6.3, 2.3, 4.4, 1.3],\n",
" [5.6, 3. , 4.1, 1.3],\n",
" [5.5, 2.5, 4. , 1.3],\n",
" [5.5, 2.6, 4.4, 1.2],\n",
" [6.1, 3. , 4.6, 1.4],\n",
" [5.8, 2.6, 4. , 1.2],\n",
" [5. , 2.3, 3.3, 1. ],\n",
" [5.6, 2.7, 4.2, 1.3],\n",
" [5.7, 3. , 4.2, 1.2],\n",
" [5.7, 2.9, 4.2, 1.3],\n",
" [6.2, 2.9, 4.3, 1.3],\n",
" [5.1, 2.5, 3. , 1.1],\n",
" [5.7, 2.8, 4.1, 1.3],\n",
" [6.3, 3.3, 6. , 2.5],\n",
" [5.8, 2.7, 5.1, 1.9],\n",
" [7.1, 3. , 5.9, 2.1],\n",
" [6.3, 2.9, 5.6, 1.8],\n",
" [6.5, 3. , 5.8, 2.2],\n",
" [7.6, 3. , 6.6, 2.1],\n",
" [4.9, 2.5, 4.5, 1.7],\n",
" [7.3, 2.9, 6.3, 1.8],\n",
" [6.7, 2.5, 5.8, 1.8],\n",
" [7.2, 3.6, 6.1, 2.5],\n",
" [6.5, 3.2, 5.1, 2. ],\n",
" [6.4, 2.7, 5.3, 1.9],\n",
" [6.8, 3. , 5.5, 2.1],\n",
" [5.7, 2.5, 5. , 2. ],\n",
" [5.8, 2.8, 5.1, 2.4],\n",
" [6.4, 3.2, 5.3, 2.3],\n",
" [6.5, 3. , 5.5, 1.8],\n",
" [7.7, 3.8, 6.7, 2.2],\n",
" [7.7, 2.6, 6.9, 2.3],\n",
" [6. , 2.2, 5. , 1.5],\n",
" [6.9, 3.2, 5.7, 2.3],\n",
" [5.6, 2.8, 4.9, 2. ],\n",
" [7.7, 2.8, 6.7, 2. ],\n",
" [6.3, 2.7, 4.9, 1.8],\n",
" [6.7, 3.3, 5.7, 2.1],\n",
" [7.2, 3.2, 6. , 1.8],\n",
" [6.2, 2.8, 4.8, 1.8],\n",
" [6.1, 3. , 4.9, 1.8],\n",
" [6.4, 2.8, 5.6, 2.1],\n",
" [7.2, 3. , 5.8, 1.6],\n",
" [7.4, 2.8, 6.1, 1.9],\n",
" [7.9, 3.8, 6.4, 2. ],\n",
" [6.4, 2.8, 5.6, 2.2],\n",
" [6.3, 2.8, 5.1, 1.5],\n",
" [6.1, 2.6, 5.6, 1.4],\n",
" [7.7, 3. , 6.1, 2.3],\n",
" [6.3, 3.4, 5.6, 2.4],\n",
" [6.4, 3.1, 5.5, 1.8],\n",
" [6. , 3. , 4.8, 1.8],\n",
" [6.9, 3.1, 5.4, 2.1],\n",
" [6.7, 3.1, 5.6, 2.4],\n",
" [6.9, 3.1, 5.1, 2.3],\n",
" [5.8, 2.7, 5.1, 1.9],\n",
" [6.8, 3.2, 5.9, 2.3],\n",
" [6.7, 3.3, 5.7, 2.5],\n",
" [6.7, 3. , 5.2, 2.3],\n",
" [6.3, 2.5, 5. , 1.9],\n",
" [6.5, 3. , 5.2, 2. ],\n",
" [6.2, 3.4, 5.4, 2.3],\n",
" [5.9, 3. , 5.1, 1.8]])"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"iris.data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To find out what the four features are, we can list them:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['sepal length (cm)',\n",
" 'sepal width (cm)',\n",
" 'petal length (cm)',\n",
" 'petal width (cm)']"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"iris.feature_names"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Similarly, we can also print the flowers' labels (a.k.a. targets):"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
" 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
" 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,\n",
" 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,\n",
" 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"iris.target"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The three flower classes are encoded with integers. Let's show the corresponding names:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['setosa', 'versicolor', 'virginica'], dtype='<U10')"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"iris.target_names"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Simple Visualizations"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since the data is four-dimensional, we cannot visualize all features together. Instead, we can plot the distribution of the flower classes by a single feature using histograms."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
2024-07-15 11:33:39 +02:00
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAh8AAAGwCAYAAAAJ/wd3AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/TGe4hAAAACXBIWXMAAA9hAAAPYQGoP6dpAAAxAklEQVR4nO3deVRV9f7/8dcRZFABJ1RIUFJDHHH+Kje19Ct6k7SWpje6kpbd7xVTM4eoHMgMa2nXIb9OlUOhNmpeu2pmTmnmiGkaOaBSYjYYOCLC/v3Rz/P1JA7IOZ9z0Odjrb1W57M/+/N5sw9yXu3p2CzLsgQAAGBIKXcXAAAA7iyEDwAAYBThAwAAGEX4AAAARhE+AACAUYQPAABgFOEDAAAY5e3uAv6soKBAx48fV0BAgGw2m7vLAQAAN8GyLJ0+fVqhoaEqVer6xzY8LnwcP35cYWFh7i4DAADcgszMTFWvXv26fTwufAQEBEj6o/jAwEA3VwMAAG5GTk6OwsLC7J/j1+Nx4ePyqZbAwEDCBwAAJczNXDLBBacAAMAowgcAADCK8AEAAIzyuGs+AAC3l4KCAl28eNHdZcAJfHx8bngb7c0gfAAAXObixYvKyMhQQUGBu0uBE5QqVUoRERHy8fEp1jiEDwCAS1iWpaysLHl5eSksLMwp/8cM97n8ENCsrCyFh4cX60GghA8AgEtcunRJ586dU2hoqMqUKePucuAEwcHBOn78uC5duqTSpUvf8jjEUACAS+Tn50tSsQ/Rw3Ncfi8vv7e3ivABAHApvqfr9uGs95LwAQAAjCJ8AAAAo7jgFABglOmzMJZldj7cGEc+AABwkiNHjshmsyktLc3dpXg0wgcAADCK8AEAwJ98+OGHatiwofz9/VWpUiV17NhRZ8+elSS9+eabioqKkp+fn+rWrav//d//tW8XEREhSWrSpIlsNpvat28v6Y8HdL300kuqXr26fH19FR0drZUrV9q3u3jxogYOHKiQkBD5+fmpRo0aSklJsa9//fXX1bBhQ5UtW1ZhYWEaMGCAzpw5Y2BPuAbXfLjQrZ7X5PwkALhPVlaW/va3v+m1117TQw89pNOnT2vjxo2yLEupqakaPXq03njjDTVp0kS7du1S//79VbZsWSUkJGjr1q1q2bKlPv/8c9WvX9/+XIwpU6Zo0qRJmjVrlpo0aaK3335bDz74oL799lvVqVNHU6dO1bJly/T+++8rPDxcmZmZyszMtNdUqlQpTZ06VRERETp8+LAGDBigESNGOASfEsXyMNnZ2ZYkKzs7292lFNsfMaLoCwDcDs6fP2/t27fPOn/+vEP7rf5tNPU3dceOHZYk68iRI1etq1WrlrVw4UKHtnHjxlmtW7e2LMuyMjIyLEnWrl27HPqEhoZa48ePd2hr0aKFNWDAAMuyLOvpp5+27r//fqugoOCmavzggw+sSpUq3eyP5DTXek8tq2if3xz5AADgCo0bN1aHDh3UsGFDxcbGqlOnTurRo4d8fHx06NAhPfHEE+rfv7+9/6VLlxQUFHTN8XJycnT8+HHFxMQ4tMfExGj37t2SpMcff1z//d//rcjISHXu3Fldu3ZVp06d7H0///xzpaSk6LvvvlNOTo4uXbqkCxcu6Ny5cyXy0fVc8wEAwBW8vLy0evVqrVixQvXq1dO0adMUGRmpvXv3SpLmzJmjtLQ0+7J3715t2bKlWHM2bdpUGRkZGjdunM6fP69HHnlEPXr0kPTHHTRdu3ZVo0aN9NFHH2nHjh2aPn26pD+uFSmJOPIBAMCf2Gw2xcTEKCYmRqNHj1aNGjW0adMmhYaG6vDhw4qPjy90u8K++yQwMFChoaHatGmT2rVrZ2/ftGmTWrZs6dCvV69e6tWrl3r06KHOnTvrt99+044dO1RQUKBJkybZvxn4/fffd8WPbQzhAwCAK3z99ddas2aNOnXqpCpVqujrr7/Wzz//rKioKCUnJ2vQoEEKCgpS586dlZubq+3bt+vUqVMaOnSoqlSpIn9/f61cuVLVq1eXn5+fgoKCNHz4cI0ZM0a1atVSdHS05s6dq7S0NKWmpkr6426WkJAQNWnSRKVKldIHH3ygatWqqXz58qpdu7by8vI0bdo0xcXFadOmTZo5c6ab91IxueKClOLgglN3Vw0AznG9ixM92b59+6zY2FgrODjY8vX1te655x5r2rRp9vWpqalWdHS05ePjY1WoUMFq27at9fHHH9vXz5kzxwoLC7NKlSpltWvXzrIsy8rPz7fGjh1r3XXXXVbp0qWtxo0bWytWrLBvM3v2bCs6OtoqW7asFRgYaHXo0MHauXOnff3rr79uhYSEWP7+/lZsbKy1YMECS5J16tQpl++PKznrglObZXnWjZ05OTkKCgpSdna2AgMD3V1OsXCrLYA72YULF5SRkaGIiAj5+fm5uxw4wfXe06J8fnPBKQAAMIrwAQAAjCJ8AAAAowgfAADAKMIHAAAwivABAACMInwAAACjCB8AAMAowgcAAG525MgR2Ww2paWleeR4zsZ3uwAAzLrVxz/fqhLw2OiwsDBlZWWpcuXK7i7FCI58AADgYnl5eddd7+XlpWrVqsnb23OOCVy8eNFlYxM+AAC4wuzZsxUaGqqCggKH9m7duqlfv36SpE8++URNmzaVn5+f7r77biUnJ+vSpUv2vjabTTNmzNCDDz6osmXLavz48Tp16pTi4+MVHBwsf39/1alTR3PnzpVU+GmSb7/9Vl27dlVgYKACAgJ077336tChQ5KkgoICvfTSS6pevbp8fX0VHR2tlStXXvfnWr9+vVq2bClfX1+FhIToueeec6i5ffv2GjhwoIYMGaLKlSsrNja2WPvxeggfAABcoWfPnvr111+1du1ae9tvv/2mlStXKj4+Xhs3blSfPn00ePBg7du3T7NmzdK8efM0fvx4h3HGjh2rhx56SHv27FG/fv00atQo7du3TytWrND+/fs1Y8aMa55m+fHHH9W2bVv5+vrqiy++0I4dO9SvXz97WJgyZYomTZqkiRMn6ptvvlFsbKwefPBBHThw4Jrj/fWvf1WLFi20e/duzZgxQ2+99ZZefvllh37z58+Xj4+PNm3apJkzZxZnN16f879wt3iK8pW8nu6PE41FXwDgdnDNr1+/1T+OBv+oduvWzerXr5/99axZs6zQ0FArPz/f6tChg/XKK6849H/nnXeskJCQK35EWUOGDHHoExcXZ/Xt27fQ+TIyMixJ1q5duyzLsqykpCQrIiLCunjxYqH9Q0NDrfHjxzu0tWjRwhowYECh4z3//PNWZGSkVVBQYO8/ffp0q1y5clZ+fr5lWZbVrl07q0mTJtfaJZZlXec9tYr2+c2RDwAA/iQ+Pl4fffSRcnNzJUmpqanq3bu3SpUqpd27d+ull15SuXLl7Ev//v2VlZWlc+fO2cdo3ry5w5j//Oc/tXjxYkVHR2vEiBHavHnzNedPS0vTvffeq9KlS1+1LicnR8ePH1dMTIxDe0xMjPbv31/oePv371fr1q1lu+Ji35iYGJ05c0Y//PCDva1Zs2bX2SvOU+TwsWHDBsXFxSk0NFQ2m01Lly61r8vLy9PIkSPVsGFDlS1bVqGhoerTp4+OHz/uzJoBAHCpuLg4WZalTz/9VJmZmdq4caPi4+MlSWfOnFFycrLS0tLsy549e3TgwAH5+fnZxyhbtqzDmF26dNHRo0f1zDPP6Pjx4+rQoYOGDRtW6Pz+/v6u++Gu4881u0qRw8fZs2fVuHFjTZ8+/ap1586d086dOzVq1Cjt3LlTH3/8sdLT0/Xggw86pVgAAEzw8/PTww8/rNTUVC1atEiRkZFq2rSpJKlp06ZKT09X7dq1r1pKlbr+x2pwcLASEhL07rvvavLkyZo9e3ah/Ro1aqSNGzcWepdMYGCgQkNDtWnTJof2TZs2qV69eoWOFxUVpa+++krWFbcdb9q0SQEBAapevfp1a3aFIt/T06VLF3Xp0qXQdUFBQVq9erVD2xtvvKGWLVvq2LFjCg8Pv7UqAQAwLD4+Xl27dtW3336rxx57zN4+evRode3aVeHh4erRo4f9VMz
2021-05-25 01:33:04 +02:00
"text/plain": [
2024-07-15 11:33:39 +02:00
"<Figure size 640x480 with 1 Axes>"
2021-05-25 01:33:04 +02:00
]
},
"metadata": {},
2021-05-25 01:33:04 +02:00
"output_type": "display_data"
}
],
"source": [
"feature_index = 2\n",
"colors = [\"blue\", \"red\", \"green\"]\n",
2021-05-25 01:33:04 +02:00
"\n",
"for label, color in zip(range(len(iris.target_names)), colors):\n",
" plt.hist(\n",
" iris.data[iris.target == label, feature_index],\n",
2021-05-25 01:33:04 +02:00
" label=iris.target_names[label],\n",
" color=color,\n",
" )\n",
"\n",
"plt.xlabel(iris.feature_names[feature_index])\n",
"plt.legend(loc=\"upper right\")\n",
2021-05-25 01:33:04 +02:00
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Also, we can draw scatter plots of two features."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
2024-07-15 11:33:39 +02:00
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAj4AAAGwCAYAAACpYG+ZAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/TGe4hAAAACXBIWXMAAA9hAAAPYQGoP6dpAABoWElEQVR4nO3deVxUVf8H8M8wyqICLinbjGIuhKmBS4k+pD1aahvGg1v+yrRsUROz3Co1LdPcAiuXrLSeXB5D1J4s00yMXMqNMkUzQkFC7UkFUQIdzu+PaSaG9V48zNyZ+bxfr3np3Dlz5jt3DsyXe8/5Xp0QQoCIiIjIDXg4OgAiIiIie2HiQ0RERG6DiQ8RERG5DSY+RERE5DaY+BAREZHbYOJDREREboOJDxEREbmNOo4OwN5KSkrw22+/wdfXFzqdztHhEBERkQJCCFy+fBnBwcHw8Kj5cRu3S3x+++03GI1GR4dBRERENZCdnQ2DwVDj57td4uPr6wvAvOP8/PwcHA0REREpkZ+fD6PRaP0erym3S3wsp7f8/PyY+BARETmZG52mwsnNRERE5DYcmviYTCZMmzYNLVu2hI+PD1q1aoVXX30V1V03NSUlBZ06dYKXlxdat26NVatW2SdgIiIicmoOPdX1xhtvYOnSpfjwww9x66234sCBAxgxYgT8/f0xbty4Cp+TmZmJ++67D08//TRWr16NHTt24IknnkBQUBD69u1r53dAREREzkQnqju8Uovuv/9+BAQE4P3337du+9e//gUfHx98/PHHFT5n8uTJ2LJlC3766SfrtiFDhuDSpUvYunVrta+Zn58Pf39/5OXlVTnHx2Qy4dq1ayreDWlR3bp1odfrHR0GERHdIKXf39Vx6BGf7t27491338XPP/+Mtm3b4ocffsC3336LRYsWVfqcvXv3ok+fPjbb+vbti/Hjx1fYvqioCEVFRdb7+fn5VcYkhMDZs2dx6dIlxe+DtK1hw4YIDAxk3SYiInJs4jNlyhTk5+fjlltugV6vh8lkwuzZszFs2LBKn3P27FkEBATYbAsICEB+fj4KCwvh4+Nj89icOXMwc+ZMxTFZkp5mzZqhXr16/LJ0YkIIXL16FefPnwcABAUFOTgiIiJyNIcmPuvXr8fq1auxZs0a3HrrrUhLS8P48eMRHByM4cOHS3mNqVOnYsKECdb7ljoAFTGZTNakp0mTJlJenxzLkgifP38ezZo142kvIiI359DEZ+LEiZgyZQqGDBkCAOjQoQNOnz6NOXPmVJr4BAYG4ty5czbbzp07Bz8/v3JHewDAy8sLXl5eiuKxzOmpV6+emrdBGmf5PK9du8bEh4jIzTl0OfvVq1fLXW9Dr9ejpKSk0udERUVhx44dNtu2b9+OqKgoaXHx9JZr4edJREQWDj3i88ADD2D27Nlo3rw5br31Vhw+fBiLFi3CyJEjrW2mTp2KnJwcfPTRRwCAp59+Gm+//TYmTZqEkSNH4uuvv8b69euxZcsWR70NIiKSwFRiQmpWKnIv5yLINwjRzaOh9+BRWpLLoYnPW2+9hWnTpmH06NE4f/48goOD8dRTT2H69OnWNrm5ucjKyrLeb9myJbZs2YLnnnsOiYmJMBgMeO+991jDh4jIiSWnJyN+azzO5J+xbjP4GZDYLxGx4bEOjIxcjUPr+DhCVXUA/vzzT2RmZqJly5bw9vZ2UIQkGz9XIm1LTk9G3Po4CNh+HelgPk2dNCiJyQ9Jq+PDa3XVEpMJSEkB1q41/2syOTqiip06dQo6nQ5paWmODoWI3JCpxIT4rfHlkh4A1m3jt46HqUSjv0TJ6TDxqQXJyUBoKHDXXcDDD5v/DQ01bycior+lZqXanN4qS0AgOz8bqVmpdoyKXBkTH8mSk4G4OOBMmZ/jnBzz9tpKfpKSktChQwf4+PigSZMm6NOnD65cuQIAeO+99xAeHg5vb2/ccsstWLJkifV5LVu2BABERkZCp9OhV69eAICSkhLMmjULBoMBXl5eiIiIsLkkSHFxMcaOHYugoCB4e3ujRYsWmDNnjvXxRYsWoUOHDqhfvz6MRiNGjx6NgoKC2nnzROS0ci/nSm1HVB0mPhKZTEB8PFDRrCnLtvHj5Z/2ys3NxdChQzFy5Eikp6cjJSUFsbGxEEJg9erVmD59OmbPno309HS8/vrrmDZtGj788EMAwPfffw8A+Oqrr5Cbm4vkvzKzxMRELFy4EAsWLMCPP/6Ivn374sEHH8TJkycBAIsXL8ann36K9evX48SJE1i9ejVCQ0OtMXl4eGDx4sU4evQoPvzwQ3z99deYNGmS3DdORE4vyFdZRXWl7Yiqw8nNpdzoJNiUFPNprers3An8dWBFikOHDqFz5844deoUWrRoYfNY69at8eqrr2Lo0KHWba+99ho+//xz7NmzB6dOnULLli1x+PBhREREWNuEhIRgzJgxePHFF63bbr/9dnTt2hXvvPMOxo0bh6NHj+Krr75SVCcnKSkJTz/9NP73v//d+BtWiZObibTLVGJCaGIocvJzKpzno4MOBj8DMuMzubTdzXFyswblKjwSq7SdUrfddht69+6NDh06YODAgVixYgUuXryIK1euICMjA48//jgaNGhgvb322mvIyMiotL/8/Hz89ttv6NGjh832Hj16ID09HQDw2GOPIS0tDWFhYRg3bhy2bdtm0/arr75C7969ERISAl9fXzzyyCP4448/cPXqVblvnoicmt5Dj8R+iQD+XsVlYbmf0C+BSQ9Jw8RHIqXXwJR9rUy9Xo/t27fjiy++QLt27fDWW28hLCwMP/30EwBgxYoVSEtLs95++ukn7Nu374Zes1OnTsjMzMSrr76KwsJCDBo0CHFxcQDMK8Xuv/9+dOzYERs2bMDBgwfxzjvvADDPDSIiKi02PBZJg5IQ4hdis93gZ+BSdpLOoQUMXU10NGAwmCcyV3QCUaczPx4dLf+1dTodevTogR49emD69Olo0aIFdu/ejeDgYPz666+VXvHe09MTgPkCrRZ+fn4IDg7G7t270bNnT+v23bt34/bbb7dpN3jwYAwePBhxcXHo168fLly4gIMHD6KkpAQLFy60XpJk/fr18t80EbmM2PBYxITFsHIz1TomPhLp9UBionn1lk5nm/xYpsEkJJjbyfTdd99hx44duOeee9CsWTN89913+P333xEeHo6ZM2di3Lhx8Pf3R79+/VBUVIQDBw7g4sWLmDBhApo1awYfHx9s3boVBoMB3t7e8Pf3x8SJEzFjxgy0atUKERERWLlyJdLS0rB69WoA5lVbQUFBiIyMhIeHBz755BMEBgaiYcOGaN26Na5du4a33noLDzzwAHbv3o1ly5bJfdNE5HL0Hnr0Cu3l6DDI1Qk3k5eXJwCIvLy8co8VFhaKY8eOicLCwht6jQ0bhDAYhDCnPuab0WjeXhuOHTsm+vbtK5o2bSq8vLxE27ZtxVtvvWV9fPXq1SIiIkJ4enqKRo0aiTvvvFMkJydbH1+xYoUwGo3Cw8ND9OzZUwghhMlkEq+88ooICQkRdevWFbfddpv44osvrM959913RUREhKhfv77w8/MTvXv3FocOHbI+vmjRIhEUFCR8fHxE3759xUcffSQAiIsXL9bOTqiCrM+ViIgcp6rvbzW4qqsUmat/TCYgNdU8kTkoyHx6S/aRHlKGq7qIiJyfrFVdPNVVS/R6uUvWiYiI6MZxVRcRERG5DSY+RERE5DaY+BAREZHbYOJDREREboOTm4n+Yioxaa54mhZj0nJc5Nw4rsgemPgQAUhOT0b81nicyT9j3WbwMyCxX6LDyuVrMSYtx0XOjeOK7IWnusjtJacnI259nM0vXADIyc9B3Po4JKcnMyaNx0XOjeOK7ImJD9XIqVOnoNPpkJaWpsn+lDKVmBC/NR4C5et4WraN3zoephJTucfdKSYtx0XOjeOK7I2JD9WI0WhEbm4u2rdv7+hQbkhqVmq5vzJLExDIzs9GalaqW8cEaDcucm4cV2RvnONTW5z8mhXXrl1D3bp1K31cr9cjMDDQjhFVr7i42Hq1eaVyL+dKbSeDFmNS83r2jou
2021-05-25 01:33:04 +02:00
"text/plain": [
2024-07-15 11:33:39 +02:00
"<Figure size 640x480 with 1 Axes>"
2021-05-25 01:33:04 +02:00
]
},
"metadata": {},
2021-05-25 01:33:04 +02:00
"output_type": "display_data"
}
],
"source": [
"first_feature_index = 1\n",
"second_feature_index = 0\n",
"\n",
"colors = [\"blue\", \"red\", \"green\"]\n",
2021-05-25 01:33:04 +02:00
"\n",
"for label, color in zip(range(len(iris.target_names)), colors):\n",
" plt.scatter(\n",
" iris.data[iris.target == label, first_feature_index],\n",
" iris.data[iris.target == label, second_feature_index],\n",
2021-05-25 01:33:04 +02:00
" label=iris.target_names[label],\n",
" c=color,\n",
" )\n",
"\n",
"plt.xlabel(iris.feature_names[first_feature_index])\n",
"plt.ylabel(iris.feature_names[second_feature_index])\n",
"plt.legend(loc=\"upper left\")\n",
2021-05-25 01:33:04 +02:00
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using the higher level library `pandas`, one can easily create a so-called **scatterplot matrix**."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
2024-07-15 11:33:39 +02:00
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAqQAAAKrCAYAAAAnEJ98AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/TGe4hAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd3wjd50//teo92JZ7t3e5u29ZHeTTSWFNNJDEgIBjst9c0e7I/TQEjjgQrg72i9HuwMCqSSB9E3dZHv3Vvcuq3eNpvz+GHssucnyypbtfT8fj4V4PJI+lj4z89F83p/3mxFFUQQhhBBCCCF5osh3AwghhBBCyPmNBqSEEEIIISSvaEBKCCGEEELyigakhBBCCCEkr2hASgghhBBC8ooGpIQQQgghJK9oQEoIIYQQQvKKBqSEEEIIISSvVPluwFQIgoCenh6YzWYwDJPv5pB5jOd5nD17Fg0NDVAqlfluDpnHqK+RmUJ9jcwUURQRCoVQVlYGhWLie6BzckDa09ODysrKfDeDEEIIIYRk0NnZiYqKign3mZMDUrPZDED6Ay0WS55bQ+azrq4uLF26lPraHBdjebQMhKHTKFHvNOW7OWOivkYmq8UdRizBo85pgl6T/R1O6mvzG8cLOOMKQcEwWFBkhkKRv5nkYDCIyspKedw2kTk5IB2aprdYLHQwkWk11L+or81dCY7H00c74IsmAQAxUYMtDYV5btVo1NfIZOxqdmN3SxgAcNLL4Y6NVdCqshuUUl+bv0RRxJP7u9DliwEA+uMKXLOiLM+twqTCK2lREyFkXhsIJeTBKACc7g/lsTWEnJsz/WH5v/3RJFzBRB5bQ2abKMvLg1EAOOsKQxDEPLZo8mhASgiZ16x6NdTK4W/nDpM2j60h5Nw4TBr5v9VKBlaDOo+tIbONTq2ESTs8+V1g1OR1yj4bc3LKnhBCJsusU+O6VeU41OmHUavElvrZN11PyGRduqQYRo0K4QSHVZU2WHQ0ICXDlAoGN64px+5WLxQMg831jnw3adJoQEoImfcqCwyoLDDkuxmEnDOdWokdi4vy3QwyizlMWly1vDTfzcgaDUjzpOZLL075sW2PXJ3DlhBCCCGE5BfFkBJCCCGEkLyiASkhc1xvIIbeQCzzjoP8URad3ihYTpjU/vEkjw5PFMF4MvPOsxQviOj0RuEJ04pkMvuxnIBObxT+KJv1Y8NxDrtbPGj3RHLyfGTuEUURPf4Y+gLxST/GF5GuC0k+/brQ64/hg2bPjPQdmrInZA57/UQ/jnQFAAArK624eHHxhPufdYXxt6O94AURhSYNbllfOWEOw3CCw5/2dCAU56BWMrhhTQXKbfqc/g3TjRdEPHWgC92+GBgGuHhxEVZU2PLdLELGFE/y+PO+TnjCLJQKBlevKJ10MYdQPIkHnz6KgVACSgWDT2ytxdYFhfjz3k64wywUDIOrV5RAN81/A8mvl4/340RvEACwttqO7QudE+5/qi+El471QRBFFFm0uHltJTQqBQ52+PDjV0+D5QSYdSp8+7plKJ3G8z/dISVkjmI5QR6MAsDhzsCob7cjHer0gx/MSecOs2j3RCfc/3R/CKE4BwBI8iKOdvnPrdF50BeMo3swL58oAgc7/PltECETaPdE4QlLd6N4QcShLPrr+80eDIQS8mNfPt6HDk8U7sHnE0SR+v88F2U5eTAKSOe7THlID3b4IIjSPq5gAl0+6brw0rE+eSYtFOew89TANLVaQgNSQuYolYKBTj18d1OnVkKVId+cSZt+N9SonXiSxDTi95n2n40MaiVSi4TMxb+BnD+MWR6jqUbm2LXq1aMeP/KYJvOLRqmARjU8tDNolBnzkI7sI0M/2w2atO0OY/rPuUY98zxDq/vnD4WCwbWryvDWqQEwDLB9oTNjebbtC51geRGBKIvGMkvG6feFxWb0VsdwrDuASrsBG2vnTk67IXajBpc3lmB/uxcGjQqXLpk4rIGQfKqwG7B9YSGaeoKwGjS4MMN065BwgkNVgQHXrSzDrhYPCowafPrCOhSadNi+0ImmngAsejUuXOSE19U3zX8FyReVUoEPryjDO2cHoGQYXLQoPUVYIJYExwtpX152LC4CL4gIxZNYXmFDsUUK6vjopmr4ogl0+WJYXm7DZY3Tm26MBqSEzGHlNj3u2Fg16f0NGhWuXTn5usYxlkebOwqWE9Htj8MfY1FknnsRaI1lFjSWUc1uMjesrS7A2uqCSe9/ul+KAeQFEbWFRjx22+q0u2Jrq+1YW22Xf/bmtLVktqlyGHCno3rU9sOdfuw85YIoSufEK5aWAJDuml+/unzU/iadCg9e1Tjt7R1CU/aEkHGd7AvCG5Hiz+JJHgfa/fltECFklN0tHjk2vNUdQW9w8quryfnj/RYPBkNF0dQTnHVZF2hASggZ18gV+Do1nTIImW206vTjVKui45SMltovFAwDtXJ29ZPZ1RpCyKyypNSM5eVWGLVK1BQasKlu7sWQEjLfXbqkGEUWLUxaFbYvdKJwxOImQgDgQ8tK4DBpYNapcFlj8axb4JlVa/x+P5555hm88847aG9vRzQahdPpxOrVq3HFFVdgy5Yt09VOQkgeMAyDSxuLAdBCIEJmqwKjBnduHB0zSEiqUqsed2+uyXczxjWpO6Q9PT247777UFpaiu985zuIxWJYtWoVLrnkElRUVGDnzp247LLL0NjYiCeeeGK620wIIYQQQuaRSd0hXb16Ne655x7s378fjY1jr7iKxWJ49tln8eijj6KzsxNf+MIXctpQQgghhBAyP01qQNrU1ASHY+LYMb1ej9tvvx233347PB5PThpHCMk9URTBCeKsC2gnhAxL8gJUCiZjbmFyfuN4Acp50k8mNSDNNBg91/0JITPDHU7g2YPdCMU5LCox48plJfPiREbIfCGKIv5+rA+n+kIw66T8kLRIiYzlzVMuHOzwQ6dW4sMrS1FhN+S7SedkSkusenp68O6778LlckEQ0mtnP/DAAzlpGCEk994765Zr05/qC2FRiRn1TlOeW0UIGdLijuBUXwiAVD/8vbNuXLdqdNJycn5zBeM42OEHIOWIfvPUAD66aW4vbMt6QPqb3/wGn/70p6HRaOBwONLurjAMQwNSQmaxoaTIhJDZiY5RMhnzsZtkHUT2ta99DV//+tcRCATQ1taG1tZW+V9LS8t0tJEQkiNbGhwwDeaeW1BsQq3DmOcWEUJS1RUasaBYmrUwaVXYUl+Y5xaR2ajYosOqShsAQKtW4MKFzvw2KAeyvkMajUZx2223QaGgBRGEzDVFZh3u21YLlhfSqjDFkzyeP9yDvkAc1YVGXLWsBCpa9ETIjFMoGFyzogwJjodGqZBnIXeedOF4TwBWgwbXriiD1aDOc0vJTBBFEa829eN0fwh2owYfXlkGi0767HcsLsIFDYVQKRgoFHN/LUDWV5xPfOIT+Mtf/jIdbSGEzACGYUaVBN3X5kOXLwZOENHsCuNIdyBPrSOEAFLZ3qHBaJs7gkOdfiR5Ee5QAm+fGchz68hMOesK43hPEElehCuYwK6z7rTfa1SKeTEYBaZwh/Thhx/GNddcg5deegnLly+HWp3+Le3HP/5xzhpHCJkZyRGLEzl+PkYoETI3cSOOzyQvjLMnmW+SI87FI3+eT6Y0IH355ZexaNEiABi1qGkuqvnSi1N+bNsjV+ewJYTkx5pKO5pdYYTiHBwmDZaXW/PdJELIoNpCE6oKDOjwRqFVK7CpjlIrni8WFJtwrFuPbn8Meo0SG2sL8t2kaZP1gPRHP/oR/ud//gcf+9jHpqE5hJB8sBrU+NiWGkQSPEw6FZSDU0CiKOLNUwNoHgjDadbiiqUl0KmVEz7Xqb4Q3jvrhlqlwGVLilFi1eW0rRwv4LUT/ejyxVBu0+OyxmKKdyVzijfC4uXjfYgkOKyvKcDKSlvaNo1KAZYTYDNo8KFlJTBpVbhxTTmCcQ56tRIaFfX3uSrG8njpeC88YRYLi83YnmExklqpQLFFi/5gHAUGDUy6KWXrnDHvN3ukWGe9Gh9aVpLVY7Pu1VqtFhdccEG2DyOEzHI
2021-05-25 01:33:04 +02:00
"text/plain": [
2024-07-15 11:33:39 +02:00
"<Figure size 800x800 with 16 Axes>"
2021-05-25 01:33:04 +02:00
]
},
"metadata": {},
2021-05-25 01:33:04 +02:00
"output_type": "display_data"
}
],
"source": [
"iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)\n",
"\n",
"pd.plotting.scatter_matrix(iris_df, figsize=(8, 8));"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Concept of Generalization"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The goal of a supervised machine learning model is to make predictions on *new* (i.e., previously unseen) data.\n",
"\n",
"In a real-world application, we are not interested in marking an already labeled email as spam or not. Instead, we want to make the user's life easier by automatically classifying new incoming mail.\n",
"\n",
"In order to get an idea of how good a model **generalizes**, a best practice is to *split* the available data into a **training** and a **test** set. Only the former is used to train the model. Then, predictions are made on the test data and the predictions can be compared with the actual labels.\n",
"\n",
"Common splits are 75/25 or 60/40."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"./static/generalization.png\" width=\"60%\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Train/Test Split for the Iris data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It is common practice to refer to the feature matrix as `X` and the vector of labels as `y`."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"X, y = iris.data, iris.target"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A naive splitting approach could be to use array slicing."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"X_train, X_test, y_train, y_test = X[0:100, :], X[100:150, :], y[0:100], y[100:150]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"However, this would lead to unbalanced label distributions. For example, the test set would only be made up of flowers of the same type."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,\n",
" 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,\n",
" 2, 2, 2, 2, 2, 2])"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_test"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ 0, 0, 50])"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.bincount(y_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`sklearn` provides a function that not only randomizes the split but also ensures that the resulting label distribution is proportionate to the overall distribution, a concept called **stratification**."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
2024-07-15 11:26:43 +02:00
"array([2, 1, 2, 1, 2, 2, 1, 1, 0, 2, 0, 0, 2, 2, 0, 2, 1, 0, 0, 0, 1, 0,\n",
" 1, 2, 2, 1, 1, 1, 1, 0, 2, 2, 1, 0, 2, 0, 0, 0, 0, 1, 1, 0, 2, 2,\n",
" 1])"
2021-05-25 01:33:04 +02:00
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train, X_test, y_train, y_test = train_test_split(\n",
" X, y, train_size=0.7, test_size=0.3, random_state=42, stratify=y\n",
")\n",
2021-05-25 01:33:04 +02:00
"\n",
"y_test"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([15, 15, 15])"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.bincount(y_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### A simple Classification Model: k-Nearest Neighbors"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To predict the label for any observation, just determine the k \"nearest\" observations in the training set (e.g., by Euclidean distance) and use a simple majority vote."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"./static/knn.png\" width=\"60%\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Training and Predicting with the Iris data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`sklearn` provides a uniform interface for all its classification models. They all have a `.fit()` and a `.predict()` method that abstract away the actual machine learning algorithm."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.neighbors import KNeighborsClassifier"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"knn = KNeighborsClassifier(n_neighbors=5)\n",
"\n",
"knn.fit(X_train, y_train)\n",
"\n",
"y_pred = knn.predict(X_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let us list the labels predicted for the test set ..."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
2024-07-15 11:26:43 +02:00
"array([2, 1, 2, 1, 2, 2, 1, 1, 0, 2, 0, 0, 2, 2, 0, 2, 1, 0, 0, 0, 1, 0,\n",
" 1, 2, 2, 1, 1, 1, 1, 0, 2, 2, 1, 0, 2, 0, 0, 0, 0, 1, 1, 0, 1, 2,\n",
" 1])"
2021-05-25 01:33:04 +02:00
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_pred"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"... and compare them with the actual labels."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
2024-07-15 11:26:43 +02:00
"array([2, 1, 2, 1, 2, 2, 1, 1, 0, 2, 0, 0, 2, 2, 0, 2, 1, 0, 0, 0, 1, 0,\n",
" 1, 2, 2, 1, 1, 1, 1, 0, 2, 2, 1, 0, 2, 0, 0, 0, 0, 1, 1, 0, 2, 2,\n",
" 1])"
2021-05-25 01:33:04 +02:00
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_test"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`numpy` shows us the indices where the predictions are wrong."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
2024-07-15 11:26:43 +02:00
"(array([42]),)"
2021-05-25 01:33:04 +02:00
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.where(y_pred != y_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Alternatively, we can calculate the fraction of correctly predicted flowers."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
2024-07-15 11:26:43 +02:00
"np.float64(0.9777777777777777)"
2021-05-25 01:33:04 +02:00
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.sum(y_pred == y_test) / len(y_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It is important to mention that we can also \"predict\" the training set. Somehow surprisingly, the model does not get the training set 100% correct."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
2024-07-15 11:26:43 +02:00
"np.float64(0.9714285714285714)"
2021-05-25 01:33:04 +02:00
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_train_pred = knn.predict(X_train)\n",
"\n",
"np.sum(y_train_pred == y_train) / len(y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A visualization reveals that the misclassified flowers are right \"at the borderline\" between two neighboring clusters of flower classes."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
2024-07-15 11:33:39 +02:00
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAkAAAAHHCAYAAABXx+fLAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/TGe4hAAAACXBIWXMAAA9hAAAPYQGoP6dpAABlRUlEQVR4nO3deVhUZfsH8O8wsgkMoLI6LC6IuKKipohgmpK7iOaWmstbLgmZubS45OtSqYGlqFQuZZECmq8L5YaRmqmIoSIigqihlrKIGzhzfn/4Y3JkcQZmmBnm+7muuS7nmeecc585yLk55znPLRIEQQARERGRETHRdQBERERENY0JEBERERkdJkBERERkdJgAERERkdFhAkRERERGhwkQERERGR0mQERERGR0mAARERGR0WECREREREaHCRCRDowfPx6enp41vt2FCxdCJBLV+HZLJSYmQiQSITExUan922+/RfPmzWFqago7OzsAQFBQEIKCgmo8xk2bNkEkEiE7O7vGt61vPD09MX78eF2HQaQVTICINKD0pHnq1CmdbP/Ro0f4/PPP0blzZ9ja2sLCwgLNmjXD9OnTcenSJZ3EpKqLFy9i/PjxaNKkCaKjo7Fhw4Ya2e7SpUuxc+fOGtlWbXHhwgUsXLiQySHVCnV0HQCRMYqOjoZcLtfIuv755x8EBwfj9OnT6N+/P0aNGgVra2ukp6cjJiYGGzZsQHFxsUa2VV3du3fHw4cPYWZmpmhLTEyEXC5HZGQkmjZtqmj/5ZdftBrL0qVLERoaisGDByu1v/766xgxYgTMzc21un1DdOHCBSxatAhBQUE6uYJJpElMgIhq0P3792FlZQVTU1ONrXP8+PE4c+YMYmNjMXToUKXPFi9ejA8++EBj26ouExMTWFhYKLXdvn0bABS3vko9myTVJLFYDLFYrJNtP08ul6O4uLjMd0ZE1cdbYERaMn78eFhbWyMzMxN9+/aFjY0NRo8erfjs+b+gY2Ji0KFDB9jY2EAikaB169aIjIysdBsnTpzAnj17MHHixDLJDwCYm5tjxYoVla5j48aNePnll+Ho6Ahzc3O0aNECUVFRZfqdOnUKffr0QYMGDWBpaYlGjRphwoQJau3D82OAPD09sWDBAgCAg4MDRCIRFi5cCKD8MUCPHj3CwoUL0axZM1hYWMDFxQUhISHIzMxU9FmxYgW6du2K+vXrw9LSEh06dEBsbKzSekQiEe7fv4/NmzdDJBJBJBIpxrpUNAZo7dq1aNmyJczNzeHq6opp06YhPz9fqU9QUBBatWqFCxcuoEePHqhbty4aNmyITz/9tLJDoBTX9OnTsXXrVsW2EhISAAA3btzAhAkT4OTkBHNzc7Rs2RLffPNNmXV88cUXaNmyJerWrQt7e3v4+fnh+++/V3xe0fizF40P27RpE4YNGwYA6NGjh+J7Kz2Wqvx8EOkTXgEi0qInT56gT58+6NatG1asWIG6deuW22///v0YOXIkevbsiU8++QQAkJaWhqNHjyIsLKzC9e/atQvA09s2VRUVFYWWLVti4MCBqFOnDv73v/9h6tSpkMvlmDZtGoCnV2l69+4NBwcHzJ07F3Z2dsjOzkZ8fHy19iEiIgJbtmzBjh07EBUVBWtra7Rp06bcvjKZDP3798fBgwcxYsQIhIWF4d69e9i/fz/OnTuHJk2aAAAiIyMxcOBAjB49GsXFxYiJicGwYcOwe/du9OvXD8DTQdeTJk1Cp06d8J///AcAFMuXZ+HChVi0aBF69eqFKVOmID09HVFRUTh58iSOHj2qdEUvLy8PwcHBCAkJwfDhwxEbG4s5c+agdevWePXVV194PA4dOoRt27Zh+vTpaNCgATw9PXHr1i289NJLigTJwcEB+/btw8SJE1FYWIjw8HAAT2+tzpgxA6GhoQgLC8OjR4/w559/4sSJExg1atQLt12Z7t27Y8aMGVi9ejXef/99+Pj4AAB8fHxU+vkg0jsCEVXbxo0bBQDCyZMnFW3jxo0TAAhz584t03/cuHGCh4eH4n1YWJggkUiEJ0+eqLXdIUOGCACEvLw8lfovWLBAeP6//YMHD8r069Onj9C4cWPF+x07dpTZv+epsg+HDx8WAAiHDx8uE9Pff/+t1DcwMFAIDAxUvP/mm28EAMKqVavKrFcul1e4P8XFxUKrVq2El19+WandyspKGDduXJl1lR7LrKwsQRAE4fbt24KZmZnQu3dvQSaTKfp9+eWXAgDhm2++UYoZgLBlyxZF2+PHjwVnZ2dh6NChZb+Q5wAQTExMhPPnzyu1T5w4UXBxcRH++ecfpfYRI0YItra2in0eNGiQ0LJly0q38fzPXqnyfjY8PDyUvqPt27eXOX6CoNrPB5G+4S0wIi2bMmXKC/vY2dnh/v372L9/v1rrLiwsBADY2NhUKTYAsLS0VPy7oKAA//zzDwIDA3HlyhUUFBQo4gOA3bt3o6SkpNz1VHUfVBUXF4cGDRrg7bffLvPZs7dunt2fvLw8FBQUICAgAMnJyVXa7oEDB1BcXIzw8HCYmPz7K3Py5MmQSCTYs2ePUn9ra2uMGTNG8d7MzAydOnXClStXVNpeYGAgWrRooXgvCALi4uIwYMAACIKAf/75R/Hq06cPCgoKFPtmZ2eH69ev4+TJk1Xa16pS5eeDSN8wASLSojp16kAqlb6w39SpU9GsWTO8+uqrkEqlmDBhgmLsR2UkEgkA4N69e1WO8ejRo+jVqxesrKxgZ2cHBwcHvP/++wCgSIACAwMxdOhQLFq0CA0aNMCgQYOwceNGPH78uNr7oKrMzEx4e3ujTp3K79zv3r0bL730EiwsLFCvXj04ODggKipKsS/qunr1KgDA29tbqd3MzAyNGzdWfF5KKpWWGUtjb2+PvLw8lbbXqFEjpfd///038vPzsWHDBjg4OCi93njjDQD/DiSfM2cOrK2t0alTJ3h5eWHatGk4evSo6jtbRar8fBDpGyZARFpkbm6udNWgIo6OjkhJScGuXbswcOBAHD58GK+++irGjRtX6XLNmzcHAKSmplYpvszMTPTs2RP//PMPVq1ahT179mD//v145513AEDxqL5IJEJsbCyOHz+O6dOnKwbkdujQAUVFRdXaB01KSkrCwIEDYWFhgbVr12Lv3r3Yv38/Ro0aBUEQaiSGip4gU3X7z17BAv49BmPGjMH+/fvLffn7+wN4Oh6ndPqDbt26IS4uDt26dVMMNAdQ4UBnmUymUnzlUeXng0jfMAEi0hNmZmYYMGAA1q5di8zMTLz55pvYsmULLl++XOEyAwYMAAB89913Vdrm//73Pzx+/Bi7du3Cm2++ib59+6JXr15lTsKlXnrpJSxZsgSnTp3C1q1bcf78ecTExFRrH1TVpEkTpKenV3qLJS4uDhYWFvj5558xYcIEvPrqq+jVq1e5fVWdEdvDwwMAkJ6ertReXFyMrKwsxefa4uDgABsbG8hkMvTq1avcl6Ojo6K/lZUVXnvtNWzcuBE5OTno168flixZgkePHgF4ejXq+afXAJS5klWeF31nL/r5INInTICI9MCdO3eU3puYmCiehqrsNkKXLl0QHByMr776qtxZjYuLizFr1qwKly+9WvHs1YmCggJs3LhRqV9eXl6ZKxi+vr5K8VV1H1Q1dOhQ/PPPP/jyyy/LfFYam1gshkgkUrqakZ2dXe53Y2VlVW4i8LxevXrBzMwMq1evVvoOvv76axQUFCieLNMWsViMoUOHIi4uDufOnSvz+d9//6349/PHwMzMDC1atIAgCIrEsUmTJigoKMCff/6p6Jebm4sdO3a8MBYrKysAKPO9qfLzQaRv+Bg8kR6YNGkS7t69i5dffhlSqRRXr17FF198AV9fX8XjxhXZsmULevfujZCQEAwYMAA9e/aElZUVMjIyEBMTg9zc3ArnAurdu7fiqs2bb76JoqIiREdHw9HREbm5uYp+mzdvxtq1azFkyBA0adIE9+7dQ3R0NCQSCfr27VvtfVDF2LFjsWXLFsycORN//PEHAgICcP/+fRw4cABTp07FoEGD0K9fP6xatQrBwcEYNWoUbt++jTV
2021-05-25 01:33:04 +02:00
"text/plain": [
2024-07-15 11:33:39 +02:00
"<Figure size 640x480 with 1 Axes>"
2021-05-25 01:33:04 +02:00
]
},
"metadata": {},
2021-05-25 01:33:04 +02:00
"output_type": "display_data"
}
],
"source": [
"first_feature_index = 3\n",
"second_feature_index = 0\n",
"\n",
"correct_idx = np.where(y_pred == y_test)[0]\n",
"incorrect_idx = np.where(y_pred != y_test)[0]\n",
"\n",
"colors = [\"darkblue\", \"darkgreen\", \"gray\"]\n",
"\n",
"for n, color in enumerate(colors):\n",
" idx = np.where(y_test == n)[0]\n",
" plt.scatter(\n",
" X_test[idx, first_feature_index],\n",
" X_test[idx, second_feature_index],\n",
" color=color,\n",
" label=iris.target_names[n],\n",
" )\n",
"\n",
"plt.scatter(\n",
" X_test[incorrect_idx, first_feature_index],\n",
" X_test[incorrect_idx, second_feature_index],\n",
" color=\"darkred\",\n",
" label=\"misclassified\",\n",
2021-05-25 01:33:04 +02:00
")\n",
"\n",
"plt.xlabel(\"sepal width [cm]\")\n",
"plt.ylabel(\"petal length [cm]\")\n",
"plt.legend(loc=\"best\")\n",
2021-05-25 01:33:04 +02:00
"plt.title(\"Iris Classification results\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In practice, the number of neighbors must be chosen before the model is trained. Therefore, it is possible to \"optimize\" it. This process is referred to as **hyper-parameter tuning**. For the Iris dataset this does not make much of a difference."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
2024-07-15 11:26:43 +02:00
"1 0.9333333333333333\n",
"2 0.9111111111111111\n",
"3 0.9555555555555556\n",
"4 0.9555555555555556\n",
"5 0.9777777777777777\n",
2021-05-25 01:33:04 +02:00
"6 0.9333333333333333\n",
2024-07-15 11:26:43 +02:00
"7 0.9555555555555556\n",
"8 0.9333333333333333\n",
"9 0.9555555555555556\n",
"10 0.9555555555555556\n",
"11 0.9333333333333333\n",
"12 0.9333333333333333\n",
2021-05-25 01:33:04 +02:00
"13 0.9333333333333333\n",
2024-07-15 11:26:43 +02:00
"14 0.9333333333333333\n",
"15 0.9555555555555556\n",
"16 0.9555555555555556\n",
"17 0.9555555555555556\n",
"18 0.9555555555555556\n",
"19 0.9555555555555556\n",
2021-05-25 01:33:04 +02:00
"20 0.9333333333333333\n",
2024-07-15 11:26:43 +02:00
"21 0.9555555555555556\n",
2021-05-25 01:33:04 +02:00
"22 0.9333333333333333\n",
2024-07-15 11:26:43 +02:00
"23 0.9555555555555556\n",
"24 0.9333333333333333\n",
"25 0.9333333333333333\n",
"26 0.9555555555555556\n",
"27 0.9333333333333333\n",
"28 0.9111111111111111\n",
"29 0.9111111111111111\n",
2021-05-25 01:33:04 +02:00
"30 0.9111111111111111\n"
]
}
],
"source": [
"for i in range(1, 31):\n",
" knn = KNeighborsClassifier(n_neighbors=i)\n",
" knn.fit(X_train, y_train)\n",
" y_pred = knn.predict(X_test)\n",
" correct = np.sum(y_pred == y_test) / len(y_test)\n",
" print(i, correct)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Further Resources on Machine Learning"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Depending on the programming language one chooses, the following books are recommended:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- [Python Machine Learning](https://www.amazon.de/Python-Machine-Learning-scikit-learn-TensorFlow/dp/1787125939/ref=sr_1_1?__mk_de_DE=%C3%85M%C3%85%C5%BD%C3%95%C3%91&keywords=python+machine+learning&qid=1575545025&sr=8-1) by Sebastian Raschka\n",
"\n",
"<img src=\"static/python_ml_book.png\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- [An Introduction to Statistical Learning](http://faculty.marshall.usc.edu/gareth-james/ISL/)\n",
"\n",
"<img src=\"static/r_ml_book.png\">"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "intro-to-data-science",
2021-05-25 01:33:04 +02:00
"language": "python",
"name": "intro-to-data-science"
2021-05-25 01:33:04 +02:00
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.4"
2021-05-25 01:33:04 +02:00
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": false,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}