{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Workshop: Machine Learning for Beginners" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What is Machine Learning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Machine learning is the process of **extracting knowledge from data** in an automated fashion.\n", "\n", "The use cases usually are making predictions on new and unseen data or simply understanding a given dataset better by finding patterns.\n", "\n", "Central to machine learning is the idea of **automating** the **decision making** from data **without** the user specifying **explicit rules** how these decisions should be made." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Examples" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Types of Machine Learning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- **Supervised** (focus of this workshop): Each entry in the dataset comes with a \"label\". Examples are a list of emails where spam mail is already marked as such or a sample of handwritten digits. The goal is to use the historic data to make predictions.\n", "\n", "- **Unsupervised**: There is no desired output associated with a data entry. In a sense, one can think of unsupervised learning as a means of discovering labels from the data itself. A popular example is the clustering of customer data.\n", "\n", "- **Reinforcement**: Conceptually, this can be seen as \"learning by doing\". Some kind of \"reward function\" tells how good a predicted outcome is. For example, chess computers are typically programmed with this approach." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Types of Supervised Learning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- In **classification** tasks, the labels are *discrete*, such as \"spam\" or \"no spam\" for emails. Often, labels are nominal (e.g., colors of something), or ordinal (e.g., T-shirt sizes in S, M, or L).\n", "- In **regression**, the labels are *continuous*. For example, given a person's age, education, and position, infer his/her salary." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Case Study: Iris Flower Classification" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Python for Scientific Computing: A brief Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Python itself does not come with any scientific algorithms. However, over time, many open source libraries emerged that are useful to build machine learning applications.\n", "\n", "Among the popular ones are [numpy](https://numpy.org/) (numerical computations, linear algebra), [pandas](https://pandas.pydata.org/) (data processing), [matplotlib](https://matplotlib.org/) (visualisations), and [scikit-learn](https://scikit-learn.org/stable/index.html) (machine learning algorithms).\n", "\n", "First, import the libraries:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following line is needed so that this Jupyter notebook creates the visiualizations in the notebook and not in a new window. This has nothing to do with Python." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Standard Python can do basic arithmetic operations ..." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = 1\n", "b = 2\n", "\n", "c = a + b\n", "\n", "c" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "... and provides some simple **data structures**, such as a list of values." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[1, 2, 3, 4]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "l = [a, b, c, 4]\n", "\n", "l" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Numpy provides a data structure called an **n-dimensional array**. This may sound fancy at first but when used with only 1 or 2 dimensions, it basically represents vectors and matrices. Arrays allow for much faster computations as they are implemented in the very fast [C language](https://en.wikipedia.org/wiki/C_%28programming_language%29)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To create an array, we use the [array()](https://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html#numpy-array) function from the imported `np` module and provide it with a `list` of values." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1, 2, 3])" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "v1 = np.array([1, 2, 3])\n", "\n", "v1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A vector can be multiplied with a scalar." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([3, 6, 9])" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "v2 = v1 * 3\n", "\n", "v2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To create a matrix, just use a list of (row) list of values instead." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[1, 2, 3],\n", " [4, 5, 6]])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m1 = np.array([\n", " [1, 2, 3],\n", " [4, 5, 6],\n", "])\n", "\n", "m1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can use numpy to multiply a matrix with a vector to obtain a new vector ..." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([14, 32])" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "v3 = np.dot(m1, v1)\n", "\n", "v3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "... or simply transpose it." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[1, 4],\n", " [2, 5],\n", " [3, 6]])" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m1.T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The rules from maths still apply and it makes a difference if a vector is multiplied from the left or the right by a matrix. The following operation will fail." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "ename": "ValueError", "evalue": "shapes (3,) and (2,3) not aligned: 3 (dim 0) != 2 (dim 0)", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdot\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mv1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mm1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m<__array_function__ internals>\u001b[0m in \u001b[0;36mdot\u001b[0;34m(*args, **kwargs)\u001b[0m\n", "\u001b[0;31mValueError\u001b[0m: shapes (3,) and (2,3) not aligned: 3 (dim 0) != 2 (dim 0)" ] } ], "source": [ "np.dot(v1, m1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to retrieve only a slice (= subset) of an array's data, we can \"index\" into it. For example, the first row of the matrix is ..." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1, 2, 3])" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m1[0, :]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "... while the second column is:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([2, 5])" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m1[:, 1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To acces the lowest element in the right column, two indices can be used." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "6" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m1[1, 2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Numpy also provides various other functions and constants, such as sinus or pi. To further illustrate the concept of **vectorization**, let us calculate the sinus curve over a range of values." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([-9.42477796, -9.23437841, -9.04397885, -8.8535793 , -8.66317974,\n", " -8.47278019, -8.28238063, -8.09198108, -7.90158152, -7.71118197,\n", " -7.52078241, -7.33038286, -7.1399833 , -6.94958375, -6.75918419,\n", " -6.56878464, -6.37838508, -6.18798553, -5.99758598, -5.80718642,\n", " -5.61678687, -5.42638731, -5.23598776, -5.0455882 , -4.85518865,\n", " -4.66478909, -4.47438954, -4.28398998, -4.09359043, -3.90319087,\n", " -3.71279132, -3.52239176, -3.33199221, -3.14159265, -2.9511931 ,\n", " -2.76079354, -2.57039399, -2.37999443, -2.18959488, -1.99919533,\n", " -1.80879577, -1.61839622, -1.42799666, -1.23759711, -1.04719755,\n", " -0.856798 , -0.66639844, -0.47599889, -0.28559933, -0.09519978,\n", " 0.09519978, 0.28559933, 0.47599889, 0.66639844, 0.856798 ,\n", " 1.04719755, 1.23759711, 1.42799666, 1.61839622, 1.80879577,\n", " 1.99919533, 2.18959488, 2.37999443, 2.57039399, 2.76079354,\n", " 2.9511931 , 3.14159265, 3.33199221, 3.52239176, 3.71279132,\n", " 3.90319087, 4.09359043, 4.28398998, 4.47438954, 4.66478909,\n", " 4.85518865, 5.0455882 , 5.23598776, 5.42638731, 5.61678687,\n", " 5.80718642, 5.99758598, 6.18798553, 6.37838508, 6.56878464,\n", " 6.75918419, 6.94958375, 7.1399833 , 7.33038286, 7.52078241,\n", " 7.71118197, 7.90158152, 8.09198108, 8.28238063, 8.47278019,\n", " 8.66317974, 8.8535793 , 9.04397885, 9.23437841, 9.42477796])" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x = np.linspace(-3*np.pi, 3*np.pi, 100)\n", "\n", "x" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([-3.67394040e-16, -1.89251244e-01, -3.71662456e-01, -5.40640817e-01,\n", " -6.90079011e-01, -8.14575952e-01, -9.09631995e-01, -9.71811568e-01,\n", " -9.98867339e-01, -9.89821442e-01, -9.45000819e-01, -8.66025404e-01,\n", " -7.55749574e-01, -6.18158986e-01, -4.58226522e-01, -2.81732557e-01,\n", " -9.50560433e-02, 9.50560433e-02, 2.81732557e-01, 4.58226522e-01,\n", " 6.18158986e-01, 7.55749574e-01, 8.66025404e-01, 9.45000819e-01,\n", " 9.89821442e-01, 9.98867339e-01, 9.71811568e-01, 9.09631995e-01,\n", " 8.14575952e-01, 6.90079011e-01, 5.40640817e-01, 3.71662456e-01,\n", " 1.89251244e-01, -1.22464680e-16, -1.89251244e-01, -3.71662456e-01,\n", " -5.40640817e-01, -6.90079011e-01, -8.14575952e-01, -9.09631995e-01,\n", " -9.71811568e-01, -9.98867339e-01, -9.89821442e-01, -9.45000819e-01,\n", " -8.66025404e-01, -7.55749574e-01, -6.18158986e-01, -4.58226522e-01,\n", " -2.81732557e-01, -9.50560433e-02, 9.50560433e-02, 2.81732557e-01,\n", " 4.58226522e-01, 6.18158986e-01, 7.55749574e-01, 8.66025404e-01,\n", " 9.45000819e-01, 9.89821442e-01, 9.98867339e-01, 9.71811568e-01,\n", " 9.09631995e-01, 8.14575952e-01, 6.90079011e-01, 5.40640817e-01,\n", " 3.71662456e-01, 1.89251244e-01, 1.22464680e-16, -1.89251244e-01,\n", " -3.71662456e-01, -5.40640817e-01, -6.90079011e-01, -8.14575952e-01,\n", " -9.09631995e-01, -9.71811568e-01, -9.98867339e-01, -9.89821442e-01,\n", " -9.45000819e-01, -8.66025404e-01, -7.55749574e-01, -6.18158986e-01,\n", " -4.58226522e-01, -2.81732557e-01, -9.50560433e-02, 9.50560433e-02,\n", " 2.81732557e-01, 4.58226522e-01, 6.18158986e-01, 7.55749574e-01,\n", " 8.66025404e-01, 9.45000819e-01, 9.89821442e-01, 9.98867339e-01,\n", " 9.71811568e-01, 9.09631995e-01, 8.14575952e-01, 6.90079011e-01,\n", " 5.40640817e-01, 3.71662456e-01, 1.89251244e-01, 3.67394040e-16])" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y = np.sin(x)\n", "\n", "y" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With matplotlib's [plot()](https://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.plot) function we can visualize the sinus curve." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.plot(x, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us quickly generate some random data and draw a scatter plot." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAD4CAYAAADiry33AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8li6FKAAAfH0lEQVR4nO3df5Ac9Xnn8fejZW0v2PGKoBC0IIuKOXG2KdBli5DC5wJsIyzjoCPOBYpzQYxPjs+us10+xeLOFRJyVyilihPfcWei2JxxHcHEQWw4QyyrDFUYn3+tWHECgw5i41gjgkRgMZh1vJKe+2N7xOyoe7Z7uqf72z2fV9XW7vTMzjwz0/30t5/vt79t7o6IiDTXsqoDEBGRwVKiFxFpOCV6EZGGU6IXEWk4JXoRkYY7ruoA4px00km+evXqqsMQEamNXbt2PevuK+LuCzLRr169munp6arDEBGpDTP7UdJ9Kt2IiDTckonezE4zs/vN7Ptm9qiZfTRafqKZ7TSzJ6LfyxP+/+roMU+Y2dVFvwEREektTYv+EPAJd38TcB7wYTN7E7AZ+Lq7nwF8Pbq9iJmdCFwP/BpwLnB90g5BREQGY8lE7+5Pu/tD0d8vAo8BE8BlwK3Rw24FNsT8+zpgp7s/5+7PAzuBS4oIXERE0slUozez1cBa4DvAye7+dHTXPwAnx/zLBPDjjtv7omVxz73RzKbNbPrgwYNZwhIRkR5Sj7oxs9cCdwIfc/efmNnR+9zdzSzX7Gjuvg3YBjA5OamZ1qQ2pmZabN2xl/2zc6wcH2PTujVsWBvbnhGpRKoWvZmNspDkb3P37dHiZ8zslOj+U4ADMf/aAk7ruH1qtEykEaZmWly3fQ+t2TkcaM3Ocd32PUzNaDWXcKQZdWPA54HH3P3THXfdDbRH0VwN/E3Mv+8ALjaz5VEn7MXRMpFG2LpjL3Pzhxctm5s/zNYdeyuKSORYaVr05wPvAy4ys93Rz3pgC/BOM3sCeEd0GzObNLPPAbj7c8AfAd+Lfm6Ilok0wv7ZuUzLRaqwZI3e3R8ELOHut8c8fhr4QMftW4Bb+g1QJGQrx8doxST1leNjFUQjEk9nxorksGndGsZGRxYtGxsdYdO6NRVFJHKsIOe6EamL9ugajbqRkCnRi+S0Ye2EErsETaUbEZGGU6IXEWk4JXoRkYZTohcRaTglehGRhlOiFxFpOCV6EZGGU6IXEWk4JXoRkYZTohcRaTglehGRhlOiFxFpOCV6EZGGU6IXEWk4JXoRkYZTohcRabglLzxiZrcAlwIH3P0t0bI7gPa10saBWXc/J+Z/nwJeBA4Dh9x9sqC4RUQkpTRXmPoCcBPwxfYCd//t9t9m9ifACz3+/0J3f7bfAEVEJJ8lE727P2Bmq+PuMzMD/jVwUbFhiYhIUfLW6P8l8Iy7P5FwvwNfM7NdZrax1xOZ2UYzmzaz6YMHD+YMS0RE2vIm+iuB23vc/1Z3/xfAu4APm9nbkh7o7tvcfdLdJ1esWJEzLBERaes70ZvZccDlwB1Jj3H3VvT7AHAXcG6/ryciIv3J06J/B/C4u++Lu9PMTjCz17X/Bi4GHsnxeiIi0oclE72Z3Q58C1hjZvvM7NrorivoKtuY2Uozuze6eTLwoJk9DHwXuMfdv1pc6CIikkaaUTdXJiy/JmbZfmB99PcPgLNzxiciIjnpzFgRkYZTohcRaTglehGRhkszBYKI5DA102Lrjr3sn51j5fgYm9atYcPaiarDkiGiRC8yQFMzLa7bvoe5+cMAtGbnuG77HgAl+4h2hIOn0o3IAG3dsfdokm+bmz/M1h17K4ooLO0dYWt2DueVHeHUTKvq0BpFiV5kgPbPzmVaPmy0IyyHEr3IAK0cH8u0fNhoR1gOJXqRAdq0bg1joyOLlo2NjrBp3ZqE/xgu2hGWQ4leZIA2rJ3gxsvPYmJ8DAMmxse48fKz1NkY0Y6wHBp1IzJgG9ZOKLEnaH8uGnUzWEr0IlIp7QgHT6UbEZGGU6IXEWk4JXoRkYZTohcRaTglehGRhlOiFxFpuDTXjL3FzA6Y2SMdy/7AzFpmtjv6WZ/wv5eY2V4ze9LMNhcZuIiIpJOmRf8F4JKY5X/q7udEP/d232lmI8B/B94FvAm40szelCdYERHJbslE7+4PAM/18dznAk+6+w/c/efAl4DL+ngeERHJIU+N/iNm9n+j0s7ymPsngB933N4XLYtlZhvNbNrMpg8ePJgjLBER6dRvov8s8CvAOcDTwJ/kDcTdt7n7pLtPrlixIu/TiYhIpK9E7+7PuPthdz8C/AULZZpuLeC0jtunRstERKREfSV6Mzul4+a/Ah6Jedj3gDPM7HQzexVwBXB3P68nIiL9W3L2SjO7HbgAOMnM9gHXAxeY2TmAA08BH4weuxL4nLuvd/dDZvYRYAcwAtzi7o8O5F2IiEgic/eqYzjG5OSkT09PVx2GiEhtmNkud5+Mu09nxoqINJwSvYhIwynRi4g0nBK9iEjDKdGLiDScEr2ISMMp0YuINJwSvYhIwynRi4g0nBK9iEjDKdGLiDTckpOaiYiEYGqmxdYde9k/O8fK8TE2rVvDhrWJ1zKSDkr0IhK8qZkW123fw9z8YQBas3Nct30PgJJ9Ckr0Ijmppdm/tJ/d1h17jyb5trn5w2zdsVefdQpK9CI5qKXZvyyf3f7ZudjnSFoui6kzViSHXi1N6S3LZ7dyfCz2OZKWy2JK9CI5qKXZvyyf3aZ1axgbHVm0bGx0hE3r1gwktqZRohfJQS3N/mX57DasneDGy89iYnwMAybGx7jx8rNUHktJNXqRHDatW7OozgxqaaaV9bPbsHZCib1PaS4OfgtwKXDA3d8SLdsKvAf4OfB3wO+4+2zM/z4FvAgcBg4lXc9QpK7aiUejbrLTZ1eeJS8ObmZvA14CvtiR6C8G7nP3Q2b2xwDu/smY/30KmHT3Z7MEpYuDi4hkk+vi4O7+APBc17Kvufuh6Oa3gVNzRykiIgNRRI3+/cAdCfc58DUzc+DP3X1b0pOY2UZgI8CqVasKCEskDDqhSqqWK9Gb2X8CDgG3JTzkre7eMrNfAnaa2ePREcIxop3ANlgo3eSJSyQUOqFKQtD38Eozu4aFTtqrPKHQ7+6t6PcB4C7g3H5fT6SOdEKVhKCvRG9mlwC/B/yGu7+c8JgTzOx17b+Bi4FH+g1UpI50QpWEYMlEb2a3A98C1pjZPjO7FrgJeB0L5ZjdZnZz9NiVZnZv9K8nAw+a2cPAd4F73P2rA3kXIoHSCVUSgiVr9O5+Zczizyc8dj+wPvr7B8DZuaITqTmdUCUh0JmxIl2KHCWjk4IkBEr0Ih0GMUpGp+5L1TSpmUgHjZKRJlKiF+mgUTLSREr0Ih00SkaaSIlepIMucCFNpM5YkQ4aJSNNpEQv0kWjZKRpVLoREWk4tehFJJamV24OJfqAaMOSUGh65WZR6SYQ7Q2rNTuH88qGNTXTqjo0GUI6caxZ1KIPRK8NSy0oKVvdTxzT0fFiatEHou4bljRLnU8c09HxsZToA1HnDUsWksv5W+7j9M33cP6W+2qfVOp84pjKTsdS6SYQmre8PrrLAheeuYI7d7Ua1XFZ5xPHdHR8LCX6QNR5wxomcaNRbvv239N90eQm9K/U9cSxleNjtGKS+jAfHSvRB6SuG9YwiSsLdCf5tmFuQVZJR8fHSlWjN7NbzOyAmT3SsexEM9tpZk9Ev5cn/O/V0WOeMLOriwpcpApZkvcwtyCrtGHtBDdefhYT42MYMDE+xo2XnzXUjai0LfovsHBB8C92LNsMfN3dt5jZ5uj2Jzv/ycxOBK4HJllo+Owys7vd/fm8gYtUIaksYCxu2Q97C7JqOjpeLFWL3t0fAJ7rWnwZcGv0963Ahph/XQfsdPfnouS+E7ikz1iF5o3uqJuk0ShXnbcqiBak1g+Jk6dGf7K7Px39/Q/AyTGPmQB+3HF7X7TsGGa2EdgIsGrVqhxhNZdOS69eyJ3madcPnUw0fArpjHV3N7OkPqm0z7EN2AYwOTmZ67maSmfPhiHUskCa9UONheGU54SpZ8zsFIDo94GYx7SA0zpunxotkz5ofLD0kmb90MlEwylPor8baI+iuRr4m5jH7AAuNrPl0aici6Nl0gedPSu9pFk/1FgYTmmHV94OfAtYY2b7zOxaYAvwTjN7AnhHdBszmzSzzwG4+3PAHwHfi35uiJZJH+p8Wnpd1aFzsx1ja3YO67qve/1QY2E4mXt45fDJyUmfnp6uOowgqSOtPN31bFhInCGNyY6LsT3Uc3xsFDOYfXn+6LoCBP+epD9mtsvdJ2PvU6JvribvFMp4b+1WcreJ8TG+ufmiQl+rX0kxjo+N8k+HjsQmdEgeNdTkdabpeiV6TYHQUE0eXVHWe6uynp024SbFMjs3f8yydqfrNzdfFPtcTV5nhp0SfUM1bShmZ+JbZsbhriPRQby3qibHypJwk2JM0msnVdQ6o6OC8Gg++oZq0uiK7gtJdCf5tqLfW1Wd31mGQCbFuPz40djn7rWTKmKdSbrox6em9gTfqd1katE3VJ2mal2qBRiX+OIU/d6qOgs2S8JNihHiO1177aSKWGeSdlKdUzmrJFQ+JfqGqstUrWnKFGlalIN6b1WcBZuUcF8/Ft9K7xVjlp1UEetM0nfVxPn660SJvqFCnpOlU5q6cFLiGzHjiHuw761fm9atYdOXH2b+yOL0+NOfH2JqppX6fWbdSRWxzmTpM6hjGbGulOgbLNQ5WTqlKVMktTSrGPtdRkfjhrUT/OH/fpTnX148cmb+sA+8FZx3nYn7rrqncG5LUxJSx24xlOilUmnqwqEcnZQ5/HC2K8m3VdkKTpN0476r7mvqQrqSkIZ7FkeJXiqVti4cwtFJmUNWQ+tMz5J0476ryTecmHlHnfR5f+KvHo59XUmmRC+VCqW1nkaZQ1ZD60zPu5PrZ0ed9LkedlfLPiMleqlcCK31NMpsZZe5A0xTkqnivIxeHbsatZONEr1ISmW3ssvYAaYtySQl3fGEE7OKEPd5d9KonfR0ZqxIShvWTnDj5WcFcW3YoqQ9C3fTujWMjnRPggwv/ezQwM5ybX/eI3bs60KYJ/+FSi16GTp5hux1t7Lbc8GH3r+QJG1JZsPaCf7g7kePmSxt/shgh3y2nzek/oo6UqKXoVLkkL0mDP/L0u/wQsyMmDD4EkqdOuxDpUQvQ6XIIZJNmCE0S79DlUM+69Jh369BnximRC9DJW2pItSRKEXL0lq+8MwViyYnA5VQilDGkaESvQyVNK3SvCNR6tZJmKa1PDXT4s5drUVJ3oDf/NVmt7TLUMaRYd+jbsxsjZnt7vj5iZl9rOsxF5jZCx2P+f38IUtTVHHh7TRzzGcZiTIsF2uP+0wcuP/xg9UE1CBlHBn23aJ3973AOQBmNgK0gLtiHvoNd7+039eRZqqqIzNNqSLLSJSlnqspmlCmClUZR4ZFlW7eDvydu/+ooOeThquyI3OpUkWWDa/pnYRtTSlThaiME/GKSvRXALcn3PfrZvYwsB/4D+7+aNyDzGwjsBFg1apVBYUloQq5hZhnw5uaaS0ab778+FGuf8+ba7Ez6NUBHdrcO01SxpGhecL1N1M/gdmrWEjib3b3Z7ru+wXgiLu/ZGbrgc+4+xlLPefk5KRPT0/nikuyKXve7/O33BfbQpwYH+Obmy+qfB7yfl5/aqYVe8GQ0RFj63vPDjrZd5fS4Ng5/7s/kwvPXMH9jx9sfNmqLsxsl7tPxt5XQKK/DPiwu1+c4rFPAZPu/myvxynRl+tTU3tih80N8vT+XokF4s+EDH26gaSdF7yyAwvVUjvebml2DFKuXom+iLluriShbGNmv2y2MFGFmZ0bvd4/FvCaUpCpmdYxSR7iR5kUqde8MWlHvQxCnpFAvcpOIZSkeslaSqvyO5LsctXozewE4J3ABzuW/S6Au98MvBf4kJkdAuaAKzzvIYQUauuOvbGXeYNyTm2Pa/0Nqn6/VDkmzUigXs/Ra1rdsjsts5aesna2FvEdVV2eGya5Er27/xT4xa5lN3f8fRNwU57XkMHqtWFWNaKiqBEenYnk9WOj/PTnh5g/vLBbi0viS40EWmpHkHRR79ERK7TTsogdVresna15v6O6zhNU152TpikeQp3liWUJU8AaVDaioogTkdqJpDU7hwOzc/NHk3xbd6lhqVbqUuWKDWsn2PpbZzM+9soc7cuPH+27IzaujNT9vtoJsrPE1E9ZJesUzHm/ozqWftJ89qHSFAhDprsldTimkmbAVeetqqylUsRws7hEEqc1O8f5W+5j07o1S7ZS05QrihpXn9TiffVxy5Y8/6DfskqW2PN+R2UNr+3VAs/aOq/zJHZK9EMmKQGOmHHEPZjD0bwJM0vCaCfR3/zVCe7c1UosX5R50lBSUklztaWy4szzHZURY6/yEJC5dBTyuR9LUelmyCStlEfc+eGWd/PNzRdVnuT70V3myHqJu7n5w9z/+MGe5Ysy57bJmjw6E2RIc/AkjWIqI8ZeLfB+SkdJO6E6nB2sFv2QaeKp7HEtt9FlxuiILarLjy4zXvua43j+5eQLaPRqpZY5t03S97T8+FF+Nn+kZ6dpKHPwpOlwHWSM/bTAe91X57ODleiHTJ1X1iRxrbP5I8742CgnvPq4YxJJ0slBaXZ2/ZYrstaDk76n69/zZmDpBBnCHDxL1bQHHeNSjZqs60AoO9B+KNEPmTqvrEmSWmEvzM2z+/pjT9gue2fXz1DCpb6nfr+vMocHVl3TXup77mcdCGEH2o/GJPq6jm+tQmgra97vLms5quydXb+jNYr+nsoeu151mTDN9zwsOSP3XDeDkHWuG827UV9FfHehf/+rN98Tu9yAH255d2lxZJ3PJq/Qv5emGfRcN5Wr48kXsqCI7y7ryT5lmpppEX9KWvkd4GWXUkL+XoZNI0o3VdcCpX9FfXehlaPakuYSquLM4ypKKaF+L8OmEYm+6lqg9C/E767I/p6kHZZT/pwug+iELrpvTH1tg9GI0k1IJ4hINqF9d0XPZ5K0w5ooYEeWdUrlokspRX9WdZ5LJnSNaNE3ccjgsAjtuyt6PpNBDeXsdwRNkaWUoj+rOs8lEyeko5NGJHpQLbDOQvrukkotrdk5Tt98T+YNdlA7shCSYtF9Y03qawttGubGJHoZHoNsKfW6eEhnOQHSb7CD2JGFkBSL7l8Jsb+mXyHsiDs1okYvw2PQddy4PoNuIQzdDWGCraL7V0Lrr8kjhB1xJyV6qZVBnzPR3WGZpOpyQghJsejO3SaNuw9hR9wpd+nGzJ4CXgQOA4e6z8yKLg7+GWA98DJwjbs/lPd1ZTiV0VLqLLXkmQBtkELpxC66LBVSf00/2mXF1uwcBovOoajy6KSoGv2F7v5swn3vAs6Ifn4N+Gz0WySzsuu4Ic/2WfekWKQQRrh0d8A6HE32E0Mw6uYy4Iu+MKnOt81s3MxOcfenS3jtoRHCil6GshNvKC1nSRbKCJe4smI7yQ9iLqEsikj0DnzNzBz4c3ff1nX/BPDjjtv7omVK9AUJZUUvQxWJt04t50Hs8ENvRIQywiW0DthORST6t7p7y8x+CdhpZo+7+wNZn8TMNgIbAVatWlVAWMMjlBW9LHVKvGUaxA6/Do2IUBJsyMNDc4+6cfdW9PsAcBdwbtdDWsBpHbdPjZZ1P882d59098kVK1bkDWuohLKiS7UGMSKpDjPDhjLCJYSRUElytejN7ARgmbu/GP19MXBD18PuBj5iZl9ioRP2hWGrzw/60DeklkToh/lN1s8Of6nvqw6NiFA6zEPuz8lbujkZuGthBCXHAX/p7l81s98FcPebgXtZGFr5JAvDK38n52vWShmHvqGs6J+a2sNt3/77o0PKQjzMb7KsO/w062ZIjYgkISXYUMuKjbjCVMjKuqpP1S3pqZkWH79jd+zc61nfa9XvpShlv4+sV3RKs27qKlH10esKU5rrZsDKOvStuiWRdIENyPZe69D5l0YV7yNryzbNuhlSa1n6p0Q/YHU49C1Cr2Se5b2GPIIoSwu93/cR9xrt50vzull2+GnXzbIaEU05kguR5roZsJB74ouUlMyzXjIv1M6/rJOp9dsx2v0am/76YTZ9+eGBTOIW0rqpi44MlhL9gDVpoqZe4pKGAVedtyrTew1lqFy3rMMM+3kfca8xf9iZP7K4KFbU8MaQ1s06DOOsM5VuSlB1/bwMRdVyQxlB1C1rC72f95HlqKWoI5xQ1s1Qj+SaQoleClNE0gi18y9rX0vW9zE102KZGYdTjoLLeoQTev17WPqyqqLhlVK50JMQDHaYYdxzt42OGDiLyjdZXzfu+UeXGa99zXHMvjwfxGeuYZz5aXilBKsuwykHeaQRV58GGDFj63vPzv26sbX/I87zL88DYXzmoR7JNYVa9FKpsk4oC9npm++JPQfBgB9ueffAnr/bMH3mTdSrRa9RN1IpdcINfqRR2ucZps982CjRS6VCHU5ZpkGNZ5+aaR09Yup1/du2YfrMh40SvVQqpJN2qjKI8eydJyDBK5e1AxgfG13o5O0wbJ/5sFFnrFRKnXALih7PvtRl7eow0kmKo0QvlRv0STvDmNSW6vsI5UQpKYcSvTRa1cM3q9rJ6AQk6aQavTRalXOoVDlRl/o+pJNa9FJLaVvKVQ7frHLKZfV9SCcleqmdLOWYKksYVZ8joDq8tKl0I7WTpRxTZQlD5whIKPpO9GZ2mpndb2bfN7NHzeyjMY+5wMxeMLPd0c/v5wtXmqB9Is/pm+/h/C33Za5ZZ2kpVznnuurkEoo8pZtDwCfc/SEzex2wy8x2uvv3ux73DXe/NMfrSIMUMQqmnymDqyhhqE4uoeg70bv708DT0d8vmtljwATQnehFjiqigzLUi5PEUZ1cQlBIjd7MVgNrge/E3P3rZvawmf2tmb25x3NsNLNpM5s+ePBgEWFJgIrooAzpEngidZB71I2ZvRa4E/iYu/+k6+6HgDe4+0tmth6YAs6Iex533wZsg4VpivPGJWEqahSMWsoi6eVq0ZvZKAtJ/jZ33959v7v/xN1fiv6+Fxg1s5PyvKbUmzooBydvJ7c0V98tejMz4PPAY+7+6YTH/DLwjLu7mZ3Lwo7lH/t9Tak/dVAORtVTPUjY8pRuzgfeB+wxs93Rsv8IrAJw95uB9wIfMrNDwBxwhYd4SSsplcouxavyLFwJX55RNw9C7+sZuPtNwE39voaIpFP1WbgSNk2BINIAmq2yGnWZAltTIIg0gDq5y1fl7KRZKdGLNIDOLShflVNgZ6XSjUhDqJO7XHXqF1GLXkSkD3WanVSJXkSkD3XqF1HpRkSkD3U6+U+JXkSkT3XpF1HpRkSk4ZToRUQaToleRKThlOhFRBpOiV5EpOEsxFmDzewg8KMCn/Ik4NkCn29Q6hBnHWIExVmkOsQI9YhzkDG+wd1XxN0RZKIvmplNu/tk1XEspQ5x1iFGUJxFqkOMUI84q4pRpRsRkYZTohcRabhhSfTbqg4gpTrEWYcYQXEWqQ4xQj3irCTGoajRi4gMs2Fp0YuIDC0lehGRhmtcojez15jZd83sYTN71Mz+sOv+/2pmL1UVXxRDbIy24L+Y2f8zs8fM7N8HGufbzewhM9ttZg+a2RurjDOKacTMZszsK9Ht083sO2b2pJndYWavqjpGiI3zNjPba2aPmNktZjYaWowdyyvfdjrFfJZBbT9tMXGWvv00LtED/wRc5O5nA+cAl5jZeQBmNgksrzK4SFKM1wCnAWe6+z8HvlRdiEBynJ8FrnL3c4C/BD5VYYxtHwUe67j9x8CfuvsbgeeBayuJ6ljdcd4GnAmcBYwBH6giqC7dMYa07XTqjvMawtp+2rrjLH37aVyi9wXtVsdo9ONmNgJsBX6vsuAiSTECHwJucPcj0eMOVBQi0esnxenAL0TLXw/sryC8o8zsVODdwOei2wZcBPx19JBbgQ3VRPeK7jgB3P3e6HN24LvAqVXFB/ExhrTttMXFSWDbDyTGWfr207hED0cPlXYDB4Cd7v4d4CPA3e7+dLXRLUiI8VeA3zazaTP7WzM7o9ooE+P8AHCvme0D3gdsqTJG4M9YSEJHotu/CMy6+6Ho9j4ghKtDdMd5VFSyeR/w1bKD6hIXY1DbTiQuzuC2H+LjLH37aWSid/fD0WHRqcC5ZvY24LeA/1ZtZK+IifEtwKuBn0WnSP8FcEuVMUJinB8H1rv7qcD/BD5dVXxmdilwwN13VRVDGini/B/AA+7+jRLDWiQuRjNbSWDbTo/PMqjtp0ecpW8/jb6UoLvPmtn9wIXAG4EnF47qOd7Mnozqt5XqiPESFlqe26O77mJhJQhCR5zvAs6OWvYAd1BtK/R84DfMbD3wGhYOiT8DjJvZcVGr/lSgVWGMEBOnmf0vd/83ZnY9sAL4YKURxn+Wj7LQVxPSthP7WRLe9hMX5z0s9CGUu/24e6N+WNhgxqO/x4BvAJd2PealEGNk4RDu/dHyC4DvBRrns8A/i5ZfC9xZ9ffe8Zl9Jfr7y8AV0d83A/+u6vgS4vwA8H+AsarjSoqxa3ml284Sn2VQ209cnCw0rkvffprYoj8FuDXqQFoG/JW7f2WJ/ylbbIxm9iBwm5l9HHiJ6kdgJMX5b4E7zewICyNa3l9lkAk+CXzJzP4zMAN8vuJ4ktzMwpTc34pazNvd/YZqQ6qtLYS1/RzD3Q9Vsf1oCgQRkYZrZGesiIi8QoleRKThlOhFRBpOiV5EpOGU6EVEGk6JXkSk4ZToRUQa7v8DjTgNROpRV4YAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "x = np.random.normal(42, 3, 100)\n", "y = np.random.gamma(7, 1, 100)\n", "\n", "plt.scatter(x, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Case Study (continued): Importing the Iris data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The sklearn library provides several sample datasets, among which is also the Iris dataset.\n", "\n", "As a table, the dataset would look like:\n", "\n", "\n", "However, the data object imported from sklearn is organized slightly different. In particular, the so-called **features** are seperated from the **labels**." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import load_iris" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "iris = load_iris()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using Python's **dir()** function we can inspect the data object, i.e. find out what **attributes** it has." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['DESCR', 'data', 'feature_names', 'filename', 'target', 'target_names']" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dir(iris)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "iris.data provides us with a Numpy array, where the first dimension equals the number of observed flowers (**instances**) and the second dimension lists the various features of a flower." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[5.1, 3.5, 1.4, 0.2],\n", " [4.9, 3. , 1.4, 0.2],\n", " [4.7, 3.2, 1.3, 0.2],\n", " [4.6, 3.1, 1.5, 0.2],\n", " [5. , 3.6, 1.4, 0.2],\n", " [5.4, 3.9, 1.7, 0.4],\n", " [4.6, 3.4, 1.4, 0.3],\n", " [5. , 3.4, 1.5, 0.2],\n", " [4.4, 2.9, 1.4, 0.2],\n", " [4.9, 3.1, 1.5, 0.1],\n", " [5.4, 3.7, 1.5, 0.2],\n", " [4.8, 3.4, 1.6, 0.2],\n", " [4.8, 3. , 1.4, 0.1],\n", " [4.3, 3. , 1.1, 0.1],\n", " [5.8, 4. , 1.2, 0.2],\n", " [5.7, 4.4, 1.5, 0.4],\n", " [5.4, 3.9, 1.3, 0.4],\n", " [5.1, 3.5, 1.4, 0.3],\n", " [5.7, 3.8, 1.7, 0.3],\n", " [5.1, 3.8, 1.5, 0.3],\n", " [5.4, 3.4, 1.7, 0.2],\n", " [5.1, 3.7, 1.5, 0.4],\n", " [4.6, 3.6, 1. , 0.2],\n", " [5.1, 3.3, 1.7, 0.5],\n", " [4.8, 3.4, 1.9, 0.2],\n", " [5. , 3. , 1.6, 0.2],\n", " [5. , 3.4, 1.6, 0.4],\n", " [5.2, 3.5, 1.5, 0.2],\n", " [5.2, 3.4, 1.4, 0.2],\n", " [4.7, 3.2, 1.6, 0.2],\n", " [4.8, 3.1, 1.6, 0.2],\n", " [5.4, 3.4, 1.5, 0.4],\n", " [5.2, 4.1, 1.5, 0.1],\n", " [5.5, 4.2, 1.4, 0.2],\n", " [4.9, 3.1, 1.5, 0.2],\n", " [5. , 3.2, 1.2, 0.2],\n", " [5.5, 3.5, 1.3, 0.2],\n", " [4.9, 3.6, 1.4, 0.1],\n", " [4.4, 3. , 1.3, 0.2],\n", " [5.1, 3.4, 1.5, 0.2],\n", " [5. , 3.5, 1.3, 0.3],\n", " [4.5, 2.3, 1.3, 0.3],\n", " [4.4, 3.2, 1.3, 0.2],\n", " [5. , 3.5, 1.6, 0.6],\n", " [5.1, 3.8, 1.9, 0.4],\n", " [4.8, 3. , 1.4, 0.3],\n", " [5.1, 3.8, 1.6, 0.2],\n", " [4.6, 3.2, 1.4, 0.2],\n", " [5.3, 3.7, 1.5, 0.2],\n", " [5. , 3.3, 1.4, 0.2],\n", " [7. , 3.2, 4.7, 1.4],\n", " [6.4, 3.2, 4.5, 1.5],\n", " [6.9, 3.1, 4.9, 1.5],\n", " [5.5, 2.3, 4. , 1.3],\n", " [6.5, 2.8, 4.6, 1.5],\n", " [5.7, 2.8, 4.5, 1.3],\n", " [6.3, 3.3, 4.7, 1.6],\n", " [4.9, 2.4, 3.3, 1. ],\n", " [6.6, 2.9, 4.6, 1.3],\n", " [5.2, 2.7, 3.9, 1.4],\n", " [5. , 2. , 3.5, 1. ],\n", " [5.9, 3. , 4.2, 1.5],\n", " [6. , 2.2, 4. , 1. ],\n", " [6.1, 2.9, 4.7, 1.4],\n", " [5.6, 2.9, 3.6, 1.3],\n", " [6.7, 3.1, 4.4, 1.4],\n", " [5.6, 3. , 4.5, 1.5],\n", " [5.8, 2.7, 4.1, 1. ],\n", " [6.2, 2.2, 4.5, 1.5],\n", " [5.6, 2.5, 3.9, 1.1],\n", " [5.9, 3.2, 4.8, 1.8],\n", " [6.1, 2.8, 4. , 1.3],\n", " [6.3, 2.5, 4.9, 1.5],\n", " [6.1, 2.8, 4.7, 1.2],\n", " [6.4, 2.9, 4.3, 1.3],\n", " [6.6, 3. , 4.4, 1.4],\n", " [6.8, 2.8, 4.8, 1.4],\n", " [6.7, 3. , 5. , 1.7],\n", " [6. , 2.9, 4.5, 1.5],\n", " [5.7, 2.6, 3.5, 1. ],\n", " [5.5, 2.4, 3.8, 1.1],\n", " [5.5, 2.4, 3.7, 1. ],\n", " [5.8, 2.7, 3.9, 1.2],\n", " [6. , 2.7, 5.1, 1.6],\n", " [5.4, 3. , 4.5, 1.5],\n", " [6. , 3.4, 4.5, 1.6],\n", " [6.7, 3.1, 4.7, 1.5],\n", " [6.3, 2.3, 4.4, 1.3],\n", " [5.6, 3. , 4.1, 1.3],\n", " [5.5, 2.5, 4. , 1.3],\n", " [5.5, 2.6, 4.4, 1.2],\n", " [6.1, 3. , 4.6, 1.4],\n", " [5.8, 2.6, 4. , 1.2],\n", " [5. , 2.3, 3.3, 1. ],\n", " [5.6, 2.7, 4.2, 1.3],\n", " [5.7, 3. , 4.2, 1.2],\n", " [5.7, 2.9, 4.2, 1.3],\n", " [6.2, 2.9, 4.3, 1.3],\n", " [5.1, 2.5, 3. , 1.1],\n", " [5.7, 2.8, 4.1, 1.3],\n", " [6.3, 3.3, 6. , 2.5],\n", " [5.8, 2.7, 5.1, 1.9],\n", " [7.1, 3. , 5.9, 2.1],\n", " [6.3, 2.9, 5.6, 1.8],\n", " [6.5, 3. , 5.8, 2.2],\n", " [7.6, 3. , 6.6, 2.1],\n", " [4.9, 2.5, 4.5, 1.7],\n", " [7.3, 2.9, 6.3, 1.8],\n", " [6.7, 2.5, 5.8, 1.8],\n", " [7.2, 3.6, 6.1, 2.5],\n", " [6.5, 3.2, 5.1, 2. ],\n", " [6.4, 2.7, 5.3, 1.9],\n", " [6.8, 3. , 5.5, 2.1],\n", " [5.7, 2.5, 5. , 2. ],\n", " [5.8, 2.8, 5.1, 2.4],\n", " [6.4, 3.2, 5.3, 2.3],\n", " [6.5, 3. , 5.5, 1.8],\n", " [7.7, 3.8, 6.7, 2.2],\n", " [7.7, 2.6, 6.9, 2.3],\n", " [6. , 2.2, 5. , 1.5],\n", " [6.9, 3.2, 5.7, 2.3],\n", " [5.6, 2.8, 4.9, 2. ],\n", " [7.7, 2.8, 6.7, 2. ],\n", " [6.3, 2.7, 4.9, 1.8],\n", " [6.7, 3.3, 5.7, 2.1],\n", " [7.2, 3.2, 6. , 1.8],\n", " [6.2, 2.8, 4.8, 1.8],\n", " [6.1, 3. , 4.9, 1.8],\n", " [6.4, 2.8, 5.6, 2.1],\n", " [7.2, 3. , 5.8, 1.6],\n", " [7.4, 2.8, 6.1, 1.9],\n", " [7.9, 3.8, 6.4, 2. ],\n", " [6.4, 2.8, 5.6, 2.2],\n", " [6.3, 2.8, 5.1, 1.5],\n", " [6.1, 2.6, 5.6, 1.4],\n", " [7.7, 3. , 6.1, 2.3],\n", " [6.3, 3.4, 5.6, 2.4],\n", " [6.4, 3.1, 5.5, 1.8],\n", " [6. , 3. , 4.8, 1.8],\n", " [6.9, 3.1, 5.4, 2.1],\n", " [6.7, 3.1, 5.6, 2.4],\n", " [6.9, 3.1, 5.1, 2.3],\n", " [5.8, 2.7, 5.1, 1.9],\n", " [6.8, 3.2, 5.9, 2.3],\n", " [6.7, 3.3, 5.7, 2.5],\n", " [6.7, 3. , 5.2, 2.3],\n", " [6.3, 2.5, 5. , 1.9],\n", " [6.5, 3. , 5.2, 2. ],\n", " [6.2, 3.4, 5.4, 2.3],\n", " [5.9, 3. , 5.1, 1.8]])" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To find out what the four features are, we can list them:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['sepal length (cm)',\n", " 'sepal width (cm)',\n", " 'petal length (cm)',\n", " 'petal width (cm)']" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.feature_names" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similarly, we can also print the flowers' labels (a.k.a. targets):" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n", " 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n", " 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,\n", " 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,\n", " 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.target" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The three flower classes are encoded with integers. Let's show the corresponding names:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['setosa', 'versicolor', 'virginica'], dtype='" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "feature_index = 2\n", "colors = ['blue', 'red', 'green']\n", "\n", "for label, color in zip(range(len(iris.target_names)), colors):\n", " plt.hist(iris.data[iris.target==label, feature_index], \n", " label=iris.target_names[label],\n", " color=color)\n", "\n", "plt.xlabel(iris.feature_names[feature_index])\n", "plt.legend(loc='upper right')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Also, we can draw scatter plots of two features." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "first_feature_index = 1\n", "second_feature_index = 0\n", "\n", "colors = ['blue', 'red', 'green']\n", "\n", "for label, color in zip(range(len(iris.target_names)), colors):\n", " plt.scatter(iris.data[iris.target==label, first_feature_index], \n", " iris.data[iris.target==label, second_feature_index],\n", " label=iris.target_names[label],\n", " c=color)\n", "\n", "plt.xlabel(iris.feature_names[first_feature_index])\n", "plt.ylabel(iris.feature_names[second_feature_index])\n", "plt.legend(loc='upper left')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the higher level library pandas, one can easily create a so-called **scatterplot matrix**." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)\n", "\n", "pd.plotting.scatter_matrix(iris_df, figsize=(8, 8));" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Concept of Generalization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The goal of a supervised machine learning model is to make predictions on new (i.e., previously unseen) data.\n", "\n", "In a real-world application, we are not interested in marking an already labeled email as spam or not. Instead, we want to make the user's life easier by automatically classifying new incoming mail.\n", "\n", "In order to get an idea of how good a model generalizes, a best practice is to split the available data into a training and a test set. Only the former is used to train the model. Then predictions are made on the test data and the predictions can be compared with the actual labels.\n", "\n", "Common splits are 75/25 or 60/40." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Case Study (continued): Train/Test Split for the Iris data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is common practice to refer to the feature matrix as X and the vector of labels as y." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "X, y = iris.data, iris.target" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A naive splitting approach could be to use array slicing." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = X[0:100, :], X[100:150, :], y[0:100], y[100:150]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, this would lead to unbalanced label distributions. For example, the test set would only be made up of flowers of the same type." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,\n", " 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,\n", " 2, 2, 2, 2, 2, 2])" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_test" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 0, 0, 50])" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.bincount(y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "sklearn provides a function that not only randomizes the split but also ensures that the resulting label distribution is proportionate to the overall distribution (called **stratification**)." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1, 0, 1, 2, 2, 2, 2, 1, 0, 1, 0, 0, 2, 0, 0, 1, 0, 1, 2, 2, 0, 2,\n", " 1, 2, 0, 0, 2, 2, 1, 2, 1, 1, 0, 0, 1, 1, 1, 0, 1, 2, 0, 0, 2, 2,\n", " 1])" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3, stratify=y)\n", "\n", "y_test" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([15, 15, 15])" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.bincount(y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A simple Classification Model: k-Nearest Neighbors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To predict the label for any observation, just determine the k \"nearest\" observations in the training set (e.g., by Euclidean distance) and use a simple majority vote." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Case Study (continued): Train and Predict the Iris data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "sklearn provides a uniform interface for all its classification models. They all have a **fit()** and a **predict()** method that abstract away the actual machine learning algorithm." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "from sklearn.neighbors import KNeighborsClassifier" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "knn = KNeighborsClassifier(n_neighbors=5)\n", "\n", "knn.fit(X_train, y_train)\n", "\n", "y_pred = knn.predict(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us list the labels predicted for the test set ..." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1, 0, 1, 1, 2, 2, 2, 1, 0, 1, 0, 0, 2, 0, 0, 1, 0, 1, 2, 2, 0, 2,\n", " 1, 1, 0, 0, 1, 2, 1, 2, 1, 1, 0, 0, 1, 1, 1, 0, 1, 2, 0, 0, 2, 2,\n", " 1])" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_pred" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "... and compare them with the actual labels." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1, 0, 1, 2, 2, 2, 2, 1, 0, 1, 0, 0, 2, 0, 0, 1, 0, 1, 2, 2, 0, 2,\n", " 1, 2, 0, 0, 2, 2, 1, 2, 1, 1, 0, 0, 1, 1, 1, 0, 1, 2, 0, 0, 2, 2,\n", " 1])" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_test" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Numpy can show us the indices where the predictions are wrong." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(array([ 3, 23, 26]),)" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.where(y_pred != y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, we can calculate the fraction of correctly predicted flowers." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9333333333333333" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.sum(y_pred == y_test) / len(y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is important to mention that we can also \"predict\" the training set. Surprisingly, the model does not get the training set 100% correct." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9523809523809523" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_train_pred = knn.predict(X_train)\n", "\n", "np.sum(y_train_pred == y_train) / len(y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A visualization reveals that the misclassified flowers are right \"at the borderline\" between two neighboring clusters of flower classes." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "first_feature_index = 3\n", "second_feature_index = 0\n", "\n", "correct_idx = np.where(y_pred == y_test)[0]\n", "incorrect_idx = np.where(y_pred != y_test)[0]\n", "\n", "colors = [\"darkblue\", \"darkgreen\", \"gray\"]\n", "\n", "for n, color in enumerate(colors):\n", " idx = np.where(y_test == n)[0]\n", " plt.scatter(X_test[idx, first_feature_index], X_test[idx, second_feature_index], color=color,\n", " label=iris.target_names[n])\n", "\n", "plt.scatter(X_test[incorrect_idx, first_feature_index], X_test[incorrect_idx, second_feature_index],\n", " color=\"darkred\", label='misclassified')\n", "\n", "plt.xlabel('sepal width [cm]')\n", "plt.ylabel('petal length [cm]')\n", "plt.legend(loc='best')\n", "plt.title(\"Iris Classification results\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In practice, the number of neighbors is to chosen before the model is trained. Therefore, it is possible to \"optimize\" it. This process is referred to as **hyper-parameter** tuning. For the Iris dataset this does not make much of a difference." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 0.9777777777777777\n", "2 0.9333333333333333\n", "3 0.9555555555555556\n", "4 0.8888888888888888\n", "5 0.9333333333333333\n", "6 0.9333333333333333\n", "7 0.9333333333333333\n", "8 0.9333333333333333\n", "9 0.9555555555555556\n", "10 0.9333333333333333\n", "11 0.9555555555555556\n", "12 0.9333333333333333\n", "13 0.9333333333333333\n", "14 0.9333333333333333\n", "15 0.9333333333333333\n", "16 0.9333333333333333\n", "17 0.9333333333333333\n", "18 0.9333333333333333\n", "19 0.9333333333333333\n", "20 0.9333333333333333\n", "21 0.9333333333333333\n", "22 0.9333333333333333\n", "23 0.9111111111111111\n", "24 0.9333333333333333\n", "25 0.9111111111111111\n", "26 0.9333333333333333\n", "27 0.9111111111111111\n", "28 0.9111111111111111\n", "29 0.9333333333333333\n", "30 0.9333333333333333\n" ] } ], "source": [ "for i in range(1, 31):\n", " knn = KNeighborsClassifier(n_neighbors=i)\n", " knn.fit(X_train, y_train)\n", " y_pred = knn.predict(X_test)\n", " correct = np.sum(y_pred == y_test) / len(y_test)\n", " print(i, correct)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## WHU's Python Course in the BSc program" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- free [online book](https://github.com/webartifex/intro-to-python) by the author of this workshop" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Literature on Machine Learning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Depending on the programming language one chooses, the following books are recommended:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- [Python Machine Learning](https://www.amazon.de/Python-Machine-Learning-scikit-learn-TensorFlow/dp/1787125939/ref=sr_1_1?__mk_de_DE=%C3%85M%C3%85%C5%BD%C3%95%C3%91&keywords=python+machine+learning&qid=1575545025&sr=8-1) by Sebastian Raschka\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- [An Introduction to Statistical Learning](http://faculty.marshall.usc.edu/gareth-james/ISL/)\n", "\n", "" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }