# A hands-on Machine Learning Introduction in Python with scikit-learn

## What is Machine Learning


Machine learning is the process of **extracting knowledge from data** automatically.

The goals usually include making predictions on new, unseen data or simply understanding given data better by finding patterns.

Central to machine learning is the concept of **automating decision making** from data **without the user specifying explicit rules** how this decision should be made.

<img src="raw/what_is_machine_learning.png" width="100%">

## Examples

<img src="raw/examples.png" width="100%">

## 3 Types of Machine Learning

<img src="raw/3_types_of_machine_learning.png" width="100%">

- **Supervised** (focus of this notebook): Each entry in the dataset comes with a "label". Examples are a list of emails where spam mail is already marked as such or a sample of handwritten digits. The goal is to use the historic data to make predictions.

- **Unsupervised**: There is no desired output associated with a data entry. In a sense, one can think of unsupervised learning as a means of discovering labels from the data itself. A popular example is the clustering of customer data.

- **Reinforcement**: Conceptually, this can be seen as "learning by doing". Some kind of "reward function" tells how good a predicted outcome is. For example, chess computers are typically programmed with this approach.

## 2 Types of Supervised Learning

<img src="raw/classification_vs_regression.png" width="100%">

- **In classification, the label is discrete**, such as "spam" or "no spam" for emails.
Furthermore, labels are nominal (e.g., colors of something), not ordinal (e.g., T-shirt sizes in S, M, or L).


- **In regression, the labels are continuous**. For example, given a person's age, education, and position, infer his/her salary.

## Case study: Iris flower classification

<img src="raw/iris_data.png" width="100%">

## Python for scientific computing

Python itself does not come with any scientific algorithms implemented it. However, over time, many open source libraries emerged that are useful to build machine learning applications.

Among the popular ones are numpy (numerical computations, linear algebra), pandas (data processing), matplotlib (visualisations), and scikit-learn (machine learning algorithms).

First, import the libraries:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

The following line is needed so that this Jupyter notebook creates the visiualizations in the notebook and not in a new window. This has nothing to do with Python.

In [None]:
% matplotlib inline

Standard Python can do basic arithmetic operations ...

In [None]:
a = 1
b = 2
c = a + b
c

... and provides some simple **data structures**, such as a list of values.

In [None]:
l = [a, b, c, 4]
l

Numpy provides a data structure called an **n-dimensional array**. This may sound fancy at first but when used with only 1 or 2 dimensions, it basically represents vectors and matrices. Arrays allow for much faster computations as they use very low level functions modern computers provide.

To create an array, use the **array()** function from the imported **np** module and provide it with a list of values.

In [None]:
v1 = np.array([1, 2, 3])
v1

A vector can be multiplied with a scalar.

In [None]:
v2 = v1 * 3
v2

To create a matrix, just use a list of (row) list of values instead.

In [None]:
m1 = np.array([
    [1, 2, 3],
    [4, 5, 6],
])
m1

Now we can use numpy to multiply a matrix with a vector to obtain a new vector ...

In [None]:
v3 = np.dot(m1, v1)
v3

... or simply transpose it.

In [None]:
m1.T

The rules from maths still apply and it makes a difference if a vector is multiplied from the left or the right by a matrix. The following operation will fail.

In [None]:
np.dot(v1, m1)

In order to retrieve only a slice (= subset) of an array's data, we can "index" into it. For example, the first row of the matrix is ...

In [None]:
m1[0, :]

... while the second column is:

In [None]:
m1[:, 1]

To acces the lowest element in the right column, two indices can be used.

In [None]:
m1[1, 2]

Numpy also provides various other functions and constants, such as sinus or pi. To further illustrate the concept of **vectorization**, let us calculate the sinus curve over a range of values.

In [None]:
x = np.linspace(-3*np.pi, 3*np.pi, 100)
x

In [None]:
y = np.sin(x)
y

With matplotlib's **plot()** function we can visualize the sinus curve.

In [None]:
plt.plot(x, y)

Let us quickly generate some random data and draw a scatter plot.

In [None]:
x = np.random.normal(42, 3, 100)
y = np.random.gamma(7, 1, 100)
plt.scatter(x, y)

# Case study: importing the Iris data

The sklearn library provides several sample datasets, among which is also the Iris dataset.

As a table, the dataset would look like:
<img src="raw/iris.png" width="100%">

However, the data object imported from sklearn is organized slightly different. In particular, the so-called **features** are seperated from the **labels**.

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()

Using Python's **dir()** function we can inspect the data object, i.e. find out what **attributes** it has.

In [None]:
dir(iris)

iris.data provides us with a Numpy array, where the first dimension equals the number of observed flowers (**instances**) and the second dimension lists the various features of a flower.

In [None]:
iris.data

To find out what the four features are, we can list them:

In [None]:
iris.feature_names

Similarly, we can also print the flowers' labels (a.k.a. targets):

In [None]:
iris.target

The three flower classes are encoded with integers. Let's show the corresponding names:

In [None]:
iris.target_names

## Case study: Simple visualizations

Since the data is four dimensional, we cannot visualize all features together. Instead, we can plot the distribution of the flower classes by a single feature using histograms.

In [None]:
feature_index = 2
colors = ['blue', 'red', 'green']

for label, color in zip(range(len(iris.target_names)), colors):
    plt.hist(iris.data[iris.target==label, feature_index], 
             label=iris.target_names[label],
             color=color)

plt.xlabel(iris.feature_names[feature_index])
plt.legend(loc='upper right')
plt.show()

Also, we can draw scatter plots of two features.

In [None]:
first_feature_index = 1
second_feature_index = 0

colors = ['blue', 'red', 'green']

for label, color in zip(range(len(iris.target_names)), colors):
    plt.scatter(iris.data[iris.target==label, first_feature_index], 
                iris.data[iris.target==label, second_feature_index],
                label=iris.target_names[label],
                c=color)

plt.xlabel(iris.feature_names[first_feature_index])
plt.ylabel(iris.feature_names[second_feature_index])
plt.legend(loc='upper left')
plt.show()

Using the higher level library pandas, one can easily create a so-called **scatterplot matrix**.

In [None]:
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
pd.tools.plotting.scatter_matrix(iris_df, figsize=(8, 8));

## Concept of Generalization

The goal of a supervised machine learning model is to make predictions on new (i.e., previously unseen) data.

In a real-world application, we are not interested in marking an already labeled email as spam or not. Instead, we want to make the user's life easier by automatically classifying new incoming mail.

In order to get an idea of how good a model generalizes, a best practice is to split the available data into a training and a test set. Only the former is used to train the model. Then predictions are made on the test data and the predictions can be compared with the actual labels.

Common splits are 75/25 or 60/40.

<img src="raw/generalization.png" width="100%">

## Case study: Train/Test split for the Iris data

It is common practice to refer to the feature matrix as X and the vector of labels as y.

In [None]:
X, y = iris.data, iris.target

A naive splitting approach could be to use array slicing.

In [None]:
X_train, X_test, y_train, y_test = X[0:100, :], X[100:150, :], y[0:100], y[100:150]

However, this would lead to unbalanced label distributions. For example, the test set would only be made up of flowers of the same type.

In [None]:
y_test

In [None]:
np.bincount(y_test)

sklearn provides a function that not only randomizes the split but also ensures that the resulting label distribution is proportionate to the overall distribution (called **stratification**).

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, stratify=y)
y_test

In [None]:
np.bincount(y_test)

## A simple classification model: k-Nearest Neighbors

To predict the label for any observation, just determine the k "nearest" observations in the training set (e.g., by Euclidean distance) and use a simple majority vote.

<img src="raw/knn.png" width="100%">

## Case study: train and predict the Iris data

sklearn provides a uniform interface for all its classification models. They all have a **fit()** and a **predict()** method that abstract away the actual machine learning algorithm.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)

knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

Let us list the labels predicted for the test set ...

In [None]:
y_pred

... and compare them with the actual labels.

In [None]:
y_test

Numpy can show us the indices where the predictions are wrong.

In [None]:
np.where(y_pred != y_test)

Alternatively, we can calculate the fraction of correctly predicted flowers.

In [None]:
np.sum(y_pred == y_test) / len(y_test)

It is important to mention that we can also "predict" the training set. Surprisingly, the model does not get the training set 100% correct.

In [None]:
y_train_pred = knn.predict(X_train)
np.sum(y_train_pred == y_train) / len(y_train)

A visualization reveals that the misclassified flowers are right "at the borderline" between two neighboring clusters of flower classes.

In [None]:
first_feature_index = 3
second_feature_index = 0

correct_idx = np.where(y_pred == y_test)[0]
incorrect_idx = np.where(y_pred != y_test)[0]

colors = ["darkblue", "darkgreen", "gray"]

for n, color in enumerate(colors):
    idx = np.where(y_test == n)[0]
    plt.scatter(X_test[idx, first_feature_index], X_test[idx, second_feature_index], color=color,
                label=iris.target_names[n])

plt.scatter(X_test[incorrect_idx, first_feature_index], X_test[incorrect_idx, second_feature_index],
            color="darkred", label='misclassified')

plt.xlabel('sepal width [cm]')
plt.ylabel('petal length [cm]')
plt.legend(loc='best')
plt.title("Iris Classification results")
plt.show()

In practice, the number of neighbors is to chosen before the model is trained. Therefore, it is possible to "optimize" it. This process is referred to as **hyper-parameter** tuning. For the Iris dataset this does not make much of a difference.

In [None]:
for i in range(1, 31):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    correct = np.sum(y_pred == y_test) / len(y_test)
    print(i, correct)
 

## Literature

Depending on the programming language one chooses, the following books are recommended.

- Python

<img src="raw/python_general.png">

<img src="raw/python_ml.png">

- R

<img src="raw/r.png">