**Note**: Click on "*Kernel*" > "*Restart Kernel and Clear All Outputs*" in [JupyterLab](https://jupyterlab.readthedocs.io/en/stable/) *before* reading this notebook to reset its output. If you cannot run this file on your machine, you may want to open it [in the cloud <img height="12" style="display: inline-block" src="../static/link/to_mb.png">](https://mybinder.org/v2/gh/webartifex/intro-to-data-science/main?urlpath=lab/tree/01_scientific_stack/02_content_pandas.ipynb).

# Chapter 1: Python's Scientific Stack (Part 2)

For practitioners, the [numpy <img height="12" style="display: inline-block" src="../static/link/to_np.png">](https://numpy.org/) library may feel a bit too "technical" or too close to "real programming" and they may prefer something that looks and feels more like Excel. That is where the [pandas <img height="12" style="display: inline-block" src="../static/link/to_pd.png">](https://pandas.pydata.org/) library comes in.

Let's first `pip` install and then `import` it.

In [1]:
%pip install pandas

Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd

## Excel-like Data with Pandas

In the same folder as this notebook there is a file named "*orders.csv*" that holds the order data of an urban meal delivery platform operating in Bordeaux, France. Open in with a double-click and take a look at its contents right here in JupyterLab!

[pandas <img height="12" style="display: inline-block" src="../static/link/to_pd.png">](https://pandas.pydata.org/) provides a [pd.read_csv() <img height="12" style="display: inline-block" src="../static/link/to_pd.png">](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv) function that, as the name suggests, can open and read in CSV data. For Excel files, there is also a [pd.read_excel() <img height="12" style="display: inline-block" src="../static/link/to_pd.png">](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html) function but the CSV format is probably more widespread in use.

Let's read in the "*orders.csv*" file with [pd.read_csv() <img height="12" style="display: inline-block" src="../static/link/to_pd.png">](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv) specifying the "order_id" column as the **index**. Here, index is a column with *unique* values that allow the identification of each row in a dataset. If we don't specify an index column, [pandas <img height="12" style="display: inline-block" src="../static/link/to_pd.png">](https://pandas.pydata.org/) creates a surrogate index as a sequence of integers 1, 2, 3, and so on.

In [3]:
df = pd.read_csv("orders.csv", index_col="order_id")

`df` models a table-like data structure, comparable to one tab in an Excel file. [pandas <img height="12" style="display: inline-block" src="../static/link/to_pd.png">](https://pandas.pydata.org/) and JupyterLab are designed to work well together: The `df` object shows a preview of the dataset below the code cell. The rows are the **records** in the dataset and the columns take the role of the **attributes** each record has. Each column comes with a **domain** of allowable values.

In [4]:
df

Unnamed: 0_level_0,placed_at,restaurant_id,restaurant,o_street,o_zip,o_city,o_latitude,o_longitude,customer_id,d_street,d_zip,d_city,d_latitude,d_longitude,total,courier_id,pickup_at,delivery_at,cancelled
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
192594,2016-07-18 12:23:13,1204,Max A Table,36 Rue Cornac,33000,Bordeaux,44.851402,-0.575870,10298,Rue Rolland 14,33000,Bordeaux,44.842592,-0.580521,2050,1423.0,2016-07-18 12:38:08,2016-07-18 12:48:22,0
192644,2016-07-18 12:48:55,1204,Max A Table,36 Rue Cornac,33000,Bordeaux,44.851402,-0.575870,6037,Rue Rolland 14,33000,Bordeaux,44.842592,-0.580521,2450,1426.0,2016-07-18 13:03:08,2016-07-18 13:12:01,0
192658,2016-07-18 13:00:13,1205,Taj Mahal,24 Rue Du Parlement Sainte-Catherine,33000,Bordeaux,44.840405,-0.573940,73830,Rue Batailley 12,33000,Bordeaux,44.838504,-0.591961,2550,1423.0,2016-07-18 13:19:04,2016-07-18 13:29:03,0
193242,2016-07-18 20:39:54,1208,Chez Ambre And Michel,1 Rue Matignon,33000,Bordeaux,44.850258,-0.586204,10298,Rue Rolland 14,33000,Bordeaux,44.842592,-0.580521,1550,1420.0,2016-07-18 20:55:52,2016-07-18 21:05:28,0
192719,2016-07-18 13:52:04,1206,La Maison Du Glacier,1 Place Saint Pierre,33000,Bordeaux,44.839706,-0.570672,6037,Rue Rolland 14,33000,Bordeaux,44.842592,-0.580521,2450,1426.0,2016-07-18 14:01:23,2016-07-18 14:08:36,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
212021,2016-07-30 22:29:52,1249,Pitaya Sainte Catherine,275 Rue Sainte Catherine,33000,Bordeaux,44.831692,-0.573207,80400,Boulevard President Franklin Roosevelt 15,33400,Bordeaux,44.820591,-0.582048,2250,1410.0,2016-07-30 22:50:16,2016-07-30 23:02:54,0
211501,2016-07-30 20:44:50,1204,Max A Table,36 Rue Cornac,33000,Bordeaux,44.851402,-0.575870,80163,Rue Marsan 22,33300,Bordeaux,44.856133,-0.576172,1250,1415.0,2016-07-30 21:02:32,2016-07-30 21:06:19,0
211508,2016-07-30 20:45:55,1254,Funky Burger,5 Rue Du Loup,33000,Bordeaux,44.838081,-0.572281,80168,Rue Des Sablieres 42,33800,Bordeaux,44.825488,-0.575264,1680,1461.0,2016-07-30 21:13:31,2016-07-30 21:19:45,0
211510,2016-07-30 20:46:05,1219,La Tagliatella,14 Rue Guiraude,33000,Bordeaux,44.839388,-0.574781,80169,Rue Pasteur 35,33200,Bordeaux,44.845053,-0.601157,4085,1411.0,2016-07-30 21:11:00,2016-07-30 21:23:24,0


The data type behind `df` is called a [pd.DataFrame <img height="12" style="display: inline-block" src="../static/link/to_pd.png">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame). `DataFrame`s are built around [numpy <img height="12" style="display: inline-block" src="../static/link/to_np.png">](https://numpy.org/)'s `ndarray`s providing an interface optimized for **interactive usage** (i.e., a data scientist exploring a dataset step by step).

In [5]:
type(df)

pandas.core.frame.DataFrame

`DataFrame`s come with many methdods.

For example, [.head() <img height="12" style="display: inline-block" src="../static/link/to_pd.png">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html#pandas.DataFrame.head) and [.tail() <img height="12" style="display: inline-block" src="../static/link/to_pd.png">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.tail.html#pandas.DataFrame.tail) show the first and last `n` rows, defaulting to `5`.

In [6]:
df.head()

Unnamed: 0_level_0,placed_at,restaurant_id,restaurant,o_street,o_zip,o_city,o_latitude,o_longitude,customer_id,d_street,d_zip,d_city,d_latitude,d_longitude,total,courier_id,pickup_at,delivery_at,cancelled
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
192594,2016-07-18 12:23:13,1204,Max A Table,36 Rue Cornac,33000,Bordeaux,44.851402,-0.57587,10298,Rue Rolland 14,33000,Bordeaux,44.842592,-0.580521,2050,1423.0,2016-07-18 12:38:08,2016-07-18 12:48:22,0
192644,2016-07-18 12:48:55,1204,Max A Table,36 Rue Cornac,33000,Bordeaux,44.851402,-0.57587,6037,Rue Rolland 14,33000,Bordeaux,44.842592,-0.580521,2450,1426.0,2016-07-18 13:03:08,2016-07-18 13:12:01,0
192658,2016-07-18 13:00:13,1205,Taj Mahal,24 Rue Du Parlement Sainte-Catherine,33000,Bordeaux,44.840405,-0.57394,73830,Rue Batailley 12,33000,Bordeaux,44.838504,-0.591961,2550,1423.0,2016-07-18 13:19:04,2016-07-18 13:29:03,0
193242,2016-07-18 20:39:54,1208,Chez Ambre And Michel,1 Rue Matignon,33000,Bordeaux,44.850258,-0.586204,10298,Rue Rolland 14,33000,Bordeaux,44.842592,-0.580521,1550,1420.0,2016-07-18 20:55:52,2016-07-18 21:05:28,0
192719,2016-07-18 13:52:04,1206,La Maison Du Glacier,1 Place Saint Pierre,33000,Bordeaux,44.839706,-0.570672,6037,Rue Rolland 14,33000,Bordeaux,44.842592,-0.580521,2450,1426.0,2016-07-18 14:01:23,2016-07-18 14:08:36,0


In [7]:
df.tail(2)

Unnamed: 0_level_0,placed_at,restaurant_id,restaurant,o_street,o_zip,o_city,o_latitude,o_longitude,customer_id,d_street,d_zip,d_city,d_latitude,d_longitude,total,courier_id,pickup_at,delivery_at,cancelled
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
211510,2016-07-30 20:46:05,1219,La Tagliatella,14 Rue Guiraude,33000,Bordeaux,44.839388,-0.574781,80169,Rue Pasteur 35,33200,Bordeaux,44.845053,-0.601157,4085,1411.0,2016-07-30 21:11:00,2016-07-30 21:23:24,0
211519,2016-07-30 20:46:55,1254,Funky Burger,5 Rue Du Loup,33000,Bordeaux,44.838081,-0.572281,80172,Rue Monadey 28,33800,Bordeaux,44.828816,-0.570789,2050,1817.0,2016-07-30 21:05:46,2016-07-30 21:14:07,0


[.info() <img height="12" style="display: inline-block" src="../static/link/to_pd.png">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html#pandas.DataFrame.info) shows on overview of the columns. In particular, it shows how many cells are filled in in a column (i.e., are "non-null") and what **data type** (i.e., "dtype") *all* values in a column have. "int64" and "float64" imply that there are only `int` and `float` values in a column (taking up to 64 bits or 1s and 0s in memory). "object" is [pandas <img height="12" style="display: inline-block" src="../static/link/to_pd.png">](https://pandas.pydata.org/)' way of telling us it could not deduce any data type more specific than textual data. For the columns holding timestamps (e.g., "placed_at") we will convert the values further below.

Looking at the output, we see that some columns hold the data of **origin**-**destination** pairs, corresponding to restaurants and customers. Other columns store data following the dispatch and delivery process of couriers picking up and delivering meals at various points in time.

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 694 entries, 192594 to 211519
Data columns (total 19 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   placed_at      694 non-null    object 
 1   restaurant_id  694 non-null    int64  
 2   restaurant     694 non-null    object 
 3   o_street       694 non-null    object 
 4   o_zip          694 non-null    int64  
 5   o_city         694 non-null    object 
 6   o_latitude     694 non-null    float64
 7   o_longitude    694 non-null    float64
 8   customer_id    694 non-null    int64  
 9   d_street       694 non-null    object 
 10  d_zip          694 non-null    int64  
 11  d_city         694 non-null    object 
 12  d_latitude     694 non-null    float64
 13  d_longitude    694 non-null    float64
 14  total          694 non-null    int64  
 15  courier_id     690 non-null    float64
 16  pickup_at      665 non-null    object 
 17  delivery_at    663 non-null    object 
 18  ca

[.describe() <img height="12" style="display: inline-block" src="../static/link/to_pd.png">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html#pandas.DataFrame.describe) shows statistics on all numerical columns in a `DataFrame`.

For the example orders, such statistics may not be meaningful for all numerical columns as some of them merely hold IDs or zip codes.

In [9]:
df.describe()

Unnamed: 0,restaurant_id,o_zip,o_latitude,o_longitude,customer_id,d_zip,d_latitude,d_longitude,total,courier_id,cancelled
count,694.0,694.0,694.0,694.0,694.0,694.0,694.0,694.0,694.0,690.0,694.0
mean,1228.479827,33075.216138,44.839258,-0.575759,74751.126801,33191.613833,44.838623,-0.57604,2294.636888,1484.755072,0.044669
std,18.001091,207.971435,0.007471,0.00692,14604.304963,307.378697,0.011545,0.010799,1060.695748,154.58621,0.206724
min,1204.0,33000.0,44.81818,-0.5994,2377.0,33000.0,44.809813,-0.606892,350.0,1403.0,0.0
25%,1212.0,33000.0,44.83691,-0.579345,76648.5,33000.0,44.829981,-0.581612,1500.0,1415.0,0.0
50%,1224.0,33000.0,44.838287,-0.57394,78146.0,33000.0,44.838364,-0.575056,1969.5,1424.0,0.0
75%,1244.0,33000.0,44.841721,-0.572281,79331.5,33300.0,44.846696,-0.569601,2750.0,1462.0,0.0
max,1267.0,33800.0,44.855438,-0.550576,80401.0,33800.0,44.877693,-0.537952,8370.0,2013.0,1.0


## Indexing & Slicing

`DataFrame`s support being indexed or sliced, both in the row and column dimensions.

To obtain all data in a single column, we index into the `DataFrame` with the column's name.

For example, `restaurant_col` provides a list of only the restaurant names. Its index are still the "order_id"s.

In [10]:
restaurant_col = df["restaurant"]

restaurant_col.head()

order_id
192594              Max A Table
192644              Max A Table
192658                Taj Mahal
193242    Chez Ambre And Michel
192719     La Maison Du Glacier
Name: restaurant, dtype: object

The data type of a single column is [pd.Series <img height="12" style="display: inline-block" src="../static/link/to_pd.png">](https://pandas.pydata.org/docs/reference/api/pandas.Series.html#pandas.Series), which is very similar to a `DataFrame` with only one column. `Series` objects work like built-in `list`s with added functionalities.

In [11]:
type(restaurant_col)

pandas.core.series.Series

If we index with a `list` of column names, the result is itself another `DataFrame`. That operation is like slicing out a smaller matrix from a larger one as we saw with `ndarray`s before.

For example, let's pull out all location data of the orders' origins (i.e., restaurants).

In [12]:
origins = df[["o_street", "o_zip", "o_city", "o_latitude", "o_longitude"]]

origins.head()

Unnamed: 0_level_0,o_street,o_zip,o_city,o_latitude,o_longitude
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
192594,36 Rue Cornac,33000,Bordeaux,44.851402,-0.57587
192644,36 Rue Cornac,33000,Bordeaux,44.851402,-0.57587
192658,24 Rue Du Parlement Sainte-Catherine,33000,Bordeaux,44.840405,-0.57394
193242,1 Rue Matignon,33000,Bordeaux,44.850258,-0.586204
192719,1 Place Saint Pierre,33000,Bordeaux,44.839706,-0.570672


To access individual rows, we index not into a `DataFrame` directly but into its [.loc <img height="12" style="display: inline-block" src="../static/link/to_pd.png">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc) object (which also exists for `Series`).

Here, `200800` is an "order_id" number. The result is a `Series` object where the original `DataFrame`'s columns become the index.

In [13]:
df.loc[200800]

placed_at        2016-07-24 19:30:52
restaurant_id                   1204
restaurant               Max A Table
o_street               36 Rue Cornac
o_zip                          33000
o_city                      Bordeaux
o_latitude                 44.851402
o_longitude                 -0.57587
customer_id                    76187
d_street            Rue Judaique 213
d_zip                          33000
d_city                      Bordeaux
d_latitude                 44.840829
d_longitude                -0.595445
total                           2250
courier_id                    1468.0
pickup_at        2016-07-24 19:50:52
delivery_at      2016-07-24 19:58:16
cancelled                          0
Name: 200800, dtype: object

We can also index into the `restaurant_col` and `origins` objects from above. As `restaurant_col` is a `Series`, we get back a scalar value.

In [14]:
restaurant_col.loc[200800]

'Max A Table'

In [15]:
origins.loc[200800]

o_street       36 Rue Cornac
o_zip                  33000
o_city              Bordeaux
o_latitude         44.851402
o_longitude         -0.57587
Name: 200800, dtype: object

Slicing also works with [.loc <img height="12" style="display: inline-block" src="../static/link/to_pd.png">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc). A tiny difference to Python's built-in slicing, the upper bound is included in the slice as well!

In [16]:
df.loc[200300:200800]

Unnamed: 0_level_0,placed_at,restaurant_id,restaurant,o_street,o_zip,o_city,o_latitude,o_longitude,customer_id,d_street,d_zip,d_city,d_latitude,d_longitude,total,courier_id,pickup_at,delivery_at,cancelled
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
200300,2016-07-24 13:46:15,1207,Le Jardin Pekinois,9 Rue Des Freres Bonie,33000,Bordeaux,44.837078,-0.579572,76030,Rue Villeneuve 1,33000,Bordeaux,44.839927,-0.580012,3820,1426.0,2016-07-24 14:12:45,2016-07-24 14:16:59,0
200305,2016-07-24 13:49:25,1207,Le Jardin Pekinois,9 Rue Des Freres Bonie,33000,Bordeaux,44.837078,-0.579572,76033,Rue Du Ha 54,33000,Bordeaux,44.835898,-0.577941,1689,1405.0,2016-07-24 14:12:04,2016-07-24 14:15:54,0
200800,2016-07-24 19:30:52,1204,Max A Table,36 Rue Cornac,33000,Bordeaux,44.851402,-0.57587,76187,Rue Judaique 213,33000,Bordeaux,44.840829,-0.595445,2250,1468.0,2016-07-24 19:50:52,2016-07-24 19:58:16,0


In [17]:
restaurant_col.loc[200300:200800]

order_id
200300    Le Jardin Pekinois
200305    Le Jardin Pekinois
200800           Max A Table
Name: restaurant, dtype: object

In [18]:
origins.loc[200300:200800]

Unnamed: 0_level_0,o_street,o_zip,o_city,o_latitude,o_longitude
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
200300,9 Rue Des Freres Bonie,33000,Bordeaux,44.837078,-0.579572
200305,9 Rue Des Freres Bonie,33000,Bordeaux,44.837078,-0.579572
200800,36 Rue Cornac,33000,Bordeaux,44.851402,-0.57587


[.loc <img height="12" style="display: inline-block" src="../static/link/to_pd.png">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc) also allows us to index and slice in both dimensions simultaneously. The first index or slice goes along the row dimension while the second index or slice selects the columns.

In [19]:
df.loc[
    200300:200800,
    ["o_street", "o_zip", "o_city", "o_latitude", "o_longitude"]
]

Unnamed: 0_level_0,o_street,o_zip,o_city,o_latitude,o_longitude
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
200300,9 Rue Des Freres Bonie,33000,Bordeaux,44.837078,-0.579572
200305,9 Rue Des Freres Bonie,33000,Bordeaux,44.837078,-0.579572
200800,36 Rue Cornac,33000,Bordeaux,44.851402,-0.57587


## Type Casting

As [.info() <img height="12" style="display: inline-block" src="../static/link/to_pd.png">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html#pandas.DataFrame.info) already revealed above, the timestamp columns could only be parsed as generic objects (i.e., textual data). Also, the "cancelled" column which holds only `True` or `False` values does not have a `bool` data type.

In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 694 entries, 192594 to 211519
Data columns (total 19 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   placed_at      694 non-null    object 
 1   restaurant_id  694 non-null    int64  
 2   restaurant     694 non-null    object 
 3   o_street       694 non-null    object 
 4   o_zip          694 non-null    int64  
 5   o_city         694 non-null    object 
 6   o_latitude     694 non-null    float64
 7   o_longitude    694 non-null    float64
 8   customer_id    694 non-null    int64  
 9   d_street       694 non-null    object 
 10  d_zip          694 non-null    int64  
 11  d_city         694 non-null    object 
 12  d_latitude     694 non-null    float64
 13  d_longitude    694 non-null    float64
 14  total          694 non-null    int64  
 15  courier_id     690 non-null    float64
 16  pickup_at      665 non-null    object 
 17  delivery_at    663 non-null    object 
 18  ca

The [pd.to_datetime() <img height="12" style="display: inline-block" src="../static/link/to_pd.png">](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html#pandas.to_datetime) function **casts** the timestamp columns correctly.

In [21]:
pd.to_datetime(df["placed_at"])

order_id
192594   2016-07-18 12:23:13
192644   2016-07-18 12:48:55
192658   2016-07-18 13:00:13
193242   2016-07-18 20:39:54
192719   2016-07-18 13:52:04
                 ...        
212021   2016-07-30 22:29:52
211501   2016-07-30 20:44:50
211508   2016-07-30 20:45:55
211510   2016-07-30 20:46:05
211519   2016-07-30 20:46:55
Name: placed_at, Length: 694, dtype: datetime64[ns]

Let's overwrite the original "placed_at" column with one that has the correct data type.

In [22]:
df["placed_at"] = pd.to_datetime(df["placed_at"])

The [.astype() <img height="12" style="display: inline-block" src="../static/link/to_pd.png">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html#pandas.DataFrame.astype) method generalizes this idea and allows us to cast several columns in a `DataFrame`. It takes a `dict`ionary mapping column names to data types as its input. Instead of references to actual data types (e.g., `bool`), it also understands [pandas <img height="12" style="display: inline-block" src="../static/link/to_pd.png">](https://pandas.pydata.org/)-specific data types provides as text.

In [23]:
df = df.astype({
    "pickup_at": "datetime64[ns]",
    "delivery_at": "datetime64[ns]",
    "cancelled": bool,
})

Now, all columns in `df` have more applicable data types.

In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 694 entries, 192594 to 211519
Data columns (total 19 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   placed_at      694 non-null    datetime64[ns]
 1   restaurant_id  694 non-null    int64         
 2   restaurant     694 non-null    object        
 3   o_street       694 non-null    object        
 4   o_zip          694 non-null    int64         
 5   o_city         694 non-null    object        
 6   o_latitude     694 non-null    float64       
 7   o_longitude    694 non-null    float64       
 8   customer_id    694 non-null    int64         
 9   d_street       694 non-null    object        
 10  d_zip          694 non-null    int64         
 11  d_city         694 non-null    object        
 12  d_latitude     694 non-null    float64       
 13  d_longitude    694 non-null    float64       
 14  total          694 non-null    int64         
 15  courier_id     

## Filtering

A common operation when working with `DataFrame`s is to filter for rows fulfilling certain conditions. That is implemented by so-called **boolean filters** in [pandas <img height="12" style="display: inline-block" src="../static/link/to_pd.png">](https://pandas.pydata.org/), which is simply a `DataFrame` or `Series` holding only `True` or `False` values.

One way to obtain such objects is to use relational operators with columns.

`max_a_table` holds `True` values for all orders at the restaurant with the ID `1204`.

In [25]:
max_a_table = df["restaurant_id"] == 1204

max_a_table

order_id
192594     True
192644     True
192658    False
193242    False
192719    False
          ...  
212021    False
211501     True
211508    False
211510    False
211519    False
Name: restaurant_id, Length: 694, dtype: bool

Next, let's use a boolean filter to index into `df`. That gives us back a new `DataFame` with all orders belonging to the restaurant "Max A Table".

In [26]:
df.loc[df["restaurant_id"] == 1204].head()

Unnamed: 0_level_0,placed_at,restaurant_id,restaurant,o_street,o_zip,o_city,o_latitude,o_longitude,customer_id,d_street,d_zip,d_city,d_latitude,d_longitude,total,courier_id,pickup_at,delivery_at,cancelled
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
192594,2016-07-18 12:23:13,1204,Max A Table,36 Rue Cornac,33000,Bordeaux,44.851402,-0.57587,10298,Rue Rolland 14,33000,Bordeaux,44.842592,-0.580521,2050,1423.0,2016-07-18 12:38:08,2016-07-18 12:48:22,False
192644,2016-07-18 12:48:55,1204,Max A Table,36 Rue Cornac,33000,Bordeaux,44.851402,-0.57587,6037,Rue Rolland 14,33000,Bordeaux,44.842592,-0.580521,2450,1426.0,2016-07-18 13:03:08,2016-07-18 13:12:01,False
194335,2016-07-19 20:35:21,1204,Max A Table,36 Rue Cornac,33000,Bordeaux,44.851402,-0.57587,74268,Place Canteloup 12,33800,Bordeaux,44.833834,-0.565674,3100,1420.0,2016-07-19 20:51:16,2016-07-19 21:01:08,False
196615,2016-07-21 19:50:15,1204,Max A Table,36 Rue Cornac,33000,Bordeaux,44.851402,-0.57587,74901,Rue Marcelin Jourdan 55,33200,Bordeaux,44.85036,-0.597361,2050,1418.0,2016-07-21 20:12:29,2016-07-21 20:25:57,False
196839,2016-07-21 20:27:22,1204,Max A Table,36 Rue Cornac,33000,Bordeaux,44.851402,-0.57587,74966,Rue Sainte-Catherine 137,33000,Bordeaux,44.836516,-0.573983,3750,1472.0,2016-07-21 20:41:42,2016-07-21 21:14:41,False


Instead of an explicit condition, we can also use a reference to a boolean filter created above.

In [27]:
df.loc[max_a_table].head()

Unnamed: 0_level_0,placed_at,restaurant_id,restaurant,o_street,o_zip,o_city,o_latitude,o_longitude,customer_id,d_street,d_zip,d_city,d_latitude,d_longitude,total,courier_id,pickup_at,delivery_at,cancelled
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
192594,2016-07-18 12:23:13,1204,Max A Table,36 Rue Cornac,33000,Bordeaux,44.851402,-0.57587,10298,Rue Rolland 14,33000,Bordeaux,44.842592,-0.580521,2050,1423.0,2016-07-18 12:38:08,2016-07-18 12:48:22,False
192644,2016-07-18 12:48:55,1204,Max A Table,36 Rue Cornac,33000,Bordeaux,44.851402,-0.57587,6037,Rue Rolland 14,33000,Bordeaux,44.842592,-0.580521,2450,1426.0,2016-07-18 13:03:08,2016-07-18 13:12:01,False
194335,2016-07-19 20:35:21,1204,Max A Table,36 Rue Cornac,33000,Bordeaux,44.851402,-0.57587,74268,Place Canteloup 12,33800,Bordeaux,44.833834,-0.565674,3100,1420.0,2016-07-19 20:51:16,2016-07-19 21:01:08,False
196615,2016-07-21 19:50:15,1204,Max A Table,36 Rue Cornac,33000,Bordeaux,44.851402,-0.57587,74901,Rue Marcelin Jourdan 55,33200,Bordeaux,44.85036,-0.597361,2050,1418.0,2016-07-21 20:12:29,2016-07-21 20:25:57,False
196839,2016-07-21 20:27:22,1204,Max A Table,36 Rue Cornac,33000,Bordeaux,44.851402,-0.57587,74966,Rue Sainte-Catherine 137,33000,Bordeaux,44.836516,-0.573983,3750,1472.0,2016-07-21 20:41:42,2016-07-21 21:14:41,False


Combining the filter with a `list` of columns allows us to further narrow down the `DataFrame`.

For example, the preview below shows us the first five customers "Max A Table" had in the target period.

In [28]:
df.loc[
    max_a_table,
    ["customer_id", "d_street", "d_zip", "d_city", "d_latitude", "d_longitude"]
].head()

Unnamed: 0_level_0,customer_id,d_street,d_zip,d_city,d_latitude,d_longitude
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
192594,10298,Rue Rolland 14,33000,Bordeaux,44.842592,-0.580521
192644,6037,Rue Rolland 14,33000,Bordeaux,44.842592,-0.580521
194335,74268,Place Canteloup 12,33800,Bordeaux,44.833834,-0.565674
196615,74901,Rue Marcelin Jourdan 55,33200,Bordeaux,44.85036,-0.597361
196839,74966,Rue Sainte-Catherine 137,33000,Bordeaux,44.836516,-0.573983


Boolean filters can be created in an arbitray fashion by combining several conditions with `&` and `|` modeling logical AND and OR operators.

The example lists the first five customers of "Max A Table" in a target area provided as latitude-longitude coordinates.

In [29]:
df.loc[
    (
        max_a_table
        &
        (
            (df["d_latitude"] > 44.85)
            |
            (df["d_longitude"] < -0.59)
        )        
    ),
    ["customer_id", "d_street", "d_zip", "d_city", "d_latitude", "d_longitude"]
].head()

Unnamed: 0_level_0,customer_id,d_street,d_zip,d_city,d_latitude,d_longitude
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
196615,74901,Rue Marcelin Jourdan 55,33200,Bordeaux,44.85036,-0.597361
200800,76187,Rue Judaique 213,33000,Bordeaux,44.840829,-0.595445
200893,76218,Rue Notre Dame 21,33000,Bordeaux,44.85026,-0.572377
202788,76786,Rue De Leybardie 27,33300,Bordeaux,44.86136,-0.565057
202563,76730,Rue Lombard 47,33300,Bordeaux,44.858661,-0.563095


[.isin() <img height="12" style="display: inline-block" src="../static/link/to_pd.png">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isin.html#pandas.DataFrame.isin) is another useful method: It allows us to provide a `list` of values that we are filtering for in a column.

In [30]:
df.loc[
    (
        max_a_table
        &
        df["customer_id"].isin([6037, 79900, 80095])
    ),
    ["placed_at", "customer_id", "d_street", "d_zip", "d_city", "total"]
].head()

Unnamed: 0_level_0,placed_at,customer_id,d_street,d_zip,d_city,total
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
192644,2016-07-18 12:48:55,6037,Rue Rolland 14,33000,Bordeaux,2450
210945,2016-07-30 19:30:39,79900,Rue Du Couvent 16,33000,Bordeaux,1650
211363,2016-07-30 20:27:45,80095,Rue De La Porte Saint-Jean 8,33000,Bordeaux,2400


The `~` operator negates a condition. So, in the cell below we see all orders at "Max A Table" except the ones from the indicated customers.

In [31]:
df.loc[
    (
        max_a_table
        &
        ~df["customer_id"].isin([6037, 79900, 80095])
    ),
    ["placed_at", "customer_id", "d_street", "d_zip", "d_city", "total"]
].head()

Unnamed: 0_level_0,placed_at,customer_id,d_street,d_zip,d_city,total
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
192594,2016-07-18 12:23:13,10298,Rue Rolland 14,33000,Bordeaux,2050
194335,2016-07-19 20:35:21,74268,Place Canteloup 12,33800,Bordeaux,3100
196615,2016-07-21 19:50:15,74901,Rue Marcelin Jourdan 55,33200,Bordeaux,2050
196839,2016-07-21 20:27:22,74966,Rue Sainte-Catherine 137,33000,Bordeaux,3750
198631,2016-07-22 21:29:40,75047,Rue Boudet 29,33000,Bordeaux,2650


## DataFrame Methods

Now that we have learned the basics of selecting the data we want from a `DataFrame`, let's look at a couple of methods that allow us to obtain some infos out of a `DataFrame`, in particular, to run some **descriptive statistics**.

[.unique() <img height="12" style="display: inline-block" src="../static/link/to_pd.png">](https://pandas.pydata.org/docs/reference/api/pandas.Series.unique.html#pandas.Series.unique) is a simple `Series` method returning an `ndarray` with all values that are in the `Series` once.

Here, we get an overview of how many restaurants there are in Bordeaux in the target time horizon.

In [32]:
df["restaurant_id"].unique()

array([1204, 1205, 1208, 1206, 1209, 1207, 1211, 1213, 1214, 1212, 1216,
       1215, 1217, 1218, 1219, 1220, 1221, 1223, 1222, 1224, 1225, 1229,
       1226, 1227, 1230, 1231, 1232, 1233, 1234, 1235, 1236, 1237, 1239,
       1241, 1242, 1243, 1245, 1244, 1246, 1247, 1249, 1254, 1250, 1256,
       1258, 1259, 1260, 1263, 1264, 1266, 1265, 1267])

In [33]:
len(df["restaurant_id"].unique())

52

To obtain an `ndarray` of all customer IDs of "Max A Table", we write the following.

In [34]:
df.loc[
    max_a_table,
    "customer_id"
].unique()

array([10298,  6037, 74268, 74901, 74966, 75047, 76187, 76218, 76442,
       76396, 76421, 76786, 76822, 76730, 76871, 75687, 77409, 77386,
       77355, 77556, 78129, 78353, 78608, 78621, 78958, 79119, 79153,
       76838, 79234, 79486, 79576, 79563, 79653, 79900, 79912, 80026,
       80204, 80095, 80163])

[.value_counts() <img height="12" style="display: inline-block" src="../static/link/to_pd.png">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.value_counts.html#pandas.DataFrame.value_counts) is similar to [.unique() <img height="12" style="display: inline-block" src="../static/link/to_pd.png">](https://pandas.pydata.org/docs/reference/api/pandas.Series.unique.html#pandas.Series.unique) and provides an array sorted by the counts of how often an element occurs in a column or `Series` in descending order.

We use it to list the `10` most popular restaurants and customers in the dataset.

In [35]:
df["restaurant_id"].value_counts().head(10)

1254    78
1207    47
1204    39
1217    37
1212    32
1244    25
1225    25
1249    23
1242    19
1221    18
Name: restaurant_id, dtype: int64

In [36]:
df["customer_id"].value_counts().head(10)

73919    14
10298    12
6037      8
77048     5
4210      4
74426     4
9304      3
76838     3
75905     3
74791     3
Name: customer_id, dtype: int64

[.sum() <img height="12" style="display: inline-block" src="../static/link/to_pd.png">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sum.html#pandas.DataFrame.sum), [.min() <img height="12" style="display: inline-block" src="../static/link/to_pd.png">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.min.html#pandas.DataFrame.min), [.max() <img height="12" style="display: inline-block" src="../static/link/to_pd.png">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.max.html#pandas.DataFrame.max), [.mean() <img height="12" style="display: inline-block" src="../static/link/to_pd.png">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html#pandas.DataFrame.mean), and [.round() <img height="12" style="display: inline-block" src="../static/link/to_pd.png">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.round.html#pandas.DataFrame.round) are self-explanatory.

We use it to analyze the overall spendings in Bordeaux and for "Max A Table".

In [37]:
df["total"].sum() / 100  # Convert to Euro

15924.78

In [38]:
df.loc[
    max_a_table,
    "total"
].sum() / 100

885.0

In [39]:
df["total"].min() / 100

3.5

In [40]:
df["total"].max() / 100

83.7

In [41]:
df.loc[
    max_a_table,
    "total"
].min() / 100

12.5

In [42]:
df.loc[
    max_a_table,
    "total"
].max() / 100

60.0

In [43]:
df["total"].mean() / 100

22.94636887608069

In [44]:
df["total"].mean().round() / 100

22.95

In [45]:
df.loc[
    max_a_table,
    "total"
].mean().round() / 100

22.69