intro-to-data-science/01_scientific_stack/02_content_pandas.ipynb
Alexander Hess 51a5dcc8ee
Run black on all the notebooks
- we use black's default settings
- some cells are NOT kept in black's format to:
  - increase readability
  - or show Python's flexibility with regard to style
2024-07-15 12:12:51 +02:00

3456 lines
120 KiB
Text
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note**: Click on \"*Kernel*\" > \"*Restart Kernel and Clear All Outputs*\" in [JupyterLab](https://jupyterlab.readthedocs.io/en/stable/) *before* reading this notebook to reset its output. If you cannot run this file on your machine, you may want to open it [in the cloud <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_mb.png\">](https://mybinder.org/v2/gh/webartifex/intro-to-data-science/main?urlpath=lab/tree/01_scientific_stack/02_content_pandas.ipynb)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Chapter 1: Python's Scientific Stack (Part 2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For practitioners, the [numpy <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_np.png\">](https://numpy.org/) library may feel a bit too \"technical\" or too close to \"real programming\" and they may prefer something that looks and feels more like Excel. That is where the [pandas <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/) library comes in.\n",
"\n",
"Let's first `pip` install and then `import` it."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied: pandas in /home/instructor/Repositories/intro-to-data-science/.venv/lib/python3.12/site-packages (2.2.2)\n",
"Requirement already satisfied: numpy>=1.26.0 in /home/instructor/Repositories/intro-to-data-science/.venv/lib/python3.12/site-packages (from pandas) (2.0.0)\n",
"Requirement already satisfied: python-dateutil>=2.8.2 in /home/instructor/Repositories/intro-to-data-science/.venv/lib/python3.12/site-packages (from pandas) (2.9.0.post0)\n",
"Requirement already satisfied: pytz>=2020.1 in /home/instructor/Repositories/intro-to-data-science/.venv/lib/python3.12/site-packages (from pandas) (2024.1)\n",
"Requirement already satisfied: tzdata>=2022.7 in /home/instructor/Repositories/intro-to-data-science/.venv/lib/python3.12/site-packages (from pandas) (2024.1)\n",
"Requirement already satisfied: six>=1.5 in /home/instructor/Repositories/intro-to-data-science/.venv/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)\n"
]
}
],
"source": [
"!pip install pandas"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Excel-like Data with Pandas"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the same folder as this notebook there is a file named \"*orders.csv*\" that holds the order data of an urban meal delivery platform operating in Bordeaux, France. Open in with a double-click and take a look at its contents right here in JupyterLab!\n",
"\n",
"[pandas <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/) provides a [pd.read_csv() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv) function that, as the name suggests, can open and read in CSV data. For Excel files, there is also a [pd.read_excel() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html) function but the CSV format is probably more widespread in use.\n",
"\n",
"Let's read in the \"*orders.csv*\" file with [pd.read_csv() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv) specifying the \"order_id\" column as the **index**. Here, index is a column with *unique* values that allow the identification of each row in a dataset. If we don't specify an index column, [pandas <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/) creates a surrogate index as a sequence of integers 1, 2, 3, and so on."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv(\"orders.csv\", index_col=\"order_id\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`df` models a table-like data structure, comparable to one tab in an Excel file. [pandas <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/) and JupyterLab are designed to work well together: The `df` object shows a preview of the dataset below the code cell. The rows are the **records** in the dataset and the columns take the role of the **attributes** each record has. Each column comes with a **domain** of allowable values."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>placed_at</th>\n",
" <th>restaurant_id</th>\n",
" <th>restaurant</th>\n",
" <th>o_street</th>\n",
" <th>o_zip</th>\n",
" <th>o_city</th>\n",
" <th>o_latitude</th>\n",
" <th>o_longitude</th>\n",
" <th>customer_id</th>\n",
" <th>d_street</th>\n",
" <th>d_zip</th>\n",
" <th>d_city</th>\n",
" <th>d_latitude</th>\n",
" <th>d_longitude</th>\n",
" <th>total</th>\n",
" <th>courier_id</th>\n",
" <th>pickup_at</th>\n",
" <th>delivery_at</th>\n",
" <th>cancelled</th>\n",
" </tr>\n",
" <tr>\n",
" <th>order_id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>192594</th>\n",
" <td>2016-07-18 12:23:13</td>\n",
" <td>1204</td>\n",
" <td>Max A Table</td>\n",
" <td>36 Rue Cornac</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.851402</td>\n",
" <td>-0.575870</td>\n",
" <td>10298</td>\n",
" <td>Rue Rolland 14</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.842592</td>\n",
" <td>-0.580521</td>\n",
" <td>2050</td>\n",
" <td>1423.0</td>\n",
" <td>2016-07-18 12:38:08</td>\n",
" <td>2016-07-18 12:48:22</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>192644</th>\n",
" <td>2016-07-18 12:48:55</td>\n",
" <td>1204</td>\n",
" <td>Max A Table</td>\n",
" <td>36 Rue Cornac</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.851402</td>\n",
" <td>-0.575870</td>\n",
" <td>6037</td>\n",
" <td>Rue Rolland 14</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.842592</td>\n",
" <td>-0.580521</td>\n",
" <td>2450</td>\n",
" <td>1426.0</td>\n",
" <td>2016-07-18 13:03:08</td>\n",
" <td>2016-07-18 13:12:01</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>192658</th>\n",
" <td>2016-07-18 13:00:13</td>\n",
" <td>1205</td>\n",
" <td>Taj Mahal</td>\n",
" <td>24 Rue Du Parlement Sainte-Catherine</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.840405</td>\n",
" <td>-0.573940</td>\n",
" <td>73830</td>\n",
" <td>Rue Batailley 12</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.838504</td>\n",
" <td>-0.591961</td>\n",
" <td>2550</td>\n",
" <td>1423.0</td>\n",
" <td>2016-07-18 13:19:04</td>\n",
" <td>2016-07-18 13:29:03</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>193242</th>\n",
" <td>2016-07-18 20:39:54</td>\n",
" <td>1208</td>\n",
" <td>Chez Ambre And Michel</td>\n",
" <td>1 Rue Matignon</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.850258</td>\n",
" <td>-0.586204</td>\n",
" <td>10298</td>\n",
" <td>Rue Rolland 14</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.842592</td>\n",
" <td>-0.580521</td>\n",
" <td>1550</td>\n",
" <td>1420.0</td>\n",
" <td>2016-07-18 20:55:52</td>\n",
" <td>2016-07-18 21:05:28</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>192719</th>\n",
" <td>2016-07-18 13:52:04</td>\n",
" <td>1206</td>\n",
" <td>La Maison Du Glacier</td>\n",
" <td>1 Place Saint Pierre</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.839706</td>\n",
" <td>-0.570672</td>\n",
" <td>6037</td>\n",
" <td>Rue Rolland 14</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.842592</td>\n",
" <td>-0.580521</td>\n",
" <td>2450</td>\n",
" <td>1426.0</td>\n",
" <td>2016-07-18 14:01:23</td>\n",
" <td>2016-07-18 14:08:36</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>212021</th>\n",
" <td>2016-07-30 22:29:52</td>\n",
" <td>1249</td>\n",
" <td>Pitaya Sainte Catherine</td>\n",
" <td>275 Rue Sainte Catherine</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.831692</td>\n",
" <td>-0.573207</td>\n",
" <td>80400</td>\n",
" <td>Boulevard President Franklin Roosevelt 15</td>\n",
" <td>33400</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.820591</td>\n",
" <td>-0.582048</td>\n",
" <td>2250</td>\n",
" <td>1410.0</td>\n",
" <td>2016-07-30 22:50:16</td>\n",
" <td>2016-07-30 23:02:54</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>211501</th>\n",
" <td>2016-07-30 20:44:50</td>\n",
" <td>1204</td>\n",
" <td>Max A Table</td>\n",
" <td>36 Rue Cornac</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.851402</td>\n",
" <td>-0.575870</td>\n",
" <td>80163</td>\n",
" <td>Rue Marsan 22</td>\n",
" <td>33300</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.856133</td>\n",
" <td>-0.576172</td>\n",
" <td>1250</td>\n",
" <td>1415.0</td>\n",
" <td>2016-07-30 21:02:32</td>\n",
" <td>2016-07-30 21:06:19</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>211508</th>\n",
" <td>2016-07-30 20:45:55</td>\n",
" <td>1254</td>\n",
" <td>Funky Burger</td>\n",
" <td>5 Rue Du Loup</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.838081</td>\n",
" <td>-0.572281</td>\n",
" <td>80168</td>\n",
" <td>Rue Des Sablieres 42</td>\n",
" <td>33800</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.825488</td>\n",
" <td>-0.575264</td>\n",
" <td>1680</td>\n",
" <td>1461.0</td>\n",
" <td>2016-07-30 21:13:31</td>\n",
" <td>2016-07-30 21:19:45</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>211510</th>\n",
" <td>2016-07-30 20:46:05</td>\n",
" <td>1219</td>\n",
" <td>La Tagliatella</td>\n",
" <td>14 Rue Guiraude</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.839388</td>\n",
" <td>-0.574781</td>\n",
" <td>80169</td>\n",
" <td>Rue Pasteur 35</td>\n",
" <td>33200</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.845053</td>\n",
" <td>-0.601157</td>\n",
" <td>4085</td>\n",
" <td>1411.0</td>\n",
" <td>2016-07-30 21:11:00</td>\n",
" <td>2016-07-30 21:23:24</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>211519</th>\n",
" <td>2016-07-30 20:46:55</td>\n",
" <td>1254</td>\n",
" <td>Funky Burger</td>\n",
" <td>5 Rue Du Loup</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.838081</td>\n",
" <td>-0.572281</td>\n",
" <td>80172</td>\n",
" <td>Rue Monadey 28</td>\n",
" <td>33800</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.828816</td>\n",
" <td>-0.570789</td>\n",
" <td>2050</td>\n",
" <td>1817.0</td>\n",
" <td>2016-07-30 21:05:46</td>\n",
" <td>2016-07-30 21:14:07</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>694 rows × 19 columns</p>\n",
"</div>"
],
"text/plain": [
" placed_at restaurant_id restaurant \\\n",
"order_id \n",
"192594 2016-07-18 12:23:13 1204 Max A Table \n",
"192644 2016-07-18 12:48:55 1204 Max A Table \n",
"192658 2016-07-18 13:00:13 1205 Taj Mahal \n",
"193242 2016-07-18 20:39:54 1208 Chez Ambre And Michel \n",
"192719 2016-07-18 13:52:04 1206 La Maison Du Glacier \n",
"... ... ... ... \n",
"212021 2016-07-30 22:29:52 1249 Pitaya Sainte Catherine \n",
"211501 2016-07-30 20:44:50 1204 Max A Table \n",
"211508 2016-07-30 20:45:55 1254 Funky Burger \n",
"211510 2016-07-30 20:46:05 1219 La Tagliatella \n",
"211519 2016-07-30 20:46:55 1254 Funky Burger \n",
"\n",
" o_street o_zip o_city o_latitude \\\n",
"order_id \n",
"192594 36 Rue Cornac 33000 Bordeaux 44.851402 \n",
"192644 36 Rue Cornac 33000 Bordeaux 44.851402 \n",
"192658 24 Rue Du Parlement Sainte-Catherine 33000 Bordeaux 44.840405 \n",
"193242 1 Rue Matignon 33000 Bordeaux 44.850258 \n",
"192719 1 Place Saint Pierre 33000 Bordeaux 44.839706 \n",
"... ... ... ... ... \n",
"212021 275 Rue Sainte Catherine 33000 Bordeaux 44.831692 \n",
"211501 36 Rue Cornac 33000 Bordeaux 44.851402 \n",
"211508 5 Rue Du Loup 33000 Bordeaux 44.838081 \n",
"211510 14 Rue Guiraude 33000 Bordeaux 44.839388 \n",
"211519 5 Rue Du Loup 33000 Bordeaux 44.838081 \n",
"\n",
" o_longitude customer_id d_street \\\n",
"order_id \n",
"192594 -0.575870 10298 Rue Rolland 14 \n",
"192644 -0.575870 6037 Rue Rolland 14 \n",
"192658 -0.573940 73830 Rue Batailley 12 \n",
"193242 -0.586204 10298 Rue Rolland 14 \n",
"192719 -0.570672 6037 Rue Rolland 14 \n",
"... ... ... ... \n",
"212021 -0.573207 80400 Boulevard President Franklin Roosevelt 15 \n",
"211501 -0.575870 80163 Rue Marsan 22 \n",
"211508 -0.572281 80168 Rue Des Sablieres 42 \n",
"211510 -0.574781 80169 Rue Pasteur 35 \n",
"211519 -0.572281 80172 Rue Monadey 28 \n",
"\n",
" d_zip d_city d_latitude d_longitude total courier_id \\\n",
"order_id \n",
"192594 33000 Bordeaux 44.842592 -0.580521 2050 1423.0 \n",
"192644 33000 Bordeaux 44.842592 -0.580521 2450 1426.0 \n",
"192658 33000 Bordeaux 44.838504 -0.591961 2550 1423.0 \n",
"193242 33000 Bordeaux 44.842592 -0.580521 1550 1420.0 \n",
"192719 33000 Bordeaux 44.842592 -0.580521 2450 1426.0 \n",
"... ... ... ... ... ... ... \n",
"212021 33400 Bordeaux 44.820591 -0.582048 2250 1410.0 \n",
"211501 33300 Bordeaux 44.856133 -0.576172 1250 1415.0 \n",
"211508 33800 Bordeaux 44.825488 -0.575264 1680 1461.0 \n",
"211510 33200 Bordeaux 44.845053 -0.601157 4085 1411.0 \n",
"211519 33800 Bordeaux 44.828816 -0.570789 2050 1817.0 \n",
"\n",
" pickup_at delivery_at cancelled \n",
"order_id \n",
"192594 2016-07-18 12:38:08 2016-07-18 12:48:22 0 \n",
"192644 2016-07-18 13:03:08 2016-07-18 13:12:01 0 \n",
"192658 2016-07-18 13:19:04 2016-07-18 13:29:03 0 \n",
"193242 2016-07-18 20:55:52 2016-07-18 21:05:28 0 \n",
"192719 2016-07-18 14:01:23 2016-07-18 14:08:36 0 \n",
"... ... ... ... \n",
"212021 2016-07-30 22:50:16 2016-07-30 23:02:54 0 \n",
"211501 2016-07-30 21:02:32 2016-07-30 21:06:19 0 \n",
"211508 2016-07-30 21:13:31 2016-07-30 21:19:45 0 \n",
"211510 2016-07-30 21:11:00 2016-07-30 21:23:24 0 \n",
"211519 2016-07-30 21:05:46 2016-07-30 21:14:07 0 \n",
"\n",
"[694 rows x 19 columns]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The data type behind `df` is called a [pd.DataFrame <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame). `DataFrame`s are built around [numpy <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_np.png\">](https://numpy.org/)'s `ndarray`s providing an interface optimized for **interactive usage** (i.e., a data scientist exploring a dataset step by step)."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.frame.DataFrame"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`DataFrame`s come with many methdods.\n",
"\n",
"For example, [.head() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html#pandas.DataFrame.head) and [.tail() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.tail.html#pandas.DataFrame.tail) show the first and last `n` rows, defaulting to `5`."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>placed_at</th>\n",
" <th>restaurant_id</th>\n",
" <th>restaurant</th>\n",
" <th>o_street</th>\n",
" <th>o_zip</th>\n",
" <th>o_city</th>\n",
" <th>o_latitude</th>\n",
" <th>o_longitude</th>\n",
" <th>customer_id</th>\n",
" <th>d_street</th>\n",
" <th>d_zip</th>\n",
" <th>d_city</th>\n",
" <th>d_latitude</th>\n",
" <th>d_longitude</th>\n",
" <th>total</th>\n",
" <th>courier_id</th>\n",
" <th>pickup_at</th>\n",
" <th>delivery_at</th>\n",
" <th>cancelled</th>\n",
" </tr>\n",
" <tr>\n",
" <th>order_id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>192594</th>\n",
" <td>2016-07-18 12:23:13</td>\n",
" <td>1204</td>\n",
" <td>Max A Table</td>\n",
" <td>36 Rue Cornac</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.851402</td>\n",
" <td>-0.575870</td>\n",
" <td>10298</td>\n",
" <td>Rue Rolland 14</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.842592</td>\n",
" <td>-0.580521</td>\n",
" <td>2050</td>\n",
" <td>1423.0</td>\n",
" <td>2016-07-18 12:38:08</td>\n",
" <td>2016-07-18 12:48:22</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>192644</th>\n",
" <td>2016-07-18 12:48:55</td>\n",
" <td>1204</td>\n",
" <td>Max A Table</td>\n",
" <td>36 Rue Cornac</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.851402</td>\n",
" <td>-0.575870</td>\n",
" <td>6037</td>\n",
" <td>Rue Rolland 14</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.842592</td>\n",
" <td>-0.580521</td>\n",
" <td>2450</td>\n",
" <td>1426.0</td>\n",
" <td>2016-07-18 13:03:08</td>\n",
" <td>2016-07-18 13:12:01</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>192658</th>\n",
" <td>2016-07-18 13:00:13</td>\n",
" <td>1205</td>\n",
" <td>Taj Mahal</td>\n",
" <td>24 Rue Du Parlement Sainte-Catherine</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.840405</td>\n",
" <td>-0.573940</td>\n",
" <td>73830</td>\n",
" <td>Rue Batailley 12</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.838504</td>\n",
" <td>-0.591961</td>\n",
" <td>2550</td>\n",
" <td>1423.0</td>\n",
" <td>2016-07-18 13:19:04</td>\n",
" <td>2016-07-18 13:29:03</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>193242</th>\n",
" <td>2016-07-18 20:39:54</td>\n",
" <td>1208</td>\n",
" <td>Chez Ambre And Michel</td>\n",
" <td>1 Rue Matignon</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.850258</td>\n",
" <td>-0.586204</td>\n",
" <td>10298</td>\n",
" <td>Rue Rolland 14</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.842592</td>\n",
" <td>-0.580521</td>\n",
" <td>1550</td>\n",
" <td>1420.0</td>\n",
" <td>2016-07-18 20:55:52</td>\n",
" <td>2016-07-18 21:05:28</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>192719</th>\n",
" <td>2016-07-18 13:52:04</td>\n",
" <td>1206</td>\n",
" <td>La Maison Du Glacier</td>\n",
" <td>1 Place Saint Pierre</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.839706</td>\n",
" <td>-0.570672</td>\n",
" <td>6037</td>\n",
" <td>Rue Rolland 14</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.842592</td>\n",
" <td>-0.580521</td>\n",
" <td>2450</td>\n",
" <td>1426.0</td>\n",
" <td>2016-07-18 14:01:23</td>\n",
" <td>2016-07-18 14:08:36</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" placed_at restaurant_id restaurant \\\n",
"order_id \n",
"192594 2016-07-18 12:23:13 1204 Max A Table \n",
"192644 2016-07-18 12:48:55 1204 Max A Table \n",
"192658 2016-07-18 13:00:13 1205 Taj Mahal \n",
"193242 2016-07-18 20:39:54 1208 Chez Ambre And Michel \n",
"192719 2016-07-18 13:52:04 1206 La Maison Du Glacier \n",
"\n",
" o_street o_zip o_city o_latitude \\\n",
"order_id \n",
"192594 36 Rue Cornac 33000 Bordeaux 44.851402 \n",
"192644 36 Rue Cornac 33000 Bordeaux 44.851402 \n",
"192658 24 Rue Du Parlement Sainte-Catherine 33000 Bordeaux 44.840405 \n",
"193242 1 Rue Matignon 33000 Bordeaux 44.850258 \n",
"192719 1 Place Saint Pierre 33000 Bordeaux 44.839706 \n",
"\n",
" o_longitude customer_id d_street d_zip d_city \\\n",
"order_id \n",
"192594 -0.575870 10298 Rue Rolland 14 33000 Bordeaux \n",
"192644 -0.575870 6037 Rue Rolland 14 33000 Bordeaux \n",
"192658 -0.573940 73830 Rue Batailley 12 33000 Bordeaux \n",
"193242 -0.586204 10298 Rue Rolland 14 33000 Bordeaux \n",
"192719 -0.570672 6037 Rue Rolland 14 33000 Bordeaux \n",
"\n",
" d_latitude d_longitude total courier_id pickup_at \\\n",
"order_id \n",
"192594 44.842592 -0.580521 2050 1423.0 2016-07-18 12:38:08 \n",
"192644 44.842592 -0.580521 2450 1426.0 2016-07-18 13:03:08 \n",
"192658 44.838504 -0.591961 2550 1423.0 2016-07-18 13:19:04 \n",
"193242 44.842592 -0.580521 1550 1420.0 2016-07-18 20:55:52 \n",
"192719 44.842592 -0.580521 2450 1426.0 2016-07-18 14:01:23 \n",
"\n",
" delivery_at cancelled \n",
"order_id \n",
"192594 2016-07-18 12:48:22 0 \n",
"192644 2016-07-18 13:12:01 0 \n",
"192658 2016-07-18 13:29:03 0 \n",
"193242 2016-07-18 21:05:28 0 \n",
"192719 2016-07-18 14:08:36 0 "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>placed_at</th>\n",
" <th>restaurant_id</th>\n",
" <th>restaurant</th>\n",
" <th>o_street</th>\n",
" <th>o_zip</th>\n",
" <th>o_city</th>\n",
" <th>o_latitude</th>\n",
" <th>o_longitude</th>\n",
" <th>customer_id</th>\n",
" <th>d_street</th>\n",
" <th>d_zip</th>\n",
" <th>d_city</th>\n",
" <th>d_latitude</th>\n",
" <th>d_longitude</th>\n",
" <th>total</th>\n",
" <th>courier_id</th>\n",
" <th>pickup_at</th>\n",
" <th>delivery_at</th>\n",
" <th>cancelled</th>\n",
" </tr>\n",
" <tr>\n",
" <th>order_id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>211510</th>\n",
" <td>2016-07-30 20:46:05</td>\n",
" <td>1219</td>\n",
" <td>La Tagliatella</td>\n",
" <td>14 Rue Guiraude</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.839388</td>\n",
" <td>-0.574781</td>\n",
" <td>80169</td>\n",
" <td>Rue Pasteur 35</td>\n",
" <td>33200</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.845053</td>\n",
" <td>-0.601157</td>\n",
" <td>4085</td>\n",
" <td>1411.0</td>\n",
" <td>2016-07-30 21:11:00</td>\n",
" <td>2016-07-30 21:23:24</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>211519</th>\n",
" <td>2016-07-30 20:46:55</td>\n",
" <td>1254</td>\n",
" <td>Funky Burger</td>\n",
" <td>5 Rue Du Loup</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.838081</td>\n",
" <td>-0.572281</td>\n",
" <td>80172</td>\n",
" <td>Rue Monadey 28</td>\n",
" <td>33800</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.828816</td>\n",
" <td>-0.570789</td>\n",
" <td>2050</td>\n",
" <td>1817.0</td>\n",
" <td>2016-07-30 21:05:46</td>\n",
" <td>2016-07-30 21:14:07</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" placed_at restaurant_id restaurant o_street \\\n",
"order_id \n",
"211510 2016-07-30 20:46:05 1219 La Tagliatella 14 Rue Guiraude \n",
"211519 2016-07-30 20:46:55 1254 Funky Burger 5 Rue Du Loup \n",
"\n",
" o_zip o_city o_latitude o_longitude customer_id \\\n",
"order_id \n",
"211510 33000 Bordeaux 44.839388 -0.574781 80169 \n",
"211519 33000 Bordeaux 44.838081 -0.572281 80172 \n",
"\n",
" d_street d_zip d_city d_latitude d_longitude total \\\n",
"order_id \n",
"211510 Rue Pasteur 35 33200 Bordeaux 44.845053 -0.601157 4085 \n",
"211519 Rue Monadey 28 33800 Bordeaux 44.828816 -0.570789 2050 \n",
"\n",
" courier_id pickup_at delivery_at cancelled \n",
"order_id \n",
"211510 1411.0 2016-07-30 21:11:00 2016-07-30 21:23:24 0 \n",
"211519 1817.0 2016-07-30 21:05:46 2016-07-30 21:14:07 0 "
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.tail(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[.info() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html#pandas.DataFrame.info) shows on overview of the columns. In particular, it shows how many cells are filled in in a column (i.e., are \"non-null\") and what **data type** (i.e., \"dtype\") *all* values in a column have. \"int64\" and \"float64\" imply that there are only `int` and `float` values in a column (taking up to 64 bits or 1s and 0s in memory). \"object\" is [pandas <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/)' way of telling us it could not deduce any data type more specific than textual data. For the columns holding timestamps (e.g., \"placed_at\") we will convert the values further below.\n",
"\n",
"Looking at the output, we see that some columns hold the data of **origin**-**destination** pairs, corresponding to restaurants and customers. Other columns store data following the dispatch and delivery process of couriers picking up and delivering meals at various points in time."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"Index: 694 entries, 192594 to 211519\n",
"Data columns (total 19 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 placed_at 694 non-null object \n",
" 1 restaurant_id 694 non-null int64 \n",
" 2 restaurant 694 non-null object \n",
" 3 o_street 694 non-null object \n",
" 4 o_zip 694 non-null int64 \n",
" 5 o_city 694 non-null object \n",
" 6 o_latitude 694 non-null float64\n",
" 7 o_longitude 694 non-null float64\n",
" 8 customer_id 694 non-null int64 \n",
" 9 d_street 694 non-null object \n",
" 10 d_zip 694 non-null int64 \n",
" 11 d_city 694 non-null object \n",
" 12 d_latitude 694 non-null float64\n",
" 13 d_longitude 694 non-null float64\n",
" 14 total 694 non-null int64 \n",
" 15 courier_id 690 non-null float64\n",
" 16 pickup_at 665 non-null object \n",
" 17 delivery_at 663 non-null object \n",
" 18 cancelled 694 non-null int64 \n",
"dtypes: float64(5), int64(6), object(8)\n",
"memory usage: 108.4+ KB\n"
]
}
],
"source": [
"df.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[.describe() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html#pandas.DataFrame.describe) shows statistics on all numerical columns in a `DataFrame`.\n",
"\n",
"For the example orders, such statistics may not be meaningful for all numerical columns as some of them merely hold IDs or zip codes."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>restaurant_id</th>\n",
" <th>o_zip</th>\n",
" <th>o_latitude</th>\n",
" <th>o_longitude</th>\n",
" <th>customer_id</th>\n",
" <th>d_zip</th>\n",
" <th>d_latitude</th>\n",
" <th>d_longitude</th>\n",
" <th>total</th>\n",
" <th>courier_id</th>\n",
" <th>cancelled</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>694.000000</td>\n",
" <td>694.000000</td>\n",
" <td>694.000000</td>\n",
" <td>694.000000</td>\n",
" <td>694.000000</td>\n",
" <td>694.000000</td>\n",
" <td>694.000000</td>\n",
" <td>694.000000</td>\n",
" <td>694.000000</td>\n",
" <td>690.000000</td>\n",
" <td>694.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>1228.479827</td>\n",
" <td>33075.216138</td>\n",
" <td>44.839258</td>\n",
" <td>-0.575759</td>\n",
" <td>74751.126801</td>\n",
" <td>33191.613833</td>\n",
" <td>44.838623</td>\n",
" <td>-0.576040</td>\n",
" <td>2294.636888</td>\n",
" <td>1484.755072</td>\n",
" <td>0.044669</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>18.001091</td>\n",
" <td>207.971435</td>\n",
" <td>0.007471</td>\n",
" <td>0.006920</td>\n",
" <td>14604.304963</td>\n",
" <td>307.378697</td>\n",
" <td>0.011545</td>\n",
" <td>0.010799</td>\n",
" <td>1060.695748</td>\n",
" <td>154.586210</td>\n",
" <td>0.206724</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>1204.000000</td>\n",
" <td>33000.000000</td>\n",
" <td>44.818180</td>\n",
" <td>-0.599400</td>\n",
" <td>2377.000000</td>\n",
" <td>33000.000000</td>\n",
" <td>44.809813</td>\n",
" <td>-0.606892</td>\n",
" <td>350.000000</td>\n",
" <td>1403.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>1212.000000</td>\n",
" <td>33000.000000</td>\n",
" <td>44.836910</td>\n",
" <td>-0.579345</td>\n",
" <td>76648.500000</td>\n",
" <td>33000.000000</td>\n",
" <td>44.829981</td>\n",
" <td>-0.581612</td>\n",
" <td>1500.000000</td>\n",
" <td>1415.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>1224.000000</td>\n",
" <td>33000.000000</td>\n",
" <td>44.838287</td>\n",
" <td>-0.573940</td>\n",
" <td>78146.000000</td>\n",
" <td>33000.000000</td>\n",
" <td>44.838364</td>\n",
" <td>-0.575056</td>\n",
" <td>1969.500000</td>\n",
" <td>1424.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>1244.000000</td>\n",
" <td>33000.000000</td>\n",
" <td>44.841721</td>\n",
" <td>-0.572281</td>\n",
" <td>79331.500000</td>\n",
" <td>33300.000000</td>\n",
" <td>44.846696</td>\n",
" <td>-0.569601</td>\n",
" <td>2750.000000</td>\n",
" <td>1462.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>1267.000000</td>\n",
" <td>33800.000000</td>\n",
" <td>44.855438</td>\n",
" <td>-0.550576</td>\n",
" <td>80401.000000</td>\n",
" <td>33800.000000</td>\n",
" <td>44.877693</td>\n",
" <td>-0.537952</td>\n",
" <td>8370.000000</td>\n",
" <td>2013.000000</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" restaurant_id o_zip o_latitude o_longitude customer_id \\\n",
"count 694.000000 694.000000 694.000000 694.000000 694.000000 \n",
"mean 1228.479827 33075.216138 44.839258 -0.575759 74751.126801 \n",
"std 18.001091 207.971435 0.007471 0.006920 14604.304963 \n",
"min 1204.000000 33000.000000 44.818180 -0.599400 2377.000000 \n",
"25% 1212.000000 33000.000000 44.836910 -0.579345 76648.500000 \n",
"50% 1224.000000 33000.000000 44.838287 -0.573940 78146.000000 \n",
"75% 1244.000000 33000.000000 44.841721 -0.572281 79331.500000 \n",
"max 1267.000000 33800.000000 44.855438 -0.550576 80401.000000 \n",
"\n",
" d_zip d_latitude d_longitude total courier_id \\\n",
"count 694.000000 694.000000 694.000000 694.000000 690.000000 \n",
"mean 33191.613833 44.838623 -0.576040 2294.636888 1484.755072 \n",
"std 307.378697 0.011545 0.010799 1060.695748 154.586210 \n",
"min 33000.000000 44.809813 -0.606892 350.000000 1403.000000 \n",
"25% 33000.000000 44.829981 -0.581612 1500.000000 1415.000000 \n",
"50% 33000.000000 44.838364 -0.575056 1969.500000 1424.000000 \n",
"75% 33300.000000 44.846696 -0.569601 2750.000000 1462.000000 \n",
"max 33800.000000 44.877693 -0.537952 8370.000000 2013.000000 \n",
"\n",
" cancelled \n",
"count 694.000000 \n",
"mean 0.044669 \n",
"std 0.206724 \n",
"min 0.000000 \n",
"25% 0.000000 \n",
"50% 0.000000 \n",
"75% 0.000000 \n",
"max 1.000000 "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Indexing & Slicing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`DataFrame`s support being indexed or sliced, both in the row and column dimensions.\n",
"\n",
"To obtain all data in a single column, we index into the `DataFrame` with the column's name.\n",
"\n",
"For example, `restaurant_col` provides a list of only the restaurant names. Its index are still the \"order_id\"s."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"order_id\n",
"192594 Max A Table\n",
"192644 Max A Table\n",
"192658 Taj Mahal\n",
"193242 Chez Ambre And Michel\n",
"192719 La Maison Du Glacier\n",
"Name: restaurant, dtype: object"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"restaurant_col = df[\"restaurant\"]\n",
"\n",
"restaurant_col.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The data type of a single column is [pd.Series <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/docs/reference/api/pandas.Series.html#pandas.Series), which is very similar to a `DataFrame` with only one column. `Series` objects work like built-in `list`s with added functionalities."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.series.Series"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(restaurant_col)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we index with a `list` of column names, the result is itself another `DataFrame`. That operation is like slicing out a smaller matrix from a larger one as we saw with `ndarray`s before.\n",
"\n",
"For example, let's pull out all location data of the orders' origins (i.e., restaurants)."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>o_street</th>\n",
" <th>o_zip</th>\n",
" <th>o_city</th>\n",
" <th>o_latitude</th>\n",
" <th>o_longitude</th>\n",
" </tr>\n",
" <tr>\n",
" <th>order_id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>192594</th>\n",
" <td>36 Rue Cornac</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.851402</td>\n",
" <td>-0.575870</td>\n",
" </tr>\n",
" <tr>\n",
" <th>192644</th>\n",
" <td>36 Rue Cornac</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.851402</td>\n",
" <td>-0.575870</td>\n",
" </tr>\n",
" <tr>\n",
" <th>192658</th>\n",
" <td>24 Rue Du Parlement Sainte-Catherine</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.840405</td>\n",
" <td>-0.573940</td>\n",
" </tr>\n",
" <tr>\n",
" <th>193242</th>\n",
" <td>1 Rue Matignon</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.850258</td>\n",
" <td>-0.586204</td>\n",
" </tr>\n",
" <tr>\n",
" <th>192719</th>\n",
" <td>1 Place Saint Pierre</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.839706</td>\n",
" <td>-0.570672</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" o_street o_zip o_city o_latitude \\\n",
"order_id \n",
"192594 36 Rue Cornac 33000 Bordeaux 44.851402 \n",
"192644 36 Rue Cornac 33000 Bordeaux 44.851402 \n",
"192658 24 Rue Du Parlement Sainte-Catherine 33000 Bordeaux 44.840405 \n",
"193242 1 Rue Matignon 33000 Bordeaux 44.850258 \n",
"192719 1 Place Saint Pierre 33000 Bordeaux 44.839706 \n",
"\n",
" o_longitude \n",
"order_id \n",
"192594 -0.575870 \n",
"192644 -0.575870 \n",
"192658 -0.573940 \n",
"193242 -0.586204 \n",
"192719 -0.570672 "
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"origins = df[[\"o_street\", \"o_zip\", \"o_city\", \"o_latitude\", \"o_longitude\"]]\n",
"\n",
"origins.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To access individual rows, we index not into a `DataFrame` directly but into its [.loc <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc) object (which also exists for `Series`).\n",
"\n",
"Here, `200800` is an \"order_id\" number. The result is a `Series` object where the original `DataFrame`'s columns become the index."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"placed_at 2016-07-24 19:30:52\n",
"restaurant_id 1204\n",
"restaurant Max A Table\n",
"o_street 36 Rue Cornac\n",
"o_zip 33000\n",
"o_city Bordeaux\n",
"o_latitude 44.851402\n",
"o_longitude -0.57587\n",
"customer_id 76187\n",
"d_street Rue Judaique 213\n",
"d_zip 33000\n",
"d_city Bordeaux\n",
"d_latitude 44.840829\n",
"d_longitude -0.595445\n",
"total 2250\n",
"courier_id 1468.0\n",
"pickup_at 2016-07-24 19:50:52\n",
"delivery_at 2016-07-24 19:58:16\n",
"cancelled 0\n",
"Name: 200800, dtype: object"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc[200800]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also index into the `restaurant_col` and `origins` objects from above. As `restaurant_col` is a `Series`, we get back a scalar value."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Max A Table'"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"restaurant_col.loc[200800]"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"o_street 36 Rue Cornac\n",
"o_zip 33000\n",
"o_city Bordeaux\n",
"o_latitude 44.851402\n",
"o_longitude -0.57587\n",
"Name: 200800, dtype: object"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"origins.loc[200800]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Slicing also works with [.loc <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc). A tiny difference to Python's built-in slicing, the upper bound is included in the slice as well!"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>placed_at</th>\n",
" <th>restaurant_id</th>\n",
" <th>restaurant</th>\n",
" <th>o_street</th>\n",
" <th>o_zip</th>\n",
" <th>o_city</th>\n",
" <th>o_latitude</th>\n",
" <th>o_longitude</th>\n",
" <th>customer_id</th>\n",
" <th>d_street</th>\n",
" <th>d_zip</th>\n",
" <th>d_city</th>\n",
" <th>d_latitude</th>\n",
" <th>d_longitude</th>\n",
" <th>total</th>\n",
" <th>courier_id</th>\n",
" <th>pickup_at</th>\n",
" <th>delivery_at</th>\n",
" <th>cancelled</th>\n",
" </tr>\n",
" <tr>\n",
" <th>order_id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>200300</th>\n",
" <td>2016-07-24 13:46:15</td>\n",
" <td>1207</td>\n",
" <td>Le Jardin Pekinois</td>\n",
" <td>9 Rue Des Freres Bonie</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.837078</td>\n",
" <td>-0.579572</td>\n",
" <td>76030</td>\n",
" <td>Rue Villeneuve 1</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.839927</td>\n",
" <td>-0.580012</td>\n",
" <td>3820</td>\n",
" <td>1426.0</td>\n",
" <td>2016-07-24 14:12:45</td>\n",
" <td>2016-07-24 14:16:59</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>200305</th>\n",
" <td>2016-07-24 13:49:25</td>\n",
" <td>1207</td>\n",
" <td>Le Jardin Pekinois</td>\n",
" <td>9 Rue Des Freres Bonie</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.837078</td>\n",
" <td>-0.579572</td>\n",
" <td>76033</td>\n",
" <td>Rue Du Ha 54</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.835898</td>\n",
" <td>-0.577941</td>\n",
" <td>1689</td>\n",
" <td>1405.0</td>\n",
" <td>2016-07-24 14:12:04</td>\n",
" <td>2016-07-24 14:15:54</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>200800</th>\n",
" <td>2016-07-24 19:30:52</td>\n",
" <td>1204</td>\n",
" <td>Max A Table</td>\n",
" <td>36 Rue Cornac</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.851402</td>\n",
" <td>-0.575870</td>\n",
" <td>76187</td>\n",
" <td>Rue Judaique 213</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.840829</td>\n",
" <td>-0.595445</td>\n",
" <td>2250</td>\n",
" <td>1468.0</td>\n",
" <td>2016-07-24 19:50:52</td>\n",
" <td>2016-07-24 19:58:16</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" placed_at restaurant_id restaurant \\\n",
"order_id \n",
"200300 2016-07-24 13:46:15 1207 Le Jardin Pekinois \n",
"200305 2016-07-24 13:49:25 1207 Le Jardin Pekinois \n",
"200800 2016-07-24 19:30:52 1204 Max A Table \n",
"\n",
" o_street o_zip o_city o_latitude o_longitude \\\n",
"order_id \n",
"200300 9 Rue Des Freres Bonie 33000 Bordeaux 44.837078 -0.579572 \n",
"200305 9 Rue Des Freres Bonie 33000 Bordeaux 44.837078 -0.579572 \n",
"200800 36 Rue Cornac 33000 Bordeaux 44.851402 -0.575870 \n",
"\n",
" customer_id d_street d_zip d_city d_latitude \\\n",
"order_id \n",
"200300 76030 Rue Villeneuve 1 33000 Bordeaux 44.839927 \n",
"200305 76033 Rue Du Ha 54 33000 Bordeaux 44.835898 \n",
"200800 76187 Rue Judaique 213 33000 Bordeaux 44.840829 \n",
"\n",
" d_longitude total courier_id pickup_at \\\n",
"order_id \n",
"200300 -0.580012 3820 1426.0 2016-07-24 14:12:45 \n",
"200305 -0.577941 1689 1405.0 2016-07-24 14:12:04 \n",
"200800 -0.595445 2250 1468.0 2016-07-24 19:50:52 \n",
"\n",
" delivery_at cancelled \n",
"order_id \n",
"200300 2016-07-24 14:16:59 0 \n",
"200305 2016-07-24 14:15:54 0 \n",
"200800 2016-07-24 19:58:16 0 "
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc[200300:200800]"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"order_id\n",
"200300 Le Jardin Pekinois\n",
"200305 Le Jardin Pekinois\n",
"200800 Max A Table\n",
"Name: restaurant, dtype: object"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"restaurant_col.loc[200300:200800]"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>o_street</th>\n",
" <th>o_zip</th>\n",
" <th>o_city</th>\n",
" <th>o_latitude</th>\n",
" <th>o_longitude</th>\n",
" </tr>\n",
" <tr>\n",
" <th>order_id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>200300</th>\n",
" <td>9 Rue Des Freres Bonie</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.837078</td>\n",
" <td>-0.579572</td>\n",
" </tr>\n",
" <tr>\n",
" <th>200305</th>\n",
" <td>9 Rue Des Freres Bonie</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.837078</td>\n",
" <td>-0.579572</td>\n",
" </tr>\n",
" <tr>\n",
" <th>200800</th>\n",
" <td>36 Rue Cornac</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.851402</td>\n",
" <td>-0.575870</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" o_street o_zip o_city o_latitude o_longitude\n",
"order_id \n",
"200300 9 Rue Des Freres Bonie 33000 Bordeaux 44.837078 -0.579572\n",
"200305 9 Rue Des Freres Bonie 33000 Bordeaux 44.837078 -0.579572\n",
"200800 36 Rue Cornac 33000 Bordeaux 44.851402 -0.575870"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"origins.loc[200300:200800]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[.loc <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc) also allows us to index and slice in both dimensions simultaneously. The first index or slice goes along the row dimension while the second index or slice selects the columns."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>o_street</th>\n",
" <th>o_zip</th>\n",
" <th>o_city</th>\n",
" <th>o_latitude</th>\n",
" <th>o_longitude</th>\n",
" </tr>\n",
" <tr>\n",
" <th>order_id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>200300</th>\n",
" <td>9 Rue Des Freres Bonie</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.837078</td>\n",
" <td>-0.579572</td>\n",
" </tr>\n",
" <tr>\n",
" <th>200305</th>\n",
" <td>9 Rue Des Freres Bonie</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.837078</td>\n",
" <td>-0.579572</td>\n",
" </tr>\n",
" <tr>\n",
" <th>200800</th>\n",
" <td>36 Rue Cornac</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.851402</td>\n",
" <td>-0.575870</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" o_street o_zip o_city o_latitude o_longitude\n",
"order_id \n",
"200300 9 Rue Des Freres Bonie 33000 Bordeaux 44.837078 -0.579572\n",
"200305 9 Rue Des Freres Bonie 33000 Bordeaux 44.837078 -0.579572\n",
"200800 36 Rue Cornac 33000 Bordeaux 44.851402 -0.575870"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc[200300:200800, [\"o_street\", \"o_zip\", \"o_city\", \"o_latitude\", \"o_longitude\"]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Type Casting"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As [.info() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html#pandas.DataFrame.info) already revealed above, the timestamp columns could only be parsed as generic objects (i.e., textual data). Also, the \"cancelled\" column which holds only `True` or `False` values does not have a `bool` data type."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"Index: 694 entries, 192594 to 211519\n",
"Data columns (total 19 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 placed_at 694 non-null object \n",
" 1 restaurant_id 694 non-null int64 \n",
" 2 restaurant 694 non-null object \n",
" 3 o_street 694 non-null object \n",
" 4 o_zip 694 non-null int64 \n",
" 5 o_city 694 non-null object \n",
" 6 o_latitude 694 non-null float64\n",
" 7 o_longitude 694 non-null float64\n",
" 8 customer_id 694 non-null int64 \n",
" 9 d_street 694 non-null object \n",
" 10 d_zip 694 non-null int64 \n",
" 11 d_city 694 non-null object \n",
" 12 d_latitude 694 non-null float64\n",
" 13 d_longitude 694 non-null float64\n",
" 14 total 694 non-null int64 \n",
" 15 courier_id 690 non-null float64\n",
" 16 pickup_at 665 non-null object \n",
" 17 delivery_at 663 non-null object \n",
" 18 cancelled 694 non-null int64 \n",
"dtypes: float64(5), int64(6), object(8)\n",
"memory usage: 124.6+ KB\n"
]
}
],
"source": [
"df.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The [pd.to_datetime() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html#pandas.to_datetime) function **casts** the timestamp columns correctly."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"order_id\n",
"192594 2016-07-18 12:23:13\n",
"192644 2016-07-18 12:48:55\n",
"192658 2016-07-18 13:00:13\n",
"193242 2016-07-18 20:39:54\n",
"192719 2016-07-18 13:52:04\n",
" ... \n",
"212021 2016-07-30 22:29:52\n",
"211501 2016-07-30 20:44:50\n",
"211508 2016-07-30 20:45:55\n",
"211510 2016-07-30 20:46:05\n",
"211519 2016-07-30 20:46:55\n",
"Name: placed_at, Length: 694, dtype: datetime64[ns]"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.to_datetime(df[\"placed_at\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's overwrite the original \"placed_at\" column with one that has the correct data type."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"df[\"placed_at\"] = pd.to_datetime(df[\"placed_at\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The [.astype() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html#pandas.DataFrame.astype) method generalizes this idea and allows us to cast several columns in a `DataFrame`. It takes a `dict`ionary mapping column names to data types as its input. Instead of references to actual data types (e.g., `bool`), it also understands [pandas <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/)-specific data types provides as text."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"df = df.astype(\n",
" {\n",
" \"pickup_at\": \"datetime64[ns]\",\n",
" \"delivery_at\": \"datetime64[ns]\",\n",
" \"cancelled\": bool,\n",
" }\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, all columns in `df` have more applicable data types."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"Index: 694 entries, 192594 to 211519\n",
"Data columns (total 19 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 placed_at 694 non-null datetime64[ns]\n",
" 1 restaurant_id 694 non-null int64 \n",
" 2 restaurant 694 non-null object \n",
" 3 o_street 694 non-null object \n",
" 4 o_zip 694 non-null int64 \n",
" 5 o_city 694 non-null object \n",
" 6 o_latitude 694 non-null float64 \n",
" 7 o_longitude 694 non-null float64 \n",
" 8 customer_id 694 non-null int64 \n",
" 9 d_street 694 non-null object \n",
" 10 d_zip 694 non-null int64 \n",
" 11 d_city 694 non-null object \n",
" 12 d_latitude 694 non-null float64 \n",
" 13 d_longitude 694 non-null float64 \n",
" 14 total 694 non-null int64 \n",
" 15 courier_id 690 non-null float64 \n",
" 16 pickup_at 665 non-null datetime64[ns]\n",
" 17 delivery_at 663 non-null datetime64[ns]\n",
" 18 cancelled 694 non-null bool \n",
"dtypes: bool(1), datetime64[ns](3), float64(5), int64(5), object(5)\n",
"memory usage: 119.9+ KB\n"
]
}
],
"source": [
"df.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Filtering"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A common operation when working with `DataFrame`s is to filter for rows fulfilling certain conditions. That is implemented by so-called **boolean filters** in [pandas <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/), which is simply a `DataFrame` or `Series` holding only `True` or `False` values.\n",
"\n",
"One way to obtain such objects is to use relational operators with columns.\n",
"\n",
"`max_a_table` holds `True` values for all orders at the restaurant with the ID `1204`."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"order_id\n",
"192594 True\n",
"192644 True\n",
"192658 False\n",
"193242 False\n",
"192719 False\n",
" ... \n",
"212021 False\n",
"211501 True\n",
"211508 False\n",
"211510 False\n",
"211519 False\n",
"Name: restaurant_id, Length: 694, dtype: bool"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"max_a_table = df[\"restaurant_id\"] == 1204\n",
"\n",
"max_a_table"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, let's use a boolean filter to index into `df`. That gives us back a new `DataFame` with all orders belonging to the restaurant \"Max A Table\"."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>placed_at</th>\n",
" <th>restaurant_id</th>\n",
" <th>restaurant</th>\n",
" <th>o_street</th>\n",
" <th>o_zip</th>\n",
" <th>o_city</th>\n",
" <th>o_latitude</th>\n",
" <th>o_longitude</th>\n",
" <th>customer_id</th>\n",
" <th>d_street</th>\n",
" <th>d_zip</th>\n",
" <th>d_city</th>\n",
" <th>d_latitude</th>\n",
" <th>d_longitude</th>\n",
" <th>total</th>\n",
" <th>courier_id</th>\n",
" <th>pickup_at</th>\n",
" <th>delivery_at</th>\n",
" <th>cancelled</th>\n",
" </tr>\n",
" <tr>\n",
" <th>order_id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>192594</th>\n",
" <td>2016-07-18 12:23:13</td>\n",
" <td>1204</td>\n",
" <td>Max A Table</td>\n",
" <td>36 Rue Cornac</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.851402</td>\n",
" <td>-0.57587</td>\n",
" <td>10298</td>\n",
" <td>Rue Rolland 14</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.842592</td>\n",
" <td>-0.580521</td>\n",
" <td>2050</td>\n",
" <td>1423.0</td>\n",
" <td>2016-07-18 12:38:08</td>\n",
" <td>2016-07-18 12:48:22</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>192644</th>\n",
" <td>2016-07-18 12:48:55</td>\n",
" <td>1204</td>\n",
" <td>Max A Table</td>\n",
" <td>36 Rue Cornac</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.851402</td>\n",
" <td>-0.57587</td>\n",
" <td>6037</td>\n",
" <td>Rue Rolland 14</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.842592</td>\n",
" <td>-0.580521</td>\n",
" <td>2450</td>\n",
" <td>1426.0</td>\n",
" <td>2016-07-18 13:03:08</td>\n",
" <td>2016-07-18 13:12:01</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>194335</th>\n",
" <td>2016-07-19 20:35:21</td>\n",
" <td>1204</td>\n",
" <td>Max A Table</td>\n",
" <td>36 Rue Cornac</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.851402</td>\n",
" <td>-0.57587</td>\n",
" <td>74268</td>\n",
" <td>Place Canteloup 12</td>\n",
" <td>33800</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.833834</td>\n",
" <td>-0.565674</td>\n",
" <td>3100</td>\n",
" <td>1420.0</td>\n",
" <td>2016-07-19 20:51:16</td>\n",
" <td>2016-07-19 21:01:08</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>196615</th>\n",
" <td>2016-07-21 19:50:15</td>\n",
" <td>1204</td>\n",
" <td>Max A Table</td>\n",
" <td>36 Rue Cornac</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.851402</td>\n",
" <td>-0.57587</td>\n",
" <td>74901</td>\n",
" <td>Rue Marcelin Jourdan 55</td>\n",
" <td>33200</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.850360</td>\n",
" <td>-0.597361</td>\n",
" <td>2050</td>\n",
" <td>1418.0</td>\n",
" <td>2016-07-21 20:12:29</td>\n",
" <td>2016-07-21 20:25:57</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>196839</th>\n",
" <td>2016-07-21 20:27:22</td>\n",
" <td>1204</td>\n",
" <td>Max A Table</td>\n",
" <td>36 Rue Cornac</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.851402</td>\n",
" <td>-0.57587</td>\n",
" <td>74966</td>\n",
" <td>Rue Sainte-Catherine 137</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.836516</td>\n",
" <td>-0.573983</td>\n",
" <td>3750</td>\n",
" <td>1472.0</td>\n",
" <td>2016-07-21 20:41:42</td>\n",
" <td>2016-07-21 21:14:41</td>\n",
" <td>False</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" placed_at restaurant_id restaurant o_street \\\n",
"order_id \n",
"192594 2016-07-18 12:23:13 1204 Max A Table 36 Rue Cornac \n",
"192644 2016-07-18 12:48:55 1204 Max A Table 36 Rue Cornac \n",
"194335 2016-07-19 20:35:21 1204 Max A Table 36 Rue Cornac \n",
"196615 2016-07-21 19:50:15 1204 Max A Table 36 Rue Cornac \n",
"196839 2016-07-21 20:27:22 1204 Max A Table 36 Rue Cornac \n",
"\n",
" o_zip o_city o_latitude o_longitude customer_id \\\n",
"order_id \n",
"192594 33000 Bordeaux 44.851402 -0.57587 10298 \n",
"192644 33000 Bordeaux 44.851402 -0.57587 6037 \n",
"194335 33000 Bordeaux 44.851402 -0.57587 74268 \n",
"196615 33000 Bordeaux 44.851402 -0.57587 74901 \n",
"196839 33000 Bordeaux 44.851402 -0.57587 74966 \n",
"\n",
" d_street d_zip d_city d_latitude d_longitude \\\n",
"order_id \n",
"192594 Rue Rolland 14 33000 Bordeaux 44.842592 -0.580521 \n",
"192644 Rue Rolland 14 33000 Bordeaux 44.842592 -0.580521 \n",
"194335 Place Canteloup 12 33800 Bordeaux 44.833834 -0.565674 \n",
"196615 Rue Marcelin Jourdan 55 33200 Bordeaux 44.850360 -0.597361 \n",
"196839 Rue Sainte-Catherine 137 33000 Bordeaux 44.836516 -0.573983 \n",
"\n",
" total courier_id pickup_at delivery_at cancelled \n",
"order_id \n",
"192594 2050 1423.0 2016-07-18 12:38:08 2016-07-18 12:48:22 False \n",
"192644 2450 1426.0 2016-07-18 13:03:08 2016-07-18 13:12:01 False \n",
"194335 3100 1420.0 2016-07-19 20:51:16 2016-07-19 21:01:08 False \n",
"196615 2050 1418.0 2016-07-21 20:12:29 2016-07-21 20:25:57 False \n",
"196839 3750 1472.0 2016-07-21 20:41:42 2016-07-21 21:14:41 False "
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc[df[\"restaurant_id\"] == 1204].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Instead of an explicit condition, we can also use a reference to a boolean filter created above."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>placed_at</th>\n",
" <th>restaurant_id</th>\n",
" <th>restaurant</th>\n",
" <th>o_street</th>\n",
" <th>o_zip</th>\n",
" <th>o_city</th>\n",
" <th>o_latitude</th>\n",
" <th>o_longitude</th>\n",
" <th>customer_id</th>\n",
" <th>d_street</th>\n",
" <th>d_zip</th>\n",
" <th>d_city</th>\n",
" <th>d_latitude</th>\n",
" <th>d_longitude</th>\n",
" <th>total</th>\n",
" <th>courier_id</th>\n",
" <th>pickup_at</th>\n",
" <th>delivery_at</th>\n",
" <th>cancelled</th>\n",
" </tr>\n",
" <tr>\n",
" <th>order_id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>192594</th>\n",
" <td>2016-07-18 12:23:13</td>\n",
" <td>1204</td>\n",
" <td>Max A Table</td>\n",
" <td>36 Rue Cornac</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.851402</td>\n",
" <td>-0.57587</td>\n",
" <td>10298</td>\n",
" <td>Rue Rolland 14</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.842592</td>\n",
" <td>-0.580521</td>\n",
" <td>2050</td>\n",
" <td>1423.0</td>\n",
" <td>2016-07-18 12:38:08</td>\n",
" <td>2016-07-18 12:48:22</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>192644</th>\n",
" <td>2016-07-18 12:48:55</td>\n",
" <td>1204</td>\n",
" <td>Max A Table</td>\n",
" <td>36 Rue Cornac</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.851402</td>\n",
" <td>-0.57587</td>\n",
" <td>6037</td>\n",
" <td>Rue Rolland 14</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.842592</td>\n",
" <td>-0.580521</td>\n",
" <td>2450</td>\n",
" <td>1426.0</td>\n",
" <td>2016-07-18 13:03:08</td>\n",
" <td>2016-07-18 13:12:01</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>194335</th>\n",
" <td>2016-07-19 20:35:21</td>\n",
" <td>1204</td>\n",
" <td>Max A Table</td>\n",
" <td>36 Rue Cornac</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.851402</td>\n",
" <td>-0.57587</td>\n",
" <td>74268</td>\n",
" <td>Place Canteloup 12</td>\n",
" <td>33800</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.833834</td>\n",
" <td>-0.565674</td>\n",
" <td>3100</td>\n",
" <td>1420.0</td>\n",
" <td>2016-07-19 20:51:16</td>\n",
" <td>2016-07-19 21:01:08</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>196615</th>\n",
" <td>2016-07-21 19:50:15</td>\n",
" <td>1204</td>\n",
" <td>Max A Table</td>\n",
" <td>36 Rue Cornac</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.851402</td>\n",
" <td>-0.57587</td>\n",
" <td>74901</td>\n",
" <td>Rue Marcelin Jourdan 55</td>\n",
" <td>33200</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.850360</td>\n",
" <td>-0.597361</td>\n",
" <td>2050</td>\n",
" <td>1418.0</td>\n",
" <td>2016-07-21 20:12:29</td>\n",
" <td>2016-07-21 20:25:57</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>196839</th>\n",
" <td>2016-07-21 20:27:22</td>\n",
" <td>1204</td>\n",
" <td>Max A Table</td>\n",
" <td>36 Rue Cornac</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.851402</td>\n",
" <td>-0.57587</td>\n",
" <td>74966</td>\n",
" <td>Rue Sainte-Catherine 137</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.836516</td>\n",
" <td>-0.573983</td>\n",
" <td>3750</td>\n",
" <td>1472.0</td>\n",
" <td>2016-07-21 20:41:42</td>\n",
" <td>2016-07-21 21:14:41</td>\n",
" <td>False</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" placed_at restaurant_id restaurant o_street \\\n",
"order_id \n",
"192594 2016-07-18 12:23:13 1204 Max A Table 36 Rue Cornac \n",
"192644 2016-07-18 12:48:55 1204 Max A Table 36 Rue Cornac \n",
"194335 2016-07-19 20:35:21 1204 Max A Table 36 Rue Cornac \n",
"196615 2016-07-21 19:50:15 1204 Max A Table 36 Rue Cornac \n",
"196839 2016-07-21 20:27:22 1204 Max A Table 36 Rue Cornac \n",
"\n",
" o_zip o_city o_latitude o_longitude customer_id \\\n",
"order_id \n",
"192594 33000 Bordeaux 44.851402 -0.57587 10298 \n",
"192644 33000 Bordeaux 44.851402 -0.57587 6037 \n",
"194335 33000 Bordeaux 44.851402 -0.57587 74268 \n",
"196615 33000 Bordeaux 44.851402 -0.57587 74901 \n",
"196839 33000 Bordeaux 44.851402 -0.57587 74966 \n",
"\n",
" d_street d_zip d_city d_latitude d_longitude \\\n",
"order_id \n",
"192594 Rue Rolland 14 33000 Bordeaux 44.842592 -0.580521 \n",
"192644 Rue Rolland 14 33000 Bordeaux 44.842592 -0.580521 \n",
"194335 Place Canteloup 12 33800 Bordeaux 44.833834 -0.565674 \n",
"196615 Rue Marcelin Jourdan 55 33200 Bordeaux 44.850360 -0.597361 \n",
"196839 Rue Sainte-Catherine 137 33000 Bordeaux 44.836516 -0.573983 \n",
"\n",
" total courier_id pickup_at delivery_at cancelled \n",
"order_id \n",
"192594 2050 1423.0 2016-07-18 12:38:08 2016-07-18 12:48:22 False \n",
"192644 2450 1426.0 2016-07-18 13:03:08 2016-07-18 13:12:01 False \n",
"194335 3100 1420.0 2016-07-19 20:51:16 2016-07-19 21:01:08 False \n",
"196615 2050 1418.0 2016-07-21 20:12:29 2016-07-21 20:25:57 False \n",
"196839 3750 1472.0 2016-07-21 20:41:42 2016-07-21 21:14:41 False "
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc[max_a_table].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Combining the filter with a `list` of columns allows us to further narrow down the `DataFrame`.\n",
"\n",
"For example, the preview below shows us the first five customers \"Max A Table\" had in the target period."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>customer_id</th>\n",
" <th>d_street</th>\n",
" <th>d_zip</th>\n",
" <th>d_city</th>\n",
" <th>d_latitude</th>\n",
" <th>d_longitude</th>\n",
" </tr>\n",
" <tr>\n",
" <th>order_id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>192594</th>\n",
" <td>10298</td>\n",
" <td>Rue Rolland 14</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.842592</td>\n",
" <td>-0.580521</td>\n",
" </tr>\n",
" <tr>\n",
" <th>192644</th>\n",
" <td>6037</td>\n",
" <td>Rue Rolland 14</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.842592</td>\n",
" <td>-0.580521</td>\n",
" </tr>\n",
" <tr>\n",
" <th>194335</th>\n",
" <td>74268</td>\n",
" <td>Place Canteloup 12</td>\n",
" <td>33800</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.833834</td>\n",
" <td>-0.565674</td>\n",
" </tr>\n",
" <tr>\n",
" <th>196615</th>\n",
" <td>74901</td>\n",
" <td>Rue Marcelin Jourdan 55</td>\n",
" <td>33200</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.850360</td>\n",
" <td>-0.597361</td>\n",
" </tr>\n",
" <tr>\n",
" <th>196839</th>\n",
" <td>74966</td>\n",
" <td>Rue Sainte-Catherine 137</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.836516</td>\n",
" <td>-0.573983</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" customer_id d_street d_zip d_city d_latitude \\\n",
"order_id \n",
"192594 10298 Rue Rolland 14 33000 Bordeaux 44.842592 \n",
"192644 6037 Rue Rolland 14 33000 Bordeaux 44.842592 \n",
"194335 74268 Place Canteloup 12 33800 Bordeaux 44.833834 \n",
"196615 74901 Rue Marcelin Jourdan 55 33200 Bordeaux 44.850360 \n",
"196839 74966 Rue Sainte-Catherine 137 33000 Bordeaux 44.836516 \n",
"\n",
" d_longitude \n",
"order_id \n",
"192594 -0.580521 \n",
"192644 -0.580521 \n",
"194335 -0.565674 \n",
"196615 -0.597361 \n",
"196839 -0.573983 "
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc[\n",
" max_a_table,\n",
" [\"customer_id\", \"d_street\", \"d_zip\", \"d_city\", \"d_latitude\", \"d_longitude\"],\n",
"].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Boolean filters can be created in an arbitray fashion by combining several conditions with `&` and `|` modeling logical AND and OR operators.\n",
"\n",
"The example lists the first five customers of \"Max A Table\" in a target area provided as latitude-longitude coordinates."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>customer_id</th>\n",
" <th>d_street</th>\n",
" <th>d_zip</th>\n",
" <th>d_city</th>\n",
" <th>d_latitude</th>\n",
" <th>d_longitude</th>\n",
" </tr>\n",
" <tr>\n",
" <th>order_id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>196615</th>\n",
" <td>74901</td>\n",
" <td>Rue Marcelin Jourdan 55</td>\n",
" <td>33200</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.850360</td>\n",
" <td>-0.597361</td>\n",
" </tr>\n",
" <tr>\n",
" <th>200800</th>\n",
" <td>76187</td>\n",
" <td>Rue Judaique 213</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.840829</td>\n",
" <td>-0.595445</td>\n",
" </tr>\n",
" <tr>\n",
" <th>200893</th>\n",
" <td>76218</td>\n",
" <td>Rue Notre Dame 21</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.850260</td>\n",
" <td>-0.572377</td>\n",
" </tr>\n",
" <tr>\n",
" <th>202788</th>\n",
" <td>76786</td>\n",
" <td>Rue De Leybardie 27</td>\n",
" <td>33300</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.861360</td>\n",
" <td>-0.565057</td>\n",
" </tr>\n",
" <tr>\n",
" <th>202563</th>\n",
" <td>76730</td>\n",
" <td>Rue Lombard 47</td>\n",
" <td>33300</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.858661</td>\n",
" <td>-0.563095</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" customer_id d_street d_zip d_city d_latitude \\\n",
"order_id \n",
"196615 74901 Rue Marcelin Jourdan 55 33200 Bordeaux 44.850360 \n",
"200800 76187 Rue Judaique 213 33000 Bordeaux 44.840829 \n",
"200893 76218 Rue Notre Dame 21 33000 Bordeaux 44.850260 \n",
"202788 76786 Rue De Leybardie 27 33300 Bordeaux 44.861360 \n",
"202563 76730 Rue Lombard 47 33300 Bordeaux 44.858661 \n",
"\n",
" d_longitude \n",
"order_id \n",
"196615 -0.597361 \n",
"200800 -0.595445 \n",
"200893 -0.572377 \n",
"202788 -0.565057 \n",
"202563 -0.563095 "
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc[\n",
" (\n",
" max_a_table\n",
" &\n",
" (\n",
" (df[\"d_latitude\"] > 44.85) | (df[\"d_longitude\"] < -0.59)\n",
" )\n",
" ),\n",
" [\"customer_id\", \"d_street\", \"d_zip\", \"d_city\", \"d_latitude\", \"d_longitude\"],\n",
"].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[.isin() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isin.html#pandas.DataFrame.isin) is another useful method: It allows us to provide a `list` of values that we are filtering for in a column."
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>placed_at</th>\n",
" <th>customer_id</th>\n",
" <th>d_street</th>\n",
" <th>d_zip</th>\n",
" <th>d_city</th>\n",
" <th>total</th>\n",
" </tr>\n",
" <tr>\n",
" <th>order_id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>192644</th>\n",
" <td>2016-07-18 12:48:55</td>\n",
" <td>6037</td>\n",
" <td>Rue Rolland 14</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>2450</td>\n",
" </tr>\n",
" <tr>\n",
" <th>210945</th>\n",
" <td>2016-07-30 19:30:39</td>\n",
" <td>79900</td>\n",
" <td>Rue Du Couvent 16</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>1650</td>\n",
" </tr>\n",
" <tr>\n",
" <th>211363</th>\n",
" <td>2016-07-30 20:27:45</td>\n",
" <td>80095</td>\n",
" <td>Rue De La Porte Saint-Jean 8</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>2400</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" placed_at customer_id d_street \\\n",
"order_id \n",
"192644 2016-07-18 12:48:55 6037 Rue Rolland 14 \n",
"210945 2016-07-30 19:30:39 79900 Rue Du Couvent 16 \n",
"211363 2016-07-30 20:27:45 80095 Rue De La Porte Saint-Jean 8 \n",
"\n",
" d_zip d_city total \n",
"order_id \n",
"192644 33000 Bordeaux 2450 \n",
"210945 33000 Bordeaux 1650 \n",
"211363 33000 Bordeaux 2400 "
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc[\n",
" (max_a_table & df[\"customer_id\"].isin([6037, 79900, 80095])),\n",
" [\"placed_at\", \"customer_id\", \"d_street\", \"d_zip\", \"d_city\", \"total\"],\n",
"].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `~` operator negates a condition. So, in the cell below we see all orders at \"Max A Table\" except the ones from the indicated customers."
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>placed_at</th>\n",
" <th>customer_id</th>\n",
" <th>d_street</th>\n",
" <th>d_zip</th>\n",
" <th>d_city</th>\n",
" <th>total</th>\n",
" </tr>\n",
" <tr>\n",
" <th>order_id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>192594</th>\n",
" <td>2016-07-18 12:23:13</td>\n",
" <td>10298</td>\n",
" <td>Rue Rolland 14</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>2050</td>\n",
" </tr>\n",
" <tr>\n",
" <th>194335</th>\n",
" <td>2016-07-19 20:35:21</td>\n",
" <td>74268</td>\n",
" <td>Place Canteloup 12</td>\n",
" <td>33800</td>\n",
" <td>Bordeaux</td>\n",
" <td>3100</td>\n",
" </tr>\n",
" <tr>\n",
" <th>196615</th>\n",
" <td>2016-07-21 19:50:15</td>\n",
" <td>74901</td>\n",
" <td>Rue Marcelin Jourdan 55</td>\n",
" <td>33200</td>\n",
" <td>Bordeaux</td>\n",
" <td>2050</td>\n",
" </tr>\n",
" <tr>\n",
" <th>196839</th>\n",
" <td>2016-07-21 20:27:22</td>\n",
" <td>74966</td>\n",
" <td>Rue Sainte-Catherine 137</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>3750</td>\n",
" </tr>\n",
" <tr>\n",
" <th>198631</th>\n",
" <td>2016-07-22 21:29:40</td>\n",
" <td>75047</td>\n",
" <td>Rue Boudet 29</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>2650</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" placed_at customer_id d_street d_zip \\\n",
"order_id \n",
"192594 2016-07-18 12:23:13 10298 Rue Rolland 14 33000 \n",
"194335 2016-07-19 20:35:21 74268 Place Canteloup 12 33800 \n",
"196615 2016-07-21 19:50:15 74901 Rue Marcelin Jourdan 55 33200 \n",
"196839 2016-07-21 20:27:22 74966 Rue Sainte-Catherine 137 33000 \n",
"198631 2016-07-22 21:29:40 75047 Rue Boudet 29 33000 \n",
"\n",
" d_city total \n",
"order_id \n",
"192594 Bordeaux 2050 \n",
"194335 Bordeaux 3100 \n",
"196615 Bordeaux 2050 \n",
"196839 Bordeaux 3750 \n",
"198631 Bordeaux 2650 "
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc[\n",
" (max_a_table & ~df[\"customer_id\"].isin([6037, 79900, 80095])),\n",
" [\"placed_at\", \"customer_id\", \"d_street\", \"d_zip\", \"d_city\", \"total\"],\n",
"].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## DataFrame Methods"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have learned the basics of selecting the data we want from a `DataFrame`, let's look at a couple of methods that allow us to obtain some infos out of a `DataFrame`, in particular, to run some **descriptive statistics**.\n",
"\n",
"[.unique() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/docs/reference/api/pandas.Series.unique.html#pandas.Series.unique) is a simple `Series` method returning an `ndarray` with all values that are in the `Series` once.\n",
"\n",
"Here, we get an overview of how many restaurants there are in Bordeaux in the target time horizon."
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([1204, 1205, 1208, 1206, 1209, 1207, 1211, 1213, 1214, 1212, 1216,\n",
" 1215, 1217, 1218, 1219, 1220, 1221, 1223, 1222, 1224, 1225, 1229,\n",
" 1226, 1227, 1230, 1231, 1232, 1233, 1234, 1235, 1236, 1237, 1239,\n",
" 1241, 1242, 1243, 1245, 1244, 1246, 1247, 1249, 1254, 1250, 1256,\n",
" 1258, 1259, 1260, 1263, 1264, 1266, 1265, 1267])"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[\"restaurant_id\"].unique()"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"52"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(df[\"restaurant_id\"].unique())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To obtain an `ndarray` of all customer IDs of \"Max A Table\", we write the following."
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([10298, 6037, 74268, 74901, 74966, 75047, 76187, 76218, 76442,\n",
" 76396, 76421, 76786, 76822, 76730, 76871, 75687, 77409, 77386,\n",
" 77355, 77556, 78129, 78353, 78608, 78621, 78958, 79119, 79153,\n",
" 76838, 79234, 79486, 79576, 79563, 79653, 79900, 79912, 80026,\n",
" 80204, 80095, 80163])"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc[max_a_table, \"customer_id\"].unique()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[.value_counts() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.value_counts.html#pandas.DataFrame.value_counts) is similar to [.unique() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/docs/reference/api/pandas.Series.unique.html#pandas.Series.unique) and provides an array sorted by the counts of how often an element occurs in a column or `Series` in descending order.\n",
"\n",
"We use it to list the `10` most popular restaurants and customers in the dataset."
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"restaurant_id\n",
"1254 78\n",
"1207 47\n",
"1204 39\n",
"1217 37\n",
"1212 32\n",
"1244 25\n",
"1225 25\n",
"1249 23\n",
"1242 19\n",
"1221 18\n",
"Name: count, dtype: int64"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[\"restaurant_id\"].value_counts().head(10)"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"customer_id\n",
"73919 14\n",
"10298 12\n",
"6037 8\n",
"77048 5\n",
"4210 4\n",
"74426 4\n",
"9304 3\n",
"76838 3\n",
"75905 3\n",
"74791 3\n",
"Name: count, dtype: int64"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[\"customer_id\"].value_counts().head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[.sum() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sum.html#pandas.DataFrame.sum), [.min() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.min.html#pandas.DataFrame.min), [.max() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.max.html#pandas.DataFrame.max), [.mean() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html#pandas.DataFrame.mean), and [.round() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.round.html#pandas.DataFrame.round) are self-explanatory.\n",
"\n",
"We use it to analyze the overall spendings in Bordeaux and for \"Max A Table\"."
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"np.float64(15924.78)"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[\"total\"].sum() / 100 # Convert to Euro"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"np.float64(885.0)"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc[max_a_table, \"total\"].sum() / 100"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"np.float64(3.5)"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[\"total\"].min() / 100"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"np.float64(83.7)"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[\"total\"].max() / 100"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"np.float64(12.5)"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc[max_a_table, \"total\"].min() / 100"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"np.float64(60.0)"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc[max_a_table, \"total\"].max() / 100"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"np.float64(22.94636887608069)"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[\"total\"].mean() / 100"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"np.float64(22.95)"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[\"total\"].mean().round() / 100"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"np.float64(22.69)"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc[max_a_table, \"total\"].mean().round() / 100"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "intro-to-data-science",
"language": "python",
"name": "intro-to-data-science"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.4"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": false,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}