"**Note**: Click on \"*Kernel*\" > \"*Restart Kernel and Clear All Outputs*\" in [JupyterLab](https://jupyterlab.readthedocs.io/en/stable/) *before* reading this notebook to reset its output. If you cannot run this file on your machine, you may want to open it [in the cloud <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_mb.png\">](https://mybinder.org/v2/gh/webartifex/intro-to-data-science/main?urlpath=lab/tree/01_scientific_stack/02_content_pandas.ipynb)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Chapter 1: Python's Scientific Stack (Part 2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For practitioners, the [numpy <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_np.png\">](https://numpy.org/) library may feel a bit too \"technical\" or too close to \"real programming\" and they may prefer something that looks and feels more like Excel. That is where the [pandas <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/) library comes in.\n",
"In the same folder as this notebook there is a file named \"*orders.csv*\" that holds the order data of an urban meal delivery platform operating in Bordeaux, France. Open in with a double-click and take a look at its contents right here in JupyterLab!\n",
"\n",
"[pandas <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/) provides a [pd.read_csv() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv) function that, as the name suggests, can open and read in CSV data. For Excel files, there is also a [pd.read_excel() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html) function but the CSV format is probably more widespread in use.\n",
"\n",
"Let's read in the \"*orders.csv*\" file with [pd.read_csv() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv) specifying the \"order_id\" column as the **index**. Here, index is a column with *unique* values that allow the identification of each row in a dataset. If we don't specify an index column, [pandas <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/) creates a surrogate index as a sequence of integers 1, 2, 3, and so on."
"`df` models a table-like data structure, comparable to one tab in an Excel file. [pandas <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/) and JupyterLab are designed to work well together: The `df` object shows a preview of the dataset below the code cell. The rows are the **records** in the dataset and the columns take the role of the **attributes** each record has. Each column comes with a **domain** of allowable values."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>placed_at</th>\n",
" <th>restaurant_id</th>\n",
" <th>restaurant</th>\n",
" <th>o_street</th>\n",
" <th>o_zip</th>\n",
" <th>o_city</th>\n",
" <th>o_latitude</th>\n",
" <th>o_longitude</th>\n",
" <th>customer_id</th>\n",
" <th>d_street</th>\n",
" <th>d_zip</th>\n",
" <th>d_city</th>\n",
" <th>d_latitude</th>\n",
" <th>d_longitude</th>\n",
" <th>total</th>\n",
" <th>courier_id</th>\n",
" <th>pickup_at</th>\n",
" <th>delivery_at</th>\n",
" <th>cancelled</th>\n",
" </tr>\n",
" <tr>\n",
" <th>order_id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>192594</th>\n",
" <td>2016-07-18 12:23:13</td>\n",
" <td>1204</td>\n",
" <td>Max A Table</td>\n",
" <td>36 Rue Cornac</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.851402</td>\n",
" <td>-0.575870</td>\n",
" <td>10298</td>\n",
" <td>Rue Rolland 14</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.842592</td>\n",
" <td>-0.580521</td>\n",
" <td>2050</td>\n",
" <td>1423.0</td>\n",
" <td>2016-07-18 12:38:08</td>\n",
" <td>2016-07-18 12:48:22</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>192644</th>\n",
" <td>2016-07-18 12:48:55</td>\n",
" <td>1204</td>\n",
" <td>Max A Table</td>\n",
" <td>36 Rue Cornac</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.851402</td>\n",
" <td>-0.575870</td>\n",
" <td>6037</td>\n",
" <td>Rue Rolland 14</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.842592</td>\n",
" <td>-0.580521</td>\n",
" <td>2450</td>\n",
" <td>1426.0</td>\n",
" <td>2016-07-18 13:03:08</td>\n",
" <td>2016-07-18 13:12:01</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>192658</th>\n",
" <td>2016-07-18 13:00:13</td>\n",
" <td>1205</td>\n",
" <td>Taj Mahal</td>\n",
" <td>24 Rue Du Parlement Sainte-Catherine</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.840405</td>\n",
" <td>-0.573940</td>\n",
" <td>73830</td>\n",
" <td>Rue Batailley 12</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.838504</td>\n",
" <td>-0.591961</td>\n",
" <td>2550</td>\n",
" <td>1423.0</td>\n",
" <td>2016-07-18 13:19:04</td>\n",
" <td>2016-07-18 13:29:03</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>193242</th>\n",
" <td>2016-07-18 20:39:54</td>\n",
" <td>1208</td>\n",
" <td>Chez Ambre And Michel</td>\n",
" <td>1 Rue Matignon</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.850258</td>\n",
" <td>-0.586204</td>\n",
" <td>10298</td>\n",
" <td>Rue Rolland 14</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.842592</td>\n",
" <td>-0.580521</td>\n",
" <td>1550</td>\n",
" <td>1420.0</td>\n",
" <td>2016-07-18 20:55:52</td>\n",
" <td>2016-07-18 21:05:28</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>192719</th>\n",
" <td>2016-07-18 13:52:04</td>\n",
" <td>1206</td>\n",
" <td>La Maison Du Glacier</td>\n",
" <td>1 Place Saint Pierre</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.839706</td>\n",
" <td>-0.570672</td>\n",
" <td>6037</td>\n",
" <td>Rue Rolland 14</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.842592</td>\n",
" <td>-0.580521</td>\n",
" <td>2450</td>\n",
" <td>1426.0</td>\n",
" <td>2016-07-18 14:01:23</td>\n",
" <td>2016-07-18 14:08:36</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>212021</th>\n",
" <td>2016-07-30 22:29:52</td>\n",
" <td>1249</td>\n",
" <td>Pitaya Sainte Catherine</td>\n",
" <td>275 Rue Sainte Catherine</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.831692</td>\n",
" <td>-0.573207</td>\n",
" <td>80400</td>\n",
" <td>Boulevard President Franklin Roosevelt 15</td>\n",
" <td>33400</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.820591</td>\n",
" <td>-0.582048</td>\n",
" <td>2250</td>\n",
" <td>1410.0</td>\n",
" <td>2016-07-30 22:50:16</td>\n",
" <td>2016-07-30 23:02:54</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>211501</th>\n",
" <td>2016-07-30 20:44:50</td>\n",
" <td>1204</td>\n",
" <td>Max A Table</td>\n",
" <td>36 Rue Cornac</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.851402</td>\n",
" <td>-0.575870</td>\n",
" <td>80163</td>\n",
" <td>Rue Marsan 22</td>\n",
" <td>33300</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.856133</td>\n",
" <td>-0.576172</td>\n",
" <td>1250</td>\n",
" <td>1415.0</td>\n",
" <td>2016-07-30 21:02:32</td>\n",
" <td>2016-07-30 21:06:19</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>211508</th>\n",
" <td>2016-07-30 20:45:55</td>\n",
" <td>1254</td>\n",
" <td>Funky Burger</td>\n",
" <td>5 Rue Du Loup</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.838081</td>\n",
" <td>-0.572281</td>\n",
" <td>80168</td>\n",
" <td>Rue Des Sablieres 42</td>\n",
" <td>33800</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.825488</td>\n",
" <td>-0.575264</td>\n",
" <td>1680</td>\n",
" <td>1461.0</td>\n",
" <td>2016-07-30 21:13:31</td>\n",
" <td>2016-07-30 21:19:45</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>211510</th>\n",
" <td>2016-07-30 20:46:05</td>\n",
" <td>1219</td>\n",
" <td>La Tagliatella</td>\n",
" <td>14 Rue Guiraude</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.839388</td>\n",
" <td>-0.574781</td>\n",
" <td>80169</td>\n",
" <td>Rue Pasteur 35</td>\n",
" <td>33200</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.845053</td>\n",
" <td>-0.601157</td>\n",
" <td>4085</td>\n",
" <td>1411.0</td>\n",
" <td>2016-07-30 21:11:00</td>\n",
" <td>2016-07-30 21:23:24</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>211519</th>\n",
" <td>2016-07-30 20:46:55</td>\n",
" <td>1254</td>\n",
" <td>Funky Burger</td>\n",
" <td>5 Rue Du Loup</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.838081</td>\n",
" <td>-0.572281</td>\n",
" <td>80172</td>\n",
" <td>Rue Monadey 28</td>\n",
" <td>33800</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.828816</td>\n",
" <td>-0.570789</td>\n",
" <td>2050</td>\n",
" <td>1817.0</td>\n",
" <td>2016-07-30 21:05:46</td>\n",
" <td>2016-07-30 21:14:07</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>694 rows × 19 columns</p>\n",
"</div>"
],
"text/plain": [
" placed_at restaurant_id restaurant \\\n",
"order_id \n",
"192594 2016-07-18 12:23:13 1204 Max A Table \n",
"192644 2016-07-18 12:48:55 1204 Max A Table \n",
"192658 2016-07-18 13:00:13 1205 Taj Mahal \n",
"193242 2016-07-18 20:39:54 1208 Chez Ambre And Michel \n",
"192719 2016-07-18 13:52:04 1206 La Maison Du Glacier \n",
"The data type behind `df` is called a [pd.DataFrame <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame). `DataFrame`s are built around [numpy <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_np.png\">](https://numpy.org/)'s `ndarray`s providing an interface optimized for **interactive usage** (i.e., a data scientist exploring a dataset step by step)."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.frame.DataFrame"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`DataFrame`s come with many methdods.\n",
"\n",
"For example, [.head() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html#pandas.DataFrame.head) and [.tail() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.tail.html#pandas.DataFrame.tail) show the first and last `n` rows, defaulting to `5`."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>placed_at</th>\n",
" <th>restaurant_id</th>\n",
" <th>restaurant</th>\n",
" <th>o_street</th>\n",
" <th>o_zip</th>\n",
" <th>o_city</th>\n",
" <th>o_latitude</th>\n",
" <th>o_longitude</th>\n",
" <th>customer_id</th>\n",
" <th>d_street</th>\n",
" <th>d_zip</th>\n",
" <th>d_city</th>\n",
" <th>d_latitude</th>\n",
" <th>d_longitude</th>\n",
" <th>total</th>\n",
" <th>courier_id</th>\n",
" <th>pickup_at</th>\n",
" <th>delivery_at</th>\n",
" <th>cancelled</th>\n",
" </tr>\n",
" <tr>\n",
" <th>order_id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>192594</th>\n",
" <td>2016-07-18 12:23:13</td>\n",
" <td>1204</td>\n",
" <td>Max A Table</td>\n",
" <td>36 Rue Cornac</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.851402</td>\n",
" <td>-0.575870</td>\n",
" <td>10298</td>\n",
" <td>Rue Rolland 14</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.842592</td>\n",
" <td>-0.580521</td>\n",
" <td>2050</td>\n",
" <td>1423.0</td>\n",
" <td>2016-07-18 12:38:08</td>\n",
" <td>2016-07-18 12:48:22</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>192644</th>\n",
" <td>2016-07-18 12:48:55</td>\n",
" <td>1204</td>\n",
" <td>Max A Table</td>\n",
" <td>36 Rue Cornac</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.851402</td>\n",
" <td>-0.575870</td>\n",
" <td>6037</td>\n",
" <td>Rue Rolland 14</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.842592</td>\n",
" <td>-0.580521</td>\n",
" <td>2450</td>\n",
" <td>1426.0</td>\n",
" <td>2016-07-18 13:03:08</td>\n",
" <td>2016-07-18 13:12:01</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>192658</th>\n",
" <td>2016-07-18 13:00:13</td>\n",
" <td>1205</td>\n",
" <td>Taj Mahal</td>\n",
" <td>24 Rue Du Parlement Sainte-Catherine</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.840405</td>\n",
" <td>-0.573940</td>\n",
" <td>73830</td>\n",
" <td>Rue Batailley 12</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.838504</td>\n",
" <td>-0.591961</td>\n",
" <td>2550</td>\n",
" <td>1423.0</td>\n",
" <td>2016-07-18 13:19:04</td>\n",
" <td>2016-07-18 13:29:03</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>193242</th>\n",
" <td>2016-07-18 20:39:54</td>\n",
" <td>1208</td>\n",
" <td>Chez Ambre And Michel</td>\n",
" <td>1 Rue Matignon</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.850258</td>\n",
" <td>-0.586204</td>\n",
" <td>10298</td>\n",
" <td>Rue Rolland 14</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.842592</td>\n",
" <td>-0.580521</td>\n",
" <td>1550</td>\n",
" <td>1420.0</td>\n",
" <td>2016-07-18 20:55:52</td>\n",
" <td>2016-07-18 21:05:28</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>192719</th>\n",
" <td>2016-07-18 13:52:04</td>\n",
" <td>1206</td>\n",
" <td>La Maison Du Glacier</td>\n",
" <td>1 Place Saint Pierre</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.839706</td>\n",
" <td>-0.570672</td>\n",
" <td>6037</td>\n",
" <td>Rue Rolland 14</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.842592</td>\n",
" <td>-0.580521</td>\n",
" <td>2450</td>\n",
" <td>1426.0</td>\n",
" <td>2016-07-18 14:01:23</td>\n",
" <td>2016-07-18 14:08:36</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" placed_at restaurant_id restaurant \\\n",
"order_id \n",
"192594 2016-07-18 12:23:13 1204 Max A Table \n",
"192644 2016-07-18 12:48:55 1204 Max A Table \n",
"192658 2016-07-18 13:00:13 1205 Taj Mahal \n",
"193242 2016-07-18 20:39:54 1208 Chez Ambre And Michel \n",
"192719 2016-07-18 13:52:04 1206 La Maison Du Glacier \n",
"\n",
" o_street o_zip o_city o_latitude \\\n",
"order_id \n",
"192594 36 Rue Cornac 33000 Bordeaux 44.851402 \n",
"192644 36 Rue Cornac 33000 Bordeaux 44.851402 \n",
"192658 24 Rue Du Parlement Sainte-Catherine 33000 Bordeaux 44.840405 \n",
"193242 1 Rue Matignon 33000 Bordeaux 44.850258 \n",
"192719 1 Place Saint Pierre 33000 Bordeaux 44.839706 \n",
"[.info() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html#pandas.DataFrame.info) shows on overview of the columns. In particular, it shows how many cells are filled in in a column (i.e., are \"non-null\") and what **data type** (i.e., \"dtype\") *all* values in a column have. \"int64\" and \"float64\" imply that there are only `int` and `float` values in a column (taking up to 64 bits or 1s and 0s in memory). \"object\" is [pandas <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/)' way of telling us it could not deduce any data type more specific than textual data. For the columns holding timestamps (e.g., \"placed_at\") we will convert the values further below.\n",
"\n",
"Looking at the output, we see that some columns hold the data of **origin**-**destination** pairs, corresponding to restaurants and customers. Other columns store data following the dispatch and delivery process of couriers picking up and delivering meals at various points in time."
"[.describe() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html#pandas.DataFrame.describe) shows statistics on all numerical columns in a `DataFrame`.\n",
"\n",
"For the example orders, such statistics may not be meaningful for all numerical columns as some of them merely hold IDs or zip codes."
"`DataFrame`s support being indexed or sliced, both in the row and column dimensions.\n",
"\n",
"To obtain all data in a single column, we index into the `DataFrame` with the column's name.\n",
"\n",
"For example, `restaurant_col` provides a list of only the restaurant names. Its index are still the \"order_id\"s."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"order_id\n",
"192594 Max A Table\n",
"192644 Max A Table\n",
"192658 Taj Mahal\n",
"193242 Chez Ambre And Michel\n",
"192719 La Maison Du Glacier\n",
"Name: restaurant, dtype: object"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"restaurant_col = df[\"restaurant\"]\n",
"\n",
"restaurant_col.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The data type of a single column is [pd.Series <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/docs/reference/api/pandas.Series.html#pandas.Series), which is very similar to a `DataFrame` with only one column. `Series` objects work like built-in `list`s with added functionalities."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.series.Series"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(restaurant_col)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we index with a `list` of column names, the result is itself another `DataFrame`. That operation is like slicing out a smaller matrix from a larger one as we saw with `ndarray`s before.\n",
"\n",
"For example, let's pull out all location data of the orders' origins (i.e., restaurants)."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>o_street</th>\n",
" <th>o_zip</th>\n",
" <th>o_city</th>\n",
" <th>o_latitude</th>\n",
" <th>o_longitude</th>\n",
" </tr>\n",
" <tr>\n",
" <th>order_id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>192594</th>\n",
" <td>36 Rue Cornac</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.851402</td>\n",
" <td>-0.575870</td>\n",
" </tr>\n",
" <tr>\n",
" <th>192644</th>\n",
" <td>36 Rue Cornac</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.851402</td>\n",
" <td>-0.575870</td>\n",
" </tr>\n",
" <tr>\n",
" <th>192658</th>\n",
" <td>24 Rue Du Parlement Sainte-Catherine</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.840405</td>\n",
" <td>-0.573940</td>\n",
" </tr>\n",
" <tr>\n",
" <th>193242</th>\n",
" <td>1 Rue Matignon</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.850258</td>\n",
" <td>-0.586204</td>\n",
" </tr>\n",
" <tr>\n",
" <th>192719</th>\n",
" <td>1 Place Saint Pierre</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.839706</td>\n",
" <td>-0.570672</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" o_street o_zip o_city o_latitude \\\n",
"order_id \n",
"192594 36 Rue Cornac 33000 Bordeaux 44.851402 \n",
"192644 36 Rue Cornac 33000 Bordeaux 44.851402 \n",
"192658 24 Rue Du Parlement Sainte-Catherine 33000 Bordeaux 44.840405 \n",
"193242 1 Rue Matignon 33000 Bordeaux 44.850258 \n",
"192719 1 Place Saint Pierre 33000 Bordeaux 44.839706 \n",
"To access individual rows, we index not into a `DataFrame` directly but into its [.loc <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc) object (which also exists for `Series`).\n",
"\n",
"Here, `200800` is an \"order_id\" number. The result is a `Series` object where the original `DataFrame`'s columns become the index."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"placed_at 2016-07-24 19:30:52\n",
"restaurant_id 1204\n",
"restaurant Max A Table\n",
"o_street 36 Rue Cornac\n",
"o_zip 33000\n",
"o_city Bordeaux\n",
"o_latitude 44.851402\n",
"o_longitude -0.57587\n",
"customer_id 76187\n",
"d_street Rue Judaique 213\n",
"d_zip 33000\n",
"d_city Bordeaux\n",
"d_latitude 44.840829\n",
"d_longitude -0.595445\n",
"total 2250\n",
"courier_id 1468.0\n",
"pickup_at 2016-07-24 19:50:52\n",
"delivery_at 2016-07-24 19:58:16\n",
"cancelled 0\n",
"Name: 200800, dtype: object"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc[200800]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also index into the `restaurant_col` and `origins` objects from above. As `restaurant_col` is a `Series`, we get back a scalar value."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Max A Table'"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"restaurant_col.loc[200800]"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"o_street 36 Rue Cornac\n",
"o_zip 33000\n",
"o_city Bordeaux\n",
"o_latitude 44.851402\n",
"o_longitude -0.57587\n",
"Name: 200800, dtype: object"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"origins.loc[200800]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Slicing also works with [.loc <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc). A tiny difference to Python's built-in slicing, the upper bound is included in the slice as well!"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>placed_at</th>\n",
" <th>restaurant_id</th>\n",
" <th>restaurant</th>\n",
" <th>o_street</th>\n",
" <th>o_zip</th>\n",
" <th>o_city</th>\n",
" <th>o_latitude</th>\n",
" <th>o_longitude</th>\n",
" <th>customer_id</th>\n",
" <th>d_street</th>\n",
" <th>d_zip</th>\n",
" <th>d_city</th>\n",
" <th>d_latitude</th>\n",
" <th>d_longitude</th>\n",
" <th>total</th>\n",
" <th>courier_id</th>\n",
" <th>pickup_at</th>\n",
" <th>delivery_at</th>\n",
" <th>cancelled</th>\n",
" </tr>\n",
" <tr>\n",
" <th>order_id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>200300</th>\n",
" <td>2016-07-24 13:46:15</td>\n",
" <td>1207</td>\n",
" <td>Le Jardin Pekinois</td>\n",
" <td>9 Rue Des Freres Bonie</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.837078</td>\n",
" <td>-0.579572</td>\n",
" <td>76030</td>\n",
" <td>Rue Villeneuve 1</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.839927</td>\n",
" <td>-0.580012</td>\n",
" <td>3820</td>\n",
" <td>1426.0</td>\n",
" <td>2016-07-24 14:12:45</td>\n",
" <td>2016-07-24 14:16:59</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>200305</th>\n",
" <td>2016-07-24 13:49:25</td>\n",
" <td>1207</td>\n",
" <td>Le Jardin Pekinois</td>\n",
" <td>9 Rue Des Freres Bonie</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.837078</td>\n",
" <td>-0.579572</td>\n",
" <td>76033</td>\n",
" <td>Rue Du Ha 54</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.835898</td>\n",
" <td>-0.577941</td>\n",
" <td>1689</td>\n",
" <td>1405.0</td>\n",
" <td>2016-07-24 14:12:04</td>\n",
" <td>2016-07-24 14:15:54</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>200800</th>\n",
" <td>2016-07-24 19:30:52</td>\n",
" <td>1204</td>\n",
" <td>Max A Table</td>\n",
" <td>36 Rue Cornac</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.851402</td>\n",
" <td>-0.575870</td>\n",
" <td>76187</td>\n",
" <td>Rue Judaique 213</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>44.840829</td>\n",
" <td>-0.595445</td>\n",
" <td>2250</td>\n",
" <td>1468.0</td>\n",
" <td>2016-07-24 19:50:52</td>\n",
" <td>2016-07-24 19:58:16</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" placed_at restaurant_id restaurant \\\n",
"order_id \n",
"200300 2016-07-24 13:46:15 1207 Le Jardin Pekinois \n",
"200305 2016-07-24 13:49:25 1207 Le Jardin Pekinois \n",
"200300 9 Rue Des Freres Bonie 33000 Bordeaux 44.837078 -0.579572\n",
"200305 9 Rue Des Freres Bonie 33000 Bordeaux 44.837078 -0.579572\n",
"200800 36 Rue Cornac 33000 Bordeaux 44.851402 -0.575870"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"origins.loc[200300:200800]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[.loc <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc) also allows us to index and slice in both dimensions simultaneously. The first index or slice goes along the row dimension while the second index or slice selects the columns."
"As [.info() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html#pandas.DataFrame.info) already revealed above, the timestamp columns could only be parsed as generic objects (i.e., textual data). Also, the \"cancelled\" column which holds only `True` or `False` values does not have a `bool` data type."
"The [pd.to_datetime() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html#pandas.to_datetime) function **casts** the timestamp columns correctly."
"The [.astype() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html#pandas.DataFrame.astype) method generalizes this idea and allows us to cast several columns in a `DataFrame`. It takes a `dict`ionary mapping column names to data types as its input. Instead of references to actual data types (e.g., `bool`), it also understands [pandas <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/)-specific data types provides as text."
"A common operation when working with `DataFrame`s is to filter for rows fulfilling certain conditions. That is implemented by so-called **boolean filters** in [pandas <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/), which is simply a `DataFrame` or `Series` holding only `True` or `False` values.\n",
"\n",
"One way to obtain such objects is to use relational operators with columns.\n",
"\n",
"`max_a_table` holds `True` values for all orders at the restaurant with the ID `1204`."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"order_id\n",
"192594 True\n",
"192644 True\n",
"192658 False\n",
"193242 False\n",
"192719 False\n",
" ... \n",
"212021 False\n",
"211501 True\n",
"211508 False\n",
"211510 False\n",
"211519 False\n",
"Name: restaurant_id, Length: 694, dtype: bool"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"max_a_table = df[\"restaurant_id\"] == 1204\n",
"\n",
"max_a_table"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, let's use a boolean filter to index into `df`. That gives us back a new `DataFame` with all orders belonging to the restaurant \"Max A Table\"."
"[.isin() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isin.html#pandas.DataFrame.isin) is another useful method: It allows us to provide a `list` of values that we are filtering for in a column."
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>placed_at</th>\n",
" <th>customer_id</th>\n",
" <th>d_street</th>\n",
" <th>d_zip</th>\n",
" <th>d_city</th>\n",
" <th>total</th>\n",
" </tr>\n",
" <tr>\n",
" <th>order_id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>192644</th>\n",
" <td>2016-07-18 12:48:55</td>\n",
" <td>6037</td>\n",
" <td>Rue Rolland 14</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>2450</td>\n",
" </tr>\n",
" <tr>\n",
" <th>210945</th>\n",
" <td>2016-07-30 19:30:39</td>\n",
" <td>79900</td>\n",
" <td>Rue Du Couvent 16</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>1650</td>\n",
" </tr>\n",
" <tr>\n",
" <th>211363</th>\n",
" <td>2016-07-30 20:27:45</td>\n",
" <td>80095</td>\n",
" <td>Rue De La Porte Saint-Jean 8</td>\n",
" <td>33000</td>\n",
" <td>Bordeaux</td>\n",
" <td>2400</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" placed_at customer_id d_street \\\n",
"order_id \n",
"192644 2016-07-18 12:48:55 6037 Rue Rolland 14 \n",
"210945 2016-07-30 19:30:39 79900 Rue Du Couvent 16 \n",
"211363 2016-07-30 20:27:45 80095 Rue De La Porte Saint-Jean 8 \n",
"Now that we have learned the basics of selecting the data we want from a `DataFrame`, let's look at a couple of methods that allow us to obtain some infos out of a `DataFrame`, in particular, to run some **descriptive statistics**.\n",
"\n",
"[.unique() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/docs/reference/api/pandas.Series.unique.html#pandas.Series.unique) is a simple `Series` method returning an `ndarray` with all values that are in the `Series` once.\n",
"\n",
"Here, we get an overview of how many restaurants there are in Bordeaux in the target time horizon."
"[.value_counts() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.value_counts.html#pandas.DataFrame.value_counts) is similar to [.unique() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_pd.png\">](https://pandas.pydata.org/docs/reference/api/pandas.Series.unique.html#pandas.Series.unique) and provides an array sorted by the counts of how often an element occurs in a column or `Series` in descending order.\n",
"\n",
"We use it to list the `10` most popular restaurants and customers in the dataset."