{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "**Note**: Click on \"*Kernel*\" > \"*Restart Kernel and Clear All Outputs*\" in [JupyterLab](https://jupyterlab.readthedocs.io/en/stable/) *before* reading this notebook to reset its output. If you cannot run this file on your machine, you may want to open it [in the cloud ](https://mybinder.org/v2/gh/webartifex/intro-to-data-science/main?urlpath=lab/tree/01_scientific_stack/02_content_pandas.ipynb)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Chapter 1: Python's Scientific Stack (Part 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For practitioners, the [numpy ](https://numpy.org/) library may feel a bit too \"technical\" or too close to \"real programming\" and they may prefer something that looks and feels more like Excel. That is where the [pandas ](https://pandas.pydata.org/) library comes in.\n", "\n", "Let's first `pip` install and then `import` it." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: pandas in /home/instructor/Repositories/intro-to-data-science/.venv/lib/python3.12/site-packages (2.2.2)\n", "Requirement already satisfied: numpy>=1.26.0 in /home/instructor/Repositories/intro-to-data-science/.venv/lib/python3.12/site-packages (from pandas) (2.0.0)\n", "Requirement already satisfied: python-dateutil>=2.8.2 in /home/instructor/Repositories/intro-to-data-science/.venv/lib/python3.12/site-packages (from pandas) (2.9.0.post0)\n", "Requirement already satisfied: pytz>=2020.1 in /home/instructor/Repositories/intro-to-data-science/.venv/lib/python3.12/site-packages (from pandas) (2024.1)\n", "Requirement already satisfied: tzdata>=2022.7 in /home/instructor/Repositories/intro-to-data-science/.venv/lib/python3.12/site-packages (from pandas) (2024.1)\n", "Requirement already satisfied: six>=1.5 in /home/instructor/Repositories/intro-to-data-science/.venv/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)\n" ] } ], "source": [ "!pip install pandas" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Excel-like Data with Pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the same folder as this notebook there is a file named \"*orders.csv*\" that holds the order data of an urban meal delivery platform operating in Bordeaux, France. Open in with a double-click and take a look at its contents right here in JupyterLab!\n", "\n", "[pandas ](https://pandas.pydata.org/) provides a [pd.read_csv() ](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv) function that, as the name suggests, can open and read in CSV data. For Excel files, there is also a [pd.read_excel() ](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html) function but the CSV format is probably more widespread in use.\n", "\n", "Let's read in the \"*orders.csv*\" file with [pd.read_csv() ](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv) specifying the \"order_id\" column as the **index**. Here, index is a column with *unique* values that allow the identification of each row in a dataset. If we don't specify an index column, [pandas ](https://pandas.pydata.org/) creates a surrogate index as a sequence of integers 1, 2, 3, and so on." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(\"orders.csv\", index_col=\"order_id\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`df` models a table-like data structure, comparable to one tab in an Excel file. [pandas ](https://pandas.pydata.org/) and JupyterLab are designed to work well together: The `df` object shows a preview of the dataset below the code cell. The rows are the **records** in the dataset and the columns take the role of the **attributes** each record has. Each column comes with a **domain** of allowable values." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
placed_atrestaurant_idrestauranto_streeto_zipo_cityo_latitudeo_longitudecustomer_idd_streetd_zipd_cityd_latituded_longitudetotalcourier_idpickup_atdelivery_atcancelled
order_id
1925942016-07-18 12:23:131204Max A Table36 Rue Cornac33000Bordeaux44.851402-0.57587010298Rue Rolland 1433000Bordeaux44.842592-0.58052120501423.02016-07-18 12:38:082016-07-18 12:48:220
1926442016-07-18 12:48:551204Max A Table36 Rue Cornac33000Bordeaux44.851402-0.5758706037Rue Rolland 1433000Bordeaux44.842592-0.58052124501426.02016-07-18 13:03:082016-07-18 13:12:010
1926582016-07-18 13:00:131205Taj Mahal24 Rue Du Parlement Sainte-Catherine33000Bordeaux44.840405-0.57394073830Rue Batailley 1233000Bordeaux44.838504-0.59196125501423.02016-07-18 13:19:042016-07-18 13:29:030
1932422016-07-18 20:39:541208Chez Ambre And Michel1 Rue Matignon33000Bordeaux44.850258-0.58620410298Rue Rolland 1433000Bordeaux44.842592-0.58052115501420.02016-07-18 20:55:522016-07-18 21:05:280
1927192016-07-18 13:52:041206La Maison Du Glacier1 Place Saint Pierre33000Bordeaux44.839706-0.5706726037Rue Rolland 1433000Bordeaux44.842592-0.58052124501426.02016-07-18 14:01:232016-07-18 14:08:360
............................................................
2120212016-07-30 22:29:521249Pitaya Sainte Catherine275 Rue Sainte Catherine33000Bordeaux44.831692-0.57320780400Boulevard President Franklin Roosevelt 1533400Bordeaux44.820591-0.58204822501410.02016-07-30 22:50:162016-07-30 23:02:540
2115012016-07-30 20:44:501204Max A Table36 Rue Cornac33000Bordeaux44.851402-0.57587080163Rue Marsan 2233300Bordeaux44.856133-0.57617212501415.02016-07-30 21:02:322016-07-30 21:06:190
2115082016-07-30 20:45:551254Funky Burger5 Rue Du Loup33000Bordeaux44.838081-0.57228180168Rue Des Sablieres 4233800Bordeaux44.825488-0.57526416801461.02016-07-30 21:13:312016-07-30 21:19:450
2115102016-07-30 20:46:051219La Tagliatella14 Rue Guiraude33000Bordeaux44.839388-0.57478180169Rue Pasteur 3533200Bordeaux44.845053-0.60115740851411.02016-07-30 21:11:002016-07-30 21:23:240
2115192016-07-30 20:46:551254Funky Burger5 Rue Du Loup33000Bordeaux44.838081-0.57228180172Rue Monadey 2833800Bordeaux44.828816-0.57078920501817.02016-07-30 21:05:462016-07-30 21:14:070
\n", "

694 rows × 19 columns

\n", "
" ], "text/plain": [ " placed_at restaurant_id restaurant \\\n", "order_id \n", "192594 2016-07-18 12:23:13 1204 Max A Table \n", "192644 2016-07-18 12:48:55 1204 Max A Table \n", "192658 2016-07-18 13:00:13 1205 Taj Mahal \n", "193242 2016-07-18 20:39:54 1208 Chez Ambre And Michel \n", "192719 2016-07-18 13:52:04 1206 La Maison Du Glacier \n", "... ... ... ... \n", "212021 2016-07-30 22:29:52 1249 Pitaya Sainte Catherine \n", "211501 2016-07-30 20:44:50 1204 Max A Table \n", "211508 2016-07-30 20:45:55 1254 Funky Burger \n", "211510 2016-07-30 20:46:05 1219 La Tagliatella \n", "211519 2016-07-30 20:46:55 1254 Funky Burger \n", "\n", " o_street o_zip o_city o_latitude \\\n", "order_id \n", "192594 36 Rue Cornac 33000 Bordeaux 44.851402 \n", "192644 36 Rue Cornac 33000 Bordeaux 44.851402 \n", "192658 24 Rue Du Parlement Sainte-Catherine 33000 Bordeaux 44.840405 \n", "193242 1 Rue Matignon 33000 Bordeaux 44.850258 \n", "192719 1 Place Saint Pierre 33000 Bordeaux 44.839706 \n", "... ... ... ... ... \n", "212021 275 Rue Sainte Catherine 33000 Bordeaux 44.831692 \n", "211501 36 Rue Cornac 33000 Bordeaux 44.851402 \n", "211508 5 Rue Du Loup 33000 Bordeaux 44.838081 \n", "211510 14 Rue Guiraude 33000 Bordeaux 44.839388 \n", "211519 5 Rue Du Loup 33000 Bordeaux 44.838081 \n", "\n", " o_longitude customer_id d_street \\\n", "order_id \n", "192594 -0.575870 10298 Rue Rolland 14 \n", "192644 -0.575870 6037 Rue Rolland 14 \n", "192658 -0.573940 73830 Rue Batailley 12 \n", "193242 -0.586204 10298 Rue Rolland 14 \n", "192719 -0.570672 6037 Rue Rolland 14 \n", "... ... ... ... \n", "212021 -0.573207 80400 Boulevard President Franklin Roosevelt 15 \n", "211501 -0.575870 80163 Rue Marsan 22 \n", "211508 -0.572281 80168 Rue Des Sablieres 42 \n", "211510 -0.574781 80169 Rue Pasteur 35 \n", "211519 -0.572281 80172 Rue Monadey 28 \n", "\n", " d_zip d_city d_latitude d_longitude total courier_id \\\n", "order_id \n", "192594 33000 Bordeaux 44.842592 -0.580521 2050 1423.0 \n", "192644 33000 Bordeaux 44.842592 -0.580521 2450 1426.0 \n", "192658 33000 Bordeaux 44.838504 -0.591961 2550 1423.0 \n", "193242 33000 Bordeaux 44.842592 -0.580521 1550 1420.0 \n", "192719 33000 Bordeaux 44.842592 -0.580521 2450 1426.0 \n", "... ... ... ... ... ... ... \n", "212021 33400 Bordeaux 44.820591 -0.582048 2250 1410.0 \n", "211501 33300 Bordeaux 44.856133 -0.576172 1250 1415.0 \n", "211508 33800 Bordeaux 44.825488 -0.575264 1680 1461.0 \n", "211510 33200 Bordeaux 44.845053 -0.601157 4085 1411.0 \n", "211519 33800 Bordeaux 44.828816 -0.570789 2050 1817.0 \n", "\n", " pickup_at delivery_at cancelled \n", "order_id \n", "192594 2016-07-18 12:38:08 2016-07-18 12:48:22 0 \n", "192644 2016-07-18 13:03:08 2016-07-18 13:12:01 0 \n", "192658 2016-07-18 13:19:04 2016-07-18 13:29:03 0 \n", "193242 2016-07-18 20:55:52 2016-07-18 21:05:28 0 \n", "192719 2016-07-18 14:01:23 2016-07-18 14:08:36 0 \n", "... ... ... ... \n", "212021 2016-07-30 22:50:16 2016-07-30 23:02:54 0 \n", "211501 2016-07-30 21:02:32 2016-07-30 21:06:19 0 \n", "211508 2016-07-30 21:13:31 2016-07-30 21:19:45 0 \n", "211510 2016-07-30 21:11:00 2016-07-30 21:23:24 0 \n", "211519 2016-07-30 21:05:46 2016-07-30 21:14:07 0 \n", "\n", "[694 rows x 19 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data type behind `df` is called a [pd.DataFrame ](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame). `DataFrame`s are built around [numpy ](https://numpy.org/)'s `ndarray`s providing an interface optimized for **interactive usage** (i.e., a data scientist exploring a dataset step by step)." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.frame.DataFrame" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`DataFrame`s come with many methdods.\n", "\n", "For example, [.head() ](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html#pandas.DataFrame.head) and [.tail() ](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.tail.html#pandas.DataFrame.tail) show the first and last `n` rows, defaulting to `5`." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
placed_atrestaurant_idrestauranto_streeto_zipo_cityo_latitudeo_longitudecustomer_idd_streetd_zipd_cityd_latituded_longitudetotalcourier_idpickup_atdelivery_atcancelled
order_id
1925942016-07-18 12:23:131204Max A Table36 Rue Cornac33000Bordeaux44.851402-0.57587010298Rue Rolland 1433000Bordeaux44.842592-0.58052120501423.02016-07-18 12:38:082016-07-18 12:48:220
1926442016-07-18 12:48:551204Max A Table36 Rue Cornac33000Bordeaux44.851402-0.5758706037Rue Rolland 1433000Bordeaux44.842592-0.58052124501426.02016-07-18 13:03:082016-07-18 13:12:010
1926582016-07-18 13:00:131205Taj Mahal24 Rue Du Parlement Sainte-Catherine33000Bordeaux44.840405-0.57394073830Rue Batailley 1233000Bordeaux44.838504-0.59196125501423.02016-07-18 13:19:042016-07-18 13:29:030
1932422016-07-18 20:39:541208Chez Ambre And Michel1 Rue Matignon33000Bordeaux44.850258-0.58620410298Rue Rolland 1433000Bordeaux44.842592-0.58052115501420.02016-07-18 20:55:522016-07-18 21:05:280
1927192016-07-18 13:52:041206La Maison Du Glacier1 Place Saint Pierre33000Bordeaux44.839706-0.5706726037Rue Rolland 1433000Bordeaux44.842592-0.58052124501426.02016-07-18 14:01:232016-07-18 14:08:360
\n", "
" ], "text/plain": [ " placed_at restaurant_id restaurant \\\n", "order_id \n", "192594 2016-07-18 12:23:13 1204 Max A Table \n", "192644 2016-07-18 12:48:55 1204 Max A Table \n", "192658 2016-07-18 13:00:13 1205 Taj Mahal \n", "193242 2016-07-18 20:39:54 1208 Chez Ambre And Michel \n", "192719 2016-07-18 13:52:04 1206 La Maison Du Glacier \n", "\n", " o_street o_zip o_city o_latitude \\\n", "order_id \n", "192594 36 Rue Cornac 33000 Bordeaux 44.851402 \n", "192644 36 Rue Cornac 33000 Bordeaux 44.851402 \n", "192658 24 Rue Du Parlement Sainte-Catherine 33000 Bordeaux 44.840405 \n", "193242 1 Rue Matignon 33000 Bordeaux 44.850258 \n", "192719 1 Place Saint Pierre 33000 Bordeaux 44.839706 \n", "\n", " o_longitude customer_id d_street d_zip d_city \\\n", "order_id \n", "192594 -0.575870 10298 Rue Rolland 14 33000 Bordeaux \n", "192644 -0.575870 6037 Rue Rolland 14 33000 Bordeaux \n", "192658 -0.573940 73830 Rue Batailley 12 33000 Bordeaux \n", "193242 -0.586204 10298 Rue Rolland 14 33000 Bordeaux \n", "192719 -0.570672 6037 Rue Rolland 14 33000 Bordeaux \n", "\n", " d_latitude d_longitude total courier_id pickup_at \\\n", "order_id \n", "192594 44.842592 -0.580521 2050 1423.0 2016-07-18 12:38:08 \n", "192644 44.842592 -0.580521 2450 1426.0 2016-07-18 13:03:08 \n", "192658 44.838504 -0.591961 2550 1423.0 2016-07-18 13:19:04 \n", "193242 44.842592 -0.580521 1550 1420.0 2016-07-18 20:55:52 \n", "192719 44.842592 -0.580521 2450 1426.0 2016-07-18 14:01:23 \n", "\n", " delivery_at cancelled \n", "order_id \n", "192594 2016-07-18 12:48:22 0 \n", "192644 2016-07-18 13:12:01 0 \n", "192658 2016-07-18 13:29:03 0 \n", "193242 2016-07-18 21:05:28 0 \n", "192719 2016-07-18 14:08:36 0 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
placed_atrestaurant_idrestauranto_streeto_zipo_cityo_latitudeo_longitudecustomer_idd_streetd_zipd_cityd_latituded_longitudetotalcourier_idpickup_atdelivery_atcancelled
order_id
2115102016-07-30 20:46:051219La Tagliatella14 Rue Guiraude33000Bordeaux44.839388-0.57478180169Rue Pasteur 3533200Bordeaux44.845053-0.60115740851411.02016-07-30 21:11:002016-07-30 21:23:240
2115192016-07-30 20:46:551254Funky Burger5 Rue Du Loup33000Bordeaux44.838081-0.57228180172Rue Monadey 2833800Bordeaux44.828816-0.57078920501817.02016-07-30 21:05:462016-07-30 21:14:070
\n", "
" ], "text/plain": [ " placed_at restaurant_id restaurant o_street \\\n", "order_id \n", "211510 2016-07-30 20:46:05 1219 La Tagliatella 14 Rue Guiraude \n", "211519 2016-07-30 20:46:55 1254 Funky Burger 5 Rue Du Loup \n", "\n", " o_zip o_city o_latitude o_longitude customer_id \\\n", "order_id \n", "211510 33000 Bordeaux 44.839388 -0.574781 80169 \n", "211519 33000 Bordeaux 44.838081 -0.572281 80172 \n", "\n", " d_street d_zip d_city d_latitude d_longitude total \\\n", "order_id \n", "211510 Rue Pasteur 35 33200 Bordeaux 44.845053 -0.601157 4085 \n", "211519 Rue Monadey 28 33800 Bordeaux 44.828816 -0.570789 2050 \n", "\n", " courier_id pickup_at delivery_at cancelled \n", "order_id \n", "211510 1411.0 2016-07-30 21:11:00 2016-07-30 21:23:24 0 \n", "211519 1817.0 2016-07-30 21:05:46 2016-07-30 21:14:07 0 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.tail(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[.info() ](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html#pandas.DataFrame.info) shows on overview of the columns. In particular, it shows how many cells are filled in in a column (i.e., are \"non-null\") and what **data type** (i.e., \"dtype\") *all* values in a column have. \"int64\" and \"float64\" imply that there are only `int` and `float` values in a column (taking up to 64 bits or 1s and 0s in memory). \"object\" is [pandas ](https://pandas.pydata.org/)' way of telling us it could not deduce any data type more specific than textual data. For the columns holding timestamps (e.g., \"placed_at\") we will convert the values further below.\n", "\n", "Looking at the output, we see that some columns hold the data of **origin**-**destination** pairs, corresponding to restaurants and customers. Other columns store data following the dispatch and delivery process of couriers picking up and delivering meals at various points in time." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Index: 694 entries, 192594 to 211519\n", "Data columns (total 19 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 placed_at 694 non-null object \n", " 1 restaurant_id 694 non-null int64 \n", " 2 restaurant 694 non-null object \n", " 3 o_street 694 non-null object \n", " 4 o_zip 694 non-null int64 \n", " 5 o_city 694 non-null object \n", " 6 o_latitude 694 non-null float64\n", " 7 o_longitude 694 non-null float64\n", " 8 customer_id 694 non-null int64 \n", " 9 d_street 694 non-null object \n", " 10 d_zip 694 non-null int64 \n", " 11 d_city 694 non-null object \n", " 12 d_latitude 694 non-null float64\n", " 13 d_longitude 694 non-null float64\n", " 14 total 694 non-null int64 \n", " 15 courier_id 690 non-null float64\n", " 16 pickup_at 665 non-null object \n", " 17 delivery_at 663 non-null object \n", " 18 cancelled 694 non-null int64 \n", "dtypes: float64(5), int64(6), object(8)\n", "memory usage: 108.4+ KB\n" ] } ], "source": [ "df.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[.describe() ](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html#pandas.DataFrame.describe) shows statistics on all numerical columns in a `DataFrame`.\n", "\n", "For the example orders, such statistics may not be meaningful for all numerical columns as some of them merely hold IDs or zip codes." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
restaurant_ido_zipo_latitudeo_longitudecustomer_idd_zipd_latituded_longitudetotalcourier_idcancelled
count694.000000694.000000694.000000694.000000694.000000694.000000694.000000694.000000694.000000690.000000694.000000
mean1228.47982733075.21613844.839258-0.57575974751.12680133191.61383344.838623-0.5760402294.6368881484.7550720.044669
std18.001091207.9714350.0074710.00692014604.304963307.3786970.0115450.0107991060.695748154.5862100.206724
min1204.00000033000.00000044.818180-0.5994002377.00000033000.00000044.809813-0.606892350.0000001403.0000000.000000
25%1212.00000033000.00000044.836910-0.57934576648.50000033000.00000044.829981-0.5816121500.0000001415.0000000.000000
50%1224.00000033000.00000044.838287-0.57394078146.00000033000.00000044.838364-0.5750561969.5000001424.0000000.000000
75%1244.00000033000.00000044.841721-0.57228179331.50000033300.00000044.846696-0.5696012750.0000001462.0000000.000000
max1267.00000033800.00000044.855438-0.55057680401.00000033800.00000044.877693-0.5379528370.0000002013.0000001.000000
\n", "
" ], "text/plain": [ " restaurant_id o_zip o_latitude o_longitude customer_id \\\n", "count 694.000000 694.000000 694.000000 694.000000 694.000000 \n", "mean 1228.479827 33075.216138 44.839258 -0.575759 74751.126801 \n", "std 18.001091 207.971435 0.007471 0.006920 14604.304963 \n", "min 1204.000000 33000.000000 44.818180 -0.599400 2377.000000 \n", "25% 1212.000000 33000.000000 44.836910 -0.579345 76648.500000 \n", "50% 1224.000000 33000.000000 44.838287 -0.573940 78146.000000 \n", "75% 1244.000000 33000.000000 44.841721 -0.572281 79331.500000 \n", "max 1267.000000 33800.000000 44.855438 -0.550576 80401.000000 \n", "\n", " d_zip d_latitude d_longitude total courier_id \\\n", "count 694.000000 694.000000 694.000000 694.000000 690.000000 \n", "mean 33191.613833 44.838623 -0.576040 2294.636888 1484.755072 \n", "std 307.378697 0.011545 0.010799 1060.695748 154.586210 \n", "min 33000.000000 44.809813 -0.606892 350.000000 1403.000000 \n", "25% 33000.000000 44.829981 -0.581612 1500.000000 1415.000000 \n", "50% 33000.000000 44.838364 -0.575056 1969.500000 1424.000000 \n", "75% 33300.000000 44.846696 -0.569601 2750.000000 1462.000000 \n", "max 33800.000000 44.877693 -0.537952 8370.000000 2013.000000 \n", "\n", " cancelled \n", "count 694.000000 \n", "mean 0.044669 \n", "std 0.206724 \n", "min 0.000000 \n", "25% 0.000000 \n", "50% 0.000000 \n", "75% 0.000000 \n", "max 1.000000 " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Indexing & Slicing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`DataFrame`s support being indexed or sliced, both in the row and column dimensions.\n", "\n", "To obtain all data in a single column, we index into the `DataFrame` with the column's name.\n", "\n", "For example, `restaurant_col` provides a list of only the restaurant names. Its index are still the \"order_id\"s." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "order_id\n", "192594 Max A Table\n", "192644 Max A Table\n", "192658 Taj Mahal\n", "193242 Chez Ambre And Michel\n", "192719 La Maison Du Glacier\n", "Name: restaurant, dtype: object" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "restaurant_col = df[\"restaurant\"]\n", "\n", "restaurant_col.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data type of a single column is [pd.Series ](https://pandas.pydata.org/docs/reference/api/pandas.Series.html#pandas.Series), which is very similar to a `DataFrame` with only one column. `Series` objects work like built-in `list`s with added functionalities." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.series.Series" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(restaurant_col)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we index with a `list` of column names, the result is itself another `DataFrame`. That operation is like slicing out a smaller matrix from a larger one as we saw with `ndarray`s before.\n", "\n", "For example, let's pull out all location data of the orders' origins (i.e., restaurants)." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
o_streeto_zipo_cityo_latitudeo_longitude
order_id
19259436 Rue Cornac33000Bordeaux44.851402-0.575870
19264436 Rue Cornac33000Bordeaux44.851402-0.575870
19265824 Rue Du Parlement Sainte-Catherine33000Bordeaux44.840405-0.573940
1932421 Rue Matignon33000Bordeaux44.850258-0.586204
1927191 Place Saint Pierre33000Bordeaux44.839706-0.570672
\n", "
" ], "text/plain": [ " o_street o_zip o_city o_latitude \\\n", "order_id \n", "192594 36 Rue Cornac 33000 Bordeaux 44.851402 \n", "192644 36 Rue Cornac 33000 Bordeaux 44.851402 \n", "192658 24 Rue Du Parlement Sainte-Catherine 33000 Bordeaux 44.840405 \n", "193242 1 Rue Matignon 33000 Bordeaux 44.850258 \n", "192719 1 Place Saint Pierre 33000 Bordeaux 44.839706 \n", "\n", " o_longitude \n", "order_id \n", "192594 -0.575870 \n", "192644 -0.575870 \n", "192658 -0.573940 \n", "193242 -0.586204 \n", "192719 -0.570672 " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "origins = df[[\"o_street\", \"o_zip\", \"o_city\", \"o_latitude\", \"o_longitude\"]]\n", "\n", "origins.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To access individual rows, we index not into a `DataFrame` directly but into its [.loc ](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc) object (which also exists for `Series`).\n", "\n", "Here, `200800` is an \"order_id\" number. The result is a `Series` object where the original `DataFrame`'s columns become the index." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "placed_at 2016-07-24 19:30:52\n", "restaurant_id 1204\n", "restaurant Max A Table\n", "o_street 36 Rue Cornac\n", "o_zip 33000\n", "o_city Bordeaux\n", "o_latitude 44.851402\n", "o_longitude -0.57587\n", "customer_id 76187\n", "d_street Rue Judaique 213\n", "d_zip 33000\n", "d_city Bordeaux\n", "d_latitude 44.840829\n", "d_longitude -0.595445\n", "total 2250\n", "courier_id 1468.0\n", "pickup_at 2016-07-24 19:50:52\n", "delivery_at 2016-07-24 19:58:16\n", "cancelled 0\n", "Name: 200800, dtype: object" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[200800]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also index into the `restaurant_col` and `origins` objects from above. As `restaurant_col` is a `Series`, we get back a scalar value." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Max A Table'" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "restaurant_col.loc[200800]" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "o_street 36 Rue Cornac\n", "o_zip 33000\n", "o_city Bordeaux\n", "o_latitude 44.851402\n", "o_longitude -0.57587\n", "Name: 200800, dtype: object" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "origins.loc[200800]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Slicing also works with [.loc ](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc). A tiny difference to Python's built-in slicing, the upper bound is included in the slice as well!" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
placed_atrestaurant_idrestauranto_streeto_zipo_cityo_latitudeo_longitudecustomer_idd_streetd_zipd_cityd_latituded_longitudetotalcourier_idpickup_atdelivery_atcancelled
order_id
2003002016-07-24 13:46:151207Le Jardin Pekinois9 Rue Des Freres Bonie33000Bordeaux44.837078-0.57957276030Rue Villeneuve 133000Bordeaux44.839927-0.58001238201426.02016-07-24 14:12:452016-07-24 14:16:590
2003052016-07-24 13:49:251207Le Jardin Pekinois9 Rue Des Freres Bonie33000Bordeaux44.837078-0.57957276033Rue Du Ha 5433000Bordeaux44.835898-0.57794116891405.02016-07-24 14:12:042016-07-24 14:15:540
2008002016-07-24 19:30:521204Max A Table36 Rue Cornac33000Bordeaux44.851402-0.57587076187Rue Judaique 21333000Bordeaux44.840829-0.59544522501468.02016-07-24 19:50:522016-07-24 19:58:160
\n", "
" ], "text/plain": [ " placed_at restaurant_id restaurant \\\n", "order_id \n", "200300 2016-07-24 13:46:15 1207 Le Jardin Pekinois \n", "200305 2016-07-24 13:49:25 1207 Le Jardin Pekinois \n", "200800 2016-07-24 19:30:52 1204 Max A Table \n", "\n", " o_street o_zip o_city o_latitude o_longitude \\\n", "order_id \n", "200300 9 Rue Des Freres Bonie 33000 Bordeaux 44.837078 -0.579572 \n", "200305 9 Rue Des Freres Bonie 33000 Bordeaux 44.837078 -0.579572 \n", "200800 36 Rue Cornac 33000 Bordeaux 44.851402 -0.575870 \n", "\n", " customer_id d_street d_zip d_city d_latitude \\\n", "order_id \n", "200300 76030 Rue Villeneuve 1 33000 Bordeaux 44.839927 \n", "200305 76033 Rue Du Ha 54 33000 Bordeaux 44.835898 \n", "200800 76187 Rue Judaique 213 33000 Bordeaux 44.840829 \n", "\n", " d_longitude total courier_id pickup_at \\\n", "order_id \n", "200300 -0.580012 3820 1426.0 2016-07-24 14:12:45 \n", "200305 -0.577941 1689 1405.0 2016-07-24 14:12:04 \n", "200800 -0.595445 2250 1468.0 2016-07-24 19:50:52 \n", "\n", " delivery_at cancelled \n", "order_id \n", "200300 2016-07-24 14:16:59 0 \n", "200305 2016-07-24 14:15:54 0 \n", "200800 2016-07-24 19:58:16 0 " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[200300:200800]" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "order_id\n", "200300 Le Jardin Pekinois\n", "200305 Le Jardin Pekinois\n", "200800 Max A Table\n", "Name: restaurant, dtype: object" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "restaurant_col.loc[200300:200800]" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
o_streeto_zipo_cityo_latitudeo_longitude
order_id
2003009 Rue Des Freres Bonie33000Bordeaux44.837078-0.579572
2003059 Rue Des Freres Bonie33000Bordeaux44.837078-0.579572
20080036 Rue Cornac33000Bordeaux44.851402-0.575870
\n", "
" ], "text/plain": [ " o_street o_zip o_city o_latitude o_longitude\n", "order_id \n", "200300 9 Rue Des Freres Bonie 33000 Bordeaux 44.837078 -0.579572\n", "200305 9 Rue Des Freres Bonie 33000 Bordeaux 44.837078 -0.579572\n", "200800 36 Rue Cornac 33000 Bordeaux 44.851402 -0.575870" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "origins.loc[200300:200800]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[.loc ](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc) also allows us to index and slice in both dimensions simultaneously. The first index or slice goes along the row dimension while the second index or slice selects the columns." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
o_streeto_zipo_cityo_latitudeo_longitude
order_id
2003009 Rue Des Freres Bonie33000Bordeaux44.837078-0.579572
2003059 Rue Des Freres Bonie33000Bordeaux44.837078-0.579572
20080036 Rue Cornac33000Bordeaux44.851402-0.575870
\n", "
" ], "text/plain": [ " o_street o_zip o_city o_latitude o_longitude\n", "order_id \n", "200300 9 Rue Des Freres Bonie 33000 Bordeaux 44.837078 -0.579572\n", "200305 9 Rue Des Freres Bonie 33000 Bordeaux 44.837078 -0.579572\n", "200800 36 Rue Cornac 33000 Bordeaux 44.851402 -0.575870" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[200300:200800, [\"o_street\", \"o_zip\", \"o_city\", \"o_latitude\", \"o_longitude\"]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Type Casting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As [.info() ](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html#pandas.DataFrame.info) already revealed above, the timestamp columns could only be parsed as generic objects (i.e., textual data). Also, the \"cancelled\" column which holds only `True` or `False` values does not have a `bool` data type." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Index: 694 entries, 192594 to 211519\n", "Data columns (total 19 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 placed_at 694 non-null object \n", " 1 restaurant_id 694 non-null int64 \n", " 2 restaurant 694 non-null object \n", " 3 o_street 694 non-null object \n", " 4 o_zip 694 non-null int64 \n", " 5 o_city 694 non-null object \n", " 6 o_latitude 694 non-null float64\n", " 7 o_longitude 694 non-null float64\n", " 8 customer_id 694 non-null int64 \n", " 9 d_street 694 non-null object \n", " 10 d_zip 694 non-null int64 \n", " 11 d_city 694 non-null object \n", " 12 d_latitude 694 non-null float64\n", " 13 d_longitude 694 non-null float64\n", " 14 total 694 non-null int64 \n", " 15 courier_id 690 non-null float64\n", " 16 pickup_at 665 non-null object \n", " 17 delivery_at 663 non-null object \n", " 18 cancelled 694 non-null int64 \n", "dtypes: float64(5), int64(6), object(8)\n", "memory usage: 124.6+ KB\n" ] } ], "source": [ "df.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The [pd.to_datetime() ](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html#pandas.to_datetime) function **casts** the timestamp columns correctly." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "order_id\n", "192594 2016-07-18 12:23:13\n", "192644 2016-07-18 12:48:55\n", "192658 2016-07-18 13:00:13\n", "193242 2016-07-18 20:39:54\n", "192719 2016-07-18 13:52:04\n", " ... \n", "212021 2016-07-30 22:29:52\n", "211501 2016-07-30 20:44:50\n", "211508 2016-07-30 20:45:55\n", "211510 2016-07-30 20:46:05\n", "211519 2016-07-30 20:46:55\n", "Name: placed_at, Length: 694, dtype: datetime64[ns]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.to_datetime(df[\"placed_at\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's overwrite the original \"placed_at\" column with one that has the correct data type." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "df[\"placed_at\"] = pd.to_datetime(df[\"placed_at\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The [.astype() ](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html#pandas.DataFrame.astype) method generalizes this idea and allows us to cast several columns in a `DataFrame`. It takes a `dict`ionary mapping column names to data types as its input. Instead of references to actual data types (e.g., `bool`), it also understands [pandas ](https://pandas.pydata.org/)-specific data types provides as text." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "df = df.astype(\n", " {\n", " \"pickup_at\": \"datetime64[ns]\",\n", " \"delivery_at\": \"datetime64[ns]\",\n", " \"cancelled\": bool,\n", " }\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, all columns in `df` have more applicable data types." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Index: 694 entries, 192594 to 211519\n", "Data columns (total 19 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 placed_at 694 non-null datetime64[ns]\n", " 1 restaurant_id 694 non-null int64 \n", " 2 restaurant 694 non-null object \n", " 3 o_street 694 non-null object \n", " 4 o_zip 694 non-null int64 \n", " 5 o_city 694 non-null object \n", " 6 o_latitude 694 non-null float64 \n", " 7 o_longitude 694 non-null float64 \n", " 8 customer_id 694 non-null int64 \n", " 9 d_street 694 non-null object \n", " 10 d_zip 694 non-null int64 \n", " 11 d_city 694 non-null object \n", " 12 d_latitude 694 non-null float64 \n", " 13 d_longitude 694 non-null float64 \n", " 14 total 694 non-null int64 \n", " 15 courier_id 690 non-null float64 \n", " 16 pickup_at 665 non-null datetime64[ns]\n", " 17 delivery_at 663 non-null datetime64[ns]\n", " 18 cancelled 694 non-null bool \n", "dtypes: bool(1), datetime64[ns](3), float64(5), int64(5), object(5)\n", "memory usage: 119.9+ KB\n" ] } ], "source": [ "df.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Filtering" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A common operation when working with `DataFrame`s is to filter for rows fulfilling certain conditions. That is implemented by so-called **boolean filters** in [pandas ](https://pandas.pydata.org/), which is simply a `DataFrame` or `Series` holding only `True` or `False` values.\n", "\n", "One way to obtain such objects is to use relational operators with columns.\n", "\n", "`max_a_table` holds `True` values for all orders at the restaurant with the ID `1204`." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "order_id\n", "192594 True\n", "192644 True\n", "192658 False\n", "193242 False\n", "192719 False\n", " ... \n", "212021 False\n", "211501 True\n", "211508 False\n", "211510 False\n", "211519 False\n", "Name: restaurant_id, Length: 694, dtype: bool" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "max_a_table = df[\"restaurant_id\"] == 1204\n", "\n", "max_a_table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, let's use a boolean filter to index into `df`. That gives us back a new `DataFame` with all orders belonging to the restaurant \"Max A Table\"." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
placed_atrestaurant_idrestauranto_streeto_zipo_cityo_latitudeo_longitudecustomer_idd_streetd_zipd_cityd_latituded_longitudetotalcourier_idpickup_atdelivery_atcancelled
order_id
1925942016-07-18 12:23:131204Max A Table36 Rue Cornac33000Bordeaux44.851402-0.5758710298Rue Rolland 1433000Bordeaux44.842592-0.58052120501423.02016-07-18 12:38:082016-07-18 12:48:22False
1926442016-07-18 12:48:551204Max A Table36 Rue Cornac33000Bordeaux44.851402-0.575876037Rue Rolland 1433000Bordeaux44.842592-0.58052124501426.02016-07-18 13:03:082016-07-18 13:12:01False
1943352016-07-19 20:35:211204Max A Table36 Rue Cornac33000Bordeaux44.851402-0.5758774268Place Canteloup 1233800Bordeaux44.833834-0.56567431001420.02016-07-19 20:51:162016-07-19 21:01:08False
1966152016-07-21 19:50:151204Max A Table36 Rue Cornac33000Bordeaux44.851402-0.5758774901Rue Marcelin Jourdan 5533200Bordeaux44.850360-0.59736120501418.02016-07-21 20:12:292016-07-21 20:25:57False
1968392016-07-21 20:27:221204Max A Table36 Rue Cornac33000Bordeaux44.851402-0.5758774966Rue Sainte-Catherine 13733000Bordeaux44.836516-0.57398337501472.02016-07-21 20:41:422016-07-21 21:14:41False
\n", "
" ], "text/plain": [ " placed_at restaurant_id restaurant o_street \\\n", "order_id \n", "192594 2016-07-18 12:23:13 1204 Max A Table 36 Rue Cornac \n", "192644 2016-07-18 12:48:55 1204 Max A Table 36 Rue Cornac \n", "194335 2016-07-19 20:35:21 1204 Max A Table 36 Rue Cornac \n", "196615 2016-07-21 19:50:15 1204 Max A Table 36 Rue Cornac \n", "196839 2016-07-21 20:27:22 1204 Max A Table 36 Rue Cornac \n", "\n", " o_zip o_city o_latitude o_longitude customer_id \\\n", "order_id \n", "192594 33000 Bordeaux 44.851402 -0.57587 10298 \n", "192644 33000 Bordeaux 44.851402 -0.57587 6037 \n", "194335 33000 Bordeaux 44.851402 -0.57587 74268 \n", "196615 33000 Bordeaux 44.851402 -0.57587 74901 \n", "196839 33000 Bordeaux 44.851402 -0.57587 74966 \n", "\n", " d_street d_zip d_city d_latitude d_longitude \\\n", "order_id \n", "192594 Rue Rolland 14 33000 Bordeaux 44.842592 -0.580521 \n", "192644 Rue Rolland 14 33000 Bordeaux 44.842592 -0.580521 \n", "194335 Place Canteloup 12 33800 Bordeaux 44.833834 -0.565674 \n", "196615 Rue Marcelin Jourdan 55 33200 Bordeaux 44.850360 -0.597361 \n", "196839 Rue Sainte-Catherine 137 33000 Bordeaux 44.836516 -0.573983 \n", "\n", " total courier_id pickup_at delivery_at cancelled \n", "order_id \n", "192594 2050 1423.0 2016-07-18 12:38:08 2016-07-18 12:48:22 False \n", "192644 2450 1426.0 2016-07-18 13:03:08 2016-07-18 13:12:01 False \n", "194335 3100 1420.0 2016-07-19 20:51:16 2016-07-19 21:01:08 False \n", "196615 2050 1418.0 2016-07-21 20:12:29 2016-07-21 20:25:57 False \n", "196839 3750 1472.0 2016-07-21 20:41:42 2016-07-21 21:14:41 False " ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[df[\"restaurant_id\"] == 1204].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead of an explicit condition, we can also use a reference to a boolean filter created above." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
placed_atrestaurant_idrestauranto_streeto_zipo_cityo_latitudeo_longitudecustomer_idd_streetd_zipd_cityd_latituded_longitudetotalcourier_idpickup_atdelivery_atcancelled
order_id
1925942016-07-18 12:23:131204Max A Table36 Rue Cornac33000Bordeaux44.851402-0.5758710298Rue Rolland 1433000Bordeaux44.842592-0.58052120501423.02016-07-18 12:38:082016-07-18 12:48:22False
1926442016-07-18 12:48:551204Max A Table36 Rue Cornac33000Bordeaux44.851402-0.575876037Rue Rolland 1433000Bordeaux44.842592-0.58052124501426.02016-07-18 13:03:082016-07-18 13:12:01False
1943352016-07-19 20:35:211204Max A Table36 Rue Cornac33000Bordeaux44.851402-0.5758774268Place Canteloup 1233800Bordeaux44.833834-0.56567431001420.02016-07-19 20:51:162016-07-19 21:01:08False
1966152016-07-21 19:50:151204Max A Table36 Rue Cornac33000Bordeaux44.851402-0.5758774901Rue Marcelin Jourdan 5533200Bordeaux44.850360-0.59736120501418.02016-07-21 20:12:292016-07-21 20:25:57False
1968392016-07-21 20:27:221204Max A Table36 Rue Cornac33000Bordeaux44.851402-0.5758774966Rue Sainte-Catherine 13733000Bordeaux44.836516-0.57398337501472.02016-07-21 20:41:422016-07-21 21:14:41False
\n", "
" ], "text/plain": [ " placed_at restaurant_id restaurant o_street \\\n", "order_id \n", "192594 2016-07-18 12:23:13 1204 Max A Table 36 Rue Cornac \n", "192644 2016-07-18 12:48:55 1204 Max A Table 36 Rue Cornac \n", "194335 2016-07-19 20:35:21 1204 Max A Table 36 Rue Cornac \n", "196615 2016-07-21 19:50:15 1204 Max A Table 36 Rue Cornac \n", "196839 2016-07-21 20:27:22 1204 Max A Table 36 Rue Cornac \n", "\n", " o_zip o_city o_latitude o_longitude customer_id \\\n", "order_id \n", "192594 33000 Bordeaux 44.851402 -0.57587 10298 \n", "192644 33000 Bordeaux 44.851402 -0.57587 6037 \n", "194335 33000 Bordeaux 44.851402 -0.57587 74268 \n", "196615 33000 Bordeaux 44.851402 -0.57587 74901 \n", "196839 33000 Bordeaux 44.851402 -0.57587 74966 \n", "\n", " d_street d_zip d_city d_latitude d_longitude \\\n", "order_id \n", "192594 Rue Rolland 14 33000 Bordeaux 44.842592 -0.580521 \n", "192644 Rue Rolland 14 33000 Bordeaux 44.842592 -0.580521 \n", "194335 Place Canteloup 12 33800 Bordeaux 44.833834 -0.565674 \n", "196615 Rue Marcelin Jourdan 55 33200 Bordeaux 44.850360 -0.597361 \n", "196839 Rue Sainte-Catherine 137 33000 Bordeaux 44.836516 -0.573983 \n", "\n", " total courier_id pickup_at delivery_at cancelled \n", "order_id \n", "192594 2050 1423.0 2016-07-18 12:38:08 2016-07-18 12:48:22 False \n", "192644 2450 1426.0 2016-07-18 13:03:08 2016-07-18 13:12:01 False \n", "194335 3100 1420.0 2016-07-19 20:51:16 2016-07-19 21:01:08 False \n", "196615 2050 1418.0 2016-07-21 20:12:29 2016-07-21 20:25:57 False \n", "196839 3750 1472.0 2016-07-21 20:41:42 2016-07-21 21:14:41 False " ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[max_a_table].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Combining the filter with a `list` of columns allows us to further narrow down the `DataFrame`.\n", "\n", "For example, the preview below shows us the first five customers \"Max A Table\" had in the target period." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
customer_idd_streetd_zipd_cityd_latituded_longitude
order_id
19259410298Rue Rolland 1433000Bordeaux44.842592-0.580521
1926446037Rue Rolland 1433000Bordeaux44.842592-0.580521
19433574268Place Canteloup 1233800Bordeaux44.833834-0.565674
19661574901Rue Marcelin Jourdan 5533200Bordeaux44.850360-0.597361
19683974966Rue Sainte-Catherine 13733000Bordeaux44.836516-0.573983
\n", "
" ], "text/plain": [ " customer_id d_street d_zip d_city d_latitude \\\n", "order_id \n", "192594 10298 Rue Rolland 14 33000 Bordeaux 44.842592 \n", "192644 6037 Rue Rolland 14 33000 Bordeaux 44.842592 \n", "194335 74268 Place Canteloup 12 33800 Bordeaux 44.833834 \n", "196615 74901 Rue Marcelin Jourdan 55 33200 Bordeaux 44.850360 \n", "196839 74966 Rue Sainte-Catherine 137 33000 Bordeaux 44.836516 \n", "\n", " d_longitude \n", "order_id \n", "192594 -0.580521 \n", "192644 -0.580521 \n", "194335 -0.565674 \n", "196615 -0.597361 \n", "196839 -0.573983 " ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[\n", " max_a_table,\n", " [\"customer_id\", \"d_street\", \"d_zip\", \"d_city\", \"d_latitude\", \"d_longitude\"],\n", "].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Boolean filters can be created in an arbitray fashion by combining several conditions with `&` and `|` modeling logical AND and OR operators.\n", "\n", "The example lists the first five customers of \"Max A Table\" in a target area provided as latitude-longitude coordinates." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
customer_idd_streetd_zipd_cityd_latituded_longitude
order_id
19661574901Rue Marcelin Jourdan 5533200Bordeaux44.850360-0.597361
20080076187Rue Judaique 21333000Bordeaux44.840829-0.595445
20089376218Rue Notre Dame 2133000Bordeaux44.850260-0.572377
20278876786Rue De Leybardie 2733300Bordeaux44.861360-0.565057
20256376730Rue Lombard 4733300Bordeaux44.858661-0.563095
\n", "
" ], "text/plain": [ " customer_id d_street d_zip d_city d_latitude \\\n", "order_id \n", "196615 74901 Rue Marcelin Jourdan 55 33200 Bordeaux 44.850360 \n", "200800 76187 Rue Judaique 213 33000 Bordeaux 44.840829 \n", "200893 76218 Rue Notre Dame 21 33000 Bordeaux 44.850260 \n", "202788 76786 Rue De Leybardie 27 33300 Bordeaux 44.861360 \n", "202563 76730 Rue Lombard 47 33300 Bordeaux 44.858661 \n", "\n", " d_longitude \n", "order_id \n", "196615 -0.597361 \n", "200800 -0.595445 \n", "200893 -0.572377 \n", "202788 -0.565057 \n", "202563 -0.563095 " ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[\n", " (\n", " max_a_table\n", " &\n", " (\n", " (df[\"d_latitude\"] > 44.85) | (df[\"d_longitude\"] < -0.59)\n", " )\n", " ),\n", " [\"customer_id\", \"d_street\", \"d_zip\", \"d_city\", \"d_latitude\", \"d_longitude\"],\n", "].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[.isin() ](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isin.html#pandas.DataFrame.isin) is another useful method: It allows us to provide a `list` of values that we are filtering for in a column." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
placed_atcustomer_idd_streetd_zipd_citytotal
order_id
1926442016-07-18 12:48:556037Rue Rolland 1433000Bordeaux2450
2109452016-07-30 19:30:3979900Rue Du Couvent 1633000Bordeaux1650
2113632016-07-30 20:27:4580095Rue De La Porte Saint-Jean 833000Bordeaux2400
\n", "
" ], "text/plain": [ " placed_at customer_id d_street \\\n", "order_id \n", "192644 2016-07-18 12:48:55 6037 Rue Rolland 14 \n", "210945 2016-07-30 19:30:39 79900 Rue Du Couvent 16 \n", "211363 2016-07-30 20:27:45 80095 Rue De La Porte Saint-Jean 8 \n", "\n", " d_zip d_city total \n", "order_id \n", "192644 33000 Bordeaux 2450 \n", "210945 33000 Bordeaux 1650 \n", "211363 33000 Bordeaux 2400 " ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[\n", " (max_a_table & df[\"customer_id\"].isin([6037, 79900, 80095])),\n", " [\"placed_at\", \"customer_id\", \"d_street\", \"d_zip\", \"d_city\", \"total\"],\n", "].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `~` operator negates a condition. So, in the cell below we see all orders at \"Max A Table\" except the ones from the indicated customers." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
placed_atcustomer_idd_streetd_zipd_citytotal
order_id
1925942016-07-18 12:23:1310298Rue Rolland 1433000Bordeaux2050
1943352016-07-19 20:35:2174268Place Canteloup 1233800Bordeaux3100
1966152016-07-21 19:50:1574901Rue Marcelin Jourdan 5533200Bordeaux2050
1968392016-07-21 20:27:2274966Rue Sainte-Catherine 13733000Bordeaux3750
1986312016-07-22 21:29:4075047Rue Boudet 2933000Bordeaux2650
\n", "
" ], "text/plain": [ " placed_at customer_id d_street d_zip \\\n", "order_id \n", "192594 2016-07-18 12:23:13 10298 Rue Rolland 14 33000 \n", "194335 2016-07-19 20:35:21 74268 Place Canteloup 12 33800 \n", "196615 2016-07-21 19:50:15 74901 Rue Marcelin Jourdan 55 33200 \n", "196839 2016-07-21 20:27:22 74966 Rue Sainte-Catherine 137 33000 \n", "198631 2016-07-22 21:29:40 75047 Rue Boudet 29 33000 \n", "\n", " d_city total \n", "order_id \n", "192594 Bordeaux 2050 \n", "194335 Bordeaux 3100 \n", "196615 Bordeaux 2050 \n", "196839 Bordeaux 3750 \n", "198631 Bordeaux 2650 " ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[\n", " (max_a_table & ~df[\"customer_id\"].isin([6037, 79900, 80095])),\n", " [\"placed_at\", \"customer_id\", \"d_street\", \"d_zip\", \"d_city\", \"total\"],\n", "].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## DataFrame Methods" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have learned the basics of selecting the data we want from a `DataFrame`, let's look at a couple of methods that allow us to obtain some infos out of a `DataFrame`, in particular, to run some **descriptive statistics**.\n", "\n", "[.unique() ](https://pandas.pydata.org/docs/reference/api/pandas.Series.unique.html#pandas.Series.unique) is a simple `Series` method returning an `ndarray` with all values that are in the `Series` once.\n", "\n", "Here, we get an overview of how many restaurants there are in Bordeaux in the target time horizon." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1204, 1205, 1208, 1206, 1209, 1207, 1211, 1213, 1214, 1212, 1216,\n", " 1215, 1217, 1218, 1219, 1220, 1221, 1223, 1222, 1224, 1225, 1229,\n", " 1226, 1227, 1230, 1231, 1232, 1233, 1234, 1235, 1236, 1237, 1239,\n", " 1241, 1242, 1243, 1245, 1244, 1246, 1247, 1249, 1254, 1250, 1256,\n", " 1258, 1259, 1260, 1263, 1264, 1266, 1265, 1267])" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"restaurant_id\"].unique()" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "52" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df[\"restaurant_id\"].unique())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To obtain an `ndarray` of all customer IDs of \"Max A Table\", we write the following." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([10298, 6037, 74268, 74901, 74966, 75047, 76187, 76218, 76442,\n", " 76396, 76421, 76786, 76822, 76730, 76871, 75687, 77409, 77386,\n", " 77355, 77556, 78129, 78353, 78608, 78621, 78958, 79119, 79153,\n", " 76838, 79234, 79486, 79576, 79563, 79653, 79900, 79912, 80026,\n", " 80204, 80095, 80163])" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[max_a_table, \"customer_id\"].unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[.value_counts() ](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.value_counts.html#pandas.DataFrame.value_counts) is similar to [.unique() ](https://pandas.pydata.org/docs/reference/api/pandas.Series.unique.html#pandas.Series.unique) and provides an array sorted by the counts of how often an element occurs in a column or `Series` in descending order.\n", "\n", "We use it to list the `10` most popular restaurants and customers in the dataset." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "restaurant_id\n", "1254 78\n", "1207 47\n", "1204 39\n", "1217 37\n", "1212 32\n", "1244 25\n", "1225 25\n", "1249 23\n", "1242 19\n", "1221 18\n", "Name: count, dtype: int64" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"restaurant_id\"].value_counts().head(10)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "customer_id\n", "73919 14\n", "10298 12\n", "6037 8\n", "77048 5\n", "4210 4\n", "74426 4\n", "9304 3\n", "76838 3\n", "75905 3\n", "74791 3\n", "Name: count, dtype: int64" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"customer_id\"].value_counts().head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[.sum() ](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sum.html#pandas.DataFrame.sum), [.min() ](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.min.html#pandas.DataFrame.min), [.max() ](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.max.html#pandas.DataFrame.max), [.mean() ](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html#pandas.DataFrame.mean), and [.round() ](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.round.html#pandas.DataFrame.round) are self-explanatory.\n", "\n", "We use it to analyze the overall spendings in Bordeaux and for \"Max A Table\"." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "np.float64(15924.78)" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"total\"].sum() / 100 # Convert to Euro" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "np.float64(885.0)" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[max_a_table, \"total\"].sum() / 100" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "np.float64(3.5)" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"total\"].min() / 100" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "np.float64(83.7)" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"total\"].max() / 100" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "np.float64(12.5)" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[max_a_table, \"total\"].min() / 100" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "np.float64(60.0)" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[max_a_table, \"total\"].max() / 100" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "np.float64(22.94636887608069)" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"total\"].mean() / 100" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "np.float64(22.95)" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"total\"].mean().round() / 100" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "np.float64(22.69)" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[max_a_table, \"total\"].mean().round() / 100" ] } ], "metadata": { "kernelspec": { "display_name": "intro-to-data-science", "language": "python", "name": "intro-to-data-science" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.4" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }