{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Cleaning the raw Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The purpose of this notebook is to take the original PostgreSQL database and clean the data in it. The output is a set of tables in a new PostgreSQL schema that hold the cleaned data.\n",
    "\n",
    "Cleaning occurs at several levels:\n",
    "- textual data is unified (e.g., spelling, accents, ...)\n",
    "- duplicate rows/records are merged together\n",
    "- numeric columns are checked for plausibility\n",
    "- foreign key relationships are strictly enforced\n",
    "\n",
    "The structure of the data can be viewed at the [ORM layer](https://github.com/webartifex/urban-meal-delivery/tree/develop/src/urban_meal_delivery/db) in the package."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[32murban-meal-delivery\u001b[0m, version \u001b[34m0.2.0\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "!umd --version"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from urban_meal_delivery import config, db"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import collections\n",
    "import datetime\n",
    "import hashlib\n",
    "\n",
    "import pandas as pd\n",
    "import pytz as tz\n",
    "import numpy as np\n",
    "import sqlalchemy as sa"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Settings & Globals"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext lab_black"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.set_option(\"display.max_columns\", 999)\n",
    "pd.set_option(\"display.max_rows\", 999)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "connection = db.connection"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### New Database Schema"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As a result of this notebook, a new PostgreSQL schema called `\"clean\"` is created holding the tables with the cleaned data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'clean'"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "config.CLEAN_SCHEMA"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All tables with the original data are stored in the default PostgreSQL schema called `\"public\"`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'public'"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "config.ORIGINAL_SCHEMA"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use `alembic` to run the very first database migration script that creates the new tables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO  [alembic.runtime.migration] Context impl PostgresqlImpl.\n",
      "INFO  [alembic.runtime.migration] Will assume transactional DDL.\n",
      "INFO  [alembic.runtime.migration] Running upgrade  -> f11cd76d2f45, Create the database from scratch.\n"
     ]
    }
   ],
   "source": [
    "%cd -q ..\n",
    "!alembic upgrade f11cd76d2f45\n",
    "%cd -q notebooks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cities"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "While the original database consists of data concerning the UDP's operations in five cities in France, we only look at `\"Bordeaux\"`, `\"Lyon\"`, and `\"Paris\"` in this research project, as the amount of data for `\"Lille\"` and `\"Nantes\"` is simply not a lot due to the very short time horizons the UDP had been operating there."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Raw Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following `target_cities` data were manually obtained from Google Maps and mapped to the `\"database_id\"`s of the cities in the original database where the UDP was operating in."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "target_cities = {\n",
    "    \"Bordeaux\": {\n",
    "        \"database_id\": 4,\n",
    "        \"google_maps_data\": {\n",
    "            \"center_latitude\": 44.837789,\n",
    "            \"center_longitude\": -0.57918,\n",
    "            \"northeast_latitude\": 44.91670389999999,\n",
    "            \"northeast_longitude\": -0.5333089999999999,\n",
    "            \"southwest_latitude\": 44.810752,\n",
    "            \"southwest_longitude\": -0.638973,\n",
    "            \"initial_zoom\": 13,\n",
    "        },\n",
    "    },\n",
    "    \"Lyon\": {\n",
    "        \"database_id\": 1,\n",
    "        \"google_maps_data\": {\n",
    "            \"center_latitude\": 45.764043,\n",
    "            \"center_longitude\": 4.835659,\n",
    "            \"northeast_latitude\": 45.808425,\n",
    "            \"northeast_longitude\": 4.898393,\n",
    "            \"southwest_latitude\": 45.707486,\n",
    "            \"southwest_longitude\": 4.7718489,\n",
    "            \"initial_zoom\": 13,\n",
    "        },\n",
    "    },\n",
    "    \"Paris\": {\n",
    "        \"database_id\": 2,\n",
    "        \"google_maps_data\": {\n",
    "            \"center_latitude\": 48.856614,\n",
    "            \"center_longitude\": 2.3522219,\n",
    "            \"northeast_latitude\": 48.9021449,\n",
    "            \"northeast_longitude\": 2.4699208,\n",
    "            \"southwest_latitude\": 48.815573,\n",
    "            \"southwest_longitude\": 2.225193,\n",
    "            \"initial_zoom\": 12,\n",
    "        },\n",
    "    },\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "city_ids = tuple(city[\"database_id\"] for city in target_cities.values())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`cities` below holds the cleaned city related data from the original database. They come with KML data (i.e., area) associated with a city, which is kept."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "cities = pd.read_sql_query(\n",
    "    f\"\"\"\n",
    "    SELECT\n",
    "        cities.id,\n",
    "        cities.name,\n",
    "        geo_areas.kml\n",
    "    FROM\n",
    "        {config.ORIGINAL_SCHEMA}.cities\n",
    "    LEFT OUTER JOIN\n",
    "        {config.ORIGINAL_SCHEMA}.geo_areas ON cities.geo_area_id = geo_areas.id\n",
    "    WHERE\n",
    "        cities.id IN %(city_ids)s\n",
    "    ORDER BY\n",
    "        cities.id\n",
    "    \"\"\",\n",
    "    con=connection,\n",
    "    index_col=\"id\",\n",
    "    params={\"city_ids\": city_ids},\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Merge in the data from Google Maps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "for city in target_cities.values():\n",
    "    for col, val in city[\"google_maps_data\"].items():\n",
    "        cities.loc[city[\"database_id\"], col] = val"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Cast the columns' types explicitly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "cities = cities.astype(\n",
    "    {\n",
    "        \"name\": \"string\",\n",
    "        \"kml\": \"string\",\n",
    "        \"center_latitude\": float,\n",
    "        \"center_longitude\": float,\n",
    "        \"northeast_latitude\": float,\n",
    "        \"northeast_longitude\": float,\n",
    "        \"southwest_latitude\": float,\n",
    "        \"southwest_longitude\": float,\n",
    "        \"initial_zoom\": int,\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Clean Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>kml</th>\n",
       "      <th>center_latitude</th>\n",
       "      <th>center_longitude</th>\n",
       "      <th>northeast_latitude</th>\n",
       "      <th>northeast_longitude</th>\n",
       "      <th>southwest_latitude</th>\n",
       "      <th>southwest_longitude</th>\n",
       "      <th>initial_zoom</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Lyon</td>\n",
       "      <td>&lt;?xml version='1.0' encoding='UTF-8'?&gt;&lt;kml xml...</td>\n",
       "      <td>45.764043</td>\n",
       "      <td>4.835659</td>\n",
       "      <td>45.808425</td>\n",
       "      <td>4.898393</td>\n",
       "      <td>45.707486</td>\n",
       "      <td>4.771849</td>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Paris</td>\n",
       "      <td>&lt;?xml version='1.0' encoding='UTF-8'?&gt;&lt;kml xml...</td>\n",
       "      <td>48.856614</td>\n",
       "      <td>2.352222</td>\n",
       "      <td>48.902145</td>\n",
       "      <td>2.469921</td>\n",
       "      <td>48.815573</td>\n",
       "      <td>2.225193</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Bordeaux</td>\n",
       "      <td>&lt;?xml version='1.0' encoding='UTF-8'?&gt;&lt;kml xml...</td>\n",
       "      <td>44.837789</td>\n",
       "      <td>-0.579180</td>\n",
       "      <td>44.916704</td>\n",
       "      <td>-0.533309</td>\n",
       "      <td>44.810752</td>\n",
       "      <td>-0.638973</td>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        name                                                kml  \\\n",
       "id                                                                \n",
       "1       Lyon  <?xml version='1.0' encoding='UTF-8'?><kml xml...   \n",
       "2      Paris  <?xml version='1.0' encoding='UTF-8'?><kml xml...   \n",
       "4   Bordeaux  <?xml version='1.0' encoding='UTF-8'?><kml xml...   \n",
       "\n",
       "    center_latitude  center_longitude  northeast_latitude  \\\n",
       "id                                                          \n",
       "1         45.764043          4.835659           45.808425   \n",
       "2         48.856614          2.352222           48.902145   \n",
       "4         44.837789         -0.579180           44.916704   \n",
       "\n",
       "    northeast_longitude  southwest_latitude  southwest_longitude  initial_zoom  \n",
       "id                                                                              \n",
       "1              4.898393           45.707486             4.771849            13  \n",
       "2              2.469921           48.815573             2.225193            12  \n",
       "4             -0.533309           44.810752            -0.638973            13  "
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cities.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 3 entries, 1 to 4\n",
      "Data columns (total 9 columns):\n",
      " #   Column               Non-Null Count  Dtype  \n",
      "---  ------               --------------  -----  \n",
      " 0   name                 3 non-null      string \n",
      " 1   kml                  3 non-null      string \n",
      " 2   center_latitude      3 non-null      float64\n",
      " 3   center_longitude     3 non-null      float64\n",
      " 4   northeast_latitude   3 non-null      float64\n",
      " 5   northeast_longitude  3 non-null      float64\n",
      " 6   southwest_latitude   3 non-null      float64\n",
      " 7   southwest_longitude  3 non-null      float64\n",
      " 8   initial_zoom         3 non-null      int64  \n",
      "dtypes: float64(6), int64(1), string(2)\n",
      "memory usage: 320.0 bytes\n"
     ]
    }
   ],
   "source": [
    "cities.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As this notebook was developed iteratively, we validate that the cleaned data stays unchanged using SHA256 checksums of the cleaned DataFrames and other `assert`s."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "assert (\n",
    "    hashlib.sha256(cities.to_json().encode()).hexdigest()\n",
    "    == \"800689a6ba5b6d03f583f258e058eca0b12e6df8e34c98bfe7aec246ed688c92\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Addresses"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Raw Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Only load addresses with orders in the target cities, excluding the cut-off day."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "addresses = pd.read_sql_query(\n",
    "    f\"\"\"\n",
    "    SELECT\n",
    "        id,\n",
    "        created_at,\n",
    "        place_id,\n",
    "        latitude,\n",
    "        longitude,\n",
    "        city_id,\n",
    "        city_name AS city,\n",
    "        zip,\n",
    "        street_address AS street,\n",
    "        floor,\n",
    "        special_instructions\n",
    "    FROM\n",
    "        {config.ORIGINAL_SCHEMA}.addresses\n",
    "    WHERE\n",
    "        city_id IN %(city_ids)s\n",
    "        AND\n",
    "        id IN (\n",
    "            SELECT DISTINCT address_id\n",
    "            FROM (\n",
    "                SELECT DISTINCT\n",
    "                    pickup_address_id AS address_id\n",
    "                FROM\n",
    "                    {config.ORIGINAL_SCHEMA}.orders\n",
    "                WHERE\n",
    "                    created_at < '{config.CUTOFF_DAY}'\n",
    "                UNION\n",
    "                SELECT DISTINCT\n",
    "                    dropoff_address_id AS address_id\n",
    "                FROM\n",
    "                    {config.ORIGINAL_SCHEMA}.orders\n",
    "                WHERE\n",
    "                    created_at < '{config.CUTOFF_DAY}'\n",
    "            ) AS orders\n",
    "        )\n",
    "    ORDER BY\n",
    "        id\n",
    "    \"\"\",\n",
    "    con=connection,\n",
    "    index_col=\"id\",\n",
    "    params={\"city_ids\": city_ids},\n",
    "    parse_dates=[\"created_at\"],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All columns are `\"strings\"`, even `zip`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "addresses = addresses.astype(\n",
    "    {\n",
    "        \"place_id\": \"string\",\n",
    "        \"city\": \"string\",\n",
    "        \"zip\": \"string\",\n",
    "        \"street\": \"string\",\n",
    "        \"floor\": \"string\",\n",
    "        \"special_instructions\": \"string\",\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>created_at</th>\n",
       "      <th>place_id</th>\n",
       "      <th>latitude</th>\n",
       "      <th>longitude</th>\n",
       "      <th>city_id</th>\n",
       "      <th>city</th>\n",
       "      <th>zip</th>\n",
       "      <th>street</th>\n",
       "      <th>floor</th>\n",
       "      <th>special_instructions</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2016-02-22 09:42:10.204171</td>\n",
       "      <td>ChIJSfxJmlXq9EcRX2ChkiPW9J8</td>\n",
       "      <td>45.763149</td>\n",
       "      <td>4.832660</td>\n",
       "      <td>1</td>\n",
       "      <td>Lyon</td>\n",
       "      <td>69002</td>\n",
       "      <td>31 rue Mercière</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2016-02-22 09:42:10.351150</td>\n",
       "      <td>ChIJwwMvNPnq9EcRY7Qu-Tw2HL8</td>\n",
       "      <td>45.767227</td>\n",
       "      <td>4.835750</td>\n",
       "      <td>1</td>\n",
       "      <td>Lyon</td>\n",
       "      <td>69001</td>\n",
       "      <td>Rue de la République 2</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2016-02-22 09:43:57.077475</td>\n",
       "      <td>ChIJr1RhGN7B9EcRv6XSHmmN6a8</td>\n",
       "      <td>45.743725</td>\n",
       "      <td>4.873138</td>\n",
       "      <td>1</td>\n",
       "      <td>Lyon</td>\n",
       "      <td>69008</td>\n",
       "      <td>123 avenue des frères lumières</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2016-02-22 09:43:57.220446</td>\n",
       "      <td>ChIJEwQm9H7q9EcRTymMxQ71z84</td>\n",
       "      <td>45.759369</td>\n",
       "      <td>4.864087</td>\n",
       "      <td>1</td>\n",
       "      <td>Lyon</td>\n",
       "      <td>69003</td>\n",
       "      <td>Avenue Georges Pompidou 17</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>Appel au 06 46 12 20 27</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>2016-02-22 10:06:08.762590</td>\n",
       "      <td>ChIJbQz7p6vr9EcRr9L2cH5942I</td>\n",
       "      <td>45.761181</td>\n",
       "      <td>4.826371</td>\n",
       "      <td>1</td>\n",
       "      <td>Lyon</td>\n",
       "      <td>69005</td>\n",
       "      <td>8 bis Place Saint Jean</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                   created_at                     place_id   latitude  \\\n",
       "id                                                                      \n",
       "2  2016-02-22 09:42:10.204171  ChIJSfxJmlXq9EcRX2ChkiPW9J8  45.763149   \n",
       "3  2016-02-22 09:42:10.351150  ChIJwwMvNPnq9EcRY7Qu-Tw2HL8  45.767227   \n",
       "4  2016-02-22 09:43:57.077475  ChIJr1RhGN7B9EcRv6XSHmmN6a8  45.743725   \n",
       "5  2016-02-22 09:43:57.220446  ChIJEwQm9H7q9EcRTymMxQ71z84  45.759369   \n",
       "6  2016-02-22 10:06:08.762590  ChIJbQz7p6vr9EcRr9L2cH5942I  45.761181   \n",
       "\n",
       "    longitude  city_id  city    zip                          street floor  \\\n",
       "id                                                                          \n",
       "2    4.832660        1  Lyon  69002                 31 rue Mercière  <NA>   \n",
       "3    4.835750        1  Lyon  69001          Rue de la République 2  <NA>   \n",
       "4    4.873138        1  Lyon  69008  123 avenue des frères lumières  <NA>   \n",
       "5    4.864087        1  Lyon  69003      Avenue Georges Pompidou 17  <NA>   \n",
       "6    4.826371        1  Lyon  69005          8 bis Place Saint Jean  <NA>   \n",
       "\n",
       "       special_instructions  \n",
       "id                           \n",
       "2                            \n",
       "3                            \n",
       "4                            \n",
       "5   Appel au 06 46 12 20 27  \n",
       "6                            "
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "addresses.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 663082 entries, 2 to 691914\n",
      "Data columns (total 10 columns):\n",
      " #   Column                Non-Null Count   Dtype         \n",
      "---  ------                --------------   -----         \n",
      " 0   created_at            663082 non-null  datetime64[ns]\n",
      " 1   place_id              662953 non-null  string        \n",
      " 2   latitude              663082 non-null  float64       \n",
      " 3   longitude             663082 non-null  float64       \n",
      " 4   city_id               663082 non-null  int64         \n",
      " 5   city                  663082 non-null  string        \n",
      " 6   zip                   663082 non-null  string        \n",
      " 7   street                663082 non-null  string        \n",
      " 8   floor                 337703 non-null  string        \n",
      " 9   special_instructions  585310 non-null  string        \n",
      "dtypes: datetime64[ns](1), float64(2), int64(1), string(6)\n",
      "memory usage: 55.6 MB\n"
     ]
    }
   ],
   "source": [
    "addresses.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "assert len(addresses) == 663_082"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Adjust Time Zones"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a helper function that strips out the microseconds from datetime columns and converts their time zones from UTC to Europe/Paris."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "def clean_datetime(col):\n",
    "    \"\"\"Strip Microseconds and convert timezone to Europe/Paris.\"\"\"\n",
    "    return (\n",
    "        col.dt.tz_localize(tz.utc)\n",
    "        .dt.tz_convert(tz.timezone(\"Europe/Paris\"))\n",
    "        .dt.tz_localize(None)\n",
    "        .map(\n",
    "            lambda x: datetime.datetime(\n",
    "                x.year, x.month, x.day, x.hour, x.minute, x.second\n",
    "            )\n",
    "            if x is not pd.NaT\n",
    "            else x\n",
    "        )\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "addresses[\"created_at\"] = clean_datetime(addresses[\"created_at\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Clean Place IDs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A tiny number of addresses has `latitude` / `longitude` pairs as `place_id`s."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "addresses[\"place_id\"] = (\n",
    "    addresses[\"place_id\"].str.replace(r\"^[\\d\\.,-]+$\", \"\", regex=True).str.strip()\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Discard addresses without a `place_id` by Google Maps. If even Google does not know where these addresses are geo-located, we do not even try."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "msk = addresses[\"place_id\"].isnull() | (addresses[\"place_id\"] == \"\")\n",
    "addresses = addresses[~msk]\n",
    "\n",
    "assert msk.sum() == 139"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Clean City Names"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Some customers entered too much data into the `city` part of the address. Unify this column by only keeping the city's name."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<StringArray>\n",
       "[                     'Lyon',   'Lyon 2 E Arrondissement',\n",
       "              'Villeurbanne',   'Lyon 7 E Arrondissement',\n",
       "   'Lyon 6 E Arrondissement',   'Lyon 8 E Arrondissement',\n",
       "   'Lyon 9 E Arrondissement',                     'Paris',\n",
       "          'Levallois-Perret',                'Courbevoie',\n",
       "                   'Puteaux',         'Neuilly Sur Seine',\n",
       "      'Boulogne Billancourt',          'Levallois Perret',\n",
       "                  'Malakoff',               'Saint-Mandé',\n",
       " 'Paris 14 E Arrondissement',                 'Montrouge',\n",
       "       'Issy-les-moulineaux', 'Paris 10 E Arrondissement',\n",
       "       'Issy Les Moulineaux',                  'Nanterre',\n",
       " 'Paris 12 E Arrondissement',       'La Garenne-colombes',\n",
       "                 'Vincennes',                 'Montreuil',\n",
       "        'Asnières-sur-Seine', 'Paris 15 E Arrondissement',\n",
       "                'Saint-Ouen',  'Paris 9 E Arrondissement',\n",
       "                    'Pantin', 'Paris 17 E Arrondissement',\n",
       " 'Paris 1 Er Arrondissement',                  'Bordeaux',\n",
       "                   'Talence',            'Ivry-sur-Seine',\n",
       "                    'Testas',               'Saint-Denis']\n",
       "Length: 38, dtype: string"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "addresses[\"city\"].unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "addresses[\"city\"] = (\n",
    "    addresses[\"city\"]\n",
    "    .str.replace(r\"(E Arrondissement|Er Arrondissement)\", \"\", regex=True)\n",
    "    .str.replace(r\"(\\d)\", \"\", regex=True)\n",
    "    # Get rid off accents.\n",
    "    .str.normalize(\"NFKD\")\n",
    "    .str.encode(\"ascii\", errors=\"ignore\")\n",
    "    .str.decode(\"utf8\")\n",
    "    .astype(\"string\")\n",
    "    # Unify hyphens.\n",
    "    .str.strip()\n",
    "    .str.replace(\"-\", \" \")\n",
    "    .str.title()\n",
    "    .str.replace(\" \", \"-\")\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sub-urban city names surrounding the three big cities in this research project are kept."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Paris                   362252\n",
       "Lyon                    199872\n",
       "Bordeaux                 64379\n",
       "Villeurbanne             15555\n",
       "Levallois-Perret          4620\n",
       "Courbevoie                3510\n",
       "Puteaux                   2323\n",
       "Issy-Les-Moulineaux       1219\n",
       "Malakoff                  1214\n",
       "Montrouge                 1164\n",
       "Pantin                    1119\n",
       "Saint-Mande               1043\n",
       "Boulogne-Billancourt       986\n",
       "Montreuil                  909\n",
       "Neuilly-Sur-Seine          835\n",
       "Vincennes                  826\n",
       "Ivry-Sur-Seine             438\n",
       "Nanterre                   361\n",
       "La-Garenne-Colombes        213\n",
       "Asnieres-Sur-Seine          62\n",
       "Saint-Ouen                  35\n",
       "Saint-Denis                  4\n",
       "Testas                       3\n",
       "Talence                      1\n",
       "Name: city, dtype: Int64"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "addresses[\"city\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Clean Zip Codes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<StringArray>\n",
       "[      '69002',       '69001',       '69008',       '69003',       '69005',\n",
       "       '69007',       '69004',       '69009',       '69006',       '69100',\n",
       " ...\n",
       "       '92053',       '92806',      '69003 ', '75015 PARIS',          '69',\n",
       "       '33077',       '33092',       '33139',      '75010z',       '92040']\n",
       "Length: 167, dtype: string"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "addresses[\"zip\"].unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "addresses[\"zip\"] = (\n",
    "    addresses[\"zip\"]\n",
    "    .str.replace(r\".*(\\d{5}).*\", r\"\\1\", regex=True)\n",
    "    .str.replace(r\"\\D+\", \"\", regex=True)\n",
    "    .replace(\"\", \"NaN\")\n",
    "    .astype(float)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Zip codes with less than 5 digits are invalid. Paris has zip codes with 75xxx (with 92xxx, 93xxx, and 94xxx being suburbs), Lyon 69xxx, and Bordeaux 33xxx (cf., [source](https://en.wikipedia.org/wiki/Postal_codes_in_France)). Keep only valid zip codes in target cities."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "invalid = addresses[\"zip\"].notnull() & (addresses[\"zip\"] < 10000)\n",
    "\n",
    "assert invalid.sum() == 9\n",
    "\n",
    "not_in_target_cities = (\n",
    "    addresses[\"zip\"].notnull()\n",
    "    & ~invalid\n",
    "    & ~(\n",
    "        (33000 <= addresses[\"zip\"]) & (addresses[\"zip\"] < 34000)\n",
    "        | (69000 <= addresses[\"zip\"]) & (addresses[\"zip\"] < 70000)\n",
    "        | (75000 <= addresses[\"zip\"]) & (addresses[\"zip\"] < 76000)\n",
    "        | (92000 <= addresses[\"zip\"]) & (addresses[\"zip\"] < 95000)\n",
    "    )\n",
    ")\n",
    "\n",
    "assert not_in_target_cities.sum() == 10\n",
    "\n",
    "addresses.loc[invalid | not_in_target_cities, \"zip\"] = np.NaN"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([69002., 69001., 69008., 69003., 69005., 69007., 69004., 69009.,\n",
       "       69006., 69100., 69200., 69300., 69350.,    nan, 75011., 69370.,\n",
       "       75009., 75010., 75001., 75008., 75017., 75003., 75002., 75018.,\n",
       "       92300., 75007., 75012., 75006., 75116., 75015., 92400., 92800.,\n",
       "       92200., 92100., 75020., 75019., 75004., 75013., 75016., 75005.,\n",
       "       75014., 92130., 92092., 92932., 92000., 92150., 92270., 92190.,\n",
       "       92110., 92170., 93400., 92600., 94250., 69160., 92120., 92250.,\n",
       "       92078., 92310., 92240., 75543., 92064., 92210., 75270., 75000.,\n",
       "       92057., 93100., 92140., 92081., 75741., 69326., 69130., 69500.,\n",
       "       94160., 94300., 93170., 94110., 94220., 93260., 69000., 93500.,\n",
       "       75045., 94120., 94200., 92220., 93000., 92974., 92063., 69429.,\n",
       "       33000., 33800., 33200., 33400., 33300., 33130., 33110., 33100.,\n",
       "       33270., 33520., 92935., 33067., 33150., 93507., 33700., 69628.,\n",
       "       33080., 33076., 92700., 93310., 92042., 92058., 92930., 69120.,\n",
       "       92936., 93300., 93210., 92671., 93200., 92053., 92806., 33077.,\n",
       "       33092., 33139., 92040.])"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "addresses[\"zip\"].unique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Discard addresses with missing zip codes because they are hard to geo-code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "msk = addresses[\"zip\"].isnull()\n",
    "addresses = addresses[~msk]\n",
    "\n",
    "assert msk.sum() == 21"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "addresses = addresses.astype({\"zip\": int})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Clean Street Names"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remove extra whitespace, HTML encodings, and accents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [],
   "source": [
    "addresses[\"street\"] = (\n",
    "    addresses[\"street\"]\n",
    "    .str.replace(\"\\s+\", \" \", regex=True)\n",
    "    .str.replace(\"&#39;\", \"'\")\n",
    "    # Get rid off accents.\n",
    "    .str.normalize(\"NFKD\")\n",
    "    .str.encode(\"ascii\", errors=\"ignore\")\n",
    "    .str.decode(\"utf8\")\n",
    "    .astype(\"string\")\n",
    "    .str.strip()\n",
    "    .str.title()\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are no addresses without a `street` name."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
    "assert not addresses[\"street\"].isnull().any()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Parse Floor Numbers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Make `floor` an integer column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<StringArray>\n",
       "[                                 <NA>,                                    '',\n",
       "                                   '3',                                  '19',\n",
       "                          '2ème étage',                                   '1',\n",
       "                                 '1er',            '2eme etage face escalier',\n",
       "                                   '2',                        '2 eme droite',\n",
       " ...\n",
       " ' Premiere Bati  interphone 16  2eme',                 'Le Cargo - 4e étage',\n",
       "                       'Rdc appart 11',             'Etage 3, appartement 11',\n",
       "                    '5e étage gauche ',          '2eme etage a gauche bat b ',\n",
       "              '4ème étage, chambre 41',                          'Cinquiéme ',\n",
       "       'Montez jusqu&#39;à la grille ',                '5ème étage à droite ']\n",
       "Length: 11990, dtype: string"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "addresses[\"floor\"].unique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Parse out floors from the `floor` text column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
    "addresses[\"floor\"] = (\n",
    "    addresses[\"floor\"]\n",
    "    # Get rid of accents and lower case everything.\n",
    "    .str.normalize(\"NFKD\")\n",
    "    .str.encode(\"ascii\", errors=\"ignore\")\n",
    "    .str.decode(\"utf8\")\n",
    "    .astype(\"string\")\n",
    "    .str.casefold()\n",
    "    # Replace common text that messes up the matching.\n",
    "    .str.replace(\".\", \"\")\n",
    "    .str.replace(\":\", \"\")\n",
    "    .str.replace(\";\", \"\")\n",
    "    .str.replace(\"&#39;\", \"'\")\n",
    "    .str.replace(\"36b25\", \"\")\n",
    "    .str.replace(\"n°\", \"\")\n",
    "    .str.replace(\"#\", \"\")\n",
    "    .str.replace(\"face l'assanceur\", \"\")\n",
    "    .str.replace(\"\\(drt\\)\", \"\")\n",
    "    .str.replace(\"floor\", \"\")\n",
    "    .str.replace(\"et demi\", \"\")\n",
    "    .str.replace(\"et droite\", \"\")\n",
    "    .str.replace(\"droite\", \"\")\n",
    "    .str.replace(\"droit\", \"\")\n",
    "    .str.replace(\"a gauche\", \"\")\n",
    "    .str.replace(\"e gauche\", \"\")\n",
    "    .str.replace(\"gauche\", \"\")\n",
    "    .str.replace(\"entrez\", \"\")\n",
    "    .str.replace(\"serez\", \"\")\n",
    "    .str.replace(\"dussol\", \"\")\n",
    "    .str.replace(\"soler\", \"\")\n",
    "    .str.replace(\"sonner\", \"\")\n",
    "    .str.replace(\"code\", \"\")\n",
    "    .str.replace(\"perez\", \"\")\n",
    "    .str.replace(\"-\", \"\")\n",
    "    .str.replace(\"\\s+\", \" \", regex=True)\n",
    "    .str.strip()\n",
    "    # Abbreviations.\n",
    "    .str.replace(\n",
    "        r\"^.*?((\\d+)\\s?(er|ere|em|eme|ele|ieme|bis|(e|g|st|nd|rd|th|z)($|,|\\s+))).*\",\n",
    "        r\"\\2\",\n",
    "        regex=True,\n",
    "    )\n",
    "    # French written out.\n",
    "    .str.replace(r\".*(rdc|rez|sol|ground).*\", \"0\", regex=True)\n",
    "    .str.replace(r\".*(premiere|premier).*\", \"1\", regex=True)\n",
    "    .str.replace(r\".*(deuxieme).*\", \"2\", regex=True)\n",
    "    .str.replace(r\".*(troisieme).*\", \"3\", regex=True)\n",
    "    .str.replace(r\".*(quatrieme).*\", \"4\", regex=True)\n",
    "    .str.replace(r\".*(cinquieme).*\", \"5\", regex=True)\n",
    "    .str.replace(r\".*(sixieme).*\", \"6\", regex=True)\n",
    "    .str.replace(r\".*(septieme).*\", \"7\", regex=True)\n",
    "    .str.replace(r\".*(huitieme).*\", \"8\", regex=True)\n",
    "    .str.replace(r\".*(neuvieme).*\", \"9\", regex=True)\n",
    "    .str.replace(r\".*(dixieme).*\", \"10\", regex=True)\n",
    "    .str.replace(r\"^.*?((etage|etg) (\\d+))($|\\D+.*)\", r\"\\3\", regex=True)\n",
    "    .str.replace(r\"^.*?((\\d+)(etage| etage|etg| etg)).*\", r\"\\2\", regex=True)\n",
    "    # Remove apartment info to not confuse it with floor\n",
    "    .str.replace(\n",
    "        r\"(.*)(ap|apt|app|appt|appart|appartment|appartement|chambre|room)\\s*\\w?\\d+(.*)\",\n",
    "        r\"\\1 \\3\",\n",
    "        regex=True,\n",
    "    )\n",
    "    .str.replace(r\"(.*)(code|digicode)\\s*\\w?\\d+(.*)\", r\"\\1 \\3\", regex=True)\n",
    "    # Take number at start.\n",
    "    .str.replace(r\"^(\\d+)(,|\\s+).*\", r\"\\1\", regex=True)\n",
    "    # Ignore anything with non-numeric symbols entirely.\n",
    "    .str.replace(r\".*\\D+.*\", \"\", regex=True)\n",
    "    .str.replace(\"^$\", \"NaN\")\n",
    "    .fillna(\"NaN\")\n",
    "    .astype(float)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If the `floor` column is empty, parse out floor info from the `special_instructions` column that must have been used before the `floor` column was introduced (slightly different parsing logic than above)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [],
   "source": [
    "addresses[\"special_instructions\"] = (\n",
    "    addresses[\"special_instructions\"]\n",
    "    # Get rid of accents and lower case everything.\n",
    "    .str.normalize(\"NFKD\")\n",
    "    .str.encode(\"ascii\", errors=\"ignore\")\n",
    "    .str.decode(\"utf8\")\n",
    "    .astype(\"string\")\n",
    "    .str.casefold()\n",
    "    # Replace common text that messes up the matching.\n",
    "    .str.replace(\".\", \"\")\n",
    "    .str.replace(\":\", \"\")\n",
    "    .str.replace(\";\", \"\")\n",
    "    .str.replace(\"&#39;\", \"'\")\n",
    "    .str.replace(\"36b25\", \"\")\n",
    "    .str.replace(\"n°\", \"\")\n",
    "    .str.replace(\"#\", \"\")\n",
    "    .str.replace(\"face l'assanceur\", \"\")\n",
    "    .str.replace(\"\\(drt\\)\", \"\")\n",
    "    .str.replace(\"floor\", \"\")\n",
    "    .str.replace(\"et demi\", \"\")\n",
    "    .str.replace(\"et droite\", \"\")\n",
    "    .str.replace(\"droite\", \"\")\n",
    "    .str.replace(\"droit\", \"\")\n",
    "    .str.replace(\"a gauche\", \"\")\n",
    "    .str.replace(\"e gauche\", \"\")\n",
    "    .str.replace(\"gauche\", \"\")\n",
    "    .str.replace(\"entrez\", \"\")\n",
    "    .str.replace(\"serez\", \"\")\n",
    "    .str.replace(\"dussol\", \"\")\n",
    "    .str.replace(\"soler\", \"\")\n",
    "    .str.replace(\"sonner\", \"\")\n",
    "    .str.replace(\"code\", \"\")\n",
    "    .str.replace(\"perez\", \"\")\n",
    "    .str.replace(\"-\", \"\")\n",
    "    .str.replace(\"\\s+\", \" \", regex=True)\n",
    "    .str.strip()\n",
    "    # Abbreviations.\n",
    "    .str.replace(\n",
    "        r\"^.*?((\\d+)\\s?(er|ere|em|eme|ele|ieme|bis|(e|g|st|nd|rd|th|z)($|,|\\s+))).*\",\n",
    "        r\"\\2\",\n",
    "        regex=True,\n",
    "    )\n",
    "    # French written out.\n",
    "    .str.replace(r\".*(rdc|rez|sol|ground).*\", \"0\", regex=True)\n",
    "    .str.replace(r\".*(premiere|premier).*\", \"1\", regex=True)\n",
    "    .str.replace(r\".*(deuxieme).*\", \"2\", regex=True)\n",
    "    .str.replace(r\".*(troisieme).*\", \"3\", regex=True)\n",
    "    .str.replace(r\".*(quatrieme).*\", \"4\", regex=True)\n",
    "    .str.replace(r\".*(cinquieme).*\", \"5\", regex=True)\n",
    "    .str.replace(r\".*(sixieme).*\", \"6\", regex=True)\n",
    "    .str.replace(r\".*(septieme).*\", \"7\", regex=True)\n",
    "    .str.replace(r\".*(huitieme).*\", \"8\", regex=True)\n",
    "    .str.replace(r\".*(neuvieme).*\", \"9\", regex=True)\n",
    "    .str.replace(r\".*(dixieme).*\", \"10\", regex=True)\n",
    "    .str.replace(r\"^.*?((etage|etg) (\\d+))($|\\D+.*)\", r\"\\3\", regex=True)\n",
    "    .str.replace(r\"^.*?((\\d+)(etage| etage|etg| etg)).*\", r\"\\2\", regex=True)\n",
    "    # Remove apartment info to not confuse it with floor.\n",
    "    .str.replace(\n",
    "        r\"(.*)(ap|apt|app|appt|appart|appartment|appartement|chambre|room)\\s*\\w?\\d+(.*)\",\n",
    "        r\"\\1 \\3\",\n",
    "        regex=True,\n",
    "    )\n",
    "    .str.replace(r\"(.*)(code|digicode)\\s*\\w?\\d+(.*)\", r\"\\1 \\3\", regex=True)\n",
    "    # Ignore anything with non-numeric symbols entirely.\n",
    "    .str.replace(r\".*\\D+.*\", \"\", regex=True)\n",
    "    .str.replace(\"^$\", \"NaN\")\n",
    "    .fillna(\"NaN\")\n",
    "    .astype(float)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Fill in `floor` from `special_instructions` and cast the type."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [],
   "source": [
    "msk = addresses[\"floor\"].isnull() & addresses[\"special_instructions\"].notnull()\n",
    "addresses.loc[msk, \"floor\"] = addresses.loc[msk, \"special_instructions\"].values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [],
   "source": [
    "del addresses[\"special_instructions\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [],
   "source": [
    "addresses = addresses.astype({\"floor\": \"Int64\"})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Only keep the realisic numbers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [],
   "source": [
    "addresses.loc[addresses[\"floor\"].notnull() & (addresses[\"floor\"] > 40), \"floor\"] = pd.NA"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Most addresses have no floor number given."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [],
   "source": [
    "assert len(addresses.loc[addresses[\"floor\"].isnull(), \"floor\"]) == 307_973"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Most `floor`s are near the ground floor, which is plausible."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     20977\n",
       "1     69765\n",
       "2     62439\n",
       "3     59011\n",
       "4     52226\n",
       "5     42930\n",
       "6     28396\n",
       "7      8853\n",
       "8      3349\n",
       "9      1631\n",
       "10     1384\n",
       "11      764\n",
       "12      657\n",
       "13      448\n",
       "14      322\n",
       "15      206\n",
       "16      202\n",
       "17      109\n",
       "18      135\n",
       "19      105\n",
       "20       77\n",
       "21       51\n",
       "22       99\n",
       "23       39\n",
       "24       85\n",
       "25       76\n",
       "26       56\n",
       "27       30\n",
       "28       49\n",
       "29       51\n",
       "30       33\n",
       "31       55\n",
       "32       22\n",
       "33       11\n",
       "34       37\n",
       "35       55\n",
       "36       19\n",
       "37        8\n",
       "38       32\n",
       "39      136\n",
       "40       19\n",
       "Name: floor, dtype: Int64"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "addresses[\"floor\"].value_counts().sort_index()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Deduplicate"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The number of addresses (ca. 663,000) is inflated, probably due to some sort of automated re-entering."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [],
   "source": [
    "assert len(addresses) == 662_922"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, merge all addresses with the same `place_id`, `latitude` / `longitude`, `city`, `zip`, `street`, *and* `floor` into one entry, namely its first occurrence."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [],
   "source": [
    "addresses[\"floor\"] = addresses[\"floor\"].fillna(999)  # dummy -> No grouping with NaN's"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [],
   "source": [
    "by = [\"place_id\", \"latitude\", \"longitude\", \"city_id\", \"city\", \"zip\", \"street\", \"floor\"]\n",
    "\n",
    "addresses = (\n",
    "    addresses.reset_index()\n",
    "    .set_index(by)\n",
    "    .merge(\n",
    "        (\n",
    "            addresses.reset_index()\n",
    "            .groupby(by)[[\"id\"]]\n",
    "            .min()\n",
    "            .rename(columns={\"id\": \"merged_on_id\"})\n",
    "        ),\n",
    "        left_index=True,\n",
    "        right_index=True,\n",
    "    )\n",
    "    .reset_index()\n",
    "    .astype({\"place_id\": \"string\", \"city\": \"string\", \"street\": \"string\"})\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Keep a dictionary `address_merger` to map the ID's that are merged away to the ones that are kept."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [],
   "source": [
    "address_merger = collections.defaultdict(lambda: np.NaN)\n",
    "address_merger.update(\n",
    "    {\n",
    "        id_: merged_on_id\n",
    "        for _, id_, merged_on_id in addresses[[\"id\", \"merged_on_id\"]].itertuples()\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [],
   "source": [
    "addresses = (\n",
    "    addresses[addresses[\"id\"] == addresses[\"merged_on_id\"]]\n",
    "    .set_index(\"id\")\n",
    "    .sort_index()[\n",
    "        [\n",
    "            \"created_at\",\n",
    "            \"place_id\",\n",
    "            \"latitude\",\n",
    "            \"longitude\",\n",
    "            \"city_id\",\n",
    "            \"city\",\n",
    "            \"zip\",\n",
    "            \"street\",\n",
    "            \"floor\",\n",
    "        ]\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [],
   "source": [
    "addresses[\"floor\"] = addresses[\"floor\"].replace(999, pd.NA).astype(\"Int64\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Only about 178,000 addresses remain!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [],
   "source": [
    "assert len(addresses) == 178_101"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Second, many addresses are still redundant as they are referring to *different* `floor`s in the *same* house or their `street` name is written differently. \n",
    "\n",
    "We create a `primary_id` column that holds the ID of the first occurrence of an address independent of the exact spelling of the `street` name and the `floor` number.\n",
    "\n",
    "That column is created via grouping the remaining addresses twice, once with their GPS location, and second by a simplified version of `street`. The latter accounts for slightly different `latitude` / `longitude` pairs of the same location, potentially due to an update in the Google Maps database."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [],
   "source": [
    "by = [\"place_id\", \"latitude\", \"longitude\"]\n",
    "\n",
    "addresses = (\n",
    "    addresses.reset_index()\n",
    "    .set_index(by)\n",
    "    .merge(\n",
    "        (\n",
    "            addresses.reset_index()\n",
    "            .groupby(by)[[\"id\"]]\n",
    "            .min()\n",
    "            .rename(columns={\"id\": \"unified1_id\"})\n",
    "        ),\n",
    "        left_index=True,\n",
    "        right_index=True,\n",
    "    )\n",
    "    .reset_index()\n",
    "    .set_index(\"id\")\n",
    "    .sort_index()\n",
    "    .astype({\"place_id\": \"string\"})[\n",
    "        [\n",
    "            \"unified1_id\",\n",
    "            \"created_at\",\n",
    "            \"place_id\",\n",
    "            \"latitude\",\n",
    "            \"longitude\",\n",
    "            \"city_id\",\n",
    "            \"city\",\n",
    "            \"zip\",\n",
    "            \"street\",\n",
    "            \"floor\",\n",
    "        ]\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [],
   "source": [
    "addresses[\"street_simple\"] = (\n",
    "    addresses[\"street\"]\n",
    "    .str.replace(\"Avenue\", \"Ave\")\n",
    "    .str.replace(\"Place\", \"Pl\")\n",
    "    .str.replace(\".\", \"\")\n",
    "    .str.replace(\"-\", \"\")\n",
    "    .str.replace(\" \", \"\")\n",
    "    .str.lower()\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [],
   "source": [
    "by = [\"city_id\", \"street_simple\"]\n",
    "\n",
    "addresses = (\n",
    "    addresses.reset_index()\n",
    "    .set_index(by)\n",
    "    .merge(\n",
    "        (\n",
    "            addresses.reset_index()\n",
    "            .groupby(by)[[\"id\"]]\n",
    "            .min()\n",
    "            .rename(columns={\"id\": \"unified2_id\"})\n",
    "        ),\n",
    "        left_index=True,\n",
    "        right_index=True,\n",
    "    )\n",
    "    .reset_index()\n",
    "    .set_index(\"id\")\n",
    "    .sort_index()[\n",
    "        [\n",
    "            \"unified1_id\",\n",
    "            \"unified2_id\",\n",
    "            \"created_at\",\n",
    "            \"place_id\",\n",
    "            \"latitude\",\n",
    "            \"longitude\",\n",
    "            \"city_id\",\n",
    "            \"city\",\n",
    "            \"zip\",\n",
    "            \"street\",\n",
    "            \"floor\",\n",
    "        ]\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So, an address may be a duplicate of *two* different earlier addresses and we choose the earliest one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [],
   "source": [
    "addresses[\"primary_id\"] = addresses[[\"unified1_id\", \"unified2_id\"]].min(axis=1)\n",
    "\n",
    "del addresses[\"unified1_id\"]\n",
    "del addresses[\"unified2_id\"]\n",
    "\n",
    "addresses = addresses[\n",
    "    [\n",
    "        \"primary_id\",\n",
    "        \"created_at\",\n",
    "        \"place_id\",\n",
    "        \"latitude\",\n",
    "        \"longitude\",\n",
    "        \"city_id\",\n",
    "        \"city\",\n",
    "        \"zip\",\n",
    "        \"street\",\n",
    "        \"floor\",\n",
    "    ]\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A tricky issue is that an address could be identified as a duplicate of an earlier one that itself is a duplicate of an even earlier one. The following loop does the trick and maps each address to its earlierst version."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [],
   "source": [
    "_address_unifier = {\n",
    "    id_: unified_id\n",
    "    for _, id_, unified_id in addresses.reset_index()[[\"id\", \"primary_id\"]].itertuples()\n",
    "}\n",
    "\n",
    "while True:\n",
    "    if (addresses[\"primary_id\"] != addresses[\"primary_id\"].map(_address_unifier)).any():\n",
    "        addresses[\"primary_id\"] = addresses[\"primary_id\"].map(_address_unifier)\n",
    "    else:\n",
    "        break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Only about 87,000 of the remaining 178,000 addresses are unique locations disregarding `floor`s and different spellings of the `street` name."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [],
   "source": [
    "_addresses = addresses.reset_index()\n",
    "msk = _addresses[\"id\"] == _addresses[\"primary_id\"]\n",
    "del _addresses\n",
    "\n",
    "assert msk.sum() == 87_287"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To not overwrite a Python built-in in the ORM layer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [],
   "source": [
    "addresses = addresses.rename(columns={\"zip\": \"zip_code\"})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Clean Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>primary_id</th>\n",
       "      <th>created_at</th>\n",
       "      <th>place_id</th>\n",
       "      <th>latitude</th>\n",
       "      <th>longitude</th>\n",
       "      <th>city_id</th>\n",
       "      <th>city</th>\n",
       "      <th>zip_code</th>\n",
       "      <th>street</th>\n",
       "      <th>floor</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>2016-02-22 10:42:10</td>\n",
       "      <td>ChIJSfxJmlXq9EcRX2ChkiPW9J8</td>\n",
       "      <td>45.763149</td>\n",
       "      <td>4.832660</td>\n",
       "      <td>1</td>\n",
       "      <td>Lyon</td>\n",
       "      <td>69002</td>\n",
       "      <td>31 Rue Merciere</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>2016-02-22 10:42:10</td>\n",
       "      <td>ChIJwwMvNPnq9EcRY7Qu-Tw2HL8</td>\n",
       "      <td>45.767227</td>\n",
       "      <td>4.835750</td>\n",
       "      <td>1</td>\n",
       "      <td>Lyon</td>\n",
       "      <td>69001</td>\n",
       "      <td>Rue De La Republique 2</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>2016-02-22 10:43:57</td>\n",
       "      <td>ChIJr1RhGN7B9EcRv6XSHmmN6a8</td>\n",
       "      <td>45.743725</td>\n",
       "      <td>4.873138</td>\n",
       "      <td>1</td>\n",
       "      <td>Lyon</td>\n",
       "      <td>69008</td>\n",
       "      <td>123 Avenue Des Freres Lumieres</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>5</td>\n",
       "      <td>2016-02-22 10:43:57</td>\n",
       "      <td>ChIJEwQm9H7q9EcRTymMxQ71z84</td>\n",
       "      <td>45.759369</td>\n",
       "      <td>4.864087</td>\n",
       "      <td>1</td>\n",
       "      <td>Lyon</td>\n",
       "      <td>69003</td>\n",
       "      <td>Avenue Georges Pompidou 17</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>6</td>\n",
       "      <td>2016-02-22 11:06:08</td>\n",
       "      <td>ChIJbQz7p6vr9EcRr9L2cH5942I</td>\n",
       "      <td>45.761181</td>\n",
       "      <td>4.826371</td>\n",
       "      <td>1</td>\n",
       "      <td>Lyon</td>\n",
       "      <td>69005</td>\n",
       "      <td>8 Bis Place Saint Jean</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    primary_id          created_at                     place_id   latitude  \\\n",
       "id                                                                           \n",
       "2            2 2016-02-22 10:42:10  ChIJSfxJmlXq9EcRX2ChkiPW9J8  45.763149   \n",
       "3            3 2016-02-22 10:42:10  ChIJwwMvNPnq9EcRY7Qu-Tw2HL8  45.767227   \n",
       "4            4 2016-02-22 10:43:57  ChIJr1RhGN7B9EcRv6XSHmmN6a8  45.743725   \n",
       "5            5 2016-02-22 10:43:57  ChIJEwQm9H7q9EcRTymMxQ71z84  45.759369   \n",
       "6            6 2016-02-22 11:06:08  ChIJbQz7p6vr9EcRr9L2cH5942I  45.761181   \n",
       "\n",
       "    longitude  city_id  city  zip_code                          street  floor  \n",
       "id                                                                             \n",
       "2    4.832660        1  Lyon     69002                 31 Rue Merciere   <NA>  \n",
       "3    4.835750        1  Lyon     69001          Rue De La Republique 2   <NA>  \n",
       "4    4.873138        1  Lyon     69008  123 Avenue Des Freres Lumieres   <NA>  \n",
       "5    4.864087        1  Lyon     69003      Avenue Georges Pompidou 17   <NA>  \n",
       "6    4.826371        1  Lyon     69005          8 Bis Place Saint Jean   <NA>  "
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "addresses.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 178101 entries, 2 to 691914\n",
      "Data columns (total 10 columns):\n",
      " #   Column      Non-Null Count   Dtype         \n",
      "---  ------      --------------   -----         \n",
      " 0   primary_id  178101 non-null  int64         \n",
      " 1   created_at  178101 non-null  datetime64[ns]\n",
      " 2   place_id    178101 non-null  string        \n",
      " 3   latitude    178101 non-null  float64       \n",
      " 4   longitude   178101 non-null  float64       \n",
      " 5   city_id     178101 non-null  int64         \n",
      " 6   city        178101 non-null  string        \n",
      " 7   zip_code    178101 non-null  int64         \n",
      " 8   street      178101 non-null  string        \n",
      " 9   floor       100540 non-null  Int64         \n",
      "dtypes: Int64(1), datetime64[ns](1), float64(2), int64(3), string(3)\n",
      "memory usage: 15.1 MB\n"
     ]
    }
   ],
   "source": [
    "addresses.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [],
   "source": [
    "assert (\n",
    "    hashlib.sha256(addresses.to_json().encode()).hexdigest()\n",
    "    == \"4f9f3b63a9b2472bf07207d0e06f4901619066121d6bb5fd3ad4ebf21b590410\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Restaurants"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Raw Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load restaurants associated with *all* addresses in the target cities. Further below, *all* restaurants are shown to have a clean address."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [],
   "source": [
    "restaurants = pd.read_sql_query(\n",
    "    f\"\"\"\n",
    "    SELECT\n",
    "        id,\n",
    "        created_at,\n",
    "        name,\n",
    "        address_id,\n",
    "        estimated_prep_duration\n",
    "    FROM\n",
    "        {config.ORIGINAL_SCHEMA}.businesses\n",
    "    WHERE\n",
    "        address_id IN (\n",
    "            SELECT id FROM {config.ORIGINAL_SCHEMA}.addresses WHERE city_id IN %(city_ids)s\n",
    "        )\n",
    "        AND\n",
    "        created_at < '{config.CUTOFF_DAY}'\n",
    "    ORDER BY\n",
    "        id\n",
    "    \"\"\",\n",
    "    con=connection,\n",
    "    index_col=\"id\",\n",
    "    params={\"city_ids\": city_ids},\n",
    "    parse_dates=[\"created_at\"],\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [],
   "source": [
    "restaurants[\"name\"] = restaurants[\"name\"].astype(\"string\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>created_at</th>\n",
       "      <th>name</th>\n",
       "      <th>address_id</th>\n",
       "      <th>estimated_prep_duration</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2016-02-22 09:42:10.228854</td>\n",
       "      <td>King Marcel Mercière</td>\n",
       "      <td>2</td>\n",
       "      <td>1200</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2016-02-22 09:43:57.103750</td>\n",
       "      <td>Trotekala</td>\n",
       "      <td>4</td>\n",
       "      <td>1200</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2016-02-22 10:06:08.796042</td>\n",
       "      <td>Soul Food &amp; Jazz Café</td>\n",
       "      <td>6</td>\n",
       "      <td>1200</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2016-02-22 10:16:11.478981</td>\n",
       "      <td>Gourmix Bellecour</td>\n",
       "      <td>8</td>\n",
       "      <td>900</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2016-02-22 10:36:09.150481</td>\n",
       "      <td>Yabio Hôtel de Ville</td>\n",
       "      <td>10</td>\n",
       "      <td>1200</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                   created_at                   name  address_id  \\\n",
       "id                                                                 \n",
       "1  2016-02-22 09:42:10.228854   King Marcel Mercière           2   \n",
       "2  2016-02-22 09:43:57.103750              Trotekala           4   \n",
       "3  2016-02-22 10:06:08.796042  Soul Food & Jazz Café           6   \n",
       "4  2016-02-22 10:16:11.478981      Gourmix Bellecour           8   \n",
       "5  2016-02-22 10:36:09.150481   Yabio Hôtel de Ville          10   \n",
       "\n",
       "    estimated_prep_duration  \n",
       "id                           \n",
       "1                      1200  \n",
       "2                      1200  \n",
       "3                      1200  \n",
       "4                       900  \n",
       "5                      1200  "
      ]
     },
     "execution_count": 66,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "restaurants.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 1654 entries, 1 to 1787\n",
      "Data columns (total 4 columns):\n",
      " #   Column                   Non-Null Count  Dtype         \n",
      "---  ------                   --------------  -----         \n",
      " 0   created_at               1654 non-null   datetime64[ns]\n",
      " 1   name                     1654 non-null   string        \n",
      " 2   address_id               1654 non-null   int64         \n",
      " 3   estimated_prep_duration  1654 non-null   int64         \n",
      "dtypes: datetime64[ns](1), int64(2), string(1)\n",
      "memory usage: 64.6 KB\n"
     ]
    }
   ],
   "source": [
    "restaurants.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [],
   "source": [
    "assert len(restaurants) == 1_654"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Adjust Time Zone"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {},
   "outputs": [],
   "source": [
    "restaurants[\"created_at\"] = clean_datetime(restaurants[\"created_at\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Simplify Names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {},
   "outputs": [],
   "source": [
    "restaurants[\"name\"] = (\n",
    "    restaurants[\"name\"]\n",
    "    .str.replace(\"\\s+\", \" \", regex=True)\n",
    "    .str.replace(\"&#39;\", \"'\")\n",
    "    # Get rid off accents.\n",
    "    .str.normalize(\"NFKD\")\n",
    "    .str.encode(\"ascii\", errors=\"ignore\")\n",
    "    .str.decode(\"utf8\")\n",
    "    .astype(\"string\")\n",
    "    .str.title()\n",
    "    # To find duplicates further below.\n",
    "    .str.replace(\" & \", \" And \")\n",
    "    .str.replace(\"The \", \"\")\n",
    "    .str.replace(\"Pasta Pizza \", \"\")\n",
    "    .str.replace(\" - Bar A Taboule\", \"\")\n",
    "    .str.replace(\"- Cuisine Mediterraneenne\", \"\")\n",
    "    .str.replace(\" - Petit-Dejeuner\", \"\")\n",
    "    .str.replace(\"Lyon\", \"\")\n",
    "    .str.replace(\"La Burgeria Saint Mande\", \"La Fromagette Saint Mande\")\n",
    "    .str.replace(\"Mansou'\", \"Mansouria\")\n",
    "    .str.strip()\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Use Merged Addresses"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [],
   "source": [
    "restaurants[\"address_id\"] = restaurants[\"address_id\"].map(address_merger)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [],
   "source": [
    "assert not restaurants[\"address_id\"].isnull().any()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Deduplicate"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Restaurants with the same name at the same (unified) address are merged."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {},
   "outputs": [],
   "source": [
    "restaurants = restaurants.merge(\n",
    "    addresses[\"primary_id\"], left_on=\"address_id\", right_index=True\n",
    ")\n",
    "restaurants = restaurants.rename(columns={\"primary_id\": \"primary_address_id\"})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {},
   "outputs": [],
   "source": [
    "by = [\"name\", \"primary_address_id\"]\n",
    "\n",
    "restaurants = (\n",
    "    restaurants.reset_index()\n",
    "    .set_index(by)\n",
    "    .merge(\n",
    "        (\n",
    "            restaurants.reset_index()\n",
    "            .groupby(by)[[\"id\"]]\n",
    "            .min()\n",
    "            .rename(columns={\"id\": \"merged_on_id\"})\n",
    "        ),\n",
    "        left_index=True,\n",
    "        right_index=True,\n",
    "    )\n",
    "    .reset_index()\n",
    "    .astype({\"name\": \"string\"})\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Keep a dictionary to map the ID's that are merged away to the ones that are kept."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {},
   "outputs": [],
   "source": [
    "restaurants_merger = collections.defaultdict(lambda: np.NaN)\n",
    "restaurants_merger.update(\n",
    "    {\n",
    "        id_: merged_on_id\n",
    "        for _, id_, merged_on_id in restaurants[[\"id\", \"merged_on_id\"]].itertuples()\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {},
   "outputs": [],
   "source": [
    "restaurants = (\n",
    "    restaurants[restaurants[\"id\"] == restaurants[\"merged_on_id\"]]\n",
    "    .set_index(\"id\")\n",
    "    .sort_index()[[\"created_at\", \"name\", \"address_id\", \"estimated_prep_duration\"]]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {},
   "outputs": [],
   "source": [
    "assert len(restaurants) == 1_644"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Clean Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>created_at</th>\n",
       "      <th>name</th>\n",
       "      <th>address_id</th>\n",
       "      <th>estimated_prep_duration</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2016-02-22 10:42:10</td>\n",
       "      <td>King Marcel Merciere</td>\n",
       "      <td>2</td>\n",
       "      <td>1200</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2016-02-22 10:43:57</td>\n",
       "      <td>Trotekala</td>\n",
       "      <td>4</td>\n",
       "      <td>1200</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2016-02-22 11:06:08</td>\n",
       "      <td>Soul Food And Jazz Cafe</td>\n",
       "      <td>6</td>\n",
       "      <td>1200</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2016-02-22 11:16:11</td>\n",
       "      <td>Gourmix Bellecour</td>\n",
       "      <td>8</td>\n",
       "      <td>900</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2016-02-22 11:36:09</td>\n",
       "      <td>Yabio Hotel De Ville</td>\n",
       "      <td>10</td>\n",
       "      <td>1200</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            created_at                     name  address_id  \\\n",
       "id                                                            \n",
       "1  2016-02-22 10:42:10     King Marcel Merciere           2   \n",
       "2  2016-02-22 10:43:57                Trotekala           4   \n",
       "3  2016-02-22 11:06:08  Soul Food And Jazz Cafe           6   \n",
       "4  2016-02-22 11:16:11        Gourmix Bellecour           8   \n",
       "5  2016-02-22 11:36:09     Yabio Hotel De Ville          10   \n",
       "\n",
       "    estimated_prep_duration  \n",
       "id                           \n",
       "1                      1200  \n",
       "2                      1200  \n",
       "3                      1200  \n",
       "4                       900  \n",
       "5                      1200  "
      ]
     },
     "execution_count": 78,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "restaurants.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 1644 entries, 1 to 1787\n",
      "Data columns (total 4 columns):\n",
      " #   Column                   Non-Null Count  Dtype         \n",
      "---  ------                   --------------  -----         \n",
      " 0   created_at               1644 non-null   datetime64[ns]\n",
      " 1   name                     1644 non-null   string        \n",
      " 2   address_id               1644 non-null   int64         \n",
      " 3   estimated_prep_duration  1644 non-null   int64         \n",
      "dtypes: datetime64[ns](1), int64(2), string(1)\n",
      "memory usage: 64.2 KB\n"
     ]
    }
   ],
   "source": [
    "restaurants.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {},
   "outputs": [],
   "source": [
    "assert (\n",
    "    hashlib.sha256(restaurants.to_json().encode()).hexdigest()\n",
    "    == \"8eb852690c027e2fcc0d9cf988391741b1cd028e35c494766f8c21d4ea1722b7\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Couriers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Raw Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Only load couriers that worked in one of the target cities (i.e., had an order) and include the vehicle information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {},
   "outputs": [],
   "source": [
    "couriers = pd.read_sql_query(\n",
    "    f\"\"\"\n",
    "    SELECT\n",
    "        couriers.id,\n",
    "        couriers.created_at,\n",
    "        MD5(couriers.name) AS name,\n",
    "        vehicle_types.icon as vehicle,\n",
    "        couriers.speed,\n",
    "        vehicle_bag_types.capacity,\n",
    "        couriers.pay_per_hour,\n",
    "        couriers.pay_per_order\n",
    "    FROM\n",
    "        {config.ORIGINAL_SCHEMA}.couriers\n",
    "    LEFT OUTER JOIN\n",
    "        {config.ORIGINAL_SCHEMA}.vehicles ON couriers.vehicle_id = vehicles.id\n",
    "    LEFT OUTER JOIN\n",
    "        {config.ORIGINAL_SCHEMA}.vehicle_types ON vehicles.vehicle_type_id = vehicle_types.id\n",
    "    LEFT OUTER JOIN\n",
    "        {config.ORIGINAL_SCHEMA}.vehicle_bag_types ON vehicles.vehicle_bag_type_id = vehicle_bag_types.id\n",
    "    WHERE\n",
    "        couriers.id in (\n",
    "            SELECT DISTINCT\n",
    "                deliveries.courier_id\n",
    "            FROM\n",
    "                {config.ORIGINAL_SCHEMA}.orders\n",
    "            INNER JOIN\n",
    "                {config.ORIGINAL_SCHEMA}.deliveries ON orders.id = deliveries.order_id\n",
    "            WHERE\n",
    "                orders.featured_business_id IN (\n",
    "                    SELECT  -- Subquery based off the restaurants query above!\n",
    "                        id\n",
    "                    FROM\n",
    "                        {config.ORIGINAL_SCHEMA}.businesses\n",
    "                    WHERE\n",
    "                        address_id IN (\n",
    "                            SELECT id FROM {config.ORIGINAL_SCHEMA}.addresses WHERE city_id IN %(city_ids)s\n",
    "                        )\n",
    "                        AND\n",
    "                        created_at < '{config.CUTOFF_DAY}'\n",
    "                )\n",
    "                AND\n",
    "                deliveries.courier_id IS NOT NULL\n",
    "        )\n",
    "    ORDER BY\n",
    "        couriers.id\n",
    "    \"\"\",\n",
    "    con=connection,\n",
    "    index_col=\"id\",\n",
    "    params={\"city_ids\": city_ids},\n",
    "    parse_dates=[\"created_at\"],\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {},
   "outputs": [],
   "source": [
    "couriers = couriers.astype({\"name\": \"string\", \"vehicle\": \"string\"})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>created_at</th>\n",
       "      <th>name</th>\n",
       "      <th>vehicle</th>\n",
       "      <th>speed</th>\n",
       "      <th>capacity</th>\n",
       "      <th>pay_per_hour</th>\n",
       "      <th>pay_per_order</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2016-02-21 13:28:41.571046</td>\n",
       "      <td>9181c2be2b3746cbcd6abf86d77fb2aa</td>\n",
       "      <td>bicycle</td>\n",
       "      <td>13.90</td>\n",
       "      <td>150</td>\n",
       "      <td>0</td>\n",
       "      <td>300</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2016-02-21 13:28:41.673711</td>\n",
       "      <td>1b5d5709985e874e9c54e30127b3049f</td>\n",
       "      <td>bicycle</td>\n",
       "      <td>18.47</td>\n",
       "      <td>100</td>\n",
       "      <td>0</td>\n",
       "      <td>300</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2016-02-21 13:28:41.776893</td>\n",
       "      <td>0e385d898f81147d460a45d2943a4eed</td>\n",
       "      <td>bicycle</td>\n",
       "      <td>19.18</td>\n",
       "      <td>100</td>\n",
       "      <td>0</td>\n",
       "      <td>300</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2016-02-21 13:28:41.879876</td>\n",
       "      <td>ca21f91ea528f1b6e596d0c43959062b</td>\n",
       "      <td>bicycle</td>\n",
       "      <td>14.73</td>\n",
       "      <td>150</td>\n",
       "      <td>0</td>\n",
       "      <td>200</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2016-02-21 13:28:41.982542</td>\n",
       "      <td>04af1119256f88ea3a25e79344dec96c</td>\n",
       "      <td>bicycle</td>\n",
       "      <td>15.20</td>\n",
       "      <td>150</td>\n",
       "      <td>0</td>\n",
       "      <td>200</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                   created_at                              name  vehicle  \\\n",
       "id                                                                         \n",
       "1  2016-02-21 13:28:41.571046  9181c2be2b3746cbcd6abf86d77fb2aa  bicycle   \n",
       "2  2016-02-21 13:28:41.673711  1b5d5709985e874e9c54e30127b3049f  bicycle   \n",
       "3  2016-02-21 13:28:41.776893  0e385d898f81147d460a45d2943a4eed  bicycle   \n",
       "4  2016-02-21 13:28:41.879876  ca21f91ea528f1b6e596d0c43959062b  bicycle   \n",
       "5  2016-02-21 13:28:41.982542  04af1119256f88ea3a25e79344dec96c  bicycle   \n",
       "\n",
       "    speed  capacity  pay_per_hour  pay_per_order  \n",
       "id                                                \n",
       "1   13.90       150             0            300  \n",
       "2   18.47       100             0            300  \n",
       "3   19.18       100             0            300  \n",
       "4   14.73       150             0            200  \n",
       "5   15.20       150             0            200  "
      ]
     },
     "execution_count": 83,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "couriers.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 2471 entries, 1 to 53764\n",
      "Data columns (total 7 columns):\n",
      " #   Column         Non-Null Count  Dtype         \n",
      "---  ------         --------------  -----         \n",
      " 0   created_at     2471 non-null   datetime64[ns]\n",
      " 1   name           2471 non-null   string        \n",
      " 2   vehicle        2471 non-null   string        \n",
      " 3   speed          2471 non-null   float64       \n",
      " 4   capacity       2471 non-null   int64         \n",
      " 5   pay_per_hour   2471 non-null   int64         \n",
      " 6   pay_per_order  2471 non-null   int64         \n",
      "dtypes: datetime64[ns](1), float64(1), int64(3), string(2)\n",
      "memory usage: 154.4 KB\n"
     ]
    }
   ],
   "source": [
    "couriers.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "metadata": {},
   "outputs": [],
   "source": [
    "assert len(couriers) == 2_471"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Adjust Time Zone"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "metadata": {},
   "outputs": [],
   "source": [
    "couriers[\"created_at\"] = clean_datetime(couriers[\"created_at\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Deduplicate"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Couriers with the same name either have the same phone number or signed up within a short time window: They are merged."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "metadata": {},
   "outputs": [],
   "source": [
    "by = [\"name\"]\n",
    "\n",
    "couriers = (\n",
    "    couriers.reset_index()\n",
    "    .set_index(by)\n",
    "    .merge(\n",
    "        (\n",
    "            couriers.reset_index()\n",
    "            .groupby(by)[[\"id\"]]\n",
    "            .max()  # merged on the latest courier!\n",
    "            .rename(columns={\"id\": \"merged_on_id\"})\n",
    "        ),\n",
    "        left_index=True,\n",
    "        right_index=True,\n",
    "    )\n",
    "    .reset_index()\n",
    "    .astype({\"name\": \"string\"})\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Keep a dictionary to map the ID's that are merged away to the ones that are kept."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {},
   "outputs": [],
   "source": [
    "couriers_merger = collections.defaultdict(lambda: np.NaN)\n",
    "couriers_merger.update(\n",
    "    {\n",
    "        id_: merged_on_id\n",
    "        for _, id_, merged_on_id in couriers[[\"id\", \"merged_on_id\"]].itertuples()\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {},
   "outputs": [],
   "source": [
    "couriers = (\n",
    "    couriers[couriers[\"id\"] == couriers[\"merged_on_id\"]]\n",
    "    .set_index(\"id\")\n",
    "    .sort_index()[\n",
    "        [\"created_at\", \"vehicle\", \"speed\", \"capacity\", \"pay_per_hour\", \"pay_per_order\"]\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {},
   "outputs": [],
   "source": [
    "assert len(couriers) == 2_469"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Clean Salary"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The column `pay_per_hour` defaults to `0` in the database definition. The actual default value is EUR 7,50, which is also the mode in the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0       2398\n",
       "750       70\n",
       "1500       1\n",
       "Name: pay_per_hour, dtype: int64"
      ]
     },
     "execution_count": 91,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "couriers[\"pay_per_hour\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The column `pay_per_order` defaults to `0` in the database definition. A more realistic value is EUR 2 (i.e., 200 cents), which is the mode in the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "200        2146\n",
       "0           159\n",
       "300          97\n",
       "400          29\n",
       "500          11\n",
       "2             9\n",
       "600           8\n",
       "20000         3\n",
       "250           3\n",
       "650           2\n",
       "1400000       1\n",
       "1             1\n",
       "Name: pay_per_order, dtype: int64"
      ]
     },
     "execution_count": 92,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "couriers[\"pay_per_order\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Whenever a `0` appears in `pay_per_order`, the corresponding `pay_per_hour` is `0` in all cases except one, which is the highest paid courier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {},
   "outputs": [],
   "source": [
    "assert ((couriers[\"pay_per_order\"] == 0) & (couriers[\"pay_per_hour\"] == 0)).sum() == 158"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "metadata": {},
   "outputs": [],
   "source": [
    "assert ((couriers[\"pay_per_order\"] == 0) & (couriers[\"pay_per_hour\"] > 0)).sum() == 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>created_at</th>\n",
       "      <th>vehicle</th>\n",
       "      <th>speed</th>\n",
       "      <th>capacity</th>\n",
       "      <th>pay_per_hour</th>\n",
       "      <th>pay_per_order</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>721</th>\n",
       "      <td>2016-03-18 16:56:39</td>\n",
       "      <td>bicycle</td>\n",
       "      <td>20.49</td>\n",
       "      <td>100</td>\n",
       "      <td>1500</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             created_at  vehicle  speed  capacity  pay_per_hour  pay_per_order\n",
       "id                                                                            \n",
       "721 2016-03-18 16:56:39  bicycle  20.49       100          1500              0"
      ]
     },
     "execution_count": 95,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "couriers[(couriers[\"pay_per_order\"] == 0) & (couriers[\"pay_per_hour\"] > 0)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Couriers with `0`s in both columns receive the default payment scheme."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "metadata": {},
   "outputs": [],
   "source": [
    "msk_0_pay = (couriers[\"pay_per_hour\"] == 0) & (couriers[\"pay_per_order\"] == 0)\n",
    "\n",
    "couriers.loc[msk_0_pay, \"pay_per_hour\"] = 750\n",
    "couriers.loc[msk_0_pay, \"pay_per_order\"] = 200"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Couriers with a `0` in the `pay_per_hour` column, receive a fixed salary of EUR 7,50."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "metadata": {},
   "outputs": [],
   "source": [
    "couriers.loc[couriers[\"pay_per_hour\"] == 0, \"pay_per_hour\"] = 750"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The column `pay_per_order` contains obvious typos that are corrected."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "metadata": {},
   "outputs": [],
   "source": [
    "couriers.loc[\n",
    "    couriers[\"pay_per_order\"].isin([1, 2, 20, 2000, 20000]), \"pay_per_order\"\n",
    "] = 200\n",
    "couriers.loc[couriers[\"pay_per_order\"] == 1400000, \"pay_per_order\"] = 400"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Distribution of the various `pay_per_hour` / `pay_per_order` combinations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 99,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Counter({(750, 300): 97,\n",
       "         (750, 200): 2317,\n",
       "         (750, 500): 11,\n",
       "         (750, 400): 30,\n",
       "         (750, 600): 8,\n",
       "         (750, 650): 2,\n",
       "         (1500, 0): 1,\n",
       "         (750, 250): 3})"
      ]
     },
     "execution_count": 99,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "collections.Counter(\n",
    "    (y, z) for (x, y, z) in couriers[[\"pay_per_hour\", \"pay_per_order\"]].itertuples()\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Clean Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 100,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>created_at</th>\n",
       "      <th>vehicle</th>\n",
       "      <th>speed</th>\n",
       "      <th>capacity</th>\n",
       "      <th>pay_per_hour</th>\n",
       "      <th>pay_per_order</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2016-02-21 14:28:41</td>\n",
       "      <td>bicycle</td>\n",
       "      <td>13.90</td>\n",
       "      <td>150</td>\n",
       "      <td>750</td>\n",
       "      <td>300</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2016-02-21 14:28:41</td>\n",
       "      <td>bicycle</td>\n",
       "      <td>18.47</td>\n",
       "      <td>100</td>\n",
       "      <td>750</td>\n",
       "      <td>300</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2016-02-21 14:28:41</td>\n",
       "      <td>bicycle</td>\n",
       "      <td>19.18</td>\n",
       "      <td>100</td>\n",
       "      <td>750</td>\n",
       "      <td>300</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2016-02-21 14:28:41</td>\n",
       "      <td>bicycle</td>\n",
       "      <td>14.73</td>\n",
       "      <td>150</td>\n",
       "      <td>750</td>\n",
       "      <td>200</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2016-02-21 14:28:41</td>\n",
       "      <td>bicycle</td>\n",
       "      <td>15.20</td>\n",
       "      <td>150</td>\n",
       "      <td>750</td>\n",
       "      <td>200</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            created_at  vehicle  speed  capacity  pay_per_hour  pay_per_order\n",
       "id                                                                           \n",
       "1  2016-02-21 14:28:41  bicycle  13.90       150           750            300\n",
       "2  2016-02-21 14:28:41  bicycle  18.47       100           750            300\n",
       "3  2016-02-21 14:28:41  bicycle  19.18       100           750            300\n",
       "4  2016-02-21 14:28:41  bicycle  14.73       150           750            200\n",
       "5  2016-02-21 14:28:41  bicycle  15.20       150           750            200"
      ]
     },
     "execution_count": 100,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "couriers.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 101,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 2469 entries, 1 to 53764\n",
      "Data columns (total 6 columns):\n",
      " #   Column         Non-Null Count  Dtype         \n",
      "---  ------         --------------  -----         \n",
      " 0   created_at     2469 non-null   datetime64[ns]\n",
      " 1   vehicle        2469 non-null   string        \n",
      " 2   speed          2469 non-null   float64       \n",
      " 3   capacity       2469 non-null   int64         \n",
      " 4   pay_per_hour   2469 non-null   int64         \n",
      " 5   pay_per_order  2469 non-null   int64         \n",
      "dtypes: datetime64[ns](1), float64(1), int64(3), string(1)\n",
      "memory usage: 135.0 KB\n"
     ]
    }
   ],
   "source": [
    "couriers.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 102,
   "metadata": {},
   "outputs": [],
   "source": [
    "assert (\n",
    "    hashlib.sha256(couriers.to_json().encode()).hexdigest()\n",
    "    == \"a1059e93095842120a58c4f74145bb6a14aeb94f47a0d77043896c70f4772afe\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Orders"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Raw Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The order related data is spread over many different tables in the original database. Also, some data is not even normalized. The following SQL query puts all the data into one big relation that is cleaned further below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 103,
   "metadata": {},
   "outputs": [],
   "source": [
    "orders = pd.read_sql_query(\n",
    "    f\"\"\"\n",
    "    SELECT\n",
    "        orders.id,\n",
    "        deliveries.id AS delivery_id,\n",
    "        MD5(CONCAT(orders.email, orders.phone_number)) as customer_id,  -- anonymize the customer data\n",
    "        orders.order_placed_at AS placed_at,\n",
    "        CASE\n",
    "            WHEN orders.preorder IS FALSE\n",
    "                THEN TRUE\n",
    "                ELSE FALSE    \n",
    "            END AS ad_hoc,\n",
    "        CASE\n",
    "            WHEN orders.preorder is TRUE\n",
    "                THEN orders.scheduled_dropoff_at\n",
    "                ELSE NULL\n",
    "            END AS scheduled_delivery_at,\n",
    "        deliveries.status,\n",
    "        cancellations.cancelled_at,\n",
    "        orders.featured_business_id as restaurant_id,\n",
    "        orders.order_sent_at AS restaurant_notified_at,\n",
    "        orders.order_received_at AS restaurant_confirmed_at,\n",
    "        orders.estimated_prep_duration,\n",
    "        orders.estimated_prep_buffer,\n",
    "        deliveries.courier_id,\n",
    "        deliveries.courier_dispatched_at AS dispatch_at,\n",
    "        deliveries.courier_notified_at,\n",
    "        deliveries.courier_accepted_at,\n",
    "        courier_no_accept_confirmed.issue AS courier_no_accept_confirmed_issue,\n",
    "        orders.pickup_address_id,\n",
    "        orders.scheduled_pickup_at,\n",
    "        deliveries.courier_picked_up_at AS pickup_at,\n",
    "        left_pickups.left_pickup_at,\n",
    "        courier_late_at_pickup.issue AS courier_late_at_pickup_issue,\n",
    "        courier_waited_at_pickup.issue AS courier_waited_at_pickup_issue,\n",
    "        courier_no_pickup_confirmed.issue AS courier_no_pickup_confirmed_issue,\n",
    "        orders.dropoff_address_id AS delivery_address_id,\n",
    "        deliveries.first_estimated_dropoff_at AS first_estimated_delivery_at,\n",
    "        deliveries.courier_dropped_off_at AS delivery_at,\n",
    "        courier_waited_at_delivery.issue AS courier_waited_at_delivery_issue,\n",
    "        courier_no_delivery_confirmed.issue AS courier_no_delivery_confirmed_issue,\n",
    "        orders.utilization,\n",
    "        items_totals.sub_total,\n",
    "        orders.delivery_fee,\n",
    "        orders.total,\n",
    "        deliveries.delivery_distance AS logged_delivery_distance,\n",
    "        deliveries.courier_avg_speed AS logged_avg_courier_speed,\n",
    "        CAST(deliveries.courier_avg_speed_distance AS INTEGER) AS logged_avg_courier_speed_distance,\n",
    "        delivery_timings.accepting_time AS logged_accepting_time,\n",
    "        delivery_timings.courier_reaction_time AS logged_reaction_time,\n",
    "        delivery_timings.to_pickup_time AS logged_to_pickup_time,\n",
    "        delivery_timings.expected_wait_pickup_time AS expected_wait_pickup_time,\n",
    "        delivery_timings.wait_pickup_time AS logged_wait_pickup_time,\n",
    "        delivery_timings.pickup_time AS logged_pickup_time,\n",
    "        delivery_timings.courier_late AS logged_courier_late_time,\n",
    "        delivery_timings.vendor_late AS logged_restaurant_late_time,\n",
    "        delivery_timings.to_dropoff_time AS logged_to_delivery_time,\n",
    "        delivery_timings.expected_dropoff_time AS expected_delivery_time,\n",
    "        delivery_timings.dropoff_time AS logged_delivery_time,\n",
    "        delivery_timings.delivery_late AS logged_delivery_late_time,\n",
    "        delivery_timings.total_time AS logged_total_time,\n",
    "        delivery_timings.confirmed_total_time AS logged_confirmed_total_time\n",
    "    FROM\n",
    "        {config.ORIGINAL_SCHEMA}.orders\n",
    "    LEFT OUTER JOIN\n",
    "        {config.ORIGINAL_SCHEMA}.deliveries ON orders.id = deliveries.order_id\n",
    "    LEFT OUTER JOIN\n",
    "        (\n",
    "            SELECT\n",
    "                order_id,\n",
    "                CAST(100 * SUM(price) AS INTEGER) AS sub_total\n",
    "            FROM\n",
    "                {config.ORIGINAL_SCHEMA}.order_records\n",
    "            GROUP BY\n",
    "                order_id\n",
    "        ) AS items_totals ON orders.id = items_totals.order_id\n",
    "    LEFT OUTER JOIN\n",
    "        {config.ORIGINAL_SCHEMA}.delivery_timings ON deliveries.id = delivery_timings.delivery_id\n",
    "    LEFT OUTER JOIN (\n",
    "            SELECT\n",
    "                delivery_id,\n",
    "                MAX(notes) AS issue\n",
    "            FROM\n",
    "                {config.ORIGINAL_SCHEMA}.issues\n",
    "            WHERE\n",
    "                type = 'DispatchIssue'\n",
    "                AND\n",
    "                category = 'no_courier_interaction'\n",
    "            GROUP BY\n",
    "                delivery_id\n",
    "        ) AS courier_no_accept_confirmed ON deliveries.id = courier_no_accept_confirmed.delivery_id\n",
    "    LEFT OUTER JOIN (\n",
    "            SELECT\n",
    "                delivery_id,\n",
    "                MAX(notes) AS issue\n",
    "            FROM\n",
    "                {config.ORIGINAL_SCHEMA}.issues\n",
    "            WHERE\n",
    "                type = 'PickupIssue'\n",
    "                AND\n",
    "                category = 'waiting'\n",
    "            GROUP BY\n",
    "                delivery_id\n",
    "        ) AS courier_waited_at_pickup ON deliveries.id = courier_waited_at_pickup.delivery_id\n",
    "    LEFT OUTER JOIN (\n",
    "            SELECT\n",
    "                delivery_id,\n",
    "                MAX(notes) AS issue\n",
    "            FROM\n",
    "                {config.ORIGINAL_SCHEMA}.issues\n",
    "            WHERE\n",
    "                type = 'PickupIssue'\n",
    "                AND\n",
    "                category = 'late'\n",
    "            GROUP BY\n",
    "                delivery_id\n",
    "        ) AS courier_late_at_pickup ON deliveries.id = courier_late_at_pickup.delivery_id\n",
    "    LEFT OUTER JOIN (\n",
    "            SELECT\n",
    "                delivery_id,\n",
    "                MAX(notes) AS issue\n",
    "            FROM\n",
    "                {config.ORIGINAL_SCHEMA}.issues\n",
    "            WHERE\n",
    "                type = 'PickupIssue'\n",
    "                AND\n",
    "                category = 'no_courier_interaction'\n",
    "            GROUP BY\n",
    "                delivery_id\n",
    "        ) AS courier_no_pickup_confirmed ON deliveries.id = courier_no_pickup_confirmed.delivery_id\n",
    "    LEFT OUTER JOIN (\n",
    "            SELECT\n",
    "                delivery_id,\n",
    "                MAX(notes) AS issue\n",
    "            FROM\n",
    "                {config.ORIGINAL_SCHEMA}.issues\n",
    "            WHERE\n",
    "                type = 'DropoffIssue'\n",
    "                AND\n",
    "                category = 'waiting'\n",
    "            GROUP BY\n",
    "                delivery_id\n",
    "        ) AS courier_waited_at_delivery ON deliveries.id = courier_waited_at_delivery.delivery_id\n",
    "    LEFT OUTER JOIN (\n",
    "            SELECT\n",
    "                delivery_id,\n",
    "                MAX(notes) AS issue\n",
    "            FROM\n",
    "                {config.ORIGINAL_SCHEMA}.issues\n",
    "            WHERE\n",
    "                type = 'DropoffIssue'\n",
    "                AND\n",
    "                category = 'late'\n",
    "            GROUP BY\n",
    "                delivery_id\n",
    "        ) AS courier_late_at_delivery ON deliveries.id = courier_late_at_delivery.delivery_id\n",
    "    LEFT OUTER JOIN (\n",
    "            SELECT\n",
    "                delivery_id,\n",
    "                MAX(notes) AS issue\n",
    "            FROM\n",
    "                {config.ORIGINAL_SCHEMA}.issues\n",
    "            WHERE\n",
    "                type = 'DropoffIssue'\n",
    "                AND\n",
    "                category = 'no_courier_interaction'\n",
    "            GROUP BY\n",
    "                delivery_id\n",
    "        ) AS courier_no_delivery_confirmed ON deliveries.id = courier_no_delivery_confirmed.delivery_id\n",
    "    LEFT OUTER JOIN (\n",
    "            SELECT\n",
    "                delivery_id,\n",
    "                courier_id,\n",
    "                MAX(created_at) AS left_pickup_at\n",
    "            FROM (\n",
    "                SELECT\n",
    "                    delivery_id,\n",
    "                    (metadata -> 'courier_id')::TEXT::INTEGER AS courier_id,\n",
    "                    created_at\n",
    "                FROM\n",
    "                    {config.ORIGINAL_SCHEMA}.delivery_transitions\n",
    "                WHERE\n",
    "                    to_state = 'left_pickup'\n",
    "            ) AS left_pickups\n",
    "            GROUP BY\n",
    "                delivery_id,\n",
    "                courier_id\n",
    "        ) AS left_pickups ON deliveries.id = left_pickups.delivery_id AND deliveries.courier_id = left_pickups.courier_id\n",
    "    LEFT OUTER JOIN (\n",
    "            SELECT\n",
    "                delivery_id,\n",
    "                MAX(created_at) AS cancelled_at\n",
    "            FROM\n",
    "                {config.ORIGINAL_SCHEMA}.delivery_transitions\n",
    "            WHERE\n",
    "                to_state = 'cancelled'\n",
    "            GROUP BY\n",
    "                delivery_id\n",
    "        ) AS cancellations ON deliveries.id = cancellations.delivery_id\n",
    "    WHERE\n",
    "        orders.featured_business_id IN (\n",
    "            SELECT  -- Subquery based off the restaurants query above!\n",
    "                id\n",
    "            FROM\n",
    "                {config.ORIGINAL_SCHEMA}.businesses\n",
    "            WHERE\n",
    "                address_id IN (\n",
    "                    SELECT id FROM {config.ORIGINAL_SCHEMA}.addresses WHERE city_id IN %(city_ids)s\n",
    "                )\n",
    "                AND\n",
    "                created_at < '{config.CUTOFF_DAY}'\n",
    "        )\n",
    "        AND\n",
    "        scheduled_dropoff_at < '{config.CUTOFF_DAY}'\n",
    "        AND\n",
    "        deliveries.is_primary IS TRUE\n",
    "    ORDER BY\n",
    "        orders.id\n",
    "    \"\"\",\n",
    "    con=connection,\n",
    "    index_col=\"id\",\n",
    "    params={\"city_ids\": city_ids},\n",
    "    parse_dates=[\n",
    "        \"placed_at\",\n",
    "        \"scheduled_delivery_at\",\n",
    "        \"cancelled_at\",\n",
    "        \"restaurant_notified_at\",\n",
    "        \"restaurant_confirmed_at\",\n",
    "        \"dispatch_at\",\n",
    "        \"courier_notified_at\",\n",
    "        \"courier_accepted_at\",\n",
    "        \"pickup_at\",\n",
    "        \"left_pickup_at\",\n",
    "        \"first_estimated_delivery_at\",\n",
    "        \"delivery_at\",\n",
    "    ],\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 104,
   "metadata": {},
   "outputs": [],
   "source": [
    "orders = orders.astype(\n",
    "    {\n",
    "        \"customer_id\": \"string\",\n",
    "        \"status\": \"string\",\n",
    "        \"estimated_prep_duration\": \"Int64\",\n",
    "        \"courier_id\": \"Int64\",\n",
    "        \"courier_no_accept_confirmed_issue\": \"string\",\n",
    "        \"courier_late_at_pickup_issue\": \"string\",\n",
    "        \"courier_waited_at_pickup_issue\": \"string\",\n",
    "        \"courier_no_pickup_confirmed_issue\": \"string\",\n",
    "        \"courier_waited_at_delivery_issue\": \"string\",\n",
    "        \"courier_no_delivery_confirmed_issue\": \"string\",\n",
    "        \"logged_avg_courier_speed_distance\": \"Int64\",\n",
    "        \"logged_accepting_time\": \"Int64\",\n",
    "        \"logged_reaction_time\": \"Int64\",\n",
    "        \"logged_to_pickup_time\": \"Int64\",\n",
    "        \"expected_wait_pickup_time\": \"Int64\",\n",
    "        \"logged_wait_pickup_time\": \"Int64\",\n",
    "        \"logged_pickup_time\": \"Int64\",\n",
    "        \"logged_courier_late_time\": \"Int64\",\n",
    "        \"logged_restaurant_late_time\": \"Int64\",\n",
    "        \"logged_to_delivery_time\": \"Int64\",\n",
    "        \"expected_delivery_time\": \"Int64\",\n",
    "        \"logged_delivery_time\": \"Int64\",\n",
    "        \"logged_delivery_late_time\": \"Int64\",\n",
    "        \"logged_total_time\": \"Int64\",\n",
    "        \"logged_confirmed_total_time\": \"Int64\",\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 105,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>delivery_id</th>\n",
       "      <th>customer_id</th>\n",
       "      <th>placed_at</th>\n",
       "      <th>ad_hoc</th>\n",
       "      <th>scheduled_delivery_at</th>\n",
       "      <th>status</th>\n",
       "      <th>cancelled_at</th>\n",
       "      <th>restaurant_id</th>\n",
       "      <th>restaurant_notified_at</th>\n",
       "      <th>restaurant_confirmed_at</th>\n",
       "      <th>estimated_prep_duration</th>\n",
       "      <th>estimated_prep_buffer</th>\n",
       "      <th>courier_id</th>\n",
       "      <th>dispatch_at</th>\n",
       "      <th>courier_notified_at</th>\n",
       "      <th>courier_accepted_at</th>\n",
       "      <th>courier_no_accept_confirmed_issue</th>\n",
       "      <th>pickup_address_id</th>\n",
       "      <th>scheduled_pickup_at</th>\n",
       "      <th>pickup_at</th>\n",
       "      <th>left_pickup_at</th>\n",
       "      <th>courier_late_at_pickup_issue</th>\n",
       "      <th>courier_waited_at_pickup_issue</th>\n",
       "      <th>courier_no_pickup_confirmed_issue</th>\n",
       "      <th>delivery_address_id</th>\n",
       "      <th>first_estimated_delivery_at</th>\n",
       "      <th>delivery_at</th>\n",
       "      <th>courier_waited_at_delivery_issue</th>\n",
       "      <th>courier_no_delivery_confirmed_issue</th>\n",
       "      <th>utilization</th>\n",
       "      <th>sub_total</th>\n",
       "      <th>delivery_fee</th>\n",
       "      <th>total</th>\n",
       "      <th>logged_delivery_distance</th>\n",
       "      <th>logged_avg_courier_speed</th>\n",
       "      <th>logged_avg_courier_speed_distance</th>\n",
       "      <th>logged_accepting_time</th>\n",
       "      <th>logged_reaction_time</th>\n",
       "      <th>logged_to_pickup_time</th>\n",
       "      <th>expected_wait_pickup_time</th>\n",
       "      <th>logged_wait_pickup_time</th>\n",
       "      <th>logged_pickup_time</th>\n",
       "      <th>logged_courier_late_time</th>\n",
       "      <th>logged_restaurant_late_time</th>\n",
       "      <th>logged_to_delivery_time</th>\n",
       "      <th>expected_delivery_time</th>\n",
       "      <th>logged_delivery_time</th>\n",
       "      <th>logged_delivery_late_time</th>\n",
       "      <th>logged_total_time</th>\n",
       "      <th>logged_confirmed_total_time</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>487377</td>\n",
       "      <td>f172baefbdf8550eb2c6bcf667e43090</td>\n",
       "      <td>2016-02-22 09:42:01</td>\n",
       "      <td>False</td>\n",
       "      <td>2016-02-22 11:30:00</td>\n",
       "      <td>cancelled</td>\n",
       "      <td>2016-10-18 07:52:45.136243</td>\n",
       "      <td>1</td>\n",
       "      <td>NaT</td>\n",
       "      <td>NaT</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>0</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>NaT</td>\n",
       "      <td>NaT</td>\n",
       "      <td>NaT</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>2</td>\n",
       "      <td>2016-02-22 11:15:00.000000</td>\n",
       "      <td>NaT</td>\n",
       "      <td>NaT</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>3</td>\n",
       "      <td>NaT</td>\n",
       "      <td>NaT</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>15</td>\n",
       "      <td>1250</td>\n",
       "      <td>250</td>\n",
       "      <td>1500</td>\n",
       "      <td>671</td>\n",
       "      <td>NaN</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>487433</td>\n",
       "      <td>cc2f0563675b4b2e9b985bfa4d7a157f</td>\n",
       "      <td>2016-02-22 09:40:09</td>\n",
       "      <td>False</td>\n",
       "      <td>2016-02-22 11:00:00</td>\n",
       "      <td>completed</td>\n",
       "      <td>NaT</td>\n",
       "      <td>2</td>\n",
       "      <td>2016-02-22 10:31:03.664857</td>\n",
       "      <td>2016-02-22 10:46:43.348671</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>0</td>\n",
       "      <td>96</td>\n",
       "      <td>2016-02-22 10:37:00.683271</td>\n",
       "      <td>2016-02-22 10:36:04.532254</td>\n",
       "      <td>2016-02-22 10:37:00.683188</td>\n",
       "      <td>Courier did not hit \"Accept\"</td>\n",
       "      <td>4</td>\n",
       "      <td>2016-02-22 10:41:03.664793</td>\n",
       "      <td>2016-02-22 10:54:00.886417</td>\n",
       "      <td>NaT</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>Courier did not hit \"Picked up\"</td>\n",
       "      <td>5</td>\n",
       "      <td>2016-02-23 22:55:23</td>\n",
       "      <td>2016-02-22 11:13:23.772580</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>19</td>\n",
       "      <td>1550</td>\n",
       "      <td>250</td>\n",
       "      <td>1800</td>\n",
       "      <td>2281</td>\n",
       "      <td>NaN</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>2916</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>187</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>69</td>\n",
       "      <td>125</td>\n",
       "      <td>67</td>\n",
       "      <td>69</td>\n",
       "      <td>1745</td>\n",
       "      <td>4791</td>\n",
       "      <td>264</td>\n",
       "      <td>942</td>\n",
       "      <td>5734</td>\n",
       "      <td>5595</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>487444</td>\n",
       "      <td>c09807c923d356a43fac084543de664b</td>\n",
       "      <td>2016-02-22 09:56:16</td>\n",
       "      <td>False</td>\n",
       "      <td>2016-02-22 11:00:00</td>\n",
       "      <td>completed</td>\n",
       "      <td>NaT</td>\n",
       "      <td>3</td>\n",
       "      <td>2016-02-22 10:31:02.531350</td>\n",
       "      <td>2016-02-22 10:42:00.334529</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>0</td>\n",
       "      <td>57</td>\n",
       "      <td>2016-02-22 10:31:30.963423</td>\n",
       "      <td>2016-02-22 10:31:03.587634</td>\n",
       "      <td>2016-02-22 10:31:30.963342</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>6</td>\n",
       "      <td>2016-02-22 10:41:02.531284</td>\n",
       "      <td>2016-02-22 10:54:32.122854</td>\n",
       "      <td>NaT</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>7</td>\n",
       "      <td>2016-02-23 22:25:47</td>\n",
       "      <td>2016-02-22 11:06:07.677724</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>25</td>\n",
       "      <td>2000</td>\n",
       "      <td>250</td>\n",
       "      <td>2250</td>\n",
       "      <td>2449</td>\n",
       "      <td>NaN</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>28</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>1381</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>522</td>\n",
       "      <td>3824</td>\n",
       "      <td>270</td>\n",
       "      <td>464</td>\n",
       "      <td>4288</td>\n",
       "      <td>4192</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>470503</td>\n",
       "      <td>440d641745a8707a7afd5565adc99be2</td>\n",
       "      <td>2016-02-22 10:11:46</td>\n",
       "      <td>False</td>\n",
       "      <td>2016-02-22 11:30:00</td>\n",
       "      <td>completed</td>\n",
       "      <td>NaT</td>\n",
       "      <td>4</td>\n",
       "      <td>2016-02-22 10:32:04.810633</td>\n",
       "      <td>2016-02-22 10:48:01.402101</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>0</td>\n",
       "      <td>39</td>\n",
       "      <td>2016-02-22 10:50:29.141811</td>\n",
       "      <td>2016-02-22 10:50:03.610169</td>\n",
       "      <td>2016-02-22 10:50:29.141722</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>8</td>\n",
       "      <td>2016-02-22 11:17:04.810526</td>\n",
       "      <td>2016-02-22 11:11:26.166161</td>\n",
       "      <td>NaT</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>9</td>\n",
       "      <td>2016-02-22 11:25:50</td>\n",
       "      <td>2016-02-22 11:18:43.289947</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>100</td>\n",
       "      <td>9800</td>\n",
       "      <td>250</td>\n",
       "      <td>10050</td>\n",
       "      <td>196</td>\n",
       "      <td>NaN</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>26</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>439</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>-180</td>\n",
       "      <td>818</td>\n",
       "      <td>-1061</td>\n",
       "      <td>-338</td>\n",
       "      <td>87</td>\n",
       "      <td>4694</td>\n",
       "      <td>350</td>\n",
       "      <td>-676</td>\n",
       "      <td>4017</td>\n",
       "      <td>4017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>487439</td>\n",
       "      <td>0e7b3e4012a00a56b5bf1dee1bc6c1c7</td>\n",
       "      <td>2016-02-22 10:35:58</td>\n",
       "      <td>False</td>\n",
       "      <td>2016-02-22 11:30:00</td>\n",
       "      <td>completed</td>\n",
       "      <td>NaT</td>\n",
       "      <td>5</td>\n",
       "      <td>2016-02-22 10:50:03.866327</td>\n",
       "      <td>2016-02-22 10:50:19.473584</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>0</td>\n",
       "      <td>128</td>\n",
       "      <td>2016-02-22 10:50:28.153663</td>\n",
       "      <td>2016-02-22 10:50:04.830855</td>\n",
       "      <td>2016-02-22 10:50:28.153559</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>10</td>\n",
       "      <td>2016-02-22 11:10:03.866256</td>\n",
       "      <td>2016-02-22 11:05:19.667624</td>\n",
       "      <td>NaT</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>11</td>\n",
       "      <td>2016-02-22 11:28:19</td>\n",
       "      <td>2016-02-22 11:20:44.147345</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>45</td>\n",
       "      <td>3600</td>\n",
       "      <td>250</td>\n",
       "      <td>3850</td>\n",
       "      <td>2584</td>\n",
       "      <td>NaN</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>24</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>284</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>-440</td>\n",
       "      <td>607</td>\n",
       "      <td>-735</td>\n",
       "      <td>-284</td>\n",
       "      <td>694</td>\n",
       "      <td>3242</td>\n",
       "      <td>231</td>\n",
       "      <td>-555</td>\n",
       "      <td>2686</td>\n",
       "      <td>2686</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    delivery_id                       customer_id           placed_at  ad_hoc  \\\n",
       "id                                                                              \n",
       "1        487377  f172baefbdf8550eb2c6bcf667e43090 2016-02-22 09:42:01   False   \n",
       "2        487433  cc2f0563675b4b2e9b985bfa4d7a157f 2016-02-22 09:40:09   False   \n",
       "3        487444  c09807c923d356a43fac084543de664b 2016-02-22 09:56:16   False   \n",
       "4        470503  440d641745a8707a7afd5565adc99be2 2016-02-22 10:11:46   False   \n",
       "5        487439  0e7b3e4012a00a56b5bf1dee1bc6c1c7 2016-02-22 10:35:58   False   \n",
       "\n",
       "   scheduled_delivery_at     status               cancelled_at  restaurant_id  \\\n",
       "id                                                                              \n",
       "1    2016-02-22 11:30:00  cancelled 2016-10-18 07:52:45.136243              1   \n",
       "2    2016-02-22 11:00:00  completed                        NaT              2   \n",
       "3    2016-02-22 11:00:00  completed                        NaT              3   \n",
       "4    2016-02-22 11:30:00  completed                        NaT              4   \n",
       "5    2016-02-22 11:30:00  completed                        NaT              5   \n",
       "\n",
       "       restaurant_notified_at    restaurant_confirmed_at  \\\n",
       "id                                                         \n",
       "1                         NaT                        NaT   \n",
       "2  2016-02-22 10:31:03.664857 2016-02-22 10:46:43.348671   \n",
       "3  2016-02-22 10:31:02.531350 2016-02-22 10:42:00.334529   \n",
       "4  2016-02-22 10:32:04.810633 2016-02-22 10:48:01.402101   \n",
       "5  2016-02-22 10:50:03.866327 2016-02-22 10:50:19.473584   \n",
       "\n",
       "    estimated_prep_duration  estimated_prep_buffer  courier_id  \\\n",
       "id                                                               \n",
       "1                      <NA>                      0        <NA>   \n",
       "2                      <NA>                      0          96   \n",
       "3                      <NA>                      0          57   \n",
       "4                      <NA>                      0          39   \n",
       "5                      <NA>                      0         128   \n",
       "\n",
       "                  dispatch_at        courier_notified_at  \\\n",
       "id                                                         \n",
       "1                         NaT                        NaT   \n",
       "2  2016-02-22 10:37:00.683271 2016-02-22 10:36:04.532254   \n",
       "3  2016-02-22 10:31:30.963423 2016-02-22 10:31:03.587634   \n",
       "4  2016-02-22 10:50:29.141811 2016-02-22 10:50:03.610169   \n",
       "5  2016-02-22 10:50:28.153663 2016-02-22 10:50:04.830855   \n",
       "\n",
       "          courier_accepted_at courier_no_accept_confirmed_issue  \\\n",
       "id                                                                \n",
       "1                         NaT                              <NA>   \n",
       "2  2016-02-22 10:37:00.683188      Courier did not hit \"Accept\"   \n",
       "3  2016-02-22 10:31:30.963342                              <NA>   \n",
       "4  2016-02-22 10:50:29.141722                              <NA>   \n",
       "5  2016-02-22 10:50:28.153559                              <NA>   \n",
       "\n",
       "    pickup_address_id        scheduled_pickup_at                  pickup_at  \\\n",
       "id                                                                            \n",
       "1                   2 2016-02-22 11:15:00.000000                        NaT   \n",
       "2                   4 2016-02-22 10:41:03.664793 2016-02-22 10:54:00.886417   \n",
       "3                   6 2016-02-22 10:41:02.531284 2016-02-22 10:54:32.122854   \n",
       "4                   8 2016-02-22 11:17:04.810526 2016-02-22 11:11:26.166161   \n",
       "5                  10 2016-02-22 11:10:03.866256 2016-02-22 11:05:19.667624   \n",
       "\n",
       "   left_pickup_at courier_late_at_pickup_issue courier_waited_at_pickup_issue  \\\n",
       "id                                                                              \n",
       "1             NaT                         <NA>                           <NA>   \n",
       "2             NaT                         <NA>                           <NA>   \n",
       "3             NaT                         <NA>                           <NA>   \n",
       "4             NaT                         <NA>                           <NA>   \n",
       "5             NaT                         <NA>                           <NA>   \n",
       "\n",
       "   courier_no_pickup_confirmed_issue  delivery_address_id  \\\n",
       "id                                                          \n",
       "1                               <NA>                    3   \n",
       "2    Courier did not hit \"Picked up\"                    5   \n",
       "3                               <NA>                    7   \n",
       "4                               <NA>                    9   \n",
       "5                               <NA>                   11   \n",
       "\n",
       "   first_estimated_delivery_at                delivery_at  \\\n",
       "id                                                          \n",
       "1                          NaT                        NaT   \n",
       "2          2016-02-23 22:55:23 2016-02-22 11:13:23.772580   \n",
       "3          2016-02-23 22:25:47 2016-02-22 11:06:07.677724   \n",
       "4          2016-02-22 11:25:50 2016-02-22 11:18:43.289947   \n",
       "5          2016-02-22 11:28:19 2016-02-22 11:20:44.147345   \n",
       "\n",
       "   courier_waited_at_delivery_issue courier_no_delivery_confirmed_issue  \\\n",
       "id                                                                        \n",
       "1                              <NA>                                <NA>   \n",
       "2                              <NA>                                <NA>   \n",
       "3                              <NA>                                <NA>   \n",
       "4                              <NA>                                <NA>   \n",
       "5                              <NA>                                <NA>   \n",
       "\n",
       "    utilization  sub_total  delivery_fee  total  logged_delivery_distance  \\\n",
       "id                                                                          \n",
       "1            15       1250           250   1500                       671   \n",
       "2            19       1550           250   1800                      2281   \n",
       "3            25       2000           250   2250                      2449   \n",
       "4           100       9800           250  10050                       196   \n",
       "5            45       3600           250   3850                      2584   \n",
       "\n",
       "    logged_avg_courier_speed  logged_avg_courier_speed_distance  \\\n",
       "id                                                                \n",
       "1                        NaN                               <NA>   \n",
       "2                        NaN                               <NA>   \n",
       "3                        NaN                               <NA>   \n",
       "4                        NaN                               <NA>   \n",
       "5                        NaN                               <NA>   \n",
       "\n",
       "    logged_accepting_time  logged_reaction_time  logged_to_pickup_time  \\\n",
       "id                                                                       \n",
       "1                    <NA>                  <NA>                   <NA>   \n",
       "2                    2916                  <NA>                    187   \n",
       "3                      28                  <NA>                   1381   \n",
       "4                      26                  <NA>                    439   \n",
       "5                      24                  <NA>                    284   \n",
       "\n",
       "    expected_wait_pickup_time  logged_wait_pickup_time  logged_pickup_time  \\\n",
       "id                                                                           \n",
       "1                        <NA>                     <NA>                <NA>   \n",
       "2                        <NA>                       69                 125   \n",
       "3                        <NA>                     <NA>                   0   \n",
       "4                        <NA>                     -180                 818   \n",
       "5                        <NA>                     -440                 607   \n",
       "\n",
       "    logged_courier_late_time  logged_restaurant_late_time  \\\n",
       "id                                                          \n",
       "1                       <NA>                         <NA>   \n",
       "2                         67                           69   \n",
       "3                          0                            0   \n",
       "4                      -1061                         -338   \n",
       "5                       -735                         -284   \n",
       "\n",
       "    logged_to_delivery_time  expected_delivery_time  logged_delivery_time  \\\n",
       "id                                                                          \n",
       "1                      <NA>                    <NA>                  <NA>   \n",
       "2                      1745                    4791                   264   \n",
       "3                       522                    3824                   270   \n",
       "4                        87                    4694                   350   \n",
       "5                       694                    3242                   231   \n",
       "\n",
       "    logged_delivery_late_time  logged_total_time  logged_confirmed_total_time  \n",
       "id                                                                             \n",
       "1                        <NA>               <NA>                         <NA>  \n",
       "2                         942               5734                         5595  \n",
       "3                         464               4288                         4192  \n",
       "4                        -676               4017                         4017  \n",
       "5                        -555               2686                         2686  "
      ]
     },
     "execution_count": 105,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "orders.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 106,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 661314 entries, 1 to 688690\n",
      "Data columns (total 50 columns):\n",
      " #   Column                               Non-Null Count   Dtype         \n",
      "---  ------                               --------------   -----         \n",
      " 0   delivery_id                          661314 non-null  int64         \n",
      " 1   customer_id                          661314 non-null  string        \n",
      " 2   placed_at                            661314 non-null  datetime64[ns]\n",
      " 3   ad_hoc                               661314 non-null  bool          \n",
      " 4   scheduled_delivery_at                83579 non-null   datetime64[ns]\n",
      " 5   status                               661314 non-null  string        \n",
      " 6   cancelled_at                         23665 non-null   datetime64[ns]\n",
      " 7   restaurant_id                        661314 non-null  int64         \n",
      " 8   restaurant_notified_at               656311 non-null  datetime64[ns]\n",
      " 9   restaurant_confirmed_at              639636 non-null  datetime64[ns]\n",
      " 10  estimated_prep_duration              552317 non-null  Int64         \n",
      " 11  estimated_prep_buffer                661314 non-null  int64         \n",
      " 12  courier_id                           648850 non-null  Int64         \n",
      " 13  dispatch_at                          656923 non-null  datetime64[ns]\n",
      " 14  courier_notified_at                  430500 non-null  datetime64[ns]\n",
      " 15  courier_accepted_at                  650964 non-null  datetime64[ns]\n",
      " 16  courier_no_accept_confirmed_issue    40105 non-null   string        \n",
      " 17  pickup_address_id                    661314 non-null  int64         \n",
      " 18  scheduled_pickup_at                  656727 non-null  datetime64[ns]\n",
      " 19  pickup_at                            638249 non-null  datetime64[ns]\n",
      " 20  left_pickup_at                       312171 non-null  datetime64[ns]\n",
      " 21  courier_late_at_pickup_issue         67499 non-null   string        \n",
      " 22  courier_waited_at_pickup_issue       29469 non-null   string        \n",
      " 23  courier_no_pickup_confirmed_issue    35394 non-null   string        \n",
      " 24  delivery_address_id                  661314 non-null  int64         \n",
      " 25  first_estimated_delivery_at          657796 non-null  datetime64[ns]\n",
      " 26  delivery_at                          637808 non-null  datetime64[ns]\n",
      " 27  courier_waited_at_delivery_issue     32415 non-null   string        \n",
      " 28  courier_no_delivery_confirmed_issue  13884 non-null   string        \n",
      " 29  utilization                          661314 non-null  int64         \n",
      " 30  sub_total                            661314 non-null  int64         \n",
      " 31  delivery_fee                         661314 non-null  int64         \n",
      " 32  total                                661314 non-null  int64         \n",
      " 33  logged_delivery_distance             661314 non-null  int64         \n",
      " 34  logged_avg_courier_speed             213841 non-null  float64       \n",
      " 35  logged_avg_courier_speed_distance    213841 non-null  Int64         \n",
      " 36  logged_accepting_time                637808 non-null  Int64         \n",
      " 37  logged_reaction_time                 410349 non-null  Int64         \n",
      " 38  logged_to_pickup_time                637808 non-null  Int64         \n",
      " 39  expected_wait_pickup_time            532387 non-null  Int64         \n",
      " 40  logged_wait_pickup_time              497916 non-null  Int64         \n",
      " 41  logged_pickup_time                   637808 non-null  Int64         \n",
      " 42  logged_courier_late_time             503505 non-null  Int64         \n",
      " 43  logged_restaurant_late_time          503505 non-null  Int64         \n",
      " 44  logged_to_delivery_time              637808 non-null  Int64         \n",
      " 45  expected_delivery_time               637808 non-null  Int64         \n",
      " 46  logged_delivery_time                 637808 non-null  Int64         \n",
      " 47  logged_delivery_late_time            637808 non-null  Int64         \n",
      " 48  logged_total_time                    637808 non-null  Int64         \n",
      " 49  logged_confirmed_total_time          637808 non-null  Int64         \n",
      "dtypes: Int64(17), bool(1), datetime64[ns](13), float64(1), int64(10), string(8)\n",
      "memory usage: 263.6 MB\n"
     ]
    }
   ],
   "source": [
    "orders.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 107,
   "metadata": {},
   "outputs": [],
   "source": [
    "assert len(orders) == 661_314"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Adjust Time Zones"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 108,
   "metadata": {},
   "outputs": [],
   "source": [
    "for column in [\n",
    "    \"placed_at\",\n",
    "    \"scheduled_delivery_at\",\n",
    "    \"cancelled_at\",\n",
    "    \"restaurant_notified_at\",\n",
    "    \"restaurant_confirmed_at\",\n",
    "    \"dispatch_at\",\n",
    "    \"courier_notified_at\",\n",
    "    \"courier_accepted_at\",\n",
    "    \"scheduled_pickup_at\",\n",
    "    \"pickup_at\",\n",
    "    \"left_pickup_at\",\n",
    "    \"first_estimated_delivery_at\",\n",
    "    \"delivery_at\",\n",
    "]:\n",
    "    orders[column] = clean_datetime(orders[column])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Use Merged Addresses"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "About 0.02 % of the orders belong to discarded addresses and are discarded also."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 109,
   "metadata": {},
   "outputs": [],
   "source": [
    "orders[\"pickup_address_id\"] = orders[\"pickup_address_id\"].map(address_merger)\n",
    "orders[\"delivery_address_id\"] = orders[\"delivery_address_id\"].map(address_merger)\n",
    "msk = orders[\"pickup_address_id\"].isnull() | orders[\"delivery_address_id\"].isnull()\n",
    "orders = orders[~msk].astype({\"pickup_address_id\": int, \"delivery_address_id\": int,})\n",
    "\n",
    "assert msk.sum() == 160"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Use Merged Restaurants"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 110,
   "metadata": {},
   "outputs": [],
   "source": [
    "orders[\"restaurant_id\"] = orders[\"restaurant_id\"].map(restaurants_merger)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 111,
   "metadata": {},
   "outputs": [],
   "source": [
    "assert not orders[\"restaurant_id\"].isnull().any()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Use Merged Couriers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 112,
   "metadata": {},
   "outputs": [],
   "source": [
    "orders[\"courier_id\"] = orders[\"courier_id\"].map(couriers_merger).astype(\"Int64\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Verify that the couriers' IDs are the same in `couriers` and `orders`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 113,
   "metadata": {},
   "outputs": [],
   "source": [
    "assert set(couriers.index) == set(\n",
    "    orders.loc[orders[\"courier_id\"].notnull(), \"courier_id\"].unique()\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Clean User IDs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Convert the MD5 hashed emails and phone numbers into integer ID's."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 114,
   "metadata": {},
   "outputs": [],
   "source": [
    "orders[\"customer_id\"] = (\n",
    "    orders[\"customer_id\"]\n",
    "    .map({y: x for (x, y) in enumerate(orders[\"customer_id\"].unique(), start=1)})\n",
    "    .astype({\"customer_id\": int})\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Ad-hoc vs. Scheduled Orders"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ad-hoc orders never have a `scheduled_delivery_at` value set, and scheduled orders always have it set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 115,
   "metadata": {},
   "outputs": [],
   "source": [
    "assert not (\n",
    "    (orders[\"ad_hoc\"] == True) & orders[\"scheduled_delivery_at\"].notnull()\n",
    ").any()\n",
    "assert not (\n",
    "    (orders[\"ad_hoc\"] == False) & orders[\"scheduled_delivery_at\"].isnull()\n",
    ").any()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For all adjusted timestamps we add `*_corrected` columns indicating if a correction is made in the following."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 116,
   "metadata": {},
   "outputs": [],
   "source": [
    "for column in [\n",
    "    \"scheduled_delivery_at\",\n",
    "    \"cancelled_at\",\n",
    "    \"restaurant_notified_at\",\n",
    "    \"restaurant_confirmed_at\",\n",
    "    \"estimated_prep_duration\",\n",
    "    \"dispatch_at\",\n",
    "    \"courier_notified_at\",\n",
    "    \"courier_accepted_at\",\n",
    "    \"pickup_at\",\n",
    "    \"left_pickup_at\",\n",
    "    \"delivery_at\",\n",
    "]:\n",
    "    orders[column + \"_corrected\"] = False\n",
    "    orders.loc[orders[column].isnull(), column + \"_corrected\"] = pd.NA"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Some customers managed to place scheduled orders for the past. These are converted into ad-hoc orders."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 117,
   "metadata": {},
   "outputs": [],
   "source": [
    "msk = orders[\"scheduled_delivery_at\"] < orders[\"placed_at\"]\n",
    "orders.loc[msk, \"ad_hoc\"] = True\n",
    "orders.loc[msk, \"scheduled_delivery_at\"] = pd.NaT\n",
    "orders.loc[msk, \"scheduled_delivery_at_corrected\"] = True\n",
    "\n",
    "assert msk.sum() == 11"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Orders scheduled within the next 30 minutes are treated as ad-hoc orders. With the median fulfillment time of ad-hoc orders being 34 minutes, it is absolutely unrealistic to fulfill such a scheduled order on time. This should not influence the KPIs in a bad way."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 118,
   "metadata": {},
   "outputs": [],
   "source": [
    "msk = (orders[\"ad_hoc\"] == False) & (\n",
    "    orders[\"scheduled_delivery_at\"] - orders[\"placed_at\"]\n",
    "    < datetime.timedelta(minutes=30)\n",
    ")\n",
    "orders.loc[msk, \"ad_hoc\"] = True\n",
    "orders.loc[msk, \"scheduled_delivery_at\"] = pd.NaT\n",
    "orders.loc[msk, \"scheduled_delivery_at_corrected\"] = True\n",
    "\n",
    "assert msk.sum() == 3_267"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For scheduled orders, `scheduled_delivery_at` is mostly set to quarters of an hour. The seconds part is always `0`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 119,
   "metadata": {},
   "outputs": [],
   "source": [
    "assert not (\n",
    "    (orders[\"ad_hoc\"] == False) & (orders[\"scheduled_delivery_at\"].dt.second != 0)\n",
    ").any()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If a customer managed to enter something other than a quarter of an hour as `scheduled_delivery_at`, we adjust that."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 120,
   "metadata": {},
   "outputs": [],
   "source": [
    "msk = (orders[\"ad_hoc\"] == False) & (\n",
    "    orders[\"scheduled_delivery_at\"].dt.minute % 15 != 0\n",
    ")\n",
    "round_down = msk & (orders[\"scheduled_delivery_at\"].dt.minute % 15 < 8)\n",
    "orders.loc[round_down, \"scheduled_delivery_at\"] = orders.loc[\n",
    "    round_down, \"scheduled_delivery_at\"\n",
    "] - (orders.loc[round_down, \"scheduled_delivery_at\"].dt.minute % 15).map(\n",
    "    lambda m: datetime.timedelta(minutes=m)\n",
    ")\n",
    "round_up = msk & (orders[\"scheduled_delivery_at\"].dt.minute % 15 >= 8)\n",
    "orders.loc[round_up, \"scheduled_delivery_at\"] = orders.loc[\n",
    "    round_up, \"scheduled_delivery_at\"\n",
    "] + (orders.loc[round_up, \"scheduled_delivery_at\"].dt.minute % 15).map(\n",
    "    lambda m: datetime.timedelta(minutes=(15 - m))\n",
    ")\n",
    "orders.loc[msk, \"scheduled_delivery_at_corrected\"] = True\n",
    "\n",
    "assert msk.sum() == 6"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Timestamps"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All timestamps in `orders` must occur in a strict sequence (i.e., order) according to the delivery process. A tiny fraction of the orders has timestamps that do not comply with that and are adjusted in the following."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`placed_at` must always be the earliest of all timestamps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 121,
   "metadata": {},
   "outputs": [],
   "source": [
    "for column in [\n",
    "    \"scheduled_delivery_at\",\n",
    "    \"cancelled_at\",\n",
    "    \"restaurant_notified_at\",\n",
    "    \"restaurant_confirmed_at\",\n",
    "    \"dispatch_at\",\n",
    "    \"courier_notified_at\",\n",
    "    \"courier_accepted_at\",\n",
    "    \"pickup_at\",\n",
    "    \"left_pickup_at\",\n",
    "    \"first_estimated_delivery_at\",\n",
    "    \"delivery_at\",\n",
    "]:\n",
    "    assert not (orders[\"placed_at\"] >= orders[column]).any()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Rarely, a restaurant confirmed an order before it was notified about it. We keep `restaurant_confirmed_at` in these cases."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 122,
   "metadata": {},
   "outputs": [],
   "source": [
    "msk = orders[\"restaurant_notified_at\"] >= orders[\"restaurant_confirmed_at\"]\n",
    "orders.loc[msk, \"restaurant_notified_at\"] = pd.NaT\n",
    "orders.loc[msk, \"restaurant_notified_at_corrected\"] = True\n",
    "\n",
    "assert msk.sum() == 47"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Whenever `restaurant_notified_at` or `restaurant_confirmed_at` is later than `pickup_at`, we discard the values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 123,
   "metadata": {},
   "outputs": [],
   "source": [
    "msk = orders[\"restaurant_notified_at\"] >= orders[\"pickup_at\"]\n",
    "orders.loc[msk, \"restaurant_notified_at\"] = pd.NaT\n",
    "orders.loc[msk, \"restaurant_notified_at_corrected\"] = True\n",
    "\n",
    "assert msk.sum() == 73"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 124,
   "metadata": {},
   "outputs": [],
   "source": [
    "msk = orders[\"restaurant_confirmed_at\"] >= orders[\"pickup_at\"]\n",
    "orders.loc[msk, \"restaurant_confirmed_at\"] = pd.NaT\n",
    "orders.loc[msk, \"restaurant_confirmed_at_corrected\"] = True\n",
    "\n",
    "assert msk.sum() == 2_001"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If a courier forgot to confirm the pickup, `pickup_at` and `delivery_at` are the same."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 125,
   "metadata": {},
   "outputs": [],
   "source": [
    "msk = orders[\"delivery_at\"] == orders[\"pickup_at\"]\n",
    "orders.loc[msk, \"pickup_at\"] = pd.NaT\n",
    "orders.loc[msk, \"pickup_at_corrected\"] = True\n",
    "\n",
    "assert msk.sum() == 16"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 126,
   "metadata": {},
   "outputs": [],
   "source": [
    "msk = orders[\"delivery_at\"] == orders[\"left_pickup_at\"]\n",
    "orders.loc[msk, \"left_pickup_at\"] = pd.NaT\n",
    "orders.loc[msk, \"left_pickup_at_corrected\"] = True\n",
    "\n",
    "assert msk.sum() == 15"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`delivery_at` must be the latest of all dispatch-related timestamps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 127,
   "metadata": {},
   "outputs": [],
   "source": [
    "for column in [\n",
    "    \"dispatch_at\",\n",
    "    \"courier_notified_at\",\n",
    "    \"courier_accepted_at\",\n",
    "    \"pickup_at\",\n",
    "    \"left_pickup_at\",\n",
    "]:\n",
    "    assert not (orders[\"delivery_at\"] <= orders[column]).any()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In about 14,500 cases the `left_pickup_at` lies before or on `pickup_at`. This only affects orders between September 6 and October 17. We discard these timestamps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 128,
   "metadata": {},
   "outputs": [],
   "source": [
    "msk = orders[\"left_pickup_at\"] < orders[\"pickup_at\"]\n",
    "orders.loc[msk, \"left_pickup_at\"] = pd.NaT\n",
    "orders.loc[msk, \"left_pickup_at_corrected\"] = True\n",
    "\n",
    "assert msk.sum() == 14_013\n",
    "assert orders.loc[msk, \"placed_at\"].min().date() == datetime.date(2016, 9, 6)\n",
    "assert orders.loc[msk, \"placed_at\"].max().date() == datetime.date(2016, 10, 17)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 129,
   "metadata": {},
   "outputs": [],
   "source": [
    "msk = orders[\"left_pickup_at\"] == orders[\"pickup_at\"]\n",
    "orders.loc[msk, \"left_pickup_at\"] = pd.NaT\n",
    "orders.loc[msk, \"left_pickup_at_corrected\"] = True\n",
    "\n",
    "assert msk.sum() == 496"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 130,
   "metadata": {},
   "outputs": [],
   "source": [
    "for column in [\n",
    "    \"dispatch_at\",\n",
    "    \"courier_notified_at\",\n",
    "    \"courier_accepted_at\",\n",
    "    \"pickup_at\",\n",
    "]:\n",
    "    assert not (orders[\"left_pickup_at\"] <= orders[column]).any()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Rarely, `pickup_at` is earlier than or equal to `dispatch_at`, `courier_notified_at`, or `courier_accepted_at`. They are discarded."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 131,
   "metadata": {},
   "outputs": [],
   "source": [
    "msk = orders[\"pickup_at\"] <= orders[\"dispatch_at\"]\n",
    "orders.loc[msk, \"dispatch_at\"] = pd.NaT\n",
    "orders.loc[msk, \"dispatch_at_corrected\"] = True\n",
    "\n",
    "assert msk.sum() == 15"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 132,
   "metadata": {},
   "outputs": [],
   "source": [
    "msk = orders[\"pickup_at\"] <= orders[\"courier_notified_at\"]\n",
    "orders.loc[msk, \"courier_notified_at\"] = pd.NaT\n",
    "orders.loc[msk, \"courier_notified_at_corrected\"] = True\n",
    "\n",
    "assert msk.sum() == 8\n",
    "assert set(orders.loc[msk, \"status\"].unique()) == set([\"cancelled\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 133,
   "metadata": {},
   "outputs": [],
   "source": [
    "msk = orders[\"pickup_at\"] <= orders[\"courier_accepted_at\"]\n",
    "orders.loc[msk, \"courier_accepted_at\"] = pd.NaT\n",
    "orders.loc[msk, \"courier_accepted_at_corrected\"] = True\n",
    "\n",
    "assert msk.sum() == 15"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 134,
   "metadata": {},
   "outputs": [],
   "source": [
    "for column in [\"dispatch_at\", \"courier_notified_at\", \"courier_accepted_at\"]:\n",
    "    assert not (orders[\"pickup_at\"] <= orders[column]).any()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For about 66.000 orders `courier_accepted_at` equals `dispatch_at` or lies before it. We assume the former is correct and discard the latter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 135,
   "metadata": {},
   "outputs": [],
   "source": [
    "msk = orders[\"courier_accepted_at\"] <= orders[\"dispatch_at\"]\n",
    "orders.loc[msk, \"dispatch_at\"] = pd.NaT\n",
    "orders.loc[msk, \"dispatch_at_corrected\"] = True\n",
    "\n",
    "assert msk.sum() == 65_848"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If `courier_accepted_at` is equal or before `courier_notified_at`, we discard the latter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 136,
   "metadata": {},
   "outputs": [],
   "source": [
    "msk = orders[\"courier_accepted_at\"] <= orders[\"courier_notified_at\"]\n",
    "orders.loc[msk, \"courier_notified_at\"] = pd.NaT\n",
    "orders.loc[msk, \"courier_notified_at_corrected\"] = True\n",
    "\n",
    "assert msk.sum() == 165_585"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 137,
   "metadata": {},
   "outputs": [],
   "source": [
    "for column in [\"dispatch_at\", \"courier_notified_at\"]:\n",
    "    assert not (orders[\"courier_accepted_at\"] <= orders[column]).any()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For some more orders, `courier_notified_at` lies before `dispatch_at`. Manual analysis reveals that in most of these cases, the courier did not hit \"accept\". We discard `dispatch_at` as the timings between `courier_notified_at` and `courier_accepted_at` fit the issue messages."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 138,
   "metadata": {},
   "outputs": [],
   "source": [
    "msk = orders[\"courier_notified_at\"] <= orders[\"dispatch_at\"]\n",
    "orders.loc[msk, \"dispatch_at\"] = pd.NaT\n",
    "orders.loc[msk, \"dispatch_at_corrected\"] = True\n",
    "\n",
    "assert msk.sum() == 3_397"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ad-hoc orders that were placed before 11 in the morning and after 23 in the evening are discarded. Most of them were cancelled anyways."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 139,
   "metadata": {},
   "outputs": [],
   "source": [
    "msk = (orders[\"ad_hoc\"] == True) & (\n",
    "    (orders[\"placed_at\"].dt.hour <= 10) | (orders[\"placed_at\"].dt.hour >= 23)\n",
    ")\n",
    "orders = orders[~msk]\n",
    "\n",
    "assert msk.sum() == 337"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The orders scheduled for 11:15 (=1) and 11:30 (=37) are scheduled for 11:45. Most of them were not delivered until 12 anyways. This is in line with the 30-minute minimum horizon above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 140,
   "metadata": {},
   "outputs": [],
   "source": [
    "msk = (orders[\"scheduled_delivery_at\"].dt.hour == 11) & (\n",
    "    orders[\"scheduled_delivery_at\"].dt.minute == 30\n",
    ")\n",
    "orders.loc[msk, \"scheduled_delivery_at\"] += datetime.timedelta(minutes=15)\n",
    "orders.loc[msk, \"scheduled_delivery_at_corrected\"] = True\n",
    "\n",
    "assert msk.sum() == 37\n",
    "\n",
    "msk = (orders[\"scheduled_delivery_at\"].dt.hour == 11) & (\n",
    "    orders[\"scheduled_delivery_at\"].dt.minute == 15\n",
    ")\n",
    "orders.loc[msk, \"scheduled_delivery_at\"] += datetime.timedelta(minutes=30)\n",
    "orders.loc[msk, \"scheduled_delivery_at_corrected\"] = True\n",
    "\n",
    "assert msk.sum() == 1\n",
    "\n",
    "assert not (\n",
    "    (orders[\"scheduled_delivery_at\"].dt.hour == 11)\n",
    "    & (orders[\"scheduled_delivery_at\"].dt.minute == 0)\n",
    ").any()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Orders with a scheduled delivery before 11 in the morning and after 23 in the evening are discarded."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 141,
   "metadata": {},
   "outputs": [],
   "source": [
    "msk = (orders[\"ad_hoc\"] == False) & (\n",
    "    (orders[\"scheduled_delivery_at\"].dt.hour <= 10)\n",
    "    | (orders[\"scheduled_delivery_at\"].dt.hour >= 23)\n",
    ")\n",
    "orders = orders[~msk]\n",
    "\n",
    "assert msk.sum() == 159"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 142,
   "metadata": {},
   "outputs": [],
   "source": [
    "ad_hoc = orders[\"ad_hoc\"] == True\n",
    "scheduled = ~ad_hoc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Order Stati"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are only cancelled and completed orders. We replace the `status` column with a boolean `cancelled` column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 143,
   "metadata": {},
   "outputs": [],
   "source": [
    "assert set(orders[\"status\"].unique()) == set([\"cancelled\", \"completed\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 144,
   "metadata": {},
   "outputs": [],
   "source": [
    "orders[\"cancelled\"] = False\n",
    "msk = orders[\"status\"] == \"cancelled\"\n",
    "orders.loc[msk, \"cancelled\"] = True\n",
    "del orders[\"status\"]\n",
    "\n",
    "assert msk.sum() == 23_552"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Some cancelled orders still have a `delivery_at` value. All of them have a dummy value for the `cancelled_at` value (cf., below). For roughly two thirds of them, the time between pickup and delivery is so small that it seems unrealistic that they actually were delivered. In these cases, we take `delivery_at` as the realistic `cancelled_at` value. The ones that could have been delivered realistically are treated as completed orders."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 145,
   "metadata": {},
   "outputs": [],
   "source": [
    "claimed_to_be_delivered = (orders[\"cancelled\"] == True) & orders[\n",
    "    \"delivery_at\"\n",
    "].notnull()\n",
    "\n",
    "assert (\n",
    "    orders.loc[claimed_to_be_delivered, \"cancelled_at\"].min()\n",
    "    == orders.loc[claimed_to_be_delivered, \"cancelled_at\"].max()\n",
    "    == datetime.datetime(2016, 10, 18, 9, 52, 45)\n",
    ")\n",
    "\n",
    "realistically_delivered = (\n",
    "    orders[\"delivery_at\"] - orders[\"pickup_at\"]\n",
    ").dt.total_seconds() > 120\n",
    "msk = claimed_to_be_delivered & realistically_delivered\n",
    "orders.loc[msk, \"cancelled\"] = False\n",
    "orders.loc[msk, \"cancelled_at\"] = pd.NaT\n",
    "orders.loc[msk, \"cancelled_at_corrected\"] = True\n",
    "msk = claimed_to_be_delivered & ~realistically_delivered\n",
    "orders.loc[msk, \"cancelled_at\"] = orders.loc[msk, \"delivery_at\"]\n",
    "orders.loc[msk, \"cancelled_at_corrected\"] = True\n",
    "orders.loc[msk, \"delivery_at\"] = pd.NaT\n",
    "orders.loc[msk, \"delivery_at_corrected\"] = pd.NA\n",
    "\n",
    "assert claimed_to_be_delivered.sum() == 159\n",
    "assert (claimed_to_be_delivered & realistically_delivered).sum() == 61\n",
    "assert (claimed_to_be_delivered & ~realistically_delivered).sum() == 98"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Only cancelled orders have a `cancelled_at` value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 146,
   "metadata": {},
   "outputs": [],
   "source": [
    "cancelled = orders[\"cancelled\"] == True\n",
    "completed = orders[\"cancelled\"] == False\n",
    "\n",
    "assert not orders.loc[cancelled, \"cancelled_at\"].isnull().any()\n",
    "assert not orders.loc[completed, \"cancelled_at\"].notnull().any()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Cancelled Orders"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For about 40% of the orders the `cancelled_at` field was only filled in after a system change on October 18 (i.e., in a batch). For these orders, this field is not meaningful because of that. We discard it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 147,
   "metadata": {},
   "outputs": [],
   "source": [
    "batch = orders[\"cancelled_at\"] == datetime.datetime(2016, 10, 18, 9, 52, 45)\n",
    "orders.loc[cancelled & batch, \"cancelled_at\"] = pd.NaT\n",
    "orders.loc[cancelled & batch, \"cancelled_at_corrected\"] = True\n",
    "\n",
    "assert (cancelled & batch).sum() == 9_410"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When a restaurant was notified about an order after the order was cancelled, we discard `restaurant_notified_at` and `restaurant_confirmed_at`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 148,
   "metadata": {},
   "outputs": [],
   "source": [
    "msk = orders[\"cancelled_at\"] <= orders[\"restaurant_notified_at\"]\n",
    "orders.loc[msk, \"restaurant_notified_at\"] = pd.NaT\n",
    "orders.loc[msk, \"restaurant_notified_at_corrected\"] = True\n",
    "\n",
    "assert msk.sum() == 6"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 149,
   "metadata": {},
   "outputs": [],
   "source": [
    "msk = orders[\"cancelled_at\"] <= orders[\"restaurant_confirmed_at\"]\n",
    "orders.loc[msk, \"restaurant_confirmed_at\"] = pd.NaT\n",
    "orders.loc[msk, \"restaurant_confirmed_at_corrected\"] = True\n",
    "\n",
    "assert msk.sum() == 1_253"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When an order was dispatched in the moment it was cancelled, we adjust that."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 150,
   "metadata": {},
   "outputs": [],
   "source": [
    "msk = orders[\"cancelled_at\"] == orders[\"dispatch_at\"]\n",
    "orders.loc[msk, \"dispatch_at\"] -= datetime.timedelta(seconds=1)\n",
    "orders.loc[msk, \"dispatch_at_corrected\"] = True\n",
    "\n",
    "assert msk.sum() == 3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When a courier was notified about or accepted an order in the moment it was cancelled, we adjust that."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 151,
   "metadata": {},
   "outputs": [],
   "source": [
    "msk = orders[\"cancelled_at\"] == orders[\"courier_notified_at\"]\n",
    "orders.loc[msk, \"courier_notified_at\"] -= datetime.timedelta(seconds=1)\n",
    "orders.loc[msk, \"courier_notified_at_corrected\"] = True\n",
    "\n",
    "assert msk.sum() == 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 152,
   "metadata": {},
   "outputs": [],
   "source": [
    "msk = orders[\"cancelled_at\"] == orders[\"courier_accepted_at\"]\n",
    "orders.loc[msk, \"courier_accepted_at\"] -= datetime.timedelta(seconds=1)\n",
    "orders.loc[msk, \"courier_accepted_at_corrected\"] = True\n",
    "\n",
    "assert msk.sum() == 8"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When a courier picked up an order in the moment it was cancelled, we adjust that."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 153,
   "metadata": {},
   "outputs": [],
   "source": [
    "msk = orders[\"cancelled_at\"] == orders[\"pickup_at\"]\n",
    "orders.loc[msk, \"pickup_at\"] -= datetime.timedelta(seconds=1)\n",
    "orders.loc[msk, \"pickup_at_corrected\"] = True\n",
    "\n",
    "assert msk.sum() == 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Verify that `cancelled_at` is indeed the latest timestamp in every row."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 154,
   "metadata": {},
   "outputs": [],
   "source": [
    "orders[\"_max_datetime\"] = pd.NaT\n",
    "orders[\"_max_datetime\"] = orders[\n",
    "    [\n",
    "        \"restaurant_notified_at\",\n",
    "        \"restaurant_confirmed_at\",\n",
    "        \"dispatch_at\",\n",
    "        \"courier_notified_at\",\n",
    "        \"courier_accepted_at\",\n",
    "        \"pickup_at\",\n",
    "        \"left_pickup_at\",\n",
    "        \"delivery_at\",\n",
    "        \"cancelled_at\",\n",
    "    ]\n",
    "].max(axis=1)\n",
    "\n",
    "assert not (\n",
    "    cancelled & ~batch & (orders[\"cancelled_at\"] != orders[\"_max_datetime\"])\n",
    ").any()\n",
    "\n",
    "del orders[\"_max_datetime\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Timings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The times in between the timestamps can be used to obtain timings of individual steps in the delivery process. In the original database, such timings were already logged. In the following, we validate the timestamps against the timings and only keep the timestamps as the timings are then calculated as `@property`s in the ORM layer."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`confirmed_total_time` is the difference between `placed_at` and `delivery_at`. It is set only for completed orders and useful only for ad_hoc orders. The `Order` class has a `total_time` property that computes that value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 155,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.998"
      ]
     },
     "execution_count": 155,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_data_available = (\n",
    "    orders[\"logged_confirmed_total_time\"].notnull()\n",
    "    & orders[\"delivery_at\"].notnull()\n",
    "    & orders[\"placed_at\"].notnull()\n",
    ")\n",
    "good_data = (\n",
    "    orders[\"logged_confirmed_total_time\"]\n",
    "    - ((orders[\"delivery_at\"] - orders[\"placed_at\"]).dt.total_seconds().round())\n",
    ").abs() <= 5\n",
    "del orders[\"logged_confirmed_total_time\"]\n",
    "\n",
    "assert (all_data_available & good_data).sum() == 635_768\n",
    "assert (all_data_available & good_data & completed).sum() == 635_768\n",
    "assert (all_data_available & good_data & ad_hoc).sum() == 561_340\n",
    "\n",
    "round(\n",
    "    (all_data_available & good_data & ad_hoc).sum()\n",
    "    / (all_data_available & ad_hoc).sum(),\n",
    "    3,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The best guess for `accepting_time` is the difference between `dispatch_at` and `courier_accepted_at`. `Order.time_to_accept` models that."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 156,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.607"
      ]
     },
     "execution_count": 156,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_data_available = (\n",
    "    orders[\"logged_accepting_time\"].notnull()\n",
    "    & orders[\"courier_accepted_at\"].notnull()\n",
    "    & orders[\"dispatch_at\"].notnull()\n",
    ")\n",
    "good_data = (\n",
    "    orders[\"logged_accepting_time\"]\n",
    "    - (\n",
    "        (orders[\"courier_accepted_at\"] - orders[\"dispatch_at\"])\n",
    "        .dt.total_seconds()\n",
    "        .round()\n",
    "    )\n",
    ").abs() <= 5\n",
    "\n",
    "assert (all_data_available & good_data).sum() == 345_803\n",
    "assert (all_data_available & good_data & completed).sum() == 345_803\n",
    "\n",
    "round((all_data_available & good_data).sum() / all_data_available.sum(), 3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use `accepting_time` to extrapolate missing values for `dispatch_at`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 157,
   "metadata": {},
   "outputs": [],
   "source": [
    "extrapolate = (\n",
    "    orders[\"dispatch_at\"].isnull()\n",
    "    & orders[\"courier_accepted_at\"].notnull()\n",
    "    & orders[\"logged_accepting_time\"].notnull()\n",
    ")\n",
    "\n",
    "accept_time = orders[\"logged_accepting_time\"].map(\n",
    "    lambda x: datetime.timedelta(seconds=x) if x is not pd.NA else pd.NaT\n",
    ")\n",
    "extrapolated_dispatch_at = orders[\"courier_accepted_at\"] - accept_time\n",
    "still_wrong = extrapolated_dispatch_at >= orders[\"courier_notified_at\"]\n",
    "msk = extrapolate & ~still_wrong\n",
    "orders.loc[msk, \"dispatch_at\"] = extrapolated_dispatch_at.loc[msk]\n",
    "orders.loc[msk, \"dispatch_at_corrected\"] = True\n",
    "del orders[\"logged_accepting_time\"]\n",
    "\n",
    "assert extrapolate.sum() == 67_372\n",
    "assert (extrapolate & ~still_wrong).sum() == 61_545"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The best guess for `reaction_time` is the difference between `courier_notified_at` and `courier_accepted_at`. `Order.time_to_react` models that in the ORM."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 158,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.843"
      ]
     },
     "execution_count": 158,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_data_available = (\n",
    "    orders[\"logged_reaction_time\"].notnull()\n",
    "    & orders[\"courier_accepted_at\"].notnull()\n",
    "    & orders[\"courier_notified_at\"].notnull()\n",
    ")\n",
    "good_data = (\n",
    "    orders[\"logged_reaction_time\"]\n",
    "    - (\n",
    "        (orders[\"courier_accepted_at\"] - orders[\"courier_notified_at\"])\n",
    "        .dt.total_seconds()\n",
    "        .round()\n",
    "    )\n",
    ").abs() <= 5\n",
    "\n",
    "assert (all_data_available & good_data).sum() == 165_355\n",
    "assert (all_data_available & good_data & completed).sum() == 165_355\n",
    "\n",
    "round((all_data_available & good_data).sum() / all_data_available.sum(), 3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use `reaction_time` to extrapolate missing values for `courier_notified_at`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 159,
   "metadata": {},
   "outputs": [],
   "source": [
    "extrapolate = (\n",
    "    orders[\"courier_notified_at\"].isnull()\n",
    "    & orders[\"courier_accepted_at\"].notnull()\n",
    "    & orders[\"logged_reaction_time\"].notnull()\n",
    ")\n",
    "extrapolated_courier_notified_at = (\n",
    "    orders[\"courier_accepted_at\"]\n",
    "    # Some values for logged_reaction_time are <= 0.\n",
    "    - orders[\"logged_reaction_time\"].map(\n",
    "        lambda x: datetime.timedelta(seconds=x) if x is not pd.NA and x > 0 else pd.NaT\n",
    "    )\n",
    ")\n",
    "still_wrong = extrapolated_courier_notified_at <= orders[\"dispatch_at\"]\n",
    "msk = extrapolate & ~still_wrong\n",
    "orders.loc[msk, \"courier_notified_at\"] = extrapolated_courier_notified_at.loc[msk]\n",
    "orders.loc[msk, \"courier_notified_at_corrected\"] = True\n",
    "\n",
    "assert extrapolate.sum() == 214_043\n",
    "assert (extrapolate & ~still_wrong).sum() == 213_290"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "No need to extrapolate `courier_accepted_at` from `courier_notified_at`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 160,
   "metadata": {},
   "outputs": [],
   "source": [
    "assert not (\n",
    "    orders[\"courier_notified_at\"].notnull()\n",
    "    & orders[\"courier_accepted_at\"].isnull()\n",
    "    & orders[\"logged_reaction_time\"].notnull()\n",
    ").any()\n",
    "\n",
    "del orders[\"logged_reaction_time\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`estimated_prep_duration` equals `expected_wait_pickup_time`. As the latter is not filled in for cancelled orders, we keep the former.\n",
    "Also, `estimated_prep_duration` is only filled in starting with May 24. It is always a multiple of `60`, so it is stored as full minutes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 161,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.0"
      ]
     },
     "execution_count": 161,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_data_available = (\n",
    "    orders[\"estimated_prep_duration\"].notnull()\n",
    "    & orders[\"expected_wait_pickup_time\"].notnull()\n",
    ")\n",
    "good_data = (\n",
    "    orders[\"estimated_prep_duration\"] - orders[\"expected_wait_pickup_time\"]\n",
    ").abs() <= 5\n",
    "\n",
    "no_duration = orders[\"estimated_prep_duration\"].isnull()\n",
    "\n",
    "assert not (no_duration & orders[\"expected_wait_pickup_time\"].notnull()).any()\n",
    "assert (~no_duration & orders[\"expected_wait_pickup_time\"].isnull()).sum() == 19_865\n",
    "assert not (\n",
    "    (orders[\"placed_at\"].dt.date > datetime.date(2016, 5, 24))\n",
    "    & orders[\"expected_wait_pickup_time\"].isnull()\n",
    "    & (orders[\"cancelled\"] == False)\n",
    ").any()\n",
    "\n",
    "del orders[\"expected_wait_pickup_time\"]\n",
    "\n",
    "assert orders.loc[no_duration, \"placed_at\"].min().date() == datetime.date(2016, 2, 21)\n",
    "assert orders.loc[no_duration, \"placed_at\"].max().date() == datetime.date(2016, 5, 24)\n",
    "assert orders.loc[~no_duration, \"placed_at\"].min().date() == datetime.date(2016, 5, 24)\n",
    "assert orders.loc[~no_duration, \"placed_at\"].max().date() == datetime.date(2017, 1, 31)\n",
    "assert not (~no_duration & (orders[\"estimated_prep_duration\"] % 60 != 0)).any()\n",
    "\n",
    "round((all_data_available & good_data).sum() / all_data_available.sum(), 3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`estimated_prep_duration` is the difference between `restaurant_notified_at` and `scheduled_pickup_at` when allowing up to half a minute of clock skew. `restaurant_confirmed_at` only works in about 40% of the cases. So, if and when a restaurant confirms an order, does not affect the dispatching process."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 162,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.986"
      ]
     },
     "execution_count": 162,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_data_available = (\n",
    "    orders[\"estimated_prep_duration\"].notnull()\n",
    "    & orders[\"restaurant_notified_at\"].notnull()\n",
    "    & orders[\"scheduled_pickup_at\"].notnull()\n",
    ")\n",
    "good_data = (\n",
    "    orders[\"estimated_prep_duration\"]\n",
    "    - (\n",
    "        (orders[\"scheduled_pickup_at\"] - orders[\"restaurant_notified_at\"])\n",
    "        .dt.total_seconds()\n",
    "        .round()\n",
    "    )\n",
    ").abs() <= 35\n",
    "\n",
    "assert (all_data_available & good_data).sum() == 539_668\n",
    "assert (all_data_available & good_data & completed).sum() == 524_709\n",
    "\n",
    "round((all_data_available & good_data).sum() / all_data_available.sum(), 3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use `estimated_prep_duration` to correct about a third of the 1.5% of `restaurant_notified_at` that are off for orders after May 24."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 163,
   "metadata": {},
   "outputs": [],
   "source": [
    "duration = orders[\"estimated_prep_duration\"].map(\n",
    "    lambda x: datetime.timedelta(seconds=x) if x is not pd.NA else pd.NaT\n",
    ")\n",
    "calc_restaurant_notified_at = orders[\"scheduled_pickup_at\"] - duration\n",
    "\n",
    "not_wrong = (\n",
    "    completed\n",
    "    & (orders[\"placed_at\"] < calc_restaurant_notified_at)\n",
    "    & (calc_restaurant_notified_at < orders[\"restaurant_confirmed_at\"])\n",
    ")\n",
    "\n",
    "msk = all_data_available & ~good_data & not_wrong\n",
    "orders.loc[msk, \"restaurant_notified_at\"] = calc_restaurant_notified_at.loc[msk]\n",
    "orders.loc[msk, \"restaurant_notified_at_corrected\"] = True\n",
    "\n",
    "assert (all_data_available & ~good_data).sum() == 7_514\n",
    "assert msk.sum() == 2_425\n",
    "assert orders.loc[msk, \"placed_at\"].min().date() == datetime.date(2016, 5, 24)\n",
    "assert orders.loc[msk, \"placed_at\"].max().date() == datetime.date(2017, 1, 31)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Also, we use `estimated_prep_duration` to extrapolate missing `restaurant_notified_at` values for orders after May 24."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 164,
   "metadata": {},
   "outputs": [],
   "source": [
    "duration = orders[\"estimated_prep_duration\"].map(\n",
    "    lambda x: datetime.timedelta(seconds=x) if x is not pd.NA else pd.NaT\n",
    ")\n",
    "extrapolated = orders[\"scheduled_pickup_at\"] - duration\n",
    "\n",
    "extrapolate = (\n",
    "    orders[\"restaurant_notified_at\"].isnull()\n",
    "    & orders[\"scheduled_pickup_at\"].notnull()\n",
    "    & orders[\"estimated_prep_duration\"].notnull()\n",
    ")\n",
    "still_wrong = (\n",
    "    (extrapolated <= orders[\"placed_at\"])\n",
    "    | (extrapolated >= orders[\"restaurant_confirmed_at\"])\n",
    "    | (extrapolated >= orders[\"cancelled_at\"])\n",
    ")\n",
    "\n",
    "msk = extrapolate & ~still_wrong\n",
    "orders.loc[msk, \"restaurant_notified_at\"] = extrapolated.loc[msk]\n",
    "orders.loc[msk, \"restaurant_notified_at_corrected\"] = True\n",
    "\n",
    "assert extrapolate.sum() == 469\n",
    "assert msk.sum() == 374\n",
    "assert orders.loc[msk, \"placed_at\"].min().date() == datetime.date(2016, 5, 29)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Vice versa, we extrapolate `estimated_prep_duration` as the difference of `scheduled_pickup_at` and `restaurant_notified_at` for orders before May 24."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 165,
   "metadata": {},
   "outputs": [],
   "source": [
    "extrapolated = orders[\"scheduled_pickup_at\"] - orders[\"restaurant_notified_at\"]\n",
    "extrapolated = (extrapolated.dt.total_seconds() // 60 * 60).astype(\"Int64\")\n",
    "\n",
    "extrapolate = (\n",
    "    orders[\"restaurant_notified_at\"].notnull()\n",
    "    & orders[\"scheduled_pickup_at\"].notnull()\n",
    "    & orders[\"estimated_prep_duration\"].isnull()\n",
    ")\n",
    "orders.loc[extrapolate, \"estimated_prep_duration\"] = extrapolated.loc[extrapolate]\n",
    "orders.loc[extrapolate, \"estimated_prep_duration_corrected\"] = True\n",
    "\n",
    "assert extrapolate.sum() == 108_398\n",
    "assert orders.loc[extrapolate, \"placed_at\"].min().date() == datetime.date(2016, 2, 21)\n",
    "assert orders.loc[extrapolate, \"placed_at\"].max().date() == datetime.date(2016, 5, 24)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "More than 99.9% of the orders with `estimated_prep_duration` set, have this value be under 45 minutes. We view the remaining ones as outliers and adjust them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 166,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.99973"
      ]
     },
     "execution_count": 166,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "more_than_45_mins = orders[\"estimated_prep_duration\"].notnull()\n",
    "more_than_45_mins &= orders[\"estimated_prep_duration\"] > 45 * 60\n",
    "\n",
    "orders.loc[more_than_45_mins, \"estimated_prep_duration\"] = 45 * 60\n",
    "orders.loc[more_than_45_mins, \"estimated_prep_duration_corrected\"] = True\n",
    "\n",
    "assert more_than_45_mins.sum() == 449\n",
    "\n",
    "round((~more_than_45_mins).sum() / orders[\"estimated_prep_duration\"].notnull().sum(), 5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We create a boolean column `pickup_not_confirmed` out of the text column `courier_no_pickup_confirmed_issue`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 167,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Courier did not hit \"Picked up\"    35345\n",
       "Name: courier_no_pickup_confirmed_issue, dtype: Int64"
      ]
     },
     "execution_count": 167,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "orders[\"courier_no_pickup_confirmed_issue\"].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 168,
   "metadata": {},
   "outputs": [],
   "source": [
    "orders[\"pickup_not_confirmed\"] = False\n",
    "\n",
    "msk = orders[\"courier_no_pickup_confirmed_issue\"].notnull()\n",
    "orders.loc[msk, \"pickup_not_confirmed\"] = True\n",
    "\n",
    "msk = orders[\"pickup_at\"].isnull()\n",
    "orders.loc[msk, \"pickup_not_confirmed\"] = pd.NA\n",
    "\n",
    "del orders[\"courier_no_pickup_confirmed_issue\"]\n",
    "\n",
    "assert orders[\"pickup_not_confirmed\"].sum() == 34_966"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`logged_to_pickup_time` and `logged_pickup_time` constitute the difference between `courier_accepted_at` and `pickup_at`. `logged_pickup_time` is negative in rare cases."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 169,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.94"
      ]
     },
     "execution_count": 169,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "assert not (orders[\"logged_to_pickup_time\"] < 0).any()\n",
    "assert (orders[\"logged_pickup_time\"] < 0).sum() == 30\n",
    "\n",
    "all_data_available = (\n",
    "    orders[\"logged_to_pickup_time\"].notnull()\n",
    "    & orders[\"logged_pickup_time\"].notnull()\n",
    "    & (orders[\"logged_pickup_time\"] >= 0)\n",
    "    & orders[\"pickup_at\"].notnull()\n",
    "    & orders[\"courier_accepted_at\"].notnull()\n",
    ")\n",
    "good_data = (\n",
    "    orders[\"logged_to_pickup_time\"]\n",
    "    + orders[\"logged_pickup_time\"]\n",
    "    - ((orders[\"pickup_at\"] - orders[\"courier_accepted_at\"]).dt.total_seconds().round())\n",
    ").abs() <= 5\n",
    "\n",
    "pickup_not_confirmed = orders[\"pickup_not_confirmed\"] == True\n",
    "\n",
    "assert (all_data_available & good_data).sum() == 599_195\n",
    "assert (all_data_available & good_data & completed).sum() == 599_111\n",
    "assert (all_data_available & (good_data | pickup_not_confirmed)).sum() == 604_483\n",
    "\n",
    "round((all_data_available & good_data).sum() / all_data_available.sum(), 3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the 6% where `pickup_at` does not relate back to `courier_accepted_at`, we correct the former. Unconfirmed pickups seem to not be the cause of these inconsistencies."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 170,
   "metadata": {},
   "outputs": [],
   "source": [
    "calc_pickup_at = (\n",
    "    orders[\"courier_accepted_at\"]\n",
    "    + orders[\"logged_to_pickup_time\"].map(\n",
    "        lambda x: datetime.timedelta(seconds=x) if x is not pd.NA else pd.NaT\n",
    "    )\n",
    "    + orders[\"logged_pickup_time\"].map(\n",
    "        lambda x: datetime.timedelta(seconds=x) if x is not pd.NA else pd.NaT\n",
    "    )\n",
    ")\n",
    "\n",
    "msk = all_data_available & ~good_data\n",
    "orders.loc[msk, \"pickup_at\"] = calc_pickup_at.loc[msk]\n",
    "orders.loc[msk, \"pickup_at_corrected\"] = True\n",
    "\n",
    "assert (all_data_available & ~good_data).sum() == 38_015\n",
    "assert (all_data_available & ~good_data & pickup_not_confirmed).sum() == 5_288"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Keep other timestamps consistent after the correction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 171,
   "metadata": {},
   "outputs": [],
   "source": [
    "msk = orders[\"pickup_at\"] <= orders[\"restaurant_notified_at\"]\n",
    "orders.loc[msk, \"restaurant_notified_at\"] = pd.NaT\n",
    "orders.loc[msk, \"restaurant_notified_at_corrected\"] = True\n",
    "\n",
    "assert msk.sum() == 107"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 172,
   "metadata": {},
   "outputs": [],
   "source": [
    "msk = orders[\"pickup_at\"] <= orders[\"restaurant_confirmed_at\"]\n",
    "orders.loc[msk, \"restaurant_confirmed_at\"] = pd.NaT\n",
    "orders.loc[msk, \"restaurant_confirmed_at_corrected\"] = True\n",
    "\n",
    "assert msk.sum() == 892"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With `logged_to_pickup_time` we calculate a new timestamp `reached_pickup_at`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 173,
   "metadata": {},
   "outputs": [],
   "source": [
    "to_pickup_time = orders[\"logged_to_pickup_time\"].map(\n",
    "    lambda x: datetime.timedelta(seconds=x) if x is not pd.NA and x > 0 else pd.NaT\n",
    ")\n",
    "reached_pickup_at = orders[\"courier_accepted_at\"] + to_pickup_time\n",
    "\n",
    "orders[\"reached_pickup_at\"] = pd.NaT\n",
    "msk = (\n",
    "    completed & reached_pickup_at.notnull() & (reached_pickup_at < orders[\"pickup_at\"])\n",
    ")\n",
    "orders.loc[msk, \"reached_pickup_at\"] = reached_pickup_at.loc[msk]\n",
    "\n",
    "assert msk.sum() == 530_724"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`logged_courier_late_time` and `logged_restaurant_late_time` are always set together. The ca. 110,000 missing values are spread over the entire horizon."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 174,
   "metadata": {},
   "outputs": [],
   "source": [
    "assert not (\n",
    "    (\n",
    "        orders[\"logged_courier_late_time\"].notnull()\n",
    "        & orders[\"logged_restaurant_late_time\"].isnull()\n",
    "    )\n",
    "    | (\n",
    "        orders[\"logged_courier_late_time\"].isnull()\n",
    "        & orders[\"logged_restaurant_late_time\"].notnull()\n",
    "    )\n",
    ").any()\n",
    "\n",
    "assert orders.loc[\n",
    "    orders[\"logged_courier_late_time\"].isnull(), \"placed_at\"\n",
    "].min().date() == datetime.date(2016, 2, 22)\n",
    "\n",
    "assert orders.loc[\n",
    "    orders[\"logged_courier_late_time\"].isnull(), \"placed_at\"\n",
    "].max().date() == datetime.date(2017, 1, 31)\n",
    "\n",
    "assert orders.loc[\n",
    "    orders[\"logged_courier_late_time\"].notnull(), \"placed_at\"\n",
    "].min().date() == datetime.date(2016, 2, 21)\n",
    "\n",
    "assert orders.loc[\n",
    "    orders[\"logged_courier_late_time\"].notnull(), \"placed_at\"\n",
    "].max().date() == datetime.date(2017, 1, 31)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`logged_courier_late_time` is mostly explained with `reached_pickup_at` and `scheduled_pickup_at`. `Order.courier_early` and `Order.courier_late` model that in the ORM."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 175,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.964"
      ]
     },
     "execution_count": 175,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_data_available = (\n",
    "    orders[\"logged_courier_late_time\"].notnull()\n",
    "    & orders[\"reached_pickup_at\"].notnull()\n",
    "    & orders[\"scheduled_pickup_at\"].notnull()\n",
    ")\n",
    "good_data = (\n",
    "    orders[\"logged_courier_late_time\"]\n",
    "    - (\n",
    "        (orders[\"reached_pickup_at\"] - orders[\"scheduled_pickup_at\"])\n",
    "        .dt.total_seconds()\n",
    "        .round()\n",
    "    )\n",
    ").abs() <= 5\n",
    "\n",
    "assert (all_data_available & good_data).sum() == 471_553\n",
    "assert (all_data_available & good_data & completed).sum() == 471_553\n",
    "\n",
    "round((all_data_available & good_data).sum() / all_data_available.sum(), 3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`logged_restaurant_late_time` is mostly explained with `pickup_at` and `scheduled_pickup_at`. `logged_restaurant_late_time` is also `0` quite often, indicating no timing was taken. `Order.restaurant_early` and `Order.restaurant_late` model that in the ORM."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 176,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.943"
      ]
     },
     "execution_count": 176,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_data_available = (\n",
    "    orders[\"logged_restaurant_late_time\"].notnull()\n",
    "    & orders[\"pickup_at\"].notnull()\n",
    "    & orders[\"scheduled_pickup_at\"].notnull()\n",
    ")\n",
    "good_data = (\n",
    "    orders[\"logged_restaurant_late_time\"]\n",
    "    - ((orders[\"pickup_at\"] - orders[\"scheduled_pickup_at\"]).dt.total_seconds().round())\n",
    ").abs() <= 5\n",
    "\n",
    "restaurant_not_timed = orders[\"logged_restaurant_late_time\"] == 0\n",
    "\n",
    "assert (all_data_available).sum() == 503_179\n",
    "assert (all_data_available & good_data).sum() == 245_714\n",
    "assert (all_data_available & restaurant_not_timed).sum() == 246_362\n",
    "assert (all_data_available & (good_data | restaurant_not_timed)).sum() == 488_512\n",
    "\n",
    "restaurant_timed = orders[\"logged_restaurant_late_time\"] != 0\n",
    "\n",
    "round(\n",
    "    (all_data_available & restaurant_timed & good_data).sum()\n",
    "    / (all_data_available & restaurant_timed).sum(),\n",
    "    3,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`logged_wait_pickup_time` is unfortunately not a good timing to extrapolate when a meal was ready to picked up by the courier. It is only good to explain the difference between `reached_pickup_at` and `left_pickup_at`, which is not really the time the courier had to wait. Also, the field seems to only be tracked correctly if the courier was late."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 177,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.396"
      ]
     },
     "execution_count": 177,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_data_available = (\n",
    "    (orders[\"logged_wait_pickup_time\"]).notnull()\n",
    "    & orders[\"reached_pickup_at\"].notnull()\n",
    "    & orders[\"left_pickup_at\"].notnull()\n",
    ")\n",
    "good_data = (\n",
    "    orders[\"logged_wait_pickup_time\"]\n",
    "    - (\n",
    "        (orders[\"left_pickup_at\"] - orders[\"reached_pickup_at\"])\n",
    "        .dt.total_seconds()\n",
    "        .round()\n",
    "    )\n",
    ").abs() <= 5\n",
    "\n",
    "round(\n",
    "    (all_data_available & good_data).sum() / (all_data_available).sum(), 3,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 178,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.0"
      ]
     },
     "execution_count": 178,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_data_available = (\n",
    "    (orders[\"logged_wait_pickup_time\"]).notnull()\n",
    "    & (orders[\"logged_courier_late_time\"] >= 0)\n",
    "    & orders[\"reached_pickup_at\"].notnull()\n",
    "    & orders[\"left_pickup_at\"].notnull()\n",
    ")\n",
    "good_data = (\n",
    "    orders[\"logged_wait_pickup_time\"]\n",
    "    - (\n",
    "        (orders[\"left_pickup_at\"] - orders[\"reached_pickup_at\"])\n",
    "        .dt.total_seconds()\n",
    "        .round()\n",
    "    )\n",
    ").abs() <= 5\n",
    "\n",
    "round(\n",
    "    (all_data_available & good_data).sum() / (all_data_available).sum(), 3,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 179,
   "metadata": {},
   "outputs": [],
   "source": [
    "del orders[\"logged_courier_late_time\"]\n",
    "del orders[\"logged_restaurant_late_time\"]\n",
    "del orders[\"logged_wait_pickup_time\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We create a boolean column `delivery_not_confirmed` out of the text column `courier_no_pickup_confirmed_issue`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 180,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Courier did not hit \"Dropped off\"    13862\n",
       "Name: courier_no_delivery_confirmed_issue, dtype: Int64"
      ]
     },
     "execution_count": 180,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "orders[\"courier_no_delivery_confirmed_issue\"].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 181,
   "metadata": {},
   "outputs": [],
   "source": [
    "orders[\"delivery_not_confirmed\"] = False\n",
    "\n",
    "msk = orders[\"courier_no_delivery_confirmed_issue\"].notnull()\n",
    "orders.loc[msk, \"delivery_not_confirmed\"] = True\n",
    "\n",
    "msk = orders[\"delivery_at\"].isnull()\n",
    "orders.loc[msk, \"delivery_not_confirmed\"] = pd.NA\n",
    "\n",
    "del orders[\"courier_no_delivery_confirmed_issue\"]\n",
    "\n",
    "assert orders[\"delivery_not_confirmed\"].sum() == 13_817"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`logged_to_delivery_time` and `logged_delivery_time` constitute the difference between `pickup_at` and `delivery_at`. Without the `pickup_at` corrections above, not 91% but only 86% of the differences would work. `Order.time_to_delivery` and `Order.time_at_delivery` model that in the ORM."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 182,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.913"
      ]
     },
     "execution_count": 182,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "assert not (orders[\"logged_to_delivery_time\"] < 0).any()\n",
    "assert not (orders[\"logged_delivery_time\"] < 0).any()\n",
    "\n",
    "all_data_available = (\n",
    "    orders[\"logged_to_delivery_time\"].notnull()\n",
    "    & orders[\"logged_delivery_time\"].notnull()\n",
    "    & orders[\"delivery_at\"].notnull()\n",
    "    & orders[\"pickup_at\"].notnull()\n",
    ")\n",
    "good_data = (\n",
    "    orders[\"logged_to_delivery_time\"]\n",
    "    + orders[\"logged_delivery_time\"]\n",
    "    - ((orders[\"delivery_at\"] - orders[\"pickup_at\"]).dt.total_seconds().round())\n",
    ").abs() <= 5\n",
    "\n",
    "delivery_not_confirmed = orders[\"delivery_not_confirmed\"] == True\n",
    "\n",
    "assert (all_data_available & good_data).sum() == 572_609\n",
    "assert (all_data_available & good_data & completed).sum() == 572_609\n",
    "assert (all_data_available & (good_data | delivery_not_confirmed)).sum() == 581_700\n",
    "\n",
    "round(\n",
    "    (all_data_available & (good_data | delivery_not_confirmed)).sum()\n",
    "    / all_data_available.sum(),\n",
    "    3,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`courier_waited_at_delivery_issue` is filled in whenever the courier needed to wait for the customer at delivery. It is also filled in if the courier forgot to confirm the delivery, which mostly happened at the end of a shift. If a courier needed to wait for more than 45 minutes, that is summarized as 'waiting about 1, 2, or 3 hours.'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 183,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Waiting at Dropoff                   1597\n",
       "Waiting at Dropoff: 10 minutes       4132\n",
       "Waiting at Dropoff: 11 minutes       3151\n",
       "Waiting at Dropoff: 12 minutes       2348\n",
       "Waiting at Dropoff: 13 minutes       1775\n",
       "Waiting at Dropoff: 14 minutes       1432\n",
       "Waiting at Dropoff: 15 minutes       1199\n",
       "Waiting at Dropoff: 16 minutes        941\n",
       "Waiting at Dropoff: 17 minutes        826\n",
       "Waiting at Dropoff: 18 minutes        720\n",
       "Waiting at Dropoff: 19 minutes        594\n",
       "Waiting at Dropoff: 20 minutes        522\n",
       "Waiting at Dropoff: 21 minutes        431\n",
       "Waiting at Dropoff: 22 minutes        415\n",
       "Waiting at Dropoff: 23 minutes        284\n",
       "Waiting at Dropoff: 24 minutes        281\n",
       "Waiting at Dropoff: 25 minutes        215\n",
       "Waiting at Dropoff: 26 minutes        193\n",
       "Waiting at Dropoff: 27 minutes        167\n",
       "Waiting at Dropoff: 28 minutes        162\n",
       "Waiting at Dropoff: 29 minutes        118\n",
       "Waiting at Dropoff: 30 minutes        109\n",
       "Waiting at Dropoff: 31 minutes         97\n",
       "Waiting at Dropoff: 32 minutes         86\n",
       "Waiting at Dropoff: 33 minutes         67\n",
       "Waiting at Dropoff: 34 minutes         69\n",
       "Waiting at Dropoff: 35 minutes         59\n",
       "Waiting at Dropoff: 36 minutes         50\n",
       "Waiting at Dropoff: 37 minutes         62\n",
       "Waiting at Dropoff: 38 minutes         38\n",
       "Waiting at Dropoff: 39 minutes         41\n",
       "Waiting at Dropoff: 40 minutes         41\n",
       "Waiting at Dropoff: 41 minutes         32\n",
       "Waiting at Dropoff: 42 minutes         26\n",
       "Waiting at Dropoff: 43 minutes         22\n",
       "Waiting at Dropoff: 44 minutes         35\n",
       "Waiting at Dropoff: 8 minutes        3958\n",
       "Waiting at Dropoff: 9 minutes        5814\n",
       "Waiting at Dropoff: about 1 hour      244\n",
       "Waiting at Dropoff: about 2 hours       7\n",
       "Waiting at Dropoff: about 3 hours       3\n",
       "Name: courier_waited_at_delivery_issue, dtype: Int64"
      ]
     },
     "execution_count": 183,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "orders[\"courier_waited_at_delivery_issue\"].value_counts().sort_index()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We convert `courier_waited_at_delivery_issue` into `courier_waited_at_delivery` to validate it further below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 184,
   "metadata": {},
   "outputs": [],
   "source": [
    "waited_at_delivery = (\n",
    "    orders[\"courier_waited_at_delivery_issue\"]\n",
    "    .str.replace(r\"\\D+\", \"\", regex=True)\n",
    "    .fillna(\"NaN\")\n",
    "    .replace(\"\", \"NaN\")\n",
    "    .astype(float)\n",
    "    .astype(\"Int64\")\n",
    ")\n",
    "\n",
    "orders[\"courier_waited_at_delivery\"] = pd.NA\n",
    "orders[\"courier_waited_at_delivery\"] = orders[\"courier_waited_at_delivery\"].astype(\n",
    "    \"Int64\"\n",
    ")\n",
    "\n",
    "hours = orders[\"courier_waited_at_delivery_issue\"].str.contains(\"hour\").fillna(0)\n",
    "orders.loc[hours, \"courier_waited_at_delivery\"] = (\n",
    "    60 * 60 * waited_at_delivery.loc[hours]\n",
    ")\n",
    "\n",
    "mins = orders[\"courier_waited_at_delivery_issue\"].str.contains(\"minutes\").fillna(0)\n",
    "orders.loc[mins, \"courier_waited_at_delivery\"] = 60 * waited_at_delivery.loc[mins]\n",
    "\n",
    "customer_late = orders[\"courier_waited_at_delivery_issue\"].notnull()\n",
    "\n",
    "del orders[\"courier_waited_at_delivery_issue\"]\n",
    "\n",
    "assert hours.sum() == 254\n",
    "assert mins.sum() == 30_512"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the roughly 9% of orders where `logged_to_delivery_time` and `logged_delivery_time` do not explain `delivery_at`, the latter is corrected. However, this is only done if the courier did not have to wait for the customer or forgot to confirm the delivery, which wrongly shows up as waiting for the customer as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 185,
   "metadata": {},
   "outputs": [],
   "source": [
    "calc_delivery_at = (\n",
    "    orders[\"pickup_at\"]\n",
    "    + orders[\"logged_to_delivery_time\"].map(\n",
    "        lambda x: datetime.timedelta(seconds=x) if x is not pd.NA else pd.NaT\n",
    "    )\n",
    "    + orders[\"logged_delivery_time\"].map(\n",
    "        lambda x: datetime.timedelta(seconds=x) if x is not pd.NA else pd.NaT\n",
    "    )\n",
    ")\n",
    "\n",
    "del orders[\"logged_delivery_time\"]\n",
    "\n",
    "orders[\"delivery_at_orig\"] = orders[\"delivery_at\"]\n",
    "\n",
    "msk = (\n",
    "    all_data_available\n",
    "    & ~good_data\n",
    "    & (~customer_late | (customer_late & delivery_not_confirmed))\n",
    ")\n",
    "orders.loc[msk, \"delivery_at\"] = calc_delivery_at.loc[msk]\n",
    "orders.loc[msk, \"delivery_at_corrected\"] = True\n",
    "\n",
    "assert (all_data_available & ~good_data).sum() == 64_543\n",
    "assert (all_data_available & ~good_data & ~customer_late).sum() == 49_122\n",
    "\n",
    "assert (\n",
    "    all_data_available & ~good_data & customer_late & delivery_not_confirmed\n",
    ").sum() == 5_241\n",
    "\n",
    "assert (all_data_available & ~good_data & delivery_not_confirmed).sum() == 9_091"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With `logged_to_delivery_time` we calculate a new timestamp `reached_delivery_at`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 186,
   "metadata": {},
   "outputs": [],
   "source": [
    "to_delivery_time = orders[\"logged_to_delivery_time\"].map(\n",
    "    lambda x: datetime.timedelta(seconds=x) if x is not pd.NA and x > 0 else pd.NaT\n",
    ")\n",
    "reached_delivery_at = orders[\"pickup_at\"] + to_delivery_time\n",
    "\n",
    "del orders[\"logged_to_delivery_time\"]\n",
    "\n",
    "orders[\"reached_delivery_at\"] = pd.NaT\n",
    "\n",
    "msk = (\n",
    "    completed\n",
    "    & reached_delivery_at.notnull()\n",
    "    & (reached_delivery_at < orders[\"delivery_at\"])\n",
    ")\n",
    "orders.loc[msk, \"reached_delivery_at\"] = reached_delivery_at.loc[msk]\n",
    "\n",
    "assert msk.sum() == 608_160"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Some `left_pickup_at` values conflict with that and are discarded."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 187,
   "metadata": {},
   "outputs": [],
   "source": [
    "msk = orders[\"left_pickup_at\"] >= orders[\"reached_delivery_at\"]\n",
    "orders.loc[msk, \"left_pickup_at\"] = pd.NaT\n",
    "orders.loc[msk, \"left_pickup_at_corrected\"] = True\n",
    "\n",
    "assert msk.sum() == 4_215"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`logged_delivery_late_time` is the difference between `scheduled_delivery_at` and `delivery_at` for pre-orders. `Order.delivery_early` and `Order.delivery_late` model that in the ORM."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 188,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.986"
      ]
     },
     "execution_count": 188,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_data_available = (\n",
    "    scheduled\n",
    "    & orders[\"logged_delivery_late_time\"].notnull()\n",
    "    & orders[\"delivery_at\"].notnull()\n",
    "    & orders[\"scheduled_delivery_at\"].notnull()\n",
    ")\n",
    "good_data = (\n",
    "    orders[\"logged_delivery_late_time\"]\n",
    "    - (\n",
    "        (orders[\"delivery_at\"] - orders[\"scheduled_delivery_at\"])\n",
    "        .dt.total_seconds()\n",
    "        .round()\n",
    "    )\n",
    ").abs() <= 5\n",
    "\n",
    "assert (all_data_available & good_data).sum() == 73_658\n",
    "assert (all_data_available & good_data & completed).sum() == 73_658\n",
    "assert (all_data_available & (good_data | delivery_not_confirmed)).sum() == 73_659\n",
    "\n",
    "round(\n",
    "    (all_data_available & (good_data | delivery_not_confirmed)).sum()\n",
    "    / all_data_available.sum(),\n",
    "    3,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`expected_delivery_time` is simply the difference between `placed_at` and `scheduled_delivery_at`, for both ad-hoc and pre-orders. So, the field provides no new information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 189,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.996"
      ]
     },
     "execution_count": 189,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_data_available = (\n",
    "    orders[\"expected_delivery_time\"].notnull()\n",
    "    & orders[\"placed_at\"].notnull()\n",
    "    & orders[\"scheduled_delivery_at\"].notnull()\n",
    ")\n",
    "good_data = (\n",
    "    orders[\"expected_delivery_time\"]\n",
    "    - (\n",
    "        (orders[\"scheduled_delivery_at\"] - orders[\"placed_at\"])\n",
    "        .dt.total_seconds()\n",
    "        .round()\n",
    "    )\n",
    ").abs() <= 5\n",
    "\n",
    "del orders[\"expected_delivery_time\"]\n",
    "\n",
    "assert (all_data_available & good_data).sum() == 74_386\n",
    "\n",
    "round((all_data_available & good_data).sum() / all_data_available.sum(), 3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`courier_waited_at_delivery` can be mostly explained as the difference between `reached_delivery_at` and `delivery_at`. `Order.courier_waited_at_delivery` models that in the ORM."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 190,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.871"
      ]
     },
     "execution_count": 190,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_data_available = (\n",
    "    orders[\"courier_waited_at_delivery\"].notnull()\n",
    "    & orders[\"delivery_at\"].notnull()\n",
    "    & orders[\"reached_delivery_at\"].notnull()\n",
    ")\n",
    "good_data = (\n",
    "    orders[\"courier_waited_at_delivery\"]\n",
    "    - (\n",
    "        (orders[\"delivery_at\"] - orders[\"reached_delivery_at\"])\n",
    "        .dt.total_seconds()\n",
    "        .round()\n",
    "    )\n",
    ").abs() <= 90\n",
    "imprecise_wait_times = orders[\"courier_waited_at_delivery\"].fillna(0) >= 45 * 60\n",
    "\n",
    "assert (all_data_available & good_data).sum() == 26_268\n",
    "assert (all_data_available & good_data & completed).sum() == 26_268\n",
    "assert (all_data_available & (good_data | imprecise_wait_times)).sum() == 26_499\n",
    "\n",
    "round(\n",
    "    (all_data_available & (good_data | imprecise_wait_times)).sum()\n",
    "    / all_data_available.sum(),\n",
    "    3,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We keep `courier_waited_at_delivery` as a boolean field here to be used by `Order.courier_waited_at_delivery`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 191,
   "metadata": {},
   "outputs": [],
   "source": [
    "msk = orders[\"delivery_at\"].notnull() & orders[\"courier_waited_at_delivery\"].notnull()\n",
    "orders[\"courier_waited_at_delivery\"] = pd.NA\n",
    "orders.loc[orders[\"delivery_at\"].notnull(), \"courier_waited_at_delivery\"] = False\n",
    "orders.loc[msk, \"courier_waited_at_delivery\"] = True\n",
    "\n",
    "assert orders[\"courier_waited_at_delivery\"].sum() == msk.sum() == 30_658"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Statistical Columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Keep the columns that log the courier's speed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 192,
   "metadata": {},
   "outputs": [],
   "source": [
    "orders = orders.rename(\n",
    "    columns={\n",
    "        \"logged_avg_courier_speed\": \"logged_avg_speed\",\n",
    "        \"logged_avg_courier_speed_distance\": \"logged_avg_speed_distance\",\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 193,
   "metadata": {},
   "outputs": [],
   "source": [
    "unrealistic = orders[\"logged_delivery_distance\"] > 12_000\n",
    "orders.loc[unrealistic, \"logged_delivery_distance\"] = pd.NA\n",
    "\n",
    "assert unrealistic.sum() == 17"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Clean Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 194,
   "metadata": {},
   "outputs": [],
   "source": [
    "orders = orders[\n",
    "    [\n",
    "        # Generic columns\n",
    "        \"delivery_id\",\n",
    "        \"customer_id\",\n",
    "        \"placed_at\",\n",
    "        \"ad_hoc\",\n",
    "        \"scheduled_delivery_at\",\n",
    "        \"scheduled_delivery_at_corrected\",\n",
    "        \"first_estimated_delivery_at\",\n",
    "        \"cancelled\",\n",
    "        \"cancelled_at\",\n",
    "        \"cancelled_at_corrected\",\n",
    "        # Price related columns\n",
    "        \"sub_total\",\n",
    "        \"delivery_fee\",\n",
    "        \"total\",\n",
    "        # Restaurant related columns\n",
    "        \"restaurant_id\",\n",
    "        \"restaurant_notified_at\",\n",
    "        \"restaurant_notified_at_corrected\",\n",
    "        \"restaurant_confirmed_at\",\n",
    "        \"restaurant_confirmed_at_corrected\",\n",
    "        \"estimated_prep_duration\",\n",
    "        \"estimated_prep_duration_corrected\",\n",
    "        \"estimated_prep_buffer\",\n",
    "        # Dispatch related columns\n",
    "        \"courier_id\",\n",
    "        \"dispatch_at\",\n",
    "        \"dispatch_at_corrected\",\n",
    "        \"courier_notified_at\",\n",
    "        \"courier_notified_at_corrected\",\n",
    "        \"courier_accepted_at\",\n",
    "        \"courier_accepted_at_corrected\",\n",
    "        \"utilization\",\n",
    "        # Pickup related columns\n",
    "        \"pickup_address_id\",\n",
    "        \"reached_pickup_at\",\n",
    "        \"pickup_at\",\n",
    "        \"pickup_at_corrected\",\n",
    "        \"pickup_not_confirmed\",\n",
    "        \"left_pickup_at\",\n",
    "        \"left_pickup_at_corrected\",\n",
    "        # Delivery related columns\n",
    "        \"delivery_address_id\",\n",
    "        \"reached_delivery_at\",\n",
    "        \"delivery_at\",\n",
    "        \"delivery_at_corrected\",\n",
    "        \"delivery_not_confirmed\",\n",
    "        \"courier_waited_at_delivery\",\n",
    "        # Statistical columns\n",
    "        \"logged_delivery_distance\",\n",
    "        \"logged_avg_speed\",\n",
    "        \"logged_avg_speed_distance\",\n",
    "    ]\n",
    "].sort_index()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 195,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>delivery_id</th>\n",
       "      <th>customer_id</th>\n",
       "      <th>placed_at</th>\n",
       "      <th>ad_hoc</th>\n",
       "      <th>scheduled_delivery_at</th>\n",
       "      <th>scheduled_delivery_at_corrected</th>\n",
       "      <th>first_estimated_delivery_at</th>\n",
       "      <th>cancelled</th>\n",
       "      <th>cancelled_at</th>\n",
       "      <th>cancelled_at_corrected</th>\n",
       "      <th>sub_total</th>\n",
       "      <th>delivery_fee</th>\n",
       "      <th>total</th>\n",
       "      <th>restaurant_id</th>\n",
       "      <th>restaurant_notified_at</th>\n",
       "      <th>restaurant_notified_at_corrected</th>\n",
       "      <th>restaurant_confirmed_at</th>\n",
       "      <th>restaurant_confirmed_at_corrected</th>\n",
       "      <th>estimated_prep_duration</th>\n",
       "      <th>estimated_prep_duration_corrected</th>\n",
       "      <th>estimated_prep_buffer</th>\n",
       "      <th>courier_id</th>\n",
       "      <th>dispatch_at</th>\n",
       "      <th>dispatch_at_corrected</th>\n",
       "      <th>courier_notified_at</th>\n",
       "      <th>courier_notified_at_corrected</th>\n",
       "      <th>courier_accepted_at</th>\n",
       "      <th>courier_accepted_at_corrected</th>\n",
       "      <th>utilization</th>\n",
       "      <th>pickup_address_id</th>\n",
       "      <th>reached_pickup_at</th>\n",
       "      <th>pickup_at</th>\n",
       "      <th>pickup_at_corrected</th>\n",
       "      <th>pickup_not_confirmed</th>\n",
       "      <th>left_pickup_at</th>\n",
       "      <th>left_pickup_at_corrected</th>\n",
       "      <th>delivery_address_id</th>\n",
       "      <th>reached_delivery_at</th>\n",
       "      <th>delivery_at</th>\n",
       "      <th>delivery_at_corrected</th>\n",
       "      <th>delivery_not_confirmed</th>\n",
       "      <th>courier_waited_at_delivery</th>\n",
       "      <th>logged_delivery_distance</th>\n",
       "      <th>logged_avg_speed</th>\n",
       "      <th>logged_avg_speed_distance</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>487377</td>\n",
       "      <td>1</td>\n",
       "      <td>2016-02-22 10:42:01</td>\n",
       "      <td>False</td>\n",
       "      <td>2016-02-22 12:30:00</td>\n",
       "      <td>False</td>\n",
       "      <td>NaT</td>\n",
       "      <td>True</td>\n",
       "      <td>NaT</td>\n",
       "      <td>True</td>\n",
       "      <td>1250</td>\n",
       "      <td>250</td>\n",
       "      <td>1500</td>\n",
       "      <td>1</td>\n",
       "      <td>NaT</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>NaT</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>0</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>NaT</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>NaT</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>NaT</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>15</td>\n",
       "      <td>2</td>\n",
       "      <td>NaT</td>\n",
       "      <td>NaT</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>NaT</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>3</td>\n",
       "      <td>NaT</td>\n",
       "      <td>NaT</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>671</td>\n",
       "      <td>NaN</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>487433</td>\n",
       "      <td>2</td>\n",
       "      <td>2016-02-22 10:40:09</td>\n",
       "      <td>False</td>\n",
       "      <td>2016-02-22 12:00:00</td>\n",
       "      <td>False</td>\n",
       "      <td>2016-02-23 23:55:23</td>\n",
       "      <td>False</td>\n",
       "      <td>NaT</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>1550</td>\n",
       "      <td>250</td>\n",
       "      <td>1800</td>\n",
       "      <td>2</td>\n",
       "      <td>2016-02-22 11:31:03</td>\n",
       "      <td>False</td>\n",
       "      <td>NaT</td>\n",
       "      <td>True</td>\n",
       "      <td>600</td>\n",
       "      <td>True</td>\n",
       "      <td>0</td>\n",
       "      <td>96</td>\n",
       "      <td>2016-02-22 10:48:24</td>\n",
       "      <td>True</td>\n",
       "      <td>2016-02-22 11:36:04</td>\n",
       "      <td>False</td>\n",
       "      <td>2016-02-22 11:37:00</td>\n",
       "      <td>False</td>\n",
       "      <td>19</td>\n",
       "      <td>4</td>\n",
       "      <td>2016-02-22 11:40:07</td>\n",
       "      <td>2016-02-22 11:42:12</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>NaT</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>5</td>\n",
       "      <td>2016-02-22 12:11:17</td>\n",
       "      <td>2016-02-22 12:15:41</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>2281</td>\n",
       "      <td>NaN</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>487444</td>\n",
       "      <td>3</td>\n",
       "      <td>2016-02-22 10:56:16</td>\n",
       "      <td>False</td>\n",
       "      <td>2016-02-22 12:00:00</td>\n",
       "      <td>False</td>\n",
       "      <td>2016-02-23 23:25:47</td>\n",
       "      <td>False</td>\n",
       "      <td>NaT</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>2000</td>\n",
       "      <td>250</td>\n",
       "      <td>2250</td>\n",
       "      <td>3</td>\n",
       "      <td>2016-02-22 11:31:02</td>\n",
       "      <td>False</td>\n",
       "      <td>2016-02-22 11:42:00</td>\n",
       "      <td>False</td>\n",
       "      <td>600</td>\n",
       "      <td>True</td>\n",
       "      <td>0</td>\n",
       "      <td>57</td>\n",
       "      <td>2016-02-22 11:31:02</td>\n",
       "      <td>True</td>\n",
       "      <td>2016-02-22 11:31:03</td>\n",
       "      <td>False</td>\n",
       "      <td>2016-02-22 11:31:30</td>\n",
       "      <td>False</td>\n",
       "      <td>25</td>\n",
       "      <td>6</td>\n",
       "      <td>2016-02-22 11:54:31</td>\n",
       "      <td>2016-02-22 11:54:32</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>NaT</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>7</td>\n",
       "      <td>2016-02-22 12:03:14</td>\n",
       "      <td>2016-02-22 12:07:44</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>2449</td>\n",
       "      <td>NaN</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>470503</td>\n",
       "      <td>4</td>\n",
       "      <td>2016-02-22 11:11:46</td>\n",
       "      <td>False</td>\n",
       "      <td>2016-02-22 12:30:00</td>\n",
       "      <td>False</td>\n",
       "      <td>2016-02-22 12:25:50</td>\n",
       "      <td>False</td>\n",
       "      <td>NaT</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>9800</td>\n",
       "      <td>250</td>\n",
       "      <td>10050</td>\n",
       "      <td>4</td>\n",
       "      <td>2016-02-22 11:32:04</td>\n",
       "      <td>False</td>\n",
       "      <td>2016-02-22 11:48:01</td>\n",
       "      <td>False</td>\n",
       "      <td>2700</td>\n",
       "      <td>True</td>\n",
       "      <td>0</td>\n",
       "      <td>39</td>\n",
       "      <td>NaT</td>\n",
       "      <td>True</td>\n",
       "      <td>2016-02-22 11:50:03</td>\n",
       "      <td>False</td>\n",
       "      <td>2016-02-22 11:50:29</td>\n",
       "      <td>False</td>\n",
       "      <td>100</td>\n",
       "      <td>8</td>\n",
       "      <td>2016-02-22 11:57:48</td>\n",
       "      <td>2016-02-22 12:11:26</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>NaT</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>9</td>\n",
       "      <td>2016-02-22 12:12:53</td>\n",
       "      <td>2016-02-22 12:18:43</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>196</td>\n",
       "      <td>NaN</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>487439</td>\n",
       "      <td>5</td>\n",
       "      <td>2016-02-22 11:35:58</td>\n",
       "      <td>False</td>\n",
       "      <td>2016-02-22 12:30:00</td>\n",
       "      <td>False</td>\n",
       "      <td>2016-02-22 12:28:19</td>\n",
       "      <td>False</td>\n",
       "      <td>NaT</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>3600</td>\n",
       "      <td>250</td>\n",
       "      <td>3850</td>\n",
       "      <td>5</td>\n",
       "      <td>2016-02-22 11:50:03</td>\n",
       "      <td>False</td>\n",
       "      <td>2016-02-22 11:50:19</td>\n",
       "      <td>False</td>\n",
       "      <td>1200</td>\n",
       "      <td>True</td>\n",
       "      <td>0</td>\n",
       "      <td>128</td>\n",
       "      <td>NaT</td>\n",
       "      <td>True</td>\n",
       "      <td>2016-02-22 11:50:04</td>\n",
       "      <td>False</td>\n",
       "      <td>2016-02-22 11:50:28</td>\n",
       "      <td>False</td>\n",
       "      <td>45</td>\n",
       "      <td>10</td>\n",
       "      <td>2016-02-22 11:55:12</td>\n",
       "      <td>2016-02-22 12:05:19</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>NaT</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>11</td>\n",
       "      <td>2016-02-22 12:16:53</td>\n",
       "      <td>2016-02-22 12:20:44</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>2584</td>\n",
       "      <td>NaN</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    delivery_id  customer_id           placed_at  ad_hoc  \\\n",
       "id                                                         \n",
       "1        487377            1 2016-02-22 10:42:01   False   \n",
       "2        487433            2 2016-02-22 10:40:09   False   \n",
       "3        487444            3 2016-02-22 10:56:16   False   \n",
       "4        470503            4 2016-02-22 11:11:46   False   \n",
       "5        487439            5 2016-02-22 11:35:58   False   \n",
       "\n",
       "   scheduled_delivery_at scheduled_delivery_at_corrected  \\\n",
       "id                                                         \n",
       "1    2016-02-22 12:30:00                           False   \n",
       "2    2016-02-22 12:00:00                           False   \n",
       "3    2016-02-22 12:00:00                           False   \n",
       "4    2016-02-22 12:30:00                           False   \n",
       "5    2016-02-22 12:30:00                           False   \n",
       "\n",
       "   first_estimated_delivery_at  cancelled cancelled_at cancelled_at_corrected  \\\n",
       "id                                                                              \n",
       "1                          NaT       True          NaT                   True   \n",
       "2          2016-02-23 23:55:23      False          NaT                   <NA>   \n",
       "3          2016-02-23 23:25:47      False          NaT                   <NA>   \n",
       "4          2016-02-22 12:25:50      False          NaT                   <NA>   \n",
       "5          2016-02-22 12:28:19      False          NaT                   <NA>   \n",
       "\n",
       "    sub_total  delivery_fee  total  restaurant_id restaurant_notified_at  \\\n",
       "id                                                                         \n",
       "1        1250           250   1500              1                    NaT   \n",
       "2        1550           250   1800              2    2016-02-22 11:31:03   \n",
       "3        2000           250   2250              3    2016-02-22 11:31:02   \n",
       "4        9800           250  10050              4    2016-02-22 11:32:04   \n",
       "5        3600           250   3850              5    2016-02-22 11:50:03   \n",
       "\n",
       "   restaurant_notified_at_corrected restaurant_confirmed_at  \\\n",
       "id                                                            \n",
       "1                              <NA>                     NaT   \n",
       "2                             False                     NaT   \n",
       "3                             False     2016-02-22 11:42:00   \n",
       "4                             False     2016-02-22 11:48:01   \n",
       "5                             False     2016-02-22 11:50:19   \n",
       "\n",
       "   restaurant_confirmed_at_corrected  estimated_prep_duration  \\\n",
       "id                                                              \n",
       "1                               <NA>                     <NA>   \n",
       "2                               True                      600   \n",
       "3                              False                      600   \n",
       "4                              False                     2700   \n",
       "5                              False                     1200   \n",
       "\n",
       "   estimated_prep_duration_corrected  estimated_prep_buffer  courier_id  \\\n",
       "id                                                                        \n",
       "1                               <NA>                      0        <NA>   \n",
       "2                               True                      0          96   \n",
       "3                               True                      0          57   \n",
       "4                               True                      0          39   \n",
       "5                               True                      0         128   \n",
       "\n",
       "           dispatch_at dispatch_at_corrected courier_notified_at  \\\n",
       "id                                                                 \n",
       "1                  NaT                  <NA>                 NaT   \n",
       "2  2016-02-22 10:48:24                  True 2016-02-22 11:36:04   \n",
       "3  2016-02-22 11:31:02                  True 2016-02-22 11:31:03   \n",
       "4                  NaT                  True 2016-02-22 11:50:03   \n",
       "5                  NaT                  True 2016-02-22 11:50:04   \n",
       "\n",
       "   courier_notified_at_corrected courier_accepted_at  \\\n",
       "id                                                     \n",
       "1                           <NA>                 NaT   \n",
       "2                          False 2016-02-22 11:37:00   \n",
       "3                          False 2016-02-22 11:31:30   \n",
       "4                          False 2016-02-22 11:50:29   \n",
       "5                          False 2016-02-22 11:50:28   \n",
       "\n",
       "   courier_accepted_at_corrected  utilization  pickup_address_id  \\\n",
       "id                                                                 \n",
       "1                           <NA>           15                  2   \n",
       "2                          False           19                  4   \n",
       "3                          False           25                  6   \n",
       "4                          False          100                  8   \n",
       "5                          False           45                 10   \n",
       "\n",
       "     reached_pickup_at           pickup_at pickup_at_corrected  \\\n",
       "id                                                               \n",
       "1                  NaT                 NaT                <NA>   \n",
       "2  2016-02-22 11:40:07 2016-02-22 11:42:12                True   \n",
       "3  2016-02-22 11:54:31 2016-02-22 11:54:32               False   \n",
       "4  2016-02-22 11:57:48 2016-02-22 12:11:26               False   \n",
       "5  2016-02-22 11:55:12 2016-02-22 12:05:19               False   \n",
       "\n",
       "   pickup_not_confirmed left_pickup_at left_pickup_at_corrected  \\\n",
       "id                                                                \n",
       "1                  <NA>            NaT                     <NA>   \n",
       "2                  True            NaT                     <NA>   \n",
       "3                 False            NaT                     <NA>   \n",
       "4                 False            NaT                     <NA>   \n",
       "5                 False            NaT                     <NA>   \n",
       "\n",
       "    delivery_address_id reached_delivery_at         delivery_at  \\\n",
       "id                                                                \n",
       "1                     3                 NaT                 NaT   \n",
       "2                     5 2016-02-22 12:11:17 2016-02-22 12:15:41   \n",
       "3                     7 2016-02-22 12:03:14 2016-02-22 12:07:44   \n",
       "4                     9 2016-02-22 12:12:53 2016-02-22 12:18:43   \n",
       "5                    11 2016-02-22 12:16:53 2016-02-22 12:20:44   \n",
       "\n",
       "   delivery_at_corrected delivery_not_confirmed courier_waited_at_delivery  \\\n",
       "id                                                                           \n",
       "1                   <NA>                   <NA>                       <NA>   \n",
       "2                   True                  False                      False   \n",
       "3                   True                  False                      False   \n",
       "4                  False                  False                      False   \n",
       "5                  False                  False                      False   \n",
       "\n",
       "   logged_delivery_distance  logged_avg_speed  logged_avg_speed_distance  \n",
       "id                                                                        \n",
       "1                       671               NaN                       <NA>  \n",
       "2                      2281               NaN                       <NA>  \n",
       "3                      2449               NaN                       <NA>  \n",
       "4                       196               NaN                       <NA>  \n",
       "5                      2584               NaN                       <NA>  "
      ]
     },
     "execution_count": 195,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "orders.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 196,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 660658 entries, 1 to 688690\n",
      "Data columns (total 45 columns):\n",
      " #   Column                             Non-Null Count   Dtype         \n",
      "---  ------                             --------------   -----         \n",
      " 0   delivery_id                        660658 non-null  int64         \n",
      " 1   customer_id                        660658 non-null  int64         \n",
      " 2   placed_at                          660658 non-null  datetime64[ns]\n",
      " 3   ad_hoc                             660658 non-null  bool          \n",
      " 4   scheduled_delivery_at              80121 non-null   datetime64[ns]\n",
      " 5   scheduled_delivery_at_corrected    83398 non-null   object        \n",
      " 6   first_estimated_delivery_at        657168 non-null  datetime64[ns]\n",
      " 7   cancelled                          660658 non-null  bool          \n",
      " 8   cancelled_at                       14081 non-null   datetime64[ns]\n",
      " 9   cancelled_at_corrected             23552 non-null   object        \n",
      " 10  sub_total                          660658 non-null  int64         \n",
      " 11  delivery_fee                       660658 non-null  int64         \n",
      " 12  total                              660658 non-null  int64         \n",
      " 13  restaurant_id                      660658 non-null  int64         \n",
      " 14  restaurant_notified_at             655847 non-null  datetime64[ns]\n",
      " 15  restaurant_notified_at_corrected   655982 non-null  object        \n",
      " 16  restaurant_confirmed_at            634974 non-null  datetime64[ns]\n",
      " 17  restaurant_confirmed_at_corrected  639102 non-null  object        \n",
      " 18  estimated_prep_duration            660385 non-null  Int64         \n",
      " 19  estimated_prep_duration_corrected  660385 non-null  object        \n",
      " 20  estimated_prep_buffer              660658 non-null  int64         \n",
      " 21  courier_id                         648239 non-null  Int64         \n",
      " 22  dispatch_at                        648608 non-null  datetime64[ns]\n",
      " 23  dispatch_at_corrected              656301 non-null  object        \n",
      " 24  courier_notified_at                314281 non-null  datetime64[ns]\n",
      " 25  courier_notified_at_corrected      481307 non-null  object        \n",
      " 26  courier_accepted_at                650348 non-null  datetime64[ns]\n",
      " 27  courier_accepted_at_corrected      650363 non-null  object        \n",
      " 28  utilization                        660658 non-null  int64         \n",
      " 29  pickup_address_id                  660658 non-null  int64         \n",
      " 30  reached_pickup_at                  530724 non-null  datetime64[ns]\n",
      " 31  pickup_at                          637684 non-null  datetime64[ns]\n",
      " 32  pickup_at_corrected                637700 non-null  object        \n",
      " 33  pickup_not_confirmed               637684 non-null  object        \n",
      " 34  left_pickup_at                     293397 non-null  datetime64[ns]\n",
      " 35  left_pickup_at_corrected           312136 non-null  object        \n",
      " 36  delivery_address_id                660658 non-null  int64         \n",
      " 37  reached_delivery_at                608160 non-null  datetime64[ns]\n",
      " 38  delivery_at                        637167 non-null  datetime64[ns]\n",
      " 39  delivery_at_corrected              637167 non-null  object        \n",
      " 40  delivery_not_confirmed             637167 non-null  object        \n",
      " 41  courier_waited_at_delivery         637167 non-null  object        \n",
      " 42  logged_delivery_distance           660641 non-null  object        \n",
      " 43  logged_avg_speed                   213812 non-null  float64       \n",
      " 44  logged_avg_speed_distance          213812 non-null  Int64         \n",
      "dtypes: Int64(3), bool(2), datetime64[ns](14), float64(1), int64(10), object(15)\n",
      "memory usage: 224.9+ MB\n"
     ]
    }
   ],
   "source": [
    "orders.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 197,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "scheduled_delivery_at_corrected 3321\n",
      "cancelled_at_corrected 9569\n",
      "restaurant_notified_at_corrected 2862\n",
      "restaurant_confirmed_at_corrected 4128\n",
      "estimated_prep_duration_corrected 108591\n",
      "dispatch_at_corrected 69241\n",
      "courier_notified_at_corrected 216869\n",
      "courier_accepted_at_corrected 23\n",
      "pickup_at_corrected 38032\n",
      "left_pickup_at_corrected 18739\n",
      "delivery_at_corrected 54363\n"
     ]
    }
   ],
   "source": [
    "for column in orders.columns:\n",
    "    if column.endswith(\"corrected\"):\n",
    "        print(column, (orders[column] == True).sum())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 198,
   "metadata": {},
   "outputs": [],
   "source": [
    "assert (\n",
    "    hashlib.sha256(orders.to_json().encode()).hexdigest()\n",
    "    == \"c548084f094bd220f3aff7e9b7072a4964127f6962dffd54f21c8d1f5b846a7f\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All couriers had at least one order."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 199,
   "metadata": {},
   "outputs": [],
   "source": [
    "assert set(couriers.reset_index()[\"id\"]) == set(\n",
    "    orders.loc[orders[\"courier_id\"].notnull(), \"courier_id\"].unique()\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Only keep restaurants that had at least one order."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 200,
   "metadata": {},
   "outputs": [],
   "source": [
    "restaurants = restaurants.reset_index()\n",
    "msk = restaurants[\"id\"].isin(orders[\"restaurant_id\"].unique())\n",
    "restaurants = restaurants[msk].set_index(\"id\")\n",
    "\n",
    "assert (~msk).sum() == 6"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Only keep addresses with pickups or deliveries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 201,
   "metadata": {},
   "outputs": [],
   "source": [
    "addresses = addresses.reset_index()\n",
    "msk = addresses[\"id\"].isin(\n",
    "    set(restaurants[\"address_id\"])\n",
    "    | set(orders[\"pickup_address_id\"])\n",
    "    | set(orders[\"delivery_address_id\"])\n",
    ")\n",
    "addresses = addresses[msk].set_index(\"id\")\n",
    "\n",
    "assert (~msk).sum() == 100\n",
    "\n",
    "discarded_addresses = set(addresses[\"primary_id\"]) - set(addresses.reset_index()[\"id\"])\n",
    "for old_primary_id in set(addresses[\"primary_id\"]) - set(addresses.reset_index()[\"id\"]):\n",
    "    msk = addresses[\"primary_id\"] == old_primary_id\n",
    "    new_primary_id = addresses[msk].index.min()\n",
    "    addresses.loc[msk, \"primary_id\"] = new_primary_id"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Store the Results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 202,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'clean'"
      ]
     },
     "execution_count": 202,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "config.CLEAN_SCHEMA"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 203,
   "metadata": {},
   "outputs": [],
   "source": [
    "cities.to_sql(\n",
    "    \"cities\",\n",
    "    con=connection,\n",
    "    schema=config.CLEAN_SCHEMA,\n",
    "    if_exists=\"append\",\n",
    "    index=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 204,
   "metadata": {},
   "outputs": [],
   "source": [
    "addresses.to_sql(\n",
    "    \"addresses\",\n",
    "    con=connection,\n",
    "    schema=config.CLEAN_SCHEMA,\n",
    "    if_exists=\"append\",\n",
    "    index=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 205,
   "metadata": {},
   "outputs": [],
   "source": [
    "restaurants.to_sql(\n",
    "    \"restaurants\",\n",
    "    con=connection,\n",
    "    schema=config.CLEAN_SCHEMA,\n",
    "    if_exists=\"append\",\n",
    "    index=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 206,
   "metadata": {},
   "outputs": [],
   "source": [
    "couriers.to_sql(\n",
    "    \"couriers\",\n",
    "    con=connection,\n",
    "    schema=config.CLEAN_SCHEMA,\n",
    "    if_exists=\"append\",\n",
    "    index=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 207,
   "metadata": {},
   "outputs": [],
   "source": [
    "customers = pd.DataFrame({\"id\": orders[\"customer_id\"].unique()}).set_index(\"id\")\n",
    "\n",
    "customers.to_sql(\n",
    "    \"customers\",\n",
    "    con=connection,\n",
    "    schema=config.CLEAN_SCHEMA,\n",
    "    if_exists=\"append\",\n",
    "    index=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 208,
   "metadata": {},
   "outputs": [],
   "source": [
    "orders.to_sql(\n",
    "    \"orders\",\n",
    "    con=connection,\n",
    "    schema=config.CLEAN_SCHEMA,\n",
    "    if_exists=\"append\",\n",
    "    index=True,\n",
    ")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}