{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Cleaning the raw Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The purpose of this notebook is to take the original PostgreSQL database and clean the data in it. The output is a set of tables in a new PostgreSQL schema that hold the cleaned data.\n", "\n", "Cleaning occurs at several levels:\n", "- textual data is unified (e.g., spelling, accents, ...)\n", "- duplicate rows/records are merged together\n", "- numeric columns are checked for plausibility\n", "- foreign key relationships are strictly enforced\n", "\n", "The structure of the data can be viewed at the [ORM layer](https://github.com/webartifex/urban-meal-delivery/tree/main/src/urban_meal_delivery/db) in the package." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[32murban-meal-delivery\u001b[0m, version \u001b[34m0.2.0\u001b[0m\n" ] } ], "source": [ "!umd --version" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Imports" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from urban_meal_delivery import config, db" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import collections\n", "import datetime\n", "import hashlib\n", "\n", "import pandas as pd\n", "import pytz as tz\n", "import numpy as np\n", "import sqlalchemy as sa" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Settings & Globals" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "%load_ext lab_black" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "pd.set_option(\"display.max_columns\", 999)\n", "pd.set_option(\"display.max_rows\", 999)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "connection = db.connection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### New Database Schema" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As a result of this notebook, a new PostgreSQL schema called `\"clean\"` is created holding the tables with the cleaned data." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'clean'" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "config.CLEAN_SCHEMA" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All tables with the original data are stored in the default PostgreSQL schema called `\"public\"`." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'public'" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "config.ORIGINAL_SCHEMA" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use `alembic` to run the very first database migration script that creates the new tables." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "INFO [alembic.runtime.migration] Context impl PostgresqlImpl.\n", "INFO [alembic.runtime.migration] Will assume transactional DDL.\n", "INFO [alembic.runtime.migration] Running upgrade -> f11cd76d2f45, Create the database from scratch.\n" ] } ], "source": [ "%cd -q ..\n", "!alembic upgrade f11cd76d2f45\n", "%cd -q research" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cities" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While the original database consists of data concerning the UDP's operations in five cities in France, we only look at `\"Bordeaux\"`, `\"Lyon\"`, and `\"Paris\"` in this research project, as the amount of data for `\"Lille\"` and `\"Nantes\"` is simply not a lot due to the very short time horizons the UDP had been operating there." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Raw Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following `target_cities` data were manually obtained from Google Maps and mapped to the `\"database_id\"`s of the cities in the original database where the UDP was operating in." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "target_cities = {\n", " \"Bordeaux\": {\n", " \"database_id\": 4,\n", " \"google_maps_data\": {\n", " \"center_latitude\": 44.837789,\n", " \"center_longitude\": -0.57918,\n", " \"northeast_latitude\": 44.91670389999999,\n", " \"northeast_longitude\": -0.5333089999999999,\n", " \"southwest_latitude\": 44.810752,\n", " \"southwest_longitude\": -0.638973,\n", " \"initial_zoom\": 13,\n", " },\n", " },\n", " \"Lyon\": {\n", " \"database_id\": 1,\n", " \"google_maps_data\": {\n", " \"center_latitude\": 45.764043,\n", " \"center_longitude\": 4.835659,\n", " \"northeast_latitude\": 45.808425,\n", " \"northeast_longitude\": 4.898393,\n", " \"southwest_latitude\": 45.707486,\n", " \"southwest_longitude\": 4.7718489,\n", " \"initial_zoom\": 13,\n", " },\n", " },\n", " \"Paris\": {\n", " \"database_id\": 2,\n", " \"google_maps_data\": {\n", " \"center_latitude\": 48.856614,\n", " \"center_longitude\": 2.3522219,\n", " \"northeast_latitude\": 48.9021449,\n", " \"northeast_longitude\": 2.4699208,\n", " \"southwest_latitude\": 48.815573,\n", " \"southwest_longitude\": 2.225193,\n", " \"initial_zoom\": 12,\n", " },\n", " },\n", "}" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "city_ids = tuple(city[\"database_id\"] for city in target_cities.values())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`cities` below holds the cleaned city related data from the original database. They come with KML data (i.e., area) associated with a city, which is kept." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "cities = pd.read_sql_query(\n", " f\"\"\"\n", " SELECT\n", " cities.id,\n", " cities.name,\n", " geo_areas.kml\n", " FROM\n", " {config.ORIGINAL_SCHEMA}.cities\n", " LEFT OUTER JOIN\n", " {config.ORIGINAL_SCHEMA}.geo_areas ON cities.geo_area_id = geo_areas.id\n", " WHERE\n", " cities.id IN %(city_ids)s\n", " ORDER BY\n", " cities.id\n", " \"\"\",\n", " con=connection,\n", " index_col=\"id\",\n", " params={\"city_ids\": city_ids},\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Merge in the data from Google Maps." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "for city in target_cities.values():\n", " for col, val in city[\"google_maps_data\"].items():\n", " cities.loc[city[\"database_id\"], col] = val" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Cast the columns' types explicitly." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "cities = cities.astype(\n", " {\n", " \"name\": \"string\",\n", " \"kml\": \"string\",\n", " \"center_latitude\": float,\n", " \"center_longitude\": float,\n", " \"northeast_latitude\": float,\n", " \"northeast_longitude\": float,\n", " \"southwest_latitude\": float,\n", " \"southwest_longitude\": float,\n", " \"initial_zoom\": int,\n", " }\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Clean Data" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namekmlcenter_latitudecenter_longitudenortheast_latitudenortheast_longitudesouthwest_latitudesouthwest_longitudeinitial_zoom
id
1Lyon<?xml version='1.0' encoding='UTF-8'?><kml xml...45.7640434.83565945.8084254.89839345.7074864.77184913
2Paris<?xml version='1.0' encoding='UTF-8'?><kml xml...48.8566142.35222248.9021452.46992148.8155732.22519312
4Bordeaux<?xml version='1.0' encoding='UTF-8'?><kml xml...44.837789-0.57918044.916704-0.53330944.810752-0.63897313
\n", "
" ], "text/plain": [ " name kml \\\n", "id \n", "1 Lyon \n", "Int64Index: 3 entries, 1 to 4\n", "Data columns (total 9 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 name 3 non-null string \n", " 1 kml 3 non-null string \n", " 2 center_latitude 3 non-null float64\n", " 3 center_longitude 3 non-null float64\n", " 4 northeast_latitude 3 non-null float64\n", " 5 northeast_longitude 3 non-null float64\n", " 6 southwest_latitude 3 non-null float64\n", " 7 southwest_longitude 3 non-null float64\n", " 8 initial_zoom 3 non-null int64 \n", "dtypes: float64(6), int64(1), string(2)\n", "memory usage: 320.0 bytes\n" ] } ], "source": [ "cities.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As this notebook was developed iteratively, we validate that the cleaned data stays unchanged using SHA256 checksums of the cleaned DataFrames and other `assert`s." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "assert (\n", " hashlib.sha256(cities.to_json().encode()).hexdigest()\n", " == \"800689a6ba5b6d03f583f258e058eca0b12e6df8e34c98bfe7aec246ed688c92\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Addresses" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Raw Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Only load addresses with orders in the target cities, excluding the cut-off day." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "addresses = pd.read_sql_query(\n", " f\"\"\"\n", " SELECT\n", " id,\n", " created_at,\n", " place_id,\n", " latitude,\n", " longitude,\n", " city_id,\n", " city_name AS city,\n", " zip,\n", " street_address AS street,\n", " floor,\n", " special_instructions\n", " FROM\n", " {config.ORIGINAL_SCHEMA}.addresses\n", " WHERE\n", " city_id IN %(city_ids)s\n", " AND\n", " id IN (\n", " SELECT DISTINCT address_id\n", " FROM (\n", " SELECT DISTINCT\n", " pickup_address_id AS address_id\n", " FROM\n", " {config.ORIGINAL_SCHEMA}.orders\n", " WHERE\n", " created_at < '{config.CUTOFF_DAY}'\n", " UNION\n", " SELECT DISTINCT\n", " dropoff_address_id AS address_id\n", " FROM\n", " {config.ORIGINAL_SCHEMA}.orders\n", " WHERE\n", " created_at < '{config.CUTOFF_DAY}'\n", " ) AS orders\n", " )\n", " ORDER BY\n", " id\n", " \"\"\",\n", " con=connection,\n", " index_col=\"id\",\n", " params={\"city_ids\": city_ids},\n", " parse_dates=[\"created_at\"],\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All columns are `\"strings\"`, even `zip`." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "addresses = addresses.astype(\n", " {\n", " \"place_id\": \"string\",\n", " \"city\": \"string\",\n", " \"zip\": \"string\",\n", " \"street\": \"string\",\n", " \"floor\": \"string\",\n", " \"special_instructions\": \"string\",\n", " }\n", ")" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
created_atplace_idlatitudelongitudecity_idcityzipstreetfloorspecial_instructions
id
22016-02-22 09:42:10.204171ChIJSfxJmlXq9EcRX2ChkiPW9J845.7631494.8326601Lyon6900231 rue Mercière<NA>
32016-02-22 09:42:10.351150ChIJwwMvNPnq9EcRY7Qu-Tw2HL845.7672274.8357501Lyon69001Rue de la République 2<NA>
42016-02-22 09:43:57.077475ChIJr1RhGN7B9EcRv6XSHmmN6a845.7437254.8731381Lyon69008123 avenue des frères lumières<NA>
52016-02-22 09:43:57.220446ChIJEwQm9H7q9EcRTymMxQ71z8445.7593694.8640871Lyon69003Avenue Georges Pompidou 17<NA>Appel au 06 46 12 20 27
62016-02-22 10:06:08.762590ChIJbQz7p6vr9EcRr9L2cH5942I45.7611814.8263711Lyon690058 bis Place Saint Jean<NA>
\n", "
" ], "text/plain": [ " created_at place_id latitude \\\n", "id \n", "2 2016-02-22 09:42:10.204171 ChIJSfxJmlXq9EcRX2ChkiPW9J8 45.763149 \n", "3 2016-02-22 09:42:10.351150 ChIJwwMvNPnq9EcRY7Qu-Tw2HL8 45.767227 \n", "4 2016-02-22 09:43:57.077475 ChIJr1RhGN7B9EcRv6XSHmmN6a8 45.743725 \n", "5 2016-02-22 09:43:57.220446 ChIJEwQm9H7q9EcRTymMxQ71z84 45.759369 \n", "6 2016-02-22 10:06:08.762590 ChIJbQz7p6vr9EcRr9L2cH5942I 45.761181 \n", "\n", " longitude city_id city zip street floor \\\n", "id \n", "2 4.832660 1 Lyon 69002 31 rue Mercière \n", "3 4.835750 1 Lyon 69001 Rue de la République 2 \n", "4 4.873138 1 Lyon 69008 123 avenue des frères lumières \n", "5 4.864087 1 Lyon 69003 Avenue Georges Pompidou 17 \n", "6 4.826371 1 Lyon 69005 8 bis Place Saint Jean \n", "\n", " special_instructions \n", "id \n", "2 \n", "3 \n", "4 \n", "5 Appel au 06 46 12 20 27 \n", "6 " ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "addresses.head()" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Int64Index: 663082 entries, 2 to 691914\n", "Data columns (total 10 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 created_at 663082 non-null datetime64[ns]\n", " 1 place_id 662953 non-null string \n", " 2 latitude 663082 non-null float64 \n", " 3 longitude 663082 non-null float64 \n", " 4 city_id 663082 non-null int64 \n", " 5 city 663082 non-null string \n", " 6 zip 663082 non-null string \n", " 7 street 663082 non-null string \n", " 8 floor 337703 non-null string \n", " 9 special_instructions 585310 non-null string \n", "dtypes: datetime64[ns](1), float64(2), int64(1), string(6)\n", "memory usage: 55.6 MB\n" ] } ], "source": [ "addresses.info()" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "assert len(addresses) == 663_082" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Adjust Time Zones" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a helper function that strips out the microseconds from datetime columns and converts their time zones from UTC to Europe/Paris." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "def clean_datetime(col):\n", " \"\"\"Strip Microseconds and convert timezone to Europe/Paris.\"\"\"\n", " return (\n", " col.dt.tz_localize(tz.utc)\n", " .dt.tz_convert(tz.timezone(\"Europe/Paris\"))\n", " .dt.tz_localize(None)\n", " .map(\n", " lambda x: datetime.datetime(\n", " x.year, x.month, x.day, x.hour, x.minute, x.second\n", " )\n", " if x is not pd.NaT\n", " else x\n", " )\n", " )" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "addresses[\"created_at\"] = clean_datetime(addresses[\"created_at\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Clean Place IDs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A tiny number of addresses has `latitude` / `longitude` pairs as `place_id`s." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "addresses[\"place_id\"] = (\n", " addresses[\"place_id\"].str.replace(r\"^[\\d\\.,-]+$\", \"\", regex=True).str.strip()\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Discard addresses without a `place_id` by Google Maps. If even Google does not know where these addresses are geo-located, we do not even try." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "msk = addresses[\"place_id\"].isnull() | (addresses[\"place_id\"] == \"\")\n", "addresses = addresses[~msk]\n", "\n", "assert msk.sum() == 139" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Clean City Names" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some customers entered too much data into the `city` part of the address. Unify this column by only keeping the city's name." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\n", "[ 'Lyon', 'Lyon 2 E Arrondissement',\n", " 'Villeurbanne', 'Lyon 7 E Arrondissement',\n", " 'Lyon 6 E Arrondissement', 'Lyon 8 E Arrondissement',\n", " 'Lyon 9 E Arrondissement', 'Paris',\n", " 'Levallois-Perret', 'Courbevoie',\n", " 'Puteaux', 'Neuilly Sur Seine',\n", " 'Boulogne Billancourt', 'Levallois Perret',\n", " 'Malakoff', 'Saint-Mandé',\n", " 'Paris 14 E Arrondissement', 'Montrouge',\n", " 'Issy-les-moulineaux', 'Paris 10 E Arrondissement',\n", " 'Issy Les Moulineaux', 'Nanterre',\n", " 'Paris 12 E Arrondissement', 'La Garenne-colombes',\n", " 'Vincennes', 'Montreuil',\n", " 'Asnières-sur-Seine', 'Paris 15 E Arrondissement',\n", " 'Saint-Ouen', 'Paris 9 E Arrondissement',\n", " 'Pantin', 'Paris 17 E Arrondissement',\n", " 'Paris 1 Er Arrondissement', 'Bordeaux',\n", " 'Talence', 'Ivry-sur-Seine',\n", " 'Testas', 'Saint-Denis']\n", "Length: 38, dtype: string" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "addresses[\"city\"].unique()" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "addresses[\"city\"] = (\n", " addresses[\"city\"]\n", " .str.replace(r\"(E Arrondissement|Er Arrondissement)\", \"\", regex=True)\n", " .str.replace(r\"(\\d)\", \"\", regex=True)\n", " # Get rid off accents.\n", " .str.normalize(\"NFKD\")\n", " .str.encode(\"ascii\", errors=\"ignore\")\n", " .str.decode(\"utf8\")\n", " .astype(\"string\")\n", " # Unify hyphens.\n", " .str.strip()\n", " .str.replace(\"-\", \" \")\n", " .str.title()\n", " .str.replace(\" \", \"-\")\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sub-urban city names surrounding the three big cities in this research project are kept." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Paris 362252\n", "Lyon 199872\n", "Bordeaux 64379\n", "Villeurbanne 15555\n", "Levallois-Perret 4620\n", "Courbevoie 3510\n", "Puteaux 2323\n", "Issy-Les-Moulineaux 1219\n", "Malakoff 1214\n", "Montrouge 1164\n", "Pantin 1119\n", "Saint-Mande 1043\n", "Boulogne-Billancourt 986\n", "Montreuil 909\n", "Neuilly-Sur-Seine 835\n", "Vincennes 826\n", "Ivry-Sur-Seine 438\n", "Nanterre 361\n", "La-Garenne-Colombes 213\n", "Asnieres-Sur-Seine 62\n", "Saint-Ouen 35\n", "Saint-Denis 4\n", "Testas 3\n", "Talence 1\n", "Name: city, dtype: Int64" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "addresses[\"city\"].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Clean Zip Codes" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\n", "[ '69002', '69001', '69008', '69003', '69005',\n", " '69007', '69004', '69009', '69006', '69100',\n", " ...\n", " '92053', '92806', '69003 ', '75015 PARIS', '69',\n", " '33077', '33092', '33139', '75010z', '92040']\n", "Length: 167, dtype: string" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "addresses[\"zip\"].unique()" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "addresses[\"zip\"] = (\n", " addresses[\"zip\"]\n", " .str.replace(r\".*(\\d{5}).*\", r\"\\1\", regex=True)\n", " .str.replace(r\"\\D+\", \"\", regex=True)\n", " .replace(\"\", \"NaN\")\n", " .astype(float)\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Zip codes with less than 5 digits are invalid. Paris has zip codes with 75xxx (with 92xxx, 93xxx, and 94xxx being suburbs), Lyon 69xxx, and Bordeaux 33xxx (cf., [source](https://en.wikipedia.org/wiki/Postal_codes_in_France)). Keep only valid zip codes in target cities." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "invalid = addresses[\"zip\"].notnull() & (addresses[\"zip\"] < 10000)\n", "\n", "assert invalid.sum() == 9\n", "\n", "not_in_target_cities = (\n", " addresses[\"zip\"].notnull()\n", " & ~invalid\n", " & ~(\n", " (33000 <= addresses[\"zip\"]) & (addresses[\"zip\"] < 34000)\n", " | (69000 <= addresses[\"zip\"]) & (addresses[\"zip\"] < 70000)\n", " | (75000 <= addresses[\"zip\"]) & (addresses[\"zip\"] < 76000)\n", " | (92000 <= addresses[\"zip\"]) & (addresses[\"zip\"] < 95000)\n", " )\n", ")\n", "\n", "assert not_in_target_cities.sum() == 10\n", "\n", "addresses.loc[invalid | not_in_target_cities, \"zip\"] = np.NaN" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([69002., 69001., 69008., 69003., 69005., 69007., 69004., 69009.,\n", " 69006., 69100., 69200., 69300., 69350., nan, 75011., 69370.,\n", " 75009., 75010., 75001., 75008., 75017., 75003., 75002., 75018.,\n", " 92300., 75007., 75012., 75006., 75116., 75015., 92400., 92800.,\n", " 92200., 92100., 75020., 75019., 75004., 75013., 75016., 75005.,\n", " 75014., 92130., 92092., 92932., 92000., 92150., 92270., 92190.,\n", " 92110., 92170., 93400., 92600., 94250., 69160., 92120., 92250.,\n", " 92078., 92310., 92240., 75543., 92064., 92210., 75270., 75000.,\n", " 92057., 93100., 92140., 92081., 75741., 69326., 69130., 69500.,\n", " 94160., 94300., 93170., 94110., 94220., 93260., 69000., 93500.,\n", " 75045., 94120., 94200., 92220., 93000., 92974., 92063., 69429.,\n", " 33000., 33800., 33200., 33400., 33300., 33130., 33110., 33100.,\n", " 33270., 33520., 92935., 33067., 33150., 93507., 33700., 69628.,\n", " 33080., 33076., 92700., 93310., 92042., 92058., 92930., 69120.,\n", " 92936., 93300., 93210., 92671., 93200., 92053., 92806., 33077.,\n", " 33092., 33139., 92040.])" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "addresses[\"zip\"].unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Discard addresses with missing zip codes because they are hard to geo-code." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "msk = addresses[\"zip\"].isnull()\n", "addresses = addresses[~msk]\n", "\n", "assert msk.sum() == 21" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "addresses = addresses.astype({\"zip\": int})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Clean Street Names" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remove extra whitespace, HTML encodings, and accents." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "addresses[\"street\"] = (\n", " addresses[\"street\"]\n", " .str.replace(\"\\s+\", \" \", regex=True)\n", " .str.replace(\"'\", \"'\")\n", " # Get rid off accents.\n", " .str.normalize(\"NFKD\")\n", " .str.encode(\"ascii\", errors=\"ignore\")\n", " .str.decode(\"utf8\")\n", " .astype(\"string\")\n", " .str.strip()\n", " .str.title()\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are no addresses without a `street` name." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "assert not addresses[\"street\"].isnull().any()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Parse Floor Numbers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Make `floor` an integer column." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\n", "[ , '',\n", " '3', '19',\n", " '2ème étage', '1',\n", " '1er', '2eme etage face escalier',\n", " '2', '2 eme droite',\n", " ...\n", " ' Premiere Bati interphone 16 2eme', 'Le Cargo - 4e étage',\n", " 'Rdc appart 11', 'Etage 3, appartement 11',\n", " '5e étage gauche ', '2eme etage a gauche bat b ',\n", " '4ème étage, chambre 41', 'Cinquiéme ',\n", " 'Montez jusqu'à la grille ', '5ème étage à droite ']\n", "Length: 11990, dtype: string" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "addresses[\"floor\"].unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Parse out floors from the `floor` text column." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "addresses[\"floor\"] = (\n", " addresses[\"floor\"]\n", " # Get rid of accents and lower case everything.\n", " .str.normalize(\"NFKD\")\n", " .str.encode(\"ascii\", errors=\"ignore\")\n", " .str.decode(\"utf8\")\n", " .astype(\"string\")\n", " .str.casefold()\n", " # Replace common text that messes up the matching.\n", " .str.replace(\".\", \"\")\n", " .str.replace(\":\", \"\")\n", " .str.replace(\";\", \"\")\n", " .str.replace(\"'\", \"'\")\n", " .str.replace(\"36b25\", \"\")\n", " .str.replace(\"n°\", \"\")\n", " .str.replace(\"#\", \"\")\n", " .str.replace(\"face l'assanceur\", \"\")\n", " .str.replace(\"\\(drt\\)\", \"\")\n", " .str.replace(\"floor\", \"\")\n", " .str.replace(\"et demi\", \"\")\n", " .str.replace(\"et droite\", \"\")\n", " .str.replace(\"droite\", \"\")\n", " .str.replace(\"droit\", \"\")\n", " .str.replace(\"a gauche\", \"\")\n", " .str.replace(\"e gauche\", \"\")\n", " .str.replace(\"gauche\", \"\")\n", " .str.replace(\"entrez\", \"\")\n", " .str.replace(\"serez\", \"\")\n", " .str.replace(\"dussol\", \"\")\n", " .str.replace(\"soler\", \"\")\n", " .str.replace(\"sonner\", \"\")\n", " .str.replace(\"code\", \"\")\n", " .str.replace(\"perez\", \"\")\n", " .str.replace(\"-\", \"\")\n", " .str.replace(\"\\s+\", \" \", regex=True)\n", " .str.strip()\n", " # Abbreviations.\n", " .str.replace(\n", " r\"^.*?((\\d+)\\s?(er|ere|em|eme|ele|ieme|bis|(e|g|st|nd|rd|th|z)($|,|\\s+))).*\",\n", " r\"\\2\",\n", " regex=True,\n", " )\n", " # French written out.\n", " .str.replace(r\".*(rdc|rez|sol|ground).*\", \"0\", regex=True)\n", " .str.replace(r\".*(premiere|premier).*\", \"1\", regex=True)\n", " .str.replace(r\".*(deuxieme).*\", \"2\", regex=True)\n", " .str.replace(r\".*(troisieme).*\", \"3\", regex=True)\n", " .str.replace(r\".*(quatrieme).*\", \"4\", regex=True)\n", " .str.replace(r\".*(cinquieme).*\", \"5\", regex=True)\n", " .str.replace(r\".*(sixieme).*\", \"6\", regex=True)\n", " .str.replace(r\".*(septieme).*\", \"7\", regex=True)\n", " .str.replace(r\".*(huitieme).*\", \"8\", regex=True)\n", " .str.replace(r\".*(neuvieme).*\", \"9\", regex=True)\n", " .str.replace(r\".*(dixieme).*\", \"10\", regex=True)\n", " .str.replace(r\"^.*?((etage|etg) (\\d+))($|\\D+.*)\", r\"\\3\", regex=True)\n", " .str.replace(r\"^.*?((\\d+)(etage| etage|etg| etg)).*\", r\"\\2\", regex=True)\n", " # Remove apartment info to not confuse it with floor\n", " .str.replace(\n", " r\"(.*)(ap|apt|app|appt|appart|appartment|appartement|chambre|room)\\s*\\w?\\d+(.*)\",\n", " r\"\\1 \\3\",\n", " regex=True,\n", " )\n", " .str.replace(r\"(.*)(code|digicode)\\s*\\w?\\d+(.*)\", r\"\\1 \\3\", regex=True)\n", " # Take number at start.\n", " .str.replace(r\"^(\\d+)(,|\\s+).*\", r\"\\1\", regex=True)\n", " # Ignore anything with non-numeric symbols entirely.\n", " .str.replace(r\".*\\D+.*\", \"\", regex=True)\n", " .str.replace(\"^$\", \"NaN\")\n", " .fillna(\"NaN\")\n", " .astype(float)\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the `floor` column is empty, parse out floor info from the `special_instructions` column that must have been used before the `floor` column was introduced (slightly different parsing logic than above)." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "addresses[\"special_instructions\"] = (\n", " addresses[\"special_instructions\"]\n", " # Get rid of accents and lower case everything.\n", " .str.normalize(\"NFKD\")\n", " .str.encode(\"ascii\", errors=\"ignore\")\n", " .str.decode(\"utf8\")\n", " .astype(\"string\")\n", " .str.casefold()\n", " # Replace common text that messes up the matching.\n", " .str.replace(\".\", \"\")\n", " .str.replace(\":\", \"\")\n", " .str.replace(\";\", \"\")\n", " .str.replace(\"'\", \"'\")\n", " .str.replace(\"36b25\", \"\")\n", " .str.replace(\"n°\", \"\")\n", " .str.replace(\"#\", \"\")\n", " .str.replace(\"face l'assanceur\", \"\")\n", " .str.replace(\"\\(drt\\)\", \"\")\n", " .str.replace(\"floor\", \"\")\n", " .str.replace(\"et demi\", \"\")\n", " .str.replace(\"et droite\", \"\")\n", " .str.replace(\"droite\", \"\")\n", " .str.replace(\"droit\", \"\")\n", " .str.replace(\"a gauche\", \"\")\n", " .str.replace(\"e gauche\", \"\")\n", " .str.replace(\"gauche\", \"\")\n", " .str.replace(\"entrez\", \"\")\n", " .str.replace(\"serez\", \"\")\n", " .str.replace(\"dussol\", \"\")\n", " .str.replace(\"soler\", \"\")\n", " .str.replace(\"sonner\", \"\")\n", " .str.replace(\"code\", \"\")\n", " .str.replace(\"perez\", \"\")\n", " .str.replace(\"-\", \"\")\n", " .str.replace(\"\\s+\", \" \", regex=True)\n", " .str.strip()\n", " # Abbreviations.\n", " .str.replace(\n", " r\"^.*?((\\d+)\\s?(er|ere|em|eme|ele|ieme|bis|(e|g|st|nd|rd|th|z)($|,|\\s+))).*\",\n", " r\"\\2\",\n", " regex=True,\n", " )\n", " # French written out.\n", " .str.replace(r\".*(rdc|rez|sol|ground).*\", \"0\", regex=True)\n", " .str.replace(r\".*(premiere|premier).*\", \"1\", regex=True)\n", " .str.replace(r\".*(deuxieme).*\", \"2\", regex=True)\n", " .str.replace(r\".*(troisieme).*\", \"3\", regex=True)\n", " .str.replace(r\".*(quatrieme).*\", \"4\", regex=True)\n", " .str.replace(r\".*(cinquieme).*\", \"5\", regex=True)\n", " .str.replace(r\".*(sixieme).*\", \"6\", regex=True)\n", " .str.replace(r\".*(septieme).*\", \"7\", regex=True)\n", " .str.replace(r\".*(huitieme).*\", \"8\", regex=True)\n", " .str.replace(r\".*(neuvieme).*\", \"9\", regex=True)\n", " .str.replace(r\".*(dixieme).*\", \"10\", regex=True)\n", " .str.replace(r\"^.*?((etage|etg) (\\d+))($|\\D+.*)\", r\"\\3\", regex=True)\n", " .str.replace(r\"^.*?((\\d+)(etage| etage|etg| etg)).*\", r\"\\2\", regex=True)\n", " # Remove apartment info to not confuse it with floor.\n", " .str.replace(\n", " r\"(.*)(ap|apt|app|appt|appart|appartment|appartement|chambre|room)\\s*\\w?\\d+(.*)\",\n", " r\"\\1 \\3\",\n", " regex=True,\n", " )\n", " .str.replace(r\"(.*)(code|digicode)\\s*\\w?\\d+(.*)\", r\"\\1 \\3\", regex=True)\n", " # Ignore anything with non-numeric symbols entirely.\n", " .str.replace(r\".*\\D+.*\", \"\", regex=True)\n", " .str.replace(\"^$\", \"NaN\")\n", " .fillna(\"NaN\")\n", " .astype(float)\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fill in `floor` from `special_instructions` and cast the type." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "msk = addresses[\"floor\"].isnull() & addresses[\"special_instructions\"].notnull()\n", "addresses.loc[msk, \"floor\"] = addresses.loc[msk, \"special_instructions\"].values" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "del addresses[\"special_instructions\"]" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "addresses = addresses.astype({\"floor\": \"Int64\"})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Only keep the realisic numbers." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "addresses.loc[addresses[\"floor\"].notnull() & (addresses[\"floor\"] > 40), \"floor\"] = pd.NA" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most addresses have no floor number given." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "assert len(addresses.loc[addresses[\"floor\"].isnull(), \"floor\"]) == 307_973" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most `floor`s are near the ground floor, which is plausible." ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 20977\n", "1 69765\n", "2 62439\n", "3 59011\n", "4 52226\n", "5 42930\n", "6 28396\n", "7 8853\n", "8 3349\n", "9 1631\n", "10 1384\n", "11 764\n", "12 657\n", "13 448\n", "14 322\n", "15 206\n", "16 202\n", "17 109\n", "18 135\n", "19 105\n", "20 77\n", "21 51\n", "22 99\n", "23 39\n", "24 85\n", "25 76\n", "26 56\n", "27 30\n", "28 49\n", "29 51\n", "30 33\n", "31 55\n", "32 22\n", "33 11\n", "34 37\n", "35 55\n", "36 19\n", "37 8\n", "38 32\n", "39 136\n", "40 19\n", "Name: floor, dtype: Int64" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "addresses[\"floor\"].value_counts().sort_index()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Deduplicate" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The number of addresses (ca. 663,000) is inflated, probably due to some sort of automated re-entering." ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "assert len(addresses) == 662_922" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, merge all addresses with the same `place_id`, `latitude` / `longitude`, `city`, `zip`, `street`, *and* `floor` into one entry, namely its first occurrence." ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "addresses[\"floor\"] = addresses[\"floor\"].fillna(999) # dummy -> No grouping with NaN's" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "by = [\"place_id\", \"latitude\", \"longitude\", \"city_id\", \"city\", \"zip\", \"street\", \"floor\"]\n", "\n", "addresses = (\n", " addresses.reset_index()\n", " .set_index(by)\n", " .merge(\n", " (\n", " addresses.reset_index()\n", " .groupby(by)[[\"id\"]]\n", " .min()\n", " .rename(columns={\"id\": \"merged_on_id\"})\n", " ),\n", " left_index=True,\n", " right_index=True,\n", " )\n", " .reset_index()\n", " .astype({\"place_id\": \"string\", \"city\": \"string\", \"street\": \"string\"})\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Keep a dictionary `address_merger` to map the ID's that are merged away to the ones that are kept." ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [], "source": [ "address_merger = collections.defaultdict(lambda: np.NaN)\n", "address_merger.update(\n", " {\n", " id_: merged_on_id\n", " for _, id_, merged_on_id in addresses[[\"id\", \"merged_on_id\"]].itertuples()\n", " }\n", ")" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "addresses = (\n", " addresses[addresses[\"id\"] == addresses[\"merged_on_id\"]]\n", " .set_index(\"id\")\n", " .sort_index()[\n", " [\n", " \"created_at\",\n", " \"place_id\",\n", " \"latitude\",\n", " \"longitude\",\n", " \"city_id\",\n", " \"city\",\n", " \"zip\",\n", " \"street\",\n", " \"floor\",\n", " ]\n", " ]\n", ")" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "addresses[\"floor\"] = addresses[\"floor\"].replace(999, pd.NA).astype(\"Int64\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Only about 178,000 addresses remain!" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [], "source": [ "assert len(addresses) == 178_101" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Second, many addresses are still redundant as they are referring to *different* `floor`s in the *same* house or their `street` name is written differently. \n", "\n", "We create a `primary_id` column that holds the ID of the first occurrence of an address independent of the exact spelling of the `street` name and the `floor` number.\n", "\n", "That column is created via grouping the remaining addresses twice, once with their GPS location, and second by a simplified version of `street`. The latter accounts for slightly different `latitude` / `longitude` pairs of the same location, potentially due to an update in the Google Maps database." ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "by = [\"place_id\", \"latitude\", \"longitude\"]\n", "\n", "addresses = (\n", " addresses.reset_index()\n", " .set_index(by)\n", " .merge(\n", " (\n", " addresses.reset_index()\n", " .groupby(by)[[\"id\"]]\n", " .min()\n", " .rename(columns={\"id\": \"unified1_id\"})\n", " ),\n", " left_index=True,\n", " right_index=True,\n", " )\n", " .reset_index()\n", " .set_index(\"id\")\n", " .sort_index()\n", " .astype({\"place_id\": \"string\"})[\n", " [\n", " \"unified1_id\",\n", " \"created_at\",\n", " \"place_id\",\n", " \"latitude\",\n", " \"longitude\",\n", " \"city_id\",\n", " \"city\",\n", " \"zip\",\n", " \"street\",\n", " \"floor\",\n", " ]\n", " ]\n", ")" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [], "source": [ "addresses[\"street_simple\"] = (\n", " addresses[\"street\"]\n", " .str.replace(\"Avenue\", \"Ave\")\n", " .str.replace(\"Place\", \"Pl\")\n", " .str.replace(\".\", \"\")\n", " .str.replace(\"-\", \"\")\n", " .str.replace(\" \", \"\")\n", " .str.lower()\n", ")" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [], "source": [ "by = [\"city_id\", \"street_simple\"]\n", "\n", "addresses = (\n", " addresses.reset_index()\n", " .set_index(by)\n", " .merge(\n", " (\n", " addresses.reset_index()\n", " .groupby(by)[[\"id\"]]\n", " .min()\n", " .rename(columns={\"id\": \"unified2_id\"})\n", " ),\n", " left_index=True,\n", " right_index=True,\n", " )\n", " .reset_index()\n", " .set_index(\"id\")\n", " .sort_index()[\n", " [\n", " \"unified1_id\",\n", " \"unified2_id\",\n", " \"created_at\",\n", " \"place_id\",\n", " \"latitude\",\n", " \"longitude\",\n", " \"city_id\",\n", " \"city\",\n", " \"zip\",\n", " \"street\",\n", " \"floor\",\n", " ]\n", " ]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, an address may be a duplicate of *two* different earlier addresses and we choose the earliest one." ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [], "source": [ "addresses[\"primary_id\"] = addresses[[\"unified1_id\", \"unified2_id\"]].min(axis=1)\n", "\n", "del addresses[\"unified1_id\"]\n", "del addresses[\"unified2_id\"]\n", "\n", "addresses = addresses[\n", " [\n", " \"primary_id\",\n", " \"created_at\",\n", " \"place_id\",\n", " \"latitude\",\n", " \"longitude\",\n", " \"city_id\",\n", " \"city\",\n", " \"zip\",\n", " \"street\",\n", " \"floor\",\n", " ]\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A tricky issue is that an address could be identified as a duplicate of an earlier one that itself is a duplicate of an even earlier one. The following loop does the trick and maps each address to its earlierst version." ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [], "source": [ "_address_unifier = {\n", " id_: unified_id\n", " for _, id_, unified_id in addresses.reset_index()[[\"id\", \"primary_id\"]].itertuples()\n", "}\n", "\n", "while True:\n", " if (addresses[\"primary_id\"] != addresses[\"primary_id\"].map(_address_unifier)).any():\n", " addresses[\"primary_id\"] = addresses[\"primary_id\"].map(_address_unifier)\n", " else:\n", " break" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Only about 87,000 of the remaining 178,000 addresses are unique locations disregarding `floor`s and different spellings of the `street` name." ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [], "source": [ "_addresses = addresses.reset_index()\n", "msk = _addresses[\"id\"] == _addresses[\"primary_id\"]\n", "del _addresses\n", "\n", "assert msk.sum() == 87_287" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To not overwrite a Python built-in in the ORM layer." ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [], "source": [ "addresses = addresses.rename(columns={\"zip\": \"zip_code\"})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Clean Data" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
primary_idcreated_atplace_idlatitudelongitudecity_idcityzip_codestreetfloor
id
222016-02-22 10:42:10ChIJSfxJmlXq9EcRX2ChkiPW9J845.7631494.8326601Lyon6900231 Rue Merciere<NA>
332016-02-22 10:42:10ChIJwwMvNPnq9EcRY7Qu-Tw2HL845.7672274.8357501Lyon69001Rue De La Republique 2<NA>
442016-02-22 10:43:57ChIJr1RhGN7B9EcRv6XSHmmN6a845.7437254.8731381Lyon69008123 Avenue Des Freres Lumieres<NA>
552016-02-22 10:43:57ChIJEwQm9H7q9EcRTymMxQ71z8445.7593694.8640871Lyon69003Avenue Georges Pompidou 17<NA>
662016-02-22 11:06:08ChIJbQz7p6vr9EcRr9L2cH5942I45.7611814.8263711Lyon690058 Bis Place Saint Jean<NA>
\n", "
" ], "text/plain": [ " primary_id created_at place_id latitude \\\n", "id \n", "2 2 2016-02-22 10:42:10 ChIJSfxJmlXq9EcRX2ChkiPW9J8 45.763149 \n", "3 3 2016-02-22 10:42:10 ChIJwwMvNPnq9EcRY7Qu-Tw2HL8 45.767227 \n", "4 4 2016-02-22 10:43:57 ChIJr1RhGN7B9EcRv6XSHmmN6a8 45.743725 \n", "5 5 2016-02-22 10:43:57 ChIJEwQm9H7q9EcRTymMxQ71z84 45.759369 \n", "6 6 2016-02-22 11:06:08 ChIJbQz7p6vr9EcRr9L2cH5942I 45.761181 \n", "\n", " longitude city_id city zip_code street floor \n", "id \n", "2 4.832660 1 Lyon 69002 31 Rue Merciere \n", "3 4.835750 1 Lyon 69001 Rue De La Republique 2 \n", "4 4.873138 1 Lyon 69008 123 Avenue Des Freres Lumieres \n", "5 4.864087 1 Lyon 69003 Avenue Georges Pompidou 17 \n", "6 4.826371 1 Lyon 69005 8 Bis Place Saint Jean " ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "addresses.head()" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Int64Index: 178101 entries, 2 to 691914\n", "Data columns (total 10 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 primary_id 178101 non-null int64 \n", " 1 created_at 178101 non-null datetime64[ns]\n", " 2 place_id 178101 non-null string \n", " 3 latitude 178101 non-null float64 \n", " 4 longitude 178101 non-null float64 \n", " 5 city_id 178101 non-null int64 \n", " 6 city 178101 non-null string \n", " 7 zip_code 178101 non-null int64 \n", " 8 street 178101 non-null string \n", " 9 floor 100540 non-null Int64 \n", "dtypes: Int64(1), datetime64[ns](1), float64(2), int64(3), string(3)\n", "memory usage: 15.1 MB\n" ] } ], "source": [ "addresses.info()" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [], "source": [ "assert (\n", " hashlib.sha256(addresses.to_json().encode()).hexdigest()\n", " == \"4f9f3b63a9b2472bf07207d0e06f4901619066121d6bb5fd3ad4ebf21b590410\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Restaurants" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Raw Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load restaurants associated with *all* addresses in the target cities. Further below, *all* restaurants are shown to have a clean address." ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [], "source": [ "restaurants = pd.read_sql_query(\n", " f\"\"\"\n", " SELECT\n", " id,\n", " created_at,\n", " name,\n", " address_id,\n", " estimated_prep_duration\n", " FROM\n", " {config.ORIGINAL_SCHEMA}.businesses\n", " WHERE\n", " address_id IN (\n", " SELECT id FROM {config.ORIGINAL_SCHEMA}.addresses WHERE city_id IN %(city_ids)s\n", " )\n", " AND\n", " created_at < '{config.CUTOFF_DAY}'\n", " ORDER BY\n", " id\n", " \"\"\",\n", " con=connection,\n", " index_col=\"id\",\n", " params={\"city_ids\": city_ids},\n", " parse_dates=[\"created_at\"],\n", ")" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [], "source": [ "restaurants[\"name\"] = restaurants[\"name\"].astype(\"string\")" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
created_atnameaddress_idestimated_prep_duration
id
12016-02-22 09:42:10.228854King Marcel Mercière21200
22016-02-22 09:43:57.103750Trotekala41200
32016-02-22 10:06:08.796042Soul Food & Jazz Café61200
42016-02-22 10:16:11.478981Gourmix Bellecour8900
52016-02-22 10:36:09.150481Yabio Hôtel de Ville101200
\n", "
" ], "text/plain": [ " created_at name address_id \\\n", "id \n", "1 2016-02-22 09:42:10.228854 King Marcel Mercière 2 \n", "2 2016-02-22 09:43:57.103750 Trotekala 4 \n", "3 2016-02-22 10:06:08.796042 Soul Food & Jazz Café 6 \n", "4 2016-02-22 10:16:11.478981 Gourmix Bellecour 8 \n", "5 2016-02-22 10:36:09.150481 Yabio Hôtel de Ville 10 \n", "\n", " estimated_prep_duration \n", "id \n", "1 1200 \n", "2 1200 \n", "3 1200 \n", "4 900 \n", "5 1200 " ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "restaurants.head()" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Int64Index: 1654 entries, 1 to 1787\n", "Data columns (total 4 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 created_at 1654 non-null datetime64[ns]\n", " 1 name 1654 non-null string \n", " 2 address_id 1654 non-null int64 \n", " 3 estimated_prep_duration 1654 non-null int64 \n", "dtypes: datetime64[ns](1), int64(2), string(1)\n", "memory usage: 64.6 KB\n" ] } ], "source": [ "restaurants.info()" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [], "source": [ "assert len(restaurants) == 1_654" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Adjust Time Zone" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [], "source": [ "restaurants[\"created_at\"] = clean_datetime(restaurants[\"created_at\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Simplify Names" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [], "source": [ "restaurants[\"name\"] = (\n", " restaurants[\"name\"]\n", " .str.replace(\"\\s+\", \" \", regex=True)\n", " .str.replace(\"'\", \"'\")\n", " # Get rid off accents.\n", " .str.normalize(\"NFKD\")\n", " .str.encode(\"ascii\", errors=\"ignore\")\n", " .str.decode(\"utf8\")\n", " .astype(\"string\")\n", " .str.title()\n", " # To find duplicates further below.\n", " .str.replace(\" & \", \" And \")\n", " .str.replace(\"The \", \"\")\n", " .str.replace(\"Pasta Pizza \", \"\")\n", " .str.replace(\" - Bar A Taboule\", \"\")\n", " .str.replace(\"- Cuisine Mediterraneenne\", \"\")\n", " .str.replace(\" - Petit-Dejeuner\", \"\")\n", " .str.replace(\"Lyon\", \"\")\n", " .str.replace(\"La Burgeria Saint Mande\", \"La Fromagette Saint Mande\")\n", " .str.replace(\"Mansou'\", \"Mansouria\")\n", " .str.strip()\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use Merged Addresses" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [], "source": [ "restaurants[\"address_id\"] = restaurants[\"address_id\"].map(address_merger)" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [], "source": [ "assert not restaurants[\"address_id\"].isnull().any()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Deduplicate" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Restaurants with the same name at the same (unified) address are merged." ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [], "source": [ "restaurants = restaurants.merge(\n", " addresses[\"primary_id\"], left_on=\"address_id\", right_index=True\n", ")\n", "restaurants = restaurants.rename(columns={\"primary_id\": \"primary_address_id\"})" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [], "source": [ "by = [\"name\", \"primary_address_id\"]\n", "\n", "restaurants = (\n", " restaurants.reset_index()\n", " .set_index(by)\n", " .merge(\n", " (\n", " restaurants.reset_index()\n", " .groupby(by)[[\"id\"]]\n", " .min()\n", " .rename(columns={\"id\": \"merged_on_id\"})\n", " ),\n", " left_index=True,\n", " right_index=True,\n", " )\n", " .reset_index()\n", " .astype({\"name\": \"string\"})\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Keep a dictionary to map the ID's that are merged away to the ones that are kept." ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [], "source": [ "restaurants_merger = collections.defaultdict(lambda: np.NaN)\n", "restaurants_merger.update(\n", " {\n", " id_: merged_on_id\n", " for _, id_, merged_on_id in restaurants[[\"id\", \"merged_on_id\"]].itertuples()\n", " }\n", ")" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [], "source": [ "restaurants = (\n", " restaurants[restaurants[\"id\"] == restaurants[\"merged_on_id\"]]\n", " .set_index(\"id\")\n", " .sort_index()[[\"created_at\", \"name\", \"address_id\", \"estimated_prep_duration\"]]\n", ")" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [], "source": [ "assert len(restaurants) == 1_644" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Clean Data" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
created_atnameaddress_idestimated_prep_duration
id
12016-02-22 10:42:10King Marcel Merciere21200
22016-02-22 10:43:57Trotekala41200
32016-02-22 11:06:08Soul Food And Jazz Cafe61200
42016-02-22 11:16:11Gourmix Bellecour8900
52016-02-22 11:36:09Yabio Hotel De Ville101200
\n", "
" ], "text/plain": [ " created_at name address_id \\\n", "id \n", "1 2016-02-22 10:42:10 King Marcel Merciere 2 \n", "2 2016-02-22 10:43:57 Trotekala 4 \n", "3 2016-02-22 11:06:08 Soul Food And Jazz Cafe 6 \n", "4 2016-02-22 11:16:11 Gourmix Bellecour 8 \n", "5 2016-02-22 11:36:09 Yabio Hotel De Ville 10 \n", "\n", " estimated_prep_duration \n", "id \n", "1 1200 \n", "2 1200 \n", "3 1200 \n", "4 900 \n", "5 1200 " ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "restaurants.head()" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Int64Index: 1644 entries, 1 to 1787\n", "Data columns (total 4 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 created_at 1644 non-null datetime64[ns]\n", " 1 name 1644 non-null string \n", " 2 address_id 1644 non-null int64 \n", " 3 estimated_prep_duration 1644 non-null int64 \n", "dtypes: datetime64[ns](1), int64(2), string(1)\n", "memory usage: 64.2 KB\n" ] } ], "source": [ "restaurants.info()" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [], "source": [ "assert (\n", " hashlib.sha256(restaurants.to_json().encode()).hexdigest()\n", " == \"8eb852690c027e2fcc0d9cf988391741b1cd028e35c494766f8c21d4ea1722b7\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Couriers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Raw Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Only load couriers that worked in one of the target cities (i.e., had an order) and include the vehicle information." ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [], "source": [ "couriers = pd.read_sql_query(\n", " f\"\"\"\n", " SELECT\n", " couriers.id,\n", " couriers.created_at,\n", " MD5(couriers.name) AS name,\n", " vehicle_types.icon as vehicle,\n", " couriers.speed,\n", " vehicle_bag_types.capacity,\n", " couriers.pay_per_hour,\n", " couriers.pay_per_order\n", " FROM\n", " {config.ORIGINAL_SCHEMA}.couriers\n", " LEFT OUTER JOIN\n", " {config.ORIGINAL_SCHEMA}.vehicles ON couriers.vehicle_id = vehicles.id\n", " LEFT OUTER JOIN\n", " {config.ORIGINAL_SCHEMA}.vehicle_types ON vehicles.vehicle_type_id = vehicle_types.id\n", " LEFT OUTER JOIN\n", " {config.ORIGINAL_SCHEMA}.vehicle_bag_types ON vehicles.vehicle_bag_type_id = vehicle_bag_types.id\n", " WHERE\n", " couriers.id in (\n", " SELECT DISTINCT\n", " deliveries.courier_id\n", " FROM\n", " {config.ORIGINAL_SCHEMA}.orders\n", " INNER JOIN\n", " {config.ORIGINAL_SCHEMA}.deliveries ON orders.id = deliveries.order_id\n", " WHERE\n", " orders.featured_business_id IN (\n", " SELECT -- Subquery based off the restaurants query above!\n", " id\n", " FROM\n", " {config.ORIGINAL_SCHEMA}.businesses\n", " WHERE\n", " address_id IN (\n", " SELECT id FROM {config.ORIGINAL_SCHEMA}.addresses WHERE city_id IN %(city_ids)s\n", " )\n", " AND\n", " created_at < '{config.CUTOFF_DAY}'\n", " )\n", " AND\n", " deliveries.courier_id IS NOT NULL\n", " )\n", " ORDER BY\n", " couriers.id\n", " \"\"\",\n", " con=connection,\n", " index_col=\"id\",\n", " params={\"city_ids\": city_ids},\n", " parse_dates=[\"created_at\"],\n", ")" ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [], "source": [ "couriers = couriers.astype({\"name\": \"string\", \"vehicle\": \"string\"})" ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
created_atnamevehiclespeedcapacitypay_per_hourpay_per_order
id
12016-02-21 13:28:41.5710469181c2be2b3746cbcd6abf86d77fb2aabicycle13.901500300
22016-02-21 13:28:41.6737111b5d5709985e874e9c54e30127b3049fbicycle18.471000300
32016-02-21 13:28:41.7768930e385d898f81147d460a45d2943a4eedbicycle19.181000300
42016-02-21 13:28:41.879876ca21f91ea528f1b6e596d0c43959062bbicycle14.731500200
52016-02-21 13:28:41.98254204af1119256f88ea3a25e79344dec96cbicycle15.201500200
\n", "
" ], "text/plain": [ " created_at name vehicle \\\n", "id \n", "1 2016-02-21 13:28:41.571046 9181c2be2b3746cbcd6abf86d77fb2aa bicycle \n", "2 2016-02-21 13:28:41.673711 1b5d5709985e874e9c54e30127b3049f bicycle \n", "3 2016-02-21 13:28:41.776893 0e385d898f81147d460a45d2943a4eed bicycle \n", "4 2016-02-21 13:28:41.879876 ca21f91ea528f1b6e596d0c43959062b bicycle \n", "5 2016-02-21 13:28:41.982542 04af1119256f88ea3a25e79344dec96c bicycle \n", "\n", " speed capacity pay_per_hour pay_per_order \n", "id \n", "1 13.90 150 0 300 \n", "2 18.47 100 0 300 \n", "3 19.18 100 0 300 \n", "4 14.73 150 0 200 \n", "5 15.20 150 0 200 " ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "couriers.head()" ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Int64Index: 2471 entries, 1 to 53764\n", "Data columns (total 7 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 created_at 2471 non-null datetime64[ns]\n", " 1 name 2471 non-null string \n", " 2 vehicle 2471 non-null string \n", " 3 speed 2471 non-null float64 \n", " 4 capacity 2471 non-null int64 \n", " 5 pay_per_hour 2471 non-null int64 \n", " 6 pay_per_order 2471 non-null int64 \n", "dtypes: datetime64[ns](1), float64(1), int64(3), string(2)\n", "memory usage: 154.4 KB\n" ] } ], "source": [ "couriers.info()" ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [], "source": [ "assert len(couriers) == 2_471" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Adjust Time Zone" ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [], "source": [ "couriers[\"created_at\"] = clean_datetime(couriers[\"created_at\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Deduplicate" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Couriers with the same name either have the same phone number or signed up within a short time window: They are merged." ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [], "source": [ "by = [\"name\"]\n", "\n", "couriers = (\n", " couriers.reset_index()\n", " .set_index(by)\n", " .merge(\n", " (\n", " couriers.reset_index()\n", " .groupby(by)[[\"id\"]]\n", " .max() # merged on the latest courier!\n", " .rename(columns={\"id\": \"merged_on_id\"})\n", " ),\n", " left_index=True,\n", " right_index=True,\n", " )\n", " .reset_index()\n", " .astype({\"name\": \"string\"})\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Keep a dictionary to map the ID's that are merged away to the ones that are kept." ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [], "source": [ "couriers_merger = collections.defaultdict(lambda: np.NaN)\n", "couriers_merger.update(\n", " {\n", " id_: merged_on_id\n", " for _, id_, merged_on_id in couriers[[\"id\", \"merged_on_id\"]].itertuples()\n", " }\n", ")" ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [], "source": [ "couriers = (\n", " couriers[couriers[\"id\"] == couriers[\"merged_on_id\"]]\n", " .set_index(\"id\")\n", " .sort_index()[\n", " [\"created_at\", \"vehicle\", \"speed\", \"capacity\", \"pay_per_hour\", \"pay_per_order\"]\n", " ]\n", ")" ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [], "source": [ "assert len(couriers) == 2_469" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Clean Salary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The column `pay_per_hour` defaults to `0` in the database definition. The actual default value is EUR 7,50, which is also the mode in the dataset." ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 2398\n", "750 70\n", "1500 1\n", "Name: pay_per_hour, dtype: int64" ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ "couriers[\"pay_per_hour\"].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The column `pay_per_order` defaults to `0` in the database definition. A more realistic value is EUR 2 (i.e., 200 cents), which is the mode in the dataset." ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "200 2146\n", "0 159\n", "300 97\n", "400 29\n", "500 11\n", "2 9\n", "600 8\n", "20000 3\n", "250 3\n", "650 2\n", "1400000 1\n", "1 1\n", "Name: pay_per_order, dtype: int64" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "couriers[\"pay_per_order\"].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Whenever a `0` appears in `pay_per_order`, the corresponding `pay_per_hour` is `0` in all cases except one, which is the highest paid courier." ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [], "source": [ "assert ((couriers[\"pay_per_order\"] == 0) & (couriers[\"pay_per_hour\"] == 0)).sum() == 158" ] }, { "cell_type": "code", "execution_count": 94, "metadata": {}, "outputs": [], "source": [ "assert ((couriers[\"pay_per_order\"] == 0) & (couriers[\"pay_per_hour\"] > 0)).sum() == 1" ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
created_atvehiclespeedcapacitypay_per_hourpay_per_order
id
7212016-03-18 16:56:39bicycle20.4910015000
\n", "
" ], "text/plain": [ " created_at vehicle speed capacity pay_per_hour pay_per_order\n", "id \n", "721 2016-03-18 16:56:39 bicycle 20.49 100 1500 0" ] }, "execution_count": 95, "metadata": {}, "output_type": "execute_result" } ], "source": [ "couriers[(couriers[\"pay_per_order\"] == 0) & (couriers[\"pay_per_hour\"] > 0)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Couriers with `0`s in both columns receive the default payment scheme." ] }, { "cell_type": "code", "execution_count": 96, "metadata": {}, "outputs": [], "source": [ "msk_0_pay = (couriers[\"pay_per_hour\"] == 0) & (couriers[\"pay_per_order\"] == 0)\n", "\n", "couriers.loc[msk_0_pay, \"pay_per_hour\"] = 750\n", "couriers.loc[msk_0_pay, \"pay_per_order\"] = 200" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Couriers with a `0` in the `pay_per_hour` column, receive a fixed salary of EUR 7,50." ] }, { "cell_type": "code", "execution_count": 97, "metadata": {}, "outputs": [], "source": [ "couriers.loc[couriers[\"pay_per_hour\"] == 0, \"pay_per_hour\"] = 750" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The column `pay_per_order` contains obvious typos that are corrected." ] }, { "cell_type": "code", "execution_count": 98, "metadata": {}, "outputs": [], "source": [ "couriers.loc[\n", " couriers[\"pay_per_order\"].isin([1, 2, 20, 2000, 20000]), \"pay_per_order\"\n", "] = 200\n", "couriers.loc[couriers[\"pay_per_order\"] == 1400000, \"pay_per_order\"] = 400" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Distribution of the various `pay_per_hour` / `pay_per_order` combinations." ] }, { "cell_type": "code", "execution_count": 99, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({(750, 300): 97,\n", " (750, 200): 2317,\n", " (750, 500): 11,\n", " (750, 400): 30,\n", " (750, 600): 8,\n", " (750, 650): 2,\n", " (1500, 0): 1,\n", " (750, 250): 3})" ] }, "execution_count": 99, "metadata": {}, "output_type": "execute_result" } ], "source": [ "collections.Counter(\n", " (y, z) for (x, y, z) in couriers[[\"pay_per_hour\", \"pay_per_order\"]].itertuples()\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Clean Data" ] }, { "cell_type": "code", "execution_count": 100, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
created_atvehiclespeedcapacitypay_per_hourpay_per_order
id
12016-02-21 14:28:41bicycle13.90150750300
22016-02-21 14:28:41bicycle18.47100750300
32016-02-21 14:28:41bicycle19.18100750300
42016-02-21 14:28:41bicycle14.73150750200
52016-02-21 14:28:41bicycle15.20150750200
\n", "
" ], "text/plain": [ " created_at vehicle speed capacity pay_per_hour pay_per_order\n", "id \n", "1 2016-02-21 14:28:41 bicycle 13.90 150 750 300\n", "2 2016-02-21 14:28:41 bicycle 18.47 100 750 300\n", "3 2016-02-21 14:28:41 bicycle 19.18 100 750 300\n", "4 2016-02-21 14:28:41 bicycle 14.73 150 750 200\n", "5 2016-02-21 14:28:41 bicycle 15.20 150 750 200" ] }, "execution_count": 100, "metadata": {}, "output_type": "execute_result" } ], "source": [ "couriers.head()" ] }, { "cell_type": "code", "execution_count": 101, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Int64Index: 2469 entries, 1 to 53764\n", "Data columns (total 6 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 created_at 2469 non-null datetime64[ns]\n", " 1 vehicle 2469 non-null string \n", " 2 speed 2469 non-null float64 \n", " 3 capacity 2469 non-null int64 \n", " 4 pay_per_hour 2469 non-null int64 \n", " 5 pay_per_order 2469 non-null int64 \n", "dtypes: datetime64[ns](1), float64(1), int64(3), string(1)\n", "memory usage: 135.0 KB\n" ] } ], "source": [ "couriers.info()" ] }, { "cell_type": "code", "execution_count": 102, "metadata": {}, "outputs": [], "source": [ "assert (\n", " hashlib.sha256(couriers.to_json().encode()).hexdigest()\n", " == \"a1059e93095842120a58c4f74145bb6a14aeb94f47a0d77043896c70f4772afe\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Orders" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Raw Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The order related data is spread over many different tables in the original database. Also, some data is not even normalized. The following SQL query puts all the data into one big relation that is cleaned further below." ] }, { "cell_type": "code", "execution_count": 103, "metadata": {}, "outputs": [], "source": [ "orders = pd.read_sql_query(\n", " f\"\"\"\n", " SELECT\n", " orders.id,\n", " deliveries.id AS delivery_id,\n", " MD5(CONCAT(orders.email, orders.phone_number)) as customer_id, -- anonymize the customer data\n", " orders.order_placed_at AS placed_at,\n", " CASE\n", " WHEN orders.preorder IS FALSE\n", " THEN TRUE\n", " ELSE FALSE \n", " END AS ad_hoc,\n", " CASE\n", " WHEN orders.preorder is TRUE\n", " THEN orders.scheduled_dropoff_at\n", " ELSE NULL\n", " END AS scheduled_delivery_at,\n", " deliveries.status,\n", " cancellations.cancelled_at,\n", " orders.featured_business_id as restaurant_id,\n", " orders.order_sent_at AS restaurant_notified_at,\n", " orders.order_received_at AS restaurant_confirmed_at,\n", " orders.estimated_prep_duration,\n", " orders.estimated_prep_buffer,\n", " deliveries.courier_id,\n", " deliveries.courier_dispatched_at AS dispatch_at,\n", " deliveries.courier_notified_at,\n", " deliveries.courier_accepted_at,\n", " courier_no_accept_confirmed.issue AS courier_no_accept_confirmed_issue,\n", " orders.pickup_address_id,\n", " orders.scheduled_pickup_at,\n", " deliveries.courier_picked_up_at AS pickup_at,\n", " left_pickups.left_pickup_at,\n", " courier_late_at_pickup.issue AS courier_late_at_pickup_issue,\n", " courier_waited_at_pickup.issue AS courier_waited_at_pickup_issue,\n", " courier_no_pickup_confirmed.issue AS courier_no_pickup_confirmed_issue,\n", " orders.dropoff_address_id AS delivery_address_id,\n", " deliveries.first_estimated_dropoff_at AS first_estimated_delivery_at,\n", " deliveries.courier_dropped_off_at AS delivery_at,\n", " courier_waited_at_delivery.issue AS courier_waited_at_delivery_issue,\n", " courier_no_delivery_confirmed.issue AS courier_no_delivery_confirmed_issue,\n", " orders.utilization,\n", " items_totals.sub_total,\n", " orders.delivery_fee,\n", " orders.total,\n", " deliveries.delivery_distance AS logged_delivery_distance,\n", " deliveries.courier_avg_speed AS logged_avg_courier_speed,\n", " CAST(deliveries.courier_avg_speed_distance AS INTEGER) AS logged_avg_courier_speed_distance,\n", " delivery_timings.accepting_time AS logged_accepting_time,\n", " delivery_timings.courier_reaction_time AS logged_reaction_time,\n", " delivery_timings.to_pickup_time AS logged_to_pickup_time,\n", " delivery_timings.expected_wait_pickup_time AS expected_wait_pickup_time,\n", " delivery_timings.wait_pickup_time AS logged_wait_pickup_time,\n", " delivery_timings.pickup_time AS logged_pickup_time,\n", " delivery_timings.courier_late AS logged_courier_late_time,\n", " delivery_timings.vendor_late AS logged_restaurant_late_time,\n", " delivery_timings.to_dropoff_time AS logged_to_delivery_time,\n", " delivery_timings.expected_dropoff_time AS expected_delivery_time,\n", " delivery_timings.dropoff_time AS logged_delivery_time,\n", " delivery_timings.delivery_late AS logged_delivery_late_time,\n", " delivery_timings.total_time AS logged_total_time,\n", " delivery_timings.confirmed_total_time AS logged_confirmed_total_time\n", " FROM\n", " {config.ORIGINAL_SCHEMA}.orders\n", " LEFT OUTER JOIN\n", " {config.ORIGINAL_SCHEMA}.deliveries ON orders.id = deliveries.order_id\n", " LEFT OUTER JOIN\n", " (\n", " SELECT\n", " order_id,\n", " CAST(100 * SUM(price) AS INTEGER) AS sub_total\n", " FROM\n", " {config.ORIGINAL_SCHEMA}.order_records\n", " GROUP BY\n", " order_id\n", " ) AS items_totals ON orders.id = items_totals.order_id\n", " LEFT OUTER JOIN\n", " {config.ORIGINAL_SCHEMA}.delivery_timings ON deliveries.id = delivery_timings.delivery_id\n", " LEFT OUTER JOIN (\n", " SELECT\n", " delivery_id,\n", " MAX(notes) AS issue\n", " FROM\n", " {config.ORIGINAL_SCHEMA}.issues\n", " WHERE\n", " type = 'DispatchIssue'\n", " AND\n", " category = 'no_courier_interaction'\n", " GROUP BY\n", " delivery_id\n", " ) AS courier_no_accept_confirmed ON deliveries.id = courier_no_accept_confirmed.delivery_id\n", " LEFT OUTER JOIN (\n", " SELECT\n", " delivery_id,\n", " MAX(notes) AS issue\n", " FROM\n", " {config.ORIGINAL_SCHEMA}.issues\n", " WHERE\n", " type = 'PickupIssue'\n", " AND\n", " category = 'waiting'\n", " GROUP BY\n", " delivery_id\n", " ) AS courier_waited_at_pickup ON deliveries.id = courier_waited_at_pickup.delivery_id\n", " LEFT OUTER JOIN (\n", " SELECT\n", " delivery_id,\n", " MAX(notes) AS issue\n", " FROM\n", " {config.ORIGINAL_SCHEMA}.issues\n", " WHERE\n", " type = 'PickupIssue'\n", " AND\n", " category = 'late'\n", " GROUP BY\n", " delivery_id\n", " ) AS courier_late_at_pickup ON deliveries.id = courier_late_at_pickup.delivery_id\n", " LEFT OUTER JOIN (\n", " SELECT\n", " delivery_id,\n", " MAX(notes) AS issue\n", " FROM\n", " {config.ORIGINAL_SCHEMA}.issues\n", " WHERE\n", " type = 'PickupIssue'\n", " AND\n", " category = 'no_courier_interaction'\n", " GROUP BY\n", " delivery_id\n", " ) AS courier_no_pickup_confirmed ON deliveries.id = courier_no_pickup_confirmed.delivery_id\n", " LEFT OUTER JOIN (\n", " SELECT\n", " delivery_id,\n", " MAX(notes) AS issue\n", " FROM\n", " {config.ORIGINAL_SCHEMA}.issues\n", " WHERE\n", " type = 'DropoffIssue'\n", " AND\n", " category = 'waiting'\n", " GROUP BY\n", " delivery_id\n", " ) AS courier_waited_at_delivery ON deliveries.id = courier_waited_at_delivery.delivery_id\n", " LEFT OUTER JOIN (\n", " SELECT\n", " delivery_id,\n", " MAX(notes) AS issue\n", " FROM\n", " {config.ORIGINAL_SCHEMA}.issues\n", " WHERE\n", " type = 'DropoffIssue'\n", " AND\n", " category = 'late'\n", " GROUP BY\n", " delivery_id\n", " ) AS courier_late_at_delivery ON deliveries.id = courier_late_at_delivery.delivery_id\n", " LEFT OUTER JOIN (\n", " SELECT\n", " delivery_id,\n", " MAX(notes) AS issue\n", " FROM\n", " {config.ORIGINAL_SCHEMA}.issues\n", " WHERE\n", " type = 'DropoffIssue'\n", " AND\n", " category = 'no_courier_interaction'\n", " GROUP BY\n", " delivery_id\n", " ) AS courier_no_delivery_confirmed ON deliveries.id = courier_no_delivery_confirmed.delivery_id\n", " LEFT OUTER JOIN (\n", " SELECT\n", " delivery_id,\n", " courier_id,\n", " MAX(created_at) AS left_pickup_at\n", " FROM (\n", " SELECT\n", " delivery_id,\n", " (metadata -> 'courier_id')::TEXT::INTEGER AS courier_id,\n", " created_at\n", " FROM\n", " {config.ORIGINAL_SCHEMA}.delivery_transitions\n", " WHERE\n", " to_state = 'left_pickup'\n", " ) AS left_pickups\n", " GROUP BY\n", " delivery_id,\n", " courier_id\n", " ) AS left_pickups ON deliveries.id = left_pickups.delivery_id AND deliveries.courier_id = left_pickups.courier_id\n", " LEFT OUTER JOIN (\n", " SELECT\n", " delivery_id,\n", " MAX(created_at) AS cancelled_at\n", " FROM\n", " {config.ORIGINAL_SCHEMA}.delivery_transitions\n", " WHERE\n", " to_state = 'cancelled'\n", " GROUP BY\n", " delivery_id\n", " ) AS cancellations ON deliveries.id = cancellations.delivery_id\n", " WHERE\n", " orders.featured_business_id IN (\n", " SELECT -- Subquery based off the restaurants query above!\n", " id\n", " FROM\n", " {config.ORIGINAL_SCHEMA}.businesses\n", " WHERE\n", " address_id IN (\n", " SELECT id FROM {config.ORIGINAL_SCHEMA}.addresses WHERE city_id IN %(city_ids)s\n", " )\n", " AND\n", " created_at < '{config.CUTOFF_DAY}'\n", " )\n", " AND\n", " scheduled_dropoff_at < '{config.CUTOFF_DAY}'\n", " AND\n", " deliveries.is_primary IS TRUE\n", " ORDER BY\n", " orders.id\n", " \"\"\",\n", " con=connection,\n", " index_col=\"id\",\n", " params={\"city_ids\": city_ids},\n", " parse_dates=[\n", " \"placed_at\",\n", " \"scheduled_delivery_at\",\n", " \"cancelled_at\",\n", " \"restaurant_notified_at\",\n", " \"restaurant_confirmed_at\",\n", " \"dispatch_at\",\n", " \"courier_notified_at\",\n", " \"courier_accepted_at\",\n", " \"pickup_at\",\n", " \"left_pickup_at\",\n", " \"first_estimated_delivery_at\",\n", " \"delivery_at\",\n", " ],\n", ")" ] }, { "cell_type": "code", "execution_count": 104, "metadata": {}, "outputs": [], "source": [ "orders = orders.astype(\n", " {\n", " \"customer_id\": \"string\",\n", " \"status\": \"string\",\n", " \"estimated_prep_duration\": \"Int64\",\n", " \"courier_id\": \"Int64\",\n", " \"courier_no_accept_confirmed_issue\": \"string\",\n", " \"courier_late_at_pickup_issue\": \"string\",\n", " \"courier_waited_at_pickup_issue\": \"string\",\n", " \"courier_no_pickup_confirmed_issue\": \"string\",\n", " \"courier_waited_at_delivery_issue\": \"string\",\n", " \"courier_no_delivery_confirmed_issue\": \"string\",\n", " \"logged_avg_courier_speed_distance\": \"Int64\",\n", " \"logged_accepting_time\": \"Int64\",\n", " \"logged_reaction_time\": \"Int64\",\n", " \"logged_to_pickup_time\": \"Int64\",\n", " \"expected_wait_pickup_time\": \"Int64\",\n", " \"logged_wait_pickup_time\": \"Int64\",\n", " \"logged_pickup_time\": \"Int64\",\n", " \"logged_courier_late_time\": \"Int64\",\n", " \"logged_restaurant_late_time\": \"Int64\",\n", " \"logged_to_delivery_time\": \"Int64\",\n", " \"expected_delivery_time\": \"Int64\",\n", " \"logged_delivery_time\": \"Int64\",\n", " \"logged_delivery_late_time\": \"Int64\",\n", " \"logged_total_time\": \"Int64\",\n", " \"logged_confirmed_total_time\": \"Int64\",\n", " }\n", ")" ] }, { "cell_type": "code", "execution_count": 105, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
delivery_idcustomer_idplaced_atad_hocscheduled_delivery_atstatuscancelled_atrestaurant_idrestaurant_notified_atrestaurant_confirmed_atestimated_prep_durationestimated_prep_buffercourier_iddispatch_atcourier_notified_atcourier_accepted_atcourier_no_accept_confirmed_issuepickup_address_idscheduled_pickup_atpickup_atleft_pickup_atcourier_late_at_pickup_issuecourier_waited_at_pickup_issuecourier_no_pickup_confirmed_issuedelivery_address_idfirst_estimated_delivery_atdelivery_atcourier_waited_at_delivery_issuecourier_no_delivery_confirmed_issueutilizationsub_totaldelivery_feetotallogged_delivery_distancelogged_avg_courier_speedlogged_avg_courier_speed_distancelogged_accepting_timelogged_reaction_timelogged_to_pickup_timeexpected_wait_pickup_timelogged_wait_pickup_timelogged_pickup_timelogged_courier_late_timelogged_restaurant_late_timelogged_to_delivery_timeexpected_delivery_timelogged_delivery_timelogged_delivery_late_timelogged_total_timelogged_confirmed_total_time
id
1487377f172baefbdf8550eb2c6bcf667e430902016-02-22 09:42:01False2016-02-22 11:30:00cancelled2016-10-18 07:52:45.1362431NaTNaT<NA>0<NA>NaTNaTNaT<NA>22016-02-22 11:15:00.000000NaTNaT<NA><NA><NA>3NaTNaT<NA><NA>1512502501500671NaN<NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA><NA>
2487433cc2f0563675b4b2e9b985bfa4d7a157f2016-02-22 09:40:09False2016-02-22 11:00:00completedNaT22016-02-22 10:31:03.6648572016-02-22 10:46:43.348671<NA>0962016-02-22 10:37:00.6832712016-02-22 10:36:04.5322542016-02-22 10:37:00.683188Courier did not hit \"Accept\"42016-02-22 10:41:03.6647932016-02-22 10:54:00.886417NaT<NA><NA>Courier did not hit \"Picked up\"52016-02-23 22:55:232016-02-22 11:13:23.772580<NA><NA>19155025018002281NaN<NA>2916<NA>187<NA>6912567691745479126494257345595
3487444c09807c923d356a43fac084543de664b2016-02-22 09:56:16False2016-02-22 11:00:00completedNaT32016-02-22 10:31:02.5313502016-02-22 10:42:00.334529<NA>0572016-02-22 10:31:30.9634232016-02-22 10:31:03.5876342016-02-22 10:31:30.963342<NA>62016-02-22 10:41:02.5312842016-02-22 10:54:32.122854NaT<NA><NA><NA>72016-02-23 22:25:472016-02-22 11:06:07.677724<NA><NA>25200025022502449NaN<NA>28<NA>1381<NA><NA>000522382427046442884192
4470503440d641745a8707a7afd5565adc99be22016-02-22 10:11:46False2016-02-22 11:30:00completedNaT42016-02-22 10:32:04.8106332016-02-22 10:48:01.402101<NA>0392016-02-22 10:50:29.1418112016-02-22 10:50:03.6101692016-02-22 10:50:29.141722<NA>82016-02-22 11:17:04.8105262016-02-22 11:11:26.166161NaT<NA><NA><NA>92016-02-22 11:25:502016-02-22 11:18:43.289947<NA><NA>100980025010050196NaN<NA>26<NA>439<NA>-180818-1061-338874694350-67640174017
54874390e7b3e4012a00a56b5bf1dee1bc6c1c72016-02-22 10:35:58False2016-02-22 11:30:00completedNaT52016-02-22 10:50:03.8663272016-02-22 10:50:19.473584<NA>01282016-02-22 10:50:28.1536632016-02-22 10:50:04.8308552016-02-22 10:50:28.153559<NA>102016-02-22 11:10:03.8662562016-02-22 11:05:19.667624NaT<NA><NA><NA>112016-02-22 11:28:192016-02-22 11:20:44.147345<NA><NA>45360025038502584NaN<NA>24<NA>284<NA>-440607-735-2846943242231-55526862686
\n", "
" ], "text/plain": [ " delivery_id customer_id placed_at ad_hoc \\\n", "id \n", "1 487377 f172baefbdf8550eb2c6bcf667e43090 2016-02-22 09:42:01 False \n", "2 487433 cc2f0563675b4b2e9b985bfa4d7a157f 2016-02-22 09:40:09 False \n", "3 487444 c09807c923d356a43fac084543de664b 2016-02-22 09:56:16 False \n", "4 470503 440d641745a8707a7afd5565adc99be2 2016-02-22 10:11:46 False \n", "5 487439 0e7b3e4012a00a56b5bf1dee1bc6c1c7 2016-02-22 10:35:58 False \n", "\n", " scheduled_delivery_at status cancelled_at restaurant_id \\\n", "id \n", "1 2016-02-22 11:30:00 cancelled 2016-10-18 07:52:45.136243 1 \n", "2 2016-02-22 11:00:00 completed NaT 2 \n", "3 2016-02-22 11:00:00 completed NaT 3 \n", "4 2016-02-22 11:30:00 completed NaT 4 \n", "5 2016-02-22 11:30:00 completed NaT 5 \n", "\n", " restaurant_notified_at restaurant_confirmed_at \\\n", "id \n", "1 NaT NaT \n", "2 2016-02-22 10:31:03.664857 2016-02-22 10:46:43.348671 \n", "3 2016-02-22 10:31:02.531350 2016-02-22 10:42:00.334529 \n", "4 2016-02-22 10:32:04.810633 2016-02-22 10:48:01.402101 \n", "5 2016-02-22 10:50:03.866327 2016-02-22 10:50:19.473584 \n", "\n", " estimated_prep_duration estimated_prep_buffer courier_id \\\n", "id \n", "1 0 \n", "2 0 96 \n", "3 0 57 \n", "4 0 39 \n", "5 0 128 \n", "\n", " dispatch_at courier_notified_at \\\n", "id \n", "1 NaT NaT \n", "2 2016-02-22 10:37:00.683271 2016-02-22 10:36:04.532254 \n", "3 2016-02-22 10:31:30.963423 2016-02-22 10:31:03.587634 \n", "4 2016-02-22 10:50:29.141811 2016-02-22 10:50:03.610169 \n", "5 2016-02-22 10:50:28.153663 2016-02-22 10:50:04.830855 \n", "\n", " courier_accepted_at courier_no_accept_confirmed_issue \\\n", "id \n", "1 NaT \n", "2 2016-02-22 10:37:00.683188 Courier did not hit \"Accept\" \n", "3 2016-02-22 10:31:30.963342 \n", "4 2016-02-22 10:50:29.141722 \n", "5 2016-02-22 10:50:28.153559 \n", "\n", " pickup_address_id scheduled_pickup_at pickup_at \\\n", "id \n", "1 2 2016-02-22 11:15:00.000000 NaT \n", "2 4 2016-02-22 10:41:03.664793 2016-02-22 10:54:00.886417 \n", "3 6 2016-02-22 10:41:02.531284 2016-02-22 10:54:32.122854 \n", "4 8 2016-02-22 11:17:04.810526 2016-02-22 11:11:26.166161 \n", "5 10 2016-02-22 11:10:03.866256 2016-02-22 11:05:19.667624 \n", "\n", " left_pickup_at courier_late_at_pickup_issue courier_waited_at_pickup_issue \\\n", "id \n", "1 NaT \n", "2 NaT \n", "3 NaT \n", "4 NaT \n", "5 NaT \n", "\n", " courier_no_pickup_confirmed_issue delivery_address_id \\\n", "id \n", "1 3 \n", "2 Courier did not hit \"Picked up\" 5 \n", "3 7 \n", "4 9 \n", "5 11 \n", "\n", " first_estimated_delivery_at delivery_at \\\n", "id \n", "1 NaT NaT \n", "2 2016-02-23 22:55:23 2016-02-22 11:13:23.772580 \n", "3 2016-02-23 22:25:47 2016-02-22 11:06:07.677724 \n", "4 2016-02-22 11:25:50 2016-02-22 11:18:43.289947 \n", "5 2016-02-22 11:28:19 2016-02-22 11:20:44.147345 \n", "\n", " courier_waited_at_delivery_issue courier_no_delivery_confirmed_issue \\\n", "id \n", "1 \n", "2 \n", "3 \n", "4 \n", "5 \n", "\n", " utilization sub_total delivery_fee total logged_delivery_distance \\\n", "id \n", "1 15 1250 250 1500 671 \n", "2 19 1550 250 1800 2281 \n", "3 25 2000 250 2250 2449 \n", "4 100 9800 250 10050 196 \n", "5 45 3600 250 3850 2584 \n", "\n", " logged_avg_courier_speed logged_avg_courier_speed_distance \\\n", "id \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 NaN \n", "5 NaN \n", "\n", " logged_accepting_time logged_reaction_time logged_to_pickup_time \\\n", "id \n", "1 \n", "2 2916 187 \n", "3 28 1381 \n", "4 26 439 \n", "5 24 284 \n", "\n", " expected_wait_pickup_time logged_wait_pickup_time logged_pickup_time \\\n", "id \n", "1 \n", "2 69 125 \n", "3 0 \n", "4 -180 818 \n", "5 -440 607 \n", "\n", " logged_courier_late_time logged_restaurant_late_time \\\n", "id \n", "1 \n", "2 67 69 \n", "3 0 0 \n", "4 -1061 -338 \n", "5 -735 -284 \n", "\n", " logged_to_delivery_time expected_delivery_time logged_delivery_time \\\n", "id \n", "1 \n", "2 1745 4791 264 \n", "3 522 3824 270 \n", "4 87 4694 350 \n", "5 694 3242 231 \n", "\n", " logged_delivery_late_time logged_total_time logged_confirmed_total_time \n", "id \n", "1 \n", "2 942 5734 5595 \n", "3 464 4288 4192 \n", "4 -676 4017 4017 \n", "5 -555 2686 2686 " ] }, "execution_count": 105, "metadata": {}, "output_type": "execute_result" } ], "source": [ "orders.head()" ] }, { "cell_type": "code", "execution_count": 106, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Int64Index: 661314 entries, 1 to 688690\n", "Data columns (total 50 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 delivery_id 661314 non-null int64 \n", " 1 customer_id 661314 non-null string \n", " 2 placed_at 661314 non-null datetime64[ns]\n", " 3 ad_hoc 661314 non-null bool \n", " 4 scheduled_delivery_at 83579 non-null datetime64[ns]\n", " 5 status 661314 non-null string \n", " 6 cancelled_at 23665 non-null datetime64[ns]\n", " 7 restaurant_id 661314 non-null int64 \n", " 8 restaurant_notified_at 656311 non-null datetime64[ns]\n", " 9 restaurant_confirmed_at 639636 non-null datetime64[ns]\n", " 10 estimated_prep_duration 552317 non-null Int64 \n", " 11 estimated_prep_buffer 661314 non-null int64 \n", " 12 courier_id 648850 non-null Int64 \n", " 13 dispatch_at 656923 non-null datetime64[ns]\n", " 14 courier_notified_at 430500 non-null datetime64[ns]\n", " 15 courier_accepted_at 650964 non-null datetime64[ns]\n", " 16 courier_no_accept_confirmed_issue 40105 non-null string \n", " 17 pickup_address_id 661314 non-null int64 \n", " 18 scheduled_pickup_at 656727 non-null datetime64[ns]\n", " 19 pickup_at 638249 non-null datetime64[ns]\n", " 20 left_pickup_at 312171 non-null datetime64[ns]\n", " 21 courier_late_at_pickup_issue 67499 non-null string \n", " 22 courier_waited_at_pickup_issue 29469 non-null string \n", " 23 courier_no_pickup_confirmed_issue 35394 non-null string \n", " 24 delivery_address_id 661314 non-null int64 \n", " 25 first_estimated_delivery_at 657796 non-null datetime64[ns]\n", " 26 delivery_at 637808 non-null datetime64[ns]\n", " 27 courier_waited_at_delivery_issue 32415 non-null string \n", " 28 courier_no_delivery_confirmed_issue 13884 non-null string \n", " 29 utilization 661314 non-null int64 \n", " 30 sub_total 661314 non-null int64 \n", " 31 delivery_fee 661314 non-null int64 \n", " 32 total 661314 non-null int64 \n", " 33 logged_delivery_distance 661314 non-null int64 \n", " 34 logged_avg_courier_speed 213841 non-null float64 \n", " 35 logged_avg_courier_speed_distance 213841 non-null Int64 \n", " 36 logged_accepting_time 637808 non-null Int64 \n", " 37 logged_reaction_time 410349 non-null Int64 \n", " 38 logged_to_pickup_time 637808 non-null Int64 \n", " 39 expected_wait_pickup_time 532387 non-null Int64 \n", " 40 logged_wait_pickup_time 497916 non-null Int64 \n", " 41 logged_pickup_time 637808 non-null Int64 \n", " 42 logged_courier_late_time 503505 non-null Int64 \n", " 43 logged_restaurant_late_time 503505 non-null Int64 \n", " 44 logged_to_delivery_time 637808 non-null Int64 \n", " 45 expected_delivery_time 637808 non-null Int64 \n", " 46 logged_delivery_time 637808 non-null Int64 \n", " 47 logged_delivery_late_time 637808 non-null Int64 \n", " 48 logged_total_time 637808 non-null Int64 \n", " 49 logged_confirmed_total_time 637808 non-null Int64 \n", "dtypes: Int64(17), bool(1), datetime64[ns](13), float64(1), int64(10), string(8)\n", "memory usage: 263.6 MB\n" ] } ], "source": [ "orders.info()" ] }, { "cell_type": "code", "execution_count": 107, "metadata": {}, "outputs": [], "source": [ "assert len(orders) == 661_314" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Adjust Time Zones" ] }, { "cell_type": "code", "execution_count": 108, "metadata": {}, "outputs": [], "source": [ "for column in [\n", " \"placed_at\",\n", " \"scheduled_delivery_at\",\n", " \"cancelled_at\",\n", " \"restaurant_notified_at\",\n", " \"restaurant_confirmed_at\",\n", " \"dispatch_at\",\n", " \"courier_notified_at\",\n", " \"courier_accepted_at\",\n", " \"scheduled_pickup_at\",\n", " \"pickup_at\",\n", " \"left_pickup_at\",\n", " \"first_estimated_delivery_at\",\n", " \"delivery_at\",\n", "]:\n", " orders[column] = clean_datetime(orders[column])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use Merged Addresses" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "About 0.02 % of the orders belong to discarded addresses and are discarded also." ] }, { "cell_type": "code", "execution_count": 109, "metadata": {}, "outputs": [], "source": [ "orders[\"pickup_address_id\"] = orders[\"pickup_address_id\"].map(address_merger)\n", "orders[\"delivery_address_id\"] = orders[\"delivery_address_id\"].map(address_merger)\n", "msk = orders[\"pickup_address_id\"].isnull() | orders[\"delivery_address_id\"].isnull()\n", "orders = orders[~msk].astype({\"pickup_address_id\": int, \"delivery_address_id\": int,})\n", "\n", "assert msk.sum() == 160" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use Merged Restaurants" ] }, { "cell_type": "code", "execution_count": 110, "metadata": {}, "outputs": [], "source": [ "orders[\"restaurant_id\"] = orders[\"restaurant_id\"].map(restaurants_merger)" ] }, { "cell_type": "code", "execution_count": 111, "metadata": {}, "outputs": [], "source": [ "assert not orders[\"restaurant_id\"].isnull().any()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use Merged Couriers" ] }, { "cell_type": "code", "execution_count": 112, "metadata": {}, "outputs": [], "source": [ "orders[\"courier_id\"] = orders[\"courier_id\"].map(couriers_merger).astype(\"Int64\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Verify that the couriers' IDs are the same in `couriers` and `orders`." ] }, { "cell_type": "code", "execution_count": 113, "metadata": {}, "outputs": [], "source": [ "assert set(couriers.index) == set(\n", " orders.loc[orders[\"courier_id\"].notnull(), \"courier_id\"].unique()\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Clean User IDs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Convert the MD5 hashed emails and phone numbers into integer ID's." ] }, { "cell_type": "code", "execution_count": 114, "metadata": {}, "outputs": [], "source": [ "orders[\"customer_id\"] = (\n", " orders[\"customer_id\"]\n", " .map({y: x for (x, y) in enumerate(orders[\"customer_id\"].unique(), start=1)})\n", " .astype({\"customer_id\": int})\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ad-hoc vs. Scheduled Orders" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ad-hoc orders never have a `scheduled_delivery_at` value set, and scheduled orders always have it set." ] }, { "cell_type": "code", "execution_count": 115, "metadata": {}, "outputs": [], "source": [ "assert not (\n", " (orders[\"ad_hoc\"] == True) & orders[\"scheduled_delivery_at\"].notnull()\n", ").any()\n", "assert not (\n", " (orders[\"ad_hoc\"] == False) & orders[\"scheduled_delivery_at\"].isnull()\n", ").any()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For all adjusted timestamps we add `*_corrected` columns indicating if a correction is made in the following." ] }, { "cell_type": "code", "execution_count": 116, "metadata": {}, "outputs": [], "source": [ "for column in [\n", " \"scheduled_delivery_at\",\n", " \"cancelled_at\",\n", " \"restaurant_notified_at\",\n", " \"restaurant_confirmed_at\",\n", " \"estimated_prep_duration\",\n", " \"dispatch_at\",\n", " \"courier_notified_at\",\n", " \"courier_accepted_at\",\n", " \"pickup_at\",\n", " \"left_pickup_at\",\n", " \"delivery_at\",\n", "]:\n", " orders[column + \"_corrected\"] = False\n", " orders.loc[orders[column].isnull(), column + \"_corrected\"] = pd.NA" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some customers managed to place scheduled orders for the past. These are converted into ad-hoc orders." ] }, { "cell_type": "code", "execution_count": 117, "metadata": {}, "outputs": [], "source": [ "msk = orders[\"scheduled_delivery_at\"] < orders[\"placed_at\"]\n", "orders.loc[msk, \"ad_hoc\"] = True\n", "orders.loc[msk, \"scheduled_delivery_at\"] = pd.NaT\n", "orders.loc[msk, \"scheduled_delivery_at_corrected\"] = True\n", "\n", "assert msk.sum() == 11" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Orders scheduled within the next 30 minutes are treated as ad-hoc orders. With the median fulfillment time of ad-hoc orders being 34 minutes, it is absolutely unrealistic to fulfill such a scheduled order on time. This should not influence the KPIs in a bad way." ] }, { "cell_type": "code", "execution_count": 118, "metadata": {}, "outputs": [], "source": [ "msk = (orders[\"ad_hoc\"] == False) & (\n", " orders[\"scheduled_delivery_at\"] - orders[\"placed_at\"]\n", " < datetime.timedelta(minutes=30)\n", ")\n", "orders.loc[msk, \"ad_hoc\"] = True\n", "orders.loc[msk, \"scheduled_delivery_at\"] = pd.NaT\n", "orders.loc[msk, \"scheduled_delivery_at_corrected\"] = True\n", "\n", "assert msk.sum() == 3_267" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For scheduled orders, `scheduled_delivery_at` is mostly set to quarters of an hour. The seconds part is always `0`." ] }, { "cell_type": "code", "execution_count": 119, "metadata": {}, "outputs": [], "source": [ "assert not (\n", " (orders[\"ad_hoc\"] == False) & (orders[\"scheduled_delivery_at\"].dt.second != 0)\n", ").any()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If a customer managed to enter something other than a quarter of an hour as `scheduled_delivery_at`, we adjust that." ] }, { "cell_type": "code", "execution_count": 120, "metadata": {}, "outputs": [], "source": [ "msk = (orders[\"ad_hoc\"] == False) & (\n", " orders[\"scheduled_delivery_at\"].dt.minute % 15 != 0\n", ")\n", "round_down = msk & (orders[\"scheduled_delivery_at\"].dt.minute % 15 < 8)\n", "orders.loc[round_down, \"scheduled_delivery_at\"] = orders.loc[\n", " round_down, \"scheduled_delivery_at\"\n", "] - (orders.loc[round_down, \"scheduled_delivery_at\"].dt.minute % 15).map(\n", " lambda m: datetime.timedelta(minutes=m)\n", ")\n", "round_up = msk & (orders[\"scheduled_delivery_at\"].dt.minute % 15 >= 8)\n", "orders.loc[round_up, \"scheduled_delivery_at\"] = orders.loc[\n", " round_up, \"scheduled_delivery_at\"\n", "] + (orders.loc[round_up, \"scheduled_delivery_at\"].dt.minute % 15).map(\n", " lambda m: datetime.timedelta(minutes=(15 - m))\n", ")\n", "orders.loc[msk, \"scheduled_delivery_at_corrected\"] = True\n", "\n", "assert msk.sum() == 6" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Timestamps" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All timestamps in `orders` must occur in a strict sequence (i.e., order) according to the delivery process. A tiny fraction of the orders has timestamps that do not comply with that and are adjusted in the following." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`placed_at` must always be the earliest of all timestamps." ] }, { "cell_type": "code", "execution_count": 121, "metadata": {}, "outputs": [], "source": [ "for column in [\n", " \"scheduled_delivery_at\",\n", " \"cancelled_at\",\n", " \"restaurant_notified_at\",\n", " \"restaurant_confirmed_at\",\n", " \"dispatch_at\",\n", " \"courier_notified_at\",\n", " \"courier_accepted_at\",\n", " \"pickup_at\",\n", " \"left_pickup_at\",\n", " \"first_estimated_delivery_at\",\n", " \"delivery_at\",\n", "]:\n", " assert not (orders[\"placed_at\"] >= orders[column]).any()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rarely, a restaurant confirmed an order before it was notified about it. We keep `restaurant_confirmed_at` in these cases." ] }, { "cell_type": "code", "execution_count": 122, "metadata": {}, "outputs": [], "source": [ "msk = orders[\"restaurant_notified_at\"] >= orders[\"restaurant_confirmed_at\"]\n", "orders.loc[msk, \"restaurant_notified_at\"] = pd.NaT\n", "orders.loc[msk, \"restaurant_notified_at_corrected\"] = True\n", "\n", "assert msk.sum() == 47" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Whenever `restaurant_notified_at` or `restaurant_confirmed_at` is later than `pickup_at`, we discard the values." ] }, { "cell_type": "code", "execution_count": 123, "metadata": {}, "outputs": [], "source": [ "msk = orders[\"restaurant_notified_at\"] >= orders[\"pickup_at\"]\n", "orders.loc[msk, \"restaurant_notified_at\"] = pd.NaT\n", "orders.loc[msk, \"restaurant_notified_at_corrected\"] = True\n", "\n", "assert msk.sum() == 73" ] }, { "cell_type": "code", "execution_count": 124, "metadata": {}, "outputs": [], "source": [ "msk = orders[\"restaurant_confirmed_at\"] >= orders[\"pickup_at\"]\n", "orders.loc[msk, \"restaurant_confirmed_at\"] = pd.NaT\n", "orders.loc[msk, \"restaurant_confirmed_at_corrected\"] = True\n", "\n", "assert msk.sum() == 2_001" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If a courier forgot to confirm the pickup, `pickup_at` and `delivery_at` are the same." ] }, { "cell_type": "code", "execution_count": 125, "metadata": {}, "outputs": [], "source": [ "msk = orders[\"delivery_at\"] == orders[\"pickup_at\"]\n", "orders.loc[msk, \"pickup_at\"] = pd.NaT\n", "orders.loc[msk, \"pickup_at_corrected\"] = True\n", "\n", "assert msk.sum() == 16" ] }, { "cell_type": "code", "execution_count": 126, "metadata": {}, "outputs": [], "source": [ "msk = orders[\"delivery_at\"] == orders[\"left_pickup_at\"]\n", "orders.loc[msk, \"left_pickup_at\"] = pd.NaT\n", "orders.loc[msk, \"left_pickup_at_corrected\"] = True\n", "\n", "assert msk.sum() == 15" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`delivery_at` must be the latest of all dispatch-related timestamps." ] }, { "cell_type": "code", "execution_count": 127, "metadata": {}, "outputs": [], "source": [ "for column in [\n", " \"dispatch_at\",\n", " \"courier_notified_at\",\n", " \"courier_accepted_at\",\n", " \"pickup_at\",\n", " \"left_pickup_at\",\n", "]:\n", " assert not (orders[\"delivery_at\"] <= orders[column]).any()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In about 14,500 cases the `left_pickup_at` lies before or on `pickup_at`. This only affects orders between September 6 and October 17. We discard these timestamps." ] }, { "cell_type": "code", "execution_count": 128, "metadata": {}, "outputs": [], "source": [ "msk = orders[\"left_pickup_at\"] < orders[\"pickup_at\"]\n", "orders.loc[msk, \"left_pickup_at\"] = pd.NaT\n", "orders.loc[msk, \"left_pickup_at_corrected\"] = True\n", "\n", "assert msk.sum() == 14_013\n", "assert orders.loc[msk, \"placed_at\"].min().date() == datetime.date(2016, 9, 6)\n", "assert orders.loc[msk, \"placed_at\"].max().date() == datetime.date(2016, 10, 17)" ] }, { "cell_type": "code", "execution_count": 129, "metadata": {}, "outputs": [], "source": [ "msk = orders[\"left_pickup_at\"] == orders[\"pickup_at\"]\n", "orders.loc[msk, \"left_pickup_at\"] = pd.NaT\n", "orders.loc[msk, \"left_pickup_at_corrected\"] = True\n", "\n", "assert msk.sum() == 496" ] }, { "cell_type": "code", "execution_count": 130, "metadata": {}, "outputs": [], "source": [ "for column in [\n", " \"dispatch_at\",\n", " \"courier_notified_at\",\n", " \"courier_accepted_at\",\n", " \"pickup_at\",\n", "]:\n", " assert not (orders[\"left_pickup_at\"] <= orders[column]).any()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rarely, `pickup_at` is earlier than or equal to `dispatch_at`, `courier_notified_at`, or `courier_accepted_at`. They are discarded." ] }, { "cell_type": "code", "execution_count": 131, "metadata": {}, "outputs": [], "source": [ "msk = orders[\"pickup_at\"] <= orders[\"dispatch_at\"]\n", "orders.loc[msk, \"dispatch_at\"] = pd.NaT\n", "orders.loc[msk, \"dispatch_at_corrected\"] = True\n", "\n", "assert msk.sum() == 15" ] }, { "cell_type": "code", "execution_count": 132, "metadata": {}, "outputs": [], "source": [ "msk = orders[\"pickup_at\"] <= orders[\"courier_notified_at\"]\n", "orders.loc[msk, \"courier_notified_at\"] = pd.NaT\n", "orders.loc[msk, \"courier_notified_at_corrected\"] = True\n", "\n", "assert msk.sum() == 8\n", "assert set(orders.loc[msk, \"status\"].unique()) == set([\"cancelled\"])" ] }, { "cell_type": "code", "execution_count": 133, "metadata": {}, "outputs": [], "source": [ "msk = orders[\"pickup_at\"] <= orders[\"courier_accepted_at\"]\n", "orders.loc[msk, \"courier_accepted_at\"] = pd.NaT\n", "orders.loc[msk, \"courier_accepted_at_corrected\"] = True\n", "\n", "assert msk.sum() == 15" ] }, { "cell_type": "code", "execution_count": 134, "metadata": {}, "outputs": [], "source": [ "for column in [\"dispatch_at\", \"courier_notified_at\", \"courier_accepted_at\"]:\n", " assert not (orders[\"pickup_at\"] <= orders[column]).any()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For about 66.000 orders `courier_accepted_at` equals `dispatch_at` or lies before it. We assume the former is correct and discard the latter." ] }, { "cell_type": "code", "execution_count": 135, "metadata": {}, "outputs": [], "source": [ "msk = orders[\"courier_accepted_at\"] <= orders[\"dispatch_at\"]\n", "orders.loc[msk, \"dispatch_at\"] = pd.NaT\n", "orders.loc[msk, \"dispatch_at_corrected\"] = True\n", "\n", "assert msk.sum() == 65_848" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If `courier_accepted_at` is equal or before `courier_notified_at`, we discard the latter." ] }, { "cell_type": "code", "execution_count": 136, "metadata": {}, "outputs": [], "source": [ "msk = orders[\"courier_accepted_at\"] <= orders[\"courier_notified_at\"]\n", "orders.loc[msk, \"courier_notified_at\"] = pd.NaT\n", "orders.loc[msk, \"courier_notified_at_corrected\"] = True\n", "\n", "assert msk.sum() == 165_585" ] }, { "cell_type": "code", "execution_count": 137, "metadata": {}, "outputs": [], "source": [ "for column in [\"dispatch_at\", \"courier_notified_at\"]:\n", " assert not (orders[\"courier_accepted_at\"] <= orders[column]).any()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For some more orders, `courier_notified_at` lies before `dispatch_at`. Manual analysis reveals that in most of these cases, the courier did not hit \"accept\". We discard `dispatch_at` as the timings between `courier_notified_at` and `courier_accepted_at` fit the issue messages." ] }, { "cell_type": "code", "execution_count": 138, "metadata": {}, "outputs": [], "source": [ "msk = orders[\"courier_notified_at\"] <= orders[\"dispatch_at\"]\n", "orders.loc[msk, \"dispatch_at\"] = pd.NaT\n", "orders.loc[msk, \"dispatch_at_corrected\"] = True\n", "\n", "assert msk.sum() == 3_397" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ad-hoc orders that were placed before 11 in the morning and after 23 in the evening are discarded. Most of them were cancelled anyways." ] }, { "cell_type": "code", "execution_count": 139, "metadata": {}, "outputs": [], "source": [ "msk = (orders[\"ad_hoc\"] == True) & (\n", " (orders[\"placed_at\"].dt.hour <= 10) | (orders[\"placed_at\"].dt.hour >= 23)\n", ")\n", "orders = orders[~msk]\n", "\n", "assert msk.sum() == 337" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The orders scheduled for 11:15 (=1) and 11:30 (=37) are scheduled for 11:45. Most of them were not delivered until 12 anyways. This is in line with the 30-minute minimum horizon above." ] }, { "cell_type": "code", "execution_count": 140, "metadata": {}, "outputs": [], "source": [ "msk = (orders[\"scheduled_delivery_at\"].dt.hour == 11) & (\n", " orders[\"scheduled_delivery_at\"].dt.minute == 30\n", ")\n", "orders.loc[msk, \"scheduled_delivery_at\"] += datetime.timedelta(minutes=15)\n", "orders.loc[msk, \"scheduled_delivery_at_corrected\"] = True\n", "\n", "assert msk.sum() == 37\n", "\n", "msk = (orders[\"scheduled_delivery_at\"].dt.hour == 11) & (\n", " orders[\"scheduled_delivery_at\"].dt.minute == 15\n", ")\n", "orders.loc[msk, \"scheduled_delivery_at\"] += datetime.timedelta(minutes=30)\n", "orders.loc[msk, \"scheduled_delivery_at_corrected\"] = True\n", "\n", "assert msk.sum() == 1\n", "\n", "assert not (\n", " (orders[\"scheduled_delivery_at\"].dt.hour == 11)\n", " & (orders[\"scheduled_delivery_at\"].dt.minute == 0)\n", ").any()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Orders with a scheduled delivery before 11 in the morning and after 23 in the evening are discarded." ] }, { "cell_type": "code", "execution_count": 141, "metadata": {}, "outputs": [], "source": [ "msk = (orders[\"ad_hoc\"] == False) & (\n", " (orders[\"scheduled_delivery_at\"].dt.hour <= 10)\n", " | (orders[\"scheduled_delivery_at\"].dt.hour >= 23)\n", ")\n", "orders = orders[~msk]\n", "\n", "assert msk.sum() == 159" ] }, { "cell_type": "code", "execution_count": 142, "metadata": {}, "outputs": [], "source": [ "ad_hoc = orders[\"ad_hoc\"] == True\n", "scheduled = ~ad_hoc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Order Stati" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are only cancelled and completed orders. We replace the `status` column with a boolean `cancelled` column." ] }, { "cell_type": "code", "execution_count": 143, "metadata": {}, "outputs": [], "source": [ "assert set(orders[\"status\"].unique()) == set([\"cancelled\", \"completed\"])" ] }, { "cell_type": "code", "execution_count": 144, "metadata": {}, "outputs": [], "source": [ "orders[\"cancelled\"] = False\n", "msk = orders[\"status\"] == \"cancelled\"\n", "orders.loc[msk, \"cancelled\"] = True\n", "del orders[\"status\"]\n", "\n", "assert msk.sum() == 23_552" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some cancelled orders still have a `delivery_at` value. All of them have a dummy value for the `cancelled_at` value (cf., below). For roughly two thirds of them, the time between pickup and delivery is so small that it seems unrealistic that they actually were delivered. In these cases, we take `delivery_at` as the realistic `cancelled_at` value. The ones that could have been delivered realistically are treated as completed orders." ] }, { "cell_type": "code", "execution_count": 145, "metadata": {}, "outputs": [], "source": [ "claimed_to_be_delivered = (orders[\"cancelled\"] == True) & orders[\n", " \"delivery_at\"\n", "].notnull()\n", "\n", "assert (\n", " orders.loc[claimed_to_be_delivered, \"cancelled_at\"].min()\n", " == orders.loc[claimed_to_be_delivered, \"cancelled_at\"].max()\n", " == datetime.datetime(2016, 10, 18, 9, 52, 45)\n", ")\n", "\n", "realistically_delivered = (\n", " orders[\"delivery_at\"] - orders[\"pickup_at\"]\n", ").dt.total_seconds() > 120\n", "msk = claimed_to_be_delivered & realistically_delivered\n", "orders.loc[msk, \"cancelled\"] = False\n", "orders.loc[msk, \"cancelled_at\"] = pd.NaT\n", "orders.loc[msk, \"cancelled_at_corrected\"] = True\n", "msk = claimed_to_be_delivered & ~realistically_delivered\n", "orders.loc[msk, \"cancelled_at\"] = orders.loc[msk, \"delivery_at\"]\n", "orders.loc[msk, \"cancelled_at_corrected\"] = True\n", "orders.loc[msk, \"delivery_at\"] = pd.NaT\n", "orders.loc[msk, \"delivery_at_corrected\"] = pd.NA\n", "\n", "assert claimed_to_be_delivered.sum() == 159\n", "assert (claimed_to_be_delivered & realistically_delivered).sum() == 61\n", "assert (claimed_to_be_delivered & ~realistically_delivered).sum() == 98" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Only cancelled orders have a `cancelled_at` value." ] }, { "cell_type": "code", "execution_count": 146, "metadata": {}, "outputs": [], "source": [ "cancelled = orders[\"cancelled\"] == True\n", "completed = orders[\"cancelled\"] == False\n", "\n", "assert not orders.loc[cancelled, \"cancelled_at\"].isnull().any()\n", "assert not orders.loc[completed, \"cancelled_at\"].notnull().any()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cancelled Orders" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For about 40% of the orders the `cancelled_at` field was only filled in after a system change on October 18 (i.e., in a batch). For these orders, this field is not meaningful because of that. We discard it." ] }, { "cell_type": "code", "execution_count": 147, "metadata": {}, "outputs": [], "source": [ "batch = orders[\"cancelled_at\"] == datetime.datetime(2016, 10, 18, 9, 52, 45)\n", "orders.loc[cancelled & batch, \"cancelled_at\"] = pd.NaT\n", "orders.loc[cancelled & batch, \"cancelled_at_corrected\"] = True\n", "\n", "assert (cancelled & batch).sum() == 9_410" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When a restaurant was notified about an order after the order was cancelled, we discard `restaurant_notified_at` and `restaurant_confirmed_at`." ] }, { "cell_type": "code", "execution_count": 148, "metadata": {}, "outputs": [], "source": [ "msk = orders[\"cancelled_at\"] <= orders[\"restaurant_notified_at\"]\n", "orders.loc[msk, \"restaurant_notified_at\"] = pd.NaT\n", "orders.loc[msk, \"restaurant_notified_at_corrected\"] = True\n", "\n", "assert msk.sum() == 6" ] }, { "cell_type": "code", "execution_count": 149, "metadata": {}, "outputs": [], "source": [ "msk = orders[\"cancelled_at\"] <= orders[\"restaurant_confirmed_at\"]\n", "orders.loc[msk, \"restaurant_confirmed_at\"] = pd.NaT\n", "orders.loc[msk, \"restaurant_confirmed_at_corrected\"] = True\n", "\n", "assert msk.sum() == 1_253" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When an order was dispatched in the moment it was cancelled, we adjust that." ] }, { "cell_type": "code", "execution_count": 150, "metadata": {}, "outputs": [], "source": [ "msk = orders[\"cancelled_at\"] == orders[\"dispatch_at\"]\n", "orders.loc[msk, \"dispatch_at\"] -= datetime.timedelta(seconds=1)\n", "orders.loc[msk, \"dispatch_at_corrected\"] = True\n", "\n", "assert msk.sum() == 3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When a courier was notified about or accepted an order in the moment it was cancelled, we adjust that." ] }, { "cell_type": "code", "execution_count": 151, "metadata": {}, "outputs": [], "source": [ "msk = orders[\"cancelled_at\"] == orders[\"courier_notified_at\"]\n", "orders.loc[msk, \"courier_notified_at\"] -= datetime.timedelta(seconds=1)\n", "orders.loc[msk, \"courier_notified_at_corrected\"] = True\n", "\n", "assert msk.sum() == 1" ] }, { "cell_type": "code", "execution_count": 152, "metadata": {}, "outputs": [], "source": [ "msk = orders[\"cancelled_at\"] == orders[\"courier_accepted_at\"]\n", "orders.loc[msk, \"courier_accepted_at\"] -= datetime.timedelta(seconds=1)\n", "orders.loc[msk, \"courier_accepted_at_corrected\"] = True\n", "\n", "assert msk.sum() == 8" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When a courier picked up an order in the moment it was cancelled, we adjust that." ] }, { "cell_type": "code", "execution_count": 153, "metadata": {}, "outputs": [], "source": [ "msk = orders[\"cancelled_at\"] == orders[\"pickup_at\"]\n", "orders.loc[msk, \"pickup_at\"] -= datetime.timedelta(seconds=1)\n", "orders.loc[msk, \"pickup_at_corrected\"] = True\n", "\n", "assert msk.sum() == 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Verify that `cancelled_at` is indeed the latest timestamp in every row." ] }, { "cell_type": "code", "execution_count": 154, "metadata": {}, "outputs": [], "source": [ "orders[\"_max_datetime\"] = pd.NaT\n", "orders[\"_max_datetime\"] = orders[\n", " [\n", " \"restaurant_notified_at\",\n", " \"restaurant_confirmed_at\",\n", " \"dispatch_at\",\n", " \"courier_notified_at\",\n", " \"courier_accepted_at\",\n", " \"pickup_at\",\n", " \"left_pickup_at\",\n", " \"delivery_at\",\n", " \"cancelled_at\",\n", " ]\n", "].max(axis=1)\n", "\n", "assert not (\n", " cancelled & ~batch & (orders[\"cancelled_at\"] != orders[\"_max_datetime\"])\n", ").any()\n", "\n", "del orders[\"_max_datetime\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Timings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The times in between the timestamps can be used to obtain timings of individual steps in the delivery process. In the original database, such timings were already logged. In the following, we validate the timestamps against the timings and only keep the timestamps as the timings are then calculated as `@property`s in the ORM layer." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`confirmed_total_time` is the difference between `placed_at` and `delivery_at`. It is set only for completed orders and useful only for ad_hoc orders. The `Order` class has a `total_time` property that computes that value." ] }, { "cell_type": "code", "execution_count": 155, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.998" ] }, "execution_count": 155, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_data_available = (\n", " orders[\"logged_confirmed_total_time\"].notnull()\n", " & orders[\"delivery_at\"].notnull()\n", " & orders[\"placed_at\"].notnull()\n", ")\n", "good_data = (\n", " orders[\"logged_confirmed_total_time\"]\n", " - ((orders[\"delivery_at\"] - orders[\"placed_at\"]).dt.total_seconds().round())\n", ").abs() <= 5\n", "del orders[\"logged_confirmed_total_time\"]\n", "\n", "assert (all_data_available & good_data).sum() == 635_768\n", "assert (all_data_available & good_data & completed).sum() == 635_768\n", "assert (all_data_available & good_data & ad_hoc).sum() == 561_340\n", "\n", "round(\n", " (all_data_available & good_data & ad_hoc).sum()\n", " / (all_data_available & ad_hoc).sum(),\n", " 3,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The best guess for `accepting_time` is the difference between `dispatch_at` and `courier_accepted_at`. `Order.time_to_accept` models that." ] }, { "cell_type": "code", "execution_count": 156, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.607" ] }, "execution_count": 156, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_data_available = (\n", " orders[\"logged_accepting_time\"].notnull()\n", " & orders[\"courier_accepted_at\"].notnull()\n", " & orders[\"dispatch_at\"].notnull()\n", ")\n", "good_data = (\n", " orders[\"logged_accepting_time\"]\n", " - (\n", " (orders[\"courier_accepted_at\"] - orders[\"dispatch_at\"])\n", " .dt.total_seconds()\n", " .round()\n", " )\n", ").abs() <= 5\n", "\n", "assert (all_data_available & good_data).sum() == 345_803\n", "assert (all_data_available & good_data & completed).sum() == 345_803\n", "\n", "round((all_data_available & good_data).sum() / all_data_available.sum(), 3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use `accepting_time` to extrapolate missing values for `dispatch_at`." ] }, { "cell_type": "code", "execution_count": 157, "metadata": {}, "outputs": [], "source": [ "extrapolate = (\n", " orders[\"dispatch_at\"].isnull()\n", " & orders[\"courier_accepted_at\"].notnull()\n", " & orders[\"logged_accepting_time\"].notnull()\n", ")\n", "\n", "accept_time = orders[\"logged_accepting_time\"].map(\n", " lambda x: datetime.timedelta(seconds=x) if x is not pd.NA else pd.NaT\n", ")\n", "extrapolated_dispatch_at = orders[\"courier_accepted_at\"] - accept_time\n", "still_wrong = extrapolated_dispatch_at >= orders[\"courier_notified_at\"]\n", "msk = extrapolate & ~still_wrong\n", "orders.loc[msk, \"dispatch_at\"] = extrapolated_dispatch_at.loc[msk]\n", "orders.loc[msk, \"dispatch_at_corrected\"] = True\n", "del orders[\"logged_accepting_time\"]\n", "\n", "assert extrapolate.sum() == 67_372\n", "assert (extrapolate & ~still_wrong).sum() == 61_545" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The best guess for `reaction_time` is the difference between `courier_notified_at` and `courier_accepted_at`. `Order.time_to_react` models that in the ORM." ] }, { "cell_type": "code", "execution_count": 158, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.843" ] }, "execution_count": 158, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_data_available = (\n", " orders[\"logged_reaction_time\"].notnull()\n", " & orders[\"courier_accepted_at\"].notnull()\n", " & orders[\"courier_notified_at\"].notnull()\n", ")\n", "good_data = (\n", " orders[\"logged_reaction_time\"]\n", " - (\n", " (orders[\"courier_accepted_at\"] - orders[\"courier_notified_at\"])\n", " .dt.total_seconds()\n", " .round()\n", " )\n", ").abs() <= 5\n", "\n", "assert (all_data_available & good_data).sum() == 165_355\n", "assert (all_data_available & good_data & completed).sum() == 165_355\n", "\n", "round((all_data_available & good_data).sum() / all_data_available.sum(), 3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use `reaction_time` to extrapolate missing values for `courier_notified_at`." ] }, { "cell_type": "code", "execution_count": 159, "metadata": {}, "outputs": [], "source": [ "extrapolate = (\n", " orders[\"courier_notified_at\"].isnull()\n", " & orders[\"courier_accepted_at\"].notnull()\n", " & orders[\"logged_reaction_time\"].notnull()\n", ")\n", "extrapolated_courier_notified_at = (\n", " orders[\"courier_accepted_at\"]\n", " # Some values for logged_reaction_time are <= 0.\n", " - orders[\"logged_reaction_time\"].map(\n", " lambda x: datetime.timedelta(seconds=x) if x is not pd.NA and x > 0 else pd.NaT\n", " )\n", ")\n", "still_wrong = extrapolated_courier_notified_at <= orders[\"dispatch_at\"]\n", "msk = extrapolate & ~still_wrong\n", "orders.loc[msk, \"courier_notified_at\"] = extrapolated_courier_notified_at.loc[msk]\n", "orders.loc[msk, \"courier_notified_at_corrected\"] = True\n", "\n", "assert extrapolate.sum() == 214_043\n", "assert (extrapolate & ~still_wrong).sum() == 213_290" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "No need to extrapolate `courier_accepted_at` from `courier_notified_at`." ] }, { "cell_type": "code", "execution_count": 160, "metadata": {}, "outputs": [], "source": [ "assert not (\n", " orders[\"courier_notified_at\"].notnull()\n", " & orders[\"courier_accepted_at\"].isnull()\n", " & orders[\"logged_reaction_time\"].notnull()\n", ").any()\n", "\n", "del orders[\"logged_reaction_time\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`estimated_prep_duration` equals `expected_wait_pickup_time`. As the latter is not filled in for cancelled orders, we keep the former.\n", "Also, `estimated_prep_duration` is only filled in starting with May 24. It is always a multiple of `60`, so it is stored as full minutes." ] }, { "cell_type": "code", "execution_count": 161, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.0" ] }, "execution_count": 161, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_data_available = (\n", " orders[\"estimated_prep_duration\"].notnull()\n", " & orders[\"expected_wait_pickup_time\"].notnull()\n", ")\n", "good_data = (\n", " orders[\"estimated_prep_duration\"] - orders[\"expected_wait_pickup_time\"]\n", ").abs() <= 5\n", "\n", "no_duration = orders[\"estimated_prep_duration\"].isnull()\n", "\n", "assert not (no_duration & orders[\"expected_wait_pickup_time\"].notnull()).any()\n", "assert (~no_duration & orders[\"expected_wait_pickup_time\"].isnull()).sum() == 19_865\n", "assert not (\n", " (orders[\"placed_at\"].dt.date > datetime.date(2016, 5, 24))\n", " & orders[\"expected_wait_pickup_time\"].isnull()\n", " & (orders[\"cancelled\"] == False)\n", ").any()\n", "\n", "del orders[\"expected_wait_pickup_time\"]\n", "\n", "assert orders.loc[no_duration, \"placed_at\"].min().date() == datetime.date(2016, 2, 21)\n", "assert orders.loc[no_duration, \"placed_at\"].max().date() == datetime.date(2016, 5, 24)\n", "assert orders.loc[~no_duration, \"placed_at\"].min().date() == datetime.date(2016, 5, 24)\n", "assert orders.loc[~no_duration, \"placed_at\"].max().date() == datetime.date(2017, 1, 31)\n", "assert not (~no_duration & (orders[\"estimated_prep_duration\"] % 60 != 0)).any()\n", "\n", "round((all_data_available & good_data).sum() / all_data_available.sum(), 3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`estimated_prep_duration` is the difference between `restaurant_notified_at` and `scheduled_pickup_at` when allowing up to half a minute of clock skew. `restaurant_confirmed_at` only works in about 40% of the cases. So, if and when a restaurant confirms an order, does not affect the dispatching process." ] }, { "cell_type": "code", "execution_count": 162, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.986" ] }, "execution_count": 162, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_data_available = (\n", " orders[\"estimated_prep_duration\"].notnull()\n", " & orders[\"restaurant_notified_at\"].notnull()\n", " & orders[\"scheduled_pickup_at\"].notnull()\n", ")\n", "good_data = (\n", " orders[\"estimated_prep_duration\"]\n", " - (\n", " (orders[\"scheduled_pickup_at\"] - orders[\"restaurant_notified_at\"])\n", " .dt.total_seconds()\n", " .round()\n", " )\n", ").abs() <= 35\n", "\n", "assert (all_data_available & good_data).sum() == 539_668\n", "assert (all_data_available & good_data & completed).sum() == 524_709\n", "\n", "round((all_data_available & good_data).sum() / all_data_available.sum(), 3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use `estimated_prep_duration` to correct about a third of the 1.5% of `restaurant_notified_at` that are off for orders after May 24." ] }, { "cell_type": "code", "execution_count": 163, "metadata": {}, "outputs": [], "source": [ "duration = orders[\"estimated_prep_duration\"].map(\n", " lambda x: datetime.timedelta(seconds=x) if x is not pd.NA else pd.NaT\n", ")\n", "calc_restaurant_notified_at = orders[\"scheduled_pickup_at\"] - duration\n", "\n", "not_wrong = (\n", " completed\n", " & (orders[\"placed_at\"] < calc_restaurant_notified_at)\n", " & (calc_restaurant_notified_at < orders[\"restaurant_confirmed_at\"])\n", ")\n", "\n", "msk = all_data_available & ~good_data & not_wrong\n", "orders.loc[msk, \"restaurant_notified_at\"] = calc_restaurant_notified_at.loc[msk]\n", "orders.loc[msk, \"restaurant_notified_at_corrected\"] = True\n", "\n", "assert (all_data_available & ~good_data).sum() == 7_514\n", "assert msk.sum() == 2_425\n", "assert orders.loc[msk, \"placed_at\"].min().date() == datetime.date(2016, 5, 24)\n", "assert orders.loc[msk, \"placed_at\"].max().date() == datetime.date(2017, 1, 31)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Also, we use `estimated_prep_duration` to extrapolate missing `restaurant_notified_at` values for orders after May 24." ] }, { "cell_type": "code", "execution_count": 164, "metadata": {}, "outputs": [], "source": [ "duration = orders[\"estimated_prep_duration\"].map(\n", " lambda x: datetime.timedelta(seconds=x) if x is not pd.NA else pd.NaT\n", ")\n", "extrapolated = orders[\"scheduled_pickup_at\"] - duration\n", "\n", "extrapolate = (\n", " orders[\"restaurant_notified_at\"].isnull()\n", " & orders[\"scheduled_pickup_at\"].notnull()\n", " & orders[\"estimated_prep_duration\"].notnull()\n", ")\n", "still_wrong = (\n", " (extrapolated <= orders[\"placed_at\"])\n", " | (extrapolated >= orders[\"restaurant_confirmed_at\"])\n", " | (extrapolated >= orders[\"cancelled_at\"])\n", ")\n", "\n", "msk = extrapolate & ~still_wrong\n", "orders.loc[msk, \"restaurant_notified_at\"] = extrapolated.loc[msk]\n", "orders.loc[msk, \"restaurant_notified_at_corrected\"] = True\n", "\n", "assert extrapolate.sum() == 469\n", "assert msk.sum() == 374\n", "assert orders.loc[msk, \"placed_at\"].min().date() == datetime.date(2016, 5, 29)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Vice versa, we extrapolate `estimated_prep_duration` as the difference of `scheduled_pickup_at` and `restaurant_notified_at` for orders before May 24." ] }, { "cell_type": "code", "execution_count": 165, "metadata": {}, "outputs": [], "source": [ "extrapolated = orders[\"scheduled_pickup_at\"] - orders[\"restaurant_notified_at\"]\n", "extrapolated = (extrapolated.dt.total_seconds() // 60 * 60).astype(\"Int64\")\n", "\n", "extrapolate = (\n", " orders[\"restaurant_notified_at\"].notnull()\n", " & orders[\"scheduled_pickup_at\"].notnull()\n", " & orders[\"estimated_prep_duration\"].isnull()\n", ")\n", "orders.loc[extrapolate, \"estimated_prep_duration\"] = extrapolated.loc[extrapolate]\n", "orders.loc[extrapolate, \"estimated_prep_duration_corrected\"] = True\n", "\n", "assert extrapolate.sum() == 108_398\n", "assert orders.loc[extrapolate, \"placed_at\"].min().date() == datetime.date(2016, 2, 21)\n", "assert orders.loc[extrapolate, \"placed_at\"].max().date() == datetime.date(2016, 5, 24)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "More than 99.9% of the orders with `estimated_prep_duration` set, have this value be under 45 minutes. We view the remaining ones as outliers and adjust them." ] }, { "cell_type": "code", "execution_count": 166, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.99973" ] }, "execution_count": 166, "metadata": {}, "output_type": "execute_result" } ], "source": [ "more_than_45_mins = orders[\"estimated_prep_duration\"].notnull()\n", "more_than_45_mins &= orders[\"estimated_prep_duration\"] > 45 * 60\n", "\n", "orders.loc[more_than_45_mins, \"estimated_prep_duration\"] = 45 * 60\n", "orders.loc[more_than_45_mins, \"estimated_prep_duration_corrected\"] = True\n", "\n", "assert more_than_45_mins.sum() == 449\n", "\n", "round((~more_than_45_mins).sum() / orders[\"estimated_prep_duration\"].notnull().sum(), 5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We create a boolean column `pickup_not_confirmed` out of the text column `courier_no_pickup_confirmed_issue`." ] }, { "cell_type": "code", "execution_count": 167, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Courier did not hit \"Picked up\" 35345\n", "Name: courier_no_pickup_confirmed_issue, dtype: Int64" ] }, "execution_count": 167, "metadata": {}, "output_type": "execute_result" } ], "source": [ "orders[\"courier_no_pickup_confirmed_issue\"].value_counts()" ] }, { "cell_type": "code", "execution_count": 168, "metadata": {}, "outputs": [], "source": [ "orders[\"pickup_not_confirmed\"] = False\n", "\n", "msk = orders[\"courier_no_pickup_confirmed_issue\"].notnull()\n", "orders.loc[msk, \"pickup_not_confirmed\"] = True\n", "\n", "msk = orders[\"pickup_at\"].isnull()\n", "orders.loc[msk, \"pickup_not_confirmed\"] = pd.NA\n", "\n", "del orders[\"courier_no_pickup_confirmed_issue\"]\n", "\n", "assert orders[\"pickup_not_confirmed\"].sum() == 34_966" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`logged_to_pickup_time` and `logged_pickup_time` constitute the difference between `courier_accepted_at` and `pickup_at`. `logged_pickup_time` is negative in rare cases." ] }, { "cell_type": "code", "execution_count": 169, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.94" ] }, "execution_count": 169, "metadata": {}, "output_type": "execute_result" } ], "source": [ "assert not (orders[\"logged_to_pickup_time\"] < 0).any()\n", "assert (orders[\"logged_pickup_time\"] < 0).sum() == 30\n", "\n", "all_data_available = (\n", " orders[\"logged_to_pickup_time\"].notnull()\n", " & orders[\"logged_pickup_time\"].notnull()\n", " & (orders[\"logged_pickup_time\"] >= 0)\n", " & orders[\"pickup_at\"].notnull()\n", " & orders[\"courier_accepted_at\"].notnull()\n", ")\n", "good_data = (\n", " orders[\"logged_to_pickup_time\"]\n", " + orders[\"logged_pickup_time\"]\n", " - ((orders[\"pickup_at\"] - orders[\"courier_accepted_at\"]).dt.total_seconds().round())\n", ").abs() <= 5\n", "\n", "pickup_not_confirmed = orders[\"pickup_not_confirmed\"] == True\n", "\n", "assert (all_data_available & good_data).sum() == 599_195\n", "assert (all_data_available & good_data & completed).sum() == 599_111\n", "assert (all_data_available & (good_data | pickup_not_confirmed)).sum() == 604_483\n", "\n", "round((all_data_available & good_data).sum() / all_data_available.sum(), 3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the 6% where `pickup_at` does not relate back to `courier_accepted_at`, we correct the former. Unconfirmed pickups seem to not be the cause of these inconsistencies." ] }, { "cell_type": "code", "execution_count": 170, "metadata": {}, "outputs": [], "source": [ "calc_pickup_at = (\n", " orders[\"courier_accepted_at\"]\n", " + orders[\"logged_to_pickup_time\"].map(\n", " lambda x: datetime.timedelta(seconds=x) if x is not pd.NA else pd.NaT\n", " )\n", " + orders[\"logged_pickup_time\"].map(\n", " lambda x: datetime.timedelta(seconds=x) if x is not pd.NA else pd.NaT\n", " )\n", ")\n", "\n", "msk = all_data_available & ~good_data\n", "orders.loc[msk, \"pickup_at\"] = calc_pickup_at.loc[msk]\n", "orders.loc[msk, \"pickup_at_corrected\"] = True\n", "\n", "assert (all_data_available & ~good_data).sum() == 38_015\n", "assert (all_data_available & ~good_data & pickup_not_confirmed).sum() == 5_288" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Keep other timestamps consistent after the correction." ] }, { "cell_type": "code", "execution_count": 171, "metadata": {}, "outputs": [], "source": [ "msk = orders[\"pickup_at\"] <= orders[\"restaurant_notified_at\"]\n", "orders.loc[msk, \"restaurant_notified_at\"] = pd.NaT\n", "orders.loc[msk, \"restaurant_notified_at_corrected\"] = True\n", "\n", "assert msk.sum() == 107" ] }, { "cell_type": "code", "execution_count": 172, "metadata": {}, "outputs": [], "source": [ "msk = orders[\"pickup_at\"] <= orders[\"restaurant_confirmed_at\"]\n", "orders.loc[msk, \"restaurant_confirmed_at\"] = pd.NaT\n", "orders.loc[msk, \"restaurant_confirmed_at_corrected\"] = True\n", "\n", "assert msk.sum() == 892" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With `logged_to_pickup_time` we calculate a new timestamp `reached_pickup_at`." ] }, { "cell_type": "code", "execution_count": 173, "metadata": {}, "outputs": [], "source": [ "to_pickup_time = orders[\"logged_to_pickup_time\"].map(\n", " lambda x: datetime.timedelta(seconds=x) if x is not pd.NA and x > 0 else pd.NaT\n", ")\n", "reached_pickup_at = orders[\"courier_accepted_at\"] + to_pickup_time\n", "\n", "orders[\"reached_pickup_at\"] = pd.NaT\n", "msk = (\n", " completed & reached_pickup_at.notnull() & (reached_pickup_at < orders[\"pickup_at\"])\n", ")\n", "orders.loc[msk, \"reached_pickup_at\"] = reached_pickup_at.loc[msk]\n", "\n", "assert msk.sum() == 530_724" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`logged_courier_late_time` and `logged_restaurant_late_time` are always set together. The ca. 110,000 missing values are spread over the entire horizon." ] }, { "cell_type": "code", "execution_count": 174, "metadata": {}, "outputs": [], "source": [ "assert not (\n", " (\n", " orders[\"logged_courier_late_time\"].notnull()\n", " & orders[\"logged_restaurant_late_time\"].isnull()\n", " )\n", " | (\n", " orders[\"logged_courier_late_time\"].isnull()\n", " & orders[\"logged_restaurant_late_time\"].notnull()\n", " )\n", ").any()\n", "\n", "assert orders.loc[\n", " orders[\"logged_courier_late_time\"].isnull(), \"placed_at\"\n", "].min().date() == datetime.date(2016, 2, 22)\n", "\n", "assert orders.loc[\n", " orders[\"logged_courier_late_time\"].isnull(), \"placed_at\"\n", "].max().date() == datetime.date(2017, 1, 31)\n", "\n", "assert orders.loc[\n", " orders[\"logged_courier_late_time\"].notnull(), \"placed_at\"\n", "].min().date() == datetime.date(2016, 2, 21)\n", "\n", "assert orders.loc[\n", " orders[\"logged_courier_late_time\"].notnull(), \"placed_at\"\n", "].max().date() == datetime.date(2017, 1, 31)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`logged_courier_late_time` is mostly explained with `reached_pickup_at` and `scheduled_pickup_at`. `Order.courier_early` and `Order.courier_late` model that in the ORM." ] }, { "cell_type": "code", "execution_count": 175, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.964" ] }, "execution_count": 175, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_data_available = (\n", " orders[\"logged_courier_late_time\"].notnull()\n", " & orders[\"reached_pickup_at\"].notnull()\n", " & orders[\"scheduled_pickup_at\"].notnull()\n", ")\n", "good_data = (\n", " orders[\"logged_courier_late_time\"]\n", " - (\n", " (orders[\"reached_pickup_at\"] - orders[\"scheduled_pickup_at\"])\n", " .dt.total_seconds()\n", " .round()\n", " )\n", ").abs() <= 5\n", "\n", "assert (all_data_available & good_data).sum() == 471_553\n", "assert (all_data_available & good_data & completed).sum() == 471_553\n", "\n", "round((all_data_available & good_data).sum() / all_data_available.sum(), 3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`logged_restaurant_late_time` is mostly explained with `pickup_at` and `scheduled_pickup_at`. `logged_restaurant_late_time` is also `0` quite often, indicating no timing was taken. `Order.restaurant_early` and `Order.restaurant_late` model that in the ORM." ] }, { "cell_type": "code", "execution_count": 176, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.943" ] }, "execution_count": 176, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_data_available = (\n", " orders[\"logged_restaurant_late_time\"].notnull()\n", " & orders[\"pickup_at\"].notnull()\n", " & orders[\"scheduled_pickup_at\"].notnull()\n", ")\n", "good_data = (\n", " orders[\"logged_restaurant_late_time\"]\n", " - ((orders[\"pickup_at\"] - orders[\"scheduled_pickup_at\"]).dt.total_seconds().round())\n", ").abs() <= 5\n", "\n", "restaurant_not_timed = orders[\"logged_restaurant_late_time\"] == 0\n", "\n", "assert (all_data_available).sum() == 503_179\n", "assert (all_data_available & good_data).sum() == 245_714\n", "assert (all_data_available & restaurant_not_timed).sum() == 246_362\n", "assert (all_data_available & (good_data | restaurant_not_timed)).sum() == 488_512\n", "\n", "restaurant_timed = orders[\"logged_restaurant_late_time\"] != 0\n", "\n", "round(\n", " (all_data_available & restaurant_timed & good_data).sum()\n", " / (all_data_available & restaurant_timed).sum(),\n", " 3,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`logged_wait_pickup_time` is unfortunately not a good timing to extrapolate when a meal was ready to picked up by the courier. It is only good to explain the difference between `reached_pickup_at` and `left_pickup_at`, which is not really the time the courier had to wait. Also, the field seems to only be tracked correctly if the courier was late." ] }, { "cell_type": "code", "execution_count": 177, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.396" ] }, "execution_count": 177, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_data_available = (\n", " (orders[\"logged_wait_pickup_time\"]).notnull()\n", " & orders[\"reached_pickup_at\"].notnull()\n", " & orders[\"left_pickup_at\"].notnull()\n", ")\n", "good_data = (\n", " orders[\"logged_wait_pickup_time\"]\n", " - (\n", " (orders[\"left_pickup_at\"] - orders[\"reached_pickup_at\"])\n", " .dt.total_seconds()\n", " .round()\n", " )\n", ").abs() <= 5\n", "\n", "round(\n", " (all_data_available & good_data).sum() / (all_data_available).sum(), 3,\n", ")" ] }, { "cell_type": "code", "execution_count": 178, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.0" ] }, "execution_count": 178, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_data_available = (\n", " (orders[\"logged_wait_pickup_time\"]).notnull()\n", " & (orders[\"logged_courier_late_time\"] >= 0)\n", " & orders[\"reached_pickup_at\"].notnull()\n", " & orders[\"left_pickup_at\"].notnull()\n", ")\n", "good_data = (\n", " orders[\"logged_wait_pickup_time\"]\n", " - (\n", " (orders[\"left_pickup_at\"] - orders[\"reached_pickup_at\"])\n", " .dt.total_seconds()\n", " .round()\n", " )\n", ").abs() <= 5\n", "\n", "round(\n", " (all_data_available & good_data).sum() / (all_data_available).sum(), 3,\n", ")" ] }, { "cell_type": "code", "execution_count": 179, "metadata": {}, "outputs": [], "source": [ "del orders[\"logged_courier_late_time\"]\n", "del orders[\"logged_restaurant_late_time\"]\n", "del orders[\"logged_wait_pickup_time\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We create a boolean column `delivery_not_confirmed` out of the text column `courier_no_pickup_confirmed_issue`." ] }, { "cell_type": "code", "execution_count": 180, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Courier did not hit \"Dropped off\" 13862\n", "Name: courier_no_delivery_confirmed_issue, dtype: Int64" ] }, "execution_count": 180, "metadata": {}, "output_type": "execute_result" } ], "source": [ "orders[\"courier_no_delivery_confirmed_issue\"].value_counts()" ] }, { "cell_type": "code", "execution_count": 181, "metadata": {}, "outputs": [], "source": [ "orders[\"delivery_not_confirmed\"] = False\n", "\n", "msk = orders[\"courier_no_delivery_confirmed_issue\"].notnull()\n", "orders.loc[msk, \"delivery_not_confirmed\"] = True\n", "\n", "msk = orders[\"delivery_at\"].isnull()\n", "orders.loc[msk, \"delivery_not_confirmed\"] = pd.NA\n", "\n", "del orders[\"courier_no_delivery_confirmed_issue\"]\n", "\n", "assert orders[\"delivery_not_confirmed\"].sum() == 13_817" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`logged_to_delivery_time` and `logged_delivery_time` constitute the difference between `pickup_at` and `delivery_at`. Without the `pickup_at` corrections above, not 91% but only 86% of the differences would work. `Order.time_to_delivery` and `Order.time_at_delivery` model that in the ORM." ] }, { "cell_type": "code", "execution_count": 182, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.913" ] }, "execution_count": 182, "metadata": {}, "output_type": "execute_result" } ], "source": [ "assert not (orders[\"logged_to_delivery_time\"] < 0).any()\n", "assert not (orders[\"logged_delivery_time\"] < 0).any()\n", "\n", "all_data_available = (\n", " orders[\"logged_to_delivery_time\"].notnull()\n", " & orders[\"logged_delivery_time\"].notnull()\n", " & orders[\"delivery_at\"].notnull()\n", " & orders[\"pickup_at\"].notnull()\n", ")\n", "good_data = (\n", " orders[\"logged_to_delivery_time\"]\n", " + orders[\"logged_delivery_time\"]\n", " - ((orders[\"delivery_at\"] - orders[\"pickup_at\"]).dt.total_seconds().round())\n", ").abs() <= 5\n", "\n", "delivery_not_confirmed = orders[\"delivery_not_confirmed\"] == True\n", "\n", "assert (all_data_available & good_data).sum() == 572_609\n", "assert (all_data_available & good_data & completed).sum() == 572_609\n", "assert (all_data_available & (good_data | delivery_not_confirmed)).sum() == 581_700\n", "\n", "round(\n", " (all_data_available & (good_data | delivery_not_confirmed)).sum()\n", " / all_data_available.sum(),\n", " 3,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`courier_waited_at_delivery_issue` is filled in whenever the courier needed to wait for the customer at delivery. It is also filled in if the courier forgot to confirm the delivery, which mostly happened at the end of a shift. If a courier needed to wait for more than 45 minutes, that is summarized as 'waiting about 1, 2, or 3 hours.'" ] }, { "cell_type": "code", "execution_count": 183, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Waiting at Dropoff 1597\n", "Waiting at Dropoff: 10 minutes 4132\n", "Waiting at Dropoff: 11 minutes 3151\n", "Waiting at Dropoff: 12 minutes 2348\n", "Waiting at Dropoff: 13 minutes 1775\n", "Waiting at Dropoff: 14 minutes 1432\n", "Waiting at Dropoff: 15 minutes 1199\n", "Waiting at Dropoff: 16 minutes 941\n", "Waiting at Dropoff: 17 minutes 826\n", "Waiting at Dropoff: 18 minutes 720\n", "Waiting at Dropoff: 19 minutes 594\n", "Waiting at Dropoff: 20 minutes 522\n", "Waiting at Dropoff: 21 minutes 431\n", "Waiting at Dropoff: 22 minutes 415\n", "Waiting at Dropoff: 23 minutes 284\n", "Waiting at Dropoff: 24 minutes 281\n", "Waiting at Dropoff: 25 minutes 215\n", "Waiting at Dropoff: 26 minutes 193\n", "Waiting at Dropoff: 27 minutes 167\n", "Waiting at Dropoff: 28 minutes 162\n", "Waiting at Dropoff: 29 minutes 118\n", "Waiting at Dropoff: 30 minutes 109\n", "Waiting at Dropoff: 31 minutes 97\n", "Waiting at Dropoff: 32 minutes 86\n", "Waiting at Dropoff: 33 minutes 67\n", "Waiting at Dropoff: 34 minutes 69\n", "Waiting at Dropoff: 35 minutes 59\n", "Waiting at Dropoff: 36 minutes 50\n", "Waiting at Dropoff: 37 minutes 62\n", "Waiting at Dropoff: 38 minutes 38\n", "Waiting at Dropoff: 39 minutes 41\n", "Waiting at Dropoff: 40 minutes 41\n", "Waiting at Dropoff: 41 minutes 32\n", "Waiting at Dropoff: 42 minutes 26\n", "Waiting at Dropoff: 43 minutes 22\n", "Waiting at Dropoff: 44 minutes 35\n", "Waiting at Dropoff: 8 minutes 3958\n", "Waiting at Dropoff: 9 minutes 5814\n", "Waiting at Dropoff: about 1 hour 244\n", "Waiting at Dropoff: about 2 hours 7\n", "Waiting at Dropoff: about 3 hours 3\n", "Name: courier_waited_at_delivery_issue, dtype: Int64" ] }, "execution_count": 183, "metadata": {}, "output_type": "execute_result" } ], "source": [ "orders[\"courier_waited_at_delivery_issue\"].value_counts().sort_index()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We convert `courier_waited_at_delivery_issue` into `courier_waited_at_delivery` to validate it further below." ] }, { "cell_type": "code", "execution_count": 184, "metadata": {}, "outputs": [], "source": [ "waited_at_delivery = (\n", " orders[\"courier_waited_at_delivery_issue\"]\n", " .str.replace(r\"\\D+\", \"\", regex=True)\n", " .fillna(\"NaN\")\n", " .replace(\"\", \"NaN\")\n", " .astype(float)\n", " .astype(\"Int64\")\n", ")\n", "\n", "orders[\"courier_waited_at_delivery\"] = pd.NA\n", "orders[\"courier_waited_at_delivery\"] = orders[\"courier_waited_at_delivery\"].astype(\n", " \"Int64\"\n", ")\n", "\n", "hours = orders[\"courier_waited_at_delivery_issue\"].str.contains(\"hour\").fillna(0)\n", "orders.loc[hours, \"courier_waited_at_delivery\"] = (\n", " 60 * 60 * waited_at_delivery.loc[hours]\n", ")\n", "\n", "mins = orders[\"courier_waited_at_delivery_issue\"].str.contains(\"minutes\").fillna(0)\n", "orders.loc[mins, \"courier_waited_at_delivery\"] = 60 * waited_at_delivery.loc[mins]\n", "\n", "customer_late = orders[\"courier_waited_at_delivery_issue\"].notnull()\n", "\n", "del orders[\"courier_waited_at_delivery_issue\"]\n", "\n", "assert hours.sum() == 254\n", "assert mins.sum() == 30_512" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the roughly 9% of orders where `logged_to_delivery_time` and `logged_delivery_time` do not explain `delivery_at`, the latter is corrected. However, this is only done if the courier did not have to wait for the customer or forgot to confirm the delivery, which wrongly shows up as waiting for the customer as well." ] }, { "cell_type": "code", "execution_count": 185, "metadata": {}, "outputs": [], "source": [ "calc_delivery_at = (\n", " orders[\"pickup_at\"]\n", " + orders[\"logged_to_delivery_time\"].map(\n", " lambda x: datetime.timedelta(seconds=x) if x is not pd.NA else pd.NaT\n", " )\n", " + orders[\"logged_delivery_time\"].map(\n", " lambda x: datetime.timedelta(seconds=x) if x is not pd.NA else pd.NaT\n", " )\n", ")\n", "\n", "del orders[\"logged_delivery_time\"]\n", "\n", "orders[\"delivery_at_orig\"] = orders[\"delivery_at\"]\n", "\n", "msk = (\n", " all_data_available\n", " & ~good_data\n", " & (~customer_late | (customer_late & delivery_not_confirmed))\n", ")\n", "orders.loc[msk, \"delivery_at\"] = calc_delivery_at.loc[msk]\n", "orders.loc[msk, \"delivery_at_corrected\"] = True\n", "\n", "assert (all_data_available & ~good_data).sum() == 64_543\n", "assert (all_data_available & ~good_data & ~customer_late).sum() == 49_122\n", "\n", "assert (\n", " all_data_available & ~good_data & customer_late & delivery_not_confirmed\n", ").sum() == 5_241\n", "\n", "assert (all_data_available & ~good_data & delivery_not_confirmed).sum() == 9_091" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With `logged_to_delivery_time` we calculate a new timestamp `reached_delivery_at`." ] }, { "cell_type": "code", "execution_count": 186, "metadata": {}, "outputs": [], "source": [ "to_delivery_time = orders[\"logged_to_delivery_time\"].map(\n", " lambda x: datetime.timedelta(seconds=x) if x is not pd.NA and x > 0 else pd.NaT\n", ")\n", "reached_delivery_at = orders[\"pickup_at\"] + to_delivery_time\n", "\n", "del orders[\"logged_to_delivery_time\"]\n", "\n", "orders[\"reached_delivery_at\"] = pd.NaT\n", "\n", "msk = (\n", " completed\n", " & reached_delivery_at.notnull()\n", " & (reached_delivery_at < orders[\"delivery_at\"])\n", ")\n", "orders.loc[msk, \"reached_delivery_at\"] = reached_delivery_at.loc[msk]\n", "\n", "assert msk.sum() == 608_160" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some `left_pickup_at` values conflict with that and are discarded." ] }, { "cell_type": "code", "execution_count": 187, "metadata": {}, "outputs": [], "source": [ "msk = orders[\"left_pickup_at\"] >= orders[\"reached_delivery_at\"]\n", "orders.loc[msk, \"left_pickup_at\"] = pd.NaT\n", "orders.loc[msk, \"left_pickup_at_corrected\"] = True\n", "\n", "assert msk.sum() == 4_215" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`logged_delivery_late_time` is the difference between `scheduled_delivery_at` and `delivery_at` for pre-orders. `Order.delivery_early` and `Order.delivery_late` model that in the ORM." ] }, { "cell_type": "code", "execution_count": 188, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.986" ] }, "execution_count": 188, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_data_available = (\n", " scheduled\n", " & orders[\"logged_delivery_late_time\"].notnull()\n", " & orders[\"delivery_at\"].notnull()\n", " & orders[\"scheduled_delivery_at\"].notnull()\n", ")\n", "good_data = (\n", " orders[\"logged_delivery_late_time\"]\n", " - (\n", " (orders[\"delivery_at\"] - orders[\"scheduled_delivery_at\"])\n", " .dt.total_seconds()\n", " .round()\n", " )\n", ").abs() <= 5\n", "\n", "assert (all_data_available & good_data).sum() == 73_658\n", "assert (all_data_available & good_data & completed).sum() == 73_658\n", "assert (all_data_available & (good_data | delivery_not_confirmed)).sum() == 73_659\n", "\n", "round(\n", " (all_data_available & (good_data | delivery_not_confirmed)).sum()\n", " / all_data_available.sum(),\n", " 3,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`expected_delivery_time` is simply the difference between `placed_at` and `scheduled_delivery_at`, for both ad-hoc and pre-orders. So, the field provides no new information." ] }, { "cell_type": "code", "execution_count": 189, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.996" ] }, "execution_count": 189, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_data_available = (\n", " orders[\"expected_delivery_time\"].notnull()\n", " & orders[\"placed_at\"].notnull()\n", " & orders[\"scheduled_delivery_at\"].notnull()\n", ")\n", "good_data = (\n", " orders[\"expected_delivery_time\"]\n", " - (\n", " (orders[\"scheduled_delivery_at\"] - orders[\"placed_at\"])\n", " .dt.total_seconds()\n", " .round()\n", " )\n", ").abs() <= 5\n", "\n", "del orders[\"expected_delivery_time\"]\n", "\n", "assert (all_data_available & good_data).sum() == 74_386\n", "\n", "round((all_data_available & good_data).sum() / all_data_available.sum(), 3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`courier_waited_at_delivery` can be mostly explained as the difference between `reached_delivery_at` and `delivery_at`. `Order.courier_waited_at_delivery` models that in the ORM." ] }, { "cell_type": "code", "execution_count": 190, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.871" ] }, "execution_count": 190, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_data_available = (\n", " orders[\"courier_waited_at_delivery\"].notnull()\n", " & orders[\"delivery_at\"].notnull()\n", " & orders[\"reached_delivery_at\"].notnull()\n", ")\n", "good_data = (\n", " orders[\"courier_waited_at_delivery\"]\n", " - (\n", " (orders[\"delivery_at\"] - orders[\"reached_delivery_at\"])\n", " .dt.total_seconds()\n", " .round()\n", " )\n", ").abs() <= 90\n", "imprecise_wait_times = orders[\"courier_waited_at_delivery\"].fillna(0) >= 45 * 60\n", "\n", "assert (all_data_available & good_data).sum() == 26_268\n", "assert (all_data_available & good_data & completed).sum() == 26_268\n", "assert (all_data_available & (good_data | imprecise_wait_times)).sum() == 26_499\n", "\n", "round(\n", " (all_data_available & (good_data | imprecise_wait_times)).sum()\n", " / all_data_available.sum(),\n", " 3,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We keep `courier_waited_at_delivery` as a boolean field here to be used by `Order.courier_waited_at_delivery`." ] }, { "cell_type": "code", "execution_count": 191, "metadata": {}, "outputs": [], "source": [ "msk = orders[\"delivery_at\"].notnull() & orders[\"courier_waited_at_delivery\"].notnull()\n", "orders[\"courier_waited_at_delivery\"] = pd.NA\n", "orders.loc[orders[\"delivery_at\"].notnull(), \"courier_waited_at_delivery\"] = False\n", "orders.loc[msk, \"courier_waited_at_delivery\"] = True\n", "\n", "assert orders[\"courier_waited_at_delivery\"].sum() == msk.sum() == 30_658" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Statistical Columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Keep the columns that log the courier's speed." ] }, { "cell_type": "code", "execution_count": 192, "metadata": {}, "outputs": [], "source": [ "orders = orders.rename(\n", " columns={\n", " \"logged_avg_courier_speed\": \"logged_avg_speed\",\n", " \"logged_avg_courier_speed_distance\": \"logged_avg_speed_distance\",\n", " }\n", ")" ] }, { "cell_type": "code", "execution_count": 193, "metadata": {}, "outputs": [], "source": [ "unrealistic = orders[\"logged_delivery_distance\"] > 12_000\n", "orders.loc[unrealistic, \"logged_delivery_distance\"] = pd.NA\n", "\n", "assert unrealistic.sum() == 17" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Clean Data" ] }, { "cell_type": "code", "execution_count": 194, "metadata": {}, "outputs": [], "source": [ "orders = orders[\n", " [\n", " # Generic columns\n", " \"delivery_id\",\n", " \"customer_id\",\n", " \"placed_at\",\n", " \"ad_hoc\",\n", " \"scheduled_delivery_at\",\n", " \"scheduled_delivery_at_corrected\",\n", " \"first_estimated_delivery_at\",\n", " \"cancelled\",\n", " \"cancelled_at\",\n", " \"cancelled_at_corrected\",\n", " # Price related columns\n", " \"sub_total\",\n", " \"delivery_fee\",\n", " \"total\",\n", " # Restaurant related columns\n", " \"restaurant_id\",\n", " \"restaurant_notified_at\",\n", " \"restaurant_notified_at_corrected\",\n", " \"restaurant_confirmed_at\",\n", " \"restaurant_confirmed_at_corrected\",\n", " \"estimated_prep_duration\",\n", " \"estimated_prep_duration_corrected\",\n", " \"estimated_prep_buffer\",\n", " # Dispatch related columns\n", " \"courier_id\",\n", " \"dispatch_at\",\n", " \"dispatch_at_corrected\",\n", " \"courier_notified_at\",\n", " \"courier_notified_at_corrected\",\n", " \"courier_accepted_at\",\n", " \"courier_accepted_at_corrected\",\n", " \"utilization\",\n", " # Pickup related columns\n", " \"pickup_address_id\",\n", " \"reached_pickup_at\",\n", " \"pickup_at\",\n", " \"pickup_at_corrected\",\n", " \"pickup_not_confirmed\",\n", " \"left_pickup_at\",\n", " \"left_pickup_at_corrected\",\n", " # Delivery related columns\n", " \"delivery_address_id\",\n", " \"reached_delivery_at\",\n", " \"delivery_at\",\n", " \"delivery_at_corrected\",\n", " \"delivery_not_confirmed\",\n", " \"courier_waited_at_delivery\",\n", " # Statistical columns\n", " \"logged_delivery_distance\",\n", " \"logged_avg_speed\",\n", " \"logged_avg_speed_distance\",\n", " ]\n", "].sort_index()" ] }, { "cell_type": "code", "execution_count": 195, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
delivery_idcustomer_idplaced_atad_hocscheduled_delivery_atscheduled_delivery_at_correctedfirst_estimated_delivery_atcancelledcancelled_atcancelled_at_correctedsub_totaldelivery_feetotalrestaurant_idrestaurant_notified_atrestaurant_notified_at_correctedrestaurant_confirmed_atrestaurant_confirmed_at_correctedestimated_prep_durationestimated_prep_duration_correctedestimated_prep_buffercourier_iddispatch_atdispatch_at_correctedcourier_notified_atcourier_notified_at_correctedcourier_accepted_atcourier_accepted_at_correctedutilizationpickup_address_idreached_pickup_atpickup_atpickup_at_correctedpickup_not_confirmedleft_pickup_atleft_pickup_at_correcteddelivery_address_idreached_delivery_atdelivery_atdelivery_at_correcteddelivery_not_confirmedcourier_waited_at_deliverylogged_delivery_distancelogged_avg_speedlogged_avg_speed_distance
id
148737712016-02-22 10:42:01False2016-02-22 12:30:00FalseNaTTrueNaTTrue125025015001NaT<NA>NaT<NA><NA><NA>0<NA>NaT<NA>NaT<NA>NaT<NA>152NaTNaT<NA><NA>NaT<NA>3NaTNaT<NA><NA><NA>671NaN<NA>
248743322016-02-22 10:40:09False2016-02-22 12:00:00False2016-02-23 23:55:23FalseNaT<NA>1550250180022016-02-22 11:31:03FalseNaTTrue600True0962016-02-22 10:48:24True2016-02-22 11:36:04False2016-02-22 11:37:00False1942016-02-22 11:40:072016-02-22 11:42:12TrueTrueNaT<NA>52016-02-22 12:11:172016-02-22 12:15:41TrueFalseFalse2281NaN<NA>
348744432016-02-22 10:56:16False2016-02-22 12:00:00False2016-02-23 23:25:47FalseNaT<NA>2000250225032016-02-22 11:31:02False2016-02-22 11:42:00False600True0572016-02-22 11:31:02True2016-02-22 11:31:03False2016-02-22 11:31:30False2562016-02-22 11:54:312016-02-22 11:54:32FalseFalseNaT<NA>72016-02-22 12:03:142016-02-22 12:07:44TrueFalseFalse2449NaN<NA>
447050342016-02-22 11:11:46False2016-02-22 12:30:00False2016-02-22 12:25:50FalseNaT<NA>98002501005042016-02-22 11:32:04False2016-02-22 11:48:01False2700True039NaTTrue2016-02-22 11:50:03False2016-02-22 11:50:29False10082016-02-22 11:57:482016-02-22 12:11:26FalseFalseNaT<NA>92016-02-22 12:12:532016-02-22 12:18:43FalseFalseFalse196NaN<NA>
548743952016-02-22 11:35:58False2016-02-22 12:30:00False2016-02-22 12:28:19FalseNaT<NA>3600250385052016-02-22 11:50:03False2016-02-22 11:50:19False1200True0128NaTTrue2016-02-22 11:50:04False2016-02-22 11:50:28False45102016-02-22 11:55:122016-02-22 12:05:19FalseFalseNaT<NA>112016-02-22 12:16:532016-02-22 12:20:44FalseFalseFalse2584NaN<NA>
\n", "
" ], "text/plain": [ " delivery_id customer_id placed_at ad_hoc \\\n", "id \n", "1 487377 1 2016-02-22 10:42:01 False \n", "2 487433 2 2016-02-22 10:40:09 False \n", "3 487444 3 2016-02-22 10:56:16 False \n", "4 470503 4 2016-02-22 11:11:46 False \n", "5 487439 5 2016-02-22 11:35:58 False \n", "\n", " scheduled_delivery_at scheduled_delivery_at_corrected \\\n", "id \n", "1 2016-02-22 12:30:00 False \n", "2 2016-02-22 12:00:00 False \n", "3 2016-02-22 12:00:00 False \n", "4 2016-02-22 12:30:00 False \n", "5 2016-02-22 12:30:00 False \n", "\n", " first_estimated_delivery_at cancelled cancelled_at cancelled_at_corrected \\\n", "id \n", "1 NaT True NaT True \n", "2 2016-02-23 23:55:23 False NaT \n", "3 2016-02-23 23:25:47 False NaT \n", "4 2016-02-22 12:25:50 False NaT \n", "5 2016-02-22 12:28:19 False NaT \n", "\n", " sub_total delivery_fee total restaurant_id restaurant_notified_at \\\n", "id \n", "1 1250 250 1500 1 NaT \n", "2 1550 250 1800 2 2016-02-22 11:31:03 \n", "3 2000 250 2250 3 2016-02-22 11:31:02 \n", "4 9800 250 10050 4 2016-02-22 11:32:04 \n", "5 3600 250 3850 5 2016-02-22 11:50:03 \n", "\n", " restaurant_notified_at_corrected restaurant_confirmed_at \\\n", "id \n", "1 NaT \n", "2 False NaT \n", "3 False 2016-02-22 11:42:00 \n", "4 False 2016-02-22 11:48:01 \n", "5 False 2016-02-22 11:50:19 \n", "\n", " restaurant_confirmed_at_corrected estimated_prep_duration \\\n", "id \n", "1 \n", "2 True 600 \n", "3 False 600 \n", "4 False 2700 \n", "5 False 1200 \n", "\n", " estimated_prep_duration_corrected estimated_prep_buffer courier_id \\\n", "id \n", "1 0 \n", "2 True 0 96 \n", "3 True 0 57 \n", "4 True 0 39 \n", "5 True 0 128 \n", "\n", " dispatch_at dispatch_at_corrected courier_notified_at \\\n", "id \n", "1 NaT NaT \n", "2 2016-02-22 10:48:24 True 2016-02-22 11:36:04 \n", "3 2016-02-22 11:31:02 True 2016-02-22 11:31:03 \n", "4 NaT True 2016-02-22 11:50:03 \n", "5 NaT True 2016-02-22 11:50:04 \n", "\n", " courier_notified_at_corrected courier_accepted_at \\\n", "id \n", "1 NaT \n", "2 False 2016-02-22 11:37:00 \n", "3 False 2016-02-22 11:31:30 \n", "4 False 2016-02-22 11:50:29 \n", "5 False 2016-02-22 11:50:28 \n", "\n", " courier_accepted_at_corrected utilization pickup_address_id \\\n", "id \n", "1 15 2 \n", "2 False 19 4 \n", "3 False 25 6 \n", "4 False 100 8 \n", "5 False 45 10 \n", "\n", " reached_pickup_at pickup_at pickup_at_corrected \\\n", "id \n", "1 NaT NaT \n", "2 2016-02-22 11:40:07 2016-02-22 11:42:12 True \n", "3 2016-02-22 11:54:31 2016-02-22 11:54:32 False \n", "4 2016-02-22 11:57:48 2016-02-22 12:11:26 False \n", "5 2016-02-22 11:55:12 2016-02-22 12:05:19 False \n", "\n", " pickup_not_confirmed left_pickup_at left_pickup_at_corrected \\\n", "id \n", "1 NaT \n", "2 True NaT \n", "3 False NaT \n", "4 False NaT \n", "5 False NaT \n", "\n", " delivery_address_id reached_delivery_at delivery_at \\\n", "id \n", "1 3 NaT NaT \n", "2 5 2016-02-22 12:11:17 2016-02-22 12:15:41 \n", "3 7 2016-02-22 12:03:14 2016-02-22 12:07:44 \n", "4 9 2016-02-22 12:12:53 2016-02-22 12:18:43 \n", "5 11 2016-02-22 12:16:53 2016-02-22 12:20:44 \n", "\n", " delivery_at_corrected delivery_not_confirmed courier_waited_at_delivery \\\n", "id \n", "1 \n", "2 True False False \n", "3 True False False \n", "4 False False False \n", "5 False False False \n", "\n", " logged_delivery_distance logged_avg_speed logged_avg_speed_distance \n", "id \n", "1 671 NaN \n", "2 2281 NaN \n", "3 2449 NaN \n", "4 196 NaN \n", "5 2584 NaN " ] }, "execution_count": 195, "metadata": {}, "output_type": "execute_result" } ], "source": [ "orders.head()" ] }, { "cell_type": "code", "execution_count": 196, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Int64Index: 660658 entries, 1 to 688690\n", "Data columns (total 45 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 delivery_id 660658 non-null int64 \n", " 1 customer_id 660658 non-null int64 \n", " 2 placed_at 660658 non-null datetime64[ns]\n", " 3 ad_hoc 660658 non-null bool \n", " 4 scheduled_delivery_at 80121 non-null datetime64[ns]\n", " 5 scheduled_delivery_at_corrected 83398 non-null object \n", " 6 first_estimated_delivery_at 657168 non-null datetime64[ns]\n", " 7 cancelled 660658 non-null bool \n", " 8 cancelled_at 14081 non-null datetime64[ns]\n", " 9 cancelled_at_corrected 23552 non-null object \n", " 10 sub_total 660658 non-null int64 \n", " 11 delivery_fee 660658 non-null int64 \n", " 12 total 660658 non-null int64 \n", " 13 restaurant_id 660658 non-null int64 \n", " 14 restaurant_notified_at 655847 non-null datetime64[ns]\n", " 15 restaurant_notified_at_corrected 655982 non-null object \n", " 16 restaurant_confirmed_at 634974 non-null datetime64[ns]\n", " 17 restaurant_confirmed_at_corrected 639102 non-null object \n", " 18 estimated_prep_duration 660385 non-null Int64 \n", " 19 estimated_prep_duration_corrected 660385 non-null object \n", " 20 estimated_prep_buffer 660658 non-null int64 \n", " 21 courier_id 648239 non-null Int64 \n", " 22 dispatch_at 648608 non-null datetime64[ns]\n", " 23 dispatch_at_corrected 656301 non-null object \n", " 24 courier_notified_at 314281 non-null datetime64[ns]\n", " 25 courier_notified_at_corrected 481307 non-null object \n", " 26 courier_accepted_at 650348 non-null datetime64[ns]\n", " 27 courier_accepted_at_corrected 650363 non-null object \n", " 28 utilization 660658 non-null int64 \n", " 29 pickup_address_id 660658 non-null int64 \n", " 30 reached_pickup_at 530724 non-null datetime64[ns]\n", " 31 pickup_at 637684 non-null datetime64[ns]\n", " 32 pickup_at_corrected 637700 non-null object \n", " 33 pickup_not_confirmed 637684 non-null object \n", " 34 left_pickup_at 293397 non-null datetime64[ns]\n", " 35 left_pickup_at_corrected 312136 non-null object \n", " 36 delivery_address_id 660658 non-null int64 \n", " 37 reached_delivery_at 608160 non-null datetime64[ns]\n", " 38 delivery_at 637167 non-null datetime64[ns]\n", " 39 delivery_at_corrected 637167 non-null object \n", " 40 delivery_not_confirmed 637167 non-null object \n", " 41 courier_waited_at_delivery 637167 non-null object \n", " 42 logged_delivery_distance 660641 non-null object \n", " 43 logged_avg_speed 213812 non-null float64 \n", " 44 logged_avg_speed_distance 213812 non-null Int64 \n", "dtypes: Int64(3), bool(2), datetime64[ns](14), float64(1), int64(10), object(15)\n", "memory usage: 224.9+ MB\n" ] } ], "source": [ "orders.info()" ] }, { "cell_type": "code", "execution_count": 197, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "scheduled_delivery_at_corrected 3321\n", "cancelled_at_corrected 9569\n", "restaurant_notified_at_corrected 2862\n", "restaurant_confirmed_at_corrected 4128\n", "estimated_prep_duration_corrected 108591\n", "dispatch_at_corrected 69241\n", "courier_notified_at_corrected 216869\n", "courier_accepted_at_corrected 23\n", "pickup_at_corrected 38032\n", "left_pickup_at_corrected 18739\n", "delivery_at_corrected 54363\n" ] } ], "source": [ "for column in orders.columns:\n", " if column.endswith(\"corrected\"):\n", " print(column, (orders[column] == True).sum())" ] }, { "cell_type": "code", "execution_count": 198, "metadata": {}, "outputs": [], "source": [ "assert (\n", " hashlib.sha256(orders.to_json().encode()).hexdigest()\n", " == \"c548084f094bd220f3aff7e9b7072a4964127f6962dffd54f21c8d1f5b846a7f\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All couriers had at least one order." ] }, { "cell_type": "code", "execution_count": 199, "metadata": {}, "outputs": [], "source": [ "assert set(couriers.reset_index()[\"id\"]) == set(\n", " orders.loc[orders[\"courier_id\"].notnull(), \"courier_id\"].unique()\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Only keep restaurants that had at least one order." ] }, { "cell_type": "code", "execution_count": 200, "metadata": {}, "outputs": [], "source": [ "restaurants = restaurants.reset_index()\n", "msk = restaurants[\"id\"].isin(orders[\"restaurant_id\"].unique())\n", "restaurants = restaurants[msk].set_index(\"id\")\n", "\n", "assert (~msk).sum() == 6" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Only keep addresses with pickups or deliveries." ] }, { "cell_type": "code", "execution_count": 201, "metadata": {}, "outputs": [], "source": [ "addresses = addresses.reset_index()\n", "msk = addresses[\"id\"].isin(\n", " set(restaurants[\"address_id\"])\n", " | set(orders[\"pickup_address_id\"])\n", " | set(orders[\"delivery_address_id\"])\n", ")\n", "addresses = addresses[msk].set_index(\"id\")\n", "\n", "assert (~msk).sum() == 100\n", "\n", "discarded_addresses = set(addresses[\"primary_id\"]) - set(addresses.reset_index()[\"id\"])\n", "for old_primary_id in set(addresses[\"primary_id\"]) - set(addresses.reset_index()[\"id\"]):\n", " msk = addresses[\"primary_id\"] == old_primary_id\n", " new_primary_id = addresses[msk].index.min()\n", " addresses.loc[msk, \"primary_id\"] = new_primary_id" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Store the Results" ] }, { "cell_type": "code", "execution_count": 202, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'clean'" ] }, "execution_count": 202, "metadata": {}, "output_type": "execute_result" } ], "source": [ "config.CLEAN_SCHEMA" ] }, { "cell_type": "code", "execution_count": 203, "metadata": {}, "outputs": [], "source": [ "cities.to_sql(\n", " \"cities\",\n", " con=connection,\n", " schema=config.CLEAN_SCHEMA,\n", " if_exists=\"append\",\n", " index=True,\n", ")" ] }, { "cell_type": "code", "execution_count": 204, "metadata": {}, "outputs": [], "source": [ "addresses.to_sql(\n", " \"addresses\",\n", " con=connection,\n", " schema=config.CLEAN_SCHEMA,\n", " if_exists=\"append\",\n", " index=True,\n", ")" ] }, { "cell_type": "code", "execution_count": 205, "metadata": {}, "outputs": [], "source": [ "restaurants.to_sql(\n", " \"restaurants\",\n", " con=connection,\n", " schema=config.CLEAN_SCHEMA,\n", " if_exists=\"append\",\n", " index=True,\n", ")" ] }, { "cell_type": "code", "execution_count": 206, "metadata": {}, "outputs": [], "source": [ "couriers.to_sql(\n", " \"couriers\",\n", " con=connection,\n", " schema=config.CLEAN_SCHEMA,\n", " if_exists=\"append\",\n", " index=True,\n", ")" ] }, { "cell_type": "code", "execution_count": 207, "metadata": {}, "outputs": [], "source": [ "customers = pd.DataFrame({\"id\": orders[\"customer_id\"].unique()}).set_index(\"id\")\n", "\n", "customers.to_sql(\n", " \"customers\",\n", " con=connection,\n", " schema=config.CLEAN_SCHEMA,\n", " if_exists=\"append\",\n", " index=True,\n", ")" ] }, { "cell_type": "code", "execution_count": 208, "metadata": {}, "outputs": [], "source": [ "orders.to_sql(\n", " \"orders\",\n", " con=connection,\n", " schema=config.CLEAN_SCHEMA,\n", " if_exists=\"append\",\n", " index=True,\n", ")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.6" } }, "nbformat": 4, "nbformat_minor": 4 }