1
0
Fork 0

Create notebook for the third application of tidying

This commit is contained in:
Alexander Hess 2018-08-26 12:54:34 +02:00
commit 442a541ad5

View file

@ -0,0 +1,994 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Variables are stored in both Rows and Columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## \"Housekeeping\""
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2018-08-26 12:56:31 CEST\n",
"\n",
"CPython 3.6.5\n",
"IPython 6.5.0\n",
"\n",
"numpy 1.15.1\n",
"pandas 0.23.4\n"
]
}
],
"source": [
"% load_ext watermark\n",
"% watermark -d -t -v -z -p numpy,pandas"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"pd.set_option('display.max_columns', 40)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example: Weather\n",
"\n",
"The [Global Historical Climatology Network](https://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets/global-historical-climatology-network-ghcn) collects daily weather. For this example, data for one weather station (MX17004) in Mexico are used."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load the Data\n",
"\n",
"The raw dataset comes in a format that is a mixture of a fixed-width style with occasional usage of characters as seperators. Some tedious cleaning work is necessary."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# Extract the data as one column and\n",
"# use string slicing to obtain groups of columns.\n",
"weather = pd.read_csv('data/weather.txt', header=None, sep='^')\n",
"\n",
"# First, remove the weird character seperators,\n",
"# then split the columns by whitespace, and\n",
"# finally name them appropriately.\n",
"days = (\n",
" weather[0]\n",
" .map(lambda x: x[21:]).str.replace('OI', ' ')\n",
" .str.replace('OS', ' ').str.replace('SI', ' ').str.replace('I', ' ')\n",
" .str.replace('S', ' ').str.replace('B', ' ').str.replace('D', ' ')\n",
" .map(str.lstrip).str.split(r'\\s+', expand=True)\n",
")[list(range(31))].rename(columns={i: f'd{i+1}' for i in range(31)})\n",
"\n",
"# The non-temperature columns can be extracted as simple slices.\n",
"weather = pd.DataFrame(data={\n",
" 'id': weather[0].map(lambda x: x[:11]),\n",
" 'year': weather[0].map(lambda x: x[11:15]).astype(int),\n",
" 'month': weather[0].map(lambda x: x[15:17]).astype(int),\n",
" 'element': weather[0].map(lambda x: x[17:21]).str.lower(),\n",
"})\n",
"\n",
"# The temperatures were stored as whole integers\n",
"# with -9999 indicating missing values.\n",
"for i in range(1, 32):\n",
" weather[f'd{i}'] = days[f'd{i}'].astype(float) / 10\n",
"weather = weather.replace(-999.9, np.NaN)\n",
"\n",
"# Discard the non-temperature observations and\n",
"# sort the dataset as in the paper.\n",
"weather = (\n",
" weather[weather['element'].isin(['tmax', 'tmin'])]\n",
" .sort_values(['id', 'year', 'month', 'element'])\n",
" .reset_index(drop=True)\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Messy Data\n",
"\n",
"Below is a dataset assumed to have been provided like this as \"raw\", i.e., the data analyst did not do the above parsing work but some third party instead.\n",
"\n",
"> The most complicated form of messy data occurs when variables are stored in both rows and columns. Table 11 shows daily weather data from the Global Historical Climatology Network for one weather station (MX17004) in Mexico for five months in 2010. It has variables in\n",
"individual columns (*id*, *year*, *month*), spread across columns (day, d1d31) and across rows (*tmin*, *tmax*) (minimum and maximum temperature). Months with less than 31 days have\n",
"structural missing values for the last day(s) of the month. The *element* column is not a variable; it stores the names of variables."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>year</th>\n",
" <th>month</th>\n",
" <th>element</th>\n",
" <th>d1</th>\n",
" <th>d2</th>\n",
" <th>d3</th>\n",
" <th>d4</th>\n",
" <th>d5</th>\n",
" <th>d6</th>\n",
" <th>d7</th>\n",
" <th>d8</th>\n",
" <th>d9</th>\n",
" <th>d10</th>\n",
" <th>d11</th>\n",
" <th>d12</th>\n",
" <th>d13</th>\n",
" <th>d14</th>\n",
" <th>d15</th>\n",
" <th>d16</th>\n",
" <th>d17</th>\n",
" <th>d18</th>\n",
" <th>d19</th>\n",
" <th>d20</th>\n",
" <th>d21</th>\n",
" <th>d22</th>\n",
" <th>d23</th>\n",
" <th>d24</th>\n",
" <th>d25</th>\n",
" <th>d26</th>\n",
" <th>d27</th>\n",
" <th>d28</th>\n",
" <th>d29</th>\n",
" <th>d30</th>\n",
" <th>d31</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1099</th>\n",
" <td>MX000017004</td>\n",
" <td>2010</td>\n",
" <td>1</td>\n",
" <td>tmax</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>27.8</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1100</th>\n",
" <td>MX000017004</td>\n",
" <td>2010</td>\n",
" <td>1</td>\n",
" <td>tmin</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>14.5</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1101</th>\n",
" <td>MX000017004</td>\n",
" <td>2010</td>\n",
" <td>2</td>\n",
" <td>tmax</td>\n",
" <td>NaN</td>\n",
" <td>27.3</td>\n",
" <td>24.1</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>29.7</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>29.9</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1102</th>\n",
" <td>MX000017004</td>\n",
" <td>2010</td>\n",
" <td>2</td>\n",
" <td>tmin</td>\n",
" <td>NaN</td>\n",
" <td>14.4</td>\n",
" <td>14.4</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>13.4</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>10.7</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1103</th>\n",
" <td>MX000017004</td>\n",
" <td>2010</td>\n",
" <td>3</td>\n",
" <td>tmax</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>32.1</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>34.5</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>31.1</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1104</th>\n",
" <td>MX000017004</td>\n",
" <td>2010</td>\n",
" <td>3</td>\n",
" <td>tmin</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>14.2</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>16.8</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>17.6</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1105</th>\n",
" <td>MX000017004</td>\n",
" <td>2010</td>\n",
" <td>4</td>\n",
" <td>tmax</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>36.3</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1106</th>\n",
" <td>MX000017004</td>\n",
" <td>2010</td>\n",
" <td>4</td>\n",
" <td>tmin</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>16.7</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1107</th>\n",
" <td>MX000017004</td>\n",
" <td>2010</td>\n",
" <td>5</td>\n",
" <td>tmax</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>33.2</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1108</th>\n",
" <td>MX000017004</td>\n",
" <td>2010</td>\n",
" <td>5</td>\n",
" <td>tmin</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>18.2</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id year month element d1 d2 d3 d4 d5 d6 d7 d8 \\\n",
"1099 MX000017004 2010 1 tmax NaN NaN NaN NaN NaN NaN NaN NaN \n",
"1100 MX000017004 2010 1 tmin NaN NaN NaN NaN NaN NaN NaN NaN \n",
"1101 MX000017004 2010 2 tmax NaN 27.3 24.1 NaN NaN NaN NaN NaN \n",
"1102 MX000017004 2010 2 tmin NaN 14.4 14.4 NaN NaN NaN NaN NaN \n",
"1103 MX000017004 2010 3 tmax NaN NaN NaN NaN 32.1 NaN NaN NaN \n",
"1104 MX000017004 2010 3 tmin NaN NaN NaN NaN 14.2 NaN NaN NaN \n",
"1105 MX000017004 2010 4 tmax NaN NaN NaN NaN NaN NaN NaN NaN \n",
"1106 MX000017004 2010 4 tmin NaN NaN NaN NaN NaN NaN NaN NaN \n",
"1107 MX000017004 2010 5 tmax NaN NaN NaN NaN NaN NaN NaN NaN \n",
"1108 MX000017004 2010 5 tmin NaN NaN NaN NaN NaN NaN NaN NaN \n",
"\n",
" d9 d10 d11 d12 d13 d14 d15 d16 d17 d18 d19 d20 d21 d22 \\\n",
"1099 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n",
"1100 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n",
"1101 NaN NaN 29.7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n",
"1102 NaN NaN 13.4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n",
"1103 NaN 34.5 NaN NaN NaN NaN NaN 31.1 NaN NaN NaN NaN NaN NaN \n",
"1104 NaN 16.8 NaN NaN NaN NaN NaN 17.6 NaN NaN NaN NaN NaN NaN \n",
"1105 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n",
"1106 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n",
"1107 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n",
"1108 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n",
"\n",
" d23 d24 d25 d26 d27 d28 d29 d30 d31 \n",
"1099 NaN NaN NaN NaN NaN NaN NaN 27.8 NaN \n",
"1100 NaN NaN NaN NaN NaN NaN NaN 14.5 NaN \n",
"1101 29.9 NaN NaN NaN NaN NaN NaN NaN NaN \n",
"1102 10.7 NaN NaN NaN NaN NaN NaN NaN NaN \n",
"1103 NaN NaN NaN NaN NaN NaN NaN NaN NaN \n",
"1104 NaN NaN NaN NaN NaN NaN NaN NaN NaN \n",
"1105 NaN NaN NaN NaN 36.3 NaN NaN NaN NaN \n",
"1106 NaN NaN NaN NaN 16.7 NaN NaN NaN NaN \n",
"1107 NaN NaN NaN NaN 33.2 NaN NaN NaN NaN \n",
"1108 NaN NaN NaN NaN 18.2 NaN NaN NaN NaN "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"weather[(weather['year'] == 2010)].head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Molten Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> To tidy this dataset we first melt it with colvars *id*, *year*, *month* and the column that contains variable names, *element* [...]. For presentation, we have dropped the missing values, making them implicit rather than explicit. This is permissible because we know how many days are in each month and can easily reconstruct the explicit missing values."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# Melt the dataset and extract a date column.\n",
"molten_weather = (\n",
" pd.melt(weather, id_vars=['id', 'year', 'month', 'element'], var_name='day')\n",
" .assign(day=lambda x: x['day'].str.extract('(\\d+)').astype(int))\n",
" .assign(date=lambda x: pd.to_datetime(x[['year', 'month', 'day']], errors='coerce'))\n",
")[['id', 'date', 'element', 'value']]\n",
"\n",
"# Make the missing values implicit.\n",
"molten_weather = molten_weather[molten_weather['value'].notnull()]\n",
"\n",
"# Sort the data as in the paper.\n",
"molten_weather = (\n",
" molten_weather\n",
" .sort_values(['id', 'date', 'element'])\n",
" .reset_index(drop=True)\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> This dataset is mostly tidy, but we have two variables stored in rows: *tmin* and *tmax*, the type of observation."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>date</th>\n",
" <th>element</th>\n",
" <th>value</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>23183</th>\n",
" <td>MX000017004</td>\n",
" <td>2010-01-30</td>\n",
" <td>tmax</td>\n",
" <td>27.8</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23184</th>\n",
" <td>MX000017004</td>\n",
" <td>2010-01-30</td>\n",
" <td>tmin</td>\n",
" <td>14.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23185</th>\n",
" <td>MX000017004</td>\n",
" <td>2010-02-02</td>\n",
" <td>tmax</td>\n",
" <td>27.3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23186</th>\n",
" <td>MX000017004</td>\n",
" <td>2010-02-02</td>\n",
" <td>tmin</td>\n",
" <td>14.4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23187</th>\n",
" <td>MX000017004</td>\n",
" <td>2010-02-03</td>\n",
" <td>tmax</td>\n",
" <td>24.1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23188</th>\n",
" <td>MX000017004</td>\n",
" <td>2010-02-03</td>\n",
" <td>tmin</td>\n",
" <td>14.4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23189</th>\n",
" <td>MX000017004</td>\n",
" <td>2010-02-11</td>\n",
" <td>tmax</td>\n",
" <td>29.7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23190</th>\n",
" <td>MX000017004</td>\n",
" <td>2010-02-11</td>\n",
" <td>tmin</td>\n",
" <td>13.4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23191</th>\n",
" <td>MX000017004</td>\n",
" <td>2010-02-23</td>\n",
" <td>tmax</td>\n",
" <td>29.9</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23192</th>\n",
" <td>MX000017004</td>\n",
" <td>2010-02-23</td>\n",
" <td>tmin</td>\n",
" <td>10.7</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id date element value\n",
"23183 MX000017004 2010-01-30 tmax 27.8\n",
"23184 MX000017004 2010-01-30 tmin 14.5\n",
"23185 MX000017004 2010-02-02 tmax 27.3\n",
"23186 MX000017004 2010-02-02 tmin 14.4\n",
"23187 MX000017004 2010-02-03 tmax 24.1\n",
"23188 MX000017004 2010-02-03 tmin 14.4\n",
"23189 MX000017004 2010-02-11 tmax 29.7\n",
"23190 MX000017004 2010-02-11 tmin 13.4\n",
"23191 MX000017004 2010-02-23 tmax 29.9\n",
"23192 MX000017004 2010-02-23 tmin 10.7"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"molten_weather[(molten_weather['date'].dt.year == 2010)].head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Tidy Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> Fixing this requires the cast, or unstack, operation. This performs the inverse of melting by rotating the element variable back out into the columns\n",
"\n",
"Note that [pd.DataFrame.unstack](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.unstack.html) method uses a DataFrame's index as columns to unstack over."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"tidy_weather = molten_weather.set_index(['id', 'date', 'element']).unstack()\n",
"\n",
"# Make the column headers look as in the paper.\n",
"tidy_weather.columns = tidy_weather.columns.droplevel(0)\n",
"tidy_weather.columns.name = None\n",
"tidy_weather = tidy_weather.reset_index()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> This form is tidy. There is one variable in each column, and each row represents a days observations."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>date</th>\n",
" <th>tmax</th>\n",
" <th>tmin</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>12087</th>\n",
" <td>MX000017004</td>\n",
" <td>2010-01-30</td>\n",
" <td>27.8</td>\n",
" <td>14.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12088</th>\n",
" <td>MX000017004</td>\n",
" <td>2010-02-02</td>\n",
" <td>27.3</td>\n",
" <td>14.4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12089</th>\n",
" <td>MX000017004</td>\n",
" <td>2010-02-03</td>\n",
" <td>24.1</td>\n",
" <td>14.4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12090</th>\n",
" <td>MX000017004</td>\n",
" <td>2010-02-11</td>\n",
" <td>29.7</td>\n",
" <td>13.4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12091</th>\n",
" <td>MX000017004</td>\n",
" <td>2010-02-23</td>\n",
" <td>29.9</td>\n",
" <td>10.7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12092</th>\n",
" <td>MX000017004</td>\n",
" <td>2010-03-05</td>\n",
" <td>32.1</td>\n",
" <td>14.2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12093</th>\n",
" <td>MX000017004</td>\n",
" <td>2010-03-10</td>\n",
" <td>34.5</td>\n",
" <td>16.8</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12094</th>\n",
" <td>MX000017004</td>\n",
" <td>2010-03-16</td>\n",
" <td>31.1</td>\n",
" <td>17.6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12095</th>\n",
" <td>MX000017004</td>\n",
" <td>2010-04-27</td>\n",
" <td>36.3</td>\n",
" <td>16.7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12096</th>\n",
" <td>MX000017004</td>\n",
" <td>2010-05-27</td>\n",
" <td>33.2</td>\n",
" <td>18.2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id date tmax tmin\n",
"12087 MX000017004 2010-01-30 27.8 14.5\n",
"12088 MX000017004 2010-02-02 27.3 14.4\n",
"12089 MX000017004 2010-02-03 24.1 14.4\n",
"12090 MX000017004 2010-02-11 29.7 13.4\n",
"12091 MX000017004 2010-02-23 29.9 10.7\n",
"12092 MX000017004 2010-03-05 32.1 14.2\n",
"12093 MX000017004 2010-03-10 34.5 16.8\n",
"12094 MX000017004 2010-03-16 31.1 17.6\n",
"12095 MX000017004 2010-04-27 36.3 16.7\n",
"12096 MX000017004 2010-05-27 33.2 18.2"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tidy_weather[(tidy_weather['date'].dt.year == 2010)].head(10)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}