2018-08-25 19:40:23 +02:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Column Headers are Values, not Variable Names\n",
"\n",
"This notebook shows two examples of how column headers display values. These type of messy datasets have practical use in two types of settings:\n",
"\n",
"1. Presentations\n",
"2. Recordings of regularly spaced observations over time"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## \"Housekeeping\""
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
2018-08-26 15:33:33 +02:00
"2018-08-26 14:39:56 CEST\n",
2018-08-25 19:40:23 +02:00
"\n",
"CPython 3.6.5\n",
"IPython 6.5.0\n",
"\n",
"numpy 1.15.1\n",
"pandas 0.23.4\n"
]
}
],
"source": [
"% load_ext watermark\n",
"% watermark -d -t -v -z -p numpy,pandas"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import datetime\n",
"import re\n",
"\n",
"import pandas as pd\n",
"import savReaderWriter as spss"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example 1: Religion vs. Income\n",
"\n",
"> A common type of messy dataset is tabular data designed for **presentation**, where variables\n",
"form both the rows and columns, and column headers are values, not variable names.\n",
"\n",
"The [Pew Research Center](http://www.pewresearch.org/) provides many studies on all kinds of aspects of life in the USA. The following examples uses data taken from its [Religious Landscape Study](http://www.pewforum.org/religious-landscape-study/)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load the Data\n",
"\n",
"The data are provided as a SPSS data file. This is a binary specification with a built-in header section describing the data, for example, what variables / columns are included and what the realizations categorical data can have."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Load the dataset's meta data."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"columns = ['q16', 'reltrad', 'income']\n",
"encodings = {}\n",
"\n",
"# For sake of simplicity all data cleaning operations\n",
"# are done within the for-loop for all columns.\n",
"with spss.SavHeaderReader('data/pew.sav') as pew:\n",
" for c in columns:\n",
" encodings[c] = {\n",
" int(k): (\n",
" re.sub(r'\\(.*\\)', '', (\n",
" v.decode('iso-8859-1')\n",
" .replace('\\x92', \"'\")\n",
" .replace(' Churches', '')\n",
" .replace('Less than $10,000', '<$10k')\n",
" .replace('10 to under $20,000', '$10-20k')\n",
" .replace('20 to under $30,000', '$20-30k')\n",
" .replace('30 to under $40,000', '$30-40k')\n",
" .replace('40 to under $50,000', '$40-50k')\n",
" .replace('50 to under $75,000', '$50-75k')\n",
" .replace('75 to under $100,000', '$75-100k')\n",
" .replace('100 to under $150,000', '$100-150k')\n",
" .replace('$150,000 or more', '>150k')\n",
" ),\n",
" ).strip()\n",
" )\n",
" for (k, v) in pew.all().valueLabels[c.encode()].items()\n",
" }"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Load the actual data and prepare them as they are presented in the paper."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"with spss.SavReader('data/pew.sav', selectVars=[c.encode() for c in columns]) as pew:\n",
" pew = list(pew)\n",
"\n",
"# Use the above encodings to map the numeric data\n",
"# to the actual labels.\n",
"pew = pd.DataFrame(pew, columns=columns, dtype=int)\n",
"for c in columns:\n",
" pew[c] = pew[c].map(encodings[c])\n",
"\n",
"for v in ('Atheist', 'Agnostic'):\n",
" pew.loc[(pew['q16'] == v), 'reltrad'] = v\n",
"\n",
"income_columns = ['<$10k', '$10-20k', '$20-30k', '$30-40k', '$40-50k', '$50-75k',\n",
" '$75-100k', '$100-150k', '>150k', 'Don\\'t know/Refused']\n",
"\n",
"pew = pew.groupby(['reltrad', 'income']).size().unstack('income')\n",
"\n",
"pew = pew[income_columns]\n",
"pew.index.name = 'religion'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Messy Data\n",
"\n",
"The next cell shows the data as they can actually be provided as \"raw\" data (i.e., the pre-processing as done above is assumed to be done by someone else and the data analyst is only presented with the below dataset)."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(18, 10)"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pew.shape"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th>income</th>\n",
" <th><$10k</th>\n",
" <th>$10-20k</th>\n",
" <th>$20-30k</th>\n",
" <th>$30-40k</th>\n",
" <th>$40-50k</th>\n",
" <th>$50-75k</th>\n",
" <th>$75-100k</th>\n",
" <th>$100-150k</th>\n",
" <th>>150k</th>\n",
" <th>Don't know/Refused</th>\n",
" </tr>\n",
" <tr>\n",
" <th>religion</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Agnostic</th>\n",
" <td>27</td>\n",
" <td>34</td>\n",
" <td>60</td>\n",
" <td>81</td>\n",
" <td>76</td>\n",
" <td>137</td>\n",
" <td>122</td>\n",
" <td>109</td>\n",
" <td>84</td>\n",
" <td>96</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Atheist</th>\n",
" <td>12</td>\n",
" <td>27</td>\n",
" <td>37</td>\n",
" <td>52</td>\n",
" <td>35</td>\n",
" <td>70</td>\n",
" <td>73</td>\n",
" <td>59</td>\n",
" <td>74</td>\n",
" <td>76</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Buddhist</th>\n",
" <td>27</td>\n",
" <td>21</td>\n",
" <td>30</td>\n",
" <td>34</td>\n",
" <td>33</td>\n",
" <td>58</td>\n",
" <td>62</td>\n",
" <td>39</td>\n",
" <td>53</td>\n",
" <td>54</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Catholic</th>\n",
" <td>418</td>\n",
" <td>617</td>\n",
" <td>732</td>\n",
" <td>670</td>\n",
" <td>638</td>\n",
" <td>1116</td>\n",
" <td>949</td>\n",
" <td>792</td>\n",
" <td>633</td>\n",
" <td>1489</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Don't know/refused</th>\n",
" <td>15</td>\n",
" <td>14</td>\n",
" <td>15</td>\n",
" <td>11</td>\n",
" <td>10</td>\n",
" <td>35</td>\n",
" <td>21</td>\n",
" <td>17</td>\n",
" <td>18</td>\n",
" <td>116</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Evangelical Protestant</th>\n",
" <td>575</td>\n",
" <td>869</td>\n",
" <td>1064</td>\n",
" <td>982</td>\n",
" <td>881</td>\n",
" <td>1486</td>\n",
" <td>949</td>\n",
" <td>723</td>\n",
" <td>414</td>\n",
" <td>1529</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Hindu</th>\n",
" <td>1</td>\n",
" <td>9</td>\n",
" <td>7</td>\n",
" <td>9</td>\n",
" <td>11</td>\n",
" <td>34</td>\n",
" <td>47</td>\n",
" <td>48</td>\n",
" <td>54</td>\n",
" <td>37</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Historically Black Protestant</th>\n",
" <td>228</td>\n",
" <td>244</td>\n",
" <td>236</td>\n",
" <td>238</td>\n",
" <td>197</td>\n",
" <td>223</td>\n",
" <td>131</td>\n",
" <td>81</td>\n",
" <td>78</td>\n",
" <td>339</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Jehovah's Witness</th>\n",
" <td>20</td>\n",
" <td>27</td>\n",
" <td>24</td>\n",
" <td>24</td>\n",
" <td>21</td>\n",
" <td>30</td>\n",
" <td>15</td>\n",
" <td>11</td>\n",
" <td>6</td>\n",
" <td>37</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Jewish</th>\n",
" <td>19</td>\n",
" <td>19</td>\n",
" <td>25</td>\n",
" <td>25</td>\n",
" <td>30</td>\n",
" <td>95</td>\n",
" <td>69</td>\n",
" <td>87</td>\n",
" <td>151</td>\n",
" <td>162</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"income <$10k $10-20k $20-30k $30-40k $40-50k \\\n",
"religion \n",
"Agnostic 27 34 60 81 76 \n",
"Atheist 12 27 37 52 35 \n",
"Buddhist 27 21 30 34 33 \n",
"Catholic 418 617 732 670 638 \n",
"Don't know/refused 15 14 15 11 10 \n",
"Evangelical Protestant 575 869 1064 982 881 \n",
"Hindu 1 9 7 9 11 \n",
"Historically Black Protestant 228 244 236 238 197 \n",
"Jehovah's Witness 20 27 24 24 21 \n",
"Jewish 19 19 25 25 30 \n",
"\n",
"income $50-75k $75-100k $100-150k >150k \\\n",
"religion \n",
"Agnostic 137 122 109 84 \n",
"Atheist 70 73 59 74 \n",
"Buddhist 58 62 39 53 \n",
"Catholic 1116 949 792 633 \n",
"Don't know/refused 35 21 17 18 \n",
"Evangelical Protestant 1486 949 723 414 \n",
"Hindu 34 47 48 54 \n",
"Historically Black Protestant 223 131 81 78 \n",
"Jehovah's Witness 30 15 11 6 \n",
"Jewish 95 69 87 151 \n",
"\n",
"income Don't know/Refused \n",
"religion \n",
"Agnostic 96 \n",
"Atheist 76 \n",
"Buddhist 54 \n",
"Catholic 1489 \n",
"Don't know/refused 116 \n",
"Evangelical Protestant 1529 \n",
"Hindu 37 \n",
"Historically Black Protestant 339 \n",
"Jehovah's Witness 37 \n",
"Jewish 162 "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pew.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Tidy Data\n",
"\n",
"> This dataset has **three** variables, **religion**, **income** and **frequency**. To tidy it, we need to **melt**, or stack it. In other words, we need to turn columns into rows.\n",
"\n",
"pandas provides a [pd.melt](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html) function to un-pivot the dataset.\n",
"\n",
"**Notes:** *reset_index()* transforms the religion index column into a data column (*pd.melt()* needs that). Further, the resulting table is sorted implicitly by the *religion* column. To get to the same ordering as in the paper, the molten table is explicitly sorted."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"molten_pew = pd.melt(pew.reset_index(), id_vars=['religion'], value_name='frequency')"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"# Create a ordered column for the income labels.\n",
"income_dtype = pd.api.types.CategoricalDtype(income_columns, ordered=True)\n",
"molten_pew['income'] = molten_pew['income'].astype(income_dtype)\n",
"molten_pew = molten_pew.sort_values(['religion', 'income']).reset_index(drop=True)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(180, 3)"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"molten_pew.shape"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>religion</th>\n",
" <th>income</th>\n",
" <th>frequency</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Agnostic</td>\n",
" <td><$10k</td>\n",
" <td>27</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Agnostic</td>\n",
" <td>$10-20k</td>\n",
" <td>34</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Agnostic</td>\n",
" <td>$20-30k</td>\n",
" <td>60</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Agnostic</td>\n",
" <td>$30-40k</td>\n",
" <td>81</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Agnostic</td>\n",
" <td>$40-50k</td>\n",
" <td>76</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Agnostic</td>\n",
" <td>$50-75k</td>\n",
" <td>137</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>Agnostic</td>\n",
" <td>$75-100k</td>\n",
" <td>122</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>Agnostic</td>\n",
" <td>$100-150k</td>\n",
" <td>109</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>Agnostic</td>\n",
" <td>>150k</td>\n",
" <td>84</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>Agnostic</td>\n",
" <td>Don't know/Refused</td>\n",
" <td>96</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" religion income frequency\n",
"0 Agnostic <$10k 27\n",
"1 Agnostic $10-20k 34\n",
"2 Agnostic $20-30k 60\n",
"3 Agnostic $30-40k 81\n",
"4 Agnostic $40-50k 76\n",
"5 Agnostic $50-75k 137\n",
"6 Agnostic $75-100k 122\n",
"7 Agnostic $100-150k 109\n",
"8 Agnostic >150k 84\n",
"9 Agnostic Don't know/Refused 96"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"molten_pew.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example 2: Billboard\n",
"\n",
"> Another common use of this data format is to record regularly spaced observations over time. For example, the Billboard dataset shown in Table 7 records the date a song first entered the Billboard Top 100. It has variables for **artist**, **track**, **date.entered**, **rank** and **week**. The rank in each week after it enters the top 100 is recorded in 75 columns, wk1 to wk75. If a song is in the Top 100 for less than 75 weeks the remaining columns are filled with missing values. This form of storage is not tidy, but it is useful for data entry. It reduces duplication since otherwise each song in each week would need its own row, and song metadata like title and artist would need to be repeated."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load the Data\n",
"\n",
"The data come in a CSV file with tediously named week columns."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"# Usage of \"1st\", \"2nd\", \"3rd\" should be forbidden by law :)\n",
"usecols = ['artist.inverted', 'track', 'time', 'date.entered'] + (\n",
" [f'x{i}st.week' for i in range(1, 76, 10) if i != 11]\n",
" + [f'x{i}nd.week' for i in range(2, 76, 10) if i != 12]\n",
" + [f'x{i}rd.week' for i in range(3, 76, 10) if i != 13]\n",
" + [f'x{i}th.week' for i in range(1, 76) if (i % 10) not in (1, 2, 3)]\n",
" + [f'x11th.week', f'x12th.week', f'x13th.week']\n",
")\n",
"\n",
"billboard = pd.read_csv('data/billboard.csv', encoding='iso-8859-1',\n",
" parse_dates=['date.entered'], usecols=usecols)\n",
"\n",
"billboard = billboard.assign(year=lambda x: x['date.entered'].dt.year)\n",
"\n",
"# Rename the week columns.\n",
"week_columns = {\n",
" c: ('wk' + re.sub(r'[^\\d]+', '', c))\n",
" for c in billboard.columns\n",
" if c.endswith('.week')\n",
"}\n",
"billboard = billboard.rename(columns={'artist.inverted': 'artist', **week_columns})\n",
"\n",
"# Ensure the columns' order is the same as in the paper.\n",
"columns = ['year', 'artist', 'track', 'time', 'date.entered'] + [\n",
" f'wk{i}' for i in range(1, 76)\n",
"]\n",
"billboard = billboard[columns]\n",
"\n",
"# Ensure the rows' order is similar as in the paper.\n",
"# For unknown reasons the exact ordering as in the paper cannot be reconstructed.\n",
"billboard = billboard[billboard['year'] == 2000]\n",
"billboard = billboard.sort_values(['artist', 'track'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Messy Data\n",
"\n",
"Again, the next cell shows the data as they were actually provided as \"raw\" data."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(267, 80)"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"billboard.shape"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>year</th>\n",
" <th>artist</th>\n",
" <th>track</th>\n",
" <th>time</th>\n",
" <th>date.entered</th>\n",
" <th>wk1</th>\n",
" <th>wk2</th>\n",
" <th>wk3</th>\n",
" <th>wk4</th>\n",
" <th>wk5</th>\n",
" <th>...</th>\n",
" <th>wk66</th>\n",
" <th>wk67</th>\n",
" <th>wk68</th>\n",
" <th>wk69</th>\n",
" <th>wk70</th>\n",
" <th>wk71</th>\n",
" <th>wk72</th>\n",
" <th>wk73</th>\n",
" <th>wk74</th>\n",
" <th>wk75</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>246</th>\n",
" <td>2000</td>\n",
" <td>2 Pac</td>\n",
" <td>Baby Don't Cry (Keep Ya Head Up II)</td>\n",
" <td>4:22</td>\n",
" <td>2000-02-26</td>\n",
" <td>87</td>\n",
" <td>82.0</td>\n",
" <td>72.0</td>\n",
" <td>77.0</td>\n",
" <td>87.0</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>287</th>\n",
" <td>2000</td>\n",
" <td>2Ge+her</td>\n",
" <td>The Hardest Part Of Breaking Up (Is Getting Ba...</td>\n",
" <td>3:15</td>\n",
" <td>2000-09-02</td>\n",
" <td>91</td>\n",
" <td>87.0</td>\n",
" <td>92.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24</th>\n",
" <td>2000</td>\n",
" <td>3 Doors Down</td>\n",
" <td>Kryptonite</td>\n",
" <td>3:53</td>\n",
" <td>2000-04-08</td>\n",
" <td>81</td>\n",
" <td>70.0</td>\n",
" <td>68.0</td>\n",
" <td>67.0</td>\n",
" <td>66.0</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>193</th>\n",
" <td>2000</td>\n",
" <td>3 Doors Down</td>\n",
" <td>Loser</td>\n",
" <td>4:24</td>\n",
" <td>2000-10-21</td>\n",
" <td>76</td>\n",
" <td>76.0</td>\n",
" <td>72.0</td>\n",
" <td>69.0</td>\n",
" <td>67.0</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>69</th>\n",
" <td>2000</td>\n",
" <td>504 Boyz</td>\n",
" <td>Wobble Wobble</td>\n",
" <td>3:35</td>\n",
" <td>2000-04-15</td>\n",
" <td>57</td>\n",
" <td>34.0</td>\n",
" <td>25.0</td>\n",
" <td>17.0</td>\n",
" <td>17.0</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>22</th>\n",
" <td>2000</td>\n",
" <td>98¡</td>\n",
" <td>Give Me Just One Night (Una Noche)</td>\n",
" <td>3:24</td>\n",
" <td>2000-08-19</td>\n",
" <td>51</td>\n",
" <td>39.0</td>\n",
" <td>34.0</td>\n",
" <td>26.0</td>\n",
" <td>26.0</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>304</th>\n",
" <td>2000</td>\n",
" <td>A*Teens</td>\n",
" <td>Dancing Queen</td>\n",
" <td>3:44</td>\n",
" <td>2000-07-08</td>\n",
" <td>97</td>\n",
" <td>97.0</td>\n",
" <td>96.0</td>\n",
" <td>95.0</td>\n",
" <td>100.0</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>135</th>\n",
" <td>2000</td>\n",
" <td>Aaliyah</td>\n",
" <td>I Don't Wanna</td>\n",
" <td>4:15</td>\n",
" <td>2000-01-29</td>\n",
" <td>84</td>\n",
" <td>62.0</td>\n",
" <td>51.0</td>\n",
" <td>41.0</td>\n",
" <td>38.0</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>2000</td>\n",
" <td>Aaliyah</td>\n",
" <td>Try Again</td>\n",
" <td>4:03</td>\n",
" <td>2000-03-18</td>\n",
" <td>59</td>\n",
" <td>53.0</td>\n",
" <td>38.0</td>\n",
" <td>28.0</td>\n",
" <td>21.0</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>200</th>\n",
" <td>2000</td>\n",
" <td>Adams, Yolanda</td>\n",
" <td>Open My Heart</td>\n",
" <td>5:30</td>\n",
" <td>2000-08-26</td>\n",
" <td>76</td>\n",
" <td>76.0</td>\n",
" <td>74.0</td>\n",
" <td>69.0</td>\n",
" <td>68.0</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>10 rows × 80 columns</p>\n",
"</div>"
],
"text/plain": [
" year artist track \\\n",
"246 2000 2 Pac Baby Don't Cry (Keep Ya Head Up II) \n",
"287 2000 2Ge+her The Hardest Part Of Breaking Up (Is Getting Ba... \n",
"24 2000 3 Doors Down Kryptonite \n",
"193 2000 3 Doors Down Loser \n",
"69 2000 504 Boyz Wobble Wobble \n",
"22 2000 98¡ Give Me Just One Night (Una Noche) \n",
"304 2000 A*Teens Dancing Queen \n",
"135 2000 Aaliyah I Don't Wanna \n",
"14 2000 Aaliyah Try Again \n",
"200 2000 Adams, Yolanda Open My Heart \n",
"\n",
" time date.entered wk1 wk2 wk3 wk4 wk5 ... wk66 wk67 wk68 \\\n",
"246 4:22 2000-02-26 87 82.0 72.0 77.0 87.0 ... NaN NaN NaN \n",
"287 3:15 2000-09-02 91 87.0 92.0 NaN NaN ... NaN NaN NaN \n",
"24 3:53 2000-04-08 81 70.0 68.0 67.0 66.0 ... NaN NaN NaN \n",
"193 4:24 2000-10-21 76 76.0 72.0 69.0 67.0 ... NaN NaN NaN \n",
"69 3:35 2000-04-15 57 34.0 25.0 17.0 17.0 ... NaN NaN NaN \n",
"22 3:24 2000-08-19 51 39.0 34.0 26.0 26.0 ... NaN NaN NaN \n",
"304 3:44 2000-07-08 97 97.0 96.0 95.0 100.0 ... NaN NaN NaN \n",
"135 4:15 2000-01-29 84 62.0 51.0 41.0 38.0 ... NaN NaN NaN \n",
"14 4:03 2000-03-18 59 53.0 38.0 28.0 21.0 ... NaN NaN NaN \n",
"200 5:30 2000-08-26 76 76.0 74.0 69.0 68.0 ... NaN NaN NaN \n",
"\n",
" wk69 wk70 wk71 wk72 wk73 wk74 wk75 \n",
"246 NaN NaN NaN NaN NaN NaN NaN \n",
"287 NaN NaN NaN NaN NaN NaN NaN \n",
"24 NaN NaN NaN NaN NaN NaN NaN \n",
"193 NaN NaN NaN NaN NaN NaN NaN \n",
"69 NaN NaN NaN NaN NaN NaN NaN \n",
"22 NaN NaN NaN NaN NaN NaN NaN \n",
"304 NaN NaN NaN NaN NaN NaN NaN \n",
"135 NaN NaN NaN NaN NaN NaN NaN \n",
"14 NaN NaN NaN NaN NaN NaN NaN \n",
"200 NaN NaN NaN NaN NaN NaN NaN \n",
"\n",
"[10 rows x 80 columns]"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"billboard.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2018-08-26 15:33:33 +02:00
"### \"Tidy\" Data\n",
2018-08-25 19:40:23 +02:00
"\n",
"As before the *pd.melt* function is used to transform the data from \"wide\" to \"long\" form."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"molten_billboard = pd.melt(\n",
" billboard,\n",
" id_vars=['year', 'artist', 'track', 'time', 'date.entered'],\n",
" var_name='week',\n",
" value_name='rank',\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In contrast to R, pandas keeps (unneccesary) rows for weeks where the song was already out of the charts. These are discarded. Also, a new column *date* indicating when exactly a particular song was at a certain rank in the charts is added."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"# pandas keeps \"wide\" variables that had missing values as rows.\n",
"molten_billboard = molten_billboard[molten_billboard['rank'].notnull()]\n",
"\n",
"# Cast as integer after missing values are removed.\n",
"molten_billboard['week'] = molten_billboard['week'].map(lambda x: int(x[2:]))\n",
"molten_billboard['rank'] = molten_billboard['rank'].map(int)\n",
"\n",
"# Calculate the actual week from the date of first entering the list.\n",
"molten_billboard = molten_billboard.assign(\n",
" date=lambda x: x['date.entered'] + (x['week'] - 1) * datetime.timedelta(weeks=1)\n",
")\n",
"\n",
"# Sort rows and columns as in the paper.\n",
"molten_billboard = molten_billboard[\n",
" ['year', 'artist', 'time', 'track', 'date', 'week', 'rank']\n",
"]\n",
"molten_billboard = (\n",
" molten_billboard.sort_values(['artist', 'track', 'week']).reset_index(drop=True)\n",
")"
]
},
2018-08-26 15:33:33 +02:00
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that this dataset is not yet fully tidy as will be explained in notebook No. 4."
]
},
2018-08-25 19:40:23 +02:00
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>year</th>\n",
" <th>artist</th>\n",
" <th>time</th>\n",
" <th>track</th>\n",
" <th>date</th>\n",
" <th>week</th>\n",
" <th>rank</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2000</td>\n",
" <td>2 Pac</td>\n",
" <td>4:22</td>\n",
" <td>Baby Don't Cry (Keep Ya Head Up II)</td>\n",
" <td>2000-02-26</td>\n",
" <td>1</td>\n",
" <td>87</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2000</td>\n",
" <td>2 Pac</td>\n",
" <td>4:22</td>\n",
" <td>Baby Don't Cry (Keep Ya Head Up II)</td>\n",
" <td>2000-03-04</td>\n",
" <td>2</td>\n",
" <td>82</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2000</td>\n",
" <td>2 Pac</td>\n",
" <td>4:22</td>\n",
" <td>Baby Don't Cry (Keep Ya Head Up II)</td>\n",
" <td>2000-03-11</td>\n",
" <td>3</td>\n",
" <td>72</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2000</td>\n",
" <td>2 Pac</td>\n",
" <td>4:22</td>\n",
" <td>Baby Don't Cry (Keep Ya Head Up II)</td>\n",
" <td>2000-03-18</td>\n",
" <td>4</td>\n",
" <td>77</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2000</td>\n",
" <td>2 Pac</td>\n",
" <td>4:22</td>\n",
" <td>Baby Don't Cry (Keep Ya Head Up II)</td>\n",
" <td>2000-03-25</td>\n",
" <td>5</td>\n",
" <td>87</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>2000</td>\n",
" <td>2 Pac</td>\n",
" <td>4:22</td>\n",
" <td>Baby Don't Cry (Keep Ya Head Up II)</td>\n",
" <td>2000-04-01</td>\n",
" <td>6</td>\n",
" <td>94</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>2000</td>\n",
" <td>2 Pac</td>\n",
" <td>4:22</td>\n",
" <td>Baby Don't Cry (Keep Ya Head Up II)</td>\n",
" <td>2000-04-08</td>\n",
" <td>7</td>\n",
" <td>99</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>2000</td>\n",
" <td>2Ge+her</td>\n",
" <td>3:15</td>\n",
" <td>The Hardest Part Of Breaking Up (Is Getting Ba...</td>\n",
" <td>2000-09-02</td>\n",
" <td>1</td>\n",
" <td>91</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>2000</td>\n",
" <td>2Ge+her</td>\n",
" <td>3:15</td>\n",
" <td>The Hardest Part Of Breaking Up (Is Getting Ba...</td>\n",
" <td>2000-09-09</td>\n",
" <td>2</td>\n",
" <td>87</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>2000</td>\n",
" <td>2Ge+her</td>\n",
" <td>3:15</td>\n",
" <td>The Hardest Part Of Breaking Up (Is Getting Ba...</td>\n",
" <td>2000-09-16</td>\n",
" <td>3</td>\n",
" <td>92</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>2000</td>\n",
" <td>3 Doors Down</td>\n",
" <td>3:53</td>\n",
" <td>Kryptonite</td>\n",
" <td>2000-04-08</td>\n",
" <td>1</td>\n",
" <td>81</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>2000</td>\n",
" <td>3 Doors Down</td>\n",
" <td>3:53</td>\n",
" <td>Kryptonite</td>\n",
" <td>2000-04-15</td>\n",
" <td>2</td>\n",
" <td>70</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>2000</td>\n",
" <td>3 Doors Down</td>\n",
" <td>3:53</td>\n",
" <td>Kryptonite</td>\n",
" <td>2000-04-22</td>\n",
" <td>3</td>\n",
" <td>68</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>2000</td>\n",
" <td>3 Doors Down</td>\n",
" <td>3:53</td>\n",
" <td>Kryptonite</td>\n",
" <td>2000-04-29</td>\n",
" <td>4</td>\n",
" <td>67</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>2000</td>\n",
" <td>3 Doors Down</td>\n",
" <td>3:53</td>\n",
" <td>Kryptonite</td>\n",
" <td>2000-05-06</td>\n",
" <td>5</td>\n",
" <td>66</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" year artist time \\\n",
"0 2000 2 Pac 4:22 \n",
"1 2000 2 Pac 4:22 \n",
"2 2000 2 Pac 4:22 \n",
"3 2000 2 Pac 4:22 \n",
"4 2000 2 Pac 4:22 \n",
"5 2000 2 Pac 4:22 \n",
"6 2000 2 Pac 4:22 \n",
"7 2000 2Ge+her 3:15 \n",
"8 2000 2Ge+her 3:15 \n",
"9 2000 2Ge+her 3:15 \n",
"10 2000 3 Doors Down 3:53 \n",
"11 2000 3 Doors Down 3:53 \n",
"12 2000 3 Doors Down 3:53 \n",
"13 2000 3 Doors Down 3:53 \n",
"14 2000 3 Doors Down 3:53 \n",
"\n",
" track date week rank \n",
"0 Baby Don't Cry (Keep Ya Head Up II) 2000-02-26 1 87 \n",
"1 Baby Don't Cry (Keep Ya Head Up II) 2000-03-04 2 82 \n",
"2 Baby Don't Cry (Keep Ya Head Up II) 2000-03-11 3 72 \n",
"3 Baby Don't Cry (Keep Ya Head Up II) 2000-03-18 4 77 \n",
"4 Baby Don't Cry (Keep Ya Head Up II) 2000-03-25 5 87 \n",
"5 Baby Don't Cry (Keep Ya Head Up II) 2000-04-01 6 94 \n",
"6 Baby Don't Cry (Keep Ya Head Up II) 2000-04-08 7 99 \n",
"7 The Hardest Part Of Breaking Up (Is Getting Ba... 2000-09-02 1 91 \n",
"8 The Hardest Part Of Breaking Up (Is Getting Ba... 2000-09-09 2 87 \n",
"9 The Hardest Part Of Breaking Up (Is Getting Ba... 2000-09-16 3 92 \n",
"10 Kryptonite 2000-04-08 1 81 \n",
"11 Kryptonite 2000-04-15 2 70 \n",
"12 Kryptonite 2000-04-22 3 68 \n",
"13 Kryptonite 2000-04-29 4 67 \n",
"14 Kryptonite 2000-05-06 5 66 "
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"molten_billboard.head(15)"
]
2018-08-26 15:33:33 +02:00
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Save the Data\n",
"\n",
"The above \"tidy\" billboard dataset is saved as input for notebook No. 4."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"molten_billboard.to_csv('data/billboard_cleaned.csv', index=False)"
]
2018-08-25 19:40:23 +02:00
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}