1
0
Fork 0

Create notebook for the fourth application of tidying

This commit is contained in:
Alexander Hess 2018-08-26 15:33:33 +02:00
commit d18f7133a8
3 changed files with 5001 additions and 2 deletions

View file

@ -28,7 +28,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
"2018-08-26 00:55:18 CEST\n",
"2018-08-26 14:39:56 CEST\n",
"\n",
"CPython 3.6.5\n",
"IPython 6.5.0\n",
@ -1026,7 +1026,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Tidy Data\n",
"### \"Tidy\" Data\n",
"\n",
"As before the *pd.melt* function is used to transform the data from \"wide\" to \"long\" form."
]
@ -1079,6 +1079,13 @@
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that this dataset is not yet fully tidy as will be explained in notebook No. 4."
]
},
{
"cell_type": "code",
"execution_count": 16,
@ -1313,6 +1320,24 @@
"source": [
"molten_billboard.head(15)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Save the Data\n",
"\n",
"The above \"tidy\" billboard dataset is saved as input for notebook No. 4."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"molten_billboard.to_csv('data/billboard_cleaned.csv', index=False)"
]
}
],
"metadata": {

View file

@ -0,0 +1,708 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Multiple Types in one Table\n",
"\n",
"> Datasets often involve values collected at multiple levels, on different types of observational units. During tidying, each type of observational unit should be stored in its own table. This is closely related to the idea of database normalisation, where each fact is expressed in only one place. If this is not done, its possible for inconsistencies to occur."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## \"Housekeeping\""
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2018-08-26 15:32:47 CEST\n",
"\n",
"CPython 3.6.5\n",
"IPython 6.5.0\n",
"\n",
"numpy 1.15.1\n",
"pandas 0.23.4\n"
]
}
],
"source": [
"% load_ext watermark\n",
"% watermark -d -t -v -z -p numpy,pandas"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example: Billboard revisited"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load the Data\n",
"\n",
"Load the cleaned and almost tidy dataset from notebook No. 1."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"billboard = pd.read_csv('data/billboard_cleaned.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Messy Data\n",
"\n",
"> The Billboard dataset described in Table 8 actually contains observations on two types of\n",
"observational units: the *song* and its *rank* in each week. This manifests itself through the duplication of facts about the song: *artist* and *time* are repeated for every song in each *week*."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>year</th>\n",
" <th>artist</th>\n",
" <th>time</th>\n",
" <th>track</th>\n",
" <th>date</th>\n",
" <th>week</th>\n",
" <th>rank</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2000</td>\n",
" <td>2 Pac</td>\n",
" <td>4:22</td>\n",
" <td>Baby Don't Cry (Keep Ya Head Up II)</td>\n",
" <td>2000-02-26</td>\n",
" <td>1</td>\n",
" <td>87</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2000</td>\n",
" <td>2 Pac</td>\n",
" <td>4:22</td>\n",
" <td>Baby Don't Cry (Keep Ya Head Up II)</td>\n",
" <td>2000-03-04</td>\n",
" <td>2</td>\n",
" <td>82</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2000</td>\n",
" <td>2 Pac</td>\n",
" <td>4:22</td>\n",
" <td>Baby Don't Cry (Keep Ya Head Up II)</td>\n",
" <td>2000-03-11</td>\n",
" <td>3</td>\n",
" <td>72</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2000</td>\n",
" <td>2 Pac</td>\n",
" <td>4:22</td>\n",
" <td>Baby Don't Cry (Keep Ya Head Up II)</td>\n",
" <td>2000-03-18</td>\n",
" <td>4</td>\n",
" <td>77</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2000</td>\n",
" <td>2 Pac</td>\n",
" <td>4:22</td>\n",
" <td>Baby Don't Cry (Keep Ya Head Up II)</td>\n",
" <td>2000-03-25</td>\n",
" <td>5</td>\n",
" <td>87</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>2000</td>\n",
" <td>2 Pac</td>\n",
" <td>4:22</td>\n",
" <td>Baby Don't Cry (Keep Ya Head Up II)</td>\n",
" <td>2000-04-01</td>\n",
" <td>6</td>\n",
" <td>94</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>2000</td>\n",
" <td>2 Pac</td>\n",
" <td>4:22</td>\n",
" <td>Baby Don't Cry (Keep Ya Head Up II)</td>\n",
" <td>2000-04-08</td>\n",
" <td>7</td>\n",
" <td>99</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>2000</td>\n",
" <td>2Ge+her</td>\n",
" <td>3:15</td>\n",
" <td>The Hardest Part Of Breaking Up (Is Getting Ba...</td>\n",
" <td>2000-09-02</td>\n",
" <td>1</td>\n",
" <td>91</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>2000</td>\n",
" <td>2Ge+her</td>\n",
" <td>3:15</td>\n",
" <td>The Hardest Part Of Breaking Up (Is Getting Ba...</td>\n",
" <td>2000-09-09</td>\n",
" <td>2</td>\n",
" <td>87</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>2000</td>\n",
" <td>2Ge+her</td>\n",
" <td>3:15</td>\n",
" <td>The Hardest Part Of Breaking Up (Is Getting Ba...</td>\n",
" <td>2000-09-16</td>\n",
" <td>3</td>\n",
" <td>92</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>2000</td>\n",
" <td>3 Doors Down</td>\n",
" <td>3:53</td>\n",
" <td>Kryptonite</td>\n",
" <td>2000-04-08</td>\n",
" <td>1</td>\n",
" <td>81</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>2000</td>\n",
" <td>3 Doors Down</td>\n",
" <td>3:53</td>\n",
" <td>Kryptonite</td>\n",
" <td>2000-04-15</td>\n",
" <td>2</td>\n",
" <td>70</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>2000</td>\n",
" <td>3 Doors Down</td>\n",
" <td>3:53</td>\n",
" <td>Kryptonite</td>\n",
" <td>2000-04-22</td>\n",
" <td>3</td>\n",
" <td>68</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>2000</td>\n",
" <td>3 Doors Down</td>\n",
" <td>3:53</td>\n",
" <td>Kryptonite</td>\n",
" <td>2000-04-29</td>\n",
" <td>4</td>\n",
" <td>67</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>2000</td>\n",
" <td>3 Doors Down</td>\n",
" <td>3:53</td>\n",
" <td>Kryptonite</td>\n",
" <td>2000-05-06</td>\n",
" <td>5</td>\n",
" <td>66</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" year artist time \\\n",
"0 2000 2 Pac 4:22 \n",
"1 2000 2 Pac 4:22 \n",
"2 2000 2 Pac 4:22 \n",
"3 2000 2 Pac 4:22 \n",
"4 2000 2 Pac 4:22 \n",
"5 2000 2 Pac 4:22 \n",
"6 2000 2 Pac 4:22 \n",
"7 2000 2Ge+her 3:15 \n",
"8 2000 2Ge+her 3:15 \n",
"9 2000 2Ge+her 3:15 \n",
"10 2000 3 Doors Down 3:53 \n",
"11 2000 3 Doors Down 3:53 \n",
"12 2000 3 Doors Down 3:53 \n",
"13 2000 3 Doors Down 3:53 \n",
"14 2000 3 Doors Down 3:53 \n",
"\n",
" track date week rank \n",
"0 Baby Don't Cry (Keep Ya Head Up II) 2000-02-26 1 87 \n",
"1 Baby Don't Cry (Keep Ya Head Up II) 2000-03-04 2 82 \n",
"2 Baby Don't Cry (Keep Ya Head Up II) 2000-03-11 3 72 \n",
"3 Baby Don't Cry (Keep Ya Head Up II) 2000-03-18 4 77 \n",
"4 Baby Don't Cry (Keep Ya Head Up II) 2000-03-25 5 87 \n",
"5 Baby Don't Cry (Keep Ya Head Up II) 2000-04-01 6 94 \n",
"6 Baby Don't Cry (Keep Ya Head Up II) 2000-04-08 7 99 \n",
"7 The Hardest Part Of Breaking Up (Is Getting Ba... 2000-09-02 1 91 \n",
"8 The Hardest Part Of Breaking Up (Is Getting Ba... 2000-09-09 2 87 \n",
"9 The Hardest Part Of Breaking Up (Is Getting Ba... 2000-09-16 3 92 \n",
"10 Kryptonite 2000-04-08 1 81 \n",
"11 Kryptonite 2000-04-15 2 70 \n",
"12 Kryptonite 2000-04-22 3 68 \n",
"13 Kryptonite 2000-04-29 4 67 \n",
"14 Kryptonite 2000-05-06 5 66 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"billboard.head(15)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Tidy Data\n",
"\n",
"> The billboard dataset needs to be broken down into two datasets: a **song** dataset which stores *artist*, *song name* and *time*, and a **ranking** dataset which gives the *rank* of the song in each *week*.\n",
"\n",
"Transforming data columns into index columns is enough in pandas to obtain unique tuples from several columns. So no real \"function\" is needed to tidy up the dataset."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# Get the unique combinations for the song DataFrame and\n",
"# \"store\" them in the original dataset for reuse.\n",
"billboard = billboard.set_index(['artist', 'track', 'time'])\n",
"\n",
"# Create the song DataFrame.\n",
"songs = pd.DataFrame.from_records(\n",
" columns=['id', 'artist', 'track', 'time'],\n",
" data=[ # Combine enumerate with tuple unpacking\n",
" (a + 1, b, c, d) # to create the ID column.\n",
" for (a, (b, c, d))\n",
" in enumerate(billboard.index.unique())\n",
" ],\n",
")\n",
"\n",
"# Take the date and rank columns from the original dataset\n",
"# and use the implicit index alignment to assign the songs' IDs.\n",
"ranking = billboard[['date', 'rank']].copy()\n",
"ranking['id'] = songs.set_index(['artist', 'track', 'time'])\n",
"\n",
"# Use the song ID as the index as in the paper.\n",
"ranking = ranking.reset_index(drop=True).set_index('id')\n",
"songs = songs.set_index('id')"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>artist</th>\n",
" <th>track</th>\n",
" <th>time</th>\n",
" </tr>\n",
" <tr>\n",
" <th>id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2 Pac</td>\n",
" <td>Baby Don't Cry (Keep Ya Head Up II)</td>\n",
" <td>4:22</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2Ge+her</td>\n",
" <td>The Hardest Part Of Breaking Up (Is Getting Ba...</td>\n",
" <td>3:15</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3 Doors Down</td>\n",
" <td>Kryptonite</td>\n",
" <td>3:53</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>3 Doors Down</td>\n",
" <td>Loser</td>\n",
" <td>4:24</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>504 Boyz</td>\n",
" <td>Wobble Wobble</td>\n",
" <td>3:35</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>98¡</td>\n",
" <td>Give Me Just One Night (Una Noche)</td>\n",
" <td>3:24</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>A*Teens</td>\n",
" <td>Dancing Queen</td>\n",
" <td>3:44</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>Aaliyah</td>\n",
" <td>I Don't Wanna</td>\n",
" <td>4:15</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>Aaliyah</td>\n",
" <td>Try Again</td>\n",
" <td>4:03</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>Adams, Yolanda</td>\n",
" <td>Open My Heart</td>\n",
" <td>5:30</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>Adkins, Trace</td>\n",
" <td>More</td>\n",
" <td>3:05</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>Aguilera, Christina</td>\n",
" <td>Come On Over Baby (All I Want Is You)</td>\n",
" <td>3:38</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>Aguilera, Christina</td>\n",
" <td>I Turn To You</td>\n",
" <td>4:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>Alice Deejay</td>\n",
" <td>Better Off Alone</td>\n",
" <td>6:50</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>Allan, Gary</td>\n",
" <td>Smoke Rings In The Dark</td>\n",
" <td>4:18</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" artist track \\\n",
"id \n",
"1 2 Pac Baby Don't Cry (Keep Ya Head Up II) \n",
"2 2Ge+her The Hardest Part Of Breaking Up (Is Getting Ba... \n",
"3 3 Doors Down Kryptonite \n",
"4 3 Doors Down Loser \n",
"5 504 Boyz Wobble Wobble \n",
"6 98¡ Give Me Just One Night (Una Noche) \n",
"7 A*Teens Dancing Queen \n",
"8 Aaliyah I Don't Wanna \n",
"9 Aaliyah Try Again \n",
"10 Adams, Yolanda Open My Heart \n",
"11 Adkins, Trace More \n",
"12 Aguilera, Christina Come On Over Baby (All I Want Is You) \n",
"13 Aguilera, Christina I Turn To You \n",
"14 Alice Deejay Better Off Alone \n",
"15 Allan, Gary Smoke Rings In The Dark \n",
"\n",
" time \n",
"id \n",
"1 4:22 \n",
"2 3:15 \n",
"3 3:53 \n",
"4 4:24 \n",
"5 3:35 \n",
"6 3:24 \n",
"7 3:44 \n",
"8 4:15 \n",
"9 4:03 \n",
"10 5:30 \n",
"11 3:05 \n",
"12 3:38 \n",
"13 4:00 \n",
"14 6:50 \n",
"15 4:18 "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"songs.head(15)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>date</th>\n",
" <th>rank</th>\n",
" </tr>\n",
" <tr>\n",
" <th>id</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2000-02-26</td>\n",
" <td>87</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2000-03-04</td>\n",
" <td>82</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2000-03-11</td>\n",
" <td>72</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2000-03-18</td>\n",
" <td>77</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2000-03-25</td>\n",
" <td>87</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2000-04-01</td>\n",
" <td>94</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2000-04-08</td>\n",
" <td>99</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2000-09-02</td>\n",
" <td>91</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2000-09-09</td>\n",
" <td>87</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2000-09-16</td>\n",
" <td>92</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2000-04-08</td>\n",
" <td>81</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2000-04-15</td>\n",
" <td>70</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2000-04-22</td>\n",
" <td>68</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2000-04-29</td>\n",
" <td>67</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2000-05-06</td>\n",
" <td>66</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" date rank\n",
"id \n",
"1 2000-02-26 87\n",
"1 2000-03-04 82\n",
"1 2000-03-11 72\n",
"1 2000-03-18 77\n",
"1 2000-03-25 87\n",
"1 2000-04-01 94\n",
"1 2000-04-08 99\n",
"2 2000-09-02 91\n",
"2 2000-09-09 87\n",
"2 2000-09-16 92\n",
"3 2000-04-08 81\n",
"3 2000-04-15 70\n",
"3 2000-04-22 68\n",
"3 2000-04-29 67\n",
"3 2000-05-06 66"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ranking.head(15)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

4266
data/billboard_cleaned.csv Normal file

File diff suppressed because it is too large Load diff