diff --git a/1_column_headers_are_values.ipynb b/1_column_headers_are_values.ipynb new file mode 100644 index 0000000..4f87b83 --- /dev/null +++ b/1_column_headers_are_values.ipynb @@ -0,0 +1,1339 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Column Headers are Values, not Variable Names\n", + "\n", + "This notebook shows two examples of how column headers display values. These type of messy datasets have practical use in two types of settings:\n", + "\n", + "1. Presentations\n", + "2. Recordings of regularly spaced observations over time" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## \"Housekeeping\"" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "2018-08-26 00:55:18 CEST\n", + "\n", + "CPython 3.6.5\n", + "IPython 6.5.0\n", + "\n", + "numpy 1.15.1\n", + "pandas 0.23.4\n" + ] + } + ], + "source": [ + "% load_ext watermark\n", + "% watermark -d -t -v -z -p numpy,pandas" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "import datetime\n", + "import re\n", + "\n", + "import pandas as pd\n", + "import savReaderWriter as spss" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Example 1: Religion vs. Income\n", + "\n", + "> A common type of messy dataset is tabular data designed for **presentation**, where variables\n", + "form both the rows and columns, and column headers are values, not variable names.\n", + "\n", + "The [Pew Research Center](http://www.pewresearch.org/) provides many studies on all kinds of aspects of life in the USA. The following examples uses data taken from its [Religious Landscape Study](http://www.pewforum.org/religious-landscape-study/)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Load the Data\n", + "\n", + "The data are provided as a SPSS data file. This is a binary specification with a built-in header section describing the data, for example, what variables / columns are included and what the realizations categorical data can have." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Load the dataset's meta data." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "columns = ['q16', 'reltrad', 'income']\n", + "encodings = {}\n", + "\n", + "# For sake of simplicity all data cleaning operations\n", + "# are done within the for-loop for all columns.\n", + "with spss.SavHeaderReader('data/pew.sav') as pew:\n", + " for c in columns:\n", + " encodings[c] = {\n", + " int(k): (\n", + " re.sub(r'\\(.*\\)', '', (\n", + " v.decode('iso-8859-1')\n", + " .replace('\\x92', \"'\")\n", + " .replace(' Churches', '')\n", + " .replace('Less than $10,000', '<$10k')\n", + " .replace('10 to under $20,000', '$10-20k')\n", + " .replace('20 to under $30,000', '$20-30k')\n", + " .replace('30 to under $40,000', '$30-40k')\n", + " .replace('40 to under $50,000', '$40-50k')\n", + " .replace('50 to under $75,000', '$50-75k')\n", + " .replace('75 to under $100,000', '$75-100k')\n", + " .replace('100 to under $150,000', '$100-150k')\n", + " .replace('$150,000 or more', '>150k')\n", + " ),\n", + " ).strip()\n", + " )\n", + " for (k, v) in pew.all().valueLabels[c.encode()].items()\n", + " }" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Load the actual data and prepare them as they are presented in the paper." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "with spss.SavReader('data/pew.sav', selectVars=[c.encode() for c in columns]) as pew:\n", + " pew = list(pew)\n", + "\n", + "# Use the above encodings to map the numeric data\n", + "# to the actual labels.\n", + "pew = pd.DataFrame(pew, columns=columns, dtype=int)\n", + "for c in columns:\n", + " pew[c] = pew[c].map(encodings[c])\n", + "\n", + "for v in ('Atheist', 'Agnostic'):\n", + " pew.loc[(pew['q16'] == v), 'reltrad'] = v\n", + "\n", + "income_columns = ['<$10k', '$10-20k', '$20-30k', '$30-40k', '$40-50k', '$50-75k',\n", + " '$75-100k', '$100-150k', '>150k', 'Don\\'t know/Refused']\n", + "\n", + "pew = pew.groupby(['reltrad', 'income']).size().unstack('income')\n", + "\n", + "pew = pew[income_columns]\n", + "pew.index.name = 'religion'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Messy Data\n", + "\n", + "The next cell shows the data as they can actually be provided as \"raw\" data (i.e., the pre-processing as done above is assumed to be done by someone else and the data analyst is only presented with the below dataset)." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(18, 10)" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pew.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
income<$10k$10-20k$20-30k$30-40k$40-50k$50-75k$75-100k$100-150k>150kDon't know/Refused
religion
Agnostic27346081761371221098496
Atheist12273752357073597476
Buddhist27213034335862395354
Catholic41861773267063811169497926331489
Don't know/refused151415111035211718116
Evangelical Protestant575869106498288114869497234141529
Hindu1979113447485437
Historically Black Protestant2282442362381972231318178339
Jehovah's Witness2027242421301511637
Jewish1919252530956987151162
\n", + "
" + ], + "text/plain": [ + "income <$10k $10-20k $20-30k $30-40k $40-50k \\\n", + "religion \n", + "Agnostic 27 34 60 81 76 \n", + "Atheist 12 27 37 52 35 \n", + "Buddhist 27 21 30 34 33 \n", + "Catholic 418 617 732 670 638 \n", + "Don't know/refused 15 14 15 11 10 \n", + "Evangelical Protestant 575 869 1064 982 881 \n", + "Hindu 1 9 7 9 11 \n", + "Historically Black Protestant 228 244 236 238 197 \n", + "Jehovah's Witness 20 27 24 24 21 \n", + "Jewish 19 19 25 25 30 \n", + "\n", + "income $50-75k $75-100k $100-150k >150k \\\n", + "religion \n", + "Agnostic 137 122 109 84 \n", + "Atheist 70 73 59 74 \n", + "Buddhist 58 62 39 53 \n", + "Catholic 1116 949 792 633 \n", + "Don't know/refused 35 21 17 18 \n", + "Evangelical Protestant 1486 949 723 414 \n", + "Hindu 34 47 48 54 \n", + "Historically Black Protestant 223 131 81 78 \n", + "Jehovah's Witness 30 15 11 6 \n", + "Jewish 95 69 87 151 \n", + "\n", + "income Don't know/Refused \n", + "religion \n", + "Agnostic 96 \n", + "Atheist 76 \n", + "Buddhist 54 \n", + "Catholic 1489 \n", + "Don't know/refused 116 \n", + "Evangelical Protestant 1529 \n", + "Hindu 37 \n", + "Historically Black Protestant 339 \n", + "Jehovah's Witness 37 \n", + "Jewish 162 " + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pew.head(10)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Tidy Data\n", + "\n", + "> This dataset has **three** variables, **religion**, **income** and **frequency**. To tidy it, we need to **melt**, or stack it. In other words, we need to turn columns into rows.\n", + "\n", + "pandas provides a [pd.melt](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html) function to un-pivot the dataset.\n", + "\n", + "**Notes:** *reset_index()* transforms the religion index column into a data column (*pd.melt()* needs that). Further, the resulting table is sorted implicitly by the *religion* column. To get to the same ordering as in the paper, the molten table is explicitly sorted." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "molten_pew = pd.melt(pew.reset_index(), id_vars=['religion'], value_name='frequency')" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "# Create a ordered column for the income labels.\n", + "income_dtype = pd.api.types.CategoricalDtype(income_columns, ordered=True)\n", + "molten_pew['income'] = molten_pew['income'].astype(income_dtype)\n", + "molten_pew = molten_pew.sort_values(['religion', 'income']).reset_index(drop=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(180, 3)" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "molten_pew.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
religionincomefrequency
0Agnostic<$10k27
1Agnostic$10-20k34
2Agnostic$20-30k60
3Agnostic$30-40k81
4Agnostic$40-50k76
5Agnostic$50-75k137
6Agnostic$75-100k122
7Agnostic$100-150k109
8Agnostic>150k84
9AgnosticDon't know/Refused96
\n", + "
" + ], + "text/plain": [ + " religion income frequency\n", + "0 Agnostic <$10k 27\n", + "1 Agnostic $10-20k 34\n", + "2 Agnostic $20-30k 60\n", + "3 Agnostic $30-40k 81\n", + "4 Agnostic $40-50k 76\n", + "5 Agnostic $50-75k 137\n", + "6 Agnostic $75-100k 122\n", + "7 Agnostic $100-150k 109\n", + "8 Agnostic >150k 84\n", + "9 Agnostic Don't know/Refused 96" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "molten_pew.head(10)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Example 2: Billboard\n", + "\n", + "> Another common use of this data format is to record regularly spaced observations over time. For example, the Billboard dataset shown in Table 7 records the date a song first entered the Billboard Top 100. It has variables for **artist**, **track**, **date.entered**, **rank** and **week**. The rank in each week after it enters the top 100 is recorded in 75 columns, wk1 to wk75. If a song is in the Top 100 for less than 75 weeks the remaining columns are filled with missing values. This form of storage is not tidy, but it is useful for data entry. It reduces duplication since otherwise each song in each week would need its own row, and song metadata like title and artist would need to be repeated." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Load the Data\n", + "\n", + "The data come in a CSV file with tediously named week columns." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "# Usage of \"1st\", \"2nd\", \"3rd\" should be forbidden by law :)\n", + "usecols = ['artist.inverted', 'track', 'time', 'date.entered'] + (\n", + " [f'x{i}st.week' for i in range(1, 76, 10) if i != 11]\n", + " + [f'x{i}nd.week' for i in range(2, 76, 10) if i != 12]\n", + " + [f'x{i}rd.week' for i in range(3, 76, 10) if i != 13]\n", + " + [f'x{i}th.week' for i in range(1, 76) if (i % 10) not in (1, 2, 3)]\n", + " + [f'x11th.week', f'x12th.week', f'x13th.week']\n", + ")\n", + "\n", + "billboard = pd.read_csv('data/billboard.csv', encoding='iso-8859-1',\n", + " parse_dates=['date.entered'], usecols=usecols)\n", + "\n", + "billboard = billboard.assign(year=lambda x: x['date.entered'].dt.year)\n", + "\n", + "# Rename the week columns.\n", + "week_columns = {\n", + " c: ('wk' + re.sub(r'[^\\d]+', '', c))\n", + " for c in billboard.columns\n", + " if c.endswith('.week')\n", + "}\n", + "billboard = billboard.rename(columns={'artist.inverted': 'artist', **week_columns})\n", + "\n", + "# Ensure the columns' order is the same as in the paper.\n", + "columns = ['year', 'artist', 'track', 'time', 'date.entered'] + [\n", + " f'wk{i}' for i in range(1, 76)\n", + "]\n", + "billboard = billboard[columns]\n", + "\n", + "# Ensure the rows' order is similar as in the paper.\n", + "# For unknown reasons the exact ordering as in the paper cannot be reconstructed.\n", + "billboard = billboard[billboard['year'] == 2000]\n", + "billboard = billboard.sort_values(['artist', 'track'])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Messy Data\n", + "\n", + "Again, the next cell shows the data as they were actually provided as \"raw\" data." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(267, 80)" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "billboard.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
yearartisttracktimedate.enteredwk1wk2wk3wk4wk5...wk66wk67wk68wk69wk70wk71wk72wk73wk74wk75
24620002 PacBaby Don't Cry (Keep Ya Head Up II)4:222000-02-268782.072.077.087.0...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
28720002Ge+herThe Hardest Part Of Breaking Up (Is Getting Ba...3:152000-09-029187.092.0NaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2420003 Doors DownKryptonite3:532000-04-088170.068.067.066.0...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
19320003 Doors DownLoser4:242000-10-217676.072.069.067.0...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
692000504 BoyzWobble Wobble3:352000-04-155734.025.017.017.0...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
22200098¡Give Me Just One Night (Una Noche)3:242000-08-195139.034.026.026.0...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
3042000A*TeensDancing Queen3:442000-07-089797.096.095.0100.0...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1352000AaliyahI Don't Wanna4:152000-01-298462.051.041.038.0...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
142000AaliyahTry Again4:032000-03-185953.038.028.021.0...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2002000Adams, YolandaOpen My Heart5:302000-08-267676.074.069.068.0...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", + "

10 rows × 80 columns

\n", + "
" + ], + "text/plain": [ + " year artist track \\\n", + "246 2000 2 Pac Baby Don't Cry (Keep Ya Head Up II) \n", + "287 2000 2Ge+her The Hardest Part Of Breaking Up (Is Getting Ba... \n", + "24 2000 3 Doors Down Kryptonite \n", + "193 2000 3 Doors Down Loser \n", + "69 2000 504 Boyz Wobble Wobble \n", + "22 2000 98¡ Give Me Just One Night (Una Noche) \n", + "304 2000 A*Teens Dancing Queen \n", + "135 2000 Aaliyah I Don't Wanna \n", + "14 2000 Aaliyah Try Again \n", + "200 2000 Adams, Yolanda Open My Heart \n", + "\n", + " time date.entered wk1 wk2 wk3 wk4 wk5 ... wk66 wk67 wk68 \\\n", + "246 4:22 2000-02-26 87 82.0 72.0 77.0 87.0 ... NaN NaN NaN \n", + "287 3:15 2000-09-02 91 87.0 92.0 NaN NaN ... NaN NaN NaN \n", + "24 3:53 2000-04-08 81 70.0 68.0 67.0 66.0 ... NaN NaN NaN \n", + "193 4:24 2000-10-21 76 76.0 72.0 69.0 67.0 ... NaN NaN NaN \n", + "69 3:35 2000-04-15 57 34.0 25.0 17.0 17.0 ... NaN NaN NaN \n", + "22 3:24 2000-08-19 51 39.0 34.0 26.0 26.0 ... NaN NaN NaN \n", + "304 3:44 2000-07-08 97 97.0 96.0 95.0 100.0 ... NaN NaN NaN \n", + "135 4:15 2000-01-29 84 62.0 51.0 41.0 38.0 ... NaN NaN NaN \n", + "14 4:03 2000-03-18 59 53.0 38.0 28.0 21.0 ... NaN NaN NaN \n", + "200 5:30 2000-08-26 76 76.0 74.0 69.0 68.0 ... NaN NaN NaN \n", + "\n", + " wk69 wk70 wk71 wk72 wk73 wk74 wk75 \n", + "246 NaN NaN NaN NaN NaN NaN NaN \n", + "287 NaN NaN NaN NaN NaN NaN NaN \n", + "24 NaN NaN NaN NaN NaN NaN NaN \n", + "193 NaN NaN NaN NaN NaN NaN NaN \n", + "69 NaN NaN NaN NaN NaN NaN NaN \n", + "22 NaN NaN NaN NaN NaN NaN NaN \n", + "304 NaN NaN NaN NaN NaN NaN NaN \n", + "135 NaN NaN NaN NaN NaN NaN NaN \n", + "14 NaN NaN NaN NaN NaN NaN NaN \n", + "200 NaN NaN NaN NaN NaN NaN NaN \n", + "\n", + "[10 rows x 80 columns]" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "billboard.head(10)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Tidy Data\n", + "\n", + "As before the *pd.melt* function is used to transform the data from \"wide\" to \"long\" form." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [], + "source": [ + "molten_billboard = pd.melt(\n", + " billboard,\n", + " id_vars=['year', 'artist', 'track', 'time', 'date.entered'],\n", + " var_name='week',\n", + " value_name='rank',\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In contrast to R, pandas keeps (unneccesary) rows for weeks where the song was already out of the charts. These are discarded. Also, a new column *date* indicating when exactly a particular song was at a certain rank in the charts is added." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [], + "source": [ + "# pandas keeps \"wide\" variables that had missing values as rows.\n", + "molten_billboard = molten_billboard[molten_billboard['rank'].notnull()]\n", + "\n", + "# Cast as integer after missing values are removed.\n", + "molten_billboard['week'] = molten_billboard['week'].map(lambda x: int(x[2:]))\n", + "molten_billboard['rank'] = molten_billboard['rank'].map(int)\n", + "\n", + "# Calculate the actual week from the date of first entering the list.\n", + "molten_billboard = molten_billboard.assign(\n", + " date=lambda x: x['date.entered'] + (x['week'] - 1) * datetime.timedelta(weeks=1)\n", + ")\n", + "\n", + "# Sort rows and columns as in the paper.\n", + "molten_billboard = molten_billboard[\n", + " ['year', 'artist', 'time', 'track', 'date', 'week', 'rank']\n", + "]\n", + "molten_billboard = (\n", + " molten_billboard.sort_values(['artist', 'track', 'week']).reset_index(drop=True)\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
yearartisttimetrackdateweekrank
020002 Pac4:22Baby Don't Cry (Keep Ya Head Up II)2000-02-26187
120002 Pac4:22Baby Don't Cry (Keep Ya Head Up II)2000-03-04282
220002 Pac4:22Baby Don't Cry (Keep Ya Head Up II)2000-03-11372
320002 Pac4:22Baby Don't Cry (Keep Ya Head Up II)2000-03-18477
420002 Pac4:22Baby Don't Cry (Keep Ya Head Up II)2000-03-25587
520002 Pac4:22Baby Don't Cry (Keep Ya Head Up II)2000-04-01694
620002 Pac4:22Baby Don't Cry (Keep Ya Head Up II)2000-04-08799
720002Ge+her3:15The Hardest Part Of Breaking Up (Is Getting Ba...2000-09-02191
820002Ge+her3:15The Hardest Part Of Breaking Up (Is Getting Ba...2000-09-09287
920002Ge+her3:15The Hardest Part Of Breaking Up (Is Getting Ba...2000-09-16392
1020003 Doors Down3:53Kryptonite2000-04-08181
1120003 Doors Down3:53Kryptonite2000-04-15270
1220003 Doors Down3:53Kryptonite2000-04-22368
1320003 Doors Down3:53Kryptonite2000-04-29467
1420003 Doors Down3:53Kryptonite2000-05-06566
\n", + "
" + ], + "text/plain": [ + " year artist time \\\n", + "0 2000 2 Pac 4:22 \n", + "1 2000 2 Pac 4:22 \n", + "2 2000 2 Pac 4:22 \n", + "3 2000 2 Pac 4:22 \n", + "4 2000 2 Pac 4:22 \n", + "5 2000 2 Pac 4:22 \n", + "6 2000 2 Pac 4:22 \n", + "7 2000 2Ge+her 3:15 \n", + "8 2000 2Ge+her 3:15 \n", + "9 2000 2Ge+her 3:15 \n", + "10 2000 3 Doors Down 3:53 \n", + "11 2000 3 Doors Down 3:53 \n", + "12 2000 3 Doors Down 3:53 \n", + "13 2000 3 Doors Down 3:53 \n", + "14 2000 3 Doors Down 3:53 \n", + "\n", + " track date week rank \n", + "0 Baby Don't Cry (Keep Ya Head Up II) 2000-02-26 1 87 \n", + "1 Baby Don't Cry (Keep Ya Head Up II) 2000-03-04 2 82 \n", + "2 Baby Don't Cry (Keep Ya Head Up II) 2000-03-11 3 72 \n", + "3 Baby Don't Cry (Keep Ya Head Up II) 2000-03-18 4 77 \n", + "4 Baby Don't Cry (Keep Ya Head Up II) 2000-03-25 5 87 \n", + "5 Baby Don't Cry (Keep Ya Head Up II) 2000-04-01 6 94 \n", + "6 Baby Don't Cry (Keep Ya Head Up II) 2000-04-08 7 99 \n", + "7 The Hardest Part Of Breaking Up (Is Getting Ba... 2000-09-02 1 91 \n", + "8 The Hardest Part Of Breaking Up (Is Getting Ba... 2000-09-09 2 87 \n", + "9 The Hardest Part Of Breaking Up (Is Getting Ba... 2000-09-16 3 92 \n", + "10 Kryptonite 2000-04-08 1 81 \n", + "11 Kryptonite 2000-04-15 2 70 \n", + "12 Kryptonite 2000-04-22 3 68 \n", + "13 Kryptonite 2000-04-29 4 67 \n", + "14 Kryptonite 2000-05-06 5 66 " + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "molten_billboard.head(15)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.5" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}