- replace pipenv with poetry
- update the README.md:
* streamline the text
* update links to notebooks with nbviewer
* update installation notes with poetry info
- streamline the notebooks:
* use backticks in MarkDown cells to make references to
columns in DataFrames clearer
* blacken all code cells
- add MIT license
- ignore .venv/ and .python-version
3.2 KiB
Tidy Data
The purpose of this repository is to illustrate how the data cleaning process described in the paper "Tidy Data" by Hadley Wickham, a member of the RStudio team, can be done in Python.
The paper was published in 2014 in the Journal of Statistical Software. The author offers it for free here. Furthermore, the original R code is available here.
After installing the dependencies for this project (cf., the installation notes below), it is recommended to first read the paper to get the big picture and then work through the six Jupyter notebooks listed below.
Summary
Definition
Tidy data is defined as data that comes in a table form adhering to the following requirements:
- each variable is a column,
- each observation a row, and
- each type of observational unit forms a table.
This is equivalent to Codd's 3rd normal form, a concept from the theory on relational databases. A dataset that does not satisfy these properties is called messy.
Tidying Data
The five most common problems with messy data are:
- column headers are values, not variable names (cf., notebook 1)
- multiple variables are stored in one column (cf., notebook 2)
- variables are stored in both rows and columns (cf., notebook 3)
- multiple types of observational units are stored in the same table (cf., notebook 4)
- a single observational unit is stored in multiple tables (cf., notebook 5)
Case Study
A case study (cf., notebook 6) shows the advantages of tidy data as a standardized input to statistical functions.
Installation
Get a local copy of this repository with git.
git clone https://github.com/webartifex/tidy-data.git
If you are not familiar with git, simply download the latest version of the files in a zip archive here.
This project uses poetry to manage its dependencies. Install all third-party packages into a virtual environment.
poetry install
Alternatively, use the Anaconda Distribution that should also suffice to run the provided notebooks.