Update the project for 2020

- replace pipenv with poetry
- update the README.md:
  * streamline the text
  * update links to notebooks with nbviewer
  * update installation notes with poetry info
- streamline the notebooks:
  * use backticks in MarkDown cells to make references to
    columns in DataFrames clearer
  * blacken all code cells
- add MIT license
- ignore .venv/ and .python-version

2020-08-26 00:40:43 +02:00

3.2 KiB

Raw Blame History

Tidy Data

The purpose of this repository is to illustrate how the data cleaning process described in the paper "Tidy Data" by Hadley Wickham, a member of the RStudio team, can be done in Python.

The paper was published in 2014 in the Journal of Statistical Software. The author offers it for free here. Furthermore, the original R code is available here.

After installing the dependencies for this project (cf., the installation notes below), it is recommended to first read the paper to get the big picture and then work through the six Jupyter notebooks listed below.

Summary

Definition

Tidy data is defined as data that comes in a table form adhering to the following requirements:

each variable is a column,
each observation a row, and
each type of observational unit forms a table.

This is equivalent to Codd's 3rd normal form, a concept from the theory on relational databases. A dataset that does not satisfy these properties is called messy.

Tidying Data

The five most common problems with messy data are:

column headers are values, not variable names (cf., notebook 1)
multiple variables are stored in one column (cf., notebook 2)
variables are stored in both rows and columns (cf., notebook 3)
multiple types of observational units are stored in the same table (cf., notebook 4)
a single observational unit is stored in multiple tables (cf., notebook 5)

Case Study

A case study (cf., notebook 6) shows the advantages of tidy data as a standardized input to statistical functions.

Installation

Get a local copy of this repository with git.

git clone https://github.com/webartifex/tidy-data.git

If you are not familiar with git, simply download the latest version of the files in a zip archive here.

This project uses poetry to manage its dependencies. Install all third-party packages into a virtual environment.

poetry install

Alternatively, use the Anaconda Distribution that should also suffice to run the provided notebooks.

3.2 KiB Raw Blame History