1
0
Fork 0

Update the project for 2020

- replace pipenv with poetry
- update the README.md:
  * streamline the text
  * update links to notebooks with nbviewer
  * update installation notes with poetry info
- streamline the notebooks:
  * use backticks in MarkDown cells to make references to
    columns in DataFrames clearer
  * blacken all code cells
- add MIT license
- ignore .venv/ and .python-version
This commit is contained in:
Alexander Hess 2020-08-26 00:07:58 +02:00
commit a3a17236a2
Signed by: alexander
GPG key ID: 344EA5AB10D868E0
13 changed files with 1975 additions and 1158 deletions

View file

@ -1,21 +1,18 @@
# Tidy Data
The purpose of this repository is to re-do the work described in the paper
[Tidy Data](tidy-data.pdf) by Hadley Wickham (member of the RStudio team) in
Python.
The purpose of this repository is to illustrate how the data cleaning process described
in the paper "[Tidy Data](tidy-data.pdf)" by Hadley Wickham, a member of the
[RStudio](https://rstudio.com/) team, can be done in
[Python](https://www.python.org/).
The paper was published in 2014 in the Journal of
[Statistical Software](https://www.jstatsoft.org/article/view/v059i10). The
author offers it for free download
[here](http://vita.had.co.nz/papers/tidy-data.html). Furthermore, the original
R code is available in a Github
[repository](https://github.com/hadley/tidy-data)
The paper was published in 2014 in the [Journal of Statistical Software](https://www.jstatsoft.org/article/view/v059i10).
The author offers it for free [here](http://vita.had.co.nz/papers/tidy-data.html).
Furthermore, the original [R](https://www.r-project.org/) code is available [here](https://github.com/hadley/tidy-data).
After installing this project, it is recommended to first read the paper to get
the big picture and then work through the six Jupyter notebooks (listed further
below).
After installing the dependencies for this project (cf., the [installation notes](https://github.com/webartifex/tidy-data#installation)
below), it is recommended to first read the paper to get the big picture and
then work through the six Jupyter notebooks listed below.
See installation notes at the bottom.
## Summary
@ -23,50 +20,51 @@ See installation notes at the bottom.
### Definition
**Tidy** data is defined as data that comes in a table form adhering to the
following requirements:
following requirements:
1. each variable is a column,
2. each observation a row, and
3. each type of observational unit forms a table.
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.
This is equivalent to Codd's 3rd normal form (in the context of relational
databases). A dataset that does not satisfy these properties is called
**messy**.
This is equivalent to [Codd's 3rd normal form](https://en.wikipedia.org/wiki/Third_normal_form),
a concept from the theory on relational databases.
A dataset that does *not* satisfy these properties is called **messy**.
### Tidying messy Data
### Tidying Data
The five most common problems with messy data are as follows:
The five most common problems with messy data are:
- Column headers are values, not variable names
[[notebook](1_column_headers_are_values.ipynb)]
- Multiple variables are stored in one column
[[notebook](2_multiple_variables_stored_in_one_column.ipynb)]
- Variables are stored in both rows and columns
[[notebook](3_variables_are_stored_in_both_rows_and_columns.ipynb)]
- Multiple types of observational units are stored in the same table
[[notebook](4_multiple_types_in_one_table.ipynb)]
- A single observational unit is stored in multiple tables
[[notebook](5_one_type_in_multiple_tables.ipynb)]
- column headers are values, not variable names
(cf., [notebook 1](https://nbviewer.jupyter.org/github/webartifex/tidy-data/blob/master/1_column_headers_are_values.ipynb))
- multiple variables are stored in one column
(cf., [notebook 2](https://nbviewer.jupyter.org/github/webartifex/tidy-data/blob/master/2_multiple_variables_stored_in_one_column.ipynb))
- variables are stored in both rows and columns
(cf., [notebook 3](https://nbviewer.jupyter.org/github/webartifex/tidy-data/blob/master/3_variables_are_stored_in_both_rows_and_columns.ipynb))
- multiple types of observational units are stored in the same table
(cf., [notebook 4](https://nbviewer.jupyter.org/github/webartifex/tidy-data/blob/master/4_multiple_types_in_one_table.ipynb))
- a single observational unit is stored in multiple tables
(cf., [notebook 5](https://nbviewer.jupyter.org/github/webartifex/tidy-data/blob/master/5_one_type_in_multiple_tables.ipynb))
Further, a [case study](6_case_study.ipynb) shows the advantages of tidy data
(as standardized input/output to statistical functions).
## Download & Installation
### Case Study
Create a local copy of this repository with:
A case study (cf., [notebook 6](https://nbviewer.jupyter.org/github/webartifex/tidy-data/blob/master/6_case_study.ipynb))
shows the advantages of tidy data as a standardized input to statistical functions.
## Installation
Get a local copy of this repository with [git](https://git-scm.com/).
`git clone https://github.com/webartifex/tidy-data.git`
This project uses [pipenv](https://docs.pipenv.org/) to manage its
dependencies.
If you are not familiar with [git](https://git-scm.com/), simply download the latest
version of the files in a zip archive [here](https://github.com/webartifex/tidy-data/archive/master.zip).
To install all third-party Python packages in the most recent version into a
project-local virtual environment, run:
This project uses [poetry](https://python-poetry.org/docs/) to manage its dependencies.
Install all third-party packages into a [virtual environment](https://docs.python.org/3/library/venv.html).
`pipenv install`
`poetry install`
To install all packages with the same version as of the time of creating this
project (for exact reproducability), run:
`pipenv install --ignore-pipfile`
Alternatively, use the [Anaconda Distribution](https://www.anaconda.com/products/individual)
that *should* also suffice to run the provided notebooks.