From f4f2cf476e74334e04dcaef54b4079f03952a4e1 Mon Sep 17 00:00:00 2001 From: Alexander Hess Date: Wed, 21 Oct 2020 18:48:59 +0200 Subject: [PATCH] Add initial version of chapter 08's exercises, part 1 --- 08_mfr/02_exercises.ipynb | 760 ++++++++++++++++++++++++++++++++++++++ 08_mfr/stream.py | 51 +++ CONTENTS.md | 5 +- 3 files changed, 815 insertions(+), 1 deletion(-) create mode 100644 08_mfr/02_exercises.ipynb create mode 100644 08_mfr/stream.py diff --git a/08_mfr/02_exercises.ipynb b/08_mfr/02_exercises.ipynb new file mode 100644 index 0000000..0911aaf --- /dev/null +++ b/08_mfr/02_exercises.ipynb @@ -0,0 +1,760 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Note**: Click on \"*Kernel*\" > \"*Restart Kernel and Run All*\" in [JupyterLab](https://jupyterlab.readthedocs.io/en/stable/) *after* finishing the exercises to ensure that your solution runs top to bottom *without* any errors. If you cannot run this file on your machine, you may want to open it [in the cloud ](https://mybinder.org/v2/gh/webartifex/intro-to-python/develop?urlpath=lab/tree/08_mfr/02_exercises.ipynb)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Chapter 8: Map, Filter, & Reduce (Coding Exercises)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The exercises below assume that you have read the [first ](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/develop/08_mfr/00_content.ipynb) and [second ](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/develop/08_mfr/01_content.ipynb) part of Chapter 8.\n", + "\n", + "The `...`'s in the code cells indicate where you need to fill in code snippets. The number of `...`'s within a code cell give you a rough idea of how many lines of code are needed to solve the task. You should not need to create any additional code cells for your final solution. However, you may want to use temporary code cells to try out some ideas." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Removing Outliers in Streaming Data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's say we are given a `list` object with random integers like `sample` below, and we want to calculate some basic statistics on them." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sample = [\n", + " 45, 46, 40, 49, 36, 53, 49, 42, 25, 40, 39, 36, 38, 40, 40, 52, 36, 52, 40, 41,\n", + " 35, 29, 48, 43, 42, 30, 29, 33, 55, 33, 38, 50, 39, 56, 52, 28, 37, 56, 45, 37,\n", + " 41, 41, 37, 30, 51, 32, 23, 40, 53, 40, 45, 39, 99, 42, 34, 42, 34, 39, 39, 53,\n", + " 43, 37, 46, 36, 45, 42, 32, 38, 57, 34, 36, 44, 47, 51, 46, 39, 28, 40, 35, 46,\n", + " 41, 51, 41, 23, 46, 40, 40, 51, 50, 32, 47, 36, 38, 29, 32, 53, 34, 43, 39, 41,\n", + " 40, 34, 44, 40, 41, 43, 47, 57, 50, 42, 38, 25, 45, 41, 58, 37, 45, 55, 44, 53,\n", + " 82, 31, 45, 33, 32, 39, 46, 48, 42, 47, 40, 45, 51, 35, 31, 46, 40, 44, 61, 57,\n", + " 40, 36, 35, 55, 40, 56, 36, 35, 86, 36, 51, 40, 54, 50, 49, 36, 41, 37, 48, 41,\n", + " 42, 44, 40, 43, 51, 47, 46, 50, 40, 23, 40, 39, 28, 38, 42, 46, 46, 42, 46, 31,\n", + " 32, 40, 48, 27, 40, 40, 30, 32, 25, 31, 30, 43, 44, 29, 45, 41, 63, 32, 33, 58,\n", + "]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "len(sample)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Q1**: `list` objects are **sequences**. What *four* behaviors do they always come with?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + " < your answer >" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Q2**: Write a function `mean()` that calculates the simple arithmetic mean of a given `sequence` with numbers!\n", + "\n", + "Hints: You can solve this task with [built-in functions ](https://docs.python.org/3/library/functions.html) only. A `for`-loop is *not* needed." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def mean(sequence):\n", + " ..." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sample_mean = mean(sample)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sample_mean" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Q3**: Write a function `std()` that calculates the [standard deviation ](https://en.wikipedia.org/wiki/Standard_deviation) of a `sequence` of numbers! Integrate your `mean()` version from before and the [sqrt() ](https://docs.python.org/3/library/math.html#math.sqrt) function from the [math ](https://docs.python.org/3/library/math.html) module in the [standard library ](https://docs.python.org/3/library/index.html) provided to you below. Make sure `std()` calls `mean()` only *once* internally! Repeated calls to `mean()` would be a waste of computational resources.\n", + "\n", + "Hints: Parts of the code are probably too long to fit within the suggested 79 characters per line. So, use *temporary variables* inside your function. Instead of a `for`-loop, you may want to use a `list` comprehension or, even better, a memoryless `generator` expression." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from math import sqrt" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def std(sequence):\n", + " ..." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sample_std = std(sample)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sample_std" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Q4**: Complete `standardize()` below that takes a `sequence` of numbers and returns a `list` object with the **[z-scores ](https://en.wikipedia.org/wiki/Standard_score)** of these numbers! A z-score is calculated by subtracting the mean and dividing by the standard deviation. Re-use `mean()` and `std()` from before. Again, ensure that `standardize()` calls `mean()` and `std()` only *once*! Further, round all z-scores with the built-in [round() ](https://docs.python.org/3/library/functions.html#round) function and pass on the keyword-only argument `digits` to it.\n", + "\n", + "Hint: You may want to use a `list` comprehension instead of a `for`-loop." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def standardize(sequence, *, digits=3):\n", + " ..." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "z_scores = standardize(sample)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The [pprint() ](https://docs.python.org/3/library/pprint.html#pprint.pprint) function from the [pprint ](https://docs.python.org/3/library/pprint.html) module in the [standard library ](https://docs.python.org/3/library/index.html) allows us to \"pretty print\" long `list` objects compactly." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from pprint import pprint" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pprint(z_scores, compact=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We know that `standardize()` works correctly if the resulting z-scores' mean and standard deviation approach `0` and `1` for a long enough `sequence`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "mean(z_scores), std(z_scores)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Even though `standardize()` calls `mean()` and `std()` only once each, `mean()` is called *twice*! That is so because `std()` internally also re-uses `mean()`!\n", + "\n", + "**Q5.1**: Rewrite `std()` to take an optional keyword-only argument `seq_mean`, defaulting to `None`. If provided, `seq_mean` is used instead of the result of calling `mean()`. Otherwise, the latter is called.\n", + "\n", + "Hint: You must check if `seq_mean` is still the default value." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def std(sequence, *, seq_mean=None):\n", + " ..." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`std()` continues to work as before." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sample_std = std(sample)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sample_std" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Q5.2**: Now, rewrite `standardize()` to pass on the return value of `mean()` to `std()`! In summary, `standardize()` calculates the z-scores for the numbers in the `sequence` with as few computational steps as possible." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def standardize(sequence, *, digits=3):\n", + " ..." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "z_scores = standardize(sample)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "mean(z_scores), std(z_scores)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Q6**: With both `sample` and `z_scores` being materialized `list` objects, we can loop over pairs consisting of a number from `sample` and its corresponding z-score. Write a `for`-loop that prints out all the \"outliers,\" as which we define numbers with an absolute z-score above `1.96`. There are *four* of them in the `sample`.\n", + "\n", + "Hint: Use the [abs() ](https://docs.python.org/3/library/functions.html#abs) and [zip() ](https://docs.python.org/3/library/functions.html#zip) built-ins." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "..." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We provide a `stream` module with a `data` object that models an *infinite* **stream** of data (cf., the [*stream.py* ](https://github.com/webartifex/intro-to-python/blob/develop/08_mfr/stream.py) file in the repository)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from stream import data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`data` is of type `generator` and has *no* length." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "type(data)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "len(data)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "So, the only thing we can do with it is to pass it to the built-in [next() ](https://docs.python.org/3/library/functions.html#next) function and go over the numbers it streams one by one." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "next(data)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Q7**: What happens if you call `mean()` with `data` as the argument? What is the problem?\n", + "\n", + "Hints: If you try it out, you may have to press the \"Stop\" button in the toolbar at the top. Your computer should *not* crash, but you will *have to* restart this Jupyter notebook with \"Kernel\" > \"Restart\" and import `data` again." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + " < your answer >" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "mean(data)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Q8**: Write a function `take_sample()` that takes an `iterable` as its argument, like `data`, and creates a *materialized* `list` object out of its first `n` elements, defaulting to `1_000`!\n", + "\n", + "Hints: [next() ](https://docs.python.org/3/library/functions.html#next) and the [range() ](https://docs.python.org/3/library/functions.html#func-range) built-in may be helpful. You may want to use a `list` comprehension instead of a `for`-loop and write a one-liner. Audacious students may want to look at [isclice() ](https://docs.python.org/3/library/itertools.html#itertools.islice) in the [itertools ](https://docs.python.org/3/library/itertools.html) module in the [standard library ](https://docs.python.org/3/library/index.html)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def take_sample(iterable, *, n=1_000):\n", + " ..." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We take a `new_sample` from the stream of `data`, and its statistics are similar to the initial `sample`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "new_sample = take_sample(data)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "len(new_sample)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "mean(new_sample)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "std(new_sample)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Q9**: Convert `standardize()` into a *new* function `standardized()` that implements the *same* logic but works on a possibly *infinite* stream of data, provided as an `iterable`, instead of a *finite* `sequence`.\n", + "\n", + "To calculate a z-score, we need the stream's overall mean and standard deviation, and that is *impossible* to calculate if we do not know how long the stream is, and, in particular, if it is *infinite*. So, `standardized()` first takes a sample from the `iterable` internally, and uses the sample's mean and standard deviation to calculate the z-scores.\n", + "\n", + "Hint: `standardized()` *must* return a `generator` object. So, use a `generator` expression as the return value; unless you know about the `yield` statement already (cf., [reference ](https://docs.python.org/3/reference/simple_stmts.html#the-yield-statement))." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def standardized(iterable, *, digits=3):\n", + " ..." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`standardized()` works almost like `standardize()` except that we use it with [next() ](https://docs.python.org/3/library/functions.html#next) to obtain the z-scores one by one." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "z_scores = standardized(data)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "z_scores" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "type(z_scores)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "next(z_scores)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Q10.1**: `standardized()` allows us to go over an *infinite* stream of z-scores. What we want to do instead is to loop over the stream's raw numbers and skip the outliers. In the remainder of this exercise, you look at the parts that make up the `skip_outliers()` function below to achieve precisely that.\n", + "\n", + "The first steps in `skip_outliers()` are the same as in `standardized()`: We take a `sample` from the stream of `data` and calculate its statistics." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sample = ...\n", + "seq_mean = ...\n", + "seq_std = ..." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Q10.2**: Just as in `standardized()`, write a `generator` expression that produces z-scores one by one! However, instead of just generating a z-score, the resulting `generator` object should produce `tuple` objects consisting of a \"raw\" number from `data` and its z-score.\n", + "\n", + "Hint: Look at the revisited \"*Averaging Even Numbers*\" example in [Chapter 7 ](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/develop/07_sequences/01_content.ipynb#Example:-Averaging-all-even-Numbers-in-a-List-%28revisited%29) for some inspiration, which also contains a `generator` expression producing `tuple` objects." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "standardizer = (... for ... in data)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`standardizer` should produce `tuple` objects." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "next(standardizer)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Q10.3**: Write another `generator` expression that loops over `standardizer`. It contains an `if`-clause that keeps only numbers with an absolute z-score below the `threshold_z`. If you fancy, use `tuple` unpacking." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "threshold_z = 1.96" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "no_outliers = (... for ... in standardizer if ...)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`no_outliers` should produce `int` objects." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "next(no_outliers)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Q10.4**: Lastly, put everything together in the `skip_outliers()` function! Make sure you refer to `iterable` inside the function and not the global `data`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def skip_outliers(iterable, *, threshold_z=1.96):\n", + " sample = ...\n", + " seq_mean = ...\n", + " seq_std = ...\n", + " standardizer = ...\n", + " no_outliers = ...\n", + " return no_outliers" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, we can create a `generator` object and loop over the `data` in the stream with outliers skipped. Instead of the default `1.96`, we use a `threshold_z` of only `0.05`: That filters out all numbers except `42`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "skipper = skip_outliers(data, threshold_z=0.05)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "skipper" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "type(skipper)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "next(skipper)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Q11**: You implemented the functions `mean()`, `std()`, `standardize()`, `standardized()`, and `skip_outliers()`. Which of them are **eager**, and which are **lazy**? How do these two concepts relate to **finite** and **infinite** data?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + " < your answer >" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.6" + }, + "toc": { + "base_numbering": 1, + "nav_menu": {}, + "number_sections": false, + "sideBar": true, + "skip_h1_title": true, + "title_cell": "Table of Contents", + "title_sidebar": "Contents", + "toc_cell": false, + "toc_position": {}, + "toc_section_display": false, + "toc_window_display": false + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/08_mfr/stream.py b/08_mfr/stream.py new file mode 100644 index 0000000..6aa6cc7 --- /dev/null +++ b/08_mfr/stream.py @@ -0,0 +1,51 @@ +"""Simulation of random streams of data. + +This module defines: +- a generator object `data` modeling an infinite stream of integers +- a function `make_finite_stream()` that creates finite streams of data + +The probability distribution underlying the integers is Gaussian-like with a +mean of 42 and a standard deviation of 8. The left tail of the distribution is +cut off meaning that the streams only produce non-negative numbers. Further, +one in a hundred random numbers has an increased chance to be an outlier. +""" + +import itertools as _itertools +import random as _random + + +_random.seed(87) + + +def _infinite_stream(): + """Internal generator function to simulate an infinite stream of data.""" + while True: + number = max(0, int(_random.gauss(42, 8))) + if _random.randint(1, 100) == 1: + number *= 2 + yield number + + +def make_finite_stream(min_=5, max_=15): + """Simulate a finite stream of data. + + The returned stream is finite, but the number of elements to be produced + by it is still random. This default behavior may be turned off by passing + in `min_` and `max_` arguments with `min_ == max_`. + + Args: + min_ (optional, int): minimum numbers in the stream; defaults to 5 + max_ (optional, int): maximum numbers in the stream; defaults to 15 + + Returns: + finite_stream (generator) + + Raises: + ValueError: if max_ < min_ + """ + stream = _infinite_stream() + n = _random.randint(min_, max_) + yield from _itertools.islice(stream, n) + + +data = _infinite_stream() diff --git a/CONTENTS.md b/CONTENTS.md index 6291fb2..c377dbd 100644 --- a/CONTENTS.md +++ b/CONTENTS.md @@ -216,4 +216,7 @@ If this is not possible, (`list` Comprehension; `generator` Expression; Streams of Data; - Boolean Reducers) \ No newline at end of file + Boolean Reducers) + - [exercises ](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/develop/08_mfr/02_exercises.ipynb) + [](https://mybinder.org/v2/gh/webartifex/intro-to-python/develop?urlpath=lab/tree/08_mfr/02_exercises.ipynb) + (Removing Outliers in Streaming Data) \ No newline at end of file