intro-to-python/07_sequences_02_exercises.ipynb
Alexander Hess 86e59b847a Restructure *.ipynb files
- rename *_content.ipynb into *_lecture.ipynb
- adjust links in README.md
2020-02-02 17:29:59 +01:00

1813 lines
46 KiB
Text

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"# Chapter 7: Sequential Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Coding Exercises"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Read [Chapter 7](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/07_sequences_00_lecture.ipynb) of the book. Then, work through the exercises below."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Working with Lists"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q1.1**: Write a function `nested_sum()` that takes a `list` object as its argument, which contains other `list` objects with numbers, and adds up the numbers! Use `nested_numbers` below to test your function!\n",
"\n",
"Hint: You need at least one `for`-loop."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nested_numbers = [[1, 2, 3], [4], [5], [6, 7], [8], [9]]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def nested_sum(list_of_lists):\n",
" ..."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nested_sum(nested_numbers)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q1.2**: Provide a one-line expression to obtain the *same* result as `nested_sum()`!\n",
"\n",
"Hints: Use a *list comprehension*, or maybe even a *generator expression*. You may want to use the built-in [sum()](https://docs.python.org/3/library/functions.html#sum) function several times."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"..."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q1.3**: Generalize `nested_sum()` into a function `mixed_sum()` that can process a \"mixed\" `list` object, which contains numbers and other `list` objects with numbers! Use `mixed_numbers` below for testing!\n",
"\n",
"Hints: Use the built-in [isinstance()](https://docs.python.org/3/library/functions.html#isinstance) function to check how an element is to be processed. Get extra credit for adhering to *goose typing*, as explained in [Chapter 5](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/05_numbers_00_lecture.ipynb#Goose-Typing)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mixed_numbers = [[1, 2, 3], 4, 5, [6, 7], 8, [9]]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import collections.abc as abc"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def mixed_sum(list_of_lists):\n",
" ..."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mixed_sum(mixed_numbers)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q1.4.1**: Write a function `cum_sum()` that takes a `list` object with numbers as its argument and returns a *new* `list` object with the **cumulative sums** of these numbers! So, `sum_up` below, `[1, 2, 3, 4, 5]`, should return `[1, 3, 6, 10, 15]`.\n",
"\n",
"Hint: The idea behind is similar to the [cumulative distribution function](https://en.wikipedia.org/wiki/Cumulative_distribution_function) from statistics."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sum_up = [1, 2, 3, 4, 5]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def cum_sum(numbers):\n",
" ..."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cum_sum(sum_up)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q1.4.2**: We should always make sure that our functions also work in corner cases. What happens if your implementation of `cum_sum()` is called with an empty list `[]`? Make sure it handles that case *without* crashing! What would be a good return value in this corner case? Describe everything in the docstring.\n",
"\n",
"Hint: It is possible to write this without any extra input validation."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cum_sum([])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Packing & Unpacking with Functions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the \"*Function Definitions & Calls*\" section in [Chapter 7](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/07_sequences_00_lecture.ipynb#Function-Definitions-&-Calls), we define the following function `product()`. In this exercise, you will improve it by making it more \"user-friendly.\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def product(*args):\n",
" \"\"\"Multiply all arguments.\"\"\"\n",
" result = args[0]\n",
"\n",
" for arg in args[1:]:\n",
" result *= arg\n",
"\n",
" return result"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `*` in the function's header line *packs* all *positional* arguments passed to `product()` into one *iterable* called `args`.\n",
"\n",
"**Q2.1**: What is the data type of `args` within the function's body?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Because of the packing, we may call `product()` with an abitrary number of *positional* arguments: The product of just `42` remains `42`, while `2`, `5`, and `10` multiplied together result in `100`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"product(42)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"product(2, 5, 10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"However, \"abitrary\" does not mean that we can pass *no* argument. If we do so, we get an `IndexError`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"product()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q2.2**: What line in the body of `product()` causes this exception? What is the exact problem?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In [Chapter 7](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/07_sequences_00_lecture.ipynb#Function-Definitions-&-Calls), we also pass a `list` object, like `one_hundred`, to `product()`, and *no* exception is raised."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"one_hundred = [2, 5, 10]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"product(one_hundred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q2.3**: What is wrong with that? What *kind* of error (cf., [Chapter 1](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/01_elements_00_lecture.ipynb#Formal-vs.-Natural-Languages)) is that conceptually? Describe precisely what happens to the passed in `one_hundred` in every line within `product()`!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Of course, one solution is to *unpack* `one_hundred` with the `*` symbol. We look at another solution further below."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"product(*one_hundred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's continue with the issue when calling `product()` *without* any argument.\n",
"\n",
"This revised version of `product()` avoids the `IndexError` from before."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def product(*args):\n",
" \"\"\"Multiply all arguments.\"\"\"\n",
" result = None\n",
"\n",
" for arg in args:\n",
" result *= arg\n",
"\n",
" return result"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"product()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q2.4**: Describe why no error occurs by going over every line in `product()`!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Unfortunately, the new version cannot process any arguments we pass in any more."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"product(42)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"product(2, 5, 10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q2.5**: What line causes troubles now? What is the exact problem?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q2.6**: Replace the `None` in `product()` above with something reasonable that does *not* cause exceptions! Ensure that `product(42)` and `product(2, 5, 10)` return a correct result.\n",
"\n",
"Hints: It is ok if `product()` returns a result *different* from the `None` above. Look at the documentation of the built-in [sum()](https://docs.python.org/3/library/functions.html#sum) function for some inspiration."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def product(*args):\n",
" \"\"\"Multiply all arguments.\"\"\"\n",
" result = ...\n",
"\n",
" for arg in args:\n",
" result *= arg\n",
"\n",
" return result"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"product(42)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"product(2, 5, 10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, calling `product()` without any arguments returns what we would best describe as a *default* or *start* value. To be \"philosophical,\" what is the product of *no* numbers? We know that the product of *one* number is just the number itself, but what could be a reasonable result when multiplying *no* numbers? The answer is what you use as the initial value of `result` above, and there is only *one* way to make `product(42)` and `product(2, 5, 10)` work."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"product()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q2.7**: Rewrite `product()` so that it takes a *keyword-only* argument `start`, defaulting to the above *default* or *start* value, and use `start` internally instead of `result`!\n",
"\n",
"Hint: Remember that a *keyword-only* argument is any parameter specified in a function's header line after the first (and only) `*` (cf., [Chapter 2](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/02_functions_00_lecture.ipynb#Keyword-only-Arguments))."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def product(*args, ...):\n",
" \"\"\"Multiply all arguments.\"\"\"\n",
" ..."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we can call `product()` with a truly arbitrary number of *positional* arguments."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"product(42)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"product(2, 5, 10)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"product()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Without any *positional* arguments but only the *keyword* argument `start`, for example, `start=0`, we can adjust the answer to the \"philosophical\" problem of multiplying *no* numbers. Because of the *keyword-only* syntax, there is *no* way to pass in a `start` number *without* naming it."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"product(start=0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We could use `start` to inject a multiplier, for example, to double the outcomes."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"product(42, start=2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"product(2, 5, 10, start=2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There is still one issue left: Because of the function's name, a user of `product()` may assume that it is ok to pass a *collection* of numbers, like `one_hundred`, which are then multiplied."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"product(one_hundred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q2.8**: What is a **collection**? How is that different from a **sequence**?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q2.9**: Rewrite the latest version of `product()` to check if the *only* positional argument is a *collection* type! If so, its elements are multiplied together. Otherwise, the logic remains the same.\n",
"\n",
"Hints: Use the built-in [len()](https://docs.python.org/3/library/functions.html#len) and [isinstance()](https://docs.python.org/3/library/functions.html#isinstance) functions to check if there is only *one* positional argument and if it is a *collection* type. Use the *abstract base class* `Collection` from the [collections.abc](https://docs.python.org/3/library/collections.abc.html) module in the [standard library](https://docs.python.org/3/library/index.html). You may want to *re-assign* `args` inside the body."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import collections.abc as abc"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def product(*args, ...):\n",
" \"\"\"Multiply all arguments.\"\"\"\n",
" ..."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"All *five* code cells below now return correct results. We may unpack `one_hundred` or not."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"product(42)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"product(2, 5, 10)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"product()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"product(one_hundred)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"product(*one_hundred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Side Note**: Above, we make `product()` work with a single *collection* type argument instead of a *sequence* type to keep it more generic: For example, we can pass in a `set` object, like `{2, 5, 10}` below, and `product()` continues to work correctly. The `set` type is introducted in [Chapter 8](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/08_mappings_00_lecture.ipynb#The-set-Type), and one essential difference to the `list` type is that objects of type `set` have *no* order regarding their elements. So, even though `[2, 5, 10]` and `{2, 5, 10}` look almost the same, the order implied in the literal notation gets lost in memory!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"product([2, 5, 10]) # the argument is a special collection type, namely a sequence"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"product({2, 5, 10}) # the argument is a collection that is NOT a sequence"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"isinstance({2, 5, 10}, abc.Sequence) # sets are NO sequences"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's continue to improve `product()` and make it more Pythonic. It is always a good idea to mimic the behavior of built-ins when writing our own functions. And, [sum()](https://docs.python.org/3/library/functions.html#sum), for example, raises a `TypeError` if called *without* any arguments. It does *not* return the \"philosophical\" answer to adding *no* numbers, which would be `0`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q2.10**: Adapt the latest version of `product()` to also raise a `TypeError` if called *without* any *positional* arguments!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def product(*args, ...):\n",
" \"\"\"Multiply all arguments.\"\"\"\n",
" ..."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"product()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we have an implementation of `product()` that is convenient to use for the caller of our function. In particular, we can pass it a *collection* with or without *unpacking* it.\n",
"\n",
"However, this version of `product()` suffers from one more flaw: We cannot pass it a *stream* of data, as modeled, for example, with an *iterator* object that produces elements on a one-by-one basis.\n",
"\n",
"Let's look at an example. The [*stream.py*](https://github.com/webartifex/intro-to-python/blob/master/stream.py) module in the repository provides a `make_finite_stream()` function. It is a *factory* function creating objects of type `generator` that we use to model *streaming* data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from stream import make_finite_stream"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"stream = make_finite_stream()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"stream"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"type(stream)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Being a `generator`, `stream` is also an `Iterator` in the abstract sense."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"isinstance(stream, abc.Iterator)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Iterators* are good for only *one* thing: Giving us the \"next\" element in a line of many."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"next(stream)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"They themselves have *no* idea of how many elements they produce eventually: The built-in [len()](https://docs.python.org/3/library/functions.html#len) function raises a `TypeError`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"len(stream)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can use the [list()](https://docs.python.org/3/library/functions.html#func-list) built-in to *materialize* the elements. However, in a real-world scenario, these may *not* fit into our machine's memory!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"list(stream)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To be more realistic, `make_finite_stream()` creates `generator` objects producing a varying number of elements."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"list(make_finite_stream())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"list(make_finite_stream())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"list(make_finite_stream())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's see what happens if we pass an *iterator*, as created by `make_finite_stream()`, instead of a materialized *collection*, like `one_hundred`, to `product()`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"product(make_finite_stream())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q2.11**: What line causes the `TypeError`? What line is really the problem in the latest implementation of `product()`? Describe what happens on each line in the function's body until the exception is raised!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q2.12**: Adapt `product()` one last time to make it work with *iterators* as well!\n",
"\n",
"Hints: This task is as easy as replacing `Collection` with something else. Which of the three behaviors of *collections* do *iterators* also exhibit? You may want to look at the documentations on the built-in [max()](https://docs.python.org/3/library/functions.html#max), [min()](https://docs.python.org/3/library/functions.html#min), and [sum()](https://docs.python.org/3/library/functions.html#sum) functions: What kind of argument do they take?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def product(*args, ...):\n",
" \"\"\"Multiply all arguments.\"\"\"\n",
" ..."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The final version of `product()` behaves like built-ins in edge cases, ..."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"product()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"... works with the arguments passed either as independent *positional* arguments, *packed* into a single *collection* argument, or *unpacked*, ..."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"product(42)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"product(2, 5, 10)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"product([2, 5, 10])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"product(*[2, 5, 10])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"... and can handle *streaming* data with *indefinite* \"length.\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"product(make_finite_stream())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In real-world projects, the data science practitioner must decide if it is worthwhile to make a function usable in various different forms as we did in this exercise, or if that is over-engineered.\n",
"\n",
"Yet, two lessons are important to take away:\n",
"- It is always a good idea to *mimic* the behavior of *built-ins* when in doubt.\n",
"- Make functions capable of working with *streaming* data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Removing Outliers in Streaming Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's say we are given a `list` object with random integers like `sample` below, and we want to calculate some basic statistics on them."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sample = [\n",
" 45, 46, 40, 49, 36, 53, 49, 42, 25, 40, 39, 36, 38, 40, 40, 52, 36, 52, 40, 41,\n",
" 35, 29, 48, 43, 42, 30, 29, 33, 55, 33, 38, 50, 39, 56, 52, 28, 37, 56, 45, 37,\n",
" 41, 41, 37, 30, 51, 32, 23, 40, 53, 40, 45, 39, 99, 42, 34, 42, 34, 39, 39, 53,\n",
" 43, 37, 46, 36, 45, 42, 32, 38, 57, 34, 36, 44, 47, 51, 46, 39, 28, 40, 35, 46,\n",
" 41, 51, 41, 23, 46, 40, 40, 51, 50, 32, 47, 36, 38, 29, 32, 53, 34, 43, 39, 41,\n",
" 40, 34, 44, 40, 41, 43, 47, 57, 50, 42, 38, 25, 45, 41, 58, 37, 45, 55, 44, 53,\n",
" 82, 31, 45, 33, 32, 39, 46, 48, 42, 47, 40, 45, 51, 35, 31, 46, 40, 44, 61, 57,\n",
" 40, 36, 35, 55, 40, 56, 36, 35, 86, 36, 51, 40, 54, 50, 49, 36, 41, 37, 48, 41,\n",
" 42, 44, 40, 43, 51, 47, 46, 50, 40, 23, 40, 39, 28, 38, 42, 46, 46, 42, 46, 31,\n",
" 32, 40, 48, 27, 40, 40, 30, 32, 25, 31, 30, 43, 44, 29, 45, 41, 63, 32, 33, 58,\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"len(sample)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q3.1**: `list` objects are **sequences**. What *four* behaviors do they always come with?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q3.2**: Write a function `mean()` that calculates the simple arithmetic mean of a given `sequence` with numbers!\n",
"\n",
"Hints: You can solve this task with [built-in functions](https://docs.python.org/3/library/functions.html) only. A `for`-loop is *not* needed."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def mean(sequence):\n",
" ..."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sample_mean = mean(sample)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sample_mean"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q3.3**: Write a function `std()` that calculates the [standard deviation](https://en.wikipedia.org/wiki/Standard_deviation) of a `sequence` of numbers! Integrate your `mean()` version from before and the [sqrt()](https://docs.python.org/3/library/math.html#math.sqrt) function from the [math](https://docs.python.org/3/library/math.html) module in the [standard library](https://docs.python.org/3/library/index.html) provided to you below. Make sure `std()` calls `mean()` only *once* internally! Repeated calls to `mean()` would be a waste of computational resources.\n",
"\n",
"Hints: Parts of the code are probably too long to fit within the suggested 79 characters per line. So, use *temporary variables* inside your function. Instead of a `for`-loop, you may want to use a *list comprehension* or, even better, a memoryless *generator expression*."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from math import sqrt"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def std(sequence):\n",
" ..."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sample_std = std(sample)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sample_std"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q3.4**: Complete `standardize()` below that takes a `sequence` of numbers and returns a `list` object with the **[z-scores](https://en.wikipedia.org/wiki/Standard_score)** of these numbers! A z-score is calculated by subtracting the mean and dividing by the standard deviation. Re-use `mean()` and `std()` from before. Again, ensure that `standardize()` calls `mean()` and `std()` only *once*! Further, round all z-scores with the built-in [round()](https://docs.python.org/3/library/functions.html#round) function and pass on the keyword-only argument `digits` to it.\n",
"\n",
"Hint: You may want to use a *list comprehension* instead of a `for`-loop."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def standardize(sequence, *, digits=3):\n",
" ..."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"z_scores = standardize(sample)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The [pprint()](https://docs.python.org/3/library/pprint.html#pprint.pprint) function from the [pprint](https://docs.python.org/3/library/pprint.html) module in the [standard library](https://docs.python.org/3/library/index.html) allows us to \"pretty print\" long `list` objects compactly."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from pprint import pprint"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pprint(z_scores, compact=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We know that `standardize()` works correctly if the resulting z-scores' mean and standard deviation approach `0` and `1` for a long enough `sequence`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mean(z_scores), std(z_scores)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Even though `standardize()` calls `mean()` and `std()` only once each, `mean()` is called *twice*! That is so because `std()` internally also re-uses `mean()`!\n",
"\n",
"**Q3.5.1**: Rewrite `std()` to take an optional keyword-only argument `seq_mean`, defaulting to `None`. If provided, `seq_mean` is used instead of the result of calling `mean()`. Otherwise, the latter is called.\n",
"\n",
"Hint: You must check if `seq_mean` is still the default value."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def std(sequence, *, seq_mean=None):\n",
" ..."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`std()` continues to work as before."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sample_std = std(sample)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sample_std"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q3.5.2**: Now, rewrite `standardize()` to pass on the return value of `mean()` to `std()`! In summary, `standardize()` calculates the z-scores for the numbers in the `sequence` with as few computational steps as possible."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def standardize(sequence, *, digits=3):\n",
" ..."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"z_scores = standardize(sample)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mean(z_scores), std(z_scores)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q3.6**: With both `sample` and `z_scores` being materialized `list` objects, we can loop over pairs consisting of a number from `sample` and its corresponding z-score. Write a `for`-loop that prints out all the \"outliers,\" as which we define numbers with an absolute z-score above `1.96`. There are *four* of them in the `sample`.\n",
"\n",
"Hint: Use the [abs()](https://docs.python.org/3/library/functions.html#abs) and [zip()](https://docs.python.org/3/library/functions.html#zip) built-ins."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"..."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We provide a `stream` module with a `data` object that models an *infinite* **stream** of data (cf., the [*stream.py*](https://github.com/webartifex/intro-to-python/blob/master/stream.py) file in the repository)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from stream import data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`data` is of type `generator` and has *no* length."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"type(data)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"len(data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Being a `generator`, it is an `Iterator` in the abstract sense ..."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import collections.abc as abc"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"isinstance(data, abc.Iterator)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"... and so the only thing we can do with it is to pass it to the built-in [next()](https://docs.python.org/3/library/functions.html#next) function and go over the numbers it streams one by one."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"next(data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q3.7**: What happens if you call `mean()` with `data` as the argument? What is the problem?\n",
"\n",
"Hints: If you try it out, you may have to press the \"Stop\" button in the toolbar at the top. Your computer should *not* crash, but you will *have to* restart this Jupyter notebook with \"Kernel\" > \"Restart\" and import `data` again."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mean(data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q3.8**: Write a function `take_sample()` that takes an `iterator` as its argument, like `data`, and creates a *materialized* `list` object out of its first `n` elements, defaulting to `1_000`!\n",
"\n",
"Hints: [next()](https://docs.python.org/3/library/functions.html#next) and the [range()](https://docs.python.org/3/library/functions.html#func-range) built-in may be helpful. You may want to use a *list comprehension* instead of a `for`-loop and write a one-liner. Audacious students may want to look at [isclice()](https://docs.python.org/3/library/itertools.html#itertools.islice) in the [itertools](https://docs.python.org/3/library/itertools.html) module in the [standard library](https://docs.python.org/3/library/index.html)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def take_sample(iterator, *, n=1_000):\n",
" ..."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We take a `new_sample` from the stream of `data`, and its statistics are similar to the initial `sample`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"new_sample = take_sample(data)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"len(new_sample)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mean(new_sample)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"std(new_sample)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q3.9**: Convert `standardize()` into a *new* function `standardized()` that implements the *same* logic but works on a possibly *infinite* stream of data, provided as an `iterable`, instead of a *finite* `sequence`.\n",
"\n",
"To calculate a z-score, we need the stream's overall mean and standard deviation, and that is *impossible* to calculate if we do not know how long the stream is, and, in particular, if it is *infinite*. So, `standardized()` first takes a sample from the `iterable` internally, and uses the sample's mean and standard deviation to calculate the z-scores.\n",
"\n",
"Hint: `standardized()` *must* return a `generator` object. So, use a *generator expression* as the return value; unless you know about the `yield` statement already (cf., [reference](https://docs.python.org/3/reference/simple_stmts.html#the-yield-statement))."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def standardized(iterable, *, digits=3):\n",
" ..."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`standardized()` works almost like `standardize()` except that we use it with [next()](https://docs.python.org/3/library/functions.html#next) to obtain the z-scores one by one."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"z_scores = standardized(data)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"z_scores"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"type(z_scores)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"next(z_scores)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q3.10.1**: `standardized()` allows us to go over an *infinite* stream of z-scores. What we want to do instead is to loop over the stream's raw numbers and skip the outliers. In the remainder of this exercise, you look at the parts that make up the `skip_outliers()` function below to achieve precisely that.\n",
"\n",
"The first steps in `skip_outliers()` are the same as in `standardized()`: We take a `sample` from the stream of `data` and calculate its statistics."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sample = ...\n",
"seq_mean = ...\n",
"seq_std = ..."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q3.10.2**: Just as in `standardized()`, write a *generator expression* that produces z-scores one by one! However, instead of just generating a z-score, the resulting `generator` object should produce `tuple` objects consisting of a \"raw\" number from `data` and its z-score.\n",
"\n",
"Hint: Look at the revisited \"*Averaging Even Numbers*\" example in [Chapter 7](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/07_sequences_00_lecture.ipynb#Example:-Averaging-Even-Numbers-%28revisited%29) for some inspiration, which also contains a generator expression producing `tuple` objects."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"standardizer = (... for ... in data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`standardizer` should produce `tuple` objects."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"next(standardizer)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q3.10.3**: Write another generator expression that loops over `standardizer`. It contains an `if`-clause that keeps only numbers with an absolute z-score below the `threshold_z`. If you fancy, use *tuple unpacking*."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"threshold_z = 1.96"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"no_outliers = (... for ... in standardizer if ...)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`no_outliers` should produce `int` objects."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"next(no_outliers)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q3.10.4**: Lastly, put everything together in the `skip_outliers()` function! Make sure you refer to `iterable` inside the function and not the global `data`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def skip_outliers(iterable, *, threshold_z=1.96):\n",
" sample = ...\n",
" seq_mean = ...\n",
" seq_std = ...\n",
" standardizer = ...\n",
" no_outliers = ...\n",
" return no_outliers"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we can create a `generator` object and loop over the `data` in the stream with outliers skipped. Instead of the default `1.96`, we use a `threshold_z` of only `0.05`: That filters out all numbers except `42`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"skipper = skip_outliers(data, threshold_z=0.05)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"skipper"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"type(skipper)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"next(skipper)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q3.11**: You implemented the functions `mean()`, `std()`, `standardize()`, `standardized()`, and `skip_outliers()`. Which of them are **eager**, and which are **lazy**? How do these two concepts relate to **finite** and **infinite** data?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": false,
"sideBar": true,
"skip_h1_title": true,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": false,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}