# Chapter 7: Sequential Data

We studied numbers (cf., [Chapter 5](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/05_numbers.ipynb)) and textual data (cf., [Chapter 6](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/06_text.ipynb)) first, mainly because objects of the presented data types are "simple," for two reasons: First, they are *immutable*, and, as we saw in the "*Who am I? And how many?*" section in [Chapter 1](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/01_elements.ipynb#Who-am-I?-And-how-many?), mutable objects can quickly become hard to reason about. Second, they are "flat" in the sense that they are *not* composed of other objects.

The `str` type is a bit of a corner case in this regard. While one could argue that a longer `str` object, for example, `"text"`, is composed of individual characters, this is *not* the case in memory as the literal `"text"` only creates *one* object (i.e., one "bag" of $0$s and $1$s modeling all characters).

This chapter and Chapter 8 introduce various "complex" data types. While some are mutable and others are not, they all share that they are primarily used to "manage," or structure, the memory in a program. Unsurprisingly, computer scientists refer to the ideas and theories behind these data types as **[data structures](https://en.wikipedia.org/wiki/Data_structure)**.

In this chapter, we focus on data types that model all kinds of sequential data. Examples of such data are [spreadsheets](https://en.wikipedia.org/wiki/Spreadsheet) or [matrices](https://en.wikipedia.org/wiki/Matrix_%28mathematics%29)/[vectors](https://en.wikipedia.org/wiki/Vector_%28mathematics_and_physics%29). Such formats share the property that they are composed of smaller units that come in a sequence of, for example, rows/columns/cells or elements/entries.

## Collections vs. Sequences

[Chapter 6](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/06_text.ipynb#A-"String"-of-Characters) already describes the *sequence* properties of `str` objects. Here, we take a step back and study these properties on their own before looking at bigger ideas.

The [collections.abc](https://docs.python.org/3/library/collections.abc.html) module in the [standard library](https://docs.python.org/3/library/index.html) defines a variety of **abstract base classes** (ABCs). We saw ABCs already in [Chapter 5](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/05_numbers.ipynb#The-Numerical-Tower), where we use the ones from the [numbers](https://docs.python.org/3/library/numbers.html) module in the [standard library](https://docs.python.org/3/library/index.html) to classify Python's numeric data types according to mathematical ideas. Now, we take the ABCs from the [collections.abc](https://docs.python.org/3/library/collections.abc.html) module to classify the data types in this chapter according to their behavior in various contexts.

As an illustration, consider `numbers` and `word` below, two objects of *different* types.

In [1]:
numbers = [7, 11, 8, 5, 3, 12, 2, 6, 9, 10, 1, 4]
word = "random"

They have in common that we may loop over them with the `for` statement. So, in the context of iteration, both exhibit the *same* behavior.

In [2]:
for number in numbers:
    print(number, end=" ")

7 11 8 5 3 12 2 6 9 10 1 4 

In [3]:
for character in word:
    print(character, end=" ")

r a n d o m 

In [Chapter 4](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/04_iteration.ipynb#Containers-vs.-Iterables), we referred to such types as *iterables*. That is *not* a proper [English](https://dictionary.cambridge.org/spellcheck/english-german/?q=iterable) word, even if it may sound like one at first sight. Yet, it is an official term in the Python world formalized with the `Iterable` ABC in the [collections.abc](https://docs.python.org/3/library/collections.abc.html) module.

For the data science practitioner, it is worthwhile to know such terms as, for example, the documentation on the [built-ins](https://docs.python.org/3/library/functions.html) uses them extensively: In simple words, any built-in that takes an argument called "*iterable*" may be called with *any* object that supports being looped over. Already familiar [built-ins](https://docs.python.org/3/library/functions.html) include, among others, [enumerate()](https://docs.python.org/3/library/functions.html#enumerate), [sum()](https://docs.python.org/3/library/functions.html#sum), or [zip()](https://docs.python.org/3/library/functions.html#zip). So, they do *not* require the argument to be of a certain concrete data type (e.g., `list`); instead, any *iterable* type works.

In [4]:
import collections.abc as abc

In [5]:
abc.Iterable

collections.abc.Iterable

As in the context of *goose typing* in [Chapter 5](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/05_numbers.ipynb#Goose-Typing), we can use ABCs with the built-in [isinstance()](https://docs.python.org/3/library/functions.html#isinstance) function to check if an object supports a behavior.

So, let's "ask" Python if it can loop over `numbers` or `word`.

In [6]:
isinstance(numbers, abc.Iterable)

True

In [7]:
isinstance(word, abc.Iterable)

True

Contrary to `list` or `str` objects, numeric objects are *not* iterable.

In [8]:
isinstance(999, abc.Iterable)

False

Instead of asking, we could try to loop over `999`, but this results in a `TypeError`.

In [9]:
for digit in 999:
    print(digit)

TypeError: 'int' object is not iterable

Most of the data types in this and the next chapter exhibit three [orthogonal](https://en.wikipedia.org/wiki/Orthogonality) (i.e., "independent") behaviors, formalized by ABCs in the [collections.abc](https://docs.python.org/3/library/collections.abc.html) module as:
- `Iterable`: An object supports being looped over.
- `Container`: An object "contains" references to other objects; a "whole" is composed of many "parts."
- `Sized`: The number of references to other objects, the "parts," is *finite*.

The characteristical operation supported by `Container` types is the `in` operator for membership testing.

In [10]:
0 in numbers

False

In [11]:
"r" in word

True

Alternatively, we could also check if `numbers` and `word` are `Container` types with the [isinstance()](https://docs.python.org/3/library/functions.html#isinstance) function.

In [12]:
isinstance(numbers, abc.Container)

True

In [13]:
isinstance(word, abc.Container)

True

Numeric objects do *not* "contain" references to other objects, and that is why they are considered "flat" data types. The `in` operator raises a `TypeError`. Conceptually speaking, Python views numeric types as "wholes" without any "parts."

In [14]:
isinstance(999, abc.Container)

False

In [15]:
9 in 999

TypeError: argument of type 'int' is not iterable

Analogously, being `Sized` types, we can pass `numbers` and `word` as the argument to the built-in [len()](https://docs.python.org/3/library/functions.html#len) function and obtain "meaningful" results. The exact meaning depends on the *concrete* data type: For `numbers`, [len()](https://docs.python.org/3/library/functions.html#len) tells us how many elements are in the `list` object; for `word`, it tells us how many [Unicode characters](https://en.wikipedia.org/wiki/Unicode) make up the `str` object. But, *abstractly* speaking, both data types exhibit the *same* behavior of *finiteness*.

In [16]:
len(numbers)

12

In [17]:
len(word)

6

In [18]:
isinstance(numbers, abc.Sized)

True

In [19]:
isinstance(word, abc.Sized)

True

On the contrary, even though `999` consists of three digits for humans, numeric objects in Python have no concept of a "size" or "length," and the [len()](https://docs.python.org/3/library/functions.html#len) function raises a `TypeError`.

In [20]:
isinstance(999, abc.Sized)

False

In [21]:
len(999)

TypeError: object of type 'int' has no len()

These three behaviors are so essential that whenever they coincide for a data type, it is called a **collection**, formalized with the `Collection` ABC. That is where the [collections.abc](https://docs.python.org/3/library/collections.abc.html) module got its name from: It summarizes all ABCs related to collections; in particular, it defines a hierarchy of specialized kinds of collections.

So, both `numbers` and `word` are collections.

In [22]:
isinstance(numbers, abc.Collection)

True

In [23]:
isinstance(word, abc.Collection)

True

They share one more common behavior: When looping over them, we can *predict* the *order* of the elements or characters. The ABC in the [collections.abc](https://docs.python.org/3/library/collections.abc.html) module corresponding to this behavior is `Reversible`. While sounding unintuitive at first, it is evident that if something is reversible, it must have a forward order, to begin with.

We add the [reversed()](https://docs.python.org/3/library/functions.html#reversed) built-in to the `for`-loop from above to iterate over the elements or characters in reverse order.

In [24]:
for number in reversed(numbers):
    print(number, end=" ")

4 1 10 9 6 2 12 3 5 8 11 7 

In [25]:
for character in reversed(word):
    print(character, end=" ")

m o d n a r 

In [26]:
isinstance(numbers, abc.Reversible)

True

In [27]:
isinstance(word, abc.Reversible)

True

Collections that exhibit this fourth behavior are referred to as **sequences**, formalized with the `Sequence` ABC in the [collections.abc](https://docs.python.org/3/library/collections.abc.html) module.

In [28]:
isinstance(numbers, abc.Sequence)

True

In [29]:
isinstance(word, abc.Sequence)

True

Most of the data types introduced in the remainder of this chapter are sequences. Nevertheless, we also look at some data types that are neither collections nor sequences but still useful to model sequential data in practice.

In Python-related documentations, the terms collection and sequence are heavily used, and the data science practitioner should always think of them in terms of the three or four behaviors they exhibit.

Data types that are collections but not sequences are covered in Chapter 8.

## The `list` Type

As already seen multiple times, to create a `list` object, we use the *literal notation* and list all elements within brackets `[` and `]`.

In [30]:
empty = []

In [31]:
simple = [40, 50]

The elements do *not* need to be of the *same* type, and `list` objects may also be **nested**.

In [32]:
nested = [empty, 10, 20.0, "Thirty", simple]

[PythonTutor](http://www.pythontutor.com/visualize.html#code=empty%20%3D%20%5B%5D%0Asimple%20%3D%20%5B40,%2050%5D%0Anested%20%3D%20%5Bempty,%2010,%2020.0,%20%22Thirty%22,%20simple%5D&cumulative=false&curInstr=0&heapPrimitives=nevernest&mode=display&origin=opt-frontend.js&py=3&rawInputLstJSON=%5B%5D&textReferences=false) shows how `nested` holds references to the `empty` and `simple` objects. Technically, it holds three more references pointing to the `10`, `20.0`, and `"Thirty"` objects as well. However, to simplify the visualization, these three objects are shown right inside the `nested` object as they are immutable and of "flat" data types. In general, the $0$s and $1$s inside a `list` object in memory always constitute pointers to other objects only.

In [33]:
nested

[[], 10, 20.0, 'Thirty', [40, 50]]

Let's not forget that `nested` is an object on its own with an *identity* and *data type*.

In [34]:
id(nested)

140322034424136

In [35]:
type(nested)

list

Alternatively, we may use the [list()](https://docs.python.org/3/library/functions.html#func-list) built-in to create a `list` object out of an iterable we pass to it as the argument.

For example, we can wrap the [range()](https://docs.python.org/3/library/functions.html#func-range) built-in with [list()](https://docs.python.org/3/library/functions.html#func-list): As described in [Chapter 4](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/04_iteration.ipynb#Containers-vs.-Iterables), `range` objects, like `range(1, 13)` below, are iterable and generate `int` objects "on the fly" (i.e., one by one). The [list()](https://docs.python.org/3/library/functions.html#func-list) around it acts like a `for`-loop and **materializes** twelve `int` objects in memory that then become the elements of the newly created `list` object. [PythonTutor](http://www.pythontutor.com/visualize.html#code=r%20%3D%20range%281,%2013%29%0Al%20%3D%20list%28range%281,%2013%29%29&cumulative=false&curInstr=0&heapPrimitives=nevernest&mode=display&origin=opt-frontend.js&py=3&rawInputLstJSON=%5B%5D&textReferences=false) shows this difference visually.

In [36]:
list(range(1, 13))

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

In [37]:
isinstance(range(1, 13), abc.Iterable)

True

Beware of passing a `range` object over a "big" horizon as the argument to [list()](https://docs.python.org/3/library/functions.html#func-list) as that may lead to a `MemoryError` and the computer crashing.

In [38]:
list(range(999_999_999_999))

MemoryError: 

As another example, we may also create a `list` object from a `str` object as the latter is iterable, as well. Then, the individual characters become the elements of the new `list` object!

In [39]:
list("WHU")

['W', 'H', 'U']

### Sequence Behaviors

`list` objects are *sequences*. To reiterate that concept from above *without* the formal ABCs from the [collections.abc](https://docs.python.org/3/library/collections.abc.html) module, we briefly summarize the *four* behaviors of a sequence and provide some more `list`-specific details below:

- **Container**:
 - holds references to other objects in memory (with their own *identity* and *type*)
 - implements membership testing via the `in` operator
- **Iterable**:
 - supports being looped over
 - works with the `for` or `while` statements
- **Reversible**:
 - the elements come in a *predictable* order that we may traverse in a forward or backward fashion
 - works with the [reversed()](https://docs.python.org/3/library/functions.html#reversed) built-in
- **Sized**:
 - the number of elements is finite *and* known in advance
 - works with the built-in [len()](https://docs.python.org/3/library/functions.html#len) function

The "length" of `nested` is *five* because `simple` counts as only *one* element. In other words, `nested` holds five references to other objects.

In [40]:
len(nested)

5

With a `for`-loop, we can traverse all elements in a *predictable* order, forward or backward. As `list` objects only hold references to other objects, these have a *indentity* and may be of *different* types; however, this is rarely, if ever, useful in practice.

In [41]:
for element in nested:
    print(element, id(element), type(element), sep="      \t")

[]      	140322025703432      	<class 'list'>
10      	94360180081984      	<class 'int'>
20.0      	140322034534104      	<class 'float'>
Thirty      	140322025251816      	<class 'str'>
[40, 50]      	140322034424072      	<class 'list'>


In [42]:
for element in reversed(nested):
    print(element, end="     ")

[40, 50]     Thirty     20.0     10     []     

The `in` operator checks if a given object is "contained" in a `list` object. It uses the `==` operator behind the scenes (i.e., *not* the `is` operator) conducting a so-called **[linear search](https://en.wikipedia.org/wiki/Linear_search)**: So, Python implicitly loops over *all* elements and only stops prematurely if an element evaluates equal to the given object. A linear search may, therefore, be relatively *slow* for big `list` objects.

In [43]:
10 in nested

True

`20` compares equal to the `20.0` in `nested`.

In [44]:
20 in nested

True

In [45]:
30 in nested

False

### Indexing

Because of the *predictable* order and the *finiteness*, each element in a sequence can be labeled with a unique *index* (i.e., an `int` object in the range $0 \leq \text{index} < \lvert \text{sequence} \rvert$).

Brackets, `[` and `]`, are the literal syntax for accessing individual elements of any sequence type. In this book, we also call them the *indexing operator*.

In [46]:
nested[1]

10

The last index is one less than `len(nested)` above, and Python raises an `IndexError` if we look up an index that is not in the implied range.

In [47]:
nested[5]

IndexError: list index out of range

Negative indices are used to count in reverse order from the end of a sequence, and brackets may be chained to access nested objects. So, to access the `50` inside `simple` via the `nested` object, we write `nested[-1][1]`.

In [48]:
nested[-1][1]

50

### Slicing

Slicing `list` objects works analogously to slicing `str` objects: We use the literal syntax with either one or two colons `:` inside the brackets `[]` to separate the *start*, *stop*, and *step* values. Slicing creates a *new* `list` object with the elements chosen from the original one.

For example, to obtain the three elements in the "middle" of `nested`, we slice from `1` (including) to `4` (excluding).

In [49]:
nested[1:4]

[10, 20.0, 'Thirty']

To obtain "every other" element, we slice from beginning to end, defaulting to `0` and `len(nested)`, in steps of `2`.

In [50]:
nested[::2]

[[], 20.0, [40, 50]]

The literal notation with the colons `:` is *syntactic sugar*, and Python provides the [slice()](https://docs.python.org/3/library/functions.html#slice) built-in to slice with `slice` objects. [slice()](https://docs.python.org/3/library/functions.html#slice) takes *start*, *stop*, and *step* arguments in the same way as the already familiar [range()](https://docs.python.org/3/library/functions.html#func-range) built-in.

In [51]:
middle = slice(1, 4)

In [52]:
type(middle)

slice

In most cases, the literal notation is more convenient to use; however, with `slice` objects, we may give names to slices and re-use them across several sequences.

In [53]:
nested[middle]

[10, 20.0, 'Thirty']

In [54]:
numbers[middle]

[11, 8, 5]

In [55]:
word[middle]

'and'

`slice` objects come with three read-only attributes `start`, `stop`, and `step` on them.

In [56]:
middle.start

1

In [57]:
middle.stop

4

If not passed to [slice()](https://docs.python.org/3/library/functions.html#slice), these attributes default to `None`. That is why the cell below has no output.

In [58]:
middle.step

A good trick to know is taking a "full" slice: This copies *all* elements of a `list` object into a *new* `list` object.

In [59]:
nested_copy = nested[:]

In [60]:
nested_copy

[[], 10, 20.0, 'Thirty', [40, 50]]

At first glance, `nested` and `nested_copy` seem to cause no pain. For `list` objects, the comparison operator `==` goes over the elements in both operands in a pairwise fashion and checks if they all evaluate equal.

We confirm that `nested` and `nested_copy` compare equal as expected but also that they are *different* objects.

In [61]:
nested == nested_copy

True

In [62]:
nested is nested_copy

False

However, as [PythonTutor](http://pythontutor.com/visualize.html#code=nested%20%3D%20%5B%5B%5D,%2010,%2020.0,%20%22Thirty%22,%20%5B40,%2050%5D%5D%0Anested_copy%20%3D%20nested%5B%3A%5D&cumulative=false&curInstr=0&heapPrimitives=nevernest&mode=display&origin=opt-frontend.js&py=3&rawInputLstJSON=%5B%5D&textReferences=false) reveals, only the *pointers* to the elements are copied! That concept is called a **[shallow copy](https://en.wikipedia.org/wiki/Object_copying#Shallow_copy)**.

We could also see this with the [id()](https://docs.python.org/3/library/functions.html#id) function: The respective first elements in both `nested` and `nested_copy` are the *same* `empty` object.

In [63]:
nested[0] is nested_copy[0]

True

In [64]:
nested[0]

[]

Knowing this becomes critical if the elements in a `list` object are mutable objects. Then, because of the original `list` object and its copy both pointing at the *same* objects in memory, if some of them are mutated, these changes are visible to both! We already saw a similar kind of confusion in [Chapter 1](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/01_elements.ipynb#Who-am-I?-And-how-many?) in a "simpler" setting and look into this in detail in the next section.

Instead of a shallow copy, we could also create a so-called **[deep copy](https://en.wikipedia.org/wiki/Object_copying#Deep_copy)** of `nested`: That concept recursively follows every pointer in a possible nested data structure and creates copies of *every* involved object.

To explicitly create shallow or deep copies, the [copy](https://docs.python.org/3/library/copy.html) module in the [standard library](https://docs.python.org/3/library/index.html) provides two functions, [copy()](https://docs.python.org/3/library/copy.html#copy.copy) and [deepcopy()](https://docs.python.org/3/library/copy.html#copy.deepcopy). We must always remember that slicing creates *shallow* copies only.

In [65]:
import copy

In [66]:
nested_deep_copy = copy.deepcopy(nested)

In [67]:
nested == nested_deep_copy

True

Now, the first elements of `nested` and `nested_deep_copy` are *different* objects, and [PythonTutor](http://pythontutor.com/visualize.html#code=import%20copy%0Anested%20%3D%20%5B%5B%5D,%2010,%2020.0,%20%22Thirty%22,%20%5B40,%2050%5D%5D%0Anested_deep_copy%20%3D%20copy.deepcopy%28nested%29&cumulative=false&curInstr=0&heapPrimitives=nevernest&mode=display&origin=opt-frontend.js&py=3&rawInputLstJSON=%5B%5D&textReferences=false) shows that there are *six* `list` objects in memory.

In [68]:
nested[0] is nested_deep_copy[0]

False

As this [StackOverflow question](https://stackoverflow.com/questions/184710/what-is-the-difference-between-a-deep-copy-and-a-shallow-copy) shows, understanding shallow and deep copies is a common source of confusion independent of the programming language.

### Mutability

In contrast to `str` objects, `list` objects are *mutable*: We may assign new elements to indices or slices and also remove elements. That changes *parts* of a `list` object in memory.

In [69]:
nested[0] = 0

In [70]:
nested

[0, 10, 20.0, 'Thirty', [40, 50]]

When we re-assign a slice, we can even change the size of the `list` object.

In [71]:
nested[:4] = [100, 100, 100]  # assign three elements where there were four before

In [72]:
nested

[100, 100, 100, [40, 50]]

In [73]:
len(nested)

4

The `list` object's identity does *not* change. That is the whole point behind mutable objects.

In [74]:
id(nested)

140322034424136

`nested_copy` is still unchanged!

In [75]:
nested_copy

[[], 10, 20.0, 'Thirty', [40, 50]]

Let's change the nested `[40, 50]` via `nested_copy` into `[1, 2, 3]` by replacing all its elements.

In [76]:
nested_copy[-1][:] = [1, 2, 3]

In [77]:
nested_copy

[[], 10, 20.0, 'Thirty', [1, 2, 3]]

This has a surprising side effect on `nested`!

In [78]:
nested

[100, 100, 100, [1, 2, 3]]

That is because `nested_copy` is a shallow copy of `nested`. [PythonTutor](http://pythontutor.com/visualize.html#code=nested%20%3D%20%5B%5B%5D,%2010,%2020.0,%20%22Thirty%22,%20%5B40,%2050%5D%5D%0Anested_copy%20%3D%20nested%5B%3A%5D%0Anested%5B%3A4%5D%20%3D%20%5B100,%20100,%20100%5D%0Anested_copy%5B-1%5D%5B%3A%5D%20%3D%20%5B1,%202,%203%5D&cumulative=false&curInstr=0&heapPrimitives=nevernest&mode=display&origin=opt-frontend.js&py=3&rawInputLstJSON=%5B%5D&textReferences=false) shows how both point to the *same* nested `list` object.

Lastly, we use the `del` statement to remove an element.

In [79]:
del nested[-1]

In [80]:
nested

[100, 100, 100]

The `del` statement, of course, also works for slices.

In [81]:
del nested[:2]

In [82]:
nested

[100]

### List Operations

As with `str` objects, the `+` and `*` operators are overloaded for concatenation and always return *new* `list` objects.

In [83]:
first = [10, 20, 30]
second = [40, 50, 60]

In [84]:
first + second

[10, 20, 30, 40, 50, 60]

In [85]:
2 * first

[10, 20, 30, 10, 20, 30]

In [86]:
second * 3

[40, 50, 60, 40, 50, 60, 40, 50, 60]

Besides being an operator, the `*` symbol has a second syntactical use, as explained in [PEP 3132](https://www.python.org/dev/peps/pep-3132/) and [PEP 448](https://www.python.org/dev/peps/pep-0448/): It implements what is called **iterable unpacking**. It is *not* an operator syntactically but a notation that Python processes as a literal.

In the example, Python interprets the expression as if the elements of the iterable `second` were placed between `30` and `70` one by one. So, we do not obtain a nested but a *flat* list.

In [87]:
[30, *second, 70]

[30, 40, 50, 60, 70]

### List Methods

The `list` type is an essential data structure in any real-world Python application, and many typical `list` related algorithms from computer science theory are already built into it at the C level (cf., the [documentation](https://docs.python.org/3/tutorial/datastructures.html#more-on-lists) for a full overview). So, understanding and applying the built-in methods of the `list` type not only speeds up the development process but also makes programs significantly faster.

In contrast to the `str` type's methods, the `list` type's methods *always* mutate (i.e., "change") an object *in place*. They do *not* create a *new* `list` object and return `None` to indicate that. So, we must *never* assign the return value of `list` methods to the variable holding the list!

Let's look at the following `names` example.

In [88]:
names = ["Carl", "Berthold", "Achim", "Xavier", "Peter"]

To add an object to the end of `names`, we use the append() method. The code cell shows no output indicating that `None` must have been returned.

In [89]:
names.append("Eckardt")

In [90]:
names

['Carl', 'Berthold', 'Achim', 'Xavier', 'Peter', 'Eckardt']

With the extend() method, we may also append multiple elements provided by an iterable at once. Here, the iterable is a `list` object itself holding two `str` objects.

In [91]:
names.extend(["Karl", "Oliver"])

In [92]:
names

['Carl', 'Berthold', 'Achim', 'Xavier', 'Peter', 'Eckardt', 'Karl', 'Oliver']

`list` objects may be sorted *in place* with the [sort()](https://docs.python.org/3/library/stdtypes.html#list.sort) method. That is different from the built-in [sorted()](https://docs.python.org/3/library/functions.html#sorted) function that takes any *finite* and *iterable* object and returns a *new* `list` object with the iterable's elements sorted.

In [93]:
sorted(names)

['Achim', 'Berthold', 'Carl', 'Eckardt', 'Karl', 'Oliver', 'Peter', 'Xavier']

In [94]:
names

['Carl', 'Berthold', 'Achim', 'Xavier', 'Peter', 'Eckardt', 'Karl', 'Oliver']

In [95]:
names.sort()

In [96]:
names

['Achim', 'Berthold', 'Carl', 'Eckardt', 'Karl', 'Oliver', 'Peter', 'Xavier']

To sort in reverse order, we pass a keyword-only `reverse=True` argument to either the [sort()](https://docs.python.org/3/library/stdtypes.html#list.sort) method or the [sorted()](https://docs.python.org/3/library/functions.html#sorted) function. In the latter case, we could also use the [reversed()](https://docs.python.org/3/library/functions.html#reversed) built-in instead; however, that *neither* returns a new `list` object *nor* changes the existing one in place. We revisit it at the end of this chapter.

In [97]:
names.sort(reverse=True)

In [98]:
names

['Xavier', 'Peter', 'Oliver', 'Karl', 'Eckardt', 'Carl', 'Berthold', 'Achim']

Both, the [sort()](https://docs.python.org/3/library/stdtypes.html#list.sort) method and the [sorted()](https://docs.python.org/3/library/functions.html#sorted) function, also accept a keyword-only `key` argument that must be a reference to a `function` object accepting one positional argument. Then, the elements in the `list` object are passed to that on a one-by-one basis, and the return values are used as the **sort keys**.

For example, to sort `names` not by alphabet but by the names' lengths, we pass in a reference to the built-in [len()](https://docs.python.org/3/library/functions.html#len) function as `key=len`. Note that there are *no* parentheses after `len`!

In [99]:
names.sort(key=len)

If two names have the same length, their relative order is kept as is. A [sorting algorithm](https://en.wikipedia.org/wiki/Sorting_algorithm) is called **[stable](https://en.wikipedia.org/wiki/Sorting_algorithm#Stability)** if it has that property. That is why `"Karl"` comes before `"Carl" ` below.

Sorting is an important topic in programming and we refer to the official [HOWTO](https://docs.python.org/3/howto/sorting.html) for a more comprehensive introduction.

In [100]:
names

['Karl', 'Carl', 'Peter', 'Achim', 'Xavier', 'Oliver', 'Eckardt', 'Berthold']

The pop() method removes the last element from a `list` object *and* returns it.

In [101]:
names.pop()

'Berthold'

In [102]:
names

['Karl', 'Carl', 'Peter', 'Achim', 'Xavier', 'Oliver', 'Eckardt']

It takes an optional index argument and removes that instead.

In [103]:
names.pop(1)

'Carl'

In [104]:
names

['Karl', 'Peter', 'Achim', 'Xavier', 'Oliver', 'Eckardt']

Instead of removing an element by its index, we can remove it by its value with the remove() method. Behind the scenes, Python then compares the object passed as its argument, `"Peter"` in the example, sequentially to each element with the `==` operator and removes the first one that evaluates equal.

In [105]:
names.remove("Peter")

In [106]:
names

['Karl', 'Achim', 'Xavier', 'Oliver', 'Eckardt']

remove() raises a `ValueError` if the value is not found.

In [107]:
names.remove("Peter")

ValueError: list.remove(x): x not in list

`list` objects implement an index() method that returns the position of the first occurrence of an element. It fails loudly with a `ValueError` if the element cannot be found by value.

In [108]:
names

['Karl', 'Achim', 'Xavier', 'Oliver', 'Eckardt']

In [109]:
names.index("Oliver")

3

In [110]:
names.index("Carl")

ValueError: 'Carl' is not in list

The count() method returns the number of occurrences of a value.

In [111]:
names.count("Xavier")

1

In [112]:
names.count("Yves")

0

### List Comparison

The relational operators also work with `list` objects; yet another example of operator overloading.

Comparison is made in a pairwise fashion until the first pair of elements does not evaluate equal or one of the `list` objects ends. The exact comparison rules depend on the elements and not the `list` object. We say that comparison is **[delegated](https://en.wikipedia.org/wiki/Delegation_(object-oriented_programming))** to the objects to be compared. Usually, all elements are of the *same* type. Then, the comparison is straightforward and conceptually the same as for string comparison in [Chapter 6](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/06_text.ipynb#String-Comparison).

In [113]:
names

['Karl', 'Achim', 'Xavier', 'Oliver', 'Eckardt']

In [114]:
names < ["Karl", "Achim", "Oliver", "Xavier", "Eckardt"]

False

In [115]:
names < ["Karl", "Xavier", "Achim", "Oliver", "Eckardt"]

True

The shorter `list` object is considered "smaller," and vice versa.

In [116]:
names < ["Karl", "Achim", "Xavier", "Oliver"]

False

In [117]:
names < ["Karl", "Achim", "Xavier", "Oliver", "Eckardt", "Peter"]

True

### Modifiers vs. Pure Functions

As `list` objects are mutable, the caller of a function can see the changes made to a `list` object passed to the function as an argument. That is often a surprising *side effect* and should be avoided.

As an example, consider the `add_xyz()` function.

In [118]:
letters = ["a", "b", "c"]

In [119]:
def add_xyz(arg):
    """Append letters to a list."""
    arg.extend(["x", "y", "z"])
    return arg

While this function is being executed, two variables, namely `letters` in the global scope and `arg` inside the function's local scope, point to the *same* `list` object in memory. Furthermore, the passed in `arg` is also the return value.

So, after the function call, `letters_with_xyz` and `letters` are **aliases** as well, pointing to the *same* object.

In [120]:
letters_with_xyz = add_xyz(letters)

In [121]:
letters_with_xyz

['a', 'b', 'c', 'x', 'y', 'z']

In [122]:
letters

['a', 'b', 'c', 'x', 'y', 'z']

A better practice is to first create a copy of `arg` within the function that is then modified and returned. If we are sure that `arg` contains immutable elements only, we get away with a shallow copy. The downside of this approach is the higher amount of memory necessary.

The revised `add_xyz()` function below is more natural to reason about as it does *not* modify the passed in `arg` internally. This approach is following the **[functional programming](https://en.wikipedia.org/wiki/Functional_programming)** paradigm that is going through a "renaissance" currently. Two essential characteristics of functional programming are that a function *never* changes its inputs and *always* returns the same output given the same inputs.

For a beginner, it is probably better to stick to this idea and not change any arguments as the original `add_xyz()` above. However, functions that modify and return the argument passed in are an important aspect of object-oriented programming, as explained in Chapter 9.

In [123]:
letters = ["a", "b", "c"]

In [124]:
def add_xyz(arg):
    """Create a new list from an existing one."""
    new_arg = arg[:]  # a shallow copy is good enough here
    new_arg.extend(["x", "y", "z"])
    return new_arg

In [125]:
letters_with_xyz = add_xyz(letters)

In [126]:
letters_with_xyz

['a', 'b', 'c', 'x', 'y', 'z']

In [127]:
letters

['a', 'b', 'c']

If we want to modify the argument passed in, it is best to return `None` and not `arg`, as does the final version of `add_xyz()` below. Then, the user of our function cannot accidentally create two aliases to the same object. That is also why the list methods above all return `None`.

In [128]:
letters = ["a", "b", "c"]

In [129]:
def add_xyz(arg):
    """Append letters to a list."""
    arg.extend(["x", "y", "z"])
    return  # = None

In [130]:
add_xyz(letters)

In [131]:
letters

['a', 'b', 'c', 'x', 'y', 'z']

If we call `add_xyz()` with `letters` as the argument again, we end up with an even longer `list` object.

In [132]:
add_xyz(letters)

In [133]:
letters

['a', 'b', 'c', 'x', 'y', 'z', 'x', 'y', 'z']

Functions that only work on the argument passed in are called **modifiers**. Their primary purpose is to change the **state** of the argument. On the contrary, functions that have *no* side effects on the arguments are said to be **pure**.

## The `tuple` Type

To create a `tuple` object, we can use the same literal notation as for `list` objects *without* the brackets and list all elements.

In [134]:
numbers = 7, 11, 8, 5, 3, 12, 2, 6, 9, 10, 1, 4

In [135]:
numbers

(7, 11, 8, 5, 3, 12, 2, 6, 9, 10, 1, 4)

However, to be clearer, many Pythonistas write out the optional parentheses `(` and `)`.

In [136]:
numbers = (7, 11, 8, 5, 3, 12, 2, 6, 9, 10, 1, 4)

In [137]:
numbers

(7, 11, 8, 5, 3, 12, 2, 6, 9, 10, 1, 4)

As before, `numbers` is an object on its own.

In [138]:
id(numbers)

140322025210648

In [139]:
type(numbers)

tuple

While we could use empty parentheses `()` to create an empty `tuple` object ...

In [140]:
empty_tuple = ()

In [141]:
empty_tuple

()

In [142]:
type(empty_tuple)

tuple

... we must use a *trailing comma* to create a `tuple` object holding one element. If we forget the comma, the parentheses are interpreted as the grouping operator and effectively useless!

In [143]:
one_tuple = (1,)  # we could ommit the parentheses but not the comma

In [144]:
one_tuple

(1,)

In [145]:
type(one_tuple)

tuple

In [146]:
no_tuple = (1)

In [147]:
no_tuple

1

In [148]:
type(no_tuple)

int

Alternatively, we may use the [tuple()](https://docs.python.org/3/library/functions.html#func-tuple) built-in that takes any iterable as its argument and creates a new `tuple` from its elements.

In [149]:
tuple([1])

(1,)

### Tuples are like "Immutable Lists"

Most operations involving `tuple` objects work in the same way as with `list` objects. The main difference is that `tuple` objects are *immutable*. So, if our program does not depend on mutability, we may and should use `tuple` and not `list` objects to model sequential data. That way, we avoid the pitfalls seen above.

`tuple` objects are *sequences* exhibiting the familiar *four* behaviors.

In [150]:
isinstance(numbers, abc.Sequence)

True

 So, `numbers` holds a *finite* number of elements ...

In [151]:
len(numbers)

12

... that we can obtain individually by looping over it in a predictable *forward* or *reverse* order.

In [152]:
for number in numbers:
    print(number, end="   ")

7   11   8   5   3   12   2   6   9   10   1   4   

In [153]:
for number in reversed(numbers):
    print(number, end="   ")

4   1   10   9   6   2   12   3   5   8   11   7   

To check if a given object is *contained* in `numbers`, we use the `in` operator and conduct a linear search.

In [154]:
0 in numbers

False

In [155]:
1 in numbers

True

In [156]:
1.0 in numbers  # in relies on == behind the scenes

True

We may index and slice with the `[]` operator. The latter returns *new* `tuple` objects.

In [157]:
numbers[0]

7

In [158]:
numbers[-1]

4

In [159]:
numbers[6:]

(2, 6, 9, 10, 1, 4)

Index assignment does *not* work as tuples are *immutable* and results in a `TypeError`.

In [160]:
numbers[-1] = 99

TypeError: 'tuple' object does not support item assignment

If we need to "modify" the `tuple` object, we must create a *new* `tuple` object, for example, like so: We take a slice of the elements we want to keep and use the overloaded `+` operator to concatenate the slice with another `tuple` object.

In [161]:
new_numbers = numbers[:-1] + (99,)

In [162]:
new_numbers

(7, 11, 8, 5, 3, 12, 2, 6, 9, 10, 1, 99)

The `*` operator works as well.

In [163]:
2 * numbers

(7, 11, 8, 5, 3, 12, 2, 6, 9, 10, 1, 4, 7, 11, 8, 5, 3, 12, 2, 6, 9, 10, 1, 4)

Being immutable, `tuple` objects only provide the count() and index() methods.

In [164]:
numbers.count(0)

0

In [165]:
numbers.index(1)

10

The relational operators compare the elements of two `tuple` objects in a pairwise fashion as above.

In [166]:
numbers < new_numbers

True

While `tuple` objects are immutable, this only relates to the references they hold. If a `tuple` object contains mutable objects, the entire nested structure is *not* immutable as a whole.

Consider the following stylized example `not_immutable`: It contains *three* elements, `1`, `[2, ..., 11]`, and `12`, and the elements of the nested `list` object may be changed. While it is not practical to mix data types in a `tuple` object that is used as an "immutable list," we want to make the point that the mere usage of the `tuple` type does *not* guarantee a nested object to be immutable.

In [167]:
not_immutable = (1, [2, 3, 4, 5, 6, 7, 8, 9, 10, 11], 12)

In [168]:
not_immutable[1][:] = [99, 99, 99]

In [169]:
not_immutable

(1, [99, 99, 99], 12)

### Packing & Unpacking

In the "*List Operations*" section above, the `*` symbol **unpacks** the elements of a `list` object into another one. This idea of *iterable unpacking* is built into Python at various places, even *without* the `*` symbol.

For example, we may write variables on the left-hand side of a `=` statement in a `tuple` style. Then,  any *finite* iterable on the right-hand side is unpacked. So, `numbers` is unpacked into *twelve* variables below.

In [170]:
n1, n2, n3, n4, n5, n6, n7, n8, n9, n10, n11, n12 = numbers

In [171]:
n1

7

In [172]:
n2

11

In [173]:
n3

8

Having to type twelve variables on the left is already tedious. Furthermore, if the iterable on the right yields a number of elements *different* from the number of variables, we get a `ValueError`.

In [174]:
n1, n2, n3, n4, n5, n6, n7, n8, n9, n10, n11 = numbers

ValueError: too many values to unpack (expected 11)

In [175]:
n1, n2, n3, n4, n5, n6, n7, n8, n9, n10, n11, n12, n13 = numbers

ValueError: not enough values to unpack (expected 13, got 12)

So, to make iterable unpacking useful, we prepend the `*` symbol to *one* of the variables on the left: That variable then becomes a `list` object holding the elements not captured by the other variables. We say that the excess elements from the iterable are **packed** into this variable.

For example, let's get the `first` and `last` element of `numbers` and collect the rest in `middle`.

In [176]:
first, *middle, last = numbers

In [177]:
first

7

In [178]:
middle

[11, 8, 5, 3, 12, 2, 6, 9, 10, 1]

In [179]:
last

4

If we do not need the `middle` elements, we go with the underscore `_` convention and "throw" them away.

In [180]:
first, *_, last = numbers

In [181]:
first

7

In [182]:
last

4

We already used unpacking before this section without knowing it. Whenever we write a `for`-loop over the [zip()](https://docs.python.org/3/library/functions.html#zip) built-in, that generates a new `tuple` object in each iteration, which we unpack by listing several loop variables.

So, the `name, position` acts like a left-hand side of an `=` statement and unpacks the `tuple` objects generated from "zipping" the `names` list and the `positions` tuple together.

In [183]:
positions = ("goalkeeper", "defender", "midfielder", "striker", "coach")

In [184]:
for name, position in zip(names, positions):
    print(name, "is a", position)

Karl is a goalkeeper
Achim is a defender
Xavier is a midfielder
Oliver is a striker
Eckardt is a coach


Without unpacking, [zip()](https://docs.python.org/3/library/functions.html#zip) generates a series of `tuple` objects.

In [185]:
for pair in zip(names, positions):
    print(type(pair), pair, sep="   ")

<class 'tuple'>   ('Karl', 'goalkeeper')
<class 'tuple'>   ('Achim', 'defender')
<class 'tuple'>   ('Xavier', 'midfielder')
<class 'tuple'>   ('Oliver', 'striker')
<class 'tuple'>   ('Eckardt', 'coach')


Unpacking also works for nested objects. Below, we wrap [zip()](https://docs.python.org/3/library/functions.html#zip) with the [enumerate()](https://docs.python.org/3/library/functions.html#enumerate) built-in to have an index variable `i` inside the `for`-loop. In each iteration, a `tuple` object consisting of `i` and another `tuple` object is created. The inner one then holds the `name` and `position`.

In [186]:
for i, (name, position) in enumerate(zip(names, positions)):
    print(i, "->", name, "is a", position)

0 -> Karl is a goalkeeper
1 -> Achim is a defender
2 -> Xavier is a midfielder
3 -> Oliver is a striker
4 -> Eckardt is a coach


#### Swapping Variables

A popular use case of unpacking is **swapping** two variables.

Consider `a` and `b` below.

In [187]:
a = 0
b = 1

Without unpacking, we must use a temporary variable `temp` to swap `a` and `b`.

In [188]:
temp = a
a = b
b = temp

In [189]:
a, b

(1, 0)

With unpacking, the solution is more elegant, and also a bit faster as well. *All* expressions on the right-hand side are evaluated *before* any assignment takes place.

In [190]:
a = 0
b = 1

In [191]:
a, b = b, a

In [192]:
a, b

(1, 0)

#### Example: [Fibonacci Numbers](https://en.wikipedia.org/wiki/Fibonacci_number) (revisited)

Unpacking allows us to rewrite the iterative `fibonacci()` function from [Chapter 4](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/04_iteration.ipynb#"Hard-at-first-Glance"-Example:-Fibonacci-Numbers-%28revisited%29) in a concise way, now also supporting *goose typing* with the [numbers](https://docs.python.org/3/library/numbers.html) module from the [standard library](https://docs.python.org/3/library/index.html).

In [193]:
import numbers

In [194]:
def fibonacci(i):
    """Calculate the ith Fibonacci number.

    Args:
        i (int): index of the Fibonacci number to calculate

    Returns:
        ith_fibonacci (int)

    Raises:
        TypeError: if i is not an integer
        ValueError: if i is not positive
    """
    if not isinstance(i, numbers.Integral):
        raise TypeError("i must be an integer")
    elif i < 0:
        raise ValueError("i must be non-negative")

    a, b = 0, 1

    for _ in range(i - 1):
        a, b = b, a + b

    return b

In [195]:
fibonacci(12)

144

#### Function Definitions & Calls

The concepts of packing and unpacking are also helpful when writing and using functions.

For example, let's look at the `product()` function below. Its implementation suggests that `args` must be a sequence type. Otherwise, it would not make sense to index into it with `[0]` or take a slice with `[1:]`. In line with the function's name, the `for`-loop multiplies all elements of the `args` sequence. So, what does the `*` do in the header line, and what is the exact data type of `args`?

The `*` is again *not* an operator in this context but a special syntax that makes Python *pack* all *positional* arguments passed to `product()` into a single `tuple` object called `args`.

In [196]:
def product(*args):
    """Multiply all arguments."""
    result = args[0]

    for arg in args[1:]:
        result *= arg

    return result

So, we can pass an *arbitrary* (i.e., also none) number of *positional* arguments to `product()`.

The product of just one number is the number itself.

In [197]:
product(42)

42

Passing in several numbers works as expected.

In [198]:
product(2, 5, 10)

100

However, this implementation of `product()` needs *at least* one argument passed in due to the expression `args[0]` used internally. Otherwise, we see a *runtime* error, namely an `IndexError`. We emphasize that this error is *not* caused in the header line.

In [199]:
product()

IndexError: tuple index out of range

Another downside of this implementation is that we can easily generate *semantic* errors: For example, if we pass in an iterable object like the `one_hundred` list, *no* exception is raised. However, the return value is also not a numeric object as we expect. The reason for this is that during the function call, `args` becomes a `tuple` object holding *one* element, which is `one_hundred`, a `list` object. So, we created a nested structure by accident.

In [200]:
one_hundred = [2, 5, 10]

In [201]:
product(one_hundred)

[2, 5, 10]

This error does not occur if we unpack `one_hundred` upon passing it as the argument.

In [202]:
product(*one_hundred)

100

That is the equivalent of writing out the following tedious expression. Yet, that does *not* scale for iterables with many elements in them.

In [203]:
product(one_hundred[0], one_hundred[1], one_hundred[2])

100

While we needed to unpack `one_hundred` above to avoid the semantic error, unpacking an argument in a function call may also be a convenience in general.

For example, to print the elements of `one_hundred` in one line, we need to use a `for` statement, until now. With unpacking, we get away *without* a loop.

In [204]:
print(one_hundred)  # prints the tuple; we do not want that

[2, 5, 10]


In [205]:
for number in one_hundred:
    print(number, end=" ")

2 5 10 

In [206]:
print(*one_hundred)

2 5 10


## The `namedtuple` Type

Above, we proposed the idea that `tuple` objects are like "immutable lists." Often, however, we use `tuple` objects to represent a **record** of related **fields**. Then, each element has a *semantic* meaning (i.e., a descriptive name).

As an example, think of a spreadsheet with information on students in a course. Each row represents a record and holds all the data associated with an individual student. The columns (e.g., matriculation number, first name, last name) are the fields that may come as *different* data types (e.g., `int` for the matriculation number, `str` for the names).

A simple way of modeling a single student is as a `tuple` object, for example, `(123456, "John", "Doe")`. A disadvantage of this approach is that we must remember the order and meaning of the elements/fields in the `tuple` object.

An example from a different domain is the representation of $(x, y)$-points in the $x$-$y$-plane. Again, we could use a `tuple` object like `current_position` below to model the point $(4, 2)$.

In [207]:
current_position = (4, 2)

We implicitly assume that the first element represents the $x$ and the second the $y$ coordinate. While that follows intuitively from convention in math, we should at least add comments somewhere in the code to document this assumption.

A better way is to create a *custom* data type. While that is covered in depth in Chapter 9, the [collections](https://docs.python.org/3/library/collections.html) module in the [standard library](https://docs.python.org/3/library/index.html) provides a [namedtuple()](https://docs.python.org/3/library/collections.html#collections.namedtuple) **factory function** that creates "simple" custom data types on top of the standard `tuple` type.

In [208]:
from collections import namedtuple

[namedtuple()](https://docs.python.org/3/library/collections.html#collections.namedtuple) takes two arguments. The first argument is the name of the data type. That could be different from the variable `Point` we use to refer to the new type, but in most cases it is best to keep them in sync. The second argument is a sequence with the field names as `str` objects. The names' order corresponds to the one assumed in `current_position`.

In [209]:
Point = namedtuple("Point", ["x", "y"])

The `Point` object is a so-called **class**. That is what it means if an object is of type `type`. It can be used as a **factory** to create *new* `tuple`-like objects of type `Point`.

In [210]:
id(Point)  # classes are objects as well

140321614839000

In [211]:
type(Point)

type

To create a `Point` object, we use the same *literal syntax* as for `current_position` above and prepend it with `Point`.

In [212]:
current_position = Point(4, 2)

Now, `current_position` has a somewhat nicer representation. In particular, the coordinates are named `x` and `y`.

In [213]:
current_position

Point(x=4, y=2)

It is *not* a `tuple` any more but an object of type `Point`.

In [214]:
id(current_position)

140322025282656

In [215]:
type(current_position)

__main__.Point

We use the dot operator `.` to access the defined attributes.

In [216]:
current_position.x

4

In [217]:
current_position.y

2

As before, we get an `AttributeError` if we try to access an undefined attribute.

In [218]:
current_position.z

AttributeError: 'Point' object has no attribute 'z'

`current_position` continues to work like a `tuple` object! That is why we can use `namedtuple` as a replacement for `tuple`. The underlying implementations exhibit the *same* computational efficiencies and memory usages.

For example, we can index into or loop over `current_position` as it is still a sequence.

In [219]:
isinstance(current_position, abc.Sequence)

True

In [220]:
current_position[0]

4

In [221]:
current_position[1]

2

In [222]:
for number in current_position:
    print(number)

4
2


In [223]:
for number in reversed(current_position):
    print(number)

2
4


## The Map-Filter-Reduce Paradigm

Whenever we process sequential data, most tasks can be classified into one of the three categories **map**, **filter**, or **reduce**. This paradigm has caught attention in recent years as it enables **[parallel computing](https://en.wikipedia.org/wiki/Parallel_computing)**, and this gets important when dealing with big amounts of data.

Let's look at a simple example.

In [224]:
numbers = [7, 11, 8, 5, 3, 12, 2, 6, 9, 10, 1, 4]

### Mapping

**Mapping** refers to the idea of applying a transformation to every element in a sequence.

For example, let's square each element in `numbers` and add `1` to it. In essence, we apply the transformation $y := x^2 + 1$ expressed as the `transform()` function below.

In [225]:
def transform(element):
    """Map elements to their squares plus 1."""
    return (element ** 2) + 1

With the syntax we know so far, we revert to a `for`-loop that iteratively appends the transformed elements to the initially empty `transformed_numbers` list.

In [226]:
transformed_numbers = []

for old in numbers:
    new = transform(old)
    transformed_numbers.append(new)

In [227]:
transformed_numbers

[50, 122, 65, 26, 10, 145, 5, 37, 82, 101, 2, 17]

As this kind of data processing is so common, Python provides the [map()](https://docs.python.org/3/library/functions.html#map) built-in. In its simplest usage form, it takes two arguments: A transformation function that takes one positional argument and an iterable.

We call [map()](https://docs.python.org/3/library/functions.html#map) with the `transform()` function and the `numbers` list as the arguments and store the result in the variable `transformer` to inspect it.

In [228]:
transformer = map(transform, numbers)

We might expect to get back a materialized sequence (i.e., all elements exist in memory), and a `list` object would feel the most natural because of the type of the `numbers` argument. However, `transformer` is an object of type `map`.

In [229]:
transformer

<map at 0x7f9f44745ac8>

In [230]:
type(transformer)

map

Like `range` objects, `map` objects generate a series of objects "on the fly" (i.e., one by one), and we use the built-in [next()](https://docs.python.org/3/library/functions.html#next) function to obtain the next object in line. So, we should think of a `map` object as a "rule" stored in memory that only knows how to calculate the next object of possibly *infinitely* many.

It is essential to understand that by creating a `map` object with the [map()](https://docs.python.org/3/library/functions.html#map) built-in, nothing happens in memory except the creation of the `map` object. In particular, no second `list` object derived from `numbers` is created. Also, we may view `range` objects as a special case of `map` objects: They are constrained to generating `int` objects only, and the *iterable* argument is replaced with *start*, *stop*, and *step* arguments.

In [231]:
next(transformer)

50

In [232]:
next(transformer)

122

In [233]:
next(transformer)

65

If we are sure that a `map` object generates a *finite* number of elements, we may materialize them into a `list` object with the [list()](https://docs.python.org/3/library/functions.html#func-list) built-in. In the example, this is the case as `transformer` is derived from a *finite* `list` object.

In summary, instead of creating an empty list first and appending it in a `for`-loop as above, we write the following one-liner and obtain an equal `transformed_numbers` list.

In [234]:
transformed_numbers = list(map(transform, numbers))

In [235]:
transformed_numbers

[50, 122, 65, 26, 10, 145, 5, 37, 82, 101, 2, 17]

### Filtering

**Filtering** refers to the idea of creating a subset of a sequence with a **boolean filter** function that indicates if an element should be kept or not.

In the example, let's only keep the even elements in `numbers`. The `is_even()` function implements that as a filter.

In [236]:
def is_even(element):
    """Filter out odd numbers."""
    if element % 2 == 0:
        return True
    return False

As before, we must revert to a `for`-loop that appends the elements to be kept iteratively to an initially empty `even_numbers` list.

In [237]:
even_numbers = []

for number in transformed_numbers:
    if is_even(number):
        even_numbers.append(number)

In [238]:
even_numbers

[50, 122, 26, 10, 82, 2]

As filtering is also a common task, we use the [filter()](https://docs.python.org/3/library/functions.html#filter) built-in that returns an object of type `filter` stored in the `evens` variable.

In [239]:
evens = filter(is_even, transformed_numbers)

In [240]:
evens

<filter at 0x7f9f447271d0>

In [241]:
type(evens)

filter

`evens` works like `transformer` above, and we use the built-in [next()](https://docs.python.org/3/library/functions.html#next) function to obtain the even numbers one by one. So, the "next" element in line is simply the next even `int` object the `filter` object encounters.

In [242]:
transformed_numbers  # for quick reference

[50, 122, 65, 26, 10, 145, 5, 37, 82, 101, 2, 17]

In [243]:
next(evens)

50

In [244]:
next(evens)

122

In [245]:
next(evens)

26

As above, we must explicitly create a materialized `list` object with the [list()](https://docs.python.org/3/library/functions.html#func-list) built-in.

In [246]:
list(filter(is_even, transformed_numbers))

[50, 122, 26, 10, 82, 2]

We may also chain mappings and filters based on the original `numbers` list.

In [247]:
list(filter(is_even, map(transform, numbers)))

[50, 122, 26, 10, 82, 2]

Using the [map()](https://docs.python.org/3/library/functions.html#map) and [filter()](https://docs.python.org/3/library/functions.html#filter) built-ins, we can quickly switch the order: Filter first and then transform the remaining elements. This variant equals the "*A simple Filter*" example from [Chapter 4](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/04_iteration.ipynb#Example:-A-simple-Filter). On the contrary, code with `for`-loops and `if` statements is more tedious to adapt. Additionally, `map` and `filter` objects are optimized at the C level and, therefore, a lot faster as well.

In [248]:
list(map(transform, filter(is_even, numbers)))

[65, 145, 5, 37, 101, 17]

### Reducing

Lastly, **reducing** sequential data means to summarize the elements into a single statistic.

A simple example is the built-in [sum()](https://docs.python.org/3/library/functions.html#sum) function.

In [249]:
sum(map(transform, filter(is_even, numbers)))

370

Other straightforward examples are the built-in [min()](https://docs.python.org/3/library/functions.html#min) or [max()](https://docs.python.org/3/library/functions.html#max) functions.

In [250]:
min(map(transform, filter(is_even, numbers)))

5

In [251]:
max(map(transform, filter(is_even, numbers)))

145

[sum()](https://docs.python.org/3/library/functions.html#sum), [min()](https://docs.python.org/3/library/functions.html#min), and [max()](https://docs.python.org/3/library/functions.html#max) can be regarded as special cases.

The generic way of reducing a sequence is to apply a function of *two* arguments on a rolling horizon: Its first argument is the reduction of the elements processed so far, and the second the next element to be reduced.

For illustration, let's replicate [sum()](https://docs.python.org/3/library/functions.html#sum) as such a function, called `add()`. Its implementation only adds two numbers.

In [252]:
def add(sum_so_far, next_number):
    """Reduce a sequence by addition."""
    return sum_so_far + next_number

Further, we create a *new* `map` object derived from `numbers` ...

In [253]:
evens_transformed = map(transform, filter(is_even, numbers))

... and loop over all *but* the first element it generates. That we capture separately as the initial `result` with the [next()](https://docs.python.org/3/library/functions.html#next) function. So, `map` objects must be *iterable* as we may loop over them.

We know that `evens_transformed` generates *six* elements. That is why we see *five* growing `result` values resembling a [cumulative sum](http://mathworld.wolfram.com/CumulativeSum.html).

In [254]:
result = next(evens_transformed)  # first element is the initial value

for number in evens_transformed:  # iterate over the remaining elements
    print(result, end=" ")  # line added for didactical purposes
    result = add(result, number)

65 210 215 252 353 

The final `result` is the same `370` as above.

In [255]:
result

370

The [reduce()](https://docs.python.org/3/library/functools.html#functools.reduce) function in the [functools](https://docs.python.org/3/library/functools.html) module in the [standard library](https://docs.python.org/3/library/index.html) provides more convenience replacing the `for`-loop. It takes two arguments in the same way as the [map()](https://docs.python.org/3/library/functions.html#map) and [filter()](https://docs.python.org/3/library/functions.html#filter) built-ins.

[reduce()](https://docs.python.org/3/library/functools.html#functools.reduce) is **[eager](https://en.wikipedia.org/wiki/Eager_evaluation)** meaning that all computations implied by the contained `map` and `filter` "rules" are executed right away, and the code cell returns `370`. On the contrary, [map()](https://docs.python.org/3/library/functions.html#map) and [filter()](https://docs.python.org/3/library/functions.html#filter) create **[lazy](https://en.wikipedia.org/wiki/Lazy_evaluation)** `map` and `filter` objects, and we have to use the [next()](https://docs.python.org/3/library/functions.html#next) function to obtain the elements.

In [256]:
from functools import reduce

In [257]:
reduce(add, map(transform, filter(is_even, numbers)))

370

### Lambda Expressions

[map()](https://docs.python.org/3/library/functions.html#map), [filter()](https://docs.python.org/3/library/functions.html#filter), and [reduce()](https://docs.python.org/3/library/functools.html#functools.reduce) take a `function` object as their first argument, and we defined `transform()`, `is_even()`, and `add()` to be used precisely for that.

Often, such functions are used *only once* in a program. However, the primary purpose of functions is to *re-use* them. In such cases, it makes more sense to define them "anonymously" right at the position where the first argument goes.

As mentioned in [Chapter 2](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/02_functions.ipynb#Anonymous-Functions), Python provides `lambda` expressions to create `function` objects *without* a variable pointing to them.

So, the above `add()` function could be rewritten as a `lambda` expression like so ...

In [258]:
lambda sum_so_far, next_number: sum_so_far + next_number

<function __main__.<lambda>(sum_so_far, next_number)>

... or even shorter.

In [259]:
lambda x, y: x + y

<function __main__.<lambda>(x, y)>

With the new concepts in this section, we can rewrite the entire example in just a few lines of code *without* any `for`, `if`, and `def` statements. The resulting code is concise, easy to read, quick to modify, and even faster in execution. Most importantly, it is optimized to handle big amounts of data as *no* temporary `list` objects are materialized in memory.

In [260]:
numbers = [7, 11, 8, 5, 3, 12, 2, 6, 9, 10, 1, 4]
evens = filter(lambda x: x % 2 == 0, numbers)
transformed = map(lambda x: (x ** 2) + 1, evens)
sum(transformed)

370

If `numbers` comes as a sorted sequence of whole numbers, we may use the [range()](https://docs.python.org/3/library/functions.html#func-range) built-in and get away *without any* `list` object in memory at all!

In [261]:
numbers = range(1, 13)
evens = filter(lambda x: x % 2 == 0, numbers)
transformed = map(lambda x: (x ** 2) + 1, evens)
sum(transformed)

370

To additionally save the temporary variables, `numbers`, `evens`, and `transformed`, we write the entire computation as *one* expression.

In [262]:
sum(map(lambda x: (x ** 2) + 1, filter(lambda x: x % 2 == 0, range(1, 13))))

370

PythonTutor visualizes the differences in the number of computational steps and memory usage:
- [Version 1](http://pythontutor.com/visualize.html#code=def%20is_even%28element%29%3A%0A%20%20%20%20if%20element%20%25%202%20%3D%3D%200%3A%0A%20%20%20%20%20%20%20%20return%20True%0A%20%20%20%20return%20False%0A%0Adef%20transform%28element%29%3A%0A%20%20%20%20return%20%28element%20**%202%29%20%2B%201%0A%0Anumbers%20%3D%20list%28range%281,%2013%29%29%0A%0Aevens%20%3D%20%5B%5D%0Afor%20number%20in%20numbers%3A%0A%20%20%20%20if%20is_even%28number%29%3A%0A%20%20%20%20%20%20%20%20evens.append%28number%29%0A%0Atransformed%20%3D%20%5B%5D%0Afor%20number%20in%20evens%3A%0A%20%20%20%20transformed.append%28transform%28number%29%29%0A%0Aresult%20%3D%20sum%28transformed%29&cumulative=false&curInstr=0&heapPrimitives=nevernest&mode=display&origin=opt-frontend.js&py=3&rawInputLstJSON=%5B%5D&textReferences=false): With `for`-loops, `if` statements, and named functions -> **116** steps
- [Version 2](http://pythontutor.com/visualize.html#code=numbers%20%3D%20range%281,%2013%29%0Aevens%20%3D%20filter%28lambda%20x%3A%20x%20%25%202%20%3D%3D%200,%20numbers%29%0Atransformed%20%3D%20map%28lambda%20x%3A%20%28x%20**%202%29%20%2B%201,%20evens%29%0Aresult%20%3D%20sum%28transformed%29&cumulative=false&curInstr=0&heapPrimitives=nevernest&mode=display&origin=opt-frontend.js&py=3&rawInputLstJSON=%5B%5D&textReferences=false): With named `map` and `filter` objects -> **58** steps
- [Version 3](http://pythontutor.com/visualize.html#code=result%20%3D%20sum%28map%28lambda%20x%3A%20%28x%20**%202%29%20%2B%201,%20filter%28lambda%20x%3A%20x%20%25%202%20%3D%3D%200,%20range%281,%2013%29%29%29%29&cumulative=false&curInstr=0&heapPrimitives=nevernest&mode=display&origin=opt-frontend.js&py=3&rawInputLstJSON=%5B%5D&textReferences=false): Everything in *one* expression -> **55** steps

Versions 2 and 3 are the same, except for the three additional steps required to create the temporary variables. The *major* downside of Version 1 is that, in the worst case, it may need *three times* the memory as compared to the other two versions!

An experienced Pythonista would probably go with Version 2 in a production system to keep the code readable and maintainable.

### List Comprehensions

For [map()](https://docs.python.org/3/library/functions.html#map) and [filter()](https://docs.python.org/3/library/functions.html#filter), Python provides a nice syntax appealing to people who like mathematics.

Consider again the "*A simple Filter*" example from [Chapter 4](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/04_iteration.ipynb#Example:-A-simple-Filter), written with combined `for` and `if` statements. So, the mapping and filtering steps happen simultaneously.

In [263]:
numbers = [7, 11, 8, 5, 3, 12, 2, 6, 9, 10, 1, 4]

In [264]:
evens_transformed = []

for number in numbers:
    if number % 2 == 0:
        evens_transformed.append((number ** 2) + 1)

In [265]:
evens_transformed

[65, 145, 5, 37, 101, 17]

**List comprehensions**, or **list-comps** for short, are *expressions* to derive *new* `list` objects out of *existing* ones (cf., [reference](https://docs.python.org/3/reference/expressions.html#displays-for-lists-sets-and-dictionaries)). A single *expression* like below can replace the compound `for` *statement* from above.

In [266]:
[(n ** 2) + 1 for n in numbers if n % 2 == 0]

[65, 145, 5, 37, 101, 17]

A list comprehension may be used in place of any `list` object.

For example, let's add up all the elements with [sum()](https://docs.python.org/3/library/functions.html#sum). The code below *materializes* all elements in memory *before* summing them up. So,  this code might cause a `MemoryError` when executed with a bigger `numbers` list. [PythonTutor](http://pythontutor.com/visualize.html#code=numbers%20%3D%20range%281,%2013%29%0Aresult%20%3D%20sum%28%5B%28n%20**%202%29%20%2B%201%20for%20n%20in%20numbers%20if%20n%20%25%202%20%3D%3D%200%5D%29&cumulative=false&curInstr=0&heapPrimitives=nevernest&mode=display&origin=opt-frontend.js&py=3&rawInputLstJSON=%5B%5D&textReferences=false) shows how a `list` object exists in memory at step 17 and then "gets lost" right after. 

In [267]:
sum([(n ** 2) + 1 for n in numbers if n % 2 == 0])

370

#### Example: Nested Lists

List comprehensions may come with several `for`'s and `if`'s.

The cell below creates a `list` object that contains other `list` objects with numbers in them. The starting number in each inner `list` object is offset by `1`.

In [268]:
nested_numbers = [list(range(x, y + 1)) for x, y in zip([1, 2, 3], [7, 8, 9])]

In [269]:
nested_numbers

[[1, 2, 3, 4, 5, 6, 7], [2, 3, 4, 5, 6, 7, 8], [3, 4, 5, 6, 7, 8, 9]]

To do something meaningful with the numbers, we have to get rid of the inner layer of `list` objects and flatten the data.

Without list comprehensions, we would probably write two nested `for`-loops.

In [270]:
flat_numbers = []

for inner_numbers in nested_numbers:
    for number in inner_numbers:
        flat_numbers.append(number)

In [271]:
flat_numbers

[1, 2, 3, 4, 5, 6, 7, 2, 3, 4, 5, 6, 7, 8, 3, 4, 5, 6, 7, 8, 9]

That translates into a list comprehension like below. The order of the `for`'s may be confusing at first but is the *same* as writing out the nested `for`-loops.

In [272]:
[number for inner_numbers in nested_numbers for number in inner_numbers]

[1, 2, 3, 4, 5, 6, 7, 2, 3, 4, 5, 6, 7, 8, 3, 4, 5, 6, 7, 8, 9]

Now, we may use the `list` object resulting from the list comprehension in any context we want.

As an example, we add up the flattened numbers with [sum()](https://docs.python.org/3/library/functions.html#sum). The same caveat holds as before: The `list` object passed into [sum()](https://docs.python.org/3/library/functions.html#sum) is *materialized* before the sum is calculated!

In [273]:
sum([number for inner_numbers in nested_numbers for number in inner_numbers])

105

In this particular example, however, we can exploit the fact that any sum of numbers can be expressed as the sum of sums of mutually exclusive and collectively exhaustive subsets of these numbers and get away with just *one* `for` in the list comprehension.

In [274]:
sum([sum(inner_numbers) for inner_numbers in nested_numbers])

105

#### Example: Cartesian Products

A popular use case of nested list comprehensions is applying a transformation to each $2$-tuple of the [Cartesian product](https://en.wikipedia.org/wiki/Cartesian_product) of two iterables.

For example, let's add `1` to each quotient obtained by taking the numerator from `[10, 20, 30]` and the denominator from `[40, 50, 60]`, and then find the product of all quotients. The table below visualizes the calculations: The result is the product of *nine* entries.

||**10**|**20**|**30**|
|-|-|-|-|
|**40**|1.25|1.50|1.75|
|**50**|1.20|1.40|1.60|
|**60**|1.17|1.33|1.50|

To express that in Python, we start by creating two `list` objects, `first` and `second`.

In [275]:
first = [10, 20, 30]
second = [40, 50, 60]

For a Cartesian product, we loop over *all* possible $2$-tuples where one element is drawn from `first` and the other from `second`. That is equivalent to two nested `for`-loops.

In [276]:
cartesian_product = []

for numerator in first:
    for denominator in second:
        quotient = numerator / denominator
        cartesian_product.append(quotient + 1)

cartesian_product

[1.25, 1.2, 1.1666666666666667, 1.5, 1.4, 1.3333333333333333, 1.75, 1.6, 1.5]

We translate the two `for`-loops into one list comprehensions with two `for`'s in it and use `x` and `y` as shorter variable names.

In [277]:
[(x / y) + 1 for x in first for y in second]

[1.25, 1.2, 1.1666666666666667, 1.5, 1.4, 1.3333333333333333, 1.75, 1.6, 1.5]

The order of the `for`'s is *important*: The list comprehension above divides numbers from `first` by numbers from `second`, whereas the list comprehension below does the opposite.

In [278]:
[(x / y) + 1 for x in second for y in first]

[5.0, 3.0, 2.333333333333333, 6.0, 3.5, 2.666666666666667, 7.0, 4.0, 3.0]

To find the overall product, we *unpack* the first list comprehension right into the `product()` function from the "*Packing & Unpacking*" section above.

In [279]:
product(*[(x / y) + 1 for x in first for y in second])

20.58

Alternatively, we use a `lambda` expression with the [reduce()](https://docs.python.org/3/library/functools.html#functools.reduce) function from the [functools](https://docs.python.org/3/library/functools.html) module.

In [280]:
reduce(lambda x, y: x * y, [(x / y) + 1 for x in first for y in second])

20.58

While this example is stylized, Cartesian products are hidden in many applications, and it shows how the various language features introduced in this chapter can be seamlessly combined to process sequential data.

### Generator Expressions

Pythonistas would forgo materialized `list` objects, and, thus, also list comprehensions, all together, and use a more memory-efficient approach with **[generator expressions](https://docs.python.org/3/reference/expressions.html#generator-expressions)**, or **genexps** for short. Syntactically, they work like list comprehensions except that parentheses replace the brackets.

Let's go back to the original example in this section and find the transformation $y := x^2 + 1$ of all even elements in `numbers`.

In [281]:
numbers = [7, 11, 8, 5, 3, 12, 2, 6, 9, 10, 1, 4]

To filter and transform `numbers`, we wrote this list comprehension above ...

In [282]:
[(n ** 2) + 1 for n in numbers if n % 2 == 0]

[65, 145, 5, 37, 101, 17]

... that now becomes a generator expression.

In [283]:
((n ** 2) + 1 for n in numbers if n % 2 == 0)

<generator object <genexpr> at 0x7f9f447618b8>

We can think of it as yet another "rule" in memory that knows how to generate the individual objects in a series one by one. Whereas a list comprehension materializes its elements in memory *when* it is evaluated, the opposite holds for generator expressions, and *no* object is created in memory except the "rule" itself. Because of this behavior, we describe generator expressions as *lazy* and list comprehensions as *eager*.

So, to materialize all elements specified by a generator expression, we revert to the [list()](https://docs.python.org/3/library/functions.html#func-list) built-in.

In [284]:
list(((n ** 2) + 1 for n in numbers if n % 2 == 0))

[65, 145, 5, 37, 101, 17]

Whenever a generator expression is the only argument to a function, we may leave out the parentheses.

In [285]:
list((n ** 2) + 1 for n in numbers if n % 2 == 0)

[65, 145, 5, 37, 101, 17]

A common use case is to reduce the elements into a single object instead, for example, by adding them up with [sum()](https://docs.python.org/3/library/functions.html#sum). [PythonTutor](http://pythontutor.com/visualize.html#code=numbers%20%3D%20range%281,%2013%29%0Asum_with_list%20%3D%20sum%28%5B%28n%20**%202%29%20%2B%201%20for%20n%20in%20numbers%20if%20n%20%25%202%20%3D%3D%200%5D%29%0Asum_with_gen%20%3D%20sum%28%28n%20**%202%29%20%2B%201%20for%20n%20in%20numbers%20if%20n%20%25%202%20%3D%3D%200%29&cumulative=false&curInstr=0&heapPrimitives=nevernest&mode=display&origin=opt-frontend.js&py=3&rawInputLstJSON=%5B%5D&textReferences=false) shows how the code cell below does *not* create a temporary `list` object in memory, whereas a list comprehension would (cf., step 17).

In [286]:
sum((n ** 2) + 1 for n in numbers if n % 2 == 0)

370

Let's assign the object returned from a generator expression to a variable and inspect it.

In [287]:
gen = ((n ** 2) + 1 for n in numbers if n % 2 == 0)

In [288]:
gen

<generator object <genexpr> at 0x7f9f44761d68>

Unsurprisingly, generator expressions create objects of type `generator`.

In [289]:
type(gen)

generator

With the [next()](https://docs.python.org/3/library/functions.html#next) function, we can retrieve the generated elements one by one.

In [290]:
next(gen)

65

In [291]:
next(gen)

145

In [292]:
next(gen)

5

In [293]:
next(gen)

37

In [294]:
next(gen)

101

In [295]:
next(gen)

17

Once a `generator` object runs out of elements, it raises a `StopIteration` exception. We say that the `generator` object is **exhausted**, and to loop over its elements again, we must create a *new* one.

In [296]:
next(gen)

StopIteration: 

In [297]:
next(gen)

StopIteration: 

Calling the [next()](https://docs.python.org/3/library/functions.html#next) function repeatedly with the *same* `generator` object as the argument is essentially what a `for`-loop automates for us. So, `generator` objects are *iterable*.

In [298]:
for number in ((n ** 2) + 1 for n in numbers if n % 2 == 0):
    print(number, end=" ")

65 145 5 37 101 17 

#### Example: Nested Lists (revisited)

If we are only interested in a *reduction* of `nested_numbers` into a single statistic, as the overall sum in the "*Nested Lists*" example, we should replace lists or list comprehensions with generator expressions wherever possible.

The result is the *same*, but no intermediate lists are materialized! That makes our code scale to larger amounts of data and uses the available hardware more efficiently.

Let's adapt the example but keep `nested_numbers` unchanged for now.

In [299]:
nested_numbers = [list(range(x, y + 1)) for x, y in zip([1, 2, 3], [7, 8, 9])]

In [300]:
nested_numbers

[[1, 2, 3, 4, 5, 6, 7], [2, 3, 4, 5, 6, 7, 8], [3, 4, 5, 6, 7, 8, 9]]

We leave out the brackets and keep everything else as-is: The argument to [sum()](https://docs.python.org/3/library/functions.html#sum), a list comprehension in the initial implementation above, becomes a generator expression.

In [301]:
sum(number for inner_numbers in nested_numbers for number in inner_numbers)

105

That also holds for the alternative formulation as a sum of sums.

In [302]:
sum(sum(inner_numbers) for inner_numbers in nested_numbers)

105

Because `nested_numbers` has an internal structure, we can make it **memoryless** by expressing it as a generator expression derived from `range` objects. [PythonTutor](http://pythontutor.com/visualize.html#code=nested_numbers%20%3D%20%28%28range%28x,%20y%20%2B%201%29%29%20for%20x,%20y%20in%20zip%28range%281,%204%29,%20range%287,%2010%29%29%29%0Aresult%20%3D%20sum%28number%20for%20inner_numbers%20in%20nested_numbers%20for%20number%20in%20inner_numbers%29&cumulative=false&curInstr=0&heapPrimitives=nevernest&mode=display&origin=opt-frontend.js&py=3&rawInputLstJSON=%5B%5D&textReferences=false) confirms that no `list` object materializes at any point in time.

In [303]:
nested_numbers = ((range(x, y + 1)) for x, y in zip(range(1, 4), range(7, 10)))

In [304]:
nested_numbers

<generator object <genexpr> at 0x7f9f44761e58>

In [305]:
sum(number for inner_numbers in nested_numbers for number in inner_numbers)

105

We must be careful when assigning a `generator` object to a variable: If we use `nested_numbers` again, for example, in the alternative formulation below, [sum()](https://docs.python.org/3/library/functions.html#sum) returns `0` because `nested_numbers` is exhausted after executing the previous code cell. [PythonTutor](http://pythontutor.com/visualize.html#code=nested_numbers%20%3D%20%28%28range%28x,%20y%20%2B%201%29%29%20for%20x,%20y%20in%20zip%28range%281,%204%29,%20range%287,%2010%29%29%29%0Aresult%20%3D%20sum%28number%20for%20inner_numbers%20in%20nested_numbers%20for%20number%20in%20inner_numbers%29%0Ano_result%20%3D%20sum%28sum%28inner_numbers%29%20for%20inner_numbers%20in%20nested_numbers%29&cumulative=false&curInstr=0&heapPrimitives=nevernest&mode=display&origin=opt-frontend.js&py=3&rawInputLstJSON=%5B%5D&textReferences=false) also shows that.

In [306]:
sum(sum(inner_numbers) for inner_numbers in nested_numbers)

0

#### Example: Cartesian Products (revisited)

Let's also rewrite the "*Cartesian Products*" example from above with generator expressions.

As a first optimization, we replace the materialized `list` objects, `first` and `second`, with memoryless `range` objects.

In [307]:
first = range(10, 31, 10)  # = [10, 20, 30]
second = range(40, 61, 10)  # = [40, 50, 60]

Now, the first of the two alternatives may be more appealing to many readers. In general, many practitioners seem to dislike `lambda` expressions.

The code cell below *unpacks* the elements produced by `((x / y) + 1 for x in first for y in second)` into the `product()` function from the "*Packing & Unpacking*" section above. However, inside `product()`, the elements are *packed* into `args`, a materialized `tuple` object. So, all the memory efficiency gained with the generator expression is voided! [PythonTutor](http://pythontutor.com/visualize.html#code=def%20product%28*args%29%3A%0A%20%20%20%20result%20%3D%20args%5B0%5D%0A%20%20%20%20for%20arg%20in%20args%5B1%3A%5D%3A%0A%20%20%20%20%20%20%20%20result%20*%3D%20arg%0A%20%20%20%20return%20result%0A%0Afirst%20%3D%20range%2810,%2031,%2010%29%0Asecond%20%3D%20range%2840,%2061,%2010%29%0A%0Aresult%20%3D%20product%28*%28%28x%20/%20y%29%20%2B%201%20for%20x%20in%20first%20for%20y%20in%20second%29%29&cumulative=false&curInstr=0&heapPrimitives=nevernest&mode=display&origin=opt-frontend.js&py=3&rawInputLstJSON=%5B%5D&textReferences=false) shows how a `tuple` object exists in steps 38-58.

In [308]:
product(*((x / y) + 1 for x in first for y in second))

20.58

On the contrary, the solution with the [reduce()](https://docs.python.org/3/library/functools.html#functools.reduce) function from the [functools](https://docs.python.org/3/library/functools.html) module and the `lambda` expression works *without* all elements materialized at the same time, and [PythonTutor](http://pythontutor.com/visualize.html#code=from%20functools%20import%20reduce%0A%0Afirst%20%3D%20range%2810,%2031,%2010%29%0Asecond%20%3D%20range%2840,%2061,%2010%29%0A%0Aresult%20%3D%20reduce%28%0A%20%20%20%20lambda%20x,%20y%3A%20x%20*%20y,%0A%20%20%20%20%28%28x%20/%20y%29%20%2B%201%20for%20x%20in%20first%20for%20y%20in%20second%29%0A%29&cumulative=false&curInstr=0&heapPrimitives=nevernest&mode=display&origin=opt-frontend.js&py=3&rawInputLstJSON=%5B%5D&textReferences=false) confirms that. So, only the second alternative is truly memory-efficient.

In [309]:
reduce(lambda x, y: x * y, ((x / y) + 1 for x in first for y in second))

20.58

In summary, we learn from this example that unpacking generator expressions may be a *bad* idea.

### Tuple Comprehensions

There is no syntax to derive *new* `tuple` objects out of existing ones. However, we can mimic such a construct by combining the [tuple()](https://docs.python.org/3/library/functions.html#func-tuple) built-in with a generator expression.

So, to convert the list comprehension `[(n ** 2) + 1 for n in numbers if n % 2 == 0]` from above into a "tuple comprehension," we write the following.

In [310]:
tuple((n ** 2) + 1 for n in numbers if n % 2 == 0)

(65, 145, 5, 37, 101, 17)

### Boolean Reducers

Besides [min()](https://docs.python.org/3/library/functions.html#min), [max()](https://docs.python.org/3/library/functions.html#max), and [sum()](https://docs.python.org/3/library/functions.html#sum), Python provides two boolean reduce functions: [all()](https://docs.python.org/3/library/functions.html#all) and [any()](https://docs.python.org/3/library/functions.html#any).

Let's look at straightforward examples involving `numbers` again.

In [311]:
numbers = [7, 11, 8, 5, 3, 12, 2, 6, 9, 10, 1, 4]

[all()](https://docs.python.org/3/library/functions.html#all) takes an *iterable* argument and returns `True` if *all* elements are *truthy*.

For example, let's check if the square of each element in `numbers` is below `100` or `150`, respectively. We express the computation with a generator expression passed as the only argument to [all()](https://docs.python.org/3/library/functions.html#all).

In [312]:
all(x ** 2 < 100 for x in numbers)

False

In [313]:
all(x ** 2 < 150 for x in numbers)

True

[all()](https://docs.python.org/3/library/functions.html#all) can be viewed as syntactic sugar replacing a `for`-loop: Internally, [all()](https://docs.python.org/3/library/functions.html#all) implements the *short-circuiting* strategy from [Chapter 3](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/03_conditionals.ipynb#Short-Circuiting), and we mimic that by testing for the *opposite* condition in the `if` statement and leaving the `for`-loop early with the `break` statement. In the worst case, if `threshold` were, for example, `150`, we would loop over *all* elements in the *iterable*, which must be *finite* for the code to work. So, [all()](https://docs.python.org/3/library/functions.html#all) is a *linear search* in disguise.

In [314]:
threshold = 100

for number in numbers:
    if number ** 2 >= threshold:  # = the opposite of what we are checking for
        all_below_threshold = False
        break
else:
    all_below_threshold = True

all_below_threshold

False

The documentation of [all()](https://docs.python.org/3/library/functions.html#all) shows in another way what it does with code: By placing a `return` statement inside the `for`-loop of a function's body, iteration is stopped prematurely once an element does *not* meet the condition. That is the familiar *early exit* pattern at work.

In [315]:
def all_alt(iterable):
    """Alternative implementation of the built-in all() function."""
    for element in iterable:
        if not element:  # = the opposite of what we are checking for
            return False
    return True

In [316]:
all_alt(x ** 2 < 100 for x in numbers)

False

In [317]:
all_alt(x ** 2 < 150 for x in numbers)

True

Similarly, [any()](https://docs.python.org/3/library/functions.html#any) checks if *at least* one element in the *iterable* argument is *truthy*.

To continue the example, let's check if the square of *any* element in `numbers` is above `100` or `150`, respectively.

In [318]:
any(x ** 2 > 100 for x in numbers)

True

In [319]:
any(x ** 2 > 150 for x in numbers)

False

Expressed with a `for`-loop, the implementation below reveals that [any()](https://docs.python.org/3/library/functions.html#any) follows the *short-circuiting* strategy as well. Here, we do *not* need to check for the opposite condition.

In [320]:
threshold = 100

for number in numbers:
    if number ** 2 > threshold:
        any_above_threshold = True
        break
else:
    any_above_threshold = False

any_above_threshold

True

The alternative formulation in the documentation of [any()](https://docs.python.org/3/library/functions.html#any) is straightforward.

In [321]:
def any_alt(iterable):
    """Alternative implementation of the built-in any() function."""
    for element in iterable:
        if element:
            return True
    return False

In [322]:
any_alt(x ** 2 > 100 for x in numbers)

True

In [323]:
any_alt(x ** 2 > 150 for x in numbers)

False

### Example: Averaging Even Numbers (revisited)

With the new concepts in this chapter, let's rewrite the book's introductory "*Averaging Even Numbers*" example in [Chapter 1](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/01_elements.ipynb#Example:-Averaging-Even-Numbers) such that it efficiently handles a large sequence of numbers.

We assume the `average_evens()` function below is called with a *finite* and *iterable* object, which generates a "stream" of numeric objects that can be cast as `int` objects because the idea of even and odd numbers only makes sense in the context of whole numbers.

The generator expression `(int(n) for n in numbers)` implements the type casting, and when it is evaluated, *nothing* happens except that a `generator` object is stored in `integers`. Then, with the [reduce()](https://docs.python.org/3/library/functools.html#functools.reduce) function from the [functools](https://docs.python.org/3/library/functools.html) module, we *simultaneously* add up *and* count the even numbers produced by the inner generator expression `((n, 1) for n in integers if n % 2 == 0)`. That results in a `generator` object producing `tuple` objects consisting of the next *even* number in line and `1`. Two such `tuple` objects are then iteratively passed to the `lambda` expression as the `x` and `y` arguments. `x` represents the total and the count of the even numbers processed so far, while `y`'s first element, `y[0]`, is the next even number to be added to the running total. The result of the [reduce()](https://docs.python.org/3/library/functools.html#functools.reduce) function is again a `tuple` object, namely the final `total` and `count`. Lastly, we calculate the simple average.

In summary, the implementation of `average_evens()` does *not* keep materialized `list` objects internally like its predecessors from [Chapter 2](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/02_functions.ipynb), but processes the elements of the `numbers` argument on a one-by-one basis.

In [324]:
def average_evens(numbers):
    """Calculate the average of all even integers.

    Args:
        numbers (iterable): a finite stream of numbers;
            may be integers or floats; floats are truncated

    Returns:
        float: average
    """
    integers = (int(n) for n in numbers)
    total, count = reduce(
        lambda x, y: (x[0] + y[0], x[1] + y[1]),
        ((n, 1) for n in integers if n % 2 == 0)
    )
    return total / count

In [325]:
average_evens([7, 11, 8, 5, 3, 12, 2, 6, 9, 10, 1, 4])

7.0

An argument generating `float` objects works as well.

In [326]:
average_evens([7., 11., 8., 5., 3., 12., 2., 6., 9., 10., 1., 4.])

7.0

To show that `average_evens()` can process a **stream** of data, we simulate `10_000_000` randomly drawn integers between `0` and `100` with the [randint()](https://docs.python.org/3/library/random.html#random.randint) function from the [random](https://docs.python.org/3/library/random.html) module. We use a generator expression derived from a `range` object as the `numbers` argument. So, at *no* point in time is there a materialized `list` or `tuple` object in memory. The result approaching `50` tells us that [randint()](https://docs.python.org/3/library/random.html#random.randint) must be based on a uniform distribution.

In [327]:
import random

In [328]:
random.seed(42)

In [329]:
average_evens(random.randint(0, 100) for _ in range(10_000_000))

49.994081434519636

To show that `average_evens()` filters out odd numbers, we simulate another stream of `10_000_000` randomly drawn odd integers between `1` and `99`. As no function in the [random](https://docs.python.org/3/library/random.html) module does that "out of the box," we must be creative: Doubling a number drawn from `random.randint(0, 49)` results in an even number between `0` and `98`, and adding `1` makes it odd. Then, `average_evens()` raises a `TypeError`, essentially because `(int(n) for n in numbers)` does not generate any element.

In [330]:
average_evens(2 * random.randint(0, 49) + 1 for _ in range(10_000_000))

TypeError: reduce() of empty sequence with no initial value

## Iterators vs. Iterables

In the "*Collections vs. Sequences*" section above, we studied the three and four *behaviors* of collections and sequences. The latter two are *abstract* ideas, and we mainly use them to classify *concrete* data types.

Similarly, we have introduced data types in this chapter that all share the "behavior" of modeling some "rule" in memory to generate objects "on the fly:" They are the `map`, `filter`, and `generator` types. Their main commonality is supporting the built-in [next()](https://docs.python.org/3/library/functions.html#next) function. In computer science terminology, such data types are called **[iterators](https://en.wikipedia.org/wiki/Iterator)**, and the [collections.abc](https://docs.python.org/3/library/collections.abc.html) module formalizes them with the `Iterator` ABC.

So, an example of an iterator is `evens_transformed` below, an object of type `map`.

In [331]:
numbers = [7, 11, 8, 5, 3, 12, 2, 6, 9, 10, 1, 4]

In [332]:
evens_transformed = map(lambda x: (x ** 2) + 1, filter(lambda x: x % 2 == 0, numbers))

Let's first confirm that `evens_transformed` is indeed an `Iterator`, "abstractly speaking."

In [333]:
isinstance(evens_transformed, abc.Iterator)

True

In Python, iterators are *always* also iterables. The reverse does *not* hold! To be precise, iterators are *specializations* of iterables. That is what the "Inherits from" columns means in the [collections.abc](https://docs.python.org/3/library/collections.abc.html) module's documentation.

In [334]:
isinstance(evens_transformed, abc.Iterable)

True

Furthermore, we revise our definition of *iterables* from above: Just as we defined an *iterator* to be an object that supports the [next()](https://docs.python.org/3/library/functions.html#next) function, we define an *iterable* to be an object that supports the built-in [iter()](https://docs.python.org/3/library/functions.html#iter) function.

The confused reader may now be wondering how the two concepts relate to each other.

In short, the [iter()](https://docs.python.org/3/library/functions.html#iter) function is the general way to create an *iterator* object out of a given *iterable* object. In real-world code, we hardly ever see [iter()](https://docs.python.org/3/library/functions.html#iter) as Python calls it for us in the background. Then, the *iterator* object manages the iteration over the *iterable* object.

For illustration, let's do that ourselves and create *two* iterators out of the iterable `numbers` and see what we can do with them.

In [335]:
iterator1 = iter(numbers)

In [336]:
iterator2 = iter(numbers)

`iterator1` and `iterator2` are of type `list_iterator`.

In [337]:
type(iterator1)

list_iterator

*Iterators* are useful for only *one* operation: Get the next object from the associated *iterable*.

By calling [next()](https://docs.python.org/3/library/functions.html#next) three times with `iterator1` as the argument, we obtain the first three elements of `numbers`.

In [338]:
next(iterator1), next(iterator1), next(iterator1)

(7, 11, 8)

`iterator1` and `iterator2` keep their *states* separate. So, we could "manually" loop over an *iterable* in parallel.

In [339]:
next(iterator1), next(iterator2)

(5, 7)

We can also play a "trick" and exchange some elements in `numbers`. `iterator1` and `iterator2` do *not* see these changes and present us with the new elements. So, *iterators* not only have state on their own but also keep this separate from the underlying *iterable*.

In [340]:
numbers[1], numbers[4] = 99, 99

In [341]:
next(iterator1), next(iterator2)

(99, 99)

Let's re-assign the elements in `numbers` so that they are in order. Now, the numbers returned from [next()](https://docs.python.org/3/library/functions.html#next) also tell us how often [next()](https://docs.python.org/3/library/functions.html#next) was called with `iterator1` or `iterator2`. We conclude that `list_iterator` objects must be keeping track of the *last* index obtained from the underlying *iterable*.

In [342]:
numbers[:] = list(range(1, 13))

In [343]:
numbers

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

In [344]:
next(iterator1), next(iterator2)

(6, 3)

With the concepts introduced in this section, we can now understand the first sentence in the documentation on the [zip()](https://docs.python.org/3/library/functions.html#zip) built-in better: "Make an *iterator* that aggregates elements from each of the *iterables*."

Because *iterators* are always also *iterables*, we pass `iterator1` and `iterator2` as arguments to [zip()](https://docs.python.org/3/library/functions.html#zip).

The returned `zipper` object is of type `zip` and, "abstractly speaking," an `Iterator` as well.

In [345]:
zipper = zip(iterator1, iterator2)

In [346]:
zipper

<zip at 0x7f9f44726a48>

In [347]:
type(zipper)

zip

In [348]:
isinstance(zipper, abc.Iterator)

True

So far, we have always used [zip()](https://docs.python.org/3/library/functions.html#zip) with a `for` statement and looped over it. That was our earlier definition of an *iterable*. Our revised definition in this section states that an *iterable* is an object that supports the [iter()](https://docs.python.org/3/library/functions.html#iter) function. So, let's see what happens if we pass `zipper` to [iter()](https://docs.python.org/3/library/functions.html#iter).

In [349]:
zipper_iterator = iter(zipper)

In [350]:
zipper_iterator

<zip at 0x7f9f44726a48>

`zipper_iterator` points to the *same* object as `zipper`! That holds for *iterators* in general: Any *iterator* created from an existing *iterator* with [iter()](https://docs.python.org/3/library/functions.html#iter) is the *iterator* itself.

In [351]:
zipper is zipper_iterator

True

The Python core developers made that design decision so that *iterators* may also be looped over.

The `for`-loop below prints out *six* more `tuple` objects derived from the re-ordered `numbers` because the `iterator1` object hidden inside `zipper` already returned the first *six* elements. So, the respective first elements of the `tuple` objects printed range from `7` to `12`. Similarly, as `iterator2` already provided *three* elements from `numbers`, we see the respective second elements in the range from `4` to `9`.

In [352]:
for x, y in zipper:
    print(x, ">", y, end="   ")

7 > 4   8 > 5   9 > 6   10 > 7   11 > 8   12 > 9   

`zipper` is now *exhausted*. So, the `for`-loop below does *not* make any iteration at all.

In [353]:
for x, y in zipper:
    print(x, ">", y, end="   ")

We verify that `iterator1` is exhausted by passing it to [next()](https://docs.python.org/3/library/functions.html#next) again, which raises a `StopIteration` exception.

In [354]:
next(iterator1)

StopIteration: 

On the contrary, `iterator2` is *not* yet exhausted.

In [355]:
next(iterator2)

10

### The `for` Statement (revisited)

In [Chapter 4](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/04_iteration.ipynb#The-for-Statement), we argue that the `for` statement is syntactic sugar, replacing the `while` statement in many scenarios. In particular, a `for`-loop saves us two tasks: Managing an index variable *and* obtaining the individual elements by indexing. In this sub-section, we look at a more realistic picture, using the new terminology as well.

Let's print out the elements of a `list` object as the *iterable* to be looped over.

In [356]:
iterable = [0, 1, 2, 3, 4]

In [357]:
for element in iterable:
    print(element, end=" ")

0 1 2 3 4 

Our previous and equivalent formulation with a `while` statement is like so.

In [358]:
index = 0
while index < len(iterable):
    element = iterable[index]
    print(element, end=" ")
    index += 1
del index

0 1 2 3 4 

What happens behind the scenes in the Python interpreter is shown below.

First, Python calls [iter()](https://docs.python.org/3/library/functions.html#iter) with the `iterable` to be looped over, and obtains an `iterator`. That contains the entire logic of how the `iterable` is looped over. In particular, the `iterator` may or may not pick the `iterable`'s elements in a predictable order. That is up to the "rule" it models.

Second, Python enters an *indefinite* `while`-loop. It tries to obtain the next element with [next()](https://docs.python.org/3/library/functions.html#next). If that succeeds, the `for`-loop's code block is executed. Below, that code is placed within the `else`-clause that runs only if *no* exception is raised in the `try`-clause. Then, Python jumps into the next iteration and tries to obtain the next element from the `iterator`, and so on. Once the `iterator` is exhausted, it raises a `StopIteration` exception, and Python leaves the `while`-loop with the `break` statement.

In [359]:
iterator = iter(iterable)

while True:
    try:
        element = next(iterator)
    except StopIteration:
        break
    else:
        print(element, end=" ")

0 1 2 3 4 

Understanding *iterators* and *iterables* is helpful for any data science practitioner that deals with large amounts of data. Even without that, these two terms occur everywhere in Python-related texts and documentation.

## TL;DR

**Sequences** are an abstract idea that summarizes *four* behaviors an object may or may not exhibit: We describe them as **finite** and **ordered** **containers** that we may **loop over**. Examples are objects of type `list`, `tuple`, but also `str`. Objects that exhibit all behaviors *except* being ordered are referred to as **collections**. The objects inside a sequence are labeled with a unique *index*, an `int` object in the range $0 \leq \text{index} < \lvert \text{sequence} \rvert$, and called its **elements**.

`list` objects are **mutable**. That means we can change the references to other objects it contains, and, in particular, re-assign them. On the contrary, `tuple` objects are like **immutable** lists: We can use them in place of any `list` object as long as we do *not* mutate it.

Often, the work we do with sequences follows the **map-filter-reduce paradigm**: We apply the same transformation to all elements, filter some of them out, and calculate summary statistics from the remaining ones.

An essential idea in this chapter is that, in many situations, we need *not* work with all the data **materialized** in memory. Instead, **iterators** allow us to process sequential data on a one-by-one basis.