intro-to-python/09_mappings/00_content.ipynb

3774 lines
97 KiB
Text
Raw Permalink Normal View History

{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"**Note**: Click on \"*Kernel*\" > \"*Restart Kernel and Clear All Outputs*\" in [JupyterLab](https://jupyterlab.readthedocs.io/en/stable/) *before* reading this notebook to reset its output. If you cannot run this file on your machine, you may want to open it [in the cloud <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_mb.png\">](https://mybinder.org/v2/gh/webartifex/intro-to-python/main?urlpath=lab/tree/09_mappings/00_content.ipynb)."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Chapter 9: Mappings & Sets"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"While [Chapter 7 <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_nb.png\">](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/main/07_sequences/00_content.ipynb) focuses on one special kind of *collection* types, namely *sequences*, this chapter introduces two more kinds: **Mappings** and **sets**. Both are presented in this chapter as they share the *same* underlying implementation.\n",
"\n",
"The `dict` type (cf, [documentation <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/stdtypes.html#dict)) introduced in the next section is an essential part in a data scientist's toolbox for two reasons: First, Python employs `dict` objects basically everywhere internally. Second, after the many concepts involving *sequential* data, *mappings* provide a different perspective on data and enhance our general problem solving skills."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## The `dict` Type"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"A *mapping* is a one-to-one correspondence from a set of **keys** to a set of **values**. In other words, a *mapping* is a *collection* of **key-value pairs**, also called **items** for short.\n",
"\n",
"In the context of mappings, the term *value* has a meaning different from the *value* every object has: In the \"bag\" analogy from [Chapter 1 <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_nb.png\">](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/main/01_elements/00_content.ipynb#Value-/-%28Semantic%29-\"Meaning\"), we describe an object's value to be the semantic meaning of the $0$s and $1$s it contains. Here, the terms *key* and *value* mean the *role* an object takes within a mapping. Both, *keys* and *values*, are *objects* on their own with distinct *values*.\n",
"\n",
"Let's continue with an example. To create a `dict` object, we commonly use the literal notation, `{..: .., ..: .., ...}`, and list all the items. `to_words` below maps the `int` objects `0`, `1`, and `2` to their English word equivalents, `\"zero\"`, `\"one\"`, and `\"two\"`, and `from_words` does the opposite. A stylistic side note: Pythonistas often expand `dict` or `list` definitions by writing each item or element on a line on their own. Also, the commas `,` after the respective *last* items, `2: \"two\"` and `\"two\": 2`, are *not* a mistake although they *may* be left out. Besides easier reading, such a style has technical advantages that we do not go into detail about here (cf., [source <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://www.python.org/dev/peps/pep-0008/#when-to-use-trailing-commas))."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"to_words = {\n",
" 0: \"zero\",\n",
" 1: \"one\",\n",
" 2: \"two\",\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"from_words = {\n",
" \"zero\": 0,\n",
" \"one\": 1,\n",
" \"two\": 2,\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"As before, `dict` objects are objects on their own: They have an identity, a type, and a value."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"data": {
"text/plain": [
"139936685526208"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"id(to_words)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"data": {
"text/plain": [
"dict"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(to_words)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{0: 'zero', 1: 'one', 2: 'two'}"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"to_words"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"data": {
"text/plain": [
"139936686018688"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"id(from_words)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"data": {
"text/plain": [
"dict"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(from_words)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{'zero': 0, 'one': 1, 'two': 2}"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from_words"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"The built-in [dict() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/functions.html#func-dict) constructor gives us an alternative way to create a `dict` object. It is versatile and can be used in different ways.\n",
"\n",
"First, we may pass it any *mapping* type, for example, a `dict` object, to obtain a *new* `dict` object. That is the easiest way to obtain a *shallow* copy of a `dict` object or convert any other mapping object into a `dict` one."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{'zero': 0, 'one': 1, 'two': 2}"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dict(from_words)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Second, we may pass it a *finite* `iterable` providing *iterables* with *two* elements each. So, both of the following two code cells work: A `list` of `tuple` objects, or a `tuple` of `list` objects. More importantly, we could use an *iterator*, for example, a `generator` object, that produces the inner iterables \"on the fly.\""
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{'zero': 0, 'one': 1, 'two': 2}"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dict([(\"zero\", 0), (\"one\", 1), (\"two\", 2)])"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{'zero': 0, 'one': 1, 'two': 2}"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dict(([\"zero\", 0], [\"one\", 1], [\"two\", 2]))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Lastly, [dict() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/functions.html#func-dict) may also be called with *keyword* arguments: The keywords become the keys and the arguments the values."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{'zero': 0, 'one': 1, 'two': 2}"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dict(zero=0, one=1, two=2)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Nested Data"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Often, `dict` objects occur in a nested form and combined with other collection types, such as `list` or `tuple` objects, to model more complex entities \"from the real world.\"\n",
"\n",
"The reason for this popularity is that many modern [ReST APIs <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_wiki.png\">](https://en.wikipedia.org/wiki/Representational_state_transfer#Applied_to_Web_services) on the internet (e.g., [Google Maps API](https://cloud.google.com/maps-platform/), [Yelp API](https://www.yelp.com/developers/documentation/v3), [Twilio API](https://www.twilio.com/docs/usage/api)) provide their data in the popular [JSON <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_wiki.png\">](https://en.wikipedia.org/wiki/JSON) format, which looks almost like a combination of `dict` and `list` objects in Python. \n",
"\n",
"The `people` example below models three groups of people: `\"mathematicians\"`, `\"physicists\"`, and `\"programmers\"`. Each person may have an arbitrary number of email addresses. In the example, [Leonhard Euler <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_wiki.png\">](https://en.wikipedia.org/wiki/Leonhard_Euler) has not lived long enough to get one whereas [Guido <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_wiki.png\">](https://en.wikipedia.org/wiki/Guido_van_Rossum) has more than one.\n",
"\n",
"`people` makes many implicit assumptions about the structure of the data. For example, there are a [one-to-many <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_wiki.png\">](https://en.wikipedia.org/wiki/One-to-many_%28data_model%29) relationship between a person and their email addresses and a [one-to-one <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_wiki.png\">](https://en.wikipedia.org/wiki/One-to-one_%28data_model%29) relationship between each person and their name."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"people = {\n",
" \"mathematicians\": [\n",
" {\n",
" \"name\": \"Gilbert Strang\",\n",
" \"emails\": [\"gilbert@mit.edu\"],\n",
" },\n",
" {\n",
" \"name\": \"Leonhard Euler\",\n",
" \"emails\": [],\n",
" },\n",
" ],\n",
" \"physicists\": [],\n",
" \"programmers\": [\n",
" {\n",
" \"name\": \"Guido\",\n",
" \"emails\": [\"guido@python.org\", \"guido@dropbox.com\"],\n",
" },\n",
" ],\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"The literal notation of such a nested `dict` object may be hard to read ..."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{'mathematicians': [{'name': 'Gilbert Strang', 'emails': ['gilbert@mit.edu']},\n",
" {'name': 'Leonhard Euler', 'emails': []}],\n",
" 'physicists': [],\n",
" 'programmers': [{'name': 'Guido',\n",
" 'emails': ['guido@python.org', 'guido@dropbox.com']}]}"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"people"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"... but the [pprint <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/pprint.html) module in the [standard library <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/index.html) provides a [pprint() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/pprint.html#pprint.pprint) function for \"pretty printing.\""
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"from pprint import pprint"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'mathematicians': [{'emails': ['gilbert@mit.edu'],\n",
" 'name': 'Gilbert Strang'},\n",
" {'emails': [],\n",
" 'name': 'Leonhard Euler'}],\n",
" 'physicists': [],\n",
" 'programmers': [{'emails': ['guido@python.org',\n",
" 'guido@dropbox.com'],\n",
" 'name': 'Guido'}]}\n"
]
}
],
"source": [
"pprint(people, indent=1, width=60)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Hash Tables & (Key) Hashability"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"In [Chapter 0 <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_nb.png\">](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/main/00_intro/00_content.ipynb#Isn't-C-a-lot-faster?), we argue that a major advantage of using Python is that it takes care of the memory managment for us. In line with that, we have never talked about the C level implementation thus far in the book. However, the `dict` type, among others, exhibits some behaviors that may seem \"weird\" for a beginner. To build some intuition, we describe the underlying implementation details on a conceptual level.\n",
"\n",
"The first unintuitive behavior is that we may *not* use a *mutable* object as a key. That results in a `TypeError`."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"ename": "TypeError",
"evalue": "unhashable type: 'list'",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[17], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m {\n\u001b[1;32m 2\u001b[0m [\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mzero\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mone\u001b[39m\u001b[38;5;124m\"\u001b[39m]: [\u001b[38;5;241m0\u001b[39m, \u001b[38;5;241m1\u001b[39m],\n\u001b[1;32m 3\u001b[0m }\n",
"\u001b[0;31mTypeError\u001b[0m: unhashable type: 'list'"
]
}
],
"source": [
"{\n",
" [\"zero\", \"one\"]: [0, 1],\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Similarly surprising is that items with the *same* key get merged together. The resulting `dict` object keeps the position of the *first* mention of the `\"zero\"` key while only the *last* mention of the corresponding values, `999`, survives."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{'zero': 999, 'one': 1, 'two': 2}"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"{\n",
" \"zero\": 0,\n",
" \"one\": 1,\n",
" \"two\": 2,\n",
" \"zero\": 999, # to illustrate a point\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"The reason for that is that the `dict` type is implemented with so-called [hash tables <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_wiki.png\">](https://en.wikipedia.org/wiki/Hash_table).\n",
"\n",
"Conceptually, when we create a *new* `dict` object, Python creates a \"bag\" in memory that takes significantly more space than needed to store the references to all the key and value objects. This bag is a **contiguous array** similar to the `list` type's implementation. Whereas in the `list` case the array is divided into equally sized *slots* capable of holding *one* reference, a `dict` object's array is divided into equally sized **buckets** with enough space to store *two* references each: One for an item's key and one for the mapped value. The buckets are labeled with *index* numbers. Because Python knows how wide each bucket, it can jump directly into *any* bucket by calculating its *offset* from the start.\n",
"\n",
"The figure below visualizes how we should think of hash tables. An empty `dict` object, created with the literal `{}`, still takes a lot of memory: It is essentially one big, contiguous, and empty table."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"| Bucket | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |\n",
"| :---: |:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|\n",
"| **Key** |*...*|*...*|*...*|*...*|*...*|*...*|*...*|*...*|\n",
"|**Value**|*...*|*...*|*...*|*...*|*...*|*...*|*...*|*...*|"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"To insert a key-value pair, the key must be translated into a bucket's index.\n",
"\n",
"As the first step to do so, the built-in [hash() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/functions.html#hash) function maps any **hashable** object to its **hash value**, a long and \"random\" `int` number, similar to the ones returned by the built-in [id() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/functions.html#id) function. This hash value is a *summary* of all the $0$s and $1$s inside the object.\n",
"\n",
"According to the official [glossary <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/glossary.html#term-hashable), an object is hashable *only if* \"it has a hash value which *never* changes during its *lifetime*.\" So, hashability implies immutability! Without this formal requirement an object may end up in *different* buckets depending on its current value. As the name of the `dict` type (i.e., \"dictionary\") suggests, a primary purpose of it is to insert objects and look them up later on. Without a *unique* bucket, this is of course not doable. The exact logic behind [hash() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/functions.html#hash) is beyond the scope of this book.\n",
"\n",
"Let's calculate the hash value of `\"zero\"`, an immutable `str` object. Hash values have *no* semantic meaning. Also, every time we re-start Python, we see *different* hash values for the *same* objects. That is a security measure, and we do not go into the technicalities here (cf. [source <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/using/cmdline.html#envvar-PYTHONHASHSEED))."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"-85344695604937002"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hash(\"zero\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"For numeric objects, we can sometimes predict the hash values. However, we must *never* interpret any meaning into them."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hash(0)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hash(0.0)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"The [glossary <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/glossary.html#term-hashable) states a second requirement for hashability, namely that \"objects which *compare equal* must have the *same* hash value.\" The purpose of this is to ensure that if we put, for example, `1` as a key in a `dict` object, we can look it up later with `1.0`. In other words, we can look up keys by their object's semantic value. The converse statement does *not* hold: Two objects *may* (accidentally) have the *same* hash value and *not* compare equal. However, that rarely happens."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"1 == 1.0"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hash(1) == hash(1.0)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Because `list` objects are not immutable, they are *never* hashable, as indicated by the `TypeError`."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"ename": "TypeError",
"evalue": "unhashable type: 'list'",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[24], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[38;5;28;43mhash\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mzero\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mone\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m)\u001b[49m\n",
"\u001b[0;31mTypeError\u001b[0m: unhashable type: 'list'"
]
}
],
"source": [
"hash([\"zero\", \"one\"])"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"If we need keys composed of several objects, we can use `tuple` objects instead."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"-1616807732336770172"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hash((\"zero\", \"one\"))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"There is no such restiction on objects inserted into `dict` objects as *values*."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{('zero', 'one'): [0, 1]}"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"{\n",
" (\"zero\", \"one\"): [0, 1],\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"After obtaining the key object's hash value, Python must still convert that into a bucket index. We do not cover this step in technical detail but provide a conceptual description of it.\n",
"\n",
"The `buckets()` function below shows how we can obtain indexes from the binary representation of a hash value by simply extracting its least significant `bits` and interpreting them as index numbers. Alternatively, the hash value may also be divided with the `%` operator by the number of available buckets. We show this idea in the `buckets_alt()` function that takes the number of buckets, `n_buckets`, as its second argument."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"def buckets(mapping, *, bits):\n",
" \"\"\"Calculate the bucket indices for a mapping's keys.\"\"\"\n",
" for key in mapping: # cf., next section for details on looping\n",
" hash_value = hash(key)\n",
" binary = bin(hash_value)\n",
" address = binary[-bits:]\n",
" bucket = int(\"0b\" + address, base=2)\n",
" print(key, hash_value, \"0b...\" + binary[-8:], address, bucket, sep=\"\\t\")"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"def buckets_alt(mapping, *, n_buckets):\n",
" \"\"\"Calculate the bucket indices for a mapping's keys.\"\"\"\n",
" for key in mapping: # cf., next section for details on looping\n",
" hash_value = hash(key)\n",
" bucket = hash_value % n_buckets\n",
" print(key, hash_value, bucket, sep=\"\\t\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"With an infinite number of possible keys being mapped to a limited number of buckets, there is a realistic chance that two or more keys end up in the *same* bucket. That is called a **hash collision**. In such cases, Python uses a perturbation rule to rearrange the bits, and if the corresponding next bucket is empty, places an item there. Then, the nice offsetting logic from above breaks down and Python needs more time on average to place items into a hash table or look them up. The remedy is to use a bigger hash table as then the chance of collisions decreases. Python does all that for us in the background, and the main cost we pay for that is a *high* memory usage of `dict` objects in general.\n",
"\n",
"Because keys with the *same* semantic value have the *same* hash value, they end up in the *same* bucket. That is why the item that gets inserted last *overwrites* all previously inserted items whose keys compare equal, as we saw with the two `\"zero\"` keys above.\n",
"\n",
"Thus, to come up with indexes for 4 buckets, we need to extract 2 bits from the hash value (i.e., $2^2 = 4$)."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"zero\t-85344695604937002\t0b...00101010\t10\t2\n",
"one\t6414592332130781825\t0b...10000001\t01\t1\n",
"two\t4316247523642253857\t0b...00100001\t01\t1\n"
]
}
],
"source": [
"buckets(from_words, bits=2)"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"zero\t-85344695604937002\t2\n",
"one\t6414592332130781825\t1\n",
"two\t4316247523642253857\t1\n"
]
}
],
"source": [
"buckets_alt(from_words, n_buckets=4)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Similarly, 3 bits provide indexes for 8 buckets (i.e., $2^3 = 8$) ..."
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"zero\t-85344695604937002\t0b...00101010\t010\t2\n",
"one\t6414592332130781825\t0b...10000001\t001\t1\n",
"two\t4316247523642253857\t0b...00100001\t001\t1\n"
]
}
],
"source": [
"buckets(from_words, bits=3)"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"zero\t-85344695604937002\t6\n",
"one\t6414592332130781825\t1\n",
"two\t4316247523642253857\t1\n"
]
}
],
"source": [
"buckets_alt(from_words, n_buckets=8)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"... while 4 bits do so for 16 buckets (i.e., $2^4 = 16$)."
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"zero\t-85344695604937002\t0b...00101010\t1010\t10\n",
"one\t6414592332130781825\t0b...10000001\t0001\t1\n",
"two\t4316247523642253857\t0b...00100001\t0001\t1\n"
]
}
],
"source": [
"buckets(from_words, bits=4)"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"zero\t-85344695604937002\t6\n",
"one\t6414592332130781825\t1\n",
"two\t4316247523642253857\t1\n"
]
}
],
"source": [
"buckets_alt(from_words, n_buckets=16)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Python allocates the memory for a `dict` object's hash table according to some internal heuristics: Whenever a hash table is roughly 2/3 full, it creates a *new* one with twice the space, and re-inserts all items, one by one, from the *old* one. So, during its lifetime, a `dict` object may have several hash tables.\n",
"\n",
"Although hash tables seem quite complex at first sight, they help us to make certain operations very fast as we see further below."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Mappings are Collections of Key-Value Pairs"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"In [Chapter 7 <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_nb.png\">](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/main/07_sequences/00_content.ipynb#Collections-vs.-Sequences), we see how a *sequence* is a special kind of a *collection*, and that collections can be described as\n",
"- *iterable*\n",
"- *containers*\n",
"- with a *finite* number of elements.\n",
"\n",
"The `dict` type is another *collection* type and has these three properties as well.\n",
"\n",
"For example, we may pass `to_words` or `from_words` to the built-in [len() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/functions.html#len) function to obtain the number of *items* they contain. In the terminology of the [collections.abc <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/collections.abc.html) module in the [standard library <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/index.html), both are `Sized` objects."
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"data": {
"text/plain": [
"3"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(to_words)"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"3"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(from_words)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Also, `dict` objects may be looped over, for example, with the `for` statement. So, in the terminology of the [collections.abc <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/collections.abc.html) module, they are `Iterable` objects.\n",
"\n",
"Regarding the *iteration order* things are not that easy, and programmers seem to often be confused about this (e.g., this [discussion](https://stackoverflow.com/questions/58413076/why-are-python-dictionaries-not-reversible-for-python3-7)). The confusion usually comes from one of two reasons:\n",
"1. The internal implementation of the `dict` type has been changed over the last couple of minor release versions, and the communication thereof in the official release notes was done only in a later version. In a nutshell, before Python 3.6, the core developers did not care about the iteration order at all as the goal was to optimize `dict` objects for computational speed, primarily regarding key look-up (cf., the \"Indexing -> Key Look-up\" section below). That meant that looping over the *same* `dict` object several times during its lifetime could have resulted in *different* iteration orders. In Python 3.6, it was discovered that it is possible to make `dict` objects remember the order in that items have been inserted *without* giving up any computational speed or memory (cf., [Raymond Hettinger](https://github.com/rhettinger)'s talk in the [Further Resources <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_nb.png\">](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/main/09_mappings/08_resources.ipynb#History-of-the-dict-Type) section at the end of the chapter. However, that change was kept an *implementation detail* and *not* made official in the release notes. That was then done in Python 3.7's release notes (cf., [Python 3.7 release notes <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://www.python.org/downloads/release/python-370/)).\n",
"2. To make order an official part of a data type, it must adhere to the `Reversible` ABC in the [collections.abc <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/collections.abc.html) module and support the [reversed() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/functions.html#reversed) built-in. Even though the items' order inside a `dict` is remembered for Python 3.6 and 3.7, `dict` objects are *not* `Reversible`. That was then changed in Python 3.8, but again *not* officially communicated (cf., [Python 3.8 release notes](https://www.python.org/downloads/release/python-380/)).\n",
"\n",
"In summary, we can say that depending on the exact Python version a `dict` object *may* remember the **insertion order** of its items.\n",
"\n",
"However, that order is only apparent to us (i.e., we could look it up) if we put the data stored in a `dict` object into the *source code* itself. Then, we say that we \"hard code\" the data in our program. That is often *not* useful as we want our software load the data to be processed, for example, from a file or a database.\n",
"\n",
"Therefore, we suggest and adopt the following *best practices* in this book:\n",
"- We *assume* that the items in a `dict` object are *not* in a **predictable order** and *never* make the correctness of the logic in our code dependent on it.\n",
"- Whenever we want or need to model data in `dict`-like objects with an *explicit* order, we use the [OrderedDict <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/collections.html#collections.OrderedDict) type in the [collections <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/collections.html) module in the [standard library <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/index.html).\n",
"\n",
"If you installed Python, as recommended, via the Anaconda Distribution, the order in the two `for`-loops below is the *same* as in the *source code* that defines `to_words` and `from_words` above. In that sense, it is *predictable*."
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Python 3.12.2\n"
]
}
],
"source": [
"!python --version # the order in the for-loops is predictable only for Python 3.7 or higher"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"By convention, iteration goes over the *keys* in the `dict` object only. The \"*Dictionary Methods*\" section below shows how to loop over the *items* or the *values* instead."
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0\n",
"1\n",
"2\n"
]
}
],
"source": [
"for number in to_words:\n",
" print(number)"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"zero\n",
"one\n",
"two\n"
]
}
],
"source": [
"for word in from_words:\n",
" print(word)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"For Python 3.8, `dict` objects are `Reversible` as well. So, passing a `dict` object to the [reversed() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/functions.html#reversed) built-in works. However, for ealier Python versions, the two next cells raise a `TypeError`."
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2\n",
"1\n",
"0\n"
]
}
],
"source": [
"for number in reversed(to_words):\n",
" print(number)"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"two\n",
"one\n",
"zero\n"
]
}
],
"source": [
"for word in reversed(from_words):\n",
" print(word)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Of course, we may always use the built-in [sorted() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/functions.html#sorted) function to loop over, for example, `from_words` in a *predictable* order. However, that creates a temporary `list` object in memory and an order that has *nothing* to do with how the items are ordered inside the `dict` object."
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"one\n",
"two\n",
"zero\n"
]
}
],
"source": [
"for word in sorted(from_words):\n",
" print(word)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"To show the `Container` behavior of *collection* types, we use the boolean `in` operator to check if a given object evaluates equal to a *key* in `to_words` or `from_words`."
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"1.0 in to_words # 1.0 is not a key but compares equal to a key"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"-1 in to_words"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\"one\" in from_words"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\"ten\" in from_words"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Membership Testing: `list` vs. `dict`"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Because of the [hash table <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_wiki.png\">](https://en.wikipedia.org/wiki/Hash_table) implementation, the `in` operator is *extremely* fast: Python does *not* need to initiate a [linear search <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_wiki.png\">](https://en.wikipedia.org/wiki/Linear_search) as in the `list` case but immediately knows the only places in memory where the searched object must be located if present in the hash table at all. Then, the Python interpreter jumps right there in only *one* step. Because that is true no matter how many items are in the hash table, we call that a **constant time** operation.\n",
"\n",
"Conceptually, the overall behavior of the `in` operator is like comparing the searched object against *all* key objects with the `==` operator *without* doing it.\n",
"\n",
"To show the speed, we run an experiment. We create a `haystack`, a `list` object, with `10_000_001` elements in it, *one* of which is the `needle`, namely `42`. Once again, the [randint() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/random.html#random.randint) function in the [random <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/random.html) module is helpful."
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"import random"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"random.seed(87)"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"needle = 42"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"haystack = [random.randint(99, 9999) for _ in range(10_000_000)]\n",
"haystack.append(needle)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"We put the elements in `haystack` in a *random* order with the [shuffle() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/random.html#random.shuffle) function in the [random <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/random.html) module."
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"random.shuffle(haystack)"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"[8126, 7370, 3735, 213, 7922, 1434, 8557, 9609, 9704, 9564]"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"haystack[:10]"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"data": {
"text/plain": [
"[7237, 886, 5945, 4014, 4998, 2055, 3531, 6919, 7875, 1944]"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"haystack[-10:]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"As modern computers are generally fast, we search the `haystack` a total of `10` times."
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"4.44 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n"
]
}
],
"source": [
"%%timeit -n 1 -r 1\n",
"for _ in range(10):\n",
" needle in haystack"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Now, we convert the elements of the `haystack` into the keys of a `magic_haystack`, a `dict` object. We use `None` as a dummy value for all items."
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"magic_haystack = dict((x, None) for x in haystack)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"To show the *massive* effect of the hash table implementation, we search the `magic_haystack` not `10` but `10_000_000` times. The code cell still runs in only a fraction of the time its counterpart does above."
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"560 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n"
]
}
],
"source": [
"%%timeit -n 1 -r 1\n",
"for _ in range(10_000_000):\n",
" needle in magic_haystack"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"However, there is no fast way to look up the values the keys are mapped to. To achieve that, we have to loop over *all* items and check for each value object if it compares equal to the searched object. That is, by definition, a linear search, as well, and rather slow. In the context of `dict` objects, we call that a **reverse look-up**."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Indexing -> Key Look-up"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"The same efficient key look-up executed in the background with the `in` operator is also behind the indexing operator `[]`. Instead of returning either `True` or `False`, it returns the value object the looked up key maps to.\n",
"\n",
"To show the similarity to indexing into `list` objects, we provide another example with `to_words_list`."
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"to_words_list = [\"zero\", \"one\", \"two\"]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Without the above definitions, we could not tell the difference between `to_words` and `to_words_list`: The usage of the `[]` is the same."
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'zero'"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"to_words[0]"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'zero'"
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"to_words_list[0]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Because key objects can be of any immutable type and are, in particular, not constrained to just the `int` type, the word \"*indexing*\" is an understatement here. Therefore, in the context of `dict` objects, we view the `[]` operator as a generalization of the indexing operator and refer to it as the **(key) look-up** operator. "
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"2"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from_words[\"two\"]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"If a key is not in a `dict` object, Python raises a `KeyError`. A sequence type would raise an `IndexError` in this situation."
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"ename": "KeyError",
"evalue": "'drei'",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[61], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43mfrom_words\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mdrei\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m]\u001b[49m\n",
"\u001b[0;31mKeyError\u001b[0m: 'drei'"
]
}
],
"source": [
"from_words[\"drei\"]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"While `dict` objects support the `[]` operator to look up a *single* key, the more general concept of *slicing* is *not* available. That is in line with the idea that there is *no* predictable *order* associated with a `dict` object's keys, and slicing requires an order.\n",
"\n",
"To access deeper levels in nested data, like `people`, we *chain* the look-up operator `[]`. For example, let's view all the `\"mathematicians\"` in `people`."
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"[{'name': 'Gilbert Strang', 'emails': ['gilbert@mit.edu']},\n",
" {'name': 'Leonhard Euler', 'emails': []}]"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"people[\"mathematicians\"]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Let's take the first mathematician on the list, ..."
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{'name': 'Gilbert Strang', 'emails': ['gilbert@mit.edu']}"
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"people[\"mathematicians\"][0]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"... and output his `\"name\"` ..."
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'Gilbert Strang'"
]
},
"execution_count": 64,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"people[\"mathematicians\"][0][\"name\"]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"... or his `\"emails\"`."
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"['gilbert@mit.edu']"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"people[\"mathematicians\"][0][\"emails\"]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Mutability"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"We may mutate `dict` objects *in place*.\n",
"\n",
"For example, let's translate the English words in `to_words` to their German counterparts. Behind the scenes, Python determines the bucket of the objects passed to the `[]` operator, looks them up in the hash table, and, if present, *updates* the references to the mapped value objects."
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{0: 'zero', 1: 'one', 2: 'two'}"
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"to_words"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"to_words[0] = \"null\"\n",
"to_words[1] = \"eins\"\n",
"to_words[2] = \"zwei\""
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{0: 'null', 1: 'eins', 2: 'zwei'}"
]
},
"execution_count": 68,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"to_words"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Let's add two more items. Again, Python determines their buckets, but this time finds them to be empty, and *inserts* the references to their key and value objects."
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"to_words[3] = \"drei\"\n",
"to_words[4] = \"vier\""
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{0: 'null', 1: 'eins', 2: 'zwei', 3: 'drei', 4: 'vier'}"
]
},
"execution_count": 70,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"to_words"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"None of these operations change the identity of the `to_words` object."
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"data": {
"text/plain": [
"139936685526208"
]
},
"execution_count": 71,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"id(to_words) # same memory location as before"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"The `del` statement removes individual items. Python just removes the *two* references to the key and value objects in the corresponding bucket."
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"del to_words[0]"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{1: 'eins', 2: 'zwei', 3: 'drei', 4: 'vier'}"
]
},
"execution_count": 73,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"to_words"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"We may also change parts of nested data, such as `people`.\n",
"\n",
"For example, let's add [Albert Einstein <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_wiki.png\">](https://en.wikipedia.org/wiki/Albert_Einstein) to the list of `\"physicists\"`, ..."
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"[]"
]
},
"execution_count": 74,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"people[\"physicists\"]"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"people[\"physicists\"].append({\"name\": \"Albert Einstein\"})"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"... complete Guido's `\"name\"`, ..."
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{'name': 'Guido', 'emails': ['guido@python.org', 'guido@dropbox.com']}"
]
},
"execution_count": 76,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"people[\"programmers\"][0]"
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"people[\"programmers\"][0][\"name\"] = \"Guido van Rossum\""
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"... and remove his work email because he retired."
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"del people[\"programmers\"][0][\"emails\"][1]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Now, `people` looks like this."
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'mathematicians': [{'emails': ['gilbert@mit.edu'],\n",
" 'name': 'Gilbert Strang'},\n",
" {'emails': [],\n",
" 'name': 'Leonhard Euler'}],\n",
" 'physicists': [{'name': 'Albert Einstein'}],\n",
" 'programmers': [{'emails': ['guido@python.org'],\n",
" 'name': 'Guido van Rossum'}]}\n"
]
}
],
"source": [
"pprint(people, indent=1, width=60)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## `dict` Methods"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"`dict` objects come with many methods bound on them (cf., [documentation <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/stdtypes.html#dict)), many of which are standardized by the `Mapping` and `MutableMapping` ABCs from the [collections.abc <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/collections.abc.html) module. While the former requires the [.keys() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/stdtypes.html#dict.keys), [.values() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/stdtypes.html#dict.values), [.items() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/stdtypes.html#dict.items), and [.get() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/stdtypes.html#dict.get) methods, which *never* mutate an object, the latter formalizes the [.update() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/stdtypes.html#dict.update), [.pop() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/stdtypes.html#dict.pop), [.popitem() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/stdtypes.html#dict.popitem), [.clear() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/stdtypes.html#dict.clear), and [.setdefault() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/stdtypes.html#dict.setdefault) methods, which *may* do so."
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"import collections.abc as abc"
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 81,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"isinstance(from_words, abc.Mapping)"
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 82,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"isinstance(from_words, abc.MutableMapping)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"While iteration over a mapping type already goes over its keys, we may emphasize this explicitly by adding the [.keys() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/stdtypes.html#dict.keys) method in the `for`-loop. Again, the iteration order is equivalent to the insertion order but still considered *unpredictable*."
]
},
{
"cell_type": "code",
"execution_count": 83,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"zero\n",
"one\n",
"two\n"
]
}
],
"source": [
"for word in from_words.keys():\n",
" print(word)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"[.keys() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/stdtypes.html#dict.keys) returns an object of type `dict_keys`. That is a dynamic **view** inside the `from_words`'s hash table, which means it does *not* copy the references to the keys, and changes to `from_words` can be seen through it. View objects behave much like `dict` objects themselves."
]
},
{
"cell_type": "code",
"execution_count": 84,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"data": {
"text/plain": [
"dict_keys(['zero', 'one', 'two'])"
]
},
"execution_count": 84,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from_words.keys()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Views can be materialized with the [list() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/functions.html#func-list) built-in. However, that may introduce *semantic* errors into a program as the newly created `list` object has a \"*predictable*\" order (i.e., indexes `0`, `1`, ...) created from an *unpredictable* one."
]
},
{
"cell_type": "code",
"execution_count": 85,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"data": {
"text/plain": [
"['zero', 'one', 'two']"
]
},
"execution_count": 85,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"list(from_words.keys())"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"To loop over the value objects instead, we use the [.values() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/stdtypes.html#dict.values) method. That returns a *view* (i.e., type `dict_values`) on the value objects inside `from_words` without copying them."
]
},
{
"cell_type": "code",
"execution_count": 86,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0\n",
"1\n",
"2\n"
]
}
],
"source": [
"for number in from_words.values():\n",
" print(number)"
]
},
{
"cell_type": "code",
"execution_count": 87,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"data": {
"text/plain": [
"dict_values([0, 1, 2])"
]
},
"execution_count": 87,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from_words.values()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"To loop over key-value pairs, we invoke the [.items() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/stdtypes.html#dict.items) method. That returns a view (i.e., type `dict_items`) on the key-value pairs as `tuple` objects, where the first element is the key and the second the value. Because of that, we use tuple unpacking in the `for`-loop."
]
},
{
"cell_type": "code",
"execution_count": 88,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"zero -> 0\n",
"one -> 1\n",
"two -> 2\n"
]
}
],
"source": [
"for word, number in from_words.items():\n",
" print(f\"{word} -> {number}\")"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"data": {
"text/plain": [
"dict_items([('zero', 0), ('one', 1), ('two', 2)])"
]
},
"execution_count": 89,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from_words.items()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Above, we see how the look-up operator fails *loudly* with a `KeyError` if a key is *not* in a `dict` object. For example, `to_words` does *not* have a key `0` any more."
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{1: 'eins', 2: 'zwei', 3: 'drei', 4: 'vier'}"
]
},
"execution_count": 90,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"to_words"
]
},
{
"cell_type": "code",
"execution_count": 91,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"ename": "KeyError",
"evalue": "0",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[91], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43mto_words\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;241;43m0\u001b[39;49m\u001b[43m]\u001b[49m\n",
"\u001b[0;31mKeyError\u001b[0m: 0"
]
}
],
"source": [
"to_words[0]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"That may be mitigated with the [.get() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/stdtypes.html#dict.get) method that takes two arguments: `key` and `default`. It returns the value object `key` maps to if it is in the `dict` object; otherwise, `default` is returned. If not provided, `default` is `None`."
]
},
{
"cell_type": "code",
"execution_count": 92,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'n/a'"
]
},
"execution_count": 92,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"to_words.get(0, \"n/a\")"
]
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'eins'"
]
},
"execution_count": 93,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"to_words.get(1, \"n/a\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"The [.update() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/stdtypes.html#dict.update) method takes the items of another mapping and either inserts them or overwrites the ones with matching keys already in the `dict` objects. It may be used in the other two ways as the [dict() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/functions.html#func-dict) constructor allows, as well."
]
},
{
"cell_type": "code",
"execution_count": 94,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"to_spanish = {\n",
" 0: \"cero\",\n",
" 1: \"uno\",\n",
" 2: \"dos\",\n",
" 3: \"tres\",\n",
" 4: \"cuatro\",\n",
" 5: \"cinco\", \n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 95,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"to_words.update(to_spanish)"
]
},
{
"cell_type": "code",
"execution_count": 96,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{1: 'uno', 2: 'dos', 3: 'tres', 4: 'cuatro', 0: 'cero', 5: 'cinco'}"
]
},
"execution_count": 96,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"to_words"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"In contrast to the `pop()` method of the `list` type, the [.pop() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/stdtypes.html#dict.pop) method of the `dict` type *requires* a `key` argument to be passed. Then, it removes the corresponding key-value pair *and* returns the value object. If the `key` is not in the `dict` object, a `KeyError` is raised."
]
},
{
"cell_type": "code",
"execution_count": 97,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{'zero': 0, 'one': 1, 'two': 2}"
]
},
"execution_count": 97,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from_words"
]
},
{
"cell_type": "code",
"execution_count": 98,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"number = from_words.pop(\"zero\")"
]
},
{
"cell_type": "code",
"execution_count": 99,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 99,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"number"
]
},
{
"cell_type": "code",
"execution_count": 100,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{'one': 1, 'two': 2}"
]
},
"execution_count": 100,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from_words"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"With an optional `default` argument, the loud `KeyError` may be suppressed and the `default` returned instead, just as with the [.get() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/stdtypes.html#dict.get) method above."
]
},
{
"cell_type": "code",
"execution_count": 101,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"ename": "KeyError",
"evalue": "'zero'",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[101], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43mfrom_words\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mpop\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mzero\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m)\u001b[49m\n",
"\u001b[0;31mKeyError\u001b[0m: 'zero'"
]
}
],
"source": [
"from_words.pop(\"zero\")"
]
},
{
"cell_type": "code",
"execution_count": 102,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 102,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from_words.pop(\"zero\", 0)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Similar to the `pop()` method of the `list` type, the [.popitem() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/stdtypes.html#dict.popitem) method of the `dict` type removes *and* returns an \"arbitrary\" key-value pair as a `tuple` object from a `dict` object. With the preservation of the insertion order in Python 3.7 and higher, this effectively becomes a \"last in, first out\" rule, just as with the `list` type. Once a `dict` object is empty, [.popitem() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/stdtypes.html#dict.popitem) raises a `KeyError`."
]
},
{
"cell_type": "code",
"execution_count": 103,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"word, number = from_words.popitem()"
]
},
{
"cell_type": "code",
"execution_count": 104,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"('two', 2)"
]
},
"execution_count": 104,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"word, number"
]
},
{
"cell_type": "code",
"execution_count": 105,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{'one': 1}"
]
},
"execution_count": 105,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from_words"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"The [.clear() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/stdtypes.html#dict.clear) method removes all items but keeps the `dict` object alive in memory."
]
},
{
"cell_type": "code",
"execution_count": 106,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"to_words.clear()"
]
},
{
"cell_type": "code",
"execution_count": 107,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{}"
]
},
"execution_count": 107,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"to_words"
]
},
{
"cell_type": "code",
"execution_count": 108,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"from_words.clear()"
]
},
{
"cell_type": "code",
"execution_count": 109,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{}"
]
},
"execution_count": 109,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from_words"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"The [.setdefault() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/stdtypes.html#dict.setdefault) method may have a bit of an unfortunate name but is useful, in particular, with nested `list` objects. It takes two arguments, `key` and `default`, and returns the value mapped to `key` if `key` is in the `dict` object; otherwise, it inserts the `key`-`default` pair *and* returns a reference to the newly created value object. So, it is similar to the [.get() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/stdtypes.html#dict.get) method above but also *mutates* the `dict` object.\n",
"\n",
"Consider the `people` example again and note how the `dict` object modeling `\"Albert Einstein\"` has *no* `\"emails\"` key in it."
]
},
{
"cell_type": "code",
"execution_count": 110,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'mathematicians': [{'emails': ['gilbert@mit.edu'],\n",
" 'name': 'Gilbert Strang'},\n",
" {'emails': [],\n",
" 'name': 'Leonhard Euler'}],\n",
" 'physicists': [{'name': 'Albert Einstein'}],\n",
" 'programmers': [{'emails': ['guido@python.org'],\n",
" 'name': 'Guido van Rossum'}]}\n"
]
}
],
"source": [
"pprint(people, indent=1, width=60)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Let's say we want to append the imaginary emails `\"leonhard@math.org\"` and `\"albert@physics.org\"`. We cannot be sure if a `dict` object modeling a person has already a `\"emails\"` key or not. To play it safe, we could first use the `in` operator to check for that and create a new `list` object in a second step if one is missing. Then, we would finally append the new email.\n",
"\n",
"[.setdefault() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/stdtypes.html#dict.setdefault) allows us to do all of the three steps at once. More importantly, behind the scenes Python only needs to make *one* key look-up instead of potentially three. For large nested data, that could speed up the computations significantly.\n",
"\n",
"So, the first code cell below adds the email to the already existing empty `list` object, while the second one creates a new one first."
]
},
{
"cell_type": "code",
"execution_count": 111,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"people[\"mathematicians\"][1].setdefault(\"emails\", []).append(\"leonhard@math.org\")"
]
},
{
"cell_type": "code",
"execution_count": 112,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"people[\"physicists\"][0].setdefault(\"emails\", []).append(\"albert@physics.org\")"
]
},
{
"cell_type": "code",
"execution_count": 113,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'mathematicians': [{'emails': ['gilbert@mit.edu'],\n",
" 'name': 'Gilbert Strang'},\n",
" {'emails': ['leonhard@math.org'],\n",
" 'name': 'Leonhard Euler'}],\n",
" 'physicists': [{'emails': ['albert@physics.org'],\n",
" 'name': 'Albert Einstein'}],\n",
" 'programmers': [{'emails': ['guido@python.org'],\n",
" 'name': 'Guido van Rossum'}]}\n"
]
}
],
"source": [
"pprint(people, indent=1, width=60)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"`dict` objects also come with a [copy() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/stdtypes.html#dict.copy) method on them that creates *shallow* copies."
]
},
{
"cell_type": "code",
"execution_count": 114,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"guido = people[\"programmers\"][0].copy()"
]
},
{
"cell_type": "code",
"execution_count": 115,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{'name': 'Guido van Rossum', 'emails': ['guido@python.org']}"
]
},
"execution_count": 115,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"guido"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"If we mutate `guido` and, for example, remove all his emails with the `.clear()` method on the `list` type, these changes are also visible through `people`."
]
},
{
"cell_type": "code",
"execution_count": 116,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"guido[\"emails\"].clear()"
]
},
{
"cell_type": "code",
"execution_count": 117,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{'name': 'Guido van Rossum', 'emails': []}"
]
},
"execution_count": 117,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"guido"
]
},
{
"cell_type": "code",
"execution_count": 118,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'mathematicians': [{'emails': ['gilbert@mit.edu'],\n",
" 'name': 'Gilbert Strang'},\n",
" {'emails': ['leonhard@math.org'],\n",
" 'name': 'Leonhard Euler'}],\n",
" 'physicists': [{'emails': ['albert@physics.org'],\n",
" 'name': 'Albert Einstein'}],\n",
" 'programmers': [{'emails': [],\n",
" 'name': 'Guido van Rossum'}]}\n"
]
}
],
"source": [
"pprint(people, indent=1, width=60)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## `dict` Comprehensions"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Analogous to `list` comprehensions in [Chapter 8 <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_nb.png\">](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/main/08_mfr/01_content.ipynb#list-Comprehensions), `dict` comprehensions are a concise literal notation to derive new `dict` objects out of existing ones.\n",
"\n",
"For example, let's derive `from_words` out of `to_words` below by swapping the keys and values."
]
},
{
"cell_type": "code",
"execution_count": 119,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"to_words = {\n",
" 0: \"zero\",\n",
" 1: \"one\",\n",
" 2: \"two\",\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Without a dictionary comprehension, we would have to initialize an empty `dict` object, loop over the items of the original one, and insert the key-value pairs one by one in a reversed fashion as value-key pairs. That assumes that the values are unique as otherwise some would be merged."
]
},
{
"cell_type": "code",
"execution_count": 120,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{'zero': 0, 'one': 1, 'two': 2}"
]
},
"execution_count": 120,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from_words = {}\n",
"\n",
"for number, word in to_words.items():\n",
" from_words[word] = number\n",
"\n",
"from_words"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"While that code is correct, it is also unnecessarily verbose. The dictionary comprehension below works in the same way as list comprehensions except that curly braces `{}` replace the brackets `[]` and a colon `:` is added to separate the keys from the values."
]
},
{
"cell_type": "code",
"execution_count": 121,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{'zero': 0, 'one': 1, 'two': 2}"
]
},
"execution_count": 121,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"{v: k for k, v in to_words.items()}"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"We may filter out items with an `if`-clause and transform the remaining key and value objects.\n",
"\n",
"For no good reason, let's filter out all words starting with a `\"z\"` and upper case the remainin words."
]
},
{
"cell_type": "code",
"execution_count": 122,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{'ONE': 1, 'TWO': 2}"
]
},
"execution_count": 122,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"{v.upper(): k for k, v in to_words.items() if not v.startswith(\"z\")}"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Multiple `for`- and/or `if`-clauses are allowed.\n",
"\n",
"For example, let's find all pairs of two numbers from `1` through `10` whose product is \"close\" to `50` (e.g., within a delta of `5`)."
]
},
{
"cell_type": "code",
"execution_count": 123,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{(5, 9): 45,\n",
" (5, 10): 50,\n",
" (6, 8): 48,\n",
" (6, 9): 54,\n",
" (7, 7): 49,\n",
" (8, 6): 48,\n",
" (9, 5): 45,\n",
" (9, 6): 54,\n",
" (10, 5): 50}"
]
},
"execution_count": 123,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"{\n",
" (x, y): x * y\n",
" for x in range(1, 11) for y in range(1, 11)\n",
" if abs(x * y - 50) <= 5\n",
"}"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.2"
},
"livereveal": {
"auto_select": "code",
"auto_select_fragment": true,
"scroll": true,
"theme": "serif"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": false,
"sideBar": true,
"skip_h1_title": true,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {
"height": "calc(100% - 180px)",
"left": "10px",
"top": "150px",
"width": "384px"
},
"toc_section_display": false,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}