diff --git a/06_text/02_content.ipynb b/06_text/02_content.ipynb
new file mode 100644
index 0000000..6ff2407
--- /dev/null
+++ b/06_text/02_content.ipynb
@@ -0,0 +1,2100 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "**Note**: Click on \"*Kernel*\" > \"*Restart Kernel and Clear All Outputs*\" in [JupyterLab](https://jupyterlab.readthedocs.io/en/stable/) *before* reading this notebook to reset its output. If you cannot run this file on your machine, you may want to open it [in the cloud ](https://mybinder.org/v2/gh/webartifex/intro-to-python/develop?urlpath=lab/tree/06_text/01_content.ipynb)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "source": [
+ "# Chapter 6: Text & Bytes (continued)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "In this second part of the chapter, we look in more detail at how `str` objects work in memory, in particular how the $0$s and $1$s in the memory translate into characters."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "source": [
+ "## Special Characters"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "As previously seen, some characters have a special meaning when following the **escape character** `\"\\\"`. Besides escaping the kind of quote used as the `str` object's delimiter, `'` or `\"`, most of these **escape sequences** (i.e., `\"\\\"` with the subsequent character), act as a **control character** that moves the \"cursor\" in the output *without* generating any pixel on the screen. Because of that, we only see the effect of such escape sequences when used with the [print() ](https://docs.python.org/3/library/functions.html#print) function. The [documentation ](https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals) lists all available escape sequences, of which we show the most important ones below.\n",
+ "\n",
+ "The most common escape sequence is `\"\\n\"` that \"prints\" a [newline character ](https://en.wikipedia.org/wiki/Newline) that is also called the line feed character or LF for short."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'This is a sentence\\nthat is printed\\non three lines.'"
+ ]
+ },
+ "execution_count": 1,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "\"This is a sentence\\nthat is printed\\non three lines.\""
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "fragment"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "This is a sentence\n",
+ "that is printed\n",
+ "on three lines.\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(\"This is a sentence\\nthat is printed\\non three lines.\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "`\"\\b\"` is the [backspace character ](https://en.wikipedia.org/wiki/Backspace), or BS for short, that moves the cursor back by one character."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "ABX\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(\"ABC\\bX\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "ABXY\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(\"ABC\\bXY\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "Similarly, `\"\\r\"` is the [carriage return character ](https://en.wikipedia.org/wiki/Carriage_return), or CR for short, that moves the cursor back to the beginning of the line."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "fragment"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "XBC\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(\"ABC\\rX\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "XYC\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(\"ABC\\rXY\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "While Linux and modern MacOS systems use solely `\"\\n\"` to express a new line, Windows systems default to using `\"\\r\\n\"`. This may lead to \"weird\" bugs on software projects where people using both kind of operating systems collaborate."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "This is a sentence\n",
+ "that is printed\n",
+ "on three lines.\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(\"This is a sentence\\r\\nthat is printed\\r\\non three lines.\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "`\"\\t\"` makes the cursor \"jump\" in equidistant tab stops. That may be useful for formatting a program with lengthy and tabular results."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Jump\tfrom\ttab\tstop\tto\ttab\tstop.\n",
+ "The\tsecond\tline\tdoes\tso\ttoo.\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(\"Jump\\tfrom\\ttab\\tstop\\tto\\ttab\\tstop.\\nThe\\tsecond\\tline\\tdoes\\tso\\ttoo.\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "source": [
+ "### Raw Strings"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "Sometimes we do *not* want the backslash `\"\\\"` and its subsequent character be interpreted as an escape sequence. For example, let's print a typical installation path on a Windows systems. Obviously, the newline character `\"\\n\"` does *not* makes sense here."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "C:\\Programs\n",
+ "ew_application\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(\"C:\\Programs\\new_application\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "Some `str` objects even produce a `SyntaxError` because the `\"\\U\"` can *not* be interpreted as a Unicode code point (cf., next section)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "fragment"
+ }
+ },
+ "outputs": [
+ {
+ "ename": "SyntaxError",
+ "evalue": "(unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \\UXXXXXXXX escape (, line 1)",
+ "output_type": "error",
+ "traceback": [
+ "\u001b[0;36m File \u001b[0;32m\"\"\u001b[0;36m, line \u001b[0;32m1\u001b[0m\n\u001b[0;31m print(\"C:\\Users\\Administrator\\Desktop\\Project\")\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \\UXXXXXXXX escape\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(\"C:\\Users\\Administrator\\Desktop\\Project\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "A simple solution would be to escape the escape character with a *second* backslash `\"\\\"`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "C:\\Programs\\new_application\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(\"C:\\\\Programs\\\\new_application\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "fragment"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "C:\\Users\\Administrator\\Desktop\\Project\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(\"C:\\\\Users\\\\Administrator\\\\Desktop\\\\Project\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "However, this is tedious to remember and type. For such use cases, Python allows to prefix any string literal with a `r`. The literal is then interpreted in a \"raw\" way."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "C:\\Programs\\new_application\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(r\"C:\\Programs\\new_application\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "fragment"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "C:\\Users\\Administrator\\Desktop\\Project\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(r\"C:\\Users\\Administrator\\Desktop\\Project\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "source": [
+ "## Characters are Numbers with a Convention"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "So far, we used the term **character** without any further consideration. In this section, we briefly look into what characters are and how they are modeled in software.\n",
+ "\n",
+ "[Chapter 5 ](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/develop/05_numbers/00_content.ipynb) gives us an idea on how individual **bits** are used to express all types of numbers, from \"simple\" `int` objects to \"complex\" `float` ones. To model characters, another **layer of abstraction** is put on top of whole numbers. So, just as bits are used to express integers, they themselves are used to express characters."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "source": [
+ "### ASCII"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "Many conventions have been developed as to what integer is associated with which character. The most basic one that was also adopted around the world is the the so-called [American Standard Code for Information Interchange ](https://en.wikipedia.org/wiki/ASCII), or **ASCII** for short. It uses 7 bits of information to map the unprintable control characters as well as the printable letters of the alphabet, numbers, and common symbols to the numbers `0` through `127`.\n",
+ "\n",
+ "A mapping from characters to numbers is referred to by the technical term **encoding**. We may use the built-in [ord() ](https://docs.python.org/3/library/functions.html#ord) function to **encode** any single character. The inverse to that is the built-in [chr() ](https://docs.python.org/3/library/functions.html#chr) function, which **decodes** a number into a character."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "65"
+ ]
+ },
+ "execution_count": 15,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "ord(\"A\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "fragment"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'A'"
+ ]
+ },
+ "execution_count": 16,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "chr(65)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "Of course, unprintable escape sequences like `\"\\n\"` count as only *one* character."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "fragment"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "10"
+ ]
+ },
+ "execution_count": 17,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "ord(\"\\n\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "fragment"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'\\n'"
+ ]
+ },
+ "execution_count": 18,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "chr(10)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "In ASCII, the numbers `0` through `31` (and `127`) are mapped to all kinds of unprintable control characters. The decimal digits are encoded with the numbers `48` through `57`, the upper case letters with `65` through `90`, and the lower case letters with `97` through `122`. While this seems random as first, there is of course a \"sophisticated\" system behind it. That can immediately be seen when looking at the encoded numbers in their *binary* representations.\n",
+ "\n",
+ "For example, the digit `5` is mapped to the number `53` in ASCII. The binary representation of `53` is `0b_11_0101` and the least significant four bits, `0101`, mean $5$. Similarly, the letter `\"E\"` is the fifth letter in the alphabet. It is encoded with the number `69` in ASCII, which is `0b_100_0101` in binary. And, the least significant bits, `0_0101`, mean $5$. Analogously, `\"e\"` is encoded with `101` in ASCII, which is `0b_110_0101` in binary. And, the least significant bits, `0_0101`, mean $5$ again. This encoding was chosen mainly because programmers \"in the old days\" needed to implement these encodings \"by hand.\" Python abstracts that logic away from its users.\n",
+ "\n",
+ "This encoding scheme is also the cause for the \"weird\" sorting in the \"*String Comparison*\" section in the [first part ](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/develop/06_text/01_content.ipynb#String-Comparison) of this chapter, where `\"apple\"` comes *after* `\"Banana\"`. As `\"a\"` is encoded with `97` and `\"B\"` with `66`, `\"Banana\"` must of course be \"smaller\" than `\"apple\"` when comparison is done in a pairwise fashion of the individual characters."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "48 0b110000 -> 0\n",
+ "49 0b110001 -> 1\n",
+ "50 0b110010 -> 2\n",
+ "51 0b110011 -> 3\n",
+ "52 0b110100 -> 4\n",
+ "53 0b110101 -> 5\n",
+ "54 0b110110 -> 6\n",
+ "55 0b110111 -> 7\n",
+ "56 0b111000 -> 8\n",
+ "57 0b111001 -> 9\n"
+ ]
+ }
+ ],
+ "source": [
+ "for number in range(48, 58):\n",
+ " print(number, bin(number), \"-> \", chr(number))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "65 0b1000001 -> A\t66 0b1000010 -> B\t67 0b1000011 -> C\n",
+ "68 0b1000100 -> D\t69 0b1000101 -> E\t70 0b1000110 -> F\n",
+ "71 0b1000111 -> G\t72 0b1001000 -> H\t73 0b1001001 -> I\n",
+ "74 0b1001010 -> J\t75 0b1001011 -> K\t76 0b1001100 -> L\n",
+ "77 0b1001101 -> M\t78 0b1001110 -> N\t79 0b1001111 -> O\n",
+ "80 0b1010000 -> P\t81 0b1010001 -> Q\t82 0b1010010 -> R\n",
+ "83 0b1010011 -> S\t84 0b1010100 -> T\t85 0b1010101 -> U\n",
+ "86 0b1010110 -> V\t87 0b1010111 -> W\t88 0b1011000 -> X\n",
+ "89 0b1011001 -> Y\t90 0b1011010 -> Z\t"
+ ]
+ }
+ ],
+ "source": [
+ "for i, number in enumerate(range(65, 91), start=1):\n",
+ " end = \"\\n\" if i % 3 == 0 else \"\\t\"\n",
+ " print(number, bin(number), \"-> \", chr(number), end=end)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 21,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ " 97 0b1100001 -> a\t 98 0b1100010 -> b\t 99 0b1100011 -> c\n",
+ "100 0b1100100 -> d\t101 0b1100101 -> e\t102 0b1100110 -> f\n",
+ "103 0b1100111 -> g\t104 0b1101000 -> h\t105 0b1101001 -> i\n",
+ "106 0b1101010 -> j\t107 0b1101011 -> k\t108 0b1101100 -> l\n",
+ "109 0b1101101 -> m\t110 0b1101110 -> n\t111 0b1101111 -> o\n",
+ "112 0b1110000 -> p\t113 0b1110001 -> q\t114 0b1110010 -> r\n",
+ "115 0b1110011 -> s\t116 0b1110100 -> t\t117 0b1110101 -> u\n",
+ "118 0b1110110 -> v\t119 0b1110111 -> w\t120 0b1111000 -> x\n",
+ "121 0b1111001 -> y\t122 0b1111010 -> z\t"
+ ]
+ }
+ ],
+ "source": [
+ "for i, number in enumerate(range(97, 123), start=1):\n",
+ " end = \"\\n\" if i % 3 == 0 else \"\\t\"\n",
+ " print(str(number).rjust(3), bin(number), \"-> \", chr(number), end=end)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "The remaining `symbols` encoded in ASCII are encoded with the numbers still unused, which is why they are scattered."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "symbols = (\n",
+ " list(range(32, 48))\n",
+ " + list(range(58, 65))\n",
+ " + list(range(91, 97))\n",
+ " + list(range(123, 127))\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 23,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ " 32 0b100000 -> \t 33 0b100001 -> !\t 34 0b100010 -> \"\n",
+ " 35 0b100011 -> #\t 36 0b100100 -> $\t 37 0b100101 -> %\n",
+ " 38 0b100110 -> &\t 39 0b100111 -> '\t 40 0b101000 -> (\n",
+ " 41 0b101001 -> )\t 42 0b101010 -> *\t 43 0b101011 -> +\n",
+ " 44 0b101100 -> ,\t 45 0b101101 -> -\t 46 0b101110 -> .\n",
+ " 47 0b101111 -> /\t 58 0b111010 -> :\t 59 0b111011 -> ;\n",
+ " 60 0b111100 -> <\t 61 0b111101 -> =\t 62 0b111110 -> >\n",
+ " 63 0b111111 -> ?\t 64 0b1000000 -> @\t 91 0b1011011 -> [\n",
+ " 92 0b1011100 -> \\\t 93 0b1011101 -> ]\t 94 0b1011110 -> ^\n",
+ " 95 0b1011111 -> _\t 96 0b1100000 -> `\t123 0b1111011 -> {\n",
+ "124 0b1111100 -> |\t125 0b1111101 -> }\t126 0b1111110 -> ~\n"
+ ]
+ }
+ ],
+ "source": [
+ "for i, number in enumerate(symbols, start=1):\n",
+ " end = \"\\n\" if i % 3 == 0 else \"\\t\"\n",
+ " print(str(number).rjust(3), bin(number).rjust(10), \"-> \", chr(number), end=end)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "As the ASCII character set does not work for many languages other than English, various encodings were developed. Popular examples are [ISO 8859-1 ](https://en.wikipedia.org/wiki/ISO/IEC_8859-1) for western European letters or [Windows 1250 ](https://en.wikipedia.org/wiki/Windows-1250) for Latin ones. Many of these encodings use 8-bit numbers (i.e., `0` through `255`) to map the multitude of non-English letters (e.g., the German [umlauts ](https://en.wikipedia.org/wiki/Umlaut_%28linguistics%29) `\"ä\"`, `\"ö\"`, `\"ü\"`, or `\"ß\"`)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "source": [
+ "### Unicode"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "However, none of these specialized encodings can map *all* characters of *all* languages around the world from *all* times in human history. To achieve that, a truly global standard called **[Unicode ](https://en.wikipedia.org/wiki/Unicode)** was developed and its first version released in 1991. Since then, Unicode has been amended with many other \"characters.\" The most popular among them being [emojis ](https://en.wikipedia.org/wiki/Emoji) or the [Klingon ](https://en.wikipedia.org/wiki/Klingon_scripts) language (from the science fiction series [Star Trek ](https://en.wikipedia.org/wiki/Star_Trek)). In Unicode, every character is given an identity referred to as the **code point**. Code points are hexadecimal numbers from `0x0000` through `0x10ffff`, written as U+0000 and U+10FFFF outside of Python. Consequently, there exist at most $1,114,112$ code points, of which only about 10% are currently in use, allowing lots of room for new characters to be invented. The first `127` code points are identical to the ASCII encoding for reasons explained in the \"*The `bytes` Type*\" section further below. There exist plenty of lists of all Unicode characters on the web (e.g., [Wikipedia ](https://en.wikipedia.org/wiki/List_of_Unicode_characters)).\n",
+ "\n",
+ "All we need to know to print a character is its code point. Python uses the escape sequence `\"\\U\"` that is followed by eight hexadecimal digits. Underscore separators are unfortunately *not* allowed here.\n",
+ "\n",
+ "So, to print a smiley, we just need to look up the corresponding number (e.g., [here ](https://en.wikipedia.org/wiki/Emoji#Unicode_blocks))."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 24,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'😄'"
+ ]
+ },
+ "execution_count": 24,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "\"\\U0001f604\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "Every Unicode character also has a descriptive name that we can use with the escape sequence `\"\\N\"` and within curly braces `{}`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 25,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "fragment"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'😂'"
+ ]
+ },
+ "execution_count": 25,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "\"\\N{FACE WITH TEARS OF JOY}\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "Whenever the code point can be expressed with just four hexadecimal digits, we may use the escape sequence `\"\\u\"` for brevity."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 26,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "fragment"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'A'"
+ ]
+ },
+ "execution_count": 26,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "\"\\U00000041\" # hex(65) == 0x41"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 27,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "fragment"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'A'"
+ ]
+ },
+ "execution_count": 27,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "\"\\u0041\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "Analogously, if the code point can be expressed with two hexadecimal digits, we may use the escape sequence `\"\\x\"` for even conciser code."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 28,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "fragment"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'A'"
+ ]
+ },
+ "execution_count": 28,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "\"\\x41\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "As the `str` type is based on Unicode, a `str` object's behavior is more in line with how humans view text and not how it is expressed in source code.\n",
+ "\n",
+ "For example, while it is obvious that `len(\"A\")` evaluates to `1`, ..."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 29,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "1"
+ ]
+ },
+ "execution_count": 29,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "len(\"A\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "... what should `len(\"\\N{SNAKE}\")` evaluate to? As the idea of a snake is expressed as *one* \"character,\" [len() ](https://docs.python.org/3/library/functions.html#len) also returns `1` here."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 30,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "fragment"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'🐍'"
+ ]
+ },
+ "execution_count": 30,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "\"\\N{SNAKE}\""
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 31,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "fragment"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "1"
+ ]
+ },
+ "execution_count": 31,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "len(\"\\N{SNAKE}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "Many of the built-in `str` methods also consider Unicode. For example, in contrast to [lower() ](https://docs.python.org/3/library/stdtypes.html#str.lower), the [casefold() ](https://docs.python.org/3/library/stdtypes.html#str.casefold) method knows that the German `\"ß\"` is commonly converted to `\"ss\"`. So, when searching for exact matches, normalizing text with [casefold() ](https://docs.python.org/3/library/stdtypes.html#str.casefold) may yield better results than with [lower() ](https://docs.python.org/3/library/stdtypes.html#str.lower)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 32,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'straße'"
+ ]
+ },
+ "execution_count": 32,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "\"Straße\".lower()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 33,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'strasse'"
+ ]
+ },
+ "execution_count": 33,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "\"Straße\".casefold()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "Many other methods like [isdecimal() ](https://docs.python.org/3/library/stdtypes.html#str.isdecimal), [isdigit() ](https://docs.python.org/3/library/stdtypes.html#str.isdigit), [isnumeric() ](https://docs.python.org/3/library/stdtypes.html#str.isnumeric), [isprintable() ](https://docs.python.org/3/library/stdtypes.html#str.isprintable), [isidentifier() ](https://docs.python.org/3/library/stdtypes.html#str.isidentifier), and many more may be worthwhile to know for the data science practitioner, especially when it comes to data cleaning."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "source": [
+ "## Multi-line Strings"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "Sometimes, it is convenient to split text across multiple lines in source code. For example, to make lines fit into the 79 characters requirement of [PEP 8 ](https://www.python.org/dev/peps/pep-0008/) or because the text consists of many lines and typing out `\"\\n\"` is tedious. However, using single double quotes `\"` around multiple lines results in a `SyntaxError`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 34,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "outputs": [
+ {
+ "ename": "SyntaxError",
+ "evalue": "EOL while scanning string literal (, line 1)",
+ "output_type": "error",
+ "traceback": [
+ "\u001b[0;36m File \u001b[0;32m\"\"\u001b[0;36m, line \u001b[0;32m1\u001b[0m\n\u001b[0;31m \"\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m EOL while scanning string literal\n"
+ ]
+ }
+ ],
+ "source": [
+ "\"\n",
+ "Do not break the lines like this\n",
+ "\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "Instead, we may enclose a string literal with either **triple double** quotes `\"\"\"` or **triple single** quotes `'''`. Then, newline characters in the source code are converted into `\"\\n\"` characters in the resulting `str` object. Docstrings are precisely that, and, by convention, always written within triple double quotes `\"\"\"`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 35,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "multi_line = \"\"\"\n",
+ "I am a multi-line string\n",
+ "consisting of four lines.\n",
+ "\"\"\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "A caveat is that `\"\\n\"` characters are often inserted at the beginning or end of the text when we try to format the source code nicely."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 36,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'\\nI am a multi-line string\\nconsisting of four lines.\\n'"
+ ]
+ },
+ "execution_count": 36,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "multi_line"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 37,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "fragment"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\n",
+ "I am a multi-line string\n",
+ "consisting of four lines.\n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(multi_line)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "Using the [split() ](https://docs.python.org/3/library/stdtypes.html#str.split) method with the optional `sep` argument, we confirm that `multi_line` consists of *four* lines with the first and last line being empty."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 38,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "fragment"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "1 \n",
+ "2 I am a multi-line string\n",
+ "3 consisting of four lines.\n",
+ "4 \n"
+ ]
+ }
+ ],
+ "source": [
+ "for i, line in enumerate(multi_line.split(\"\\n\"), start=1):\n",
+ " print(i, line)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "To mitigate that, we often see the [strip() ](https://docs.python.org/3/library/stdtypes.html#bytes.strip) method in source code."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 39,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "multi_line = \"\"\"\n",
+ "I am a multi-line string\n",
+ "consisting of two lines.\n",
+ "\"\"\".strip()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 40,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "1 I am a multi-line string\n",
+ "2 consisting of two lines.\n"
+ ]
+ }
+ ],
+ "source": [
+ "for i, line in enumerate(multi_line.split(\"\\n\"), start=1):\n",
+ " print(i, line)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "source": [
+ "## The `bytes` Type"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "To end this chapter, we want to briefly look at the `bytes` data type, which conceptually is a sequence of bytes. That data format is probably one of the most generic ways of exchanging data between any two programs or computers (e.g., a web browser obtains its data from a web server in this format).\n",
+ "\n",
+ "Let's open a binary file in read-only mode (i.e., `mode=\"rb\"`) and read in all of its contents."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 41,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "with open(\"full_house.bin\", mode=\"rb\") as binary_file:\n",
+ " data = binary_file.read()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "`data` is an object of type `bytes`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 42,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "fragment"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "139880714782512"
+ ]
+ },
+ "execution_count": 42,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "id(data)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 43,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "fragment"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "bytes"
+ ]
+ },
+ "execution_count": 43,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "type(data)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "It's value is given out in the literal bytes notation with a `b` prefix (cf., the [reference ](https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals)). Every byte is expressed in hexadecimal representation with the escape sequence `\"\\x\"`. This representation is commonly chosen as we can *not* tell what kind of information is hidden in the `data` by just looking at the bytes. Instead, we must be told by some other source how to **decode** the raw bytes into information we can interpret."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 44,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "fragment"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "b'\\xf0\\x9f\\x82\\xa7\\xf0\\x9f\\x82\\xb7\\xf0\\x9f\\x83\\x97\\xf0\\x9f\\x83\\x8e\\xf0\\x9f\\x83\\x9e'"
+ ]
+ },
+ "execution_count": 44,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "data"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "`bytes` objects work like `str` objects in many ways. In particular, they are *sequences* as well: The number of bytes is *finite* and we may *iterate* over them in *order*."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 45,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "20"
+ ]
+ },
+ "execution_count": 45,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "len(data)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "Consisting of 8 bits, a single byte can always be interpreted as a whole number between `0` through `255`. That is exactly what we see when we loop over the `data` ..."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 46,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "fragment"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "240 159 130 167 240 159 130 183 240 159 131 151 240 159 131 142 240 159 131 158 "
+ ]
+ }
+ ],
+ "source": [
+ "for byte in data:\n",
+ " print(byte, end=\" \")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "... or index into them."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 47,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "fragment"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "158"
+ ]
+ },
+ "execution_count": 47,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "data[-1]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "Slicing returns another `bytes` object."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 48,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "fragment"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "b'\\xf0\\x82\\xf0\\x82\\xf0\\x83\\xf0\\x83\\xf0\\x83'"
+ ]
+ },
+ "execution_count": 48,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "data[::2]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "source": [
+ "### Character Encodings"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "Luckily, `data` consists of bytes encoded with the [UTF-8 ](https://en.wikipedia.org/wiki/UTF-8) encoding. That is the most common way of mapping a Unicode character's code point to a sequence of bytes.\n",
+ "\n",
+ "To obtain a `str` object out of a given `bytes` object, we decode it with the `bytes` type's [decode() ](https://docs.python.org/3/library/stdtypes.html#bytes.decode) method."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 49,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "cards = data.decode()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 50,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "fragment"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "str"
+ ]
+ },
+ "execution_count": 50,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "type(cards)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "So, `data` consisted of a [full house ](https://en.wikipedia.org/wiki/List_of_poker_hands#Full_house) hand in a poker game."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 51,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "fragment"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'🂧🂷🃗🃎🃞'"
+ ]
+ },
+ "execution_count": 51,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "cards"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "To go the opposite direction and encode a given `str` object, we use the `str` type's [encode() ](https://docs.python.org/3/library/stdtypes.html#str.encode) method."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 52,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "place = \"Café Kastanientörtchen\""
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 53,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "fragment"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "b'Caf\\xc3\\xa9 Kastanient\\xc3\\xb6rtchen'"
+ ]
+ },
+ "execution_count": 53,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "place.encode()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "By default, [encode() ](https://docs.python.org/3/library/stdtypes.html#str.encode) and [decode() ](https://docs.python.org/3/library/stdtypes.html#bytes.decode) use an `encoding=\"utf-8\"` argument. We may use another encoding like, for example, `\"iso-8859-1\"`, which can deal with ASCII and western European letters."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 54,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "fragment"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "b'Caf\\xe9 Kastanient\\xf6rtchen'"
+ ]
+ },
+ "execution_count": 54,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "place.encode(\"iso-8859-1\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "However, we must use the *same* encoding for the decoding step as for the encoding step. Otherwise, a `UnicodeDecodeError` is raised."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 55,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "outputs": [
+ {
+ "ename": "UnicodeDecodeError",
+ "evalue": "'utf-8' codec can't decode byte 0xe9 in position 3: invalid continuation byte",
+ "output_type": "error",
+ "traceback": [
+ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+ "\u001b[0;31mUnicodeDecodeError\u001b[0m Traceback (most recent call last)",
+ "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mplace\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mencode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"iso-8859-1\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdecode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+ "\u001b[0;31mUnicodeDecodeError\u001b[0m: 'utf-8' codec can't decode byte 0xe9 in position 3: invalid continuation byte"
+ ]
+ }
+ ],
+ "source": [
+ "place.encode(\"iso-8859-1\").decode()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "Not all encodings map all Unicode code points. For example `\"iso-8859-1\"` does not know Czech letters. Below, [encode() ](https://docs.python.org/3/library/stdtypes.html#str.encode) raises a `UnicodeEncodeError` because of that."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 56,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "fragment"
+ }
+ },
+ "outputs": [
+ {
+ "ename": "UnicodeEncodeError",
+ "evalue": "'latin-1' codec can't encode character '\\u0159' in position 12: ordinal not in range(256)",
+ "output_type": "error",
+ "traceback": [
+ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+ "\u001b[0;31mUnicodeEncodeError\u001b[0m Traceback (most recent call last)",
+ "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0;34m\"Dobrý den, přátelé!\"\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mencode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"iso-8859-1\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+ "\u001b[0;31mUnicodeEncodeError\u001b[0m: 'latin-1' codec can't encode character '\\u0159' in position 12: ordinal not in range(256)"
+ ]
+ }
+ ],
+ "source": [
+ "\"Dobrý den, přátelé!\".encode(\"iso-8859-1\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "source": [
+ "### Reading Files (continued)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "The [open() ](https://docs.python.org/3/library/functions.html#open) function takes an optional `encoding` argument as well."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 57,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "outputs": [
+ {
+ "ename": "UnicodeDecodeError",
+ "evalue": "'utf-8' codec can't decode byte 0xe4 in position 9: invalid continuation byte",
+ "output_type": "error",
+ "traceback": [
+ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+ "\u001b[0;31mUnicodeDecodeError\u001b[0m Traceback (most recent call last)",
+ "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mwith\u001b[0m \u001b[0mopen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"umlauts.txt\"\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mfile\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"\"\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mjoin\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfile\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreadlines\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+ "\u001b[0;32m~/.pyenv/versions/3.8.6/lib/python3.8/codecs.py\u001b[0m in \u001b[0;36mdecode\u001b[0;34m(self, input, final)\u001b[0m\n\u001b[1;32m 320\u001b[0m \u001b[0;31m# decode input (taking the buffer into account)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 321\u001b[0m \u001b[0mdata\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbuffer\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0minput\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 322\u001b[0;31m \u001b[0;34m(\u001b[0m\u001b[0mresult\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mconsumed\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_buffer_decode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdata\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0merrors\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfinal\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 323\u001b[0m \u001b[0;31m# keep undecoded input until the next call\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 324\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbuffer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdata\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mconsumed\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+ "\u001b[0;31mUnicodeDecodeError\u001b[0m: 'utf-8' codec can't decode byte 0xe4 in position 9: invalid continuation byte"
+ ]
+ }
+ ],
+ "source": [
+ "with open(\"umlauts.txt\") as file:\n",
+ " print(\"\".join(file.readlines()))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 58,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Lerchen-Lärchen-Ähnlichkeiten\n",
+ "fehlen. Dieses abzustreiten\n",
+ "mag im Klang der Worte liegen.\n",
+ "Merke, eine Lerch' kann fliegen,\n",
+ "Lärchen nicht, was kaum verwundert,\n",
+ "denn nicht eine unter hundert\n",
+ "ist geflügelt. Auch im Singen\n",
+ "sind die Bäume zu bezwingen.\n",
+ "Die Bätrachtung sollte reichen,\n",
+ "Rächtschreibfählern auszuweichen.\n",
+ "Leicht gälingt's, zu unterscheiden,\n",
+ "wär ist wär nun von dän beiden.\n"
+ ]
+ }
+ ],
+ "source": [
+ "with open(\"umlauts.txt\", encoding=\"iso-8859-1\") as file:\n",
+ " print(\"\".join(file.readlines()))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "source": [
+ "### Best Practice: Use UTF-8 explicitly"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "source": [
+ "A best practice is to *always* specify the `encoding`, especially on computers running on Windows (cf., the talk by Łukasz Langa in the [Further Resources ](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/develop/06_text/05_resources.ipynb#Unicode)) section at the end of this chapter.\n",
+ "\n",
+ "Below is the first example involving [open() ](https://docs.python.org/3/library/functions.html#open) one last time: It shows how *all* the contents of a text file should be read into one `str` object."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 59,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "with open(\"lorem_ipsum.txt\", encoding=\"utf-8\") as file:\n",
+ " content = \"\".join(file.readlines())"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 60,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "fragment"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "\"Lorem Ipsum is simply dummy text of the printing and typesetting industry.\\nLorem Ipsum has been the industry's standard dummy text ever since the 1500s\\nwhen an unknown printer took a galley of type and scrambled it to make a type\\nspecimen book. It has survived not only five centuries but also the leap into\\nelectronic typesetting, remaining essentially unchanged. It was popularised in\\nthe 1960s with the release of Letraset sheets.\\n\""
+ ]
+ },
+ "execution_count": 60,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "content"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 61,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "fragment"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Lorem Ipsum is simply dummy text of the printing and typesetting industry.\n",
+ "Lorem Ipsum has been the industry's standard dummy text ever since the 1500s\n",
+ "when an unknown printer took a galley of type and scrambled it to make a type\n",
+ "specimen book. It has survived not only five centuries but also the leap into\n",
+ "electronic typesetting, remaining essentially unchanged. It was popularised in\n",
+ "the 1960s with the release of Letraset sheets.\n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(content)"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.8.6"
+ },
+ "livereveal": {
+ "auto_select": "code",
+ "auto_select_fragment": true,
+ "scroll": true,
+ "theme": "serif"
+ },
+ "toc": {
+ "base_numbering": 1,
+ "nav_menu": {},
+ "number_sections": false,
+ "sideBar": true,
+ "skip_h1_title": true,
+ "title_cell": "Table of Contents",
+ "title_sidebar": "Contents",
+ "toc_cell": false,
+ "toc_position": {
+ "height": "calc(100% - 180px)",
+ "left": "10px",
+ "top": "150px",
+ "width": "384px"
+ },
+ "toc_section_display": false,
+ "toc_window_display": false
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/06_text/full_house.bin b/06_text/full_house.bin
new file mode 100644
index 0000000..4ad5e35
--- /dev/null
+++ b/06_text/full_house.bin
@@ -0,0 +1 @@
+🂧🂷🃗🃎🃞
\ No newline at end of file
diff --git a/06_text/umlauts.txt b/06_text/umlauts.txt
new file mode 100644
index 0000000..9c0f911
--- /dev/null
+++ b/06_text/umlauts.txt
@@ -0,0 +1,12 @@
+Lerchen-Lrchen-hnlichkeiten
+fehlen. Dieses abzustreiten
+mag im Klang der Worte liegen.
+Merke, eine Lerch' kann fliegen,
+Lrchen nicht, was kaum verwundert,
+denn nicht eine unter hundert
+ist geflgelt. Auch im Singen
+sind die Bume zu bezwingen.
+Die Btrachtung sollte reichen,
+Rchtschreibfhlern auszuweichen.
+Leicht glingt's, zu unterscheiden,
+wr ist wr nun von dn beiden.
\ No newline at end of file
diff --git a/README.md b/README.md
index 430dbb9..fd12c45 100644
--- a/README.md
+++ b/README.md
@@ -152,6 +152,13 @@ Alternatively, the content can be viewed in a web browser
- [exercises ](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/develop/06_text/01_exercises.ipynb)
[](https://mybinder.org/v2/gh/webartifex/intro-to-python/develop?urlpath=lab/tree/06_text/01_exercises.ipynb)
(Detecting Palindromes)
+ - [content ](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/develop/06_text/02_content.ipynb)
+ [](https://mybinder.org/v2/gh/webartifex/intro-to-python/develop?urlpath=lab/tree/06_text/02_content.ipynb)
+ (Special Characters;
+ ASCII & Unicode;
+ Multi-line Strings;
+ `bytes` Type;
+ Character Encodings)
#### Videos