From 3c91b00749cfcfa6998d3f71479893573f4f8b03 Mon Sep 17 00:00:00 2001 From: Alexander Hess Date: Mon, 19 Oct 2020 11:40:27 +0200 Subject: [PATCH] Add initial version of chapter 06, part 2 --- 06_text/02_content.ipynb | 2100 ++++++++++++++++++++++++++++++++++++++ 06_text/full_house.bin | 1 + 06_text/umlauts.txt | 12 + README.md | 7 + 4 files changed, 2120 insertions(+) create mode 100644 06_text/02_content.ipynb create mode 100644 06_text/full_house.bin create mode 100644 06_text/umlauts.txt diff --git a/06_text/02_content.ipynb b/06_text/02_content.ipynb new file mode 100644 index 0000000..6ff2407 --- /dev/null +++ b/06_text/02_content.ipynb @@ -0,0 +1,2100 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "**Note**: Click on \"*Kernel*\" > \"*Restart Kernel and Clear All Outputs*\" in [JupyterLab](https://jupyterlab.readthedocs.io/en/stable/) *before* reading this notebook to reset its output. If you cannot run this file on your machine, you may want to open it [in the cloud ](https://mybinder.org/v2/gh/webartifex/intro-to-python/develop?urlpath=lab/tree/06_text/01_content.ipynb)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# Chapter 6: Text & Bytes (continued)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "In this second part of the chapter, we look in more detail at how `str` objects work in memory, in particular how the $0$s and $1$s in the memory translate into characters." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Special Characters" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "As previously seen, some characters have a special meaning when following the **escape character** `\"\\\"`. Besides escaping the kind of quote used as the `str` object's delimiter, `'` or `\"`, most of these **escape sequences** (i.e., `\"\\\"` with the subsequent character), act as a **control character** that moves the \"cursor\" in the output *without* generating any pixel on the screen. Because of that, we only see the effect of such escape sequences when used with the [print() ](https://docs.python.org/3/library/functions.html#print) function. The [documentation ](https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals) lists all available escape sequences, of which we show the most important ones below.\n", + "\n", + "The most common escape sequence is `\"\\n\"` that \"prints\" a [newline character ](https://en.wikipedia.org/wiki/Newline) that is also called the line feed character or LF for short." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "'This is a sentence\\nthat is printed\\non three lines.'" + ] + }, + "execution_count": 1, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "\"This is a sentence\\nthat is printed\\non three lines.\"" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "This is a sentence\n", + "that is printed\n", + "on three lines.\n" + ] + } + ], + "source": [ + "print(\"This is a sentence\\nthat is printed\\non three lines.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "`\"\\b\"` is the [backspace character ](https://en.wikipedia.org/wiki/Backspace), or BS for short, that moves the cursor back by one character." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "ABX\n" + ] + } + ], + "source": [ + "print(\"ABC\\bX\")" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "ABXY\n" + ] + } + ], + "source": [ + "print(\"ABC\\bXY\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "Similarly, `\"\\r\"` is the [carriage return character ](https://en.wikipedia.org/wiki/Carriage_return), or CR for short, that moves the cursor back to the beginning of the line." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "XBC\n" + ] + } + ], + "source": [ + "print(\"ABC\\rX\")" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "XYC\n" + ] + } + ], + "source": [ + "print(\"ABC\\rXY\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "While Linux and modern MacOS systems use solely `\"\\n\"` to express a new line, Windows systems default to using `\"\\r\\n\"`. This may lead to \"weird\" bugs on software projects where people using both kind of operating systems collaborate." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "This is a sentence\n", + "that is printed\n", + "on three lines.\n" + ] + } + ], + "source": [ + "print(\"This is a sentence\\r\\nthat is printed\\r\\non three lines.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "`\"\\t\"` makes the cursor \"jump\" in equidistant tab stops. That may be useful for formatting a program with lengthy and tabular results." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Jump\tfrom\ttab\tstop\tto\ttab\tstop.\n", + "The\tsecond\tline\tdoes\tso\ttoo.\n" + ] + } + ], + "source": [ + "print(\"Jump\\tfrom\\ttab\\tstop\\tto\\ttab\\tstop.\\nThe\\tsecond\\tline\\tdoes\\tso\\ttoo.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "### Raw Strings" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "Sometimes we do *not* want the backslash `\"\\\"` and its subsequent character be interpreted as an escape sequence. For example, let's print a typical installation path on a Windows systems. Obviously, the newline character `\"\\n\"` does *not* makes sense here." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "C:\\Programs\n", + "ew_application\n" + ] + } + ], + "source": [ + "print(\"C:\\Programs\\new_application\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "Some `str` objects even produce a `SyntaxError` because the `\"\\U\"` can *not* be interpreted as a Unicode code point (cf., next section)." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [ + { + "ename": "SyntaxError", + "evalue": "(unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \\UXXXXXXXX escape (, line 1)", + "output_type": "error", + "traceback": [ + "\u001b[0;36m File \u001b[0;32m\"\"\u001b[0;36m, line \u001b[0;32m1\u001b[0m\n\u001b[0;31m print(\"C:\\Users\\Administrator\\Desktop\\Project\")\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \\UXXXXXXXX escape\n" + ] + } + ], + "source": [ + "print(\"C:\\Users\\Administrator\\Desktop\\Project\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "A simple solution would be to escape the escape character with a *second* backslash `\"\\\"`." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "C:\\Programs\\new_application\n" + ] + } + ], + "source": [ + "print(\"C:\\\\Programs\\\\new_application\")" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "C:\\Users\\Administrator\\Desktop\\Project\n" + ] + } + ], + "source": [ + "print(\"C:\\\\Users\\\\Administrator\\\\Desktop\\\\Project\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "However, this is tedious to remember and type. For such use cases, Python allows to prefix any string literal with a `r`. The literal is then interpreted in a \"raw\" way." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "C:\\Programs\\new_application\n" + ] + } + ], + "source": [ + "print(r\"C:\\Programs\\new_application\")" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "C:\\Users\\Administrator\\Desktop\\Project\n" + ] + } + ], + "source": [ + "print(r\"C:\\Users\\Administrator\\Desktop\\Project\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Characters are Numbers with a Convention" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "So far, we used the term **character** without any further consideration. In this section, we briefly look into what characters are and how they are modeled in software.\n", + "\n", + "[Chapter 5 ](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/develop/05_numbers/00_content.ipynb) gives us an idea on how individual **bits** are used to express all types of numbers, from \"simple\" `int` objects to \"complex\" `float` ones. To model characters, another **layer of abstraction** is put on top of whole numbers. So, just as bits are used to express integers, they themselves are used to express characters." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "### ASCII" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "Many conventions have been developed as to what integer is associated with which character. The most basic one that was also adopted around the world is the the so-called [American Standard Code for Information Interchange ](https://en.wikipedia.org/wiki/ASCII), or **ASCII** for short. It uses 7 bits of information to map the unprintable control characters as well as the printable letters of the alphabet, numbers, and common symbols to the numbers `0` through `127`.\n", + "\n", + "A mapping from characters to numbers is referred to by the technical term **encoding**. We may use the built-in [ord() ](https://docs.python.org/3/library/functions.html#ord) function to **encode** any single character. The inverse to that is the built-in [chr() ](https://docs.python.org/3/library/functions.html#chr) function, which **decodes** a number into a character." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "65" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ord(\"A\")" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "'A'" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "chr(65)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "Of course, unprintable escape sequences like `\"\\n\"` count as only *one* character." + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "10" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ord(\"\\n\")" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "'\\n'" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "chr(10)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "In ASCII, the numbers `0` through `31` (and `127`) are mapped to all kinds of unprintable control characters. The decimal digits are encoded with the numbers `48` through `57`, the upper case letters with `65` through `90`, and the lower case letters with `97` through `122`. While this seems random as first, there is of course a \"sophisticated\" system behind it. That can immediately be seen when looking at the encoded numbers in their *binary* representations.\n", + "\n", + "For example, the digit `5` is mapped to the number `53` in ASCII. The binary representation of `53` is `0b_11_0101` and the least significant four bits, `0101`, mean $5$. Similarly, the letter `\"E\"` is the fifth letter in the alphabet. It is encoded with the number `69` in ASCII, which is `0b_100_0101` in binary. And, the least significant bits, `0_0101`, mean $5$. Analogously, `\"e\"` is encoded with `101` in ASCII, which is `0b_110_0101` in binary. And, the least significant bits, `0_0101`, mean $5$ again. This encoding was chosen mainly because programmers \"in the old days\" needed to implement these encodings \"by hand.\" Python abstracts that logic away from its users.\n", + "\n", + "This encoding scheme is also the cause for the \"weird\" sorting in the \"*String Comparison*\" section in the [first part ](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/develop/06_text/01_content.ipynb#String-Comparison) of this chapter, where `\"apple\"` comes *after* `\"Banana\"`. As `\"a\"` is encoded with `97` and `\"B\"` with `66`, `\"Banana\"` must of course be \"smaller\" than `\"apple\"` when comparison is done in a pairwise fashion of the individual characters." + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "48 0b110000 -> 0\n", + "49 0b110001 -> 1\n", + "50 0b110010 -> 2\n", + "51 0b110011 -> 3\n", + "52 0b110100 -> 4\n", + "53 0b110101 -> 5\n", + "54 0b110110 -> 6\n", + "55 0b110111 -> 7\n", + "56 0b111000 -> 8\n", + "57 0b111001 -> 9\n" + ] + } + ], + "source": [ + "for number in range(48, 58):\n", + " print(number, bin(number), \"-> \", chr(number))" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "65 0b1000001 -> A\t66 0b1000010 -> B\t67 0b1000011 -> C\n", + "68 0b1000100 -> D\t69 0b1000101 -> E\t70 0b1000110 -> F\n", + "71 0b1000111 -> G\t72 0b1001000 -> H\t73 0b1001001 -> I\n", + "74 0b1001010 -> J\t75 0b1001011 -> K\t76 0b1001100 -> L\n", + "77 0b1001101 -> M\t78 0b1001110 -> N\t79 0b1001111 -> O\n", + "80 0b1010000 -> P\t81 0b1010001 -> Q\t82 0b1010010 -> R\n", + "83 0b1010011 -> S\t84 0b1010100 -> T\t85 0b1010101 -> U\n", + "86 0b1010110 -> V\t87 0b1010111 -> W\t88 0b1011000 -> X\n", + "89 0b1011001 -> Y\t90 0b1011010 -> Z\t" + ] + } + ], + "source": [ + "for i, number in enumerate(range(65, 91), start=1):\n", + " end = \"\\n\" if i % 3 == 0 else \"\\t\"\n", + " print(number, bin(number), \"-> \", chr(number), end=end)" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " 97 0b1100001 -> a\t 98 0b1100010 -> b\t 99 0b1100011 -> c\n", + "100 0b1100100 -> d\t101 0b1100101 -> e\t102 0b1100110 -> f\n", + "103 0b1100111 -> g\t104 0b1101000 -> h\t105 0b1101001 -> i\n", + "106 0b1101010 -> j\t107 0b1101011 -> k\t108 0b1101100 -> l\n", + "109 0b1101101 -> m\t110 0b1101110 -> n\t111 0b1101111 -> o\n", + "112 0b1110000 -> p\t113 0b1110001 -> q\t114 0b1110010 -> r\n", + "115 0b1110011 -> s\t116 0b1110100 -> t\t117 0b1110101 -> u\n", + "118 0b1110110 -> v\t119 0b1110111 -> w\t120 0b1111000 -> x\n", + "121 0b1111001 -> y\t122 0b1111010 -> z\t" + ] + } + ], + "source": [ + "for i, number in enumerate(range(97, 123), start=1):\n", + " end = \"\\n\" if i % 3 == 0 else \"\\t\"\n", + " print(str(number).rjust(3), bin(number), \"-> \", chr(number), end=end)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "The remaining `symbols` encoded in ASCII are encoded with the numbers still unused, which is why they are scattered." + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "symbols = (\n", + " list(range(32, 48))\n", + " + list(range(58, 65))\n", + " + list(range(91, 97))\n", + " + list(range(123, 127))\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " 32 0b100000 -> \t 33 0b100001 -> !\t 34 0b100010 -> \"\n", + " 35 0b100011 -> #\t 36 0b100100 -> $\t 37 0b100101 -> %\n", + " 38 0b100110 -> &\t 39 0b100111 -> '\t 40 0b101000 -> (\n", + " 41 0b101001 -> )\t 42 0b101010 -> *\t 43 0b101011 -> +\n", + " 44 0b101100 -> ,\t 45 0b101101 -> -\t 46 0b101110 -> .\n", + " 47 0b101111 -> /\t 58 0b111010 -> :\t 59 0b111011 -> ;\n", + " 60 0b111100 -> <\t 61 0b111101 -> =\t 62 0b111110 -> >\n", + " 63 0b111111 -> ?\t 64 0b1000000 -> @\t 91 0b1011011 -> [\n", + " 92 0b1011100 -> \\\t 93 0b1011101 -> ]\t 94 0b1011110 -> ^\n", + " 95 0b1011111 -> _\t 96 0b1100000 -> `\t123 0b1111011 -> {\n", + "124 0b1111100 -> |\t125 0b1111101 -> }\t126 0b1111110 -> ~\n" + ] + } + ], + "source": [ + "for i, number in enumerate(symbols, start=1):\n", + " end = \"\\n\" if i % 3 == 0 else \"\\t\"\n", + " print(str(number).rjust(3), bin(number).rjust(10), \"-> \", chr(number), end=end)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "As the ASCII character set does not work for many languages other than English, various encodings were developed. Popular examples are [ISO 8859-1 ](https://en.wikipedia.org/wiki/ISO/IEC_8859-1) for western European letters or [Windows 1250 ](https://en.wikipedia.org/wiki/Windows-1250) for Latin ones. Many of these encodings use 8-bit numbers (i.e., `0` through `255`) to map the multitude of non-English letters (e.g., the German [umlauts ](https://en.wikipedia.org/wiki/Umlaut_%28linguistics%29) `\"ä\"`, `\"ö\"`, `\"ü\"`, or `\"ß\"`)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "### Unicode" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "However, none of these specialized encodings can map *all* characters of *all* languages around the world from *all* times in human history. To achieve that, a truly global standard called **[Unicode ](https://en.wikipedia.org/wiki/Unicode)** was developed and its first version released in 1991. Since then, Unicode has been amended with many other \"characters.\" The most popular among them being [emojis ](https://en.wikipedia.org/wiki/Emoji) or the [Klingon ](https://en.wikipedia.org/wiki/Klingon_scripts) language (from the science fiction series [Star Trek ](https://en.wikipedia.org/wiki/Star_Trek)). In Unicode, every character is given an identity referred to as the **code point**. Code points are hexadecimal numbers from `0x0000` through `0x10ffff`, written as U+0000 and U+10FFFF outside of Python. Consequently, there exist at most $1,114,112$ code points, of which only about 10% are currently in use, allowing lots of room for new characters to be invented. The first `127` code points are identical to the ASCII encoding for reasons explained in the \"*The `bytes` Type*\" section further below. There exist plenty of lists of all Unicode characters on the web (e.g., [Wikipedia ](https://en.wikipedia.org/wiki/List_of_Unicode_characters)).\n", + "\n", + "All we need to know to print a character is its code point. Python uses the escape sequence `\"\\U\"` that is followed by eight hexadecimal digits. Underscore separators are unfortunately *not* allowed here.\n", + "\n", + "So, to print a smiley, we just need to look up the corresponding number (e.g., [here ](https://en.wikipedia.org/wiki/Emoji#Unicode_blocks))." + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "'😄'" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "\"\\U0001f604\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "Every Unicode character also has a descriptive name that we can use with the escape sequence `\"\\N\"` and within curly braces `{}`." + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "'😂'" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "\"\\N{FACE WITH TEARS OF JOY}\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "Whenever the code point can be expressed with just four hexadecimal digits, we may use the escape sequence `\"\\u\"` for brevity." + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "'A'" + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "\"\\U00000041\" # hex(65) == 0x41" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "'A'" + ] + }, + "execution_count": 27, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "\"\\u0041\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "Analogously, if the code point can be expressed with two hexadecimal digits, we may use the escape sequence `\"\\x\"` for even conciser code." + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "'A'" + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "\"\\x41\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "As the `str` type is based on Unicode, a `str` object's behavior is more in line with how humans view text and not how it is expressed in source code.\n", + "\n", + "For example, while it is obvious that `len(\"A\")` evaluates to `1`, ..." + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "1" + ] + }, + "execution_count": 29, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(\"A\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "... what should `len(\"\\N{SNAKE}\")` evaluate to? As the idea of a snake is expressed as *one* \"character,\" [len() ](https://docs.python.org/3/library/functions.html#len) also returns `1` here." + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "'🐍'" + ] + }, + "execution_count": 30, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "\"\\N{SNAKE}\"" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "1" + ] + }, + "execution_count": 31, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(\"\\N{SNAKE}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "Many of the built-in `str` methods also consider Unicode. For example, in contrast to [lower() ](https://docs.python.org/3/library/stdtypes.html#str.lower), the [casefold() ](https://docs.python.org/3/library/stdtypes.html#str.casefold) method knows that the German `\"ß\"` is commonly converted to `\"ss\"`. So, when searching for exact matches, normalizing text with [casefold() ](https://docs.python.org/3/library/stdtypes.html#str.casefold) may yield better results than with [lower() ](https://docs.python.org/3/library/stdtypes.html#str.lower)." + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "'straße'" + ] + }, + "execution_count": 32, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "\"Straße\".lower()" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "'strasse'" + ] + }, + "execution_count": 33, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "\"Straße\".casefold()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "Many other methods like [isdecimal() ](https://docs.python.org/3/library/stdtypes.html#str.isdecimal), [isdigit() ](https://docs.python.org/3/library/stdtypes.html#str.isdigit), [isnumeric() ](https://docs.python.org/3/library/stdtypes.html#str.isnumeric), [isprintable() ](https://docs.python.org/3/library/stdtypes.html#str.isprintable), [isidentifier() ](https://docs.python.org/3/library/stdtypes.html#str.isidentifier), and many more may be worthwhile to know for the data science practitioner, especially when it comes to data cleaning." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## Multi-line Strings" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "Sometimes, it is convenient to split text across multiple lines in source code. For example, to make lines fit into the 79 characters requirement of [PEP 8 ](https://www.python.org/dev/peps/pep-0008/) or because the text consists of many lines and typing out `\"\\n\"` is tedious. However, using single double quotes `\"` around multiple lines results in a `SyntaxError`." + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [ + { + "ename": "SyntaxError", + "evalue": "EOL while scanning string literal (, line 1)", + "output_type": "error", + "traceback": [ + "\u001b[0;36m File \u001b[0;32m\"\"\u001b[0;36m, line \u001b[0;32m1\u001b[0m\n\u001b[0;31m \"\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m EOL while scanning string literal\n" + ] + } + ], + "source": [ + "\"\n", + "Do not break the lines like this\n", + "\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "Instead, we may enclose a string literal with either **triple double** quotes `\"\"\"` or **triple single** quotes `'''`. Then, newline characters in the source code are converted into `\"\\n\"` characters in the resulting `str` object. Docstrings are precisely that, and, by convention, always written within triple double quotes `\"\"\"`." + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "multi_line = \"\"\"\n", + "I am a multi-line string\n", + "consisting of four lines.\n", + "\"\"\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "A caveat is that `\"\\n\"` characters are often inserted at the beginning or end of the text when we try to format the source code nicely." + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "'\\nI am a multi-line string\\nconsisting of four lines.\\n'" + ] + }, + "execution_count": 36, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "multi_line" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "I am a multi-line string\n", + "consisting of four lines.\n", + "\n" + ] + } + ], + "source": [ + "print(multi_line)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "Using the [split() ](https://docs.python.org/3/library/stdtypes.html#str.split) method with the optional `sep` argument, we confirm that `multi_line` consists of *four* lines with the first and last line being empty." + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1 \n", + "2 I am a multi-line string\n", + "3 consisting of four lines.\n", + "4 \n" + ] + } + ], + "source": [ + "for i, line in enumerate(multi_line.split(\"\\n\"), start=1):\n", + " print(i, line)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "To mitigate that, we often see the [strip() ](https://docs.python.org/3/library/stdtypes.html#bytes.strip) method in source code." + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "multi_line = \"\"\"\n", + "I am a multi-line string\n", + "consisting of two lines.\n", + "\"\"\".strip()" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1 I am a multi-line string\n", + "2 consisting of two lines.\n" + ] + } + ], + "source": [ + "for i, line in enumerate(multi_line.split(\"\\n\"), start=1):\n", + " print(i, line)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## The `bytes` Type" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "To end this chapter, we want to briefly look at the `bytes` data type, which conceptually is a sequence of bytes. That data format is probably one of the most generic ways of exchanging data between any two programs or computers (e.g., a web browser obtains its data from a web server in this format).\n", + "\n", + "Let's open a binary file in read-only mode (i.e., `mode=\"rb\"`) and read in all of its contents." + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "with open(\"full_house.bin\", mode=\"rb\") as binary_file:\n", + " data = binary_file.read()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "`data` is an object of type `bytes`." + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "139880714782512" + ] + }, + "execution_count": 42, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "id(data)" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "bytes" + ] + }, + "execution_count": 43, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "type(data)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "It's value is given out in the literal bytes notation with a `b` prefix (cf., the [reference ](https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals)). Every byte is expressed in hexadecimal representation with the escape sequence `\"\\x\"`. This representation is commonly chosen as we can *not* tell what kind of information is hidden in the `data` by just looking at the bytes. Instead, we must be told by some other source how to **decode** the raw bytes into information we can interpret." + ] + }, + { + "cell_type": "code", + "execution_count": 44, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "b'\\xf0\\x9f\\x82\\xa7\\xf0\\x9f\\x82\\xb7\\xf0\\x9f\\x83\\x97\\xf0\\x9f\\x83\\x8e\\xf0\\x9f\\x83\\x9e'" + ] + }, + "execution_count": 44, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "data" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "`bytes` objects work like `str` objects in many ways. In particular, they are *sequences* as well: The number of bytes is *finite* and we may *iterate* over them in *order*." + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "20" + ] + }, + "execution_count": 45, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(data)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "Consisting of 8 bits, a single byte can always be interpreted as a whole number between `0` through `255`. That is exactly what we see when we loop over the `data` ..." + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "240 159 130 167 240 159 130 183 240 159 131 151 240 159 131 142 240 159 131 158 " + ] + } + ], + "source": [ + "for byte in data:\n", + " print(byte, end=\" \")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "... or index into them." + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "158" + ] + }, + "execution_count": 47, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "data[-1]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "Slicing returns another `bytes` object." + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "b'\\xf0\\x82\\xf0\\x82\\xf0\\x83\\xf0\\x83\\xf0\\x83'" + ] + }, + "execution_count": 48, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "data[::2]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "### Character Encodings" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "Luckily, `data` consists of bytes encoded with the [UTF-8 ](https://en.wikipedia.org/wiki/UTF-8) encoding. That is the most common way of mapping a Unicode character's code point to a sequence of bytes.\n", + "\n", + "To obtain a `str` object out of a given `bytes` object, we decode it with the `bytes` type's [decode() ](https://docs.python.org/3/library/stdtypes.html#bytes.decode) method." + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "cards = data.decode()" + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "str" + ] + }, + "execution_count": 50, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "type(cards)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "So, `data` consisted of a [full house ](https://en.wikipedia.org/wiki/List_of_poker_hands#Full_house) hand in a poker game." + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "'🂧🂷🃗🃎🃞'" + ] + }, + "execution_count": 51, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "cards" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "To go the opposite direction and encode a given `str` object, we use the `str` type's [encode() ](https://docs.python.org/3/library/stdtypes.html#str.encode) method." + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "place = \"Café Kastanientörtchen\"" + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "b'Caf\\xc3\\xa9 Kastanient\\xc3\\xb6rtchen'" + ] + }, + "execution_count": 53, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "place.encode()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "By default, [encode() ](https://docs.python.org/3/library/stdtypes.html#str.encode) and [decode() ](https://docs.python.org/3/library/stdtypes.html#bytes.decode) use an `encoding=\"utf-8\"` argument. We may use another encoding like, for example, `\"iso-8859-1\"`, which can deal with ASCII and western European letters." + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "b'Caf\\xe9 Kastanient\\xf6rtchen'" + ] + }, + "execution_count": 54, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "place.encode(\"iso-8859-1\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "However, we must use the *same* encoding for the decoding step as for the encoding step. Otherwise, a `UnicodeDecodeError` is raised." + ] + }, + { + "cell_type": "code", + "execution_count": 55, + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [ + { + "ename": "UnicodeDecodeError", + "evalue": "'utf-8' codec can't decode byte 0xe9 in position 3: invalid continuation byte", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mUnicodeDecodeError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mplace\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mencode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"iso-8859-1\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdecode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;31mUnicodeDecodeError\u001b[0m: 'utf-8' codec can't decode byte 0xe9 in position 3: invalid continuation byte" + ] + } + ], + "source": [ + "place.encode(\"iso-8859-1\").decode()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "Not all encodings map all Unicode code points. For example `\"iso-8859-1\"` does not know Czech letters. Below, [encode() ](https://docs.python.org/3/library/stdtypes.html#str.encode) raises a `UnicodeEncodeError` because of that." + ] + }, + { + "cell_type": "code", + "execution_count": 56, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [ + { + "ename": "UnicodeEncodeError", + "evalue": "'latin-1' codec can't encode character '\\u0159' in position 12: ordinal not in range(256)", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mUnicodeEncodeError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0;34m\"Dobrý den, přátelé!\"\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mencode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"iso-8859-1\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;31mUnicodeEncodeError\u001b[0m: 'latin-1' codec can't encode character '\\u0159' in position 12: ordinal not in range(256)" + ] + } + ], + "source": [ + "\"Dobrý den, přátelé!\".encode(\"iso-8859-1\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "### Reading Files (continued)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "The [open() ](https://docs.python.org/3/library/functions.html#open) function takes an optional `encoding` argument as well." + ] + }, + { + "cell_type": "code", + "execution_count": 57, + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [ + { + "ename": "UnicodeDecodeError", + "evalue": "'utf-8' codec can't decode byte 0xe4 in position 9: invalid continuation byte", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mUnicodeDecodeError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mwith\u001b[0m \u001b[0mopen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"umlauts.txt\"\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mfile\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"\"\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mjoin\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfile\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreadlines\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m~/.pyenv/versions/3.8.6/lib/python3.8/codecs.py\u001b[0m in \u001b[0;36mdecode\u001b[0;34m(self, input, final)\u001b[0m\n\u001b[1;32m 320\u001b[0m \u001b[0;31m# decode input (taking the buffer into account)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 321\u001b[0m \u001b[0mdata\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbuffer\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0minput\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 322\u001b[0;31m \u001b[0;34m(\u001b[0m\u001b[0mresult\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mconsumed\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_buffer_decode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdata\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0merrors\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfinal\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 323\u001b[0m \u001b[0;31m# keep undecoded input until the next call\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 324\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbuffer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdata\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mconsumed\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mUnicodeDecodeError\u001b[0m: 'utf-8' codec can't decode byte 0xe4 in position 9: invalid continuation byte" + ] + } + ], + "source": [ + "with open(\"umlauts.txt\") as file:\n", + " print(\"\".join(file.readlines()))" + ] + }, + { + "cell_type": "code", + "execution_count": 58, + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Lerchen-Lärchen-Ähnlichkeiten\n", + "fehlen. Dieses abzustreiten\n", + "mag im Klang der Worte liegen.\n", + "Merke, eine Lerch' kann fliegen,\n", + "Lärchen nicht, was kaum verwundert,\n", + "denn nicht eine unter hundert\n", + "ist geflügelt. Auch im Singen\n", + "sind die Bäume zu bezwingen.\n", + "Die Bätrachtung sollte reichen,\n", + "Rächtschreibfählern auszuweichen.\n", + "Leicht gälingt's, zu unterscheiden,\n", + "wär ist wär nun von dän beiden.\n" + ] + } + ], + "source": [ + "with open(\"umlauts.txt\", encoding=\"iso-8859-1\") as file:\n", + " print(\"\".join(file.readlines()))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "### Best Practice: Use UTF-8 explicitly" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "skip" + } + }, + "source": [ + "A best practice is to *always* specify the `encoding`, especially on computers running on Windows (cf., the talk by Łukasz Langa in the [Further Resources ](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/develop/06_text/05_resources.ipynb#Unicode)) section at the end of this chapter.\n", + "\n", + "Below is the first example involving [open() ](https://docs.python.org/3/library/functions.html#open) one last time: It shows how *all* the contents of a text file should be read into one `str` object." + ] + }, + { + "cell_type": "code", + "execution_count": 59, + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "with open(\"lorem_ipsum.txt\", encoding=\"utf-8\") as file:\n", + " content = \"\".join(file.readlines())" + ] + }, + { + "cell_type": "code", + "execution_count": 60, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "\"Lorem Ipsum is simply dummy text of the printing and typesetting industry.\\nLorem Ipsum has been the industry's standard dummy text ever since the 1500s\\nwhen an unknown printer took a galley of type and scrambled it to make a type\\nspecimen book. It has survived not only five centuries but also the leap into\\nelectronic typesetting, remaining essentially unchanged. It was popularised in\\nthe 1960s with the release of Letraset sheets.\\n\"" + ] + }, + "execution_count": 60, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "content" + ] + }, + { + "cell_type": "code", + "execution_count": 61, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Lorem Ipsum is simply dummy text of the printing and typesetting industry.\n", + "Lorem Ipsum has been the industry's standard dummy text ever since the 1500s\n", + "when an unknown printer took a galley of type and scrambled it to make a type\n", + "specimen book. It has survived not only five centuries but also the leap into\n", + "electronic typesetting, remaining essentially unchanged. It was popularised in\n", + "the 1960s with the release of Letraset sheets.\n", + "\n" + ] + } + ], + "source": [ + "print(content)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.6" + }, + "livereveal": { + "auto_select": "code", + "auto_select_fragment": true, + "scroll": true, + "theme": "serif" + }, + "toc": { + "base_numbering": 1, + "nav_menu": {}, + "number_sections": false, + "sideBar": true, + "skip_h1_title": true, + "title_cell": "Table of Contents", + "title_sidebar": "Contents", + "toc_cell": false, + "toc_position": { + "height": "calc(100% - 180px)", + "left": "10px", + "top": "150px", + "width": "384px" + }, + "toc_section_display": false, + "toc_window_display": false + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/06_text/full_house.bin b/06_text/full_house.bin new file mode 100644 index 0000000..4ad5e35 --- /dev/null +++ b/06_text/full_house.bin @@ -0,0 +1 @@ +🂧🂷🃗🃎🃞 \ No newline at end of file diff --git a/06_text/umlauts.txt b/06_text/umlauts.txt new file mode 100644 index 0000000..9c0f911 --- /dev/null +++ b/06_text/umlauts.txt @@ -0,0 +1,12 @@ +Lerchen-Lrchen-hnlichkeiten +fehlen. Dieses abzustreiten +mag im Klang der Worte liegen. +Merke, eine Lerch' kann fliegen, +Lrchen nicht, was kaum verwundert, +denn nicht eine unter hundert +ist geflgelt. Auch im Singen +sind die Bume zu bezwingen. +Die Btrachtung sollte reichen, +Rchtschreibfhlern auszuweichen. +Leicht glingt's, zu unterscheiden, +wr ist wr nun von dn beiden. \ No newline at end of file diff --git a/README.md b/README.md index 430dbb9..fd12c45 100644 --- a/README.md +++ b/README.md @@ -152,6 +152,13 @@ Alternatively, the content can be viewed in a web browser - [exercises ](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/develop/06_text/01_exercises.ipynb) [](https://mybinder.org/v2/gh/webartifex/intro-to-python/develop?urlpath=lab/tree/06_text/01_exercises.ipynb) (Detecting Palindromes) + - [content ](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/develop/06_text/02_content.ipynb) + [](https://mybinder.org/v2/gh/webartifex/intro-to-python/develop?urlpath=lab/tree/06_text/02_content.ipynb) + (Special Characters; + ASCII & Unicode; + Multi-line Strings; + `bytes` Type; + Character Encodings) #### Videos