intro-to-python/06_text/02_content.ipynb

2111 lines
56 KiB
Text
Raw Normal View History

{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"**Note**: Click on \"*Kernel*\" > \"*Restart Kernel and Clear All Outputs*\" in [JupyterLab](https://jupyterlab.readthedocs.io/en/stable/) *before* reading this notebook to reset its output. If you cannot run this file on your machine, you may want to open it [in the cloud <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_mb.png\">](https://mybinder.org/v2/gh/webartifex/intro-to-python/main?urlpath=lab/tree/06_text/01_content.ipynb)."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Chapter 6: Text & Bytes (continued)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"In this second part of the chapter, we look in more detail at how `str` objects work in memory, in particular how the $0$s and $1$s in the memory translate into characters."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Special Characters"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"As previously seen, some characters have a special meaning when following the **escape character** `\"\\\"`. Besides escaping the kind of quote used as the `str` object's delimiter, `'` or `\"`, most of these **escape sequences** (i.e., `\"\\\"` with the subsequent character), act as a **control character** that moves the \"cursor\" in the output *without* generating any pixel on the screen. Because of that, we only see the effect of such escape sequences when used with the [print() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/functions.html#print) function. The [documentation <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals) lists all available escape sequences, of which we show the most important ones below.\n",
"\n",
"The most common escape sequence is `\"\\n\"` that \"prints\" a [newline character <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_wiki.png\">](https://en.wikipedia.org/wiki/Newline) that is also called the line feed character or LF for short."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'This is a sentence\\nthat is printed\\non three lines.'"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\"This is a sentence\\nthat is printed\\non three lines.\""
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"This is a sentence\n",
"that is printed\n",
"on three lines.\n"
]
}
],
"source": [
"print(\"This is a sentence\\nthat is printed\\non three lines.\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"`\"\\b\"` is the [backspace character <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_wiki.png\">](https://en.wikipedia.org/wiki/Backspace), or BS for short, that moves the cursor back by one character."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"ABX\n"
]
}
],
"source": [
"print(\"ABC\\bX\")"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"ABXY\n"
]
}
],
"source": [
"print(\"ABC\\bXY\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Similarly, `\"\\r\"` is the [carriage return character <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_wiki.png\">](https://en.wikipedia.org/wiki/Carriage_return), or CR for short, that moves the cursor back to the beginning of the line."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"XBC\n"
]
}
],
"source": [
"print(\"ABC\\rX\")"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"XYC\n"
]
}
],
"source": [
"print(\"ABC\\rXY\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"While Linux and modern MacOS systems use solely `\"\\n\"` to express a new line, Windows systems default to using `\"\\r\\n\"`. This may lead to \"weird\" bugs on software projects where people using both kind of operating systems collaborate."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"This is a sentence\n",
"that is printed\n",
"on three lines.\n"
]
}
],
"source": [
"print(\"This is a sentence\\r\\nthat is printed\\r\\non three lines.\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"`\"\\t\"` makes the cursor \"jump\" in equidistant tab stops. That may be useful for formatting a program with lengthy and tabular results."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Jump\tfrom\ttab\tstop\tto\ttab\tstop.\n",
"The\tsecond\tline\tdoes\tso\ttoo.\n"
]
}
],
"source": [
"print(\"Jump\\tfrom\\ttab\\tstop\\tto\\ttab\\tstop.\\nThe\\tsecond\\tline\\tdoes\\tso\\ttoo.\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Raw Strings"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Sometimes we do *not* want the backslash `\"\\\"` and its subsequent character be interpreted as an escape sequence. For example, let's print a typical installation path on a Windows systems. Obviously, the newline character `\"\\n\"` does *not* makes sense here."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"C:\\Programs\n",
"ew_application\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"<>:1: SyntaxWarning: invalid escape sequence '\\P'\n",
"<>:1: SyntaxWarning: invalid escape sequence '\\P'\n",
"/tmp/ipykernel_159416/1102122489.py:1: SyntaxWarning: invalid escape sequence '\\P'\n",
" print(\"C:\\Programs\\new_application\")\n"
]
}
],
"source": [
"print(\"C:\\Programs\\new_application\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Some `str` objects even produce a `SyntaxError` because the `\"\\U\"` can *not* be interpreted as a Unicode code point (cf., next section)."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"ename": "SyntaxError",
"evalue": "(unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \\UXXXXXXXX escape (2296736867.py, line 1)",
"output_type": "error",
"traceback": [
"\u001b[0;36m Cell \u001b[0;32mIn[10], line 1\u001b[0;36m\u001b[0m\n\u001b[0;31m print(\"C:\\Users\\Administrator\\Desktop\\Project\")\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \\UXXXXXXXX escape\n"
]
}
],
"source": [
"print(\"C:\\Users\\Administrator\\Desktop\\Project\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"A simple solution would be to escape the escape character with a *second* backslash `\"\\\"`."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"C:\\Programs\\new_application\n"
]
}
],
"source": [
"print(\"C:\\\\Programs\\\\new_application\")"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"C:\\Users\\Administrator\\Desktop\\Project\n"
]
}
],
"source": [
"print(\"C:\\\\Users\\\\Administrator\\\\Desktop\\\\Project\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"However, this is tedious to remember and type. For such use cases, Python allows to prefix any string literal with a `r`. The literal is then interpreted in a \"raw\" way."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"C:\\Programs\\new_application\n"
]
}
],
"source": [
"print(r\"C:\\Programs\\new_application\")"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"C:\\Users\\Administrator\\Desktop\\Project\n"
]
}
],
"source": [
"print(r\"C:\\Users\\Administrator\\Desktop\\Project\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Characters are Numbers with a Convention"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"So far, we used the term **character** without any further consideration. In this section, we briefly look into what characters are and how they are modeled in software.\n",
"\n",
"[Chapter 5 <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_nb.png\">](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/main/05_numbers/00_content.ipynb) gives us an idea on how individual **bits** are used to express all types of numbers, from \"simple\" `int` objects to \"complex\" `float` ones. To model characters, another **layer of abstraction** is put on top of whole numbers. So, just as bits are used to express integers, they themselves are used to express characters."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### ASCII"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Many conventions have been developed as to what integer is associated with which character. The most basic one that was also adopted around the world is the the so-called [American Standard Code for Information Interchange <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_wiki.png\">](https://en.wikipedia.org/wiki/ASCII), or **ASCII** for short. It uses 7 bits of information to map the unprintable control characters as well as the printable letters of the alphabet, numbers, and common symbols to the numbers `0` through `127`.\n",
"\n",
"A mapping from characters to numbers is referred to by the technical term **encoding**. We may use the built-in [ord() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/functions.html#ord) function to **encode** any single character. The inverse to that is the built-in [chr() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/functions.html#chr) function, which **decodes** a number into a character."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"65"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ord(\"A\")"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'A'"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chr(65)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Of course, unprintable escape sequences like `\"\\n\"` count as only *one* character."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"10"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ord(\"\\n\")"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'\\n'"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chr(10)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"In ASCII, the numbers `0` through `31` (and `127`) are mapped to all kinds of unprintable control characters. The decimal digits are encoded with the numbers `48` through `57`, the upper case letters with `65` through `90`, and the lower case letters with `97` through `122`. While this seems random as first, there is of course a \"sophisticated\" system behind it. That can immediately be seen when looking at the encoded numbers in their *binary* representations.\n",
"\n",
"For example, the digit `5` is mapped to the number `53` in ASCII. The binary representation of `53` is `0b_11_0101` and the least significant four bits, `0101`, mean $5$. Similarly, the letter `\"E\"` is the fifth letter in the alphabet. It is encoded with the number `69` in ASCII, which is `0b_100_0101` in binary. And, the least significant bits, `0_0101`, mean $5$. Analogously, `\"e\"` is encoded with `101` in ASCII, which is `0b_110_0101` in binary. And, the least significant bits, `0_0101`, mean $5$ again. This encoding was chosen mainly because programmers \"in the old days\" needed to implement these encodings \"by hand.\" Python abstracts that logic away from its users.\n",
"\n",
"This encoding scheme is also the cause for the \"weird\" sorting in the \"*String Comparison*\" section in the [first part <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_nb.png\">](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/main/06_text/01_content.ipynb#String-Comparison) of this chapter, where `\"apple\"` comes *after* `\"Banana\"`. As `\"a\"` is encoded with `97` and `\"B\"` with `66`, `\"Banana\"` must of course be \"smaller\" than `\"apple\"` when comparison is done in a pairwise fashion of the individual characters."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"48 0b110000 -> 0\n",
"49 0b110001 -> 1\n",
"50 0b110010 -> 2\n",
"51 0b110011 -> 3\n",
"52 0b110100 -> 4\n",
"53 0b110101 -> 5\n",
"54 0b110110 -> 6\n",
"55 0b110111 -> 7\n",
"56 0b111000 -> 8\n",
"57 0b111001 -> 9\n"
]
}
],
"source": [
"for number in range(48, 58):\n",
" print(number, bin(number), \"-> \", chr(number))"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"65 0b1000001 -> A\t66 0b1000010 -> B\t67 0b1000011 -> C\n",
"68 0b1000100 -> D\t69 0b1000101 -> E\t70 0b1000110 -> F\n",
"71 0b1000111 -> G\t72 0b1001000 -> H\t73 0b1001001 -> I\n",
"74 0b1001010 -> J\t75 0b1001011 -> K\t76 0b1001100 -> L\n",
"77 0b1001101 -> M\t78 0b1001110 -> N\t79 0b1001111 -> O\n",
"80 0b1010000 -> P\t81 0b1010001 -> Q\t82 0b1010010 -> R\n",
"83 0b1010011 -> S\t84 0b1010100 -> T\t85 0b1010101 -> U\n",
"86 0b1010110 -> V\t87 0b1010111 -> W\t88 0b1011000 -> X\n",
"89 0b1011001 -> Y\t90 0b1011010 -> Z\t"
]
}
],
"source": [
"for i, number in enumerate(range(65, 91), start=1):\n",
" end = \"\\n\" if i % 3 == 0 else \"\\t\"\n",
" print(number, bin(number), \"-> \", chr(number), end=end)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 97 0b1100001 -> a\t 98 0b1100010 -> b\t 99 0b1100011 -> c\n",
"100 0b1100100 -> d\t101 0b1100101 -> e\t102 0b1100110 -> f\n",
"103 0b1100111 -> g\t104 0b1101000 -> h\t105 0b1101001 -> i\n",
"106 0b1101010 -> j\t107 0b1101011 -> k\t108 0b1101100 -> l\n",
"109 0b1101101 -> m\t110 0b1101110 -> n\t111 0b1101111 -> o\n",
"112 0b1110000 -> p\t113 0b1110001 -> q\t114 0b1110010 -> r\n",
"115 0b1110011 -> s\t116 0b1110100 -> t\t117 0b1110101 -> u\n",
"118 0b1110110 -> v\t119 0b1110111 -> w\t120 0b1111000 -> x\n",
"121 0b1111001 -> y\t122 0b1111010 -> z\t"
]
}
],
"source": [
"for i, number in enumerate(range(97, 123), start=1):\n",
" end = \"\\n\" if i % 3 == 0 else \"\\t\"\n",
" print(str(number).rjust(3), bin(number), \"-> \", chr(number), end=end)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"The remaining `symbols` encoded in ASCII are encoded with the numbers still unused, which is why they are scattered."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"symbols = (\n",
" list(range(32, 48))\n",
" + list(range(58, 65))\n",
" + list(range(91, 97))\n",
" + list(range(123, 127))\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 32 0b100000 -> \t 33 0b100001 -> !\t 34 0b100010 -> \"\n",
" 35 0b100011 -> #\t 36 0b100100 -> $\t 37 0b100101 -> %\n",
" 38 0b100110 -> &\t 39 0b100111 -> '\t 40 0b101000 -> (\n",
" 41 0b101001 -> )\t 42 0b101010 -> *\t 43 0b101011 -> +\n",
" 44 0b101100 -> ,\t 45 0b101101 -> -\t 46 0b101110 -> .\n",
" 47 0b101111 -> /\t 58 0b111010 -> :\t 59 0b111011 -> ;\n",
" 60 0b111100 -> <\t 61 0b111101 -> =\t 62 0b111110 -> >\n",
" 63 0b111111 -> ?\t 64 0b1000000 -> @\t 91 0b1011011 -> [\n",
" 92 0b1011100 -> \\\t 93 0b1011101 -> ]\t 94 0b1011110 -> ^\n",
" 95 0b1011111 -> _\t 96 0b1100000 -> `\t123 0b1111011 -> {\n",
"124 0b1111100 -> |\t125 0b1111101 -> }\t126 0b1111110 -> ~\n"
]
}
],
"source": [
"for i, number in enumerate(symbols, start=1):\n",
" end = \"\\n\" if i % 3 == 0 else \"\\t\"\n",
" print(str(number).rjust(3), bin(number).rjust(10), \"-> \", chr(number), end=end)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"As the ASCII character set does not work for many languages other than English, various encodings were developed. Popular examples are [ISO 8859-1 <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_wiki.png\">](https://en.wikipedia.org/wiki/ISO/IEC_8859-1) for western European letters or [Windows 1250 <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_wiki.png\">](https://en.wikipedia.org/wiki/Windows-1250) for Latin ones. Many of these encodings use 8-bit numbers (i.e., `0` through `255`) to map the multitude of non-English letters (e.g., the German [umlauts <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_wiki.png\">](https://en.wikipedia.org/wiki/Umlaut_%28linguistics%29) `\"ä\"`, `\"ö\"`, `\"ü\"`, or `\"ß\"`)."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Unicode"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"However, none of these specialized encodings can map *all* characters of *all* languages around the world from *all* times in human history. To achieve that, a truly global standard called **[Unicode <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_wiki.png\">](https://en.wikipedia.org/wiki/Unicode)** was developed and its first version released in 1991. Since then, Unicode has been amended with many other \"characters.\" The most popular among them being [emojis <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_wiki.png\">](https://en.wikipedia.org/wiki/Emoji) or the [Klingon <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_wiki.png\">](https://en.wikipedia.org/wiki/Klingon_scripts) language (from the science fiction series [Star Trek <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_wiki.png\">](https://en.wikipedia.org/wiki/Star_Trek)). In Unicode, every character is given an identity referred to as the **code point**. Code points are hexadecimal numbers from `0x0000` through `0x10ffff`, written as U+0000 and U+10FFFF outside of Python. Consequently, there exist at most $1,114,112$ code points, of which only about 10% are currently in use, allowing lots of room for new characters to be invented. The first `127` code points are identical to the ASCII encoding for reasons explained in the \"*The `bytes` Type*\" section further below. There exist plenty of lists of all Unicode characters on the web (e.g., [Wikipedia <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_wiki.png\">](https://en.wikipedia.org/wiki/List_of_Unicode_characters)).\n",
"\n",
"All we need to know to print a character is its code point. Python uses the escape sequence `\"\\U\"` that is followed by eight hexadecimal digits. Underscore separators are unfortunately *not* allowed here.\n",
"\n",
"So, to print a smiley, we just need to look up the corresponding number (e.g., [here <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_wiki.png\">](https://en.wikipedia.org/wiki/Emoji#Unicode_blocks))."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'😄'"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\"\\U0001f604\""
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Every Unicode character also has a descriptive name that we can use with the escape sequence `\"\\N\"` and within curly braces `{}`."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'😂'"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\"\\N{FACE WITH TEARS OF JOY}\""
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Whenever the code point can be expressed with just four hexadecimal digits, we may use the escape sequence `\"\\u\"` for brevity."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'A'"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\"\\U00000041\" # hex(65) == 0x41"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'A'"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\"\\u0041\""
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Analogously, if the code point can be expressed with two hexadecimal digits, we may use the escape sequence `\"\\x\"` for even conciser code."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'A'"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\"\\x41\""
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"As the `str` type is based on Unicode, a `str` object's behavior is more in line with how humans view text and not how it is expressed in source code.\n",
"\n",
"For example, while it is obvious that `len(\"A\")` evaluates to `1`, ..."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(\"A\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"... what should `len(\"\\N{SNAKE}\")` evaluate to? As the idea of a snake is expressed as *one* \"character,\" [len() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/functions.html#len) also returns `1` here."
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'🐍'"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\"\\N{SNAKE}\""
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(\"\\N{SNAKE}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Many of the built-in `str` methods also consider Unicode. For example, in contrast to [lower() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/stdtypes.html#str.lower), the [casefold() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/stdtypes.html#str.casefold) method knows that the German `\"ß\"` is commonly converted to `\"ss\"`. So, when searching for exact matches, normalizing text with [casefold() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/stdtypes.html#str.casefold) may yield better results than with [lower() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/stdtypes.html#str.lower)."
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'straße'"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\"Straße\".lower()"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'strasse'"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\"Straße\".casefold()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Many other methods like [isdecimal() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/stdtypes.html#str.isdecimal), [isdigit() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/stdtypes.html#str.isdigit), [isnumeric() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/stdtypes.html#str.isnumeric), [isprintable() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/stdtypes.html#str.isprintable), [isidentifier() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/stdtypes.html#str.isidentifier), and many more may be worthwhile to know for the data science practitioner, especially when it comes to data cleaning."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Multi-line Strings"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Sometimes, it is convenient to split text across multiple lines in source code. For example, to make lines fit into the 79 characters requirement of [PEP 8 <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://www.python.org/dev/peps/pep-0008/) or because the text consists of many lines and typing out `\"\\n\"` is tedious. However, using single double quotes `\"` around multiple lines results in a `SyntaxError`."
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"ename": "SyntaxError",
"evalue": "unterminated string literal (detected at line 1) (2682216481.py, line 1)",
"output_type": "error",
"traceback": [
"\u001b[0;36m Cell \u001b[0;32mIn[34], line 1\u001b[0;36m\u001b[0m\n\u001b[0;31m \"\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m unterminated string literal (detected at line 1)\n"
]
}
],
"source": [
"\"\n",
"Do not break the lines like this\n",
"\""
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Instead, we may enclose a string literal with either **triple double** quotes `\"\"\"` or **triple single** quotes `'''`. Then, newline characters in the source code are converted into `\"\\n\"` characters in the resulting `str` object. Docstrings are precisely that, and, by convention, always written within triple double quotes `\"\"\"`."
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"multi_line = \"\"\"\n",
"I am a multi-line string\n",
"consisting of four lines.\n",
"\"\"\""
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"A caveat is that `\"\\n\"` characters are often inserted at the beginning or end of the text when we try to format the source code nicely."
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'\\nI am a multi-line string\\nconsisting of four lines.\\n'"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"multi_line"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"I am a multi-line string\n",
"consisting of four lines.\n",
"\n"
]
}
],
"source": [
"print(multi_line)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Using the [split() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/stdtypes.html#str.split) method with the optional `sep` argument, we confirm that `multi_line` consists of *four* lines with the first and last line being empty."
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1 \n",
"2 I am a multi-line string\n",
"3 consisting of four lines.\n",
"4 \n"
]
}
],
"source": [
"for i, line in enumerate(multi_line.split(\"\\n\"), start=1):\n",
" print(i, line)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"To mitigate that, we often see the [strip() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/stdtypes.html#bytes.strip) method in source code."
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"multi_line = \"\"\"\n",
"I am a multi-line string\n",
"consisting of two lines.\n",
"\"\"\".strip()"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1 I am a multi-line string\n",
"2 consisting of two lines.\n"
]
}
],
"source": [
"for i, line in enumerate(multi_line.split(\"\\n\"), start=1):\n",
" print(i, line)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## The `bytes` Type"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"To end this chapter, we want to briefly look at the `bytes` data type, which conceptually is a sequence of bytes. That data format is probably one of the most generic ways of exchanging data between any two programs or computers (e.g., a web browser obtains its data from a web server in this format).\n",
"\n",
"Let's open a binary file in read-only mode (i.e., `mode=\"rb\"`) and read in all of its contents."
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"with open(\"full_house.bin\", mode=\"rb\") as binary_file:\n",
" data = binary_file.read()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"`data` is an object of type `bytes`."
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"139880714782512"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"id(data)"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"bytes"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(data)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"It's value is given out in the literal bytes notation with a `b` prefix (cf., the [reference <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals)). Every byte is expressed in hexadecimal representation with the escape sequence `\"\\x\"`. This representation is commonly chosen as we can *not* tell what kind of information is hidden in the `data` by just looking at the bytes. Instead, we must be told by some other source how to **decode** the raw bytes into information we can interpret."
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"b'\\xf0\\x9f\\x82\\xa7\\xf0\\x9f\\x82\\xb7\\xf0\\x9f\\x83\\x97\\xf0\\x9f\\x83\\x8e\\xf0\\x9f\\x83\\x9e'"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"`bytes` objects work like `str` objects in many ways. In particular, they are *sequences* as well: The number of bytes is *finite* and we may *iterate* over them in *order*."
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"20"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(data)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Consisting of 8 bits, a single byte can always be interpreted as a whole number between `0` through `255`. That is exactly what we see when we loop over the `data` ..."
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"240 159 130 167 240 159 130 183 240 159 131 151 240 159 131 142 240 159 131 158 "
]
}
],
"source": [
"for byte in data:\n",
" print(byte, end=\" \")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"... or index into them."
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"158"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data[-1]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Slicing returns another `bytes` object."
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"b'\\xf0\\x82\\xf0\\x82\\xf0\\x83\\xf0\\x83\\xf0\\x83'"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data[::2]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Character Encodings"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Luckily, `data` consists of bytes encoded with the [UTF-8 <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_wiki.png\">](https://en.wikipedia.org/wiki/UTF-8) encoding. That is the most common way of mapping a Unicode character's code point to a sequence of bytes.\n",
"\n",
"To obtain a `str` object out of a given `bytes` object, we decode it with the `bytes` type's [decode() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/stdtypes.html#bytes.decode) method."
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"cards = data.decode()"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"str"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(cards)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"So, `data` consisted of a [full house <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_wiki.png\">](https://en.wikipedia.org/wiki/List_of_poker_hands#Full_house) hand in a poker game."
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'🂧🂷🃗🃎🃞'"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cards"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"To go the opposite direction and encode a given `str` object, we use the `str` type's [encode() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/stdtypes.html#str.encode) method."
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"place = \"Café Kastanientörtchen\""
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"b'Caf\\xc3\\xa9 Kastanient\\xc3\\xb6rtchen'"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"place.encode()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"By default, [encode() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/stdtypes.html#str.encode) and [decode() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/stdtypes.html#bytes.decode) use an `encoding=\"utf-8\"` argument. We may use another encoding like, for example, `\"iso-8859-1\"`, which can deal with ASCII and western European letters."
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"b'Caf\\xe9 Kastanient\\xf6rtchen'"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"place.encode(\"iso-8859-1\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"However, we must use the *same* encoding for the decoding step as for the encoding step. Otherwise, a `UnicodeDecodeError` is raised."
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"ename": "UnicodeDecodeError",
"evalue": "'utf-8' codec can't decode byte 0xe9 in position 3: invalid continuation byte",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mUnicodeDecodeError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[55], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43mplace\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mencode\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43miso-8859-1\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m)\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mdecode\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n",
"\u001b[0;31mUnicodeDecodeError\u001b[0m: 'utf-8' codec can't decode byte 0xe9 in position 3: invalid continuation byte"
]
}
],
"source": [
"place.encode(\"iso-8859-1\").decode()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Not all encodings map all Unicode code points. For example `\"iso-8859-1\"` does not know Czech letters. Below, [encode() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/stdtypes.html#str.encode) raises a `UnicodeEncodeError` because of that."
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"ename": "UnicodeEncodeError",
"evalue": "'latin-1' codec can't encode character '\\u0159' in position 12: ordinal not in range(256)",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mUnicodeEncodeError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[56], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mDobrý den, přátelé!\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mencode\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43miso-8859-1\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m)\u001b[49m\n",
"\u001b[0;31mUnicodeEncodeError\u001b[0m: 'latin-1' codec can't encode character '\\u0159' in position 12: ordinal not in range(256)"
]
}
],
"source": [
"\"Dobrý den, přátelé!\".encode(\"iso-8859-1\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Reading Files (continued)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"The [open() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/functions.html#open) function takes an optional `encoding` argument as well."
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"ename": "UnicodeDecodeError",
"evalue": "'utf-8' codec can't decode byte 0xe4 in position 9: invalid continuation byte",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mUnicodeDecodeError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[57], line 2\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;28;01mwith\u001b[39;00m \u001b[38;5;28mopen\u001b[39m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mumlauts.txt\u001b[39m\u001b[38;5;124m\"\u001b[39m) \u001b[38;5;28;01mas\u001b[39;00m file:\n\u001b[0;32m----> 2\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;241m.\u001b[39mjoin(\u001b[43mfile\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mreadlines\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m))\n",
"File \u001b[0;32m<frozen codecs>:322\u001b[0m, in \u001b[0;36mdecode\u001b[0;34m(self, input, final)\u001b[0m\n",
"\u001b[0;31mUnicodeDecodeError\u001b[0m: 'utf-8' codec can't decode byte 0xe4 in position 9: invalid continuation byte"
]
}
],
"source": [
"with open(\"umlauts.txt\") as file:\n",
" print(\"\".join(file.readlines()))"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Lerchen-Lärchen-Ähnlichkeiten\n",
"fehlen. Dieses abzustreiten\n",
"mag im Klang der Worte liegen.\n",
"Merke, eine Lerch' kann fliegen,\n",
"Lärchen nicht, was kaum verwundert,\n",
"denn nicht eine unter hundert\n",
"ist geflügelt. Auch im Singen\n",
"sind die Bäume zu bezwingen.\n",
"Die Bätrachtung sollte reichen,\n",
"Rächtschreibfählern auszuweichen.\n",
"Leicht gälingt's, zu unterscheiden,\n",
"wär ist wär nun von dän beiden.\n"
]
}
],
"source": [
"with open(\"umlauts.txt\", encoding=\"iso-8859-1\") as file:\n",
" print(\"\".join(file.readlines()))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Best Practice: Use UTF-8 explicitly"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"A best practice is to *always* specify the `encoding`, especially on computers running on Windows (cf., the talk by Łukasz Langa in the [Further Resources <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_nb.png\">](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/main/06_text/05_resources.ipynb#Unicode)) section at the end of this chapter.\n",
"\n",
"Below is the first example involving [open() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](https://docs.python.org/3/library/functions.html#open) one last time: It shows how *all* the contents of a text file should be read into one `str` object."
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"with open(\"lorem_ipsum.txt\", encoding=\"utf-8\") as file:\n",
" content = \"\".join(file.readlines())"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"\"Lorem Ipsum is simply dummy text of the printing and typesetting industry.\\nLorem Ipsum has been the industry's standard dummy text ever since the 1500s\\nwhen an unknown printer took a galley of type and scrambled it to make a type\\nspecimen book. It has survived not only five centuries but also the leap into\\nelectronic typesetting, remaining essentially unchanged. It was popularised in\\nthe 1960s with the release of Letraset sheets.\\n\""
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"content"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Lorem Ipsum is simply dummy text of the printing and typesetting industry.\n",
"Lorem Ipsum has been the industry's standard dummy text ever since the 1500s\n",
"when an unknown printer took a galley of type and scrambled it to make a type\n",
"specimen book. It has survived not only five centuries but also the leap into\n",
"electronic typesetting, remaining essentially unchanged. It was popularised in\n",
"the 1960s with the release of Letraset sheets.\n",
"\n"
]
}
],
"source": [
"print(content)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.2"
},
"livereveal": {
"auto_select": "code",
"auto_select_fragment": true,
"scroll": true,
"theme": "serif"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": false,
"sideBar": true,
"skip_h1_title": true,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {
"height": "calc(100% - 180px)",
"left": "10px",
"top": "150px",
"width": "384px"
},
"toc_section_display": false,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}