{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Chapter 6: Bytes & Text" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "In this chapter, we continue the study of the built-in data types. Building on our knowledge of numbers, the next layer consists of textual data that are modeled primarily with the `str` type in Python. `str` objects are naturally more \"complex\" than numerical objects as any text consists of an arbitrary and possibly large number of individual characters that may be chosen from any alphabet in the history of humankind. Luckily, Python abstracts away most of this complexity." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## The `str` Type" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "The `str` type is the default way of modeling **textual data**. To create a `str` object, we use a **literal notation** and type the text between enclosing **double quotes** `\"`." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "school = \"WHU - Otto Beisheim School of Management\"" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Like everything in Python, `school` is an object." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "140133793916176" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "id(school)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "str" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(school)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "A `str` object evaluates to itself in a literal notation with enclosing **single quotes** `'` by default. In [Chapter 1](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/01_elements.ipynb#Value), we already specified the double quotes `\"` convention we stick to in this book. Yet, single quotes `'` and double quotes `\"` are *perfect* substitutes for all `str` objects that do *not* contain any of the two symbols in it. We could use the reverse convention, as well.\n", "\n", "As [this discussion](https://stackoverflow.com/questions/56011/single-quotes-vs-double-quotes-in-python) shows, many programmers have *strong* opinions about that and make up *new* conventions for their projects. Consequently, the discussion was \"closed as not constructive\" by the moderators." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "'WHU - Otto Beisheim School of Management'" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "school" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "As the single quote `'` is often used in the English language as a shortener, we could make an argument in favor of using the double quotes `\"`: There are possibly fewer situations like in the two code cells below, in which we must revert to using a `\\` to **escape** a single quote `'` in a text (cf., the [Special Characters](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/06_text.ipynb#Special-Characters) section further below). However, double quotes `\"` are often used as well. So, this argument is somewhat not convincing.\n", "\n", "Many proponents of the single quote `'` usage claim that double quotes `\"` make more **visual noise** on the screen. This argument is also not convincing. On the contrary, one could claim that *two* single quotes `''` look so similar to *one* double quote `\"` that it might not be apparent right away what we are looking at. By sticking to double quotes `\"`, we avoid such danger of confusion.\n", "\n", "This discussion is an exellent example of a [flame war](https://en.wikipedia.org/wiki/Flaming_%28Internet%29#Flame_war) in the programming world that leads to *no* result.\n", "\n", "An *important* fact to know is that enclosing quotes of either kind are *not* part of the `str` object's *value*! They are merely *syntax* to make the text in a code cell a *literal* that Python converts into a `str` object upon reading." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "\"It's cool that strings are so versatile in Python!\"" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"It's cool that strings are so versatile in Python!\"" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "\"It's cool that strings are so versatile in Python!\"" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'It\\'s cool that strings are so versatile in Python!'" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "We can always use the [str()](https://docs.python.org/3/library/stdtypes.html#str) built-in to cast non-`str` objects as a `str`." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "'123'" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "str(123)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Another common situation where we obtain `str` objects is when reading the contents of a file with the [open()](https://docs.python.org/3/library/functions.html#open) built-in. In its simplest form, to open a [text file](https://en.wikipedia.org/wiki/Text_file) file in read-only mode, we pass in its path (i.e., \"filename\") as a `str` object." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "file = open(\"lorem_ipsum.txt\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "[open()](https://docs.python.org/3/library/functions.html#open) returns a **[proxy](https://en.wikipedia.org/wiki/Proxy_pattern)** object of type `TextIOWrapper` that allows us to interact with the file on disk." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "<_io.TextIOWrapper name='lorem_ipsum.txt' mode='r' encoding='UTF-8'>" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "file" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "_io.TextIOWrapper" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(file)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "While `file` provides, for example, the [read()](https://docs.python.org/3/library/io.html#io.TextIOBase.read), [readline()](https://docs.python.org/3/library/io.html#io.TextIOBase.readline), and [readlines()](https://docs.python.org/3/library/io.html#io.IOBase.readlines) methods to access its contents, it is also *iterable*, and we may loop over the individual lines with a `for` statement." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Lorem Ipsum is simply dummy text of the printing and typesetting industry.\n", "\n", "Lorem Ipsum has been the industry's standard dummy text ever since the 1500s\n", "\n", "when an unknown printer took a galley of type and scrambled it to make a type\n", "\n", "specimen book. It has survived not only five centuries but also the leap into\n", "\n", "electronic typesetting, remaining essentially unchanged. It was popularised in\n", "\n", "the 1960s with the release of Letraset sheets.\n", "\n" ] } ], "source": [ "for line in file:\n", " print(line)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Once we looped over `file` the first time, it is **exhausted**: That means we do not see any output if we loop over it another time." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "for line in file:\n", " print(line)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "After the `for`-loop, `line` is still set to the last line in the file, and we verify that it is indeed a `str` object." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "'the 1960s with the release of Letraset sheets.\\n'" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "line" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "str" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(line)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "An important fact is that `file` is still associated with an *open* **[file descriptor](https://en.wikipedia.org/wiki/File_descriptor)**. Without going into any technical details, we note that an operating system can only handle a limited number of \"open files\" at the same time, and, therefore, we should always *close* the file once we are done processing it.\n", "\n", "`file` has a `closed` attribute on it that shows us if a file descriptor is open or closed, and with the [close()](https://docs.python.org/3/library/io.html#io.IOBase.close) method, we can \"manually\" close it." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "file.closed" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "file.close()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "file.closed" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "The more Pythonic way is to open a file with the `with` statement (cf., [reference](https://docs.python.org/3/reference/compound_stmts.html#the-with-statement)): The indented code block is said to be executed in the **context** of the header line that acts as a **[context manager](https://docs.python.org/3/reference/datamodel.html?highlight=context%20manager#with-statement-context-managers)**. Such objects may have many different purposes. Here, the context manager created with `with open(...) as file:` mainly ensures that the file descriptor gets automatically closed after the last line in the code block is executed." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Lorem Ipsum is simply dummy text of the printing and typesetting industry.\n", "\n", "Lorem Ipsum has been the industry's standard dummy text ever since the 1500s\n", "\n", "when an unknown printer took a galley of type and scrambled it to make a type\n", "\n", "specimen book. It has survived not only five centuries but also the leap into\n", "\n", "electronic typesetting, remaining essentially unchanged. It was popularised in\n", "\n", "the 1960s with the release of Letraset sheets.\n", "\n" ] } ], "source": [ "with open(\"lorem_ipsum.txt\") as file:\n", " for line in file:\n", " print(line)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "file.closed" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "To use constructs familiar from [Chapter 3](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/03_conditionals.ipynb#The-try-Statement) to explain what `with open(...) as file:` does, below is a formulation with a `try` statement *equivalent* to the `with` statement above. The `finally`-branch is always executed, even if an exception is raised in the `for`-loop. So, `file` is sure to be closed too, with a somewhat less expressive formulation." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Lorem Ipsum is simply dummy text of the printing and typesetting industry.\n", "\n", "Lorem Ipsum has been the industry's standard dummy text ever since the 1500s\n", "\n", "when an unknown printer took a galley of type and scrambled it to make a type\n", "\n", "specimen book. It has survived not only five centuries but also the leap into\n", "\n", "electronic typesetting, remaining essentially unchanged. It was popularised in\n", "\n", "the 1960s with the release of Letraset sheets.\n", "\n" ] } ], "source": [ "try:\n", " file = open(\"lorem_ipsum.txt\")\n", " for line in file:\n", " print(line)\n", "finally:\n", " file.close()" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "file.closed" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "A subtlety to notice is that there is an empty line printed between each `line`. That is because each `line` ends with a `\"\\n\"` that results in a line break and that is explained further below. To print the text without empty lines in between, we pass a `end=\"\"` argument to the [print()](https://docs.python.org/3/library/functions.html#print) function." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Lorem Ipsum is simply dummy text of the printing and typesetting industry.\n", "Lorem Ipsum has been the industry's standard dummy text ever since the 1500s\n", "when an unknown printer took a galley of type and scrambled it to make a type\n", "specimen book. It has survived not only five centuries but also the leap into\n", "electronic typesetting, remaining essentially unchanged. It was popularised in\n", "the 1960s with the release of Letraset sheets.\n" ] } ], "source": [ "with open(\"lorem_ipsum.txt\") as file:\n", " for line in file:\n", " print(line, end=\"\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## A \"String\" of Characters" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "The idea of a **sequence** is yet another *abstract* concept.\n", "\n", "It unifies *four* [orthogonal](https://en.wikipedia.org/wiki/Orthogonality) (i.e., \"independent\") *abstract* concepts into one: Any *concrete* data type, such as `str`, is considered a sequence if it simultaneously\n", "\n", "1. **contains** other \"things,\"\n", "2. is **iterable**, and \n", "3. comes with a *predictable* **order** of its\n", "4. **finite** number of elements.\n", "\n", "[Chapter 7](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/07_sequences.ipynb#Collections-vs.-Sequences) formalizes sequences in great detail. Here, we keep our focus on the `str` type that historically received its name as it models a \"**[string of characters](https://en.wikipedia.org/wiki/String_%28computer_science%29)**,\" and a \"string\" is more formally called a sequence in the computer science literature.\n", "\n", "Behaving like a sequence, `str` objects may be treated like `list` objects in many cases. For example, the built-in [len()](https://docs.python.org/3/library/functions.html#len) function tells us how many elements (i.e., characters) make up `school`. [len()](https://docs.python.org/3/library/functions.html#len) would not work on an \"*infinite*\" object: As anything modeled in a program must fit into a computer's finite memory at runtime, there cannot exist objects containing a truly infinite number of elements; however, [Chapter 7](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/07_sequences.ipynb#Collections-vs.-Sequences#Mapping) introduces *concrete* iterable data types that can be used to model an *infinite* series of elements and that, consequently, have no concept of \"length.\"" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "40" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(school)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Being iterable, we can iterate over a `str` object, for example, with a `for`-loop, and do something with the individual characters, for example, print them out with extra space in between them.\n", "\n", "[Chapter 4](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/04_iteration.ipynb#Containers-vs.-Iterables) already shows that we can loop over *different* concrete types: The example there first loops over the `list` object `[0, 1, 2, 3, 4]` and then the `range` object `range(5)`, with the same outcome. Now, we add the `str` type to the list of *concrete* types we can loop over. *Abstractly* speaking, all three are *iterable*, and there are many more iterable types in Python." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "W H U - O t t o B e i s h e i m S c h o o l o f M a n a g e m e n t " ] } ], "source": [ "for letter in school:\n", " print(letter, end=\" \")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Being a container, we can check if a given object is a member of a sequence with the `in` operator. In the context of `str` objects, the `in` operator has *two* usages: First, it checks if a *single* character is contained in a `str` object. Second, it may also check if a shorter `str` object, then called a **substring**, is contained in a longer one." ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"O\" in school" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"WHU\" in school" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"EBS\" in school" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Indexing" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "As `str` objects have the additional property of being *ordered*, we may **index** into them to obtain individual characters with the **indexing operator** `[]`. This is analogous to how we obtained individual elements of a `list` object in [Chapter 1](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/01_elements.ipynb#Who-am-I?-And-how-many?)." ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "'W'" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "school[0]" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "'H'" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "school[1]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "The index must be of type `int`." ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "ename": "TypeError", "evalue": "string indices must be integers", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mschool\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1.0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mTypeError\u001b[0m: string indices must be integers" ] } ], "source": [ "school[1.0]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "The last index is one less than the above \"length\" of the `str` object as we start counting at 0." ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "'t'" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "school[39]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "An `IndexError` is raised whenever the index is too large." ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "ename": "IndexError", "evalue": "string index out of range", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mIndexError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mschool\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m40\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mIndexError\u001b[0m: string index out of range" ] } ], "source": [ "school[40]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "We may use *negative* indexes to start counting from the end of the `str` object." ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "'t'" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "school[-1]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "One reason why programmers like to start counting at 0 is that a positive index and its *corresponding* negative index always add up to the length of the sequence." ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'O'" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "school[6]" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "'O'" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "school[-34]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Slicing" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "A **slice** is a substring of a `str` object.\n", "\n", "The **slicing operator** is a generalization of the indexing operator: We can put one, two, or three integers within the brackets, separated by colons `:`. The three integers are then referred to as the *start*, *end*, and *step* values." ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "'WHU'" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "school[0:3]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Whereas the *start* is always included in the result, the *end* is not. Counter-intuitive at first, this makes working with individual slices easier as they add up to the original `str` object again. As the *end* is not included, we must end the second slice with `len(school)` or `40` below." ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'WHU - Otto Beisheim School of Management'" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "school[0:3] + school[3:40]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "For convenience, the indexes do not need to lie in the range from 0 to the `str` object's \"length\" when slicing. This is *not* the case for indexing as the `IndexError` above shows." ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'WHU - Otto Beisheim School of Management'" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "school[0:999]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Commonly, we leave out the `0` for the *start* and the *end* if it is equal to the \"length.\"" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'WHU - Otto Beisheim School of Management'" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "school[:3] + school[3:]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Slicing makes it easy to obtain shorter versions of the original `str` object." ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [ { "data": { "text/plain": [ "'WHU Otto Beisheim School'" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "school[:3] + school[5:26]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "A *step* value of $i$ can be used to obtain only every $i$th letter." ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "'WU-Ot esemSho fMngmn'" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "school[::2]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "A negative *step* size reverses the order of the characters." ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'tnemeganaM fo loohcS miehsieB ottO - UHW'" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "school[::-1]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Immutability" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Whereas elements of a `list` object *may* be *re-assigned*, as shortly hinted at in [Chapter 1](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/01_elements.ipynb#Who-am-I?-And-how-many?), this is *not* allowed for `str` objects. Once created, they *cannot* be *changed*. Formally, we say that they are **immutable**. In that regard, `str` objects and all the numerical types in [Chapter 5](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/05_numbers.ipynb) are alike.\n", "\n", "On the contrary, objects that may be changed after creation, are called **mutable**. We already saw in [Chapter 1](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/01_elements.ipynb#Who-am-I?-And-how-many?) how mutable objects are more difficult to reason about for a beginner, in particular, if more than *one* variable point to one. Yet, mutability does have its place in a programmer's toolbox, and we revisit this idea in the next chapters.\n", "\n", "`str` objects are *immutable* as the `TypeError` indicates: Assignment to an index is *not* supported by this type." ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "ename": "TypeError", "evalue": "'str' object does not support item assignment", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mschool\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m\"E\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mTypeError\u001b[0m: 'str' object does not support item assignment" ] } ], "source": [ "school[0] = \"E\"" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "The only thing we can do is to create a *new* `str` object in memory." ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "new_school = \"EBS\" + school[3:]" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'EBS - Otto Beisheim School of Management'" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_school" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "140133784610832" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "id(new_school)" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "140133793916176" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "id(school)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## String Operations" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "As mentioned before, the `+` and `*` operators are *overloaded* and used for **string concatenation**." ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "greeting = \"Hello \"" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "'Hello WHU'" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "greeting + school[:3]" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "'WHU WHU WHU WHU WHU WHU WHU WHU WHU WHU '" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "10 * school[:4]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## String Methods" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Objects of type `str` come with many **methods** bound on them (cf., the [documentation](https://docs.python.org/3/library/stdtypes.html#string-methods) for a full list). As seen before, they work like *normal* functions and are accessed via the **dot operator** `.`. Calling a method is also referred to as **method invocation**.\n", "\n", "The [find()](https://docs.python.org/3/library/stdtypes.html#str.find) method returns the index of the first occurrence of a character or a substring. If no match is found, it returns `-1`." ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "6" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "school.find(\"O\")" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "-1" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "school.find(\"Z\")" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "11" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "school.find(\"Beisheim\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "[find()](https://docs.python.org/3/library/stdtypes.html#str.find) takes optional *start* and *end* arguments that allow us to find occurrences other than the first one." ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "12" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "school.find(\"e\")" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "16" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "school.find(\"e\", 13) # 13 not 12 as otherwise the same character is found" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "-1" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "school.find(\"e\", 13, 15) # \"e\" does not occur in the specified slice" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "[count()](https://docs.python.org/3/library/stdtypes.html#str.count) does what we expect." ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "4" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "school.count(\"o\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "As [count()](https://docs.python.org/3/library/stdtypes.html#str.count) is *case-sensitive*, we must **chain** it with the [lower()](https://docs.python.org/3/library/stdtypes.html#str.lower) method to get the count of all \"O\"s and \"o\"s." ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "5" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "school.lower().count(\"o\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Alternatively, we may use the [upper()](https://docs.python.org/3/library/stdtypes.html#str.upper) method and search for \"O\"s." ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [ { "data": { "text/plain": [ "5" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "school.upper().count(\"O\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Because `str` objects are *immutable*, the methods always return *new* objects, even if a method does *not* change the value of the `str` object at all." ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "example = \"test\"" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "140133891261864" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "id(example)" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "lower = example.lower()" ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "140133784910848" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "id(lower)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "`example` and `lower` are *different* objects with the *same* value." ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example is lower" ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example == lower" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Another popular string method is [split()](https://docs.python.org/3/library/stdtypes.html#str.split): It separates a longer `str` object into a list of smaller ones. By default, groups of whitespace are used as the *separator*." ] }, { "cell_type": "code", "execution_count": 66, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WHU\n", "-\n", "Otto\n", "Beisheim\n", "School\n", "of\n", "Management\n" ] } ], "source": [ "for word in school.split():\n", " print(word)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "The opposite of splitting is done with the [join()](https://docs.python.org/3/library/stdtypes.html#str.join) method. It is typically invoked on a `str` object that represents a separator (e.g., `\" \"` or `\", \"`) and connects the elements of an *iterable* argument passed in (e.g., `words` below) into one *new* `str` object." ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "words = [\"This\", \"will\", \"become\", \"a\", \"sentence\"]" ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "sentence = \" \".join(words)" ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "'This will become a sentence'" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sentence" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "As the `str` object `\"abcde\"` below is an *iterable* itself, its elements (i.e., characters) are joined together with a space `\" \"` in between." ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'a b c d e'" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\" \".join(\"abcde\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "The [replace()](https://docs.python.org/3/library/stdtypes.html#str.replace) method creates a *new* `str` object with parts of the text replaced." ] }, { "cell_type": "code", "execution_count": 71, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "'This is a sentence'" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sentence.replace(\"will become\", \"is\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## String Comparison" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "The *relational* operators also work with `str` objects, another example of operator overloading. Comparison is done one character at a time until the first pair differs or one operand ends. However, `str` objects are sorted in a \"weird\" way. The reason for this is that computers store characters internally as numbers (i.e., $0$s and $1$s). Depending on the character encoding, these numbers vary. Commonly, characters and symbols used in the American language are encoded with the numbers 0 through 127, the so-called [ASCII standard](https://en.wikipedia.org/wiki/ASCII). However, Python works with the more general [Unicode/UTF-8 standard](https://en.wikipedia.org/wiki/UTF-8) that understands every language ever used by humans, even emojis." ] }, { "cell_type": "code", "execution_count": 72, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "A = \"Apple\" # ignore snake_case for variable names in this example\n", "a = \"apple\"\n", "B = \"Banana\"" ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A < B" ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a < B" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "One way to fix this is to only compare lower-cased `str` objects." ] }, { "cell_type": "code", "execution_count": 75, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a < B.lower()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "To provide a simple intuition for the \"weird\" sorting above, let's think of the American alphabet as being represented by the numbers as listed below. Then `\"Banana\"` is clearly \"smaller\" than `\"apple\"`." ] }, { "cell_type": "code", "execution_count": 76, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A -> 65 \t a -> 97\n", "B -> 66 \t b -> 98\n", "C -> 67 \t c -> 99\n", "D -> 68 \t d -> 100\n", "E -> 69 \t e -> 101\n", "F -> 70 \t f -> 102\n", "G -> 71 \t g -> 103\n", "H -> 72 \t h -> 104\n", "I -> 73 \t i -> 105\n", "J -> 74 \t j -> 106\n", "K -> 75 \t k -> 107\n", "L -> 76 \t l -> 108\n", "M -> 77 \t m -> 109\n", "N -> 78 \t n -> 110\n", "O -> 79 \t o -> 111\n", "P -> 80 \t p -> 112\n", "Q -> 81 \t q -> 113\n", "R -> 82 \t r -> 114\n", "S -> 83 \t s -> 115\n", "T -> 84 \t t -> 116\n", "U -> 85 \t u -> 117\n", "V -> 86 \t v -> 118\n", "W -> 87 \t w -> 119\n", "X -> 88 \t x -> 120\n", "Y -> 89 \t y -> 121\n", "Z -> 90 \t z -> 122\n" ] } ], "source": [ "for lower_i in range(65, 91):\n", " upper_i = lower_i + 32 # all the upper case characters are offset by 32\n", " lower_char = chr(lower_i) # from their lower case counterpart\n", " upper_char = chr(upper_i)\n", " print(f\"{lower_char} -> {lower_i} \\t {upper_char} -> {upper_i}\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## String Interpolation" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "The previous code cell shows an example of a so-called **f-string**, as introduced by [PEP 498](https://www.python.org/dev/peps/pep-0498/) only in 2016, that is passed as the argument to the [print()](https://docs.python.org/3/library/functions.html#print) function.\n", "\n", "The \"f\" stands for \"formatted\", and we can think of the `str` object as a text \"draft\" that is filled in with values determined at runtime. This concept is formally called **string interpolation**, and there are three ways to achieve that in Python." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### f-strings" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "f-strings, formally called **[formatted string literals](https://docs.python.org/3/reference/lexical_analysis.html#formatted-string-literals)**, are the least recently added and most readable way: We prepend a `str` in literal notation with an `f`, and put variables, or more generally, expressions, within curly braces. These are then filled in when a `str` object is evaluated." ] }, { "cell_type": "code", "execution_count": 77, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "name = \"Alexander\"\n", "time_of_day = \"morning\"" ] }, { "cell_type": "code", "execution_count": 78, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "'Hello Alexander! Good morning.'" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f\"Hello {name}! Good {time_of_day}.\"" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Separated by a colon `:`, various formatting options are available. In the beginning, the ability to round may be particularly useful: This can be achieved by adding `:.2f` to the variable name inside the curly braces, which casts the number as a `float` and rounds it to two digits. The `:.2f` is a so-called format specifier, and there exists a whole **[format specification mini-language](https://docs.python.org/3/library/string.html#formatspec)** to govern how specifiers work." ] }, { "cell_type": "code", "execution_count": 79, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "pi = 3.141592653" ] }, { "cell_type": "code", "execution_count": 80, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "'Pi is 3.14'" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f\"Pi is {pi:.2f}\"" ] }, { "cell_type": "code", "execution_count": 81, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "'Pi is 3.142'" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f\"Pi is {pi:.3f}\"" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### [format()](https://docs.python.org/3/library/stdtypes.html#str.format) Method" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "`str` objects also provide a [format()](https://docs.python.org/3/library/stdtypes.html#str.format) method that accepts an arbitrary number of *positional* arguments that are inserted into the `str` object in the same order replacing empty curly brackets. String interpolation with the [format()](https://docs.python.org/3/library/stdtypes.html#str.format) method is a more traditional and probably the most common way one as of today. While f-strings are the recommended way going forward, usage of the [format()](https://docs.python.org/3/library/stdtypes.html#str.format) method is likely not declining any time soon." ] }, { "cell_type": "code", "execution_count": 82, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "'Hello Alexander! Good morning.'" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"Hello {}! Good {}.\".format(name, time_of_day)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "We may use index numbers inside the curly braces if the order is different in the `str` object." ] }, { "cell_type": "code", "execution_count": 83, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'Good morning, Alexander'" ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"Good {1}, {0}\".format(name, time_of_day)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "The [format()](https://docs.python.org/3/library/stdtypes.html#str.format) method may alternatively be used with *keyword* arguments as well. Then, we must put the keywords' names within the curly brackets." ] }, { "cell_type": "code", "execution_count": 84, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'Hello Alexander! Good morning.'" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"Hello {name}! Good {time}.\".format(name=name, time=time_of_day)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Format specifiers work as in the f-string case." ] }, { "cell_type": "code", "execution_count": 85, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'Pi is 3.14'" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"Pi is {:.2f}\".format(pi)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "### `%` Operator" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "The `%` operator that we saw in the context of modulo division in [Chapter 1](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/01_elements.ipynb#%28Arithmetic%29-Operators) is overloaded with string interpolation when its first operand is a `str` object. The second operand consists of all expressions to be filled in. Format specifiers work with a `%` instead of curly braces and according to a different set of rules referred to as **[printf-style string formatting](https://docs.python.org/3/library/stdtypes.html#printf-style-string-formatting)**. So, `{:.2f}` becomes `%.2f`.\n", "\n", "This way of string interpolation is the oldest and originates from the [C language](https://en.wikipedia.org/wiki/C_%28programming_language%29). It is still widely spread, but we should use one of the other two ways instead. We show it here mainly for completeness sake." ] }, { "cell_type": "code", "execution_count": 86, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [ { "data": { "text/plain": [ "'Pi is 3.14'" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"Pi is %.2f\" % pi" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "To insert more than one expression, we must list them in order and between parenthesis `(` and `)`. As [Chapter 7](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/07_sequences.ipynb#The-tuple-Type) reveals, this literal syntax creates an object of type `tuple`. Also, to format an expression as text, we use the format specifier `%s`." ] }, { "cell_type": "code", "execution_count": 87, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [ { "data": { "text/plain": [ "'Hello Alexander! Good morning.'" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"Hello %s! Good %s.\" % (name, time_of_day)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Special Characters" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Some symbols have a special meaning within `str` objects. Popular examples are the newline `\\n` and tab `\\t` \"characters.\" The backslash symbol `\\` is also referred to as an **escape character** in this context, indicating that the following character has a meaning other than its literal meaning.\n", "\n", "The built-in [print()](https://docs.python.org/3/library/functions.html#print) function then \"prints\" out these special characters accordingly." ] }, { "cell_type": "code", "execution_count": 88, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This is a sentence\n", "that is printed\n", "on three lines.\n" ] } ], "source": [ "print(\"This is a sentence\\nthat is printed\\non three lines.\")" ] }, { "cell_type": "code", "execution_count": 89, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Words\taligned\twith\ttabs.\n" ] } ], "source": [ "print(\"Words\\taligned\\twith\\ttabs.\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "As emojis are important as well, they can be inserted with the corresponding **unicode code point** number starting with `\\U`. See this [list](https://en.wikipedia.org/wiki/List_of_Unicode_characters) of unicode characters for an overview." ] }, { "cell_type": "code", "execution_count": 90, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "😄\n" ] } ], "source": [ "print(\"\\U0001f604\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Outside the [print()](https://docs.python.org/3/library/functions.html#print) function, the special characters are not treated any different from non-special ones." ] }, { "cell_type": "code", "execution_count": 91, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [ { "data": { "text/plain": [ "'This is a sentence\\nthat is printed\\non three lines.'" ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"This is a sentence\\nthat is printed\\non three lines.\"" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Raw Strings" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Sometimes we do *not* want the backslash `\\` and its following character be interpreted as special characters.\n", "\n", "For example, let's print a typical installation path on a Windows systems. Obviously, the newline character `\\n` does *not* makes sense here." ] }, { "cell_type": "code", "execution_count": 92, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "C:\\Programs\n", "ew_application\n" ] } ], "source": [ "print(\"C:\\Programs\\new_application\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Some `str` objects even produce a `SyntaxError` because the `\\U` *cannot* be interpreted as a unicode code point." ] }, { "cell_type": "code", "execution_count": 93, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "ename": "SyntaxError", "evalue": "(unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \\UXXXXXXXX escape (, line 1)", "output_type": "error", "traceback": [ "\u001b[0;36m File \u001b[0;32m\"\"\u001b[0;36m, line \u001b[0;32m1\u001b[0m\n\u001b[0;31m print(\"C:\\Users\\Administrator\\Desktop\\Project_Folder\")\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \\UXXXXXXXX escape\n" ] } ], "source": [ "print(\"C:\\Users\\Administrator\\Desktop\\Project_Folder\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "A simple solution would be to escape the escape character with a *second* backslash `\\`." ] }, { "cell_type": "code", "execution_count": 94, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "C:\\Programs\\new_application\n" ] } ], "source": [ "print(\"C:\\\\Programs\\\\new_application\")" ] }, { "cell_type": "code", "execution_count": 95, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "C:\\Users\\Administrator\\Desktop\\Project_Folder\n" ] } ], "source": [ "print(\"C:\\\\Users\\\\Administrator\\\\Desktop\\\\Project_Folder\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "However, this is tedious to remember and type. Luckily, Python allows treating any string literal as \"raw,\" and this is indicated in the string literal by the prefix `r`." ] }, { "cell_type": "code", "execution_count": 96, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "C:\\Programs\\new_application\n" ] } ], "source": [ "print(r\"C:\\Programs\\new_application\")" ] }, { "cell_type": "code", "execution_count": 97, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "C:\\Users\\Administrator\\Desktop\\Project_Folder\n" ] } ], "source": [ "print(r\"C:\\Users\\Administrator\\Desktop\\Project_Folder\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Multi-line Strings" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Sometimes, it is convenient to split text across multiple lines in source code. For example, to make lines fit into the 79 characters requirement of [PEP 8](https://www.python.org/dev/peps/pep-0008/) or because the text naturally contains many newlines. Using double quotes `\"` around multiple lines results in a `SyntaxError`." ] }, { "cell_type": "code", "execution_count": 98, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "ename": "SyntaxError", "evalue": "EOL while scanning string literal (, line 1)", "output_type": "error", "traceback": [ "\u001b[0;36m File \u001b[0;32m\"\"\u001b[0;36m, line \u001b[0;32m1\u001b[0m\n\u001b[0;31m \"\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m EOL while scanning string literal\n" ] } ], "source": [ "\"\n", "Do not break the lines like this\n", "\"" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "However, by enclosing a string literal with either **triple-double** quotes `\"\"\"` or **triple-single** quotes `'''`, Python creates a \"plain\" `str` object. Docstrings are precisely that, and, by convention, always written in triple-double quotes `\"\"\"`." ] }, { "cell_type": "code", "execution_count": 99, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "multi_line = \"\"\"\n", "I am a multi-line string\n", "consisting of 4 lines.\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Line breaks are kept and implicitly converted into `\\n` characters." ] }, { "cell_type": "code", "execution_count": 100, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'\\nI am a multi-line string\\nconsisting of 4 lines.\\n'" ] }, "execution_count": 100, "metadata": {}, "output_type": "execute_result" } ], "source": [ "multi_line" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "The built-in [print()](https://docs.python.org/3/library/functions.html#print) function correctly prints out the `\\n` characters." ] }, { "cell_type": "code", "execution_count": 101, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "I am a multi-line string\n", "consisting of 4 lines.\n", "\n" ] } ], "source": [ "print(multi_line)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Using the [split()](https://docs.python.org/3/library/stdtypes.html#str.split) method with the optional *sep* argument, we confirm that `multi_line` consists of *four* lines with the first and last line breaks being the first and last characters in the `str` object." ] }, { "cell_type": "code", "execution_count": 102, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 \n", "1 I am a multi-line string\n", "2 consisting of 4 lines.\n", "3 \n" ] } ], "source": [ "for i, line in enumerate(multi_line.split(\"\\n\")):\n", " print(i, line)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "The next code cell puts several constructs from this chapter together to create a multi-line `str` object `content`: The `with` statement provides a context that ensures `file` is not left open. Then, the [readlines()](https://docs.python.org/3/library/io.html#io.IOBase.readlines) method returns the contents of `file` as a `list` object holding as many `str` objects as there are lines in the file on disk. Lastly, we concatenate these together with the [join()](https://docs.python.org/3/library/stdtypes.html#str.join) method to obtain `content`. We do so on an empty `str` object `\"\"` as each line already ends with a `\"\\n\"`." ] }, { "cell_type": "code", "execution_count": 103, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "with open(\"lorem_ipsum.txt\") as file:\n", " content = \"\".join(file.readlines())" ] }, { "cell_type": "code", "execution_count": 104, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "\"Lorem Ipsum is simply dummy text of the printing and typesetting industry.\\nLorem Ipsum has been the industry's standard dummy text ever since the 1500s\\nwhen an unknown printer took a galley of type and scrambled it to make a type\\nspecimen book. It has survived not only five centuries but also the leap into\\nelectronic typesetting, remaining essentially unchanged. It was popularised in\\nthe 1960s with the release of Letraset sheets.\\n\"" ] }, "execution_count": 104, "metadata": {}, "output_type": "execute_result" } ], "source": [ "content" ] }, { "cell_type": "code", "execution_count": 105, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Lorem Ipsum is simply dummy text of the printing and typesetting industry.\n", "Lorem Ipsum has been the industry's standard dummy text ever since the 1500s\n", "when an unknown printer took a galley of type and scrambled it to make a type\n", "specimen book. It has survived not only five centuries but also the leap into\n", "electronic typesetting, remaining essentially unchanged. It was popularised in\n", "the 1960s with the release of Letraset sheets.\n", "\n" ] } ], "source": [ "print(content)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "## TL;DR" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Textual data is modeled with the **immutable** `str` type.\n", "\n", "The `str` type supports *four* orthogonal **abstract concepts** that together constitute the idea of a **sequence**: Every `str` object is an iterable container of a finite number of ordered characters." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" }, "livereveal": { "auto_select": "code", "auto_select_fragment": true, "scroll": true, "theme": "serif" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": true, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": { "height": "calc(100% - 180px)", "left": "10px", "top": "150px", "width": "384px" }, "toc_section_display": false, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }