# Chapter 6: Bytes & Text

In this chapter, we continue the study of the built-in data types. Building on our knowledge of numbers, the next layer consists of textual data that are modeled primarily with the `str` type in Python. `str` objects are naturally more "complex" than numeric objects as any text consists of an arbitrary and possibly large number of individual characters that may be chosen from any alphabet in the history of humankind. Luckily, Python abstracts away most of this complexity.

## The `str` Type

The `str` type is the default way of modeling **textual data**. To create a `str` object, we use a **literal notation** and type the text between enclosing **double quotes** `"`.

In [1]:
text = "Lorem ipsum dolor sit amet, ..."

Like everything in Python, `text` is an object.

In [2]:
id(text)

140483431254256

In [3]:
type(text)

str

A `str` object evaluates to itself in a literal notation with enclosing **single quotes** `'` by default. In [Chapter 1](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/01_elements_00_content.ipynb#Value), we already specified the double quotes `"` convention we stick to in this book. Yet, single quotes `'` and double quotes `"` are *perfect* substitutes for all `str` objects that do *not* contain any of the two symbols in it. We could use the reverse convention, as well.

As [this discussion](https://stackoverflow.com/questions/56011/single-quotes-vs-double-quotes-in-python) shows, many programmers have *strong* opinions about that and make up *new* conventions for their projects. Consequently, the discussion was "closed as not constructive" by the moderators.

In [4]:
text

'Lorem ipsum dolor sit amet, ...'

As the single quote `'` is often used in the English language as a shortener, we could make an argument in favor of using the double quotes `"`: There are possibly fewer situations like in the two code cells below, in which we must revert to using a `\` to **escape** a single quote `'` in a text (cf., the [Special Characters](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/06_text_00_content.ipynb#Special-Characters) section further below). However, double quotes `"` are often used as well. So, this argument is somewhat not convincing.

Many proponents of the single quote `'` usage claim that double quotes `"` make more **visual noise** on the screen. This argument is also not convincing. On the contrary, one could claim that *two* single quotes `''` look so similar to *one* double quote `"` that it might not be apparent right away what we are looking at. By sticking to double quotes `"`, we avoid such danger of confusion.

This discussion is an exellent example of a [flame war](https://en.wikipedia.org/wiki/Flaming_%28Internet%29#Flame_war) in the programming world that leads to *no* result.

An *important* fact to know is that enclosing quotes of either kind are *not* part of the `str` object's *value*! They are merely *syntax* to make the text in a code cell a *literal* that Python converts into a `str` object upon reading.

In [5]:
"It's cool that \"strings\" are versatile"

'It\'s cool that "strings" are versatile'

In [6]:
'It\'s cool that "strings" are versatile'

'It\'s cool that "strings" are versatile'

We can always use the [str()](https://docs.python.org/3/library/stdtypes.html#str) built-in to cast non-`str` objects as a `str`.

In [7]:
str(123)

'123'

Another common situation where we obtain `str` objects is when reading the contents of a file with the [open()](https://docs.python.org/3/library/functions.html#open) built-in. In its simplest form, to open a [text file](https://en.wikipedia.org/wiki/Text_file) file in read-only mode, we pass in its path (i.e., "filename") as a `str` object.

In [8]:
file = open("lorem_ipsum.txt")

[open()](https://docs.python.org/3/library/functions.html#open) returns a **[proxy](https://en.wikipedia.org/wiki/Proxy_pattern)** object of type `TextIOWrapper` that allows us to interact with the file on disk.

In [9]:
file

<_io.TextIOWrapper name='lorem_ipsum.txt' mode='r' encoding='UTF-8'>

In [10]:
type(file)

_io.TextIOWrapper

While `file` provides, for example, the [read()](https://docs.python.org/3/library/io.html#io.TextIOBase.read), [readline()](https://docs.python.org/3/library/io.html#io.TextIOBase.readline), and [readlines()](https://docs.python.org/3/library/io.html#io.IOBase.readlines) methods to access its contents, it is also *iterable*, and we may loop over the individual lines with a `for` statement.

In [11]:
for line in file:
    print(line)

Lorem Ipsum is simply dummy text of the printing and typesetting industry.

Lorem Ipsum has been the industry's standard dummy text ever since the 1500s

when an unknown printer took a galley of type and scrambled it to make a type

specimen book. It has survived not only five centuries but also the leap into

electronic typesetting, remaining essentially unchanged. It was popularised in

the 1960s with the release of Letraset sheets.



Once we looped over `file` the first time, it is **exhausted**: That means we do not see any output if we loop over it another time.

In [12]:
for line in file:
    print(line)

After the `for`-loop, `line` is still set to the last line in the file, and we verify that it is indeed a `str` object.

In [13]:
line

'the 1960s with the release of Letraset sheets.\n'

In [14]:
type(line)

str

An important fact is that `file` is still associated with an *open* **[file descriptor](https://en.wikipedia.org/wiki/File_descriptor)**. Without going into any technical details, we note that an operating system can only handle a limited number of "open files" at the same time, and, therefore, we should always *close* the file once we are done processing it.

`file` has a `closed` attribute on it that shows us if a file descriptor is open or closed, and with the [close()](https://docs.python.org/3/library/io.html#io.IOBase.close) method, we can "manually" close it.

In [15]:
file.closed

False

In [16]:
file.close()

In [17]:
file.closed

True

The more Pythonic way is to open a file with the `with` statement (cf., [reference](https://docs.python.org/3/reference/compound_stmts.html#the-with-statement)): The indented code block is said to be executed in the **context** of the header line that acts as a **[context manager](https://docs.python.org/3/reference/datamodel.html?highlight=context%20manager#with-statement-context-managers)**. Such objects may have many different purposes. Here, the context manager created with `with open(...) as file:` mainly ensures that the file descriptor gets automatically closed after the last line in the code block is executed.

In [18]:
with open("lorem_ipsum.txt") as file:
    for line in file:
        print(line)

Lorem Ipsum is simply dummy text of the printing and typesetting industry.

Lorem Ipsum has been the industry's standard dummy text ever since the 1500s

when an unknown printer took a galley of type and scrambled it to make a type

specimen book. It has survived not only five centuries but also the leap into

electronic typesetting, remaining essentially unchanged. It was popularised in

the 1960s with the release of Letraset sheets.



In [19]:
file.closed

True

To use constructs familiar from [Chapter 3](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/03_conditionals_00_content.ipynb#The-try-Statement) to explain what `with open(...) as file:` does, below is a formulation with a `try` statement *equivalent* to the `with` statement above. The `finally`-branch is *always* executed, even if an exception is raised in the `for`-loop. So, `file` is sure to be closed too, with a somewhat less expressive formulation.

In [20]:
try:
    file = open("lorem_ipsum.txt")
    for line in file:
        print(line)
finally:
    file.close()

Lorem Ipsum is simply dummy text of the printing and typesetting industry.

Lorem Ipsum has been the industry's standard dummy text ever since the 1500s

when an unknown printer took a galley of type and scrambled it to make a type

specimen book. It has survived not only five centuries but also the leap into

electronic typesetting, remaining essentially unchanged. It was popularised in

the 1960s with the release of Letraset sheets.



In [21]:
file.closed

True

A subtlety to notice is that there is an empty line printed between each `line`. That is because each `line` ends with a `"\n"` that results in a line break and that is explained further below. To print the text without empty lines in between, we pass a `end=""` argument to the [print()](https://docs.python.org/3/library/functions.html#print) function.

In [22]:
with open("lorem_ipsum.txt") as file:
    for line in file:
        print(line, end="")

Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s
when an unknown printer took a galley of type and scrambled it to make a type
specimen book. It has survived not only five centuries but also the leap into
electronic typesetting, remaining essentially unchanged. It was popularised in
the 1960s with the release of Letraset sheets.


## A "String" of Characters

A **sequence** is yet another *abstract* concept (cf., the "*Containers vs. Iterables*" section in [Chapter 4](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/04_iteration_00_content.ipynb#Containers-vs.-Iterables)).

It unifies *four* [orthogonal](https://en.wikipedia.org/wiki/Orthogonality) (i.e., "independent") behaviors into one idea: Any data type, such as `str`, is considered a sequence if it simultaneously

1. **contains** other "things,"
2. is **iterable**, and 
3. comes with a *predictable* **order** of its
4. **finite** number of "things."

[Chapter 7](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/07_sequences_00_content.ipynb#Collections-vs.-Sequences) formalizes sequences in great detail. Here, we keep our focus on the `str` type that historically received its name as it models a "**[string of characters](https://en.wikipedia.org/wiki/String_%28computer_science%29)**," and a "string" is more formally called a sequence in the computer science literature.

Behaving like a sequence, `str` objects may be treated like `list` objects in many cases. For example, the built-in [len()](https://docs.python.org/3/library/functions.html#len) function tells us how many elements (i.e., characters) make up `text`. [len()](https://docs.python.org/3/library/functions.html#len) would not work with an *infinite* object: As anything modeled in a program must fit into a computer's finite memory at runtime, there cannot exist objects containing a truly infinite number of elements; however, [Chapter 7](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/07_sequences_00_content.ipynb#Iterators-vs.-Iterables) introduces iterable data types that can be used to model an *infinite* series of elements and that, consequently, have no concept of "length."

In [23]:
len(text)

31

Being iterable, we may loop over `text` and do something with the individual characters, for example, print them out with extra space in between them.

In [24]:
for letter in text:
    print(letter, end=" ")

L o r e m   i p s u m   d o l o r   s i t   a m e t ,   . . . 

Being a container, we may check if a given `str` object is contained in `text` with the `in` operator.

The `in` operator has *two* usages: First, it checks if a *single* character is contained in a `str` object. Second, it may also check if a shorter `str` object, then called a **substring**, is contained in a longer one.

In [25]:
"L" in text

True

In [26]:
"ipsum" in text

True

In [27]:
"veni, vidi, vici" in text

False

## Indexing

As `str` objects have the additional property of being *ordered*, we may **index** into them to obtain individual characters with the **indexing operator** `[]`. This is analogous to how we obtained individual elements of a `list` object in [Chapter 1](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/01_elements_00_content.ipynb#Who-am-I?-And-how-many?).

In [28]:
text[0]

'L'

In [29]:
text[1]

'o'

The index must be of type `int` or we get a `TypeError`.

In [30]:
text[1.0]

TypeError: string indices must be integers

The last index is one less than the above "length" of the `str` object as we start counting at 0.

In [31]:
text[30]  # = len(text) - 1

'.'

An `IndexError` is raised whenever the index is too large.

In [32]:
text[31]  # = len(text)

IndexError: string index out of range

We may use *negative* indexes to start counting from the end of the `str` object, as shown in the figure below. That only works because sequences are *finite*.

|   Slot    | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10| 11| 12| 13| 14| 15| 16| 17| 18| 19| 20| 21| 22| 23| 24| 25| 26| 27| 28| 29| 30|
|:---------:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|**Reverse**|-31|-30|-29|-28|-27|-26|-25|-24|-23|-22|-21|-20|-19|-18|-17|-16|-15|-14|-13|-12|-11|-10|-9 |-8 |-7 |-6 |-5 |-4 |-3 |-2 |-1 |
| **Char**  |`L`|`o`|`r`|`e`|`m`|` `|`i`|`p`|`s`|`u`|`m`|` `|`d`|`o`|`l`|`o`|`r`|` `|`s`|`i`|`t`|` `|`a`|`m`|`e`|`t`|`,`|` `|`.`|`.`|`.`|

In [33]:
text[-1]

'.'

In [34]:
text[-31]  # = -len(text)

'L'

One reason why programmers like to start counting at 0 is that a positive index and its *corresponding* negative index always add up to the length of the sequence. Here, `6` and `25` add to `31`.

In [35]:
text[6]

'i'

In [36]:
text[-25]

'i'

## Slicing

A **slice** is a substring of a `str` object.

The **slicing operator** is a generalization of the indexing operator: We can put one, two, or three integers within the brackets, separated by colons `:`. The three integers are then referred to as the *start*, *end*, and *step* values.

Let's start with two integers, *start* and *end*.

In [37]:
text[0:5]

'Lorem'

Whereas the *start* is always included in the result, the *end* is not. Counter-intuitive at first, this makes working with individual slices easier as they "add" up to the original `str` object again (cf., the "*String Operations*" sub-section below regarding the overloaded `+` operator). Because the *end* is *not* included, we end the second slice below with `len(text)` or `31` below.

Not including the *end* has another advantage: The difference "*end* minus *start*" tells us how many elements the resulting slice has. Above, for example, `5 - 0` implies that `"Lorem"` consists of `5` characters. So, colloquially, `0:5` means "taking the first five characters." That rule only works if both *start* and *end* are *positive*.

In [38]:
text[0:5] + text[5:len(text)]

'Lorem ipsum dolor sit amet, ...'

By combining a *positive* start with a *negative* end index, we specify both ends of the slice *relative* to the ends of the entire `str` object. So, colloquially, `6:-5` below means "drop the first six and last five characters." The length of the resulting slice can *not* be calculated from the indexes and depends only on the length of the original `str` object.

In [39]:
text[6:-5]

'ipsum dolor sit amet'

For convenience, the indexes do not need to lie within the range from `0` to `len(text)` when slicing. So, no `IndexError` is raised here.

In [40]:
text[0:999]

'Lorem ipsum dolor sit amet, ...'

If left out, *start* defaults to `0` and *end* to the "length" of the `str` object. Here, we take a "full" slice that is essentially a copy of the original `str` object.

In [41]:
text[:]

'Lorem ipsum dolor sit amet, ...'

Slicing (and indexing) makes it easy to obtain shorter versions of the original `str` object.

In [42]:
text[:26] + text[30]

'Lorem ipsum dolor sit amet.'

A *step* value of $i$ can be used to obtain only every $i$th letter.

In [43]:
text[::2]

'Lrmismdlrstae,..'

A negative *step* size reverses the order of the characters.

In [44]:
text[::-1]

'... ,tema tis rolod muspi meroL'

## Immutability

Whereas elements of a `list` object *may* be *re-assigned*, as shortly hinted at in [Chapter 1](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/01_elements_00_content.ipynb#Who-am-I?-And-how-many?), this is *not* allowed for `str` objects. Once created, they *cannot* be *changed*. Formally, we say that they are **immutable**. In that regard, `str` objects and all the numeric types in [Chapter 5](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/05_numbers_00_content.ipynb) are alike.

On the contrary, objects that may be changed after creation, are called **mutable**. We already saw in [Chapter 1](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/01_elements_00_content.ipynb#Who-am-I?-And-how-many?) how mutable objects are more difficult to reason about for a beginner, in particular, if more than *one* variable references it. Yet, mutability does have its place in a programmer's toolbox, and we revisit this idea in the next chapters.

The `TypeError` indicates that `str` objects are *immutable*: Assignment to an index or a slice are *not* supported.

In [45]:
text[0] = "Z"

TypeError: 'str' object does not support item assignment

In [46]:
text[:5] = "random"

TypeError: 'str' object does not support item assignment

## String Methods

Objects of type `str` come with many **methods** bound on them (cf., the [documentation](https://docs.python.org/3/library/stdtypes.html#string-methods) for a full list). As seen before, they work like *normal* functions and are accessed via the **dot operator** `.`. Calling a method is also referred to as **method invocation**.

The [find()](https://docs.python.org/3/library/stdtypes.html#str.find) method returns the index of the first occurrence of a character or a substring. If no match is found, it returns `-1`.

In [47]:
text

'Lorem ipsum dolor sit amet, ...'

In [48]:
text.find("a")

22

In [49]:
text.find("z")

-1

In [50]:
text.find("dolor")

12

[find()](https://docs.python.org/3/library/stdtypes.html#str.find) takes optional *start* and *end* arguments that allow us to find occurrences other than the first one.

In [51]:
text.find("o")

1

In [52]:
text.find("o", 2)  # 2 not 1 as otherwise the same "o" is found again

13

In [53]:
text.find("o", 2, 12)  # "o" does not occur in the specified slice

-1

[count()](https://docs.python.org/3/library/stdtypes.html#str.count) does what we expect.

In [54]:
text.count("l")

1

As [count()](https://docs.python.org/3/library/stdtypes.html#str.count) is *case-sensitive*, we must **chain** it with the [lower()](https://docs.python.org/3/library/stdtypes.html#str.lower) method to get the count of all `"L"`s and `"l"`s.

In [55]:
text.lower().count("l")

2

Alternatively, we may use the [upper()](https://docs.python.org/3/library/stdtypes.html#str.upper) method and search for `"O"`s.

In [56]:
text.upper().count("L")

2

Because `str` objects are *immutable*, [upper()](https://docs.python.org/3/library/stdtypes.html#str.upper) and [lower()](https://docs.python.org/3/library/stdtypes.html#str.lower) return *new* `str` objects, even if they do *not* change the value of the original `str` object.

In [57]:
example = "random"

In [58]:
id(example)

140483525349552

In [59]:
lower = example.lower()

In [60]:
id(lower)

140483422704176

`example` and `lower` are *different* objects with the *same* value.

In [61]:
example is lower

False

In [62]:
example == lower

True

Another popular string method is [split()](https://docs.python.org/3/library/stdtypes.html#str.split): It separates a longer `str` object into smaller ones contained in a `list` object. By default, groups of contiguous whitespace are used as the *separator*.

As an example, we use [split()](https://docs.python.org/3/library/stdtypes.html#str.split) to print out the individual words in `text` on separate lines.

In [63]:
for word in text.split():
    print(word)

Lorem
ipsum
dolor
sit
amet,
...


The opposite of splitting is done with the [join()](https://docs.python.org/3/library/stdtypes.html#str.join) method. It is typically invoked on a `str` object that represents a separator (e.g., `" "` or `", "`) and connects the elements of an *iterable* argument passed in (e.g., `words` below) into one *new* `str` object.

In [64]:
words = ["This", "will", "become", "a", "sentence"]

In [65]:
sentence = " ".join(words)

In [66]:
sentence

'This will become a sentence'

As the `str` object `"abcde"` below is an *iterable* itself, its elements (i.e., characters) are joined together with a space `" "` in between.

In [67]:
" ".join("abcde")

'a b c d e'

The [replace()](https://docs.python.org/3/library/stdtypes.html#str.replace) method creates a *new* `str` object with parts of the original `str` object potentially replaced.

In [68]:
sentence.replace("will become", "is")

'This is a sentence'

## String Operations

As mentioned in [Chapter 1](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/01_elements_00_content.ipynb#Operator-Overloading), the `+` and `*` operators are *overloaded* and used for **string concatenation**. They always create *new* `str` objects. That has nothing to do with the `str` type's immutability, but is the default behavior of operators.

In [69]:
"Hello " + text[:4]

'Hello Lore'

In [70]:
5 * text[:12]

'Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum '

### String Comparison

The *relational* operators also work with `str` objects, another example of operator overloading. Comparison is done one character at a time until the first pair differs or one operand ends. However, `str` objects are sorted in a "weird" way. The reason for this is that computers store characters internally as numbers (i.e., $0$s and $1$s). Depending on the character encoding, these numbers vary. Commonly, characters and symbols used in American English are encoded with the numbers 0 through 127, the so-called [ASCII standard](https://en.wikipedia.org/wiki/ASCII). However, Python works with the more general [Unicode/UTF-8 standard](https://en.wikipedia.org/wiki/UTF-8) that understands every language ever used by humans, even emojis.

In [71]:
A = "Apple"  # ignore snake_case for variable names in this example
a = "apple"
B = "Banana"

In [72]:
A < B

True

In [73]:
a < B

False

One way to fix this is to only compare lower-cased `str` objects.

In [74]:
a < B.lower()

True

To provide a simple intuition for the "weird" sorting above, let's think of the alphabet as being represented by the numbers as listed below. Then `"Banana"` is clearly "smaller" than `"apple"`. In general, all the upper case letters are "smaller" than all the lower case letters.

In [75]:
for lower_i in range(65, 91):
    upper_i = lower_i + 32  # all the upper case characters are offset by 32
    lower_char = chr(lower_i)  # from their lower case counterpart
    upper_char = chr(upper_i)
    print(f"{lower_char} -> {lower_i} \t {upper_char} -> {upper_i}")

A -> 65 	 a -> 97
B -> 66 	 b -> 98
C -> 67 	 c -> 99
D -> 68 	 d -> 100
E -> 69 	 e -> 101
F -> 70 	 f -> 102
G -> 71 	 g -> 103
H -> 72 	 h -> 104
I -> 73 	 i -> 105
J -> 74 	 j -> 106
K -> 75 	 k -> 107
L -> 76 	 l -> 108
M -> 77 	 m -> 109
N -> 78 	 n -> 110
O -> 79 	 o -> 111
P -> 80 	 p -> 112
Q -> 81 	 q -> 113
R -> 82 	 r -> 114
S -> 83 	 s -> 115
T -> 84 	 t -> 116
U -> 85 	 u -> 117
V -> 86 	 v -> 118
W -> 87 	 w -> 119
X -> 88 	 x -> 120
Y -> 89 	 y -> 121
Z -> 90 	 z -> 122


## String Interpolation

The previous code cell shows an example of a so-called **f-string**, as introduced by [PEP 498](https://www.python.org/dev/peps/pep-0498/) only in 2016, that is passed as the argument to the [print()](https://docs.python.org/3/library/functions.html#print) function.

The "f" stands for "formatted", and we can think of the `str` object as a text "draft" that is filled in with values determined at runtime. This concept is formally called **string interpolation**, and there are three ways to achieve that in Python.

### f-strings

f-strings, formally called **[formatted string literals](https://docs.python.org/3/reference/lexical_analysis.html#formatted-string-literals)**, are the least recently added and most readable way: We prepend a `str` in literal notation with an `f`, and put variables, or more generally, expressions, within curly braces. These are then filled in when a `str` object is evaluated.

In [76]:
name = "Alexander"
time_of_day = "morning"

In [77]:
f"Hello {name}! Good {time_of_day}."

'Hello Alexander! Good morning.'

Separated by a colon `:`, various formatting options are available. In the beginning, the ability to round may be particularly useful: This can be achieved by adding `:.2f` to the variable name inside the curly braces, which casts the number as a `float` and rounds it to two digits. The `:.2f` is a so-called format specifier, and there exists a whole **[format specification mini-language](https://docs.python.org/3/library/string.html#formatspec)** to govern how specifiers work.

In [78]:
pi = 3.141592653

In [79]:
f"Pi is {pi:.2f}"

'Pi is 3.14'

In [80]:
f"Pi is {pi:.3f}"

'Pi is 3.142'

### [format()](https://docs.python.org/3/library/stdtypes.html#str.format) Method

`str` objects also provide a [format()](https://docs.python.org/3/library/stdtypes.html#str.format) method that accepts an arbitrary number of *positional* arguments that are inserted into the `str` object in the same order replacing empty curly brackets. String interpolation with the [format()](https://docs.python.org/3/library/stdtypes.html#str.format) method is a more traditional and probably the most common way one as of today. While f-strings are the recommended way going forward, usage of the [format()](https://docs.python.org/3/library/stdtypes.html#str.format) method is likely not declining any time soon.

In [81]:
"Hello {}! Good {}.".format(name, time_of_day)

'Hello Alexander! Good morning.'

We may use index numbers inside the curly braces if the order is different in the `str` object.

In [82]:
"Good {1}, {0}".format(name, time_of_day)

'Good morning, Alexander'

The [format()](https://docs.python.org/3/library/stdtypes.html#str.format) method may alternatively be used with *keyword* arguments as well. Then, we must put the keywords' names within the curly brackets.

In [83]:
"Hello {name}! Good {time}.".format(name=name, time=time_of_day)

'Hello Alexander! Good morning.'

Format specifiers work as in the f-string case.

In [84]:
"Pi is {:.2f}".format(pi)

'Pi is 3.14'

### `%` Operator

The `%` operator that we saw in the context of modulo division in [Chapter 1](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/01_elements_00_content.ipynb#%28Arithmetic%29-Operators) is overloaded with string interpolation when its first operand is a `str` object. The second operand consists of all expressions to be filled in. Format specifiers work with a `%` instead of curly braces and according to a different set of rules referred to as **[printf-style string formatting](https://docs.python.org/3/library/stdtypes.html#printf-style-string-formatting)**. So, `{:.2f}` becomes `%.2f`.

This way of string interpolation is the oldest and originates from the [C language](https://en.wikipedia.org/wiki/C_%28programming_language%29). It is still widely spread, but we should use one of the other two ways instead. We show it here mainly for completeness sake.

In [85]:
"Pi is %.2f" % pi

'Pi is 3.14'

To insert more than one expression, we must list them in order and between parenthesis `(` and `)`. As [Chapter 7](https://nbviewer.jupyter.org/github/webartifex/intro-to-python/blob/master/07_sequences_00_content.ipynb#The-tuple-Type) reveals, this literal syntax creates an object of type `tuple`. Also, to format an expression as text, we use the format specifier `%s`.

In [86]:
"Hello %s! Good %s." % (name, time_of_day)

'Hello Alexander! Good morning.'

## Special Characters

Some symbols have a special meaning within `str` objects. Popular examples are the newline `\n` and tab `\t` "characters." The backslash symbol `\` is also referred to as an **escape character** in this context, indicating that the following character has a meaning other than its literal meaning.

The built-in [print()](https://docs.python.org/3/library/functions.html#print) function then "prints" out these special characters accordingly.

In [87]:
print("This is a sentence\nthat is printed\non three lines.")

This is a sentence
that is printed
on three lines.


In [88]:
print("Words\taligned\twith\ttabs.")

Words	aligned	with	tabs.


As emojis are important as well, they can be inserted with the corresponding **unicode code point** number starting with `\U`. See this [list](https://en.wikipedia.org/wiki/List_of_Unicode_characters) of unicode characters for an overview.

In [89]:
print("\U0001f604")

ðŸ˜„


Outside the [print()](https://docs.python.org/3/library/functions.html#print) function, the special characters are not treated any different from non-special ones.

In [90]:
"This is a sentence\nthat is printed\non three lines."

'This is a sentence\nthat is printed\non three lines.'

## Raw Strings

Sometimes we do *not* want the backslash `\` and its following character be interpreted as special characters.

For example, let's print a typical installation path on a Windows systems. Obviously, the newline character `\n` does *not* makes sense here.

In [91]:
print("C:\Programs\new_application")

C:\Programs
ew_application


Some `str` objects even produce a `SyntaxError` because the `\U` *cannot* be interpreted as a unicode code point.

In [92]:
print("C:\Users\Administrator\Desktop\Project_Folder")

SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape (<ipython-input-92-0aad8a365b02>, line 1)

A simple solution would be to escape the escape character with a *second* backslash `\`.

In [93]:
print("C:\\Programs\\new_application")

C:\Programs\new_application


In [94]:
print("C:\\Users\\Administrator\\Desktop\\Project_Folder")

C:\Users\Administrator\Desktop\Project_Folder


However, this is tedious to remember and type. Luckily, Python allows treating any string literal as "raw," and this is indicated in the string literal by the prefix `r`.

In [95]:
print(r"C:\Programs\new_application")

C:\Programs\new_application


In [96]:
print(r"C:\Users\Administrator\Desktop\Project_Folder")

C:\Users\Administrator\Desktop\Project_Folder


## Multi-line Strings

Sometimes, it is convenient to split text across multiple lines in source code. For example, to make lines fit into the 79 characters requirement of [PEP 8](https://www.python.org/dev/peps/pep-0008/) or because the text naturally contains many newlines. Using double quotes `"` around multiple lines results in a `SyntaxError`.

In [97]:
"
Do not break the lines like this
"

SyntaxError: EOL while scanning string literal (<ipython-input-97-4cef690f1f4a>, line 1)

However, by enclosing a string literal with either **triple-double** quotes `"""` or **triple-single** quotes `'''`, Python creates a "plain" `str` object. Docstrings are precisely that, and, by convention, always written in triple-double quotes `"""`.

In [98]:
multi_line = """
I am a multi-line string
consisting of 4 lines.
"""

Line breaks are kept and implicitly converted into `\n` characters.

In [99]:
multi_line

'\nI am a multi-line string\nconsisting of 4 lines.\n'

The built-in [print()](https://docs.python.org/3/library/functions.html#print) function correctly prints out the `\n` characters.

In [100]:
print(multi_line)


I am a multi-line string
consisting of 4 lines.



Using the [split()](https://docs.python.org/3/library/stdtypes.html#str.split) method with the optional *sep* argument, we confirm that `multi_line` consists of *four* lines with the first and last line breaks being the first and last characters in the `str` object.

In [101]:
for i, line in enumerate(multi_line.split("\n")):
    print(i, line)

0 
1 I am a multi-line string
2 consisting of 4 lines.
3 


The next code cell puts several constructs from this chapter together to create a multi-line `str` object `content`: The `with` statement provides a context that ensures `file` is not left open. Then, the [readlines()](https://docs.python.org/3/library/io.html#io.IOBase.readlines) method returns the contents of `file` as a `list` object holding as many `str` objects as there are lines in the file on disk. Lastly, we concatenate these together with the [join()](https://docs.python.org/3/library/stdtypes.html#str.join) method to obtain `content`. We do so on an empty `str` object `""` as each line already ends with a `"\n"`.

In [102]:
with open("lorem_ipsum.txt") as file:
    content = "".join(file.readlines())

In [103]:
content

"Lorem Ipsum is simply dummy text of the printing and typesetting industry.\nLorem Ipsum has been the industry's standard dummy text ever since the 1500s\nwhen an unknown printer took a galley of type and scrambled it to make a type\nspecimen book. It has survived not only five centuries but also the leap into\nelectronic typesetting, remaining essentially unchanged. It was popularised in\nthe 1960s with the release of Letraset sheets.\n"

In [104]:
print(content)

Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s
when an unknown printer took a galley of type and scrambled it to make a type
specimen book. It has survived not only five centuries but also the leap into
electronic typesetting, remaining essentially unchanged. It was popularised in
the 1960s with the release of Letraset sheets.



## TL;DR

Textual data is modeled with the **immutable** `str` type.

The `str` type supports *four* orthogonal **abstract concepts** that together constitute the idea of a **sequence**: Every `str` object is an iterable container of a finite number of ordered characters.