
959 lines
30 KiB
Raw Permalink Normal View History

"cells": [
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
"source": [
"**Note**: Click on \"*Kernel*\" > \"*Restart Kernel and Clear All Outputs*\" in [JupyterLab]( *before* reading this notebook to reset its output. If you cannot run this file on your machine, you may want to open it [in the cloud <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_mb.png\">]("
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
"source": [
"# Chapter 7: Sequential Data"
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
"source": [
"We studied numbers (cf., [Chapter 5 <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_nb.png\">]( and textual data (cf., [Chapter 6 <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_nb.png\">]( first mainly because objects of the presented data types are \"simple.\" That is so for two reasons: First, they are *immutable*, and, as we saw in the \"*Who am I? And how many?*\" section in [Chapter 1 <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_nb.png\">](, mutable objects can quickly become hard to reason about. Second, they are \"flat\" in the sense that they are *not* composed of other objects.\n",
"The `str` type is a bit of a corner case in this regard. While one could argue that a longer `str` object, for example, `\"text\"`, is composed of individual characters, this is *not* the case in memory as the literal `\"text\"` only creates *one* object (i.e., one \"bag\" of $0$s and $1$s modeling all characters).\n",
"This chapter, [Chapter 8 <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_nb.png\">](, [Chapter 9 <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_nb.png\">](, and [Chapter 10 <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_nb.png\">]( introduce various \"complex\" data types. While some are mutable and others are not, they all share that they are primarily used to \"manage,\" or structure, the memory in a program (i.e., they provide references to other objects). Unsurprisingly, computer scientists refer to the ideas behind these data types as **[data structures <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_wiki.png\">](**.\n",
"In this chapter, we focus on data types that model all kinds of sequential data. Examples of such data are [spreadsheets <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_wiki.png\">]( or [matrices <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_wiki.png\">]( and [vectors <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_wiki.png\">]( These formats share the property that they are composed of smaller units that come in a sequence of, for example, rows/columns/cells or elements/entries."
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
"source": [
"## Collections vs. Sequences"
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
"source": [
"[Chapter 6 <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_nb.png\">](\"String\"-of-Characters) already describes the **sequence** properties of `str` objects. In this section, we take a step back and study these properties one by one.\n",
"The [ <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">]( module in the [standard library <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">]( defines a variety of **abstract base classes** (ABCs). We saw ABCs already in [Chapter 5 <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_nb.png\">](, where we use the ones from the [numbers <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">]( module in the [standard library <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">]( to classify Python's numeric data types according to mathematical ideas. Now, we take the ABCs from the [ <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">]( module to classify the data types in this chapter according to their behavior in various contexts.\n",
"As an illustration, consider `numbers` and `text` below, two objects of *different* types."
"cell_type": "code",
"execution_count": 1,
"metadata": {
"slideshow": {
"slide_type": "slide"
"outputs": [],
"source": [
"numbers = [7, 11, 8, 5, 3, 12, 2, 6, 9, 10, 1, 4]\n",
"text = \"Lorem ipsum dolor sit amet.\""
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
"source": [
"Among others, one commonality between the two is that we may loop over them with the `for` statement. So, in the context of iteration, both exhibit the *same* behavior."
"cell_type": "code",
"execution_count": 2,
"metadata": {
"slideshow": {
"slide_type": "fragment"
"outputs": [
"name": "stdout",
"output_type": "stream",
"text": [
"7 11 8 5 3 12 2 6 9 10 1 4 "
"source": [
"for number in numbers:\n",
" print(number, end=\" \")"
"cell_type": "code",
"execution_count": 3,
"metadata": {
"slideshow": {
"slide_type": "fragment"
"outputs": [
"name": "stdout",
"output_type": "stream",
"text": [
"L o r e m i p s u m d o l o r s i t a m e t . "
"source": [
"for character in text:\n",
" print(character, end=\" \")"
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
"source": [
"In [Chapter 4 <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_nb.png\">](, we referred to such types as *iterables*. That is *not* a proper [English]( word, even if it may sound like one at first sight. Yet, it is an official term in the Python world formalized with the `Iterable` ABC in the [ <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">]( module.\n",
"For the data science practitioner, it is worthwhile to know such terms as, for example, the documentation on the [built-ins <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">]( uses them extensively: In simple words, any built-in that takes an argument called \"*iterable*\" may be called with *any* object that supports being looped over. Already familiar [built-ins <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">]( include [enumerate() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](, [sum() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">](, or [zip() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">]( So, they do *not* require the argument to be of a certain data type (e.g., `list`); instead, any *iterable* type works."
"cell_type": "code",
"execution_count": 4,
"metadata": {
"slideshow": {
"slide_type": "slide"
"outputs": [],
"source": [
"import as abc"
"cell_type": "code",
"execution_count": 5,
"metadata": {
"slideshow": {
"slide_type": "fragment"
"outputs": [
"data": {
"text/plain": [
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
"source": [
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
"source": [
"As seen in [Chapter 5 <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_nb.png\">](, we can use ABCs with the built-in [isinstance() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">]( function to check if an object supports a behavior.\n",
"So, let's \"ask\" Python if it can loop over `numbers` or `text`."
"cell_type": "code",
"execution_count": 6,
"metadata": {
"slideshow": {
"slide_type": "slide"
"outputs": [
"data": {
"text/plain": [
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
"source": [
"isinstance(numbers, abc.Iterable)"
"cell_type": "code",
"execution_count": 7,
"metadata": {
"slideshow": {
"slide_type": "fragment"
"outputs": [
"data": {
"text/plain": [
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
"source": [
"isinstance(text, abc.Iterable)"
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
"source": [
"Contrary to `list` or `str` objects, numeric objects are *not* iterable."
"cell_type": "code",
"execution_count": 8,
"metadata": {
"slideshow": {
"slide_type": "skip"
"outputs": [
"data": {
"text/plain": [
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
"source": [
"isinstance(999, abc.Iterable)"
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
"source": [
"Instead of asking, we could try to loop over `999`, but this results in a `TypeError`."
"cell_type": "code",
"execution_count": 9,
"metadata": {
"slideshow": {
"slide_type": "skip"
"outputs": [
"ename": "TypeError",
"evalue": "'int' object is not iterable",
"output_type": "error",
"traceback": [
"\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[9], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[38;5;28;43;01mfor\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43mdigit\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;129;43;01min\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[38;5;241;43m999\u001b[39;49m\u001b[43m:\u001b[49m\n\u001b[1;32m 2\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mprint\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43mdigit\u001b[49m\u001b[43m)\u001b[49m\n",
"\u001b[0;31mTypeError\u001b[0m: 'int' object is not iterable"
"source": [
"for digit in 999:\n",
" print(digit)"
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
"source": [
"Most of the data types in this chapter and [Chapter 9 <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_nb.png\">]( and [Chapter 10 <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_nb.png\">]( exhibit three [orthogonal <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_wiki.png\">]( (i.e., \"independent\") behaviors, formalized by ABCs in the [ <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">]( module as:\n",
"- `Iterable`: An object may be looped over.\n",
"- `Container`: An object \"contains\" references to other objects; a \"whole\" is composed of many \"parts.\"\n",
"- `Sized`: The number of references to other objects, the \"parts,\" is *finite*.\n",
"The characteristical operation supported by `Container` types is the `in` operator for membership testing."
"cell_type": "code",
"execution_count": 10,
"metadata": {
"slideshow": {
"slide_type": "slide"
"outputs": [
"data": {
"text/plain": [
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
"source": [
"0 in numbers"
"cell_type": "code",
"execution_count": 11,
"metadata": {
"slideshow": {
"slide_type": "fragment"
"outputs": [
"data": {
"text/plain": [
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
"source": [
"\"l\" in text"
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
"source": [
"Alternatively, we could also check if `numbers` and `text` are `Container` types with [isinstance() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">]("
"cell_type": "code",
"execution_count": 12,
"metadata": {
"slideshow": {
"slide_type": "fragment"
"outputs": [
"data": {
"text/plain": [
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
"source": [
"isinstance(numbers, abc.Container)"
"cell_type": "code",
"execution_count": 13,
"metadata": {
"slideshow": {
"slide_type": "fragment"
"outputs": [
"data": {
"text/plain": [
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
"source": [
"isinstance(text, abc.Container)"
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
"source": [
"Numeric objects do *not* \"contain\" references to other objects, and that is why they are considered \"flat\" data types. The `in` operator raises a `TypeError`. Conceptually speaking, Python views numeric types as \"wholes\" without any \"parts.\""
"cell_type": "code",
"execution_count": 14,
"metadata": {
"slideshow": {
"slide_type": "skip"
"outputs": [
"data": {
"text/plain": [
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
"source": [
"isinstance(999, abc.Container)"
"cell_type": "code",
"execution_count": 15,
"metadata": {
"slideshow": {
"slide_type": "skip"
"outputs": [
"ename": "TypeError",
"evalue": "argument of type 'int' is not iterable",
"output_type": "error",
"traceback": [
"\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[15], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[38;5;241;43m9\u001b[39;49m\u001b[43m \u001b[49m\u001b[38;5;129;43;01min\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[38;5;241;43m999\u001b[39;49m\n",
"\u001b[0;31mTypeError\u001b[0m: argument of type 'int' is not iterable"
"source": [
"9 in 999"
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
"source": [
"Analogously, being `Sized` types, we can pass `numbers` and `text` as the argument to the built-in [len() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">]( function and obtain \"meaningful\" results. The exact meaning depends on the data type: For `numbers`, [len() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">]( tells us how many elements are in the `list` object; for `text`, it tells us how many [Unicode characters <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_wiki.png\">]( make up the `str` object. *Abstractly* speaking, both data types exhibit the *same* behavior of *finiteness*."
"cell_type": "code",
"execution_count": 16,
"metadata": {
"slideshow": {
"slide_type": "slide"
"outputs": [
"data": {
"text/plain": [
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
"source": [
"cell_type": "code",
"execution_count": 17,
"metadata": {
"slideshow": {
"slide_type": "fragment"
"outputs": [
"data": {
"text/plain": [
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
"source": [
"cell_type": "code",
"execution_count": 18,
"metadata": {
"slideshow": {
"slide_type": "fragment"
"outputs": [
"data": {
"text/plain": [
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
"source": [
"isinstance(numbers, abc.Sized)"
"cell_type": "code",
"execution_count": 19,
"metadata": {
"slideshow": {
"slide_type": "fragment"
"outputs": [
"data": {
"text/plain": [
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
"source": [
"isinstance(text, abc.Sized)"
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
"source": [
"On the contrary, even though `999` consists of three digits for humans, numeric objects in Python have no concept of a \"size\" or \"length,\" and the [len() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">]( function raises a `TypeError`."
"cell_type": "code",
"execution_count": 20,
"metadata": {
"slideshow": {
"slide_type": "skip"
"outputs": [
"data": {
"text/plain": [
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
"source": [
"isinstance(999, abc.Sized)"
"cell_type": "code",
"execution_count": 21,
"metadata": {
"slideshow": {
"slide_type": "skip"
"outputs": [
"ename": "TypeError",
"evalue": "object of type 'int' has no len()",
"output_type": "error",
"traceback": [
"\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[21], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[38;5;28;43mlen\u001b[39;49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m999\u001b[39;49m\u001b[43m)\u001b[49m\n",
"\u001b[0;31mTypeError\u001b[0m: object of type 'int' has no len()"
"source": [
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
"source": [
"These three behaviors are so essential that whenever they coincide for a data type, it is called a **collection**, formalized with the `Collection` ABC. That is where the [ <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">]( module got its name from: It summarizes all ABCs related to collections; in particular, it defines a hierarchy of specialized kinds of collections.\n",
"Without going into too much detail, one way to read the summary table at the beginning of the [ <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">]( module's documention is as follows: The first column, titled \"ABC\", lists all collection-related ABCs in Python. The second column, titled \"Inherits from,\" indicates if the idea behind the ABC is *original* (e.g., the first row with the `Container` ABC has an empty \"Inherits from\" column) or a *combination* (e.g., the row with the `Collection` ABC has `Sized`, `Iterable`, and `Container` in the \"Inherits from\" column). The third and fourth columns list the methods that come with a data type following an ABC. We keep ignoring the methods named in the dunder style for now.\n",
"So, let's confirm that both `numbers` and `text` are collections."
"cell_type": "code",
"execution_count": 22,
"metadata": {
"slideshow": {
"slide_type": "slide"
"outputs": [
"data": {
"text/plain": [
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
"source": [
"isinstance(numbers, abc.Collection)"
"cell_type": "code",
"execution_count": 23,
"metadata": {
"slideshow": {
"slide_type": "fragment"
"outputs": [
"data": {
"text/plain": [
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
"source": [
"isinstance(text, abc.Collection)"
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
"source": [
"They share one more common behavior: When looping over them, we can *predict* the *order* of the elements or characters. The ABC in the [ <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">]( module corresponding to this behavior is `Reversible`. While sounding unintuitive at first, it is evident that if something is reversible, it must have a forward order, to begin with.\n",
"The [reversed() <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">]( built-in allows us to loop over the elements or characters in reverse order."
"cell_type": "code",
"execution_count": 24,
"metadata": {
"slideshow": {
"slide_type": "slide"
"outputs": [
"name": "stdout",
"output_type": "stream",
"text": [
"4 1 10 9 6 2 12 3 5 8 11 7 "
"source": [
"for number in reversed(numbers):\n",
" print(number, end=\" \")"
"cell_type": "code",
"execution_count": 25,
"metadata": {
"slideshow": {
"slide_type": "fragment"
"outputs": [
"name": "stdout",
"output_type": "stream",
"text": [
". t e m a t i s r o l o d m u s p i m e r o L "
"source": [
"for character in reversed(text):\n",
" print(character, end=\" \")"
"cell_type": "code",
"execution_count": 26,
"metadata": {
"slideshow": {
"slide_type": "fragment"
"outputs": [
"data": {
"text/plain": [
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
"source": [
"isinstance(numbers, abc.Reversible)"
"cell_type": "code",
"execution_count": 27,
"metadata": {
"slideshow": {
"slide_type": "fragment"
"outputs": [
"data": {
"text/plain": [
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
"source": [
"isinstance(text, abc.Reversible)"
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
"source": [
"Collections that exhibit this fourth behavior are referred to as **sequences**, formalized with the `Sequence` ABC in the [ <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_py.png\">]( module."
"cell_type": "code",
"execution_count": 28,
"metadata": {
"slideshow": {
"slide_type": "slide"
"outputs": [
"data": {
"text/plain": [
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
"source": [
"isinstance(numbers, abc.Sequence)"
"cell_type": "code",
"execution_count": 29,
"metadata": {
"slideshow": {
"slide_type": "fragment"
"outputs": [
"data": {
"text/plain": [
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
"source": [
"isinstance(text, abc.Sequence)"
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
"source": [
"The data types introduced in this chapter are sequences. Nevertheless, we also look at some data types that are neither collections nor sequences but are still useful to model sequential data in practice in [Chapter 8 <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_nb.png\">](\n",
"In Python-related documentations, the terms collection and sequence are heavily used, and the data science practitioner should always think of them in terms of the three or four behaviors they exhibit.\n",
"Data types that are collections but not sequences are covered in [Chapter 9 <img height=\"12\" style=\"display: inline-block\" src=\"../static/link/to_nb.png\">]("
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.2"
"livereveal": {
"auto_select": "code",
"auto_select_fragment": true,
"scroll": true,
"theme": "serif"
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": false,
"sideBar": true,
"skip_h1_title": true,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {
"height": "calc(100% - 180px)",
"left": "10px",
"top": "150px",
"width": "384px"
"toc_section_display": false,
"toc_window_display": false
"nbformat": 4,
"nbformat_minor": 4