{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Cleaning"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## \"Housekeeping\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Import all the third-party (scientific) libraries needed."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import missingno as msno\n",
"import numpy as np\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The *utils.py* module defines helper dictionaries, lists, and functions that help with parsing the data types correctly, look up column descriptions, and refer to groups of data columns.\n",
"\n",
"**Note:** the suffix \\_*COLUMNS* indicates a dictionary with all meta information on the provided data file and \\_*VARIABLES* a list with only the column names (i.e., the keys of the respective \\_*COLUMNS* dictionary)."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"from utils import (\n",
" ALL_COLUMNS,\n",
" ALL_VARIABLES,\n",
" CONTINUOUS_COLUMNS,\n",
" CONTINUOUS_VARIABLES,\n",
" DISCRETE_COLUMNS,\n",
" DISCRETE_VARIABLES,\n",
" INDEX_COLUMNS,\n",
" LABEL_COLUMNS, # groups nominal and ordinal\n",
" LABEL_TYPES,\n",
" NOMINAL_COLUMNS,\n",
" NOMINAL_VARIABLES,\n",
" NUMERIC_VARIABLES, # groups continuous and discrete\n",
" ORDINAL_COLUMNS,\n",
" ORDINAL_VARIABLES,\n",
" TARGET_VARIABLES, # = Sale Price\n",
" correct_column_names,\n",
" print_column_list,\n",
" update_column_descriptions,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# Show all data columns.\n",
"pd.set_option(\"display.max_columns\", 100)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load the Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The original data are available for [download](https://www.amstat.org/publications/jse/v19n3/decock/AmesHousing.xls) and a detailed description of the data types for each column can be found [here](https://www.amstat.org/publications/jse/v19n3/decock/DataDocumentation.txt). These meta data go into the `dtype` argument of the `read_excel` function below to parse the data correctly. There are four different generic data types defined that are casted as follows:\n",
"\n",
"- continuous -> np.float64\n",
"- discrete -> actually np.int64 but np.float64 because of missing values\n",
"- nominal -> object (str)\n",
"- ordinal -> object (str), the order can be looked up in the above mentioned *ALL_COLUMNS* dictionary\n",
"\n",
"**Note 1:** the data come with a lot of \"NA\" text strings that do **not** indicate missing data but, for example, the absence of a basement or a parking lot (see the linked data description).\n",
"\n",
"**Note 2:** the mappings from column names to data types are encoded in the \"utils.py\" module that defines the aforementioned helper dictionaries / lists.\n",
"\n",
"**Note 3:** the Excel file with all the data is either loaded from the local dictionary (= \"cache\") or obtained fresh from the source."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# To avoid redundancy.\n",
"kwargs = {\n",
" \"dtype\": { # Ensure each column is parsed as the correct data type.\n",
" column: ( # This creates a mapping from column name to data type.\n",
" object if mapping_info[\"type\"] in LABEL_TYPES else np.float64\n",
" )\n",
" for (column, mapping_info) in ALL_COLUMNS.items()\n",
" },\n",
" \"na_values\": \"\", # By default, pandas treats NA strings as missing,\n",
" \"keep_default_na\": False, # which is not the correct meaning here.\n",
"}\n",
"\n",
"try:\n",
" df = pd.read_excel(\"data/data_raw.xls\", **kwargs)\n",
"except FileNotFoundError:\n",
" df = pd.read_excel(\n",
" \"https://www.amstat.org/publications/jse/v19n3/decock/AmesHousing.xls\", **kwargs\n",
" )\n",
" # Cache the obtained file.\n",
" df.to_excel(\"data/data_raw.xls\")"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# Some columns names differ between the Excel file and\n",
"# the data description file. Correct that with the values\n",
"# in the Excel file.\n",
"correct_column_names(df.columns)\n",
"# Use a compound index and keep both\n",
"# identifying columns in the DataFrame.\n",
"df = df.set_index(INDEX_COLUMNS)\n",
"# Put the provided columns into the same\n",
"# order as in the encoded description file.\n",
"# Note that the target variable \"SalePrice\"\n",
"# is not in the description file.\n",
"df = df[ALL_VARIABLES + TARGET_VARIABLES]"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
1st Flr SF
\n",
"
2nd Flr SF
\n",
"
3Ssn Porch
\n",
"
Alley
\n",
"
Bedroom AbvGr
\n",
"
Bldg Type
\n",
"
Bsmt Cond
\n",
"
Bsmt Exposure
\n",
"
Bsmt Full Bath
\n",
"
Bsmt Half Bath
\n",
"
Bsmt Qual
\n",
"
Bsmt Unf SF
\n",
"
BsmtFin SF 1
\n",
"
BsmtFin SF 2
\n",
"
BsmtFin Type 1
\n",
"
BsmtFin Type 2
\n",
"
Central Air
\n",
"
Condition 1
\n",
"
Condition 2
\n",
"
Electrical
\n",
"
Enclosed Porch
\n",
"
Exter Cond
\n",
"
Exter Qual
\n",
"
Exterior 1st
\n",
"
Exterior 2nd
\n",
"
Fence
\n",
"
Fireplace Qu
\n",
"
Fireplaces
\n",
"
Foundation
\n",
"
Full Bath
\n",
"
Functional
\n",
"
Garage Area
\n",
"
Garage Cars
\n",
"
Garage Cond
\n",
"
Garage Finish
\n",
"
Garage Qual
\n",
"
Garage Type
\n",
"
Garage Yr Blt
\n",
"
Gr Liv Area
\n",
"
Half Bath
\n",
"
Heating
\n",
"
Heating QC
\n",
"
House Style
\n",
"
Kitchen AbvGr
\n",
"
Kitchen Qual
\n",
"
Land Contour
\n",
"
Land Slope
\n",
"
Lot Area
\n",
"
Lot Config
\n",
"
Lot Frontage
\n",
"
Lot Shape
\n",
"
Low Qual Fin SF
\n",
"
MS SubClass
\n",
"
MS Zoning
\n",
"
Mas Vnr Area
\n",
"
Mas Vnr Type
\n",
"
Misc Feature
\n",
"
Misc Val
\n",
"
Mo Sold
\n",
"
Neighborhood
\n",
"
Open Porch SF
\n",
"
Overall Cond
\n",
"
Overall Qual
\n",
"
Paved Drive
\n",
"
Pool Area
\n",
"
Pool QC
\n",
"
Roof Matl
\n",
"
Roof Style
\n",
"
Sale Condition
\n",
"
Sale Type
\n",
"
Screen Porch
\n",
"
Street
\n",
"
TotRms AbvGrd
\n",
"
Total Bsmt SF
\n",
"
Utilities
\n",
"
Wood Deck SF
\n",
"
Year Built
\n",
"
Year Remod/Add
\n",
"
Yr Sold
\n",
"
SalePrice
\n",
"
\n",
"
\n",
"
Order
\n",
"
PID
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
"
\n",
"
1
\n",
"
526301100
\n",
"
1656.0
\n",
"
0.0
\n",
"
0
\n",
"
NA
\n",
"
3
\n",
"
1Fam
\n",
"
Gd
\n",
"
Gd
\n",
"
1.0
\n",
"
0.0
\n",
"
TA
\n",
"
441.0
\n",
"
639.0
\n",
"
0.0
\n",
"
BLQ
\n",
"
Unf
\n",
"
Y
\n",
"
Norm
\n",
"
Norm
\n",
"
SBrkr
\n",
"
0.0
\n",
"
TA
\n",
"
TA
\n",
"
BrkFace
\n",
"
Plywood
\n",
"
NA
\n",
"
Gd
\n",
"
2.0
\n",
"
CBlock
\n",
"
1.0
\n",
"
Typ
\n",
"
528.0
\n",
"
2.0
\n",
"
TA
\n",
"
Fin
\n",
"
TA
\n",
"
Attchd
\n",
"
1960.0
\n",
"
1656.0
\n",
"
0.0
\n",
"
GasA
\n",
"
Fa
\n",
"
1Story
\n",
"
1
\n",
"
TA
\n",
"
Lvl
\n",
"
Gtl
\n",
"
31770.0
\n",
"
Corner
\n",
"
141.0
\n",
"
IR1
\n",
"
0.0
\n",
"
020
\n",
"
RL
\n",
"
112.0
\n",
"
Stone
\n",
"
NA
\n",
"
0.0
\n",
"
5.0
\n",
"
NAmes
\n",
"
62.0
\n",
"
5
\n",
"
6
\n",
"
P
\n",
"
0.0
\n",
"
NA
\n",
"
CompShg
\n",
"
Hip
\n",
"
Normal
\n",
"
WD
\n",
"
0.0
\n",
"
Pave
\n",
"
7
\n",
"
1080.0
\n",
"
AllPub
\n",
"
210.0
\n",
"
1960.0
\n",
"
1960.0
\n",
"
2010.0
\n",
"
215000
\n",
"
\n",
"
\n",
"
2
\n",
"
526350040
\n",
"
896.0
\n",
"
0.0
\n",
"
0
\n",
"
NA
\n",
"
2
\n",
"
1Fam
\n",
"
TA
\n",
"
No
\n",
"
0.0
\n",
"
0.0
\n",
"
TA
\n",
"
270.0
\n",
"
468.0
\n",
"
144.0
\n",
"
Rec
\n",
"
LwQ
\n",
"
Y
\n",
"
Feedr
\n",
"
Norm
\n",
"
SBrkr
\n",
"
0.0
\n",
"
TA
\n",
"
TA
\n",
"
VinylSd
\n",
"
VinylSd
\n",
"
MnPrv
\n",
"
NA
\n",
"
0.0
\n",
"
CBlock
\n",
"
1.0
\n",
"
Typ
\n",
"
730.0
\n",
"
1.0
\n",
"
TA
\n",
"
Unf
\n",
"
TA
\n",
"
Attchd
\n",
"
1961.0
\n",
"
896.0
\n",
"
0.0
\n",
"
GasA
\n",
"
TA
\n",
"
1Story
\n",
"
1
\n",
"
TA
\n",
"
Lvl
\n",
"
Gtl
\n",
"
11622.0
\n",
"
Inside
\n",
"
80.0
\n",
"
Reg
\n",
"
0.0
\n",
"
020
\n",
"
RH
\n",
"
0.0
\n",
"
None
\n",
"
NA
\n",
"
0.0
\n",
"
6.0
\n",
"
NAmes
\n",
"
0.0
\n",
"
6
\n",
"
5
\n",
"
Y
\n",
"
0.0
\n",
"
NA
\n",
"
CompShg
\n",
"
Gable
\n",
"
Normal
\n",
"
WD
\n",
"
120.0
\n",
"
Pave
\n",
"
5
\n",
"
882.0
\n",
"
AllPub
\n",
"
140.0
\n",
"
1961.0
\n",
"
1961.0
\n",
"
2010.0
\n",
"
105000
\n",
"
\n",
"
\n",
"
3
\n",
"
526351010
\n",
"
1329.0
\n",
"
0.0
\n",
"
0
\n",
"
NA
\n",
"
3
\n",
"
1Fam
\n",
"
TA
\n",
"
No
\n",
"
0.0
\n",
"
0.0
\n",
"
TA
\n",
"
406.0
\n",
"
923.0
\n",
"
0.0
\n",
"
ALQ
\n",
"
Unf
\n",
"
Y
\n",
"
Norm
\n",
"
Norm
\n",
"
SBrkr
\n",
"
0.0
\n",
"
TA
\n",
"
TA
\n",
"
Wd Sdng
\n",
"
Wd Sdng
\n",
"
NA
\n",
"
NA
\n",
"
0.0
\n",
"
CBlock
\n",
"
1.0
\n",
"
Typ
\n",
"
312.0
\n",
"
1.0
\n",
"
TA
\n",
"
Unf
\n",
"
TA
\n",
"
Attchd
\n",
"
1958.0
\n",
"
1329.0
\n",
"
1.0
\n",
"
GasA
\n",
"
TA
\n",
"
1Story
\n",
"
1
\n",
"
Gd
\n",
"
Lvl
\n",
"
Gtl
\n",
"
14267.0
\n",
"
Corner
\n",
"
81.0
\n",
"
IR1
\n",
"
0.0
\n",
"
020
\n",
"
RL
\n",
"
108.0
\n",
"
BrkFace
\n",
"
Gar2
\n",
"
12500.0
\n",
"
6.0
\n",
"
NAmes
\n",
"
36.0
\n",
"
6
\n",
"
6
\n",
"
Y
\n",
"
0.0
\n",
"
NA
\n",
"
CompShg
\n",
"
Hip
\n",
"
Normal
\n",
"
WD
\n",
"
0.0
\n",
"
Pave
\n",
"
6
\n",
"
1329.0
\n",
"
AllPub
\n",
"
393.0
\n",
"
1958.0
\n",
"
1958.0
\n",
"
2010.0
\n",
"
172000
\n",
"
\n",
"
\n",
"
4
\n",
"
526353030
\n",
"
2110.0
\n",
"
0.0
\n",
"
0
\n",
"
NA
\n",
"
3
\n",
"
1Fam
\n",
"
TA
\n",
"
No
\n",
"
1.0
\n",
"
0.0
\n",
"
TA
\n",
"
1045.0
\n",
"
1065.0
\n",
"
0.0
\n",
"
ALQ
\n",
"
Unf
\n",
"
Y
\n",
"
Norm
\n",
"
Norm
\n",
"
SBrkr
\n",
"
0.0
\n",
"
TA
\n",
"
Gd
\n",
"
BrkFace
\n",
"
BrkFace
\n",
"
NA
\n",
"
TA
\n",
"
2.0
\n",
"
CBlock
\n",
"
2.0
\n",
"
Typ
\n",
"
522.0
\n",
"
2.0
\n",
"
TA
\n",
"
Fin
\n",
"
TA
\n",
"
Attchd
\n",
"
1968.0
\n",
"
2110.0
\n",
"
1.0
\n",
"
GasA
\n",
"
Ex
\n",
"
1Story
\n",
"
1
\n",
"
Ex
\n",
"
Lvl
\n",
"
Gtl
\n",
"
11160.0
\n",
"
Corner
\n",
"
93.0
\n",
"
Reg
\n",
"
0.0
\n",
"
020
\n",
"
RL
\n",
"
0.0
\n",
"
None
\n",
"
NA
\n",
"
0.0
\n",
"
4.0
\n",
"
NAmes
\n",
"
0.0
\n",
"
5
\n",
"
7
\n",
"
Y
\n",
"
0.0
\n",
"
NA
\n",
"
CompShg
\n",
"
Hip
\n",
"
Normal
\n",
"
WD
\n",
"
0.0
\n",
"
Pave
\n",
"
8
\n",
"
2110.0
\n",
"
AllPub
\n",
"
0.0
\n",
"
1968.0
\n",
"
1968.0
\n",
"
2010.0
\n",
"
244000
\n",
"
\n",
"
\n",
"
5
\n",
"
527105010
\n",
"
928.0
\n",
"
701.0
\n",
"
0
\n",
"
NA
\n",
"
3
\n",
"
1Fam
\n",
"
TA
\n",
"
No
\n",
"
0.0
\n",
"
0.0
\n",
"
Gd
\n",
"
137.0
\n",
"
791.0
\n",
"
0.0
\n",
"
GLQ
\n",
"
Unf
\n",
"
Y
\n",
"
Norm
\n",
"
Norm
\n",
"
SBrkr
\n",
"
0.0
\n",
"
TA
\n",
"
TA
\n",
"
VinylSd
\n",
"
VinylSd
\n",
"
MnPrv
\n",
"
TA
\n",
"
1.0
\n",
"
PConc
\n",
"
2.0
\n",
"
Typ
\n",
"
482.0
\n",
"
2.0
\n",
"
TA
\n",
"
Fin
\n",
"
TA
\n",
"
Attchd
\n",
"
1997.0
\n",
"
1629.0
\n",
"
1.0
\n",
"
GasA
\n",
"
Gd
\n",
"
2Story
\n",
"
1
\n",
"
TA
\n",
"
Lvl
\n",
"
Gtl
\n",
"
13830.0
\n",
"
Inside
\n",
"
74.0
\n",
"
IR1
\n",
"
0.0
\n",
"
060
\n",
"
RL
\n",
"
0.0
\n",
"
None
\n",
"
NA
\n",
"
0.0
\n",
"
3.0
\n",
"
Gilbert
\n",
"
34.0
\n",
"
5
\n",
"
5
\n",
"
Y
\n",
"
0.0
\n",
"
NA
\n",
"
CompShg
\n",
"
Gable
\n",
"
Normal
\n",
"
WD
\n",
"
0.0
\n",
"
Pave
\n",
"
6
\n",
"
928.0
\n",
"
AllPub
\n",
"
212.0
\n",
"
1997.0
\n",
"
1998.0
\n",
"
2010.0
\n",
"
189900
\n",
"
\n",
"
\n",
"
6
\n",
"
527105030
\n",
"
926.0
\n",
"
678.0
\n",
"
0
\n",
"
NA
\n",
"
3
\n",
"
1Fam
\n",
"
TA
\n",
"
No
\n",
"
0.0
\n",
"
0.0
\n",
"
TA
\n",
"
324.0
\n",
"
602.0
\n",
"
0.0
\n",
"
GLQ
\n",
"
Unf
\n",
"
Y
\n",
"
Norm
\n",
"
Norm
\n",
"
SBrkr
\n",
"
0.0
\n",
"
TA
\n",
"
TA
\n",
"
VinylSd
\n",
"
VinylSd
\n",
"
NA
\n",
"
Gd
\n",
"
1.0
\n",
"
PConc
\n",
"
2.0
\n",
"
Typ
\n",
"
470.0
\n",
"
2.0
\n",
"
TA
\n",
"
Fin
\n",
"
TA
\n",
"
Attchd
\n",
"
1998.0
\n",
"
1604.0
\n",
"
1.0
\n",
"
GasA
\n",
"
Ex
\n",
"
2Story
\n",
"
1
\n",
"
Gd
\n",
"
Lvl
\n",
"
Gtl
\n",
"
9978.0
\n",
"
Inside
\n",
"
78.0
\n",
"
IR1
\n",
"
0.0
\n",
"
060
\n",
"
RL
\n",
"
20.0
\n",
"
BrkFace
\n",
"
NA
\n",
"
0.0
\n",
"
6.0
\n",
"
Gilbert
\n",
"
36.0
\n",
"
6
\n",
"
6
\n",
"
Y
\n",
"
0.0
\n",
"
NA
\n",
"
CompShg
\n",
"
Gable
\n",
"
Normal
\n",
"
WD
\n",
"
0.0
\n",
"
Pave
\n",
"
7
\n",
"
926.0
\n",
"
AllPub
\n",
"
360.0
\n",
"
1998.0
\n",
"
1998.0
\n",
"
2010.0
\n",
"
195500
\n",
"
\n",
"
\n",
"
7
\n",
"
527127150
\n",
"
1338.0
\n",
"
0.0
\n",
"
0
\n",
"
NA
\n",
"
2
\n",
"
TwnhsE
\n",
"
TA
\n",
"
Mn
\n",
"
1.0
\n",
"
0.0
\n",
"
Gd
\n",
"
722.0
\n",
"
616.0
\n",
"
0.0
\n",
"
GLQ
\n",
"
Unf
\n",
"
Y
\n",
"
Norm
\n",
"
Norm
\n",
"
SBrkr
\n",
"
170.0
\n",
"
TA
\n",
"
Gd
\n",
"
CemntBd
\n",
"
CmentBd
\n",
"
NA
\n",
"
NA
\n",
"
0.0
\n",
"
PConc
\n",
"
2.0
\n",
"
Typ
\n",
"
582.0
\n",
"
2.0
\n",
"
TA
\n",
"
Fin
\n",
"
TA
\n",
"
Attchd
\n",
"
2001.0
\n",
"
1338.0
\n",
"
0.0
\n",
"
GasA
\n",
"
Ex
\n",
"
1Story
\n",
"
1
\n",
"
Gd
\n",
"
Lvl
\n",
"
Gtl
\n",
"
4920.0
\n",
"
Inside
\n",
"
41.0
\n",
"
Reg
\n",
"
0.0
\n",
"
120
\n",
"
RL
\n",
"
0.0
\n",
"
None
\n",
"
NA
\n",
"
0.0
\n",
"
4.0
\n",
"
StoneBr
\n",
"
0.0
\n",
"
5
\n",
"
8
\n",
"
Y
\n",
"
0.0
\n",
"
NA
\n",
"
CompShg
\n",
"
Gable
\n",
"
Normal
\n",
"
WD
\n",
"
0.0
\n",
"
Pave
\n",
"
6
\n",
"
1338.0
\n",
"
AllPub
\n",
"
0.0
\n",
"
2001.0
\n",
"
2001.0
\n",
"
2010.0
\n",
"
213500
\n",
"
\n",
"
\n",
"
8
\n",
"
527145080
\n",
"
1280.0
\n",
"
0.0
\n",
"
0
\n",
"
NA
\n",
"
2
\n",
"
TwnhsE
\n",
"
TA
\n",
"
No
\n",
"
0.0
\n",
"
0.0
\n",
"
Gd
\n",
"
1017.0
\n",
"
263.0
\n",
"
0.0
\n",
"
ALQ
\n",
"
Unf
\n",
"
Y
\n",
"
Norm
\n",
"
Norm
\n",
"
SBrkr
\n",
"
0.0
\n",
"
TA
\n",
"
Gd
\n",
"
HdBoard
\n",
"
HdBoard
\n",
"
NA
\n",
"
NA
\n",
"
0.0
\n",
"
PConc
\n",
"
2.0
\n",
"
Typ
\n",
"
506.0
\n",
"
2.0
\n",
"
TA
\n",
"
RFn
\n",
"
TA
\n",
"
Attchd
\n",
"
1992.0
\n",
"
1280.0
\n",
"
0.0
\n",
"
GasA
\n",
"
Ex
\n",
"
1Story
\n",
"
1
\n",
"
Gd
\n",
"
HLS
\n",
"
Gtl
\n",
"
5005.0
\n",
"
Inside
\n",
"
43.0
\n",
"
IR1
\n",
"
0.0
\n",
"
120
\n",
"
RL
\n",
"
0.0
\n",
"
None
\n",
"
NA
\n",
"
0.0
\n",
"
1.0
\n",
"
StoneBr
\n",
"
82.0
\n",
"
5
\n",
"
8
\n",
"
Y
\n",
"
0.0
\n",
"
NA
\n",
"
CompShg
\n",
"
Gable
\n",
"
Normal
\n",
"
WD
\n",
"
144.0
\n",
"
Pave
\n",
"
5
\n",
"
1280.0
\n",
"
AllPub
\n",
"
0.0
\n",
"
1992.0
\n",
"
1992.0
\n",
"
2010.0
\n",
"
191500
\n",
"
\n",
"
\n",
"
9
\n",
"
527146030
\n",
"
1616.0
\n",
"
0.0
\n",
"
0
\n",
"
NA
\n",
"
2
\n",
"
TwnhsE
\n",
"
TA
\n",
"
No
\n",
"
1.0
\n",
"
0.0
\n",
"
Gd
\n",
"
415.0
\n",
"
1180.0
\n",
"
0.0
\n",
"
GLQ
\n",
"
Unf
\n",
"
Y
\n",
"
Norm
\n",
"
Norm
\n",
"
SBrkr
\n",
"
0.0
\n",
"
TA
\n",
"
Gd
\n",
"
CemntBd
\n",
"
CmentBd
\n",
"
NA
\n",
"
TA
\n",
"
1.0
\n",
"
PConc
\n",
"
2.0
\n",
"
Typ
\n",
"
608.0
\n",
"
2.0
\n",
"
TA
\n",
"
RFn
\n",
"
TA
\n",
"
Attchd
\n",
"
1995.0
\n",
"
1616.0
\n",
"
0.0
\n",
"
GasA
\n",
"
Ex
\n",
"
1Story
\n",
"
1
\n",
"
Gd
\n",
"
Lvl
\n",
"
Gtl
\n",
"
5389.0
\n",
"
Inside
\n",
"
39.0
\n",
"
IR1
\n",
"
0.0
\n",
"
120
\n",
"
RL
\n",
"
0.0
\n",
"
None
\n",
"
NA
\n",
"
0.0
\n",
"
3.0
\n",
"
StoneBr
\n",
"
152.0
\n",
"
5
\n",
"
8
\n",
"
Y
\n",
"
0.0
\n",
"
NA
\n",
"
CompShg
\n",
"
Gable
\n",
"
Normal
\n",
"
WD
\n",
"
0.0
\n",
"
Pave
\n",
"
5
\n",
"
1595.0
\n",
"
AllPub
\n",
"
237.0
\n",
"
1995.0
\n",
"
1996.0
\n",
"
2010.0
\n",
"
236500
\n",
"
\n",
"
\n",
"
10
\n",
"
527162130
\n",
"
1028.0
\n",
"
776.0
\n",
"
0
\n",
"
NA
\n",
"
3
\n",
"
1Fam
\n",
"
TA
\n",
"
No
\n",
"
0.0
\n",
"
0.0
\n",
"
TA
\n",
"
994.0
\n",
"
0.0
\n",
"
0.0
\n",
"
Unf
\n",
"
Unf
\n",
"
Y
\n",
"
Norm
\n",
"
Norm
\n",
"
SBrkr
\n",
"
0.0
\n",
"
TA
\n",
"
TA
\n",
"
VinylSd
\n",
"
VinylSd
\n",
"
NA
\n",
"
TA
\n",
"
1.0
\n",
"
PConc
\n",
"
2.0
\n",
"
Typ
\n",
"
442.0
\n",
"
2.0
\n",
"
TA
\n",
"
Fin
\n",
"
TA
\n",
"
Attchd
\n",
"
1999.0
\n",
"
1804.0
\n",
"
1.0
\n",
"
GasA
\n",
"
Gd
\n",
"
2Story
\n",
"
1
\n",
"
Gd
\n",
"
Lvl
\n",
"
Gtl
\n",
"
7500.0
\n",
"
Inside
\n",
"
60.0
\n",
"
Reg
\n",
"
0.0
\n",
"
060
\n",
"
RL
\n",
"
0.0
\n",
"
None
\n",
"
NA
\n",
"
0.0
\n",
"
6.0
\n",
"
Gilbert
\n",
"
60.0
\n",
"
5
\n",
"
7
\n",
"
Y
\n",
"
0.0
\n",
"
NA
\n",
"
CompShg
\n",
"
Gable
\n",
"
Normal
\n",
"
WD
\n",
"
0.0
\n",
"
Pave
\n",
"
7
\n",
"
994.0
\n",
"
AllPub
\n",
"
140.0
\n",
"
1999.0
\n",
"
1999.0
\n",
"
2010.0
\n",
"
189000
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 1st Flr SF 2nd Flr SF 3Ssn Porch Alley Bedroom AbvGr \\\n",
"Order PID \n",
"1 526301100 1656.0 0.0 0 NA 3 \n",
"2 526350040 896.0 0.0 0 NA 2 \n",
"3 526351010 1329.0 0.0 0 NA 3 \n",
"4 526353030 2110.0 0.0 0 NA 3 \n",
"5 527105010 928.0 701.0 0 NA 3 \n",
"6 527105030 926.0 678.0 0 NA 3 \n",
"7 527127150 1338.0 0.0 0 NA 2 \n",
"8 527145080 1280.0 0.0 0 NA 2 \n",
"9 527146030 1616.0 0.0 0 NA 2 \n",
"10 527162130 1028.0 776.0 0 NA 3 \n",
"\n",
" Bldg Type Bsmt Cond Bsmt Exposure Bsmt Full Bath \\\n",
"Order PID \n",
"1 526301100 1Fam Gd Gd 1.0 \n",
"2 526350040 1Fam TA No 0.0 \n",
"3 526351010 1Fam TA No 0.0 \n",
"4 526353030 1Fam TA No 1.0 \n",
"5 527105010 1Fam TA No 0.0 \n",
"6 527105030 1Fam TA No 0.0 \n",
"7 527127150 TwnhsE TA Mn 1.0 \n",
"8 527145080 TwnhsE TA No 0.0 \n",
"9 527146030 TwnhsE TA No 1.0 \n",
"10 527162130 1Fam TA No 0.0 \n",
"\n",
" Bsmt Half Bath Bsmt Qual Bsmt Unf SF BsmtFin SF 1 \\\n",
"Order PID \n",
"1 526301100 0.0 TA 441.0 639.0 \n",
"2 526350040 0.0 TA 270.0 468.0 \n",
"3 526351010 0.0 TA 406.0 923.0 \n",
"4 526353030 0.0 TA 1045.0 1065.0 \n",
"5 527105010 0.0 Gd 137.0 791.0 \n",
"6 527105030 0.0 TA 324.0 602.0 \n",
"7 527127150 0.0 Gd 722.0 616.0 \n",
"8 527145080 0.0 Gd 1017.0 263.0 \n",
"9 527146030 0.0 Gd 415.0 1180.0 \n",
"10 527162130 0.0 TA 994.0 0.0 \n",
"\n",
" BsmtFin SF 2 BsmtFin Type 1 BsmtFin Type 2 Central Air \\\n",
"Order PID \n",
"1 526301100 0.0 BLQ Unf Y \n",
"2 526350040 144.0 Rec LwQ Y \n",
"3 526351010 0.0 ALQ Unf Y \n",
"4 526353030 0.0 ALQ Unf Y \n",
"5 527105010 0.0 GLQ Unf Y \n",
"6 527105030 0.0 GLQ Unf Y \n",
"7 527127150 0.0 GLQ Unf Y \n",
"8 527145080 0.0 ALQ Unf Y \n",
"9 527146030 0.0 GLQ Unf Y \n",
"10 527162130 0.0 Unf Unf Y \n",
"\n",
" Condition 1 Condition 2 Electrical Enclosed Porch Exter Cond \\\n",
"Order PID \n",
"1 526301100 Norm Norm SBrkr 0.0 TA \n",
"2 526350040 Feedr Norm SBrkr 0.0 TA \n",
"3 526351010 Norm Norm SBrkr 0.0 TA \n",
"4 526353030 Norm Norm SBrkr 0.0 TA \n",
"5 527105010 Norm Norm SBrkr 0.0 TA \n",
"6 527105030 Norm Norm SBrkr 0.0 TA \n",
"7 527127150 Norm Norm SBrkr 170.0 TA \n",
"8 527145080 Norm Norm SBrkr 0.0 TA \n",
"9 527146030 Norm Norm SBrkr 0.0 TA \n",
"10 527162130 Norm Norm SBrkr 0.0 TA \n",
"\n",
" Exter Qual Exterior 1st Exterior 2nd Fence Fireplace Qu \\\n",
"Order PID \n",
"1 526301100 TA BrkFace Plywood NA Gd \n",
"2 526350040 TA VinylSd VinylSd MnPrv NA \n",
"3 526351010 TA Wd Sdng Wd Sdng NA NA \n",
"4 526353030 Gd BrkFace BrkFace NA TA \n",
"5 527105010 TA VinylSd VinylSd MnPrv TA \n",
"6 527105030 TA VinylSd VinylSd NA Gd \n",
"7 527127150 Gd CemntBd CmentBd NA NA \n",
"8 527145080 Gd HdBoard HdBoard NA NA \n",
"9 527146030 Gd CemntBd CmentBd NA TA \n",
"10 527162130 TA VinylSd VinylSd NA TA \n",
"\n",
" Fireplaces Foundation Full Bath Functional Garage Area \\\n",
"Order PID \n",
"1 526301100 2.0 CBlock 1.0 Typ 528.0 \n",
"2 526350040 0.0 CBlock 1.0 Typ 730.0 \n",
"3 526351010 0.0 CBlock 1.0 Typ 312.0 \n",
"4 526353030 2.0 CBlock 2.0 Typ 522.0 \n",
"5 527105010 1.0 PConc 2.0 Typ 482.0 \n",
"6 527105030 1.0 PConc 2.0 Typ 470.0 \n",
"7 527127150 0.0 PConc 2.0 Typ 582.0 \n",
"8 527145080 0.0 PConc 2.0 Typ 506.0 \n",
"9 527146030 1.0 PConc 2.0 Typ 608.0 \n",
"10 527162130 1.0 PConc 2.0 Typ 442.0 \n",
"\n",
" Garage Cars Garage Cond Garage Finish Garage Qual \\\n",
"Order PID \n",
"1 526301100 2.0 TA Fin TA \n",
"2 526350040 1.0 TA Unf TA \n",
"3 526351010 1.0 TA Unf TA \n",
"4 526353030 2.0 TA Fin TA \n",
"5 527105010 2.0 TA Fin TA \n",
"6 527105030 2.0 TA Fin TA \n",
"7 527127150 2.0 TA Fin TA \n",
"8 527145080 2.0 TA RFn TA \n",
"9 527146030 2.0 TA RFn TA \n",
"10 527162130 2.0 TA Fin TA \n",
"\n",
" Garage Type Garage Yr Blt Gr Liv Area Half Bath Heating \\\n",
"Order PID \n",
"1 526301100 Attchd 1960.0 1656.0 0.0 GasA \n",
"2 526350040 Attchd 1961.0 896.0 0.0 GasA \n",
"3 526351010 Attchd 1958.0 1329.0 1.0 GasA \n",
"4 526353030 Attchd 1968.0 2110.0 1.0 GasA \n",
"5 527105010 Attchd 1997.0 1629.0 1.0 GasA \n",
"6 527105030 Attchd 1998.0 1604.0 1.0 GasA \n",
"7 527127150 Attchd 2001.0 1338.0 0.0 GasA \n",
"8 527145080 Attchd 1992.0 1280.0 0.0 GasA \n",
"9 527146030 Attchd 1995.0 1616.0 0.0 GasA \n",
"10 527162130 Attchd 1999.0 1804.0 1.0 GasA \n",
"\n",
" Heating QC House Style Kitchen AbvGr Kitchen Qual \\\n",
"Order PID \n",
"1 526301100 Fa 1Story 1 TA \n",
"2 526350040 TA 1Story 1 TA \n",
"3 526351010 TA 1Story 1 Gd \n",
"4 526353030 Ex 1Story 1 Ex \n",
"5 527105010 Gd 2Story 1 TA \n",
"6 527105030 Ex 2Story 1 Gd \n",
"7 527127150 Ex 1Story 1 Gd \n",
"8 527145080 Ex 1Story 1 Gd \n",
"9 527146030 Ex 1Story 1 Gd \n",
"10 527162130 Gd 2Story 1 Gd \n",
"\n",
" Land Contour Land Slope Lot Area Lot Config Lot Frontage \\\n",
"Order PID \n",
"1 526301100 Lvl Gtl 31770.0 Corner 141.0 \n",
"2 526350040 Lvl Gtl 11622.0 Inside 80.0 \n",
"3 526351010 Lvl Gtl 14267.0 Corner 81.0 \n",
"4 526353030 Lvl Gtl 11160.0 Corner 93.0 \n",
"5 527105010 Lvl Gtl 13830.0 Inside 74.0 \n",
"6 527105030 Lvl Gtl 9978.0 Inside 78.0 \n",
"7 527127150 Lvl Gtl 4920.0 Inside 41.0 \n",
"8 527145080 HLS Gtl 5005.0 Inside 43.0 \n",
"9 527146030 Lvl Gtl 5389.0 Inside 39.0 \n",
"10 527162130 Lvl Gtl 7500.0 Inside 60.0 \n",
"\n",
" Lot Shape Low Qual Fin SF MS SubClass MS Zoning \\\n",
"Order PID \n",
"1 526301100 IR1 0.0 020 RL \n",
"2 526350040 Reg 0.0 020 RH \n",
"3 526351010 IR1 0.0 020 RL \n",
"4 526353030 Reg 0.0 020 RL \n",
"5 527105010 IR1 0.0 060 RL \n",
"6 527105030 IR1 0.0 060 RL \n",
"7 527127150 Reg 0.0 120 RL \n",
"8 527145080 IR1 0.0 120 RL \n",
"9 527146030 IR1 0.0 120 RL \n",
"10 527162130 Reg 0.0 060 RL \n",
"\n",
" Mas Vnr Area Mas Vnr Type Misc Feature Misc Val Mo Sold \\\n",
"Order PID \n",
"1 526301100 112.0 Stone NA 0.0 5.0 \n",
"2 526350040 0.0 None NA 0.0 6.0 \n",
"3 526351010 108.0 BrkFace Gar2 12500.0 6.0 \n",
"4 526353030 0.0 None NA 0.0 4.0 \n",
"5 527105010 0.0 None NA 0.0 3.0 \n",
"6 527105030 20.0 BrkFace NA 0.0 6.0 \n",
"7 527127150 0.0 None NA 0.0 4.0 \n",
"8 527145080 0.0 None NA 0.0 1.0 \n",
"9 527146030 0.0 None NA 0.0 3.0 \n",
"10 527162130 0.0 None NA 0.0 6.0 \n",
"\n",
" Neighborhood Open Porch SF Overall Cond Overall Qual \\\n",
"Order PID \n",
"1 526301100 NAmes 62.0 5 6 \n",
"2 526350040 NAmes 0.0 6 5 \n",
"3 526351010 NAmes 36.0 6 6 \n",
"4 526353030 NAmes 0.0 5 7 \n",
"5 527105010 Gilbert 34.0 5 5 \n",
"6 527105030 Gilbert 36.0 6 6 \n",
"7 527127150 StoneBr 0.0 5 8 \n",
"8 527145080 StoneBr 82.0 5 8 \n",
"9 527146030 StoneBr 152.0 5 8 \n",
"10 527162130 Gilbert 60.0 5 7 \n",
"\n",
" Paved Drive Pool Area Pool QC Roof Matl Roof Style \\\n",
"Order PID \n",
"1 526301100 P 0.0 NA CompShg Hip \n",
"2 526350040 Y 0.0 NA CompShg Gable \n",
"3 526351010 Y 0.0 NA CompShg Hip \n",
"4 526353030 Y 0.0 NA CompShg Hip \n",
"5 527105010 Y 0.0 NA CompShg Gable \n",
"6 527105030 Y 0.0 NA CompShg Gable \n",
"7 527127150 Y 0.0 NA CompShg Gable \n",
"8 527145080 Y 0.0 NA CompShg Gable \n",
"9 527146030 Y 0.0 NA CompShg Gable \n",
"10 527162130 Y 0.0 NA CompShg Gable \n",
"\n",
" Sale Condition Sale Type Screen Porch Street TotRms AbvGrd \\\n",
"Order PID \n",
"1 526301100 Normal WD 0.0 Pave 7 \n",
"2 526350040 Normal WD 120.0 Pave 5 \n",
"3 526351010 Normal WD 0.0 Pave 6 \n",
"4 526353030 Normal WD 0.0 Pave 8 \n",
"5 527105010 Normal WD 0.0 Pave 6 \n",
"6 527105030 Normal WD 0.0 Pave 7 \n",
"7 527127150 Normal WD 0.0 Pave 6 \n",
"8 527145080 Normal WD 144.0 Pave 5 \n",
"9 527146030 Normal WD 0.0 Pave 5 \n",
"10 527162130 Normal WD 0.0 Pave 7 \n",
"\n",
" Total Bsmt SF Utilities Wood Deck SF Year Built \\\n",
"Order PID \n",
"1 526301100 1080.0 AllPub 210.0 1960.0 \n",
"2 526350040 882.0 AllPub 140.0 1961.0 \n",
"3 526351010 1329.0 AllPub 393.0 1958.0 \n",
"4 526353030 2110.0 AllPub 0.0 1968.0 \n",
"5 527105010 928.0 AllPub 212.0 1997.0 \n",
"6 527105030 926.0 AllPub 360.0 1998.0 \n",
"7 527127150 1338.0 AllPub 0.0 2001.0 \n",
"8 527145080 1280.0 AllPub 0.0 1992.0 \n",
"9 527146030 1595.0 AllPub 237.0 1995.0 \n",
"10 527162130 994.0 AllPub 140.0 1999.0 \n",
"\n",
" Year Remod/Add Yr Sold SalePrice \n",
"Order PID \n",
"1 526301100 1960.0 2010.0 215000 \n",
"2 526350040 1961.0 2010.0 105000 \n",
"3 526351010 1958.0 2010.0 172000 \n",
"4 526353030 1968.0 2010.0 244000 \n",
"5 527105010 1998.0 2010.0 189900 \n",
"6 527105030 1998.0 2010.0 195500 \n",
"7 527127150 2001.0 2010.0 213500 \n",
"8 527145080 1992.0 2010.0 191500 \n",
"9 527146030 1996.0 2010.0 236500 \n",
"10 527162130 1999.0 2010.0 189000 "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Spelling Mistakes & Data Types"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Some textual values appear differently in the provided data file as compared to the specification. These inconsistencies are manually repaired."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"# Repair spelling and whitespace mistakes.\n",
"df[\"Bldg Type\"] = df[\"Bldg Type\"].replace(to_replace=\"2fmCon\", value=\"2FmCon\")\n",
"df[\"Bldg Type\"] = df[\"Bldg Type\"].replace(to_replace=\"Duplex\", value=\"Duplx\")\n",
"df[\"Bldg Type\"] = df[\"Bldg Type\"].replace(to_replace=\"Twnhs\", value=\"TwnhsI\")\n",
"df[\"Exterior 2nd\"] = df[\"Exterior 2nd\"].replace(to_replace=\"Brk Cmn\", value=\"BrkComm\")\n",
"df[\"Exterior 2nd\"] = df[\"Exterior 2nd\"].replace(to_replace=\"CmentBd\", value=\"CemntBd\")\n",
"df[\"Exterior 2nd\"] = df[\"Exterior 2nd\"].replace(to_replace=\"Wd Shng\", value=\"WdShing\")\n",
"df[\"MS Zoning\"] = df[\"MS Zoning\"].replace(to_replace=\"A (agr)\", value=\"A\")\n",
"df[\"MS Zoning\"] = df[\"MS Zoning\"].replace(to_replace=\"C (all)\", value=\"C\")\n",
"df[\"MS Zoning\"] = df[\"MS Zoning\"].replace(to_replace=\"I (all)\", value=\"I\")\n",
"df[\"Neighborhood\"] = df[\"Neighborhood\"].replace(to_replace=\"NAmes\", value=\"Names\")\n",
"df[\"Sale Type\"] = df[\"Sale Type\"].replace(to_replace=\"WD \", value=\"WD\")"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"# Ensure that the remaining textual values in the data file are a subset\n",
"# of the values allowed in the specification.\n",
"for column, mapping_info in LABEL_COLUMNS.items():\n",
" # Note that .unique() returns a numpy array with integer dtype in cases\n",
" # where the provided data can be casted as such (e.g., \"Overall Qual\" column).\n",
" values_in_data = set(str(x) for x in df[column].unique() if x is not np.nan)\n",
" values_in_description = set(mapping_info[\"lookups\"].keys())\n",
" assert values_in_data <= values_in_description"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Interestingly, all numeric columns (i.e. also \"continuous\" variables) come with only integer values."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"# Show that all \"continuous\" variables come as integers.\n",
"for column in NUMERIC_VARIABLES + TARGET_VARIABLES:\n",
" not_null = df[column].notnull()\n",
" mask = (\n",
" df.loc[not_null, column].astype(np.int64)\n",
" != df.loc[not_null, column].astype(np.float64)\n",
" )\n",
" assert not mask.any()\n",
"# Cast discrete fields as integers where possible,\n",
"# i.e., all columns without missing values.\n",
"for column in DISCRETE_VARIABLES:\n",
" try:\n",
" df[column] = df[column].astype(np.int64)\n",
" except ValueError:\n",
" mask = df[column].notnull()\n",
" df.loc[mask, column].astype(np.int64)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Raw Data Overview"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The overall shape of the data is a 2930 rows x 80 columns matrix."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(2930, 80)"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Continuous Variables"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The continuous columns are truly continuous in the sense that each column has at least 14 unique value realizations."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"for column in CONTINUOUS_VARIABLES:\n",
" mask = df[column].notnull()\n",
" num_realizations = len(list(x for x in df.loc[mask, column].unique()))\n",
" assert num_realizations > 13"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A brief description of the variables:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1st Flr SF First Floor square feet\n",
"2nd Flr SF Second floor square feet\n",
"3Ssn Porch Three season porch area in square feet\n",
"Bsmt Unf SF Unfinished square feet of basement area\n",
"BsmtFin SF 1 Type 1 finished square feet\n",
"BsmtFin SF 2 Type 2 finished square feet\n",
"Enclosed Porch Enclosed porch area in square feet\n",
"Garage Area Size of garage in square feet\n",
"Gr Liv Area Above grade (ground) living area square feet\n",
"Lot Area Lot size in square feet\n",
"Lot Frontage Linear feet of street connected to property\n",
"Low Qual Fin SF Low quality finished square feet (all floors)\n",
"Mas Vnr Area Masonry veneer area in square feet\n",
"Misc Val $Value of miscellaneous feature\n",
"Open Porch SF Open porch area in square feet\n",
"Pool Area Pool area in square feet\n",
"Screen Porch Screen porch area in square feet\n",
"Total Bsmt SF Total square feet of basement area\n",
"Wood Deck SF Wood deck area in square feet\n"
]
}
],
"source": [
"print_column_list(CONTINUOUS_COLUMNS)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
1st Flr SF
\n",
"
2nd Flr SF
\n",
"
3Ssn Porch
\n",
"
Bsmt Unf SF
\n",
"
BsmtFin SF 1
\n",
"
BsmtFin SF 2
\n",
"
Enclosed Porch
\n",
"
Garage Area
\n",
"
Gr Liv Area
\n",
"
Lot Area
\n",
"
Lot Frontage
\n",
"
Low Qual Fin SF
\n",
"
Mas Vnr Area
\n",
"
Misc Val
\n",
"
Open Porch SF
\n",
"
Pool Area
\n",
"
Screen Porch
\n",
"
Total Bsmt SF
\n",
"
Wood Deck SF
\n",
"
\n",
"
\n",
"
Order
\n",
"
PID
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
"
\n",
"
1
\n",
"
526301100
\n",
"
1656.0
\n",
"
0.0
\n",
"
0
\n",
"
441.0
\n",
"
639.0
\n",
"
0.0
\n",
"
0.0
\n",
"
528.0
\n",
"
1656.0
\n",
"
31770.0
\n",
"
141.0
\n",
"
0.0
\n",
"
112.0
\n",
"
0.0
\n",
"
62.0
\n",
"
0.0
\n",
"
0.0
\n",
"
1080.0
\n",
"
210.0
\n",
"
\n",
"
\n",
"
2
\n",
"
526350040
\n",
"
896.0
\n",
"
0.0
\n",
"
0
\n",
"
270.0
\n",
"
468.0
\n",
"
144.0
\n",
"
0.0
\n",
"
730.0
\n",
"
896.0
\n",
"
11622.0
\n",
"
80.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
120.0
\n",
"
882.0
\n",
"
140.0
\n",
"
\n",
"
\n",
"
3
\n",
"
526351010
\n",
"
1329.0
\n",
"
0.0
\n",
"
0
\n",
"
406.0
\n",
"
923.0
\n",
"
0.0
\n",
"
0.0
\n",
"
312.0
\n",
"
1329.0
\n",
"
14267.0
\n",
"
81.0
\n",
"
0.0
\n",
"
108.0
\n",
"
12500.0
\n",
"
36.0
\n",
"
0.0
\n",
"
0.0
\n",
"
1329.0
\n",
"
393.0
\n",
"
\n",
"
\n",
"
4
\n",
"
526353030
\n",
"
2110.0
\n",
"
0.0
\n",
"
0
\n",
"
1045.0
\n",
"
1065.0
\n",
"
0.0
\n",
"
0.0
\n",
"
522.0
\n",
"
2110.0
\n",
"
11160.0
\n",
"
93.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
2110.0
\n",
"
0.0
\n",
"
\n",
"
\n",
"
5
\n",
"
527105010
\n",
"
928.0
\n",
"
701.0
\n",
"
0
\n",
"
137.0
\n",
"
791.0
\n",
"
0.0
\n",
"
0.0
\n",
"
482.0
\n",
"
1629.0
\n",
"
13830.0
\n",
"
74.0
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0
\n",
"
34.0
\n",
"
0.0
\n",
"
0.0
\n",
"
928.0
\n",
"
212.0
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 1st Flr SF 2nd Flr SF 3Ssn Porch Bsmt Unf SF \\\n",
"Order PID \n",
"1 526301100 1656.0 0.0 0 441.0 \n",
"2 526350040 896.0 0.0 0 270.0 \n",
"3 526351010 1329.0 0.0 0 406.0 \n",
"4 526353030 2110.0 0.0 0 1045.0 \n",
"5 527105010 928.0 701.0 0 137.0 \n",
"\n",
" BsmtFin SF 1 BsmtFin SF 2 Enclosed Porch Garage Area \\\n",
"Order PID \n",
"1 526301100 639.0 0.0 0.0 528.0 \n",
"2 526350040 468.0 144.0 0.0 730.0 \n",
"3 526351010 923.0 0.0 0.0 312.0 \n",
"4 526353030 1065.0 0.0 0.0 522.0 \n",
"5 527105010 791.0 0.0 0.0 482.0 \n",
"\n",
" Gr Liv Area Lot Area Lot Frontage Low Qual Fin SF \\\n",
"Order PID \n",
"1 526301100 1656.0 31770.0 141.0 0.0 \n",
"2 526350040 896.0 11622.0 80.0 0.0 \n",
"3 526351010 1329.0 14267.0 81.0 0.0 \n",
"4 526353030 2110.0 11160.0 93.0 0.0 \n",
"5 527105010 1629.0 13830.0 74.0 0.0 \n",
"\n",
" Mas Vnr Area Misc Val Open Porch SF Pool Area \\\n",
"Order PID \n",
"1 526301100 112.0 0.0 62.0 0.0 \n",
"2 526350040 0.0 0.0 0.0 0.0 \n",
"3 526351010 108.0 12500.0 36.0 0.0 \n",
"4 526353030 0.0 0.0 0.0 0.0 \n",
"5 527105010 0.0 0.0 34.0 0.0 \n",
"\n",
" Screen Porch Total Bsmt SF Wood Deck SF \n",
"Order PID \n",
"1 526301100 0.0 1080.0 210.0 \n",
"2 526350040 120.0 882.0 140.0 \n",
"3 526351010 0.0 1329.0 393.0 \n",
"4 526353030 0.0 2110.0 0.0 \n",
"5 527105010 0.0 928.0 212.0 "
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[CONTINUOUS_VARIABLES].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Except for the column *Lot Frontage* the columns with missing data only have a couple of missing values (i.e., < 1% of all the rows)."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"MultiIndex: 2930 entries, (np.int64(1), np.int64(526301100)) to (np.int64(2930), np.int64(924151050))\n",
"Data columns (total 19 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 1st Flr SF 2930 non-null float64\n",
" 1 2nd Flr SF 2930 non-null float64\n",
" 2 3Ssn Porch 2930 non-null int64 \n",
" 3 Bsmt Unf SF 2929 non-null float64\n",
" 4 BsmtFin SF 1 2929 non-null float64\n",
" 5 BsmtFin SF 2 2929 non-null float64\n",
" 6 Enclosed Porch 2930 non-null float64\n",
" 7 Garage Area 2929 non-null float64\n",
" 8 Gr Liv Area 2930 non-null float64\n",
" 9 Lot Area 2930 non-null float64\n",
" 10 Lot Frontage 2440 non-null float64\n",
" 11 Low Qual Fin SF 2930 non-null float64\n",
" 12 Mas Vnr Area 2907 non-null float64\n",
" 13 Misc Val 2930 non-null float64\n",
" 14 Open Porch SF 2930 non-null float64\n",
" 15 Pool Area 2930 non-null float64\n",
" 16 Screen Porch 2930 non-null float64\n",
" 17 Total Bsmt SF 2929 non-null float64\n",
" 18 Wood Deck SF 2930 non-null float64\n",
"dtypes: float64(18), int64(1)\n",
"memory usage: 621.3 KB\n"
]
}
],
"source": [
"df[CONTINUOUS_VARIABLES].info()"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"# The columns with a lot of missing\n",
"# values will be treated seperately below.\n",
"missing_a_lot = [\"Lot Frontage\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Discrete Variables"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The discrete columns have between 2 and 15 unique realizations each if year numbers are excluded from the analysis."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"for column in DISCRETE_VARIABLES:\n",
" mask = df[column].notnull()\n",
" num_realizations = len(list(x for x in df.loc[mask, column].unique()))\n",
" if column not in (\"Year Built\", \"Year Remod/Add\", \"Garage Yr Blt\"):\n",
" assert num_realizations < 15\n",
" assert num_realizations > 2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A brief description of the variables:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Bedroom AbvGr Bedrooms above grade (does NOT include basement bedrooms)\n",
"Bsmt Full Bath Basement full bathrooms\n",
"Bsmt Half Bath Basement half bathrooms\n",
"Fireplaces Number of fireplaces\n",
"Full Bath Full bathrooms above grade\n",
"Garage Cars Size of garage in car capacity\n",
"Garage Yr Blt Year garage was built\n",
"Half Bath Half baths above grade\n",
"Kitchen AbvGr Kitchens above grade\n",
"Mo Sold Month Sold (MM)\n",
"TotRms AbvGrd Total rooms above grade (does not include bathrooms)\n",
"Year Built Original construction date\n",
"Year Remod/Add Remodel date (same as construction date if no remodeling or additions)\n",
"Yr Sold Year Sold (YYYY)\n"
]
}
],
"source": [
"print_column_list(DISCRETE_COLUMNS)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note:** columns with missing values are implicitly casted to a *float64* type an the *int64* type has no concept of a NaN (=\"Not a number\") value."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
Bedroom AbvGr
\n",
"
Bsmt Full Bath
\n",
"
Bsmt Half Bath
\n",
"
Fireplaces
\n",
"
Full Bath
\n",
"
Garage Cars
\n",
"
Garage Yr Blt
\n",
"
Half Bath
\n",
"
Kitchen AbvGr
\n",
"
Mo Sold
\n",
"
TotRms AbvGrd
\n",
"
Year Built
\n",
"
Year Remod/Add
\n",
"
Yr Sold
\n",
"
\n",
"
\n",
"
Order
\n",
"
PID
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
"
\n",
"
1
\n",
"
526301100
\n",
"
3
\n",
"
1.0
\n",
"
0.0
\n",
"
2
\n",
"
1
\n",
"
2.0
\n",
"
1960.0
\n",
"
0
\n",
"
1
\n",
"
5
\n",
"
7
\n",
"
1960
\n",
"
1960
\n",
"
2010
\n",
"
\n",
"
\n",
"
2
\n",
"
526350040
\n",
"
2
\n",
"
0.0
\n",
"
0.0
\n",
"
0
\n",
"
1
\n",
"
1.0
\n",
"
1961.0
\n",
"
0
\n",
"
1
\n",
"
6
\n",
"
5
\n",
"
1961
\n",
"
1961
\n",
"
2010
\n",
"
\n",
"
\n",
"
3
\n",
"
526351010
\n",
"
3
\n",
"
0.0
\n",
"
0.0
\n",
"
0
\n",
"
1
\n",
"
1.0
\n",
"
1958.0
\n",
"
1
\n",
"
1
\n",
"
6
\n",
"
6
\n",
"
1958
\n",
"
1958
\n",
"
2010
\n",
"
\n",
"
\n",
"
4
\n",
"
526353030
\n",
"
3
\n",
"
1.0
\n",
"
0.0
\n",
"
2
\n",
"
2
\n",
"
2.0
\n",
"
1968.0
\n",
"
1
\n",
"
1
\n",
"
4
\n",
"
8
\n",
"
1968
\n",
"
1968
\n",
"
2010
\n",
"
\n",
"
\n",
"
5
\n",
"
527105010
\n",
"
3
\n",
"
0.0
\n",
"
0.0
\n",
"
1
\n",
"
2
\n",
"
2.0
\n",
"
1997.0
\n",
"
1
\n",
"
1
\n",
"
3
\n",
"
6
\n",
"
1997
\n",
"
1998
\n",
"
2010
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Bedroom AbvGr Bsmt Full Bath Bsmt Half Bath Fireplaces \\\n",
"Order PID \n",
"1 526301100 3 1.0 0.0 2 \n",
"2 526350040 2 0.0 0.0 0 \n",
"3 526351010 3 0.0 0.0 0 \n",
"4 526353030 3 1.0 0.0 2 \n",
"5 527105010 3 0.0 0.0 1 \n",
"\n",
" Full Bath Garage Cars Garage Yr Blt Half Bath \\\n",
"Order PID \n",
"1 526301100 1 2.0 1960.0 0 \n",
"2 526350040 1 1.0 1961.0 0 \n",
"3 526351010 1 1.0 1958.0 1 \n",
"4 526353030 2 2.0 1968.0 1 \n",
"5 527105010 2 2.0 1997.0 1 \n",
"\n",
" Kitchen AbvGr Mo Sold TotRms AbvGrd Year Built \\\n",
"Order PID \n",
"1 526301100 1 5 7 1960 \n",
"2 526350040 1 6 5 1961 \n",
"3 526351010 1 6 6 1958 \n",
"4 526353030 1 4 8 1968 \n",
"5 527105010 1 3 6 1997 \n",
"\n",
" Year Remod/Add Yr Sold \n",
"Order PID \n",
"1 526301100 1960 2010 \n",
"2 526350040 1961 2010 \n",
"3 526351010 1958 2010 \n",
"4 526353030 1968 2010 \n",
"5 527105010 1998 2010 "
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[DISCRETE_VARIABLES].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Except for the *Garage Yr Blt* column no variable has a significant number of missing values (i.e., > 1% of all rows)."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"MultiIndex: 2930 entries, (np.int64(1), np.int64(526301100)) to (np.int64(2930), np.int64(924151050))\n",
"Data columns (total 14 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 Bedroom AbvGr 2930 non-null int64 \n",
" 1 Bsmt Full Bath 2928 non-null float64\n",
" 2 Bsmt Half Bath 2928 non-null float64\n",
" 3 Fireplaces 2930 non-null int64 \n",
" 4 Full Bath 2930 non-null int64 \n",
" 5 Garage Cars 2929 non-null float64\n",
" 6 Garage Yr Blt 2771 non-null float64\n",
" 7 Half Bath 2930 non-null int64 \n",
" 8 Kitchen AbvGr 2930 non-null int64 \n",
" 9 Mo Sold 2930 non-null int64 \n",
" 10 TotRms AbvGrd 2930 non-null int64 \n",
" 11 Year Built 2930 non-null int64 \n",
" 12 Year Remod/Add 2930 non-null int64 \n",
" 13 Yr Sold 2930 non-null int64 \n",
"dtypes: float64(4), int64(10)\n",
"memory usage: 506.9 KB\n"
]
}
],
"source": [
"df[DISCRETE_VARIABLES].info()"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"missing_a_lot.append(\"Garage Yr Blt\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Nominal Variables"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Except for the total of 28 neighborhoods, the nominal columns come with anywhere between 1 and 18 different labels each."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"for column in NOMINAL_VARIABLES:\n",
" mask = df[column].notnull()\n",
" num_realizations = len(list(x for x in df.loc[mask, column].unique()))\n",
" if column not in (\"Neighborhood\"):\n",
" assert num_realizations < 18\n",
" assert num_realizations > 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A brief description of the variables:"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Alley Type of alley access to property\n",
"Bldg Type Type of dwelling\n",
"Central Air Central air conditioning\n",
"Condition 1 Proximity to various conditions\n",
"Condition 2 Proximity to various conditions (if more than one is present)\n",
"Exterior 1st Exterior covering on house\n",
"Exterior 2nd Exterior covering on house (if more than one material)\n",
"Foundation Type of foundation\n",
"Garage Type Garage location\n",
"Heating Type of heating\n",
"House Style Style of dwelling\n",
"Land Contour Flatness of the property\n",
"Lot Config Lot configuration\n",
"MS SubClass Identifies the type of dwelling involved in the sale.\n",
"MS Zoning Identifies the general zoning classification of the sale.\n",
"Mas Vnr Type Masonry veneer type\n",
"Misc Feature Miscellaneous feature not covered in other categories\n",
"Neighborhood Physical locations within Ames city limits (map available)\n",
"Roof Matl Roof material\n",
"Roof Style Type of roof\n",
"Sale Condition Condition of sale\n",
"Sale Type Type of sale\n",
"Street Type of road access to property\n"
]
}
],
"source": [
"print_column_list(NOMINAL_COLUMNS)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
Alley
\n",
"
Bldg Type
\n",
"
Central Air
\n",
"
Condition 1
\n",
"
Condition 2
\n",
"
Exterior 1st
\n",
"
Exterior 2nd
\n",
"
Foundation
\n",
"
Garage Type
\n",
"
Heating
\n",
"
House Style
\n",
"
Land Contour
\n",
"
Lot Config
\n",
"
MS SubClass
\n",
"
MS Zoning
\n",
"
Mas Vnr Type
\n",
"
Misc Feature
\n",
"
Neighborhood
\n",
"
Roof Matl
\n",
"
Roof Style
\n",
"
Sale Condition
\n",
"
Sale Type
\n",
"
Street
\n",
"
\n",
"
\n",
"
Order
\n",
"
PID
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
"
\n",
"
1
\n",
"
526301100
\n",
"
NA
\n",
"
1Fam
\n",
"
Y
\n",
"
Norm
\n",
"
Norm
\n",
"
BrkFace
\n",
"
Plywood
\n",
"
CBlock
\n",
"
Attchd
\n",
"
GasA
\n",
"
1Story
\n",
"
Lvl
\n",
"
Corner
\n",
"
020
\n",
"
RL
\n",
"
Stone
\n",
"
NA
\n",
"
Names
\n",
"
CompShg
\n",
"
Hip
\n",
"
Normal
\n",
"
WD
\n",
"
Pave
\n",
"
\n",
"
\n",
"
2
\n",
"
526350040
\n",
"
NA
\n",
"
1Fam
\n",
"
Y
\n",
"
Feedr
\n",
"
Norm
\n",
"
VinylSd
\n",
"
VinylSd
\n",
"
CBlock
\n",
"
Attchd
\n",
"
GasA
\n",
"
1Story
\n",
"
Lvl
\n",
"
Inside
\n",
"
020
\n",
"
RH
\n",
"
None
\n",
"
NA
\n",
"
Names
\n",
"
CompShg
\n",
"
Gable
\n",
"
Normal
\n",
"
WD
\n",
"
Pave
\n",
"
\n",
"
\n",
"
3
\n",
"
526351010
\n",
"
NA
\n",
"
1Fam
\n",
"
Y
\n",
"
Norm
\n",
"
Norm
\n",
"
Wd Sdng
\n",
"
Wd Sdng
\n",
"
CBlock
\n",
"
Attchd
\n",
"
GasA
\n",
"
1Story
\n",
"
Lvl
\n",
"
Corner
\n",
"
020
\n",
"
RL
\n",
"
BrkFace
\n",
"
Gar2
\n",
"
Names
\n",
"
CompShg
\n",
"
Hip
\n",
"
Normal
\n",
"
WD
\n",
"
Pave
\n",
"
\n",
"
\n",
"
4
\n",
"
526353030
\n",
"
NA
\n",
"
1Fam
\n",
"
Y
\n",
"
Norm
\n",
"
Norm
\n",
"
BrkFace
\n",
"
BrkFace
\n",
"
CBlock
\n",
"
Attchd
\n",
"
GasA
\n",
"
1Story
\n",
"
Lvl
\n",
"
Corner
\n",
"
020
\n",
"
RL
\n",
"
None
\n",
"
NA
\n",
"
Names
\n",
"
CompShg
\n",
"
Hip
\n",
"
Normal
\n",
"
WD
\n",
"
Pave
\n",
"
\n",
"
\n",
"
5
\n",
"
527105010
\n",
"
NA
\n",
"
1Fam
\n",
"
Y
\n",
"
Norm
\n",
"
Norm
\n",
"
VinylSd
\n",
"
VinylSd
\n",
"
PConc
\n",
"
Attchd
\n",
"
GasA
\n",
"
2Story
\n",
"
Lvl
\n",
"
Inside
\n",
"
060
\n",
"
RL
\n",
"
None
\n",
"
NA
\n",
"
Gilbert
\n",
"
CompShg
\n",
"
Gable
\n",
"
Normal
\n",
"
WD
\n",
"
Pave
\n",
"
\n",
"
\n",
"
6
\n",
"
527105030
\n",
"
NA
\n",
"
1Fam
\n",
"
Y
\n",
"
Norm
\n",
"
Norm
\n",
"
VinylSd
\n",
"
VinylSd
\n",
"
PConc
\n",
"
Attchd
\n",
"
GasA
\n",
"
2Story
\n",
"
Lvl
\n",
"
Inside
\n",
"
060
\n",
"
RL
\n",
"
BrkFace
\n",
"
NA
\n",
"
Gilbert
\n",
"
CompShg
\n",
"
Gable
\n",
"
Normal
\n",
"
WD
\n",
"
Pave
\n",
"
\n",
"
\n",
"
7
\n",
"
527127150
\n",
"
NA
\n",
"
TwnhsE
\n",
"
Y
\n",
"
Norm
\n",
"
Norm
\n",
"
CemntBd
\n",
"
CemntBd
\n",
"
PConc
\n",
"
Attchd
\n",
"
GasA
\n",
"
1Story
\n",
"
Lvl
\n",
"
Inside
\n",
"
120
\n",
"
RL
\n",
"
None
\n",
"
NA
\n",
"
StoneBr
\n",
"
CompShg
\n",
"
Gable
\n",
"
Normal
\n",
"
WD
\n",
"
Pave
\n",
"
\n",
"
\n",
"
8
\n",
"
527145080
\n",
"
NA
\n",
"
TwnhsE
\n",
"
Y
\n",
"
Norm
\n",
"
Norm
\n",
"
HdBoard
\n",
"
HdBoard
\n",
"
PConc
\n",
"
Attchd
\n",
"
GasA
\n",
"
1Story
\n",
"
HLS
\n",
"
Inside
\n",
"
120
\n",
"
RL
\n",
"
None
\n",
"
NA
\n",
"
StoneBr
\n",
"
CompShg
\n",
"
Gable
\n",
"
Normal
\n",
"
WD
\n",
"
Pave
\n",
"
\n",
"
\n",
"
9
\n",
"
527146030
\n",
"
NA
\n",
"
TwnhsE
\n",
"
Y
\n",
"
Norm
\n",
"
Norm
\n",
"
CemntBd
\n",
"
CemntBd
\n",
"
PConc
\n",
"
Attchd
\n",
"
GasA
\n",
"
1Story
\n",
"
Lvl
\n",
"
Inside
\n",
"
120
\n",
"
RL
\n",
"
None
\n",
"
NA
\n",
"
StoneBr
\n",
"
CompShg
\n",
"
Gable
\n",
"
Normal
\n",
"
WD
\n",
"
Pave
\n",
"
\n",
"
\n",
"
10
\n",
"
527162130
\n",
"
NA
\n",
"
1Fam
\n",
"
Y
\n",
"
Norm
\n",
"
Norm
\n",
"
VinylSd
\n",
"
VinylSd
\n",
"
PConc
\n",
"
Attchd
\n",
"
GasA
\n",
"
2Story
\n",
"
Lvl
\n",
"
Inside
\n",
"
060
\n",
"
RL
\n",
"
None
\n",
"
NA
\n",
"
Gilbert
\n",
"
CompShg
\n",
"
Gable
\n",
"
Normal
\n",
"
WD
\n",
"
Pave
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Alley Bldg Type Central Air Condition 1 Condition 2 \\\n",
"Order PID \n",
"1 526301100 NA 1Fam Y Norm Norm \n",
"2 526350040 NA 1Fam Y Feedr Norm \n",
"3 526351010 NA 1Fam Y Norm Norm \n",
"4 526353030 NA 1Fam Y Norm Norm \n",
"5 527105010 NA 1Fam Y Norm Norm \n",
"6 527105030 NA 1Fam Y Norm Norm \n",
"7 527127150 NA TwnhsE Y Norm Norm \n",
"8 527145080 NA TwnhsE Y Norm Norm \n",
"9 527146030 NA TwnhsE Y Norm Norm \n",
"10 527162130 NA 1Fam Y Norm Norm \n",
"\n",
" Exterior 1st Exterior 2nd Foundation Garage Type Heating \\\n",
"Order PID \n",
"1 526301100 BrkFace Plywood CBlock Attchd GasA \n",
"2 526350040 VinylSd VinylSd CBlock Attchd GasA \n",
"3 526351010 Wd Sdng Wd Sdng CBlock Attchd GasA \n",
"4 526353030 BrkFace BrkFace CBlock Attchd GasA \n",
"5 527105010 VinylSd VinylSd PConc Attchd GasA \n",
"6 527105030 VinylSd VinylSd PConc Attchd GasA \n",
"7 527127150 CemntBd CemntBd PConc Attchd GasA \n",
"8 527145080 HdBoard HdBoard PConc Attchd GasA \n",
"9 527146030 CemntBd CemntBd PConc Attchd GasA \n",
"10 527162130 VinylSd VinylSd PConc Attchd GasA \n",
"\n",
" House Style Land Contour Lot Config MS SubClass MS Zoning \\\n",
"Order PID \n",
"1 526301100 1Story Lvl Corner 020 RL \n",
"2 526350040 1Story Lvl Inside 020 RH \n",
"3 526351010 1Story Lvl Corner 020 RL \n",
"4 526353030 1Story Lvl Corner 020 RL \n",
"5 527105010 2Story Lvl Inside 060 RL \n",
"6 527105030 2Story Lvl Inside 060 RL \n",
"7 527127150 1Story Lvl Inside 120 RL \n",
"8 527145080 1Story HLS Inside 120 RL \n",
"9 527146030 1Story Lvl Inside 120 RL \n",
"10 527162130 2Story Lvl Inside 060 RL \n",
"\n",
" Mas Vnr Type Misc Feature Neighborhood Roof Matl Roof Style \\\n",
"Order PID \n",
"1 526301100 Stone NA Names CompShg Hip \n",
"2 526350040 None NA Names CompShg Gable \n",
"3 526351010 BrkFace Gar2 Names CompShg Hip \n",
"4 526353030 None NA Names CompShg Hip \n",
"5 527105010 None NA Gilbert CompShg Gable \n",
"6 527105030 BrkFace NA Gilbert CompShg Gable \n",
"7 527127150 None NA StoneBr CompShg Gable \n",
"8 527145080 None NA StoneBr CompShg Gable \n",
"9 527146030 None NA StoneBr CompShg Gable \n",
"10 527162130 None NA Gilbert CompShg Gable \n",
"\n",
" Sale Condition Sale Type Street \n",
"Order PID \n",
"1 526301100 Normal WD Pave \n",
"2 526350040 Normal WD Pave \n",
"3 526351010 Normal WD Pave \n",
"4 526353030 Normal WD Pave \n",
"5 527105010 Normal WD Pave \n",
"6 527105030 Normal WD Pave \n",
"7 527127150 Normal WD Pave \n",
"8 527145080 Normal WD Pave \n",
"9 527146030 Normal WD Pave \n",
"10 527162130 Normal WD Pave "
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[NOMINAL_VARIABLES].head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For the nominal variables there is only a neglectable number of missing values."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"MultiIndex: 2930 entries, (np.int64(1), np.int64(526301100)) to (np.int64(2930), np.int64(924151050))\n",
"Data columns (total 23 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 Alley 2930 non-null object\n",
" 1 Bldg Type 2930 non-null object\n",
" 2 Central Air 2930 non-null object\n",
" 3 Condition 1 2930 non-null object\n",
" 4 Condition 2 2930 non-null object\n",
" 5 Exterior 1st 2930 non-null object\n",
" 6 Exterior 2nd 2930 non-null object\n",
" 7 Foundation 2930 non-null object\n",
" 8 Garage Type 2930 non-null object\n",
" 9 Heating 2930 non-null object\n",
" 10 House Style 2930 non-null object\n",
" 11 Land Contour 2930 non-null object\n",
" 12 Lot Config 2930 non-null object\n",
" 13 MS SubClass 2930 non-null object\n",
" 14 MS Zoning 2930 non-null object\n",
" 15 Mas Vnr Type 2907 non-null object\n",
" 16 Misc Feature 2930 non-null object\n",
" 17 Neighborhood 2930 non-null object\n",
" 18 Roof Matl 2930 non-null object\n",
" 19 Roof Style 2930 non-null object\n",
" 20 Sale Condition 2930 non-null object\n",
" 21 Sale Type 2930 non-null object\n",
" 22 Street 2930 non-null object\n",
"dtypes: object(23)\n",
"memory usage: 712.9+ KB\n"
]
}
],
"source": [
"df[NOMINAL_VARIABLES].info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Ordinal Variables"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The ordinal columns come with anywhere between 2 and 11 distinct labels each."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"for column in ORDINAL_VARIABLES:\n",
" mask = df[column].notnull()\n",
" num_realizations = len(list(x for x in df.loc[mask, column].unique()))\n",
" assert 2 < num_realizations < 11"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A brief description of the variables:"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Bsmt Cond Evaluates the general condition of the basement\n",
"Bsmt Exposure Refers to walkout or garden level walls\n",
"Bsmt Qual Evaluates the height of the basement\n",
"BsmtFin Type 1 Rating of basement finished area\n",
"BsmtFin Type 2 Rating of basement finished area (if multiple types)\n",
"Electrical Electrical system\n",
"Exter Cond Evaluates the present condition of the material on the exterior\n",
"Exter Qual Evaluates the quality of the material on the exterior\n",
"Fence Fence quality\n",
"Fireplace Qu Fireplace quality\n",
"Functional Home functionality (Assume typical unless deductions are warranted)\n",
"Garage Cond Garage condition\n",
"Garage Finish Interior finish of the garage\n",
"Garage Qual Garage quality\n",
"Heating QC Heating quality and condition\n",
"Kitchen Qual Kitchen quality\n",
"Land Slope Slope of property\n",
"Lot Shape General shape of property\n",
"Overall Cond Rates the overall condition of the house\n",
"Overall Qual Rates the overall material and finish of the house\n",
"Paved Drive Paved driveway\n",
"Pool QC Pool quality\n",
"Utilities Type of utilities available\n"
]
}
],
"source": [
"print_column_list(ORDINAL_COLUMNS)"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
Bsmt Cond
\n",
"
Bsmt Exposure
\n",
"
Bsmt Qual
\n",
"
BsmtFin Type 1
\n",
"
BsmtFin Type 2
\n",
"
Electrical
\n",
"
Exter Cond
\n",
"
Exter Qual
\n",
"
Fence
\n",
"
Fireplace Qu
\n",
"
Functional
\n",
"
Garage Cond
\n",
"
Garage Finish
\n",
"
Garage Qual
\n",
"
Heating QC
\n",
"
Kitchen Qual
\n",
"
Land Slope
\n",
"
Lot Shape
\n",
"
Overall Cond
\n",
"
Overall Qual
\n",
"
Paved Drive
\n",
"
Pool QC
\n",
"
Utilities
\n",
"
\n",
"
\n",
"
Order
\n",
"
PID
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
"
\n",
"
1
\n",
"
526301100
\n",
"
Gd
\n",
"
Gd
\n",
"
TA
\n",
"
BLQ
\n",
"
Unf
\n",
"
SBrkr
\n",
"
TA
\n",
"
TA
\n",
"
NA
\n",
"
Gd
\n",
"
Typ
\n",
"
TA
\n",
"
Fin
\n",
"
TA
\n",
"
Fa
\n",
"
TA
\n",
"
Gtl
\n",
"
IR1
\n",
"
5
\n",
"
6
\n",
"
P
\n",
"
NA
\n",
"
AllPub
\n",
"
\n",
"
\n",
"
2
\n",
"
526350040
\n",
"
TA
\n",
"
No
\n",
"
TA
\n",
"
Rec
\n",
"
LwQ
\n",
"
SBrkr
\n",
"
TA
\n",
"
TA
\n",
"
MnPrv
\n",
"
NA
\n",
"
Typ
\n",
"
TA
\n",
"
Unf
\n",
"
TA
\n",
"
TA
\n",
"
TA
\n",
"
Gtl
\n",
"
Reg
\n",
"
6
\n",
"
5
\n",
"
Y
\n",
"
NA
\n",
"
AllPub
\n",
"
\n",
"
\n",
"
3
\n",
"
526351010
\n",
"
TA
\n",
"
No
\n",
"
TA
\n",
"
ALQ
\n",
"
Unf
\n",
"
SBrkr
\n",
"
TA
\n",
"
TA
\n",
"
NA
\n",
"
NA
\n",
"
Typ
\n",
"
TA
\n",
"
Unf
\n",
"
TA
\n",
"
TA
\n",
"
Gd
\n",
"
Gtl
\n",
"
IR1
\n",
"
6
\n",
"
6
\n",
"
Y
\n",
"
NA
\n",
"
AllPub
\n",
"
\n",
"
\n",
"
4
\n",
"
526353030
\n",
"
TA
\n",
"
No
\n",
"
TA
\n",
"
ALQ
\n",
"
Unf
\n",
"
SBrkr
\n",
"
TA
\n",
"
Gd
\n",
"
NA
\n",
"
TA
\n",
"
Typ
\n",
"
TA
\n",
"
Fin
\n",
"
TA
\n",
"
Ex
\n",
"
Ex
\n",
"
Gtl
\n",
"
Reg
\n",
"
5
\n",
"
7
\n",
"
Y
\n",
"
NA
\n",
"
AllPub
\n",
"
\n",
"
\n",
"
5
\n",
"
527105010
\n",
"
TA
\n",
"
No
\n",
"
Gd
\n",
"
GLQ
\n",
"
Unf
\n",
"
SBrkr
\n",
"
TA
\n",
"
TA
\n",
"
MnPrv
\n",
"
TA
\n",
"
Typ
\n",
"
TA
\n",
"
Fin
\n",
"
TA
\n",
"
Gd
\n",
"
TA
\n",
"
Gtl
\n",
"
IR1
\n",
"
5
\n",
"
5
\n",
"
Y
\n",
"
NA
\n",
"
AllPub
\n",
"
\n",
"
\n",
"
6
\n",
"
527105030
\n",
"
TA
\n",
"
No
\n",
"
TA
\n",
"
GLQ
\n",
"
Unf
\n",
"
SBrkr
\n",
"
TA
\n",
"
TA
\n",
"
NA
\n",
"
Gd
\n",
"
Typ
\n",
"
TA
\n",
"
Fin
\n",
"
TA
\n",
"
Ex
\n",
"
Gd
\n",
"
Gtl
\n",
"
IR1
\n",
"
6
\n",
"
6
\n",
"
Y
\n",
"
NA
\n",
"
AllPub
\n",
"
\n",
"
\n",
"
7
\n",
"
527127150
\n",
"
TA
\n",
"
Mn
\n",
"
Gd
\n",
"
GLQ
\n",
"
Unf
\n",
"
SBrkr
\n",
"
TA
\n",
"
Gd
\n",
"
NA
\n",
"
NA
\n",
"
Typ
\n",
"
TA
\n",
"
Fin
\n",
"
TA
\n",
"
Ex
\n",
"
Gd
\n",
"
Gtl
\n",
"
Reg
\n",
"
5
\n",
"
8
\n",
"
Y
\n",
"
NA
\n",
"
AllPub
\n",
"
\n",
"
\n",
"
8
\n",
"
527145080
\n",
"
TA
\n",
"
No
\n",
"
Gd
\n",
"
ALQ
\n",
"
Unf
\n",
"
SBrkr
\n",
"
TA
\n",
"
Gd
\n",
"
NA
\n",
"
NA
\n",
"
Typ
\n",
"
TA
\n",
"
RFn
\n",
"
TA
\n",
"
Ex
\n",
"
Gd
\n",
"
Gtl
\n",
"
IR1
\n",
"
5
\n",
"
8
\n",
"
Y
\n",
"
NA
\n",
"
AllPub
\n",
"
\n",
"
\n",
"
9
\n",
"
527146030
\n",
"
TA
\n",
"
No
\n",
"
Gd
\n",
"
GLQ
\n",
"
Unf
\n",
"
SBrkr
\n",
"
TA
\n",
"
Gd
\n",
"
NA
\n",
"
TA
\n",
"
Typ
\n",
"
TA
\n",
"
RFn
\n",
"
TA
\n",
"
Ex
\n",
"
Gd
\n",
"
Gtl
\n",
"
IR1
\n",
"
5
\n",
"
8
\n",
"
Y
\n",
"
NA
\n",
"
AllPub
\n",
"
\n",
"
\n",
"
10
\n",
"
527162130
\n",
"
TA
\n",
"
No
\n",
"
TA
\n",
"
Unf
\n",
"
Unf
\n",
"
SBrkr
\n",
"
TA
\n",
"
TA
\n",
"
NA
\n",
"
TA
\n",
"
Typ
\n",
"
TA
\n",
"
Fin
\n",
"
TA
\n",
"
Gd
\n",
"
Gd
\n",
"
Gtl
\n",
"
Reg
\n",
"
5
\n",
"
7
\n",
"
Y
\n",
"
NA
\n",
"
AllPub
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Bsmt Cond Bsmt Exposure Bsmt Qual BsmtFin Type 1 \\\n",
"Order PID \n",
"1 526301100 Gd Gd TA BLQ \n",
"2 526350040 TA No TA Rec \n",
"3 526351010 TA No TA ALQ \n",
"4 526353030 TA No TA ALQ \n",
"5 527105010 TA No Gd GLQ \n",
"6 527105030 TA No TA GLQ \n",
"7 527127150 TA Mn Gd GLQ \n",
"8 527145080 TA No Gd ALQ \n",
"9 527146030 TA No Gd GLQ \n",
"10 527162130 TA No TA Unf \n",
"\n",
" BsmtFin Type 2 Electrical Exter Cond Exter Qual Fence \\\n",
"Order PID \n",
"1 526301100 Unf SBrkr TA TA NA \n",
"2 526350040 LwQ SBrkr TA TA MnPrv \n",
"3 526351010 Unf SBrkr TA TA NA \n",
"4 526353030 Unf SBrkr TA Gd NA \n",
"5 527105010 Unf SBrkr TA TA MnPrv \n",
"6 527105030 Unf SBrkr TA TA NA \n",
"7 527127150 Unf SBrkr TA Gd NA \n",
"8 527145080 Unf SBrkr TA Gd NA \n",
"9 527146030 Unf SBrkr TA Gd NA \n",
"10 527162130 Unf SBrkr TA TA NA \n",
"\n",
" Fireplace Qu Functional Garage Cond Garage Finish Garage Qual \\\n",
"Order PID \n",
"1 526301100 Gd Typ TA Fin TA \n",
"2 526350040 NA Typ TA Unf TA \n",
"3 526351010 NA Typ TA Unf TA \n",
"4 526353030 TA Typ TA Fin TA \n",
"5 527105010 TA Typ TA Fin TA \n",
"6 527105030 Gd Typ TA Fin TA \n",
"7 527127150 NA Typ TA Fin TA \n",
"8 527145080 NA Typ TA RFn TA \n",
"9 527146030 TA Typ TA RFn TA \n",
"10 527162130 TA Typ TA Fin TA \n",
"\n",
" Heating QC Kitchen Qual Land Slope Lot Shape Overall Cond \\\n",
"Order PID \n",
"1 526301100 Fa TA Gtl IR1 5 \n",
"2 526350040 TA TA Gtl Reg 6 \n",
"3 526351010 TA Gd Gtl IR1 6 \n",
"4 526353030 Ex Ex Gtl Reg 5 \n",
"5 527105010 Gd TA Gtl IR1 5 \n",
"6 527105030 Ex Gd Gtl IR1 6 \n",
"7 527127150 Ex Gd Gtl Reg 5 \n",
"8 527145080 Ex Gd Gtl IR1 5 \n",
"9 527146030 Ex Gd Gtl IR1 5 \n",
"10 527162130 Gd Gd Gtl Reg 5 \n",
"\n",
" Overall Qual Paved Drive Pool QC Utilities \n",
"Order PID \n",
"1 526301100 6 P NA AllPub \n",
"2 526350040 5 Y NA AllPub \n",
"3 526351010 6 Y NA AllPub \n",
"4 526353030 7 Y NA AllPub \n",
"5 527105010 5 Y NA AllPub \n",
"6 527105030 6 Y NA AllPub \n",
"7 527127150 8 Y NA AllPub \n",
"8 527145080 8 Y NA AllPub \n",
"9 527146030 8 Y NA AllPub \n",
"10 527162130 7 Y NA AllPub "
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[ORDINAL_VARIABLES].head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For the ordinal variables there is only a neglectable number of missing values."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"MultiIndex: 2930 entries, (np.int64(1), np.int64(526301100)) to (np.int64(2930), np.int64(924151050))\n",
"Data columns (total 23 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 Bsmt Cond 2929 non-null object\n",
" 1 Bsmt Exposure 2926 non-null object\n",
" 2 Bsmt Qual 2929 non-null object\n",
" 3 BsmtFin Type 1 2929 non-null object\n",
" 4 BsmtFin Type 2 2928 non-null object\n",
" 5 Electrical 2929 non-null object\n",
" 6 Exter Cond 2930 non-null object\n",
" 7 Exter Qual 2930 non-null object\n",
" 8 Fence 2930 non-null object\n",
" 9 Fireplace Qu 2930 non-null object\n",
" 10 Functional 2930 non-null object\n",
" 11 Garage Cond 2929 non-null object\n",
" 12 Garage Finish 2928 non-null object\n",
" 13 Garage Qual 2929 non-null object\n",
" 14 Heating QC 2930 non-null object\n",
" 15 Kitchen Qual 2930 non-null object\n",
" 16 Land Slope 2930 non-null object\n",
" 17 Lot Shape 2930 non-null object\n",
" 18 Overall Cond 2930 non-null object\n",
" 19 Overall Qual 2930 non-null object\n",
" 20 Paved Drive 2930 non-null object\n",
" 21 Pool QC 2930 non-null object\n",
" 22 Utilities 2930 non-null object\n",
"dtypes: object(23)\n",
"memory usage: 712.9+ KB\n"
]
}
],
"source": [
"df[ORDINAL_VARIABLES].info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Missing Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Visualizations"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"msno.matrix(df[ORDINAL_VARIABLES]);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Cleansing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since only about 1% of the overall number of observations exhibit variables with missing data (disregarding the columns *Lot Frontage* and *Garage Yr Blt*), the decision is made to discard these rows entirely to not have to deal with interpolating meaningful replacements for the missing values."
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"remaining_columns = sorted(set(ALL_VARIABLES) - set(missing_a_lot)) + TARGET_VARIABLES\n",
"mask = df[remaining_columns].isnull().any(axis=1)\n",
"assert (100 * mask.sum() / df.shape[0]) < 1.1 # percent\n",
"df = df[~mask]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The two columns with a lot of missing values regard the age of a house's optional garage and the length of the intersection with the street where the house is located. The first is assumed as not important for the house appraisal and the second is assumed to be captured in other variables (e.g. overall size of the house). Therefore, for sake of simplicity both columns are dropped from the DataFrame."
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Garage Yr Blt Year garage was built\n",
"Lot Frontage Linear feet of street connected to property\n"
]
}
],
"source": [
"print_column_list(missing_a_lot)"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"df = df[remaining_columns]"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"# Remove the discarded columns from the helper dictionaries / lists.\n",
"update_column_descriptions(df.columns)\n",
"# Without any more missing data, cast all numeric\n",
"# columns as floats or integers respectively.\n",
"for column in CONTINUOUS_VARIABLES + TARGET_VARIABLES:\n",
" df[column] = df[column].astype(np.float64)\n",
"for column in DISCRETE_VARIABLES:\n",
" df[column] = df[column].astype(np.int64)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Clean Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The cleaned data comes as a 2898 rows x 78 columns matrix."
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(2898, 78)"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.shape"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
1st Flr SF
\n",
"
2nd Flr SF
\n",
"
3Ssn Porch
\n",
"
Alley
\n",
"
Bedroom AbvGr
\n",
"
Bldg Type
\n",
"
Bsmt Cond
\n",
"
Bsmt Exposure
\n",
"
Bsmt Full Bath
\n",
"
Bsmt Half Bath
\n",
"
Bsmt Qual
\n",
"
Bsmt Unf SF
\n",
"
BsmtFin SF 1
\n",
"
BsmtFin SF 2
\n",
"
BsmtFin Type 1
\n",
"
BsmtFin Type 2
\n",
"
Central Air
\n",
"
Condition 1
\n",
"
Condition 2
\n",
"
Electrical
\n",
"
Enclosed Porch
\n",
"
Exter Cond
\n",
"
Exter Qual
\n",
"
Exterior 1st
\n",
"
Exterior 2nd
\n",
"
Fence
\n",
"
Fireplace Qu
\n",
"
Fireplaces
\n",
"
Foundation
\n",
"
Full Bath
\n",
"
Functional
\n",
"
Garage Area
\n",
"
Garage Cars
\n",
"
Garage Cond
\n",
"
Garage Finish
\n",
"
Garage Qual
\n",
"
Garage Type
\n",
"
Gr Liv Area
\n",
"
Half Bath
\n",
"
Heating
\n",
"
Heating QC
\n",
"
House Style
\n",
"
Kitchen AbvGr
\n",
"
Kitchen Qual
\n",
"
Land Contour
\n",
"
Land Slope
\n",
"
Lot Area
\n",
"
Lot Config
\n",
"
Lot Shape
\n",
"
Low Qual Fin SF
\n",
"
MS SubClass
\n",
"
MS Zoning
\n",
"
Mas Vnr Area
\n",
"
Mas Vnr Type
\n",
"
Misc Feature
\n",
"
Misc Val
\n",
"
Mo Sold
\n",
"
Neighborhood
\n",
"
Open Porch SF
\n",
"
Overall Cond
\n",
"
Overall Qual
\n",
"
Paved Drive
\n",
"
Pool Area
\n",
"
Pool QC
\n",
"
Roof Matl
\n",
"
Roof Style
\n",
"
Sale Condition
\n",
"
Sale Type
\n",
"
Screen Porch
\n",
"
Street
\n",
"
TotRms AbvGrd
\n",
"
Total Bsmt SF
\n",
"
Utilities
\n",
"
Wood Deck SF
\n",
"
Year Built
\n",
"
Year Remod/Add
\n",
"
Yr Sold
\n",
"
SalePrice
\n",
"
\n",
"
\n",
"
Order
\n",
"
PID
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
"
\n",
"
1
\n",
"
526301100
\n",
"
1656.0
\n",
"
0.0
\n",
"
0.0
\n",
"
NA
\n",
"
3
\n",
"
1Fam
\n",
"
Gd
\n",
"
Gd
\n",
"
1
\n",
"
0
\n",
"
TA
\n",
"
441.0
\n",
"
639.0
\n",
"
0.0
\n",
"
BLQ
\n",
"
Unf
\n",
"
Y
\n",
"
Norm
\n",
"
Norm
\n",
"
SBrkr
\n",
"
0.0
\n",
"
TA
\n",
"
TA
\n",
"
BrkFace
\n",
"
Plywood
\n",
"
NA
\n",
"
Gd
\n",
"
2
\n",
"
CBlock
\n",
"
1
\n",
"
Typ
\n",
"
528.0
\n",
"
2
\n",
"
TA
\n",
"
Fin
\n",
"
TA
\n",
"
Attchd
\n",
"
1656.0
\n",
"
0
\n",
"
GasA
\n",
"
Fa
\n",
"
1Story
\n",
"
1
\n",
"
TA
\n",
"
Lvl
\n",
"
Gtl
\n",
"
31770.0
\n",
"
Corner
\n",
"
IR1
\n",
"
0.0
\n",
"
020
\n",
"
RL
\n",
"
112.0
\n",
"
Stone
\n",
"
NA
\n",
"
0.0
\n",
"
5
\n",
"
Names
\n",
"
62.0
\n",
"
5
\n",
"
6
\n",
"
P
\n",
"
0.0
\n",
"
NA
\n",
"
CompShg
\n",
"
Hip
\n",
"
Normal
\n",
"
WD
\n",
"
0.0
\n",
"
Pave
\n",
"
7
\n",
"
1080.0
\n",
"
AllPub
\n",
"
210.0
\n",
"
1960
\n",
"
1960
\n",
"
2010
\n",
"
215000.0
\n",
"
\n",
"
\n",
"
2
\n",
"
526350040
\n",
"
896.0
\n",
"
0.0
\n",
"
0.0
\n",
"
NA
\n",
"
2
\n",
"
1Fam
\n",
"
TA
\n",
"
No
\n",
"
0
\n",
"
0
\n",
"
TA
\n",
"
270.0
\n",
"
468.0
\n",
"
144.0
\n",
"
Rec
\n",
"
LwQ
\n",
"
Y
\n",
"
Feedr
\n",
"
Norm
\n",
"
SBrkr
\n",
"
0.0
\n",
"
TA
\n",
"
TA
\n",
"
VinylSd
\n",
"
VinylSd
\n",
"
MnPrv
\n",
"
NA
\n",
"
0
\n",
"
CBlock
\n",
"
1
\n",
"
Typ
\n",
"
730.0
\n",
"
1
\n",
"
TA
\n",
"
Unf
\n",
"
TA
\n",
"
Attchd
\n",
"
896.0
\n",
"
0
\n",
"
GasA
\n",
"
TA
\n",
"
1Story
\n",
"
1
\n",
"
TA
\n",
"
Lvl
\n",
"
Gtl
\n",
"
11622.0
\n",
"
Inside
\n",
"
Reg
\n",
"
0.0
\n",
"
020
\n",
"
RH
\n",
"
0.0
\n",
"
None
\n",
"
NA
\n",
"
0.0
\n",
"
6
\n",
"
Names
\n",
"
0.0
\n",
"
6
\n",
"
5
\n",
"
Y
\n",
"
0.0
\n",
"
NA
\n",
"
CompShg
\n",
"
Gable
\n",
"
Normal
\n",
"
WD
\n",
"
120.0
\n",
"
Pave
\n",
"
5
\n",
"
882.0
\n",
"
AllPub
\n",
"
140.0
\n",
"
1961
\n",
"
1961
\n",
"
2010
\n",
"
105000.0
\n",
"
\n",
"
\n",
"
3
\n",
"
526351010
\n",
"
1329.0
\n",
"
0.0
\n",
"
0.0
\n",
"
NA
\n",
"
3
\n",
"
1Fam
\n",
"
TA
\n",
"
No
\n",
"
0
\n",
"
0
\n",
"
TA
\n",
"
406.0
\n",
"
923.0
\n",
"
0.0
\n",
"
ALQ
\n",
"
Unf
\n",
"
Y
\n",
"
Norm
\n",
"
Norm
\n",
"
SBrkr
\n",
"
0.0
\n",
"
TA
\n",
"
TA
\n",
"
Wd Sdng
\n",
"
Wd Sdng
\n",
"
NA
\n",
"
NA
\n",
"
0
\n",
"
CBlock
\n",
"
1
\n",
"
Typ
\n",
"
312.0
\n",
"
1
\n",
"
TA
\n",
"
Unf
\n",
"
TA
\n",
"
Attchd
\n",
"
1329.0
\n",
"
1
\n",
"
GasA
\n",
"
TA
\n",
"
1Story
\n",
"
1
\n",
"
Gd
\n",
"
Lvl
\n",
"
Gtl
\n",
"
14267.0
\n",
"
Corner
\n",
"
IR1
\n",
"
0.0
\n",
"
020
\n",
"
RL
\n",
"
108.0
\n",
"
BrkFace
\n",
"
Gar2
\n",
"
12500.0
\n",
"
6
\n",
"
Names
\n",
"
36.0
\n",
"
6
\n",
"
6
\n",
"
Y
\n",
"
0.0
\n",
"
NA
\n",
"
CompShg
\n",
"
Hip
\n",
"
Normal
\n",
"
WD
\n",
"
0.0
\n",
"
Pave
\n",
"
6
\n",
"
1329.0
\n",
"
AllPub
\n",
"
393.0
\n",
"
1958
\n",
"
1958
\n",
"
2010
\n",
"
172000.0
\n",
"
\n",
"
\n",
"
4
\n",
"
526353030
\n",
"
2110.0
\n",
"
0.0
\n",
"
0.0
\n",
"
NA
\n",
"
3
\n",
"
1Fam
\n",
"
TA
\n",
"
No
\n",
"
1
\n",
"
0
\n",
"
TA
\n",
"
1045.0
\n",
"
1065.0
\n",
"
0.0
\n",
"
ALQ
\n",
"
Unf
\n",
"
Y
\n",
"
Norm
\n",
"
Norm
\n",
"
SBrkr
\n",
"
0.0
\n",
"
TA
\n",
"
Gd
\n",
"
BrkFace
\n",
"
BrkFace
\n",
"
NA
\n",
"
TA
\n",
"
2
\n",
"
CBlock
\n",
"
2
\n",
"
Typ
\n",
"
522.0
\n",
"
2
\n",
"
TA
\n",
"
Fin
\n",
"
TA
\n",
"
Attchd
\n",
"
2110.0
\n",
"
1
\n",
"
GasA
\n",
"
Ex
\n",
"
1Story
\n",
"
1
\n",
"
Ex
\n",
"
Lvl
\n",
"
Gtl
\n",
"
11160.0
\n",
"
Corner
\n",
"
Reg
\n",
"
0.0
\n",
"
020
\n",
"
RL
\n",
"
0.0
\n",
"
None
\n",
"
NA
\n",
"
0.0
\n",
"
4
\n",
"
Names
\n",
"
0.0
\n",
"
5
\n",
"
7
\n",
"
Y
\n",
"
0.0
\n",
"
NA
\n",
"
CompShg
\n",
"
Hip
\n",
"
Normal
\n",
"
WD
\n",
"
0.0
\n",
"
Pave
\n",
"
8
\n",
"
2110.0
\n",
"
AllPub
\n",
"
0.0
\n",
"
1968
\n",
"
1968
\n",
"
2010
\n",
"
244000.0
\n",
"
\n",
"
\n",
"
5
\n",
"
527105010
\n",
"
928.0
\n",
"
701.0
\n",
"
0.0
\n",
"
NA
\n",
"
3
\n",
"
1Fam
\n",
"
TA
\n",
"
No
\n",
"
0
\n",
"
0
\n",
"
Gd
\n",
"
137.0
\n",
"
791.0
\n",
"
0.0
\n",
"
GLQ
\n",
"
Unf
\n",
"
Y
\n",
"
Norm
\n",
"
Norm
\n",
"
SBrkr
\n",
"
0.0
\n",
"
TA
\n",
"
TA
\n",
"
VinylSd
\n",
"
VinylSd
\n",
"
MnPrv
\n",
"
TA
\n",
"
1
\n",
"
PConc
\n",
"
2
\n",
"
Typ
\n",
"
482.0
\n",
"
2
\n",
"
TA
\n",
"
Fin
\n",
"
TA
\n",
"
Attchd
\n",
"
1629.0
\n",
"
1
\n",
"
GasA
\n",
"
Gd
\n",
"
2Story
\n",
"
1
\n",
"
TA
\n",
"
Lvl
\n",
"
Gtl
\n",
"
13830.0
\n",
"
Inside
\n",
"
IR1
\n",
"
0.0
\n",
"
060
\n",
"
RL
\n",
"
0.0
\n",
"
None
\n",
"
NA
\n",
"
0.0
\n",
"
3
\n",
"
Gilbert
\n",
"
34.0
\n",
"
5
\n",
"
5
\n",
"
Y
\n",
"
0.0
\n",
"
NA
\n",
"
CompShg
\n",
"
Gable
\n",
"
Normal
\n",
"
WD
\n",
"
0.0
\n",
"
Pave
\n",
"
6
\n",
"
928.0
\n",
"
AllPub
\n",
"
212.0
\n",
"
1997
\n",
"
1998
\n",
"
2010
\n",
"
189900.0
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 1st Flr SF 2nd Flr SF 3Ssn Porch Alley Bedroom AbvGr \\\n",
"Order PID \n",
"1 526301100 1656.0 0.0 0.0 NA 3 \n",
"2 526350040 896.0 0.0 0.0 NA 2 \n",
"3 526351010 1329.0 0.0 0.0 NA 3 \n",
"4 526353030 2110.0 0.0 0.0 NA 3 \n",
"5 527105010 928.0 701.0 0.0 NA 3 \n",
"\n",
" Bldg Type Bsmt Cond Bsmt Exposure Bsmt Full Bath \\\n",
"Order PID \n",
"1 526301100 1Fam Gd Gd 1 \n",
"2 526350040 1Fam TA No 0 \n",
"3 526351010 1Fam TA No 0 \n",
"4 526353030 1Fam TA No 1 \n",
"5 527105010 1Fam TA No 0 \n",
"\n",
" Bsmt Half Bath Bsmt Qual Bsmt Unf SF BsmtFin SF 1 \\\n",
"Order PID \n",
"1 526301100 0 TA 441.0 639.0 \n",
"2 526350040 0 TA 270.0 468.0 \n",
"3 526351010 0 TA 406.0 923.0 \n",
"4 526353030 0 TA 1045.0 1065.0 \n",
"5 527105010 0 Gd 137.0 791.0 \n",
"\n",
" BsmtFin SF 2 BsmtFin Type 1 BsmtFin Type 2 Central Air \\\n",
"Order PID \n",
"1 526301100 0.0 BLQ Unf Y \n",
"2 526350040 144.0 Rec LwQ Y \n",
"3 526351010 0.0 ALQ Unf Y \n",
"4 526353030 0.0 ALQ Unf Y \n",
"5 527105010 0.0 GLQ Unf Y \n",
"\n",
" Condition 1 Condition 2 Electrical Enclosed Porch Exter Cond \\\n",
"Order PID \n",
"1 526301100 Norm Norm SBrkr 0.0 TA \n",
"2 526350040 Feedr Norm SBrkr 0.0 TA \n",
"3 526351010 Norm Norm SBrkr 0.0 TA \n",
"4 526353030 Norm Norm SBrkr 0.0 TA \n",
"5 527105010 Norm Norm SBrkr 0.0 TA \n",
"\n",
" Exter Qual Exterior 1st Exterior 2nd Fence Fireplace Qu \\\n",
"Order PID \n",
"1 526301100 TA BrkFace Plywood NA Gd \n",
"2 526350040 TA VinylSd VinylSd MnPrv NA \n",
"3 526351010 TA Wd Sdng Wd Sdng NA NA \n",
"4 526353030 Gd BrkFace BrkFace NA TA \n",
"5 527105010 TA VinylSd VinylSd MnPrv TA \n",
"\n",
" Fireplaces Foundation Full Bath Functional Garage Area \\\n",
"Order PID \n",
"1 526301100 2 CBlock 1 Typ 528.0 \n",
"2 526350040 0 CBlock 1 Typ 730.0 \n",
"3 526351010 0 CBlock 1 Typ 312.0 \n",
"4 526353030 2 CBlock 2 Typ 522.0 \n",
"5 527105010 1 PConc 2 Typ 482.0 \n",
"\n",
" Garage Cars Garage Cond Garage Finish Garage Qual \\\n",
"Order PID \n",
"1 526301100 2 TA Fin TA \n",
"2 526350040 1 TA Unf TA \n",
"3 526351010 1 TA Unf TA \n",
"4 526353030 2 TA Fin TA \n",
"5 527105010 2 TA Fin TA \n",
"\n",
" Garage Type Gr Liv Area Half Bath Heating Heating QC \\\n",
"Order PID \n",
"1 526301100 Attchd 1656.0 0 GasA Fa \n",
"2 526350040 Attchd 896.0 0 GasA TA \n",
"3 526351010 Attchd 1329.0 1 GasA TA \n",
"4 526353030 Attchd 2110.0 1 GasA Ex \n",
"5 527105010 Attchd 1629.0 1 GasA Gd \n",
"\n",
" House Style Kitchen AbvGr Kitchen Qual Land Contour \\\n",
"Order PID \n",
"1 526301100 1Story 1 TA Lvl \n",
"2 526350040 1Story 1 TA Lvl \n",
"3 526351010 1Story 1 Gd Lvl \n",
"4 526353030 1Story 1 Ex Lvl \n",
"5 527105010 2Story 1 TA Lvl \n",
"\n",
" Land Slope Lot Area Lot Config Lot Shape Low Qual Fin SF \\\n",
"Order PID \n",
"1 526301100 Gtl 31770.0 Corner IR1 0.0 \n",
"2 526350040 Gtl 11622.0 Inside Reg 0.0 \n",
"3 526351010 Gtl 14267.0 Corner IR1 0.0 \n",
"4 526353030 Gtl 11160.0 Corner Reg 0.0 \n",
"5 527105010 Gtl 13830.0 Inside IR1 0.0 \n",
"\n",
" MS SubClass MS Zoning Mas Vnr Area Mas Vnr Type Misc Feature \\\n",
"Order PID \n",
"1 526301100 020 RL 112.0 Stone NA \n",
"2 526350040 020 RH 0.0 None NA \n",
"3 526351010 020 RL 108.0 BrkFace Gar2 \n",
"4 526353030 020 RL 0.0 None NA \n",
"5 527105010 060 RL 0.0 None NA \n",
"\n",
" Misc Val Mo Sold Neighborhood Open Porch SF Overall Cond \\\n",
"Order PID \n",
"1 526301100 0.0 5 Names 62.0 5 \n",
"2 526350040 0.0 6 Names 0.0 6 \n",
"3 526351010 12500.0 6 Names 36.0 6 \n",
"4 526353030 0.0 4 Names 0.0 5 \n",
"5 527105010 0.0 3 Gilbert 34.0 5 \n",
"\n",
" Overall Qual Paved Drive Pool Area Pool QC Roof Matl \\\n",
"Order PID \n",
"1 526301100 6 P 0.0 NA CompShg \n",
"2 526350040 5 Y 0.0 NA CompShg \n",
"3 526351010 6 Y 0.0 NA CompShg \n",
"4 526353030 7 Y 0.0 NA CompShg \n",
"5 527105010 5 Y 0.0 NA CompShg \n",
"\n",
" Roof Style Sale Condition Sale Type Screen Porch Street \\\n",
"Order PID \n",
"1 526301100 Hip Normal WD 0.0 Pave \n",
"2 526350040 Gable Normal WD 120.0 Pave \n",
"3 526351010 Hip Normal WD 0.0 Pave \n",
"4 526353030 Hip Normal WD 0.0 Pave \n",
"5 527105010 Gable Normal WD 0.0 Pave \n",
"\n",
" TotRms AbvGrd Total Bsmt SF Utilities Wood Deck SF \\\n",
"Order PID \n",
"1 526301100 7 1080.0 AllPub 210.0 \n",
"2 526350040 5 882.0 AllPub 140.0 \n",
"3 526351010 6 1329.0 AllPub 393.0 \n",
"4 526353030 8 2110.0 AllPub 0.0 \n",
"5 527105010 6 928.0 AllPub 212.0 \n",
"\n",
" Year Built Year Remod/Add Yr Sold SalePrice \n",
"Order PID \n",
"1 526301100 1960 1960 2010 215000.0 \n",
"2 526350040 1961 1961 2010 105000.0 \n",
"3 526351010 1958 1958 2010 172000.0 \n",
"4 526353030 1968 1968 2010 244000.0 \n",
"5 527105010 1997 1998 2010 189900.0 "
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"df.to_csv(\"data/data_clean.csv\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "ames-housing",
"language": "python",
"name": "ames-housing"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.4"
}
},
"nbformat": 4,
"nbformat_minor": 4
}