ames-housing/01_pairwise_correlations.ipynb

2427 lines
408 KiB
Text
Raw Normal View History

2021-05-25 08:18:04 +02:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Pair-wise Correlations\n",
"\n",
"The purpose is to identify predictor variables strongly correlated with the sales price and with each other to get an idea of what variables could be good predictors and potential issues with collinearity.\n",
"\n",
"Furthermore, Box-Cox transformations and linear combinations of variables are added where applicable or useful."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## \"Housekeeping\""
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import warnings\n",
"import json\n",
"\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"import pandas as pd\n",
"import seaborn as sns\n",
"\n",
"from sklearn.preprocessing import PowerTransformer\n",
"from tabulate import tabulate\n",
"\n",
"from utils import (\n",
" ALL_VARIABLES,\n",
" CONTINUOUS_VARIABLES,\n",
" DISCRETE_VARIABLES,\n",
" NUMERIC_VARIABLES,\n",
" ORDINAL_VARIABLES,\n",
" TARGET_VARIABLES,\n",
" encode_ordinals,\n",
" load_clean_data,\n",
" print_column_list,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"pd.set_option(\"display.max_columns\", 100)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"sns.set_style(\"white\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load the Data\n",
"\n",
"Only a subset of the previously cleaned data is used in this analysis. In particular, it does not make sense to calculate correlations involving nominal variables.\n",
"\n",
"Furthermore, ordinal variables are encoded as integers (with greater values indicating a higher sales price by \"guts feeling\"; refer to the [data documentation](https://www.amstat.org/publications/jse/v19n3/decock/DataDocumentation.txt) to see the un-encoded values) and take part in the analysis.\n",
"\n",
"A `cleaned_df` DataFrame with the original data from the previous notebook is kept so as to restore the encoded ordinal labels again at the end of this notebook for correct storage."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"cleaned_df = load_clean_data()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"df = cleaned_df[NUMERIC_VARIABLES + ORDINAL_VARIABLES + TARGET_VARIABLES]\n",
"df = encode_ordinals(df)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th></th>\n",
" <th>1st Flr SF</th>\n",
" <th>2nd Flr SF</th>\n",
" <th>3Ssn Porch</th>\n",
" <th>Bedroom AbvGr</th>\n",
" <th>Bsmt Full Bath</th>\n",
" <th>Bsmt Half Bath</th>\n",
" <th>Bsmt Unf SF</th>\n",
" <th>BsmtFin SF 1</th>\n",
" <th>BsmtFin SF 2</th>\n",
" <th>Enclosed Porch</th>\n",
" <th>Fireplaces</th>\n",
" <th>Full Bath</th>\n",
" <th>Garage Area</th>\n",
" <th>Garage Cars</th>\n",
" <th>Gr Liv Area</th>\n",
" <th>Half Bath</th>\n",
" <th>Kitchen AbvGr</th>\n",
" <th>Lot Area</th>\n",
" <th>Low Qual Fin SF</th>\n",
" <th>Mas Vnr Area</th>\n",
" <th>Misc Val</th>\n",
" <th>Mo Sold</th>\n",
" <th>Open Porch SF</th>\n",
" <th>Pool Area</th>\n",
" <th>Screen Porch</th>\n",
" <th>TotRms AbvGrd</th>\n",
" <th>Total Bsmt SF</th>\n",
" <th>Wood Deck SF</th>\n",
" <th>Year Built</th>\n",
" <th>Year Remod/Add</th>\n",
" <th>Yr Sold</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Order</th>\n",
" <th>PID</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <th>526301100</th>\n",
" <td>1656.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>441.0</td>\n",
" <td>639.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>528.0</td>\n",
" <td>2</td>\n",
" <td>1656.0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>31770.0</td>\n",
" <td>0.0</td>\n",
" <td>112.0</td>\n",
" <td>0.0</td>\n",
" <td>5</td>\n",
" <td>62.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>7</td>\n",
" <td>1080.0</td>\n",
" <td>210.0</td>\n",
" <td>1960</td>\n",
" <td>1960</td>\n",
" <td>2010</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <th>526350040</th>\n",
" <td>896.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>270.0</td>\n",
" <td>468.0</td>\n",
" <td>144.0</td>\n",
" <td>0.0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>730.0</td>\n",
" <td>1</td>\n",
" <td>896.0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>11622.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>6</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>120.0</td>\n",
" <td>5</td>\n",
" <td>882.0</td>\n",
" <td>140.0</td>\n",
" <td>1961</td>\n",
" <td>1961</td>\n",
" <td>2010</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <th>526351010</th>\n",
" <td>1329.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>406.0</td>\n",
" <td>923.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>312.0</td>\n",
" <td>1</td>\n",
" <td>1329.0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>14267.0</td>\n",
" <td>0.0</td>\n",
" <td>108.0</td>\n",
" <td>12500.0</td>\n",
" <td>6</td>\n",
" <td>36.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>6</td>\n",
" <td>1329.0</td>\n",
" <td>393.0</td>\n",
" <td>1958</td>\n",
" <td>1958</td>\n",
" <td>2010</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <th>526353030</th>\n",
" <td>2110.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1045.0</td>\n",
" <td>1065.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>522.0</td>\n",
" <td>2</td>\n",
" <td>2110.0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>11160.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>4</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>8</td>\n",
" <td>2110.0</td>\n",
" <td>0.0</td>\n",
" <td>1968</td>\n",
" <td>1968</td>\n",
" <td>2010</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <th>527105010</th>\n",
" <td>928.0</td>\n",
" <td>701.0</td>\n",
" <td>0.0</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>137.0</td>\n",
" <td>791.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>482.0</td>\n",
" <td>2</td>\n",
" <td>1629.0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>13830.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>3</td>\n",
" <td>34.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>6</td>\n",
" <td>928.0</td>\n",
" <td>212.0</td>\n",
" <td>1997</td>\n",
" <td>1998</td>\n",
" <td>2010</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 1st Flr SF 2nd Flr SF 3Ssn Porch Bedroom AbvGr \\\n",
"Order PID \n",
"1 526301100 1656.0 0.0 0.0 3 \n",
"2 526350040 896.0 0.0 0.0 2 \n",
"3 526351010 1329.0 0.0 0.0 3 \n",
"4 526353030 2110.0 0.0 0.0 3 \n",
"5 527105010 928.0 701.0 0.0 3 \n",
"\n",
" Bsmt Full Bath Bsmt Half Bath Bsmt Unf SF BsmtFin SF 1 \\\n",
"Order PID \n",
"1 526301100 1 0 441.0 639.0 \n",
"2 526350040 0 0 270.0 468.0 \n",
"3 526351010 0 0 406.0 923.0 \n",
"4 526353030 1 0 1045.0 1065.0 \n",
"5 527105010 0 0 137.0 791.0 \n",
"\n",
" BsmtFin SF 2 Enclosed Porch Fireplaces Full Bath \\\n",
"Order PID \n",
"1 526301100 0.0 0.0 2 1 \n",
"2 526350040 144.0 0.0 0 1 \n",
"3 526351010 0.0 0.0 0 1 \n",
"4 526353030 0.0 0.0 2 2 \n",
"5 527105010 0.0 0.0 1 2 \n",
"\n",
" Garage Area Garage Cars Gr Liv Area Half Bath \\\n",
"Order PID \n",
"1 526301100 528.0 2 1656.0 0 \n",
"2 526350040 730.0 1 896.0 0 \n",
"3 526351010 312.0 1 1329.0 1 \n",
"4 526353030 522.0 2 2110.0 1 \n",
"5 527105010 482.0 2 1629.0 1 \n",
"\n",
" Kitchen AbvGr Lot Area Low Qual Fin SF Mas Vnr Area \\\n",
"Order PID \n",
"1 526301100 1 31770.0 0.0 112.0 \n",
"2 526350040 1 11622.0 0.0 0.0 \n",
"3 526351010 1 14267.0 0.0 108.0 \n",
"4 526353030 1 11160.0 0.0 0.0 \n",
"5 527105010 1 13830.0 0.0 0.0 \n",
"\n",
" Misc Val Mo Sold Open Porch SF Pool Area Screen Porch \\\n",
"Order PID \n",
"1 526301100 0.0 5 62.0 0.0 0.0 \n",
"2 526350040 0.0 6 0.0 0.0 120.0 \n",
"3 526351010 12500.0 6 36.0 0.0 0.0 \n",
"4 526353030 0.0 4 0.0 0.0 0.0 \n",
"5 527105010 0.0 3 34.0 0.0 0.0 \n",
"\n",
" TotRms AbvGrd Total Bsmt SF Wood Deck SF Year Built \\\n",
"Order PID \n",
"1 526301100 7 1080.0 210.0 1960 \n",
"2 526350040 5 882.0 140.0 1961 \n",
"3 526351010 6 1329.0 393.0 1958 \n",
"4 526353030 8 2110.0 0.0 1968 \n",
"5 527105010 6 928.0 212.0 1997 \n",
"\n",
" Year Remod/Add Yr Sold \n",
"Order PID \n",
"1 526301100 1960 2010 \n",
"2 526350040 1961 2010 \n",
"3 526351010 1958 2010 \n",
"4 526353030 1968 2010 \n",
"5 527105010 1998 2010 "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[NUMERIC_VARIABLES].head()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th></th>\n",
" <th>Bsmt Cond</th>\n",
" <th>Bsmt Exposure</th>\n",
" <th>Bsmt Qual</th>\n",
" <th>BsmtFin Type 1</th>\n",
" <th>BsmtFin Type 2</th>\n",
" <th>Electrical</th>\n",
" <th>Exter Cond</th>\n",
" <th>Exter Qual</th>\n",
" <th>Fence</th>\n",
" <th>Fireplace Qu</th>\n",
" <th>Functional</th>\n",
" <th>Garage Cond</th>\n",
" <th>Garage Finish</th>\n",
" <th>Garage Qual</th>\n",
" <th>Heating QC</th>\n",
" <th>Kitchen Qual</th>\n",
" <th>Land Slope</th>\n",
" <th>Lot Shape</th>\n",
" <th>Overall Cond</th>\n",
" <th>Overall Qual</th>\n",
" <th>Paved Drive</th>\n",
" <th>Pool QC</th>\n",
" <th>Utilities</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Order</th>\n",
" <th>PID</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <th>526301100</th>\n",
" <td>4</td>\n",
" <td>4</td>\n",
" <td>3</td>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>4</td>\n",
" <td>7</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>4</td>\n",
" <td>5</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <th>526350040</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>7</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>5</td>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <th>526351010</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>5</td>\n",
" <td>1</td>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>7</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>5</td>\n",
" <td>5</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <th>526353030</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>5</td>\n",
" <td>1</td>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>7</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>4</td>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>4</td>\n",
" <td>6</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <th>527105010</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>4</td>\n",
" <td>6</td>\n",
" <td>1</td>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>7</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>4</td>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Bsmt Cond Bsmt Exposure Bsmt Qual BsmtFin Type 1 \\\n",
"Order PID \n",
"1 526301100 4 4 3 4 \n",
"2 526350040 3 1 3 3 \n",
"3 526351010 3 1 3 5 \n",
"4 526353030 3 1 3 5 \n",
"5 527105010 3 1 4 6 \n",
"\n",
" BsmtFin Type 2 Electrical Exter Cond Exter Qual Fence \\\n",
"Order PID \n",
"1 526301100 1 4 2 2 0 \n",
"2 526350040 2 4 2 2 3 \n",
"3 526351010 1 4 2 2 0 \n",
"4 526353030 1 4 2 3 0 \n",
"5 527105010 1 4 2 2 3 \n",
"\n",
" Fireplace Qu Functional Garage Cond Garage Finish \\\n",
"Order PID \n",
"1 526301100 4 7 3 3 \n",
"2 526350040 0 7 3 1 \n",
"3 526351010 0 7 3 1 \n",
"4 526353030 3 7 3 3 \n",
"5 527105010 3 7 3 3 \n",
"\n",
" Garage Qual Heating QC Kitchen Qual Land Slope Lot Shape \\\n",
"Order PID \n",
"1 526301100 3 1 2 2 2 \n",
"2 526350040 3 2 2 2 3 \n",
"3 526351010 3 2 3 2 2 \n",
"4 526353030 3 4 4 2 3 \n",
"5 527105010 3 3 2 2 2 \n",
"\n",
" Overall Cond Overall Qual Paved Drive Pool QC Utilities \n",
"Order PID \n",
"1 526301100 4 5 1 0 3 \n",
"2 526350040 5 4 2 0 3 \n",
"3 526351010 5 5 2 0 3 \n",
"4 526353030 4 6 2 0 3 \n",
"5 527105010 4 4 2 0 3 "
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[ORDINAL_VARIABLES].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Linearly \"dependent\" Features"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The \"above grade (ground) living area\" (= *Gr Liv Area*) can be split into 1st and 2nd floor living area plus some undefined rest."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"assert not (\n",
" df[\"Gr Liv Area\"]\n",
" != (df[\"1st Flr SF\"] + df[\"2nd Flr SF\"] + df[\"Low Qual Fin SF\"])\n",
").any()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The various basement areas also add up."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"assert not (\n",
" df[\"Total Bsmt SF\"]\n",
" != (df[\"BsmtFin SF 1\"] + df[\"BsmtFin SF 2\"] + df[\"Bsmt Unf SF\"])\n",
").any()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Calculate a variable for the total living area *Total SF* as this is the number communicated most often in housing ads."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"df[\"Total SF\"] = df[\"Gr Liv Area\"] + df[\"Total Bsmt SF\"]\n",
"new_variables = [\"Total SF\"]\n",
"CONTINUOUS_VARIABLES.append(\"Total SF\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The different porch areas are unified into a new variable *Total Porch SF*. This potentially helps making the presence of a porch in general relevant in the prediction."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"df[\"Total Porch SF\"] = (\n",
" df[\"3Ssn Porch\"] + df[\"Enclosed Porch\"] + df[\"Open Porch SF\"]\n",
" + df[\"Screen Porch\"] + df[\"Wood Deck SF\"]\n",
")\n",
"new_variables.append(\"Total Porch SF\")\n",
"CONTINUOUS_VARIABLES.append(\"Total Porch SF\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The various types of rooms \"above grade\" (i.e., *TotRms AbvGrd*, *Bedroom AbvGr*, *Kitchen AbvGr*, and *Full Bath*) do not add up (only in 29% of the cases they do). Therefore, no single unified variable can be used as a predictor."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"29"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"round(\n",
" 100\n",
" * (\n",
" df[\"TotRms AbvGrd\"]\n",
" == (df[\"Bedroom AbvGr\"] + df[\"Kitchen AbvGr\"] + df[\"Full Bath\"])\n",
" ).sum()\n",
" / df.shape[0]\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Unify the number of various types of bathrooms into a single variable. Note that \"half\" bathrooms are counted as such."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"df[\"Total Bath\"] = (\n",
" df[\"Full Bath\"] + 0.5 * df[\"Half Bath\"]\n",
" + df[\"Bsmt Full Bath\"] + 0.5 * df[\"Bsmt Half Bath\"]\n",
")\n",
"new_variables.append(\"Total Bath\")\n",
"DISCRETE_VARIABLES.append(\"Total Bath\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Box-Cox Transformations\n",
"\n",
"Only numeric columns with non-negative values are eligable for a Box-Cox transformation."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1st Flr SF First Floor square feet\n",
"Gr Liv Area Above grade (ground) living area square feet\n",
"Lot Area Lot size in square feet\n",
"SalePrice\n",
"Total SF\n"
]
}
],
"source": [
"columns = CONTINUOUS_VARIABLES + TARGET_VARIABLES\n",
"transforms = df[columns].describe().T\n",
"transforms = list(transforms[transforms['min'] > 0].index)\n",
"print_column_list(transforms)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A common convention is to use Box-Cox transformations only if the found lambda value (estimated with Maximum Likelyhood Estimation) is in the range from -3 to +3.\n",
"\n",
"Consequently, the only applicable transformation are for *SalePrice* and the new variable *Total SF*."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1st Flr SF: use lambda of -0.0\n",
"Gr Liv Area: use lambda of -0.0\n",
"Lot Area: use lambda of 0.1\n",
"SalePrice: use lambda of 0.0\n",
"Total SF: use lambda of 0.2\n"
]
}
],
"source": [
"# Check the Box-Cox tranformations for each column seperately\n",
"# to decide if the optimal lambda value is in an acceptable range.\n",
"output = []\n",
"transformed_columns = []\n",
"for column in transforms:\n",
" X = df[[column]] # 2D array needed!\n",
" pt = PowerTransformer(method=\"box-cox\", standardize=False)\n",
" # Suppress a weird but harmless warning from scipy\n",
" with warnings.catch_warnings():\n",
" warnings.simplefilter(\"ignore\")\n",
" pt.fit(X)\n",
" # Check if the optimal lambda is ok.\n",
" lambda_ = pt.lambdas_[0].round(1)\n",
" if -3 <= lambda_ <= 3:\n",
" lambda_label = 0 if lambda_ <= 0.01 else lambda_ # to avoid -0.0\n",
" new_column = f\"{column} (box-cox-{lambda_label})\"\n",
" df[new_column] = (\n",
" np.log(X) if lambda_ <= 0.001 else (((X ** lambda_) - 1) / lambda_)\n",
" )\n",
" # Track the new column in the appropiate list.\n",
" new_variables.append(new_column)\n",
" if column in TARGET_VARIABLES:\n",
" TARGET_VARIABLES.append(new_column)\n",
" else:\n",
" CONTINUOUS_VARIABLES.append(new_column)\n",
" # To show only the transformed columns below.\n",
" transformed_columns.append(column)\n",
" transformed_columns.append(new_column)\n",
" output.append((\n",
" f\"{column}:\",\n",
" f\"use lambda of {lambda_}\",\n",
" ))\n",
" else:\n",
" output.append((\n",
" f\"{column}:\",\n",
" f\"lambda of {lambda_} not in realistic range\",\n",
" ))\n",
"print(tabulate(sorted(output), tablefmt=\"plain\"))"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th></th>\n",
" <th>1st Flr SF</th>\n",
" <th>1st Flr SF (box-cox-0)</th>\n",
" <th>Gr Liv Area</th>\n",
" <th>Gr Liv Area (box-cox-0)</th>\n",
" <th>Lot Area</th>\n",
" <th>Lot Area (box-cox-0.1)</th>\n",
" <th>Total SF</th>\n",
" <th>Total SF (box-cox-0.2)</th>\n",
" <th>SalePrice</th>\n",
" <th>SalePrice (box-cox-0)</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Order</th>\n",
" <th>PID</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <th>526301100</th>\n",
" <td>1656.0</td>\n",
" <td>7.412160</td>\n",
" <td>1656.0</td>\n",
" <td>7.412160</td>\n",
" <td>31770.0</td>\n",
" <td>18.196923</td>\n",
" <td>2736.0</td>\n",
" <td>19.344072</td>\n",
" <td>215000.0</td>\n",
" <td>12.278393</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <th>526350040</th>\n",
" <td>896.0</td>\n",
" <td>6.797940</td>\n",
" <td>896.0</td>\n",
" <td>6.797940</td>\n",
" <td>11622.0</td>\n",
" <td>15.499290</td>\n",
" <td>1778.0</td>\n",
" <td>17.333478</td>\n",
" <td>105000.0</td>\n",
" <td>11.561716</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <th>526351010</th>\n",
" <td>1329.0</td>\n",
" <td>7.192182</td>\n",
" <td>1329.0</td>\n",
" <td>7.192182</td>\n",
" <td>14267.0</td>\n",
" <td>16.027549</td>\n",
" <td>2658.0</td>\n",
" <td>19.203658</td>\n",
" <td>172000.0</td>\n",
" <td>12.055250</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <th>526353030</th>\n",
" <td>2110.0</td>\n",
" <td>7.654443</td>\n",
" <td>2110.0</td>\n",
" <td>7.654443</td>\n",
" <td>11160.0</td>\n",
" <td>15.396064</td>\n",
" <td>4220.0</td>\n",
" <td>21.548042</td>\n",
" <td>244000.0</td>\n",
" <td>12.404924</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <th>527105010</th>\n",
" <td>928.0</td>\n",
" <td>6.833032</td>\n",
" <td>1629.0</td>\n",
" <td>7.395722</td>\n",
" <td>13830.0</td>\n",
" <td>15.946705</td>\n",
" <td>2557.0</td>\n",
" <td>19.016856</td>\n",
" <td>189900.0</td>\n",
" <td>12.154253</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 1st Flr SF 1st Flr SF (box-cox-0) Gr Liv Area \\\n",
"Order PID \n",
"1 526301100 1656.0 7.412160 1656.0 \n",
"2 526350040 896.0 6.797940 896.0 \n",
"3 526351010 1329.0 7.192182 1329.0 \n",
"4 526353030 2110.0 7.654443 2110.0 \n",
"5 527105010 928.0 6.833032 1629.0 \n",
"\n",
" Gr Liv Area (box-cox-0) Lot Area Lot Area (box-cox-0.1) \\\n",
"Order PID \n",
"1 526301100 7.412160 31770.0 18.196923 \n",
"2 526350040 6.797940 11622.0 15.499290 \n",
"3 526351010 7.192182 14267.0 16.027549 \n",
"4 526353030 7.654443 11160.0 15.396064 \n",
"5 527105010 7.395722 13830.0 15.946705 \n",
"\n",
" Total SF Total SF (box-cox-0.2) SalePrice \\\n",
"Order PID \n",
"1 526301100 2736.0 19.344072 215000.0 \n",
"2 526350040 1778.0 17.333478 105000.0 \n",
"3 526351010 2658.0 19.203658 172000.0 \n",
"4 526353030 4220.0 21.548042 244000.0 \n",
"5 527105010 2557.0 19.016856 189900.0 \n",
"\n",
" SalePrice (box-cox-0) \n",
"Order PID \n",
"1 526301100 12.278393 \n",
"2 526350040 11.561716 \n",
"3 526351010 12.055250 \n",
"4 526353030 12.404924 \n",
"5 527105010 12.154253 "
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[transformed_columns].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Correlations\n",
"\n",
"The pair-wise correlations are calculated based on the type of the variables:\n",
"- **continuous** variables are assumed to be linearly related with the target and each other or not: use **Pearson's correlation coefficient**\n",
"- **discrete** (because of the low number of distinct realizations as seen in the data cleaning notebook) and **ordinal** (low number of distinct realizations as well) variables are assumed to be related in a monotonic way with the target and each other or not: use **Spearman's rank correlation coefficient**\n",
"\n",
"Furthermore, for a **naive feature selection** a \"rule of thumb\" classification in *weak* and *strong* correlation is applied to the predictor variables. The identified variables will be used in the prediction modelling part to speed up the feature selection. A correlation between 0.33 and 0.66 is considered *weak* while a correlation above 0.66 is considered *strong* (these thresholds refer to the absolute value of the correlation). Correlations are calculated for **each** target variable (i.e., raw \"SalePrice\" and Box-Cox transformation thereof). Correlations below 0.1 are considered \"uncorrelated\"."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"strong = 0.66\n",
"weak = 0.33\n",
"uncorrelated = 0.1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Two heatmaps below (implemented in the reusable `plot_correlation` function) help visualize the correlations.\n",
"\n",
"Obviously, many variables are pair-wise correlated. This could yield regression coefficients *inprecise* and not usable / interpretable. At the same time, this does not lower the predictive power of a model as a whole. In contrast to the pair-wise correlations, *multi-collinearity* is not checked here."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"def plot_correlation(data, title):\n",
" \"\"\"Visualize a correlation matrix in a nice heatmap.\"\"\"\n",
" fig, ax = plt.subplots(figsize=(12, 12))\n",
" ax.set_title(title, fontsize=24)\n",
" # Blank out the upper triangular part of the matrix.\n",
" mask = np.zeros_like(data, dtype=np.bool)\n",
" mask[np.triu_indices_from(mask)] = True\n",
" # Use a diverging color map.\n",
" cmap = sns.diverging_palette(240, 0, as_cmap=True)\n",
" # Adjust the labels' font size.\n",
" labels = data.columns\n",
" ax.set_xticks(range(len(labels)), labels=labels, fontsize=10)\n",
" ax.set_yticks(range(len(labels)), labels=labels, fontsize=10)\n",
2021-05-25 08:18:04 +02:00
" # Plot it.\n",
" sns.heatmap(\n",
" data, vmin=-1, vmax=1, cmap=cmap, center=0, linewidths=.5,\n",
" cbar_kws={\"shrink\": .5}, square=True, mask=mask, ax=ax\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Pearson\n",
"\n",
"Pearson's correlation coefficient shows a linear relationship between two variables."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"columns = CONTINUOUS_VARIABLES + TARGET_VARIABLES\n",
"pearson = df[columns].corr(method=\"pearson\")"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
2024-07-10 01:48:08 +02:00
"image/png": "iVBORw0KGgoAAAANSUhEUgAABB8AAAPICAYAAACRpFOxAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/TGe4hAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdeXiM1///8dckkkZklUQQSRCMILHWVmqpFrEv1ara91J7iSq1Ry2lpYvatUV9BLVVa61qkbaU1lKtLWJrSCIIksj8/vAz30wTBJmJ8nxc11zX3Oc+93mfeybz+fR+O4vBZDKZBAAAAAAAYCV2Od0BAAAAAADwZCP5AAAAAAAArIrkAwAAAAAAsCqSDwAAAAAAwKpIPgAAAAAAAKsi+QAAAAAAAKyK5AMAAAAAALAqkg8AAAAAAMCqSD4AAAAAAACrIvkAAADwlGjfvr2MRqOMRqP27NmT0925q5iYGHM/69atm9PdAQBkg1w53QEAgHW1b99eUVFRdz3v7OwsT09PGY1GVa9eXc2bN5erq6sNewhb2rNnjzp06CBJ6tu3r958880c7tHdnT17Vtu3b9dPP/2k48ePKz4+XleuXFHu3Lnl4eEho9Go0NBQNWzYUP7+/jndXQAAcA8kHwDgKZeUlKSkpCSdOXNGW7du1YwZMzRy5Eg1b948p7uGp9S5c+f00UcfadWqVUpNTc1wPiUlRYmJiYqOjtamTZs0bdo0Va1aVYMGDVLZsmVzoMe4l5iYGL3wwguSJD8/P23dujWHewQAyAkkHwDgKRISEqLQ0FDzsclkUmJiov744w+dPHlSknT16lUNGzZMN2/e1CuvvJJDPcXTavfu3erXr58uX75sLjMYDDIajQoICJCHh4euXbum2NhY/fHHH0pKSjJf16ZNGy1fvpwEBAAAjyGSDwDwFKlVq9Zdh9lv2rRJw4cP15UrVyRJ48ePV61atZQ/f35bdhFPsa1bt6pfv35KSUmRdHtKUKdOndSuXTt5e3tnqJ+cnKyffvpJn332mX799VdJ0o0bN2zaZ1hHoUKF9Oeff+Z0NwAA2YgFJwEAkqQXX3xRU6dONR8nJydryZIlOdgjPE1Onz6tYcOGmRMPfn5+ioyMVP/+/TNNPEiSo6OjateurSVLlmjWrFlyd3e3ZZcBAMADIPkAADCrXbu2SpYsaT7+6aefcrA3eJqMHDlSiYmJkm6PeFi0aJGKFi2a5etffPFFRUZGqkCBAtbqIgAAeARMuwAAWChfvryOHDki6fa/Rt/NuXPntHLlSv3444+Kjo5WQkKCnJ2dVbBgQVWrVk1t2rRRkSJF7hvvxo0b2rlzp3bv3q2DBw/q1KlTSkxMlIODgzw9PRUcHKzatWurWbNmcnR0vGdb6XdyqFy5sj7//HNJ0vfff6+vv/5af/zxh2JjY5WUlKThw4erU6dO5mtNJpO2bNmib775xlzv+vXreuaZZ5Q3b14VKlRIISEhqlmzpipXriw7u3vn73/44Qdt2LBBe/fuVWxsrFJTU+Xl5aVSpUrphRdeUJMmTeTg4HDPNsLDw7Vq1SpJUkREhFq2bKnr169r5cqVWrdunfmz8vLyUsWKFdWuXTtVrFjxfh95lh04cECrV6/Wvn37FBMTo2vXrilXrlxyc3NTwYIFFRwcrCpVqqh27dpydnZ+6Di///67du3aZT4eNGjQQ+1ekZVrfvvtN61Zs0Z79uzRP//8oxs3bsjT01PFixdXnTp11LJly/vey8yZMzVr1ixJ/7djyI0bN7R27Vp98803On78uC5evKiUlBStXr1awcHBWrlypYYPHy5JatGihSZNmqRbt25p48aNWrdunY4eParY2FjdvHlTH330kerVq5ch7oEDB7Ru3Trt2bNHFy5c0NWrV+Xu7q4iRYro+eef1yuvvJJtoz+y43eZ/p7vOHPmjIxGY6b100+zeJhFKq313aampmrdunVavXq1jh07pvj4eHl4eCg0NFQvv/yy6tSpc9++AQBIPgAA/iX9w8u1a9cynE9LS9PMmTM1b9483bx50+Lc5cuXdfnyZR0+fFiLFy9Wt27dNGDAABkMhkxj7d+/X506dTIvGpheSkqKeReOzZs365NPPtGsWbNUqlSpLN/LlStXNHz4cG3atOme9S5evKi+fftq3759Gc5dv35dZ86c0ZkzZ7Rnzx7NnTtXCxYsUPXq1TNt69KlSxo8eLDFw/QdZ8+e1dmzZ7V582bNnj1bU6dOVUhISJbv5++//1a/fv107Ngxi/Lz589r/fr1Wr9+vfr06aN+/fpluc3MpKamauzYsfrqq68ynLt165ZiY2MVGxur/fv3a9myZerVq5cGDhz40PGWLl1qfu/q6qrWrVs/dFt3k5SUpBEjRmjDhg0Zzl24cEEXLlzQzp079emnn2rChAmqVatWlts+duyY+vfvr7/++ivL11y4cEEDBw40r1VxL5cvX9bIkSP17bffZjh38eJFXbx4UT///LPmzJmjcePGqUGDBlnuR2as/bvMbtb8bi9cuKD+/ftn+N+G2NhYbdmyRVu2bFHLli01YcKE+yYkAeBpR/IBAGAh/S4DLi4uFudu3bqlgQMHWjwE+fr6KjQ0VHnz5tW1a9d04MABRUdHKzU1VZ9++qni4uI0bty4u8a684Dj5eWlYsWKKX/+/MqdO7du3LihU6dO6ffff1dqaqrOnDmj119/XatWrVJgYOB978NkMumtt97Stm3bZDAYVKZMGRUrVkwmk0l//fWXOSFy69Yt9ejRQwcPHjRfW6JECRUvXlyurq5KTk5WbGysjhw5otjY2HvGvHjxotq2bavo6GhzWUBAgEJDQ+Xo6Khjx45p//79kqSTJ0+qQ4cOmjt3bpZGK/zzzz/q1KmTYmNj5ebmpooVK8rHx0fx8fHavXu3eaHQjz76SMWKFVNYWNh927ybyZMnWyQe0n/HaWlpSkhI0N9//60TJ048dIz0du/ebX7/wgsvKHfu3NnS7h3Xr19Xx44ddeDAAXNZvnz5VKlSJTk7Oys6Olq//vqrObHyxhtvaNq0aVl6iE9ISFC3bt109uxZPfPMM6pYsaIKFiyopKQk83f9b8nJyerdu7cOHjyoXLlyqXz58vL391dycrIOHTpkUTc2NlYdO3a0SDgVL15cRqNRefLk0aVLl/TLL78oISFBiYmJGjBggCZPnqymTZs+5KeVfb/LoKAgtWvXTteuXdPq1aslSXny5MnWbXyt+d0mJSWpW7duOnr0qHLnzq2KFSuqQIECunbtmvbs2aNLly5Juj3Co0iRIurRo0e23RcAPIlIPgAALOzdu9f8vlChQhbnZs2aZU48+Pj4aNSoUXrxxRczjGz45ptvNHLkSF25ckXLly9XtWrVMn0Ydnd3V69evdSoUSOVKFEi0/5cunRJ7733nr7++mtdu3ZN7777rhYuXHjf+9i3b59SU1NVokQJTZ06NcNQ7+TkZEnStm3bzIkHHx8fffTRR3fdqvGvv/7S119/nSEpc8fw4cPNiQdnZ2eNHz9ejRo1sqjz+++/a+DAgTp9+rSSkpI0ePBgrVmzRm5ubve8n48++kjJycnq3r27+vTpY/GAnpCQoP79+5sf4t9//301bNgw0xEnVapUuecuAvHx8fryyy8lSfb29powYYKaN2+eaVv//POPvv32Wzk5Od2z7/dy/vx5nTlzxnycfivY7PLee++ZH07t7e01bNgwtW/f3uJfqk+ePKlBgwbp4MGDSk1N1YgRI1SmTJkMv4F/W7ZsmVJTU1W/fn2NHj1aefPmNZ9LS0vTrVu3Mlzz7bffKjU1VZUrV1ZERESGGHf+NtPS0jR48GBz4iE0NFRjxozJMMrg5s2bmjNnjmbNmiWTyaR3333XnNB4GNn1uyxbtqzKli2rmJgYc/LBw8NDo0aNeqh+Zcaa3+0XX3yh5ORktWjRQuHh4fLw8DCfu379ut555x2tW7dOkvTJJ5/o9ddff6TpRwDwpGN8GADAbPv27RYPptWqVTO/j4mJ0ezZsyXdfoBYsmSJXnrppUwfShs2bGieNy3J/FD0b2XLltXAgQPv+oA
2021-05-25 08:18:04 +02:00
"text/plain": [
2024-07-10 01:48:08 +02:00
"<Figure size 1200x1200 with 2 Axes>"
2021-05-25 08:18:04 +02:00
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plot_correlation(pearson, \"Pearson's Correlation\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Predictors weakly or strongly correlated with a target variable are collected."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"pearson_weakly_correlated = set()\n",
"pearson_strongly_correlated = set()\n",
"pearson_uncorrelated = set()\n",
"# Iterate over the raw and transformed target.\n",
"for target in TARGET_VARIABLES:\n",
" corrs = pearson.loc[target].drop(TARGET_VARIABLES).abs()\n",
" pearson_weakly_correlated |= set(corrs[(weak < corrs) & (corrs <= strong)].index)\n",
" pearson_strongly_correlated |= set(corrs[(strong < corrs)].index)\n",
" pearson_uncorrelated |= set(corrs[(corrs < uncorrelated)].index)\n",
"# Show that no contradiction exists between the classifications.\n",
"assert pearson_weakly_correlated & pearson_strongly_correlated == set()\n",
"assert pearson_weakly_correlated & pearson_uncorrelated == set()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Show the continuous variables that are weakly and strongly correlated with the sales price or uncorrelated."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"3Ssn Porch Three season porch area in square feet\n",
"BsmtFin SF 2 Type 2 finished square feet\n",
"Low Qual Fin SF Low quality finished square feet (all floors)\n",
"Misc Val $Value of miscellaneous feature\n",
"Pool Area Pool area in square feet\n"
]
}
],
"source": [
"print_column_list(pearson_uncorrelated)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1st Flr SF First Floor square feet\n",
"1st Flr SF (box-cox-0)\n",
"BsmtFin SF 1 Type 1 finished square feet\n",
"Garage Area Size of garage in square feet\n",
"Lot Area (box-cox-0.1)\n",
"Mas Vnr Area Masonry veneer area in square feet\n",
"Total Bsmt SF Total square feet of basement area\n",
"Total Porch SF\n",
"Wood Deck SF Wood deck area in square feet\n"
]
}
],
"source": [
"print_column_list(pearson_weakly_correlated)"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Gr Liv Area Above grade (ground) living area square feet\n",
"Gr Liv Area (box-cox-0)\n",
"Total SF\n",
"Total SF (box-cox-0.2)\n"
]
}
],
"source": [
"print_column_list(pearson_strongly_correlated)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Spearman\n",
"\n",
"Spearman's correlation coefficient shows an ordinal rank relationship between two variables."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"columns = sorted(DISCRETE_VARIABLES + ORDINAL_VARIABLES) + TARGET_VARIABLES\n",
"spearman = df[columns].corr(method=\"spearman\")"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
2024-07-10 01:48:08 +02:00
"image/png": "iVBORw0KGgoAAAANSUhEUgAABBIAAAO6CAYAAAAisHTLAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/TGe4hAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzde3zP9f//8duONoYNb2SYOY0YG9OSLYcIKzSnfIhyGFI5FpNy1oRC5Cw5RQ4zklRIWaSUQ0RqDjOGMdtsYyf7/eHn9d27jY3ec1j36+Xyvlzer9fr+Xq8nq/X++1tr8frebDKzMzMREREREREREQkD6wfdAVERERERERE5NGhRIKIiIiIiIiI5JkSCSIiIiIiIiKSZ0okiIiIiIiIiEieKZEgIiIiIiIiInmmRIKIiIiIiIiI5JkSCSIiIiIiIiKSZ0okiIiIiIiIiEieKZEgIiIiIiIiInmmRIKIiIgUWFFRUXh4eODh4UGzZs0edHUkH3Tv3t34jPfu3fugq3Nb+i6KSEFi+6ArICJyt/bu3ctXX33FoUOHOHfuHImJiVhbW1OkSBEee+wx3N3d8fT0pH79+tSuXRsrK6sHXWX5j4iKiuKZZ54BIDAwkMmTJz/gGt3UrFkzzp49m+M2KysrihQpQvHixfHw8ODJJ5+kbdu2uLi43Oda/jecO3eOnTt3snv3bk6cOMGVK1e4evUqjo6OODs74+HhQZ06dWjdujUVKlR40NUVERHJkRIJIvLIiIiI4O233+bAgQM5bk9JSSE2NpYjR46wefNmAKpVq2a8F5HsMjMzSUxMJDExkbNnz7Jjxw5mzJjBiBEj6NKly4OuXoERHR3Nxx9/zIYNG0hPT8+2PS0tjYSEBCIjI/n222/54IMPePLJJxk6dCh169Z9ADWWO8maNHR1dWXHjh0PuEYiIveXEgki8kj4448/ePnll0lISDDWlSpVitq1a1OqVCmsrKyIi4vjr7/+4vTp02RmZgKYlRcRaNiwIZUrVzaWMzMzSUhI4Pfff+f06dMAJCcnM2bMGFJTU+nRo8eDqmqB8dNPPzFw4EDi4+ONdVZWVnh4eFCxYkWcnZ1JSkoiJiaGw4cPk5ycbOzXuXNn1qxZo2SCiIg8VJRIEJGHXlpaGsOGDTOSAqVLl2bMmDE0a9YMa+vsQ73Exsayfft2Nm7cyJkzZ+53dUUeam3btqV9+/Y5btuxYwfBwcHGDe+0adNo2bIlZcqUuZ9VLFB27NjBwIEDSUtLA6Bw4cK88sordOvWjVKlSmUrn5qayu7du1mwYAG//vorANevX7+vdZb8Ub58ef78888HXQ0REYvQYIsi8tDbtm0bJ06cAMDBwYFly5bRvHnzHJMIACVKlKBTp06sWLGC5cuX38+qijzSmjVrRkhIiLGckpLChg0bHmCNHm1nzpxhxIgRRhLB1dWV9evXM2jQoByTCAD29vY0adKEzz77jNmzZ1O8ePH7WWUREZE8USJBRB56P/74o/H+mWeewd3dPc/7VqxYMT+qJFJgPfPMM5QvX95Y/uWXXx5gbR5t7777rtGSqnDhwixdutSsW0luWrRowfr163nsscfyq4oiIiL3RF0bROShd+HCBeN9uXLlLBa3e/fu/PzzzwAsW7YMX19foqOjWb16Nd999x3nz58nNTWVsmXL8vTTT9OtWzfc3Nzu6hh79uzhq6++4tdffyUmJobk5GRjZPamTZvSsWNHHBwcco1z9uxZvv/+e/bt28fx48eJjo7m+vXrODk5Ubp0aerVq0dgYCBeXl65xgoODjaeMoeEhNC+fXsSEhLYsGED33zzDZGRkVy+fJmMjAx++eUXihUrxqxZs5g9ezYAr7/+Om+88YbxtPrLL7/k5MmTxMXFUaJECZ588kn69u1L1apVzY6blJREWFgYmzdv5syZM8THx1OmTBn8/f3p168fZcuWzbXuly9fZufOnfz888/8+eefnDt3jqSkJBwdHSlVqhTe3t4EBATg7++fa6yczik9PZ3NmzcTFhZGREQEV65cwdnZmTp16tCpUyeaNm2aa9y8ioiIYP369ezbt4/Tp0+TlJSElZUVTk5OPPbYY3h4ePDEE0/QrFmz+/5UumbNmkRFRQFw8eLFO5a9fv064eHh/PTTTxw5coTTp0+TkJCAnZ0dLi4u1KxZkyZNmtCuXTvs7e3vGGvv3r3GmAxPPPGE0aJoz549rFmzhkOHDnHx4kUKFy5M1apVad26NS+++CJ2dnYWOOubLTCGDh3Ktm3bgJutmxYtWkStWrXuOtbvv//Onj17jOWhQ4fe0ywMednnwIEDbNq0ib1793Lx4kWuX7+Oi4sL1apVo2nTprRv357ChQvfMUZO/x6uX7/OF198wVdffcWJEye4dOkSaWlphIWFUbNmTUJDQxk5ciTwf7OUZGRksHXrVjZv3szx48eJiYkhJSWFjz/+mObNm2c77qFDh9i8eTN79+7lwoULJCYmUrx4cdzd3Xn66ad58cUXLfb9t8R3Nes533L27Fk8PDxyLJ+1K8O9DNCYX5/t/fytE5GCSYkEEXnoZe3CcOvmJj9s376dESNGcPXqVbP1J0+e5OTJk3z++ee8/fbbvPjii7nGio6OZvjw4UaiIquYmBhiYmIIDw9n/vz5TJ8+HR8fn9vGev/991myZIkxgGRWcXFxxMXFcfz4cVavXs1zzz3HpEmTcHR0zMMZ3/Trr78ybNgwoqOj87zPmTNneP311zl27JjZ+gsXLrBx40a++uor5syZY9zQHzp0iNdff90sKXQrzmeffcamTZtYvHjxHRMhy5YtM25U/unq1atcvXqVkydPEhoaypNPPsmMGTPuagrDCxcuMGjQIPbv32+2PiYmhu3bt7N9+3bat2/PpEmTbtutJq9mzZrF3LlzczyX2NhYY/aR0NBQ2rRpw7Rp0/7V8e5WoUKFjPepqam3LXfw4EFeeeUVY3DArNLS0khOTubs2bNs27aNuXPnMnv2bB5//PE81yM1NZUJEyawZs2abOv37dvHvn37CA0NZdGiRZQoUSLPcXNy9epVBgwYYPybdXV15ZNPPqFSpUr3FG/VqlXG+6JFi9KxY8d/Vb+cJCcnM2rUKLZs2ZJt24ULF7hw4QLh4eHMmzePSZMm0bhx4zzHjoiIYNCgQfz111953ufChQsMGTLEGNvhTuLj43n33Xf5+uuvs227dOkSly5d4pdffmHhwoVMmDCBVq1a5bkeOcnv76ql5ednez9/60Sk4FIiQUQeelmfyH333Xf8/fff2Z52/1uHDx9m+vTppKWl4ezsjK+vL8WKFePs2bP88ssvpKWlcf36dUaPHo21tTWdOnW6bayIiAhefvllYmJigJujsz/++ONUrVoVBwcHLly4wC+//EJSUhIXL16kZ8+eLFy4kCeffDLHeOfPnyczMxMrKyvc3d1xd3fH2dkZW1tb4uLiOHr0KJGRkQB8+eWXJCYmMn/+fKysrHI979OnT/Pee+9x9epVihQpQoMGDShdujTx8fHs27cvx30SExPp06cPp06dwsnJiQYNGmAymYiJieGnn37i2rVrpKam8vrrr/PFF1+QlpZGz549SUxMxMXFhQYNGuDs7My5c+fYu3cvaWlpJCYm8tprr7F161aKFi2a43EvXrxo3HhXqFCBKlWqUKJECezt7bl69SrHjx83bnp++uknevbsyZo1a3J9Cg43/2jv06cPx48fx9HRkfr16/PYY4+RlJTE3r17uXz5MnDzaaS7uzt9+/bNNebtLF261Hg6CODi4oKXlxcmk8mYfeTkyZNERETkmGi4H7K2QihZsuRty8XHxxs3ZiVLlqRq1aqULVsWR0dHrl+/zunTp/n9999JT0/n7NmzvPTSS2zYsCHPLXtGjx7Nhg0bsLa2pm7duri7u5OZmcmBAwc4efIkAEeOHGHEiBEsXLjwns83JiaGoKAgjh49CkD16tVZtGjRvxpk8qeffjLeP/PMM3eV3MuLa9eu8fLLL3Po0CFjXenSpfHx8aFw4cJERkby66+/kpGRQUxMDAMGDOCDDz7I0w15XFwcffr04dy
2021-05-25 08:18:04 +02:00
"text/plain": [
2024-07-10 01:48:08 +02:00
"<Figure size 1200x1200 with 2 Axes>"
2021-05-25 08:18:04 +02:00
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plot_correlation(spearman, \"Spearman's Rank Correlation\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Predictors weakly or strongly correlated with a target variable are collected."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"spearman_weakly_correlated = set()\n",
"spearman_strongly_correlated = set()\n",
"spearman_uncorrelated = set()\n",
"# Iterate over the raw and transformed target.\n",
"for target in TARGET_VARIABLES:\n",
" corrs = spearman.loc[target].drop(TARGET_VARIABLES).abs()\n",
" spearman_weakly_correlated |= set(corrs[(weak < corrs) & (corrs <= strong)].index)\n",
" spearman_strongly_correlated |= set(corrs[(strong < corrs)].index)\n",
" spearman_uncorrelated |= set(corrs[(corrs < uncorrelated)].index)\n",
"# Show that no contradiction exists between the classifications.\n",
"assert spearman_weakly_correlated & spearman_strongly_correlated == set()\n",
"assert spearman_weakly_correlated & spearman_uncorrelated == set()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Show the discrete and ordinal variables that are weakly and strongly correlated with the sales price or uncorrelated."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Bsmt Half Bath Basement half bathrooms\n",
"BsmtFin Type 2 Rating of basement finished area (if multiple types)\n",
"Exter Cond Evaluates the present condition of the material on the exterior\n",
"Land Slope Slope of property\n",
"Mo Sold Month Sold (MM)\n",
"Pool QC Pool quality\n",
"Utilities Type of utilities available\n",
"Yr Sold Year Sold (YYYY)\n"
]
}
],
"source": [
"print_column_list(spearman_uncorrelated)"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Bsmt Exposure Refers to walkout or garden level walls\n",
"BsmtFin Type 1 Rating of basement finished area\n",
"Fireplace Qu Fireplace quality\n",
"Fireplaces Number of fireplaces\n",
"Full Bath Full bathrooms above grade\n",
"Garage Cond Garage condition\n",
"Garage Finish Interior finish of the garage\n",
"Garage Qual Garage quality\n",
"Half Bath Half baths above grade\n",
"Heating QC Heating quality and condition\n",
"Lot Shape General shape of property\n",
"Paved Drive Paved driveway\n",
"TotRms AbvGrd Total rooms above grade (does not include bathrooms)\n",
"Year Remod/Add Remodel date (same as construction date if no remodeling or additions)\n"
]
}
],
"source": [
"print_column_list(spearman_weakly_correlated)"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Bsmt Qual Evaluates the height of the basement\n",
"Exter Qual Evaluates the quality of the material on the exterior\n",
"Garage Cars Size of garage in car capacity\n",
"Kitchen Qual Kitchen quality\n",
"Overall Qual Rates the overall material and finish of the house\n",
"Total Bath\n",
"Year Built Original construction date\n"
]
}
],
"source": [
"print_column_list(spearman_strongly_correlated)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Save the Results"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Save the weakly and strongly correlated Variables\n",
"\n",
"The subset of variables that have a correlation with the house price are saved in a simple JSON file for easy re-use."
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"with open(\"data/correlated_variables.json\", \"w\") as file:\n",
" file.write(json.dumps({\n",
" \"uncorrelated\": sorted(\n",
" list(pearson_uncorrelated) + list(spearman_uncorrelated)\n",
" ),\n",
" \"weakly_correlated\": sorted(\n",
" list(pearson_weakly_correlated) + list(spearman_weakly_correlated)\n",
" ),\n",
" \"strongly_correlated\": sorted(\n",
" list(pearson_strongly_correlated) + list(spearman_strongly_correlated)\n",
" ),\n",
" }))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Save the Data\n",
"\n",
"Sort the new variables into the unprocessed `cleaned_df` DataFrame with the targets at the end. This \"restores\" the ordinal labels again for storage."
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"for column in new_variables:\n",
" cleaned_df[column] = df[column]\n",
"for target in set(TARGET_VARIABLES) & set(new_variables):\n",
" new_variables.remove(target)\n",
"cleaned_df = cleaned_df[sorted(ALL_VARIABLES + new_variables) + TARGET_VARIABLES]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In totality, this notebook added two new linear combinations and one Box-Cox transformation to the previous 78 columns."
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(2898, 86)"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cleaned_df.shape"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th></th>\n",
" <th>1st Flr SF</th>\n",
" <th>1st Flr SF (box-cox-0)</th>\n",
" <th>2nd Flr SF</th>\n",
" <th>3Ssn Porch</th>\n",
" <th>Alley</th>\n",
" <th>Bedroom AbvGr</th>\n",
" <th>Bldg Type</th>\n",
" <th>Bsmt Cond</th>\n",
" <th>Bsmt Exposure</th>\n",
" <th>Bsmt Full Bath</th>\n",
" <th>Bsmt Half Bath</th>\n",
" <th>Bsmt Qual</th>\n",
" <th>Bsmt Unf SF</th>\n",
" <th>BsmtFin SF 1</th>\n",
" <th>BsmtFin SF 2</th>\n",
" <th>BsmtFin Type 1</th>\n",
" <th>BsmtFin Type 2</th>\n",
" <th>Central Air</th>\n",
" <th>Condition 1</th>\n",
" <th>Condition 2</th>\n",
" <th>Electrical</th>\n",
" <th>Enclosed Porch</th>\n",
" <th>Exter Cond</th>\n",
" <th>Exter Qual</th>\n",
" <th>Exterior 1st</th>\n",
" <th>Exterior 2nd</th>\n",
" <th>Fence</th>\n",
" <th>Fireplace Qu</th>\n",
" <th>Fireplaces</th>\n",
" <th>Foundation</th>\n",
" <th>Full Bath</th>\n",
" <th>Functional</th>\n",
" <th>Garage Area</th>\n",
" <th>Garage Cars</th>\n",
" <th>Garage Cond</th>\n",
" <th>Garage Finish</th>\n",
" <th>Garage Qual</th>\n",
" <th>Garage Type</th>\n",
" <th>Gr Liv Area</th>\n",
" <th>Gr Liv Area (box-cox-0)</th>\n",
" <th>Half Bath</th>\n",
" <th>Heating</th>\n",
" <th>Heating QC</th>\n",
" <th>House Style</th>\n",
" <th>Kitchen AbvGr</th>\n",
" <th>Kitchen Qual</th>\n",
" <th>Land Contour</th>\n",
" <th>Land Slope</th>\n",
" <th>Lot Area</th>\n",
" <th>Lot Area (box-cox-0.1)</th>\n",
" <th>Lot Config</th>\n",
" <th>Lot Shape</th>\n",
" <th>Low Qual Fin SF</th>\n",
" <th>MS SubClass</th>\n",
" <th>MS Zoning</th>\n",
" <th>Mas Vnr Area</th>\n",
" <th>Mas Vnr Type</th>\n",
" <th>Misc Feature</th>\n",
" <th>Misc Val</th>\n",
" <th>Mo Sold</th>\n",
" <th>Neighborhood</th>\n",
" <th>Open Porch SF</th>\n",
" <th>Overall Cond</th>\n",
" <th>Overall Qual</th>\n",
" <th>Paved Drive</th>\n",
" <th>Pool Area</th>\n",
" <th>Pool QC</th>\n",
" <th>Roof Matl</th>\n",
" <th>Roof Style</th>\n",
" <th>Sale Condition</th>\n",
" <th>Sale Type</th>\n",
" <th>Screen Porch</th>\n",
" <th>Street</th>\n",
" <th>TotRms AbvGrd</th>\n",
" <th>Total Bath</th>\n",
" <th>Total Bsmt SF</th>\n",
" <th>Total Porch SF</th>\n",
" <th>Total SF</th>\n",
" <th>Total SF (box-cox-0.2)</th>\n",
" <th>Utilities</th>\n",
" <th>Wood Deck SF</th>\n",
" <th>Year Built</th>\n",
" <th>Year Remod/Add</th>\n",
" <th>Yr Sold</th>\n",
" <th>SalePrice</th>\n",
" <th>SalePrice (box-cox-0)</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Order</th>\n",
" <th>PID</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <th>526301100</th>\n",
" <td>1656.0</td>\n",
" <td>7.412160</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>NA</td>\n",
" <td>3</td>\n",
" <td>1Fam</td>\n",
" <td>Gd</td>\n",
" <td>Gd</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>TA</td>\n",
" <td>441.0</td>\n",
" <td>639.0</td>\n",
" <td>0.0</td>\n",
" <td>BLQ</td>\n",
" <td>Unf</td>\n",
" <td>Y</td>\n",
" <td>Norm</td>\n",
" <td>Norm</td>\n",
" <td>SBrkr</td>\n",
" <td>0.0</td>\n",
" <td>TA</td>\n",
" <td>TA</td>\n",
" <td>BrkFace</td>\n",
" <td>Plywood</td>\n",
" <td>NA</td>\n",
" <td>Gd</td>\n",
" <td>2</td>\n",
" <td>CBlock</td>\n",
" <td>1</td>\n",
" <td>Typ</td>\n",
" <td>528.0</td>\n",
" <td>2</td>\n",
" <td>TA</td>\n",
" <td>Fin</td>\n",
" <td>TA</td>\n",
" <td>Attchd</td>\n",
" <td>1656.0</td>\n",
" <td>7.412160</td>\n",
" <td>0</td>\n",
" <td>GasA</td>\n",
" <td>Fa</td>\n",
" <td>1Story</td>\n",
" <td>1</td>\n",
" <td>TA</td>\n",
" <td>Lvl</td>\n",
" <td>Gtl</td>\n",
" <td>31770.0</td>\n",
" <td>18.196923</td>\n",
" <td>Corner</td>\n",
" <td>IR1</td>\n",
" <td>0.0</td>\n",
" <td>020</td>\n",
" <td>RL</td>\n",
" <td>112.0</td>\n",
" <td>Stone</td>\n",
" <td>NA</td>\n",
" <td>0.0</td>\n",
" <td>5</td>\n",
" <td>Names</td>\n",
" <td>62.0</td>\n",
" <td>5</td>\n",
" <td>6</td>\n",
" <td>P</td>\n",
" <td>0.0</td>\n",
" <td>NA</td>\n",
" <td>CompShg</td>\n",
" <td>Hip</td>\n",
" <td>Normal</td>\n",
" <td>WD</td>\n",
" <td>0.0</td>\n",
" <td>Pave</td>\n",
" <td>7</td>\n",
" <td>2.0</td>\n",
" <td>1080.0</td>\n",
" <td>272.0</td>\n",
" <td>2736.0</td>\n",
" <td>19.344072</td>\n",
" <td>AllPub</td>\n",
" <td>210.0</td>\n",
" <td>1960</td>\n",
" <td>1960</td>\n",
" <td>2010</td>\n",
" <td>215000.0</td>\n",
" <td>12.278393</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <th>526350040</th>\n",
" <td>896.0</td>\n",
" <td>6.797940</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>NA</td>\n",
" <td>2</td>\n",
" <td>1Fam</td>\n",
" <td>TA</td>\n",
" <td>No</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>TA</td>\n",
" <td>270.0</td>\n",
" <td>468.0</td>\n",
" <td>144.0</td>\n",
" <td>Rec</td>\n",
" <td>LwQ</td>\n",
" <td>Y</td>\n",
" <td>Feedr</td>\n",
" <td>Norm</td>\n",
" <td>SBrkr</td>\n",
" <td>0.0</td>\n",
" <td>TA</td>\n",
" <td>TA</td>\n",
" <td>VinylSd</td>\n",
" <td>VinylSd</td>\n",
" <td>MnPrv</td>\n",
" <td>NA</td>\n",
" <td>0</td>\n",
" <td>CBlock</td>\n",
" <td>1</td>\n",
" <td>Typ</td>\n",
" <td>730.0</td>\n",
" <td>1</td>\n",
" <td>TA</td>\n",
" <td>Unf</td>\n",
" <td>TA</td>\n",
" <td>Attchd</td>\n",
" <td>896.0</td>\n",
" <td>6.797940</td>\n",
" <td>0</td>\n",
" <td>GasA</td>\n",
" <td>TA</td>\n",
" <td>1Story</td>\n",
" <td>1</td>\n",
" <td>TA</td>\n",
" <td>Lvl</td>\n",
" <td>Gtl</td>\n",
" <td>11622.0</td>\n",
" <td>15.499290</td>\n",
" <td>Inside</td>\n",
" <td>Reg</td>\n",
" <td>0.0</td>\n",
" <td>020</td>\n",
" <td>RH</td>\n",
" <td>0.0</td>\n",
" <td>None</td>\n",
" <td>NA</td>\n",
" <td>0.0</td>\n",
" <td>6</td>\n",
" <td>Names</td>\n",
" <td>0.0</td>\n",
" <td>6</td>\n",
" <td>5</td>\n",
" <td>Y</td>\n",
" <td>0.0</td>\n",
" <td>NA</td>\n",
" <td>CompShg</td>\n",
" <td>Gable</td>\n",
" <td>Normal</td>\n",
" <td>WD</td>\n",
" <td>120.0</td>\n",
" <td>Pave</td>\n",
" <td>5</td>\n",
" <td>1.0</td>\n",
" <td>882.0</td>\n",
" <td>260.0</td>\n",
" <td>1778.0</td>\n",
" <td>17.333478</td>\n",
" <td>AllPub</td>\n",
" <td>140.0</td>\n",
" <td>1961</td>\n",
" <td>1961</td>\n",
" <td>2010</td>\n",
" <td>105000.0</td>\n",
" <td>11.561716</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <th>526351010</th>\n",
" <td>1329.0</td>\n",
" <td>7.192182</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>NA</td>\n",
" <td>3</td>\n",
" <td>1Fam</td>\n",
" <td>TA</td>\n",
" <td>No</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>TA</td>\n",
" <td>406.0</td>\n",
" <td>923.0</td>\n",
" <td>0.0</td>\n",
" <td>ALQ</td>\n",
" <td>Unf</td>\n",
" <td>Y</td>\n",
" <td>Norm</td>\n",
" <td>Norm</td>\n",
" <td>SBrkr</td>\n",
" <td>0.0</td>\n",
" <td>TA</td>\n",
" <td>TA</td>\n",
" <td>Wd Sdng</td>\n",
" <td>Wd Sdng</td>\n",
" <td>NA</td>\n",
" <td>NA</td>\n",
" <td>0</td>\n",
" <td>CBlock</td>\n",
" <td>1</td>\n",
" <td>Typ</td>\n",
" <td>312.0</td>\n",
" <td>1</td>\n",
" <td>TA</td>\n",
" <td>Unf</td>\n",
" <td>TA</td>\n",
" <td>Attchd</td>\n",
" <td>1329.0</td>\n",
" <td>7.192182</td>\n",
" <td>1</td>\n",
" <td>GasA</td>\n",
" <td>TA</td>\n",
" <td>1Story</td>\n",
" <td>1</td>\n",
" <td>Gd</td>\n",
" <td>Lvl</td>\n",
" <td>Gtl</td>\n",
" <td>14267.0</td>\n",
" <td>16.027549</td>\n",
" <td>Corner</td>\n",
" <td>IR1</td>\n",
" <td>0.0</td>\n",
" <td>020</td>\n",
" <td>RL</td>\n",
" <td>108.0</td>\n",
" <td>BrkFace</td>\n",
" <td>Gar2</td>\n",
" <td>12500.0</td>\n",
" <td>6</td>\n",
" <td>Names</td>\n",
" <td>36.0</td>\n",
" <td>6</td>\n",
" <td>6</td>\n",
" <td>Y</td>\n",
" <td>0.0</td>\n",
" <td>NA</td>\n",
" <td>CompShg</td>\n",
" <td>Hip</td>\n",
" <td>Normal</td>\n",
" <td>WD</td>\n",
" <td>0.0</td>\n",
" <td>Pave</td>\n",
" <td>6</td>\n",
" <td>1.5</td>\n",
" <td>1329.0</td>\n",
" <td>429.0</td>\n",
" <td>2658.0</td>\n",
" <td>19.203658</td>\n",
" <td>AllPub</td>\n",
" <td>393.0</td>\n",
" <td>1958</td>\n",
" <td>1958</td>\n",
" <td>2010</td>\n",
" <td>172000.0</td>\n",
" <td>12.055250</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <th>526353030</th>\n",
" <td>2110.0</td>\n",
" <td>7.654443</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>NA</td>\n",
" <td>3</td>\n",
" <td>1Fam</td>\n",
" <td>TA</td>\n",
" <td>No</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>TA</td>\n",
" <td>1045.0</td>\n",
" <td>1065.0</td>\n",
" <td>0.0</td>\n",
" <td>ALQ</td>\n",
" <td>Unf</td>\n",
" <td>Y</td>\n",
" <td>Norm</td>\n",
" <td>Norm</td>\n",
" <td>SBrkr</td>\n",
" <td>0.0</td>\n",
" <td>TA</td>\n",
" <td>Gd</td>\n",
" <td>BrkFace</td>\n",
" <td>BrkFace</td>\n",
" <td>NA</td>\n",
" <td>TA</td>\n",
" <td>2</td>\n",
" <td>CBlock</td>\n",
" <td>2</td>\n",
" <td>Typ</td>\n",
" <td>522.0</td>\n",
" <td>2</td>\n",
" <td>TA</td>\n",
" <td>Fin</td>\n",
" <td>TA</td>\n",
" <td>Attchd</td>\n",
" <td>2110.0</td>\n",
" <td>7.654443</td>\n",
" <td>1</td>\n",
" <td>GasA</td>\n",
" <td>Ex</td>\n",
" <td>1Story</td>\n",
" <td>1</td>\n",
" <td>Ex</td>\n",
" <td>Lvl</td>\n",
" <td>Gtl</td>\n",
" <td>11160.0</td>\n",
" <td>15.396064</td>\n",
" <td>Corner</td>\n",
" <td>Reg</td>\n",
" <td>0.0</td>\n",
" <td>020</td>\n",
" <td>RL</td>\n",
" <td>0.0</td>\n",
" <td>None</td>\n",
" <td>NA</td>\n",
" <td>0.0</td>\n",
" <td>4</td>\n",
" <td>Names</td>\n",
" <td>0.0</td>\n",
" <td>5</td>\n",
" <td>7</td>\n",
" <td>Y</td>\n",
" <td>0.0</td>\n",
" <td>NA</td>\n",
" <td>CompShg</td>\n",
" <td>Hip</td>\n",
" <td>Normal</td>\n",
" <td>WD</td>\n",
" <td>0.0</td>\n",
" <td>Pave</td>\n",
" <td>8</td>\n",
" <td>3.5</td>\n",
" <td>2110.0</td>\n",
" <td>0.0</td>\n",
" <td>4220.0</td>\n",
" <td>21.548042</td>\n",
" <td>AllPub</td>\n",
" <td>0.0</td>\n",
" <td>1968</td>\n",
" <td>1968</td>\n",
" <td>2010</td>\n",
" <td>244000.0</td>\n",
" <td>12.404924</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <th>527105010</th>\n",
" <td>928.0</td>\n",
" <td>6.833032</td>\n",
" <td>701.0</td>\n",
" <td>0.0</td>\n",
" <td>NA</td>\n",
" <td>3</td>\n",
" <td>1Fam</td>\n",
" <td>TA</td>\n",
" <td>No</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>Gd</td>\n",
" <td>137.0</td>\n",
" <td>791.0</td>\n",
" <td>0.0</td>\n",
" <td>GLQ</td>\n",
" <td>Unf</td>\n",
" <td>Y</td>\n",
" <td>Norm</td>\n",
" <td>Norm</td>\n",
" <td>SBrkr</td>\n",
" <td>0.0</td>\n",
" <td>TA</td>\n",
" <td>TA</td>\n",
" <td>VinylSd</td>\n",
" <td>VinylSd</td>\n",
" <td>MnPrv</td>\n",
" <td>TA</td>\n",
" <td>1</td>\n",
" <td>PConc</td>\n",
" <td>2</td>\n",
" <td>Typ</td>\n",
" <td>482.0</td>\n",
" <td>2</td>\n",
" <td>TA</td>\n",
" <td>Fin</td>\n",
" <td>TA</td>\n",
" <td>Attchd</td>\n",
" <td>1629.0</td>\n",
" <td>7.395722</td>\n",
" <td>1</td>\n",
" <td>GasA</td>\n",
" <td>Gd</td>\n",
" <td>2Story</td>\n",
" <td>1</td>\n",
" <td>TA</td>\n",
" <td>Lvl</td>\n",
" <td>Gtl</td>\n",
" <td>13830.0</td>\n",
" <td>15.946705</td>\n",
" <td>Inside</td>\n",
" <td>IR1</td>\n",
" <td>0.0</td>\n",
" <td>060</td>\n",
" <td>RL</td>\n",
" <td>0.0</td>\n",
" <td>None</td>\n",
" <td>NA</td>\n",
" <td>0.0</td>\n",
" <td>3</td>\n",
" <td>Gilbert</td>\n",
" <td>34.0</td>\n",
" <td>5</td>\n",
" <td>5</td>\n",
" <td>Y</td>\n",
" <td>0.0</td>\n",
" <td>NA</td>\n",
" <td>CompShg</td>\n",
" <td>Gable</td>\n",
" <td>Normal</td>\n",
" <td>WD</td>\n",
" <td>0.0</td>\n",
" <td>Pave</td>\n",
" <td>6</td>\n",
" <td>2.5</td>\n",
" <td>928.0</td>\n",
" <td>246.0</td>\n",
" <td>2557.0</td>\n",
" <td>19.016856</td>\n",
" <td>AllPub</td>\n",
" <td>212.0</td>\n",
" <td>1997</td>\n",
" <td>1998</td>\n",
" <td>2010</td>\n",
" <td>189900.0</td>\n",
" <td>12.154253</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 1st Flr SF 1st Flr SF (box-cox-0) 2nd Flr SF 3Ssn Porch \\\n",
"Order PID \n",
"1 526301100 1656.0 7.412160 0.0 0.0 \n",
"2 526350040 896.0 6.797940 0.0 0.0 \n",
"3 526351010 1329.0 7.192182 0.0 0.0 \n",
"4 526353030 2110.0 7.654443 0.0 0.0 \n",
"5 527105010 928.0 6.833032 701.0 0.0 \n",
"\n",
" Alley Bedroom AbvGr Bldg Type Bsmt Cond Bsmt Exposure \\\n",
"Order PID \n",
"1 526301100 NA 3 1Fam Gd Gd \n",
"2 526350040 NA 2 1Fam TA No \n",
"3 526351010 NA 3 1Fam TA No \n",
"4 526353030 NA 3 1Fam TA No \n",
"5 527105010 NA 3 1Fam TA No \n",
"\n",
" Bsmt Full Bath Bsmt Half Bath Bsmt Qual Bsmt Unf SF \\\n",
"Order PID \n",
"1 526301100 1 0 TA 441.0 \n",
"2 526350040 0 0 TA 270.0 \n",
"3 526351010 0 0 TA 406.0 \n",
"4 526353030 1 0 TA 1045.0 \n",
"5 527105010 0 0 Gd 137.0 \n",
"\n",
" BsmtFin SF 1 BsmtFin SF 2 BsmtFin Type 1 BsmtFin Type 2 \\\n",
"Order PID \n",
"1 526301100 639.0 0.0 BLQ Unf \n",
"2 526350040 468.0 144.0 Rec LwQ \n",
"3 526351010 923.0 0.0 ALQ Unf \n",
"4 526353030 1065.0 0.0 ALQ Unf \n",
"5 527105010 791.0 0.0 GLQ Unf \n",
"\n",
" Central Air Condition 1 Condition 2 Electrical \\\n",
"Order PID \n",
"1 526301100 Y Norm Norm SBrkr \n",
"2 526350040 Y Feedr Norm SBrkr \n",
"3 526351010 Y Norm Norm SBrkr \n",
"4 526353030 Y Norm Norm SBrkr \n",
"5 527105010 Y Norm Norm SBrkr \n",
"\n",
" Enclosed Porch Exter Cond Exter Qual Exterior 1st \\\n",
"Order PID \n",
"1 526301100 0.0 TA TA BrkFace \n",
"2 526350040 0.0 TA TA VinylSd \n",
"3 526351010 0.0 TA TA Wd Sdng \n",
"4 526353030 0.0 TA Gd BrkFace \n",
"5 527105010 0.0 TA TA VinylSd \n",
"\n",
" Exterior 2nd Fence Fireplace Qu Fireplaces Foundation \\\n",
"Order PID \n",
"1 526301100 Plywood NA Gd 2 CBlock \n",
"2 526350040 VinylSd MnPrv NA 0 CBlock \n",
"3 526351010 Wd Sdng NA NA 0 CBlock \n",
"4 526353030 BrkFace NA TA 2 CBlock \n",
"5 527105010 VinylSd MnPrv TA 1 PConc \n",
"\n",
" Full Bath Functional Garage Area Garage Cars Garage Cond \\\n",
"Order PID \n",
"1 526301100 1 Typ 528.0 2 TA \n",
"2 526350040 1 Typ 730.0 1 TA \n",
"3 526351010 1 Typ 312.0 1 TA \n",
"4 526353030 2 Typ 522.0 2 TA \n",
"5 527105010 2 Typ 482.0 2 TA \n",
"\n",
" Garage Finish Garage Qual Garage Type Gr Liv Area \\\n",
"Order PID \n",
"1 526301100 Fin TA Attchd 1656.0 \n",
"2 526350040 Unf TA Attchd 896.0 \n",
"3 526351010 Unf TA Attchd 1329.0 \n",
"4 526353030 Fin TA Attchd 2110.0 \n",
"5 527105010 Fin TA Attchd 1629.0 \n",
"\n",
" Gr Liv Area (box-cox-0) Half Bath Heating Heating QC \\\n",
"Order PID \n",
"1 526301100 7.412160 0 GasA Fa \n",
"2 526350040 6.797940 0 GasA TA \n",
"3 526351010 7.192182 1 GasA TA \n",
"4 526353030 7.654443 1 GasA Ex \n",
"5 527105010 7.395722 1 GasA Gd \n",
"\n",
" House Style Kitchen AbvGr Kitchen Qual Land Contour \\\n",
"Order PID \n",
"1 526301100 1Story 1 TA Lvl \n",
"2 526350040 1Story 1 TA Lvl \n",
"3 526351010 1Story 1 Gd Lvl \n",
"4 526353030 1Story 1 Ex Lvl \n",
"5 527105010 2Story 1 TA Lvl \n",
"\n",
" Land Slope Lot Area Lot Area (box-cox-0.1) Lot Config \\\n",
"Order PID \n",
"1 526301100 Gtl 31770.0 18.196923 Corner \n",
"2 526350040 Gtl 11622.0 15.499290 Inside \n",
"3 526351010 Gtl 14267.0 16.027549 Corner \n",
"4 526353030 Gtl 11160.0 15.396064 Corner \n",
"5 527105010 Gtl 13830.0 15.946705 Inside \n",
"\n",
" Lot Shape Low Qual Fin SF MS SubClass MS Zoning \\\n",
"Order PID \n",
"1 526301100 IR1 0.0 020 RL \n",
"2 526350040 Reg 0.0 020 RH \n",
"3 526351010 IR1 0.0 020 RL \n",
"4 526353030 Reg 0.0 020 RL \n",
"5 527105010 IR1 0.0 060 RL \n",
"\n",
" Mas Vnr Area Mas Vnr Type Misc Feature Misc Val Mo Sold \\\n",
"Order PID \n",
"1 526301100 112.0 Stone NA 0.0 5 \n",
"2 526350040 0.0 None NA 0.0 6 \n",
"3 526351010 108.0 BrkFace Gar2 12500.0 6 \n",
"4 526353030 0.0 None NA 0.0 4 \n",
"5 527105010 0.0 None NA 0.0 3 \n",
"\n",
" Neighborhood Open Porch SF Overall Cond Overall Qual \\\n",
"Order PID \n",
"1 526301100 Names 62.0 5 6 \n",
"2 526350040 Names 0.0 6 5 \n",
"3 526351010 Names 36.0 6 6 \n",
"4 526353030 Names 0.0 5 7 \n",
"5 527105010 Gilbert 34.0 5 5 \n",
"\n",
" Paved Drive Pool Area Pool QC Roof Matl Roof Style \\\n",
"Order PID \n",
"1 526301100 P 0.0 NA CompShg Hip \n",
"2 526350040 Y 0.0 NA CompShg Gable \n",
"3 526351010 Y 0.0 NA CompShg Hip \n",
"4 526353030 Y 0.0 NA CompShg Hip \n",
"5 527105010 Y 0.0 NA CompShg Gable \n",
"\n",
" Sale Condition Sale Type Screen Porch Street TotRms AbvGrd \\\n",
"Order PID \n",
"1 526301100 Normal WD 0.0 Pave 7 \n",
"2 526350040 Normal WD 120.0 Pave 5 \n",
"3 526351010 Normal WD 0.0 Pave 6 \n",
"4 526353030 Normal WD 0.0 Pave 8 \n",
"5 527105010 Normal WD 0.0 Pave 6 \n",
"\n",
" Total Bath Total Bsmt SF Total Porch SF Total SF \\\n",
"Order PID \n",
"1 526301100 2.0 1080.0 272.0 2736.0 \n",
"2 526350040 1.0 882.0 260.0 1778.0 \n",
"3 526351010 1.5 1329.0 429.0 2658.0 \n",
"4 526353030 3.5 2110.0 0.0 4220.0 \n",
"5 527105010 2.5 928.0 246.0 2557.0 \n",
"\n",
" Total SF (box-cox-0.2) Utilities Wood Deck SF Year Built \\\n",
"Order PID \n",
"1 526301100 19.344072 AllPub 210.0 1960 \n",
"2 526350040 17.333478 AllPub 140.0 1961 \n",
"3 526351010 19.203658 AllPub 393.0 1958 \n",
"4 526353030 21.548042 AllPub 0.0 1968 \n",
"5 527105010 19.016856 AllPub 212.0 1997 \n",
"\n",
" Year Remod/Add Yr Sold SalePrice SalePrice (box-cox-0) \n",
"Order PID \n",
"1 526301100 1960 2010 215000.0 12.278393 \n",
"2 526350040 1961 2010 105000.0 11.561716 \n",
"3 526351010 1958 2010 172000.0 12.055250 \n",
"4 526353030 1968 2010 244000.0 12.404924 \n",
"5 527105010 1998 2010 189900.0 12.154253 "
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cleaned_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"cleaned_df.to_csv(\"data/data_clean_with_transformations.csv\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "ames-housing",
2021-05-25 08:18:04 +02:00
"language": "python",
"name": "ames-housing"
2021-05-25 08:18:04 +02:00
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.4"
2021-05-25 08:18:04 +02:00
}
},
"nbformat": 4,
"nbformat_minor": 4
}