2018-09-02 23:25:07 +02:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Pair-wise Correlations\n",
"\n",
2018-09-03 15:57:24 +02:00
"The purpose is to identify predictor variables strongly correlated with the sales price and with each other to get an idea of what variables could be good predictors and potential issues with collinearity.\n",
"\n",
"Furthermore, Box-Cox transformations and linear combinations of variables are added where applicable or useful."
2018-09-02 23:25:07 +02:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## \"Housekeeping\""
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
2018-09-03 15:57:24 +02:00
"2018-09-03 15:55:55 CEST\n",
2018-09-02 23:25:07 +02:00
"\n",
"CPython 3.6.5\n",
"IPython 6.5.0\n",
"\n",
"matplotlib 3.0.0rc2\n",
"numpy 1.15.1\n",
"pandas 0.23.4\n",
2018-09-03 15:57:24 +02:00
"seaborn 0.9.0\n",
"sklearn 0.20rc1\n"
2018-09-02 23:25:07 +02:00
]
}
],
"source": [
"% load_ext watermark\n",
2018-09-03 15:57:24 +02:00
"% watermark -d -t -v -z -p matplotlib,numpy,pandas,seaborn,sklearn"
2018-09-02 23:25:07 +02:00
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
2018-09-03 15:57:24 +02:00
"import warnings\n",
2018-09-02 23:25:07 +02:00
"import json\n",
"\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"import pandas as pd\n",
"import seaborn as sns\n",
"\n",
2018-09-03 15:57:24 +02:00
"from sklearn.preprocessing import PowerTransformer\n",
"\n",
2018-09-02 23:25:07 +02:00
"from utils import (\n",
2018-09-03 15:57:24 +02:00
" ALL_VARIABLES,\n",
2018-09-02 23:25:07 +02:00
" CONTINUOUS_VARIABLES,\n",
" DISCRETE_VARIABLES,\n",
" NUMERIC_VARIABLES,\n",
" ORDINAL_VARIABLES,\n",
2018-09-03 15:57:24 +02:00
" TARGET_VARIABLES,\n",
2018-09-02 23:25:07 +02:00
" load_clean_data,\n",
" print_column_list,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
2018-09-03 15:57:24 +02:00
"% load_ext blackcellmagic"
2018-09-02 23:25:07 +02:00
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
2018-09-03 15:57:24 +02:00
"% matplotlib inline"
2018-09-02 23:25:07 +02:00
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
2018-09-03 15:57:24 +02:00
"source": [
"pd.set_option(\"display.max_columns\", 100)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
2018-09-02 23:25:07 +02:00
"source": [
"sns.set_style(\"white\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load the Data\n",
"\n",
"A subset of the previously cleaned data is used in this analysis. It does not make sense to calculate correlations involving nominal variables."
]
},
{
"cell_type": "code",
2018-09-03 15:57:24 +02:00
"execution_count": 7,
2018-09-02 23:25:07 +02:00
"metadata": {},
"outputs": [],
"source": [
2018-09-03 15:57:24 +02:00
"df = load_clean_data(ordinal_encoded=True)"
2018-09-02 23:25:07 +02:00
]
},
{
"cell_type": "code",
2018-09-03 15:57:24 +02:00
"execution_count": 8,
2018-09-02 23:25:07 +02:00
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th></th>\n",
" <th>1st Flr SF</th>\n",
" <th>2nd Flr SF</th>\n",
" <th>3Ssn Porch</th>\n",
" <th>Bedroom AbvGr</th>\n",
" <th>Bsmt Full Bath</th>\n",
" <th>Bsmt Half Bath</th>\n",
" <th>Bsmt Unf SF</th>\n",
" <th>BsmtFin SF 1</th>\n",
" <th>BsmtFin SF 2</th>\n",
" <th>Enclosed Porch</th>\n",
" <th>Fireplaces</th>\n",
" <th>Full Bath</th>\n",
" <th>Garage Area</th>\n",
" <th>Garage Cars</th>\n",
" <th>Gr Liv Area</th>\n",
" <th>Half Bath</th>\n",
" <th>Kitchen AbvGr</th>\n",
" <th>Lot Area</th>\n",
" <th>Low Qual Fin SF</th>\n",
" <th>Mas Vnr Area</th>\n",
" <th>Misc Val</th>\n",
" <th>Mo Sold</th>\n",
" <th>Open Porch SF</th>\n",
" <th>Pool Area</th>\n",
" <th>Screen Porch</th>\n",
" <th>TotRms AbvGrd</th>\n",
" <th>Total Bsmt SF</th>\n",
" <th>Wood Deck SF</th>\n",
" <th>Year Built</th>\n",
" <th>Year Remod/Add</th>\n",
" <th>Yr Sold</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Order</th>\n",
" <th>PID</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <th>526301100</th>\n",
" <td>1656.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>441.0</td>\n",
" <td>639.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>528.0</td>\n",
" <td>2</td>\n",
" <td>1656.0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>31770.0</td>\n",
" <td>0.0</td>\n",
" <td>112.0</td>\n",
" <td>0.0</td>\n",
" <td>5</td>\n",
" <td>62.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>7</td>\n",
" <td>1080.0</td>\n",
" <td>210.0</td>\n",
" <td>1960</td>\n",
" <td>1960</td>\n",
" <td>2010</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <th>526350040</th>\n",
" <td>896.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>270.0</td>\n",
" <td>468.0</td>\n",
" <td>144.0</td>\n",
" <td>0.0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>730.0</td>\n",
" <td>1</td>\n",
" <td>896.0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>11622.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>6</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>120.0</td>\n",
" <td>5</td>\n",
" <td>882.0</td>\n",
" <td>140.0</td>\n",
" <td>1961</td>\n",
" <td>1961</td>\n",
" <td>2010</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <th>526351010</th>\n",
" <td>1329.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>406.0</td>\n",
" <td>923.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>312.0</td>\n",
" <td>1</td>\n",
" <td>1329.0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>14267.0</td>\n",
" <td>0.0</td>\n",
" <td>108.0</td>\n",
" <td>12500.0</td>\n",
" <td>6</td>\n",
" <td>36.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>6</td>\n",
" <td>1329.0</td>\n",
" <td>393.0</td>\n",
" <td>1958</td>\n",
" <td>1958</td>\n",
" <td>2010</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <th>526353030</th>\n",
" <td>2110.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1045.0</td>\n",
" <td>1065.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>522.0</td>\n",
" <td>2</td>\n",
" <td>2110.0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>11160.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>4</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>8</td>\n",
" <td>2110.0</td>\n",
" <td>0.0</td>\n",
" <td>1968</td>\n",
" <td>1968</td>\n",
" <td>2010</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <th>527105010</th>\n",
" <td>928.0</td>\n",
" <td>701.0</td>\n",
" <td>0.0</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>137.0</td>\n",
" <td>791.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>482.0</td>\n",
" <td>2</td>\n",
" <td>1629.0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>13830.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>3</td>\n",
" <td>34.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>6</td>\n",
" <td>928.0</td>\n",
" <td>212.0</td>\n",
" <td>1997</td>\n",
" <td>1998</td>\n",
" <td>2010</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 1st Flr SF 2nd Flr SF 3Ssn Porch Bedroom AbvGr \\\n",
"Order PID \n",
"1 526301100 1656.0 0.0 0.0 3 \n",
"2 526350040 896.0 0.0 0.0 2 \n",
"3 526351010 1329.0 0.0 0.0 3 \n",
"4 526353030 2110.0 0.0 0.0 3 \n",
"5 527105010 928.0 701.0 0.0 3 \n",
"\n",
" Bsmt Full Bath Bsmt Half Bath Bsmt Unf SF BsmtFin SF 1 \\\n",
"Order PID \n",
"1 526301100 1 0 441.0 639.0 \n",
"2 526350040 0 0 270.0 468.0 \n",
"3 526351010 0 0 406.0 923.0 \n",
"4 526353030 1 0 1045.0 1065.0 \n",
"5 527105010 0 0 137.0 791.0 \n",
"\n",
" BsmtFin SF 2 Enclosed Porch Fireplaces Full Bath \\\n",
"Order PID \n",
"1 526301100 0.0 0.0 2 1 \n",
"2 526350040 144.0 0.0 0 1 \n",
"3 526351010 0.0 0.0 0 1 \n",
"4 526353030 0.0 0.0 2 2 \n",
"5 527105010 0.0 0.0 1 2 \n",
"\n",
" Garage Area Garage Cars Gr Liv Area Half Bath \\\n",
"Order PID \n",
"1 526301100 528.0 2 1656.0 0 \n",
"2 526350040 730.0 1 896.0 0 \n",
"3 526351010 312.0 1 1329.0 1 \n",
"4 526353030 522.0 2 2110.0 1 \n",
"5 527105010 482.0 2 1629.0 1 \n",
"\n",
" Kitchen AbvGr Lot Area Low Qual Fin SF Mas Vnr Area \\\n",
"Order PID \n",
"1 526301100 1 31770.0 0.0 112.0 \n",
"2 526350040 1 11622.0 0.0 0.0 \n",
"3 526351010 1 14267.0 0.0 108.0 \n",
"4 526353030 1 11160.0 0.0 0.0 \n",
"5 527105010 1 13830.0 0.0 0.0 \n",
"\n",
" Misc Val Mo Sold Open Porch SF Pool Area Screen Porch \\\n",
"Order PID \n",
"1 526301100 0.0 5 62.0 0.0 0.0 \n",
"2 526350040 0.0 6 0.0 0.0 120.0 \n",
"3 526351010 12500.0 6 36.0 0.0 0.0 \n",
"4 526353030 0.0 4 0.0 0.0 0.0 \n",
"5 527105010 0.0 3 34.0 0.0 0.0 \n",
"\n",
" TotRms AbvGrd Total Bsmt SF Wood Deck SF Year Built \\\n",
"Order PID \n",
"1 526301100 7 1080.0 210.0 1960 \n",
"2 526350040 5 882.0 140.0 1961 \n",
"3 526351010 6 1329.0 393.0 1958 \n",
"4 526353030 8 2110.0 0.0 1968 \n",
"5 527105010 6 928.0 212.0 1997 \n",
"\n",
" Year Remod/Add Yr Sold \n",
"Order PID \n",
"1 526301100 1960 2010 \n",
"2 526350040 1961 2010 \n",
"3 526351010 1958 2010 \n",
"4 526353030 1968 2010 \n",
"5 527105010 1998 2010 "
]
},
2018-09-03 15:57:24 +02:00
"execution_count": 8,
2018-09-02 23:25:07 +02:00
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[NUMERIC_VARIABLES].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ordinal variables are encoded as integers (with greater values indicating a higher sales price by \"guts feeling\"; refer to the [data documentation](https://www.amstat.org/publications/jse/v19n3/decock/DataDocumentation.txt) to see the un-encoded values) and take part in the analysis."
]
},
{
"cell_type": "code",
2018-09-03 15:57:24 +02:00
"execution_count": 9,
2018-09-02 23:25:07 +02:00
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th></th>\n",
" <th>Bsmt Cond</th>\n",
" <th>Bsmt Exposure</th>\n",
" <th>Bsmt Qual</th>\n",
" <th>BsmtFin Type 1</th>\n",
" <th>BsmtFin Type 2</th>\n",
" <th>Electrical</th>\n",
" <th>Exter Cond</th>\n",
" <th>Exter Qual</th>\n",
" <th>Fence</th>\n",
" <th>Fireplace Qu</th>\n",
" <th>Functional</th>\n",
" <th>Garage Cond</th>\n",
" <th>Garage Finish</th>\n",
" <th>Garage Qual</th>\n",
" <th>Heating QC</th>\n",
" <th>Kitchen Qual</th>\n",
" <th>Land Slope</th>\n",
" <th>Lot Shape</th>\n",
" <th>Overall Cond</th>\n",
" <th>Overall Qual</th>\n",
" <th>Paved Drive</th>\n",
" <th>Pool QC</th>\n",
" <th>Utilities</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Order</th>\n",
" <th>PID</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <th>526301100</th>\n",
" <td>4</td>\n",
" <td>4</td>\n",
" <td>3</td>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>4</td>\n",
" <td>7</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>4</td>\n",
" <td>5</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <th>526350040</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>7</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>5</td>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <th>526351010</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>5</td>\n",
" <td>1</td>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>7</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>5</td>\n",
" <td>5</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <th>526353030</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>5</td>\n",
" <td>1</td>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>7</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>4</td>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>4</td>\n",
" <td>6</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <th>527105010</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>4</td>\n",
" <td>6</td>\n",
" <td>1</td>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>7</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>4</td>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Bsmt Cond Bsmt Exposure Bsmt Qual BsmtFin Type 1 \\\n",
"Order PID \n",
"1 526301100 4 4 3 4 \n",
"2 526350040 3 1 3 3 \n",
"3 526351010 3 1 3 5 \n",
"4 526353030 3 1 3 5 \n",
"5 527105010 3 1 4 6 \n",
"\n",
" BsmtFin Type 2 Electrical Exter Cond Exter Qual Fence \\\n",
"Order PID \n",
"1 526301100 1 4 2 2 0 \n",
"2 526350040 2 4 2 2 3 \n",
"3 526351010 1 4 2 2 0 \n",
"4 526353030 1 4 2 3 0 \n",
"5 527105010 1 4 2 2 3 \n",
"\n",
" Fireplace Qu Functional Garage Cond Garage Finish \\\n",
"Order PID \n",
"1 526301100 4 7 3 3 \n",
"2 526350040 0 7 3 1 \n",
"3 526351010 0 7 3 1 \n",
"4 526353030 3 7 3 3 \n",
"5 527105010 3 7 3 3 \n",
"\n",
" Garage Qual Heating QC Kitchen Qual Land Slope Lot Shape \\\n",
"Order PID \n",
"1 526301100 3 1 2 2 2 \n",
"2 526350040 3 2 2 2 3 \n",
"3 526351010 3 2 3 2 2 \n",
"4 526353030 3 4 4 2 3 \n",
"5 527105010 3 3 2 2 2 \n",
"\n",
" Overall Cond Overall Qual Paved Drive Pool QC Utilities \n",
"Order PID \n",
"1 526301100 4 5 1 0 3 \n",
"2 526350040 5 4 2 0 3 \n",
"3 526351010 5 5 2 0 3 \n",
"4 526353030 4 6 2 0 3 \n",
"5 527105010 4 4 2 0 3 "
]
},
2018-09-03 15:57:24 +02:00
"execution_count": 9,
2018-09-02 23:25:07 +02:00
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[ORDINAL_VARIABLES].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2018-09-03 15:57:24 +02:00
"## Linearly \"dependent\" Features"
2018-09-02 23:25:07 +02:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2018-09-03 15:57:24 +02:00
"The \"above grade (ground) living area\" (= *Gr Liv Area*) can be split into 1st and 2nd floor living area plus some undefined rest."
2018-09-02 23:25:07 +02:00
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
2018-09-03 15:57:24 +02:00
"assert not (\n",
" df[\"Gr Liv Area\"]\n",
" != (df[\"1st Flr SF\"] + df[\"2nd Flr SF\"] + df[\"Low Qual Fin SF\"])\n",
").any()"
2018-09-02 23:25:07 +02:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2018-09-03 15:57:24 +02:00
"The various basement areas also add up."
2018-09-02 23:25:07 +02:00
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
2018-09-03 15:57:24 +02:00
"assert not (\n",
" df[\"Total Bsmt SF\"]\n",
" != (df[\"BsmtFin SF 1\"] + df[\"BsmtFin SF 2\"] + df[\"Bsmt Unf SF\"])\n",
").any()"
2018-09-02 23:25:07 +02:00
]
},
{
2018-09-03 15:57:24 +02:00
"cell_type": "markdown",
2018-09-02 23:25:07 +02:00
"metadata": {},
"source": [
2018-09-03 15:57:24 +02:00
"The different porch areas are unified into a new variable *Total Porch SF*. This potentially helps making the presence of a porch in general relevant in the prediction."
2018-09-02 23:25:07 +02:00
]
},
{
"cell_type": "code",
2018-09-03 15:57:24 +02:00
"execution_count": 12,
2018-09-02 23:25:07 +02:00
"metadata": {},
"outputs": [],
"source": [
2018-09-03 15:57:24 +02:00
"df[\"Total Porch SF\"] = (\n",
" df[\"3Ssn Porch\"] + df[\"Enclosed Porch\"] + df[\"Open Porch SF\"]\n",
" + df[\"Screen Porch\"] + df[\"Wood Deck SF\"]\n",
")\n",
"\n",
"new_variables = [\"Total Porch SF\"]\n",
"CONTINUOUS_VARIABLES.append(\"Total Porch SF\")"
2018-09-02 23:25:07 +02:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2018-09-03 15:57:24 +02:00
"The various types of rooms \"above grade\" (i.e., *TotRms AbvGrd*, *Bedroom AbvGr*, *Kitchen AbvGr*, and *Full Bath*) do not add up (only in 29% of the cases they do). Therefore, no single unified variable could be used as a predictor."
2018-09-02 23:25:07 +02:00
]
},
{
"cell_type": "code",
2018-09-03 15:57:24 +02:00
"execution_count": 13,
2018-09-02 23:25:07 +02:00
"metadata": {},
"outputs": [
{
2018-09-03 15:57:24 +02:00
"data": {
"text/plain": [
"29.0"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
2018-09-02 23:25:07 +02:00
}
],
"source": [
2018-09-03 15:57:24 +02:00
"round(\n",
" 100\n",
" * (\n",
" df[\"TotRms AbvGrd\"]\n",
" == (df[\"Bedroom AbvGr\"] + df[\"Kitchen AbvGr\"] + df[\"Full Bath\"])\n",
" ).sum()\n",
" / df.shape[0]\n",
")"
2018-09-02 23:25:07 +02:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2018-09-03 15:57:24 +02:00
"Unify the number of various types of bathrooms into a single variable. Note that \"half\" bathrooms are counted as such."
2018-09-02 23:25:07 +02:00
]
},
{
"cell_type": "code",
2018-09-03 15:57:24 +02:00
"execution_count": 14,
2018-09-02 23:25:07 +02:00
"metadata": {},
"outputs": [],
"source": [
2018-09-03 15:57:24 +02:00
"df[\"Total Bath\"] = (\n",
" df[\"Full Bath\"] + 0.5 * df[\"Half Bath\"]\n",
" + df[\"Bsmt Full Bath\"] + 0.5 * df[\"Bsmt Half Bath\"]\n",
")\n",
"\n",
"new_variables.append(\"Total Bath\")\n",
"DISCRETE_VARIABLES.append(\"Total Bath\")"
2018-09-02 23:25:07 +02:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2018-09-03 15:57:24 +02:00
"## Box-Cox Transformations\n",
"\n",
"Only columns with non-negative values are eligable for a Box-Cox transformation."
2018-09-02 23:25:07 +02:00
]
},
{
"cell_type": "code",
2018-09-03 15:57:24 +02:00
"execution_count": 15,
2018-09-02 23:25:07 +02:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
2018-09-03 15:57:24 +02:00
"1st Flr SF First Floor square feet\n",
"Gr Liv Area Above grade (ground) living area square feet\n",
"Lot Area Lot size in square feet\n",
"Mo Sold Month Sold (MM)\n",
"SalePrice\n",
2018-09-02 23:25:07 +02:00
"TotRms AbvGrd Total rooms above grade (does not include bathrooms)\n",
2018-09-03 15:57:24 +02:00
"Total Bath\n",
"Year Built Original construction date\n",
"Year Remod/Add Remodel date (same as construction date if no remodeling or additions)\n",
"Yr Sold Year Sold (YYYY)\n"
2018-09-02 23:25:07 +02:00
]
}
],
"source": [
2018-09-03 15:57:24 +02:00
"columns = CONTINUOUS_VARIABLES + DISCRETE_VARIABLES + TARGET_VARIABLES\n",
"transforms = df[columns].describe().T\n",
"transforms = list(transforms[transforms['min'] > 0].index)\n",
"print_column_list(transforms)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A common convention is to use Box-Cox transformations only if the found lambda value (estimated with Maximum Likelyhood Estimation) is in the range from -3 to +3. Also, use a lambda rounded to the next \"half\" integer."
2018-09-02 23:25:07 +02:00
]
},
{
"cell_type": "code",
2018-09-03 15:57:24 +02:00
"execution_count": 16,
2018-09-02 23:25:07 +02:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
2018-09-03 15:57:24 +02:00
"Exact lambda of -8.398 for 1st Flr SF not in realistic range\n",
"Exact lambda of -8.398 for Gr Liv Area not in realistic range\n",
"Exact lambda of -8.398 for Lot Area not in realistic range\n",
"Rounded lambda of 1.0 (exact is 0.775) used to transform Mo Sold\n",
"Rounded lambda of 0.0 (exact is 0.107) used to transform TotRms AbvGrd\n",
"Exact lambda of 21.823 for Year Built not in realistic range\n",
"Exact lambda of 35.529 for Year Remod/Add not in realistic range\n",
"Exact lambda of -8.398 for Yr Sold not in realistic range\n",
"Rounded lambda of 0.5 (exact is 0.511) used to transform Total Bath\n",
"Rounded lambda of 0.0 (exact is 0.004) used to transform SalePrice\n"
2018-09-02 23:25:07 +02:00
]
}
],
"source": [
2018-09-03 15:57:24 +02:00
"# Check the Box-Cox tranformations for each column seperately\n",
"# to decide if the optimal lambda value is in an acceptable range.\n",
"for column in transforms:\n",
" X = df[[column]] # 2D array needed!\n",
" pt = PowerTransformer(method=\"box-cox\", standardize=False)\n",
" # Suppress a weird but harmless warning from scipy\n",
" with warnings.catch_warnings():\n",
" warnings.simplefilter(\"ignore\")\n",
" pt.fit(X)\n",
" # Check if the optimal lambda is ok.\n",
" exact_lambda = pt.lambdas_[0]\n",
" used_lambda = 0.5 * np.round(2.0 * exact_lambda)\n",
" if -3 <= exact_lambda <= 3:\n",
" print(\n",
" f\"Rounded lambda of {used_lambda} (exact is {exact_lambda:.3f}) \"\n",
" f\"used to transform {column}\"\n",
" )\n",
" new_column = f\"{column} (box-cox-{used_lambda})\"\n",
" df[new_column] = (\n",
" np.log(X) if used_lambda == 0 else (((X ** used_lambda) - 1) / used_lambda)\n",
" )\n",
" # Track the new column in the appropiate list.\n",
" if column in CONTINUOUS_VARIABLES:\n",
" new_variables.append(new_column)\n",
" CONTINUOUS_VARIABLES.append(new_column)\n",
" elif column in DISCRETE_VARIABLES:\n",
" new_variables.append(new_column)\n",
" DISCRETE_VARIABLES.append(new_column)\n",
" else:\n",
" TARGET_VARIABLES.append(new_column)\n",
" else:\n",
" print(f\"Exact lambda of {exact_lambda:.3f} for {column} not in realistic range\")"
2018-09-02 23:25:07 +02:00
]
},
{
"cell_type": "code",
2018-09-03 15:57:24 +02:00
"execution_count": 17,
2018-09-02 23:25:07 +02:00
"metadata": {},
2018-09-03 15:57:24 +02:00
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th></th>\n",
" <th>1st Flr SF</th>\n",
" <th>2nd Flr SF</th>\n",
" <th>3Ssn Porch</th>\n",
" <th>Alley</th>\n",
" <th>Bedroom AbvGr</th>\n",
" <th>Bldg Type</th>\n",
" <th>Bsmt Cond</th>\n",
" <th>Bsmt Exposure</th>\n",
" <th>Bsmt Full Bath</th>\n",
" <th>Bsmt Half Bath</th>\n",
" <th>Bsmt Qual</th>\n",
" <th>Bsmt Unf SF</th>\n",
" <th>BsmtFin SF 1</th>\n",
" <th>BsmtFin SF 2</th>\n",
" <th>BsmtFin Type 1</th>\n",
" <th>BsmtFin Type 2</th>\n",
" <th>Central Air</th>\n",
" <th>Condition 1</th>\n",
" <th>Condition 2</th>\n",
" <th>Electrical</th>\n",
" <th>Enclosed Porch</th>\n",
" <th>Exter Cond</th>\n",
" <th>Exter Qual</th>\n",
" <th>Exterior 1st</th>\n",
" <th>Exterior 2nd</th>\n",
" <th>Fence</th>\n",
" <th>Fireplace Qu</th>\n",
" <th>Fireplaces</th>\n",
" <th>Foundation</th>\n",
" <th>Full Bath</th>\n",
" <th>Functional</th>\n",
" <th>Garage Area</th>\n",
" <th>Garage Cars</th>\n",
" <th>Garage Cond</th>\n",
" <th>Garage Finish</th>\n",
" <th>Garage Qual</th>\n",
" <th>Garage Type</th>\n",
" <th>Gr Liv Area</th>\n",
" <th>Half Bath</th>\n",
" <th>Heating</th>\n",
" <th>Heating QC</th>\n",
" <th>House Style</th>\n",
" <th>Kitchen AbvGr</th>\n",
" <th>Kitchen Qual</th>\n",
" <th>Land Contour</th>\n",
" <th>Land Slope</th>\n",
" <th>Lot Area</th>\n",
" <th>Lot Config</th>\n",
" <th>Lot Shape</th>\n",
" <th>Low Qual Fin SF</th>\n",
" <th>MS SubClass</th>\n",
" <th>MS Zoning</th>\n",
" <th>Mas Vnr Area</th>\n",
" <th>Mas Vnr Type</th>\n",
" <th>Misc Feature</th>\n",
" <th>Misc Val</th>\n",
" <th>Mo Sold</th>\n",
" <th>Neighborhood</th>\n",
" <th>Open Porch SF</th>\n",
" <th>Overall Cond</th>\n",
" <th>Overall Qual</th>\n",
" <th>Paved Drive</th>\n",
" <th>Pool Area</th>\n",
" <th>Pool QC</th>\n",
" <th>Roof Matl</th>\n",
" <th>Roof Style</th>\n",
" <th>Sale Condition</th>\n",
" <th>Sale Type</th>\n",
" <th>Screen Porch</th>\n",
" <th>Street</th>\n",
" <th>TotRms AbvGrd</th>\n",
" <th>Total Bsmt SF</th>\n",
" <th>Utilities</th>\n",
" <th>Wood Deck SF</th>\n",
" <th>Year Built</th>\n",
" <th>Year Remod/Add</th>\n",
" <th>Yr Sold</th>\n",
" <th>SalePrice</th>\n",
" <th>Total Porch SF</th>\n",
" <th>Total Bath</th>\n",
" <th>Mo Sold (box-cox-1.0)</th>\n",
" <th>TotRms AbvGrd (box-cox-0.0)</th>\n",
" <th>Total Bath (box-cox-0.5)</th>\n",
" <th>SalePrice (box-cox-0.0)</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Order</th>\n",
" <th>PID</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <th>526301100</th>\n",
" <td>1656.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>NA</td>\n",
" <td>3</td>\n",
" <td>1Fam</td>\n",
" <td>4</td>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>441.0</td>\n",
" <td>639.0</td>\n",
" <td>0.0</td>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>Y</td>\n",
" <td>Norm</td>\n",
" <td>Norm</td>\n",
" <td>4</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>BrkFace</td>\n",
" <td>Plywood</td>\n",
" <td>0</td>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" <td>CBlock</td>\n",
" <td>1</td>\n",
" <td>7</td>\n",
" <td>528.0</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>Attchd</td>\n",
" <td>1656.0</td>\n",
" <td>0</td>\n",
" <td>GasA</td>\n",
" <td>1</td>\n",
" <td>1Story</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>Lvl</td>\n",
" <td>2</td>\n",
" <td>31770.0</td>\n",
" <td>Corner</td>\n",
" <td>2</td>\n",
" <td>0.0</td>\n",
" <td>020</td>\n",
" <td>RL</td>\n",
" <td>112.0</td>\n",
" <td>Stone</td>\n",
" <td>NA</td>\n",
" <td>0.0</td>\n",
" <td>5</td>\n",
" <td>Names</td>\n",
" <td>62.0</td>\n",
" <td>4</td>\n",
" <td>5</td>\n",
" <td>1</td>\n",
" <td>0.0</td>\n",
" <td>0</td>\n",
" <td>CompShg</td>\n",
" <td>Hip</td>\n",
" <td>Normal</td>\n",
" <td>WD</td>\n",
" <td>0.0</td>\n",
" <td>Pave</td>\n",
" <td>7</td>\n",
" <td>1080.0</td>\n",
" <td>3</td>\n",
" <td>210.0</td>\n",
" <td>1960</td>\n",
" <td>1960</td>\n",
" <td>2010</td>\n",
" <td>215000.0</td>\n",
" <td>272.0</td>\n",
" <td>2.0</td>\n",
" <td>4.0</td>\n",
" <td>1.945910</td>\n",
" <td>0.828427</td>\n",
" <td>12.278393</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <th>526350040</th>\n",
" <td>896.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>NA</td>\n",
" <td>2</td>\n",
" <td>1Fam</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>270.0</td>\n",
" <td>468.0</td>\n",
" <td>144.0</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>Y</td>\n",
" <td>Feedr</td>\n",
" <td>Norm</td>\n",
" <td>4</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>VinylSd</td>\n",
" <td>VinylSd</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>CBlock</td>\n",
" <td>1</td>\n",
" <td>7</td>\n",
" <td>730.0</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Attchd</td>\n",
" <td>896.0</td>\n",
" <td>0</td>\n",
" <td>GasA</td>\n",
" <td>2</td>\n",
" <td>1Story</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>Lvl</td>\n",
" <td>2</td>\n",
" <td>11622.0</td>\n",
" <td>Inside</td>\n",
" <td>3</td>\n",
" <td>0.0</td>\n",
" <td>020</td>\n",
" <td>RH</td>\n",
" <td>0.0</td>\n",
" <td>None</td>\n",
" <td>NA</td>\n",
" <td>0.0</td>\n",
" <td>6</td>\n",
" <td>Names</td>\n",
" <td>0.0</td>\n",
" <td>5</td>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" <td>0.0</td>\n",
" <td>0</td>\n",
" <td>CompShg</td>\n",
" <td>Gable</td>\n",
" <td>Normal</td>\n",
" <td>WD</td>\n",
" <td>120.0</td>\n",
" <td>Pave</td>\n",
" <td>5</td>\n",
" <td>882.0</td>\n",
" <td>3</td>\n",
" <td>140.0</td>\n",
" <td>1961</td>\n",
" <td>1961</td>\n",
" <td>2010</td>\n",
" <td>105000.0</td>\n",
" <td>260.0</td>\n",
" <td>1.0</td>\n",
" <td>5.0</td>\n",
" <td>1.609438</td>\n",
" <td>0.000000</td>\n",
" <td>11.561716</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <th>526351010</th>\n",
" <td>1329.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>NA</td>\n",
" <td>3</td>\n",
" <td>1Fam</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>406.0</td>\n",
" <td>923.0</td>\n",
" <td>0.0</td>\n",
" <td>5</td>\n",
" <td>1</td>\n",
" <td>Y</td>\n",
" <td>Norm</td>\n",
" <td>Norm</td>\n",
" <td>4</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>Wd Sdng</td>\n",
" <td>Wd Sdng</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>CBlock</td>\n",
" <td>1</td>\n",
" <td>7</td>\n",
" <td>312.0</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Attchd</td>\n",
" <td>1329.0</td>\n",
" <td>1</td>\n",
" <td>GasA</td>\n",
" <td>2</td>\n",
" <td>1Story</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Lvl</td>\n",
" <td>2</td>\n",
" <td>14267.0</td>\n",
" <td>Corner</td>\n",
" <td>2</td>\n",
" <td>0.0</td>\n",
" <td>020</td>\n",
" <td>RL</td>\n",
" <td>108.0</td>\n",
" <td>BrkFace</td>\n",
" <td>Gar2</td>\n",
" <td>12500.0</td>\n",
" <td>6</td>\n",
" <td>Names</td>\n",
" <td>36.0</td>\n",
" <td>5</td>\n",
" <td>5</td>\n",
" <td>2</td>\n",
" <td>0.0</td>\n",
" <td>0</td>\n",
" <td>CompShg</td>\n",
" <td>Hip</td>\n",
" <td>Normal</td>\n",
" <td>WD</td>\n",
" <td>0.0</td>\n",
" <td>Pave</td>\n",
" <td>6</td>\n",
" <td>1329.0</td>\n",
" <td>3</td>\n",
" <td>393.0</td>\n",
" <td>1958</td>\n",
" <td>1958</td>\n",
" <td>2010</td>\n",
" <td>172000.0</td>\n",
" <td>429.0</td>\n",
" <td>1.5</td>\n",
" <td>5.0</td>\n",
" <td>1.791759</td>\n",
" <td>0.449490</td>\n",
" <td>12.055250</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <th>526353030</th>\n",
" <td>2110.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>NA</td>\n",
" <td>3</td>\n",
" <td>1Fam</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>1045.0</td>\n",
" <td>1065.0</td>\n",
" <td>0.0</td>\n",
" <td>5</td>\n",
" <td>1</td>\n",
" <td>Y</td>\n",
" <td>Norm</td>\n",
" <td>Norm</td>\n",
" <td>4</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>BrkFace</td>\n",
" <td>BrkFace</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>CBlock</td>\n",
" <td>2</td>\n",
" <td>7</td>\n",
" <td>522.0</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>Attchd</td>\n",
" <td>2110.0</td>\n",
" <td>1</td>\n",
" <td>GasA</td>\n",
" <td>4</td>\n",
" <td>1Story</td>\n",
" <td>1</td>\n",
" <td>4</td>\n",
" <td>Lvl</td>\n",
" <td>2</td>\n",
" <td>11160.0</td>\n",
" <td>Corner</td>\n",
" <td>3</td>\n",
" <td>0.0</td>\n",
" <td>020</td>\n",
" <td>RL</td>\n",
" <td>0.0</td>\n",
" <td>None</td>\n",
" <td>NA</td>\n",
" <td>0.0</td>\n",
" <td>4</td>\n",
" <td>Names</td>\n",
" <td>0.0</td>\n",
" <td>4</td>\n",
" <td>6</td>\n",
" <td>2</td>\n",
" <td>0.0</td>\n",
" <td>0</td>\n",
" <td>CompShg</td>\n",
" <td>Hip</td>\n",
" <td>Normal</td>\n",
" <td>WD</td>\n",
" <td>0.0</td>\n",
" <td>Pave</td>\n",
" <td>8</td>\n",
" <td>2110.0</td>\n",
" <td>3</td>\n",
" <td>0.0</td>\n",
" <td>1968</td>\n",
" <td>1968</td>\n",
" <td>2010</td>\n",
" <td>244000.0</td>\n",
" <td>0.0</td>\n",
" <td>3.5</td>\n",
" <td>3.0</td>\n",
" <td>2.079442</td>\n",
" <td>1.741657</td>\n",
" <td>12.404924</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <th>527105010</th>\n",
" <td>928.0</td>\n",
" <td>701.0</td>\n",
" <td>0.0</td>\n",
" <td>NA</td>\n",
" <td>3</td>\n",
" <td>1Fam</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>4</td>\n",
" <td>137.0</td>\n",
" <td>791.0</td>\n",
" <td>0.0</td>\n",
" <td>6</td>\n",
" <td>1</td>\n",
" <td>Y</td>\n",
" <td>Norm</td>\n",
" <td>Norm</td>\n",
" <td>4</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>VinylSd</td>\n",
" <td>VinylSd</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>PConc</td>\n",
" <td>2</td>\n",
" <td>7</td>\n",
" <td>482.0</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>Attchd</td>\n",
" <td>1629.0</td>\n",
" <td>1</td>\n",
" <td>GasA</td>\n",
" <td>3</td>\n",
" <td>2Story</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>Lvl</td>\n",
" <td>2</td>\n",
" <td>13830.0</td>\n",
" <td>Inside</td>\n",
" <td>2</td>\n",
" <td>0.0</td>\n",
" <td>060</td>\n",
" <td>RL</td>\n",
" <td>0.0</td>\n",
" <td>None</td>\n",
" <td>NA</td>\n",
" <td>0.0</td>\n",
" <td>3</td>\n",
" <td>Gilbert</td>\n",
" <td>34.0</td>\n",
" <td>4</td>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" <td>0.0</td>\n",
" <td>0</td>\n",
" <td>CompShg</td>\n",
" <td>Gable</td>\n",
" <td>Normal</td>\n",
" <td>WD</td>\n",
" <td>0.0</td>\n",
" <td>Pave</td>\n",
" <td>6</td>\n",
" <td>928.0</td>\n",
" <td>3</td>\n",
" <td>212.0</td>\n",
" <td>1997</td>\n",
" <td>1998</td>\n",
" <td>2010</td>\n",
" <td>189900.0</td>\n",
" <td>246.0</td>\n",
" <td>2.5</td>\n",
" <td>2.0</td>\n",
" <td>1.791759</td>\n",
" <td>1.162278</td>\n",
" <td>12.154253</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 1st Flr SF 2nd Flr SF 3Ssn Porch Alley Bedroom AbvGr \\\n",
"Order PID \n",
"1 526301100 1656.0 0.0 0.0 NA 3 \n",
"2 526350040 896.0 0.0 0.0 NA 2 \n",
"3 526351010 1329.0 0.0 0.0 NA 3 \n",
"4 526353030 2110.0 0.0 0.0 NA 3 \n",
"5 527105010 928.0 701.0 0.0 NA 3 \n",
"\n",
" Bldg Type Bsmt Cond Bsmt Exposure Bsmt Full Bath \\\n",
"Order PID \n",
"1 526301100 1Fam 4 4 1 \n",
"2 526350040 1Fam 3 1 0 \n",
"3 526351010 1Fam 3 1 0 \n",
"4 526353030 1Fam 3 1 1 \n",
"5 527105010 1Fam 3 1 0 \n",
"\n",
" Bsmt Half Bath Bsmt Qual Bsmt Unf SF BsmtFin SF 1 \\\n",
"Order PID \n",
"1 526301100 0 3 441.0 639.0 \n",
"2 526350040 0 3 270.0 468.0 \n",
"3 526351010 0 3 406.0 923.0 \n",
"4 526353030 0 3 1045.0 1065.0 \n",
"5 527105010 0 4 137.0 791.0 \n",
"\n",
" BsmtFin SF 2 BsmtFin Type 1 BsmtFin Type 2 Central Air \\\n",
"Order PID \n",
"1 526301100 0.0 4 1 Y \n",
"2 526350040 144.0 3 2 Y \n",
"3 526351010 0.0 5 1 Y \n",
"4 526353030 0.0 5 1 Y \n",
"5 527105010 0.0 6 1 Y \n",
"\n",
" Condition 1 Condition 2 Electrical Enclosed Porch \\\n",
"Order PID \n",
"1 526301100 Norm Norm 4 0.0 \n",
"2 526350040 Feedr Norm 4 0.0 \n",
"3 526351010 Norm Norm 4 0.0 \n",
"4 526353030 Norm Norm 4 0.0 \n",
"5 527105010 Norm Norm 4 0.0 \n",
"\n",
" Exter Cond Exter Qual Exterior 1st Exterior 2nd Fence \\\n",
"Order PID \n",
"1 526301100 2 2 BrkFace Plywood 0 \n",
"2 526350040 2 2 VinylSd VinylSd 3 \n",
"3 526351010 2 2 Wd Sdng Wd Sdng 0 \n",
"4 526353030 2 3 BrkFace BrkFace 0 \n",
"5 527105010 2 2 VinylSd VinylSd 3 \n",
"\n",
" Fireplace Qu Fireplaces Foundation Full Bath Functional \\\n",
"Order PID \n",
"1 526301100 4 2 CBlock 1 7 \n",
"2 526350040 0 0 CBlock 1 7 \n",
"3 526351010 0 0 CBlock 1 7 \n",
"4 526353030 3 2 CBlock 2 7 \n",
"5 527105010 3 1 PConc 2 7 \n",
"\n",
" Garage Area Garage Cars Garage Cond Garage Finish \\\n",
"Order PID \n",
"1 526301100 528.0 2 3 3 \n",
"2 526350040 730.0 1 3 1 \n",
"3 526351010 312.0 1 3 1 \n",
"4 526353030 522.0 2 3 3 \n",
"5 527105010 482.0 2 3 3 \n",
"\n",
" Garage Qual Garage Type Gr Liv Area Half Bath Heating \\\n",
"Order PID \n",
"1 526301100 3 Attchd 1656.0 0 GasA \n",
"2 526350040 3 Attchd 896.0 0 GasA \n",
"3 526351010 3 Attchd 1329.0 1 GasA \n",
"4 526353030 3 Attchd 2110.0 1 GasA \n",
"5 527105010 3 Attchd 1629.0 1 GasA \n",
"\n",
" Heating QC House Style Kitchen AbvGr Kitchen Qual \\\n",
"Order PID \n",
"1 526301100 1 1Story 1 2 \n",
"2 526350040 2 1Story 1 2 \n",
"3 526351010 2 1Story 1 3 \n",
"4 526353030 4 1Story 1 4 \n",
"5 527105010 3 2Story 1 2 \n",
"\n",
" Land Contour Land Slope Lot Area Lot Config Lot Shape \\\n",
"Order PID \n",
"1 526301100 Lvl 2 31770.0 Corner 2 \n",
"2 526350040 Lvl 2 11622.0 Inside 3 \n",
"3 526351010 Lvl 2 14267.0 Corner 2 \n",
"4 526353030 Lvl 2 11160.0 Corner 3 \n",
"5 527105010 Lvl 2 13830.0 Inside 2 \n",
"\n",
" Low Qual Fin SF MS SubClass MS Zoning Mas Vnr Area \\\n",
"Order PID \n",
"1 526301100 0.0 020 RL 112.0 \n",
"2 526350040 0.0 020 RH 0.0 \n",
"3 526351010 0.0 020 RL 108.0 \n",
"4 526353030 0.0 020 RL 0.0 \n",
"5 527105010 0.0 060 RL 0.0 \n",
"\n",
" Mas Vnr Type Misc Feature Misc Val Mo Sold Neighborhood \\\n",
"Order PID \n",
"1 526301100 Stone NA 0.0 5 Names \n",
"2 526350040 None NA 0.0 6 Names \n",
"3 526351010 BrkFace Gar2 12500.0 6 Names \n",
"4 526353030 None NA 0.0 4 Names \n",
"5 527105010 None NA 0.0 3 Gilbert \n",
"\n",
" Open Porch SF Overall Cond Overall Qual Paved Drive \\\n",
"Order PID \n",
"1 526301100 62.0 4 5 1 \n",
"2 526350040 0.0 5 4 2 \n",
"3 526351010 36.0 5 5 2 \n",
"4 526353030 0.0 4 6 2 \n",
"5 527105010 34.0 4 4 2 \n",
"\n",
" Pool Area Pool QC Roof Matl Roof Style Sale Condition \\\n",
"Order PID \n",
"1 526301100 0.0 0 CompShg Hip Normal \n",
"2 526350040 0.0 0 CompShg Gable Normal \n",
"3 526351010 0.0 0 CompShg Hip Normal \n",
"4 526353030 0.0 0 CompShg Hip Normal \n",
"5 527105010 0.0 0 CompShg Gable Normal \n",
"\n",
" Sale Type Screen Porch Street TotRms AbvGrd Total Bsmt SF \\\n",
"Order PID \n",
"1 526301100 WD 0.0 Pave 7 1080.0 \n",
"2 526350040 WD 120.0 Pave 5 882.0 \n",
"3 526351010 WD 0.0 Pave 6 1329.0 \n",
"4 526353030 WD 0.0 Pave 8 2110.0 \n",
"5 527105010 WD 0.0 Pave 6 928.0 \n",
"\n",
" Utilities Wood Deck SF Year Built Year Remod/Add Yr Sold \\\n",
"Order PID \n",
"1 526301100 3 210.0 1960 1960 2010 \n",
"2 526350040 3 140.0 1961 1961 2010 \n",
"3 526351010 3 393.0 1958 1958 2010 \n",
"4 526353030 3 0.0 1968 1968 2010 \n",
"5 527105010 3 212.0 1997 1998 2010 \n",
"\n",
" SalePrice Total Porch SF Total Bath Mo Sold (box-cox-1.0) \\\n",
"Order PID \n",
"1 526301100 215000.0 272.0 2.0 4.0 \n",
"2 526350040 105000.0 260.0 1.0 5.0 \n",
"3 526351010 172000.0 429.0 1.5 5.0 \n",
"4 526353030 244000.0 0.0 3.5 3.0 \n",
"5 527105010 189900.0 246.0 2.5 2.0 \n",
"\n",
" TotRms AbvGrd (box-cox-0.0) Total Bath (box-cox-0.5) \\\n",
"Order PID \n",
"1 526301100 1.945910 0.828427 \n",
"2 526350040 1.609438 0.000000 \n",
"3 526351010 1.791759 0.449490 \n",
"4 526353030 2.079442 1.741657 \n",
"5 527105010 1.791759 1.162278 \n",
"\n",
" SalePrice (box-cox-0.0) \n",
"Order PID \n",
"1 526301100 12.278393 \n",
"2 526350040 11.561716 \n",
"3 526351010 12.055250 \n",
"4 526353030 12.404924 \n",
"5 527105010 12.154253 "
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Correlations\n",
"\n",
"The pair-wise correlations are calculated based on the type of the variables:\n",
"- **continuous** variables are assumed to be linearly related with the target and each other or not: use **Pearson's correlation coefficient**\n",
"- **discrete** (because of the low number of distinct realizations as seen in the data cleaning notebook) and **ordinal** (low number of distinct realizations as well) variables are assumed to be related in a monotonic way with the target and each other or not: use **Spearman's rank correlation coefficient**\n",
"\n",
"Furthermore, for a **naive feature selection** a \"rule of thumb\" classification in *weak* and *strong* correlation is applied to the predictor variables. The identified variables will be used in the prediction modelling part to speed up the feature selection. A correlation between 0.33 and 0.66 is considered *weak* while a correlation above 0.66 is considered *strong*. Correlations are calculated for **each** target variable (i.e., raw \"SalePrice\" and Box-Cox transformation thereof)."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"strong = 0.66\n",
"weak = 0.33"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Two heatmaps below (implemented in the reusable `plot_correlation` function) help visualize the correlations.\n",
"\n",
"Obviously, many variables are pair-wise correlated. This could yield regression coefficients *inprecise* and not usable / interpretable. At the same time, this does not lower the predictive power of a model as a whole. In contrast to the pair-wise correlations, *multi-collinearity* is not checked here."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"def plot_correlation(data, title):\n",
" \"\"\"Visualize a correlation matrix in a nice heatmap.\"\"\"\n",
" fig, ax = plt.subplots(figsize=(12, 12))\n",
" ax.set_title(title, fontsize=24)\n",
" # Blank out the upper triangular part of the matrix.\n",
" mask = np.zeros_like(data, dtype=np.bool)\n",
" mask[np.triu_indices_from(mask)] = True\n",
" # Use a diverging color map.\n",
" cmap = sns.diverging_palette(240, 0, as_cmap=True)\n",
" # Adjust the labels' font size.\n",
" labels = data.columns\n",
" ax.set_xticklabels(labels, fontsize=10)\n",
" ax.set_yticklabels(labels, fontsize=10)\n",
" # Plot it.\n",
" sns.heatmap(\n",
" data, vmin=-1, vmax=1, cmap=cmap, center=0, linewidths=.5,\n",
" cbar_kws={\"shrink\": .5}, square=True, mask=mask, ax=ax\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Pearson\n",
"\n",
"Pearson's correlation coefficient shows a linear relationship between two variables."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"columns = CONTINUOUS_VARIABLES + TARGET_VARIABLES\n",
"pearson = df[columns].corr(method=\"pearson\")"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAuIAAAKtCAYAAABi7QuGAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADx0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4wcmMyLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvMCCy2AAAIABJREFUeJzs3Xl4zOf+//HnTGSTIIktihRhYqkgqqiWKtWiumjtiUOLVqspagnVomqt0tLKsRaxL6Ek2nNq6dfS2nqopdYKTWyxSyIyWT6/P/wyR44gFWMkXo/rynVlPst93/PJXK7X3N6f+2MyDMNAREREREQeKLOjByAiIiIi8ihSEBcRERERcQAFcRERERERB1AQFxERERFxAAVxEREREREHUBAXEREREXGAAo4egIjkbXFxcTRp0iTbfSaTCRcXF7y8vKhWrRpvvPEGTZs2fcAjzPvCwsJYsWIFTz31FBEREXbtyzAM1q1bx+rVq9mzZw/nzp3D2dmZxx57jPr16xMcHEy5cuXsOgZ7CwgIAGD16tVYLJZct5eYmEhSUhIlS5a0bZs8eTLffPMNL774IpMmTcp1HyKSPymIi8h988QTT+Di4mJ7bRgGVquVuLg41q9fz/r16+nYsSNDhw514Cjlds6ePUvfvn3ZuXMnAAULFqRChQokJydz/Phxjh49yqJFixgwYACdO3d28GgfDlFRUYwZM4YRI0ZkCeIiIjmhIC4i983XX39NmTJlbtmemprKN998wz//+U8WLFjAs88+y/PPP++AEeZNJpPJ7n3ExcXRrl07zp8/T9WqVenTpw8NGza07b948SJTpkwhIiKCkSNHUqBAATp27Gj3cT3sJkyYwLlz527Z3qlTJ1q0aIGnp6cDRiUieYVqxEXE7pydnenTpw+1atUCYMGCBQ4eUd6SOdP62GOP2aX9jIwMBgwYwPnz5wkKCmL+/PlZQjiAj48PQ4YM4Z133gFg3LhxnD171i7jyQ98fHzw9/fXLLmI3JGCuIg8MI0bNwZg7969Dh5J3lKpUiXgv7XN99v333/Pb7/9RoECBRg3bhwFCxa87bHvvfcePj4+JCcns3TpUruMR0TkUaHSFBF5YDL/mz4pKemWfbGxsUyfPp3NmzcTHx+Ph4cHNWvWpEuXLtSvXz/b9mJjY5k7dy5bt27l5MmTWK1WvLy8qFGjBiEhIdSrVy/L8SEhIWzfvp2FCxeyevVqvv/+ewACAwOZNWsWZrOZbdu2MWfOHHbt2sXVq1cpXLgwVatW5Y033qBFixbZjuPf//43ixcvZu/evVy7do3ixYtTv359unfvTvny5bMcGxkZyaBBg+jUqRO9evXim2++Yf369Zw/f56iRYvy3HPP8f7771OiRAnbORUrVgS45cbCAwcOMGPGDLZv387Fixfx8PDAYrHQqlUr3njjDQoUyNk/8cuXLwfghRdeoGzZsnc81s3NjTFjxuDh4UFgYOAt+7du3UpERITt+nl5eVGnTh3eeustqlevfttrUa9ePb744gvOnDlDqVKlGD16NCdOnLjj/tq1awNw4cIFZsyYwfr16zl9+jSurq5UrVqVDh068NJLL+XoGgCkp6cTFRVFdHQ0f/zxB5cvX8bFxYWyZcvSpEkTunbtSqFChbKMPdO7774LwOjRo2nduvUdb9aMj49n1qxZbNiwgVOnTuHi4oLFYuH111+ndevWt/zdnn/+eU6ePMmvv/7Kzp07mT17NgcPHsQwDAICAggJCaFly5Y5fp8i8vBQEBeRB+avv/4CoFSpUlm2b9q0idDQUK5du4a7uzuVKlXi4sWL/Pzzz/z888988MEH9OrVK8s5mzdv5v333+f69esUKlQIPz8/UlJSiI2NZe3ataxbt47x48fz8ssv3zKOsWPHsnv3biwWC5cvX6Z48eKYzWZWr17NgAEDyMjIwNfXl8qVK3P+/Hk2b97M5s2b2bt3LwMHDrS1k1nSsXr1atv7Klu2LDExMSxfvpyoqCjGjx9Ps2bNbhlDfHw8rVu35syZM5QuXZpy5cpx5MgRFi1axKZNm1i5ciWFCxcGbsyEHzp0KMv527dv5+2338ZqtVK0aFEqV67MlStX2LFjBzt27GDLli05Wq0jJSWFXbt2Adz2C8//atSoUbbbv/zyS6ZNmwZAsWLFqFy5MrGxsaxZs4Yff/yRjz/+mODg4FvO2717N4sXL8bLy4ty5cpx8uRJAgICOHHixB33A+zfv5/u3btz4cIFXFxcKF++PNeuXWPr1q1s3bqV1q1bM2rUqLvW2aemptKzZ082bdoEgJ+fHyVLluTMmTMcPHiQgwcPsnbtWpYtW4aLiwtFixYlKCiIffv2YbVaqVixIoULF6Zo0aJ37GfXrl28++67tpBfqVIlkpKS+M9//sN//vMfoqOjmTJlCh4eHrecGx4ezty5cylYsCDlypXj1KlT7Nq1i127dnHu3Dm6dOlyx75F5CFkiIjkQmxsrGGxWAyLxWLExsbe9rjLly8b9erVMywWizFixIgs5wcFBRkWi8X46quvjJSUFNu+tWvX2vb99NNPtu0pKSnGM888Y1gsFmPUqFFZzjl37pzRpUsXw2KxGM2bN88yhuDgYNtY//3vfxuGYRjp6enGpUuXjPT0dOPpp582LBaLER0dneW8FStWGAEBAUblypWzvMfJkycbFovFqF27trF+/Xrb9uTkZGPUqFGGxWIxqlevbhw6dMi2b/ny5bYxNGvWzNi3b59t33/+8x+jRo0ahsViMaZNm3b7i24YRuvWrQ2LxWLMnDnTSE9Pt23fvHmzUb16dcNisRg7duy4YxuGYRiHDx+2jee333676/G3s2LFCsNisRjVqlUzli5damRkZBiGYRhpaWnGtGnTjICAACMgIMDYvHmz7Zybr8UHH3xgWK1WwzAM48KFCznaf/XqVaNRo0aGxWIxPv74YyMhIcHW9s6dO22fke+++y7LWDPbvPnvMnfuXMNisRhPP/20ceDAgSzHr1mzxqhcuXK2n43GjRsbFosly9/fMAxj0qRJtnFnunTpkvHUU08ZFovFCA0NNS5dumTb9/vvv9veS79+/bLtw2KxGBMmTLB93lNSUow+ffrYPoOZ10dE8g7ViIuI3RiGwdWrV9m4cSPdunXj4sWLFCpUiLffftt2zKxZs0hMTOS1117jww8/zLL8YZMmTfjoo48A+Oabb2zb9+3bx7Vr1yhZsiQDBgzIck6xYsV4//33AYiJiSEjI+OWcdWqVYsXXngBALPZjJeXFxcuXOD8+fMUKVKE5s2bZzn+tddeo23btrRs2ZLExEQArl27xqxZswD47LPPbPXvcKN8Y9CgQTRp0oSUlBSmTJmS7fUZN24c1apVyzKuzBKD33///bbXFeDw4cMAvPnmm5jN//2nvEGDBnTr1o0WLVqQmpp6xzYArl69avvdy8vrrsffTubfJzQ0lDfffNM2A+3k5ET37t0JCQnBMAy++uqrbM/v06cPzs7OwI0bHXOyf8mSJZw+fZqnnnqKESNGZFmhpHbt2nz++ecATJs27a7XYuvWrTg5OfHBBx9QuXLlLPuaN29O3bp1Afjzzz/vfCHuYP78+Vy+fBmLxcKXX36Z5XoHBgYyZcoUTCYTq1ev5ujRo7ec/+yzz9KnTx/b593FxYUBAwYAkJCQkKuxiYhjqDRFRO6b2z3YJ5O3tzeTJk3KUpqyfv16gNvWuLZs2ZLPPvuMAwcOcO7cOYoXL05QUBC//fYb169fx8nJ6ZZz3N3dgRulIykpKbbXmWrWrJnt2AoVKsSVK1cYPHgwb731lu0mSbgRtm+2c+dOkpKS8PHxuW0dckhICOvWrWPjxo2kp6dnGWtmLfv/yqwpzwz8t+Pn58fRo0cZMGAA77//Pk888YQt/IaGht7x3JvdfG3S09NzfN7N/vzzT2JjYzGbzbRv3z7bYzp37szcuXPZs2cPFy5cyFLC4eXldUst/c1ut3/dunUAtGjRItvSk4YNG1KkSBEuXLjA/v37s/27Z/r2229JTU3Ntp309HRbqUhycvJt27i
"text/plain": [
"<Figure size 864x864 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plot_correlation(pearson, \"Pearson's Correlation\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Predictors weakly or strongly correlated with a target variable are collected."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"pearson_weakly_correlated = set()\n",
"pearson_strongly_correlated = set()\n",
"# Iterate over the raw and transformed target.\n",
"for target in TARGET_VARIABLES:\n",
" corrs = pearson.loc[target].drop(TARGET_VARIABLES)\n",
" pearson_weakly_correlated |= set(corrs[(weak < corrs) & (corrs <= strong)].index)\n",
" pearson_strongly_correlated |= set(corrs[(strong < corrs)].index)\n",
"# Show that no contradiction exists between weak and strong classification.\n",
"assert pearson_weakly_correlated & pearson_strongly_correlated == set()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Show the continuous variables that are weakly and strongly correlated with the sales price."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1st Flr SF First Floor square feet\n",
"BsmtFin SF 1 Type 1 finished square feet\n",
"Garage Area Size of garage in square feet\n",
"Mas Vnr Area Masonry veneer area in square feet\n",
"Total Bsmt SF Total square feet of basement area\n",
"Total Porch SF\n",
"Wood Deck SF Wood deck area in square feet\n"
]
}
],
"source": [
"print_column_list(pearson_weakly_correlated)"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Gr Liv Area Above grade (ground) living area square feet\n"
]
}
],
"source": [
"print_column_list(pearson_strongly_correlated)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Spearman\n",
"\n",
"Spearman's correlation coefficient shows an ordinal rank relationship between two variables."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"columns = sorted(DISCRETE_VARIABLES + ORDINAL_VARIABLES) + TARGET_VARIABLES\n",
"spearman = df[columns].corr(method=\"spearman\")"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAv8AAALKCAYAAAClTD48AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADx0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4wcmMyLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvMCCy2AAAIABJREFUeJzs3XlcVNX/+PHXIJsbAqKIuLEILimaG2i5kZqYlpaABLmVn6g0K1RwKVxJ0Ewxd0kFU0ARS7QsNUktTNGPfsQlQRQEERdEURiQ+f3hb+7XCRBwL97Px4NH4733nHvunZke73vmfc5RaTQaDUIIIYQQQoh/Pb1n3QAhhBBCCCHE0yHBvxBCCCGEEFWEBP9CCCGEEEJUERL8CyGEEEIIUUVI8C+EEEIIIUQVIcG/EEIIIYQQVYT+s26AEOL5s3fvXrZu3crRo0e5cuUKhoaG1K9fny5duvDmm2/ywgsvPOsmivs4OjoCEBQUxJAhQ55Y/WUxNDSkbt26tGzZEm9vb7p16/bY21BR6enpuLq6ApCYmEjNmjUfW91ZWVlERkby22+/kZKSQn5+PnXq1KFly5YMHDiQgQMHUq1atcd2vqctNDSUxYsX069fPxYtWvRY6kxOTsbOzk5nm/bz9MMPP+Dg4PBYziOEqDgJ/oUQiqKiIvz8/NixYwcADRo0wNHRkdzcXNLT00lOTmbDhg2MHDmSSZMmPePWiqfNwcGBWrVqldiem5tLamoqmZmZ7N69mwkTJvDuu+8+gxY+OVFRUcyaNYuCggL09PSwtLSkSZMmXLx4kX379rFv3z7WrVvHkiVLaNCgwbNu7jOXnZ3N7NmzSUtLY/Pmzc+6OUKI+0jwL4RQfP311+zYsQMbGxu++uorWrVqpezLz89n3bp1LFiwgLCwMBo2bIiPj88zbK34O5VK9UTrnzp1Kl26dCl139WrV5k8eTK//vor8+fPp1evXiV6fP+pQkJCWLVqFQYGBrz33nuMGjUKc3NzZf+vv/7KnDlzOHHiBMOHD2fTpk3Url37Gbb42fvtt9/YsWMHrVu3LrFv+/btADRu3PhpN0sIgeT8CyH+v9u3b7N+/Xrg3kPA/YE/gLGxMWPGjMHX1xeA5cuXU1xc/NTbKUrS9jRbWVk9szbUrVuXkJAQTE1NKS4uJiYm5pm15XHat28fq1evplq1asybNw8/Pz+dwB+gZ8+ehIeHY2ZmRmpqKqGhoc+otf8MdnZ22NnZYWho+KybIkSVJMG/EAKA1NRUbt++jaGhIS1atCjzuKFDhwL3ftbPzMx8Ws0TD2Bvbw+Un5v/pJmYmNCuXTvgXq73P11xcTHTp09Ho9EwePBgXn311TKPtbS0ZMyYMQBER0dz586dp9VMIYSoFEn7EUIAoK9/738HarWa33//HRcXl1KPs7KyIjY2FhMTE53c5piYGAICAvDw8ODDDz8kODiY/fv3U1BQgI2NDZ6enrz11lvo6ZXsc1Cr1Xz33Xf88MMPpKSkoNFosLGx4bXXXsPb2xsjI6MSZQoKCti0aRM7d+7kzJkz3Lx5E2NjY2xsbOjfvz/e3t46PYva9r399ts4OzsTEhLCpUuXsLKyIigoiKKiIt555x1cXV0JDg5myZIl/Pjjj2RnZ1O/fn3eeOMNfH190dfXZ8eOHXz77becOXMGfX19OnfujJ+fH7a2tiXaeerUKSIiIvjzzz+5fPkyRUVF1K1blw4dOjBq1KgSaRE+Pj4cPHiQTZs2cfv2bVasWMHx48cpKCjA1taWt956i2HDhuncx+bNm3P69GnMzMyUbRqNhk2bNrFlyxZOnTqFWq3GwsKCDh06MHz4cNq2bVvWR+GRPCj16MqVK4SHh/Pbb7+RlpbGnTt3qF27Nq1atcLd3Z1+/frpHK8dgDpx4kRlEOqBAwfIycmhQYMG9OvXj/fff7/CKTZr1qwhKCgIAwMDQkND6dWr1wOPP3jwIBcuXABg1KhR5dY/ePBgzM3N6dy5M9WrV9fZl5uby9q1a9m5cycXLlxAT08PGxsb3Nzc8Pb2xtjYWOf43r17c/HiRX7++WcWLFjAnj17MDAwoFevXgQHB5e7X+uXX35hw4YN/O9//+P27dtYWlrSs2dPxowZQ/369St03wDS0tJYt24df/zxBxcvXkStVmNqaoqTkxM+Pj44OzuXaDvAiRMncHR0xNramt27dwMPHvC7c+dOIiMjOX78OLdv36ZevXq4uLjw3nvvYWNjo3Ps/d/pjz76iMWLF7N7926uXLlC3bp16dmzJx9++GGlrlOIqkCCfyEEALa2tlhaWpKVlcWHH37I8OHDGThwYKkBbcuWLcusJzs7G3d3dy5duoSdnR3FxcWcOHGCadOmsW/fPr766ivlQQMgJyeH9957j2PHjqGnp0fjxo0xNjbm9OnTJCUlERcXx+rVq3UC25s3bzJ8+HBOnDhBtWrVaNKkCVZWVly8eJFjx45x7NgxDhw4wKpVq0q07+jRo0RGRmJqakqzZs24ePEijo6OnDhxArgXpHl4eCizlNSvX5/09HQWL17MlStXsLCwYPHixZiZmWFjY8Nff/3Frl27OHr0KHFxcTrt3LJlC1OmTOHu3bvK8bdu3SI9PZ1t27bx008/8e2339KpU6cS7dy6dSsREREYGRnRrFkzrl69SlJSEjNmzODcuXNMnTpVOdbf3x9/f3+d8l988QWRkZGoVCqaNm1KzZo1lfPu2LGDRYsW8corr5T5Pj6MnJwcEhISAErMCHXy5ElGjhzJ9evXqVGjBo0aNQLuBZXaAbOfffaZ0nt+v7/++oulS5dy+/Zt5VpSU1NZtWoVv//+O1FRUTqfqdJER0fz5ZdfYmBgwNdff11u4A/wxx9/AFCvXr0KjV8wMzPjjTfeKLE9NTWVkSNHkpGRQbVq1WjevDnFxcUkJSVx4sQJvv/+e1avXk29evVKlJ0wYQLHjx/HwcGBS5cu0bBhwwrt12g0fP7550RFRSnX0Lx5c86dO0d4eDhxcXGsWLGCNm3alHtd+/bt48MPPyQ/P5/atWvTpEkTCgoKSEtL45dffmHXrl3MmzeP1157Dbj33hsYGJCamkqNGjVo0aJFqdd2v+LiYiZOnMgPP/wA3OtkaNy4MefOnWPz5s1s27aNefPm0bdv3xJlL1++zJAhQ7h06RLW1tY0a9aMv/76i40bN/Lbb78pnRVCiP9PI4QQ/9/OnTs1jo6OGgcHB+WvZ8+emokTJ2o2b96sycrKKrPs5s2blTKdO3fWHDx4UNl34MABzYsvvqhxcHDQrFu3Tqfcf/7zH42Dg4PGw8NDc/78eWV7RkaGxsvLS+Pg4KDx9fXVKRMUFKRxcHDQ9O/fX5Oenq5sLyoq0qxZs0Zpx3//+99S2zd27FiNWq3WaDQazdWrVzUajUbzxx9/KPu7deumOX78uFJ28eLFGgcHB02LFi00jo6Omm+//VZTXFys0Wg0mnPnzmk6d+6scXBw0Kxdu1Ypk52drXFyctI4ODhoVq9erSkqKlL2XbhwQTNo0CCNg4OD5t1339W5Nm9vb6UdAQEBmps3byrX9uWXX2ocHBw0LVu2fOB7cebMGY2Dg4PG2dlZc+bMGWV7QUGBJjAwUOPg4KBxdXUts/zfadvzxx9/lHnM+fPnlferU6dOmsuXL+vsHzx4sMbBwUEzfvx45Zo0Go3m5s2bms8++0zj4OCg6dChg/K+aDQazaJFi5Rzu7u763w+7v+sxsXFKdvT0tKUMrdu3dJoNBpNXFycpkWLFppWrVppfvrppwpf99ixYzUODg6akSNHVrjM36nVak2/fv00Dg4OGm9vb01mZqayLyUlRTNw4ECNg4ODxsvLS6dcr169NA4ODpoXXnhBk5iYqNSlvXfl7Q8LC9M4ODhoXnrpJc2BAweUevPy8pTPQI8ePXTeC+39Hjt2rLKtoKBA89JLL2kcHBw0c+bM0RQUFCj7srOzNSNGjFC+i/fTft8GDx5c4p5o35/Tp08r20JDQ5XPwO7du5Xtd+7c0cyZM0fj4OCgadOmjU6Z+7/Tffv21fz
"text/plain": [
"<Figure size 864x864 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plot_correlation(spearman, \"Spearman's Rank Correlation\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Predictors weakly or strongly correlated with a target variable are collected."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'SalePrice'"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"TARGET_VARIABLES[0]"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"spearman_weakly_correlated = set()\n",
"spearman_strongly_correlated = set()\n",
"# Iterate over the raw and transformed target.\n",
"for target in TARGET_VARIABLES:\n",
" corrs = spearman.loc[target].drop(TARGET_VARIABLES)\n",
" spearman_weakly_correlated |= set(corrs[(weak < corrs) & (corrs <= strong)].index)\n",
" spearman_strongly_correlated |= set(corrs[(strong < corrs)].index)\n",
"# Show that no contradiction exists between weak and strong classification.\n",
"assert spearman_weakly_correlated & spearman_strongly_correlated == set()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Show the discrete and ordinal variables that are weakly and strongly correlated with the sales price."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Bsmt Exposure Refers to walkout or garden level walls\n",
"BsmtFin Type 1 Rating of basement finished area\n",
"Fireplace Qu Fireplace quality\n",
"Fireplaces Number of fireplaces\n",
"Full Bath Full bathrooms above grade\n",
"Garage Cond Garage condition\n",
"Garage Finish Interior finish of the garage\n",
"Garage Qual Garage quality\n",
"Half Bath Half baths above grade\n",
"Heating QC Heating quality and condition\n",
"Paved Drive Paved driveway\n",
"TotRms AbvGrd Total rooms above grade (does not include bathrooms)\n",
"TotRms AbvGrd (box-cox-0.0)\n",
"Year Remod/Add Remodel date (same as construction date if no remodeling or additions)\n"
]
}
],
"source": [
"print_column_list(spearman_weakly_correlated)"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Bsmt Qual Evaluates the height of the basement\n",
"Exter Qual Evaluates the quality of the material on the exterior\n",
"Garage Cars Size of garage in car capacity\n",
"Kitchen Qual Kitchen quality\n",
"Overall Qual Rates the overall material and finish of the house\n",
"Total Bath\n",
"Total Bath (box-cox-0.5)\n",
"Year Built Original construction date\n"
]
}
],
"source": [
"print_column_list(spearman_strongly_correlated)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Save the Results"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Save the weakly and strongly correlated Variables\n",
"\n",
"The subset of variables that have a correlation with the house price are saved in a simple JSON file for easy re-use."
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"with open(\"weakly_and_strongly_correlated_variables.json\", \"w\") as file:\n",
" file.write(json.dumps({\n",
" \"weakly_correlated\": sorted(\n",
" list(pearson_weakly_correlated) + list(spearman_weakly_correlated)\n",
" ),\n",
" \"strongly_correlated\": sorted(\n",
" list(pearson_strongly_correlated) + list(spearman_strongly_correlated)\n",
" ),\n",
" }))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Save the Data\n",
"\n",
"For conveniene, sort the columns alphabetically with the targets at the end."
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"df = df[sorted(ALL_VARIABLES + new_variables) + TARGET_VARIABLES]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Added 2 new linear combinations and 4 Box-Cox transformations to the previously 78 columns."
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(2898, 84)"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.shape"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th></th>\n",
" <th>1st Flr SF</th>\n",
" <th>2nd Flr SF</th>\n",
" <th>3Ssn Porch</th>\n",
" <th>Alley</th>\n",
" <th>Bedroom AbvGr</th>\n",
" <th>Bldg Type</th>\n",
" <th>Bsmt Cond</th>\n",
" <th>Bsmt Exposure</th>\n",
" <th>Bsmt Full Bath</th>\n",
" <th>Bsmt Half Bath</th>\n",
" <th>Bsmt Qual</th>\n",
" <th>Bsmt Unf SF</th>\n",
" <th>BsmtFin SF 1</th>\n",
" <th>BsmtFin SF 2</th>\n",
" <th>BsmtFin Type 1</th>\n",
" <th>BsmtFin Type 2</th>\n",
" <th>Central Air</th>\n",
" <th>Condition 1</th>\n",
" <th>Condition 2</th>\n",
" <th>Electrical</th>\n",
" <th>Enclosed Porch</th>\n",
" <th>Exter Cond</th>\n",
" <th>Exter Qual</th>\n",
" <th>Exterior 1st</th>\n",
" <th>Exterior 2nd</th>\n",
" <th>Fence</th>\n",
" <th>Fireplace Qu</th>\n",
" <th>Fireplaces</th>\n",
" <th>Foundation</th>\n",
" <th>Full Bath</th>\n",
" <th>Functional</th>\n",
" <th>Garage Area</th>\n",
" <th>Garage Cars</th>\n",
" <th>Garage Cond</th>\n",
" <th>Garage Finish</th>\n",
" <th>Garage Qual</th>\n",
" <th>Garage Type</th>\n",
" <th>Gr Liv Area</th>\n",
" <th>Half Bath</th>\n",
" <th>Heating</th>\n",
" <th>Heating QC</th>\n",
" <th>House Style</th>\n",
" <th>Kitchen AbvGr</th>\n",
" <th>Kitchen Qual</th>\n",
" <th>Land Contour</th>\n",
" <th>Land Slope</th>\n",
" <th>Lot Area</th>\n",
" <th>Lot Config</th>\n",
" <th>Lot Shape</th>\n",
" <th>Low Qual Fin SF</th>\n",
" <th>MS SubClass</th>\n",
" <th>MS Zoning</th>\n",
" <th>Mas Vnr Area</th>\n",
" <th>Mas Vnr Type</th>\n",
" <th>Misc Feature</th>\n",
" <th>Misc Val</th>\n",
" <th>Mo Sold</th>\n",
" <th>Mo Sold (box-cox-1.0)</th>\n",
" <th>Neighborhood</th>\n",
" <th>Open Porch SF</th>\n",
" <th>Overall Cond</th>\n",
" <th>Overall Qual</th>\n",
" <th>Paved Drive</th>\n",
" <th>Pool Area</th>\n",
" <th>Pool QC</th>\n",
" <th>Roof Matl</th>\n",
" <th>Roof Style</th>\n",
" <th>Sale Condition</th>\n",
" <th>Sale Type</th>\n",
" <th>Screen Porch</th>\n",
" <th>Street</th>\n",
" <th>TotRms AbvGrd</th>\n",
" <th>TotRms AbvGrd (box-cox-0.0)</th>\n",
" <th>Total Bath</th>\n",
" <th>Total Bath (box-cox-0.5)</th>\n",
" <th>Total Bsmt SF</th>\n",
" <th>Total Porch SF</th>\n",
" <th>Utilities</th>\n",
" <th>Wood Deck SF</th>\n",
" <th>Year Built</th>\n",
" <th>Year Remod/Add</th>\n",
" <th>Yr Sold</th>\n",
" <th>SalePrice</th>\n",
" <th>SalePrice (box-cox-0.0)</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Order</th>\n",
" <th>PID</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <th>526301100</th>\n",
" <td>1656.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>NA</td>\n",
" <td>3</td>\n",
" <td>1Fam</td>\n",
" <td>4</td>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>441.0</td>\n",
" <td>639.0</td>\n",
" <td>0.0</td>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>Y</td>\n",
" <td>Norm</td>\n",
" <td>Norm</td>\n",
" <td>4</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>BrkFace</td>\n",
" <td>Plywood</td>\n",
" <td>0</td>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" <td>CBlock</td>\n",
" <td>1</td>\n",
" <td>7</td>\n",
" <td>528.0</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>Attchd</td>\n",
" <td>1656.0</td>\n",
" <td>0</td>\n",
" <td>GasA</td>\n",
" <td>1</td>\n",
" <td>1Story</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>Lvl</td>\n",
" <td>2</td>\n",
" <td>31770.0</td>\n",
" <td>Corner</td>\n",
" <td>2</td>\n",
" <td>0.0</td>\n",
" <td>020</td>\n",
" <td>RL</td>\n",
" <td>112.0</td>\n",
" <td>Stone</td>\n",
" <td>NA</td>\n",
" <td>0.0</td>\n",
" <td>5</td>\n",
" <td>4.0</td>\n",
" <td>Names</td>\n",
" <td>62.0</td>\n",
" <td>4</td>\n",
" <td>5</td>\n",
" <td>1</td>\n",
" <td>0.0</td>\n",
" <td>0</td>\n",
" <td>CompShg</td>\n",
" <td>Hip</td>\n",
" <td>Normal</td>\n",
" <td>WD</td>\n",
" <td>0.0</td>\n",
" <td>Pave</td>\n",
" <td>7</td>\n",
" <td>1.945910</td>\n",
" <td>2.0</td>\n",
" <td>0.828427</td>\n",
" <td>1080.0</td>\n",
" <td>272.0</td>\n",
" <td>3</td>\n",
" <td>210.0</td>\n",
" <td>1960</td>\n",
" <td>1960</td>\n",
" <td>2010</td>\n",
" <td>215000.0</td>\n",
" <td>12.278393</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <th>526350040</th>\n",
" <td>896.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>NA</td>\n",
" <td>2</td>\n",
" <td>1Fam</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>270.0</td>\n",
" <td>468.0</td>\n",
" <td>144.0</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>Y</td>\n",
" <td>Feedr</td>\n",
" <td>Norm</td>\n",
" <td>4</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>VinylSd</td>\n",
" <td>VinylSd</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>CBlock</td>\n",
" <td>1</td>\n",
" <td>7</td>\n",
" <td>730.0</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Attchd</td>\n",
" <td>896.0</td>\n",
" <td>0</td>\n",
" <td>GasA</td>\n",
" <td>2</td>\n",
" <td>1Story</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>Lvl</td>\n",
" <td>2</td>\n",
" <td>11622.0</td>\n",
" <td>Inside</td>\n",
" <td>3</td>\n",
" <td>0.0</td>\n",
" <td>020</td>\n",
" <td>RH</td>\n",
" <td>0.0</td>\n",
" <td>None</td>\n",
" <td>NA</td>\n",
" <td>0.0</td>\n",
" <td>6</td>\n",
" <td>5.0</td>\n",
" <td>Names</td>\n",
" <td>0.0</td>\n",
" <td>5</td>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" <td>0.0</td>\n",
" <td>0</td>\n",
" <td>CompShg</td>\n",
" <td>Gable</td>\n",
" <td>Normal</td>\n",
" <td>WD</td>\n",
" <td>120.0</td>\n",
" <td>Pave</td>\n",
" <td>5</td>\n",
" <td>1.609438</td>\n",
" <td>1.0</td>\n",
" <td>0.000000</td>\n",
" <td>882.0</td>\n",
" <td>260.0</td>\n",
" <td>3</td>\n",
" <td>140.0</td>\n",
" <td>1961</td>\n",
" <td>1961</td>\n",
" <td>2010</td>\n",
" <td>105000.0</td>\n",
" <td>11.561716</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <th>526351010</th>\n",
" <td>1329.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>NA</td>\n",
" <td>3</td>\n",
" <td>1Fam</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>406.0</td>\n",
" <td>923.0</td>\n",
" <td>0.0</td>\n",
" <td>5</td>\n",
" <td>1</td>\n",
" <td>Y</td>\n",
" <td>Norm</td>\n",
" <td>Norm</td>\n",
" <td>4</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>Wd Sdng</td>\n",
" <td>Wd Sdng</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>CBlock</td>\n",
" <td>1</td>\n",
" <td>7</td>\n",
" <td>312.0</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Attchd</td>\n",
" <td>1329.0</td>\n",
" <td>1</td>\n",
" <td>GasA</td>\n",
" <td>2</td>\n",
" <td>1Story</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Lvl</td>\n",
" <td>2</td>\n",
" <td>14267.0</td>\n",
" <td>Corner</td>\n",
" <td>2</td>\n",
" <td>0.0</td>\n",
" <td>020</td>\n",
" <td>RL</td>\n",
" <td>108.0</td>\n",
" <td>BrkFace</td>\n",
" <td>Gar2</td>\n",
" <td>12500.0</td>\n",
" <td>6</td>\n",
" <td>5.0</td>\n",
" <td>Names</td>\n",
" <td>36.0</td>\n",
" <td>5</td>\n",
" <td>5</td>\n",
" <td>2</td>\n",
" <td>0.0</td>\n",
" <td>0</td>\n",
" <td>CompShg</td>\n",
" <td>Hip</td>\n",
" <td>Normal</td>\n",
" <td>WD</td>\n",
" <td>0.0</td>\n",
" <td>Pave</td>\n",
" <td>6</td>\n",
" <td>1.791759</td>\n",
" <td>1.5</td>\n",
" <td>0.449490</td>\n",
" <td>1329.0</td>\n",
" <td>429.0</td>\n",
" <td>3</td>\n",
" <td>393.0</td>\n",
" <td>1958</td>\n",
" <td>1958</td>\n",
" <td>2010</td>\n",
" <td>172000.0</td>\n",
" <td>12.055250</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <th>526353030</th>\n",
" <td>2110.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>NA</td>\n",
" <td>3</td>\n",
" <td>1Fam</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>1045.0</td>\n",
" <td>1065.0</td>\n",
" <td>0.0</td>\n",
" <td>5</td>\n",
" <td>1</td>\n",
" <td>Y</td>\n",
" <td>Norm</td>\n",
" <td>Norm</td>\n",
" <td>4</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>BrkFace</td>\n",
" <td>BrkFace</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>CBlock</td>\n",
" <td>2</td>\n",
" <td>7</td>\n",
" <td>522.0</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>Attchd</td>\n",
" <td>2110.0</td>\n",
" <td>1</td>\n",
" <td>GasA</td>\n",
" <td>4</td>\n",
" <td>1Story</td>\n",
" <td>1</td>\n",
" <td>4</td>\n",
" <td>Lvl</td>\n",
" <td>2</td>\n",
" <td>11160.0</td>\n",
" <td>Corner</td>\n",
" <td>3</td>\n",
" <td>0.0</td>\n",
" <td>020</td>\n",
" <td>RL</td>\n",
" <td>0.0</td>\n",
" <td>None</td>\n",
" <td>NA</td>\n",
" <td>0.0</td>\n",
" <td>4</td>\n",
" <td>3.0</td>\n",
" <td>Names</td>\n",
" <td>0.0</td>\n",
" <td>4</td>\n",
" <td>6</td>\n",
" <td>2</td>\n",
" <td>0.0</td>\n",
" <td>0</td>\n",
" <td>CompShg</td>\n",
" <td>Hip</td>\n",
" <td>Normal</td>\n",
" <td>WD</td>\n",
" <td>0.0</td>\n",
" <td>Pave</td>\n",
" <td>8</td>\n",
" <td>2.079442</td>\n",
" <td>3.5</td>\n",
" <td>1.741657</td>\n",
" <td>2110.0</td>\n",
" <td>0.0</td>\n",
" <td>3</td>\n",
" <td>0.0</td>\n",
" <td>1968</td>\n",
" <td>1968</td>\n",
" <td>2010</td>\n",
" <td>244000.0</td>\n",
" <td>12.404924</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <th>527105010</th>\n",
" <td>928.0</td>\n",
" <td>701.0</td>\n",
" <td>0.0</td>\n",
" <td>NA</td>\n",
" <td>3</td>\n",
" <td>1Fam</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>4</td>\n",
" <td>137.0</td>\n",
" <td>791.0</td>\n",
" <td>0.0</td>\n",
" <td>6</td>\n",
" <td>1</td>\n",
" <td>Y</td>\n",
" <td>Norm</td>\n",
" <td>Norm</td>\n",
" <td>4</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>VinylSd</td>\n",
" <td>VinylSd</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>PConc</td>\n",
" <td>2</td>\n",
" <td>7</td>\n",
" <td>482.0</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>Attchd</td>\n",
" <td>1629.0</td>\n",
" <td>1</td>\n",
" <td>GasA</td>\n",
" <td>3</td>\n",
" <td>2Story</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>Lvl</td>\n",
" <td>2</td>\n",
" <td>13830.0</td>\n",
" <td>Inside</td>\n",
" <td>2</td>\n",
" <td>0.0</td>\n",
" <td>060</td>\n",
" <td>RL</td>\n",
" <td>0.0</td>\n",
" <td>None</td>\n",
" <td>NA</td>\n",
" <td>0.0</td>\n",
" <td>3</td>\n",
" <td>2.0</td>\n",
" <td>Gilbert</td>\n",
" <td>34.0</td>\n",
" <td>4</td>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" <td>0.0</td>\n",
" <td>0</td>\n",
" <td>CompShg</td>\n",
" <td>Gable</td>\n",
" <td>Normal</td>\n",
" <td>WD</td>\n",
" <td>0.0</td>\n",
" <td>Pave</td>\n",
" <td>6</td>\n",
" <td>1.791759</td>\n",
" <td>2.5</td>\n",
" <td>1.162278</td>\n",
" <td>928.0</td>\n",
" <td>246.0</td>\n",
" <td>3</td>\n",
" <td>212.0</td>\n",
" <td>1997</td>\n",
" <td>1998</td>\n",
" <td>2010</td>\n",
" <td>189900.0</td>\n",
" <td>12.154253</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 1st Flr SF 2nd Flr SF 3Ssn Porch Alley Bedroom AbvGr \\\n",
"Order PID \n",
"1 526301100 1656.0 0.0 0.0 NA 3 \n",
"2 526350040 896.0 0.0 0.0 NA 2 \n",
"3 526351010 1329.0 0.0 0.0 NA 3 \n",
"4 526353030 2110.0 0.0 0.0 NA 3 \n",
"5 527105010 928.0 701.0 0.0 NA 3 \n",
"\n",
" Bldg Type Bsmt Cond Bsmt Exposure Bsmt Full Bath \\\n",
"Order PID \n",
"1 526301100 1Fam 4 4 1 \n",
"2 526350040 1Fam 3 1 0 \n",
"3 526351010 1Fam 3 1 0 \n",
"4 526353030 1Fam 3 1 1 \n",
"5 527105010 1Fam 3 1 0 \n",
"\n",
" Bsmt Half Bath Bsmt Qual Bsmt Unf SF BsmtFin SF 1 \\\n",
"Order PID \n",
"1 526301100 0 3 441.0 639.0 \n",
"2 526350040 0 3 270.0 468.0 \n",
"3 526351010 0 3 406.0 923.0 \n",
"4 526353030 0 3 1045.0 1065.0 \n",
"5 527105010 0 4 137.0 791.0 \n",
"\n",
" BsmtFin SF 2 BsmtFin Type 1 BsmtFin Type 2 Central Air \\\n",
"Order PID \n",
"1 526301100 0.0 4 1 Y \n",
"2 526350040 144.0 3 2 Y \n",
"3 526351010 0.0 5 1 Y \n",
"4 526353030 0.0 5 1 Y \n",
"5 527105010 0.0 6 1 Y \n",
"\n",
" Condition 1 Condition 2 Electrical Enclosed Porch \\\n",
"Order PID \n",
"1 526301100 Norm Norm 4 0.0 \n",
"2 526350040 Feedr Norm 4 0.0 \n",
"3 526351010 Norm Norm 4 0.0 \n",
"4 526353030 Norm Norm 4 0.0 \n",
"5 527105010 Norm Norm 4 0.0 \n",
"\n",
" Exter Cond Exter Qual Exterior 1st Exterior 2nd Fence \\\n",
"Order PID \n",
"1 526301100 2 2 BrkFace Plywood 0 \n",
"2 526350040 2 2 VinylSd VinylSd 3 \n",
"3 526351010 2 2 Wd Sdng Wd Sdng 0 \n",
"4 526353030 2 3 BrkFace BrkFace 0 \n",
"5 527105010 2 2 VinylSd VinylSd 3 \n",
"\n",
" Fireplace Qu Fireplaces Foundation Full Bath Functional \\\n",
"Order PID \n",
"1 526301100 4 2 CBlock 1 7 \n",
"2 526350040 0 0 CBlock 1 7 \n",
"3 526351010 0 0 CBlock 1 7 \n",
"4 526353030 3 2 CBlock 2 7 \n",
"5 527105010 3 1 PConc 2 7 \n",
"\n",
" Garage Area Garage Cars Garage Cond Garage Finish \\\n",
"Order PID \n",
"1 526301100 528.0 2 3 3 \n",
"2 526350040 730.0 1 3 1 \n",
"3 526351010 312.0 1 3 1 \n",
"4 526353030 522.0 2 3 3 \n",
"5 527105010 482.0 2 3 3 \n",
"\n",
" Garage Qual Garage Type Gr Liv Area Half Bath Heating \\\n",
"Order PID \n",
"1 526301100 3 Attchd 1656.0 0 GasA \n",
"2 526350040 3 Attchd 896.0 0 GasA \n",
"3 526351010 3 Attchd 1329.0 1 GasA \n",
"4 526353030 3 Attchd 2110.0 1 GasA \n",
"5 527105010 3 Attchd 1629.0 1 GasA \n",
"\n",
" Heating QC House Style Kitchen AbvGr Kitchen Qual \\\n",
"Order PID \n",
"1 526301100 1 1Story 1 2 \n",
"2 526350040 2 1Story 1 2 \n",
"3 526351010 2 1Story 1 3 \n",
"4 526353030 4 1Story 1 4 \n",
"5 527105010 3 2Story 1 2 \n",
"\n",
" Land Contour Land Slope Lot Area Lot Config Lot Shape \\\n",
"Order PID \n",
"1 526301100 Lvl 2 31770.0 Corner 2 \n",
"2 526350040 Lvl 2 11622.0 Inside 3 \n",
"3 526351010 Lvl 2 14267.0 Corner 2 \n",
"4 526353030 Lvl 2 11160.0 Corner 3 \n",
"5 527105010 Lvl 2 13830.0 Inside 2 \n",
"\n",
" Low Qual Fin SF MS SubClass MS Zoning Mas Vnr Area \\\n",
"Order PID \n",
"1 526301100 0.0 020 RL 112.0 \n",
"2 526350040 0.0 020 RH 0.0 \n",
"3 526351010 0.0 020 RL 108.0 \n",
"4 526353030 0.0 020 RL 0.0 \n",
"5 527105010 0.0 060 RL 0.0 \n",
"\n",
" Mas Vnr Type Misc Feature Misc Val Mo Sold \\\n",
"Order PID \n",
"1 526301100 Stone NA 0.0 5 \n",
"2 526350040 None NA 0.0 6 \n",
"3 526351010 BrkFace Gar2 12500.0 6 \n",
"4 526353030 None NA 0.0 4 \n",
"5 527105010 None NA 0.0 3 \n",
"\n",
" Mo Sold (box-cox-1.0) Neighborhood Open Porch SF \\\n",
"Order PID \n",
"1 526301100 4.0 Names 62.0 \n",
"2 526350040 5.0 Names 0.0 \n",
"3 526351010 5.0 Names 36.0 \n",
"4 526353030 3.0 Names 0.0 \n",
"5 527105010 2.0 Gilbert 34.0 \n",
"\n",
" Overall Cond Overall Qual Paved Drive Pool Area Pool QC \\\n",
"Order PID \n",
"1 526301100 4 5 1 0.0 0 \n",
"2 526350040 5 4 2 0.0 0 \n",
"3 526351010 5 5 2 0.0 0 \n",
"4 526353030 4 6 2 0.0 0 \n",
"5 527105010 4 4 2 0.0 0 \n",
"\n",
" Roof Matl Roof Style Sale Condition Sale Type Screen Porch \\\n",
"Order PID \n",
"1 526301100 CompShg Hip Normal WD 0.0 \n",
"2 526350040 CompShg Gable Normal WD 120.0 \n",
"3 526351010 CompShg Hip Normal WD 0.0 \n",
"4 526353030 CompShg Hip Normal WD 0.0 \n",
"5 527105010 CompShg Gable Normal WD 0.0 \n",
"\n",
" Street TotRms AbvGrd TotRms AbvGrd (box-cox-0.0) \\\n",
"Order PID \n",
"1 526301100 Pave 7 1.945910 \n",
"2 526350040 Pave 5 1.609438 \n",
"3 526351010 Pave 6 1.791759 \n",
"4 526353030 Pave 8 2.079442 \n",
"5 527105010 Pave 6 1.791759 \n",
"\n",
" Total Bath Total Bath (box-cox-0.5) Total Bsmt SF \\\n",
"Order PID \n",
"1 526301100 2.0 0.828427 1080.0 \n",
"2 526350040 1.0 0.000000 882.0 \n",
"3 526351010 1.5 0.449490 1329.0 \n",
"4 526353030 3.5 1.741657 2110.0 \n",
"5 527105010 2.5 1.162278 928.0 \n",
"\n",
" Total Porch SF Utilities Wood Deck SF Year Built \\\n",
"Order PID \n",
"1 526301100 272.0 3 210.0 1960 \n",
"2 526350040 260.0 3 140.0 1961 \n",
"3 526351010 429.0 3 393.0 1958 \n",
"4 526353030 0.0 3 0.0 1968 \n",
"5 527105010 246.0 3 212.0 1997 \n",
"\n",
" Year Remod/Add Yr Sold SalePrice SalePrice (box-cox-0.0) \n",
"Order PID \n",
"1 526301100 1960 2010 215000.0 12.278393 \n",
"2 526350040 1961 2010 105000.0 11.561716 \n",
"3 526351010 1958 2010 172000.0 12.055250 \n",
"4 526353030 1968 2010 244000.0 12.404924 \n",
"5 527105010 1998 2010 189900.0 12.154253 "
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"df.to_csv(\"data_clean_with_transformations.csv\")"
2018-09-02 23:25:07 +02:00
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}