2018-09-02 23:25:07 +02:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Pair-wise Correlations\n",
"\n",
2018-09-03 15:57:24 +02:00
"The purpose is to identify predictor variables strongly correlated with the sales price and with each other to get an idea of what variables could be good predictors and potential issues with collinearity.\n",
"\n",
"Furthermore, Box-Cox transformations and linear combinations of variables are added where applicable or useful."
2018-09-02 23:25:07 +02:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## \"Housekeeping\""
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
2018-09-03 18:04:40 +02:00
"2018-09-03 18:03:41 CEST\n",
2018-09-02 23:25:07 +02:00
"\n",
"CPython 3.6.5\n",
"IPython 6.5.0\n",
"\n",
"matplotlib 3.0.0rc2\n",
"numpy 1.15.1\n",
"pandas 0.23.4\n",
2018-09-03 15:57:24 +02:00
"seaborn 0.9.0\n",
"sklearn 0.20rc1\n"
2018-09-02 23:25:07 +02:00
]
}
],
"source": [
"% load_ext watermark\n",
2018-09-03 15:57:24 +02:00
"% watermark -d -t -v -z -p matplotlib,numpy,pandas,seaborn,sklearn"
2018-09-02 23:25:07 +02:00
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
2018-09-03 15:57:24 +02:00
"import warnings\n",
2018-09-02 23:25:07 +02:00
"import json\n",
"\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"import pandas as pd\n",
"import seaborn as sns\n",
"\n",
2018-09-03 15:57:24 +02:00
"from sklearn.preprocessing import PowerTransformer\n",
2018-09-03 16:34:14 +02:00
"from tabulate import tabulate\n",
2018-09-03 15:57:24 +02:00
"\n",
2018-09-02 23:25:07 +02:00
"from utils import (\n",
2018-09-03 15:57:24 +02:00
" ALL_VARIABLES,\n",
2018-09-02 23:25:07 +02:00
" CONTINUOUS_VARIABLES,\n",
" DISCRETE_VARIABLES,\n",
" NUMERIC_VARIABLES,\n",
" ORDINAL_VARIABLES,\n",
2018-09-03 15:57:24 +02:00
" TARGET_VARIABLES,\n",
2018-09-03 18:04:40 +02:00
" encode_ordinals,\n",
2018-09-02 23:25:07 +02:00
" load_clean_data,\n",
" print_column_list,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
2018-09-03 15:57:24 +02:00
"% matplotlib inline"
2018-09-02 23:25:07 +02:00
]
},
{
"cell_type": "code",
2018-09-03 16:34:14 +02:00
"execution_count": 4,
2018-09-02 23:25:07 +02:00
"metadata": {},
"outputs": [],
2018-09-03 15:57:24 +02:00
"source": [
"pd.set_option(\"display.max_columns\", 100)"
]
},
{
"cell_type": "code",
2018-09-03 16:34:14 +02:00
"execution_count": 5,
2018-09-03 15:57:24 +02:00
"metadata": {},
"outputs": [],
2018-09-02 23:25:07 +02:00
"source": [
"sns.set_style(\"white\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load the Data\n",
"\n",
2018-09-03 18:04:40 +02:00
"Only a subset of the previously cleaned data is used in this analysis. In particular, it does not make sense to calculate correlations involving nominal variables.\n",
"\n",
"Furthermore, ordinal variables are encoded as integers (with greater values indicating a higher sales price by \"guts feeling\"; refer to the [data documentation](https://www.amstat.org/publications/jse/v19n3/decock/DataDocumentation.txt) to see the un-encoded values) and take part in the analysis.\n",
"\n",
"A `cleaned_df` DataFrame with the original data from the previous notebook is kept so as to restore the encoded ordinal labels again at the end of this notebook for correct storage."
2018-09-02 23:25:07 +02:00
]
},
{
"cell_type": "code",
2018-09-03 16:34:14 +02:00
"execution_count": 6,
2018-09-02 23:25:07 +02:00
"metadata": {},
"outputs": [],
"source": [
2018-09-03 18:04:40 +02:00
"cleaned_df = load_clean_data()"
2018-09-02 23:25:07 +02:00
]
},
{
"cell_type": "code",
2018-09-03 16:34:14 +02:00
"execution_count": 7,
2018-09-02 23:25:07 +02:00
"metadata": {},
2018-09-03 18:04:40 +02:00
"outputs": [],
"source": [
"df = cleaned_df[NUMERIC_VARIABLES + ORDINAL_VARIABLES + TARGET_VARIABLES]\n",
"df = encode_ordinals(df)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
2018-09-02 23:25:07 +02:00
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th></th>\n",
" <th>1st Flr SF</th>\n",
" <th>2nd Flr SF</th>\n",
" <th>3Ssn Porch</th>\n",
" <th>Bedroom AbvGr</th>\n",
" <th>Bsmt Full Bath</th>\n",
" <th>Bsmt Half Bath</th>\n",
" <th>Bsmt Unf SF</th>\n",
" <th>BsmtFin SF 1</th>\n",
" <th>BsmtFin SF 2</th>\n",
" <th>Enclosed Porch</th>\n",
" <th>Fireplaces</th>\n",
" <th>Full Bath</th>\n",
" <th>Garage Area</th>\n",
" <th>Garage Cars</th>\n",
" <th>Gr Liv Area</th>\n",
" <th>Half Bath</th>\n",
" <th>Kitchen AbvGr</th>\n",
" <th>Lot Area</th>\n",
" <th>Low Qual Fin SF</th>\n",
" <th>Mas Vnr Area</th>\n",
" <th>Misc Val</th>\n",
" <th>Mo Sold</th>\n",
" <th>Open Porch SF</th>\n",
" <th>Pool Area</th>\n",
" <th>Screen Porch</th>\n",
" <th>TotRms AbvGrd</th>\n",
" <th>Total Bsmt SF</th>\n",
" <th>Wood Deck SF</th>\n",
" <th>Year Built</th>\n",
" <th>Year Remod/Add</th>\n",
" <th>Yr Sold</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Order</th>\n",
" <th>PID</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <th>526301100</th>\n",
" <td>1656.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>441.0</td>\n",
" <td>639.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>528.0</td>\n",
" <td>2</td>\n",
" <td>1656.0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>31770.0</td>\n",
" <td>0.0</td>\n",
" <td>112.0</td>\n",
" <td>0.0</td>\n",
" <td>5</td>\n",
" <td>62.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>7</td>\n",
" <td>1080.0</td>\n",
" <td>210.0</td>\n",
" <td>1960</td>\n",
" <td>1960</td>\n",
" <td>2010</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <th>526350040</th>\n",
" <td>896.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>270.0</td>\n",
" <td>468.0</td>\n",
" <td>144.0</td>\n",
" <td>0.0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>730.0</td>\n",
" <td>1</td>\n",
" <td>896.0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>11622.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>6</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>120.0</td>\n",
" <td>5</td>\n",
" <td>882.0</td>\n",
" <td>140.0</td>\n",
" <td>1961</td>\n",
" <td>1961</td>\n",
" <td>2010</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <th>526351010</th>\n",
" <td>1329.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>406.0</td>\n",
" <td>923.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>312.0</td>\n",
" <td>1</td>\n",
" <td>1329.0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>14267.0</td>\n",
" <td>0.0</td>\n",
" <td>108.0</td>\n",
" <td>12500.0</td>\n",
" <td>6</td>\n",
" <td>36.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>6</td>\n",
" <td>1329.0</td>\n",
" <td>393.0</td>\n",
" <td>1958</td>\n",
" <td>1958</td>\n",
" <td>2010</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <th>526353030</th>\n",
" <td>2110.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1045.0</td>\n",
" <td>1065.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>522.0</td>\n",
" <td>2</td>\n",
" <td>2110.0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>11160.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>4</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>8</td>\n",
" <td>2110.0</td>\n",
" <td>0.0</td>\n",
" <td>1968</td>\n",
" <td>1968</td>\n",
" <td>2010</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <th>527105010</th>\n",
" <td>928.0</td>\n",
" <td>701.0</td>\n",
" <td>0.0</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>137.0</td>\n",
" <td>791.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>482.0</td>\n",
" <td>2</td>\n",
" <td>1629.0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>13830.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>3</td>\n",
" <td>34.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>6</td>\n",
" <td>928.0</td>\n",
" <td>212.0</td>\n",
" <td>1997</td>\n",
" <td>1998</td>\n",
" <td>2010</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 1st Flr SF 2nd Flr SF 3Ssn Porch Bedroom AbvGr \\\n",
"Order PID \n",
"1 526301100 1656.0 0.0 0.0 3 \n",
"2 526350040 896.0 0.0 0.0 2 \n",
"3 526351010 1329.0 0.0 0.0 3 \n",
"4 526353030 2110.0 0.0 0.0 3 \n",
"5 527105010 928.0 701.0 0.0 3 \n",
"\n",
" Bsmt Full Bath Bsmt Half Bath Bsmt Unf SF BsmtFin SF 1 \\\n",
"Order PID \n",
"1 526301100 1 0 441.0 639.0 \n",
"2 526350040 0 0 270.0 468.0 \n",
"3 526351010 0 0 406.0 923.0 \n",
"4 526353030 1 0 1045.0 1065.0 \n",
"5 527105010 0 0 137.0 791.0 \n",
"\n",
" BsmtFin SF 2 Enclosed Porch Fireplaces Full Bath \\\n",
"Order PID \n",
"1 526301100 0.0 0.0 2 1 \n",
"2 526350040 144.0 0.0 0 1 \n",
"3 526351010 0.0 0.0 0 1 \n",
"4 526353030 0.0 0.0 2 2 \n",
"5 527105010 0.0 0.0 1 2 \n",
"\n",
" Garage Area Garage Cars Gr Liv Area Half Bath \\\n",
"Order PID \n",
"1 526301100 528.0 2 1656.0 0 \n",
"2 526350040 730.0 1 896.0 0 \n",
"3 526351010 312.0 1 1329.0 1 \n",
"4 526353030 522.0 2 2110.0 1 \n",
"5 527105010 482.0 2 1629.0 1 \n",
"\n",
" Kitchen AbvGr Lot Area Low Qual Fin SF Mas Vnr Area \\\n",
"Order PID \n",
"1 526301100 1 31770.0 0.0 112.0 \n",
"2 526350040 1 11622.0 0.0 0.0 \n",
"3 526351010 1 14267.0 0.0 108.0 \n",
"4 526353030 1 11160.0 0.0 0.0 \n",
"5 527105010 1 13830.0 0.0 0.0 \n",
"\n",
" Misc Val Mo Sold Open Porch SF Pool Area Screen Porch \\\n",
"Order PID \n",
"1 526301100 0.0 5 62.0 0.0 0.0 \n",
"2 526350040 0.0 6 0.0 0.0 120.0 \n",
"3 526351010 12500.0 6 36.0 0.0 0.0 \n",
"4 526353030 0.0 4 0.0 0.0 0.0 \n",
"5 527105010 0.0 3 34.0 0.0 0.0 \n",
"\n",
" TotRms AbvGrd Total Bsmt SF Wood Deck SF Year Built \\\n",
"Order PID \n",
"1 526301100 7 1080.0 210.0 1960 \n",
"2 526350040 5 882.0 140.0 1961 \n",
"3 526351010 6 1329.0 393.0 1958 \n",
"4 526353030 8 2110.0 0.0 1968 \n",
"5 527105010 6 928.0 212.0 1997 \n",
"\n",
" Year Remod/Add Yr Sold \n",
"Order PID \n",
"1 526301100 1960 2010 \n",
"2 526350040 1961 2010 \n",
"3 526351010 1958 2010 \n",
"4 526353030 1968 2010 \n",
"5 527105010 1998 2010 "
]
},
2018-09-03 18:04:40 +02:00
"execution_count": 8,
2018-09-02 23:25:07 +02:00
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[NUMERIC_VARIABLES].head()"
]
},
{
"cell_type": "code",
2018-09-03 18:04:40 +02:00
"execution_count": 9,
2018-09-02 23:25:07 +02:00
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th></th>\n",
" <th>Bsmt Cond</th>\n",
" <th>Bsmt Exposure</th>\n",
" <th>Bsmt Qual</th>\n",
" <th>BsmtFin Type 1</th>\n",
" <th>BsmtFin Type 2</th>\n",
" <th>Electrical</th>\n",
" <th>Exter Cond</th>\n",
" <th>Exter Qual</th>\n",
" <th>Fence</th>\n",
" <th>Fireplace Qu</th>\n",
" <th>Functional</th>\n",
" <th>Garage Cond</th>\n",
" <th>Garage Finish</th>\n",
" <th>Garage Qual</th>\n",
" <th>Heating QC</th>\n",
" <th>Kitchen Qual</th>\n",
" <th>Land Slope</th>\n",
" <th>Lot Shape</th>\n",
" <th>Overall Cond</th>\n",
" <th>Overall Qual</th>\n",
" <th>Paved Drive</th>\n",
" <th>Pool QC</th>\n",
" <th>Utilities</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Order</th>\n",
" <th>PID</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <th>526301100</th>\n",
" <td>4</td>\n",
" <td>4</td>\n",
" <td>3</td>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>4</td>\n",
" <td>7</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>4</td>\n",
" <td>5</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <th>526350040</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>7</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>5</td>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <th>526351010</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>5</td>\n",
" <td>1</td>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>7</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>5</td>\n",
" <td>5</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <th>526353030</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>5</td>\n",
" <td>1</td>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>7</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>4</td>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>4</td>\n",
" <td>6</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <th>527105010</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>4</td>\n",
" <td>6</td>\n",
" <td>1</td>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>7</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>4</td>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Bsmt Cond Bsmt Exposure Bsmt Qual BsmtFin Type 1 \\\n",
"Order PID \n",
"1 526301100 4 4 3 4 \n",
"2 526350040 3 1 3 3 \n",
"3 526351010 3 1 3 5 \n",
"4 526353030 3 1 3 5 \n",
"5 527105010 3 1 4 6 \n",
"\n",
" BsmtFin Type 2 Electrical Exter Cond Exter Qual Fence \\\n",
"Order PID \n",
"1 526301100 1 4 2 2 0 \n",
"2 526350040 2 4 2 2 3 \n",
"3 526351010 1 4 2 2 0 \n",
"4 526353030 1 4 2 3 0 \n",
"5 527105010 1 4 2 2 3 \n",
"\n",
" Fireplace Qu Functional Garage Cond Garage Finish \\\n",
"Order PID \n",
"1 526301100 4 7 3 3 \n",
"2 526350040 0 7 3 1 \n",
"3 526351010 0 7 3 1 \n",
"4 526353030 3 7 3 3 \n",
"5 527105010 3 7 3 3 \n",
"\n",
" Garage Qual Heating QC Kitchen Qual Land Slope Lot Shape \\\n",
"Order PID \n",
"1 526301100 3 1 2 2 2 \n",
"2 526350040 3 2 2 2 3 \n",
"3 526351010 3 2 3 2 2 \n",
"4 526353030 3 4 4 2 3 \n",
"5 527105010 3 3 2 2 2 \n",
"\n",
" Overall Cond Overall Qual Paved Drive Pool QC Utilities \n",
"Order PID \n",
"1 526301100 4 5 1 0 3 \n",
"2 526350040 5 4 2 0 3 \n",
"3 526351010 5 5 2 0 3 \n",
"4 526353030 4 6 2 0 3 \n",
"5 527105010 4 4 2 0 3 "
]
},
2018-09-03 18:04:40 +02:00
"execution_count": 9,
2018-09-02 23:25:07 +02:00
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[ORDINAL_VARIABLES].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2018-09-03 15:57:24 +02:00
"## Linearly \"dependent\" Features"
2018-09-02 23:25:07 +02:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2018-09-03 15:57:24 +02:00
"The \"above grade (ground) living area\" (= *Gr Liv Area*) can be split into 1st and 2nd floor living area plus some undefined rest."
2018-09-02 23:25:07 +02:00
]
},
{
"cell_type": "code",
2018-09-03 18:04:40 +02:00
"execution_count": 10,
2018-09-02 23:25:07 +02:00
"metadata": {},
"outputs": [],
"source": [
2018-09-03 15:57:24 +02:00
"assert not (\n",
" df[\"Gr Liv Area\"]\n",
" != (df[\"1st Flr SF\"] + df[\"2nd Flr SF\"] + df[\"Low Qual Fin SF\"])\n",
").any()"
2018-09-02 23:25:07 +02:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2018-09-03 15:57:24 +02:00
"The various basement areas also add up."
2018-09-02 23:25:07 +02:00
]
},
{
"cell_type": "code",
2018-09-03 18:04:40 +02:00
"execution_count": 11,
2018-09-02 23:25:07 +02:00
"metadata": {},
"outputs": [],
"source": [
2018-09-03 15:57:24 +02:00
"assert not (\n",
" df[\"Total Bsmt SF\"]\n",
" != (df[\"BsmtFin SF 1\"] + df[\"BsmtFin SF 2\"] + df[\"Bsmt Unf SF\"])\n",
").any()"
2018-09-02 23:25:07 +02:00
]
},
{
2018-09-03 15:57:24 +02:00
"cell_type": "markdown",
2018-09-02 23:25:07 +02:00
"metadata": {},
"source": [
2018-09-03 15:57:24 +02:00
"The different porch areas are unified into a new variable *Total Porch SF*. This potentially helps making the presence of a porch in general relevant in the prediction."
2018-09-02 23:25:07 +02:00
]
},
{
"cell_type": "code",
2018-09-03 18:04:40 +02:00
"execution_count": 12,
2018-09-02 23:25:07 +02:00
"metadata": {},
"outputs": [],
"source": [
2018-09-03 15:57:24 +02:00
"df[\"Total Porch SF\"] = (\n",
" df[\"3Ssn Porch\"] + df[\"Enclosed Porch\"] + df[\"Open Porch SF\"]\n",
" + df[\"Screen Porch\"] + df[\"Wood Deck SF\"]\n",
")\n",
"\n",
"new_variables = [\"Total Porch SF\"]\n",
"CONTINUOUS_VARIABLES.append(\"Total Porch SF\")"
2018-09-02 23:25:07 +02:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2018-09-03 16:34:14 +02:00
"The various types of rooms \"above grade\" (i.e., *TotRms AbvGrd*, *Bedroom AbvGr*, *Kitchen AbvGr*, and *Full Bath*) do not add up (only in 29% of the cases they do). Therefore, no single unified variable can be used as a predictor."
2018-09-02 23:25:07 +02:00
]
},
{
"cell_type": "code",
2018-09-03 18:04:40 +02:00
"execution_count": 13,
2018-09-02 23:25:07 +02:00
"metadata": {},
"outputs": [
{
2018-09-03 15:57:24 +02:00
"data": {
"text/plain": [
"29.0"
]
},
2018-09-03 18:04:40 +02:00
"execution_count": 13,
2018-09-03 15:57:24 +02:00
"metadata": {},
"output_type": "execute_result"
2018-09-02 23:25:07 +02:00
}
],
"source": [
2018-09-03 15:57:24 +02:00
"round(\n",
" 100\n",
" * (\n",
" df[\"TotRms AbvGrd\"]\n",
" == (df[\"Bedroom AbvGr\"] + df[\"Kitchen AbvGr\"] + df[\"Full Bath\"])\n",
" ).sum()\n",
" / df.shape[0]\n",
")"
2018-09-02 23:25:07 +02:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2018-09-03 15:57:24 +02:00
"Unify the number of various types of bathrooms into a single variable. Note that \"half\" bathrooms are counted as such."
2018-09-02 23:25:07 +02:00
]
},
{
"cell_type": "code",
2018-09-03 18:04:40 +02:00
"execution_count": 14,
2018-09-02 23:25:07 +02:00
"metadata": {},
"outputs": [],
"source": [
2018-09-03 15:57:24 +02:00
"df[\"Total Bath\"] = (\n",
" df[\"Full Bath\"] + 0.5 * df[\"Half Bath\"]\n",
" + df[\"Bsmt Full Bath\"] + 0.5 * df[\"Bsmt Half Bath\"]\n",
")\n",
"\n",
"new_variables.append(\"Total Bath\")\n",
"DISCRETE_VARIABLES.append(\"Total Bath\")"
2018-09-02 23:25:07 +02:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2018-09-03 15:57:24 +02:00
"## Box-Cox Transformations\n",
"\n",
2018-09-03 16:34:14 +02:00
"Only numeric columns with non-negative values are eligable for a Box-Cox transformation."
2018-09-02 23:25:07 +02:00
]
},
{
"cell_type": "code",
2018-09-03 18:04:40 +02:00
"execution_count": 15,
2018-09-02 23:25:07 +02:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
2018-09-03 16:34:14 +02:00
"1st Flr SF First Floor square feet\n",
"Gr Liv Area Above grade (ground) living area square feet\n",
"Lot Area Lot size in square feet\n",
"SalePrice\n"
2018-09-02 23:25:07 +02:00
]
}
],
"source": [
2018-09-03 16:34:14 +02:00
"columns = CONTINUOUS_VARIABLES + TARGET_VARIABLES\n",
2018-09-03 15:57:24 +02:00
"transforms = df[columns].describe().T\n",
"transforms = list(transforms[transforms['min'] > 0].index)\n",
"print_column_list(transforms)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2018-09-03 16:34:14 +02:00
"A common convention is to use Box-Cox transformations only if the found lambda value (estimated with Maximum Likelyhood Estimation) is in the range from -3 to +3. Also, use a lambda rounded to the next \"half\" integer.\n",
"\n",
"Consequently, the only applicable transformation is for \"SalePrice\"."
2018-09-02 23:25:07 +02:00
]
},
{
"cell_type": "code",
2018-09-03 18:04:40 +02:00
"execution_count": 16,
2018-09-02 23:25:07 +02:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
2018-09-03 16:34:14 +02:00
"1st Flr SF: exact lambda of -8.398 not in realistic range\n",
"Gr Liv Area: exact lambda of -8.398 not in realistic range\n",
"Lot Area: exact lambda of -8.398 not in realistic range\n",
"SalePrice: use rounded lambda of 0.0 (exact is 0.004)\n"
2018-09-02 23:25:07 +02:00
]
}
],
"source": [
2018-09-03 15:57:24 +02:00
"# Check the Box-Cox tranformations for each column seperately\n",
"# to decide if the optimal lambda value is in an acceptable range.\n",
2018-09-03 16:34:14 +02:00
"output = []\n",
"transformed_columns = []\n",
2018-09-03 15:57:24 +02:00
"for column in transforms:\n",
" X = df[[column]] # 2D array needed!\n",
" pt = PowerTransformer(method=\"box-cox\", standardize=False)\n",
" # Suppress a weird but harmless warning from scipy\n",
" with warnings.catch_warnings():\n",
" warnings.simplefilter(\"ignore\")\n",
" pt.fit(X)\n",
" # Check if the optimal lambda is ok.\n",
" exact_lambda = pt.lambdas_[0]\n",
" used_lambda = 0.5 * np.round(2.0 * exact_lambda)\n",
" if -3 <= exact_lambda <= 3:\n",
" new_column = f\"{column} (box-cox-{used_lambda})\"\n",
" df[new_column] = (\n",
" np.log(X) if used_lambda == 0 else (((X ** used_lambda) - 1) / used_lambda)\n",
" )\n",
2018-09-03 16:34:14 +02:00
" # Show that only SalePrice has a useful transformation.\n",
" assert column in TARGET_VARIABLES\n",
2018-09-03 15:57:24 +02:00
" # Track the new column in the appropiate list.\n",
2018-09-03 18:04:40 +02:00
" new_variables.append(new_column)\n",
2018-09-03 16:34:14 +02:00
" TARGET_VARIABLES.append(new_column)\n",
" # To show only the transformed columns below.\n",
" transformed_columns.append(column)\n",
" transformed_columns.append(new_column)\n",
" output.append((\n",
" f\"{column}:\",\n",
" f\"use rounded lambda of {used_lambda} (exact is {exact_lambda:.3f})\",\n",
" ))\n",
2018-09-03 15:57:24 +02:00
" else:\n",
2018-09-03 16:34:14 +02:00
" output.append((\n",
" f\"{column}:\",\n",
" f\"exact lambda of {exact_lambda:.3f} not in realistic range\",\n",
" ))\n",
"print(tabulate(sorted(output), tablefmt=\"plain\"))"
2018-09-02 23:25:07 +02:00
]
},
{
"cell_type": "code",
2018-09-03 18:04:40 +02:00
"execution_count": 17,
2018-09-02 23:25:07 +02:00
"metadata": {},
2018-09-03 15:57:24 +02:00
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th></th>\n",
" <th>SalePrice</th>\n",
" <th>SalePrice (box-cox-0.0)</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Order</th>\n",
" <th>PID</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
2018-09-03 16:34:14 +02:00
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <th>526301100</th>\n",
" <td>215000.0</td>\n",
" <td>12.278393</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <th>526350040</th>\n",
" <td>105000.0</td>\n",
" <td>11.561716</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <th>526351010</th>\n",
" <td>172000.0</td>\n",
" <td>12.055250</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <th>526353030</th>\n",
" <td>244000.0</td>\n",
" <td>12.404924</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <th>527105010</th>\n",
2018-09-03 15:57:24 +02:00
" <td>189900.0</td>\n",
" <td>12.154253</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
2018-09-03 16:34:14 +02:00
" SalePrice SalePrice (box-cox-0.0)\n",
"Order PID \n",
"1 526301100 215000.0 12.278393\n",
"2 526350040 105000.0 11.561716\n",
"3 526351010 172000.0 12.055250\n",
"4 526353030 244000.0 12.404924\n",
"5 527105010 189900.0 12.154253"
2018-09-03 15:57:24 +02:00
]
},
2018-09-03 18:04:40 +02:00
"execution_count": 17,
2018-09-03 15:57:24 +02:00
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
2018-09-03 16:34:14 +02:00
"df[transformed_columns].head()"
2018-09-03 15:57:24 +02:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Correlations\n",
"\n",
"The pair-wise correlations are calculated based on the type of the variables:\n",
"- **continuous** variables are assumed to be linearly related with the target and each other or not: use **Pearson's correlation coefficient**\n",
"- **discrete** (because of the low number of distinct realizations as seen in the data cleaning notebook) and **ordinal** (low number of distinct realizations as well) variables are assumed to be related in a monotonic way with the target and each other or not: use **Spearman's rank correlation coefficient**\n",
"\n",
"Furthermore, for a **naive feature selection** a \"rule of thumb\" classification in *weak* and *strong* correlation is applied to the predictor variables. The identified variables will be used in the prediction modelling part to speed up the feature selection. A correlation between 0.33 and 0.66 is considered *weak* while a correlation above 0.66 is considered *strong*. Correlations are calculated for **each** target variable (i.e., raw \"SalePrice\" and Box-Cox transformation thereof)."
]
},
{
"cell_type": "code",
2018-09-03 18:04:40 +02:00
"execution_count": 18,
2018-09-03 15:57:24 +02:00
"metadata": {},
"outputs": [],
"source": [
"strong = 0.66\n",
"weak = 0.33"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Two heatmaps below (implemented in the reusable `plot_correlation` function) help visualize the correlations.\n",
"\n",
"Obviously, many variables are pair-wise correlated. This could yield regression coefficients *inprecise* and not usable / interpretable. At the same time, this does not lower the predictive power of a model as a whole. In contrast to the pair-wise correlations, *multi-collinearity* is not checked here."
]
},
{
"cell_type": "code",
2018-09-03 18:04:40 +02:00
"execution_count": 19,
2018-09-03 15:57:24 +02:00
"metadata": {},
"outputs": [],
"source": [
"def plot_correlation(data, title):\n",
" \"\"\"Visualize a correlation matrix in a nice heatmap.\"\"\"\n",
" fig, ax = plt.subplots(figsize=(12, 12))\n",
" ax.set_title(title, fontsize=24)\n",
" # Blank out the upper triangular part of the matrix.\n",
" mask = np.zeros_like(data, dtype=np.bool)\n",
" mask[np.triu_indices_from(mask)] = True\n",
" # Use a diverging color map.\n",
" cmap = sns.diverging_palette(240, 0, as_cmap=True)\n",
" # Adjust the labels' font size.\n",
" labels = data.columns\n",
" ax.set_xticklabels(labels, fontsize=10)\n",
" ax.set_yticklabels(labels, fontsize=10)\n",
" # Plot it.\n",
" sns.heatmap(\n",
" data, vmin=-1, vmax=1, cmap=cmap, center=0, linewidths=.5,\n",
" cbar_kws={\"shrink\": .5}, square=True, mask=mask, ax=ax\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Pearson\n",
"\n",
"Pearson's correlation coefficient shows a linear relationship between two variables."
]
},
{
"cell_type": "code",
2018-09-03 18:04:40 +02:00
"execution_count": 20,
2018-09-03 15:57:24 +02:00
"metadata": {},
"outputs": [],
"source": [
"columns = CONTINUOUS_VARIABLES + TARGET_VARIABLES\n",
"pearson = df[columns].corr(method=\"pearson\")"
]
},
{
"cell_type": "code",
2018-09-03 18:04:40 +02:00
"execution_count": 21,
2018-09-03 15:57:24 +02:00
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAuIAAAKtCAYAAABi7QuGAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADx0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4wcmMyLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvMCCy2AAAIABJREFUeJzs3Xl4zOf+//HnTGSTIIktihRhYqkgqqiWKtWiumjtiUOLVqspagnVomqt0tLKsRaxL6Ek2nNq6dfS2nqopdYKTWyxSyIyWT6/P/wyR44gFWMkXo/rynVlPst93/PJXK7X3N6f+2MyDMNAREREREQeKLOjByAiIiIi8ihSEBcRERERcQAFcRERERERB1AQFxERERFxAAVxEREREREHUBAXEREREXGAAo4egIjkbXFxcTRp0iTbfSaTCRcXF7y8vKhWrRpvvPEGTZs2fcAjzPvCwsJYsWIFTz31FBEREXbtyzAM1q1bx+rVq9mzZw/nzp3D2dmZxx57jPr16xMcHEy5cuXsOgZ7CwgIAGD16tVYLJZct5eYmEhSUhIlS5a0bZs8eTLffPMNL774IpMmTcp1HyKSPymIi8h988QTT+Di4mJ7bRgGVquVuLg41q9fz/r16+nYsSNDhw514Cjlds6ePUvfvn3ZuXMnAAULFqRChQokJydz/Phxjh49yqJFixgwYACdO3d28GgfDlFRUYwZM4YRI0ZkCeIiIjmhIC4i983XX39NmTJlbtmemprKN998wz//+U8WLFjAs88+y/PPP++AEeZNJpPJ7n3ExcXRrl07zp8/T9WqVenTpw8NGza07b948SJTpkwhIiKCkSNHUqBAATp27Gj3cT3sJkyYwLlz527Z3qlTJ1q0aIGnp6cDRiUieYVqxEXE7pydnenTpw+1atUCYMGCBQ4eUd6SOdP62GOP2aX9jIwMBgwYwPnz5wkKCmL+/PlZQjiAj48PQ4YM4Z133gFg3LhxnD171i7jyQ98fHzw9/fXLLmI3JGCuIg8MI0bNwZg7969Dh5J3lKpUiXgv7XN99v333/Pb7/9RoECBRg3bhwFCxa87bHvvfcePj4+JCcns3TpUruMR0TkUaHSFBF5YDL/mz4pKemWfbGxsUyfPp3NmzcTHx+Ph4cHNWvWpEuXLtSvXz/b9mJjY5k7dy5bt27l5MmTWK1WvLy8qFGjBiEhIdSrVy/L8SEhIWzfvp2FCxeyevVqvv/+ewACAwOZNWsWZrOZbdu2MWfOHHbt2sXVq1cpXLgwVatW5Y033qBFixbZjuPf//43ixcvZu/evVy7do3ixYtTv359unfvTvny5bMcGxkZyaBBg+jUqRO9evXim2++Yf369Zw/f56iRYvy3HPP8f7771OiRAnbORUrVgS45cbCAwcOMGPGDLZv387Fixfx8PDAYrHQqlUr3njjDQoUyNk/8cuXLwfghRdeoGzZsnc81s3NjTFjxuDh4UFgYOAt+7du3UpERITt+nl5eVGnTh3eeustqlevfttrUa9ePb744gvOnDlDqVKlGD16NCdOnLjj/tq1awNw4cIFZsyYwfr16zl9+jSurq5UrVqVDh068NJLL+XoGgCkp6cTFRVFdHQ0f/zxB5cvX8bFxYWyZcvSpEkTunbtSqFChbKMPdO7774LwOjRo2nduvUdb9aMj49n1qxZbNiwgVOnTuHi4oLFYuH111+ndevWt/zdnn/+eU6ePMmvv/7Kzp07mT17NgcPHsQwDAICAggJCaFly5Y5fp8i8vBQEBeRB+avv/4CoFSpUlm2b9q0idDQUK5du4a7uzuVKlXi4sWL/Pzzz/z888988MEH9OrVK8s5mzdv5v333+f69esUKlQIPz8/UlJSiI2NZe3ataxbt47x48fz8ssv3zKOsWPHsnv3biwWC5cvX6Z48eKYzWZWr17NgAEDyMjIwNfXl8qVK3P+/Hk2b97M5s2b2bt3LwMHDrS1k1nSsXr1atv7Klu2LDExMSxfvpyoqCjGjx9Ps2bNbhlDfHw8rVu35syZM5QuXZpy5cpx5MgRFi1axKZNm1i5ciWFCxcGbsyEHzp0KMv527dv5+2338ZqtVK0aFEqV67MlStX2LFjBzt27GDLli05Wq0jJSWFXbt2Adz2C8//atSoUbbbv/zyS6ZNmwZAsWLFqFy5MrGxsaxZs4Yff/yRjz/+mODg4FvO2717N4sXL8bLy4ty5cpx8uRJAgICOHHixB33A+zfv5/u3btz4cIFXFxcKF++PNeuXWPr1q1s3bqV1q1bM2rUqLvW2aemptKzZ082bdoEgJ+fHyVLluTMmTMcPHiQgwcPsnbtWpYtW4aLiwtFixYlKCiIffv2YbVaqVixIoULF6Zo0aJ37GfXrl28++67tpBfqVIlkpKS+M9//sN//vMfoqOjmTJlCh4eHrecGx4ezty5cylYsCDlypXj1KlT7Nq1i127dnHu3Dm6dOlyx75F5CFkiIjkQmxsrGGxWAyLxWLExsbe9rjLly8b9erVMywWizFixIgs5wcFBRkWi8X46quvjJSUFNu+tWvX2vb99NNPtu0pKSnGM888Y1gsFmPUqFFZzjl37pzRpUsXw2KxGM2bN88yhuDgYNtY//3vfxuGYRjp6enGpUuXjPT0dOPpp582LBaLER0dneW8FStWGAEBAUblypWzvMfJkycbFovFqF27trF+/Xrb9uTkZGPUqFGGxWIxqlevbhw6dMi2b/ny5bYxNGvWzNi3b59t33/+8x+jRo0ahsViMaZNm3b7i24YRuvWrQ2LxWLMnDnTSE9Pt23fvHmzUb16dcNisRg7duy4YxuGYRiHDx+2jee333676/G3s2LFCsNisRjVqlUzli5damRkZBiGYRhpaWnGtGnTjICAACMgIMDYvHmz7Zybr8UHH3xgWK1WwzAM48KFCznaf/XqVaNRo0aGxWIxPv74YyMhIcHW9s6dO22fke+++y7LWDPbvPnvMnfuXMNisRhPP/20ceDAgSzHr1mzxqhcuXK2n43GjRsbFosly9/fMAxj0qRJtnFnunTpkvHUU08ZFovFCA0NNS5dumTb9/vvv9veS79+/bLtw2KxGBMmTLB93lNSUow+ffrYPoOZ10dE8g7ViIuI3RiGwdWrV9m4cSPdunXj4sWLFCpUiLffftt2zKxZs0hMTOS1117jww8/zLL8YZMmTfjoo48A+Oabb2zb9+3bx7Vr1yhZsiQDBgzIck6xYsV4//33AYiJiSEjI+OWcdWqVYsXXngBALPZjJeXFxcuXOD8+fMUKVKE5s2bZzn+tddeo23btrRs2ZLExEQArl27xqxZswD47LPPbPXvcKN8Y9CgQTRp0oSUlBSmTJmS7fUZN24c1apVyzKuzBKD33///bbXFeDw4cMAvPnmm5jN//2nvEGDBnTr1o0WLVqQmpp6xzYArl69avvdy8vrrsffTubfJzQ0lDfffNM2A+3k5ET37t0JCQnBMAy++uqrbM/v06cPzs7OwI0bHXOyf8mSJZw+fZqnnnqKESNGZFmhpHbt2nz++ecATJs27a7XYuvWrTg5OfHBBx9QuXLlLPuaN29O3bp1Afjzzz/vfCHuYP78+Vy+fBmLxcKXX36Z5XoHBgYyZcoUTCYTq1ev5ujRo7ec/+yzz9KnTx/b593FxYUBAwYAkJCQkKuxiYhjqDRFRO6b2z3YJ5O3tzeTJk3KUpqyfv16gNvWuLZs2ZLPPvuMAwcOcO7cOYoXL05QUBC//fYb169fx8nJ6ZZz3N3dgRulIykpKbbXmWrWrJnt2AoVKsSVK1cYPHgwb731lu0mSbgRtm+2c+dOkpKS8PHxuW0dckhICOvWrWPjxo2kp6dnGWtmLfv/yqwpzwz8t+Pn58fRo0cZMGAA77//Pk888YQt/IaGht7x3JvdfG3S09NzfN7N/vzzT2JjYzGbzbRv3z7bYzp37szcuXPZs2cPFy5cyFLC4eXldUst/c1ut3/dunUAtGjRItvSk4YNG1KkSBEuXLjA/v37s/27Z/r2229JTU3Ntp309HRbqUhycvJt27i
"text/plain": [
"<Figure size 864x864 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plot_correlation(pearson, \"Pearson's Correlation\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Predictors weakly or strongly correlated with a target variable are collected."
]
},
{
"cell_type": "code",
2018-09-03 18:04:40 +02:00
"execution_count": 22,
2018-09-03 15:57:24 +02:00
"metadata": {},
"outputs": [],
"source": [
"pearson_weakly_correlated = set()\n",
"pearson_strongly_correlated = set()\n",
"# Iterate over the raw and transformed target.\n",
"for target in TARGET_VARIABLES:\n",
" corrs = pearson.loc[target].drop(TARGET_VARIABLES)\n",
" pearson_weakly_correlated |= set(corrs[(weak < corrs) & (corrs <= strong)].index)\n",
" pearson_strongly_correlated |= set(corrs[(strong < corrs)].index)\n",
"# Show that no contradiction exists between weak and strong classification.\n",
"assert pearson_weakly_correlated & pearson_strongly_correlated == set()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Show the continuous variables that are weakly and strongly correlated with the sales price."
]
},
{
"cell_type": "code",
2018-09-03 18:04:40 +02:00
"execution_count": 23,
2018-09-03 15:57:24 +02:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1st Flr SF First Floor square feet\n",
"BsmtFin SF 1 Type 1 finished square feet\n",
"Garage Area Size of garage in square feet\n",
"Mas Vnr Area Masonry veneer area in square feet\n",
"Total Bsmt SF Total square feet of basement area\n",
"Total Porch SF\n",
"Wood Deck SF Wood deck area in square feet\n"
]
}
],
"source": [
"print_column_list(pearson_weakly_correlated)"
]
},
{
"cell_type": "code",
2018-09-03 18:04:40 +02:00
"execution_count": 24,
2018-09-03 15:57:24 +02:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Gr Liv Area Above grade (ground) living area square feet\n"
]
}
],
"source": [
"print_column_list(pearson_strongly_correlated)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Spearman\n",
"\n",
"Spearman's correlation coefficient shows an ordinal rank relationship between two variables."
]
},
{
"cell_type": "code",
2018-09-03 18:04:40 +02:00
"execution_count": 25,
2018-09-03 15:57:24 +02:00
"metadata": {},
"outputs": [],
"source": [
"columns = sorted(DISCRETE_VARIABLES + ORDINAL_VARIABLES) + TARGET_VARIABLES\n",
"spearman = df[columns].corr(method=\"spearman\")"
]
},
{
"cell_type": "code",
2018-09-03 18:04:40 +02:00
"execution_count": 26,
2018-09-03 15:57:24 +02:00
"metadata": {},
"outputs": [
{
"data": {
2018-09-03 16:34:14 +02:00
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAuIAAAKtCAYAAABi7QuGAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADx0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4wcmMyLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvMCCy2AAAIABJREFUeJzs3XlcVdX6+PHPQUAcQEQUEScGwSFFcwItJ1IT09ISkCCn8kal2Y0UHApHEjRSzFlSwRRQxBItS01SE1M0veKQIAqCiAOiKByR8/vD79k/ToCIQ0g+79eL1z3uvdfaa+99zu056zxrLZVGo9EghBBCCCGE+EfpVXYDhBBCCCGEeB5JIC6EEEIIIUQlkEBcCCGEEEKISiCBuBBCCCGEEJVAAnEhhBBCCCEqgQTiQgghhBBCVAL9ym6AEOLB9uzZw5YtWzh69ChXrlzB0NCQBg0a0LVrV958801eeOGFym6iKMbBwQGAwMBAhg4d+tTqL4uhoSH16tWjVatWeHl50b179yfehoeVnp6Oi4sLAImJidSqVeuJ1Z2VlUVkZCS//fYbKSkp5OfnU6dOHVq1asWgQYMYNGgQ1apVe2Ln+6eFhoayaNEi+vfvz8KFC59IncnJydja2ups076ffvjhB+zt7Z/IeYQQD08CcSGeUYWFhfj6+rJ9+3YAGjZsiIODA7m5uaSnp5OcnMz69esZNWoUkyZNquTWin+avb09tWvXLrE9NzeX1NRUMjMz2bVrF5999hnvvvtuJbTw6YmKimLWrFkUFBSgp6eHhYUFTZs25eLFi+zdu5e9e/eydu1aFi9eTMOGDSu7uZUuOzub2bNnk5aWxqZNmyq7OUKIYiQQF+IZ9fXXX7N9+3asra356quvaN26tbIvPz+ftWvXEhISQlhYGI0aNcLb27sSWyv+TqVSPdX6p06dSteuXUvdd/XqVSZPnsyvv/7K/Pnz6d27d4me0KoqODiYlStXYmBgwHvvvcfo0aMxMzNT9v/666/MmTOHEydOMGLECDZu3IixsXEltrjy/fbbb2zfvp02bdqU2Ldt2zYAmjRp8k83SwiB5IgL8Uy6ffs269atA+4H5MWDcAAjIyPGjh2Lj48PAMuWLaOoqOgfb6coSdsDa2lpWWltqFevHsHBwZiamlJUVERMTEylteVJ2rt3L6tWraJatWrMmzcPX19fnSAcoFevXoSHh1O3bl1SU1MJDQ2tpNZWDba2ttja2mJoaFjZTRHiuSSBuBDPoNTUVG7fvo2hoSEtW7Ys87hhw4YB9396zszM/KeaJx7Azs4OKD+X+2kzMTGhffv2wP3c4KquqKiI6dOno9FoGDJkCK+++mqZx1pYWDB27FgAoqOjuXPnzj/VTCGEqBBJTRHiGaSvf/+jqVar+f3333F2di71OEtLS2JjYzExMdHJhY2JicHf3x93d3c+/PBDgoKC2LdvHwUFBVhbW+Ph4cFbb72Fnl7J7+JqtZrvvvuOH374gZSUFDQaDdbW1rz22mt4eXlRvXr1EmUKCgrYuHEjO3bs4MyZM9y8eRMjIyOsra0ZMGAAXl5eOj1u2va9/fbbODk5ERwczKVLl7C0tCQwMJDCwkLeeecdXFxcCAoKYvHixfz4449kZ2fToEED3njjDXx8fNDX12f79u18++23nDlzBn19fbp06YKvry82NjYl2nnq1CkiIiL4448/uHz5MoWFhdSrV4+OHTsyevToEj/de3t7c/DgQTZu3Mjt27dZvnw5x48fp6CgABsbG9566y2GDx+ucx9btGjB6dOnqVu3rrJNo9GwceNGNm/ezKlTp1Cr1Zibm9OxY0dGjBhBu3btynorPJYHpcdcuXKF8PBwfvvtN9LS0rhz5w7Gxsa0bt0aNzc3+vfvr3O8dvDgxIkTlQGE+/fvJycnh4YNG9K/f3/ef//9h04DWb16NYGBgRgYGBAaGkrv3r0fePzBgwe5cOECAKNHjy63/iFDhmBmZkaXLl2oUaOGzr7c3FzWrFnDjh07uHDhAnp6elhbW+Pq6oqXlxdGRkY6x/fp04eLFy/y888/ExISwu7duzEwMKB3794EBQWVu1/rl19+Yf369fzvf//j9u3bWFhY0KtXL8aOHUuDBg0e6r4BpKWlsXbtWg4cOMDFixdRq9WYmpri6OiIt7c3Tk5OJdoOcOLECRwcHLCysmLXrl3Agwdr7tixg8jISI4fP87t27epX78+zs7OvPfee1hbW+scW/wz/dFHH7Fo0SJ27drFlStXqFevHr169eLDDz+s0HUK8TyQQFyIZ5CNjQ0WFhZkZWXx4YcfMmLECAYNGlRqcNmqVasy68nOzsbNzY1Lly5ha2tLUVERJ06cYNq0aezdu5evvvpKCfoBcnJyeO+99zh27Bh6eno0adIEIyMjTp8+TVJSEnFxcaxatUonyLx58yYjRozgxIkTVKtWjaZNm2JpacnFixc5duwYx44dY//+/axcubJE+44ePUpkZCSmpqY0b96cixcv4uDgwIkTJ4D7AZO7u7sy20ODBg1IT09n0aJFXLlyBXNzcxYtWkTdunWxtrbmr7/+YufOnRw9epS4uDiddm7evJkpU6Zw79495fhbt26Rnp7O1q1b+emnn/j222/p3LlziXZu2bKFiIgIqlevTvPmzbl69SpJSUnMmDGDc+fOMXXqVOVYPz8//Pz8dMp/8cUXREZGolKpaNasGbVq1VLOu337dhYuXMgrr7xS5nN8FDk5OSQkJACUmFnn5MmTjBo1iuvXr1OzZk0aN24M3A/wtIMdP/30U6VXubi//vqLJUuWcPv2beVaUlNTWblyJb///jtRUVE676nSREdH8+WXX2JgYMDXX39dbhAOcODAAQDq16//UPnudevW5Y033iixPTU1lVGjRpGRkUG1atVo0aIFRUVFJCUlceLECb7//ntWrVpF/fr1S5T97LPPOH78OPb29ly6dIlGjRo91H6NRsPnn39OVFSUcg0tWrTg3LlzhIeHExcXx/Lly2nbtm2517V3714+/PBD8vPzMTY2pmnTphQUFJCWlsYvv/zCzp07mTdvHq+99hpw/9kbGBiQmppKzZo1admyZanXVlxRURETJ07khx9+AO5/4W/SpAnnzp1j06ZNbN26lXnz5tGvX78SZS9fvszQoUO5dOkSVlZWNG/enL/++osNGzbw22+/KR0HQoj/oxFCPJN27NihcXBw0Njb2yt/vXr10kycOFGzadMmTVZWVpllN23apJTp0qWL5uDBg8q+/fv3a1588UWNvb29Zu3atTrl/vOf/2js7e017u7umvPnzyvbMzIyNJ6enhp7e3uNj4+PTpnAwECNvb29ZsCAAZr09HRle2FhoWb16tVKO/78889S2zdu3DiNWq3WaDQazdWrVzUajUZz4MABZX/37t01x48fV8ouWrRIY29vr2nZsqXGwcFB8+2332qKioo0Go1Gc+7cOU2XLl009vb2mjVr1ihlsrOzNY6Ojhp7e3vNqlWrNIWFhcq+CxcuaAYPHqyxt7fXvPvuuzrX5uXlpbTD399fc/PmTeXavvzyS429vb2mVatWD3wWZ86c0djb22ucnJw0Z86cUbYXFBRoAgICNPb29hoXF5cyy/+dtj0HDhwo85jz588rz6tz586ay5cv6+wfMmSIxt7eXjNhwgTlmjQajebmzZuaTz/9VGNvb6/p2LGj8lw0Go1m4cKFyrnd3Nx03h/F36txcXHK9rS0NKXMrVu3NBqNRhMXF6dp2bKlpnXr1pqffvrpoa973LhxGnt7e82oUaMeuszfqdVqTf/+/TX29vYaLy8vTWZmprIvJSVFM2jQII29vb3G09NTp1zv3r019vb2mhdeeEGTmJio1KW9d+XtDwsL09jb22teeuklzf79+5V68/LylPdAz549dZ6F9n6PGzdO2VZQUKB56aWXNPb29po5c+ZoCgoKlH3Z2dmakSNHKp/F4rSftyFDhpS4J9rnc/r0aWVbaGio8h7YtWuXsv3OnTuaOXPmaOzt7TVt27bVKVP8M92vXz/N//73P2V
2018-09-03 15:57:24 +02:00
"text/plain": [
"<Figure size 864x864 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plot_correlation(spearman, \"Spearman's Rank Correlation\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Predictors weakly or strongly correlated with a target variable are collected."
]
},
{
"cell_type": "code",
2018-09-03 18:04:40 +02:00
"execution_count": 27,
2018-09-03 15:57:24 +02:00
"metadata": {},
"outputs": [],
"source": [
"spearman_weakly_correlated = set()\n",
"spearman_strongly_correlated = set()\n",
"# Iterate over the raw and transformed target.\n",
"for target in TARGET_VARIABLES:\n",
" corrs = spearman.loc[target].drop(TARGET_VARIABLES)\n",
" spearman_weakly_correlated |= set(corrs[(weak < corrs) & (corrs <= strong)].index)\n",
" spearman_strongly_correlated |= set(corrs[(strong < corrs)].index)\n",
"# Show that no contradiction exists between weak and strong classification.\n",
"assert spearman_weakly_correlated & spearman_strongly_correlated == set()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Show the discrete and ordinal variables that are weakly and strongly correlated with the sales price."
]
},
{
"cell_type": "code",
2018-09-03 18:04:40 +02:00
"execution_count": 28,
2018-09-03 15:57:24 +02:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
2018-09-03 16:34:14 +02:00
"Bsmt Exposure Refers to walkout or garden level walls\n",
"BsmtFin Type 1 Rating of basement finished area\n",
"Fireplace Qu Fireplace quality\n",
"Fireplaces Number of fireplaces\n",
"Full Bath Full bathrooms above grade\n",
"Garage Cond Garage condition\n",
"Garage Finish Interior finish of the garage\n",
"Garage Qual Garage quality\n",
"Half Bath Half baths above grade\n",
"Heating QC Heating quality and condition\n",
"Paved Drive Paved driveway\n",
"TotRms AbvGrd Total rooms above grade (does not include bathrooms)\n",
"Year Remod/Add Remodel date (same as construction date if no remodeling or additions)\n"
2018-09-03 15:57:24 +02:00
]
}
],
"source": [
"print_column_list(spearman_weakly_correlated)"
]
},
{
"cell_type": "code",
2018-09-03 18:04:40 +02:00
"execution_count": 29,
2018-09-03 15:57:24 +02:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
2018-09-03 16:34:14 +02:00
"Bsmt Qual Evaluates the height of the basement\n",
"Exter Qual Evaluates the quality of the material on the exterior\n",
"Garage Cars Size of garage in car capacity\n",
"Kitchen Qual Kitchen quality\n",
"Overall Qual Rates the overall material and finish of the house\n",
2018-09-03 15:57:24 +02:00
"Total Bath\n",
2018-09-03 16:34:14 +02:00
"Year Built Original construction date\n"
2018-09-03 15:57:24 +02:00
]
}
],
"source": [
"print_column_list(spearman_strongly_correlated)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Save the Results"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Save the weakly and strongly correlated Variables\n",
"\n",
"The subset of variables that have a correlation with the house price are saved in a simple JSON file for easy re-use."
]
},
{
"cell_type": "code",
2018-09-03 18:04:40 +02:00
"execution_count": 30,
2018-09-03 15:57:24 +02:00
"metadata": {},
"outputs": [],
"source": [
2018-09-03 16:12:15 +02:00
"with open(\"data/weakly_and_strongly_correlated_variables.json\", \"w\") as file:\n",
2018-09-03 15:57:24 +02:00
" file.write(json.dumps({\n",
" \"weakly_correlated\": sorted(\n",
" list(pearson_weakly_correlated) + list(spearman_weakly_correlated)\n",
" ),\n",
" \"strongly_correlated\": sorted(\n",
" list(pearson_strongly_correlated) + list(spearman_strongly_correlated)\n",
" ),\n",
" }))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Save the Data\n",
"\n",
2018-09-03 18:04:40 +02:00
"Sort the new variables into the unprocessed `cleaned_df` DataFrame with the targets at the end. This \"restores\" the ordinal labels again for storage."
2018-09-03 15:57:24 +02:00
]
},
{
"cell_type": "code",
2018-09-03 18:04:40 +02:00
"execution_count": 31,
2018-09-03 15:57:24 +02:00
"metadata": {},
"outputs": [],
"source": [
2018-09-03 18:04:40 +02:00
"for column in new_variables:\n",
" cleaned_df[column] = df[column]\n",
"for target in set(TARGET_VARIABLES) & set(new_variables):\n",
" new_variables.remove(target)\n",
"cleaned_df = cleaned_df[sorted(ALL_VARIABLES + new_variables) + TARGET_VARIABLES]"
2018-09-03 15:57:24 +02:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2018-09-03 18:04:40 +02:00
"In totality, this notebook added two new linear combinations and one Box-Cox transformation to the previous 78 columns."
2018-09-03 15:57:24 +02:00
]
},
{
"cell_type": "code",
2018-09-03 18:04:40 +02:00
"execution_count": 32,
2018-09-03 15:57:24 +02:00
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
2018-09-03 16:34:14 +02:00
"(2898, 81)"
2018-09-03 15:57:24 +02:00
]
},
2018-09-03 18:04:40 +02:00
"execution_count": 32,
2018-09-03 15:57:24 +02:00
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
2018-09-03 18:04:40 +02:00
"cleaned_df.shape"
2018-09-03 15:57:24 +02:00
]
},
{
"cell_type": "code",
2018-09-03 18:04:40 +02:00
"execution_count": 33,
2018-09-03 15:57:24 +02:00
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th></th>\n",
" <th>1st Flr SF</th>\n",
" <th>2nd Flr SF</th>\n",
" <th>3Ssn Porch</th>\n",
" <th>Alley</th>\n",
" <th>Bedroom AbvGr</th>\n",
" <th>Bldg Type</th>\n",
" <th>Bsmt Cond</th>\n",
" <th>Bsmt Exposure</th>\n",
" <th>Bsmt Full Bath</th>\n",
" <th>Bsmt Half Bath</th>\n",
" <th>Bsmt Qual</th>\n",
" <th>Bsmt Unf SF</th>\n",
" <th>BsmtFin SF 1</th>\n",
" <th>BsmtFin SF 2</th>\n",
" <th>BsmtFin Type 1</th>\n",
" <th>BsmtFin Type 2</th>\n",
" <th>Central Air</th>\n",
" <th>Condition 1</th>\n",
" <th>Condition 2</th>\n",
" <th>Electrical</th>\n",
" <th>Enclosed Porch</th>\n",
" <th>Exter Cond</th>\n",
" <th>Exter Qual</th>\n",
" <th>Exterior 1st</th>\n",
" <th>Exterior 2nd</th>\n",
" <th>Fence</th>\n",
" <th>Fireplace Qu</th>\n",
" <th>Fireplaces</th>\n",
" <th>Foundation</th>\n",
" <th>Full Bath</th>\n",
" <th>Functional</th>\n",
" <th>Garage Area</th>\n",
" <th>Garage Cars</th>\n",
" <th>Garage Cond</th>\n",
" <th>Garage Finish</th>\n",
" <th>Garage Qual</th>\n",
" <th>Garage Type</th>\n",
" <th>Gr Liv Area</th>\n",
" <th>Half Bath</th>\n",
" <th>Heating</th>\n",
" <th>Heating QC</th>\n",
" <th>House Style</th>\n",
" <th>Kitchen AbvGr</th>\n",
" <th>Kitchen Qual</th>\n",
" <th>Land Contour</th>\n",
" <th>Land Slope</th>\n",
" <th>Lot Area</th>\n",
" <th>Lot Config</th>\n",
" <th>Lot Shape</th>\n",
" <th>Low Qual Fin SF</th>\n",
" <th>MS SubClass</th>\n",
" <th>MS Zoning</th>\n",
" <th>Mas Vnr Area</th>\n",
" <th>Mas Vnr Type</th>\n",
" <th>Misc Feature</th>\n",
" <th>Misc Val</th>\n",
" <th>Mo Sold</th>\n",
" <th>Neighborhood</th>\n",
" <th>Open Porch SF</th>\n",
" <th>Overall Cond</th>\n",
" <th>Overall Qual</th>\n",
" <th>Paved Drive</th>\n",
" <th>Pool Area</th>\n",
" <th>Pool QC</th>\n",
" <th>Roof Matl</th>\n",
" <th>Roof Style</th>\n",
" <th>Sale Condition</th>\n",
" <th>Sale Type</th>\n",
" <th>Screen Porch</th>\n",
" <th>Street</th>\n",
" <th>TotRms AbvGrd</th>\n",
" <th>Total Bath</th>\n",
" <th>Total Bsmt SF</th>\n",
" <th>Total Porch SF</th>\n",
" <th>Utilities</th>\n",
" <th>Wood Deck SF</th>\n",
" <th>Year Built</th>\n",
" <th>Year Remod/Add</th>\n",
" <th>Yr Sold</th>\n",
" <th>SalePrice</th>\n",
" <th>SalePrice (box-cox-0.0)</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Order</th>\n",
" <th>PID</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <th>526301100</th>\n",
" <td>1656.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>NA</td>\n",
" <td>3</td>\n",
" <td>1Fam</td>\n",
2018-09-03 18:04:40 +02:00
" <td>Gd</td>\n",
" <td>Gd</td>\n",
2018-09-03 15:57:24 +02:00
" <td>1</td>\n",
" <td>0</td>\n",
2018-09-03 18:04:40 +02:00
" <td>TA</td>\n",
2018-09-03 15:57:24 +02:00
" <td>441.0</td>\n",
" <td>639.0</td>\n",
" <td>0.0</td>\n",
2018-09-03 18:04:40 +02:00
" <td>BLQ</td>\n",
" <td>Unf</td>\n",
2018-09-03 15:57:24 +02:00
" <td>Y</td>\n",
" <td>Norm</td>\n",
" <td>Norm</td>\n",
2018-09-03 18:04:40 +02:00
" <td>SBrkr</td>\n",
2018-09-03 15:57:24 +02:00
" <td>0.0</td>\n",
2018-09-03 18:04:40 +02:00
" <td>TA</td>\n",
" <td>TA</td>\n",
2018-09-03 15:57:24 +02:00
" <td>BrkFace</td>\n",
" <td>Plywood</td>\n",
2018-09-03 18:04:40 +02:00
" <td>NA</td>\n",
" <td>Gd</td>\n",
2018-09-03 15:57:24 +02:00
" <td>2</td>\n",
" <td>CBlock</td>\n",
" <td>1</td>\n",
2018-09-03 18:04:40 +02:00
" <td>Typ</td>\n",
2018-09-03 15:57:24 +02:00
" <td>528.0</td>\n",
" <td>2</td>\n",
2018-09-03 18:04:40 +02:00
" <td>TA</td>\n",
" <td>Fin</td>\n",
" <td>TA</td>\n",
2018-09-03 15:57:24 +02:00
" <td>Attchd</td>\n",
" <td>1656.0</td>\n",
" <td>0</td>\n",
" <td>GasA</td>\n",
2018-09-03 18:04:40 +02:00
" <td>Fa</td>\n",
2018-09-03 15:57:24 +02:00
" <td>1Story</td>\n",
" <td>1</td>\n",
2018-09-03 18:04:40 +02:00
" <td>TA</td>\n",
2018-09-03 15:57:24 +02:00
" <td>Lvl</td>\n",
2018-09-03 18:04:40 +02:00
" <td>Gtl</td>\n",
2018-09-03 15:57:24 +02:00
" <td>31770.0</td>\n",
" <td>Corner</td>\n",
2018-09-03 18:04:40 +02:00
" <td>IR1</td>\n",
2018-09-03 15:57:24 +02:00
" <td>0.0</td>\n",
" <td>020</td>\n",
" <td>RL</td>\n",
" <td>112.0</td>\n",
" <td>Stone</td>\n",
" <td>NA</td>\n",
" <td>0.0</td>\n",
" <td>5</td>\n",
" <td>Names</td>\n",
" <td>62.0</td>\n",
" <td>5</td>\n",
2018-09-03 18:04:40 +02:00
" <td>6</td>\n",
" <td>P</td>\n",
2018-09-03 15:57:24 +02:00
" <td>0.0</td>\n",
2018-09-03 18:04:40 +02:00
" <td>NA</td>\n",
2018-09-03 15:57:24 +02:00
" <td>CompShg</td>\n",
" <td>Hip</td>\n",
" <td>Normal</td>\n",
" <td>WD</td>\n",
" <td>0.0</td>\n",
" <td>Pave</td>\n",
" <td>7</td>\n",
" <td>2.0</td>\n",
" <td>1080.0</td>\n",
" <td>272.0</td>\n",
2018-09-03 18:04:40 +02:00
" <td>AllPub</td>\n",
2018-09-03 15:57:24 +02:00
" <td>210.0</td>\n",
" <td>1960</td>\n",
" <td>1960</td>\n",
" <td>2010</td>\n",
" <td>215000.0</td>\n",
" <td>12.278393</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <th>526350040</th>\n",
" <td>896.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>NA</td>\n",
" <td>2</td>\n",
" <td>1Fam</td>\n",
2018-09-03 18:04:40 +02:00
" <td>TA</td>\n",
" <td>No</td>\n",
2018-09-03 15:57:24 +02:00
" <td>0</td>\n",
" <td>0</td>\n",
2018-09-03 18:04:40 +02:00
" <td>TA</td>\n",
2018-09-03 15:57:24 +02:00
" <td>270.0</td>\n",
" <td>468.0</td>\n",
" <td>144.0</td>\n",
2018-09-03 18:04:40 +02:00
" <td>Rec</td>\n",
" <td>LwQ</td>\n",
2018-09-03 15:57:24 +02:00
" <td>Y</td>\n",
" <td>Feedr</td>\n",
" <td>Norm</td>\n",
2018-09-03 18:04:40 +02:00
" <td>SBrkr</td>\n",
2018-09-03 15:57:24 +02:00
" <td>0.0</td>\n",
2018-09-03 18:04:40 +02:00
" <td>TA</td>\n",
" <td>TA</td>\n",
2018-09-03 15:57:24 +02:00
" <td>VinylSd</td>\n",
" <td>VinylSd</td>\n",
2018-09-03 18:04:40 +02:00
" <td>MnPrv</td>\n",
" <td>NA</td>\n",
2018-09-03 15:57:24 +02:00
" <td>0</td>\n",
" <td>CBlock</td>\n",
" <td>1</td>\n",
2018-09-03 18:04:40 +02:00
" <td>Typ</td>\n",
2018-09-03 15:57:24 +02:00
" <td>730.0</td>\n",
" <td>1</td>\n",
2018-09-03 18:04:40 +02:00
" <td>TA</td>\n",
" <td>Unf</td>\n",
" <td>TA</td>\n",
2018-09-03 15:57:24 +02:00
" <td>Attchd</td>\n",
" <td>896.0</td>\n",
" <td>0</td>\n",
" <td>GasA</td>\n",
2018-09-03 18:04:40 +02:00
" <td>TA</td>\n",
2018-09-03 15:57:24 +02:00
" <td>1Story</td>\n",
" <td>1</td>\n",
2018-09-03 18:04:40 +02:00
" <td>TA</td>\n",
2018-09-03 15:57:24 +02:00
" <td>Lvl</td>\n",
2018-09-03 18:04:40 +02:00
" <td>Gtl</td>\n",
2018-09-03 15:57:24 +02:00
" <td>11622.0</td>\n",
" <td>Inside</td>\n",
2018-09-03 18:04:40 +02:00
" <td>Reg</td>\n",
2018-09-03 15:57:24 +02:00
" <td>0.0</td>\n",
" <td>020</td>\n",
" <td>RH</td>\n",
" <td>0.0</td>\n",
" <td>None</td>\n",
" <td>NA</td>\n",
" <td>0.0</td>\n",
" <td>6</td>\n",
" <td>Names</td>\n",
" <td>0.0</td>\n",
2018-09-03 18:04:40 +02:00
" <td>6</td>\n",
2018-09-03 15:57:24 +02:00
" <td>5</td>\n",
2018-09-03 18:04:40 +02:00
" <td>Y</td>\n",
2018-09-03 15:57:24 +02:00
" <td>0.0</td>\n",
2018-09-03 18:04:40 +02:00
" <td>NA</td>\n",
2018-09-03 15:57:24 +02:00
" <td>CompShg</td>\n",
" <td>Gable</td>\n",
" <td>Normal</td>\n",
" <td>WD</td>\n",
" <td>120.0</td>\n",
" <td>Pave</td>\n",
" <td>5</td>\n",
" <td>1.0</td>\n",
" <td>882.0</td>\n",
" <td>260.0</td>\n",
2018-09-03 18:04:40 +02:00
" <td>AllPub</td>\n",
2018-09-03 15:57:24 +02:00
" <td>140.0</td>\n",
" <td>1961</td>\n",
" <td>1961</td>\n",
" <td>2010</td>\n",
" <td>105000.0</td>\n",
" <td>11.561716</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <th>526351010</th>\n",
" <td>1329.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>NA</td>\n",
" <td>3</td>\n",
" <td>1Fam</td>\n",
2018-09-03 18:04:40 +02:00
" <td>TA</td>\n",
" <td>No</td>\n",
2018-09-03 15:57:24 +02:00
" <td>0</td>\n",
" <td>0</td>\n",
2018-09-03 18:04:40 +02:00
" <td>TA</td>\n",
2018-09-03 15:57:24 +02:00
" <td>406.0</td>\n",
" <td>923.0</td>\n",
" <td>0.0</td>\n",
2018-09-03 18:04:40 +02:00
" <td>ALQ</td>\n",
" <td>Unf</td>\n",
2018-09-03 15:57:24 +02:00
" <td>Y</td>\n",
" <td>Norm</td>\n",
" <td>Norm</td>\n",
2018-09-03 18:04:40 +02:00
" <td>SBrkr</td>\n",
2018-09-03 15:57:24 +02:00
" <td>0.0</td>\n",
2018-09-03 18:04:40 +02:00
" <td>TA</td>\n",
" <td>TA</td>\n",
2018-09-03 15:57:24 +02:00
" <td>Wd Sdng</td>\n",
" <td>Wd Sdng</td>\n",
2018-09-03 18:04:40 +02:00
" <td>NA</td>\n",
" <td>NA</td>\n",
2018-09-03 15:57:24 +02:00
" <td>0</td>\n",
" <td>CBlock</td>\n",
" <td>1</td>\n",
2018-09-03 18:04:40 +02:00
" <td>Typ</td>\n",
2018-09-03 15:57:24 +02:00
" <td>312.0</td>\n",
" <td>1</td>\n",
2018-09-03 18:04:40 +02:00
" <td>TA</td>\n",
" <td>Unf</td>\n",
" <td>TA</td>\n",
2018-09-03 15:57:24 +02:00
" <td>Attchd</td>\n",
" <td>1329.0</td>\n",
" <td>1</td>\n",
" <td>GasA</td>\n",
2018-09-03 18:04:40 +02:00
" <td>TA</td>\n",
2018-09-03 15:57:24 +02:00
" <td>1Story</td>\n",
" <td>1</td>\n",
2018-09-03 18:04:40 +02:00
" <td>Gd</td>\n",
2018-09-03 15:57:24 +02:00
" <td>Lvl</td>\n",
2018-09-03 18:04:40 +02:00
" <td>Gtl</td>\n",
2018-09-03 15:57:24 +02:00
" <td>14267.0</td>\n",
" <td>Corner</td>\n",
2018-09-03 18:04:40 +02:00
" <td>IR1</td>\n",
2018-09-03 15:57:24 +02:00
" <td>0.0</td>\n",
" <td>020</td>\n",
" <td>RL</td>\n",
" <td>108.0</td>\n",
" <td>BrkFace</td>\n",
" <td>Gar2</td>\n",
" <td>12500.0</td>\n",
" <td>6</td>\n",
" <td>Names</td>\n",
" <td>36.0</td>\n",
2018-09-03 18:04:40 +02:00
" <td>6</td>\n",
" <td>6</td>\n",
" <td>Y</td>\n",
2018-09-03 15:57:24 +02:00
" <td>0.0</td>\n",
2018-09-03 18:04:40 +02:00
" <td>NA</td>\n",
2018-09-03 15:57:24 +02:00
" <td>CompShg</td>\n",
" <td>Hip</td>\n",
" <td>Normal</td>\n",
" <td>WD</td>\n",
" <td>0.0</td>\n",
" <td>Pave</td>\n",
" <td>6</td>\n",
" <td>1.5</td>\n",
" <td>1329.0</td>\n",
" <td>429.0</td>\n",
2018-09-03 18:04:40 +02:00
" <td>AllPub</td>\n",
2018-09-03 15:57:24 +02:00
" <td>393.0</td>\n",
" <td>1958</td>\n",
" <td>1958</td>\n",
" <td>2010</td>\n",
" <td>172000.0</td>\n",
" <td>12.055250</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <th>526353030</th>\n",
" <td>2110.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>NA</td>\n",
" <td>3</td>\n",
" <td>1Fam</td>\n",
2018-09-03 18:04:40 +02:00
" <td>TA</td>\n",
" <td>No</td>\n",
2018-09-03 15:57:24 +02:00
" <td>1</td>\n",
" <td>0</td>\n",
2018-09-03 18:04:40 +02:00
" <td>TA</td>\n",
2018-09-03 15:57:24 +02:00
" <td>1045.0</td>\n",
" <td>1065.0</td>\n",
" <td>0.0</td>\n",
2018-09-03 18:04:40 +02:00
" <td>ALQ</td>\n",
" <td>Unf</td>\n",
2018-09-03 15:57:24 +02:00
" <td>Y</td>\n",
" <td>Norm</td>\n",
" <td>Norm</td>\n",
2018-09-03 18:04:40 +02:00
" <td>SBrkr</td>\n",
2018-09-03 15:57:24 +02:00
" <td>0.0</td>\n",
2018-09-03 18:04:40 +02:00
" <td>TA</td>\n",
" <td>Gd</td>\n",
2018-09-03 15:57:24 +02:00
" <td>BrkFace</td>\n",
" <td>BrkFace</td>\n",
2018-09-03 18:04:40 +02:00
" <td>NA</td>\n",
" <td>TA</td>\n",
2018-09-03 15:57:24 +02:00
" <td>2</td>\n",
" <td>CBlock</td>\n",
" <td>2</td>\n",
2018-09-03 18:04:40 +02:00
" <td>Typ</td>\n",
2018-09-03 15:57:24 +02:00
" <td>522.0</td>\n",
" <td>2</td>\n",
2018-09-03 18:04:40 +02:00
" <td>TA</td>\n",
" <td>Fin</td>\n",
" <td>TA</td>\n",
2018-09-03 15:57:24 +02:00
" <td>Attchd</td>\n",
" <td>2110.0</td>\n",
" <td>1</td>\n",
" <td>GasA</td>\n",
2018-09-03 18:04:40 +02:00
" <td>Ex</td>\n",
2018-09-03 15:57:24 +02:00
" <td>1Story</td>\n",
" <td>1</td>\n",
2018-09-03 18:04:40 +02:00
" <td>Ex</td>\n",
2018-09-03 15:57:24 +02:00
" <td>Lvl</td>\n",
2018-09-03 18:04:40 +02:00
" <td>Gtl</td>\n",
2018-09-03 15:57:24 +02:00
" <td>11160.0</td>\n",
" <td>Corner</td>\n",
2018-09-03 18:04:40 +02:00
" <td>Reg</td>\n",
2018-09-03 15:57:24 +02:00
" <td>0.0</td>\n",
" <td>020</td>\n",
" <td>RL</td>\n",
" <td>0.0</td>\n",
" <td>None</td>\n",
" <td>NA</td>\n",
" <td>0.0</td>\n",
" <td>4</td>\n",
" <td>Names</td>\n",
" <td>0.0</td>\n",
2018-09-03 18:04:40 +02:00
" <td>5</td>\n",
" <td>7</td>\n",
" <td>Y</td>\n",
2018-09-03 15:57:24 +02:00
" <td>0.0</td>\n",
2018-09-03 18:04:40 +02:00
" <td>NA</td>\n",
2018-09-03 15:57:24 +02:00
" <td>CompShg</td>\n",
" <td>Hip</td>\n",
" <td>Normal</td>\n",
" <td>WD</td>\n",
" <td>0.0</td>\n",
" <td>Pave</td>\n",
" <td>8</td>\n",
" <td>3.5</td>\n",
" <td>2110.0</td>\n",
" <td>0.0</td>\n",
2018-09-03 18:04:40 +02:00
" <td>AllPub</td>\n",
2018-09-03 15:57:24 +02:00
" <td>0.0</td>\n",
" <td>1968</td>\n",
" <td>1968</td>\n",
" <td>2010</td>\n",
" <td>244000.0</td>\n",
" <td>12.404924</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <th>527105010</th>\n",
" <td>928.0</td>\n",
" <td>701.0</td>\n",
" <td>0.0</td>\n",
" <td>NA</td>\n",
" <td>3</td>\n",
" <td>1Fam</td>\n",
2018-09-03 18:04:40 +02:00
" <td>TA</td>\n",
" <td>No</td>\n",
2018-09-03 15:57:24 +02:00
" <td>0</td>\n",
" <td>0</td>\n",
2018-09-03 18:04:40 +02:00
" <td>Gd</td>\n",
2018-09-03 15:57:24 +02:00
" <td>137.0</td>\n",
" <td>791.0</td>\n",
" <td>0.0</td>\n",
2018-09-03 18:04:40 +02:00
" <td>GLQ</td>\n",
" <td>Unf</td>\n",
2018-09-03 15:57:24 +02:00
" <td>Y</td>\n",
" <td>Norm</td>\n",
" <td>Norm</td>\n",
2018-09-03 18:04:40 +02:00
" <td>SBrkr</td>\n",
2018-09-03 15:57:24 +02:00
" <td>0.0</td>\n",
2018-09-03 18:04:40 +02:00
" <td>TA</td>\n",
" <td>TA</td>\n",
2018-09-03 15:57:24 +02:00
" <td>VinylSd</td>\n",
" <td>VinylSd</td>\n",
2018-09-03 18:04:40 +02:00
" <td>MnPrv</td>\n",
" <td>TA</td>\n",
2018-09-03 15:57:24 +02:00
" <td>1</td>\n",
" <td>PConc</td>\n",
" <td>2</td>\n",
2018-09-03 18:04:40 +02:00
" <td>Typ</td>\n",
2018-09-03 15:57:24 +02:00
" <td>482.0</td>\n",
" <td>2</td>\n",
2018-09-03 18:04:40 +02:00
" <td>TA</td>\n",
" <td>Fin</td>\n",
" <td>TA</td>\n",
2018-09-03 15:57:24 +02:00
" <td>Attchd</td>\n",
" <td>1629.0</td>\n",
" <td>1</td>\n",
" <td>GasA</td>\n",
2018-09-03 18:04:40 +02:00
" <td>Gd</td>\n",
2018-09-03 15:57:24 +02:00
" <td>2Story</td>\n",
" <td>1</td>\n",
2018-09-03 18:04:40 +02:00
" <td>TA</td>\n",
2018-09-03 15:57:24 +02:00
" <td>Lvl</td>\n",
2018-09-03 18:04:40 +02:00
" <td>Gtl</td>\n",
2018-09-03 15:57:24 +02:00
" <td>13830.0</td>\n",
" <td>Inside</td>\n",
2018-09-03 18:04:40 +02:00
" <td>IR1</td>\n",
2018-09-03 15:57:24 +02:00
" <td>0.0</td>\n",
" <td>060</td>\n",
" <td>RL</td>\n",
" <td>0.0</td>\n",
" <td>None</td>\n",
" <td>NA</td>\n",
" <td>0.0</td>\n",
" <td>3</td>\n",
" <td>Gilbert</td>\n",
" <td>34.0</td>\n",
2018-09-03 18:04:40 +02:00
" <td>5</td>\n",
" <td>5</td>\n",
" <td>Y</td>\n",
2018-09-03 15:57:24 +02:00
" <td>0.0</td>\n",
2018-09-03 18:04:40 +02:00
" <td>NA</td>\n",
2018-09-03 15:57:24 +02:00
" <td>CompShg</td>\n",
" <td>Gable</td>\n",
" <td>Normal</td>\n",
" <td>WD</td>\n",
" <td>0.0</td>\n",
" <td>Pave</td>\n",
" <td>6</td>\n",
" <td>2.5</td>\n",
" <td>928.0</td>\n",
" <td>246.0</td>\n",
2018-09-03 18:04:40 +02:00
" <td>AllPub</td>\n",
2018-09-03 15:57:24 +02:00
" <td>212.0</td>\n",
" <td>1997</td>\n",
" <td>1998</td>\n",
" <td>2010</td>\n",
" <td>189900.0</td>\n",
" <td>12.154253</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 1st Flr SF 2nd Flr SF 3Ssn Porch Alley Bedroom AbvGr \\\n",
"Order PID \n",
"1 526301100 1656.0 0.0 0.0 NA 3 \n",
"2 526350040 896.0 0.0 0.0 NA 2 \n",
"3 526351010 1329.0 0.0 0.0 NA 3 \n",
"4 526353030 2110.0 0.0 0.0 NA 3 \n",
"5 527105010 928.0 701.0 0.0 NA 3 \n",
"\n",
2018-09-03 18:04:40 +02:00
" Bldg Type Bsmt Cond Bsmt Exposure Bsmt Full Bath \\\n",
"Order PID \n",
"1 526301100 1Fam Gd Gd 1 \n",
"2 526350040 1Fam TA No 0 \n",
"3 526351010 1Fam TA No 0 \n",
"4 526353030 1Fam TA No 1 \n",
"5 527105010 1Fam TA No 0 \n",
2018-09-03 15:57:24 +02:00
"\n",
2018-09-03 18:04:40 +02:00
" Bsmt Half Bath Bsmt Qual Bsmt Unf SF BsmtFin SF 1 \\\n",
"Order PID \n",
"1 526301100 0 TA 441.0 639.0 \n",
"2 526350040 0 TA 270.0 468.0 \n",
"3 526351010 0 TA 406.0 923.0 \n",
"4 526353030 0 TA 1045.0 1065.0 \n",
"5 527105010 0 Gd 137.0 791.0 \n",
2018-09-03 15:57:24 +02:00
"\n",
2018-09-03 18:04:40 +02:00
" BsmtFin SF 2 BsmtFin Type 1 BsmtFin Type 2 Central Air \\\n",
"Order PID \n",
"1 526301100 0.0 BLQ Unf Y \n",
"2 526350040 144.0 Rec LwQ Y \n",
"3 526351010 0.0 ALQ Unf Y \n",
"4 526353030 0.0 ALQ Unf Y \n",
"5 527105010 0.0 GLQ Unf Y \n",
2018-09-03 15:57:24 +02:00
"\n",
2018-09-03 18:04:40 +02:00
" Condition 1 Condition 2 Electrical Enclosed Porch Exter Cond \\\n",
"Order PID \n",
"1 526301100 Norm Norm SBrkr 0.0 TA \n",
"2 526350040 Feedr Norm SBrkr 0.0 TA \n",
"3 526351010 Norm Norm SBrkr 0.0 TA \n",
"4 526353030 Norm Norm SBrkr 0.0 TA \n",
"5 527105010 Norm Norm SBrkr 0.0 TA \n",
2018-09-03 15:57:24 +02:00
"\n",
2018-09-03 18:04:40 +02:00
" Exter Qual Exterior 1st Exterior 2nd Fence Fireplace Qu \\\n",
2018-09-03 15:57:24 +02:00
"Order PID \n",
2018-09-03 18:04:40 +02:00
"1 526301100 TA BrkFace Plywood NA Gd \n",
"2 526350040 TA VinylSd VinylSd MnPrv NA \n",
"3 526351010 TA Wd Sdng Wd Sdng NA NA \n",
"4 526353030 Gd BrkFace BrkFace NA TA \n",
"5 527105010 TA VinylSd VinylSd MnPrv TA \n",
2018-09-03 15:57:24 +02:00
"\n",
2018-09-03 18:04:40 +02:00
" Fireplaces Foundation Full Bath Functional Garage Area \\\n",
"Order PID \n",
"1 526301100 2 CBlock 1 Typ 528.0 \n",
"2 526350040 0 CBlock 1 Typ 730.0 \n",
"3 526351010 0 CBlock 1 Typ 312.0 \n",
"4 526353030 2 CBlock 2 Typ 522.0 \n",
"5 527105010 1 PConc 2 Typ 482.0 \n",
2018-09-03 15:57:24 +02:00
"\n",
2018-09-03 18:04:40 +02:00
" Garage Cars Garage Cond Garage Finish Garage Qual \\\n",
"Order PID \n",
"1 526301100 2 TA Fin TA \n",
"2 526350040 1 TA Unf TA \n",
"3 526351010 1 TA Unf TA \n",
"4 526353030 2 TA Fin TA \n",
"5 527105010 2 TA Fin TA \n",
2018-09-03 15:57:24 +02:00
"\n",
2018-09-03 18:04:40 +02:00
" Garage Type Gr Liv Area Half Bath Heating Heating QC \\\n",
"Order PID \n",
"1 526301100 Attchd 1656.0 0 GasA Fa \n",
"2 526350040 Attchd 896.0 0 GasA TA \n",
"3 526351010 Attchd 1329.0 1 GasA TA \n",
"4 526353030 Attchd 2110.0 1 GasA Ex \n",
"5 527105010 Attchd 1629.0 1 GasA Gd \n",
2018-09-03 15:57:24 +02:00
"\n",
2018-09-03 18:04:40 +02:00
" House Style Kitchen AbvGr Kitchen Qual Land Contour \\\n",
2018-09-03 15:57:24 +02:00
"Order PID \n",
2018-09-03 18:04:40 +02:00
"1 526301100 1Story 1 TA Lvl \n",
"2 526350040 1Story 1 TA Lvl \n",
"3 526351010 1Story 1 Gd Lvl \n",
"4 526353030 1Story 1 Ex Lvl \n",
"5 527105010 2Story 1 TA Lvl \n",
2018-09-03 15:57:24 +02:00
"\n",
2018-09-03 18:04:40 +02:00
" Land Slope Lot Area Lot Config Lot Shape Low Qual Fin SF \\\n",
"Order PID \n",
"1 526301100 Gtl 31770.0 Corner IR1 0.0 \n",
"2 526350040 Gtl 11622.0 Inside Reg 0.0 \n",
"3 526351010 Gtl 14267.0 Corner IR1 0.0 \n",
"4 526353030 Gtl 11160.0 Corner Reg 0.0 \n",
"5 527105010 Gtl 13830.0 Inside IR1 0.0 \n",
2018-09-03 15:57:24 +02:00
"\n",
2018-09-03 18:04:40 +02:00
" MS SubClass MS Zoning Mas Vnr Area Mas Vnr Type Misc Feature \\\n",
"Order PID \n",
"1 526301100 020 RL 112.0 Stone NA \n",
"2 526350040 020 RH 0.0 None NA \n",
"3 526351010 020 RL 108.0 BrkFace Gar2 \n",
"4 526353030 020 RL 0.0 None NA \n",
"5 527105010 060 RL 0.0 None NA \n",
2018-09-03 15:57:24 +02:00
"\n",
2018-09-03 18:04:40 +02:00
" Misc Val Mo Sold Neighborhood Open Porch SF Overall Cond \\\n",
"Order PID \n",
"1 526301100 0.0 5 Names 62.0 5 \n",
"2 526350040 0.0 6 Names 0.0 6 \n",
"3 526351010 12500.0 6 Names 36.0 6 \n",
"4 526353030 0.0 4 Names 0.0 5 \n",
"5 527105010 0.0 3 Gilbert 34.0 5 \n",
2018-09-03 15:57:24 +02:00
"\n",
2018-09-03 18:04:40 +02:00
" Overall Qual Paved Drive Pool Area Pool QC Roof Matl \\\n",
"Order PID \n",
"1 526301100 6 P 0.0 NA CompShg \n",
"2 526350040 5 Y 0.0 NA CompShg \n",
"3 526351010 6 Y 0.0 NA CompShg \n",
"4 526353030 7 Y 0.0 NA CompShg \n",
"5 527105010 5 Y 0.0 NA CompShg \n",
2018-09-03 15:57:24 +02:00
"\n",
2018-09-03 18:04:40 +02:00
" Roof Style Sale Condition Sale Type Screen Porch Street \\\n",
"Order PID \n",
"1 526301100 Hip Normal WD 0.0 Pave \n",
"2 526350040 Gable Normal WD 120.0 Pave \n",
"3 526351010 Hip Normal WD 0.0 Pave \n",
"4 526353030 Hip Normal WD 0.0 Pave \n",
"5 527105010 Gable Normal WD 0.0 Pave \n",
2018-09-03 15:57:24 +02:00
"\n",
2018-09-03 18:04:40 +02:00
" TotRms AbvGrd Total Bath Total Bsmt SF Total Porch SF \\\n",
2018-09-03 16:34:14 +02:00
"Order PID \n",
2018-09-03 18:04:40 +02:00
"1 526301100 7 2.0 1080.0 272.0 \n",
"2 526350040 5 1.0 882.0 260.0 \n",
"3 526351010 6 1.5 1329.0 429.0 \n",
"4 526353030 8 3.5 2110.0 0.0 \n",
"5 527105010 6 2.5 928.0 246.0 \n",
2018-09-03 15:57:24 +02:00
"\n",
2018-09-03 18:04:40 +02:00
" Utilities Wood Deck SF Year Built Year Remod/Add Yr Sold \\\n",
"Order PID \n",
"1 526301100 AllPub 210.0 1960 1960 2010 \n",
"2 526350040 AllPub 140.0 1961 1961 2010 \n",
"3 526351010 AllPub 393.0 1958 1958 2010 \n",
"4 526353030 AllPub 0.0 1968 1968 2010 \n",
"5 527105010 AllPub 212.0 1997 1998 2010 \n",
2018-09-03 15:57:24 +02:00
"\n",
2018-09-03 18:04:40 +02:00
" SalePrice SalePrice (box-cox-0.0) \n",
"Order PID \n",
"1 526301100 215000.0 12.278393 \n",
"2 526350040 105000.0 11.561716 \n",
"3 526351010 172000.0 12.055250 \n",
"4 526353030 244000.0 12.404924 \n",
"5 527105010 189900.0 12.154253 "
2018-09-03 15:57:24 +02:00
]
},
2018-09-03 18:04:40 +02:00
"execution_count": 33,
2018-09-03 15:57:24 +02:00
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
2018-09-03 18:04:40 +02:00
"cleaned_df.head()"
2018-09-03 15:57:24 +02:00
]
},
{
"cell_type": "code",
2018-09-03 18:04:40 +02:00
"execution_count": 34,
2018-09-03 15:57:24 +02:00
"metadata": {},
"outputs": [],
"source": [
2018-09-03 18:04:40 +02:00
"cleaned_df.to_csv(\"data/data_clean_with_transformations.csv\")"
2018-09-02 23:25:07 +02:00
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}