Predicting House Prices
King County House Sales Dataset
Objective: To clean, explore, and model this dataset with a multivariate linear regression to predict the sale price of houses as accurately as possible.
Data Exploration and Scrubbing
Our first step is to explore and understand the dataset we are working with. We need to make sure we know what all of the columns represent and how the computer is interpreting them. This step allows us to force the computer to understand the data the way we want it to.
Deal with Missing Data
We need to deal with rows that are missing data. The dataset is incomplete, which causes problems when running statistical analysis, so we have to determine how best to deal with missing data on a case by case basis.
We make sure that all of the predictors we are using describe price and not each other. If two of our predictors are highly correlated, it’s hard to determine which one is affecting the price.
Check Model Assumptions
We want to be sure that our data fulfills all the assumptions that are necessary to create a model. If our data does not satisfy all the assumptions, we need to transform our data appropriately so that we can build a statistically significant model.
Test Our Model
We want to ensure that our model is actually predicting results, so we run several tests to protect us from creating a model that works on the data we currently have.
Remove Inconsequential Predictors
We figure out which predictors don’t actually influence our model and remove them from the equation to keep everything as simple as possible.
Our final Model explains 98.9% of the variations in our dataset.