Kaggle House Prices Competition - Regresson Models

As a further integral part of my machine learning exploration and training I decided to tackle a regression prediction problem (as opposed to a classification one). The house prices playground competition on Kaggle provides that opportunity. The below walk-through takes you through my solution (which ultimately uses a linear regression model to predict house prices).

Introduction

The object of the competition is to use the data provided to train a model that predicts how much a house in Ames, Iowa (from the dataset provided) would sell for. There are 80 features variables to choose from related to the house's location, size, number of fireplaces, basement quality, age, and so on. The training set contains observed sale prices for 1,460 houses built between 1872 to 2010 in Ames, Iowa (USA).

This tutorial/walk-through uses R. So if you are following along the first thing you need to do (if you have not done so already) is install R. This is very easy to do; and there is only one dominant IDE you need to worry about. You can get R on the CRAN website and after installing R, the IDE is available for download on the RStudio website. After installing R you can use the code below to train the models and predict house prices in Kaggle's test set.

1. Required Packages

In order to perform the below analyis first install the required packages (dplyr, caret and party) packages as shown below

###############################################################################
##                          House Prices: Advanced Regression Techniques
###############################################################################
# Purpose:
#   To submit a prediction to kaggle for house prices based on
#   Ames house price training dataset
#   
# Inputs:
#   - 'C:/Users/craig/Google Drive/201705/0526hous/cache/train.csv'
#   - 'C:/Users/craig/Google Drive/201705/0526hous/cache/test.csv'
# 
# Anatomy:
#   1. Required Packages
#   2. Combine Train/Test
#   3. Analyse the Data
#   4. Engineer Variables
#   5. Train a Forest Model
#   6. Score Forest Model
#   7. Train Linear Model
#   8. Score Linear Model
#   9. Make Submission



###1. Required Packages
install.packages('dplyr')
install.packages('caret')
install.packages('party')

2. Combine Train/Test

You next need to import the train and test data to R. The first step is to download the housing datasets in .csv format from Kaggle's house price competition site. After doing this I have used the setwd function (as shown below) to point to the directory where the train.csv and test.csv files are located. Setwd stands for set working directory. So we are telling R that this is the default directory to use for reading/writing files to/from R in this session.

###2. Combine Train/Test
setwd('C:/Users/craig/Google Drive/201705/0526hous/cache/')

ktrain<-read.csv('train.csv')
ktest<-read.csv('test.csv')

#create saleprice variable in test data
ktest$SalePrice<-NA
bs<-rbind(ktrain,ktest)
dim(ktrain)
table(ktrain$YearBuilt)

The subsequent steps in the above code are 1. to read the train.csv dataset into an R dataframe object called ktrain, and then repeat for the test data into ktest. In order to properly clean the data it is easier if the train and test sets are appended together. So this is also done in the above snippet. Since SalePrice is the target variable and is missing from ktest, I have added it as NA (missing) above.

3. Analyse the Data

You next need to import the train and test data to R.

###3. Analyse the Data

#general summary
dim(bs)
summary(bs)

#show freq counts for all variables where <20 categories in the variable
lapply(bs,
       function(var)
         {
          if(dim(table(var))<20) table(var)#if there <50 items then show table
         }
  )
#<--although this helps me get a better sense of the data from a univariate
#perspective, there are many trade offs I'm not sure about and I am basically 
#discarding few variables that clearly have no value. A better approach I think
#would be to use a random forest with pretty full growth then assess variable
#importances after that analysis and see if the results can be used to build a
#better linear (or other?) model off the back of it-->

What I have done above is look at the dataset dimensions (rows * cols) then produce a summary of each variable in the dataset. Because some numeric variables are more like categoricals I have used lapply to view all variables that have less then 20 'levels' or items within the variable. For example, MSSubClass is numeric, but each number in that variable's items actually corresponds to a dwelling type. So it is not a continuous variable.

After spending some time going through the variables and making decisions about whether to discard them from the model or not, I decided (as is documented in the long comment above) to use a random forest model to help me decide on which features to keep or toss out.

##Push all vars into RandomTreeRegressor to see feature importances

#view mean saleprice by MSSubClass to determine if the variable's continuous 
#property is a true representation
aggregate(data=ktrain,SalePrice~MSSubClass,mean)
#from the data dictionary it does not look like a true representation

#convert MSSubClass to a categorical variable
bs$MSSubClass=as.factor(paste('c',as.character(bs$MSSubClass),sep=''))

#view number of levels for each categorical variable
sapply(bs[,sapply(bs,is.factor)],nlevels)
#there are no factors with >32 levels--OK!

#Replace NA's
nacounts<-sapply(bs, function(x) sum(length(which(is.na(x)))))
types<-sapply(bs,class)
cbind(types,nacounts)

##Loop through list of categoric vars and complete NA's with mode

#define mode2 function (for calculating a variable's mode)
mode2<-function(x){
  ux<-unique(x) 
  ux[which.max(tabulate(match(x,ux)))]
}

#vars to complete with mode
nafacs<-sapply(bs[,sapply(bs,is.factor)], function(x) sum(length(which(is.na(x)))))
data.frame(nafacs)#view results
mvars<-nafacs[0=200 NA's
dvars<-nafacs[200<=nafacs]#vars with >=200 NA's
dvarnms<-names(dvars)
bs[,dvarnms]<-NULL#drop vars

After deciding to use a forest as my slave to provide a short-cut to feature imporances, the first thing I need to do is complete any variables that have missing observations. So, to do that I am thinking of using the variable's mode to fill in NA values for categorical variables. And before doing that I want to encode MSSubClass variable as a categoric variable. In the above snippet, the first thing I do is look at SalePrice by MSSubClass to see if there is a direct correlation between the numeric code and the price. There is not. I wanted to make sure the forest model does not treat MSSubClass as continuous so I added a 'c' to it and encoded it as a factor.

Next I just want to make sure there are no factor variables with >32 levels since I know the forest model can't cope with that. There are none!

Next I need to start replacing NA's with the variable's mode. So I list all factor variables, define a mode function, then use sapply to replace the NA's for any variable that has <200 NA's. After consulting the attributes of other variables with >=200 NA's I decided to remove them at this point.

That is the categorical variables taken care of. But what about numerics? I can't complete them with mode--or at least it wouldn't make sense to since they are continuous variables. So I will instead complete them with their median (that is, the median of non-null observations).

##Loop through numerics completing NA's with median

#vars to complete with mean
nanums<-sapply(bs[,sapply(bs,is.numeric)], function(x) sum(length(which(is.na(x)))))
data.frame(nanums)#view results
mvars<-nanums[00 NAs
mvars<-mvars[-12]#exclude saleprice here..

#loop through mvars completing NA's with median
for(varnm in names(mvars)){
  print(varnm)
  bs[,varnm][is.na(bs[,varnm])]<-median(bs[,varnm],na.rm=T)
}

In the snippet above, I first obtain a list of all numeric variables that have missing data. I store this list in a list called mvars then exclude SalePrice from the mvars. Then I loop through each variable and replace their Null values with the median of non-null values for each respective variable in turn.

Now I am ready to run the random forest model so I can start to see which features are more useful than others. Running a forest algorithm at this point is useful because it will use repeated random sampling of different combinations of variables that I supply to the algorithm, as well as varying samples of data from the training dataset provided, in order to determine which variables are most useful for making predictions. Each tree grown by the algorithm will use a different subset of variables and observations, so the model averages out feature performances across many different-shaped trees to overcome split biases expressed by any one tree. This makes the random forest useful for feature selection.

##Run the Random Forest to identify best variables

#prep data
t1<-subset(ktrain,select='Id')
tr1<-merge(bs,t1,by='Id')
tr1$Id<-NULL

#train forest model
library(party)
mdl_rf<-cforest(SalePrice~.,data=tr1,control=cforest_unbiased(ntree=200))

#view feature importances
library(dplyr)
vi1<-data.frame(varimp(mdl_rf))#get importances
vi1<-add_rownames(vi1,'var')#convert rownames to a column
names(vi1)[names(vi1)=='varimp.mdl_rf.']<-'varimp'#rename varimp column
vi1<-data.frame(vi1[with(vi1,order(-varimp, var)),])
vi1#view results

My next step is to train the forest model using the pre-defined features. The first step performed in the above snippet is to subset the combined dataset used thus far so as only to include training data observations. This way I have kept all the cleaning work done thus far; but just using data where a SalePrice is observed. When it comes to actually training the model (mdl_rf) I have used the cforest model, specified SalePrice as the target variable, all available features (.) as the predictors, tr1 as the model's training dataset, then I have specified that 200 trees are to be grown.

After running the model. Feature importances are viewed with the help of the varimp() function. This will provide a list of all variables and their importance for predicting SalePrice in the trained model. Note that OverallQual is the most predictive feature while including YrSold actually decreases the model's predictive ability (on average). Several other variables (like Alley or PoolQC) make no difference to model performance at all. Note that this is after building 200 decision trees with varying combinations of all variables--so we should have a fairly good picture of which variables are useful or not useful.

4. Engineer Variables

However, what we have not yet done is engineer any new variables that may actually prove to be more predictive than the sum of their parts. It has been said that smart feature engineering is like combining atomic elements to produce useful compounds in chemistry. So, I tried to think creatively about what combinations of features might boost my model's prodictive capacity.

###4. Engineer Variables


##Encode Categoricals

#get all factors from bs
facs<-data.frame(names(bs)[sapply(bs,is.factor)])
names(facs)[1]<-'var'

#find the ones that are in the forest's top 20 features
merge(facs,vi1[0:20,],by='var')

#neighorhood
nbhds<-aggregate(data=bs,SalePrice~Neighborhood,mean)
names(nbhds)[2]<-'Neighborhood_cg'
bs<-merge(x=bs,y=nbhds,by='Neighborhood',all.x=T)

#external quality
bs$ExterQual_cg<-ifelse(bs$ExterQual=='Ex',5,
                        ifelse(bs$ExterQual=='Gd',4,
                               ifelse(bs$ExterQual=='TA',3,
                                      ifelse(bs$ExterQual=='Fa',2,
                                             ifelse(bs$ExterQual=='Po',1,0)))))

#basement quality
bs$BsmtQual_cg<-ifelse(bs$BsmtQual=='Ex',5,
                        ifelse(bs$BsmtQual=='Gd',4,
                               ifelse(bs$BsmtQual=='TA',3,
                                      ifelse(bs$BsmtQual=='Fa',2,
                                             ifelse(bs$BsmtQual=='Po',1,0)))))


#kitchen quality
bs$KitchenQual_cg<-ifelse(bs$KitchenQual=='Ex',5,
                       ifelse(bs$KitchenQual=='Gd',4,
                              ifelse(bs$KitchenQual=='TA',3,
                                     ifelse(bs$KitchenQual=='Fa',2,
                                            ifelse(bs$KitchenQual=='Po',1,0)))))

#dwelling type
dweltypes<-aggregate(data=bs,SalePrice~MSSubClass,mean)
names(dweltypes)[2]<-'MSSubClass_cg'
bs<-merge(x=bs,y=dweltypes,by='MSSubClass',all.x=T)

#garage interior finish
garagefin<-aggregate(data=bs,SalePrice~GarageFinish,mean)
names(garagefin)[2]<-'GarageFinish_cg'
bs<-merge(x=bs,y=garagefin,by='GarageFinish',all.x=T)

#foundation
foundations<-aggregate(data=bs,SalePrice~Foundation,mean)
names(foundations)[2]<-'Foundation_cg'
bs<-merge(bs,foundations,by='Foundation',all.x=T)


##Build New Variables

#age at time of sale
bs$saleage_cg<-bs$YrSold-bs$YearBuilt

#neighbourhood*size
bs$nbhdsize_cg<-bs$Neighborhood_cg*bs$GrLivArea

A linear model requires that all features are numeric. So I needed to go through each variable and somehow encode them as numeric. However, the first part in the above snippet simply finds the factor variables I need to worry about--these are the factors in the top 20 important features as returned by the forest model above.

When faced with the problem of encode Neighborhood to numeric I had three options: (1) set up dummy variables for each neighbourhood, (2) assign some ordinal scale to the variable, (3) dump it completely. I decided to look at the mean SalePrice per level and then just use this to create a new variable called Neighborhood_cg. Thus, instead of a factor variable I now have a numeric variable where each 'level' is now substituted with that level's mean SalePrice.

I just converted categoricals to numeric where an ordinal scale was implied. So, for example BsmtQual (basement quality) is either Ex, Gd, TA, Fa, Po. It is more informative to assign some ordinal value to the variable from 1-5, with Ex (excellent) being 5 and Po (poor) being 1. I repeated this for other quality scale variables (external and kitchen).

For the remaining factors in the top 20 most important featurs (MSSubClass, GarageFinish, and Foundation) I adopted a similar approach to that used with Neighbourhood.

Build New Variables

Two ideas I had for new features that may combine existing ones into something more valuable were 'age at time of sale' and 'neighbourhood*size'. I thought that perhaps if a house was brand new it may sell at a slight premium to comparable houses sold at a similar time. For neighbourhood*size--I thought that a combination of these two features might yield a good predictor of sale price. You can see above how the variables are built.

We now have a set of features that have been completed, cleaned, encoded, and engineered to some degree. I find that the process of feature engineering and modelling is a creative and iterative one. That is, you don't just do the cleaning, engineering, then tweak a model--but rather I keep coming back to the earlier stages after choosing and tweaking a model, so that the entire program is continually being refined in order to improve the model. However, we can commence with training a model now that some preparation work has been done.

5. Train a Random Forest Model

###5. Train a Forest Model


##Define training data

#cut back the modified bs data to include only training observations
ktrain_id<-subset(ktrain,select='Id')
rf1_tr<-merge(bs,ktrain_id,by='Id')
use_vars=names(rf1_tr)
rf1_tr<-subset(rf1_tr,select=use_vars)#note that use_vars is re-defined below!!!

#split further into train/test for true out of sample test
set.seed(1)
tr_index<-sample(nrow(rf1_tr),size=floor(.6*nrow(rf1_tr)))
rf1_tr_train<-rf1_tr[tr_index,]
rf1_tr_test<-rf1_tr[-tr_index,]


##Train cforest model with rf1_tr_train
mdl_rf1<-cforest(SalePrice~.,
             data=rf1_tr_train,
             control=cforest_unbiased(ntree=100, mtry=8, minsplit=30,
                                      minbucket=10))

#view feature importances
vi1<-data.frame(varimp(mdl_rf1))#get importances
vi1<-add_rownames(vi1,'var')#convert rownames to a column
names(vi1)[names(vi1)=='varimp.mdl_rf1.']<-'varimp'#rename varimp column
vi1<-data.frame(vi1[with(vi1,order(-varimp, var)),])
vi1#view results

#get list of vars that add to model
use_vars<-c(vi1[vi1$varimp>0,1][0:30],'SalePrice')
use_vars

Although our base dataset has been prepared for modelling, we still just need to define the dataset to be used for training the model. So the first thing I do in the above snippet is define the variables and observations required to train the model. I do this by merging the data with the original train.csv dataset from kaggle (ktrain) on Id.

After doing this I elect to keep out a portion of the data to use as an out of sample test set, so I can see how well the trained model will perform on a randomly split portion of the observations that were not used to train the model. Doing ths kind of test should give me an idea of how well the model can predict never-before-seen (out of sample) cases. And we will see the results of that test shortly.

Next I use the cforest algorithm along with the rf1_tr_train dataframe to create a predictive model. I have chosen to grow only 100 trees, selecting 8 features per tree, splitting only when at least 30 observations can be found for the split, and only producing a leaf node containing at least 10 observations.

When the model has been trained, feature importances can be viewed and I have rolled the more important features into a list called use_vars. That list can then be used to run the model again if the 'Define Training Data' step above is re-run.

A random forest model has now been created and this can be used to make predictions; however, the next critical step before deciding to use the model is to score (evaluate/assess) it.

6. Score the Forest Model

###6. Score the Forest Model

#define log rmse function
rmse<-function(x,y){mean((log(x)-log(y))^2)^.5}

#define assess function
score_rf1<-function(df){
  preds_rf1<-predict(mdl_rf1,df,OOB=TRUE,type='response')
  sc1<<-data.frame(cbind(df,preds_rf1))
  names(sc1)[names(sc1)=='SalePrice.1']<<-'preds_rf1'
  rmse(sc1$SalePrice,sc1$preds_rf1)
}

#gauge accuracy
score_rf1(rf1_tr_train)#in-sample observations
score_rf1(rf1_tr_test)#out of sample observations

The measure Kaggle use for scoring this competition is the root mean squared logarithmic error. It just involves taking the log of predicted sale price then subtracting from that the log of actual sale price, squaring the difference so that you have a positive number. Taking the mean all these differences, then taking the root of that mean. The lower the better. I have written a function (rmse) to calculate this score as a function of two vectors (x and y).

I have also scripted a function to generate the predictions by using the trained model (mdl_rf1) then call the rmse function to return a score as a function of a given set of observations. This function is then called twice--once using the same data that were used to train the forest model, then once for the 'left out' observations that were not included when training the model.

The score should usually always be better when using the training data (i.e. rf1_tr_train here) because that is the dataset used to train the model and thus exposes itself to overfitting. The result based on the test data (rf1_tr_test here) should be an accurate measure of how well the prediction would perform on a completely new set of data (or Kaggle's test.csv file here). If the training score is very accurate and the test score is very inaccurate this is a sign that the model is probably overfitted to the training data, making it great at predicting on observations but less great at predicting on out of sample observations. Adjusting the minsplit or minbucket control paramters in the model can help to make the model more general and avoid the overfitting problem.

The above techniques are the nuts and bolts of the machine learning process here. Data pre-processing, feature engineering, selecting an algorithm, cutting training data then training and evaluating a model with your data. However, at this point I would also like to use a linear model to see how the results compare.

7/8. Train and Score a Linear Model

###7. Train a Linear Model


##Define training data

#cut back the modified bs data to include only training observations
ktrain_id<-subset(ktrain,select='Id')
lm_tr<-merge(bs,ktrain_id,by='Id')
lm_tr$log_SalePrice=log(lm_tr$SalePrice)

#subset to include numerics only
nums_use<-names(lm_tr)[sapply(lm_tr,is.numeric)]
nums_use=c('LotArea','OverallQual','OverallCond','BsmtFinSF1','BsmtUnfSF',
           'X1stFlrSF','X2ndFlrSF','Fireplaces','ScreenPorch',
           'Neighborhood_cg','KitchenQual_cg','nbhdsize_cg','LotFrontage',
           'BsmtFinSF2','BsmtQual_cg','log_SalePrice')

#highly correlated features
library(caret)
write.csv(cor(lm_tr[,nums_use]),file='cor.csv')
#<--view correlations in excel-->
findCorrelation(cor(lm_tr[,nums_use]))
findLinearCombos(lm_tr[,nums_use])
lm_tr<-subset(lm_tr,select=nums_use)

#split further into train/test for true out of sample test
set.seed(1)
tr_index<-sample(nrow(lm_tr),size=floor(.6*nrow(lm_tr)))
lm_tr_train<-lm_tr[tr_index,]
lm_tr_test<-lm_tr[-tr_index,]


##Train a linear model
mdl_lm<-lm(log_SalePrice~.,data=lm_tr_train)
summary(mdl_lm)#view initial results



###8. Score the Linear Model

#define log rmse function
rmse<-function(x,y){mean((log(x)-log(y))^2)^.5}

#define assess function
score_lm<-function(df){
  preds<-predict(mdl_lm,df,type='response')
  preds<-exp(preds)
  df$SalePrice=exp(df$log_SalePrice)
  sc2<<-data.frame(cbind(df,preds))
  rmse(sc2$SalePrice,sc2$preds)
}

#gauge accuracy
score_lm(lm_tr_train)#in-sample observations
score_lm(lm_tr_test)#out of sample observations

The above modelling process and resulting scores should not require much additional explanation. I have used the same technique of leaving out some observations to get an out of sample estimate. But the algorithm is a basic multiple linear regression least-squares approach. I used the summary function to see feature significances. The linear modelling approach does not require growing trees or splitting predictors, etc. When defining which features to use for training I have limited to numeric variables only since a linear model cannot accept non-numeric variables. I have also included some functions to try and weed out highly correlated features and features that are linear combinations of each other--but both of these steps led to no remedial action (i.e. I didn't remove anything as a result of finding that any two features were highly correlated or were linear combinations of each other!)

However, after scoring the linear model's predictions I found that the out of sample result was slightly better than that of the cforest model. So, I decided to use it to predict on Kaggle's test.csv file for this competition.

9. Make Submission

###9. Make Submission

#get ktest-altered data
t1<-subset(ktest,select='Id')
bs_ts<-merge(t1,bs,by='Id')
preds<-exp(predict(mdl_lm,bs_ts,OOB=TRUE,type='response'))
submit<-data.frame(Id=ktest$Id,SalePrice=preds)
write.csv(submit,file='submit.csv',row.names=FALSE)

The last thing to do once a model is selected is to use it to predict on the true test data (i.e. Kaggle's test set--not my out of sample test sets!). And at this point I will not pretend I didn't make several submissions in this way before arriving at the code used in this tutorial :).

The above code just merges my cleaned base dataset to the original Kaggle test data (ktest) on Id. This makes sure I only have test cases in my submit dataframe. Then I predict SalePrice using the predict function and the linear model trained earlier (mdl_lm) and write the resulting dataframe to submit.csv in the working directory. This can then be uploaded to Kaggle's House Prices: Advanced Regression Techniques site.

House Prices: Advanced Regression Techniques (In Full)

All the above code is contained in the below block in full for convenience.

###############################################################################
##                          House Price Predictions
###############################################################################
# Purpose:
#   To submit a credible prediction to kaggle for house prices based on
#   Ames house price training dataset
#   
# Inputs:
#   - 'C:/Users/craig/Google Drive/201705/0526hous/cache/train.csv'
#   - 'C:/Users/craig/Google Drive/201705/0526hous/cache/test.csv'
# 
# Anatomy:
#   1. Required Packages
#   2. Combine Train/Test
#   3. Analyse the Data
#   4. Engineer Variables
#   5. Train a Forest Model
#   6. Score Forest Model
#   7. Train Linear Model
#   8. Score Linear Model
#   9. Make Submission



###1. Required Packages
#install.packages('dplyr')
#install.packages('caret')
#install.packages('party')


###2. Combine Train/Test
setwd('C:/Users/craig/Google Drive/201705/0526hous/cache/')

ktrain<-read.csv('train.csv')
ktest<-read.csv('test.csv')

#create saleprice variable in test data
ktest$SalePrice<-NA
bs<-rbind(ktrain,ktest)
dim(ktrain)
table(ktrain$YearBuilt)


###3. Analyse the Data

#general summary
dim(bs)
summary(bs)

#show freq counts for all variables where <50 categories in the variable
lapply(bs,
       function(var)
         {
          if(dim(table(var))<20) table(var)#if there <20 items then show table
         }
  )
#<--although this helps me get a better sense of the data from a univariate
#perspective, there are many trade offs I'm not sure about and I am basically 
#discarding few variables that clearly have no value. A better approach I think
#would be to use a random forest with pretty full growth then assess variable
#importances after that analysis and see if the results can be used to build a
#better linear (or other?) model off the back of it-->


##Push all vars into RandomTreeRegressor to see feature importances

#view mean saleprice by MSSubClass to determine if the variable's continuous 
#property is a true representation
aggregate(data=ktrain,SalePrice~MSSubClass,mean)
#from the data dictionary it does not look like a true representation

#convert MSSubClass to a categorical variable
bs$MSSubClass=as.factor(paste('c',as.character(bs$MSSubClass),sep=''))

#view number of levels for each categorical variable
sapply(bs[,sapply(bs,is.factor)],nlevels)
#there are no factors with >32 levels--OK!

#Replace NA's
nacounts<-sapply(bs, function(x) sum(length(which(is.na(x)))))
types<-sapply(bs,class)
cbind(types,nacounts)


##Loop through list of categoric vars and complete NA's with mode

#define mode2 function (for calculating a variable's mode)
mode2<-function(x){
  ux<-unique(x) 
  ux[which.max(tabulate(match(x,ux)))]
}

#vars to complete with mode
nafacs<-sapply(bs[,sapply(bs,is.factor)], function(x) sum(length(which(is.na(x)))))
data.frame(nafacs)#view results
mvars<-nafacs[0=200 NA's
dvars<-nafacs[200<=nafacs]#vars with >=200 NA's
dvarnms<-names(dvars)
bs[,dvarnms]<-NULL#drop vars


##Loop through numerics completing NA's with median

#vars to complete with mean
nanums<-sapply(bs[,sapply(bs,is.numeric)], function(x) sum(length(which(is.na(x)))))
data.frame(nanums)#view results
mvars<-nanums[00 NAs
mvars<-mvars[-12]#exclude saleprice here..

#loop through mvars completing NA's with median
for(varnm in names(mvars)){
  print(varnm)
  bs[,varnm][is.na(bs[,varnm])]<-median(bs[,varnm],na.rm=T)
}


##Run the Random Forest to identify best variables

#prep data
t1<-subset(ktrain,select='Id')
tr1<-merge(bs,t1,by='Id')
tr1$Id<-NULL

#train forest model
library(party)
mdl_rf<-cforest(SalePrice~.,data=tr1,control=cforest_unbiased(ntree=200))

#view feature importances
library(dplyr)
vi1<-data.frame(varimp(mdl_rf))#get importances
vi1<-add_rownames(vi1,'var')#convert rownames to a column
names(vi1)[names(vi1)=='varimp.mdl_rf.']<-'varimp'#rename varimp column
vi1<-data.frame(vi1[with(vi1,order(-varimp, var)),])
vi1#view results



###4. Engineer Variables


##Encode Categoricals

#get all factors from bs
facs<-data.frame(names(bs)[sapply(bs,is.factor)])
names(facs)[1]<-'var'

#find the ones that are in the forest's top 20 features
merge(facs,vi1[0:20,],by='var')

#neighorhood
nbhds<-aggregate(data=bs,SalePrice~Neighborhood,mean)
names(nbhds)[2]<-'Neighborhood_cg'
bs<-merge(x=bs,y=nbhds,by='Neighborhood',all.x=T)

#external quality
bs$ExterQual_cg<-ifelse(bs$ExterQual=='Ex',5,
                        ifelse(bs$ExterQual=='Gd',4,
                               ifelse(bs$ExterQual=='TA',3,
                                      ifelse(bs$ExterQual=='Fa',2,
                                             ifelse(bs$ExterQual=='Po',1,0)))))

#basement quality
bs$BsmtQual_cg<-ifelse(bs$BsmtQual=='Ex',5,
                        ifelse(bs$BsmtQual=='Gd',4,
                               ifelse(bs$BsmtQual=='TA',3,
                                      ifelse(bs$BsmtQual=='Fa',2,
                                             ifelse(bs$BsmtQual=='Po',1,0)))))


#kitchen quality
bs$KitchenQual_cg<-ifelse(bs$KitchenQual=='Ex',5,
                       ifelse(bs$KitchenQual=='Gd',4,
                              ifelse(bs$KitchenQual=='TA',3,
                                     ifelse(bs$KitchenQual=='Fa',2,
                                            ifelse(bs$KitchenQual=='Po',1,0)))))

#dwelling type
dweltypes<-aggregate(data=bs,SalePrice~MSSubClass,mean)
names(dweltypes)[2]<-'MSSubClass_cg'
bs<-merge(x=bs,y=dweltypes,by='MSSubClass',all.x=T)

#garage interior finish
garagefin<-aggregate(data=bs,SalePrice~GarageFinish,mean)
names(garagefin)[2]<-'GarageFinish_cg'
bs<-merge(x=bs,y=garagefin,by='GarageFinish',all.x=T)

#foundation
foundations<-aggregate(data=bs,SalePrice~Foundation,mean)
names(foundations)[2]<-'Foundation_cg'
bs<-merge(bs,foundations,by='Foundation',all.x=T)


##Build New Variables

#age at time of sale
bs$saleage_cg<-bs$YrSold-bs$YearBuilt

#neighbourhood*size
bs$nbhdsize_cg<-bs$Neighborhood_cg*bs$GrLivArea



###5. Train a Forest Model


##Define training data

#cut back the modified bs data to include only training observations
ktrain_id<-subset(ktrain,select='Id')
rf1_tr<-merge(bs,ktrain_id,by='Id')
use_vars<-names(rf1_tr)
rf1_tr<-subset(rf1_tr,select=use_vars)#note that use_vars is re-defined below!!!

#split further into train/test for true out of sample test
set.seed(1)
tr_index<-sample(nrow(rf1_tr),size=floor(.6*nrow(rf1_tr)))
rf1_tr_train<-rf1_tr[tr_index,]
rf1_tr_test<-rf1_tr[-tr_index,]


##Train cforest model with rf1_tr_train
mdl_rf1<-cforest(SalePrice~.,
             data=rf1_tr_train,
             control=cforest_unbiased(ntree=100, mtry=8, minsplit=30,
                                      minbucket=10))

#view feature importances
vi1<-data.frame(varimp(mdl_rf1))#get importances
vi1<-add_rownames(vi1,'var')#convert rownames to a column
names(vi1)[names(vi1)=='varimp.mdl_rf1.']<-'varimp'#rename varimp column
vi1<-data.frame(vi1[with(vi1,order(-varimp, var)),])
vi1#view results

#get list of vars that add to model
use_vars<-c(vi1[vi1$varimp>0,1][0:30],'SalePrice')
use_vars



###6. Score the Forest Model

#define log rmse function
rmse<-function(x,y){mean((log(x)-log(y))^2)^.5}

#define assess function
score_rf1<-function(df){
  preds_rf1<-predict(mdl_rf1,df,OOB=TRUE,type='response')
  sc1<<-data.frame(cbind(df,preds_rf1))
  names(sc1)[names(sc1)=='SalePrice.1']<<-'preds_rf1'
  rmse(sc1$SalePrice,sc1$preds_rf1)
}

#gauge accuracy
score_rf1(rf1_tr_train)#in-sample observations
score_rf1(rf1_tr_test)#out of sample observations



###7. Train a Linear Model


##Define training data

#cut back the modified bs data to include only training observations
ktrain_id<-subset(ktrain,select='Id')
lm_tr<-merge(bs,ktrain_id,by='Id')
lm_tr$log_SalePrice=log(lm_tr$SalePrice)

#subset to include numerics only
nums_use<-names(lm_tr)[sapply(lm_tr,is.numeric)]
nums_use=c('LotArea','OverallQual','OverallCond','BsmtFinSF1','BsmtUnfSF',
           'X1stFlrSF','X2ndFlrSF','Fireplaces','ScreenPorch',
           'Neighborhood_cg','KitchenQual_cg','nbhdsize_cg','LotFrontage',
           'BsmtFinSF2','BsmtQual_cg','log_SalePrice')

#highly correlated features
library(caret)
write.csv(cor(lm_tr[,nums_use]),file='cor.csv')
#<--view correlations in excel-->
findCorrelation(cor(lm_tr[,nums_use]))
findLinearCombos(lm_tr[,nums_use])
lm_tr<-subset(lm_tr,select=nums_use)

#split further into train/test for true out of sample test
set.seed(1)
tr_index<-sample(nrow(lm_tr),size=floor(.6*nrow(lm_tr)))
lm_tr_train<-lm_tr[tr_index,]
lm_tr_test<-lm_tr[-tr_index,]


##Train a linear model
mdl_lm<-lm(log_SalePrice~.,data=lm_tr_train)
summary(mdl_lm)#view initial results



###8. Score the Linear Model

#define log rmse function
rmse<-function(x,y){mean((log(x)-log(y))^2)^.5}

#define assess function
score_lm<-function(df){
  preds<-predict(mdl_lm,df,type='response')
  preds<-exp(preds)
  df$SalePrice=exp(df$log_SalePrice)
  sc2<<-data.frame(cbind(df,preds))
  rmse(sc2$SalePrice,sc2$preds)
}

#gauge accuracy
score_lm(lm_tr_train)#in-sample observations
score_lm(lm_tr_test)#out of sample observations



###9. Make Submission

#get ktest-altered data
t1<-subset(ktest,select='Id')
bs_ts<-merge(t1,bs,by='Id')
preds<-exp(predict(mdl_lm,bs_ts,OOB=TRUE,type='response'))
submit<-data.frame(Id=ktest$Id,SalePrice=preds)
write.csv(submit,file='submit.csv',row.names=FALSE)




######Workings