Python Titanic Machine Learning Walkthrough

I have recently been getting to grips with machine learning techniques. Coursera's Data Science courses (Johns Hopkins) and R were where I started--quickly followed by Kaggle's Titanic competition. I initially tried using Excel VBA to construct what I now understand is a basic decision tree classifier. However, this was very labour intensive and so I started exploring R.

Then I got busy with a challenging project at work as well as my Base SAS certifications... until a few months ago when I again resumed the challenge and found an absolute gem of a tutorial for starting machine learning in R. I cannot praise Trevor Stephens and his excellent tutorial enough for their lucidity and beginner-friendly approach that re-invigorated me to get on my journey with machine learning! Thank you Trevor! :)

After completing that tutorial on the Titanic competition I set about replicating the effort from scratch in R with a goal of improving my score on Kaggle's leaderboard.

Around this time (couple of months ago) I got involved with a predictive modelling project at work and an opportunity to use Python. Consequently, I have replicated the effort in Python and include a fairly basic effort for beating my previous score.

After much perseverance I did manage to improve the score I had achieved through the R tutorial. I will walk you through the Python code used to do so.

Installing Python

Before diving into the actual code I thought I should just mention the process of installing Python on my laptop first. You can get Python on its website (python.org) but you will want to use an IDE to manage it. Initially I got PyCharm then had some issues trying to download scikit learn and other packages. After some struggles and compatibility issues I just downloaded Anaconda. This installs Python, the Spyder IDE, some other useful applications and all the packages needed to run my code--it takes ages to install but rest assured you will have everything!

When installing Anaconda I opted for Python2.7 because I thought there would be better online community support for it; but I had some problems updating it and later cleaned it from my machine and instead installed Python3.6.

I will also mention a few things I didn't find initially but later gladly discovered about Spyder:

  1. F9 runs a highlighted section of code or the current line (i.e. line where cursor is). This is immensely useful.
  2. You can type #%% to create a cell. Then hit ctrl+enter to run the current cell only
  3. Ctrl+r is used for find and replace :)

Titanic Prediction Code

You can get everything you need to do the Titanic challenge on Kaggle's competitions website. The Titanic competition provides train and test datasets containing data on actual passengers of the Titanic (e.g. name, age, fare, cabin, gender). Basically your job is to predict whether a passenger will survive or not based on the training dataset.

My Approach

The phases of my code follow in this order:

  1. Import Kaggle Data
  2. Explore the Datasets
  3. Wrangle Name
  4. Wrangle Cabin
  5. Wrangle Ticket
  6. Wrangle Age
  7. Wrangle Sex
  8. Complete Fare and Embarked
  9. Complete Age
  10. Complete cabin_ltr
  11. Create Family Features
  12. Complete Survived (Decision Tree Method)
  13. Complete Survived (Random Forest Method)
  14. Prepare Kaggle Submission

1. Import Kaggle Data

### 1. Import Kaggle Data
import os #import the os library to use chdir command
print(os.curdir) #returns the current working directory
os.chdir('C:\\Users\\craig\\Google Drive\\201704\\0408pyth\\cache')#changes the
		#current working directory

import pandas as pd #import pandas library - needed for read_csv
ktrain=pd.read_csv('train.csv')#read in train data
ktest=pd.read_csv('test.csv')#read in test data
type(ktest)#check ktest is a data frame
ktest.shape#check rows and columns in ktest

bs1=pd.concat([ktrain,ktest])#concatenate the data
bs1.shape#check dimensions of bs1

#check that Survived has been concatenated properly
bs1.head(n=5)
bs1.tail(n=5)
#it is concatenated as NaN--good!
				

Commentary: The os package allows you to execute command line switches from python. Here I use it to change the working directory to the 0408pyth/cache folder.

The pandas package is later used extensively to manipulate data. But here it is used to access the read_csv method to read in the train and test datasets from Kaggle. Note--I had previously downloaded the csv files from the Kaggle competitions site to my cache folder.

After loading the train and test data I concatenate (stack) them on top of each other to create a combined base dataset to work with.

2. Explore the Datasets

###2. Explore the Dataset
bs1.info()#data frame variables, missing counts, and types
bs1.describe()#view summary of numeric variables
bs1.describe(include=['O'])#view summary of non-numeric variables

#view null counts per variable
pd.isnull(bs1).sum()

#see avg numerics by survived
bs1.groupby('Survived').mean()
#Fare, Parch, and Pclass seem most predictive features here

#mean of survived by all categorical variables in turn
feats=list(ktrain.select_dtypes(include=['O']))#get list of categorical feats
feats#view list of categorical feats
for feat in feats:#loop through each variable and show survived * feat
	print(ktrain[[feat,'Survived']].\
 	groupby([feat]).\
	agg(['mean','count']))
#seems like there could be potential to split out:
	#title from name variable, 
	#cabin type from cabin,
	#and ticket type from ticket then drill down for further investigation
				

Commentary: Analyse the key attributes of the dataset variables (numeric and categorical variables). Assess null counts.

Look at the average numeric variables by Survived to see if any predictive potential is apparent.

Consider the count and average of Survived by each categorical variable item--to gauge whether any apparent predictive potential exists within categorical variables. From the name variable, it seems like title could be extracted, as well as the first letter/number from cabin, and the first character from ticket.

3. Wrangle Name

###3. Wrangle Name
bs1.shape#indicates there are 12 columns

#extract Title from Name
bs1['title']=bs1.Name.str.extract('([A-Za-z]+)\.')#regex extraction of title
   #logic as follows:
	   #[A-Za-z]: matches any letter
	   #+: preceding set repeated one or more times
	   #\.: has a dot at at the end of the expression
bs1.shape
bs1.sample(n=10)

#view the new average survival by title table (Note: NaNs are excluded from 
#mean calc)
bs1[['title','Survived']].\
	groupby(['title']).\
	agg(['mean','count']).\
	sort_values(by=('Survived','mean'), ascending=False)

#combine titles into broader groups
bs1['title']=bs1['title'].replace(['Mme','Countess','Lady','Dona'],'Mrs')#
	#replace Mme, countess, lady, and dona with mrs
bs1['title']=bs1['title'].replace(['Mlle','Ms'],'Miss')#replace Mlle and Ms
bs1['title']=bs1['title'].replace(['Sir','Col','Capt','Rev','Don','Jonkheer',\
	 'Major','Dr'],'Mr')#replace rare male
bs1['title']=bs1['title'].replace(['Countess','Lady'],'Mrs')

#view the new average survival by title table (Note: NaNs are excluded from 
#mean calc)
bs1[['title','Survived']].\
	groupby('title').\
	agg(['mean','count']).\
	sort_values(by=('Survived','mean'), ascending=False)

#create title2 variable to convert title to numeric (for regtree modelling)
bs1['title2']=bs1['title'].map({'Mrs':1,'Miss':2,'Master':3,'Mr':4})
				

Commentary: Extract title from the Name variable (using a regex expression) then view average of survived by title to observe possibly predictive potential. There is definitely some correlation in the fact that Mrs reveals 79% survival rate and Mr with 16% survival rate. But some of the rare titles need to be combined to avoid overfitting downstream.

Encode title to numeric values so that the tree model can use it later.

4. Wrangle Cabin

###4. Wrangle Cabin
bs1['cabin_ltr']=bs1['Cabin'].str[:1]#take first character from Cabin
bs1['cabin_ltr'].sample(n=10)#inspect new variable

#view summary of new variable
bs1[['cabin_ltr','Survived']].groupby(['cabin_ltr']).agg(['mean','count'])
#looks like there is some predictive value in cabin_ltr

#broaden out small groups so as to avoid overfitting
bs1['cabin_ltr']=bs1['cabin_ltr'].replace(['A','F','G','T'],'Other')

#encode numeric cabin_ltr
cabin_ltr_map={'D':1,'E':2,'B':3,'C':4,'Other':5}#create mapping dictionary
bs1['cabin_ltr2']=bs1['cabin_ltr'].map(cabin_ltr_map)
bs1['cabin_ltr2'].value_counts()#check has assigned correctly
				

Commentary: Create a new cabin_ltr variable by extracting the first letter from the Cabin variable. Then view the grouped impact on the Survived variable. Although there are a lot of NaNs here, there is clearly some predictive value since there is some volatility in output (A=47% and D=76%).

Combine into larger groups then encode to numeric.

5. Wrangle Ticket

###5. Wrangle Ticket

#pull sample of 50 with seed=1 where survived is not null (i.e. train data)
bs1[['Ticket','Survived']][bs1['Survived'].notnull()].sample(n=40)
#no observable correlation patterns... try splitting by left-most char

#parse left-most character and view relationship
bs1['ticket_ltr']=bs1['Ticket'].str[:1]#create ticket_ltr feat using Ticket

#view mean of Survived by ticket_ltr
bs1[['ticket_ltr','Survived']].groupby(['ticket_ltr']).agg(['mean','count'])
#to me it looks like there is some correlative relationship (e.g. 63% of
    #those with 1* as ticket# survived vs. 23% of those with 3* ticket#)

#group categories with small vols into seperate group
bs1['ticket_ltr']=bs1['ticket_ltr'].\
    replace(['4','5','6','7','8','9','A'],'4-A')
bs1['ticket_ltr']=bs1['ticket_ltr'].replace(['C','F','L','P','S','W'],'C-W')

#view mean survived grouped by new categories
bs1[['ticket_ltr','Survived']].groupby('ticket_ltr').agg(['mean','count'])

#make categories numeric
bs1['ticket_ltr2']=bs1['ticket_ltr'].\
    map({'1':1,'2':2,'C-W':3,'3':4,'4-A':5}).astype(int)
				

Commentary: View a sample of 40 records to see if any kind of pattern can be observed between ticket and survived. No apparent pattern can be seen.

Try parsing the left-most character from the Ticket variable and looking at mean of survived by the resulting groups. After doing this it looks like there is some variability in mean survived by group. So I will use this in the model to see if it helps.

Combine resulting variable items into larger item groups then encode to numeric.

6/7. Wrangle Age and Sex

###6. Wrangle Age

#currently no wrangling--however, potential to explore further



###7. Wrangle Sex
bs1.Sex.value_counts()#see all levels in Sex variable
bs1['Sex2']=bs1['Sex'].map({'male':1,'female':0}).astype(int)#encode to integers
bs1.Sex2.value_counts()
				

Commentary: I thought about wrangling Age--i.e. converting it into logical age bins (e.g. infant, youth, young adult, middle-aged, elderly) but decided to run the model first then come back and adjust later if necessary. But I managed to beat the benchmark score on Kaggle without doing so--so this section is blank.

As for Sex--the only 'wrangling' to be done (if it can be termed such) is encoding the variable to numeric values (so that the tree classifier model we will see later can swallow it!)

8. Complete Fare and Embarked

###8. Complete Fare and Embarked

##Complete Fare

#just use median of non-nulls
bs1['Fare'].fillna(bs1['Fare'].median(),inplace=True)

#check there are now no nulls
pd.isnull(bs1['Fare']).sum()
#returns 0


##Complete Embarked

#see frequency of Embarked items
bs1.Embarked.value_counts()
#'S' is most frequent value

#replace NaNs with mode of Embarked ('S')
embarked_mode=bs1.Embarked.mode()[0]#return mode of Embarked
bs1['Embarked']=bs1['Embarked'].fillna(embarked_mode)
pd.isnull(bs1['Embarked']).sum()#check it has worked!

#encode as numeric
bs1['Embarked2']=bs1['Embarked'].map({'S':1,'C':2,'Q':3}).astype(int)
				

Commentary: Using pd.isnull(bs['Fare']).sum() I can see that there is only one Fare record that is null. So I just use the median of non-nul values to complete the null instances of Fare.

Similarly, with Embarked, there are only 2 null observations so I just fill them with the mode of the remaining observations.

9. Complete Age

###9. Complete Age

#check # of null values
pd.isnull(bs1['Age']).sum()#see how many nulls there are

#define model training data
tr_age=bs1[bs1['Age'].notnull()]#define data to use for training

#define decision tree regressor parameters
from sklearn.tree import DecisionTreeRegressor
rtree=DecisionTreeRegressor(min_samples_split=20,min_samples_leaf=10,
                            random_state=1)

#define predictor and target variables
colnames=bs1.columns.values.tolist()#save colnames as list
for i, item in enumerate(colnames):#list colnames and item number for reference
    print('%02d'%i+', '+item+', '+str(bs1[item].dtype))#'%02d' denotes 00 format

predictors=list(colnames[i] for i in[5,7,13])#get desired predictors
predictors#check I've selected the intended columns
target=list(colnames[i] for i in[0])#extract target (Age)
target#check correct target has been defined (Age)

#fit a regression tree model based on the algorithm and predictors 
rtree.fit(tr_age[predictors],tr_age[target])

#use the model to predict age on training data--and compare vs. actuals
tr_age['age_pred']=rtree.predict(tr_age[predictors])#get predicted vals

#define a function for calculating rmse
def rmse(preds, actuals):
    sq_err=(actuals-preds)**2
    mse=sq_err.mean()
    rmse=mse**.5
    return rmse

#compute rmse
rmse(tr_age['age_pred'],tr_age['Age'])

#view data in Excel to verify/confirm
tr_age[['age_pred','Age']].to_csv('age_guesses2.csv')

#cross-validate (k-fold)
from sklearn.cross_validation import KFold
from sklearn.cross_validation import cross_val_score

x=tr_age[predictors]
y=tr_age[target]

#set up cross validation paramters
crossv=KFold(n=x.shape[0],n_folds=6,shuffle=True,random_state=1)
folds=cross_val_score(rtree,x,y,scoring='neg_mean_squared_error',
                      cv=crossv,n_jobs=1)
folds_rmse=abs(folds)**.5#view rmse of folds
folds_rmse.mean()#mean of folds rmse

#see feature importances
a=x.columns.values.tolist()
b=rtree.feature_importances_.tolist()
pd.DataFrame({'imp':b,'feat':a}).sort_values(by=('imp'), ascending=False)
#the higher the value the more important for prediction in this tree model
				

Commentary: So, we now come to a variable that has a substantial proportion of observations with missing data; and I considered some potential approaches to deciding how to complete the variable (so that the classification tree model will work). The simplest method would be to just use the average of non-nul age observations--as it happens I don't think it was hugely influential which method was adopted. Maybe it is because the 'title' variable helps surrogate for Age in ways.

But I spent a bit of time creating estimates and testing them anyway. It actually becomes a useful tutotial in itself to do this--since essentially the same process is followed later when predicting Survived.

DecisionTreeRegressor Method

From the block of code above you can see that I define define the adhoc training set (tr_age) by taking all observations where age is not null.

I then use the DecisionTreeRegressor algorithm to create the rtree model(with min_samples_split=20 and min_samples_leaf=10).

Some variables cannot be used since they are either not yet complete or they are not numeric. So I create my list of predictors (indexed by 5, 7, and 13 in the colnames list). I also set the target equal to Age.

The next stage is to fit the rtree model on the train data. The decision tree regressor will use the CART algorithm to essentially create an extensive nested if-statement based on the predictor variables and the target variable (age) in the training data. This nested if-statement can then later be used on the test data predictors to predict Age for the test observations.

But first I want to test whether the fit is a good one or not. To do this I look for the root mean squared errors (RMSE) from predicted to actuals in the train data. The RMSE is the square root of the mean of all squared errors in the data. I have created a function to calculate it given two vectors (predicted values, and actual values). It turns out to be 10.78.. which is pretty high. My algorithm is off by ca. 10yrs for each person on the train data.. on average.

A better way to gauge if the model has a close fit is to use k-fold cross validation. This means I essentially repeat all the above steps--i.e. taking some train data and fitting a model, then getting the rmse; but instead of using all the train data I randomly split, or fold, the train data into n folds (I chose 6) then take the average of the rmse from each fold.

After doing this you can see by hitting F9 on the folds_rmse line that the folds show rmse's ranging from 10.2 to 12.3. The mean of these is 11.01.

It is useful to look at the features (predictors) used in the model and see how important they were for obtaining the achieved accuracy. Looks like title2 and Pclass were most important and Parch was the least valuable from the features used. It is a useful exercise to go back and add more features (or even tweak them, and create new ones) then repeat the modelling process to see if a better result can be achieved.. and to be able to explain why it may have moved/changed.

RandForestRegressor Method

Since I was not too happy about the RMSE of 11--suspecting that the trained model used was probably giving me bogus Age estimates on those passengers with null ages--I decided to see if a random forest might yeild a tighter generalised fit on the train data.

A random forest is essentially like training many variations of the tree model-then looking for the averaged benefits of each variable used across all 'trees' in the forest. It works because a lone tree model is subject to limitations on variables used as well as when splits are made. Since a random forest will only use a small selection of all available predictors per tree, it more fully explores the full potential of each feature. It also does not use 100% of available observations for each tree. It uses a sample of available observations then benchmarks performance of each tree using out of bag (oob for short) observations to gauge accuracy. Out of bag here means 'not used for training' on any particular tree.

I have used a similar approach as with the earlier tree model in terms of the steps taken. But this time, I have actually randomly split the tr_age train set into two sub-sets. I have split it into quasi train (tr_age_train) and test (tr_age_test) subsets. This way I can reserve a portion of the data to test out of sample data on once the model has been trained.

With the forest model I select a much wider range of features since I want the forest to explore all potential avenues for predictive potential. I am also not as concerned about limiting the min_samples_leaf and min_samples_split criteria since I can let the forest overfit across all samples then give me the mean impact across all trees.

Using the model to predict age on the training data I get a much improved rmse score of 6.47 when used on the tr_age_train dataset. This is quite encouraging. But when I then take the trained forest model and use it on the out of sample (i.e. the left out) portion, the rmse is still 11.7! So, it looks like the forest didn't give me a much better score after all.

In any case I will use the trained forest model (randfr in my code) to complete Age.

10. Complete cabin_ltr

###10. Complete cabin_ltr

#check # of null values
pd.isnull(bs1['cabin_ltr']).sum()

#define variables for predictor and target variables
colnames=bs1.columns.values.tolist()#save colnames as list
for i, item in enumerate(colnames):#list colnames and item number for reference
    print('%02d'%i+', '+item+', '+str(bs1[item].dtype))#'%02d' denotes 00 format

predictors=list(colnames[i] for i in[0,3,5,7,17,18,19])#get predictors
predictors#check I've selected the intended columns
target=list(colnames[i] for i in[14])#extract target (cabin_ltr)
target#check correct target has been defined (cabin_ltr)

#define decision tree classifier parameters
from sklearn.tree import DecisionTreeClassifier
ctree=DecisionTreeClassifier(criterion='entropy',min_samples_split=20,
                             random_state=1)

#define model training data then fit the model
tr_cab=bs1[bs1['cabin_ltr'].notnull()]#define data to use for training
tr_cab_train=tr_cab.sample(frac=.7)#define training set
tr_cab_test=tr_cab.drop(tr_age_train.index)#define test set

#fit a regression tree model based on the algorithm and predictors
x=tr_cab_train[predictors]
y=tr_cab_train[target]

ctree.fit(x,y)#train the tree model

#use the model to predict cabin_ltr on training data--and compare vs. actuals
tr_cab_train['cabin_ltr_pred']=ctree.predict(x)#get predicted vals

##See percent correctly classified in training data
#create new variable 'correct' where predicted=actual
tr_cab_train['correct']=np.where(tr_cab_train['cabin_ltr']==\
            tr_cab_train['cabin_ltr_pred'],1,0)
tr_cab_train['correct'].sum()/tr_cab_train.shape[0]#sum corrects/all possible

#now test on out of sample observations
tr_cab_test['cabin_ltr_pred']=ctree.predict(tr_cab_test[predictors])
tr_cab_test['correct']=np.where(tr_cab_test['cabin_ltr']==\
            tr_cab_test['cabin_ltr_pred'],1,0)
tr_cab_test['correct'].sum()/tr_cab_test.shape[0]#sum corrects/all possible


##Cross-Validate (K-Fold)
from sklearn.cross_validation import KFold
from sklearn.cross_validation import cross_val_score

x=tr_cab_train[predictors]
y=tr_cab_train[target]

#set up cross validation parameters
crossv=KFold(n=x.shape[0],n_folds=6,shuffle=True,random_state=1)#xval parameters
fold_scores=cross_val_score(ctree,x,y,scoring='accuracy',cv=crossv)#score folds
fold_scores.mean()#get mean of fold scores

#see feature importances
a=x.columns.values.tolist()
b=ctree.feature_importances_.tolist()
pd.DataFrame({'var':a,'imp':b}).sort_values(by=('imp'), ascending=False)
#the higher the value the more important for prediction in this tree model

#%%
##See if Random Forest Classifier is Better

#define decision tree classifier parameters
from sklearn.ensemble import RandomForestClassifier
randfc=RandomForestClassifier(criterion='entropy',
                              n_jobs=2,oob_score=True,
                              n_estimators=100)

#define model training data, predictors and target
tr_cab_train=tr_cab.sample(frac=.7)#define training set
tr_cab_test=tr_cab.drop(tr_age_train.index)#define test set

#predictors and targets have already been defined

#train a randfr model based on data and predictors
randfc.fit(x,y)
randfc.oob_score_

##See percent correctly classified in training data
#create new variable 'correct' where predicted=actual
tr_cab_train['cabin_ltr_pred_rf']=randfc.predict(x)
tr_cab_train['correct_rf']=np.where(tr_cab_train['cabin_ltr']==\
            tr_cab_train['cabin_ltr_pred_rf'],1,0)
print(tr_cab_train['correct_rf'].sum()/tr_cab_train.shape[0])

#now test on out of sample observations
tr_cab_test['cabin_ltr_pred_rf']=randfc.predict(tr_cab_test[predictors])
tr_cab_test['correct_rf']=np.where(tr_cab_test['cabin_ltr']==\
            tr_cab_test['cabin_ltr_pred_rf'],1,0)
print(tr_cab_test['correct_rf'].sum()/tr_cab_test.shape[0])


#%%
#choose a model and complete cabin_ltr
bs1['cabin_ltr_pred']=randfc.predict(bs1[predictors])#derive fitted values
bs1[['cabin_ltr','cabin_ltr_pred']].sample(n=20,random_state=1)#check looks ok
bs1['cabin_ltr'].fillna(bs1['cabin_ltr_pred'],inplace=True)#complete cabin_ltr
pd.isnull(bs1['cabin_ltr']).sum()#check nulls have been filled
bs1['cabin_ltr'].value_counts()#view new split of cabin_ltr
bs1['cabin_ltr2']=bs1['cabin_ltr'].\
    map({'B':1,'C':2,'D':3,'E':4,'Other':5}).astype(int)#convert to numeric
				

Commentary: So I will not go into the detail of completing the cabin_ltr feature since the process is much the same as with Age. However, cabin_ltr is not a continuous variable and so instead of the regressor models I am using classifier tree models instead. Their syntax is very similar--as you can see from the above code when comparing to that for completing Age.

11. Create Family Features

###11. Create Family Features

##Family Feature
bs1['surname']=bs1.Name.str.extract('([A-Za-z]+)\,')
   #logic as follows:
	   #[A-Za-z]: matches any letter
	   #+: preceding set repeated one or more times
	   #\,: has a dot at at the end of the expression
bs1['surname'].sample(n=10)#check looks ok
bs1['famsize']=bs1.SibSp+bs1.Parch+1
bs1['famsize_str']=bs1['famsize'].apply(str)#convert to string
bs1['family']=bs1['surname']+bs1['famsize_str']


##Create Family Cost Feature

#create dataset of avg famcosts in bs1
famcosts=bs1[['family','Fare']].groupby('family').agg(['count','mean'])#create 
    #famcosts lookup dataframe
famcosts.columns=famcosts.columns.droplevel()#drop the first column level
fc2=famcosts.reset_index()#reset the index (i.e. not family)
fc3=fc2.reset_index()#reset index again
fc3.head(n=20)#check index has worked
fc3.rename(columns={'index':'family_int','count':'fam_n','mean':'fam_avgcost'},
                inplace=True)
fc3.sample(n=10)

#merge with bs1
bs1=pd.merge(bs1,fc3,on=['family'])
bs1.sample(n=10)#check merge looks ok
bs1['famcost']=bs1.famsize*bs1.fam_avgcost#calculate total famcost
bs1.sample(n=10)#inspect calc has worked as expected

#encode family variable to blend anyone solo or couple
bs1['family2']=np.where(bs1['famsize']==1,
   'solo',
   np.where(bs1['famsize']==2,
            'couple',
            bs1['family']))

fm1=pd.DataFrame(bs1['family2'].value_counts())#get df of counts
fm1.drop('family2',axis=1,inplace=True)
fm2=fm1.reset_index()#reset index
fm3=fm2.reset_index()#reset index again to obtain last index as variable

#rename the vars for merging
fm3.rename(columns={'index':'family2','level_0':'family2_int'},inplace=True)
bs1=pd.merge(bs1,fm3,on=['family2'])#merge into original bs1
				

Commentary: So I must confess that this section was created AFTER the subequent sections in an attempt to improve my score. The family idea comes straight from Trevor Stephens' R tutorial. And the family cost idea I thought of myself... I just considered that the total family ££ spent may be a deciding Survived feature.

As with title, a regular expression is used to extract the last name from the Name variable. Then the family's size is determined using the SibSp + Parch variables. These can then be combined to create the family feature.

The family cost feature is a little more tricky. It involves taking the mean cost for a family based on available data, then applying it to a passenger's family size to gauge the total family cost associated with an individual. I have made use of the droplevel() methods as a quick way to encode a family as numeric for the tree classifier.

12. Complete Survived (Decision Tree Method)

###12. Complete Survived (Decision Tree Method)

#check # of null values
pd.isnull(bs1['Survived']).sum()

#define variables for predictor and target variables
colnames=bs1.columns.values.tolist()#save colnames as list
for i, item in enumerate(colnames):#list colnames and item number for reference
    print('%02d'%i+', '+item+', '+str(bs1[item].dtype))#'%02d' denotes 00 format

predictors=list(colnames[i] for i in[0,3,7,9,13,15,17])#get predictors
predictors#check I've selected the intended columns
target=list(colnames[i] for i in[10])#extract target (Survived)
target#check correct target has been defined (Survived)

#define decision tree classifier parameters
from sklearn.tree import DecisionTreeClassifier
ctree=DecisionTreeClassifier(criterion='entropy',min_samples_split=20,
                             random_state=1)

#define model training data then fit the model
tr_svd=bs1[bs1['Survived'].notnull()]#define data to use for training
tr_svd_train=tr_svd.sample(frac=.7)#define training set
tr_svd_test=tr_svd.drop(tr_svd_train.index)#define test set

#fit a regression tree model based on the algorithm and predictors
x=tr_svd_train[predictors]
y=tr_svd_train[target]

ctree.fit(x,y)


##See percent correctly classified in training data

#create new variable 'correct' where predicted=actual
#use the trained model to predict survived
tr_svd_train['Survived_pred_ctree']=ctree.predict(x)
tr_svd_train['correct']=np.where(tr_svd_train['Survived']==\
            tr_svd_train['Survived_pred_ctree'],1,0)
tr_svd_train['correct'].sum()/tr_svd_train.shape[0]#sum corrects/all possible

#try on test out of sample data
tr_svd_test['Survived_pred_ctree']=ctree.predict(tr_svd_test[predictors])
tr_svd_test['correct']=np.where(tr_svd_test['Survived']==\
            tr_svd_test['Survived_pred_ctree'],1,0)
tr_svd_test['correct'].sum()/tr_svd_test.shape[0]#sum corrects/all possible

#view crosstab of predictions vs. actuals
pd.crosstab(tr_svd_train['Survived'],tr_svd_train['Survived_pred_ctree'],
            rownames=['actual'],
            colnames=['predicted'])

#cross-validate (k-fold)
from sklearn.cross_validation import KFold
from sklearn.cross_validation import cross_val_score

#set up cross validation parameters
crossv=KFold(n=x.shape[0],n_folds=6,shuffle=True,random_state=1)#xval parameters
fold_scores=cross_val_score(ctree,x,y,scoring='accuracy',cv=crossv)#score folds
fold_scores
fold_scores.mean()#get mean of fold scores

#see feature importances
a=x.columns.values.tolist()
b=ctree.feature_importances_.tolist()
pd.DataFrame({'var':a,'imp':b}).sort_values(by=('imp'), ascending=False)
#the higher the value the more important for prediction in this tree model
				

Commentary: Once again, as with Age and cabin_ltr the same tree classifier model method has been applied. You can see that I used k-fold validation again, then re-worked the model by including/excluding features or altering the min_samples_split parameter value. I will not walk through again, since the methods have already been largely covered in Complete Age section above.

I decided to supercede this tree method with the forest classifier method covered below.

13. Complete Survived (Random Forest Method)

###13. Complete Survived (Random Forest Method)

#define variables for predictor and target variables
colnames=bs1.columns.values.tolist()#save colnames as list
for i, item in enumerate(colnames):#list colnames and item number for reference
    print('%02d'%i+', '+item+', '+str(bs1[item].dtype))#'%02d' denotes 00 format

predictors=list(colnames[i] for i in[0,3,5,7,9,13,15,17,18,19,23,29,31])
predictors#check I've selected the intended columns
target=list(colnames[i] for i in[10])#extract target (Survived)
target#check correct target has been defined (Survived)

#train the cforest model
from sklearn.ensemble import RandomForestClassifier
randfc=RandomForestClassifier(oob_score=True,
                              criterion='entropy',
                              max_features=4,
                              n_jobs=2,
                              min_samples_split=10,
                              min_samples_leaf=5,
                              n_estimators=200)

#divide ktrain into a further train/test split
tr_svd=bs1[bs1['Survived'].notnull()]#define data to use for training
tr_svd_train=tr_svd.sample(frac=.7)#define training set
tr_svd_test=tr_svd.drop(tr_svd_train.index)#define test set

x=tr_svd_train[predictors]
y=np.ravel(tr_svd_train[target])#to get around indices error..

#configure ensemble tuning grid
from sklearn.model_selection import GridSearchCV
param_grid={'min_samples_leaf':[3,4,5,6,7],
        'min_samples_split':[2,4,6],
        #'n_estimators':[100,200,300],
        'max_features':[2,3,4,5]}

cv_randfc=GridSearchCV(estimator=randfc,
                     param_grid=param_grid,
                     cv=3)

#cv_randfc.fit(x,y)#fit random forest with gridsearch
#cv_randfc.best_params_#view best param_grid
#cv_randfc.best_score_#view associated best score

#<--Now adjust the randfc parameters to match best_params-->

#train model with altered parameters
randfc.fit(x,y)#train the model on train data
randfc.oob_score_#out of bag accuracy

#assess accuracy on trained data
tr_svd_train['Survived_pred_randfc']=randfc.predict(tr_svd_train[predictors])
tr_svd_train['correct_rfc']=np.where(tr_svd_train['Survived']==\
            tr_svd_train['Survived_pred_randfc'],1,0)
tr_svd_train['correct_rfc'].sum()/tr_svd_train.shape[0]

#try on test out of sample data
tr_svd_test['Survived_pred_randfc']=randfc.predict(tr_svd_test[predictors])
tr_svd_test['correct']=\
    np.where(tr_svd_test['Survived']==tr_svd_test['Survived_pred_randfc'],1,0)
print(tr_svd_test['correct'].sum()/tr_svd_test.shape[0])

##Make Predictions on Base Data
bs1['Survived_pred_randfc']=randfc.predict(bs1[predictors])
randfc.oob_score_

				

Commentary: Again, the random forest classifier has been used as when completing the cabin_ltr feature. However, some additional techniques have been employed here. I have used some extra parameters in the model itelf. Max features are the features that the forest will allow per tree; n_estimators is simply the number of trees to grow... for a small set of features anything more than 50 doesn't make a big difference.

A feature grid has also been set up to tune the model parameters. As you can hopefully infer, the parameter grid just allocates various paramter ranges to test for. These are then fed into the model via GridSearchCV so that every possibly combination of these parameters can be run and each resulting scenario benchmarked to arrive at the best possible combination of paramters. This helped me eek a few more percentage points of accuracy in my % correct score. The down side of this technique is that it hogs a lot of computer resource-it took my laptop about 15+mins to run that section of code.

After I was happy to use the model trained, make predictions on the base data using the model.

14. Prepare Kaggle Submission

###14. Prepare Kaggle Submission

#merge reworked base dataset with ktest to include only test samples
submit=pd.merge(bs1,ktest,on=['PassengerId'])\
    [['PassengerId','Survived_pred_randfc']].sort_values(by='PassengerId')
submit.rename(columns={'Survived_pred_randfc':'Survived'},
              inplace=True)#rename survived
submit.sample(n=20)#inspect a sample of submission
submit.head(n=20)
submit.to_csv('submit6.csv', index=False)#output to csv without printint index
				

Commentary: After applying the predicted values to the base dataset (base includes train and test in my program) the last thing to do is extricate the test set for Kaggle then prepare the .csv file. I did this by merging with the original Kaggle train set on PassengerId. The to_csv method is then used to write the csv in preparation for uploading to Kaggle.

I used a random split of train/test when training the final random forest model used to predict Survived. Since I did not include a seed value (i.e. random_state=n) it may be that you will receive a slightly different result when submitting your results to Kaggle. My last result was .80861--the attached code should yeild a score in that vicinity.




Kaggle Titanic Program (In Full)

All the above code is contained in the below block in full for convenience.
###############################################################################
##                        Titanic Survival Predictions
###############################################################################
#Purpose:
#    To rank better than 0.786 on the Kaggle public leaderboard by building
#    a machine learning solution
#    
#Inputs:
#    - train.csv
#    - test.csv
#    
#Anatomy:
#    1. Import Kaggle Data
#    2. Explore the Datasets
#    3. Wrangle Name
#    4. Wrangle Cabin
#    5. Wrangle Ticket
#    6. Wrangle Age
#    7. Wrangle Sex
#    8. Complete Fare and Embarked
#    9. Complete Age
#    10. Complete cabin_ltr
#    11. Create Family Features
#    12. Complete Survived (Decision Tree Method)
#    13. Complete Survived (Random Forest Method)
#    14. Prepare Kaggle Submission



### 1. Import Kaggle Data
import os #import the os library to use chdir command
print(os.curdir) #returns the current working directory
os.chdir('C:\\Users\\craig\\Google Drive\\201704\\0408pyth\\cache')#changes the
		#current working directory

import pandas as pd #import pandas library - needed for read_csv
ktrain=pd.read_csv('train.csv')#read in train data
ktest=pd.read_csv('test.csv')#read in test data
type(ktest)#check ktest is a data frame
ktest.shape#check rows and columns in ktest

bs1=pd.concat([ktrain,ktest])#concatenate the data
bs1.shape#check dimensions of bs1

#check that Survived has been concatenated properly
bs1.head(n=5)
bs1.tail(n=5)
#it is concatenated as NaN--good!



###2. Explore the Dataset
bs1.info()#data frame variables, missing counts, and types
bs1.describe()#view summary of numeric variables
bs1.describe(include=['O'])#view summary of non-numeric variables

#view null counts per variable
pd.isnull(bs1).sum()

#see avg numerics by survived
bs1.groupby('Survived').mean()
#Fare, Parch, and Pclass seem most predictive features here

#mean of survived by all categorical variables in turn
feats=list(ktrain.select_dtypes(include=['O']))#get list of categorical feats
feats#view list of categorical feats
for feat in feats:#loop through each variable and show survived * feat
	print(ktrain[[feat,'Survived']].\
 	groupby([feat]).\
	agg(['mean','count']))
#seems like there could be potential to split out:
	#title from name variable, 
	#cabin type from cabin,
	#and ticket type from ticket then drill down for further investigation



###3. Wrangle Name
bs1.shape#indicates there are 12 columns

#extract Title from Name
bs1['title']=bs1.Name.str.extract('([A-Za-z]+)\.')#regex extraction of title
   #logic as follows:
	   #[A-Za-z]: matches any letter
	   #+: preceding set repeated one or more times
	   #\.: has a dot at at the end of the expression
bs1.shape
bs1.sample(n=10)

#view the new average survival by title table (Note: NaNs are excluded from 
#mean calc)
bs1[['title','Survived']].\
	groupby(['title']).\
	agg(['mean','count']).\
	sort_values(by=('Survived','mean'), ascending=False)

#combine titles into broader groups
bs1['title']=bs1['title'].replace(['Mme','Countess','Lady','Dona'],'Mrs')#
	#replace Mme, countess, lady, and dona with mrs
bs1['title']=bs1['title'].replace(['Mlle','Ms'],'Miss')#replace Mlle and Ms
bs1['title']=bs1['title'].replace(['Sir','Col','Capt','Rev','Don','Jonkheer',\
	 'Major','Dr'],'Mr')#replace rare male
bs1['title']=bs1['title'].replace(['Countess','Lady'],'Mrs')

#view the new average survival by title table (Note: NaNs are excluded from 
#mean calc)
bs1[['title','Survived']].\
	groupby('title').\
	agg(['mean','count']).\
	sort_values(by=('Survived','mean'), ascending=False)

#create title2 variable to convert title to numeric (for regtree modelling)
bs1['title2']=bs1['title'].map({'Mrs':1,'Miss':2,'Master':3,'Mr':4})



###4. Wrangle Cabin
bs1['cabin_ltr']=bs1['Cabin'].str[:1]#take first character from Cabin
bs1['cabin_ltr'].sample(n=10)#inspect new variable

#view summary of new variable
bs1[['cabin_ltr','Survived']].groupby(['cabin_ltr']).agg(['mean','count'])
#looks like there is some predictive value in cabin_ltr

#broaden out small groups so as to avoid overfitting
bs1['cabin_ltr']=bs1['cabin_ltr'].replace(['A','F','G','T'],'Other')

#encode numeric cabin_ltr
cabin_ltr_map={'D':1,'E':2,'B':3,'C':4,'Other':5}#create mapping dictionary
bs1['cabin_ltr2']=bs1['cabin_ltr'].map(cabin_ltr_map)
bs1['cabin_ltr2'].value_counts()#check has assigned correctly



###5. Wrangle Ticket

#pull sample of 50 with seed=1 where survived is not null (i.e. train data)
bs1[['Ticket','Survived']][bs1['Survived'].notnull()].sample(n=40)
#no observable correlation patterns... try splitting by left-most char

#parse left-most character and view relationship
bs1['ticket_ltr']=bs1['Ticket'].str[:1]#create ticket_ltr feat using Ticket

#view mean of Survived by ticket_ltr
bs1[['ticket_ltr','Survived']].groupby(['ticket_ltr']).agg(['mean','count'])
#to me it looks like there is some correlative relationship (e.g. 63% of
    #those with 1* as ticket# survived vs. 23% of those with 3* ticket#)

#group categories with small vols into seperate group
bs1['ticket_ltr']=bs1['ticket_ltr'].\
    replace(['4','5','6','7','8','9','A'],'4-A')
bs1['ticket_ltr']=bs1['ticket_ltr'].replace(['C','F','L','P','S','W'],'C-W')

#view mean survived grouped by new categories
bs1[['ticket_ltr','Survived']].groupby('ticket_ltr').agg(['mean','count'])

#make categories numeric
bs1['ticket_ltr2']=bs1['ticket_ltr'].\
    map({'1':1,'2':2,'C-W':3,'3':4,'4-A':5}).astype(int)



###6. Wrangle Age

#currently no wrangling--however, potential to explore further



###7. Wrangle Sex
bs1.Sex.value_counts()#see all levels in Sex variable
bs1['Sex2']=bs1['Sex'].map({'male':1,'female':0}).astype(int)#encode to integers
bs1.Sex2.value_counts()



###8. Complete Fare and Embarked

##Complete Fare

#just use median of non-nulls
bs1['Fare'].fillna(bs1['Fare'].median(),inplace=True)

#check there are now no nulls
pd.isnull(bs1['Fare']).sum()
#returns 0


##Complete Embarked

#see frequency of Embarked items
bs1.Embarked.value_counts()
#'S' is most frequent value

#replace NaNs with mode of Embarked ('S')
embarked_mode=bs1.Embarked.mode()[0]#return mode of Embarked
bs1['Embarked']=bs1['Embarked'].fillna(embarked_mode)
pd.isnull(bs1['Embarked']).sum()#check it has worked!

#encode as numeric
bs1['Embarked2']=bs1['Embarked'].map({'S':1,'C':2,'Q':3}).astype(int)



###9. Complete Age

#check # of null values
pd.isnull(bs1['Age']).sum()#see how many nulls there are

#define model training data
tr_age=bs1[bs1['Age'].notnull()]#define data to use for training

#define decision tree regressor parameters
from sklearn.tree import DecisionTreeRegressor
rtree=DecisionTreeRegressor(min_samples_split=20,min_samples_leaf=10,
                            random_state=1)

#define predictor and target variables
colnames=bs1.columns.values.tolist()#save colnames as list
for i, item in enumerate(colnames):#list colnames and item number for reference
    print('%02d'%i+', '+item+', '+str(bs1[item].dtype))#'%02d' denotes 00 format

predictors=list(colnames[i] for i in[5,7,13])#get desired predictors
predictors#check I've selected the intended columns
target=list(colnames[i] for i in[0])#extract target (Age)
target#check correct target has been defined (Age)

#fit a regression tree model based on the algorithm and predictors 
rtree.fit(tr_age[predictors],tr_age[target])

#use the model to predict age on training data--and compare vs. actuals
tr_age['age_pred']=rtree.predict(tr_age[predictors])#get predicted vals

#define a function for calculating rmse
def rmse(preds, actuals):
    sq_err=(actuals-preds)**2
    mse=sq_err.mean()
    rmse=mse**.5
    return rmse

#compute rmse
rmse(tr_age['age_pred'],tr_age['Age'])

#view data in Excel to verify/confirm
tr_age[['age_pred','Age']].to_csv('age_guesses2.csv')

#cross-validate (k-fold)
from sklearn.cross_validation import KFold
from sklearn.cross_validation import cross_val_score

x=tr_age[predictors]
y=tr_age[target]

#set up cross validation paramters
crossv=KFold(n=x.shape[0],n_folds=6,shuffle=True,random_state=1)
folds=cross_val_score(rtree,x,y,scoring='neg_mean_squared_error',
                      cv=crossv,n_jobs=1)
folds_rmse=abs(folds)**.5#view rmse of folds
folds_rmse.mean()#mean of folds rmse

#see feature importances
a=x.columns.values.tolist()
b=rtree.feature_importances_.tolist()
pd.DataFrame({'imp':b,'feat':a}).sort_values(by=('imp'), ascending=False)
#the higher the value the more important for prediction in this tree model

#%%
##See if RandForestRegressor Works Better
#train the randfr model
from sklearn.ensemble import RandomForestRegressor
randfr=RandomForestRegressor(n_estimators=200, oob_score=True)

#divide ktrain into a further train/test split
tr_age_train=tr_age.sample(frac=.7)#define training set
tr_age_test=tr_age.drop(tr_age_train.index)#define test set

#define predictor and target variables
colnames=bs1.columns.values.tolist()#save colnames as list
for i, item in enumerate(colnames):#list colnames and item number for reference
    print('%02d'%i+', '+item+', '+str(bs1[item].dtype))#'%02d' denotes 00 format

predictors=list(colnames[i] for i in[3,5,7,9,13,17,18,19])#get desired predictors
predictors#check I've selected the intended columns
target=list(colnames[i] for i in[0])#extract target (Age)
target#check correct target has been defined (Age)

x=tr_age_train[predictors]
y=tr_age_train[target]

randfr.fit(x,y)#fit a model on the training data
randfr.oob_score_

#use the model to predict age on training data--and compare vs. actuals
tr_age_train['age_pred_rfr']=randfr.predict(tr_age_train[predictors])#get 
    #predicted vals
print(rmse(tr_age_train['age_pred_rfr'],tr_age_train['Age']))#compute rmse

#now use model on tr_age_test to see out of sample score
tr_age_test['age_pred_rfr']=randfr.predict(tr_age_test[predictors])#get 
    #predicted vals
print(rmse(tr_age_test['age_pred_rfr'],tr_age_test['Age']))#compute rmse


#%%
##use a trained model to complete age (random forest regressor)
import numpy as np
bs1['Age_pred']=randfr.predict(bs1[predictors])#get predictions variable
bs1['Age']=np.where(bs1['Age'].isnull(),bs1['Age_pred'],bs1['Age'])#replace Age
    #with Age_pred where Age is null
pd.isnull(bs1['Age']).sum()#see how many nulls there are

#%%


###10. Complete cabin_ltr

#check # of null values
pd.isnull(bs1['cabin_ltr']).sum()

#define variables for predictor and target variables
colnames=bs1.columns.values.tolist()#save colnames as list
for i, item in enumerate(colnames):#list colnames and item number for reference
    print('%02d'%i+', '+item+', '+str(bs1[item].dtype))#'%02d' denotes 00 format

predictors=list(colnames[i] for i in[0,3,5,7,17,18,19])#get predictors
predictors#check I've selected the intended columns
target=list(colnames[i] for i in[14])#extract target (cabin_ltr)
target#check correct target has been defined (cabin_ltr)

#define decision tree classifier parameters
from sklearn.tree import DecisionTreeClassifier
ctree=DecisionTreeClassifier(criterion='entropy',min_samples_split=20,
                             random_state=1)

#define model training data then fit the model
tr_cab=bs1[bs1['cabin_ltr'].notnull()]#define data to use for training
tr_cab_train=tr_cab.sample(frac=.7)#define training set
tr_cab_test=tr_cab.drop(tr_age_train.index)#define test set

#fit a regression tree model based on the algorithm and predictors
x=tr_cab_train[predictors]
y=tr_cab_train[target]

ctree.fit(x,y)#train the tree model

#use the model to predict cabin_ltr on training data--and compare vs. actuals
tr_cab_train['cabin_ltr_pred']=ctree.predict(x)#get predicted vals

##See percent correctly classified in training data
#create new variable 'correct' where predicted=actual
tr_cab_train['correct']=np.where(tr_cab_train['cabin_ltr']==\
            tr_cab_train['cabin_ltr_pred'],1,0)
tr_cab_train['correct'].sum()/tr_cab_train.shape[0]#sum corrects/all possible

#now test on out of sample observations
tr_cab_test['cabin_ltr_pred']=ctree.predict(tr_cab_test[predictors])
tr_cab_test['correct']=np.where(tr_cab_test['cabin_ltr']==\
            tr_cab_test['cabin_ltr_pred'],1,0)
tr_cab_test['correct'].sum()/tr_cab_test.shape[0]#sum corrects/all possible


##Cross-Validate (K-Fold)
from sklearn.cross_validation import KFold
from sklearn.cross_validation import cross_val_score

x=tr_cab_train[predictors]
y=tr_cab_train[target]

#set up cross validation parameters
crossv=KFold(n=x.shape[0],n_folds=6,shuffle=True,random_state=1)#xval parameters
fold_scores=cross_val_score(ctree,x,y,scoring='accuracy',cv=crossv)#score folds
fold_scores.mean()#get mean of fold scores

#see feature importances
a=x.columns.values.tolist()
b=ctree.feature_importances_.tolist()
pd.DataFrame({'var':a,'imp':b}).sort_values(by=('imp'), ascending=False)
#the higher the value the more important for prediction in this tree model

#%%
##See if Random Forest Classifier is Better

#define decision tree classifier parameters
from sklearn.ensemble import RandomForestClassifier
randfc=RandomForestClassifier(criterion='entropy',
                              n_jobs=2,oob_score=True,
                              n_estimators=100)

#define model training data, predictors and target
tr_cab_train=tr_cab.sample(frac=.7)#define training set
tr_cab_test=tr_cab.drop(tr_age_train.index)#define test set

#predictors and targets have already been defined

#train a randfr model based on data and predictors
randfc.fit(x,y)
randfc.oob_score_

##See percent correctly classified in training data
#create new variable 'correct' where predicted=actual
tr_cab_train['cabin_ltr_pred_rf']=randfc.predict(x)
tr_cab_train['correct_rf']=np.where(tr_cab_train['cabin_ltr']==\
            tr_cab_train['cabin_ltr_pred_rf'],1,0)
print(tr_cab_train['correct_rf'].sum()/tr_cab_train.shape[0])

#now test on out of sample observations
tr_cab_test['cabin_ltr_pred_rf']=randfc.predict(tr_cab_test[predictors])
tr_cab_test['correct_rf']=np.where(tr_cab_test['cabin_ltr']==\
            tr_cab_test['cabin_ltr_pred_rf'],1,0)
print(tr_cab_test['correct_rf'].sum()/tr_cab_test.shape[0])



#%%
#choose a model and complete cabin_ltr
bs1['cabin_ltr_pred']=randfc.predict(bs1[predictors])#derive fitted values
bs1[['cabin_ltr','cabin_ltr_pred']].sample(n=20,random_state=1)#check looks ok
bs1['cabin_ltr'].fillna(bs1['cabin_ltr_pred'],inplace=True)#complete cabin_ltr
pd.isnull(bs1['cabin_ltr']).sum()#check nulls have been filled
bs1['cabin_ltr'].value_counts()#view new split of cabin_ltr
bs1['cabin_ltr2']=bs1['cabin_ltr'].\
    map({'B':1,'C':2,'D':3,'E':4,'Other':5}).astype(int)#convert to numeric



#%%
###11. Create Family Features

##Family Feature
bs1['surname']=bs1.Name.str.extract('([A-Za-z]+)\,')
   #logic as follows:
	   #[A-Za-z]: matches any letter
	   #+: preceding set repeated one or more times
	   #\,: has a dot at at the end of the expression
bs1['surname'].sample(n=10)#check looks ok
bs1['famsize']=bs1.SibSp+bs1.Parch+1
bs1['famsize_str']=bs1['famsize'].apply(str)#convert to string
bs1['family']=bs1['surname']+bs1['famsize_str']


##Create Family Cost Feature

#create dataset of avg famcosts in bs1
famcosts=bs1[['family','Fare']].groupby('family').agg(['count','mean'])#create 
    #famcosts lookup dataframe
famcosts.columns=famcosts.columns.droplevel()#drop the first column level
fc2=famcosts.reset_index()#reset the index (i.e. not family)
fc3=fc2.reset_index()#reset index again
fc3.head(n=20)#check index has worked
fc3.rename(columns={'index':'family_int','count':'fam_n','mean':'fam_avgcost'},
                inplace=True)
fc3.sample(n=10)

#merge with bs1
bs1=pd.merge(bs1,fc3,on=['family'])
bs1.sample(n=10)#check merge looks ok
bs1['famcost']=bs1.famsize*bs1.fam_avgcost#calculate total famcost
bs1.sample(n=10)#inspect calc has worked as expected

#encode family variable to blend anyone solo or couple
bs1['family2']=np.where(bs1['famsize']==1,
   'solo',
   np.where(bs1['famsize']==2,
            'couple',
            bs1['family']))

fm1=pd.DataFrame(bs1['family2'].value_counts())#get df of counts
fm1.drop('family2',axis=1,inplace=True)
fm2=fm1.reset_index()#reset index
fm3=fm2.reset_index()#reset index again to obtain last index as variable

#rename the vars for merging
fm3.rename(columns={'index':'family2','level_0':'family2_int'},inplace=True)
bs1=pd.merge(bs1,fm3,on=['family2'])#merge into original bs1


#%%
###12. Complete Survived (Decision Tree Method)

#check # of null values
pd.isnull(bs1['Survived']).sum()

#define variables for predictor and target variables
colnames=bs1.columns.values.tolist()#save colnames as list
for i, item in enumerate(colnames):#list colnames and item number for reference
    print('%02d'%i+', '+item+', '+str(bs1[item].dtype))#'%02d' denotes 00 format

predictors=list(colnames[i] for i in[0,3,7,9,13,15,17])#get predictors
predictors#check I've selected the intended columns
target=list(colnames[i] for i in[10])#extract target (Survived)
target#check correct target has been defined (Survived)

#define decision tree classifier parameters
from sklearn.tree import DecisionTreeClassifier
ctree=DecisionTreeClassifier(criterion='entropy',min_samples_split=20,
                             random_state=1)

#define model training data then fit the model
tr_svd=bs1[bs1['Survived'].notnull()]#define data to use for training
tr_svd_train=tr_svd.sample(frac=.7)#define training set
tr_svd_test=tr_svd.drop(tr_svd_train.index)#define test set

#fit a regression tree model based on the algorithm and predictors
x=tr_svd_train[predictors]
y=tr_svd_train[target]

ctree.fit(x,y)


##See percent correctly classified in training data

#create new variable 'correct' where predicted=actual
#use the trained model to predict survived
tr_svd_train['Survived_pred_ctree']=ctree.predict(x)
tr_svd_train['correct']=np.where(tr_svd_train['Survived']==\
            tr_svd_train['Survived_pred_ctree'],1,0)
tr_svd_train['correct'].sum()/tr_svd_train.shape[0]#sum corrects/all possible

#try on test out of sample data
tr_svd_test['Survived_pred_ctree']=ctree.predict(tr_svd_test[predictors])
tr_svd_test['correct']=np.where(tr_svd_test['Survived']==\
            tr_svd_test['Survived_pred_ctree'],1,0)
tr_svd_test['correct'].sum()/tr_svd_test.shape[0]#sum corrects/all possible

#view crosstab of predictions vs. actuals
pd.crosstab(tr_svd_train['Survived'],tr_svd_train['Survived_pred_ctree'],
            rownames=['actual'],
            colnames=['predicted'])

#cross-validate (k-fold)
from sklearn.cross_validation import KFold
from sklearn.cross_validation import cross_val_score

#set up cross validation parameters
crossv=KFold(n=x.shape[0],n_folds=6,shuffle=True,random_state=1)#xval parameters
fold_scores=cross_val_score(ctree,x,y,scoring='accuracy',cv=crossv)#score folds
fold_scores
fold_scores.mean()#get mean of fold scores

#see feature importances
a=x.columns.values.tolist()
b=ctree.feature_importances_.tolist()
pd.DataFrame({'var':a,'imp':b}).sort_values(by=('imp'), ascending=False)
#the higher the value the more important for prediction in this tree model



#%%
###13. Complete Survived (Random Forest Method)

#define variables for predictor and target variables
colnames=bs1.columns.values.tolist()#save colnames as list
for i, item in enumerate(colnames):#list colnames and item number for reference
    print('%02d'%i+', '+item+', '+str(bs1[item].dtype))#'%02d' denotes 00 format

predictors=list(colnames[i] for i in[0,3,5,7,9,13,15,17,18,19,23,29,31])
predictors#check I've selected the intended columns
target=list(colnames[i] for i in[10])#extract target (Survived)
target#check correct target has been defined (Survived)

#train the cforest model
from sklearn.ensemble import RandomForestClassifier
randfc=RandomForestClassifier(oob_score=True,
                              criterion='entropy',
                              max_features=4,
                              n_jobs=2,
                              min_samples_split=10,
                              min_samples_leaf=5,
                              n_estimators=200)

#divide ktrain into a further train/test split
tr_svd=bs1[bs1['Survived'].notnull()]#define data to use for training
tr_svd_train=tr_svd.sample(frac=.7)#define training set
tr_svd_test=tr_svd.drop(tr_svd_train.index)#define test set

x=tr_svd_train[predictors]
y=np.ravel(tr_svd_train[target])#to get around indices error..

#configure ensemble tuning grid
from sklearn.model_selection import GridSearchCV
param_grid={'min_samples_leaf':[3,4,5,6,7],
        'min_samples_split':[2,4,6],
        #'n_estimators':[100,200,300],
        'max_features':[2,3,4,5]}

cv_randfc=GridSearchCV(estimator=randfc,
                     param_grid=param_grid,
                     cv=3)

#cv_randfc.fit(x,y)#fit random forest with gridsearch
#cv_randfc.best_params_#view best param_grid
#cv_randfc.best_score_#view associated best score

#<--Now adjust the randfc parameters to match best_params-->

#train model with altered parameters
randfc.fit(x,y)#train the model on train data
randfc.oob_score_#out of bag accuracy

#assess accuracy on trained data
tr_svd_train['Survived_pred_randfc']=randfc.predict(tr_svd_train[predictors])
tr_svd_train['correct_rfc']=np.where(tr_svd_train['Survived']==\
            tr_svd_train['Survived_pred_randfc'],1,0)
tr_svd_train['correct_rfc'].sum()/tr_svd_train.shape[0]

#try on test out of sample data
tr_svd_test['Survived_pred_randfc']=randfc.predict(tr_svd_test[predictors])
tr_svd_test['correct']=\
    np.where(tr_svd_test['Survived']==tr_svd_test['Survived_pred_randfc'],1,0)
print(tr_svd_test['correct'].sum()/tr_svd_test.shape[0])

##Make Predictions on Base Data
bs1['Survived_pred_randfc']=randfc.predict(bs1[predictors])
randfc.oob_score_


###14. Prepare Kaggle Submission

#merge reworked base dataset with ktest to include only test samples
submit=pd.merge(bs1,ktest,on=['PassengerId'])\
    [['PassengerId','Survived_pred_randfc']].sort_values(by='PassengerId')
submit.rename(columns={'Survived_pred_randfc':'Survived'},
              inplace=True)#rename survived
submit.sample(n=20)#inspect a sample of submission
submit.head(n=20)
submit.to_csv('submit6.csv', index=False)#output to csv without printint index






######Workings