Python Titanic Machine Learning Walkthrough
I have recently been getting to grips with machine learning techniques. Coursera's Data Science courses (Johns Hopkins) and R were where I started--quickly followed by Kaggle's Titanic competition. I initially tried using Excel VBA to construct what I now understand is a basic decision tree classifier. However, this was very labour intensive and so I started exploring R.
Then I got busy with a challenging project at work as well as my Base SAS certifications... until a few months ago when I again resumed the challenge and found an absolute gem of a tutorial for starting machine learning in R. I cannot praise Trevor Stephens and his excellent tutorial enough for their lucidity and beginner-friendly approach that re-invigorated me to get on my journey with machine learning! Thank you Trevor! :)
After completing that tutorial on the Titanic competition I set about replicating the effort from scratch in R with a goal of improving my score on Kaggle's leaderboard.
Around this time (couple of months ago) I got involved with a predictive modelling project at work and an opportunity to use Python. Consequently, I have replicated the effort in Python and include a fairly basic effort for beating my previous score.
After much perseverance I did manage to improve the score I had achieved through the R tutorial. I will walk you through the Python code used to do so.
Installing Python
Before diving into the actual code I thought I should just mention the process of installing Python on my laptop first. You can get Python on its website (python.org) but you will want to use an IDE to manage it. Initially I got PyCharm then had some issues trying to download scikit learn and other packages. After some struggles and compatibility issues I just downloaded Anaconda. This installs Python, the Spyder IDE, some other useful applications and all the packages needed to run my code--it takes ages to install but rest assured you will have everything!
When installing Anaconda I opted for Python2.7 because I thought there would be better online community support for it; but I had some problems updating it and later cleaned it from my machine and instead installed Python3.6.
I will also mention a few things I didn't find initially but later gladly discovered about Spyder:
- F9 runs a highlighted section of code or the current line (i.e. line where cursor is). This is immensely useful.
- You can type #%% to create a cell. Then hit ctrl+enter to run the current cell only
- Ctrl+r is used for find and replace :)
Titanic Prediction Code
You can get everything you need to do the Titanic challenge on Kaggle's competitions website. The Titanic competition provides train and test datasets containing data on actual passengers of the Titanic (e.g. name, age, fare, cabin, gender). Basically your job is to predict whether a passenger will survive or not based on the training dataset.
My Approach
The phases of my code follow in this order:
- Import Kaggle Data
- Explore the Datasets
- Wrangle Name
- Wrangle Cabin
- Wrangle Ticket
- Wrangle Age
- Wrangle Sex
- Complete Fare and Embarked
- Complete Age
- Complete cabin_ltr
- Create Family Features
- Complete Survived (Decision Tree Method)
- Complete Survived (Random Forest Method)
- Prepare Kaggle Submission
1. Import Kaggle Data
### 1. Import Kaggle Data import os #import the os library to use chdir command print(os.curdir) #returns the current working directory os.chdir('C:\\Users\\craig\\Google Drive\\201704\\0408pyth\\cache')#changes the #current working directory import pandas as pd #import pandas library - needed for read_csv ktrain=pd.read_csv('train.csv')#read in train data ktest=pd.read_csv('test.csv')#read in test data type(ktest)#check ktest is a data frame ktest.shape#check rows and columns in ktest bs1=pd.concat([ktrain,ktest])#concatenate the data bs1.shape#check dimensions of bs1 #check that Survived has been concatenated properly bs1.head(n=5) bs1.tail(n=5) #it is concatenated as NaN--good!
Commentary: The os package allows you to execute command line switches from python. Here I use it to change the working directory to the 0408pyth/cache folder.
The pandas package is later used extensively to manipulate data. But here it is used to access the read_csv method to read in the train and test datasets from Kaggle. Note--I had previously downloaded the csv files from the Kaggle competitions site to my cache folder.
After loading the train and test data I concatenate (stack) them on top of each other to create a combined base dataset to work with.
2. Explore the Datasets
###2. Explore the Dataset bs1.info()#data frame variables, missing counts, and types bs1.describe()#view summary of numeric variables bs1.describe(include=['O'])#view summary of non-numeric variables #view null counts per variable pd.isnull(bs1).sum() #see avg numerics by survived bs1.groupby('Survived').mean() #Fare, Parch, and Pclass seem most predictive features here #mean of survived by all categorical variables in turn feats=list(ktrain.select_dtypes(include=['O']))#get list of categorical feats feats#view list of categorical feats for feat in feats:#loop through each variable and show survived * feat print(ktrain[[feat,'Survived']].\ groupby([feat]).\ agg(['mean','count'])) #seems like there could be potential to split out: #title from name variable, #cabin type from cabin, #and ticket type from ticket then drill down for further investigation
Commentary: Analyse the key attributes of the dataset variables (numeric and categorical variables). Assess null counts.
Look at the average numeric variables by Survived to see if any predictive potential is apparent.
Consider the count and average of Survived by each categorical variable item--to gauge whether any apparent predictive potential exists within categorical variables. From the name variable, it seems like title could be extracted, as well as the first letter/number from cabin, and the first character from ticket.
3. Wrangle Name
###3. Wrangle Name bs1.shape#indicates there are 12 columns #extract Title from Name bs1['title']=bs1.Name.str.extract('([A-Za-z]+)\.')#regex extraction of title #logic as follows: #[A-Za-z]: matches any letter #+: preceding set repeated one or more times #\.: has a dot at at the end of the expression bs1.shape bs1.sample(n=10) #view the new average survival by title table (Note: NaNs are excluded from #mean calc) bs1[['title','Survived']].\ groupby(['title']).\ agg(['mean','count']).\ sort_values(by=('Survived','mean'), ascending=False) #combine titles into broader groups bs1['title']=bs1['title'].replace(['Mme','Countess','Lady','Dona'],'Mrs')# #replace Mme, countess, lady, and dona with mrs bs1['title']=bs1['title'].replace(['Mlle','Ms'],'Miss')#replace Mlle and Ms bs1['title']=bs1['title'].replace(['Sir','Col','Capt','Rev','Don','Jonkheer',\ 'Major','Dr'],'Mr')#replace rare male bs1['title']=bs1['title'].replace(['Countess','Lady'],'Mrs') #view the new average survival by title table (Note: NaNs are excluded from #mean calc) bs1[['title','Survived']].\ groupby('title').\ agg(['mean','count']).\ sort_values(by=('Survived','mean'), ascending=False) #create title2 variable to convert title to numeric (for regtree modelling) bs1['title2']=bs1['title'].map({'Mrs':1,'Miss':2,'Master':3,'Mr':4})
Commentary: Extract title from the Name variable (using a regex expression) then view average of survived by title to observe possibly predictive potential. There is definitely some correlation in the fact that Mrs reveals 79% survival rate and Mr with 16% survival rate. But some of the rare titles need to be combined to avoid overfitting downstream.
Encode title to numeric values so that the tree model can use it later.
4. Wrangle Cabin
###4. Wrangle Cabin bs1['cabin_ltr']=bs1['Cabin'].str[:1]#take first character from Cabin bs1['cabin_ltr'].sample(n=10)#inspect new variable #view summary of new variable bs1[['cabin_ltr','Survived']].groupby(['cabin_ltr']).agg(['mean','count']) #looks like there is some predictive value in cabin_ltr #broaden out small groups so as to avoid overfitting bs1['cabin_ltr']=bs1['cabin_ltr'].replace(['A','F','G','T'],'Other') #encode numeric cabin_ltr cabin_ltr_map={'D':1,'E':2,'B':3,'C':4,'Other':5}#create mapping dictionary bs1['cabin_ltr2']=bs1['cabin_ltr'].map(cabin_ltr_map) bs1['cabin_ltr2'].value_counts()#check has assigned correctly
Commentary: Create a new cabin_ltr variable by extracting the first letter from the Cabin variable. Then view the grouped impact on the Survived variable. Although there are a lot of NaNs here, there is clearly some predictive value since there is some volatility in output (A=47% and D=76%).
Combine into larger groups then encode to numeric.
5. Wrangle Ticket
###5. Wrangle Ticket #pull sample of 50 with seed=1 where survived is not null (i.e. train data) bs1[['Ticket','Survived']][bs1['Survived'].notnull()].sample(n=40) #no observable correlation patterns... try splitting by left-most char #parse left-most character and view relationship bs1['ticket_ltr']=bs1['Ticket'].str[:1]#create ticket_ltr feat using Ticket #view mean of Survived by ticket_ltr bs1[['ticket_ltr','Survived']].groupby(['ticket_ltr']).agg(['mean','count']) #to me it looks like there is some correlative relationship (e.g. 63% of #those with 1* as ticket# survived vs. 23% of those with 3* ticket#) #group categories with small vols into seperate group bs1['ticket_ltr']=bs1['ticket_ltr'].\ replace(['4','5','6','7','8','9','A'],'4-A') bs1['ticket_ltr']=bs1['ticket_ltr'].replace(['C','F','L','P','S','W'],'C-W') #view mean survived grouped by new categories bs1[['ticket_ltr','Survived']].groupby('ticket_ltr').agg(['mean','count']) #make categories numeric bs1['ticket_ltr2']=bs1['ticket_ltr'].\ map({'1':1,'2':2,'C-W':3,'3':4,'4-A':5}).astype(int)
Commentary: View a sample of 40 records to see if any kind of pattern can be observed between ticket and survived. No apparent pattern can be seen.
Try parsing the left-most character from the Ticket variable and looking at mean of survived by the resulting groups. After doing this it looks like there is some variability in mean survived by group. So I will use this in the model to see if it helps.
Combine resulting variable items into larger item groups then encode to numeric.
6/7. Wrangle Age and Sex
###6. Wrangle Age #currently no wrangling--however, potential to explore further ###7. Wrangle Sex bs1.Sex.value_counts()#see all levels in Sex variable bs1['Sex2']=bs1['Sex'].map({'male':1,'female':0}).astype(int)#encode to integers bs1.Sex2.value_counts()
Commentary: I thought about wrangling Age--i.e. converting it into logical age bins (e.g. infant, youth, young adult, middle-aged, elderly) but decided to run the model first then come back and adjust later if necessary. But I managed to beat the benchmark score on Kaggle without doing so--so this section is blank.
As for Sex--the only 'wrangling' to be done (if it can be termed such) is encoding the variable to numeric values (so that the tree classifier model we will see later can swallow it!)
8. Complete Fare and Embarked
###8. Complete Fare and Embarked ##Complete Fare #just use median of non-nulls bs1['Fare'].fillna(bs1['Fare'].median(),inplace=True) #check there are now no nulls pd.isnull(bs1['Fare']).sum() #returns 0 ##Complete Embarked #see frequency of Embarked items bs1.Embarked.value_counts() #'S' is most frequent value #replace NaNs with mode of Embarked ('S') embarked_mode=bs1.Embarked.mode()[0]#return mode of Embarked bs1['Embarked']=bs1['Embarked'].fillna(embarked_mode) pd.isnull(bs1['Embarked']).sum()#check it has worked! #encode as numeric bs1['Embarked2']=bs1['Embarked'].map({'S':1,'C':2,'Q':3}).astype(int)
Commentary: Using pd.isnull(bs['Fare']).sum() I can see that there is only one Fare record that is null. So I just use the median of non-nul values to complete the null instances of Fare.
Similarly, with Embarked, there are only 2 null observations so I just fill them with the mode of the remaining observations.
9. Complete Age
###9. Complete Age #check # of null values pd.isnull(bs1['Age']).sum()#see how many nulls there are #define model training data tr_age=bs1[bs1['Age'].notnull()]#define data to use for training #define decision tree regressor parameters from sklearn.tree import DecisionTreeRegressor rtree=DecisionTreeRegressor(min_samples_split=20,min_samples_leaf=10, random_state=1) #define predictor and target variables colnames=bs1.columns.values.tolist()#save colnames as list for i, item in enumerate(colnames):#list colnames and item number for reference print('%02d'%i+', '+item+', '+str(bs1[item].dtype))#'%02d' denotes 00 format predictors=list(colnames[i] for i in[5,7,13])#get desired predictors predictors#check I've selected the intended columns target=list(colnames[i] for i in[0])#extract target (Age) target#check correct target has been defined (Age) #fit a regression tree model based on the algorithm and predictors rtree.fit(tr_age[predictors],tr_age[target]) #use the model to predict age on training data--and compare vs. actuals tr_age['age_pred']=rtree.predict(tr_age[predictors])#get predicted vals #define a function for calculating rmse def rmse(preds, actuals): sq_err=(actuals-preds)**2 mse=sq_err.mean() rmse=mse**.5 return rmse #compute rmse rmse(tr_age['age_pred'],tr_age['Age']) #view data in Excel to verify/confirm tr_age[['age_pred','Age']].to_csv('age_guesses2.csv') #cross-validate (k-fold) from sklearn.cross_validation import KFold from sklearn.cross_validation import cross_val_score x=tr_age[predictors] y=tr_age[target] #set up cross validation paramters crossv=KFold(n=x.shape[0],n_folds=6,shuffle=True,random_state=1) folds=cross_val_score(rtree,x,y,scoring='neg_mean_squared_error', cv=crossv,n_jobs=1) folds_rmse=abs(folds)**.5#view rmse of folds folds_rmse.mean()#mean of folds rmse #see feature importances a=x.columns.values.tolist() b=rtree.feature_importances_.tolist() pd.DataFrame({'imp':b,'feat':a}).sort_values(by=('imp'), ascending=False) #the higher the value the more important for prediction in this tree model
Commentary: So, we now come to a variable that has a substantial proportion of observations with missing data; and I considered some potential approaches to deciding how to complete the variable (so that the classification tree model will work). The simplest method would be to just use the average of non-nul age observations--as it happens I don't think it was hugely influential which method was adopted. Maybe it is because the 'title' variable helps surrogate for Age in ways.
But I spent a bit of time creating estimates and testing them anyway. It actually becomes a useful tutotial in itself to do this--since essentially the same process is followed later when predicting Survived.
DecisionTreeRegressor Method
From the block of code above you can see that I define define the adhoc training set (tr_age) by taking all observations where age is not null.
I then use the DecisionTreeRegressor algorithm to create the rtree model(with min_samples_split=20 and min_samples_leaf=10).
Some variables cannot be used since they are either not yet complete or they are not numeric. So I create my list of predictors (indexed by 5, 7, and 13 in the colnames list). I also set the target equal to Age.
The next stage is to fit the rtree model on the train data. The decision tree regressor will use the CART algorithm to essentially create an extensive nested if-statement based on the predictor variables and the target variable (age) in the training data. This nested if-statement can then later be used on the test data predictors to predict Age for the test observations.
But first I want to test whether the fit is a good one or not. To do this I look for the root mean squared errors (RMSE) from predicted to actuals in the train data. The RMSE is the square root of the mean of all squared errors in the data. I have created a function to calculate it given two vectors (predicted values, and actual values). It turns out to be 10.78.. which is pretty high. My algorithm is off by ca. 10yrs for each person on the train data.. on average.
A better way to gauge if the model has a close fit is to use k-fold cross validation. This means I essentially repeat all the above steps--i.e. taking some train data and fitting a model, then getting the rmse; but instead of using all the train data I randomly split, or fold, the train data into n folds (I chose 6) then take the average of the rmse from each fold.
After doing this you can see by hitting F9 on the folds_rmse line that the folds show rmse's ranging from 10.2 to 12.3. The mean of these is 11.01.
It is useful to look at the features (predictors) used in the model and see how important they were for obtaining the achieved accuracy. Looks like title2 and Pclass were most important and Parch was the least valuable from the features used. It is a useful exercise to go back and add more features (or even tweak them, and create new ones) then repeat the modelling process to see if a better result can be achieved.. and to be able to explain why it may have moved/changed.
RandForestRegressor Method
Since I was not too happy about the RMSE of 11--suspecting that the trained model used was probably giving me bogus Age estimates on those passengers with null ages--I decided to see if a random forest might yeild a tighter generalised fit on the train data.
A random forest is essentially like training many variations of the tree model-then looking for the averaged benefits of each variable used across all 'trees' in the forest. It works because a lone tree model is subject to limitations on variables used as well as when splits are made. Since a random forest will only use a small selection of all available predictors per tree, it more fully explores the full potential of each feature. It also does not use 100% of available observations for each tree. It uses a sample of available observations then benchmarks performance of each tree using out of bag (oob for short) observations to gauge accuracy. Out of bag here means 'not used for training' on any particular tree.
I have used a similar approach as with the earlier tree model in terms of the steps taken. But this time, I have actually randomly split the tr_age train set into two sub-sets. I have split it into quasi train (tr_age_train) and test (tr_age_test) subsets. This way I can reserve a portion of the data to test out of sample data on once the model has been trained.
With the forest model I select a much wider range of features since I want the forest to explore all potential avenues for predictive potential. I am also not as concerned about limiting the min_samples_leaf and min_samples_split criteria since I can let the forest overfit across all samples then give me the mean impact across all trees.
Using the model to predict age on the training data I get a much improved rmse score of 6.47 when used on the tr_age_train dataset. This is quite encouraging. But when I then take the trained forest model and use it on the out of sample (i.e. the left out) portion, the rmse is still 11.7! So, it looks like the forest didn't give me a much better score after all.
In any case I will use the trained forest model (randfr in my code) to complete Age.
10. Complete cabin_ltr
###10. Complete cabin_ltr #check # of null values pd.isnull(bs1['cabin_ltr']).sum() #define variables for predictor and target variables colnames=bs1.columns.values.tolist()#save colnames as list for i, item in enumerate(colnames):#list colnames and item number for reference print('%02d'%i+', '+item+', '+str(bs1[item].dtype))#'%02d' denotes 00 format predictors=list(colnames[i] for i in[0,3,5,7,17,18,19])#get predictors predictors#check I've selected the intended columns target=list(colnames[i] for i in[14])#extract target (cabin_ltr) target#check correct target has been defined (cabin_ltr) #define decision tree classifier parameters from sklearn.tree import DecisionTreeClassifier ctree=DecisionTreeClassifier(criterion='entropy',min_samples_split=20, random_state=1) #define model training data then fit the model tr_cab=bs1[bs1['cabin_ltr'].notnull()]#define data to use for training tr_cab_train=tr_cab.sample(frac=.7)#define training set tr_cab_test=tr_cab.drop(tr_age_train.index)#define test set #fit a regression tree model based on the algorithm and predictors x=tr_cab_train[predictors] y=tr_cab_train[target] ctree.fit(x,y)#train the tree model #use the model to predict cabin_ltr on training data--and compare vs. actuals tr_cab_train['cabin_ltr_pred']=ctree.predict(x)#get predicted vals ##See percent correctly classified in training data #create new variable 'correct' where predicted=actual tr_cab_train['correct']=np.where(tr_cab_train['cabin_ltr']==\ tr_cab_train['cabin_ltr_pred'],1,0) tr_cab_train['correct'].sum()/tr_cab_train.shape[0]#sum corrects/all possible #now test on out of sample observations tr_cab_test['cabin_ltr_pred']=ctree.predict(tr_cab_test[predictors]) tr_cab_test['correct']=np.where(tr_cab_test['cabin_ltr']==\ tr_cab_test['cabin_ltr_pred'],1,0) tr_cab_test['correct'].sum()/tr_cab_test.shape[0]#sum corrects/all possible ##Cross-Validate (K-Fold) from sklearn.cross_validation import KFold from sklearn.cross_validation import cross_val_score x=tr_cab_train[predictors] y=tr_cab_train[target] #set up cross validation parameters crossv=KFold(n=x.shape[0],n_folds=6,shuffle=True,random_state=1)#xval parameters fold_scores=cross_val_score(ctree,x,y,scoring='accuracy',cv=crossv)#score folds fold_scores.mean()#get mean of fold scores #see feature importances a=x.columns.values.tolist() b=ctree.feature_importances_.tolist() pd.DataFrame({'var':a,'imp':b}).sort_values(by=('imp'), ascending=False) #the higher the value the more important for prediction in this tree model #%% ##See if Random Forest Classifier is Better #define decision tree classifier parameters from sklearn.ensemble import RandomForestClassifier randfc=RandomForestClassifier(criterion='entropy', n_jobs=2,oob_score=True, n_estimators=100) #define model training data, predictors and target tr_cab_train=tr_cab.sample(frac=.7)#define training set tr_cab_test=tr_cab.drop(tr_age_train.index)#define test set #predictors and targets have already been defined #train a randfr model based on data and predictors randfc.fit(x,y) randfc.oob_score_ ##See percent correctly classified in training data #create new variable 'correct' where predicted=actual tr_cab_train['cabin_ltr_pred_rf']=randfc.predict(x) tr_cab_train['correct_rf']=np.where(tr_cab_train['cabin_ltr']==\ tr_cab_train['cabin_ltr_pred_rf'],1,0) print(tr_cab_train['correct_rf'].sum()/tr_cab_train.shape[0]) #now test on out of sample observations tr_cab_test['cabin_ltr_pred_rf']=randfc.predict(tr_cab_test[predictors]) tr_cab_test['correct_rf']=np.where(tr_cab_test['cabin_ltr']==\ tr_cab_test['cabin_ltr_pred_rf'],1,0) print(tr_cab_test['correct_rf'].sum()/tr_cab_test.shape[0]) #%% #choose a model and complete cabin_ltr bs1['cabin_ltr_pred']=randfc.predict(bs1[predictors])#derive fitted values bs1[['cabin_ltr','cabin_ltr_pred']].sample(n=20,random_state=1)#check looks ok bs1['cabin_ltr'].fillna(bs1['cabin_ltr_pred'],inplace=True)#complete cabin_ltr pd.isnull(bs1['cabin_ltr']).sum()#check nulls have been filled bs1['cabin_ltr'].value_counts()#view new split of cabin_ltr bs1['cabin_ltr2']=bs1['cabin_ltr'].\ map({'B':1,'C':2,'D':3,'E':4,'Other':5}).astype(int)#convert to numeric
Commentary: So I will not go into the detail of completing the cabin_ltr feature since the process is much the same as with Age. However, cabin_ltr is not a continuous variable and so instead of the regressor models I am using classifier tree models instead. Their syntax is very similar--as you can see from the above code when comparing to that for completing Age.
11. Create Family Features
###11. Create Family Features ##Family Feature bs1['surname']=bs1.Name.str.extract('([A-Za-z]+)\,') #logic as follows: #[A-Za-z]: matches any letter #+: preceding set repeated one or more times #\,: has a dot at at the end of the expression bs1['surname'].sample(n=10)#check looks ok bs1['famsize']=bs1.SibSp+bs1.Parch+1 bs1['famsize_str']=bs1['famsize'].apply(str)#convert to string bs1['family']=bs1['surname']+bs1['famsize_str'] ##Create Family Cost Feature #create dataset of avg famcosts in bs1 famcosts=bs1[['family','Fare']].groupby('family').agg(['count','mean'])#create #famcosts lookup dataframe famcosts.columns=famcosts.columns.droplevel()#drop the first column level fc2=famcosts.reset_index()#reset the index (i.e. not family) fc3=fc2.reset_index()#reset index again fc3.head(n=20)#check index has worked fc3.rename(columns={'index':'family_int','count':'fam_n','mean':'fam_avgcost'}, inplace=True) fc3.sample(n=10) #merge with bs1 bs1=pd.merge(bs1,fc3,on=['family']) bs1.sample(n=10)#check merge looks ok bs1['famcost']=bs1.famsize*bs1.fam_avgcost#calculate total famcost bs1.sample(n=10)#inspect calc has worked as expected #encode family variable to blend anyone solo or couple bs1['family2']=np.where(bs1['famsize']==1, 'solo', np.where(bs1['famsize']==2, 'couple', bs1['family'])) fm1=pd.DataFrame(bs1['family2'].value_counts())#get df of counts fm1.drop('family2',axis=1,inplace=True) fm2=fm1.reset_index()#reset index fm3=fm2.reset_index()#reset index again to obtain last index as variable #rename the vars for merging fm3.rename(columns={'index':'family2','level_0':'family2_int'},inplace=True) bs1=pd.merge(bs1,fm3,on=['family2'])#merge into original bs1
Commentary: So I must confess that this section was created AFTER the subequent sections in an attempt to improve my score. The family idea comes straight from Trevor Stephens' R tutorial. And the family cost idea I thought of myself... I just considered that the total family ££ spent may be a deciding Survived feature.
As with title, a regular expression is used to extract the last name from the Name variable. Then the family's size is determined using the SibSp + Parch variables. These can then be combined to create the family feature.
The family cost feature is a little more tricky. It involves taking the mean cost for a family based on available data, then applying it to a passenger's family size to gauge the total family cost associated with an individual. I have made use of the droplevel() methods as a quick way to encode a family as numeric for the tree classifier.
12. Complete Survived (Decision Tree Method)
###12. Complete Survived (Decision Tree Method) #check # of null values pd.isnull(bs1['Survived']).sum() #define variables for predictor and target variables colnames=bs1.columns.values.tolist()#save colnames as list for i, item in enumerate(colnames):#list colnames and item number for reference print('%02d'%i+', '+item+', '+str(bs1[item].dtype))#'%02d' denotes 00 format predictors=list(colnames[i] for i in[0,3,7,9,13,15,17])#get predictors predictors#check I've selected the intended columns target=list(colnames[i] for i in[10])#extract target (Survived) target#check correct target has been defined (Survived) #define decision tree classifier parameters from sklearn.tree import DecisionTreeClassifier ctree=DecisionTreeClassifier(criterion='entropy',min_samples_split=20, random_state=1) #define model training data then fit the model tr_svd=bs1[bs1['Survived'].notnull()]#define data to use for training tr_svd_train=tr_svd.sample(frac=.7)#define training set tr_svd_test=tr_svd.drop(tr_svd_train.index)#define test set #fit a regression tree model based on the algorithm and predictors x=tr_svd_train[predictors] y=tr_svd_train[target] ctree.fit(x,y) ##See percent correctly classified in training data #create new variable 'correct' where predicted=actual #use the trained model to predict survived tr_svd_train['Survived_pred_ctree']=ctree.predict(x) tr_svd_train['correct']=np.where(tr_svd_train['Survived']==\ tr_svd_train['Survived_pred_ctree'],1,0) tr_svd_train['correct'].sum()/tr_svd_train.shape[0]#sum corrects/all possible #try on test out of sample data tr_svd_test['Survived_pred_ctree']=ctree.predict(tr_svd_test[predictors]) tr_svd_test['correct']=np.where(tr_svd_test['Survived']==\ tr_svd_test['Survived_pred_ctree'],1,0) tr_svd_test['correct'].sum()/tr_svd_test.shape[0]#sum corrects/all possible #view crosstab of predictions vs. actuals pd.crosstab(tr_svd_train['Survived'],tr_svd_train['Survived_pred_ctree'], rownames=['actual'], colnames=['predicted']) #cross-validate (k-fold) from sklearn.cross_validation import KFold from sklearn.cross_validation import cross_val_score #set up cross validation parameters crossv=KFold(n=x.shape[0],n_folds=6,shuffle=True,random_state=1)#xval parameters fold_scores=cross_val_score(ctree,x,y,scoring='accuracy',cv=crossv)#score folds fold_scores fold_scores.mean()#get mean of fold scores #see feature importances a=x.columns.values.tolist() b=ctree.feature_importances_.tolist() pd.DataFrame({'var':a,'imp':b}).sort_values(by=('imp'), ascending=False) #the higher the value the more important for prediction in this tree model
Commentary: Once again, as with Age and cabin_ltr the same tree classifier model method has been applied. You can see that I used k-fold validation again, then re-worked the model by including/excluding features or altering the min_samples_split parameter value. I will not walk through again, since the methods have already been largely covered in Complete Age section above.
I decided to supercede this tree method with the forest classifier method covered below.
13. Complete Survived (Random Forest Method)
###13. Complete Survived (Random Forest Method) #define variables for predictor and target variables colnames=bs1.columns.values.tolist()#save colnames as list for i, item in enumerate(colnames):#list colnames and item number for reference print('%02d'%i+', '+item+', '+str(bs1[item].dtype))#'%02d' denotes 00 format predictors=list(colnames[i] for i in[0,3,5,7,9,13,15,17,18,19,23,29,31]) predictors#check I've selected the intended columns target=list(colnames[i] for i in[10])#extract target (Survived) target#check correct target has been defined (Survived) #train the cforest model from sklearn.ensemble import RandomForestClassifier randfc=RandomForestClassifier(oob_score=True, criterion='entropy', max_features=4, n_jobs=2, min_samples_split=10, min_samples_leaf=5, n_estimators=200) #divide ktrain into a further train/test split tr_svd=bs1[bs1['Survived'].notnull()]#define data to use for training tr_svd_train=tr_svd.sample(frac=.7)#define training set tr_svd_test=tr_svd.drop(tr_svd_train.index)#define test set x=tr_svd_train[predictors] y=np.ravel(tr_svd_train[target])#to get around indices error.. #configure ensemble tuning grid from sklearn.model_selection import GridSearchCV param_grid={'min_samples_leaf':[3,4,5,6,7], 'min_samples_split':[2,4,6], #'n_estimators':[100,200,300], 'max_features':[2,3,4,5]} cv_randfc=GridSearchCV(estimator=randfc, param_grid=param_grid, cv=3) #cv_randfc.fit(x,y)#fit random forest with gridsearch #cv_randfc.best_params_#view best param_grid #cv_randfc.best_score_#view associated best score #<--Now adjust the randfc parameters to match best_params--> #train model with altered parameters randfc.fit(x,y)#train the model on train data randfc.oob_score_#out of bag accuracy #assess accuracy on trained data tr_svd_train['Survived_pred_randfc']=randfc.predict(tr_svd_train[predictors]) tr_svd_train['correct_rfc']=np.where(tr_svd_train['Survived']==\ tr_svd_train['Survived_pred_randfc'],1,0) tr_svd_train['correct_rfc'].sum()/tr_svd_train.shape[0] #try on test out of sample data tr_svd_test['Survived_pred_randfc']=randfc.predict(tr_svd_test[predictors]) tr_svd_test['correct']=\ np.where(tr_svd_test['Survived']==tr_svd_test['Survived_pred_randfc'],1,0) print(tr_svd_test['correct'].sum()/tr_svd_test.shape[0]) ##Make Predictions on Base Data bs1['Survived_pred_randfc']=randfc.predict(bs1[predictors]) randfc.oob_score_
Commentary: Again, the random forest classifier has been used as when completing the cabin_ltr feature. However, some additional techniques have been employed here. I have used some extra parameters in the model itelf. Max features are the features that the forest will allow per tree; n_estimators is simply the number of trees to grow... for a small set of features anything more than 50 doesn't make a big difference.
A feature grid has also been set up to tune the model parameters. As you can hopefully infer, the parameter grid just allocates various paramter ranges to test for. These are then fed into the model via GridSearchCV so that every possibly combination of these parameters can be run and each resulting scenario benchmarked to arrive at the best possible combination of paramters. This helped me eek a few more percentage points of accuracy in my % correct score. The down side of this technique is that it hogs a lot of computer resource-it took my laptop about 15+mins to run that section of code.
After I was happy to use the model trained, make predictions on the base data using the model.
14. Prepare Kaggle Submission
###14. Prepare Kaggle Submission #merge reworked base dataset with ktest to include only test samples submit=pd.merge(bs1,ktest,on=['PassengerId'])\ [['PassengerId','Survived_pred_randfc']].sort_values(by='PassengerId') submit.rename(columns={'Survived_pred_randfc':'Survived'}, inplace=True)#rename survived submit.sample(n=20)#inspect a sample of submission submit.head(n=20) submit.to_csv('submit6.csv', index=False)#output to csv without printint index
Commentary: After applying the predicted values to the base dataset (base includes train and test in my program) the last thing to do is extricate the test set for Kaggle then prepare the .csv file. I did this by merging with the original Kaggle train set on PassengerId. The to_csv method is then used to write the csv in preparation for uploading to Kaggle.
I used a random split of train/test when training the final random forest model used to predict Survived. Since I did not include a seed value (i.e. random_state=n) it may be that you will receive a slightly different result when submitting your results to Kaggle. My last result was .80861--the attached code should yeild a score in that vicinity.
Kaggle Titanic Program (In Full)
All the above code is contained in the below block in full for convenience.############################################################################### ## Titanic Survival Predictions ############################################################################### #Purpose: # To rank better than 0.786 on the Kaggle public leaderboard by building # a machine learning solution # #Inputs: # - train.csv # - test.csv # #Anatomy: # 1. Import Kaggle Data # 2. Explore the Datasets # 3. Wrangle Name # 4. Wrangle Cabin # 5. Wrangle Ticket # 6. Wrangle Age # 7. Wrangle Sex # 8. Complete Fare and Embarked # 9. Complete Age # 10. Complete cabin_ltr # 11. Create Family Features # 12. Complete Survived (Decision Tree Method) # 13. Complete Survived (Random Forest Method) # 14. Prepare Kaggle Submission ### 1. Import Kaggle Data import os #import the os library to use chdir command print(os.curdir) #returns the current working directory os.chdir('C:\\Users\\craig\\Google Drive\\201704\\0408pyth\\cache')#changes the #current working directory import pandas as pd #import pandas library - needed for read_csv ktrain=pd.read_csv('train.csv')#read in train data ktest=pd.read_csv('test.csv')#read in test data type(ktest)#check ktest is a data frame ktest.shape#check rows and columns in ktest bs1=pd.concat([ktrain,ktest])#concatenate the data bs1.shape#check dimensions of bs1 #check that Survived has been concatenated properly bs1.head(n=5) bs1.tail(n=5) #it is concatenated as NaN--good! ###2. Explore the Dataset bs1.info()#data frame variables, missing counts, and types bs1.describe()#view summary of numeric variables bs1.describe(include=['O'])#view summary of non-numeric variables #view null counts per variable pd.isnull(bs1).sum() #see avg numerics by survived bs1.groupby('Survived').mean() #Fare, Parch, and Pclass seem most predictive features here #mean of survived by all categorical variables in turn feats=list(ktrain.select_dtypes(include=['O']))#get list of categorical feats feats#view list of categorical feats for feat in feats:#loop through each variable and show survived * feat print(ktrain[[feat,'Survived']].\ groupby([feat]).\ agg(['mean','count'])) #seems like there could be potential to split out: #title from name variable, #cabin type from cabin, #and ticket type from ticket then drill down for further investigation ###3. Wrangle Name bs1.shape#indicates there are 12 columns #extract Title from Name bs1['title']=bs1.Name.str.extract('([A-Za-z]+)\.')#regex extraction of title #logic as follows: #[A-Za-z]: matches any letter #+: preceding set repeated one or more times #\.: has a dot at at the end of the expression bs1.shape bs1.sample(n=10) #view the new average survival by title table (Note: NaNs are excluded from #mean calc) bs1[['title','Survived']].\ groupby(['title']).\ agg(['mean','count']).\ sort_values(by=('Survived','mean'), ascending=False) #combine titles into broader groups bs1['title']=bs1['title'].replace(['Mme','Countess','Lady','Dona'],'Mrs')# #replace Mme, countess, lady, and dona with mrs bs1['title']=bs1['title'].replace(['Mlle','Ms'],'Miss')#replace Mlle and Ms bs1['title']=bs1['title'].replace(['Sir','Col','Capt','Rev','Don','Jonkheer',\ 'Major','Dr'],'Mr')#replace rare male bs1['title']=bs1['title'].replace(['Countess','Lady'],'Mrs') #view the new average survival by title table (Note: NaNs are excluded from #mean calc) bs1[['title','Survived']].\ groupby('title').\ agg(['mean','count']).\ sort_values(by=('Survived','mean'), ascending=False) #create title2 variable to convert title to numeric (for regtree modelling) bs1['title2']=bs1['title'].map({'Mrs':1,'Miss':2,'Master':3,'Mr':4}) ###4. Wrangle Cabin bs1['cabin_ltr']=bs1['Cabin'].str[:1]#take first character from Cabin bs1['cabin_ltr'].sample(n=10)#inspect new variable #view summary of new variable bs1[['cabin_ltr','Survived']].groupby(['cabin_ltr']).agg(['mean','count']) #looks like there is some predictive value in cabin_ltr #broaden out small groups so as to avoid overfitting bs1['cabin_ltr']=bs1['cabin_ltr'].replace(['A','F','G','T'],'Other') #encode numeric cabin_ltr cabin_ltr_map={'D':1,'E':2,'B':3,'C':4,'Other':5}#create mapping dictionary bs1['cabin_ltr2']=bs1['cabin_ltr'].map(cabin_ltr_map) bs1['cabin_ltr2'].value_counts()#check has assigned correctly ###5. Wrangle Ticket #pull sample of 50 with seed=1 where survived is not null (i.e. train data) bs1[['Ticket','Survived']][bs1['Survived'].notnull()].sample(n=40) #no observable correlation patterns... try splitting by left-most char #parse left-most character and view relationship bs1['ticket_ltr']=bs1['Ticket'].str[:1]#create ticket_ltr feat using Ticket #view mean of Survived by ticket_ltr bs1[['ticket_ltr','Survived']].groupby(['ticket_ltr']).agg(['mean','count']) #to me it looks like there is some correlative relationship (e.g. 63% of #those with 1* as ticket# survived vs. 23% of those with 3* ticket#) #group categories with small vols into seperate group bs1['ticket_ltr']=bs1['ticket_ltr'].\ replace(['4','5','6','7','8','9','A'],'4-A') bs1['ticket_ltr']=bs1['ticket_ltr'].replace(['C','F','L','P','S','W'],'C-W') #view mean survived grouped by new categories bs1[['ticket_ltr','Survived']].groupby('ticket_ltr').agg(['mean','count']) #make categories numeric bs1['ticket_ltr2']=bs1['ticket_ltr'].\ map({'1':1,'2':2,'C-W':3,'3':4,'4-A':5}).astype(int) ###6. Wrangle Age #currently no wrangling--however, potential to explore further ###7. Wrangle Sex bs1.Sex.value_counts()#see all levels in Sex variable bs1['Sex2']=bs1['Sex'].map({'male':1,'female':0}).astype(int)#encode to integers bs1.Sex2.value_counts() ###8. Complete Fare and Embarked ##Complete Fare #just use median of non-nulls bs1['Fare'].fillna(bs1['Fare'].median(),inplace=True) #check there are now no nulls pd.isnull(bs1['Fare']).sum() #returns 0 ##Complete Embarked #see frequency of Embarked items bs1.Embarked.value_counts() #'S' is most frequent value #replace NaNs with mode of Embarked ('S') embarked_mode=bs1.Embarked.mode()[0]#return mode of Embarked bs1['Embarked']=bs1['Embarked'].fillna(embarked_mode) pd.isnull(bs1['Embarked']).sum()#check it has worked! #encode as numeric bs1['Embarked2']=bs1['Embarked'].map({'S':1,'C':2,'Q':3}).astype(int) ###9. Complete Age #check # of null values pd.isnull(bs1['Age']).sum()#see how many nulls there are #define model training data tr_age=bs1[bs1['Age'].notnull()]#define data to use for training #define decision tree regressor parameters from sklearn.tree import DecisionTreeRegressor rtree=DecisionTreeRegressor(min_samples_split=20,min_samples_leaf=10, random_state=1) #define predictor and target variables colnames=bs1.columns.values.tolist()#save colnames as list for i, item in enumerate(colnames):#list colnames and item number for reference print('%02d'%i+', '+item+', '+str(bs1[item].dtype))#'%02d' denotes 00 format predictors=list(colnames[i] for i in[5,7,13])#get desired predictors predictors#check I've selected the intended columns target=list(colnames[i] for i in[0])#extract target (Age) target#check correct target has been defined (Age) #fit a regression tree model based on the algorithm and predictors rtree.fit(tr_age[predictors],tr_age[target]) #use the model to predict age on training data--and compare vs. actuals tr_age['age_pred']=rtree.predict(tr_age[predictors])#get predicted vals #define a function for calculating rmse def rmse(preds, actuals): sq_err=(actuals-preds)**2 mse=sq_err.mean() rmse=mse**.5 return rmse #compute rmse rmse(tr_age['age_pred'],tr_age['Age']) #view data in Excel to verify/confirm tr_age[['age_pred','Age']].to_csv('age_guesses2.csv') #cross-validate (k-fold) from sklearn.cross_validation import KFold from sklearn.cross_validation import cross_val_score x=tr_age[predictors] y=tr_age[target] #set up cross validation paramters crossv=KFold(n=x.shape[0],n_folds=6,shuffle=True,random_state=1) folds=cross_val_score(rtree,x,y,scoring='neg_mean_squared_error', cv=crossv,n_jobs=1) folds_rmse=abs(folds)**.5#view rmse of folds folds_rmse.mean()#mean of folds rmse #see feature importances a=x.columns.values.tolist() b=rtree.feature_importances_.tolist() pd.DataFrame({'imp':b,'feat':a}).sort_values(by=('imp'), ascending=False) #the higher the value the more important for prediction in this tree model #%% ##See if RandForestRegressor Works Better #train the randfr model from sklearn.ensemble import RandomForestRegressor randfr=RandomForestRegressor(n_estimators=200, oob_score=True) #divide ktrain into a further train/test split tr_age_train=tr_age.sample(frac=.7)#define training set tr_age_test=tr_age.drop(tr_age_train.index)#define test set #define predictor and target variables colnames=bs1.columns.values.tolist()#save colnames as list for i, item in enumerate(colnames):#list colnames and item number for reference print('%02d'%i+', '+item+', '+str(bs1[item].dtype))#'%02d' denotes 00 format predictors=list(colnames[i] for i in[3,5,7,9,13,17,18,19])#get desired predictors predictors#check I've selected the intended columns target=list(colnames[i] for i in[0])#extract target (Age) target#check correct target has been defined (Age) x=tr_age_train[predictors] y=tr_age_train[target] randfr.fit(x,y)#fit a model on the training data randfr.oob_score_ #use the model to predict age on training data--and compare vs. actuals tr_age_train['age_pred_rfr']=randfr.predict(tr_age_train[predictors])#get #predicted vals print(rmse(tr_age_train['age_pred_rfr'],tr_age_train['Age']))#compute rmse #now use model on tr_age_test to see out of sample score tr_age_test['age_pred_rfr']=randfr.predict(tr_age_test[predictors])#get #predicted vals print(rmse(tr_age_test['age_pred_rfr'],tr_age_test['Age']))#compute rmse #%% ##use a trained model to complete age (random forest regressor) import numpy as np bs1['Age_pred']=randfr.predict(bs1[predictors])#get predictions variable bs1['Age']=np.where(bs1['Age'].isnull(),bs1['Age_pred'],bs1['Age'])#replace Age #with Age_pred where Age is null pd.isnull(bs1['Age']).sum()#see how many nulls there are #%% ###10. Complete cabin_ltr #check # of null values pd.isnull(bs1['cabin_ltr']).sum() #define variables for predictor and target variables colnames=bs1.columns.values.tolist()#save colnames as list for i, item in enumerate(colnames):#list colnames and item number for reference print('%02d'%i+', '+item+', '+str(bs1[item].dtype))#'%02d' denotes 00 format predictors=list(colnames[i] for i in[0,3,5,7,17,18,19])#get predictors predictors#check I've selected the intended columns target=list(colnames[i] for i in[14])#extract target (cabin_ltr) target#check correct target has been defined (cabin_ltr) #define decision tree classifier parameters from sklearn.tree import DecisionTreeClassifier ctree=DecisionTreeClassifier(criterion='entropy',min_samples_split=20, random_state=1) #define model training data then fit the model tr_cab=bs1[bs1['cabin_ltr'].notnull()]#define data to use for training tr_cab_train=tr_cab.sample(frac=.7)#define training set tr_cab_test=tr_cab.drop(tr_age_train.index)#define test set #fit a regression tree model based on the algorithm and predictors x=tr_cab_train[predictors] y=tr_cab_train[target] ctree.fit(x,y)#train the tree model #use the model to predict cabin_ltr on training data--and compare vs. actuals tr_cab_train['cabin_ltr_pred']=ctree.predict(x)#get predicted vals ##See percent correctly classified in training data #create new variable 'correct' where predicted=actual tr_cab_train['correct']=np.where(tr_cab_train['cabin_ltr']==\ tr_cab_train['cabin_ltr_pred'],1,0) tr_cab_train['correct'].sum()/tr_cab_train.shape[0]#sum corrects/all possible #now test on out of sample observations tr_cab_test['cabin_ltr_pred']=ctree.predict(tr_cab_test[predictors]) tr_cab_test['correct']=np.where(tr_cab_test['cabin_ltr']==\ tr_cab_test['cabin_ltr_pred'],1,0) tr_cab_test['correct'].sum()/tr_cab_test.shape[0]#sum corrects/all possible ##Cross-Validate (K-Fold) from sklearn.cross_validation import KFold from sklearn.cross_validation import cross_val_score x=tr_cab_train[predictors] y=tr_cab_train[target] #set up cross validation parameters crossv=KFold(n=x.shape[0],n_folds=6,shuffle=True,random_state=1)#xval parameters fold_scores=cross_val_score(ctree,x,y,scoring='accuracy',cv=crossv)#score folds fold_scores.mean()#get mean of fold scores #see feature importances a=x.columns.values.tolist() b=ctree.feature_importances_.tolist() pd.DataFrame({'var':a,'imp':b}).sort_values(by=('imp'), ascending=False) #the higher the value the more important for prediction in this tree model #%% ##See if Random Forest Classifier is Better #define decision tree classifier parameters from sklearn.ensemble import RandomForestClassifier randfc=RandomForestClassifier(criterion='entropy', n_jobs=2,oob_score=True, n_estimators=100) #define model training data, predictors and target tr_cab_train=tr_cab.sample(frac=.7)#define training set tr_cab_test=tr_cab.drop(tr_age_train.index)#define test set #predictors and targets have already been defined #train a randfr model based on data and predictors randfc.fit(x,y) randfc.oob_score_ ##See percent correctly classified in training data #create new variable 'correct' where predicted=actual tr_cab_train['cabin_ltr_pred_rf']=randfc.predict(x) tr_cab_train['correct_rf']=np.where(tr_cab_train['cabin_ltr']==\ tr_cab_train['cabin_ltr_pred_rf'],1,0) print(tr_cab_train['correct_rf'].sum()/tr_cab_train.shape[0]) #now test on out of sample observations tr_cab_test['cabin_ltr_pred_rf']=randfc.predict(tr_cab_test[predictors]) tr_cab_test['correct_rf']=np.where(tr_cab_test['cabin_ltr']==\ tr_cab_test['cabin_ltr_pred_rf'],1,0) print(tr_cab_test['correct_rf'].sum()/tr_cab_test.shape[0]) #%% #choose a model and complete cabin_ltr bs1['cabin_ltr_pred']=randfc.predict(bs1[predictors])#derive fitted values bs1[['cabin_ltr','cabin_ltr_pred']].sample(n=20,random_state=1)#check looks ok bs1['cabin_ltr'].fillna(bs1['cabin_ltr_pred'],inplace=True)#complete cabin_ltr pd.isnull(bs1['cabin_ltr']).sum()#check nulls have been filled bs1['cabin_ltr'].value_counts()#view new split of cabin_ltr bs1['cabin_ltr2']=bs1['cabin_ltr'].\ map({'B':1,'C':2,'D':3,'E':4,'Other':5}).astype(int)#convert to numeric #%% ###11. Create Family Features ##Family Feature bs1['surname']=bs1.Name.str.extract('([A-Za-z]+)\,') #logic as follows: #[A-Za-z]: matches any letter #+: preceding set repeated one or more times #\,: has a dot at at the end of the expression bs1['surname'].sample(n=10)#check looks ok bs1['famsize']=bs1.SibSp+bs1.Parch+1 bs1['famsize_str']=bs1['famsize'].apply(str)#convert to string bs1['family']=bs1['surname']+bs1['famsize_str'] ##Create Family Cost Feature #create dataset of avg famcosts in bs1 famcosts=bs1[['family','Fare']].groupby('family').agg(['count','mean'])#create #famcosts lookup dataframe famcosts.columns=famcosts.columns.droplevel()#drop the first column level fc2=famcosts.reset_index()#reset the index (i.e. not family) fc3=fc2.reset_index()#reset index again fc3.head(n=20)#check index has worked fc3.rename(columns={'index':'family_int','count':'fam_n','mean':'fam_avgcost'}, inplace=True) fc3.sample(n=10) #merge with bs1 bs1=pd.merge(bs1,fc3,on=['family']) bs1.sample(n=10)#check merge looks ok bs1['famcost']=bs1.famsize*bs1.fam_avgcost#calculate total famcost bs1.sample(n=10)#inspect calc has worked as expected #encode family variable to blend anyone solo or couple bs1['family2']=np.where(bs1['famsize']==1, 'solo', np.where(bs1['famsize']==2, 'couple', bs1['family'])) fm1=pd.DataFrame(bs1['family2'].value_counts())#get df of counts fm1.drop('family2',axis=1,inplace=True) fm2=fm1.reset_index()#reset index fm3=fm2.reset_index()#reset index again to obtain last index as variable #rename the vars for merging fm3.rename(columns={'index':'family2','level_0':'family2_int'},inplace=True) bs1=pd.merge(bs1,fm3,on=['family2'])#merge into original bs1 #%% ###12. Complete Survived (Decision Tree Method) #check # of null values pd.isnull(bs1['Survived']).sum() #define variables for predictor and target variables colnames=bs1.columns.values.tolist()#save colnames as list for i, item in enumerate(colnames):#list colnames and item number for reference print('%02d'%i+', '+item+', '+str(bs1[item].dtype))#'%02d' denotes 00 format predictors=list(colnames[i] for i in[0,3,7,9,13,15,17])#get predictors predictors#check I've selected the intended columns target=list(colnames[i] for i in[10])#extract target (Survived) target#check correct target has been defined (Survived) #define decision tree classifier parameters from sklearn.tree import DecisionTreeClassifier ctree=DecisionTreeClassifier(criterion='entropy',min_samples_split=20, random_state=1) #define model training data then fit the model tr_svd=bs1[bs1['Survived'].notnull()]#define data to use for training tr_svd_train=tr_svd.sample(frac=.7)#define training set tr_svd_test=tr_svd.drop(tr_svd_train.index)#define test set #fit a regression tree model based on the algorithm and predictors x=tr_svd_train[predictors] y=tr_svd_train[target] ctree.fit(x,y) ##See percent correctly classified in training data #create new variable 'correct' where predicted=actual #use the trained model to predict survived tr_svd_train['Survived_pred_ctree']=ctree.predict(x) tr_svd_train['correct']=np.where(tr_svd_train['Survived']==\ tr_svd_train['Survived_pred_ctree'],1,0) tr_svd_train['correct'].sum()/tr_svd_train.shape[0]#sum corrects/all possible #try on test out of sample data tr_svd_test['Survived_pred_ctree']=ctree.predict(tr_svd_test[predictors]) tr_svd_test['correct']=np.where(tr_svd_test['Survived']==\ tr_svd_test['Survived_pred_ctree'],1,0) tr_svd_test['correct'].sum()/tr_svd_test.shape[0]#sum corrects/all possible #view crosstab of predictions vs. actuals pd.crosstab(tr_svd_train['Survived'],tr_svd_train['Survived_pred_ctree'], rownames=['actual'], colnames=['predicted']) #cross-validate (k-fold) from sklearn.cross_validation import KFold from sklearn.cross_validation import cross_val_score #set up cross validation parameters crossv=KFold(n=x.shape[0],n_folds=6,shuffle=True,random_state=1)#xval parameters fold_scores=cross_val_score(ctree,x,y,scoring='accuracy',cv=crossv)#score folds fold_scores fold_scores.mean()#get mean of fold scores #see feature importances a=x.columns.values.tolist() b=ctree.feature_importances_.tolist() pd.DataFrame({'var':a,'imp':b}).sort_values(by=('imp'), ascending=False) #the higher the value the more important for prediction in this tree model #%% ###13. Complete Survived (Random Forest Method) #define variables for predictor and target variables colnames=bs1.columns.values.tolist()#save colnames as list for i, item in enumerate(colnames):#list colnames and item number for reference print('%02d'%i+', '+item+', '+str(bs1[item].dtype))#'%02d' denotes 00 format predictors=list(colnames[i] for i in[0,3,5,7,9,13,15,17,18,19,23,29,31]) predictors#check I've selected the intended columns target=list(colnames[i] for i in[10])#extract target (Survived) target#check correct target has been defined (Survived) #train the cforest model from sklearn.ensemble import RandomForestClassifier randfc=RandomForestClassifier(oob_score=True, criterion='entropy', max_features=4, n_jobs=2, min_samples_split=10, min_samples_leaf=5, n_estimators=200) #divide ktrain into a further train/test split tr_svd=bs1[bs1['Survived'].notnull()]#define data to use for training tr_svd_train=tr_svd.sample(frac=.7)#define training set tr_svd_test=tr_svd.drop(tr_svd_train.index)#define test set x=tr_svd_train[predictors] y=np.ravel(tr_svd_train[target])#to get around indices error.. #configure ensemble tuning grid from sklearn.model_selection import GridSearchCV param_grid={'min_samples_leaf':[3,4,5,6,7], 'min_samples_split':[2,4,6], #'n_estimators':[100,200,300], 'max_features':[2,3,4,5]} cv_randfc=GridSearchCV(estimator=randfc, param_grid=param_grid, cv=3) #cv_randfc.fit(x,y)#fit random forest with gridsearch #cv_randfc.best_params_#view best param_grid #cv_randfc.best_score_#view associated best score #<--Now adjust the randfc parameters to match best_params--> #train model with altered parameters randfc.fit(x,y)#train the model on train data randfc.oob_score_#out of bag accuracy #assess accuracy on trained data tr_svd_train['Survived_pred_randfc']=randfc.predict(tr_svd_train[predictors]) tr_svd_train['correct_rfc']=np.where(tr_svd_train['Survived']==\ tr_svd_train['Survived_pred_randfc'],1,0) tr_svd_train['correct_rfc'].sum()/tr_svd_train.shape[0] #try on test out of sample data tr_svd_test['Survived_pred_randfc']=randfc.predict(tr_svd_test[predictors]) tr_svd_test['correct']=\ np.where(tr_svd_test['Survived']==tr_svd_test['Survived_pred_randfc'],1,0) print(tr_svd_test['correct'].sum()/tr_svd_test.shape[0]) ##Make Predictions on Base Data bs1['Survived_pred_randfc']=randfc.predict(bs1[predictors]) randfc.oob_score_ ###14. Prepare Kaggle Submission #merge reworked base dataset with ktest to include only test samples submit=pd.merge(bs1,ktest,on=['PassengerId'])\ [['PassengerId','Survived_pred_randfc']].sort_values(by='PassengerId') submit.rename(columns={'Survived_pred_randfc':'Survived'}, inplace=True)#rename survived submit.sample(n=20)#inspect a sample of submission submit.head(n=20) submit.to_csv('submit6.csv', index=False)#output to csv without printint index ######Workings