These are my notes from various blogs to find different ways to predict survival on Titanic using Python-stack. This is continued from part 2
3. Way to predict survival on Titianic
We tweak the style of this notebook a little bit to have centered plots.
from IPython.core.display import HTML
HTML("""
<style>
.output_png {
display: table-cell;
text-align: center;
vertical-align: middle;
}
</style>
""")
#Import the libraries
# remove warnings
import warnings
warnings.filterwarnings('ignore')
# ---
%matplotlib inline
import pandas as pd
pd.options.display.max_columns = 100
from matplotlib import pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
import numpy as np
pd.options.display.max_rows = 100
#Now let's start by loading the training set.
data = pd.read_csv(r"C:\Users\piush\Desktop\Dataset\Titanic\train.csv")
#Pandas allows you to have a sneak peak at your data.
data.head(2)
data.describe()
#The count variable shows that 177 values are missing in the Age column.
data['Age'].fillna(data['Age'].median(), inplace=True)
Let’s now make some charts
#Let's visualize survival based on the gender.
survived_sex = data[data['Survived']==1]['Sex'].value_counts()
dead_sex = data[data['Survived']==0]['Sex'].value_counts()
df = pd.DataFrame([survived_sex,dead_sex])
df.index = ['Survived','Dead']
df.plot(kind='bar',stacked=True, figsize=(15,8))
The Sex variable seems to be a decisive feature. Women are more likely to survive.
#Let's now correlate the survival with the age variable.
figure = plt.figure(figsize=(15,8))
plt.hist([data[data['Survived']==1]['Age'],data[data['Survived']==0]['Age']], stacked=True, color = ['g','r'],
bins = 30,label = ['Survived','Dead'])
plt.xlabel('Age')
plt.ylabel('Number of passengers')
plt.legend()
If you follow the chart bin by bin, you will notice that passengers who are less than 10 are more likely to survive than older ones who are more than 12 and less than 50. Older passengers seem to be rescued too.
These two first charts confirm that one old code of conduct that sailors and captains follow in case of threatening situations: “Women and children first !”.
#Let's now focus on the Fare ticket of each passenger and correlate it with the survival.
figure = plt.figure(figsize=(15,8))
plt.hist([data[data['Survived']==1]['Fare'],data[data['Survived']==0]['Fare']], stacked=True, color = ['g','r'],
bins = 30,label = ['Survived','Dead'])
plt.xlabel('Fare')
plt.ylabel('Number of passengers')
plt.legend()
Passengers with cheaper ticket fares are more likely to die. Put differently, passengers with more expensive tickets, and therefore a more important social status, seem to be rescued first.
# Let's now combine the age, the fare and the survival on a single chart.
plt.figure(figsize=(15,8))
ax = plt.subplot()
ax.scatter(data[data['Survived']==1]['Age'],data[data['Survived']==1]['Fare'],c='green',s=40)
ax.scatter(data[data['Survived']==0]['Age'],data[data['Survived']==0]['Fare'],c='red',s=40)
ax.set_xlabel('Age')
ax.set_ylabel('Fare')
ax.legend(('survived','dead'),scatterpoints=1,loc='upper right',fontsize=15,)
A distinct cluster of dead passengers (the red one) appears on the chart. Those people are adults (age between 15 and 50) of lower class (lowest ticket fares).
#the ticket fare correlates with the class as we see it in the chart below.
ax = plt.subplot()
ax.set_ylabel('Average fare')
data.groupby('Pclass').mean()['Fare'].plot(kind='bar',figsize=(15,8), ax = ax)
#Let's now see how the embarkation site affects the survival.
survived_embark = data[data['Survived']==1]['Embarked'].value_counts()
dead_embark = data[data['Survived']==0]['Embarked'].value_counts()
df = pd.DataFrame([survived_embark,dead_embark])
df.index = ['Survived','Dead']
df.plot(kind='bar',stacked=True, figsize=(15,8))
There seems to be no distinct correlation here.
II – Feature engineering
#let's define a print function that asserts whether or not a feature has been processed.
def status(feature):
print ('Processing',feature,': ok')
Loading the data
One trick when starting a machine learning problem is to combine the training set and the test set together. This is a useful technique especially when your test set appears to have a feature that doesn’t exist in the training set. Therefore, if we don’t combine the two sets, testing our model on the test set will dramatically fail.
Besides, combining the two sets will save us some repeated work to do later on when testing.
The procedure is quite simple.
- We start by loading the train set and the test set.
- We create an empty dataframe called combined.
- Then we append test to train and affect the result to combined.
def get_combined_data():
# reading train data
train = pd.read_csv(r"C:\Users\piush\Desktop\Dataset\Titanic\train.csv")
# reading test data
test = pd.read_csv(r"C:\Users\piush\Desktop\Dataset\Titanic\test.csv")
# extracting and then removing the targets from the training data
targets = train.Survived
train.drop('Survived',1,inplace=True)
# merging train data and test data for future feature engineering
combined = train.append(test)
combined.reset_index(inplace=True)
combined.drop('index',inplace=True,axis=1)
return combined
combined = get_combined_data()
combined.shape
You may notice that the total number of rows (1309) is the exact summation of the number of rows in the train set and the test set.
Extracting the passenger titles
def get_titles():
global combined
# we extract the title from each name
combined['Title'] = combined['Name'].map(lambda name:name.split(',')[1].split('.')[0].strip())
# a map of more aggregated titles
Title_Dictionary = {
"Capt": "Officer",
"Col": "Officer",
"Major": "Officer",
"Jonkheer": "Royalty",
"Don": "Royalty",
"Sir" : "Royalty",
"Dr": "Officer",
"Rev": "Officer",
"the Countess":"Royalty",
"Dona": "Royalty",
"Mme": "Mrs",
"Mlle": "Miss",
"Ms": "Mrs",
"Mr" : "Mr",
"Mrs" : "Mrs",
"Miss" : "Miss",
"Master" : "Master",
"Lady" : "Royalty"
}
# we map each title
combined['Title'] = combined.Title.map(Title_Dictionary)
get_titles()
combined.head(2)
Processing the ages
Simply replacing them with the mean or the median age might not be the best solution since the age may differ by groups and categories of passengers.
To understand why, let’s group our dataset by sex, Title and passenger class and for each subset compute the median age.
grouped = combined.groupby(['Sex','Pclass','Title'])
grouped.median()
Look at the median age column and see how this value can be different based on the Sex, Pclass and Title put together.
For example:
- If the passenger is female, from Pclass 1, and from royalty the median age is 39.
- If the passenger is male, from Pclass 3, with a Mr title, the median age is 26.
def process_age():
global combined
# a function that fills the missing values of the Age variable
def fillAges(row):
if row['Sex']=='female' and row['Pclass'] == 1:
if row['Title'] == 'Miss':
return 30
elif row['Title'] == 'Mrs':
return 45
elif row['Title'] == 'Officer':
return 49
elif row['Title'] == 'Royalty':
return 39
elif row['Sex']=='female' and row['Pclass'] == 2:
if row['Title'] == 'Miss':
return 20
elif row['Title'] == 'Mrs':
return 30
elif row['Sex']=='female' and row['Pclass'] == 3:
if row['Title'] == 'Miss':
return 18
elif row['Title'] == 'Mrs':
return 31
elif row['Sex']=='male' and row['Pclass'] == 1:
if row['Title'] == 'Master':
return 6
elif row['Title'] == 'Mr':
return 41.5
elif row['Title'] == 'Officer':
return 52
elif row['Title'] == 'Royalty':
return 40
elif row['Sex']=='male' and row['Pclass'] == 2:
if row['Title'] == 'Master':
return 2
elif row['Title'] == 'Mr':
return 30
elif row['Title'] == 'Officer':
return 41.5
elif row['Sex']=='male' and row['Pclass'] == 3:
if row['Title'] == 'Master':
return 6
elif row['Title'] == 'Mr':
return 26
combined.Age = combined.apply(lambda r : fillAges(r) if np.isnan(r['Age']) else r['Age'], axis=1)
status('age')
process_age()
combined.info()
#Let's now process the names.
def process_names():
global combined
# we clean the Name variable
combined.drop('Name',axis=1,inplace=True)
# encoding in dummy variable
titles_dummies = pd.get_dummies(combined['Title'],prefix='Title')
combined = pd.concat([combined,titles_dummies],axis=1)
# removing the title variable
combined.drop('Title',axis=1,inplace=True)
status('names')
This function drops the Name column since we won’t be using it anymore because we created a Title column.
Then we encode the title values using a dummy encoding.
process_names()
combined.head()
As you can see :
- there is no longer a name feature.
- new variables (Title_X) appeared. These features are binary.
- For example, If Title_Mr = 1, the corresponding Title is Mr.
Processing Fare
#This function simply replaces one missing Fare value by the mean.
def process_fares():
global combined
# there's one missing fare value - replacing it with the mean.
combined.Fare.fillna(combined.Fare.mean(),inplace=True)
status('fare')
process_fares()
Processing Embarked
#This functions replaces the two missing values of Embarked with the most frequent Embarked value.
def process_embarked():
global combined
# two missing embarked values - filling them with the most frequent one (S)
combined.Embarked.fillna('S',inplace=True)
# dummy encoding
embarked_dummies = pd.get_dummies(combined['Embarked'],prefix='Embarked')
combined = pd.concat([combined,embarked_dummies],axis=1)
combined.drop('Embarked',axis=1,inplace=True)
status('embarked')
process_embarked()
Processing Cabin
#This function replaces NaN values with U (for Unknow). It then maps each Cabin value to the first letter. Then it encodes the cabin values using dummy encoding again.
def process_cabin():
global combined
# replacing missing cabins with U (for Uknown)
combined.Cabin.fillna('U',inplace=True)
# mapping each Cabin value with the cabin letter
combined['Cabin'] = combined['Cabin'].map(lambda c : c[0])
# dummy encoding ...
cabin_dummies = pd.get_dummies(combined['Cabin'],prefix='Cabin')
combined = pd.concat([combined,cabin_dummies],axis=1)
combined.drop('Cabin',axis=1,inplace=True)
status('cabin')
process_cabin()
combined.info()
Ok no missing values now.
Processing Sex
#This function maps the string values male and female to 1 and 0 respectively.
def process_sex():
global combined
# mapping string values to numerical one
combined['Sex'] = combined['Sex'].map({'male':1,'female':0})
status('sex')
process_sex()
Processing Pclass
#This function encodes the values of Pclass (1,2,3) using a dummy encoding.
def process_pclass():
global combined
# encoding into 3 categories:
pclass_dummies = pd.get_dummies(combined['Pclass'],prefix="Pclass")
# adding dummy variables
combined = pd.concat([combined,pclass_dummies],axis=1)
# removing "Pclass"
combined.drop('Pclass',axis=1,inplace=True)
status('pclass')
process_pclass()
- This functions preprocess the tikets first by extracting the ticket prefix. When it fails in extracting a prefix it returns XXX.
- Then it encodes prefixes using dummy encoding.
def process_ticket():
global combined
# a function that extracts each prefix of the ticket, returns 'XXX' if no prefix (i.e the ticket is a digit)
def cleanTicket(ticket):
ticket = ticket.replace('.','')
ticket = ticket.replace('/','')
ticket = ticket.split()
ticket = map(lambda t : t.strip() , ticket)
ticket = list(filter(lambda t : not t.isdigit(), ticket))
if len(ticket) > 0:
return ticket[0]
else:
return 'XXX'
# Extracting dummy variables from tickets:
combined['Ticket'] = combined['Ticket'].map(cleanTicket)
tickets_dummies = pd.get_dummies(combined['Ticket'],prefix='Ticket')
combined = pd.concat([combined, tickets_dummies],axis=1)
combined.drop('Ticket',inplace=True,axis=1)
status('ticket')
process_ticket()
This function introduces 4 new features:
- FamilySize : the total number of relatives including the passenger (him/her)self.
- Sigleton : a boolean variable that describes families of size = 1
- SmallFamily : a boolean variable that describes families of 2 <= size <= 4
- LargeFamily : a boolean variable that describes families of 5 < size
def process_family():
global combined
# introducing a new feature : the size of families (including the passenger)
combined['FamilySize'] = combined['Parch'] + combined['SibSp'] + 1
# introducing other features based on the family size
combined['Singleton'] = combined['FamilySize'].map(lambda s : 1 if s == 1 else 0)
combined['SmallFamily'] = combined['FamilySize'].map(lambda s : 1 if 2<=s<=4 else 0)
combined['LargeFamily'] = combined['FamilySize'].map(lambda s : 1 if 5<=s else 0)
status('family')
process_family()
combined.shape
#Let's normalize all of them in the unit interval.
#All of them except the PassengerId that we'll need for the submission.
def scale_all_features():
global combined
features = list(combined.columns)
features.remove('PassengerId')
combined[features] = combined[features].apply(lambda x: x/x.max(), axis=0)
print ('Features scaled successfully !')
III – Modeling
We now have to:
- Break the combined dataset in train set and test set.
- Use the train set to build a predictive model.
- Evaluate the model using the train set.
- Test the model using the test set and generate and output file for the submission.
#Let's start by importing the useful libraries.
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.cross_validation import StratifiedKFold
from sklearn.grid_search import GridSearchCV
from sklearn.ensemble.gradient_boosting import GradientBoostingClassifier
from sklearn.cross_validation import cross_val_score
#To evaluate our model we'll be using a 5-fold cross validation with the Accuracy metric.
def compute_score(clf, X, y,scoring='accuracy'):
xval = cross_val_score(clf, X, y, cv = 5,scoring=scoring)
return np.mean(xval)
#Recover the train set and the test set from the combined dataset
def recover_train_test_target():
global combined
train0 = pd.read_csv(r"C:\Users\piush\Desktop\Dataset\Titanic\train.csv")
targets = train0.Survived
train = combined.ix[0:890]
test = combined.ix[891:]
return train,test,targets
train,test,targets = recover_train_test_target()
Feature selection
#Tree-based estimators can be used to compute feature importances, which in turn can be used to discard irrelevant features.
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel
clf = ExtraTreesClassifier(n_estimators=200)
clf = clf.fit(train, targets)
#Let's have a look at the importance of each feature.
features = pd.DataFrame()
features['feature'] = train.columns
features['importance'] = clf.feature_importances_
features.sort(['importance'],ascending=False)
As you may notice, there is a great importance linked to Title_Mr, Age, Fare, and Sex.
There is also an important correlation with the Passenger_Id.`
#Let's now transform our train set and test set in a more compact datasets.
model = SelectFromModel(clf, prefit=True)
train_new = model.transform(train)
train_new.shape
test_new = model.transform(test)
test_new.shape
Hyperparameters tuning
#Random Forest are quite handy. They do however come with some parameters to tweak in order to get an optimal model for the prediction task.
forest = RandomForestClassifier(max_features='sqrt')
parameter_grid = {
'max_depth' : [4,5,6,7,8],
'n_estimators': [200,210,240,250],
'criterion': ['gini','entropy']
}
cross_validation = StratifiedKFold(targets, n_folds=5)
grid_search = GridSearchCV(forest,
param_grid=parameter_grid,
cv=cross_validation)
grid_search.fit(train_new, targets)
print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))
output = grid_search.predict(test_new).astype(int)
df_output = pd.DataFrame()
df_output['PassengerId'] = test['PassengerId']
df_output['Survived'] = output
df_output[['PassengerId','Survived']].to_csv('output.csv',index=False)