Classification
KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifi
alg_ngbh = KNeighborsClassifier(n_neighbors=3)
scores = cross_validation.cross_val_score(alg_ngbh, train_data_scaled, train_data_munged["Survived"], cv=cv, n_jobs=-1)
print("Accuracy (k-neighbors): {}/{}".format(scores.mean(), scores.std()))
SGDClassifier
from sklearn.linear_model.stochastic_gradient import SGDClassifier
alg_sgd = SGDClassifier(random_state=1)
scores = cross_validation.cross_val_score(alg_sgd, train_data_scaled, train_data_munged["Survived"], cv=cv, n_jobs=-1)
print("Accuracy (sgd): {}/{}".format(scores.mean(), scores.std()))
SVC
from sklearn.svm import SVC
alg_svm = SVC(C=1.0)
scores = cross_validation.cross_val_score(alg_svm, train_data_scaled, train_data_munged["Survived"], cv=cv, n_jobs=-1)
print("Accuracy (svm): {}/{}".format(scores.mean(), scores.std()))
GaussianNB
from sklearn.naive_bayes import GaussianNB
alg_nbs = GaussianNB()
scores = cross_validation.cross_val_score(alg_nbs, train_data_scaled, train_data_munged["Survived"], cv=cv, n_jobs=-1)
print("Accuracy (naive bayes): {}/{}".format(scores.mean(), scores.std()))
LinearRegression
def linear_scorer(estimator, x, y):
scorer_predictions = estimator.predict(x)
scorer_predictions[scorer_predictions > 0.5] = 1
scorer_predictions[scorer_predictions <= 0.5] = 0
return metrics.accuracy_score(y, scorer_predictions)
from sklearn import linear_model
alg_lnr = linear_model.LinearRegression()
scores = cross_validation.cross_val_score(alg_lnr, train_data_scaled, train_data_munged["Survived"], cv=cv, n_jobs=-1,
scoring=linear_scorer)
print("Accuracy (linear regression): {}/{}".format(scores.mean(), scores.std()))
Linear_scorer method is needed because LinearRegression – this regression returning any real number. Accordingly, we share border 0.5 scale and give any of the two classes – 0 and 1.
from sklearn import linear_model
alg_log = linear_model.LogisticRegression(random_state=1)
scores = cross_validation.cross_val_score(alg_log, train_data_scaled, train_data_munged["Survived"], cv=cv, n_jobs=-1,
scoring=linear_scorer)
print("Accuracy (logistic regression): {}/{}".format(scores.mean(), scores.std()))
from sklearn.ensemble import RandomForestClassifier
alg_frst = RandomForestClassifier(random_state=1, n_estimators=500, min_samples_split=8, min_samples_leaf=2)
scores = cross_validation.cross_val_score(alg_frst, train_data_scaled, train_data_munged["Survived"], cv=cv, n_jobs=-1)
print("Accuracy (random forest): {}/{}".format(scores.mean(), scores.std()))
from sklearn.grid_search import GridSearchCV
alg_frst_model = RandomForestClassifier(random_state=1)
alg_frst_params = [{
"n_estimators": [350, 400, 450],
"min_samples_split": [6, 8, 10],
"min_samples_leaf": [1, 2, 4]
}]
alg_frst_grid = GridSearchCV(alg_frst_model, alg_frst_params, cv=cv, refit=True, verbose=1, n_jobs=-1)
alg_frst_grid.fit(train_data_scaled, train_data_munged["Survived"])
alg_frst_best = alg_frst_grid.best_estimator_
print("Accuracy (random forest auto): {} with params {}"
.format(alg_frst_grid.best_score_, alg_frst_grid.best_params_))
alg_test = alg_frst_best
alg_test.fit(train_data_scaled, train_data_munged["Survived"])
predictions = alg_test.predict(test_data_scaled)
submission = pd.DataFrame({
"PassengerId": test_data["PassengerId"],
"Survived": predictions
})
submission.to_csv("titanic-submission.csv", index=False)
2. Way to predict survival on Titianic
These notes are from this link
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd
df = pd.read_csv(r"C:\Users\piush\Desktop\Dataset\Titanic\train.csv",header=0)
#Lets take a look at the data format below
df.info()
If you carefully observe the above summary of pandas, there are total 891 rows, Age shows only 714 (means missing), Embarked (2 missing) and Cabin missing a lot as well. Object data types are non-numeric so we have to find a way to encode them to numerical values. One such way is columnisation ie. factorize to the row values to column header.
Lets try to drop some of the columns which many not contribute
cols = ['Name','Ticket','Cabin']
df = df.drop(cols,axis=1)
if we want we can drop all rows in the data
#df = df.dropna()
Now you see the dataset is reduced to 712 rows from 891, which means we are wasting data. Machine learning models need data for training to perform well. So we preserve the data and make use of it as much as we can.
Now we convert the Pclass, Sex, Embarked to columns in pandas and drop them after conversion.
dummies = []
cols = ['Pclass','Sex','Embarked']
for col in cols:
dummies.append(pd.get_dummies(df[col]))
titanic_dummies = pd.concat(dummies, axis=1)
titanic_dummies.head(2)
#finally we concatenate to the original dataframe columnwise
df = pd.concat((df,titanic_dummies),axis=1)
Now that we converted Pclass, Sex, Embarked values into columns, we drop the redundant same columns from the dataframe
df = df.drop(['Pclass','Sex','Embarked'],axis=1)
#now look on the new dataframe
df.info()
All is good, except age which has lots of missing values. Lets compute a median or interpolate() all the ages and fill those missing age values. Pandas has a nice interpolate() function that will replace all the missing NaNs to interpolated values.
df['Age'] = df['Age'].interpolate()
df.info()
Machine Learning
X = Input set with 14 attributes
y = Small y Output, in this case ‘Survived’
Now we convert our dataframe from pandas to numpy and we assign input and output
X = df.values
y = df['Survived'].values
X has still Survived values in it, which should not be there. So we drop in numpy column which is the 1st column.
X = np.delete(X,1,axis=1)
Now that we are ready with X and y, lets split the dataset for 70% Training and 30% test set using scikit cross validation
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=0)
Lets start with simple Decision Tree Classifier machine learning algorithm and see how it goes
from sklearn import tree
clf = tree.DecisionTreeClassifier(max_depth=5)
clf.fit(X_train,y_train)
clf.score(X_test,y_test)
Not bad it gives score of 78.73%.
If you perform a decision tree on dataset, the variable importances_ contains important information on what columns of data has large variances thus contributing to the decision.
clf.feature_importances_
This output shows that second element in array 0.111, “Fare” has 11% importance, the last 5 shows 51% which is ‘Females’. Very interesting! yes the large number of survivors in titanic are women and children.
Random Forests
from sklearn import ensemble
clf = ensemble.RandomForestClassifier(n_estimators=100)
clf.fit (X_train, y_train)
clf.score (X_test, y_test)
Gradient boosting
clf = ensemble.GradientBoostingClassifier()
clf.fit (X_train, y_train)
clf.score (X_test, y_test)
#Let not give up and play around fine tune this Gradient booster.
clf = ensemble.GradientBoostingClassifier(n_estimators=50)
clf.fit(X_train,y_train)
clf.score(X_test,y_test)