Women’s Health Risk Assessment is a multi-class classification competition for finding an optimized machine learning a solution that allows a young woman (age 15-30 years old) to be accurately categorized for their particular health risk. Based on the category a patient falls within, healthcare providers can offer appropriate education and training programs to help reduce the patient’s reproductive health risks. This blog is modeling Women’s Health Risk Assessment.
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
# importing the dataset
df = pd.read_csv("WomenHealth_Training.csv")
df.head(5)
Review input features
print ("\n\n---------------------")
print ("TRAIN SET INFORMATION")
print ("---------------------")
print ("Shape of training set:", df.shape, "\n")
print ("Column Headers:", list(df.columns.values), "\n")
print (df.dtypes)
import re
missing_values = []
nonumeric_values = []
print ("TRAINING SET INFORMATION")
print ("========================\n")
for column in df:
# Find all the unique feature values
uniq = df[column].unique()
print ("'{}' has {} unique values" .format(column,uniq.size))
if (uniq.size > 10):
print("~~Listing up to 10 unique values~~")
print (uniq[0:10])
print ("\n-----------------------------------------------------------------------\n")
# Find features with missing values
if (True in pd.isnull(uniq)):
s = "{} has {} missing" .format(column, pd.isnull(df[column]).sum())
missing_values.append(s)
# Find features with non-numeric values
for i in range (1, np.prod(uniq.shape)):
if (re.match('nan', str(uniq[i]))):
break
if not (re.search('(^\d+\.?\d*$)|(^\d*\.?\d+$)', str(uniq[i]))):
nonumeric_values.append(column)
break
print ("\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n")
print ("Features with missing values:\n{}\n\n" .format(missing_values))
print ("Features with non-numeric values:\n{}" .format(nonumeric_values))
print ("\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n")
Select the rows with nan
df[np.isnan(df['christian'])]
Looking at the dataset , the religion has been separated into various religions within the datset so there is no need to change the string variable to numerical values. Also, Religion has only 15 rows missing from the whole dataset , I will drop these rows as well.
df.drop('religion', axis=1, inplace=True)
df = df.dropna(subset = ['christian'])
Some of the assumptions for the missing data depending on the competition about sex and health. ‘married has 22 missing : Yes’, ‘multpart has 1154 missing : 1’, ‘educ has 90 missing : Median ‘, ‘inschool has 17 missing: Yes’, ‘ownincome has 59 missing : No’, ‘literacy has 113 missing : Median’, ‘LaborDeliv has 2513 missing : No’, ‘babydoc has 67 missing : Median’, ‘Debut has 566 missing : Median’, ‘ModCon has 318 missing : No’, ‘usecondom has 2360 missing : No’, ‘hivknow has 607 missing : No’, ‘lowlit has 113 missing : Median’, ‘highlit has 113 missing : Median’, ‘single has 1383 missing : No’
df['married'].fillna(1, inplace=True)
df['multpart'].fillna(1, inplace=True)
df['educ'].fillna(df['educ'].median(), inplace=True)
df['inschool'].fillna(1, inplace=True)
df['ownincome'].fillna(0, inplace=True)
df['literacy'].fillna(df['literacy'].median(), inplace=True)
df['LaborDeliv'].fillna(0, inplace=True)
df['babydoc'].fillna(df['babydoc'].median(), inplace=True)
df['Debut'].fillna(df['Debut'].median(), inplace=True)
df['ModCon'].fillna(0, inplace=True)
df['usecondom'].fillna(0, inplace=True)
df['hivknow'].fillna(0, inplace=True)
df['lowlit'].fillna(df['lowlit'].median(), inplace=True)
df['highlit'].fillna(df['highlit'].median(), inplace=True)
df['single'].fillna(0, inplace=True)
We will work with this dataset in all examples, namely, with the X feature-object matrix and values of the y target variable.
# separate the data from the target attributes
X = df[['patientID', 'christian', 'muslim', 'hindu', 'other',
'cellphone', 'motorcycle', 'radio', 'cooker', 'fridge', 'furniture',
'computer', 'cart', 'irrigation', 'thrasher', 'car', 'generator',
'INTNR', 'REGION_PROVINCE', 'DISTRICT', 'electricity', 'age', 'tribe',
'foodinsecurity', 'EVER_HAD_SEX', 'EVER_BEEN_PREGNANT', 'CHILDREN',
'india', 'married', 'multpart', 'educ', 'inschool', 'ownincome',
'literacy', 'urbanicity', 'LaborDeliv', 'babydoc', 'Debut', 'ModCon',
'usecondom', 'hivknow', 'lowlit', 'highlit', 'urban', 'rural', 'single']]
y = df[['geo','segment','subgroup']]
Data Normalization
All of us know well that the majority of gradient methods (on which almost all machine learning algorithms are based) are highly sensitive to data scaling. Therefore, before running an algorithm, we should perform either normalization, or the so-called standardization. Normalization involves replacing nominal features, so that each of them would be in the range from 0 to 1. As for standardization, it involves data pre-processing, after which each feature has an average 0 and 1 dispersion.
from sklearn import preprocessing
# normalize the data attributes
normalized_X = preprocessing.normalize(X)
# standardize the data attributes
standardized_X = preprocessing.scale(X)
I will group all the three variables into one total for predicting the outcome
for i, col in enumerate(y.columns.tolist(), 1):
y.loc[:, col] *= i
y = y.sum(axis=1)
Feature Selection
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
# create the RFE model and select 3 attributes
rfe = RFE(model, 3)
rfe = rfe.fit(X, y)
Summarize the selection of the attributes
print(rfe.support_)
print(rfe.ranking_)
Algorithm Development
Most often used for solving tasks of classification (binary), but multiclass classification (the so-called one-vs-all method) is also allowed. The advantage of this algorithm is that there’s the probability of belonging to a class for each object at the output.
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model.stochastic_gradient import SGDClassifier
Logistic Regression
model = LogisticRegression()
model.fit(standardized_X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(standardized_X)
# summarize the fit of the model
#print(metrics.classification_report(expected, predicted))
print(metrics.accuracy_score(expected, predicted))
SGDClassier
Stochastic Gradient Descent (SGD) is a simple yet very efficient approach to discriminative learning of linear classifiers under convex loss functions such as (linear) Support Vector Machines and Logistic Regression. SGD has been successfully applied to large-scale and sparse machine learning problems often encountered in text classification and natural language processing.
model = SGDClassifier(random_state=1)
model.fit(standardized_X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(standardized_X)
# summarize the fit of the model
#print(metrics.classification_report(expected, predicted))
print(metrics.accuracy_score(expected, predicted))
Naive Bayes
Is also one of the most well-known machine learning algorithms, the main task of which is to restore the density of data distribution of the training sample. This method often provides good quality in multiclass classification problems.
model = GaussianNB()
model.fit(standardized_X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(standardized_X)
# summarize the fit of the model
#print(metrics.classification_report(expected, predicted))
print(metrics.accuracy_score(expected, predicted))
k-Nearest Neighbours
The kNN (k-Nearest Neighbors) method is often used as part of a more complex classification algorithm. For instance, we can use its estimate as an object’s feature. Sometimes, a simple kNN provides great quality on well-chosen features. When parameters (metrics mostly) are set well, the algorithm often gives good quality in regression problems.
# fit a k-nearest neighbor model to the data
model = KNeighborsClassifier()
model.fit(standardized_X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(standardized_X)
# summarize the fit of the model
#print(metrics.classification_report(expected, predicted))
print(metrics.accuracy_score(expected, predicted))
Decision Trees
Classification and Regression Trees (CART) are often used in problems, in which objects have category features and used for regression and classification problems. The trees are very well suited for multiclass classification.
# fit a CART model to the data
model = DecisionTreeClassifier()
model.fit(standardized_X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(standardized_X)
# summarize the fit of the model
#print(metrics.classification_report(expected, predicted))
print(metrics.accuracy_score(expected, predicted))
Support Vector Machines
SVM (Support Vector Machines) is one of the most popular machine learning algorithms used mainly for the classification problem. As well as logistic regression, SVM allows multi-class classification with the help of the one-vs-all method.
# fit a SVM model to the data
model = SVC()
model.fit(standardized_X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(standardized_X)
# summarize the fit of the model
#print(metrics.classification_report(expected, predicted))
print(metrics.accuracy_score(expected, predicted))
Optimize Algorithm Parameters
let’s take a look at the selection of the regularization parameter, in which several values are searched in turn:
from sklearn.linear_model import Ridge
from sklearn.grid_search import GridSearchCV
# prepare a range of alpha values to test
alphas = np.array([1,0.1,0.01,0.001,0.0001,0])
# create and fit a ridge regression model, testing each alpha
model = Ridge()
grid = GridSearchCV(estimator=model, param_grid=dict(alpha=alphas))
grid.fit(X, y)
print(grid)
# summarize the results of the grid search
print(grid.best_score_)
print(grid.best_estimator_.alpha)
Sometimes it is more efficient to randomly select a parameter from the given range, estimate the algorithm quality for this parameter and choose the best one.
from scipy.stats import uniform as sp_rand
from sklearn.grid_search import RandomizedSearchCV
# prepare a uniform distribution to sample for the alpha parameter
param_grid = {'alpha': sp_rand()}
# create and fit a ridge regression model, testing random alpha values
model = Ridge()
rsearch = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=100)
rsearch.fit(standardized_X, y)
print(rsearch)
# summarize the results of the random parameter search
print(rsearch.best_score_)
print(rsearch.best_estimator_.alpha)