Use Feature Selection Techniques and Build an Ensemble of Classiﬁcation Models
Feature selection is an automatic or manual process to select features which contribute to the prediction and remove irrelevant features that negatively impact the performance of the model. It helps to reduce overﬁtting and training time while improving performance. It is also an important technique if there are many features (Zhang and Sejdi´c, 2019). This is a continuation from here.
Feature Selection using Variance Threshold, Random Forest and Univariate Selection
The implementation uses all the features from the data set where the rows were not empty are included. Certain features that do not provide information about the predicted class e.g., update stamp, site are removed. Categorical features such as gender, marital status and race are converted to dummy variables. Feature selection is done to have important predictive features from the data set. The project selects features using:
• Variance threshold of over 0.90
• Random forest classiﬁer to identify important features
• Univariate feature selection using chi-square test
The number of features selected is 54 from a total of 1452 features available. Imputation is not required and scaling is done before dividing into training and test data sets.
Use Principal Component Analysis to Understand the Variance Explained
Figure shows the variance explained after feature selection using principal component analysis. The ﬁrst principal component explains the most variance (approximately 99% of the variance) among the new variables.
This shows that the selected features contain most of the information in the data set and hence are good for modelling.
Implementation, Evaluation and Result of Decision Tree using 50 Leaf Nodes
It is implemented using scikit-learn library and the function is DecisionTreeClassiﬁer() with 50 leaf nodes. The model resulted in an average AUROC score of 0.821. Figure is a decision tree in which the probability of an input belonging to each class is 0.30, 0.45, 0.23 if the value of baseline CDRSB is less than or equal to 0.25 and “gini” is equal to 0.642.
The probability of an input belonging to each class is 0.908, 0.071 and 0.021 if CDRSB is true and the baseline month is less than or equal to 23.9 and “gini” is equal to 0.17. However, only 29.6% of the data is classiﬁed. Similarly, the probability of input to belong to each class is 0.057, 0.617 and 0.326 if value of CDRSB is less than or equal to 2.75 and “gini” is equal to 0.51. Only 70.4% of the data is classiﬁed.
Implementation, Evaluation and Result of Random Forest
Random forest uses a random sample of data to train tree independently and is less likely to overﬁt than gradient boosting trees. It is implemented using scikit-learn library. The function used to implement is RandomForestClassiﬁer(). The model resulted in predicting normal with AUROC score of 0.898 against dementia and MCI, MCI with AUROC score of 0.779 against normal and dementia and classify dementia with AUROC score of 0.841 against normal and MCI when the threshold is unﬁxed.
Implementation, Evaluation and Result of XGBoost
Gradient boosting is an ensemble method that is applied to train many individual trees sequentially with each tree improving over the previous tree. A combination of weak learners creates a strong ensemble learner. Extreme Gradient Boosting (XGBoost) is designed for eﬃcient multi-core parallel processing to ensure the use of all the CPU cores. It is implemented using XGBoost library and function used to implement is XGBClassiﬁer() with 100 estimators. The function has built-in methods to regularize, cross-validate and handle missing values. The model resulted in predicting normal with AUROC score of 0.901 against dementia and MCI, MCI with AUROC score of 0.685 against normal and dementia and classify dementia with AUROC score of 0.812 against normal and MCI when the threshold is unﬁxed.
Implementation, Evaluation and Result of Ensemble of Classiﬁers
Random grid search and ensemble learning using voting classiﬁer is used to combine the result of multiple models to produce a single output. The hyperparameters for the classiﬁers are as follows:
- Random Forest: number of estimators – 49, a minimum number of samples required to split an internal node – 25, number of features to consider when looking for the best split – auto and a maximum depth of the tree
- Extra Tree: number of trees in the forest – 60, a minimum number of samples required to split an internal node – 8, number of features to consider when looking for the best split – square root and a maximum depth of the tree – 37.
- Ada boost: number of estimators – 37 and learning rate – 0.2
- Gradient Boosting: subsample – 0.9, number of trees – 32, minimum number of samples – 0.01, minimum number of samples required to be at a leaf node 10, number of features to consider when looking for the best split – square root, maximum depth of the individual estimators – 25, loss function to be optimized deviance, learning rate to shrink the contribution of each tree by – 0.025, function to measure the quality of a split – friedman mse
- XGBoost: fraction of observations to be randomly sampled for each tree – 0.6, silent mode – False, L2 regularization term on weights – 0.1, number of trees – 120, minimum sum of weights of all observations required in a child – 1.0, maximum depth of a tree – 6, learning rate – 0.01, minimum loss reduction required to make a split – 0.25, fraction of columns to be randomly sampled for each tree – 0.9, subsample ratio of columns for each split at each level – 0.4
Classiﬁers with optimized hyperparameters are combined using voting classiﬁer. Voting classiﬁer is implemented using scikit-learn library and function used to implement is VotingClassiﬁer(). It uses “soft” voting and weights of the model are 2, 3, 3, 1 and 3. Figure is a normalized confusion matrix with the values of the diagonal elements denoting the degree of correctly predicted class i.e., 0.87 for normal, 0.59 for MCI and 0.92 for dementia when the threshold for the model is ﬁxed at 0.5.
The oﬀ-diagonal elements are mistakenly confused with the other classes e.g., 0.35 are classiﬁed as dementia when the elements are actually MCI. Therefore, the model is better in predicting normal and dementia than MCI.
The model resulted in predicting normal with AUROC score of 0.908 against dementia and MCI, MCI with AUROC score of 0.760 against normal and dementia and classify dementia with AUROC score of 0.846 against normal and MCI. Figure is AUROC curve and plots a point for every possible threshold.
It shows that an ensemble of classiﬁers is better in predicting normal and dementia than MCI when the threshold is not ﬁxed.
Comparison of Developed Models
Figure shows AUROC score per class for each of the developed models. Both XGBoost and ensemble of classiﬁers are good in predicting normal clinical stage.
However, the ensemble is better than the other models in distinguishing between a class and other classes with AUROC score of 0.908 for normal, 0.760 for MCI and 0.846 for dementia. Therefore, the combination of diverse classiﬁers and feature selection has resulted in a strong machine learning model.
Interpreting Machine Learning Model
It is not possible to explain the prediction made by voting classiﬁers using SHAP. Hence, XGBoost is implemented to gain insight and discover important features. Figure explains the most important features by plotting the sum of SHAP value magnitude over all samples for each feature.
Red means the feature has a high impact and blue means the feature has a low impact on the model. Therefore, CDRSB from baseline has a positive impact and MMSE from baseline has a negative impact on the model. The importance of cognitive tests are in accordance with other papers such as (Goyal et al., 2018) and (Mehdipour Ghazi et al., 2019) but diﬀerent than the ﬁndings of a paper (Lee et al., 2016). The volume of left hippocampus, cortical thickness of right entorhinal and months to nearest 6 months from baseline as a continuous factor are the next three important features. The feature whether or not the base image includes a timepoint has very limited impact on the model.
The figure below further conﬁrms that both CDRSB and MMSE at baseline are the two most important features. Plotting mean absolute value of the SHAP values of every feature for every sample shows that CDRSB is twice important than MMSE.
Furthermore, the volume of left hippocampus, average cortical thickness of right entorhinal, months to nearest 6 months from baseline as a continuous factor, months from baseline and age at baseline are the next few signiﬁcant features which have an impact on the model order magnitude. Whether or not the base image includes a time point has very little signiﬁcance.
The project has implemented diﬀerent machine learning techniques on data from the domain of cognition behaviour and radiology images after doing exploratory analysis. The evaluation and result of the models are also discussed. A web-based application is developed and deployed to the cloud.
The report continues here.