This notebook contains my notes for Predictive Analysis on Binary Classification. It acts as a cookbook. It is a continuation from the previous post on visualizing. This notebook discusses Pearson’s Correlation.
Pearson’s Correlation Calculation for Attributes 2 versus 3 and 2 versus 21
from math import sqrt #calculate correlations between real-valued attributes dataRow2 = rocksVMines.iloc[1,0:60] dataRow3 = rocksVMines.iloc[2,0:60] dataRow21 = rocksVMines.iloc[20,0:60] mean2 = 0.0; mean3 = 0.0; mean21 = 0.0 numElt = len(dataRow2) for i in range(numElt): mean2 += dataRow2[i]/numElt mean3 += dataRow3[i]/numElt mean21 += dataRow21[i]/numElt var2 = 0.0; var3 = 0.0; var21 = 0.0 for i in range(numElt): var2 += (dataRow2[i] - mean2) * (dataRow2[i] - mean2)/numElt var3 += (dataRow3[i] - mean3) * (dataRow3[i] - mean3)/numElt var21 += (dataRow21[i] - mean21) * (dataRow21[i] - mean21)/numElt corr23 = 0.0; corr221 = 0.0 for i in range(numElt): corr23 += (dataRow2[i] - mean2) * \ (dataRow3[i] - mean3) / (sqrt(var2*var3) * numElt) corr221 += (dataRow2[i] - mean2) * \ (dataRow21[i] - mean21) / (sqrt(var2*var21) * numElt) sys.stdout.write("Correlation between attribute 2 and 3 \n") print(corr23) sys.stdout.write(" \n") sys.stdout.write("Correlation between attribute 2 and 21 \n") print(corr221) sys.stdout.write(" \n")
Correlation between attribute 2 and 3 0.770938121191 Correlation between attribute 2 and 21 0.466548080789
Visualizing Attribute and Label Correlations Using a Heat Map
One way to check correlations with a large number of attributes is to calculate the Pearson’s correlation coefficient for pairs of attributes, arrange those correlations into a matrix where the ij‐th entry is the correlation between the ith attribute and the jth attribute, and then plot them in a heat map
#calculate correlations between real-valued attributes corMat = DataFrame(rocksVMines.corr()) #visualize correlations using heatmap plot.pcolor(corMat) plot.show()
The light areas along the diagonal confirm that attributes close to one another in index have relatively high correlations. As mentioned earlier, this is due to the way in which the data are generated. Close indices are sampled at short time intervals from one another and consequently have similar frequencies. Similar frequencies reflect off the targets similarly (and so on).
Perfect correlation (correlation = 1) between attributes means that you may have made a mistake and included the same thing twice. Very high correlation between a set of attributes (pairwise correlations > 0.7) is known as multicollinearity and can lead to unstable estimates. Correlation with the targets is a different matter. Having an attribute that’s correlated with the target generally indicates a predictive relation.