This notebook contains my notes for Predictive Analysis on Binary Classification. It acts as a cookbook. It is a continuation from the previous post on using pandas.
Parallel Coordinates Plots
for i in range(208): #assign color based on color based on "M" or "R" labels if rocksVMines.iat[i,60] == "M": pcolor = "red" else: pcolor = "blue" #plot rows of data as if they were series data dataRow = rocksVMines.iloc[i,0:60] dataRow.plot(color=pcolor, alpha=0.5) plot.xlabel("Attribute Index") plot.ylabel(("Attribute Values")) plot.show()
no extremely clear separation is evident in the line plot, but there are some areas where the blues and reds are separated. Along the bottom of the plot, the blues stand out a bit, and in the range of attribute indices from 30 to 40, the blues are somewhat higher than the reds. These kinds of insights can help in interpreting and confirming predictions made by your trained model.
Visualizing Interrelationships between Attributes and Labels
Another question you might ask of the data is how the various attributes relate to one another. One quick way to get an idea of pair‐wise relationships is to crossplot the attributes with the labels.
Scatter Plot / Cross-Plots
#calculate correlations between real-valued attributes dataRow2 = rocksVMines.iloc[1,0:60] dataRow3 = rocksVMines.iloc[2,0:60] plot.scatter(dataRow2, dataRow3) plot.xlabel("2nd Attribute") plot.ylabel(("3rd Attribute")) plot.show() dataRow21 = rocksVMines.iloc[20,0:60] plot.scatter(dataRow2, dataRow21) plot.xlabel("2nd Attribute") plot.ylabel(("21st Attribute")) plot.show()
If you want to develop your intuition about the relation between numeric correlation and the shape of the scatter plot, just search “correlation”
Basically, if the points in the scatter plot lie along a thin straight line, the two variables are highly correlated; if they form a ball of points, they’re uncorrelated.
Correlation between Classification Target and Real Attributes
Plotting a scatter plot between the targets and attribute 35.
The idea of using attribute 35 for the example showing correlation with the target came from the parallel coordinates graph
from random import uniform #change the targets to numeric values target =  for i in range(208): #assign 0 or 1 target value based on "M" or "R" labels if rocksVMines.iat[i,60] == "M": target.append(1.0) else: target.append(0.0) #plot rows of data as if they were series data dataRow = rocksVMines.iloc[0:208,35] plot.scatter(dataRow, target) plot.xlabel("Attribute Value") plot.ylabel("Target Value") plot.show() # #To improve the visualization, this version dithers the points a little # and makes them somewhat transparent target =  for i in range(208): #assign 0 or 1 target value based on "M" or "R" labels # and add some dither if rocksVMines.iat[i,60] == "M": target.append(1.0 + uniform(-0.1, 0.1)) else: target.append(0.0 + uniform(-0.1, 0.1)) #plot rows of data as if they were series data dataRow = rocksVMines.iloc[0:208,35] plot.scatter(dataRow, target, alpha=0.5, s=120) plot.xlabel("Attribute Value") plot.ylabel("Target Value") plot.show()
Notice the somewhat higher concentration of attribute 35 on the left end of the upper band of points, whereas the points are more uniformly spread from right to left in the lower band. The upper band of points corresponds to mines. The lower band corresponds to rocks. You could build a classifier for this problem by testing whether attribute 35 is greater than or less than 0.5. If it is greater than 0.5 call it a rock and if it is less than 0.5, call it a mine. The examples where attribute 35 is less than 0.5 contain a higher concentration of mines than rock, and the examples where attribute 35 is less than 0.5 contain a lower density, so you’d get better performance than you would with random guessing.
The degree of correlation between two attributes (or an attribute and a target) can be quantified using Pearson’s correlation coefficient.
The attributes that have close index numbers have relatively higher correlations than those that are separated further.