This notebook contains my notes for Predictive Analysis on Binary Classification. It acts as a cookbook. It is a continuation from the previous post on the summary statistics.
Using Python Pandas to Read Data
import pandas as pd
from pandas import DataFrame
import matplotlib.pyplot as plot
%matplotlib inline
target_url = ("https://archive.ics.uci.edu/ml/machine-learning-"
"databases/undocumented/connectionist-bench/sonar/sonar.all-data")
#read rocks versus mines data into pandas data frame
rocksVMines = pd.read_csv(target_url,header=None, prefix="V")
#print head and tail of data frame
print(rocksVMines.head())
print(rocksVMines.tail())
Structure in the way the data are stored might need to be factored into your approach for doing subsequent sampling.
Using Python Pandas to Summarize Data
#print summary of data frame
summary = rocksVMines.describe()
print(summary)
Notice that the summary produced by the describe function is itself a data frame so that you can automate the process of screening for attributes that have outliers. To do that, you can compare the differences between the various quantiles and raise a flag if any of the differences for an attribute are out of scale with the other differences for the same attributes. The attributes that are shown in the output indicate that several of them have outliers. It would be worth looking to determine how many rows are involved in the outliers.