These are mainly notes for myself to win a Data Science competition , but I figured that they might be of interest to some of the blog readers too. Comments on what is written below are most welcome!
Typically, steps 1-5 would happen once per competition or problem, while steps 6-9 would be repeated in a loop or occur in parallel until you run out of time
Steps of Data Exploration
First, identify Predictor (Input) and Target (output) variables. Next, identify the data type and category of the variables.
A key part of doing well in a competition is understanding how the performance measure works. It’s often easy to significantly improve your score by using an optimisation approach that is suitable to the measure. A classic example is optimising the mean absolute error (MAE) versus the mean square error (MSE). It’s easy to show that given no other data for a set of numbers, the predictor that minimises the MAE is the median, while the predictor that minimises the MSE is the mean.
The Data Exploration
When competing, exploiting anomalies in the data can work in your favour. For example, in the aforementioned hackathon, we noticed that even though we had to produce hourly predictions for air pollutant levels, the measured levels didn’t change every hour (probably due to limitations in the measuring equipment). This led us to try a simple “model” for the first few hours, where we predicted exactly the last measured value. This proved to be one of our most valuable insights. Obviously, this means that we were predicting what the measurement equipment would say rather than actual pollutant levels – something you’d definitely want to avoid in a real-life situation!
Understand the data and what useful patterns they want to model.
At this stage, we explore variables one by one. Method to perform uni-variate analysis will depend on whether the variable type is categorical or continuous. Let’s look at these methods and statistical measures for categorical and continuous variables individually:
- Continuous Variables:- In case of continuous variables, we need to understand the central tendency and spread of the variable.
- Categorical Variables:- For categorical variables, we’ll use frequency table to understand distribution of each category. We can also read as percentage of values under each category. It can be be measured using two metrics, Count and Count% against each category.
Univariate analysis is also used to highlight missing and outlier values.
Bi-variate Analysis finds out the relationship between two variables. Here, we look for association and disassociation between variables at a pre-defined significance level. We can perform bi-variate analysis for any combination of categorical and continuous variables. The combination can be: Categorical & Categorical, Categorical & Continuous and Continuous & Continuous.
- Continuous & Continuous: While doing bi-variate analysis between two continuous variables, we should look at scatter plot. It is a nifty way to find out the relationship between two variables. The pattern of scatter plot indicates the relationship between variables. The relationship can be linear or non-linear. Scatter plot shows the relationship between two variable but does not indicates the strength of relationship amongst them. To find the strength of the relationship, we use Correlation. Correlation varies between -1 and +1.
- Categorical & Categorical: To find the relationship between two categorical variables, we can use following methods:
- Two-way table: We can start analyzing the relationship by creating a two-way table of count and count%. The rows represents the category of one variable and the columns represent the categories of the other variable. We show count or count% of observations available in each combination of row and column categories.
- Stacked Column Chart: This method is more of a visual form of Two-way table.
- Chi-Square Test: This test is used to derive the statistical significance of relationship between the variables. Also, it tests whether the evidence in the sample is strong enough to generalize that the relationship for a larger population as well. Chi-square is based on the difference between the expected and observed frequencies in one or more categories in the two-way table. It returns probability for the computed chi-square distribution with the degree of freedom.
- Probability of 0: It indicates that both categorical variable are dependent
- Probability of 1: It shows that both variables are independent.
- Probability less than 0.05: It indicates that the relationship between the variables is significant at 95% confidence.
- Categorical & Continuous: While exploring relation between categorical and continuous variables, we can draw box plots for each level of categorical variables. If levels are small in number, it will not show the statistical significance. To look at the statistical significance we can perform Z-test, T-test or ANOVA.
- Z-Test/ T-Test:- Either test assess whether mean of two groups are statistically different from each other or not.
- ANOVA:- It assesses whether the average of more than two groups is statistically different.
Having a local validation environment allows you to move faster and produce more reliable results than when relying on leaderboard scores.
k-fold cross validation, which I used is, in general, a good enough start for most competitions. The decision regarding how to perform the split is critical. Random splits might be good enough at times. Other times the classes (0/1) are unbalanced so you might need to do stratified sampling or sometimes time-based splits (month, quarter, year, etc.) will have to be made.
Over time, I have learnt what it represents and why it’s an indispensable part of any machine learning problem.
For any given problem, it’s likely that there are people dedicating their lives to its solution. Those people (often academics) have probably published papers, benchmarks and code, which you can learn from.
You can read more about feature engineering.
Try different algorithms to give a better performance both locally and on the leader board
Hyperparameters can be thought of as cogs which can be turned to fine-tune the machine that is your algorithm.
Running a method without minimal tuning is worse than not running it at all, because you may get a false negative – giving up on something that would actually work very well.
There are automated methods such as GridSearch and RandomizedSearch which can be used. Or Read this post for manual tuning
The idea is to combine models that were developed independently. In high-profile competitions, it is often the case that teams merge and gain a significant boost from combining their models.
Solving a problem such as this takes time and perseverance, among other things.