Regression is the first algorithm we need to master if we are aspiring to become a data scientist. It is one of the easiest algorithms to learn yet requires understanding and effort to get to the master it. In this blog is a guide for linear regression using Python. It will focus on linear and multiple regression. We will learn the crucial concepts with code in python. Once finished we’ll be able to build, improve, and optimize regression models.
The term “regression” was used by Francis Galton in his 1886 paper “Regression towards mediocrity in hereditary stature”. He used the term in the context of regression toward the mean. Today regression has an even broader meaning. It’s any model that makes continuous predictions.
Regression is a parametric technique used to predict the value of an outcome variable Y based on one or more input predictor variables X. It is parametric in nature because it makes certain assumptions based on the data set. If the data set follows those assumptions, regression gives incredible results. Otherwise, it struggles to provide convincing accuracy.
Regression Analysis helps us to find answers to:
- Prediction of future observations
- find an association, relationship between variables.
- Identify which variables contribute more towards predicting the future outcomes.
Types of regression problems:
Simple Linear Regression:
If model deals with one input, called as independent or predictor variable and one output variable, called as dependent or response variable then it is called Simple Linear Regression. In this type of Linear regression, it assumes that there exists a linear relationship between predictor and response variable of the form.
Multiple Linear Regression:
If the problem contains more than one input variables and one response variable, then it is called Multiple Linear regression.
Guide for Linear Regression using Python
The aim of linear regression is to establish a linear relationship (a mathematical formula) between the predictor variable(s) and the response variable.
This mathematical equation can be generalized as follows:
Y = β1 + β2X + ϵ
X is the known input variable and if we can estimate β1, β2 by some method then Y can be predicted. To predict future outcomes, by using the training data we need to estimate the unknown model parameters (β1 β2 ).
β1 is the intercept. It is the prediction value you get when X = 0. β2 is the slope. It explains the change in Y when X changes by 1 unit. ϵ is the error term, the part of Y the regression model is unable to explain. ϵ represents the residual value, i.e. the difference between actual and predicted values. Collectively, they are called regression coefficients.
The error is an inevitable part of the prediction-making process. We can’t eliminate the (ϵ) error term, but we try to reduce it to the lowest. To do this, regression uses a technique known as Ordinary Least Square (OLS), Generalized Least Square, Percentage Least Square, Total Least Squares, Least absolute deviation, and much more.
In this blog, we are referring to the OLS technique when using linear/multiple regression. OLS technique tries to reduce the sum of squared errors ∑[Actual(y) – Predicted(y’)] ² by finding the best possible value of regression coefficients (β1, β2, etc).
OLS uses squared error which has nice mathematical properties, thereby making it easier to differentiate and compute gradient descent. It is also easy to analyze and computationally faster, i.e. it can be quickly applied to data sets having 1000s of features. Further, interpretation of OLS is much easier than other regression techniques.
Graphical Analysis of variables
Before modeling regression, we need to understand the independent variables (predictors) and dependent variables. The best way to do it through a visualizing their behavior is through:
- Scatter plot: help to visualize any linear relationship between the dependent (response) variable and independent (predictor) variables. Ideally, we have multiple predictor variables, a scatter plot is drawn for each one of them against the response variable.
- Box plot helps to spot any outlier observations in the variable. Having outliers in our predictor can drastically affect the predictions as they can easily affect the direction/slope of the line of best fit.
- Density plot: checks if the response variable is close to normality and see the distribution of the predictor variable. Ideally, a close to a normal distribution (a bell-shaped curve), without being skewed to the left or right is preferred.
- Correlation: is a statistical measure that suggests the level of linear dependence between two variables, that occur in a pair. It can take values between -1 to +1. Correlation between two variables will be closer to 1 if there is a high positive relationship. The opposite is true for an inverse relationship, in which case, the correlation between the variables will be close to -1. A value closer to 0 suggests a weak relationship between the variables. A low correlation (-0.2 < x < 0.2) probably suggests that much of variation of the response variable (Y) is unexplained by the predictor (X), in which case, we should probably look for better explanatory variables. Covariance provides a measure of the strength of the correlation between two or more sets of random variates.
What are the assumptions made in regression?
Regression is a parametric technique, so it makes assumptions. The presence of these assumptions make regression quite restrictive and is conditioned on fulfillment of these assumptions. Once these assumptions get violated, regression makes biased, erratic predictions. If data violates these regression assumptions an obvious solution is to use tree-based algorithms which capture nonlinearity quite well. We can also check performance metrics to estimate violation.
Let’s look at the assumptions and interpretations of regression plots:
- Linear and additive relationship: exists between the response variable (Y) and predictors (X). By linear, it means that the change in Y by 1-unit change in X is constant. If we fit a linear model to a nonlinear, non-additive data set, the regression algorithm would fail to capture the trend mathematically, thus resulting in an inefficient model. Also, this will result in erroneous predictions on an unseen data set.
By additive, it refers to the effect of X on Y is independent of other variables.
How to check for a linear and additive relationship: Look for residual vs fitted value plots.
Residual vs. Fitted Values Plot
Scatter plot shows the distribution of residuals (errors) vs fitted values (predicted values). It reveals various useful insights including outliers. The outliers in this plot are labeled by their observation number which makes them easy to detect.
Ideally, this plot shouldn’t show any pattern. If there exists any shape (curve, U shape), it suggests non-linearity in the data. It means that the model doesn’t capture nonlinear effects. But if you see nonlinearity in the data set. In addition, If a funnel shape is evident in the plot, consider it as the signs of non-constant variance i.e. heteroskedasticity.
Image for residuals Vs fitted values: https://stats.stackexchange.com/questions/76226/interpreting-the-residuals-vs-fitted-values-plot-for-verifying-the-assumptions
Steps: to overcome the issue of nonlinearity, we can do a nonlinear transformation of predictors such as log (X), √X or X² transform the dependent variable. If your data is suffering from nonlinearity, transform the dependent variables using sqrt, log, square, etc. Also, we can include polynomial terms (X, X², X³) in your model to capture the nonlinear effect.
- Error terms must possess constant variance. The absence of constant variance in the error terms results in heteroscedasticity. Generally, non-constant variance arises in presence of outliers or extreme leverage values. These values get too much weight, hence disproportionately influences the model’s performance. When this phenomenon occurs, the confidence interval for out of sample prediction tends to be unrealistically wide or narrow.
How to check for heteroscedasticity:
- Look at residual vs fitted values plot. If heteroscedasticity exists, the plot would exhibit a funnel shape pattern.
- Use Breusch-Pagan / Cook – Weisberg test or White general test to detect this phenomenon. If we find p < 0.05, we reject the null hypothesis and infer that heteroscedasticity is present.
- Scale Location Plot
Scale Location Plot
This plot shows how the residual are spread along the range of predictors. It’s similar to residual vs fitted value plot except it uses standardized residual values. We need to corroborate the findings of this plot with the funnel shape in residual vs. fitted values. Ideally, this plot shouldn’t show any pattern. If there is no presence of a pattern errors are normally distributed. However, if plot shows any discernible pattern (probably a funnel shape), it would imply a non-normal distribution of errors.
Image for Scale Location Plot: https://stats.stackexchange.com/questions/93464/homoscedastic-and-heteroscedastic-data-and-regression-models
Steps: to overcome the issue of heteroskedasticity, transform the response variable(Y) using sqrt, log, square, etc. Also, we can use the weighted least square method to tackle this problem.
Please click here for the rest of the blog about linear regression.