Regression analysis explores the relationship between a quantitative response variable and one or more explanatory variables. If there is just one explanatory variable it is called simple regression, if there is more than one explanatory variable it is called multiple regression. Usually, the x is the explanatory variable and y the response variable, sometimes x is also called the independent variable and y the dependent variable.
Can the density of wood be used to predict its hardness? Hardness is difficult to measure, density is easier to measure. If we can establish a relationship between these two variables we can use the easy to measure quantity density to get the hard to measure quantity hardness.
The question is, can we use a known value of density x to help predict the hardness y? To answer this question we will fit a regression line through the points described by the density and hardness data and then we use this line for prediction.
- Is the relationship strong enough so that we can use it for prediction?
- How do we come up with a regression line?
- Is a line a reasonable summary of the relationship between the variables?
Correlation coefficient (How strong is the relationship?)
The correlation coefficient of two variables in a data set equals to their covariance divided by the product of their individual standard deviations. It is a normalized measurement of how the two are linearly related.
The correlation coefficient provides information on how closely the points approach a line of imagination in a scatter plot. Or easier, it tells us if there is a relationship between two features x and y and how close this is. It tells you how well the points can be connected by a straight line.
Look at the figure above, the lines in the left scatter plot both have a correlation coefficient of 1, the lines in scatter plot in the middle have a correlation coefficient of -1 and the correlation coefficient in the right scatter plot is 0.
1, all points are exactly on a straight line
-1, all points are exactly on a falling line
0, we can’t say how the scatter plot looks like
The correlation coefficient applies only to linear relationships, that is, relationships that are represented by a straight line in a scatter plot. We do not know how the slope of the curve looks or how the features of the curve are, we only know whether it is rising or falling. The right plot depicts a square function, therefore the coefficient is 0.
Regression (How to determine a regression line?)
The determination of the line is all the more difficult the lower the correlation coefficient is, this means if the data points are scattered more. The determination of the exact position of the line is important because only then we can predict an unknown y for a known x.
The scatter plot above illustrates four data points. The task is to fit a regression line through these data points.
The first approach (gray line). An approximate line is drawn by visual judgment, are we done?
The second approach, another approximate line (yellow) is drawn by visual judgment. Which of the two is the better one? Maybe we need some mathematic to solve this problem.
The distance of the points from the line should be minimal, we look at which line the sum of all distances is the smallest, more precisely, we look where the distances from the regression line to the points on the y-axis are the smallest. This gives us a straight line with a high prognostic accuracy. We are looking for a straight line where the y distances are minimized, the y deviations are reduced to the smallest possible value.
More specifically, the squared deviations of the y gaps are minimized. Since the area represents the square value, the task is to minimize the total area of this square, therefore the method of least square is apllied. The deviations are also called errors or residuals.
By now we know how to fit in a regression line but we don’t know the coefficients of the equation of the line. We assume a linear relationship between x and y, therefore the use a linear equation.
The first equation is a simple linear equation, beta 0 is the intercept, beta 1 the slope and epsilon holds the random errors (residuals). The errors represent that the y values are nearby the line and will not be precisely on the line.
The red triangles on the plot are showing some predictions, the predictions are always on the regression line, the observed values are not, they are near the regression line, this is epsilon.
For the future to do some statistical inference, we will have to make some assumptions about the distribution of that residuals.
The parameters beta 0 and beta 1 are typically unknown, so we do not know the value and we want to estimate these parameters.
Note the hat on top of the terms of the second equation. It shows that we will use sample data to obtain the estimated regression line.
Beta 0 hat and beta 1 hat are sample data that estimate the parameters beta 0 and beta 1. Ypsilon hat is the predicted value for a given value of x. There is no error term because the prediction values will be on the line.
How are we going to estimate the parameters? A common approach would be to apply the method of least squares to estimate beta 0 and beta1.
I spare you the details of calculating the parameters. The parameters are part of the model created in the statistic language R, you can see them in the red area of the figure above.
Assumptions (Is a line a reasonable relationship between the parameters?)
In order to carry out a statistical analysis, we need to make a few assumptions for our model to make sure that linear regression is suitable. The epsilon is assumed to be a random variable that:
- Has a mean of 0
- Has constant variance sigma square at every value of x (Homoscedastic)
- Is normally distributed (Need to be proofed and it seldom holds in practice)
- The error terms are also assumed to be independent
The statement is that the variance at all points of x should always be the same. Besides presenting the constant variance – the figure above depicts that I have to train more to make a living as a painter. Sigma square is typically unknown therefore the standard variation of the sample data is used.
The figure depicts that for a given value of x that y is distributed normally with a mean of beta 0 + beta 1 times x and a variance of sigma squared. What we assume is that the distribution of y is the same at every value of x only the mean is changing and the variance stays the same. The only thing changing is the mean of y on the different values of x and these theoretical means of y are on the regression line. This is our model used for predicting. We should proof if our model is reasonable by checking some preconditions. These preconditions can be investigated with appropriate plots of the observed residuals.
The residuals in a simple linear regression always sum up to zero, therefore the mean is 0. We hope to see a random scatter of points and no pattern of any kind. The variability in our case (the blue line drawn by hand) is not the same for all different values of x. Overall the variance is OK because the dots do not give a pattern and the two blue lines form „almost“ horizontal lines.
This plot represents the normal distribution of the values. The values seem normally distributed, although you can see some deviation at the beginning and the end of the regression line. Normal distribution needs to be tested utilizing a specific statistical tool. Moreover, the testing hypothesis of an underlying normal distribution seldom holds true. All in all, our model seems to be OK, but it is far from being perfect.
This blog presented what linear regression is used for, how the relationship between the parameters is quantified, how a regression line is fitted through the data points and which constraints have to be fulfilled to apply linear regression and how to check these constraints.