Regression analysis explores the relationship between a quantitative response variable and one or more explanatory variables. If there is just one explanatory variable it is called simple regression, if there is more than one explanatory variable it is called multiple regression. Usually, the x is the explanatory variable and y the response variable, sometimes x is also called the independent variable and y the dependent variable.
Can the density of wood be used to predict its hardness? Hardness is difficult to measure, density is an easier variable to measure. If we can establish a relationship between these two variables we can use the easy to measure variable density to get the hard to measure variable hardness.
The question is, can we use a known value of density x to help predict the hardness y? To answer this question we will fit a well-fitting line through the points described by the density and hardness data and then we use this line for prediction.
- Is the relationship strong enough so that we can use it for prediction?
- How do we come up with a good fitting line?
- Is a line a reasonable summary of the relationship between the variables?
Correlation coefficient (How strong is the relationship?)
The correlation coefficient provides information on how closely the points approach a line of imagination in a scatter plot. Or easier, it tells us if there is a relationship between two features x and y and how close this is. It tells you how well the points can be connected by a straight line.
Look at the figure above, the lines in the left scatter plot both have a correlation coefficient of 1, the lines in scatter plot in the middle have a correlation coefficient of -1 and the correlation coefficient in the right scatter plot is 0.
1, all points are exactly on a straight line
-1, all points are exactly on a falling line
0, we can’t say how the scatter plot looks like
The correlation coefficient applies only to linear relationships, that is, relationships that are represented by a straight line in a scatter plot. We do not know how the slope of the curve looks or how the curve is stored, we only know whether it is rising or falling. The right plot shows a square function, therefore the coefficient is 0.
Regression (How to determine a well-fitting line?)
The determination of the line is all the more difficult the lower the correlation coefficient is, this means if the data points scatter more. The determination of the exact position of the line is important because only then we can predict an unknown y for a known x.
The scatter plot above shows 4 data points. The task is to fit a well-fitting line through these data points.
The first approach (gray line). A line is drawn with the free eye, are we done?
The second approach, another line (yellow) is drawn with the free eye. Which of the two is the better one? Maybe we need some mathematics to solve this problem.
The distance of the points from the line should be minimal, we look at which line the sum of all distances is the smallest, more precisely, we look where the distances on the y-axis are the smallest. This gives us a straight line with a high prognostic accuracy. We are looking for a straight line where the y distances are minimized, regression means back, the y deviations are pushed back to the smallest possible value.
More specifically, the squared aberrations of the y gaps are minimized. Since the area represents the square value, the task is to minimize the total area of this square. The aberrations are also called errors or residuals.
By now we know how to fit in the well-fitting line but we don’t know the equation of the line. The equation is straightforward since we assume a linear relationship between y and x we use a linear equation.
The first equation is a simple linear equation, beta 0 is the intercept, beta 1 the slope and epsilon hold the random errors (residuals). The errors represent that the y values will vary about the line and will not fall precisely on the line.
The red triangles on the plot are showing some predictions, the predictions are always on the regression line, the observed values are not, they vary around the regression line, this is epsilon.
For the future to do some statistical inference we will have to make some assumptions about the distribution of that error component.
The parameters beta 0 and beta 1 are typically unknown, so we don’t know the value and want to estimate them.
Note the hat on top of the terms of the second equation. It shows that we will use sample data to obtain the estimated regression line.
Beta 0 hat and beta 1 hat are sample statistics that estimate the parameters beta 0 and beta 1. Ypsilon hat is the predicted value for a given value of x. There is no error term because the prediction values will fall precisely on the line.
How are we going to estimate the parameters? We usually use the method of least squares to estimate beta 0 and beta1.
I spare you the details of calculating the parameters. The parameters are part of the model created in the statistic language R, you can see them in the red area of the figure above.
Assumptions (Is a line a reasonable relationship between the parameters?)
In order to carry out statistical inference, we need to make a few assumptions about our model. The epsilon is assumed to be a random variable that:
- Has a mean of 0
- Has constant variance sigma square at every value of x (Homoscedastic)
- Is normally distributed
- The error terms are also assumed to be independent
The statement is that the variance at all points of x should always be the same. Besides showing the constant variance, the figure above shows that I have to train more to make a living as a painter. Sigma square is typically unknown therefore the standard variation of the sample data is used.
The figure shows that for a given value of x that y is distributed normally with a mean of beta 0 + beta 1 times x and a variance of sigma squared. What we assume is that the distribution of y is the same at every value of x only the mean is changing and the variance stays the same. The only thing changing is the mean of y on the different values of x and these theoretical means of y are falling on the regression line. This is our assumed model used for predicting. We should check if our assumption is reasonable. These assumptions can be investigated with appropriate plots of the observed residuals. If our assumptions are true, then the observed residuals should behave in a similar fashion.
The residuals in simple linear regression always sum to zero, therefore the mean is 0. We hope to see a random scatter of points and no pattern of any kind. The variability in our case (the blue line drawn by hand) is not the same for all different values of x. Overall the variance doesn’t look that bad.
This plot shows the normal distribution of the values. The values seem fairly normally distributed, although you can see some deviation at the beginning and the end of the regression line, all in all, our model seems to be ok, but it is far from being perfect.
We covered what linear regression is used for, how the relationship between the parameters is quantified, how a well-fitting line is fitted through the data points and which constraints have to be fulfilled to apply linear regression and how to check these constraints.