Lab 9
Regression




 

To this point, we have looked at scatterplots and "imagined" a line running through the datapoints that characterizes the general linear pattern of the data. In the last lab, we added a number, the Pearson correlation, which summarizes how tightly clustered the points would be around that imaginary line. In today's lab we'll actually put the line onto the scatterplots. This process is called Regression.

Let's start by talking about lines and graphs. Consider the following graph.

at X = 0, Y = 1
at X = 1, Y = 1.5
at X = 2, Y = 2.0
at X = 3, Y = 2.5
at X = 4, Y = 3.0

So as X goes up by 1, Y goes up by 0.5. This is called the slope (b1). This is a constant.

The intercept (b0) is the value of Y when X = 0. In other words, this is the point at which the line intersects the Y-axis. This is also a constant.

We can describe the line in the following linear equation:

Y = intercept + slope*X = b0 + b1X

For our example: Y = 1 + .5*X

For our example, if X = 3, then

Y = 1 + .5*3 = 1 + 1.5 = 2.5.

If we look at the graph, X = 3 and sure enough Y = 2.5.

In other words, using the linear equation, we can determine the value of Y, if we know the values of X, b1 (slope), and b0 (intercept).

Why am I writing the slope and the intercept as the letter b with subscripts (b1 and b0) instead of Y = mx + b or Y = bx + a like you were taught in algebra? The reason is that regression is a complex topic. In this class we will consider only simple regression. However, with something called multiple regression, you can estimate a variable from many other variables. Each variable has its own "slope" and so statisticians have settled on the convention of using the letter b with subscripts for each variable (i.e., b1, b2, b3, and so forth with b0 as the intercept). Therefore, I decided to introduce you to this notation now rather than making you unlearn the more familiar notation later.

Okay, now let's return to our scatterplots. Let's start with the simple case of r = 1.0. In this situation it is easy to decide where our line goes, because all of the data points fit exactly on the line (remember that's what a "perfect" correlation refers to, a "perfect fit").

When we do a regression analysis, what we are doing is trying to find the line (and linear equation) that best fits the data points. For this example it is pretty easy. There is only one possible line that makes sense to fit to this set of data. To find the line all we need to do is draw a straight line through all of the points and then to figure out the equation for that line we can just look at it the way we did in the above example (in fact if you look carefully you'll see that the this is the same line as the one in the above example).

Now let's look at a case when the correlation is not perfect. 
Now it isn't as easy. Clearly no single straight line will fit each data point (that is, you can't draw a single line through all of the data points). In fact it is not too hard to imagine several different possible lines fitting to this data. What we want is the line (and linear equation) the fits the best.

    For the questions in this lab, you'll need the height.sav file that we used in the last lab.

      Re-create the scatterplots that you created in the last lab with weight on the X-axis and height on the Y-axis.
      Double click on the scatterplot to open the Chart Editor. This will allow us to make changes to the scatterplot SPSS gives us by default.
      Click the "Add Fit Line at Total" button as seen below:




      Blackboard 1) Which is the closest to the fitline generated by SPSS?

      One of the important questions in regression is where the fitline (AKA "regression line") crosses the Y-Axis. The chart as it appears now is misleading because neither the X nor Y axes start at 0 where you are accustomed to see them.

      While the Chart Editor is still open, click on the Y-axis so that it is highlighted. Now double-click it so that a properties box appears to the right. This can be tricky so do it precisely. Click on the "Scale" tab it is not already selected. Set the minimum box to 0.

      It will look something like this:




      Now repeat the same process for the X axis and set the minimum to 0. Also, check the box that says, "Display line at origin."

      This will rescale the scatterplot so that you can see where the fitline crosses the Y-axis.

      Blackboard 2) Approximately where does the fitline cross the Y-axis?

    Let's see what it means to be the line that best fits the data.

      Basically what we want to do is minimize the error. That is, the line that differs the least from all of the data points is the best fitting line.
        Remember what the line is, it is a formula (a linear equation) that predicts the value of Y given X, the slope, & the intercept. So what we want to do is pick the line that gives the best estimate of Y. That is, the line that makes the smallest error in estimating all of the Y values.
      So how do we do this (by hand, so we understand what goes into the computations)?

        We find the least-squares solution
          To get this we'll look at each point, and compare the actual value for Y with the predicted value of Y (called (pronounced "Y-hat"))

          Note: You should notice that an important difference between correlation and regression is that with correlation it doesn't matter which variable is assigned as the independent (explanatory) variable X, and which is assigned as the dependent (response) variable Y. However, for regression it DOES matter. In regression we are predicting the outcome of Y based on X.

    distance = Y - Ŷ 

    SSerror = residual squared error = Σ(Y - Ŷ)2

    We get the values from the line, and the Y values from the actual data points

    We need to do this for all of the values of X and Y.

        The formula for the slope of the best fitting line is:

          - or -


            that is the correlation coefficient times the ratio of the standard deviation of Y and the standard deviation of X.

          Both formulas give you the same answer (they are mathematically equivalent). You can chose to use whichever one best fits the information that you have (e.g., if you know the SP and SSX use the top, if you know r and the standard deviations use the bottom).

        The formula for the intercept of the best fitting line is:
 

      So let's revisit the example that we used for correlations.

     	X	Y
    0 1
    10 3
    4 1
    8 2
    8 3
    sum 30 10
    mean 6.0 2.0

    Our first step was to draw the scatterplot

    Based on this scatterplot we will expect an r that is positive and fairly strong (because the points cluster fairly strongly around an imaginary straight line). So then we comput r and find that it is 0.875

So now our next step is to compute our regression equation for this data.

slope =
 
= 14/64 = .22

    intercept =
     
    = 2.0 - (.22)(6.0) = .68

So the regression equation is:
 Ŷ = .22(X) + .68

Okay, so now we know how regression works and (if we must) we can do it by hand. Now let's see how to do regression in SPSS.

Getting SPSS to compute the ordinary least squares regression equation

    Note: the regression analysis also gives us the power to do more than just get the equation for the line. Because of this, our output will have a lot of information in it. Be prepared to have to sift through it to get the information that we want. Later in the course we'll discuss some of the rest of the output of the regression analyses.

    • Under the "Analyze" menu select "Regression". 
    • Under the "Regression" submenu select "Linear". 

    • Enter your dependent variable and your independent variable into the appropriate fields.

      • Note: you can add more than one explanatory variable at a time. This is for that procedure I mentioned briefly before called "multiple regression". For now, just do regressions with one independent variable at a time.

    • Your output window will have a bunch of information in it. The information for the Least Squares Regression curve are highlighted in yellow here (but won't be in your SPSS output). They correspond to the "Unstandardized Coefficients" for the intercept (constant) and the slope (your variable name).

So for this relationship the linear equation is:

Ŷ = -12.878 + 1.193X

Blackboard 3) Use SPSS to compute the regression components (slope and intercept) for weight predicting height (i.e., weight is the independent variable and height is the dependent variable. What is the slope? (Round to 2 decimals)
Blackboard 4) What is the intercept of the regression line for weight predicting height?
(Round to 2 decimals)

Note that the answer is not that far from your estimate from the scatterplot

You can use regression equations to estimate what one variable is likely to be when the predictor variable is known. From the example of predicting the height of an adult from the average height of the parents we can use this equation:

Y = 1.19X - 12.88

If one's parents have an average height of 61 inches, you can plug in the X to get the Y like this

1.19*61 - 12.88 = 59.71

Thus, if your parents are 61 inches tall on average, you are predicted to be 59.71 inches tall. Of course, you know and I know that this equation will overestimate height females and underestimate height for males. This is where that multiple regression thing comes in handy. It uses 2 or more variables (e.g., parents' height and sex) to arrive at more accurate estimates.

Blackboard 5) This question will be different for each person. It is about using the smoke15 variable to estimate height.

The standard error of the estimate is simply the standard deviation of the error scores. The formula is the square root of the residual sums of squares divided by N-2:
Standard error of the estimate =






The standard error of the estimate represents the "average" error. In other words, when you use regression to make estimates, you are likely to be off in your predictions. The standard error of the estimate tells you by how much you are likely to be off when you make predictions.

Blackboard 6) Answer this question about the standard error of the estimate.

The term "percent of variance explained in Y by X" means by how much is variance in the errors reduced with the regression equation compared to the variance in Y. One way to calculate variance explained is:

1 - SSerror/SSY

In the ANOVA section of the regression output, SSerror corresponds to the Residual Sum of Squares (273.142 in the picture above). SSY corresponds to the Total Sum of Squares (759.975 in the picture above). An easier way to find the variances explained by X is to find the value of R squared in the output. It is in the "Model Summary" section.

Blackboard 7) Answer this question about the percent of variance explained in height by smoke15.

Another way to think about variance explained is that it corresponds to the correlation coefficient squared multiplied by 100.

% of variance explained = 100 * r * r

If you know the % of variance explained, you can calculate the correlation like this:

r = sqrt(% of variance explained/100)
That is, convert the % to a decimal by dividing by 100 and then take the square root.

This formula only works if you know that the correlation between the variances is positive. If you know that the correlation is negative, you'll have to multiply your answer by -1.

Blackboard 8 and 9)  Answer questions about the relationship between correlations and variance explained.