|
|
To this point, we have looked at scatterplots and
"imagined" a line running through the datapoints that characterizes the
general linear pattern of the data. In the last lab, we added a number,
the Pearson correlation, which summarizes how tightly clustered the
points would be around that imaginary line. In today's lab we'll
actually put the line onto the scatterplots. This process is called Regression.
Let's start by talking about lines and graphs. Consider
the following graph.
 |
at X = 0, Y = 1
at X = 1, Y = 1.5
at X = 2, Y = 2.0
at X = 3, Y = 2.5
at X = 4, Y = 3.0
So as X goes up by 1, Y goes up by 0.5. This is
called the slope
(b1). This is a constant.
The intercept (b0) is the value
of Y when X = 0. In other words, this is the point at which the line
intersects the Y-axis. This is also a constant.
We can describe the line in the following linear
equation:
Y = intercept + slope*X = b0 + b1X
For our example: Y = 1 + .5*X
For our example, if X = 3, then
Y = 1 + .5*3 = 1 + 1.5 = 2.5.
If we look at the graph, X = 3 and sure enough Y =
2.5.
|
In other words, using the linear equation, we can
determine the value of Y, if we know the values of X, b1
(slope), and b0 (intercept).
Why am I writing the slope and the intercept as the
letter b with subscripts (b1 and b0) instead of Y
= mx + b or Y = bx + a like you were taught in algebra? The reason is
that regression is a complex topic. In this class we will consider only
simple regression. However, with something called multiple regression,
you can estimate a variable from many other variables. Each variable
has its own "slope" and so statisticians have settled on the convention
of using the letter b with subscripts for each variable (i.e., b1,
b2, b3, and so forth with b0 as the
intercept). Therefore, I decided to introduce you to this notation now
rather than making you unlearn the more familiar notation later.
Okay, now let's return to our scatterplots. Let's start
with the simple case of r = 1.0. In this situation it is easy to decide
where our line goes, because all of the data points fit exactly on the
line (remember that's what a "perfect" correlation refers to, a
"perfect fit").
When we do a regression analysis, what we are doing is trying to find
the line (and linear equation) that best fits the data points. For this
example it is pretty easy. There is only one possible line that makes
sense to fit to this set of data. To find the line all we need to do is
draw a straight line through all of the points and then to figure out
the equation for that line we can just look at it the way we did in the
above example (in fact if you look carefully you'll see that the this
is the same line as the one in the above example).
|
Now let's look at a case when the correlation is
not perfect. |
|
Now it isn't as easy. Clearly no single straight
line will fit each data point (that is, you can't draw a single line
through all of the data points). In fact it is not too hard to imagine
several different possible lines fitting to this data. What we want is
the line (and linear equation) the fits the best. |
For the questions in this lab, you'll need the height.sav file that we used
in the last lab.
Re-create the scatterplots that you created in the last lab with weight
on the X-axis and height on the Y-axis.
Double click on the scatterplot to open the Chart Editor. This will
allow us to make changes to the scatterplot SPSS gives us by default.
Click the "Add Fit Line at Total" button as seen below:
Blackboard 1) Which is the closest to the fitline generated by SPSS?
One of the important
questions in regression is where the fitline (AKA "regression line")
crosses the Y-Axis. The chart as it appears now is misleading because
neither the X nor Y axes start at 0 where you are accustomed to see
them.
While the Chart Editor is still open, click on the Y-axis so that it is
highlighted. Now double-click it so that a properties box appears to
the right. This can be tricky so do it precisely. Click on the "Scale"
tab it is not already selected. Set the minimum box to 0.
It will look
something like this:
Now repeat the same process for the X axis and set the minimum to 0.
Also, check the box that says, "Display line at origin."
This will rescale
the scatterplot so that you can see where the fitline crosses the
Y-axis.
Blackboard 2) Approximately where does the fitline cross the Y-axis?
Let's see what it means to be the line that best
fits
the data.
Basically what we want to do is minimize the error. That is,
the line that differs the least from all of the data points is
the best fitting line.
Remember what the line is, it is a formula (a linear equation) that
predicts the value of Y given X, the slope, & the intercept. So
what we want to do is
pick the line that gives the best estimate of Y. That is, the
line that makes the smallest error in estimating all of the Y values.
So how do we do this (by hand, so we understand what goes into the
computations)?
We find the least-squares solution
To get this we'll look at each point, and compare the actual value for
Y with the predicted value of Y (called
(pronounced "Y-hat"))
| Note: You should notice that an important
difference between correlation and regression is that with correlation
it doesn't matter which variable is assigned as the independent
(explanatory) variable X, and which is assigned as the dependent
(response) variable Y. However, for regression it DOES matter.
In regression we are predicting the outcome of Y based on X. |
|
distance = Y -
Ŷ
SSerror = residual squared error =
Σ(Y - Ŷ)2
We get the values from the line, and the Y
values from the actual data points
We need to do this for all of the values of X
and Y.
|
The formula for the slope of the best fitting line is:

- or -

that is the correlation coefficient times the ratio of the standard
deviation of Y and the standard deviation of X.
Both formulas give you the same answer (they are
mathematically equivalent). You can chose to use whichever one best
fits the information that you have (e.g., if you know the SP and SSX
use the top, if you know r and the standard deviations use the bottom).
The formula for the intercept of the best fitting line is:
So now our next step is to compute our regression equation for
this data.
slope =
= 14/64 = .22
So the regression equation is:
| Ŷ
= .22(X) + .68 |
|
Okay, so now we know how regression works and (if we
must) we can do it by hand. Now let's see how to do regression in SPSS.
Getting SPSS to compute the ordinary least squares
regression
equation
| Note: the regression analysis also gives us the
power to do more than just get the equation for the line. Because of
this, our output will have a lot of information in it. Be prepared to
have to sift through it to get the information that we want. Later in
the course we'll discuss some of the rest of the output of the
regression analyses. |
- Under the "Analyze" menu select "Regression".
- Under the "Regression" submenu select
"Linear".

- Enter your dependent variable and your
independent variable into the appropriate fields.

Note: you can add more than one explanatory
variable at a time. This is for that procedure I mentioned briefly
before called "multiple regression". For now, just do regressions with
one independent variable at a
time.
- Your output window will have a bunch of information
in it. The information for the Least Squares Regression curve are
highlighted in yellow here (but won't be in your SPSS output). They
correspond to the "Unstandardized Coefficients" for the intercept
(constant) and the slope (your variable name).
So for this relationship the linear equation is:
Ŷ
= -12.878 + 1.193X
Blackboard 3) Use SPSS to
compute the
regression
components (slope and intercept) for weight predicting height (i.e.,
weight is the independent variable and height is the dependent
variable. What is the slope? (Round to 2 decimals)
Blackboard 4) What is the intercept of the regression line for weight
predicting height? (Round to 2 decimals)
Note that the answer is not
that far from your estimate from the scatterplot
You can use regression
equations to estimate what one variable is likely to be when the
predictor variable is known. From the example of predicting the height
of an adult from the average height of the parents we can use this
equation:
Y = 1.19X - 12.88
If one's parents have an
average height of 61 inches, you can plug in the X to get the Y like
this
1.19*61 - 12.88 = 59.71
Thus, if your parents are
61 inches tall on average, you are predicted to be 59.71 inches tall.
Of course, you know and I know that this equation will overestimate
height females and underestimate height for males. This is where that
multiple regression thing comes in handy. It uses 2 or more variables
(e.g., parents' height and sex) to arrive at more accurate estimates.
Blackboard 5) This question will be
different for each person. It is about using the smoke15 variable to
estimate height.
The standard error of the estimate
is simply the standard deviation of the error scores. The formula is
the square root of the residual sums of squares divided by N-2:
Standard error of the
estimate =

The standard error of the estimate represents the "average" error. In
other words, when you use regression to make estimates, you are likely
to be off in your predictions. The standard error of the estimate tells
you by how much you are likely to be off when you make predictions.
Blackboard 6) Answer this question
about the
standard error of the estimate.
The
term "percent of variance explained in Y by X" means by how much is
variance in the errors reduced with the regression equation compared to
the
variance in Y. One way to calculate variance explained is:
1 -
SSerror/SSY
In
the ANOVA section of the regression output, SSerror
corresponds to the Residual Sum of Squares (273.142 in the picture
above). SSY
corresponds to the Total Sum of Squares (759.975 in the picture above).
An easier way to
find the variances explained by X is to find the value
of R squared in the output. It is in the "Model Summary" section.
Blackboard 7) Answer this question
about the
percent of variance explained in height by smoke15.
Another
way to think about variance explained is that it corresponds to the
correlation coefficient squared multiplied by 100.
% of
variance explained = 100 * r * r
If
you know the % of variance explained, you can calculate the correlation
like this:
r =
sqrt(% of variance explained/100)
That is, convert the % to a decimal by dividing by 100 and then take
the square root.
This
formula only works if you know that the correlation between the
variances is positive. If you know that the correlation is negative,
you'll have to multiply your answer by -1.
Blackboard 8 and 9) Answer
questions
about the relationship between correlations and variance explained.
|