Now I shall move to a more generalized definition of degrees of freedom. The degrees of freedom of a statistic is the number of observations minus the number of necessary auxillary values which values themselves are based on the observations. This is kind of a nasty statement and somewhat flakey but don't panic. The rule works for 95% of the situations and that isn't bad statistically is it? In the last example the variables are the observations and the auxilliary value is the mean (note it is based on the four observed values), and therefore the df= 4 - 1 = 3. More specifically, the estimated standard error of the mean and therefore the t-statistic here would have df=3 and would test a hypothesis about a population mean.
Finally, one more example using this better rule and I shall close up shop for the month. When related pairs are present and your concen is the correlation coefficient r, what is the df for that situation? If I give you the correlated pairs (X1, Y1), (X2, Y2), (X3, Y3), (X4, Y4) and again allow you to assign the correlated values ( a wee bit tricky, huh?), the observations is 4 not 8 because a related pair is an observation (Now don't be rigid and not allow this). Also in computing the sample correlation coefficient r there are two auxiliary values, the slope of the best fitting straight line to the scatter plot of data and the Y-intercept. In other words , in a situation such as a t-test about the correlation coefficient r, the degrees of freedom here is df= 4-2 = 2 or in general df= N-2, where N is the number of correlated pairs and 2 is the number of auxiliary values. With this definition you must expand your notion of an observation and be cautious about new auxilliary values. Next month I will use this new rule to explain the df for the standard errors of some test statistics.
I shall now turn to the most basic t-test of them all displayed to the right, the t for testing a hypothesis about a single population mean. Recall that basically degrees of freedom is the number of variables that are allowed to vary freely without restriction. In hypothesis testing we usually work with a random sample(s) of scores of some sort. Here think of each score in the sample as a variable that is capable of taking on any value. Thus, each score becomes an observation and the total number of observations is the sample size N. Now the only necessary auxillary value in this case is the sample mean. Hence invoking the principle from last month that the df-value of a statistic is the number of observations minus the number of necessary auxillary values, the df-value of the estimated standard error in the denominator of this ratio is N-1 which becomes the df-value for this basic test. This ratio for example might be used to test the null hypothesis that a popuation mean of IQ scores is 100 which would be substituted on the right in the numerator. Of course, the sample mean and standard deviation s would be calculated from the data and plugged in also.
Several interesting observations are in order. Last week we determined that the sample standard deviation s had df=N-1 which is the same df as the estimated standard error of the mean in this test. Also if you suddenly go brain dead and forget the df-value for this test, it is staring you right in the face in the denominator of the formula. This is not by chance but occurs quite frequently with t-tests. Pretty nifty huh?
OK since things are going so smoothly, I next want to discuss the most widely used t-test in the literature...the so-called independent samples t. It is used to test the hypothesis that there is no difference in the means of two distinct popuations (i.e., the null hypothesis 0 is plugged into the right side of the numerator). The formula to the right admitedly looks a little scary but again it is nothing more than an iteration of the basic t template with the difference in the sample means serving as the test statistic. Here the observations are the scores in both samples and the auxillary values are the two separate sample means. Thus, by our rule the df-value for the estimated standard error and the test is N1+N2-2. Now that is pretty slick! The ingredients you need to calculate this t are the two sample means and the two sample variances and of course the two sample sizes. Again popping out like a zeon light from the formula is the df-value from the denominator to jog your memory. This test finds many applications. When you have two separate random samples of scores as in Experimental and Control groups or two different treatment groups and you desire to test the significance of the difference in the two means, this t becomes the star of your stat world.
Another relatively important t-test is presented that tests the hypothesis that there is no difference in the means of two correlated populations. To conduct this test, we must use the framework of the correlated pairs of (X1-X2) scores which was discussed last month. Fortunately in this situation you are allowed to compute a difference score (D) for each related pair in the sample and subsequently work with the sample D's from then on out. In essence you have reverted back to the simple t test with D's taking the place of X's. Thank God for little favors. Without this simple move, you must use an alternate method which requires that you compute the correlation coefficient and treat the X1's and X2's separately. Believe me unless you do this on a computer it is a statistican's nightmare and requires three times the work. Returning to the main problem of getting the df-value here, an observation becomes a D-value of which we have N and we have one auxilliary value which is the mean D. Thus the df for the estimated standard error is N-1 which becomes the df-value for the test. Beware of something with the calculation of t. You are working with a sample of D's so a difference is computed in the same order and you will probably end up with positive and negative D's which must be accounted for. The sample mean D and the sample standard deviation of D along with 0 for the hypothesized value of the population mean D are substituted in the formula and the value of t rolls out. This test is employed when you have a pre-test and post-test situation for a number of subjects or when you have subjects that are matched on another variable prior to administering two treatments. A common mistake with this test is to treat the X1's and the X2's as independent samples and use N+N-2 or 2N-2 as the df-value (too large) and employ the independent samples t-test above. This would be a positively biased test and result in too many Type I errors.
One Last Critical Note: The above test involves correlated or related pairs and then obtains D-scores and the mean D. This t-test has df= N-1 and tests a hypothesis about a population mean difference. Shown below is one last t-test (for now anyway), the t for testing the hypothesis that a population correlation coefficient is zero. It also involves correlated or related pairs and the sample correlation r, but employs a t-test with df= N-2 as we explained in an earlier Stickey Wicket. Many folks get the values N-1 of the former and N-2 of the latter confused since they both involve related pairs of scores, but remember there are two distinctly different hypothesis tests being performed here. Now you know the rest of the story!
Well that concludes my ramblings for January. I hope you are realizing that statistics has many reoccuring themes. Certainly the principles for counting degrees of freedom is one of them. You all now should be experts in counting degrees of freedom at least when you perform William Sealy Gossett's celebrated t-test.
Well I understand where you are coming from in the moderate to large sample situation, but these two quantities have caused students of statistics more problems and confusion than a barrel of monkeys particularly when the students have used several textbooks in a course or in different statistics courses. The crux of the issue which generally an author makes no mention of is that the sample variance and standard deviation can be defined two different ways. This in turn makes subsequent formulas such as estimated standard errors (or error variances) look "seemingly" different depending on the definition when in reality the formulas are equivalent. The reason I feel that this issue requires discussion is that, to my knowledge, I have not seen a good explanation of this problem in any statistics textbook and it will save you questioning whether there are typos on many pages of the book. Let us then examine each definition , see where it takes us, and talk about the positives and negatives of each choice. Here you are going to see an issue that many statisticians are split on. If I were to guess I would say that the statistical community is about 50/50 on this one!
Now look at the two methods that are labeled (A) and (B) to the right. One thing both methods have in common is the sum of the squared deviations of the scores about the mean (∑x2). Three Cheers! In other words, statisticians pretty much agree that in most situations in order to measure how variable a set of scores is, you first must take into account each and every score in the sample. That is , you find out how far each score is above or below the mean of the sample (a deviation score). Then you square each of these deviation scores and summate the squared deviations. This is the direct or "brute force" method of computing this quantity and it involves far too many messy decimals. It is far easier to make this computation with only the raw scores and not fuss with the mean. You get ∑X and ∑X2 and employ STEP ONE of the World Famous Three Step Method. (i.e., ∑x2 = ∑X2 - (∑X)2/N). See Step One WFTSM for an example of this calculation.
Now the two methods part company. In (A) we divide the ∑x2 by the sample size N and this produces the sample variance s2. Since this index is in squared units, if we want an index in the original score units we extract the square root and have the sample standard deviation s. Division by N in this process makes sense logically because then we are able to state that the sample variance is the average squared deviation of each of the scores in the sample about the mean. This just shouts that it is measuring variation and it also just feels like a meaningful way of getting at the spread of a set of scores. Also it is valid when you have an N of 1 since the variance and standard deviation would be 0 which upon reflection is exactly what it should be.
Turning to method (B), we divide ∑x2 by N-1 to get s2 and then take the square root if desired to obtain s. However, the N-1 just seems nonintuitive. You cannot now neatly enterpret the sample variance as an average and the two formulas seem to lose their logical appeal. In addition, if N is 1 then the variance and standard deviation are both undefined because you are dividing by 0. Why then would anyone employ (B) to define the sample variance and standard deviation? I am going to whisper this but there is one slight advantage of (B). The reality of the matter is that with (B) you really have calculated an unbiased estimate of the population variance and very close to the same for the population standard deviation. So some authors feel that this method bypasses the sample index and moves directly to the population estimate. Thus, when authors label s2 = ∑x2/(N-1) and the subsequent square root as the sample variance and sample standard deviation, they are really somewhat disingenuous in doing so.
I will illustrate the confusion that the two definitions can create when you are reading different books. If the author uses (A), the estimated error variance of the mean is given by s2/(N-1) whereas if the author prefers (B) the same estimated error variance of the mean is s2/N...Two seemingly different results! But wait, two different definitions have been used for s2. REALLY THE TWO RESULTS ARE IDENTICAL! To show this, using the former result and substituting (A) for s2, we have ∑x2/N(N-1). Now using the latter result and substituting (B) for s2, we have ∑x2/(N-1)N...precisely identical results. Sooo...(what Steve Jobs would utter) what does all this mean? THE FIRST THING A PERSON SHOULD CHECK UPON OPENING A STATISTICS TEXTBOOK IS SEE WHAT STANCE THE AUTHOR TAKES ON THE N VS N-1 ISSUE IN DEFINING THE SAMPLE VARIANCE AND STANDARD DEVIATION. My opinion favors division by N but about half of the textbooks use division by N-1 so be prepared to make adjustments in your thinking. Statisticians end up at the same place on this one but sure create some illusions along the way. Thanks for reading my blurb and see you next month.
We will present a simple example of calculating upper and lower limits of a 95% confidence interval for a population mean μ. The figure at the right displays a standard normal curve of z-scores with two examples of useful percentiles that would be needed to obtain a 95% confidence interval. The first is called z.025 = -1.96 and by definition is the point on the z-scale such that 2.5% (.025) of the area falls below it (Remember the total area under this curve is 1 so areas correspond to probabilities). Now at the upper end we have z.975 = +1.96 or the point on the z-scale such that 97.5% (.975) of the area falls below it (upper blue area is therefore .025). The -1.96 and +1.96 come from the standard normal curve table and were perhaps memorized by some of you. Also the middle white area (called Δ or the confidence coefficient) then becomes 95% or .95. Note that in building a confidence interval, Δ is selected first and the tail-areas are always equal. Some other commonly used percentiles that may be dear to your heart from the tables are z.005 = -3.29 and z.995 = 3.29 with a middle area of 99% or Δ = .99. Also z.05 = -1.64 and z.95 = 1.64 with a middle area of 90% or Δ = .90. Great memories, huh? Now returning to the pictured example: If a random z is drawn from this distribution, the probabilty that a z will fall between -1.96 and +1.96 is .95 or mathematically, P(-1.96 ≤ z ≤ +1.96) = .95.
Next moving to the another 3-Step Procedure displayed to the right (notice never 2, never 4, always 3 steps for nice psychological closure), draw a random sample of size N from a population with known σ. Then convert the sample mean to a z in the previous probability statement and get statement (1) for a result. Then solving this three-way inequality with some simple algebra and getting μ smack dab in the middle by itself and everything else on the ends we arrive exactly where we want to be with statement (2). These end expressions are indeed the formulas for the lower and upper limits of a 95% confidence interval for μ. They are pulled out and stated for emphasis in statements (3). To cement these formulas in our minds let's do a simple example. Suppose we have a population of IQ scores with an unknown μ and σ = 16, We want to generate a 95% confidence interval for the population mean μ. If a random sample of N=64 is drawn and is computed to be 98.7, we substitute into statements (3):
LL = 98.7 -1.96(16/sq rt(64)) = 98.7 -1.96(2) = 98.7 - 3.92 = 94.78
UL = 98.7 +1.96(16/sq rt(64)) = 98.7 +1.96(2) = 98.7 + 3.92 = 102.62
Now the fun begins folks when we try to interpret these results. But you say, "This is a snap. We simply say the probablity that the population mean μ is between 94.78 and 102.62 is .95." But wait I hate to inform you that the population μ is a a fixed parameter and it is either between 94.78 and 102.62 ahead of time in which case the probability is one or the population μ is not beween 94.78 and 102.62 ahead of time in which case the probability is zero. Keep in mind probabilities refer to random variables and the mean μ is a fixed constant even though we don't know what it is. In other words, we can not associate a probability with any single pair of limits. This seems like a minor problem, but to many experts it is a real deterrant for using confidence intervals. Now we could replicate the experiment and obtain several sets of limits. Would this add any information? Certainly it would, but each pair of limits would be subject to the same criticism. But if I did collect an infinity of limits from N's of 64, 95% of the limits would contain the true value of μ. This is a true statement but many would deem this fact essentially useless.
The big advantage of a hypothesis test where an H0 is tested against a two-tailed alternative is that you do end up with an observed test statistic that has a probability associated with it when you reject or retain the null hypothesis. This method appears to appeal to many researchers even though we all know one hypothesis test does not prove anything. It is my speculation that the language itself with hypothesis testing has a certain degree of strength and finality associated with it. Expressions such as "Reject H0: μ1 - μ2 = 0 at the .05 level of significance and Accept the alternative that H1: μ1 > μ2" have a ring of authority linked to them. Recall also, thanks to Pearson and Neyman, we have our dear old friends Type I Error, Type II Error, and the Power of the test. It is indeed sad that the confidence interval approach has no such counterparts. In addition, the terminology of reject or retain H0 seems to mesh with complex ANOVA's where multiple comparisons are perfomed following a significant overall test. For these reasons and perhaps others that I have overlooked, hypothesis testing currently is the KING of the HILL with statisticians.
I would like to give you one advantage for the lonely confidence interval before I close shop. Assume the limits of the previous example where Δ = .95. If another reader reads these results and desires to hypothesis test instead, the results can be predicted very easily. Remember confidence intervals by nature are two-tailed and must be compared with a two-tailed hypothesis test. If the reader wants to test H0:μ = 100 with .05 as level of significance against two alternatives, retention of H0,: μ = 100 would be predicted because 100 is contained between the limits of 94.78 and 102.62. If the reader desires to test H0: μ = 104 against two alternatives, rejection of H0: μ = 104 would be predicted and acceptance of H1: μ < 104 would be supported since 104 is above both limits of 94.78 and 102.62. This may be continued on and on. Thus, the reader may very quickly and easily test any null hypotheses that his heart desires with the single set of data and limits given. Mathematicians have always thought this was pretty neat. However it has not caught on in other disciplines and this interpretation has not helped the cause for confidence intervals.
Thus, we conclude our cases for both methodologies of inference. I must admit I also favor hypothesis testing but who knows where we will be in ten years. Maybe we will turn to Tukey's Exploratory Data Analysis and refine sampling procedures to such a point where we do not even have to use inferential statistics. Now that would be a monumental advance. Meanwhile, thanks again for reading this presentation and HAPPY INFERRING!
Now let us look at this small glob of critical topics and skills that should be the focus of the course. I will present these in a sequential fashion but there is some flexibility in how they are ordered:
(1) Collecting and Organizing Data.
(2) Picturing Distributions of Scores through Polygons, Histograms, Stem and Leaf Designs, and Box-and-Whisker Plots.
(3) Describing the Central Tendancy of Distributions (Mean, Median, and Mode) and Examining Skewness and Kurtosis.
(4) Variability - What Makes the Whole Field of Statistics Tick. The Most Important Skill of All--- Applying WFTSM Which is The World Famous Three Step Method Used to Calculate the Standard Deviation. (The Golden Key is Step1 which is at the top of the heap as far as important formulas in Statistics go)
(5) Interpreting a Score's Location in a Distribution - Percentiles and Standard Scores (Primarily z-Scores)
(6) The Normal Curve and Reading Out Probabilities from Under the Curve.
(7) Simple Hypothesis Testing with the z-Test using LFFSM which is the Locally Famous Five Step Method, another Critical Skill Almost as important as WFTSM. It is imperative that you have knowledge of three important aspects of the Sampling Distribution of the test statistic: Form, Mean, and Standard Error. If these three have not been at least estimated by mathematical statisticians for the statistic, all bets are off and a hypothesis test can not be employed with this particular statistic. Fortunately, the common sampling distributions discussed here have been thoroughly worked out and described.
(8) The t-Test and Reading the Table. Coverage of the Related and Independent Samples Tests.
(9) Correlation and the Importance of Step1's Cousin (Sum of products of the deviation scores) in the Calculation of the Correlation Coefficient.
(10) Simple Regression Analysis with One Predictor Variable.
Well, there you have my Ten Super Topics that give a student the solid underpinnings of statistical thought and allow him to easily move into more advanced areas. But wait you say, there are so many topics being left out such as One and Two Way Analysis of Variance, Confidence Intervels, the Chi-square Statistic, the Power of the Test, Non-parametric Statistics, Follow-up Tests in ANOVA and on and on. No doubt these are important but not core in the sense of lower level themes. If you made time for some or all of these more advanced topics, the course would evolve into a hodge podge of techniques with only the surface being scratched on each one. Precisely what you don't want at the basic level. You want depth in the above TEN topics. After all, there are entire courses devoted to Analysis of Variance and Covariance called Experimental Design and also semesters directed at Nonparametric techniques. There is a time and a place for these courses but don't muddle the beginning student's mind with the whole ball of wax in one semester. Allow the student to have some fun and insure that he walks away with a good impression of the statistics field. Thanks for your attention.
Consider the score format that has been visited before and we have termed related or correlated pairs. That is, given (X1, Y 1), (X2, Y2),...(Xi, Yi),...(XN, YN). Now your dear little $15 calculator can still easily handle the task of entering the X member of the pair with one key and the Y member of the pair with another key until all pairs are entered. Then we can retrieve the following descriptive indices by pushing 4 different keys:, , sX, and sY. Now that is a pretty impressive array of indices. But recall that each of these pertain to either the separate X scores or the separate Y scores. We have no information on how high(or low or intermediate) the X score is relative to its mean compared with how high(or low or intermediate) the paired Y is relative to its mean. Putting this in very crude language, do the pairs of scores tend to be high together, low together and intermediate together or a completely different pattern such as the pairs being high and low together or low and high together? I hope you can see that the 4 basic indices do not touch on this type of "togetherness or covariability" relationship. Let us try out a calculation that may get at what we want...The sum of the products of the X and Y deviations about their respective means or in formula form:
This calculation can either be a positive number or a negative number unlike ∑x2and ∑y2 which are ALWAYS POSITIVE! So this has great promise for doing what we want it to. However, this is usually referred to as a "thinking" formula because it allows you to see exactly thow to calculate it directly but a direct calculation often is a very messy creature. Here we generally get decimals for both means, then we must subtract a decimal from each raw score for both the X's and Y's resulting in signed decimals, next we must find the products of these decimals again paying close attention to the signs, and finally sweating profusely we add up the whole batch of signed decimal products to arrive at the final ∑xy. Whew!!!
Fortunately, we are blessed with a neat computational formula displayed to the right where the ingredients are stored in memories in the calculator as you enter the pairs of scores (The proof will be omitted). Some calculators will (some won't) allow you to push still another key and ∑xy will appear upon the display. If not, you can still pull out from memories the sum of raw score products and the sums of the raw scores (ie. ∑XY, ∑X, and ∑Y) and finish the simple calculation on the right. Remember, that in this formula most raw scores will be whole numbers so this formula will be comparatively easy if done separately by hand on the calculator. Oh, I must mention we have finally arrived at what I call "STEP ONE'S COUSIN" because the procedure is so "analagously similar" (neat expression Huh?) to "STEP ONE". In other words,the right hand side of ∑x2 involves sum of raw squares and the square of the raw sum whereas here the ∑xy on the right involves sum of the raw products and the product of the raw sums. Hope you can see how similar they are! If STEP ONE is gold studded then STEP ONE'S COUSIN must rate silver studded!!!
Now we present two widely publicized formulas that are just tiny steps away from STEP ONE'S COUSIN and will give the measures of "togetherness" that we want for the X and Y pairs. Examine the formulas that are labeled (A) and (B) below:
To obtain result (A), we simply divide STEP ONE'S COUSIN by N, the number of pairs of scores and this produces the widely known Covariance of the X and Y pairs. In simple language, the Covariance is the mean product of the deviations of the X and Y scores about their respective means. Some authors refer to this as the mean cross product of the deviation scores. Recall that ∑xy/N can either be positive or negative and can range between negative infinity and positive infinity. If your calculator is top of the line it possibly has a button that will recall this result. But don't count on it. The Covariance is a very crucial index when you have a multiple number of variables. For example, with 4 variables we would arrange the 4 Variances down the main diagonal of a matrix with the 6 possible Covariances located in the off diagonal positions in the matrix. A wealth of information is contained in this 4x4 variance-covariance matrix with the Variance of each individual variable and the Covariance of all possible pairs of variables being displayed. Matrix algebra becomes the mode of operation when you delve into multivariate analysis.
Now in result (B) we move one more tiny step and divide the Covariance of X and Y by the product of the the standard deviations of X and Y. Putting it in statistical language, we are simply standardizing the covariance with this maneuver. Lo and Behold, the result may surprise you. We have now arrived at one the most celebrated statistics ever employed...The Pearson Product-Moment Correlation Coefficient. This index, of course, behaves very well and a full range of values between -1 and +1 may occur and includes the value of 0 as a possiblity. A high positive index such as +.90 or +.80 would suggest that high X's occur very frequently with high Y's, intermediate X's occur often with intermediate Y's, and low X's tend to be paired with low Y's. An inverse or negative correlation such as -.85 or -.90 would suggest low X's being paired with high Y's and high X's being paired with low Y's. A 0 index suggests no correspondence whatsoever. That is, given a high or low X value, it is impossible to predict where the Y will be. In terms of a scientific calculator, a high-end unit will almost always give you a button that will crank out the (B) result after all pairs are entered.
Finally, we shall calculate an example to show you how things work but will use a reasonably small set of pairs so you can use any type of calculator including a basic $5 unit. Please realize you will have to make three passes at the data if you use an el cheapo unit but it is still doable. Here are the paired data or the (Xi, Yi)'s which you may think of as pretest posttest scores for 10 individuals:
(10, 8) (8, 6) (5, 4) (12, 12) (4, 5) (3, 5) (14, 9) (12, 8) (6, 8) (12, 10)
After all the pairs of scores are entered, we recall from the calculator memories the following basic calculations and statistical indices:
∑X=86, ∑X2=878, ∑Y=75, ∑Y2=619, ∑XY=717, =8.6, =7.5, sX=3.72, sY=2.38 (WFTSM steps will be omitted and only the last new formulas will be shown.)
Now for the three new calculations of this entire blurb, substituting from the above
STEP ONE'S COUSIN
∑xy = ∑XY - (∑X)(∑Y) / N
=717 - (86)(75) / 10
=717 - 645
COVXY = ∑xy / N
=72 / 10
(B) CORRELATION COEFFICIENT
r = COVXY / (sXsY)
=7.2 / [(3.72)(2.38)]
=7.2 / 8.85
=.814 This r of .814 would indicate there is a fairly strong tendancy for high X's to be paired with high Y's and low X's to be paired with low Y's.
With this example we now finish our presentation of three very useful formulas that were an outgrowth of a single set of scores representing a single variable to a format of pairs of scores representing two different variables. We have seen that statistical indices are important for each set of scores separately but now we also need indices that measure so called "togetherness" relationships between the two variables. This type of need has brought into play STEP ONE'S COUSIN and sequentially the Covariance and the Correlation Coefficient. This wicket has become indeed a little more sticky. In fact, a point of confusion is that these three formulas take on many different forms and in each case its equivalent may look nothing like the original symbollically. These formulas require much study and practice to develope depth of understanding. I will leave you with a neat little exercise just for fun. The formula for the correlation coefficient r can be thought of as STEP ONE'S COUSIN divided by the product of the square roots of two different STEP ONES! See if you can verify this goofy statement in your mind. Thanks for reading this somewhat rambling presentation and please tune in again.
During my career as a statistics professor I have had one question that has repeatedly been directed my way by students, academicians, and professional people...WHY DO WE NEED STATISTICS? My answer has evolved over the years from the simplistic...We need it to summarize data and make inferences about large populations to a more research oriented reason... We need it to intelligently read and interpret inferences from the scholarly literature in our own particular disciplines. Somehow I have not felt entirely comfortable with these reasons because they appear to be effects of statistical acumen rather than a primary need for statistics itself. I have finally, after extensive hand-wringing, come up with a more basic reason for the very existence of statistics... Statistics is needed because individuals and objects VARY on traits and characteristics. There it is... I finally hit the nail on the head without smashing my finger. Statistics is all about variation and what to do with it. Imagine a world where all people were the same height, same weight, same intelligenge, same hair color, and on and on or every car were the same color(black according to Henry Ford), same body, same engine horsepower, same number of doors, same dashboard and accesories, etc. Now think about this horrendous scenerio for ever trait of every individual and every characteristic of every object. Could we even survive in a boring, convoluted world like this?? I think not. So that is why when you progress through several statistics courses much time is devoted, sometimes unknowingly, to variation and how it is applied to the particular procedures that are presented to you. Yes variabibility RULES and our world will always need it to be appropriately understood.
Now I want to take you on a stroll through some of the statistical indices that are closely tied in with variation in the first several courses. You will be surprised at the shear number of these, some of which are very familiar and others may surprise you as being associated with variation. Remember that all measures of variabilitity should represent spread or clustering of a set of scores and should be capable of being viewed as a distance on a score scale. I will present the group of indices generally from the simplest to the most complex and make parenthetical comments as needed:
R = H - L RANGE (The highest minus the lowest score. Supplementary index since only 2 scores involved.)
Q = (Q3 - Q1)/ 2 SEMI-INTERQUARTILE RANGE (The distance between the 75th percentile or 3rd quartile and the 25th percentile or 1st quartile divided by 2. Also can be thought of as the mean distance that Q3and Q1are from Q2, the median. Q is a good variability partner for the median as central tendancy when a distribution is skewed to any extent.)
MEAN DEVIATION (The mean absolute distance that each score is from the mean. A very intuitive index that makes a lot of sense. The only problem is it is difficult to calculate because of the absolute value signs, particularly for moderate to large samples. No STEP ONE-LIKE algorithm is available for this one.)
SAMPLE VARIANCE (Remember this is just the mean of the squares of the deviations of the scores about the mean or ∑x2/N or in its simplest form...STEP ONE divided by N. This index is very useful because it employs every score in the sample and avoids the absolute value signs by squaring the deviations.)
SAMPLE STANDARD DEVIATION (The square root of the above sample variance s2. Most widely used descriptive index of variability and is a perfect partner when the sample mean is used as the measure of central tendancy. Its main features are that again it depends on each and every score value and also may be interpreted in terms of the original score scale.)
ESTIMATED STANDARD ERROR OF THE MEAN (Remember a standard error is nothing more than a glorified standard deviation. This is the estimated SE of the sampling distribution of the sample mean and as such is employed in inferential statistics or hypothesis testing. If you substitute many other test statistics for such as or a myriad of other statistics, you can usually find stated SE formulas and use them in your hypothesis test. Some of these formulas are quite complex.)
MSW = SSW/(N-k) MEAN SQUARE WITHIN (This is employed in a one-way ANOVA with k groups of scores and nj scores in each group. It is an extension of ∑x2 where the deviation scores are taken about the respective group mean and then pooled together across all k groups. Looks suspiciously like variability again WITHIN the groups. N-k is the df-value for sum of squares within.)
MSB = SSB/(k-1) MEAN SQUARE BETWEEN (This also is employed in a one-way ANOVA. It is another extension of ∑x2 but the deviations are the group means about the overall mean M. Again, this has the appearance of BETWEEN group variation with k-1 as the df-value for sum of squares beween. Now, as some of you know you form an F-ratio with MSB/ MSW. This tests the significance of the differences in all the k group means with one big shot. WOW!)
WILK'S LAMBDA (This is a multivariate test that tests the significance of the differences of several population centroids of multivariate normal distributions. The bars in the numerator and denominator are determinants of the WITHIN GROUPS MATRIX and the TOTAL GROUPS MATRIX respectively. These matrices are variance-covariance matrices for the multiple dependent variables. Interestingly a small value of Wilk's Lambda is desirable for significance. This Lambda is usually employed in rather complex functions to actually run the test.)
OK from this list you can appreciate the statement that variation makes the statistical world go around. Keep in mind the list is not exhaustive but I would be exhausted if I continued this process without presenting something new and sort of exciting. To finish off this sticky wicket for the month I want to propose to you one last intuitive and compelling index of variation. To my knowledge, this index has never appeared in any behavioral science statistics textbook nor has it been used in any published study in this field. I am less certain of its use in economics, business, and other disciplines but at best its employment would be rare. To set the stage, consider a set of N scores: X1, X2, X3,..., XN. Now take all the possible differences between each and every score: X1 - X2, X1 - X3,..., X1 - XN, X2 - X3, X2 - X4,..., X2 - XN,..., XN-1 - XN. It should strike you with little thought that a possible measure of the spread of a set of scores may involve simply looking at each and every pair-wise difference. This is a simple concept but very powerful. Let's give it a try. One slight modification we will make is to take the square of each difference to solve the problem of negative and positive differences. Now I want you direct your attention to formula (A) on the right. Notice the big surprise on the right side of the equation: Dear old STEP ONE mutiplied by an N in front of it staring right at you. Are you kidding me? Absolutely amazing. I have omitted the proof of this formula but believe me it is as solid as a rock. Next it seems reasonable to find the mean of all these squared differences. That would involve determining the total number of differences present in a set of N scores. This would be the combinations of N things taken 2 at a time which is N! / (2)!(N-2)! = (N)(N-1) /2 (Recall "!" is the factorial function in mathematics). So dividing the left and right-hand side of (A) by (N)(N-1)/2 we arrive at what we will term MSAD (Mean Square of All Differences). Thus, the simplest formula is MSAD = 2∑x2/(N-1).
To give you a feel for this "new" index, I will use a small set of scores and compute it directly first and then use STEP ONE.
Consider the following set of 5 X-scores : 2, 4, 5, 5, 8
Using (A) on the left above for the direct calculation, the 10 squared differences:
(2 - 4)2 + (2 - 5)2 + (2 - 5)2 + (2 - 8)2 + (4 - 5)2 + (4 - 5)2 + (4 - 8)2 + (5 - 5)2 + (5 -8)2 + (5 - 8)2 = 4 + 9 + 9 + 36 + 1 + 1 + 16 + 0 + 9 + 9 = 94
Finally, MSAD = 94/10 = 9.4 This was not so bad here but imagine 20 scores with (20)(19)/2 = 180 differences. This would be a bear.
Now using STEP ONE on the right of (A) above we have:
∑X = 24, ∑X2 = 134 and ∑x2 = 134 -(24)2/5 = 134 - 576/5 = 134 - 115.2 = 18.8. Finally MSAD = 2∑x2/(N-1) = (2)(18.8)/4 = 9.4. Now I would like to make several comments about this index. First, it is intuitively appealing and can easily be computed with a simple function of the STEP ONE algorithim. Second, as with the standard deviation should we take the square root of MSAD as a descriptive index? Also mathematical statisticians would have to develop the sampling distribution of this statistic and its standard error before it could be employed in inferential statistics. I would certainly like to hear from my loyal readers about any reactions you have to MSAD. Thank you for reading this rather lengthy blurb.
BACK TO TOP OF PAGE
Now Back to the Fun and Humourous Side of Statistics:
Visit the Best Collection of Annotated Stat Jokes in the World With Over 200 entries. First Internet Gallery of Statistics Jokes
Read About the Fun Activities That Can Be Introduced in a Statistics Classroom. Archives of Statistics Fun
Also, If You Want Information About the Author That Created This Set of Pages Check the Home Page of Gary C. Ramseyer.