Lab 14

Statistical Power

When we conduct statistical tests, remember that we can make two kinds of correct decisions and two kinds of errors:

Reality
H0 is true H0 is false
Researcher’s Decisions Retain H0 Correct Retention of H0 Type II Error
Reject H0 Type I Error Correct Rejection of H0

Replication Studies

Although scientists hope to be the first to discover new facts about important variables, they are wary of making false claims (i.e., making a Type I error). In Psychology and many other disciplines, we typically set a Type I error rate at α = 0.05. This means that when the null hypothesis is true, 5% of the time a statistical test will lead us to believe that the null hypothesis is unlikely to be true and so we will incorrectly reject it. Of course, as researchers, we never know when we’ve made a Type I error until further data are collected by our research team or by other researchers.

What usually happens is that we get a result that suggests that the effect we were hoping to find is present. Because we are excited about our result, we publish our finding. However, it is quite possible that our finding is a Type I error. After we have published our finding, other researchers might attempt to conduct a replication study. When efforts to replicate a finding fail repeatedly, the researchers who reported the original finding usually try to replicate it themselves. Perhaps the other researchers are not doing the study correctly. However, if new data keeps coming back from new studies and we cannot reject the null hypothesis (except for maybe 5% of the time), we have to conclude that the first study we did was just a fluke and was simply a Type I error. There is no shame in this. It is just bad luck.

We could cut the probability of making a Type I error by setting α to a lower level (e.g., 0.01), but there is an unfortunate side-effect for setting a low α: When the null hypothesis is false, it increases β, the probability of making Type II errors! How can a researcher catch a break? If it isn’t a Type I error, it’s a Type II error! Actually, there are a number of things that can be done to reduce the frequency of both types of errors.

Statistical Power

Closely related to β is something called statistical power (also referred to as simply power). Whereas β is the probability of incorrectly retaining a false null hypothesis, power is the probability of correctly rejecting a false null hypothesis. So,

\[\begin{align}\text{Power} &= 1 - \beta\\ &\text{and}\\ 1 - \text{Power} &=\beta\end{align}\]

The more powerful the test, the more readily it will detect the effect of a variable if it is there. However, increased power does NOT increase the chances of rejecting the null hypothesis if the null hypothesis is true.

Effect Sizes

Imagine that there is a new treatment for memory problems in older adults. Because the treatment works, the treated population performs better than the untreated population does on memory tests. The magnitude of the difference between population means is called the effect size. In this case, the memory test scores have a population mean of 100 and a standard deviation of 15 in the untreated population. In the treated population, the mean is 106 and the standard deviation is still 15. The two means differ by 6 points, which is equal to 0.4 standard deviations in both populations. The effect size here is +0.4 standard deviation units because the two population means are 0.4 standard deviations apart.

Null & Alternative Distributions of the Sample Mean

Suppose that we do not know yet that the memory treatment works. However, we do know that for untreated older adults the scores on the memory test have a population mean of 100 and a standard deviation of 15. Therefore, we can specify the null distribution of the sample mean for any particular sample size. Suppose that we give the memory treatment to a sample of 25 older adults. The sampling distribution of the sample mean when N = 25 is shown below, along with the critical regions for a two tailed test. If the null hypothesis is true, the sample mean is a single member of the null distribution. The size of the critical regions is determined by α. Because α = 0.05, the two critical regions together occupy 5% of the null distribution of the sample mean.

If the null hypothesis is false, the sample mean is a single member of the alternative distribution of the sample mean.

Calculating β & Statistical Power

To simplify matters, let’s zoom in on the null and alternative distributions. To further simplify, we will also assume that we are conducting a one-tailed test. For now we will set aside my warning that in practice one-tailed tests are problematic.

Because α is set to 0.05, 5% of the null distribution is in the critical region. Remember from the previous lab how the critical region is defined when μ1 > μ0:

\[\text{Critical Region}\ge\mu_0 + z_{crit}\sigma_{\bar{X}}\] We can find zcrit using the NORMSINV function in Excel: \[\begin{align}z_{crit}&=\mathtt{NORMSINV}(1-\alpha/tails)\\&=\mathtt{NORMSINV}(1-0.05/1)\\&\approx 1.645\end{align}\]

Therefore, the critical region includes all values above:

\[\begin{align}\mu_0 + z_{crit}\sigma_{\bar{X}}&=100+1.645 \cdot 3\\&=104.93\end{align}\]

If the sample mean is above this value, we will reject the null hypothesis. If the null hypothesis is true, we will therefore make a Type I error 5% of the times that we run this study with 25 people in the sample. On the other hand, if the alternative hypothesis is true (and the population mean is 106), we will sometimes correctly reject the null hypothesis or incorrectly retain it (i.e., make a Type II error). To find out how often we will make a Type II error (i.e., β), we can use the NORMDIST function in Excel:

\[\begin{align}&=\mathtt{NORMDIST}(\mu_0 + z_{crit}\sigma_{\bar{X}},\mu_1,\sigma_{\bar{X}},TRUE)\\ &=\mathtt{NORMDIST}(104.93,106,3,TRUE)\\ &\approx 0.36\end{align}\]

To find out how often we will correctly reject the null hypothesis, we simply subtract β from 1.

\[\begin{align}\text{Power} &= 1 - \beta\\ &=1 - 0.36\\ &=0.64\end{align}\]

Null & Alternative Distributions of the z-test

So far, we have referred to the null distribution of the sample mean. However the term null distribution typically refers to the distribution of a test statistic, such as the z in a z-test. The null distribution of the z in a z-test is the distribution that the observed z will have if the null hypothesis is true. Like all z-scores, the observed z of the z-test has a standard normal distribution (μ = 0, σ = 1).

The alternative distribution in this example also has a standard deviation of 1, but it has a higher mean: \[\begin{align}\mu_{z_1}&=\dfrac{\mu_1-\mu_0}{\sigma_{\bar{X}}}\\&=\dfrac{106-100}{3}\\&=2\end{align}\]

Influences on Statistical Power

1. Larger effect sizes increase power.

When there are two populations, statistical power will be related to how big a difference there is between the two. This makes sense: Large effects are easy to detect whereas small effects are harder to detect. In the animation below, notice how power rises and falls with the distance between the null and alternative distributions.

2. Increasing α increases power.

It is possible to increase power by raising α, but this is a bad strategy because it increases the chance you will make a Type I error if the null hypothesis is true. Although it is your prerogative to set α to any value you choose, it is everyone else’s prerogative to distrust your research findings if you set α too high. It is generally recommended that you keep α at the same level as is common in your field of study. In most cases, α = 0.05.

In the animation below, see how raising α increases power, as well as the Type I error rate. You can also see the problem with lowering α too low. Although doing so will lower your Type I error rate, it will simultaneously decrease power. Note the trade-off between α and power is not a one-to-one linear relationship.

3. One-tailed tests increase power if they are in the correct direction.

One-tailed tests have more power than two-tailed tests, given that you have specified the correct tail. If you specify the wrong tail, power is essentially 0, because there is no way to correctly reject the null hypothesis and interpret it correctly. For this reason, most researchers always conduct two-tailed tests. Even though they are less powerful, they prevent the awkwardness of having to retain the null hypothesis even when the mean difference is huge but in the opposite direction of what was expected.

In the animation below, note that the one-tailed test has a power of 0.64 whereas the two-tailed test has a power of 0.52. Like increasing α, using a one-tailed test is generally a poor strategy for increasing power.

4. Increasing the sample size increases power.

The distribution of sample means becomes narrow when the sample size is large. The standard error (the standard deviation of the distribution of sample means) has this formula:

\[\sigma_{\bar{X}}=\dfrac{\sigma}{\sqrt{N}}\]

It is easy to see that as sample size grows, the standard error shrinks. Dividing by a large number (or in this case, the square root of a large number) results in a small number.

The animation below refers to the null and alternative distributions of the sample means, not of the observed z. As the sample size increases, the sampling distributions become narrower and therefore have less overlap.

5. Smaller population standard deviations increase power.

Researchers rarely have any control over the standard deviation of variables. However, for the sake of completeness, we should note that a large population standard deviation reduces power. Remember that the z-score formula has a standard deviation in the denominator. Dividing by a large number makes the z smaller. In order to reject the null hypothesis, z has to be big (e.g., far from 0).

There is a way to decrease the standard deviation: Study people who are mostly alike. There are sometimes good reasons to do this but this strategy of increasing power has a clear disadvantage. By studying people who are very similar to each other, your findings will apply only to that narrow segment of the population from which the sample was drawn. There will be legitimate doubts about whether the findings will generalize to any other kinds of people.

Calculating Power with an Excel Spreadsheet

Here is an Excel spreadsheet that illustrates aspects of statistical power.

Download the file and open it.

Let’s walk through how to use it.

Suppose you collect data from N = 30 people and the sample mean is 55. You think that this sample did not come from a population with μ0 = 50 and σ0 = 20. You didn’t specify whether you expected your sample to have a higher or lower mean than the population, so this is a two-tailed hypothesis. The null hypothesis is that the sample does come from that population with μ0 = 50 and σ0 = 20. Now, suppose that, in reality and unbeknownst to you, the sample came from a different population with μ1 = 55 and σ1 = 20. Thus, although you are not in a position to know this for sure, the null hypothesis is false.

First, set the Null Hypothesis is option box to False.

Next, set the Hypothesis Type option box to two-tailed (H1 ≠ H0) like this:

Set the distribution mean for the null hypothesis (H0) to 50 and the standard deviation to 20.

Set the treatment distribution (H1) mean to 55 and the standard deviation to 20.

Set the α level to 0.05.

Set the sample size (n) to 30.

It should look like this:

The graph below should look something like this:

The blue distribution is the distribution of sample means of the null distribution and the red distribution is the distribution of sample means of the alternative distribution.

The green vertical line is the sample mean. Right now it falls on 55.

The critical regions are shaded in blue. When the vertical line (the sample mean) falls in a critical region, the null hypothesis is rejected.

The red shaded regions represent statistical power. They represent the part of the alternative distribution that is beyond the critical regions. You can’t see the very small red portion that is in the blue critical region on the left side because it is covered but is also included in the calculation of statistical power.

Notice that the sample mean is 55, exactly the same as the alternative distribution mean. You’d think that we’d reject the null hypothesis in a situation like this, right? Wrong. Remember, researchers never know the mean of the alternative distribution. Unfortunately, the sample mean did not fall in a critical region so we must retain the null hypothesis. Thus, unbeknownst to us, we have made a Type II error.

It turns out that making a Type II error was very likely in this situation. If you look at the Type II Error Rate (β) box, you’ll see that the probability of making a Type II error was quite high (about 72%). Power was only about 0.28, meaning that randomly sampling 30 people from the alternative distribution would result in a correct decision only 28% of the time. If you were a researcher, you would not want to put in hours of toil for only a 28% chance of success. You would want to improve your power substantially.

The Effect of Using One-Tailed Tests Instead of Two-Tailed Tests

If you had specified a one-tailed hypothesis, your power would have improved somewhat (if you guessed in the right direction).

Set the Hypothesis Type option box to one-tailed (H1>H0)

ReggieNet: What is your power now?

ReggieNet: Does the sample mean fall in the critical region now that you have a one-tailed hypothesis?

The Effect of Raising α

Set the Hypothesis Type option box back to two-tailed (H1≠H0).

Raising α is the easiest way to raise power. It is also the stupidest. Don’t do it when conducting real data analysis! People will laugh at you and won’t take you seriously anymore. ;)

However, just to see the effect of raising it, temporarily raise α from 0.05 to 0.2.

ReggieNet: What happened to the size of the critical regions when you raised α from 0.05 to 0.20?

Hint: Remember to set the Hypothesis Type option box back to two-tailed (H1≠H0).

ReggieNet: Does the sample mean fall in the critical region now that you have raised α from 0.05 to 0.2?

You can see that power increased from 0.28 to 0.54. On the surface, this would seem to be a good thing. It is not. The problem with raising α is that it increases the frequency of Type I errors if the null hypothesis is true. Type I errors are false facts and they are harder to get rid of than Type II errors once they’ve been entered into the scientific literature.

Reset α to 0.05 again.

The Effect of Increasing Sample Size (N)

The most effective way to increase power that is directly under the researcher’s control is to increase the sample size.

ReggieNet: What happens to the distributions in the graph when you change the sample size from N = 30 to N = 100 (Remember that α must be set to 0.05)?

The result you noticed in question 5 should not be surprising and you probably predicted what would happen before you did it. The graph is the distribution of sample means. The width of this distribution is related to the standard deviation of the distribution. The standard deviation of the distribution of sample means is also called the standard error. Since you may recall that the standard error is the original population standard deviation divided by the square root of N, you can guess what happens to the standard error when you divide by the square root of a large number.

Try experimenting with different values of N and see what happens.

ReggieNet: What is the lowest N that causes the two-tailed null hypothesis to be rejected?

Hint: Pay attention to the Decision box in the upper right corner. It will change to red when the null hypothesis is rejected.

The Effect of Larger Population Mean Differences

Right now, the 2 distribution means are 0.25 standard deviations apart: \[\dfrac{\mu_1-\mu_0}{\sigma} =\dfrac{55-50}{20}=0.25 \] Suppose that the alternative distribution had μ = 60 instead of μ = 55. Now the distributions are 0.50 standard deviations apart: \[\dfrac{\mu_1-\mu_0}{\sigma} =\dfrac{60-50}{20}=0.50 \] If N is set back to 30, you can compare the original power we started with (0.28) to what you have now. As you can see, the odds of correctly rejecting the null hypothesis have gone up considerably.

ReggieNet: What is the power now that μ = 60 for the alternative distribution?

Hint: Remember to set N back to 30.

ReggieNet: If the sample mean is also set to 60, is the null hypothesis rejected or retained?

Review

Statistical power is the probability of correctly rejecting the null hypothesis when the null hypothesis is false. It is influenced by:

Influence Direction
Type of hypothesis One-tailed test → More power
Type I error rate Larger α → More power
Sample size Larger N → More power
Effect size Larger mean differences → More power
Population variability Smaller σ → More power

A Type I error is when the null hypothesis is true but is rejected by the researcher.
A Type II error is when the alternative hypothesis is true but the researcher fails to find evidence for it and therefore retains the null hypothesis.
The null distribution is the sampling distribution of a statistic when the null hypothesis is assumed to be true.