When we conduct statistical tests, remember that we can make two kinds of correct decisions and two kinds of errors:

Reality | |||
---|---|---|---|

H_{0} is true |
H_{0} is false |
||

Researcher’s Decisions | Retain H_{0} |
Correct Retention of H_{0} |
Type II Error |

Reject H_{0} |
Type I Error | Correct Rejection of H_{0} |

Although scientists hope to be the first to discover new facts about
important variables, they are wary of making false claims (i.e., making a
Type I error). In Psychology and many other disciplines, we
typically set a **Type I error rate** at
*α* = 0.05. This means that when the null hypothesis
is true, 5% of the time a statistical test will lead us to believe that the
null hypothesis is unlikely to be true and so we will incorrectly reject
it. Of course, as researchers, we never know when we’ve made a Type I
error until further data are collected by our research team or by other
researchers.

What usually happens is that we get a result that suggests that the
effect we were hoping to find is present. Because we are excited about our
result, we publish our finding. However, it is quite possible that our
finding is a Type I error. After we have published our finding, other
researchers might attempt to conduct a **replication study**.
When efforts to replicate a finding fail repeatedly, the researchers who
reported the original finding usually try to replicate it themselves. Perhaps
the other researchers are not doing the study correctly. However, if new
data keeps coming back from new studies and we cannot reject the null
hypothesis (except for maybe 5% of the time), we have to conclude that the
first study we did was just a fluke and was simply a Type I error. There is
no shame in this. It is just bad luck.

We could cut the probability of making a Type I error by setting
*α* to a lower level (e.g., 0.01), but there is an unfortunate
side-effect for setting a low *α*: When the null hypothesis is
false, it increases ** β**, the probability of making Type II
errors! How can a researcher catch a break? If it isn’t a Type I
error, it’s a Type II error! Actually, there are a number of things
that can be done to reduce the frequency of both types of errors.

Closely related to *β* is something called
**statistical power** (also referred to as simply
*power*). Whereas *β* is the probability of incorrectly
retaining a false null hypothesis, power is the probability of correctly
rejecting a false null hypothesis. So,

The more powerful the test, the more readily it will detect the effect of a variable if it is there. However, increased power does NOT increase the chances of rejecting the null hypothesis if the null hypothesis is true.

Imagine that there is a new treatment for memory problems in older adults.
Because the treatment works, the treated population performs better than the
untreated population does on memory tests. The magnitude of the
difference between population means is called the **effect
size**. In this case, the memory test scores have a population mean
of 100 and a standard deviation of 15 in the untreated population. In the
treated population, the mean is 106 and the standard deviation is still 15.
The two means differ by 6 points, which is equal to 0.4 standard deviations
in both populations. The effect size here is +0.4 standard deviation units
because the two population means are 0.4 standard deviations apart.

Suppose that we do not know yet that the memory treatment works.
However, we do know that for untreated older adults the scores on the
memory test have a population mean of 100 and a standard deviation of 15.
Therefore, we can specify the null distribution of the sample mean for any
particular sample size. Suppose that we give the memory treatment to a
sample of 25 older adults. The sampling distribution of the sample mean
when *N* = 25 is shown below, along with the critical regions
for a two tailed test. If the null hypothesis is true, the sample mean is a
single member of the null distribution. The size of the critical regions is
determined by *α*. Because *α* = 0.05,
the two critical regions together occupy 5% of the null distribution of the
sample mean.

If the null hypothesis is false, the sample mean is a single member of the alternative distribution of the sample mean.

To simplify matters, let’s zoom in on the null and alternative distributions. To further simplify, we will also assume that we are conducting a one-tailed test. For now we will set aside my warning that in practice one-tailed tests are problematic.

Because *α* is set to 0.05, 5% of the null distribution is
in the critical region. Remember from the previous lab how the critical
region is defined when
*μ*_{1} > *μ*_{0}:

`NORMSINV`

function in
Excel:
\[\begin{align}z_{crit}&=\mathtt{NORMSINV}(1-\alpha/tails)\\&=\mathtt{NORMSINV}(1-0.05/1)\\&\approx
1.645\end{align}\]
Therefore, the critical region includes all values above:

\[\begin{align}\mu_0 + z_{crit}\sigma_{\bar{X}}&=100+1.645 \cdot 3\\&=104.93\end{align}\]If the sample mean is above this value, we will reject the null
hypothesis. If the null hypothesis is true, we will therefore make a Type I
error 5% of the times that we run this study with 25 people in the sample.
On the other hand, if the alternative hypothesis is true (and the
population mean is 106), we will sometimes correctly reject the null
hypothesis or incorrectly retain it (i.e., make a Type II error). To find
out how often we will make a Type II error (i.e., *β*), we can
use the `NORMDIST`

function in
Excel:

To find out how often we will correctly reject the null hypothesis, we
simply subtract *β* from 1.

So far,
we have referred to the null distribution of the sample mean. However the term
*null distribution* typically refers to the distribution of a test
statistic, such as the *z* in a *z*-test. The null distribution of the *z* in a *z*-test
is the distribution that the observed *z* will have if the null hypothesis is
true. Like all *z*-scores, the observed *z* of the
*z*-test has a standard normal distribution
(*μ* = 0, *σ* = 1).

The alternative distribution in this example also has a standard deviation of 1, but it has a higher mean: \[\begin{align}\mu_{z_1}&=\dfrac{\mu_1-\mu_0}{\sigma_{\bar{X}}}\\&=\dfrac{106-100}{3}\\&=2\end{align}\]

When there are two populations, statistical power will be related to how big a difference there is between the two. This makes sense: Large effects are easy to detect whereas small effects are harder to detect. In the animation below, notice how power rises and falls with the distance between the null and alternative distributions.

It is possible to increase power by raising *α*, but this
is a bad strategy because it increases the chance you will make a Type I
error if the null hypothesis is true. Although it is your prerogative to
set *α* to any value you choose, it is everyone else’s
prerogative to distrust your research findings if you set *α*
too high. It is generally recommended that you keep *α* at the
same level as is common in your field of study. In most cases,
*α* = 0.05.

In the animation below, see how raising *α* increases power,
as well as the Type I error rate. You can also see the problem with
lowering *α* too low. Although doing so will lower your Type I
error rate, it will simultaneously decrease power. Note the trade-off
between *α* and power is not a one-to-one linear
relationship.

One-tailed tests have more power than two-tailed tests, given that you have specified the correct tail. If you specify the wrong tail, power is essentially 0, because there is no way to correctly reject the null hypothesis and interpret it correctly. For this reason, most researchers always conduct two-tailed tests. Even though they are less powerful, they prevent the awkwardness of having to retain the null hypothesis even when the mean difference is huge but in the opposite direction of what was expected.

In the animation below, note that the one-tailed test has a power of
0.64 whereas the two-tailed test has a power of 0.52. Like increasing
*α*, using a one-tailed test is generally a poor strategy for
increasing power.

The distribution of sample means becomes narrow when the sample size is large. The standard error (the standard deviation of the distribution of sample means) has this formula:

\[\sigma_{\bar{X}}=\dfrac{\sigma}{\sqrt{N}}\]It is easy to see that as sample size grows, the standard error shrinks. Dividing by a large number (or in this case, the square root of a large number) results in a small number.

The animation below refers to the null and alternative distributions of
the sample means, not of the observed *z*. As the sample size
increases, the sampling distributions become narrower and therefore have
less overlap.

Researchers rarely have any control over the standard deviation of
variables. However, for the sake of completeness, we should note that a
large population standard deviation reduces power. Remember that the
*z*-score formula has a standard deviation in the denominator.
Dividing by a large number makes the *z* smaller. In order to reject
the null hypothesis, *z* has to be big (e.g., far from
0).

There is a way to decrease the standard deviation: Study people who are mostly alike. There are sometimes good reasons to do this but this strategy of increasing power has a clear disadvantage. By studying people who are very similar to each other, your findings will apply only to that narrow segment of the population from which the sample was drawn. There will be legitimate doubts about whether the findings will generalize to any other kinds of people.

Here is an Excel spreadsheet that illustrates aspects of statistical power.

Download the file and open it.

Let’s walk through how to use it.

Suppose you collect data from *N* = 30 people and the
sample mean is 55. You think that this sample did not come from a
population with *μ*_{0} = 50 and
σ_{0} = 20. You didn’t specify whether you
expected your sample to have a higher or lower mean than the population, so
this is a two-tailed hypothesis. The null hypothesis is that the sample
does come from that population with
*μ*_{0} = 50 and
*σ _{0}* = 20. Now, suppose that, in reality
and unbeknownst to you, the sample came from a different population with

First, set the *Null Hypothesis is* option box to
*False*.

Next, set the *Hypothesis Type* option box to
*two-tailed (H _{1} ≠ H_{0})* like
this:

Set the distribution mean for the null hypothesis
(H_{0}) to 50 and the standard deviation to 20.

Set the *treatment* distribution (H_{1}) mean
to 55 and the standard deviation to 20.

Set the *α* level to 0.05.

Set the sample size (n) to 30.

It should look like this:

The graph below should look something like this:

The blue distribution is the distribution of sample means of the null distribution and the red distribution is the distribution of sample means of the alternative distribution.

The green vertical line is the sample mean. Right now it falls on 55.

The critical regions are shaded in blue. When the vertical line (the sample mean) falls in a critical region, the null hypothesis is rejected.

The red shaded regions represent statistical power. They represent the part of the alternative distribution that is beyond the critical regions. You can’t see the very small red portion that is in the blue critical region on the left side because it is covered but is also included in the calculation of statistical power.

Notice that the sample mean is 55, exactly the same as the alternative distribution mean. You’d think that we’d reject the null hypothesis in a situation like this, right? Wrong. Remember, researchers never know the mean of the alternative distribution. Unfortunately, the sample mean did not fall in a critical region so we must retain the null hypothesis. Thus, unbeknownst to us, we have made a Type II error.

It turns out that making a Type II error was very likely in this
situation. If you look at the *Type II Error Rate (*β) box,
you’ll see that the probability of making a Type II error was quite
high (about 72%). Power was only about 0.28, meaning that randomly sampling
30 people from the alternative distribution would result in a correct
decision only 28% of the time. If you were a researcher, you would not want
to put in hours of toil for only a 28% chance of success. You would want to
improve your power substantially.

If you had specified a one-tailed hypothesis, your power would have improved somewhat (if you guessed in the right direction).

Set the *Hypothesis Type* option box to
*one-tailed (H _{1}>H_{0})*

ReggieNet: What is your power now?

ReggieNet: Does the sample mean fall in the critical region now that you have a one-tailed hypothesis?

Set the *Hypothesis Type* option box back to
*two-tailed (H _{1}≠H_{0})*.

Raising *α* is the easiest way to raise power. It is also
the stupidest. Don’t do it when conducting real data analysis! People
will laugh at you and won’t take you seriously anymore. ;)

However, just to see the effect of raising it, temporarily raise
*α* from 0.05 to 0.2.

ReggieNet: What happened to the size of the critical
regions when you raised *α* from 0.05 to 0.20?

**Hint**: Remember to set the *Hypothesis
Type* option box back to *two-tailed
(H _{1}≠H_{0})*.

ReggieNet: Does the sample mean fall in the critical region now that you have raised α from 0.05 to 0.2?

You can see that power increased from 0.28 to 0.54. On the surface, this
would seem to be a good thing. It is not. The problem with raising
*α* is that it increases the frequency of Type I errors if the
null hypothesis is true. Type I errors are false facts and they are harder
to get rid of than Type II errors once they’ve been entered into the
scientific literature.

Reset *α* to 0.05 again.

The most effective way to increase power that is directly under the researcher’s control is to increase the sample size.

ReggieNet: What happens to the distributions in the
graph when you change the sample size from *N* = 30 to
*N* = 100 (Remember that α must be set to 0.05)?

The result you noticed in question 5 should not be surprising and you
probably predicted what would happen before you did it. The graph is the
distribution of sample means. The width of this distribution is related to
the standard deviation of the distribution. The standard deviation of the
distribution of sample means is also called the standard error. Since you
may recall that the standard error is the original population standard
deviation divided by the square root of *N*, you can guess what
happens to the standard error when you divide by the square root of a large
number.

Try experimenting with different values of *N* and
see what happens.

ReggieNet: What is the lowest *N* that causes
the two-tailed null hypothesis to be rejected?

**Hint**: Pay attention to the
*Decision* box in the upper right corner. It will change to red when
the null hypothesis is rejected.

Right now, the 2 distribution means are 0.25 standard deviations apart:
\[\dfrac{\mu_1-\mu_0}{\sigma} =\dfrac{55-50}{20}=0.25 \] Suppose that the
alternative distribution had *μ* = 60 instead of
*μ* = 55. Now the distributions are 0.50 standard
deviations apart: \[\dfrac{\mu_1-\mu_0}{\sigma} =\dfrac{60-50}{20}=0.50 \]
If *N* is set back to 30, you can compare the original power we
started with (0.28) to what you have now. As you can see, the odds of
correctly rejecting the null hypothesis have gone up considerably.

ReggieNet: What is the power now that
*μ* = 60 for the alternative distribution?

**Hint**: Remember to set *N* back to
30.

ReggieNet: If the sample mean is also set to 60, is the null hypothesis rejected or retained?

Statistical power is the probability of correctly rejecting the null hypothesis when the null hypothesis is false. It is influenced by:

Influence | Direction |
---|---|

Type of hypothesis | One-tailed test → More power |

Type I error rate | Larger α → More power |

Sample size | Larger N → More power |

Effect size | Larger mean differences → More power |

Population variability | Smaller σ → More power |

A **Type I error** is when the null hypothesis is true but
is rejected by the researcher.

A **Type II error** is when the alternative hypothesis is
true but the researcher fails to find evidence for it and therefore
retains the null hypothesis.

The **null distribution** is the sampling distribution of
a statistic when the null hypothesis is assumed to be true.