Chapter 16: Statistical tests

STAT 1010 - Fall 2022

Learning outcomes

  • Vocabulary: alternative hypothesis, null hupothesis, one-sided, two-sided hypotheses, and test statistic
  • Understand the differences between a statistical test for proportions and for means.
  • Recognize the different types of errors Type I and Type II when designing a test
  • Distinguish between statistical and substantive significance
  • Relate confidence intervals to two-sided tests

Does ESP exist?

Some people believe in the presence of Extrasensory Perception, or ESP. How could we prove whether it does or doesn’t exist?

One test for ESP is with Zener cards.

Does ESP exist?

Since there are 5 cards, it would be possible to guess the correct card at random \(p=1/5\) times, or 20% of the time.

A person with ESP should be able to guess the correct card more often than 20%. But how much more often do they need to get it right for us to believe that ESP exists?

Statistical test

One way to determine something like this is to use a statistical test. A statistical test is a procedure to determine if the results from the sample are convincing enough to allow us to conclude something about the population.


When we perform a statistical test, we set out our hypotheses before we begin. There are two hypotheses,

null hypothesis \(H_0\), a statement about there being no effect, or no difference.

alternative hypothesis \(H_A\), the thing that we secretly hope will turn out to be true.

ESP hypotheses

Thinking about the ESP experiment, we could use words to state our hypotheses

\(\begin{eqnarray*} &H_0:& \text{ ESP does not exist} \\ &H_A:& \text{ESP exists} \end{eqnarray*}\)

We could also write the hypotheses in terms of parameters,

\(\begin{eqnarray*} H_0: p \leq 1/5 \\ H_A: p > 1/5 \end{eqnarray*}\)


  • Hypotheses are always written about the population parameter (\(\mu\), \(p\), \(\mu_1-\mu_2\), \(p_1-p_2\)), never about the sample statistics (\(\bar{x}\), \(\hat{p}\), \(\bar{x}_1-\bar{x}_2\), \(\hat{p}_1-\hat{p}_2\)).
  • The null hypothesis is the boring thing that we’re trying to gather evidence against
  • The alternative hypothesis is the exciting thing that would make headlines

One and two sided hypotheses

  • \(H_A\) has a \(>\) sign: upper tail, or “one-sided”
  • \(H_A\) has a \(<\) sign: lower tail, or “one-sided”
  • \(H_A\) has a \(\neq\) sign: both sides, or “two-sided”

Example 1

What are the null and alternate hypothesis for these decisions?

  • A person interviews for a job opening. The company has to decide whether to hire the person
  • An inventor proposes a new way to wrap packages that they say will speed up the manufacturing process. Should they adopt the new method?
  • A sales representative submits receipts from a recent business trip. Staff must determine whether the claims are legitimate.

Example 1 solns

  • $H_0 = $ Person is not hired
  • $H_0 = $ Retain the current method
  • $H_0 = $ Treat claims as legitimate. The employee is innocent until evidence to the contrary is found.

Past tests

Example 2

A snack-food chain runs a promotion in which shoppers are told that 1 in 4 kids’ meals includes a prize. A father buys two kids’ meals, and neither has a prize. He concludes that because neither has a prize, the chain is being deceptive.

  1. What is the null and alternative hypothesis?
  2. Describe a Type I error?
  3. Describe a Type II error?
  4. What is the probability that the father has made a Type I error?

Example 2 - solns

  1. \(H_0\) = chain is being honest; \(H_A\) = chain is not being honest
  2. Falsely accusing the chain of being deceptive
  3. Failing to realize that the chain is being deceptive
  4. Father rejects if both are missing a prize. \(P(neither has a prize) = (1 - 1/4)^2 \approx 0.56\)

Test statistic

In the previous example, we used probability to determine how likely it is to get two meals neither of which contain a prize. This process involves computing a test statistic (\(0.56\)).

Testing a proportion

To test a proportion, \(\hat{p}\), we must have a distribution with which to compare it. The distribution that we use is the distribution assumed in \(H_0\) we use \(p_0\) to denote this.

\[p_0 \sim N(p_0, \frac{{p_0}(1-{p_0})}{n}) \]

Testing a proportion - contd

Under \(H_0\), we find

\[z_0 = \frac{\hat{p} - p_0} {\sqrt{\frac{p_0(1-p_0)}{n}}} \]


  • SRS and sample must be < \(10\%\) of population
  • Both \(np_0\) and \(n(1-p_0)\) are larger than 10

Example 3

In the ESP example, we know that the population average for guessing ESP cards is 9 out of 24 cards. An interested participant took a training course to enhance their ESP. In the followup exam, they guessed 17 out of 36 cards. Did the course improve their ESP abilities?

Steps for a hypothesis test

  1. State hypotheses
  2. Determine the level of significance
  3. Check conditions
  4. Calculate test statistic
  5. Compute p-value
  6. Generic conclusion
  7. Interpret in context

Example 3 - solns

  1. \(H_0: p \leq 9/24\) and \(H_A: p > 9/24\)
  2. \(\alpha = 0.05\)
  3. Yes 9/24*36 and (1-9/24)*36 both > 10
  4. \(\hat{p} = \frac{x}{n} = \frac{17}{36} =0.472\)
  5. \(\begin{aligned} z_0 &= \frac{\hat{p} - p_0} {\sqrt{\frac{p_0(1-p_0)}{n}}} \\ &= \frac{0.472 - 0.375} {\sqrt{\frac{0.375(1-0.375)}{36}}} \\ &= \frac{0.0972}{0.0807} \approx 1.2 \end{aligned}\)

Example 3 - solns

  1. 1 -pnorm(1.2) or 1- pnorm((17/36- 9/24)/sqrt((0.375*(1-0.375))/36)) \(\approx 0.115\)
  2. Since \(0.115 > \alpha = 0.05\) this is not significant.
  3. We fail to reject \(H_0\). The score does not provide evidence that the intervention improved ESP.

Testing a mean

To test a mean, \(\bar{x}\), we must have a distribution with which to compare it. The distribution that we use is the distribution assumed in \(H_0\) we use \(\mu_0\) to denote this. Because we don’t know \(\sigma\) we will once again use \(s\) from the sample

Testing a mean - contd

Under \(H_0\), we find

\[t = \frac{\bar{X} - \mu_0} {s/\sqrt{{n}}} \]


  • SRS and sample must be < \(10\%\) of population
  • If we do not know if the population is normal, \(n > 10|K_4|\)

Example 4

Let \(\bar{x} = 3281\), \(s = 529\), and \(n = 59\). Perform a hypothesis test that the sample comes from a distribution where the population mean is less than \(\mu_0 = 4000\)

Steps for a hypothesis test

  1. State hypotheses
  2. Determine the level of significance
  3. Check conditions
  4. Calculate test statistic
  5. Compute p-value
  6. Generic conclusion
  7. Interpret in context

Example 4 - solns

  1. \(H_0: p \geq 4000\) and \(H_A: p < 4000\)
  2. \(\alpha = 0.05\)
  3. Yes
  4. \(\bar{x} = 3281\)
  5. \(\begin{aligned} t &= \frac{\bar{X} - \mu_0} {s/\sqrt{{n}}}\\ &= \frac{3281-4000}{529/\sqrt{59}} \\ &= \frac{-719}{68.87} \approx -10.44 \end{aligned}\)

Example 4 - solns

  1. pt(-10.44, df = 58) or pt((3281-4000)/(529/sqrt(59)), df = 58) \(\approx 3.07e-15\)
  2. Since \(3.07e-15 < \alpha = 0.05\) this is significant.
  3. We reject \(H_0\). There is very strong evidence that the mean value is less than 4000.

Do extremists see the world in black and white?

Researcher Matt Motyl ran a study in 2010 with 2,000 participants. He wanted to determine if political moderates were able to perceive shades of grey more accurately than people on the far left or the far right. The p-value from the study was 0.01. Reject the null! Evidence in support of the idea that moderates can see grey better.

But then… he re-ran the study. This time, the p-value was 0.59. Not even close to significant.

What happened?!

Via Regina Nuzzo’s Nature article, Scientific method: Statistical Errors


There are two types of errors defined in hypothesis testing:

  • Type I error, rejecting a true null
  • Type II error, not rejecting a false null

Law analogy

In the US, a person is innocent until proven guilty, and evidence of guilt must be beyond “the shadow of a doubt.” We can make two types of mistakes:

  • Convict an innocent person (type I error)
  • Release a guilty person (type II error)

Extremists and black and white

Two options:

  • the original study (p-value 0.01) made a Type I error, and the \(H_0\) was really true
  • the second study (p-value 0.59) made a Type II error, and \(H_A\) is really true


  • maybe there were no errors made, just different studies found different things

Multiple testing

Because the probability of a Type I error is \(\alpha\), if you do many tests you will find significance in \(\alpha\) of them just by chance.

If you do 100 tests, you should expect to find 5 of them to be significant, just by chance.

This is the problem of multiple testing.

Multiple testing + publication bias

Okay, so \(\alpha\) of all tests show significance, just by random chance.

And things that look significant get published…

That means that a fair number of things that are published are actually false! This is pretty scary.

How to fix the problem

As a researcher:

  • make sure your results can be replicated (like Motyl tried to do with the politics and grey study)
  • publish code and data so others can study your work

As someone who reads about statistics:

  • be skeptical about claims that are just one of many tests
  • look for replication and reproducibility!

Reducing the probability of Type II error

In order to reduce the probability of making a Type II error, we can either

  • increase the significance level
  • increase the sample size

Statistical versus practical significance

Sometimes, you find something that is very statistically significant, but not practically significant.

For example, a revision program might increase final exam grades with \(p-value < 0.00001\), but it may only increase final exam grades by \(1\%\). This is statistically significant but not practically significant.

Context is important!

Confidence interval or test

These two are mostly interchangeable. However, tests only provide negative statements and does not give us much information about parameter values.