The birthday simulation (AE-5)

Important

This application exercise is due on 14 Oct at 2:00pm.

Computers let you assemble, manipulate, and visualize data sets, all at speeds that would have wowed yesterday’s scientists. In short, computers give you superpowers! But if you wish to use them, you’ll need to pick up some programming skills. Steve Job said that “computers are bicycles for our minds” because the efficiency rating of humans on bicycles is so incredible, and computers give us similar powers.

One reason computers are so incredible is that they allow us to simulate a multitude of events. Today we will simulate the probability that two of you in this section of Stat1010 have the same birthdays.

library(tidyverse) # for data manipulation
library(vctrs) # to find the length of a tibble

Suppose you are in a classroom with 100 people. If we assume this is a randomly selected group of 100 people, what is the chance that at least two people have the same birthday? Although it is somewhat advanced, we can deduce this mathematically. We will do this later. Here we use a Monte Carlo simulation. For simplicity, we assume nobody was born on February 29. This actually doesn’t change the answer much.

First, note that birthdays can be represented as numbers between 1 and 365, so a sample of 100 birthdays can be obtained like this:

all_bdays <- as_tibble(1:365) # data from where we can sample

bdays_class <- 
  all_bdays %>% # data from where to sample
  slice_sample(n = 100, replace = TRUE) # sample of size 100 with replacement

To check if in this particular set of 100 people we have at least two with the same birthday, we can use the functions group_by and filter, which returns a tibble with a vector of duplicated dates. Here is an example:

bdays_class %>% 
  group_by(value) %>% # for each birthday
  count() %>% # count them
  filter(n > 1) # those have more than 1

To estimate the probability of a shared birthday in the group, we repeat this experiment by sampling sets of 100 birthdays over and over. Prior to replicating, we need to write this as a function.

bdays_dups <- 
  all_bdays %>% # data from where to sample
  slice_sample(n = 100, replace = TRUE) %>% # take a sample of n 100
  group_by(value) %>% # for each birthday
  count() %>% # count them
  filter(n > 1) %>% # those have more than 1
  vec_n(.) # the number that are duplicated

A function has an input (in this case $n$)

What does $n$ represent in this coding?

num_of_same_birthdays <- function(n){
  all_bdays %>% # data from where to sample
  slice_sample(n = n, replace = TRUE) %>% # take a sample
  group_by(value) %>% # for each birthday
  count() %>% # count them
  filter(n > 1) %>% # those have more than 1
  vec_size(.)  # the number that are duplicated
}

To run this function, we do this:

num_of_same_birthdays(10) # run the function

How many students are there in class today?
Run the coding above but include the number of students in class today.

The law of large numbers says that if we do this many times we should have an estimate of the number of people that have the same birthdays in class today. The function replicate can be used to run functions multiple times.

B <- 500 # the number of times to run
results <- replicate(B, # replicate this number of times
          ifelse( # if there are some duplicated birthdays
            num_of_same_birthdays(50) >= 1, # our function
            1, # give me a 1
            0)) # if not, give me a 0

mean(results) # take the mean

Were you expecting the probability to be this high?

People tend to underestimate these probabilities. To get an intuition as to why it is so high, think about what happens when the group size is close to 365. At this stage, we run out of days and the probability is one.

Say we want to use this knowledge to bet with friends about two people having the same birthday in a group of people. When are the chances larger than 50%? Larger than 75%?

Let’s create a look-up table. We can create a function to compute this for any group size. This function runs our function $500$ times for every value of $n$. Statisticians will usually do this $10000$ times, but this will take a very very very very looooooooooonnnnnggggg time, so we are simplifying.

compute_prob <- # name the function
  function(n, B = 500){ # function inputs
  results <- # store results
    replicate(B, # run num_of_same_birthdays B times 
              ifelse( # if there are some duplicated birthdays
            num_of_same_birthdays(n) >= 1, # our function
            1, # give me a 1
            0)) # if not, give me a 0
  
  mean(results) # find the mean
}

Using the function map_dbl, we can perform element-wise operations on any function. Note that this may take awhile to run.

n <- seq(1, 60) # which values of n are important
prob <- # save the results as prob
  map_dbl(n,  # for each value of n 
          compute_prob) # run the function

We can now make a plot of the estimated probabilities of two people having the same birthday in a group of size n:

as_tibble(n, prob) %>% # the format for ggplot
  ggplot() + # draw a graph
  geom_point(aes(x = n, y = prob)) # of points

Now let’s compute the exact probabilities rather than use Monte Carlo approximations. Not only do we get the exact answer using math, but the computations are much faster since we don’t have to generate experiments.

To make the math simpler, instead of computing the probability of it happening, we will compute the probability of it not happening, or the compliment. For this, we use the multiplication rule.

Let’s start with the first person. The probability that person 1 has a unique birthday is 1. The probability that person 2 has a unique birthday, given that person 1 already took one, is 364/365. Then, given that the first two people have unique birthdays, person 3 is left with 363 days to choose from. We continue this way and find the chances of all $n$ people having a unique birthday is:

\[1×\frac{364}{365}×\frac{363}{365}…\frac{365−n+1}{365}\] We can write a function that does this for any number:

exact_prob <- function(n){
  1 - 
    prod(365:(365-n+1))/ # the product from the numerator above
    365^(n) # the denominator from above
}

Run this coding for the number of people that are in class today.

Now, we run this on multiple values of $n$

eprob <- map_dbl(n, exact_prob) # run on multiple values of n

Next we plot the results.

as_tibble(n, eprob) %>% # the format for ggplot
  ggplot() + # draw a graph
  geom_point(aes(x = n, y = eprob)) # of points

On mercury, the year is only 88 days. Update the coding above to simulate the expected number of people with the same birthdays on Mercury.

Assuming we have the same number of people, would you expect the number of people with the same birthdays to be higher or lower than those on earth?