library(tidyverse) # for data manipulation
library(vctrs) # to find the length of a tibble
The birthday simulation (AE-5)
Computers let you assemble, manipulate, and visualize data sets, all at speeds that would have wowed yesterday’s scientists. In short, computers give you superpowers! But if you wish to use them, you’ll need to pick up some programming skills. Steve Job said that “computers are bicycles for our minds” because the efficiency rating of humans on bicycles is so incredible, and computers give us similar powers.
One reason computers are so incredible is that they allow us to simulate a multitude of events. Today we will simulate the probability that two of you in this section of Stat1010 have the same birthdays.
Suppose you are in a classroom with 100 people. If we assume this is a randomly selected group of 100 people, what is the chance that at least two people have the same birthday? Although it is somewhat advanced, we can deduce this mathematically. We will do this later. Here we use a Monte Carlo simulation. For simplicity, we assume nobody was born on February 29. This actually doesn’t change the answer much.
First, note that birthdays can be represented as numbers between 1 and 365, so a sample of 100 birthdays can be obtained like this:
<- as_tibble(1:365) # data from where we can sample
all_bdays
<-
bdays_class %>% # data from where to sample
all_bdays slice_sample(n = 100, replace = TRUE) # sample of size 100 with replacement
To check if in this particular set of 100 people we have at least two with the same birthday, we can use the functions group_by
and filter
, which returns a tibble with a vector of duplicated dates. Here is an example:
%>%
bdays_class group_by(value) %>% # for each birthday
count() %>% # count them
filter(n > 1) # those have more than 1
To estimate the probability of a shared birthday in the group, we repeat this experiment by sampling sets of 100 birthdays over and over. Prior to replicating, we need to write this as a function.
<-
bdays_dups %>% # data from where to sample
all_bdays slice_sample(n = 100, replace = TRUE) %>% # take a sample of n 100
group_by(value) %>% # for each birthday
count() %>% # count them
filter(n > 1) %>% # those have more than 1
vec_n(.) # the number that are duplicated
A function has an input (in this case \(n\))
- What does \(n\) represent in this coding?
<- function(n){
num_of_same_birthdays %>% # data from where to sample
all_bdays slice_sample(n = n, replace = TRUE) %>% # take a sample
group_by(value) %>% # for each birthday
count() %>% # count them
filter(n > 1) %>% # those have more than 1
vec_size(.) # the number that are duplicated
}
To run this function, we do this:
num_of_same_birthdays(10) # run the function
How many students are there in class today?
Run the coding above but include the number of students in class today.
The law of large numbers says that if we do this many times we should have an estimate of the number of people that have the same birthdays in class today. The function replicate
can be used to run functions multiple times.
<- 500 # the number of times to run
B <- replicate(B, # replicate this number of times
results ifelse( # if there are some duplicated birthdays
num_of_same_birthdays(50) >= 1, # our function
1, # give me a 1
0)) # if not, give me a 0
mean(results) # take the mean
Were you expecting the probability to be this high?
People tend to underestimate these probabilities. To get an intuition as to why it is so high, think about what happens when the group size is close to 365. At this stage, we run out of days and the probability is one.
Say we want to use this knowledge to bet with friends about two people having the same birthday in a group of people. When are the chances larger than 50%? Larger than 75%?
Let’s create a look-up table. We can create a function to compute this for any group size. This function runs our function \(500\) times for every value of $n$. Statisticians will usually do this \(10000\) times, but this will take a very very very very looooooooooonnnnnggggg time, so we are simplifying.
<- # name the function
compute_prob function(n, B = 500){ # function inputs
<- # store results
results replicate(B, # run num_of_same_birthdays B times
ifelse( # if there are some duplicated birthdays
num_of_same_birthdays(n) >= 1, # our function
1, # give me a 1
0)) # if not, give me a 0
mean(results) # find the mean
}
Using the function map_dbl
, we can perform element-wise operations on any function. Note that this may take awhile to run.
<- seq(1, 60) # which values of n are important
n <- # save the results as prob
prob map_dbl(n, # for each value of n
# run the function compute_prob)
We can now make a plot of the estimated probabilities of two people having the same birthday in a group of size n:
as_tibble(n, prob) %>% # the format for ggplot
ggplot() + # draw a graph
geom_point(aes(x = n, y = prob)) # of points
Now let’s compute the exact probabilities rather than use Monte Carlo approximations. Not only do we get the exact answer using math, but the computations are much faster since we don’t have to generate experiments.
To make the math simpler, instead of computing the probability of it happening, we will compute the probability of it not happening, or the compliment. For this, we use the multiplication rule.
Let’s start with the first person. The probability that person 1 has a unique birthday is 1. The probability that person 2 has a unique birthday, given that person 1 already took one, is 364/365. Then, given that the first two people have unique birthdays, person 3 is left with 363 days to choose from. We continue this way and find the chances of all \(n\) people having a unique birthday is:
\[1×\frac{364}{365}×\frac{363}{365}…\frac{365−n+1}{365}\] We can write a function that does this for any number:
<- function(n){
exact_prob 1 -
prod(365:(365-n+1))/ # the product from the numerator above
365^(n) # the denominator from above
}
- Run this coding for the number of people that are in class today.
Now, we run this on multiple values of \(n\)
<- map_dbl(n, exact_prob) # run on multiple values of n eprob
Next we plot the results.
as_tibble(n, eprob) %>% # the format for ggplot
ggplot() + # draw a graph
geom_point(aes(x = n, y = eprob)) # of points
On mercury, the year is only 88 days. Update the coding above to simulate the expected number of people with the same birthdays on Mercury.
Assuming we have the same number of people, would you expect the number of people with the same birthdays to be higher or lower than those on earth?