### BIRTHDAY PROBLEM ###

# We assume 365 days in a year, equally probable as bithdays. We assume some 
# number k of people in a room, with no 'dependencies' (e.g. no twins). We're 
# interested in the probability that all k people have distinct birthdays.

# This function computes the probability exactly (up to machine imprecision).
exact <- function(k) {
  prod(1 - (0:(k - 1)) / 365)
}

# Here are some crucial k to consider. In English, what do these results mean?
exact(21)
exact(22)
exact(23)
exact(24)

# This function computes the probability approximately. The approximation works 
# best when k is much smaller than 365.
approx <- function(k) {
  exp((1 - k) * k / 730)
}

# Here's a plot comparing the exact answer and the approximation. For which k 
# does the approximation look worst?
xs <- 1:180
exacts <- sapply(xs, exact)
approxs <- sapply(xs, approx)
plot(x=c(xs, xs), y=c(exacts, approxs))

# Here's a plot of the exact answer divided by the approximation. For which k 
# does the approximation look worst?
plot(x=xs, y=(exacts / approxs))


### ARIZONA DNA DATABASE EXAMPLE ###

# This is Example 2.9 in our textbook. It's a true story from 2001 in Arizona. 
# Police were collecting DNA for evidence in criminal investigations. 
# Specifically, they built DNA profiles based on nine loci on the chromosomes. 
# Without going into the details, there are about 754,000,000 possible 
# profiles. The database had 65,493 profiles in it. A query of the database 
# revealed that two of the profiles matched exactly. Does this match call into 
# question the reliability of the database?

# Assume that all of the 754,000,000 possible profiles are equally probable, 
# and that all 65,493 current profiles are 'independent' (e.g. no twins). Then 
# this is the birthday problem with different numbers. The probability that all 
# 65,493 profiles will be distinct is approximately...
k <- 65493
m <- 754000000
exp((1 - k) * k / (2 * m))

# So does the match call into question the reliability of the database?


### HASH TABLE EXAMPLE ###

# In computer science, a hash table is a common data structure --- a way of 
# organizing information inside the computer's memory. Essentially, the hash 
# table consists of m boxes, into which we want to place k pieces of 
# information. The hash table works fastest when there are no collisions --- 
# that is, when no two pieces of information land in the same box. So what is 
# the probability of no collisions?

# Assume that all pieces of information have the same probability of landing in 
# all boxes. Then we can work out the approximate probability...
k <- 1000000
m <- 1000000 * k
exp((1 - k) * k / (2 * m))

# In practice, we have no choice about k --- it's determined by how much data 
# our customer wants to store --- but we can choose m. For the sake of speed, 
# do we want to choose m to be large or small?

# A computer scientist might also ask: For the sake of memory usage, do we want 
# to choose m to be large or small? And what effect does using a lot of memory 
# have on speed? This is an example of how computer scientists often consider 
# subtle tradeoffs between time efficiency and space efficiency. Also, the hash 
# table is not ruined when a collision happens. Rather, its performance slowly 
# degrades, as more and more collisions happen. So the computer scientist needs 
# to ask more subtle questions about *how many* collisions are expected. Let's 
# not go into all of that.