# Your homework is to work through this tutorial, handing in your solutions to 
# Exercises B, C, E, F, G, and H. Except for Exercise G, each exercise involves 
# writing a small amount of R code and/or a small amount of discussion. The 
# code that you write will really be slight modifications of code snippets 
# already in this file. Having you write a bunch of code is not a goal of this 
# course.

# You will hand in your work on paper --- either printing it out or copying it 
# by hand. Exercise G will need to be done by hand. If you print, then don't 
# print this entire file. Just print the code snippets and discussion that are 
# needed, for the grader to grade the six required exercises.


### BERNOULLI DISTRIBUTION ###

# Here is a function for sampling a Bernoulli-distributed random variable. 
# (Sampling means generating a concrete random outcome.) Don't worry about how 
# this code works.
rbern <- function(n, p) {
  sample.int(2, n, prob=c(1 - p, p), replace=TRUE) - 1
}

# Here are some examples. Study them, to figure out what they mean.
rbern(100, 1 / 6)
numTrials <- 1000000
sum(rbern(numTrials, 1 / 6)) / numTrials


### GEOMETRIC DISTRIBUTION ###

# We know that if X ~ Geom(n, p), then P(X = k) = (1 - p)^k p. Here are some 
# examples computed by hand.
p <- 1 / 3
(1 - p)^4 * p
p <- 0.001
(1 - p)^16 * p

# Here are the same probabilities computed using R's built-in knowledge of the 
# geometric distribution.
dgeom(4, 1 / 3)
dgeom(16, 0.001)

# Let's focus on X ~ Geom(0.001). What is P(X <= 3)? The pgeom function exists 
# to answer this kind of question.
pgeom(3, 0.001)

# Exercise A: How would you compute P(X <= 3) knowing only dgeom?

# How high does k need to be, for P(X <= k) to reach 95%? Change the first 
# input given to pgeom, trying various values, until you find an answer.
pgeom(450, 0.001)

# That task is so common in probability and statistics that R offers a shortcut.
qgeom(0.95, 0.001)

# This code produces a sample from the geometric distribution.
data <- rgeom(1000, 0.09)
data

# Here is a function for drawing a histogram, which shows how many of each 
# possible value we get. Don't worry about how this code works.
discreteHistogram <- function(a, b, data) {
  hist(
    data, 
    breaks=seq(from=(a - 0.5), to=(b + 0.5), length.out=(b - a + 1)), 
    xaxp=c(a, b, b - a))
}

# Here is a picture of the sample from the geometric distribution. Does it look 
# like you would expect?
discreteHistogram(min(data), max(data), data)


### BINOMIAL DISTRIBUTION ###

# Let X be the number of sixes you encounter when rolling a (fair, six-sided) 
# die 12 times. You expect X to take values around 2. But what is the 
# probability that you get exactly 2 sixes? Only about 30%, according to the 
# binomial distribution.
choose(12, 0) * (5 / 6)^12 * (1 / 6)^0
choose(12, 1) * (5 / 6)^11 * (1 / 6)^1
choose(12, 2) * (5 / 6)^10 * (1 / 6)^2

# Here's how to compute the same numbers using R's built-in knowledge of the 
# binomial distribution.
dbinom(0, 12, 1 / 6)
dbinom(1, 12, 1 / 6)
dbinom(2, 12, 1 / 6)

# Here are those three numbers again, made succinctly using sapply.
sapply(0:2, dbinom, 12, 1 / 6)

# How do we compute P(X = 0) + P(X = 1) + P(X = 2)? Here are three ways.
dbinom(0, 12, 1 / 6) + dbinom(1, 12, 1 / 6) + dbinom(2, 12, 1 / 6)
sum(sapply(0:2, dbinom, 12, 1 / 6))
pbinom(2, 12, 1 / 6)

# Explain the result of these commands.
sum(sapply(0:12, dbinom, 12, 1 / 6))
pbinom(12, 12, 1 / 6)

# Here's the PMF and a histogram of 10,000 random values of X. They match 
# fairly well.
plot(x=0:12, y=sapply(0:12, dbinom, 12, 1 / 6))
discreteHistogram(0, 12, rbinom(10000, 12, 1 / 6))

# Let's switch to X ~ Binom(100, 0.2). For which k is P(X <= k) = 0.95? Use 
# trial and error.
pbinom(24, 100, 0.2)
pbinom(25, 100, 0.2)
pbinom(26, 100, 0.2)
pbinom(27, 100, 0.2)
pbinom(28, 100, 0.2)

# Here's the efficient way.
qbinom(0.95, 100, 0.2)


### OVERBOOKING YOUR #^&!$& AIRPLANE FLIGHT ###

# In a policy called overbooking, airlines sell more seats than they have, 
# counting on some of the passengers not to show up. Below, specify the 
# probability of a passenger showing up for her flight, the capacity of the 
# airplane, and the overbooking limit of the airline.
prob <- 0.88
capacity <- 100
limit <- 110

# Here's the probability that exactly one too many passengers show up, assuming 
# that the passengers are independent.
k <- capacity + 1
choose(limit, k) * prob^k * (1 - prob)^(limit - k)
dbinom(k, limit, prob)

# Here are the probabilities that two and three too many show up.
dbinom(capacity + 2, limit, prob)
dbinom(capacity + 3, limit, prob)

# Here's the total probability of overbooking the flight.
1 - pbinom(capacity, limit, prob)

# Exercise B: Change the limit below, working by trial and error, to get the 
# total probability below 5%. Show your work. That is, show a few lines of 
# code, with their resulting output, that demonstrate how the overbooking limit 
# should be set.
limit <- 110
1 - pbinom(capacity, limit, prob)


### HYPERGEOMETRIC DISTRIBUTION ###

# Draw five cards from a standard deck, and let X be the number of aces. Then X 
# ~ HGeom(4, 48, 5). Do the following six numbers seem reasonable?
dhyper(0, 4, 48, 5)
dhyper(1, 4, 48, 5)
dhyper(2, 4, 48, 5)
dhyper(3, 4, 48, 5)
dhyper(4, 4, 48, 5)
dhyper(5, 4, 48, 5)


### LEGALIZE IT? ###

# Suppose that, at a college of 2,000 students, 613 of them support 
# legalization of marijuana. Let X be the number of legalization supporters in 
# a survey of 200 students. Then X ~ HGeom(613, 2000 - 613, 200). Here are the 
# probabilities of a few values of X.
dhyper(60, 613, 2000 - 613, 200)
dhyper(61, 613, 2000 - 613, 200)
dhyper(62, 613, 2000 - 613, 200)

# In a more realistic problem, the number of supporters is unknown, and we 
# conduct the survey in the hopes of ascertaining that number. Suppose that 
# 89 of the 200 students surveyed support legalization.

# Exercise C: By trial and error, find the value of the parameter 'supporters' 
# that maximizes the probability of observing 89 supporters. (This approach to 
# finding a parameter of a population is called maximum likelihood estimation. 
# It is extremely common in statistics.)
supporters <- 882
dhyper(89, supporters, 2000 - supporters, 200)


### NEGATIVE BINOMIAL ###

# Remember that X ~ NBin(r, p) models the number of failures before the rth 
# success in repeated Bernoulli trials, each with probability p. For the sake 
# of examples, let's focus on r = 5 and p = 1 / 3.

# Here is P(X = 20) computed in two ways.
choose(20 + 5 - 1, 5 - 1) * (1 / 3)^5 * (1 - 1 / 3)^20
dnbinom(20, 5, 1 / 3)

# Here is P(X > 20) computed in two ways.
1 - sum(sapply(0:20, dnbinom, 5, 1 / 3))
1 - pnbinom(20, 5, 1 / 3)

# Here's a sample.
data <- rnbinom(100, 5, 1 / 3)
data
discreteHistogram(min(data), max(data), data)

# Exercise D. For X ~ NBin(5, 1 / 3), find k such that P(X <= k) = 95%. Do it 
# in two ways: trial and error with pnbinom, and more simply with qnbinom.


### MISCELLANEOUS ###

# When I borrow a book from another Math/Stats professor, I sometimes return 
# the book by placing it in their mailbox in the Math/Stats department. This is 
# a little risky, because the book could be stolen. I do it 25 times with no 
# problems. On the 26th time, the book is stolen.

# Exercise E. Which of the discrete distributions, that we have studied, would 
# you use to model this problem? (If it's not clear what the problem is yet, 
# then read Exercise F.)

# Exercise F. Use trial and error in R to find the parameter value(s) that 
# maximizes the probability of the result observed above. An approximate 
# answer is fine. Show a succinct version of your code and the answer(s) that 
# it gives you. Explain.

# Exercise G. Solve the same problem as in Exercise F, but this time exactly, 
# using calculus instead of R.

# Exercise H. In your area, there are 58,000 Republicans and 42,000 Democrats 
# (and no one else). You are working a phone bank, calling random people, 
# trying to reach Republicans. Let X ~ NBin(r, 0.58), Y ~ Binom(n, 0.58), and 
# Z ~ HGeom(58000, 42000, n). Within the context of the phone bank, give three 
# questions in plain English that are answered by X, Y, and Z, respectively.