Solving Simple Probability Problems with Simulation in R

Problem 1

Considering the probability distribution associated with rolling 3 fair dice labelled d1, d2 and d3, calculate the probability of the following:

  1. Compute the probability that the sum of the dice is greater than 12 and less than 18.
  2. Compute the probability that the sum is even.
  3. Compute the probability that the mean is exactly 4.

The problem is from the following stackoverflow post: http://stackoverflow.com/questions/40253035/how-to-compute-the-probability-with-rolling-3-fair-dice-for-some-specific-condit/40254805#40254805

Theoretical Solution with Classical Definition of Probability

  • Total number of points in the Sample Space that are mutually exclusive, exhaustive and equally likely = 6^3=216.
  • Number of cases favorable to the events in (1) is 55. Here are all the 55 cases.
#[1] "1 6 6"
#[1] "2 5 6"
#[1] "2 6 5"
#[1] "2 6 6"
#[1] "3 4 6"
#[1] "3 5 5"
#[1] "3 5 6"
#[1] "3 6 4"
#[1] "3 6 5"
#[1] "3 6 6"
#[1] "4 3 6"
#[1] "4 4 5"
#[1] "4 4 6"
#[1] "4 5 4"
#[1] "4 5 5"
#[1] "4 5 6"
#[1] "4 6 3"
#[1] "4 6 4"
#[1] "4 6 5"
#[1] "4 6 6"
#[1] "5 2 6"
#[1] "5 3 5"
#[1] "5 3 6"
#[1] "5 4 4"
#[1] "5 4 5"
#[1] "5 4 6"
#[1] "5 5 3"
#[1] "5 5 4"
#[1] "5 5 5"
#[1] "5 5 6"
#[1] "5 6 2"
#[1] "5 6 3"
#[1] "5 6 4"
#[1] "5 6 5"
#[1] "5 6 6"
#[1] "6 1 6"
#[1] "6 2 5"
#[1] "6 2 6"
#[1] "6 3 4"
#[1] "6 3 5"
#[1] "6 3 6"
#[1] "6 4 3"
#[1] "6 4 4"
#[1] "6 4 5"
#[1] "6 4 6"
#[1] "6 5 2"
#[1] "6 5 3"
#[1] "6 5 4"
#[1] "6 5 5"
#[1] "6 5 6"
#[1] "6 6 1"
#[1] "6 6 2"
#[1] "6 6 3"
#[1] "6 6 4"
#[1] "6 6 5"
  • Number of cases favorable to the events in (2) is 108.
  • Number of cases favorable to the events in (3) is 25.
  • Hence, by the classical definition of probability, the corresponding probabilities are 55/216=0.2546296, 108/216=0.5 and 25/216=0.1157407 respectively.
# compute theroetical probs
actual.probs <- rep(0,3)
# all points in the sample space
for (d1 in 1:6) 
  for (d2 in 1:6)
    for (d3 in 1:6) {
      s <- d1 + d2 + d3
      actual.probs[1] <- actual.probs[1] + ((s > 12) && (s < 18))
      actual.probs[2] <- actual.probs[2] + ((s %% 2) == 0)
      actual.probs[3] <- actual.probs[3] + ((s / 3) == 4)
    }
actual.probs <- actual.probs / 6^3 # theoretical probs

Solution with Simulation

  • Now let’s try to obtain the solution using simulation.
  • Let’s simulate an experiment of throwing 3 dice together for a number of trials.
  • Then let’s repeat the experiment for a number of times
# compute simulated probs
num.repeat <- 100
num.trials <- 10^3
sim.probs <- vector("list", 3)
for (j in 1:num.repeat) {
  res <- .rowMeans(replicate(num.trials, {
    #dice 
    dice <- sample(1:6, 3, replace=TRUE)
    s <- sum(dice)
    p1 <- ((s > 12) && (s < 18))
    p2 <- (s %% 2 == 0)
    p3 <- ((s/3) == 4)
    c(p1, p2, p3)
  }), 3, num.trials)
  #print(res)
  for (i in 1:3) {
    sim.probs[[i]] <- c(sim.probs[[i]], res[i])
  }
}

Here is a visual comparison between the simulated and the actual (theoretical) probabilities for each of the 3 cases, the theoretical probabilities are represented as dotted lines. As can be seen, in general, as the number of trials increase, the simulated probability tends to more accurately estimate the theoretical probabilities.

i1

Now let’s find the impact of the number of trials on the mean and absolute difference from the theoretical probabilities w.r.t. the computed probabilities with simulation.

i2i3

Problem 2

Let Θ be an unknown constant. Let W1,W2,,Wn be independent exponential random variables each with parameter 1.

Let Xi=Θ+Wi. It can be shown that Θ^ML=minXi. Now, we have been asked to construct a confidence interval of the particular form [Θ^c,Θ^], where Θ^=minXand c is a constant that we need to choose.

For n=10, how should the constant be chosen so that we have a 95% confidence interval? (Give the smallest possible value of c.) Your answer should be accurate to 3 decimal places.

This problem appeared in the final exam of the edX course MITx: 6.041x Introduction to Probability – The Science of Uncertainty.

n <- 10
th <- 100
exp.samp <- function() {
  w <- rexp(n, 1)
  x <- th + w
  th.hat <- min(x)  
  th.hat
}
ntrial <- 10^6
th.hat <- replicate(ntrial, exp.samp())
#c 
#print(c)
c <- quantile(th.hat - th, probs=0.95) 
print(c)
##       95% 
## 0.3002047
print(mean(th.hat - c <= th)) # & (th <= th.hat))
## [1] 0.95
hist(th.hat - th, xlab=expression(hat(theta)-theta), main=expression(hat(theta)-theta))

i4
Problem 3

Find the area of the following regions with Monte-Carlo Simulation.

area.png

  1. Let’s use Monte-Carlo-Integration to compute the area of the above green shaded region and then compare with the actural area enclosed by the curves.
n <- 10^6
df <- data.frame(x = runif(n), y = runif(n))
indices <- which(((df$y)^2<=df$x)&((df$x)^2<=df$y))
print(paste('Area with Monte-Carlo Integration = ', length(indices) / n))
## [1] "Area with Monte-Carlo Integration =  0.332888"
res <- integrate(function(x) sqrt(x) - x^2,0,1)
print(paste('Actual Area = ', res$value, ' with absolute error', res$abs.err))
## [1] "Actual Area =  0.333333408167678  with absolute error 7.80933322530327e-05"
  1. Again let’s use Monte-Carlo-Integration to compute the area under the below sinc function in between the vertical lines (-3,3) and x axis then compare with the actual area. The sinc function is defined as follows, by removing the discontinuity at 0.|eq.png
    area2

    eq  <- 10^6
    xmax <- 3
    df <- data.frame(x = runif(n,-xmax,xmax), y = runif(n,-0.25,1))
    indices <- which((df$y>0)&(abs(df$x)<=xmax)&(df$y<eq(df$x)))
    print(paste('Area with Monte-Carlo Integration = ', 6*1.25*length(indices)/n))
    ## [1] "Area with Monte-Carlo Integration =  3.700785"
    res <- integrate(eq,-3,3)
    print(paste('Actual Area = ', res$value, ' with absolute error', res$abs.err))
    ## [1] "Actual Area =  3.69730505599894  with absolute error 4.1048332022e-14"
  2. Again let’s use Monte-Carlo-Integration to compute the area under the Standard
    Normal Curve between the vertical lines and then compare the values with those obtained from the actual cdf Φ(.) values available from the standard normal table.
    normal
  3. Again let’s use Monte-Carlo-Integration to compute the area under the pdf of the Student’s t-distribution with degrees of freedom ν=between the vertical lines and then compare the values with those obtained from the actual cdf values availble from the t-table.t.gif
Advertisements

Movie Recommendation with Bayesian

In this article, a movie recommender system is going to be designed under Bayesian framework. This problem appeared as a project in the course MITx: 6.008.1x Computational Probability and Inference, where Bayesian inference is used to determine which movie to watch next.

recommrecomm1

The below table shows the input for the recommender system, we have 470 movies and each movie is rated by 1000 different users, the user ratings for the first 5 movies are shown below.

recomm2.png

The following shows the MAP estimates for the movie ratings for the first 5 movies.

MAP_Ratings Movies
0 2 Toy Story (1995)
1 5 Jumanji (1995)
2 6 Grumpier Old Men (1995)
3 6 Father of the Bride Part II (1995)
4 8 Heat (1995)
Movies with the highest MAP ratings
12 Monkeys (Twelve Monkeys) (1995)
Babe (1995)
Seven (a.k.a. Se7en) (1995)
From Dusk Till Dawn (1996)
Braveheart (1995)
Apollo 13 (1995)
Batman Forever (1995)
Crimson Tide (1995)
Die Hard: With a Vengeance (1995)
Net, The (1995)
Don Juan DeMarco (1995)
Dumb & Dumber (1994)
Hoop Dreams (1994)
Star Wars: Episode IV - A New Hope (a.k.a. Star Wars) (1977)
Like Water for Chocolate (Como agua para chocolate) (1992)
Legends of the Fall (1994)
Outbreak (1995)
Stargate (1994)
Santa Clause, The (1994)
Star Trek: Generations (1994)
Forrest Gump (1994)
Lion King, The (1994)
Naked Gun 33 1/3: The Final Insult (1994)
Speed (1994)
Addams Family Values (1993)
Beverly Hills Cop III (1994)
Fugitive, The (1993)
Hot Shots! Part Deux (1993)
Jurassic Park (1993)
Much Ado About Nothing (1993)
Mrs. Doubtfire (1993)
Robin Hood: Men in Tights (1993)
Schindler's List (1993)
Sleepless in Seattle (1993)
Blade Runner (1982)
Tombstone (1993)
Home Alone (1990)
Ghost (1990)
Aladdin (1992)
Terminator 2: Judgment Day (1991)
Dances with Wolves (1990)
Batman (1989)
Snow White and the Seven Dwarfs (1937)
Beauty and the Beast (1991)
Fargo (1996)
Wallace & Gromit: A Close Shave (1995)
Singin' in the Rain (1952)
Rear Window (1954)
North by Northwest (1959)
Some Like It Hot (1959)
Maltese Falcon, The (1941)
Wizard of Oz, The (1939)
Gone with the Wind (1939)
Citizen Kane (1941)
2001: A Space Odyssey (1968)
Mary Poppins (1964)
English Patient, The (1996)
One Flew Over the Cuckoo's Nest (1975)
Star Wars: Episode V - The Empire Strikes Back (1980)
Princess Bride, The (1987)
Brazil (1985)
Aliens (1986)
Army of Darkness (1993)
Psycho (1960)
Amadeus (1984)
Annie Hall (1977)
Boat, The (Das Boot) (1981)
Fantasia (1940)
Butch Cassidy and the Sundance Kid (1969)
Alien³ (1992)
Star Trek IV: The Voyage Home (1986)
Jaws (1975)
Contact (1997)
Hunt for Red October, The (1990)
Game, The (1997)
Gattaca (1997)
Starship Troopers (1997)
As Good As It Gets (1997)
Blade (1998)
Enemy of the State (1998)
Rushmore (1998)
Shakespeare in Love (1998)
You've Got Mail (1998)
Office Space (1999)
Matrix, The (1999)
Notting Hill (1999)
Run Lola Run (Lola rennt) (1998)
Ghostbusters (a.k.a. Ghost Busters) (1984)
Sixth Sense, The (1999)
Airplane! (1980)
American Beauty (1999)
Gladiator (2000)
Unbreakable (2000)
Crouching Tiger, Hidden Dragon (Wu hu zang long) (2000)
Memento (2000)
Moulin Rouge (2001)
Donnie Darko (2001)
Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001)
Catch Me If You Can (2002)
Pirates of the Caribbean: The Curse of the Black Pearl (2003
Eternal Sunshine of the Spotless Mind (2004)

 Average Posterior Entropy of the Movie Ratings = 0.0366234427451

The following figures show how the posterior entropy of the movie-ratings changes as we have more and more data available.

recomm3recomm4recomm5recomm6recomm7