Kernel Density Estimation (KDE) and Kernel Regression (KR) in R

  • In this article, the basic concepts of the Kernel Density Estimation and Kernel Regression along their applications will be discussed.

Kernel Density Estimation (KDE)

  • Let’s start with an example (from the edX course Applied Machine Learning by Microsoft): Let’s say that we have data points representing crime rates along one street. As shown in the following figure, let’s say crime 1 happens at location 15, crime 2 happens at location 12, and so on.
  • We want to know how likely it is to have a crime at a particular place along the street. In other words we want to estimate the density of crimes along the street.
  • So we’re going to assume that crimes happen in the future in similar places to where they’ve happened in the past, and we’ll place a bump on each crime that has happened, and we’ll add all the bumps up.
  • We have some flexibility in the size of the bumps. We could have chosen narrower bumps or wider bumps. If we chose wider bumps, our kernel density estimate will be smoother, and if we chose narrower bumps, the KDE estimate will be peakier.
  • The following figure illustrates the math formula to be used for KDE: K(.) is a density, i.e., a positive function that integrates to 1 and h is a positive number representing the bandwidth of the Kernel estimate of the density of the data x.

kde.png

  • The next animations show how the density function estimated by KDE tends to over / under-fit the data with (low / high) value of the bandwidth parameter for the Kernel function used. They also show the density functions estimated by using different Kernel functions.

rect.gifepi.gifgauss

  • The next animation shows density function estimated by KDE in 2 dimensions.

kde2d.gif

Kernel Regression

  • It’s a non-parametric regression technique. The idea is to estimate the dependent variable value for a query point by a weighted average of the dependent variable values of the similar points. Similar data points are to be found using the Kernel function.krkr1
  • The following animations show how the regression curve fitted by KR tends to over / under-fit the data with (low / high) value of the bandwidth parameter for the Gaussian Kernel function used.
  • As shown, the black point represents the query point (for which y value is to be estimated) and the color / size of the other points represent the contributions (in terms of weights) of the other points’ y values for estimation (the larger and the more reddish are the points the more weights they have).

gk-5

gk5

gk20

Advertisements

Some Statistics Concepts: Order Statistics and Application in Auction

  • In this article, the concepts of the Order Statistics along with its applications will be discussed.

Order Statistics

  • Definition: Consider a collection of i.i.d. continuous random variables X1,X2,,Xn. If X(j) is the j-th smallest of X1,X2,,Xn, it’s called the j-th order statistics. The 1st and the n-th order statistics are the minimum and maximum of the variables, respectively. We are interested in the marginal density and expected value of the j-th order statistics. The below figure shows the density of different order statistics in general and also for specific distributions (exponential and uniform).order_stat.png
  • The following figure and animation show the density of the order statistics for n i.i.d random variables XiU(0,1), for different values of n, along with the expected values of the order statistics as dotted vertical lines, quantities that we also shall be interested in.o1.png
    animation1.gif
  • The following animation shows the density of the order statistics for n i.i.d random variables XiExp(5). 

    animation2.gif

  • Application

    • Now, let’s apply the order statistics concepts in the following settings of an auction.
      • Let there be N potential buyers of some good.
      • Their valuations are i.i.d. with U(0,1).
      • The seller can offer the good
        • at no cost
        • at a posted price
        • or can auction it off.
      • The seller knows the distribution of valuations, but does not know the individual realizations.
      • As shown in the following figure, the expected profit at the posted price depends on the CDF of the N-th order statistics of the valuations. The seller wants to maximize his expected profit and the optimal posted price is (1/(N+1))^(1/N), with the optimal expected profit as (N/(N+1))(1/(N+1))^(1/N).
      • Again. as shown in the next figure, the profit at the 2nd price auction depends on the distribution of the (N1)-th order statistics of the valuations and the expected profit is computed to be (N1)/(N+1).auction_posted.png
    • The following animation shows the distribution of the (N1)-th order statistics for N valuations for different values of N. The vertical dotted line shows the expected value as before.animation3.gif
    • As can be seen from the following figure, as there are more and more potential buyers (N >= 3), the 2nd price auction becomes
      more profitable in expectation than the optimal posted price.
    • auc_post

Some Statistics Concepts: Probability integral transformations

  • In this article, the statistics concepts for the probability integral transformation along with its applications will be discussed.

Probability Integral Tranformation

  • First let’s convince ourselves about the fact that a continuous random variable transformed by its own CDF with always have a U(0,1distribution. It can be proved as shown in the below figure.
    prob_integral.png
  • Let’s draw n=1000 i.i.d. samples from XExp(λ=5using R function rexp and then transform the variable by its own CDF F_X(x)=1exp(λx), s.t. Y=F_X(X). We can see from the below figure that F_Y is the distribution function of Y∼U(0,1).
    n <- 1000
    lambda <- 5
    x <- rexp(n, lambda)
    y <- 1 - exp(-lambda*x)
    F_y <- cumsum(table(y))/sum(table(y))
    par(mfrow=c(1,3))
    hist(x, col='blue')
    plot(F_y, col='green', pch=19, xlab='y', ylab=expression('F'['Y']), main=expression(paste('Y=1-e'^ paste('-',lambda,'X'))))
    plot(ecdf(y), col='red', pch=19, xlab='y', ylab='ECDF(y)')

    cdf.png

  • Application

    • Now, suppose that we have at our disposal only a function that can generate i.i.d. samples from XU(0,1)X∼U(0,1) distribution, now we want to use the function to generate i.i.d. samples from some other distribution (let’s say from XExp(λ=5) or XGeom(p=0.3) or XLaplace(μ=0,b=4).
    • We can use a U(0,1) random variable transformed by the inverse CDF corresponding to the other distribution to get a random variable with that CDF.
    • Let’s use the above facts to draw n=1000 samples from a Yξ(λ=5distribution, using just the samples drawn from a X(0,1) with the R function runif using probability integral transform.
      • First draw n samples from X(0,1).
      • Transform Y=F_X(X) with the inverse CDF −(1/λ)ln(1x).
      • Now Y has the same distribution as 1exp(λy).
    • Let’s compare the histograms obtained with the samples drawn from Yξ(λ=5) using probability integral transform and with the R function rexp. As expected, histogram looks almost exactly the same, as can be seen from the following figure.Also, the times taken to draw 10000 such samples are quite comparable.
      exp
      p1animation.gif
    • Similarly let’s use probability integral transform to draw samples from YGeom(p=0.1) using only XU(0,1) transformed with the inverse CDF
      ln(1x)/ln(1−p) of the geometric distribution, since the geometric distribution
      has the CDF 1(1p)^x and then compare with the ones drawn using R function rgeom. As expected, histogram looks almost exactly the same, as can be seen from the following figure. Also, the times taken to draw 10000 such samples are quite comparable.
      geom.png
      p2
    • Again let’s use probability integral transform to draw samples from YLaplace(μ=0,b=4using only XU(0,1) transformed with the following inverse CDF.
      icdf
      since the laplace distribution has the following CDF
      lcdf
      Then compare with the ones drawn using R function rlaplace. As expected, histogram looks almost exactly the same, as can be seen from the following figure. Also, the times taken to draw 10000 such samples are quite comparable.
      laplace
      p3
    • Finally let’s say we don’t have the function rnorm but we only have the ICDF qnorm and we want to sample from a normal distribution with a given mean and variance using the probability integral transform.
    • Then let’s compare the histogram with those generated using rnorm. As can be seen, they look exactly same. Also, the times taken to draw 10000 such samples are quite comparable.norm
      p4

Inference Algorithms in Probabilistic Graphical Model: implementation of the Sum-Product and the Chow-Liu Algorithm for tree learning with Python

The following problem appeared as the final project in the edX course MITx: 6.008.1x Computational Probability and Inference.

Coconut Oil Price Movements

You really want coconut oil, but you don’t know which of your four nearby markets to buy the coconut oil at. You decide to do a detailed study on how these four markets relate, beginning with how coconut oil prices change across these markets.

The data provided consists of 2.8 years of daily coconut oil price movement data quantized to +1 (price went up in the past day), 0 (price stayed exactly the same in the past day), and -1 (price went down in the past day). Each line of the data file looks something like the following:

10/9/2012,-1,1,0,1

Note that there are five items: the date (month/day/year) followed by price movements for markets 0, 1, 2, and 3.

Let’s suppose for the purposes of this project that the price movements on different days are reasonably modeled to be i.i.d.

(a) Implement the Chow-Liu algorithm for learning a tree for these four markets.

Learning the tree PGM has the following few steps:

(1) Compute the empirical distributions: First we need to compute the empirical distribution for each node potential (here we have 4 nodes corresponding to 4 markets) and joint distribution for each edge potential, before we can compute the mutual
information  between the random variables representing the nodes in the next step.

(2) Tree Structure Learning: Once the parameters are learnt we can run the Chow-Liu algorithm to learn the tree structure as shown below.

chowliu
As can be seen from the mutual information matrix and from the Chow-Liu algorithm animation, the edges are selected in order of the corresponding mutual information.

mu
chowliu

(b) Once you learn the Chow-Liu tree, of course the next thing is learning parameters for this specific tree.

(1) Parameter learning: We need to find the MLE of the parameters for the node potentials and the conditional probability tables (CPTs) as edge potentials given the observations (from the empirical conditional probability distributions with MLE), as shown below.

mle

The observations for the 4 markets:
obs

Here are the parameters for the node potentials learnt from the data with MLE:
np

The parameters for the edge potentials learnt from the data with MLE (in python nested dictionary format):

{(0, 1): {0: {0: 0.8597997138769671, 1: 0.0715307582260372, -1: 0.06866952789699571}, 1: {0: 0.6608187134502924, 1: 0.29239766081871343, -1: 0.04678362573099415}, -1: {0: 0.7697368421052632, 1: 0.05921052631578947, -1: 0.17105263157894737}}, (3, 0): {0: {0: 0.5135908440629471, 1: 0.2982456140350877, -1: 0.28289473684210525}, 1: {0: 0.2503576537911302, 1: 0.5555555555555556, -1: 0.08552631578947369}, -1: {0: 0.23605150214592274, 1: 0.14619883040935672, -1: 0.631578947368421}}, (0, 2): {0: {0: 0.8497854077253219, 1: 0.08583690987124463, -1: 0.06437768240343347}, 1: {0: 0.6257309941520468, 1: 0.3391812865497076, -1: 0.03508771929824561}, -1: {0: 0.6578947368421053, 1: 0.039473684210526314, -1: 0.3026315789473684}}, (2, 0): {0: {0: 0.8497854077253219, 1: 0.6257309941520468, -1: 0.6578947368421053}, 1: {0: 0.08583690987124463, 1: 0.3391812865497076, -1: 0.039473684210526314}, -1: {0: 0.06437768240343347, 1: 0.03508771929824561, -1: 0.3026315789473684}}, (1, 0): {0: {0: 0.8597997138769671, 1: 0.6608187134502924, -1: 0.7697368421052632}, 1: {0: 0.0715307582260372, 1: 0.29239766081871343, -1: 0.05921052631578947}, -1: {0: 0.06866952789699571, 1: 0.04678362573099415, -1: 0.17105263157894737}}, (0, 3): {0: {0: 0.5135908440629471, 1: 0.2503576537911302, -1: 0.23605150214592274}, 1: {0: 0.2982456140350877, 1: 0.5555555555555556, -1: 0.14619883040935672}, -1: {0: 0.28289473684210525, 1: 0.08552631578947369, -1: 0.631578947368421}}}

(c) After learning both the tree and the parameters, of course this means that we now have a graphical model, for which we can solve specific inference tasks. First, let’s not worry about incorporating observations yet. Implement the function sum_product for running the general Sum-Product algorithm for a given tree-structured graphical model.

The sum-product algorithm on trees to infer the marginals in the tree  is shown below:

spalgo.png

The following figure illustrates different phases of the sum-product algorithm on trees to learn the marginals at nodes using a different test data, the blue-green tree.
bg

(d) We want to be able to answer questions like “what if we see in the last day that the price moved up for markets 1 and 2, but we didn’t get to observe what happened with markets 0 and 3″? What can we say about the posterior marginals for markets 0 and 3? This amounts to incorporating the observations that markets 1 and 2 each have observed values of +1. Fill in the general function compute_marginals_given_observations that let’s us incorporate any subset of observations for the four markets and that also produces posterior marginals.

The posterior marginals computed at the nodes with the sum-product algorithm using the new observations is shown below:
marg