- In this article,
will be used for*Expectation Maximization*of a dataset in**soft clustering***k*clusters. Given*n*data points in an*m*dimensional space*Data = (Data_1, … , Data_n)*, let’s first represent their soft assignments to*k*clusters as a*k x n*dimensional*column-stochastic*matrix*HiddenMatrix = (HiddenVector_1, … , HiddenVector_n)*, where each*HiddenVector_j*itself is a*k*dimensional vector representing the probabilities of assigning the*Data_j*to each of the*k*clusters, where $j ∈ {1,..,k}. Then the*k*centers can be represented as*k*points in*m*dimensional space, Parameters =*(θ1, …, θk)*(Reference:*Stepic, Rosalind and Coursera Course on Bioinformatics Algorithms by UCSD*). - The
*expectation maximization algorithm*starts with a random choice of Parameters. It then alternates between the, in which a responsibility matrix*Estep**HiddenMatrix*for Data given Parameters is computed and the, in which the Parameters are re-estimated (with*Mstep**MLE*) using the*HiddenMatrix*, till convergence. The iterative algorithm outline is shown below.

- For this assignment, a 2 dimensional dataset with ~1000 data points will be used and EM algorithm will be run to cluster the dataset into 6 segments, starting with 6 random centers. The following figures / animations show the EM algorithm steps and the responsibility matrix and the clusters obtained with different values of the stiffness parameter β.
- As can be seen, the soft assignments come closer to hard assignments as the β parameter value is increased.
- For β = 1.1

- For β = 10

- For β = 0.1

Advertisements