(Sandipan Dey, 20 August 2016)
- In this article, Gibbs Sampling is used to find the Best K-mer Motifs from a collection of Dna strings. This problem appears in the Rosalindand the Stepic Bioinformatics problem-solving sites and also as an assignment in a Coursera BioInformatics Algorithms course provided by UCSD.
- A Profile-randomly generated k-mer in a string is defined as follows: For each k-mer Pattern in Text, compute the probability Pr(Pattern | Profile), resulting in n = |Text| – k + 1 probabilities (p1, ., pn). These probabilities do not necessarily sum to 1, but can still form the random number generator Random(p1, ., pn) based on them.
- GIBBSSAMPLER uses this random number generator to select a Profile-randomly generated k-mer at each step: if the n-sided (biased) die with probability of face-showups (p1, ., pn) rolls the number i, then the Profile-randomly generated k-mer is defined as the i-th k-mer in Text.
- The algorithm is then repeated a few times with a few random starts and the best motifs are chosen in the end.
- The following shows the algorithm used by to implement the Gibbs Sampler and also couple of sample input datasets used and the output best motifs computed.
- The animations show the Profile matrix (with Information Content Scaled / unscaled) computed at each iteration of the Repeated GibsSampling algorithm and how it changes.