Gibbs Sampling to find the Best K-mer Motifs from a collection of Dna strings in R: BioInformatics Concepts

(Sandipan Dey, 20 August 2016)


  • In this article, Gibbs Sampling is used to find the Best K-mer Motifs from a collection of Dna strings. This problem appears in the Rosalindand the Stepic Bioinformatics problem-solving sites and also as an assignment in a Coursera BioInformatics Algorithms course provided by UCSD.
  • A Profile-randomly generated k-mer in a string is defined as follows: For each k-mer Pattern in Text, compute the probability Pr(Pattern | Profile), resulting in n = |Text| – k + 1 probabilities (p1, ., pn). These probabilities do not necessarily sum to 1, but can still form the random number generator Random(p1, ., pn) based on them.
  • GIBBSSAMPLER uses this random number generator to select a Profile-randomly generated k-mer at each step: if the n-sided (biased) die with probability of face-showups (p1, ., pn) rolls the number i, then the Profile-randomly generated k-mer is defined as the i-th k-mer in Text.
  • The algorithm is then repeated a few times with a few random starts and the best motifs are chosen in the end.
  • The following shows the algorithm used by to implement the Gibbs Sampler and also couple of sample input datasets used and the output best motifs computed.
  • The animations show the Profile matrix (with Information Content Scaled / unscaled) computed at each iteration of the Repeated GibsSampling algorithm and how it changes.algodata1





Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s