Exceptional motifs in different Markov chain models
for a statistical analysis of DNA sequences

Sophie SCHBATH, Bernard PRUM and Élisabeth de TURCKHEIM

J. Comp. Biol., vol. 2, 417-437, 1995.


Identifying exceptional motifs is often used for extracting information from long DNA sequences. The two difficulties of the method are the choice of the model that defines the expected frequencies of words and the approximation of the variance of the difference $T(W)$ between the number of occurences of a word $W$ and its estimation. We consider here different Markov chain models, either with stationary or periodic transition probabilities. We estimate the variance of the difference $T(W)$ by the conditional variance of the number of occurences of $W$ given the oligonucleotides counts that define the model. Two applications show how to use asymptotically standard normal statistics associated with the counts to describe a given sequence in terms of its outlying words. Sequences of Escherichia coli and of Bacillus subtilis are compared with respect to their exceptional tri- and tetranucleotides. For both bacteria, exceptional 3-words are mainly found in the coding frame. E. coli palindrome counts are analyzed in different models, showing that many overabundant words are one-letter mutations of avoided palindromes.

Key words and phrases DNA sequences, unexpected frequencies, statistical models, Markov chains, asymptotic variance.

Statistiques des Séquences Biologiques Home Page