#
Exceptional motifs in different Markov chain models

for a statistical analysis of DNA sequences

###
Sophie SCHBATH, Bernard PRUM and Élisabeth de TURCKHEIM

###
*J. Comp. Biol.*, vol. 2, 417-437, 1995.

**Abstract**

Identifying exceptional motifs is often used for extracting
information from long DNA sequences. The two difficulties of the
method are the choice of the model that defines the expected
frequencies of words and the approximation of the variance of the
difference $T(W)$ between the number of occurences of a word $W$ and
its estimation. We consider here different Markov chain models,
either with stationary or periodic transition probabilities. We
estimate the variance of the difference $T(W)$ by the conditional
variance of the number of occurences of $W$ given the oligonucleotides
counts that define the model. Two applications show how to use
asymptotically standard normal statistics associated with the counts
to describe a given sequence in terms of its outlying words. Sequences
of *Escherichia coli* and of *Bacillus subtilis* are compared
with respect to their exceptional tri- and tetranucleotides. For both
bacteria, exceptional 3-words are mainly found in the coding
frame. *E. coli* palindrome counts are analyzed in different
models, showing that many overabundant words are one-letter mutations
of avoided palindromes.

**Key words and phrases**
DNA sequences, unexpected frequencies,
statistical models, Markov chains, asymptotic variance.

Statistiques des Séquences Biologiques Home Page