#
Compound Poisson approximation of word counts

in DNA sequences

###
Sophie SCHBATH

###
*ESAIM: Probability and Statistics*, vol.1 , 1-16, 1995.

**Abstract**

Identifying words with unexpected frequencies is an important problem
in the analysis of long DNA sequences. To solve it, we need an
approximation of the distribution of the number of occurrences $N(W)$
of a word $W$. Modeling DNA sequences with $m$-order Markov chains,
we use the Chen-Stein method to obtain Poisson approximations for two
different counts. We approximate the ``declumped'' count of $W$ by a
Poisson variable and the number of occurrences $N(W)$ by a compound
Poisson variable. Combinatorial results are used to solve the general
case of overlapping words and to calculate the parameters of these
distributions.

**Key words and phrases**
Compound Poisson approximation, word counts, word periods, Chen-Stein method.

Statistiques des Séquences Biologiques Home Page