Exact distribution of the distances between
any occurences of a set of words


Stéphane ROBIN and Jean-Jacques DAUDIN


Ann. Inst. Statist. Math., 36 895-905.


Abstract

The distribution of the distance between two (or more) successive occurrences of a specific word in a random sequence of letters is known under different models. In this paper, a more general problem is studied: the distribution of the distance between two (or more) successive occurrences of any word of a given set under a Markov model for the sequence. The generating function and a recurrence for obtaining the probabilities are given. These results are applied to study the distribution of the `CHI' motif in the genome sequence of Haemophilus influenzae.

Key words and phrases distance between occurrences, genome sequence analysis, semi Markov process.