Abstract
The distribution of the distance between two (or more) successive occurrences of a specific word in a random sequence of letters is known under different models. In this paper, a more general problem is studied: the distribution of the distance between two (or more) successive occurrences of any word of a given set under a Markov model for the sequence. The generating function and a recurrence for obtaining the probabilities are given. These results are applied to study the distribution of the `CHI' motif in the genome sequence of Haemophilus influenzae.
Key words and phrases distance between occurrences, genome sequence analysis, semi Markov process.