Finding Words with Unexpected Frequencies in Deoxyribonucleic Acid Sequences

Bernard PRUM, François RODOLPHE and Élisabeth de TURCKHEIM

J.R.Statist.Soc. B, vol. 57, 205-220, 1995.

Abstract

Considering a Markov chain model for DNA sequences, this paper proposes two asymptotically normal statistics to test whether the frequency of a given word is concordant with the first order Markov chain model or not. The question is to choose estimates $\hat{\mu}(W)$ of the expectation of the frequency $M_{W}$ of a word $W$ in the observed sequence such that the asymptotic variance of $M_{W}-\hat{\mu}(W) $ is easily computable. The first estimator is derived from the frequency of $W^{[-1]}$, which is $W$ with its last letter deleted. The second, following an idea of Cowan (1991), is the conditional expectation $M_{W}$ given the observed frequencies of all 2-letter words. Two examples on phage $\lambda$ and phage T7 are finally shown.

Key words and phrases Words in DNA sequences, unexpected frequencies, Markov chains, central limit theorems.

Statistiques des Séquences Biologiques Home Page