An efficient statistic to detect
over- and under- represented words
in DNA sequences


Sophie SCHBATH


J. Comp. Biol., vol. 4, 189-192, 1997.


Abstract

In this note, we point out a very efficient statistic to detect over- and under-represented words in DNA sequences, when Markov chain models are used to represent the sequences. This statistic is missing from the recent review done on this important problem, and appears to be a better measure of rarity and abundance of words in DNA sequences.

Key words and phrases DNA sequences, word counts, unexpected frequencies, Gaussian approximation, Markov chains.



Statistiques des Séquences Biologiques Home Page