Abstract
In this note, we point out a very efficient statistic to detect over- and under-represented words in DNA sequences, when Markov chain models are used to represent the sequences. This statistic is missing from the recent review done on this important problem, and appears to be a better measure of rarity and abundance of words in DNA sequences.
Key words and phrases DNA sequences, word counts, unexpected frequencies, Gaussian approximation, Markov chains.