Heterogeneity of DNA sequences and hidden Markov chains.

Florence Muri
Université Paris V, UA CNRS 1323, France

Abstract:

Statistical analysis of DNA sequences requires models describing the succession of the bases adenine, cytosine, guanine and thymine. A model based on the homogeneity assumption along all the sequence should not reflect the compositional variation that might exist between different regions of the genome. Our purpose is to define a model taking into account the observed heterogeneity and use it to identify distinct regions of the sequence. The break points which delimit these regions may then separate parts of the genome with different functional or structural properties.

The statistical approach we consider is based on Hidden Markov Chains. We suppose that the sequence has an underlying structure corresponding to unobservable states of a first order Markov chain (the ``hidden process''). The bases A, C, G, T hence appear with a law which depends on the hidden state: we consider models where, conditionally to the states, the bases are drawn independently, or according to a first (or higher) order Markov Chain. The aim is to reconstruct the hidden process and to estimate the parameters of the model.

We are in a frame of missing data and of mixed distributions with dependent data. We compare different identification procedures of the hidden chains: EM algorithm (and its stochastic version SEM), Viterbi algorithm and Markov chain Monte Carlo methods (particularly Gibbs sampling). All these methods perform maximum likelihood estimators. We also consider a Bayesian estimation using Gibbs sampling. The performances of these algorithms are discussed (dependence on the initial position, rate of convergence, ...).

An application to the Bacteriophage Lambda and to the virus HIV1 is presented to illustrate the method.