Abstract
The Z-value is an attempt to estimate the statistical significance of a Smith and Waterman dynamic programming alignment score (H-score) through the use of a Monte-Carlo procedure. In this paper, we give an approximation for the Z-value law deduced from the Poisson clumping heuristic developped by Waterman and Vingron (Stat. Sci. 9 (1994) 367) in the case of independent and identically distributed sequences comparison. As for non-gapped alignment scores, our approximation is of Gumbel type but with parameters which are sequence independent. This result makes clear the related experimental results mentionned by Comet et al. (Comput. Chem. 23 (1999) 317). Using ``quasi-real'' sequences (i.e. randomly shuffled sequences of the same length and amino acid composition as the real ones) we investigate the relevance of our approximation result. Since the Monte-Carlo approach we use generates a bias for the Gumbel decay parameter estimation, a correction procedure is proposed. Applications to real sequences are considered and we show how our results can be used to detect the potential biological relationships between real sequences.
Key words and phrases Sequence alignment, Dynamic programming, Significance, Z-value, Approximated distribution, Gumbel distribution.