Motif Statistics

Pierre Nicodème, Bruno Salvy and Philippe Flajolet

ESA99, Lecture Notes in Computer Science, vol. 1643, 194-211, 1999.


We present a complete analysis of the statistics of number of occurrences of a regular expression pattern in a random text. This covers ``motifs'' widely used in computational biology. Our approach is based on: (i) classical constructive results in theoretical computer science (automata and formal language theory); (ii) analytic combinatorics to compute asymptotic properties from generating functions; (iii) computer algebra to determine generating functions explicitly, analyse generating functions and extract coefficients efficiently. We provide constructions for overlapping or non-overlapping matches of a regular expression. A companion implementation produces: multivariate generating functions for the statistics under study; a fast computation of their Taylor coefficients which yields exact values of the moments with typical application to random texts of size 30,000; precise asymptotic formulae that allow predictions in texts of arbitrarily large sizes. Our implementation was tested by comparing predictions of the number of occurrences of motifs against the 7 megabytes aminoacid database Prodom. We handled more than 88% of the standard collection of Prosite motifs with our programs. Such comparisons help detect which motifs are observed in real biological data more or less frequently than theoretically predicted.

Statistiques des Séquences Biologiques Home Page