Abstract
We present a complete analysis of the statistics of number of
occurrences of a regular expression
pattern in a random text. This covers ``motifs'' widely
used in computational biology.
Our approach is based on: (i)
classical constructive results in
theoretical computer science (automata and formal language theory);
(ii) analytic combinatorics
to compute
asymptotic properties from generating functions;
(iii) computer algebra to determine generating functions
explicitly, analyse generating functions
and extract coefficients efficiently. We provide
constructions for
overlapping or non-overlapping matches of a regular expression.
A companion
implementation produces: multivariate generating functions for the statistics
under study; a fast computation of their Taylor
coefficients which yields exact values of the
moments with typical application to random
texts of size 30,000;
precise asymptotic formulae that allow predictions in texts
of arbitrarily large sizes. Our implementation was tested by
comparing predictions of the number of occurrences of motifs against
the 7 megabytes aminoacid database Prodom.
We handled more than 88% of the standard collection of Prosite motifs
with our programs.
Such comparisons help detect which motifs are observed in real
biological data more or less frequently than theoretically predicted.