TREEMM is a program for unsupervised clustering of promoter sequence
based on modeling of distinct classes of bipartite motifs designed to
represent binding sites of different Sigma factors. It allows to
account for the non-random distribution of such motifs across a tree
aimed at summarizing the correlation between the activities of the
promoters.
The publication that describes TREEMM and should be referenced is
Nicolas, Mäder, Dervyn et al. (submitted).
For questions that concerns TREEMM you can contact
pierre.nicolas@jouy.inra.fr
##########################
#Command line and options#
##########################
Usage : treemm -seq [-tree ] -nmotifs <#> -niter <#>
[-nheur <#(default 0)>] [-thin <#(default 50)>]
[-alpha ] [-rseed <#>] [-bmm <#(default 3)>]
[-lmaxbox1 <#(default 30)>] [-lmaxbox2 <#(default 30)>]
[-lminbox1 <#(default 5)>] [-lminbox2 <#(default 1)>]
[-uniformd] [-box2oblig]
Description of the command line arguments
-----------------------------------------
-seq : input sequence file, see format below.
-nmotifs <#> : how many motif types
-niter <#> : how many iterations for the MCMC algorithm
Optional
--------
-tree : input tree file, opotional, see format below.
-nheur <#(default 0)> : for how many iterations the heuristic split
moves are allowed. This type of move consists of splitting randomly
the set of sequences associated with a motif type into two subsets
when one motif type accounts for less than 0.001 of the total number
of sequences (this motif disappears with the heuristic split). These
moves improve the convergence towards the high likelihood region of
the parameter space but are not reversible. There use should be
restricted to the first part of the MCMC burn-in phase.
-thin <#(default 50)> : the number of MCMC iterations between the
recording of parameters and allocations, see output files.
-alpha : the parameter of the Dirichlet priors
(nucleotide composition parameters, probability of the different
motif classes).
-rseed <#> : the seed of the random number generator, useful if you
want to reproduce exactely the same results in different runs. If not
specified a seed is generated using the current time.
-bmm <#(default 3)> : the order of the Markov chain describing the
background nucleotide composition. The default is r=3 meaning that
the composition in words of length 4 is modeled. Select a value
compatible with the size of your data set (there are 4^{r+1}
parameters to be estimated for the background nucleotide
composition).
-lmaxbox1 <#(default 30)> : maximum allowed length for the first box
(-10 box).
-lmaxbox2 <#(default 30)> : maximum allowed length for the second box
(-35 box).
-lminbox1 <#(default 5) : minimum allowed length for the first box
-lminbox2 <#(default 1) : minimum allowed length for the second box
-uniformd : when this flag is set there is no preferential position
for the motifs within the sequences.
-box2oblig : when this flag is set the second box is obligatory,
otherwise the probability of the second box is estimated for each
motif (>0.5).
#############
#Input files#
#############
The input sequences file
------------------------
The sequence data should be formatted like
3242 101
S1 TGTCCGCTTTGTGGATAAGATTGTGACAACCATTGCAAGCTCTCGTTTATTTTGGTATTATATTTGTGTTTTAACTCTTGATTACTAATCCTACCTTTCCT
S2 AACACAAAAAAAGAGCAAATGGCGCTTACCATTCGTACACATTAAATGTTGAAAACATATAATATAGTAGATAAATAGCTTTTCGACAAATTTCACAACTT
...
where 3242 specifies the number of sequences to be analyzed and 101
indicates the nucleotide length of each sequence. The subsequent lines
(here 3242) provide the sequence information: they are composed of a
sequence identifier separated from the sequence itself by a tabulation
character.
Notice: all the sequences need to have the same length and that
missing data is not allowed.
The input tree file (optional)
------------------------------
The topology and branch length information about the tree should
provided in a four columns format as shown below
node child1 child2 height
1 -20 -1231 0.0025
2 -2855 -2978 0.0025
...
21 -982 6 0.0038
...
470 353 434 0.0155
...
3241 3238 3240 0.5587
where the first column indicates the number of the node, the two next
columns indicate the two children of this node and the last column
gives the height of the node. The n leaves (here 3242) are numbered
negatively from -1 to -n, the n-1 internal nodes are numbered
positively from 1 to n-1. The nodes need to be ordered by increasing
height such that all the descendants of a node are found above the
line that describes the node.
Notice: this input format supposes an ultra-metric binary tree as all
the leaves are implicitly assumed to have height 0. These requirements
are naturally fulfilled if the tree was obtained by hierarchical
clustering of the matrix of pair-wise correlation coefficients using
the average linkage algorithm. Modifying the program to allow other
types of tree should not be too difficult.
##############
#Output files#
##############
The standard output provide you the following useful information
* The user-specified parameters that were used.
* The value of the seed of the random number generator. You will need
it if you want to rerun the algorithm and to obtain exactly the same
results.
* The log-likelihood of the no motif (background only) model. Both the
marginal likelihood (expectation wrt the prior) and the maximum
likelihood are reported.
* The progress of the MCMC algorithm. Useful as running the program on big
data sets can take several days.
Two other files report the values of the parameters and the clustering
(values of the allocation variables) across iterations after thinning
(see the command line option). Notice that for inference purpose the
first part of the iterations should be discarded as they correspond to
the burn-in phase of the MCMC algorithm.
The clustering information file
-------------------------------
One out of every two lines provides the MCMC iteration number and the other
line provides the clustering information formatted as follows.
**** iter 99950 ****
alloc s0 15 56 1 7.262 2.881 s1 4 57 1 9.081 3.653 ...
where the (1+i*6) is the column offset to access information
concerning the ith sequence:
* column (1+i*6)+1 recalls the sequence identifier;
* column (1+i*6)+2 indicates the motif type (cluster) associated with
the sequence, motifs are numbered from 0 to nclust-1, and nclust
corresponds to an absence of motif;
* column (1+i*6)+3 reports the reference position of the first box in
the sequence;
* column (1+i*6)+4 reports the variable component of the spacer
distance between the two boxes;
* column (1+i*6)+5 reports the log likelihood ratio (motif
vs. background) associated with the first box;
* column (1+i*6)+6 reports the log likelihood ratio (motif
vs. background) associated with the second box;
The parameter information file
------------------------------
After the number of the MCMC iteration one find the information on the
parameters in the following order:
* current distribution for the reference position of the box 1 in the
sequence (lines _nbins_p_pos_motif, _breakpoints_p_pos_motif,
_p_pos_motif).
* current background composition as a vector of length
4+4^2+...+4^{r+1} where r is the order of the Markov chain that
serves to model the background. Only the last 4^{r+1} values are
estimated and thus change between iterations.
* the probability of each motif type after a switch occuring on a
branch of the tree (_pmotifs_ontree) or at a terminal leaf
(_pmotifs). These are vectors of length nclust+1 as the last value
corresponds to the absence of motif.
* the rate of motif switches per unit of branch length in the tree
(_alpha_evolontree) and the probability of motif switch at terminal
leaf (_epsilon_evolontree).
* then one find the description of each motif with the length of the
two boxes (_lbox1, _lbox2), the fixed length of the spacer distance
(lspacer) the length of the variable component of the spacer
distance (lvarspacer) and the probability of each distance
(_pvarspacer), whether or not the second box is "optional"
(_box2isoptional), the probability of the second box (pbox2),
tracking information allowing to align position-specific nucleotide
composition parameters across iterations (_track_posbox1,
_track_posbox2), the nucleotide composition at each position in each
box (5'->3', column order A G C T).
* the log-likelihood of the sequence data with the current set of parameters.