TREEMM is a program for unsupervised clustering of promoter sequence based on modeling of distinct classes of bipartite motifs designed to represent binding sites of different Sigma factors. It allows to account for the non-random distribution of such motifs across a tree aimed at summarizing the correlation between the activities of the promoters. The publication that describes TREEMM and should be referenced is Nicolas, Mäder, Dervyn et al. (submitted). For questions that concerns TREEMM you can contact pierre.nicolas@jouy.inra.fr ########################## #Command line and options# ########################## Usage : treemm -seq [-tree ] -nmotifs <#> -niter <#> [-nheur <#(default 0)>] [-thin <#(default 50)>] [-alpha ] [-rseed <#>] [-bmm <#(default 3)>] [-lmaxbox1 <#(default 30)>] [-lmaxbox2 <#(default 30)>] [-lminbox1 <#(default 5)>] [-lminbox2 <#(default 1)>] [-uniformd] [-box2oblig] Description of the command line arguments ----------------------------------------- -seq : input sequence file, see format below. -nmotifs <#> : how many motif types -niter <#> : how many iterations for the MCMC algorithm Optional -------- -tree : input tree file, opotional, see format below. -nheur <#(default 0)> : for how many iterations the heuristic split moves are allowed. This type of move consists of splitting randomly the set of sequences associated with a motif type into two subsets when one motif type accounts for less than 0.001 of the total number of sequences (this motif disappears with the heuristic split). These moves improve the convergence towards the high likelihood region of the parameter space but are not reversible. There use should be restricted to the first part of the MCMC burn-in phase. -thin <#(default 50)> : the number of MCMC iterations between the recording of parameters and allocations, see output files. -alpha : the parameter of the Dirichlet priors (nucleotide composition parameters, probability of the different motif classes). -rseed <#> : the seed of the random number generator, useful if you want to reproduce exactely the same results in different runs. If not specified a seed is generated using the current time. -bmm <#(default 3)> : the order of the Markov chain describing the background nucleotide composition. The default is r=3 meaning that the composition in words of length 4 is modeled. Select a value compatible with the size of your data set (there are 4^{r+1} parameters to be estimated for the background nucleotide composition). -lmaxbox1 <#(default 30)> : maximum allowed length for the first box (-10 box). -lmaxbox2 <#(default 30)> : maximum allowed length for the second box (-35 box). -lminbox1 <#(default 5) : minimum allowed length for the first box -lminbox2 <#(default 1) : minimum allowed length for the second box -uniformd : when this flag is set there is no preferential position for the motifs within the sequences. -box2oblig : when this flag is set the second box is obligatory, otherwise the probability of the second box is estimated for each motif (>0.5). ############# #Input files# ############# The input sequences file ------------------------ The sequence data should be formatted like 3242 101 S1 TGTCCGCTTTGTGGATAAGATTGTGACAACCATTGCAAGCTCTCGTTTATTTTGGTATTATATTTGTGTTTTAACTCTTGATTACTAATCCTACCTTTCCT S2 AACACAAAAAAAGAGCAAATGGCGCTTACCATTCGTACACATTAAATGTTGAAAACATATAATATAGTAGATAAATAGCTTTTCGACAAATTTCACAACTT ... where 3242 specifies the number of sequences to be analyzed and 101 indicates the nucleotide length of each sequence. The subsequent lines (here 3242) provide the sequence information: they are composed of a sequence identifier separated from the sequence itself by a tabulation character. Notice: all the sequences need to have the same length and that missing data is not allowed. The input tree file (optional) ------------------------------ The topology and branch length information about the tree should provided in a four columns format as shown below node child1 child2 height 1 -20 -1231 0.0025 2 -2855 -2978 0.0025 ... 21 -982 6 0.0038 ... 470 353 434 0.0155 ... 3241 3238 3240 0.5587 where the first column indicates the number of the node, the two next columns indicate the two children of this node and the last column gives the height of the node. The n leaves (here 3242) are numbered negatively from -1 to -n, the n-1 internal nodes are numbered positively from 1 to n-1. The nodes need to be ordered by increasing height such that all the descendants of a node are found above the line that describes the node. Notice: this input format supposes an ultra-metric binary tree as all the leaves are implicitly assumed to have height 0. These requirements are naturally fulfilled if the tree was obtained by hierarchical clustering of the matrix of pair-wise correlation coefficients using the average linkage algorithm. Modifying the program to allow other types of tree should not be too difficult. ############## #Output files# ############## The standard output provide you the following useful information * The user-specified parameters that were used. * The value of the seed of the random number generator. You will need it if you want to rerun the algorithm and to obtain exactly the same results. * The log-likelihood of the no motif (background only) model. Both the marginal likelihood (expectation wrt the prior) and the maximum likelihood are reported. * The progress of the MCMC algorithm. Useful as running the program on big data sets can take several days. Two other files report the values of the parameters and the clustering (values of the allocation variables) across iterations after thinning (see the command line option). Notice that for inference purpose the first part of the iterations should be discarded as they correspond to the burn-in phase of the MCMC algorithm. The clustering information file ------------------------------- One out of every two lines provides the MCMC iteration number and the other line provides the clustering information formatted as follows. **** iter 99950 **** alloc s0 15 56 1 7.262 2.881 s1 4 57 1 9.081 3.653 ... where the (1+i*6) is the column offset to access information concerning the ith sequence: * column (1+i*6)+1 recalls the sequence identifier; * column (1+i*6)+2 indicates the motif type (cluster) associated with the sequence, motifs are numbered from 0 to nclust-1, and nclust corresponds to an absence of motif; * column (1+i*6)+3 reports the reference position of the first box in the sequence; * column (1+i*6)+4 reports the variable component of the spacer distance between the two boxes; * column (1+i*6)+5 reports the log likelihood ratio (motif vs. background) associated with the first box; * column (1+i*6)+6 reports the log likelihood ratio (motif vs. background) associated with the second box; The parameter information file ------------------------------ After the number of the MCMC iteration one find the information on the parameters in the following order: * current distribution for the reference position of the box 1 in the sequence (lines _nbins_p_pos_motif, _breakpoints_p_pos_motif, _p_pos_motif). * current background composition as a vector of length 4+4^2+...+4^{r+1} where r is the order of the Markov chain that serves to model the background. Only the last 4^{r+1} values are estimated and thus change between iterations. * the probability of each motif type after a switch occuring on a branch of the tree (_pmotifs_ontree) or at a terminal leaf (_pmotifs). These are vectors of length nclust+1 as the last value corresponds to the absence of motif. * the rate of motif switches per unit of branch length in the tree (_alpha_evolontree) and the probability of motif switch at terminal leaf (_epsilon_evolontree). * then one find the description of each motif with the length of the two boxes (_lbox1, _lbox2), the fixed length of the spacer distance (lspacer) the length of the variable component of the spacer distance (lvarspacer) and the probability of each distance (_pvarspacer), whether or not the second box is "optional" (_box2isoptional), the probability of the second box (pbox2), tracking information allowing to align position-specific nucleotide composition parameters across iterations (_track_posbox1, _track_posbox2), the nucleotide composition at each position in each box (5'->3', column order A G C T). * the log-likelihood of the sequence data with the current set of parameters.