This program implements the approach presented in our paper "Transcriptional landscape estimation from tiling array data using a model of signal shift and drift" (submitted). Example data sets and example command lines can be found in the section EXAMPLES. The options that control the behavior of the program are detailed in the section OPTIONS. ########## #EXAMPLES# ########## Data sets that were used in our study and that are formatted to suit the input file format can be found in the "examples" directory. - Basysbio, B. subtilis (exp # 147634_532) data : basysbio_s1_exp147634_532_hmm.dat (direct strand) basysbio_s2_exp147634_532_hmm.dat (complementary strand) - David, S. cerevisae first chromosome data: davidK1_s1_hmm.dat (direct strand) davidK1_s2_hmm.dat (complementary strand) As a first test of the program you may wish to run it on these two data sets. If this is the case, move to the examples directory and run the following commands to reproduce the results presented in our paper. ../bin/hmmtiling -data davidK1_s1_hmm.dat davidK1_s2_hmm.dat \ -randompar T -seqsmax 0.5 -rtrunc 2 >hmmtiling_david.stdout 2>&1 ../bin/hmmtiling -data basysbio_s1_hmm.dat basysbio_s2_hmm.dat \ -randompar T >hmmtiling_basysbio.stdout 2>&1 ############ #DATA INPUT# ############ The first seven lines of the basysbio data files are shown below. The first line indicates the column names, this line serves only as a reminder. Each line after this header corresponds to a different probe. The first column gives the position of the probe, the second column corresponds to the value of the signal (in log2 scale), the third column gives the value of the gDNA covariate (residual of the substraction with the mode or the median), the fourth column is a value that indicates the amount of "non-uniqueness" of the probe in the genome (probes with value above a threshold will be discarded, see -seqsmax parameter). Probes should be already ordonned according to their physical location on the chromosome. The positions given in the first column are not used by the program that does not accound for non equally spaced probes. This column is only intended to help the user. Because the model is intended to distinguish between upward and downward drift it is very important to always orient the probes in the same direction (conventionally 5'->3'): notice the decreasing positions of the probes in basysbio_s2_hmm.dat and davidK1_s2_hmm.dat because probes are on the complementary strand. head -7 basysbio_s2_hmm.dat: Pos Signal gDNA SeqS 1 9.09334 -0.27766 2 23 8.67281 -0.61719 2 45 9.00874 -1.33126 2 67 8.69349 -1.34151 1 89 8.8535 -0.8385 1 111 8.55159 -1.40841 1 head -7 basysbio_s2_hmm.dat: Pos Signal gDNA SeqS 4214617 10.35229 -0.70271 1 4214595 8.19638 -1.37462 1 4214573 9.58538 -0.27262 1 4214551 9.81876 0.07776 1 4214529 14.32713 0.47013 1 4214507 15.11244 0.44044 1 ################################## #UNDERLYING SIGNAL RECONSTRUCTION# ################################## The 20 columns of the *.path files are Xstatus 0: missing data, 1: fully observed, 2: left censored, 3: right censored t self generated probe id SeqS the same as in the input file X the value of the signal (same as the Signal column in the input file) XrX The difference Signal - gDNA XrrX The difference Signal - rho*gDNA Poutl The probability for this data point to be an outlier ES The expected value for the underlying signal (best estimator in terms of quadratic risk) MLS The most likely value value of the underlying signal at this position CIminS The lower end of the 95% prediction interval for the underlying signal CImaxS The upper end of the 95% prediction interval for the underlying signal PR0 The probability to be in regime 0 (1 if only one regime i.e. nR=1) PM1 The probability of a shift move at this position PM2 The probability of an upward drift move at this position PM3 The probability of an downward drift move at this position VitR The underlying regime at this position on Viterbi path VitM 1 if there is a shift, 2 if upward drift, 3 if downward drift, 0 elswhere VitS The value of the underlying signal according to the Viteri path PM1up The probability of an upward shift move at this position (PM1=PM1up+PM1down) PM1down The probability of a downward shift move at this position ######### #OPTIONS# ######### The shortest command line is: hmmtiling -data The complete list of options is printed when running the program without arguments: Usage: hmmtiling -data [ ...] [-nR ] (default 1) [-nS ] (default 100) [-Xrange ] (default 0.9) [-Srange ] (default 1.1) [-rtrunc ] (default 200) [-itermax ] (default 100) [-oprefix ] (default out) [-fraccat ] (default 0.13) [-iparam ] (default none) [-randompar ] (default F) [-seed ] (default time(NULL)) [-eta ] (default kernel) [-rhotype ] (default 2) [-sigmatype ] (default 2) [-outliertype ] (default 2) [-eps ] (default 0.01) [-nexplo ] (default 10) [-niterexplo ] (default 100) [-epsexplo ] (default 0.1) [-seqsmax ] (default 1.9) [-alphamin ] (default 0 0 0 0) [-nR ] (default 1) This option permits more than one regime for the dynamic of the underlying signal. It was not described in the paper and we do not recommand changing its default value. [-nS ] (default 100) This option allows you to control nS, the number of discretization steps (the notation used in the paper for this parameter is K). [-Xrange ] (default 0.9) Fraction of the total range of variation of the signal where the data are not considered as censored. With the default value value, data points belonging to the lower and higher 5% of the range are considered as censored. Censoring is not equivalent to missing. Censoring is informative in the sense that the model still consider known that the value is low or high but it consider that the measure is not accurate. Notice also that 90% of the range (the default) is not the same that 90% of the data points. [-Srange ] (default 1.1) Total span of the underlying signal state space as compared with the total range of variation of the data. This control the value of Umin and Umax (notations of the paper). [-rtrunc ] (default 200) If the absolute value of the gDNA covariate (gDNA residuals) is above this threshold the data point is considered as missing. The default value corresponds to no trimming with typical values of the gDNA covariate. [-itermax ] (default 100) Maximum number of iterations allowed in the terminal phase of the EM algorithm. [-oprefix ] (default out) Prefix that will added to all output files. With the default prefix, output files are : * out_params.explotrace, this file contains the information on the parameters and the value of the log-likelihood at each iteration of the exploratory phase of the EM algorithm. * out_params.trace, reports the parameters and log-likelihood in the the terminal phase of the EM algorithm * out.oparams, final parameters * out.path out.path, these files report the information on the denoised signal. They contains 17 columns and one line per probe. See the section "UNDERLYING SIGNAL RECONSTRUCTION" below for a detailed description of this output. [-fraccat ] (default 0.13) Control the number of categories for rho and sigma. The default value corresponds to 8 categories (1/0.13 ~ 8) [-iparam ] (default none) Allows you to specify the initial values of the model parameters with an input file. The format of this input file is the same as the format of the output file out.oparams. [-randompar ] (default F) Boolean, indicates whether or not the starting point of the EM algorithm should be random. Notice that if F (FALSE) only one starting point is considered and the the exploration phase of the EM algorithm is skipped (see arg -nexplo, -niterexplo, -epsexplo). [-seed ] (default time(NULL)) To be used if one want to specify the random seed that serves to draw random starting points in the parameter space. [-eta ] (default kernel) Choose 'flat' if you want eta to be the uniform distribution eta (see equation 3 in the paper) instead of a kernel density estimate of the empirical distribution of the signal. [-rhotype ] (default 2) 2 if rho is a function of the underlying signal value to be estimated, 1 if a unique rho is to be estimated. We strongly recommend to take rhotype=2 (see our paper) unless you want to use prederminated values for rho in which case you would specify the 0 for rhotype in an input parameter file (-iparam option) that would allow you to give the values of rho (defaut value of rho is 0 for rhotype 0). [-sigmatype ] (default 2) The same as rhotype but for sigma. Modelling sigma as a function of the underlying signal may not be as important as for rho (see paper). [-outliertype ] (default 2) 2 means use a logistic function of the gDNA covariate. Select 1 if you prefer the probability of outlier to be independent of the gDNA signal (not recommended!). The value 0 could be used if you want to fix this probability (for this you would use the -iparam option). [-eps ] (default 0.01) Threshold on the difference of the log-likelihood between two iterations that serve to check convergence in the terminal phase of the EM algorithm. [-nexplo ] (default 10) number of starting points to be considered in the exploration phase of the EM algorithm. [-niterexplo ] (default 100) Maximum number of iterations allowed in from each starting point in the exploration phase of the EM algorithm. [-epsexplo ] (default 0.1) Threshold on the difference of the log-likelihood between two iterations that serve to check convergence from each starting point in the exploration phase of the EM algorithm. [-seqsmax ] (default 1.9) This indicates the threshold on the SeqS value of the probe (probes with value above seqsmax a threshold will be discarded, see the fourth column of the input data file). [-maxdriftamplitude ] (default -1 = disabled) Set the max allowed value for the average drift amplitude (i.e. the mean of the geometric distribution modeling drift x size of grid step). [-alphamin ] (default 0 0 0 0) These 4 values allows to set lower bounds on the 4 individual components of the alpha vector (resp. probability of "no move", "shift", "drift up", and "drift down"). The 4 values should sum below 1. ################### #OUTPUT MODEL FILE# ################### The final model is written in file out.oparams ("out" is the default prefix, see option -oprefix). An example of such a file is given below, before a discussion of the meaning of the different lines. The same syntax is used to report the values of parameters at each step of the EM algorithm. It serves also when specifying a parameter input file (see option -iparam and section below). nS 100 mu 5.38241 5.49477 5.60712 5.71948 5.83183 5.94419 6.05654 6.1689 6.28125 6.39361 6.50596 6.61832 6.73067 6.84303 6.95538 7.06774 7.1801 7.29245 7.40481 7.51716 7.62952 7.74187 7.85423 7.96658 8.07894 8.19129 8.30365 8.416 8.52836 8.64071 8.75307 8.86542 8.97778 9.09013 9.20249 9.31484 9.4272 9.53955 9.65191 9.76427 9.87662 9.98898 10.1013 10.2137 10.326 10.4384 10.5508 10.6631 10.7755 10.8878 11.0002 11.1125 11.2249 11.3372 11.4496 11.5619 11.6743 11.7867 11.899 12.0114 12.1237 12.2361 12.3484 12.4608 12.5731 12.6855 12.7979 12.9102 13.0226 13.1349 13.2473 13.3596 13.472 13.5843 13.6967 13.8091 13.9214 14.0338 14.1461 14.2585 14.3708 14.4832 14.5955 14.7079 14.8202 14.9326 15.045 15.1573 15.2697 15.382 15.4944 15.6067 15.7191 15.8314 15.9438 16.0562 16.1685 16.2809 16.3932 16.5056 eta 1.09116e-06 6.63692e-06 3.33707e-05 0.000137539 0.000467373 0.00132132 0.00314368 0.00637342 0.0111728 0.0172063 0.0237025 0.029746 0.0346621 0.0381271 0.0401214 0.0407774 0.0402998 0.0389075 0.036806 0.0342149 0.0313516 0.0284274 0.0255941 0.0229448 0.0205297 0.0183875 0.0165321 0.0149423 0.0135773 0.0124011 0.0113878 0.0105182 0.00978268 0.00917388 0.00868328 0.00828999 0.0079649 0.0076812 0.00743353 0.00723623 0.00710145 0.00701997 0.00697123 0.00694125 0.00692509 0.00691257 0.00688579 0.00683146 0.00675672 0.00668129 0.00661902 0.00656942 0.00652818 0.00649742 0.0064806 0.00647334 0.0064694 0.00646736 0.00646877 0.00646814 0.00645439 0.00641948 0.00636986 0.00632159 0.00628642 0.00626104 0.00623924 0.00622413 0.00622575 0.00624437 0.00626509 0.00626389 0.00622431 0.00614417 0.00603138 0.00589526 0.00573979 0.00555895 0.00533985 0.00507299 0.00476091 0.00441774 0.0040649 0.00372458 0.00341161 0.00313396 0.00290238 0.00273499 0.00264827 0.00265216 0.00275117 0.00293839 0.00314962 0.00322118 0.00295078 0.0022891 0.0014439 0.000721175 0.000280562 8.41537e-05 xmin 6.39361 xmax 15.4944 ncat_mu 8 cat_mu 13 16 20 24 35 53 74 100 sigma_type 2 sigma_val 0.229045 0.229045 0.229045 0.229045 0.229045 0.229045 0.229045 0.229045 0.229045 0.229045 0.229045 0.229045 0.229045 0.255477 0.255477 0.255477 0.264645 0.264645 0.264645 0.264645 0.296707 0.296707 0.296707 0.296707 0.305125 0.305125 0.305125 0.305125 0.305125 0.305125 0.305125 0.305125 0.305125 0.305125 0.305125 0.335788 0.335788 0.335788 0.335788 0.335788 0.335788 0.335788 0.335788 0.335788 0.335788 0.335788 0.335788 0.335788 0.335788 0.335788 0.335788 0.335788 0.335788 0.344305 0.344305 0.344305 0.344305 0.344305 0.344305 0.344305 0.344305 0.344305 0.344305 0.344305 0.344305 0.344305 0.344305 0.344305 0.344305 0.344305 0.344305 0.344305 0.344305 0.344305 0.301208 0.301208 0.301208 0.301208 0.301208 0.301208 0.301208 0.301208 0.301208 0.301208 0.301208 0.301208 0.301208 0.301208 0.301208 0.301208 0.301208 0.301208 0.301208 0.301208 0.301208 0.301208 0.301208 0.301208 0.301208 0.301208 rho_type 2 rho_val 0.100499 0.100499 0.100499 0.100499 0.100499 0.100499 0.100499 0.100499 0.100499 0.100499 0.100499 0.100499 0.100499 0.195064 0.195064 0.195064 0.354788 0.354788 0.354788 0.354788 0.398874 0.398874 0.398874 0.398874 0.621677 0.621677 0.621677 0.621677 0.621677 0.621677 0.621677 0.621677 0.621677 0.621677 0.621677 0.865603 0.865603 0.865603 0.865603 0.865603 0.865603 0.865603 0.865603 0.865603 0.865603 0.865603 0.865603 0.865603 0.865603 0.865603 0.865603 0.865603 0.865603 0.911147 0.911147 0.911147 0.911147 0.911147 0.911147 0.911147 0.911147 0.911147 0.911147 0.911147 0.911147 0.911147 0.911147 0.911147 0.911147 0.911147 0.911147 0.911147 0.911147 0.911147 0.751318 0.751318 0.751318 0.751318 0.751318 0.751318 0.751318 0.751318 0.751318 0.751318 0.751318 0.751318 0.751318 0.751318 0.751318 0.751318 0.751318 0.751318 0.751318 0.751318 0.751318 0.751318 0.751318 0.751318 0.751318 0.751318 outlier_type 2 outlier_logit -5.56875 0.938642 nR 1 b_type 1 b 1 alpha_type 1 alpha 0.860532 0.0119378 0.0498707 0.0776596 lambda_type 1 lambda 0.707708 0.62689 We will now briefly explain each line of the file. nS : the number of hidden states (see the -nS option, and the parameter K in the paper). mu : a vector of size nS with the value of the underlying signal associated to each hidden state. eta : a vector of size nS with the value of the function eta (see paper) that define the pdf of the arrival hidden after shift moves. If not specified by an input model file this function is automatically computed with a kernel density estimator on the bulk signal data using a gaussian kernel with Scott's bandwidth. xmin, xmax : signal below xmin and above xmax is considered as censored ncat_mu : the number of categories for the definition of the function sigma and rho (serves only if rhotype=2 and/or sigmatype=2) cat_mu : the index of the hidden states that define the break-points of each category. sigma_type : see option -sigmatype, note that sigma type can be 0 meaning a fix value for sigma that should in this case be specified with an input model file (see next section). sigma_val : a vector of size nS with the value of sigma for each hidden state. If sigma_type=2 this is a piecwise constant function of mu with a number of segments ncat_mu. rho_type : see option -rhotype, note that rho type can be 0 meaning a fix value for rho that should in this case be specified with an input model file (see next section). rho_val : a vector of size nS with the value of rho for each hidden state. If rho_type=2 this is a piecwise constant function of mu with a number of segments ncat_mu. outlier_type : see -outliertype option. 0=fixed probability of outlier, 1=probability of outlier to be estimated, 2=logistic model for the probability of outlier to be estimated. outlier or outlier_logit : in the first case (outliertype=0 or 1) this corresponds to the probability of outlier, in the second case (outliertype=2), this gives the two parameters of the logistic function that define the relation between the probability of outlier and the absolute value of the gDNA covariate. nR : see the -nR option. Should be 1 expected if you decide to go with nR>1. This possibility was not explored in the paper. b_type : whether or not the transition matrix between regimes should be estimated b : the transition matrix (a signle number if nR=1) alpha_type : 0 if not to be estimated, 1 estimated and random staring point for the four values of the vector alpha. 2 estimated but specified starting point for the value of the third and fourth components (useful if you wish to impose no drift). 3 estimated but specified starting point for the third component (useful if you wish to impose no drift up). 4 estimated but specified starting point for the fourth component (useful if you wish to impose no drift down). alpha : a four value vector whose component should sum to 1. The first component is the probability of no move, second compenent is the probability of shift move, third component is the probability of upward drift, fourth component is the probability of downward drift. lambda_type : whether or not lambda should be estimated. lambda : a two-component vector that specifies the amplitude of the drift moves. The first component is the parameter of the geometric distribution that govern the amplitude of upward drift. The second component corresponds to downward drift. NOTICE that if nR>1 there will be as many lines alpha_type, alpha, lambda_type, lambda as the number of regime. ################## #INPUT MODEL FILE# ################## The syntax for the input model file is the same as the syntax of the output model file. In any case you should not attempt to create such a file from scratch but you should instead consider reutilisnig (and modifying) a file created by the program in a previous run of the algorithm.