This program implements the approach presented in our paper
"Transcriptional landscape estimation from tiling array data using a
model of signal shift and drift" (submitted).

Example data sets and example command lines can be found in the
section EXAMPLES. The options that control the behavior of the program
are detailed in the section OPTIONS.

##########
#EXAMPLES#
##########

Data sets that were used in our study and that are formatted to suit
the input file format can be found in the "examples" directory.

- Basysbio, B. subtilis (exp # 147634_532) data :
    basysbio_s1_exp147634_532_hmm.dat (direct strand) 
    basysbio_s2_exp147634_532_hmm.dat (complementary strand)
- David, S. cerevisae first chromosome data:
    davidK1_s1_hmm.dat (direct strand) 
    davidK1_s2_hmm.dat (complementary strand)

As a first test of the program you may wish to run it on these two
data sets. If this is the case, move to the examples directory and run
the following commands to reproduce the results presented in our
paper.

../bin/hmmtiling -data davidK1_s1_hmm.dat davidK1_s2_hmm.dat \
 -randompar T -seqsmax 0.5 -rtrunc 2 >hmmtiling_david.stdout 2>&1

../bin/hmmtiling -data basysbio_s1_hmm.dat basysbio_s2_hmm.dat \
 -randompar T >hmmtiling_basysbio.stdout 2>&1


############
#DATA INPUT#
############

The first seven lines of the basysbio data files are shown below. 

The first line indicates the column names, this line serves only as a
reminder.  

Each line after this header corresponds to a different probe. The
first column gives the position of the probe, the second column
corresponds to the value of the signal (in log2 scale), the third
column gives the value of the gDNA covariate (residual of the
substraction with the mode or the median), the fourth column is a
value that indicates the amount of "non-uniqueness" of the probe in
the genome (probes with value above a threshold will be discarded, see
-seqsmax parameter).  Probes should be already ordonned according to
their physical location on the chromosome. The positions given in the
first column are not used by the program that does not accound for non
equally spaced probes. This column is only intended to help the user.

Because the model is intended to distinguish between upward and
downward drift it is very important to always orient the probes in the
same direction (conventionally 5'->3'): notice the decreasing
positions of the probes in basysbio_s2_hmm.dat and davidK1_s2_hmm.dat
because probes are on the complementary strand.

head -7 basysbio_s2_hmm.dat:
Pos Signal gDNA SeqS
1 9.09334 -0.27766 2
23 8.67281 -0.61719 2
45 9.00874 -1.33126 2
67 8.69349 -1.34151 1
89 8.8535 -0.8385 1
111 8.55159 -1.40841 1

head -7 basysbio_s2_hmm.dat:
Pos Signal gDNA SeqS
4214617 10.35229 -0.70271 1
4214595 8.19638 -1.37462 1
4214573 9.58538 -0.27262 1
4214551 9.81876 0.07776 1
4214529 14.32713 0.47013 1
4214507 15.11244 0.44044 1


##################################
#UNDERLYING SIGNAL RECONSTRUCTION#
##################################

The 20 columns of the *.path files are

Xstatus	0: missing data, 1: fully observed, 2: left censored, 3: right censored 
t	self generated probe id
SeqS	the same as in the input file
X	the value of the signal (same as the Signal column in the input file)
XrX	The difference Signal - gDNA
XrrX	The difference Signal - rho*gDNA
Poutl	The probability for this data point to be an outlier 
ES	The expected value for the underlying signal (best estimator in terms of quadratic risk)
MLS	The most likely value value of the underlying signal at this position
CIminS	The lower end of the 95% prediction interval for the underlying signal
CImaxS	The upper end of the 95% prediction interval for the underlying signal
PR0	The probability to be in regime 0 (1 if only one regime i.e. nR=1)
PM1	The probability of a shift move at this position
PM2	The probability of an upward drift move at this position
PM3     The probability of an downward drift move at this position
VitR	The underlying regime at this position on Viterbi path
VitM	1 if there is a shift, 2 if upward drift, 3 if downward drift, 0 elswhere
VitS    The value of the underlying signal according to the Viteri path
PM1up	The probability of an upward shift move at this position (PM1=PM1up+PM1down)
PM1down	The probability of a downward shift move at this position

#########
#OPTIONS#
#########

The shortest command line is: hmmtiling -data <file>

The complete list of options is printed when running the program without 
arguments:

Usage: hmmtiling
 -data <file1> [<file2> ...]
 [-nR <nR>] (default 1)
 [-nS <nS>] (default 100)
 [-Xrange <xr>] (default 0.9)
 [-Srange <sr>] (default 1.1)
 [-rtrunc <rm>] (default 200)
 [-itermax <im>] (default 100)
 [-oprefix <op>] (default out)
 [-fraccat <fc>] (default 0.13)
 [-iparam <filename>] (default none)
 [-randompar <T or F>] (default F)
 [-seed <s>] (default time(NULL))
 [-eta <kernel or flat>] (default kernel)
 [-rhotype <rhotype>] (default 2)
 [-sigmatype <sigmatype>] (default 2)
 [-outliertype <outliertype>] (default 2)
 [-eps <epsmax>] (default 0.01)
 [-nexplo <nexplo>] (default 10)
 [-niterexplo <niterexplo>] (default 100)
 [-epsexplo <epsexplo>] (default 0.1)
 [-seqsmax <seqsmax>] (default 1.9)
 [-alphamin <alpha_n alpha_s alpha_u alpha_d>] (default 0 0 0 0)

 [-nR <nR>] (default 1) This option permits more than one regime for
 the dynamic of the underlying signal. It was not described in the
 paper and we do not recommand changing its default value.

 [-nS <nS>] (default 100) This option allows you to control nS, the
 number of discretization steps (the notation used in the paper for
 this parameter is K).
 
 [-Xrange <xr>] (default 0.9) Fraction of the total range of variation
 of the signal where the data are not considered as censored. With the
 default value value, data points belonging to the lower and higher 5%
 of the range are considered as censored. Censoring is not equivalent
 to missing. Censoring is informative in the sense that the model
 still consider known that the value is low or high but it consider
 that the measure is not accurate. Notice also that 90% of the range
 (the default) is not the same that 90% of the data points.

 [-Srange <sr>] (default 1.1) Total span of the underlying signal
 state space as compared with the total range of variation of the
 data. This control the value of Umin and Umax (notations of the
 paper).
 
 [-rtrunc <rm>] (default 200) If the absolute value of the gDNA
 covariate (gDNA residuals) is above this threshold the data point is
 considered as missing. The default value corresponds to no trimming
 with typical values of the gDNA covariate.

 [-itermax <im>] (default 100) Maximum number of iterations allowed in
 the terminal phase of the EM algorithm.

 [-oprefix <op>] (default out) Prefix that will added to all output
 files. With the default prefix, output files are : 

 * out_params.explotrace, this file contains the information on the
 parameters and the value of the log-likelihood at each iteration of
 the exploratory phase of the EM algorithm.

 * out_params.trace, reports the parameters and log-likelihood in the 
 the terminal phase of the EM algorithm

 * out.oparams, final parameters

 * out<file1>.path out<file2>.path, these files report the information
 on the denoised signal. They contains 17 columns and one line per
 probe. See the section "UNDERLYING SIGNAL RECONSTRUCTION" below for a detailed
 description of this output.
 
 [-fraccat <fc>] (default 0.13) Control the number of categories for
 rho and sigma. The default value corresponds to 8 categories (1/0.13 ~ 8)

 [-iparam <filename>] (default none) Allows you to specify the initial
 values of the model parameters with an input file. The format of this
 input file is the same as the format of the output file out.oparams.

 [-randompar <T or F>] (default F) Boolean, indicates whether or not
 the starting point of the EM algorithm should be random. Notice that
 if F (FALSE) only one starting point is considered and the the
 exploration phase of the EM algorithm is skipped (see arg -nexplo,
 -niterexplo, -epsexplo).

 [-seed <s>] (default time(NULL)) To be used if one want to specify
 the random seed that serves to draw random starting points in the
 parameter space.

 [-eta <kernel or flat>] (default kernel) Choose 'flat' if you want
 eta to be the uniform distribution eta (see equation 3 in the paper)
 instead of a kernel density estimate of the empirical distribution of
 the signal.

 [-rhotype <rhotype>] (default 2) 2 if rho is a function of the
 underlying signal value to be estimated, 1 if a unique rho is to be
 estimated. We strongly recommend to take rhotype=2 (see our paper)
 unless you want to use prederminated values for rho in which case you
 would specify the 0 for rhotype in an input parameter file (-iparam
 option) that would allow you to give the values of rho (defaut value
 of rho is 0 for rhotype 0).

 [-sigmatype <sigmatype>] (default 2) The same as rhotype but for
 sigma. Modelling sigma as a function of the underlying signal may not
 be as important as for rho (see paper).

 [-outliertype <outliertype>] (default 2) 2 means use a logistic
 function of the gDNA covariate. Select 1 if you prefer the
 probability of outlier to be independent of the gDNA signal (not
 recommended!). The value 0 could be used if you want to fix this
 probability (for this you would use the -iparam option).

 [-eps <epsmax>] (default 0.01) Threshold on the difference of the
 log-likelihood between two iterations that serve to check convergence
 in the terminal phase of the EM algorithm.

 [-nexplo <nexplo>] (default 10) number of starting points to be
 considered in the exploration phase of the EM algorithm.

 [-niterexplo <niterexplo>] (default 100) Maximum number of iterations
 allowed in from each starting point in the exploration phase of the
 EM algorithm.

 [-epsexplo <epsexplo>] (default 0.1) Threshold on the difference of
 the log-likelihood between two iterations that serve to check
 convergence from each starting point in the exploration phase of the
 EM algorithm.

 [-seqsmax <seqsmax>] (default 1.9) This indicates the threshold on
 the SeqS value of the probe (probes with value above seqsmax a
 threshold will be discarded, see the fourth column of the input data
 file).

 [-maxdriftamplitude <amplitude>] (default -1 = disabled) Set the max
 allowed value for the average drift amplitude (i.e. the mean of the
 geometric distribution modeling drift x size of grid step).

 [-alphamin <alpha_n alpha_s alpha_u alpha_d>] (default 0 0 0 0) These
 4 values allows to set lower bounds on the 4 individual components of
 the alpha vector (resp. probability of "no move", "shift", "drift
 up", and "drift down"). The 4 values should sum below 1.

###################
#OUTPUT MODEL FILE#
###################

The final model is written in file out.oparams ("out" is the default
prefix, see option -oprefix). An example of such a file is given
below, before a discussion of the meaning of the different lines. The
same syntax is used to report the values of parameters at each step of
the EM algorithm. It serves also when specifying a parameter input
file (see option -iparam and section below).


nS 100
mu 5.38241 5.49477 5.60712 5.71948 5.83183 5.94419 6.05654 6.1689 6.28125 6.39361 6.50596 6.61832 6.73067 6.84303 6.95538 7.06774 7.1801 7.29245 7.40481 7.51716 7.62952 7.74187 7.85423 7.96658 8.07894 8.19129 8.30365 8.416 8.52836 8.64071 8.75307 8.86542 8.97778 9.09013 9.20249 9.31484 9.4272 9.53955 9.65191 9.76427 9.87662 9.98898 10.1013 10.2137 10.326 10.4384 10.5508 10.6631 10.7755 10.8878 11.0002 11.1125 11.2249 11.3372 11.4496 11.5619 11.6743 11.7867 11.899 12.0114 12.1237 12.2361 12.3484 12.4608 12.5731 12.6855 12.7979 12.9102 13.0226 13.1349 13.2473 13.3596 13.472 13.5843 13.6967 13.8091 13.9214 14.0338 14.1461 14.2585 14.3708 14.4832 14.5955 14.7079 14.8202 14.9326 15.045 15.1573 15.2697 15.382 15.4944 15.6067 15.7191 15.8314 15.9438 16.0562 16.1685 16.2809 16.3932 16.5056
eta 1.09116e-06 6.63692e-06 3.33707e-05 0.000137539 0.000467373 0.00132132 0.00314368 0.00637342 0.0111728 0.0172063 0.0237025 0.029746 0.0346621 0.0381271 0.0401214 0.0407774 0.0402998 0.0389075 0.036806 0.0342149 0.0313516 0.0284274 0.0255941 0.0229448 0.0205297 0.0183875 0.0165321 0.0149423 0.0135773 0.0124011 0.0113878 0.0105182 0.00978268 0.00917388 0.00868328 0.00828999 0.0079649 0.0076812 0.00743353 0.00723623 0.00710145 0.00701997 0.00697123 0.00694125 0.00692509 0.00691257 0.00688579 0.00683146 0.00675672 0.00668129 0.00661902 0.00656942 0.00652818 0.00649742 0.0064806 0.00647334 0.0064694 0.00646736 0.00646877 0.00646814 0.00645439 0.00641948 0.00636986 0.00632159 0.00628642 0.00626104 0.00623924 0.00622413 0.00622575 0.00624437 0.00626509 0.00626389 0.00622431 0.00614417 0.00603138 0.00589526 0.00573979 0.00555895 0.00533985 0.00507299 0.00476091 0.00441774 0.0040649 0.00372458 0.00341161 0.00313396 0.00290238 0.00273499 0.00264827 0.00265216 0.00275117 0.00293839 0.00314962 0.00322118 0.00295078 0.0022891 0.0014439 0.000721175 0.000280562 8.41537e-05
xmin 6.39361 xmax 15.4944
ncat_mu 8
cat_mu 13 16 20 24 35 53 74 100
sigma_type 2
sigma_val 0.229045 0.229045 0.229045 0.229045 0.229045 0.229045 0.229045 0.229045 0.229045 0.229045 0.229045 0.229045 0.229045 0.255477 0.255477 0.255477 0.264645 0.264645 0.264645 0.264645 0.296707 0.296707 0.296707 0.296707 0.305125 0.305125 0.305125 0.305125 0.305125 0.305125 0.305125 0.305125 0.305125 0.305125 0.305125 0.335788 0.335788 0.335788 0.335788 0.335788 0.335788 0.335788 0.335788 0.335788 0.335788 0.335788 0.335788 0.335788 0.335788 0.335788 0.335788 0.335788 0.335788 0.344305 0.344305 0.344305 0.344305 0.344305 0.344305 0.344305 0.344305 0.344305 0.344305 0.344305 0.344305 0.344305 0.344305 0.344305 0.344305 0.344305 0.344305 0.344305 0.344305 0.344305 0.301208 0.301208 0.301208 0.301208 0.301208 0.301208 0.301208 0.301208 0.301208 0.301208 0.301208 0.301208 0.301208 0.301208 0.301208 0.301208 0.301208 0.301208 0.301208 0.301208 0.301208 0.301208 0.301208 0.301208 0.301208 0.301208
rho_type 2
rho_val 0.100499 0.100499 0.100499 0.100499 0.100499 0.100499 0.100499 0.100499 0.100499 0.100499 0.100499 0.100499 0.100499 0.195064 0.195064 0.195064 0.354788 0.354788 0.354788 0.354788 0.398874 0.398874 0.398874 0.398874 0.621677 0.621677 0.621677 0.621677 0.621677 0.621677 0.621677 0.621677 0.621677 0.621677 0.621677 0.865603 0.865603 0.865603 0.865603 0.865603 0.865603 0.865603 0.865603 0.865603 0.865603 0.865603 0.865603 0.865603 0.865603 0.865603 0.865603 0.865603 0.865603 0.911147 0.911147 0.911147 0.911147 0.911147 0.911147 0.911147 0.911147 0.911147 0.911147 0.911147 0.911147 0.911147 0.911147 0.911147 0.911147 0.911147 0.911147 0.911147 0.911147 0.911147 0.751318 0.751318 0.751318 0.751318 0.751318 0.751318 0.751318 0.751318 0.751318 0.751318 0.751318 0.751318 0.751318 0.751318 0.751318 0.751318 0.751318 0.751318 0.751318 0.751318 0.751318 0.751318 0.751318 0.751318 0.751318 0.751318
outlier_type 2
outlier_logit -5.56875 0.938642
nR 1
b_type 1
b
 1
alpha_type 1
alpha 0.860532 0.0119378 0.0498707 0.0776596
lambda_type 1
lambda 0.707708 0.62689


We will now briefly explain each line of the file.

nS : the number of hidden states (see the -nS option, and the parameter K in the paper).

mu : a vector of size nS with the value of the underlying signal
associated to each hidden state.

eta : a vector of size nS with the value of the function eta (see
paper) that define the pdf of the arrival hidden after shift moves. If
not specified by an input model file this function is automatically
computed with a kernel density estimator on the bulk signal data
using a gaussian kernel with Scott's bandwidth.

xmin, xmax : signal below xmin and above xmax is considered as censored

ncat_mu : the number of categories for the definition of the function
sigma and rho (serves only if rhotype=2 and/or sigmatype=2)

cat_mu : the index of the hidden states that define the break-points of each category.

sigma_type : see option -sigmatype, note that sigma type can be 0
meaning a fix value for sigma that should in this case be specified
with an input model file (see next section).

sigma_val : a vector of size nS with the value of sigma for each
hidden state. If sigma_type=2 this is a piecwise constant function of
mu with a number of segments ncat_mu.

rho_type : see option -rhotype, note that rho type can be 0
meaning a fix value for rho that should in this case be specified
with an input model file (see next section).

rho_val : a vector of size nS with the value of rho for each hidden
state. If rho_type=2 this is a piecwise constant function of mu with a
number of segments ncat_mu.

outlier_type : see -outliertype option. 0=fixed probability of
outlier, 1=probability of outlier to be estimated, 2=logistic model
for the probability of outlier to be estimated.

outlier or outlier_logit : in the first case (outliertype=0 or 1) this
corresponds to the probability of outlier, in the second case
(outliertype=2), this gives the two parameters of the logistic
function that define the relation between the probability of outlier
and the absolute value of the gDNA covariate.

nR : see the -nR option. Should be 1 expected if you decide to go with
nR>1. This possibility was not explored in the paper.
  b_type : whether or not the transition matrix between regimes should be estimated
  b : the transition matrix (a signle number if nR=1)

alpha_type : 0 if not to be estimated, 1 estimated and random staring
point for the four values of the vector alpha. 2 estimated but
specified starting point for the value of the third and fourth
components (useful if you wish to impose no drift). 3 estimated but
specified starting point for the third component (useful if you wish
to impose no drift up). 4 estimated but specified starting point for
the fourth component (useful if you wish to impose no drift down).

alpha : a four value vector whose component should sum to 1. The first
component is the probability of no move, second compenent is the
probability of shift move, third component is the probability of
upward drift, fourth component is the probability of downward drift.

lambda_type : whether or not lambda should be estimated.

lambda : a two-component vector that specifies the amplitude of the
drift moves. The first component is the parameter of the geometric
distribution that govern the amplitude of upward drift. The second
component corresponds to downward drift.

NOTICE that if nR>1 there will be as many lines alpha_type, alpha,
lambda_type, lambda as the number of regime.


##################
#INPUT MODEL FILE#
##################

The syntax for the input model file is the same as the syntax of the
output model file. In any case you should not attempt to create such a
file from scratch but you should instead consider reutilisnig (and
modifying) a file created by the program in a previous run of the
algorithm.