This set of three programs implements the approach presented in our paper "The Stem Cell Population of the Human Colon Crypt: Analysis via Methylation Patterns" (Pierre Nicolas, Kyoung-Mee Kim, Darryl Shibata and Simon Tavaré, submitted). - "mcmc" samples the posterior distribution of the parameters given the data (Bayesian Inference). This may be very time consuming if you want to get a good estimate of the posterior for a large data set. - "sim" can simulate multiple data sets from a model with a same set of parameters. It computes summary statistics and, as an option, returns the raw data. - "predictivesim" simulates data sets from a model whose parameters can change from one data set to the other and, for each data set, it computes an array of summary statistics. It served in the paper for the posterior predictive assessment of model fit. We now present in more details each one of the three programs. Example parameter and data files can be found in the "examples" directory. The data sets used in our study on human colon crypt can be found in the "data" directory. ---------------------------------------------------------------- --------------------- mcmc program ----------------------------- Usage: mcmc -data -params -out Example1: move in the "examples" directory and run ../bin/mcmc -data afewcrypts.dat -params ex1_mcmc.params -out ex1_mcmc.out Example2: move in the "examples" directory and run ../bin/mcmc -data afewcrypts.dat -params ex2_mcmc.params -out ex2_mcmc.out First example runs the MCMC algorithm for a methylation model with independent sites. Second example do the same for a model with context-dependent effects (4 sets of methylation/demethylation parameters for increasing level of methylation). Parameters, priors and MCMC set-up are described in ex1_mcmc.params and ex2_mcmc.params. These files are commented (after # symbols) and should be understandable. A toy data set is described in afewcrypts.dat. Data files start with a line indicating the number of crypt sampled. Then for each crypt there is a descriptor line of the type "0 I b16I 40 lseq 9 nseqs 3 ncells 6". The first three words need to be here but are skipped by the program, you could use them to comment your data set (here the second word indicates the patient). Next, you find the age of the crypt (in year), the length of the sequences (that should match information given in the parameter data file), the number of distinct patterns sampled from the crypt and the total number of cells (sequences) sampled from the crypt. Then, for each pattern, one line describes the pattern and the number of cells sampled with this pattern. Each run of the program produces three output files. For example 1, those files are ex1_mcmc.out, ex1_mcmc.out.alloc, ex1_mcmc.out.patrates, ex1_mcmc.out.pred, ex1_mcmc.out.trees. File "ex1_mcmc.out" is the main output file, it contains parameters sampled every 100 sweeps of the algorithm. Each line corresponds to a set of parameters. An example of line is: 12 1 0 2.50895 0.00593909 0.0166878 0.05 7.56604 0.00365107 0.0105869 where - column 1 corresponds to the number of stem cells. - columns 2 and 3 can be ignored. - column 4 corresponds to the hyper-parameter of the methylation model (meaningless in this case, as a model with independent sites is used). - columns 5 and 6 report methylation and demethylation rate (in year^-1) and should be multiplied by tau to obtain the nu mentioned in the paper. - column 7 corresponds to lambda (1/tau in the paper) in year^-1. - column 8 corresponds to g. - column 9 corresponds to alpha. - column 10 reports the sequencing error rate. File "ex2_mcmc.out" has the same format excepted that, a context-dependent methylation model being used, methylation and demethylation rates are both reported in the form of lseq+1 numbers. Each number corresponds to the rate for a given number of methylated sites: column 5 reports the methylation rate when no site is methylated, column 6 reports the methylation rate when one site is already methylated, and so on. File ex?_mcmc.out.alloc contains lines indicating the number of stem cells in each crypt for each set of parameters reported in "ex2_mcmc.out". For each crypt, two number are reported. The first number can be ignored and second number is the number of stem cells. Notice that numbers are allowed to differ by 1 between crypts. File ex?_mcmc.out.patrates can be ignored. It is used in a version of the program where methylation rate is allowed to vary between patients (not described here). File ex?_mcmc.out.trees reports the description of the tree sampled from the posterior. Those are printed every 100000 sweeps. The description format of the tree is the one used in the "hclust" function of the R statistical language format. As an example the tree described by crypt1iter1000 <- list(merge=rbind(c(-1,-3),c(-5,-4),c(-2,2),c(3,1)), height=c(-0.73126,-0.0637079,5.57427,12.5732), order=c(1,2,3,4,5), labels=c("000000000","000000000","000000000","000000000","000000000"), method = "mymethod", call = "mycall") can be plotted with R after converting it to an object of class hclust: R> crypt1iter1000$height <- crypt1iter1000$height+1 R> class(crypt1iter1000) <- "hclust" R> plot(crypt1iter1000,hang=-1) File ex?_mcmc.out.pred can also be ignored. It contains values of a number of summary statistics obtained for synthetic data sets generated across the branches of the genealogical trees sampled from their posterior distributions. ---------------------------------------------------------------- ---------------------- sim program ----------------------------- Usage: sim -data -params -outstat [-outseq ] Example1: move in the "examples" directory and run ../bin/sim -data afewcrypts.dat -params ex1_sim.params -outstat ex1_sim.outstat -outseq ex1_sim.outseq Example2: move in the "examples" directory and run ../bin/sim -data afewcrypts.dat -params ex2_sim.params -outstat ex2_sim.outstat -outseq ex2_sim.outseq First example runs the MCMC algorithm for a methylation model with independent sites. Second example corresponds to a model with context-dependent effects. The data file "afewcrypts.dat" specifies the characteristics of the data sets to be simulated: number of crypts, number of cells sampled from each crypt, length of the sampled patterns, ages of the crypts. Running the program produces two result files: ex?_sim.outstat and ex?_sim.outseq. First file corresponds to the values of the summary statistics computed for each crypt of each simulated data set. Second file corresponds to the simulated data sets. ---------------------------------------------------------------- -------------------- predictivesim program --------------------- Usage: predictivesim -templateparams -data -params -outstat -alloc -patrate Example1: move in the "examples" directory and run ../bin/predictivesim -data afewcrypts.dat -templateparams ex1_pred.templateparams -params ex1_pred.params -alloc ex1_pred.alloc -patrate ex1_pred.patrate -outstat ex1_pred.out In this example, 10 data sets are simulated with 10 distinct sets of parameters. The two output files are "ex1_pred.out" and "ex1_pred.out.trees". First file correspond to the value of a set of summary statistics for each crypt each simulated data set (lines starting with -1 correspond to the data from the "-data" file). Second file report the genealogical trees of each crypt in the same format as mentioned for the file with suffix ".trees" of the "mcmc" program. File "afewcrypts.dat" is a template of the data set (see sim program). File "ex1_pred.templateparams" is a template of the parameter file that contains characteristics of the model such as model with independent sites or not. This will typically be the parameter's file of "mcmc" if the parameters used for the simulation were sampled with this program. File "ex1_pred.params" contains the parameters used for the simulation. The format is the same as the one used for the main output file of "mcmc" program. IMPORTANT: the first line indicates the number of data set to simulate. Then, each line corresponds to the set of parameter for a particular simulation. Input files "ex1_pred.alloc" and "ex1_pred.patrate" have the same format as the corresponding output files of "mcmc". The number of stem cells indicated in "ex1_pred.alloc" should match the number of stem cells indicated in "ex1_pred.params" for the same simulated data set. File "ex1_pred.patrate" should contain one line of "1" (with as many "1" as the number of crypts) per simulated data set. ----------------------------------------------------------------