---
title: "compartment-model-vignette"
output: rmarkdown::html_vignette
geometry: margin=0.3in
vignette: >
  %\VignetteIndexEntry{compartment-model-vignette}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# Introduction

A document summarizing "compartment model" which models time evolution of Pol II signal in pause region and gene body (`Dutta et al. 2021, Mattada et al. 2019`). 

# Getting Started

## Installing the package

We will install a python based package, `compartmentModel` currently hosted on [Test PyPI](https://test.pypi.org/project/compartmentModel/). The package uses [COPASI](http://copasi.org/) to run parameter estimation and it needs to be installed on the local or any virtual environment. The following commands install the package:-

```{r, eval=FALSE}
python3 -m pip install python-copasi 
python3 -m pip install copasi-basico
python3 -m pip install -i https://test.pypi.org/simple compartmentModel
```

Please verify that the latest version has been installed (by cross-checking the package repo). (If an old version has been installed, then you need to upgrade all the above packages and try again.) After successful installation, the help menu can be accessed by running the following on command line:-
```{r, eval=FALSE}
polcomp estparam --help
```

## Inputs to the compartment model

We need two tab-separated files from each of control and treatment conditions containing 'gene', 'pause_sum', 'body_density' for all genes of our interest. (Please refer to the section `Preparation of Inputs` for detailed overview of creation of these files.) For this example, we will use two files prepared from Triptolide treatment data (Jonkers et al. 2014):-
```{r, engine='bash',  comment=''}
head InternalControl_repressed_TRP125min_baseline.txt
```
```{r, engine='bash',  comment=''}
head InternalControl_repressed_TRP125min_treatment.txt
```

## Running the model

We will specify the inputs to our model as mentioned in the help message:-
```{r, eval=FALSE}
polcomp estparam --help
```
By default for parameter estimation, the lower and higher limits for premature termination (`Kpre`), pause release (`Krel`), initiation (`Kinit`) rates are 0.01 and 1 respectively, whereas lower and upper limits for elongation (`Kelong`) are 30 and 60. To demonstrate that these limits can be changed, we will use a config file to change the lower limit to `1e-8`. 
```{r, engine='bash',  comment=''}
cat TRP_limitfile_AllParams
```
Note: In similar manner, the upper limit can be changed for any parameter by specifying `upper` value for the desired paramter, say premature termination rate. This is for demonstration and not being used in this section.:-
```{r, eval=FALSE}
[prematuretermination]
lower = 0.02
upper = 0.023
```
We will run the following command:-
```{r, eval=FALSE}
polcomp estparam --basename baseline --basefile InternalControl_repressed_TRP125min_baseline.txt --exprname TRP_12.5min --exprfile InternalControl_repressed_TRP125min_treatment.txt --limitfile TRP_limitfile_AllParams --nscan 700 --outputdir InternalControl_TRP12min_repressed/
```
The command will attempt finding `nscan=700` parameters for each set of genes provided in the input. At the end of the run, it should produce the following output (last few lines mentioned):-
```{r, eval=FALSE}
> Valid fits are saved as InternalControl_TRP12min_repressed/paramEstimationResults_combined.txt
>> Genes with no valid fits have been written to InternalControl_TRP12min_repressed/NoValidFits.txt. Please read vignette or check annotations to see if these genes have correct pause windows.
>> Genes with pause sum or body avg as zero in either condition have been written to InternalControl_TRP12min_repressed/GeneswithZeroPauseSums_or_BodyAvg.txt
```

The output file `paramEstimationResults_combined.txt` contain all the valid parameter sets for all the genes. The column names are self-explanatory and depend on the `basename` and `exprname` mentioned in the input command:-
```{r, engine='bash',  comment=''}
head InternalControl_TRP12min_repressed/paramEstimationResults_combined.txt
```
A way to display the parameter results is mentioned in the section `Plotting Model Results`. The user may use own custom functions for displaying/summarising the parameter sets mentioned in `paramEstimationResults_combined.txt`.

# Preparation of Inputs

## Processing of raw sequencing files
We will be using Triptolide (TRP) treatment dataset from `Jonkers et al. 2014`. We will process the raw GRO-seq files from `control` and `TRP 12.5min` conditions. Mouse V6.5 cells were processed in two replicates. These are the accession numbers for files `control rep1` - `SRR935093`, `SRR935094`, `control rep2` - `SRR935095`, `SRR935096` and `TRP 12.5min` - `SRR935119` (rep1), `SRR935120` (rep2). As Arabidopsis thaliana RNAs were used as spike-ins, we also remove reads aligning to Arabidopsis genome. We are using `Mus_musculus.GRCm39.105` genome for mouse and [`GCF_000001735.4`](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001735.4/) for Arabidopsis.
We follow the almost the same workflow as detailed in [Nascent RNA Methods](https://github.com/guertinlab/Nascent_RNA_Methods) with following changes:-  

* We align to Arabidopsis genome and collect reads aligning to Arbidopsis (spikeIns) and mouse.
* The mouse reads are further aligned to mouse ribosome and we remove the ribosomal reads.
* The remaining reads aligning to mouse are then processed.

We use the following file as our "placeholder" and we replace the `placeholder` within this file by the names of our replicates:-

```{r, engine='bash',  comment=''}
cat processPRO_seqdata_placeholder_spikeIn.sh
```
We can replace the `placeholder` and submit jobs in the following manner (submission example for slurm environment in HPC cluster @ UConn):-

```{r, eval=FALSE}
for filename in V65_cntrl_rep1 V65_cntrl_rep2 V65_TRP_treat_12min_rep1 V65_TRP_treat_12min_rep2; do
        sed "s/placeholder/${filename}/g" processPRO_seqdata_placeholder_spikeIn.sh > processPRO_seqdata_${filename}_spikeIn.sh
        echo ${filename}_spikeIn
        sbatch processPRO_seqdata_${filename}_spikeIn.sh
        sleep 1
done
```
## Normalization of sequencing data by their read-depth

We eventually will  normalize data by [DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html) size factors. For running DESeq2, we plan to use the following criteria:-
* consider genes with length greater than 50kbp and treat regions from `[TSS + 50kbp]` to `TTS` as unchanging between control and TRP treatment for 12.5 mins. TSS = transcription start site, TTS = transcription termination site.
* for differential expression analysis by DESeq2, only consider regions between 500 and 25kbp..

We create normalized files in the same manner as mentioned in `PRO_normalization` of [Nascent RNA Methods](https://github.com/guertinlab/Nascent_RNA_Methods) and use `normalize_bedGraph_v2.py` for increased execution speed.

```{r, engine='bash',  comment=''}
cat normalization_factor.R
```
```{r, engine='bash',  comment=''}
cat PRO_normalization
```
We will now use the normalized files from `control` condition - `V65_cntrl_plus_normalized.bigWig` and `V65_cntrl_minus_normalized.bigWig` for estimating the TSS in next section.  

## Estimation of transcription start sites for genes

We will use `primaryTranscriptAnnotation` (Anderson et al. 2020, [GitHub](https://github.com/WarrenDavidAnderson/genomicsRpackage/tree/master/primaryTranscriptAnnotation) ) to estimate TSS and TTS. Comments mention why some thresholds were used:-
```{r, engine='bash',  comment=''}
cat createPrimaryTranscAnnot_Mouse_cntThresh_point9.R
```
## Running DESeq2 for differential analysis

We will first prepare bed files for "affected genes" (TSS+500 to 25kb) and "internalcontrol genes" (50kbp to TTS) as discussed above.
```{r, engine='bash',  comment=''}
cat parsePTACoordsToShortAffectedAndLongControlGenes.R
```
```{r, engine='bash',  comment=''}
cat parsePTACoordsToShortAffectedAndLongControlGenes.R
```
We will generate read count data from our processed GRO-seq files and the bed files for "affected" and "internalcontrol" genes created earlier.
```{r, engine='bash',  comment=''}
cat deseq2DataCreator_InternalControl_TRP12min.sh
```
```{r, engine='bash',  comment=''}
cat runDeseq2_TRP_12min_matched_internalControl.R
```
We get a MA plot showing 9632 genes as repressed.
```{r, echo=FALSE, fig.cap="MA plot for repressed genes", out.width = '100%', out.height = '100%'}
knitr::include_graphics("figures/MA_plot_TRP_treat_matched.png")
```


We then save the DESeq2 size factors in a tab-separated text file.
```{r, engine='bash',  comment=''}
cat saveSizeFactorsAsTxtFile.R
```
We use the DESeq2 size factors for normalization. The code is derived from `PRO_normalization_sizeFactor` of [Nascent RNA Methods](https://github.com/guertinlab/Nascent_RNA_Methods)
```{r, engine='bash',  comment=''}
cat PRO_normalization_sizeFactor_TRP  
```
The processed files are hosted as a [track hub](http://guertinlab.cam.uchc.edu/Jonkers_FP_TRP_hub/hub.txt) which can be loaded in UCSC Genome Browser. A session can be accessed [here](https://genome.ucsc.edu/s/rmukherjee/Jonkers_et_al_mm39_TRP_FP).

## Estimating Pol II densities in pause region and gene body
We will prepare the pause windows and calculate the densities in gene body regions. We assume all packages mentioned in this file has been installed.
```{r, engine='bash',  comment=''}
cat run_pauseworkflow_InternalTRP_matched.R
```
Thus, we have the output files of interest - `InternalControl_repressed_TRP125min_baseline.txt` and `InternalControl_repressed_TRP125min_treatment.txt`.

# Plotting Model Results

After sections `Preparation of Inputs` and `Running the model`, we can plot results. The code files mentioned here prepare R data structures containing the rate data as well summarizes them:-
```{r, engine='bash',  comment=''}
cat plotParamsFromRepeatParamScan.R
```
The file `plotParamsFromRepeatParamScan.R` references `plotRateFuncs.R` below:-
```{r, engine='bash',  comment=''}
cat plotRateFuncs.R
```
We get `log2_Mean_foldchange_kinit,krel_TRP_12min_repressed_allParamsConstrained.pdf` as one of the figures and `repeatedParamScanData_TRP_12min_repressed_allParamsConstrained.rds` containing all valid param sets for all the genes.

```{r, echo=FALSE, fig.cap="log2 Mean Fold Change in Initiation and Pause Release", out.width = '100%', out.height = '100%'}
knitr::include_graphics("figures/log2_Mean_foldchange_kinit,krel_TRP_12min_repressed_allParamsConstrained.pdf")
```
There are some custom functions available for plotting in `customSummariseRateData.R`. We use `repeatedParamScanData_TRP_12min_repressed_allParamsConstrained.rds` and `density.TRP_12.5min_all.v.control_all.rds` (produced by `run_pauseworkflow_InternalTRP_matched.R`) to plot other relevant figures.
```{r, engine='bash',  comment=''}
cat customSummariseRateData.R
```

The figures generated are given below.  
```{r, echo=FALSE, fig.cap="Frequency bar plot of genes by their Initiation foldchange behavior", out.width = '100%', out.height = '100%'}
knitr::include_graphics("figures/CountFrequency_stacked_KinitFC_profile_TRP_treat.pdf")
```
```{r, echo=FALSE, fig.cap="Frequency bar plot of genes by their Pause Release foldchange behavior", out.width = '100%', out.height = '100%'}
knitr::include_graphics("figures/CountFrequency_stacked_KinitFC_KrelFC_TRP_treat.pdf")
```
We select genes with a wide and narrow distribution of foldchange in Kinit.
```{r, echo=FALSE, fig.cap="Initiation Foldchange across all parameter sets", out.width = '100%', out.height = '100%'}
knitr::include_graphics("figures/log2_TRP_kinitFC_distribution.pdf")
```
```{r, echo=FALSE, fig.cap="Pause release Foldchange across all parameter sets", out.width = '100%', out.height = '100%'}
knitr::include_graphics("figures/log2_krel_FC_TRP_distribution.pdf")
```
Finally, we have figures showing index of dispersion. Notice that Krel FC always have a narrow distribution because of the simple relation of `B_fold_change = Krel_foldchange * P_foldchange`.

```{r, echo=FALSE, fig.cap="Index of Dispersion for Initation rate foldchnage", out.width = '100%', out.height = '100%'}
knitr::include_graphics("figures/log10_TRP_kinitFC_indexOfDispersion.pdf")
```
```{r, echo=FALSE, fig.cap="Index of Dispersion for Pause Release rate foldchnage", out.width = '100%', out.height = '100%'}
knitr::include_graphics("figures/log10_krel_FC_TRP_IndexOfDispersion.pdf")
```

We also constrained premature termination rate and used this limit file:- 
```{r, eval=FALSE}
[initiation]
lower = 0.00000001

[prematuretermination]
lower = 0.02
upper = 0.023
```
All valid parameters are stored in `InternalControl_TRP12min_repressed_KpreConstrained/paramEstimationResults_combined.txt`.