Contents

1 Introduction

A document summarizing “compartment model” which models time evolution of Pol II signal in pause region and gene body (Dutta et al. 2021, Mattada et al. 2019).

2 Getting Started

2.1 Installing the package

We will install a python based package, compartmentModel currently hosted on Test PyPI. The package uses COPASI to run parameter estimation and it needs to be installed on the local or any virtual environment. The following commands install the package:-

python3 -m pip install python-copasi 
python3 -m pip install copasi-basico
python3 -m pip install -i https://test.pypi.org/simple compartmentModel

Please verify that the latest version has been installed (by cross-checking the package repo). (If an old version has been installed, then you need to upgrade all the above packages and try again.) After successful installation, the help menu can be accessed by running the following on command line:-

polcomp estparam --help

2.2 Inputs to the compartment model

We need two tab-separated files from each of control and treatment conditions containing ‘gene’, ‘pause_sum’, ‘body_density’ for all genes of our interest. (Please refer to the section Preparation of Inputs for detailed overview of creation of these files.) For this example, we will use two files prepared from Triptolide treatment data (Jonkers et al. 2014):-

head InternalControl_repressed_TRP125min_baseline.txt
gene    pause_sum   body_density
Tcf24   9.19386497139931    0.0804360718203523
Cops5   85.9334249198437    0.0384385124459365
Cspp1   756.197381913662    0.0254058947988453
Arfgef1 5.56534001231194    0.0219468799772712
Rnf152  6.98348507285118    0.0068474169675437
Pign    322.371615976095    0.0171639551516065
Relch   33.6940849125385    0.0252412666679433
Zcchc2  78.383134752512 0.0418215350599081
Kdsr    153.332449406385    0.0335362723951963
head InternalControl_repressed_TRP125min_treatment.txt
gene    pause_sum   body_density
Tcf24   3.97053015232086    0.0243700172313062
Cops5   16.6258801221848    0.0278898475135028
Cspp1   116.273120522499    0.0247544261562293
Arfgef1 1.59940004348755    0.0177578365557946
Rnf152  0.523810029029846   0.0044233950876575
Pign    63.7204707860947    0.0124597407610381
Relch   10.836000084877 0.0200736366477201
Zcchc2  14.5026798248291    0.0268547701719555
Kdsr    27.2139506340027    0.0272897903308197

2.3 Running the model

We will specify the inputs to our model as mentioned in the help message:-

polcomp estparam --help

By default for parameter estimation, the lower and higher limits for premature termination (Kpre), pause release (Krel), initiation (Kinit) rates are 0.01 and 1 respectively, whereas lower and upper limits for elongation (Kelong) are 30 and 60. To demonstrate that these limits can be changed, we will use a config file to change the lower limit to 1e-8.

cat TRP_limitfile_AllParams
[initiation]
lower = 0.00000001

[pauserelease]
lower = 0.00000001

[prematuretermination]
lower = 0.00000001

Note: In similar manner, the upper limit can be changed for any parameter by specifying upper value for the desired paramter, say premature termination rate. This is for demonstration and not being used in this section.:-

[prematuretermination]
lower = 0.02
upper = 0.023

We will run the following command:-

polcomp estparam --basename baseline --basefile InternalControl_repressed_TRP125min_baseline.txt --exprname TRP_12.5min --exprfile InternalControl_repressed_TRP125min_treatment.txt --limitfile TRP_limitfile_AllParams --nscan 700 --outputdir InternalControl_TRP12min_repressed/

The command will attempt finding nscan=700 parameters for each set of genes provided in the input. At the end of the run, it should produce the following output (last few lines mentioned):-

> Valid fits are saved as InternalControl_TRP12min_repressed/paramEstimationResults_combined.txt
>> Genes with no valid fits have been written to InternalControl_TRP12min_repressed/NoValidFits.txt. Please read vignette or check annotations to see if these genes have correct pause windows.
>> Genes with pause sum or body avg as zero in either condition have been written to InternalControl_TRP12min_repressed/GeneswithZeroPauseSums_or_BodyAvg.txt

The output file paramEstimationResults_combined.txt contain all the valid parameter sets for all the genes. The column names are self-explanatory and depend on the basename and exprname mentioned in the input command:-

head InternalControl_TRP12min_repressed/paramEstimationResults_combined.txt
P_to_B_ratio_baseline_Fitted    P_to_B_ratio_baseline_Measured  P_to_B_TRP_12.5min_Fitted   P_to_B_TRP_12.5min_Measured P_fold_change_Fitted    P_fold_change_Measured  B_fold_change_Fitted    B_fold_change_Measured  P_baseline_model    P_TRP_12.5min_model elongationRate  initiationRate_baseline initiationRate_TRP_12.5min  prematureTerminationRate    pauseReleaseRate_baseline   pauseReleaseRate_TRP_12.5min    objective_func_bestval  gene
114.3   114.3   162.927 162.927 0.431867    0.431867    0.302974    0.302974    4.67645e-06 2.01961e-06 30.03   1.57438e-06 5.21716e-07 0.0741941   0.262467    0.184132    1.41891e-33 Tcf24
114.3   114.3   162.927 162.927 0.431867    0.431867    0.302974    0.302974    3.30129e-07 1.42572e-07 49.2197 1.42017e-07 4.30275e-08 1e-08   0.430188    0.301795    1.5166000000000001e-33  Tcf24
114.3   114.3   162.927 162.927 0.431867    0.431867    0.302974    0.302974    5.48365e-07 2.36821e-07 30.03   1.43928e-07 4.36063e-08 4.04015e-08 0.262467    0.184132    1.41891e-33 Tcf24
114.3   114.3   162.927 162.927 0.431867    0.431867    0.302974    0.302974    5.02756e-06 2.17124e-06 30.03   1.50315e-06 4.79079e-07 0.036516    0.262467    0.184132    0.0 Tcf24
114.3   114.3   162.927 162.927 0.431867    0.431867    0.302974    0.302974    5.04544e-06 2.17896e-06 30.03   1.4048e-06  4.35999e-07 0.0159634   0.262467    0.184132    1.41891e-33 Tcf24
114.3   114.3   162.927 162.927 0.431867    0.431867    0.302974    0.302974    2.13914e-05 9.23824e-06 30.03   5.61453e-06 1.70105e-06 1.61773e-07 0.262467    0.184132    1.41891e-33 Tcf24
114.3   114.3   162.927 162.927 0.431867    0.431867    0.302974    0.302974    6.57341e-06 2.83884e-06 30.03   5.34314e-06 2.08515e-06 0.550375    0.262467    0.184132    0.0 Tcf24
114.3   114.3   162.927 162.927 0.431867    0.431867    0.302974    0.302974    1.68193e-06 7.26372e-07 30.03   2.12339e-06 8.6012e-07  1.0 0.262467    0.184132    1.41891e-33 Tcf24
114.3   114.3   162.927 162.927 0.431867    0.431867    0.302974    0.302974    5.99757e-06 2.59016e-06 30.03   1.57919e-06 4.79103e-07 0.000838913 0.262467    0.184132    0.0 Tcf24

A way to display the parameter results is mentioned in the section Plotting Model Results. The user may use own custom functions for displaying/summarising the parameter sets mentioned in paramEstimationResults_combined.txt.

3 Preparation of Inputs

3.1 Processing of raw sequencing files

We will be using Triptolide (TRP) treatment dataset from Jonkers et al. 2014. We will process the raw GRO-seq files from control and TRP 12.5min conditions. Mouse V6.5 cells were processed in two replicates. These are the accession numbers for files control rep1 - SRR935093, SRR935094, control rep2 - SRR935095, SRR935096 and TRP 12.5min - SRR935119 (rep1), SRR935120 (rep2). As Arabidopsis thaliana RNAs were used as spike-ins, we also remove reads aligning to Arabidopsis genome. We are using Mus_musculus.GRCm39.105 genome for mouse and GCF_000001735.4 for Arabidopsis. We follow the almost the same workflow as detailed in Nascent RNA Methods with following changes:-

  • We align to Arabidopsis genome and collect reads aligning to Arbidopsis (spikeIns) and mouse.
  • The mouse reads are further aligned to mouse ribosome and we remove the ribosomal reads.
  • The remaining reads aligning to mouse are then processed.

We use the following file as our “placeholder” and we replace the placeholder within this file by the names of our replicates:-

cat processPRO_seqdata_placeholder_spikeIn.sh
#!/bin/bash
#SBATCH --job-name=processPRO_seqdata
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 8
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=END
#SBATCH --mem=50G
#SBATCH --mail-user=rmukherjee@uchc.edu
#SBATCH -o processPRO_seqdata_placeholder_%j.out
#SBATCH -e processPRO_seqdata_placeholder_%j.err

set -e
# add path to the R files
export PATH=$PATH:/home/FCAM/rmukherjee/Spring2022Rotation/R_files
module load cutadapt;
module load seqtk;
module load bowtie2;
module load bedtools;
module load samtools;
module load bamtools;
module load genometools
seqBiasDir=/home/FCAM/rmukherjee/seqOutBias/target/release
filesDir=/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/fastqfiles
UMI_length=0
mouseGenomeIndex=/home/FCAM/rmukherjee/Spring2022Rotation/mm39_bt/mm39
mouserDNAIndex=/home/FCAM/rmukherjee/Spring2022Rotation/mouse_rDNA_bt2/mouse_rDNA
name=placeholder
annotation_prefix=/home/FCAM/rmukherjee/Spring2022Rotation/genome/annotations/Mus_musculus.GRCm39.105
aThalianaGenomeIndex=/home/FCAM/rmukherjee/arabidopsisThalianaGenome/AThaliana_GCF1735
cores=8
read_size=50
parentMouseGenomeDir=/home/FCAM/rmukherjee/Spring2022Rotation/genome
tallymer=${parentMouseGenomeDir}/mm39.tal_${read_size}.gtTxt.gz
table=${parentMouseGenomeDir}/mm39_${read_size}.4.2.2.tbl
aThalianaGenomeDir=/home/FCAM/rmukherjee/arabidopsisThalianaGenome
aThalianaTallymer=${aThalianaGenomeDir}/GCF_000001735.4.tal_${read_size}.gtTxt.gz
aThalianaTable=${aThalianaGenomeDir}/GCF_000001735.4_${read_size}.4.2.2.tbl

cd ${filesDir}
gunzip ${name}.fastq.gz

cutadapt --cores=${cores} -m $((UMI_length+2)) -O 1 -a TGGAATTCTCGGGTGCCAAGG ${name}.fastq \
                                -o ${name}_noadap.fastq --too-short-output ${name}_short.fastq > ${name}_cutadapt.txt
## initialize the metrics file
#echo -e  "value\texperiment\tthreshold\tmetric" > ${name}_QC_metrics.txt
#PE1_total=$(wc -l ${name}.fastq | awk '{print $1/4}')
#PE1_w_Adapter=$(wc -l ${name}_short.fastq | awk '{print $1/4}')
#AAligation=$(echo "scale=2 ; $PE1_w_Adapter / $PE1_total" | bc)
#echo -e "$AAligation\t$name\t0.80\tAdapter/Adapter" >> ${name}_QC_metrics.txt
seqtk seq -L $((UMI_length+10)) ${name}_noadap.fastq > ${name}_noadap_trimmed.fastq
#
########################
####### since UMI length is zero
##### - no fqdedup as UMI is not present
##### - no reverse complementing for GRO-seq as `opposite adapters on the 3 and 5 ends for GRO`.
mv ${name}_noadap_trimmed.fastq ${name}_processed_UMI_${UMI_length}.fastq
#########################

## bowtie log showed highest alignment with both b and e removed
#seqtk trimfq -b ${UMI_length} -e ${UMI_length} ${name}_dedup.fastq | seqtk seq -r - > ${name}_processed_UMI_${UMI_length}.fastq

##bowtie2 -p $((cores)) -x ${aThalianaGenomeIndex} -U ${name}_processed_UMI_${UMI_length}.fastq 2>${name}_bowtie2_${UMI_length}_aThaliana.log | samtools view -b - | \
#                          samtools sort - -o ${name}_${UMI_length}_aThaliana.bam
##parse bowtie log for alignment
#filename=${name}_bowtie2_${UMI_length}_aThaliana.log # logfile
#totalReads=$(wc -l ${name}_processed_UMI_${UMI_length}.fastq | awk '{ print $1/4 }')
#alignmentRate=$(grep "overall alignment rate" $filename | awk -F"%" '{ print $1 }')
#alignmentFraction=$(echo "scale=2 ; ${alignmentRate} / 100" | bc)
#spikeInReads=$(echo "scale=0; ${alignmentFraction} * ${totalReads}" | bc)
#mousealignedPercent=$(grep "overall alignment rate" ${name}_bowtie2_${UMI_length}_human.log | awk -F"%" '{ print $1 }')
#mouseReads=$(echo "scale=0; ${mousealignedPercent} * ${totalReads} * 0.01" | bc)
#echo -e "${alignmentFraction}\t$name\t0.30\tThaliana Alignment Rate" > ${name}_metrics_spikeIn.txt
#echo -e "spikeInReads\tmouseReads\ttotalReads" > ${name}_metrics_spikeIn.txt
#echo -e "${name}\t${spikeInReads}\t${mouseReads}\t${totalReads}" >> ${name}_metrics_spikeIn.txt

#
#bowtie2 -p $((cores)) -x ${mouseGenomeIndex} -U ${name}_processed_UMI_${UMI_length}.fastq 2>${name}_bowtie2_${UMI_length}_mouse.log | samtools view -b - | \
#                          samtools sort - -o ${name}_${UMI_length}_mouse.bam
##parse bowtie log for alignment
#filename=${name}_bowtie2_${UMI_length}_mouse.log # logfile
#alignmentRate=$(grep "overall alignment rate" $filename | awk -F"%" '{ print $1 }')
#alignmentFraction=$(echo "scale=2 ; ${alignmentRate} / 100" | bc)
#echo -e "${alignmentFraction}\t$name\t0.80\tAlignment Rate" >> ${name}_QC_metrics.txt

# thaliana reads - collect all reads not aligning to mouse genome
bowtie2 -p $((cores)) --maxins 1000 -x ${mouseGenomeIndex} -U ${name}_processed_UMI_${UMI_length}.fastq  2>${name}_bowtie2_non_mouseDNA.log |  samtools sort -n - | samtools fastq -f 0x4 - > ${name}.non.mouse.fastq

# align to ribosomal reads and remove them
bowtie2 -p $((cores)) -x ${mouserDNAIndex} -U ${name}_processed_UMI_${UMI_length}.fastq 2>${name}_bowtie2_rDNA.log | \
                        samtools sort -n - | samtools fastq -f 0x4 - > ${name}.non.rDNA.fastq
reads=$(wc -l ${name}.non.rDNA.fastq | awk '{print $1/4}')

filename=${name}_bowtie2_rDNA.log # logfile
rDNA_alignment=$(grep "overall alignment rate" $filename | awk -F"%" '{ print $1 }')
rDNA_alignmentFraction=$(echo "scale=2 ; ${rDNA_alignment} / 100" | bc)
echo -e "$rDNA_alignmentFraction\t$name\t0.10\trDNA Alignment Rate" >> ${name}_QC_metrics.txt

bowtie2 -p $((cores)) --maxins 1000 -x ${mouseGenomeIndex} -U ${name}.non.rDNA.fastq  2>${name}_bowtie2_non_rDNA.log | samtools view -b - | samtools sort - -o ${name}.nonrDNA.bam
# parse bowtie log to find number of non-rDNA reads mapping to mouse genome.
filename=${name}_bowtie2_non_rDNA.log # logfile
non_rDNA_reads=$(wc -l ${name}.non.rDNA.fastq | awk '{ print $1/4 }')
non_rDNA_alignment=$(grep "overall alignment rate" $filename | awk -F"%" '{ print $1 }')
nonrDNAalignmentFraction=$(echo "scale=2 ; ${non_rDNA_alignment} / 100" | bc)
non_rDNA_ReadCount=$(echo "scale=0 ; ( ${non_rDNA_alignment} * ${non_rDNA_reads} ) / 100" | bc)
echo -e "${nonrDNAalignmentFraction}\t$name\t0.80\tAlignment Rate" >> ${name}_QC_metrics.txt
#
### complexity and theoretical read depth
#fqComplexity -i ${name}_noadap_trimmed.fastq
#PE1_total=$(wc -l ${name}.fastq | awk '{print $1/4}')
#PE1_noadap_trimmed=$(wc -l ${name}_noadap_trimmed.fastq | awk '{print $1/4}')
#factorX=$(echo "scale=2 ; $PE1_noadap_trimmed / $PE1_total" | bc)
#echo fraction of reads that are not adapter/adapter ligation products or below 10 base inserts
#echo $factorX
###calculate PE1 deduplicated reads
#PE1_dedup=$(wc -l ${name}_dedup.fastq | awk '{print $1/4}')
###divide
#factorY=$(echo "scale=2 ; $non_rDNA_ReadCount / $PE1_dedup" | bc)
##re-run with factors
#fqComplexity -i ${name}_noadap_trimmed.fastq -x $factorX -y $factorY
#
bowtie2 -p $((cores)) -x ${aThalianaGenomeIndex} -U ${name}.non.mouse.fastq  2>${name}_bowtie2_${UMI_length}_aThaliana.log | samtools view -b - | \
                                   samtools sort - -o ${name}_${UMI_length}_aThaliana.bam
##parse bowtie log for alignment
filename=${name}_bowtie2_${UMI_length}_aThaliana.log # logfile
totalReads=$(wc -l ${name}_processed_UMI_${UMI_length}.fastq | awk '{ print $1/4 }')
alignmentRate=$(grep "overall alignment rate" $filename | awk -F"%" '{ print $1 }')
alignmentFraction=$(echo "scale=2 ; ${alignmentRate} / 100" | bc)
spikeInReads=$(echo "scale=0; ${alignmentFraction} * ${totalReads}" | bc)
mousealignedPercent=$(grep "overall alignment rate" ${name}_bowtie2_${UMI_length}_human.log | awk -F"%" '{ print $1 }')
mouseReads=$(echo "scale=0; ${mousealignedPercent} * ${totalReads} * 0.01" | bc)
#echo -e "${alignmentFraction}\t$name\t0.30\tThaliana Alignment Rate" > ${name}_metrics_spikeIn.txt
echo -e "exprName\tspikeInReads\tmouseReads\ttotalReads" > ${name}_metrics_spikeIn.txt
echo -e "${name}\t${spikeInReads}\t${mouseReads}\t${totalReads}" >> ${name}_metrics_spikeIn.txt

##TBD: convert to bigWig and BED6 using seqOutBias
${seqBiasDir}/seqOutBias scale $table ${name}.nonrDNA.bam --no-scale --stranded --bed-stranded-positive \
                    --bw=${name}.bigWig --bed=${name}.bed \
            --tail-edge --read-size=${read_size} --tallymer=$tallymer

grep -v "random" ${name}_not_scaled.bed | grep -v "chrUn" | grep -v "chrEBV" | sort -k1,1 -k2,2n > ${name}_tmp.txt
mv ${name}_tmp.txt ${name}_not_scaled.bed
# run seqoutBias on aThaliana bam
${seqBiasDir}/seqOutBias scale ${aThalianaTable} ${name}_${UMI_length}_aThaliana.bam --no-scale --stranded --bed-stranded-positive \
                    --bw=${name}_aThaliana.bigWig --bed=${name}_aThaliana.bed \
            --tail-edge --read-size=${read_size} --tallymer=${aThalianaTallymer}

sort -k1,1 -k2,2n ${name}_aThaliana_not_scaled.bed > ${name}_pause_counts_tmp.bed
mv ${name}_pause_counts_tmp.bed ${name}_aThaliana_not_scaled.bed

##count reads in pause region
mapBed -null "0" -s -a $annotation_prefix.pause.bed -b ${name}_not_scaled.bed | \
                                awk '$7>0' | sort -k5,5 -k7,7nr | sort -k5,5 -u > ${name}_pause.bed
#discard anything with chr and strand inconsistencies
join -1 5 -2 5 ${name}_pause.bed $annotation_prefix.bed | \
                                awk '{OFS="\t";} $2==$8 && $6==$12 {print $2, $3, $4, $1, $6, $7, $9, $10}' | \
                                                awk '{OFS="\t";} $5 == "+" {print $1,$2+480,$8,$4,$6,$5} $5 == "-" {print $1,$7,$2 - 380,$4,$6,$5}' | \
                                                            awk  '{OFS="\t";} $3>$2 {print $1,$2,$3,$4,$5,$6}' > ${name}_pause_counts_body_coordinates.bed

sort -k1,1 -k2,2n ${name}_pause_counts_body_coordinates.bed > ${name}_pause_counts_tmp.bed
mv ${name}_pause_counts_tmp.bed ${name}_pause_counts_body_coordinates.bed

#column ten is Pause index
mapBed -null "0" -s -a ${name}_pause_counts_body_coordinates.bed \
            -b ${name}_not_scaled.bed | awk '$7>0' | \
                                    awk '{OFS="\t";} {print $1,$2,$3,$4,$5,$6,$7,$5/100,$7/($3 - $2)}' | \
                                                    awk '{OFS="\t";} {print $1,$2,$3,$4,$5,$6,$7,$8,$9,$8/$9}' > ${name}_pause_body.bed
pause_index.R ${name}_pause_body.bed
#
mapBed -null "0" -s -a $annotation_prefix.introns.bed \
                        -b ${name}_not_scaled.bed | awk '$7>0' | \
            awk '{OFS="\t";} {print $1,$2,$3,$5,$5,$6,$7,($3 - $2)}' > ${name}_intron_counts.bed

mapBed -null "0" -s -a $annotation_prefix.no.first.exons.named.bed \
            -b ${name}_not_scaled.bed | awk '$7>0' | \
                awk '{OFS="\t";} {print $1,$2,$3,$4,$4,$6,$7,($3 - $2)}' > ${name}_exon_counts.bed

exon_intron_ratio.R ${name}_exon_counts.bed ${name}_intron_counts.bed

gzip ${name}.fastq

#cat *_QC_metrics.txt | awk '!x[$0]++' > project_QC_metrics.txt 
#plot_all_metrics.R project_QC_metrics.txt Estrogen_treatment_PRO

We can replace the placeholder and submit jobs in the following manner (submission example for slurm environment in HPC cluster @ UConn):-

for filename in V65_cntrl_rep1 V65_cntrl_rep2 V65_TRP_treat_12min_rep1 V65_TRP_treat_12min_rep2; do
        sed "s/placeholder/${filename}/g" processPRO_seqdata_placeholder_spikeIn.sh > processPRO_seqdata_${filename}_spikeIn.sh
        echo ${filename}_spikeIn
        sbatch processPRO_seqdata_${filename}_spikeIn.sh
        sleep 1
done

3.2 Normalization of sequencing data by their read-depth

We eventually will normalize data by DESeq2 size factors. For running DESeq2, we plan to use the following criteria:- * consider genes with length greater than 50kbp and treat regions from [TSS + 50kbp] to TTS as unchanging between control and TRP treatment for 12.5 mins. TSS = transcription start site, TTS = transcription termination site. * for differential expression analysis by DESeq2, only consider regions between 500 and 25kbp..

We create normalized files in the same manner as mentioned in PRO_normalization of Nascent RNA Methods and use normalize_bedGraph_v2.py for increased execution speed.

cat normalization_factor.R
#!/usr/bin/env Rscript

# modified from normalization_factor.R because of different file names 
# assumes the naming convention CELL_COND_TIME_ETC_rep<#>_<plus><minus>.bigWig
# assumes that you are running the script in the directory with your bigWigs 

if(!require(devtools)) { install.packages("devtools", repos='http://cran.us.r-project.org') }
if(!require(bigWig)) {devtools::install_github("andrelmartins/bigWig", subdir="bigWig")}

library(bigWig)

coverage = c()
file.name = c()

# commented out from the original code as it seems not to serve any purpose
#for (plus.bigWig in Sys.glob(file.path("*_rep1_plus_PE1.bigWig"))) { # name changed
#   file.prefix = strsplit(plus.bigWig, '_rep1')[[1]][1]
#   #print(file.prefix)
#    minus.bigWig = paste0(strsplit(plus.bigWig, 'plus')[[1]][1], "plus.bigWig") # should actually be minus.bigWig
#    #print(minus.bigWig)
#    }
# these bigWigs were creataed using seqOutBias
for (replicate.plus.bigWig in Sys.glob(file.path("*_plus.bigWig"))) { # name changed
    #print(replicate.plus.bigWig)
    replicate.prefix = strsplit(replicate.plus.bigWig, '_plus.bigWig')[[1]][1] # name changed
    replicate.minus.bigWig = paste0(strsplit(replicate.prefix, '_plus.bigWig')[[1]][1], "_minus.bigWig") # name changed
    #print(replicate.minus.bigWig)
    bigwig.plus = load.bigWig(replicate.plus.bigWig)
    bigwig.minus = load.bigWig(replicate.minus.bigWig)
    coverage.bigwig.plus = bigwig.plus$basesCovered*bigwig.plus$mean # coveredBases * mean (mysterious because I don't know the definitions)
    coverage.bigwig.minus = abs(bigwig.minus$basesCovered*bigwig.minus$mean) # abs used because of negstive values
    coverage = append(coverage, coverage.bigwig.plus)
    coverage = append(coverage, coverage.bigwig.minus)
    file.name = append(file.name, replicate.plus.bigWig)
    file.name = append(file.name, replicate.minus.bigWig)
    #print(coverage)
    unload.bigWig(bigwig.plus)
    unload.bigWig(bigwig.minus)
}   
# NOTE: CHANGE THE FOllowing prefix to `basename` in PRO_normalization (line 34)
#celltype.prefix = strsplit(file.prefix, '_')[[1]][1]
celltype.prefix = "V65" 
write.table(data.frame(file.name, coverage), file = paste0(celltype.prefix, '_normalization.txt'), sep = '\t', row.names = FALSE, col.names=FALSE, quote =FALSE)
cat PRO_normalization
#! /bin/sh

# modified from original code as file names are different: https://github.com/guertinlab/Nascent_RNA_Methods/blob/main/PRO_normalization

# put this in your path: normalize_bedGraph.py
# put this executable in your path: normalization_factor.R

#normalization_factor_v2.R and normalize_bedGraph_v2.py are present in the same directory and they have been modifed accordingly.
set -e
module load R/4.1.2
export PATH=${PATH}:/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014 # for normalization_bedgraph_v2.py
export PATH=${PATH}:/home/FCAM/rmukherjee/Summer2022Rotation/R_files # for bigWigToBedGraph
while getopts "c:" OPTION
do
    case $OPTION in
    c)
        chrSizes=$OPTARG
        ;;

    esac
done

bigWigFilesDir=/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/fastqfiles

cd ${bigWigFilesDir}

if [ $chrSizes ]; then

    Rscript /home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/normalization_factor.R # modified version of normalization_factor.R from Guertin lab's GitHub
        #for i in V65_cntrl_rep1_plus.bigWig V65_FP_treat_25min_rep1_plus.bigWig V65_TRP_treat_12min_rep1_plus.bigWig; # name modified
        for i in V65_cntrl_rep1_plus.bigWig V65_TRP_treat_12min_rep1_plus.bigWig; # name modified
          do
          name=$(echo $i | awk -F"_rep1_plus.bigWig" '{print $1}') # name modified
              printf "%s - processing individual replicates\n" ${name}
          basename=$(echo $i | awk -F"_" '{print $1}') 
          # count the number of replicates
                  reps=$(ls ${name}_rep?_plus.bigWig | wc -w | bc) # modified name
                  for j in ${name}_rep?_plus.bigWig; 
          do
                  repNum=$(echo $j | awk -F"rep" '{print $NF}' | awk -F"_plus" '{print $1}') # take the num# of replicate 
                              invscalePlus=$(grep "${name}_rep${repNum}_plus.bigWig" ${basename}_normalization.txt | awk -F" " '{print $2}' | bc)
                  invscaleMinus=$(grep "${name}_rep${repNum}_minus.bigWig" ${basename}_normalization.txt | awk -F" " '{print $2}' | bc)
                  echo $invscalePlus
                      echo $invscaleMinus
                  # scale to 10 million
                  invScale=$(expr $invscalePlus + $invscaleMinus)
                          scale=$(bc <<< "scale=3 ; 10000000 / $invScale") # normalised to 1e7 reads
                      bigWigToBedGraph ${name}_rep${repNum}_plus.bigWig ${name}_rep${repNum}_plus.bedGraph
                  bigWigToBedGraph ${name}_rep${repNum}_minus.bigWig ${name}_rep${repNum}_minus.bedGraph
                          normalize_bedGraph_v2.py -i ${name}_rep${repNum}_plus.bedGraph -s $scale -o ${name}_rep${repNum}_plus_normalized.bedGraph
                      normalize_bedGraph_v2.py -i ${name}_rep${repNum}_minus.bedGraph -s $scale -o ${name}_rep${repNum}_minus_normalized.bedGraph
                  bedGraphToBigWig ${name}_rep${repNum}_plus_normalized.bedGraph $chrSizes ${name}_rep${repNum}_plus_normalized.bigWig 
                  bedGraphToBigWig ${name}_rep${repNum}_minus_normalized.bedGraph $chrSizes ${name}_rep${repNum}_minus_normalized.bigWig
          done
          printf "%s - creating and merging normalized replicate bigWigs\n" ${name}
          plusfiles=$(ls ${name}_rep*_plus_normalized.bigWig)
          bigWigMerge $plusfiles tmpPlus.bg
              minusfiles=$(ls ${name}_rep*_minus_normalized.bigWig)
              bigWigMerge -threshold=-10000000000 $minusfiles tmpMinus.bg # for bigWigMerge utility: -threshold=0.N - don't output values at or below this threshold. Default is 0.0
                  scaleall=$(bc <<< "scale=4 ; 1.0 / $reps")
          normalize_bedGraph_v2.py -i tmpPlus.bg -s $scaleall -o ${name}_plus_normalized.bg # scaleall=1/rep#
              normalize_bedGraph_v2.py -i tmpMinus.bg -s $scaleall -o ${name}_minus_normalized.bg
              sort -k1,1 -k2,2n ${name}_plus_normalized.bg > ${name}_plus_normalized_sorted.bg
          sort -k1,1 -k2,2n ${name}_minus_normalized.bg > ${name}_minus_normalized_sorted.bg
          bedGraphToBigWig ${name}_plus_normalized_sorted.bg $chrSizes ${name}_plus_normalized.bigWig 
          bedGraphToBigWig ${name}_minus_normalized_sorted.bg $chrSizes ${name}_minus_normalized.bigWig   
              rm ${name}_plus_normalized.bg
          rm ${name}_minus_normalized.bg
          rm ${name}_plus_normalized_sorted.bg
          rm ${name}_minus_normalized_sorted.bg
              rm tmpPlus.bg
              rm tmpMinus.bg
      done
fi

We will now use the normalized files from control condition - V65_cntrl_plus_normalized.bigWig and V65_cntrl_minus_normalized.bigWig for estimating the TSS in next section.

3.3 Estimation of transcription start sites for genes

We will use primaryTranscriptAnnotation (Anderson et al. 2020, GitHub ) to estimate TSS and TTS. Comments mention why some thresholds were used:-

cat createPrimaryTranscAnnot_Mouse_cntThresh_point9.R
library(bigWig)
library(NMF)
library(dplyr)
library(pracma)
library(RColorBrewer)
library(dplyr)

# source primaryTranscriptAnnotation package files here
source("/home/FCAM/rmukherjee/Summer2022Rotation/genomicsRpackage-master/primaryTranscriptAnnotation/R/gene_ann.R")
source("/home/FCAM/rmukherjee/Summer2022Rotation/genomicsRpackage-master/primaryTranscriptAnnotation/R/map_tu.R")

# modified from get.TSS()
get.counts.around.potential.TSS <- function(bed.in=NULL, bed.out=NULL, bw.plus=NULL, bw.minus=NULL,
                                           bp.range=NULL, cnt.thresh=NULL, low.read=FALSE, by="cnt"){
     final.df = c()
      # error handling. note that input and output genes must be matched
      if( length(unique(bed.in$gene)) != length(unique(bed.out$gene)) ){stop("input genes do not match output genes")}
  if( setequal(unique(bed.in$gene),unique(bed.out$gene)) == FALSE ){stop("input genes do not match output genes")}
    # loop through each unique gene
    # map reads a range for each annotated TSS
    bed0 = bed.out
    bed = bed.in
      genes = unique(bed$gene)
      tss = rep(0,length(genes))
        for(ii in 1:length(genes)){
    ind.gene = which(bed$gene == genes[ii])
    strand.gene = unique(bed$strand[ind.gene])[1]
    chr.gene = unique(bed$chr[ind.gene])[1]
        # get all sets of intervals around each annotated TSS
        # sort upstream to downstream
        if(strand.gene == "-"){
              starts = unique(bed$end[ind.gene])
          ranges = data.frame(chr=chr.gene, start=starts-bp.range[2],
                                    end=starts-bp.range[1], stringsAsFactors=FALSE)
            ranges = ranges[order(ranges$end,decreasing=TRUE),]
          }
        if(strand.gene == "+"){
              starts = unique(bed$start[ind.gene])
          ranges = data.frame(chr=chr.gene, start=starts+bp.range[1],
                                    end=starts+bp.range[2], stringsAsFactors=FALSE)
                ranges = ranges[order(ranges$start),]
              }

        # get counts and densities for each interval of interest
        if(strand.gene == "-"){ counts = abs(bed.region.bpQuery.bigWig(bw.minus, ranges)) }
        if(strand.gene == "+"){ counts = abs(bed.region.bpQuery.bigWig(bw.plus, ranges)) }
            dens = counts / (ranges$end - ranges$start)

        # select the best TSS, select the most upstream for ties
        if(by == "cnt"){ind.max = which(counts == max(counts))[1]}
        if(by == "den"){ind.max = which(dens == max(dens))[1]}
           final.df = rbind(final.df, c(genes[ii], counts[ind.max], ranges[ind.max, ], strand.gene))
    }
      final.df = as.data.frame(final.df)
      return(final.df)
}
# load transcript, firstExon files
annotDir="/home/FCAM/rmukherjee/Summer2022Rotation/annotations"
outputDir = "/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/primaryTranscriptAnnotation"

# import data for first exons, annotate, and remove duplicate transcripts
fname = sprintf("%s/gencode.mm39.firstExon.bed", annotDir)
dat0 = read.table(fname,header=F,stringsAsFactors=F)
names(dat0) = c('chr', 'start', 'end', 'gene', 'xy', 'strand')
dat0 = unique(dat0)
gencode.firstExon = dat0
# import data for all transcripts, annotate, and remove duplicate transcripts
fname = sprintf("%s/gencode.mm39.transcript.bed", annotDir)
dat0 = read.table(fname,header=F,stringsAsFactors=F)
names(dat0) = c('chr', 'start', 'end', 'gene', 'xy', 'strand')
dat0 = unique(dat0)
gencode.transcript = dat0
# chromosome sizes
chrom.sizes = read.table("/home/FCAM/rmukherjee/Spring2022Rotation/genome/mm39.chrom.sizes",stringsAsFactors=F,header=F)
names(chrom.sizes) = c("chr","size")

bigwigDir = "/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/fastqfiles"
#plus.file = sprintf("%s/TRP_FP_control_merged_plus.bigWig", bigwigDir)
#minus.file = sprintf("%s/TRP_FP_control_merged_minus.bigWig", bigwigDir)
#plus.file = sprintf("%s/V65_CNTRL_All_plus_normalized.bigWig", bigwigDir)
#minus.file = sprintf("%s/V65_CNTRL_All_minus_normalized.bigWig", bigwigDir)
plus.file = sprintf("%s/V65_cntrl_plus_normalized.bigWig", bigwigDir)
minus.file = sprintf("%s/V65_cntrl_minus_normalized.bigWig", bigwigDir)
bw.plus = load.bigWig(plus.file)
bw.minus = load.bigWig(minus.file)
cntrl.rep1.plus = load.bigWig(sprintf("%s/V65_cntrl_rep1_plus.bigWig", bigwigDir))
cntrl.rep1.minus = load.bigWig(sprintf("%s/V65_cntrl_rep1_minus.bigWig", bigwigDir))
cntrl.rep2.plus = load.bigWig(sprintf("%s/V65_cntrl_rep2_plus.bigWig", bigwigDir))
cntrl.rep2.minus = load.bigWig(sprintf("%s/V65_cntrl_rep2_minus.bigWig", bigwigDir))

# (using gencode transcript bed) get intervals for furthest TSS and TTS +/- interval
# stitch together the lowest start and highest end of each gene
largest.interval.bed = get.largest.interval(bed=gencode.transcript)
saveRDS(largest.interval.bed, file=sprintf("%s/gencode.largest.interval.rds", outputDir))
# (use gencode transcript bed) get read counts and densities for each annotated gene
transcript.reads = read.count.transcript(bed=gencode.transcript,
                     bw.plus=bw.plus, bw.minus=bw.minus)
saveRDS(transcript.reads, file = sprintf("%s/gencode.transcript.reads.rds", outputDir))
# decide on cutoffs for discarding reads - evaluate count and density distributions
pdf(sprintf("%s/read_count_density.pdf", outputDir))
par(mfrow=c(1,2))
hist(log(transcript.reads$density), breaks=200,
          col="black",xlab="log read density",main="")
hist(log(transcript.reads$counts), breaks=200,
          col="black",xlab="log read count",main="")
dev.off()
# create cutoffs and discard gene with low expression
den.cut = -7.2 # density < exp(-7.5)
cnt.cut = 0.3    # count < exp (0)
# plot the graph
pdf(sprintf("%s/read_count_density_thresholdLine.pdf", outputDir))
par(mfrow=c(1,2))
hist(log(transcript.reads$density), breaks=200,
          col="black",xlab="log read density",main="")
abline(v=den.cut, col="red")
hist(log(transcript.reads$counts), breaks=200,
          col="black",xlab="log read count",main="")
abline(v=cnt.cut, col="red")
dev.off()
# remove "unexpressed" genes
# get indices based on density and cnt threshold
ind.cut.den = which(log(transcript.reads$density) < den.cut)
ind.cut.cnt = which(log(transcript.reads$counts) < cnt.cut)
ind.cut = union(ind.cut.den, ind.cut.cnt) # take union of indices to be removed
unexp = names(transcript.reads$counts)[ind.cut] # get all unexpressed genes
largest.interval.expr.bed = largest.interval.bed[!(largest.interval.bed$gene %in% unexp),] # get interval bed of all genes which are expressed in sufficient amount
saveRDS(largest.interval.expr.bed, file=sprintf("%s/largest_interval_expr_bed.rds", outputDir))

# select the TSS (defined as having the highest count) for each gene and incorporate these TSSs
# into the largest interval coordinates
bp.range = c(20,160) # estimated using plotMetaProfileOfBigwigs.R
cnt.thresh = 0.9 # chose this because the Stat4 correct annotation was at 0.1805. Also doesn't make sense to use this threshold as most reads are really low 
#original count threshold was five. Took few cases with read-depth normalized values and computed the threshold
## normalization factor
# control_rep1 = 0.2495727
# control_rep2 = 0.3610368
# the nromalized count calculated using the following cases
#  control_rep1 control_rep2 normalizedRead
#1            5            0      0.6239319
#2            0            5      0.9025920
#3            5            5      1.5265239
#4            2            3      0.7911280
#5            3            2      0.7353959
#6            1            4      0.8468600
#7            4            1      0.6796639
bed.out = largest.interval.expr.bed
# select exon 1 having genes in largest.interval. Probably to enforce same genes?
bed.in = gencode.firstExon[gencode.firstExon$gene %in% bed.out$gene,]
saveRDS(bed.in, sprintf("%s/V65.Jonkers.bed.in.rds", outputDir))
# bp.range is the interval within which the greatest read density is searched for
# the following regions on gene are searched through bigwig.query
# for - strand, the start and end are set as "end-120", "end-20"
# for + strand, the start and end are set as "start+20" and "start+120"
# after the max coverage is found,
# the + strand TSS is set to "start"
# whereas - strand TSS is set to "end"
# finally, the gene names from TSS calculated are used to set the same genes in largest
# interval bed, where genes in plus(+) strand get their "start" set as TSS whereas genes in minus(-) strand get their "end" set as TSS
# get.TSS also returns an "issues" object which contain all genes whose ends are lower than starts and whose start are less than zero
# the "bed" object returned contains the interval data patched with TSS as start.

##################### EXERCISE FOR estimating the counts around TSS 
#counts.df = get.counts.around.potential.TSS(bed.in=bed.in, bed.out=bed.out,
#                                          bw.plus=bw.plus, bw.minus=bw.minus,
#                                                     bp.range=bp.range, cnt.thresh=cnt.thresh)
#counts.df = as.data.frame(counts.df)
#counts.df[,2] =as.numeric(counts.df[,2])
#saveRDS(counts.df, file=sprintf("%s/max.counts.around.potential.tss.rds", outputDir))
## control rep1
#counts.df.cntrl.rep1 = get.counts.around.potential.TSS(bed.in=bed.in, bed.out=bed.out,
#                                          bw.plus=cntrl.rep1.plus, bw.minus=cntrl.rep1.minus,
#                                                     bp.range=bp.range, cnt.thresh=cnt.thresh)
#counts.df.cntrl.rep1 = as.data.frame(counts.df.cntrl.rep1)
#colnames(counts.df.cntrl.rep1) = c("gene", "max.count", "chr", "start", "end", "strand")
#counts.df.cntrl.rep1[,2] =as.numeric(counts.df.cntrl.rep1[,2])
#saveRDS(counts.df.cntrl.rep1, file=sprintf("%s/cntrl.rep1.max.counts.potential.tss.rds", outputDir))
## control rep2
#counts.df.cntrl.rep2 = get.counts.around.potential.TSS(bed.in=bed.in, bed.out=bed.out,
#                                          bw.plus=cntrl.rep2.plus, bw.minus=cntrl.rep2.minus,
#                                                     bp.range=bp.range, cnt.thresh=cnt.thresh)
#counts.df.cntrl.rep2 = as.data.frame(counts.df.cntrl.rep2)
#counts.df.cntrl.rep2[,2] =as.numeric(counts.df.cntrl.rep2[,2])
#colnames(counts.df.cntrl.rep2) = c("gene", "max.count", "chr", "start", "end", "strand")
#saveRDS(counts.df.cntrl.rep2, file=sprintf("%s/cntrl.rep2.max.counts.potential.tss.rds", outputDir))
#############################

gene.TSS = get.TSS(bed.in=bed.in, bed.out=bed.out, 
           bw.plus=bw.plus, bw.minus=bw.minus,
           bp.range=bp.range, cnt.thresh=cnt.thresh)
TSS.gene = gene.TSS$bed
saveRDS(TSS.gene, file=sprintf("%s/TSS_gene_unfiltered.cnt.thresh.%0.2f.rds", outputDir, cnt.thresh))
# pre-filter to remove poorly annotated genes
for (substr in c("Gm", "Rik", "AC", "AW")) {
               grep.result = grep(substr, TSS.gene$gene)
   if (length(grep.result) > 0) {
                        TSS.gene = TSS.gene[-grep.result,]
         }
}
saveRDS(TSS.gene, file=sprintf("%s/TSS_gene_filtered.cnt.thresh.%0.2f.rds", outputDir, cnt.thresh))
# remove overlaps by selecting the gene with the highest density
# (for both plus and minus strand) gene.overlaps searches for genes within the same chromosome which has a start inside other genes. A dataframe is created with gene list and which genes start within them ("has.start.inside") or where they start within other genes ("is.start.inside").
# also, number of cases of such occurence of inner and outer genes are documented for each of the genes. (integer overlap.)
overlap.data = gene.overlaps( bed = TSS.gene )
# the cases from gene.overlaps is used again within remove.overlaps function
# only those genes are kept which have the maximum density within the transcript bed
# the Tss.gene object with rows filtered out are returned
filtered.id.overlaps = remove.overlaps(bed=TSS.gene,
                       overlaps=overlap.data$cases,                                   
                       transcripts=gencode.transcript,bw.plus=bw.plus, bw.minus=bw.minus, by="den")
# now start finding TTS coordinates
# get intervals for TTS evaluation
# the output from the function is an input for get.TTS()
# for plus strand,
# add.to.end is number of bases added to continue searching after gene end annotation
# start of TTS search is set by subtracting fraction.end*(gene_length_from_annotation) from (end+add.to.end)
# for minus strand,
# add.to.end is subtracted from start
# the end of TTS search is set by adding fraction.end*(fraction.end*gene_length) from (start - add.to.end)
# (for both plus and minus strand) dist.to.start is the number of bases left from gene start of the next gene in annotati
add.to.end = 100000
fraction.end = 0.2
dist.from.start = 50
bed.for.tts.eval = get.end.intervals(bed=filtered.id.overlaps,
                     add.to.end=add.to.end, fraction.end=fraction.end,                                                                  dist.from.start=dist.from.start)
saveRDS(bed.for.tts.eval, file=sprintf("%s/bed.for.tts.eval.cnt.thresh.%0.2f.rds", outputDir, cnt.thresh))
# identify gene ends
# @param knot.div (in get.TTS) the number of knots for the spline fit is defined as the number of bins covering the gene end divided by this number;
#' increasing this parameter results and a smoother curve
add.to.end = max(bed.for.tts.eval$xy)
knot.div = 40
pk.thresh = 0.05
bp.bin = 50
knot.thresh = 5
Cnt.thresh = 4 # same threshold as for get.TSS
tau.dist = 50000
frac.max = 1
frac.min = 0.3
# get.TTS tries to figure how to consider the area with the last transcription signal
# as count data is disordered, we need to find proper peaks. For that purpose a spline is fitted with number of knots (the number of polynomial functions.)
# then we search for the maximum value reached by spline. From that peak of spline,
# we move to the last peak of the spline which at a threshold decided by equations in the code.
# we call the x coodirnate of that peak as TTS.
# (all the above process in reverse direction for minus strand.)
inferred.coords = get.TTS(bed=bed.for.tts.eval, tss=filtered.id.overlaps,
                 bw.plus=bw.plus, bw.minus=bw.minus, 
                 bp.bin=bp.bin, add.to.end=add.to.end,
                 pk.thresh=pk.thresh, knot.thresh=knot.thresh,cnt.thresh=Cnt.thresh, tau.dist=tau.dist,
                 frac.max=frac.max, frac.min=frac.min, knot.div=knot.div)
saveRDS(inferred.coords, file=sprintf("%s/inferred_coords.cnt.thresh.%0.2f.rds", outputDir, cnt.thresh))
coords = inferred.coords$bed
coords.filt = chrom.size.filter(coords=coords, chrom.sizes=chrom.sizes)
coords.filt$log
head(coords.filt$bed)
# process output, remove mitochondrial coordinates, and write out data
out = coords.filt$bed %>% select(chr,start,end,gene)
out = cbind(out, c(500), coords.filt$bed$strand) %>% as.data.frame(stringsAsFactors=F)
# create a gene list for use while merging
pta.gene.list = out$gene # gene lists from primary Transcript annotation workflow
# save the previous data as inferred coords
out = out[-which(out$chr == "chrM"),]
write.table(out,sprintf("%s/inferredcoords.cnt.thresh.%0.2f.bed", outputDir, cnt.thresh),quote=F,sep="\t",col.names=F,row.names=F)
# now, begin to merge together a full annotation list.
removed.gene.interval = largest.interval.bed %>% filter(!(gene %in% pta.gene.list | chr == "chrM"))
names(removed.gene.interval) = names(out) # set the colnames equal
removed.gene.interval[c("c(500)")] = 100 # set score to something low as these were discarded
master.annotation.bed = rbind(out, removed.gene.interval)
write.table(master.annotation.bed, sprintf("%s/mastercoords.cnt.thresh.%0.2f.bed", outputDir, cnt.thresh), quote=F,sep="\t",col.names=F,row.names=F)

3.4 Running DESeq2 for differential analysis

We will first prepare bed files for “affected genes” (TSS+500 to 25kb) and “internalcontrol genes” (50kbp to TTS) as discussed above.

cat parsePTACoordsToShortAffectedAndLongControlGenes.R
library(dplyr)

# wrong # df.pause.body <- readRDS("pauseWindowsAndBWplots/TRP_FP_vscntrl.pTA.pause.body.rds")
df.pause.body = read.table("primaryTranscriptAnnotation/mastercoords.bed", header=F)
colnames(df.pause.body) = c("chr", "start", "end", "gene", "score", "strand")

# break long genes into long genes with no signal
controlgenes.len.gt.50kb =  subset(df.pause.body, end - start >= 49999)

control.genes.df = c()
genes.for.affect = c()
colnames.of.interest = c("chr","start", "end", "gene", "score", "strand")
for (i in c(1:nrow(controlgenes.len.gt.50kb))) {
    this.row = controlgenes.len.gt.50kb[i,]
    if ( this.row$strand == "+" ) {
        copy.row = this.row
        this.row$start = this.row$start + 49999
        this.row$gene = sprintf("%s_cntrlGene", this.row$gene)
        control.genes.df = rbind(control.genes.df, this.row[,colnames.of.interest])
        copy.row$end = copy.row$start + 24999
        copy.row$gene = sprintf("%s_affected", copy.row$gene)
        genes.for.affect = rbind(genes.for.affect, copy.row[,colnames.of.interest])
    } else {
        copy.row = this.row
        this.row$end = this.row$end - 49999
        this.row$gene = sprintf("%s_cntrlGene", this.row$gene)
        control.genes.df = rbind(control.genes.df, this.row[,colnames.of.interest])
        copy.row$start = copy.row$end - 24999
        copy.row$gene = sprintf("%s_affected", copy.row$gene)
        genes.for.affect = rbind(genes.for.affect, copy.row[,colnames.of.interest])
       }
}

# shorter genes
controlgenes.len.short.50kb =  subset(df.pause.body, end - start < 49999)
for (i in c(1:nrow(controlgenes.len.short.50kb))) {
    this.row = controlgenes.len.short.50kb[i,]
    if ( this.row$strand == "+" ) {
    copy.row = this.row
    copy.row$end = min(copy.row$end, copy.row$start + 24999)
    copy.row$gene = sprintf("%s_affected", copy.row$gene)
    genes.for.affect = rbind(genes.for.affect, copy.row[,colnames.of.interest])
    } else {
    copy.row = this.row
    copy.row$start = max(copy.row$start, copy.row$end - 24999)
    copy.row$gene = sprintf("%s_affected", copy.row$gene)
    genes.for.affect = rbind(genes.for.affect, copy.row[,colnames.of.interest])
    }
}
write.table(as.data.frame(control.genes.df), file="primaryTranscriptAnnotation/InternalControlGenes_TRP_12min.bed", quote=F, sep="\t", row.names=F, col.names=F)
write.table(as.data.frame(genes.for.affect), file="primaryTranscriptAnnotation/GenesWithRegionsAffected_TRP_12min.bed", quote=F, sep="\t", row.names=F, col.names=F)
cat parsePTACoordsToShortAffectedAndLongControlGenes.R
library(dplyr)

# wrong # df.pause.body <- readRDS("pauseWindowsAndBWplots/TRP_FP_vscntrl.pTA.pause.body.rds")
df.pause.body = read.table("primaryTranscriptAnnotation/mastercoords.bed", header=F)
colnames(df.pause.body) = c("chr", "start", "end", "gene", "score", "strand")

# break long genes into long genes with no signal
controlgenes.len.gt.50kb =  subset(df.pause.body, end - start >= 49999)

control.genes.df = c()
genes.for.affect = c()
colnames.of.interest = c("chr","start", "end", "gene", "score", "strand")
for (i in c(1:nrow(controlgenes.len.gt.50kb))) {
    this.row = controlgenes.len.gt.50kb[i,]
    if ( this.row$strand == "+" ) {
        copy.row = this.row
        this.row$start = this.row$start + 49999
        this.row$gene = sprintf("%s_cntrlGene", this.row$gene)
        control.genes.df = rbind(control.genes.df, this.row[,colnames.of.interest])
        copy.row$end = copy.row$start + 24999
        copy.row$gene = sprintf("%s_affected", copy.row$gene)
        genes.for.affect = rbind(genes.for.affect, copy.row[,colnames.of.interest])
    } else {
        copy.row = this.row
        this.row$end = this.row$end - 49999
        this.row$gene = sprintf("%s_cntrlGene", this.row$gene)
        control.genes.df = rbind(control.genes.df, this.row[,colnames.of.interest])
        copy.row$start = copy.row$end - 24999
        copy.row$gene = sprintf("%s_affected", copy.row$gene)
        genes.for.affect = rbind(genes.for.affect, copy.row[,colnames.of.interest])
       }
}

# shorter genes
controlgenes.len.short.50kb =  subset(df.pause.body, end - start < 49999)
for (i in c(1:nrow(controlgenes.len.short.50kb))) {
    this.row = controlgenes.len.short.50kb[i,]
    if ( this.row$strand == "+" ) {
    copy.row = this.row
    copy.row$end = min(copy.row$end, copy.row$start + 24999)
    copy.row$gene = sprintf("%s_affected", copy.row$gene)
    genes.for.affect = rbind(genes.for.affect, copy.row[,colnames.of.interest])
    } else {
    copy.row = this.row
    copy.row$start = max(copy.row$start, copy.row$end - 24999)
    copy.row$gene = sprintf("%s_affected", copy.row$gene)
    genes.for.affect = rbind(genes.for.affect, copy.row[,colnames.of.interest])
    }
}
write.table(as.data.frame(control.genes.df), file="primaryTranscriptAnnotation/InternalControlGenes_TRP_12min.bed", quote=F, sep="\t", row.names=F, col.names=F)
write.table(as.data.frame(genes.for.affect), file="primaryTranscriptAnnotation/GenesWithRegionsAffected_TRP_12min.bed", quote=F, sep="\t", row.names=F, col.names=F)

We will generate read count data from our processed GRO-seq files and the bed files for “affected” and “internalcontrol” genes created earlier.

cat deseq2DataCreator_InternalControl_TRP12min.sh
#!/bin/bash
set -e
module load bedtools # specific for Xanadu cluster
proseqfolder=/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/fastqfiles
annotation_bed_TRP_sorted=/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/primaryTranscriptAnnotation/GenesWithRegionsAffected_TRP_12min_sorted.bed
annotationbed_TRP_internalControls=/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/primaryTranscriptAnnotation/InternalControlGenes_TRP_12min_sorted.bed
outputFolder=/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/deseq2

printf "entering proseqfolder: %s\n" "${proseqfolder}"
cd ${proseqfolder}
filenames=$(ls V65*rep?_not_scaled.bed | grep -v "FP")
for filename in ${filenames}
do
 name=$(echo $filename | awk -F"_not_scaled.bed" '{print $1}')
 echo -e  "\t${name}" > ${outputFolder}/${name}_gene_counts_Affected_Trp12min.txt
 mapBed -null "0" -s -a ${annotation_bed_TRP_sorted} -b $filename |\
     awk '{OFS="\t";} {print $4,$7}' >> ${outputFolder}/${name}_gene_counts_Affected_Trp12min.txt
done

#internal spike ins
for filename in ${filenames}
do
 name=$(echo $filename | awk -F"_not_scaled.bed" '{print $1}')
 echo -e  "\t${name}" > ${outputFolder}/${name}_gene_counts_InternalControl_Trp12min.txt
 mapBed -null "0" -s -a ${annotationbed_TRP_internalControls} -b $filename |\
     awk '{OFS="\t";} {print $4,$7}' >> ${outputFolder}/${name}_gene_counts_InternalControl_Trp12min.txt
done

printf "entering outputFolder: %s\n" "${outputFolder}"
cd ${outputFolder}
paste -d'\t' *_gene_counts_Affected_Trp12min.txt > PRO_TRP_V65_Affected_GeneCounts.txt
paste -d'\t' *_gene_counts_InternalControl_Trp12min.txt >  PRO_TRP_V65_InternalControl_GeneCounts.txt
cat runDeseq2_TRP_12min_matched_internalControl.R
library(MatchIt)
library(bigWig)
library(DESeq2)

source("/home/FCAM/rmukherjee/Spring2022Rotation/R_files/ZNF143_functions.R")
setwd("/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/deseq2")

outputDir = "/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/deseq2"
pTA_dir = "/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/primaryTranscriptAnnotation"
fastqfilesdir = "/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/fastqfiles"
genecountsDir = "/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/deseq2"
spikeInCountsFile = sprintf("%s/PRO_TRP_V65_InternalControl_GeneCounts.txt", genecountsDir)
mouseGeneCountFile = sprintf("%s/PRO_TRP_V65_Affected_GeneCounts.txt", genecountsDir) # made from primary transcript annotation
counts.obj.internal = read.table(spikeInCountsFile, sep = '\t', header = T)
row.names.internal = counts.obj.internal[,1]
counts.obj.internal = counts.obj.internal[,seq(2,to=ncol(counts.obj.internal),by=2)]
rownames(counts.obj.internal) = row.names.internal

## only consider TRP and control files - regex V65_[cT]
## only consider genes with length upto 25kb 
inferred.coords=read.table(sprintf('%s/GenesWithRegionsAffected_TRP_12min_sorted.bed', pTA_dir), sep="\t", header =FALSE)
inferred.coords[,"V4"] = gsub("_affected", "", inferred.coords[,"V4"])
counts.df = abs(get.raw.counts.interval(inferred.coords, fastqfilesdir, file.prefix = "V65_[cT]", 
        file.plus.suffix = "*rep?_plus.bigWig", file.minus.suffix = "_minus.bigWig"))
saveRDS(counts.df, "raw.counts.TRP12min_affected.rds")
merged.counts.obj = rbind(counts.obj.internal, counts.df)
saveRDS(merged.counts.obj, file = "merged.internal.plus.raw.counts.rds")
cond.list = c("control", "control", "TRP_12min", "TRP_12min")
cond.levels=c("TRP_12min", "control")
sample.conditions = factor(cond.list, levels = cond.levels)
sample.conditions = relevel(sample.conditions, ref="control")

# #PCA for experiments
dds = run.deseq.list.dds(merged.counts.obj, sample.conditions, cond.levels, ControlGenes.Indices=1:nrow(counts.obj.internal))
rld <- rlog(dds)
plotPCA(rld) # basic plot

pca.plot = plotPCA(rld, intgroup="sample.conditions", returnData=TRUE)
pca.plot$sample.conditions = rownames(pca.plot)
plotPCAlattice(pca.plot, file = 'PCA_TRP_12min_PRO.pdf')

rep = factor(sapply(strsplit(colnames(counts.df), 'rep'), '[', 2))
# formaula: ~ rep + sample.conditions
deseq.df = DESeqDataSetFromMatrix(counts.df,colData = cbind.data.frame(sample.conditions, rep), ~ rep + sample.conditions)
deseq.df$sizeFactor = estimateSizeFactorsForMatrix(merged.counts.obj, controlGenes = 1:nrow(counts.obj.internal))
saveRDS(deseq.df$sizeFactor, file = sprintf("%s/size.factors.rawCounts.TRPInternalGenes.rds", outputDir))
deseq.df = DESeq(deseq.df)
saveRDS(deseq.df, "deseq.df.TRPinternal.RAW.counts.rds")

# formula: ~ sample.conditions
deseq.df = DESeqDataSetFromMatrix(counts.df,colData = DataFrame(sample.conditions), ~ sample.conditions)
deseq.df$sizeFactor = estimateSizeFactorsForMatrix(merged.counts.obj, controlGenes = 1:nrow(counts.obj.internal))
saveRDS(deseq.df$sizeFactor, file = sprintf("%s/size.factors.modifiedFormula.rawCounts.TRPInternalGenes.rds", outputDir))
deseq.df = DESeq(deseq.df)
saveRDS(deseq.df, "deseq.df.modifiedFormula.TRPinternal.RAW.counts.rds")

# Create tables of activated/repressed genes and their matching genes 
for (i in levels(sample.conditions)[-1]) {
    res.deseq = results(deseq.df, contrast = c("sample.conditions", i, "control"))
  sum(res.deseq$padj < 0.1 & !is.na(res.deseq$padj))
    activated = res.deseq[res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & res.deseq$log2FoldChange > 0,]
    activated.strand = merge(cbind(tighten_summit_window(activated), "Activated"), inferred.coords, by.x = "gene", by.y = "V4")[,c(2, 3, 4, 1, 5, 10)]    
      write.table(activated.strand, file = paste0(i, '_activated_genes.bed'), quote = FALSE, row.names = FALSE, col.names = FALSE, sep = '\t')
      unchanged = res.deseq[!is.na(res.deseq$padj) & res.deseq$padj > 0.1 & abs(res.deseq$log2FoldChange) < 0.01,]
        #unchanged = unchanged[sample(x, size, replace = FALSE, prob = NULL),]
        unchanged$treatment = 0
        activated$treatment = 1
      df.deseq.effects.lattice = rbind(unchanged, activated)
      out = matchit(treatment ~ baseMean, data = df.deseq.effects.lattice, method = "optimal", ratio = 1)
      summary(out, un=F) # don't show stats from pre-matching data
      pdf("matchIt.optimal.activated.vs.unchanged.TRP12min.pdf")
      plot(out, type="density", interactive = FALSE)
      dev.off()  
      
      # chose unchanged genes which are matched to activated
      unchanged = df.deseq.effects.lattice[rownames(df.deseq.effects.lattice) %in% out$match.matrix,]
      unchanged.strand = merge(cbind(tighten_summit_window(unchanged), "Matched to Activated"), inferred.coords, by.x = "gene", by.y = "V4")[,c(2, 3, 4, 1, 5, 10)] 
          write.table(unchanged.strand, file = paste0(i, '_activated_matched_unchanged_genes.bed'), quote = FALSE, row.names = FALSE, col.names = FALSE, sep = '\t')
          unchanged$treatment = "Matched to Activated"
        activated$treatment = "Activated"
        df.x = rbind(activated, unchanged)
    # do the similar procedure using repressed genes
      repressed = res.deseq[res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & res.deseq$log2FoldChange < 0,]
      repressed.strand = merge(cbind(tighten_summit_window(repressed), "Repressed"), inferred.coords, by.x = "gene", by.y = "V4")[,c(2, 3, 4, 1, 5, 10)]    
        write.table(repressed.strand, file = paste0(i, '_repressed_genes.bed'), quote = FALSE, row.names = FALSE, col.names = FALSE, sep = '\t')
        unchanged = res.deseq[!is.na(res.deseq$padj) & res.deseq$padj > 0.1 & abs(res.deseq$log2FoldChange) < 0.01,]

      # unchanged = res.deseq[rownames(res.deseq) %notin% rownames(repressed) & !is.na(res.deseq$padj) & res.deseq$log2FoldChange > 0,]
       unchanged$treatment = 0
      repressed$treatment = 1
        df.deseq.effects.lattice = rbind(unchanged, repressed)
        out = matchit(treatment ~ baseMean, data = df.deseq.effects.lattice, method = "optimal", ratio = 1)
    summary(out, un=F)
        ## Warning message: Fewer control units than treated units; not all treated units will get a match.
    pdf("matchIt.optimal.repressed.vs.unchanged.TRP12min.pdf")
      plot(out, type="density", interactive = FALSE)
      dev.off()
      unchanged = res.deseq[!is.na(res.deseq$padj) & res.deseq$padj > 0.1 & abs(res.deseq$log2FoldChange) < 0.01,]  
      unchanged = df.deseq.effects.lattice[rownames(df.deseq.effects.lattice) %in% out$match.matrix,]
      unchanged.strand = merge(cbind(tighten_summit_window(unchanged), "Matched to Repressed"), inferred.coords, by.x = "gene", by.y = "V4")[,c(2, 3, 4, 1, 5, 10)] 

        write.table(unchanged.strand, file = paste0(i, '_repressed_matched_unchanged_genes.bed'), quote = FALSE, row.names = FALSE, col.names = FALSE, sep = '\t')
        unchanged$treatment = "Matched to Repressed"
          repressed$treatment = "Repressed"
          df.x = rbind(df.x, unchanged)
            df.x = rbind(df.x, repressed)
}

## MA Plot for activated, repressed, unchanged

temp = df.x
unique(temp$treatment)

dim(res.deseq[is.na(res.deseq$padj),])
otherGenes = res.deseq[!(rownames(res.deseq) %in% rownames(temp)) & !(is.na(res.deseq$padj)),]
otherGenes$treatment = "Other"

ma.df = rbind(temp,otherGenes)
dim(ma.df)
# `matched to activated` and `matched to repressed` are labels of unchanged genes
levelsMA = c("Other","Matched to Activated", "Matched to Repressed","Activated","Repressed")
ma.df$treatment = factor(ma.df$treatment, levels = levelsMA)
saveRDS(ma.df, file = "TRP_treat_matched_DESeq_genes_categorized.rds")
ma.plot.lattice(ma.df, filename='TRP_treat_matched',title.main = "Diff Exp (TRP 12.5min)", col=c("grey90","grey30","grey30", "#ce228e" ,"#2290cf"))
deseqGenes = tighten_summit_window(ma.df)
dim(deseqGenes)

deseqGenesInfo = cbind(deseqGenes, ma.df)
dim(deseqGenesInfo)
# save cateogrized data
saveRDS(deseqGenesInfo, "deseqGenesInfo.TRP12min.matched.rds")
write.table(deseqGenesInfo,file="TRP_treat_matched_DESeq_genes_categorized.bed",quote = FALSE, sep ='\t', row.names = FALSE, col.names = TRUE)
We get a MA plot showing 9632 genes as repressed.
MA plot for repressed genes

Figure 1: MA plot for repressed genes

We then save the DESeq2 size factors in a tab-separated text file.

cat saveSizeFactorsAsTxtFile.R

### For TRP 12.5min treatment
#genecountsDir = "/Users/rudradeepmukherjee/Documents/UConn Health/comparment.model.project/datasets, genome/Jonkers_2014_Trp_FP_paper"
genecountsDir = "/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014"
outputDir = sprintf("%s/deseq2", genecountsDir)
size.factors = readRDS(sprintf("%s/size.factors.rawCounts.TRPInternalGenes.rds", outputDir))
#size.factors = readRDS(sprintf("%s/size.factors.modifiedFormula.rawCounts.TRPInternalGenes.rds", outputDir))
size.factors.transpose = t(t(size.factors))
write.table(size.factors.transpose, sprintf("%s/sizefactors_TRP_12min_InternalGenes.txt",outputDir), sep="\t", quote=F, col.names=F)

### For FP 25min treatment
#genecountsDir = "/Users/rudradeepmukherjee/Documents/UConn Health/comparment.model.project/datasets, genome/Jonkers_2014_Trp_FP_paper"
genecountsDir = "/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014"
outputDir = sprintf("%s/deseq2", genecountsDir)
#size.factors = readRDS(sprintf("%s/size.factors.FPInternalGenes.rds", outputDir))
size.factors = readRDS(sprintf("%s/size.factors.modifiedFormula.rawCounts.FPInternalGenes.rds", outputDir))
size.factors.transpose = t(t(size.factors))
write.table(size.factors.transpose, sprintf("%s/sizefactors_FP_25min_InternalGenes.txt",outputDir), sep="\t", quote=F, col.names=F)

We use the DESeq2 size factors for normalization. The code is derived from PRO_normalization_sizeFactor of Nascent RNA Methods

cat PRO_normalization_sizeFactor_TRP  
#! /bin/sh

# modified from original code as file names are different: https://github.com/guertinlab/Nascent_RNA_Methods/blob/main/PRO_normalization

# put this in your path: normalize_bedGraph.py
# put this executable in your path: normalization_factor.R

usage() { echo "Usage: $0 -c chrSizes -z sizeFactorsDataTxtFile" 1>&2; exit 1; }

set -e
module load R/4.1.2
export PATH=${PATH}:/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014 # for normalization_bedgraph_v2.py
export PATH=${PATH}:/home/FCAM/rmukherjee/Summer2022Rotation/R_files # for bigWigToBedGraph
while getopts "c:z:" OPTION
do
    case $OPTION in
    c)
        chrSizes=$OPTARG
        ;;
    z) sizeFactorsFile=$OPTARG
    ;;
    *) 
        usage 
        ;;
    esac
done

if [ -z "${chrSizes}" ] || [ -z "${sizeFactorsFile}" ]; then
        usage
fi

bigWigFilesDir=/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/fastqfiles

cd ${bigWigFilesDir}

if [ $chrSizes ]; then

    #NOTE - normalization_factor.R is not needed as we are directly using the sizeFactors file containing two columns - 
    # `replicateName`, `deseq2.sizefactor`

    #Rscript /home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/normalization_factor_v2.R # modified version of normalization_factor.R from Guertin lab's GitHub
        #for i in V65_cntrl_rep1_plus.bigWig V65_FP_treat_25min_rep1_plus.bigWig V65_TRP_treat_12min_rep1_plus.bigWig; # name modified
        for i in V65_cntrl_rep1_plus.bigWig  V65_TRP_treat_12min_rep1_plus.bigWig; # name modified
          do
          name=$(echo $i | awk -F"_rep1_plus.bigWig" '{print $1}') # name modified
              printf "%s - processing individual replicates\n" ${name}
          basename=$(echo $i | awk -F"_" '{print $1}') 
          # count the number of replicates
                  reps=$(ls ${name}_rep?_plus.bigWig | wc -w | bc) # modified name
                  for j in ${name}_rep?_plus.bigWig; 
          do
                  repNum=$(echo $j | awk -F"rep" '{print $NF}' | awk -F"_plus" '{print $1}') # take the num# of replicate 
                              # NOTE - we uill use DESeq2 sizeFactors instead of invscalePlus and invscaleMinus - 
                  #scaleFactor=$(grep ${name}_rep${repNum} ../deseq2/sizefactors_TRP_12min_InternalGenes.txt | awk -F"\t" '{print $2}' | bc)
                  scaleFactor=$(grep ${name}_rep${repNum} ${sizeFactorsFile} | awk -F"\t" '{print $2}' | bc)
                  echo "scalefactor for ${name}_rep${repNum} - ${scaleFactor}"
                  
                  # TAKE THE INVERSE - eqn (6) of Anders and Huber, 2010 finds normalized values by dividing counts with sizeFactors
                  scaleFactor=$(echo "scale=5; 1/${scaleFactor}" | bc)
                  echo "scalefactor inverse for ${name}_rep${repNum} - ${scaleFactor}"

                  # scale to 10 million
                      bigWigToBedGraph ${name}_rep${repNum}_plus.bigWig ${name}_rep${repNum}_plus.bedGraph
                  bigWigToBedGraph ${name}_rep${repNum}_minus.bigWig ${name}_rep${repNum}_minus.bedGraph
                          normalize_bedGraph_v2_pandas.py -i ${name}_rep${repNum}_plus.bedGraph -s $scaleFactor -o ${name}_rep${repNum}_plus_normalized.bedGraph
                      normalize_bedGraph_v2_pandas.py -i ${name}_rep${repNum}_minus.bedGraph -s $scaleFactor -o ${name}_rep${repNum}_minus_normalized.bedGraph
                  bedGraphToBigWig ${name}_rep${repNum}_plus_normalized.bedGraph $chrSizes ${name}_rep${repNum}_plus_normalized.bigWig 
                  bedGraphToBigWig ${name}_rep${repNum}_minus_normalized.bedGraph $chrSizes ${name}_rep${repNum}_minus_normalized.bigWig
          done
          printf "%s - creating and merging normalized replicate bigWigs\n" ${name}
          plusfiles=$(ls ${name}_rep*_plus_normalized.bigWig)
          bigWigMerge $plusfiles tmpPlus.bg
              minusfiles=$(ls ${name}_rep*_minus_normalized.bigWig)
              bigWigMerge -threshold=-10000000000 $minusfiles tmpMinus.bg # for bigWigMerge utility: -threshold=0.N - don't output values at or below this threshold. Default is 0.0
                  scaleall=$(bc <<< "scale=4 ; 1.0 / $reps")
          normalize_bedGraph_v2_pandas.py -i tmpPlus.bg -s $scaleall -o ${name}_plus_normalized.bg # scaleall=1/rep#
              normalize_bedGraph_v2_pandas.py -i tmpMinus.bg -s $scaleall -o ${name}_minus_normalized.bg
              sort -k1,1 -k2,2n ${name}_plus_normalized.bg > ${name}_plus_normalized_sorted.bg
          sort -k1,1 -k2,2n ${name}_minus_normalized.bg > ${name}_minus_normalized_sorted.bg
          bedGraphToBigWig ${name}_plus_normalized_sorted.bg $chrSizes ${name}_plus_normalized_sizeFactor_TRP.bigWig 
          bedGraphToBigWig ${name}_minus_normalized_sorted.bg $chrSizes ${name}_minus_normalized_sizeFactor_TRP.bigWig   
              rm ${name}_plus_normalized.bg
          rm ${name}_minus_normalized.bg
          rm ${name}_plus_normalized_sorted.bg
          rm ${name}_minus_normalized_sorted.bg
              rm tmpPlus.bg
              rm tmpMinus.bg
      done
fi

The processed files are hosted as a track hub which can be loaded in UCSC Genome Browser. A session can be accessed here.

3.5 Estimating Pol II densities in pause region and gene body

We will prepare the pause windows and calculate the densities in gene body regions. We assume all packages mentioned in this file has been installed.

cat run_pauseworkflow_InternalTRP_matched.R
# calls functions in pause_workflow_v2.R to plot composites from conditions.
# this code was written for a project where there were three conditions - control, 45min and 90min.
library(dplyr)
library(bigWig)

# source the functions
source("pause_workflow_funcs_v2.R")

## NOTE: CHANGE these PATHS before running on your own platform.
# set the working directory
outputDir = '/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/pauseWindowsAndBWplots'
# load the merged bigWigs from all conditions x replicates
merged.bw.dir = "/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/fastqfiles"
#bw.plus <- load.bigWig(sprintf('%s/TRP_FP_control_merged_plus.bigWig', merged.bw.dir))
#bw.minus <- load.bigWig(sprintf('%s/TRP_FP_control_merged_minus.bigWig', merged.bw.dir))

normalized.bw.dir = "/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/fastqfiles"

bw.cntrl.plus = load.bigWig(sprintf('%s/V65_cntrl_plus_normalized_sizeFactor_TRP.bigWig', normalized.bw.dir))
bw.cntrl.minus = load.bigWig(sprintf('%s/V65_cntrl_minus_normalized_sizeFactor_TRP.bigWig', normalized.bw.dir))

bw.TRP.plus = load.bigWig(sprintf('%s/V65_TRP_treat_12min_plus_normalized_sizeFactor_TRP.bigWig', normalized.bw.dir))
bw.TRP.minus = load.bigWig(sprintf('%s/V65_TRP_treat_12min_minus_normalized_sizeFactor_TRP.bigWig', normalized.bw.dir))

deseq2.files.dir = "/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/deseq2"
categorized.TRPtreatvscontrol = readRDS(sprintf("%s/deseqGenesInfo.TRP12min.matched.rds",deseq2.files.dir))
#categorized.TRPtreatvscontrol$gene = categorized.TRPtreatvscontrol$gene %>% str_replace("_affected", "")
categorized.TRPtreatvscontrol  = categorized.TRPtreatvscontrol[!duplicated(categorized.TRPtreatvscontrol$gene),]
rownames(categorized.TRPtreatvscontrol) = categorized.TRPtreatvscontrol$gene

# find pause region for all genes listed in primaryTranscriptAnnotation bed file using the merged bigWigs 
# load the coordinates file
affectedGenes.newTTS.bed = "/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/primaryTranscriptAnnotation/GenesWithRegionsAffected_TRP_12min_sorted_renamed.bed"
pTA <- read.table(affectedGenes.newTTS.bed, sep="\t")
df.pause.body <- find.pause.regions(pTA,bw.cntrl.plus,bw.cntrl.minus, searchWindow = 300)
rownames(df.pause.body) = df.pause.body$gene
saveRDS(df.pause.body, file=sprintf("%s/Internal_TRP_vscntrl.pause.body.rds",outputDir))

merged.pbody.TRPtreat.vs.control = merge(df.pause.body, as.data.frame(categorized.TRPtreatvscontrol))
merged.pbody.TRPtreat.vs.control = merged.pbody.TRPtreat.vs.control[,c(colnames(df.pause.body), colnames(categorized.TRPtreatvscontrol))]
rownames(merged.pbody.TRPtreat.vs.control) = merged.pbody.TRPtreat.vs.control$gene
saveRDS(merged.pbody.TRPtreat.vs.control, file=sprintf("%s/Internal.merged.pbody.TRP.vs.control.rds",outputDir))

setwd(outputDir)
# call plotting steps for all genes
run.plotting.steps(merged.pbody.TRPtreat.vs.control, "control_all", bw.cntrl.plus, bw.cntrl.minus,
                "TRP_12.5min_all", bw.TRP.plus, bw.TRP.minus,
            "TRP_12.5min (all genes, Internal Control)", color.names=c(rgb(1,0,0,1/2), rgb(0,0,1,1/2))) # red, blue from rgb value. Indices set in alphabetical order of column names
density.TRP = readRDS("density.TRP_12.5min_all.v.control_all.rds")
repressed.TRP.baseline = subset(density.TRP, cond == "control_all" & treatment == "Repressed")
repressed.TRP.treatment = subset(density.TRP, cond == "TRP_12.5min_all" & treatment == "Repressed")

colNames = c("gene", "pause.sum", "body.avg")
write.table(repressed.TRP.baseline[,colNames], "InternalControl_repressed_TRP125min_baseline.txt", sep="\t", quote=F, row.names=F)
write.table(repressed.TRP.treatment[,colNames], "InternalControl_repressed_TRP125min_treatment.txt", sep="\t", quote=F, row.names=F)

# call the run.plotting.steps on pause body object containing genes of our interest.
repressed.TRPvscontrol = subset(merged.pbody.TRPtreat.vs.control, treatment == "Repressed")
saveRDS(repressed.TRPvscontrol, file=sprintf("%s/InternalControl.repressed.pbody.TRP.vs.control.rds", outputDir))
run.plotting.steps(repressed.TRPvscontrol, "control", bw.cntrl.plus, bw.cntrl.minus,
                "TRP_12.5min (repressed, internal spikein)", bw.TRP.plus, bw.TRP.minus,
            "TRP_12.5min(rrepressed, internal control)", color.names=c(rgb(1,0,0,1/2), rgb(0,0,1,1/2))) # red, blue from rgb value. Indices set in alphabetical order of column names

Thus, we have the output files of interest - InternalControl_repressed_TRP125min_baseline.txt and InternalControl_repressed_TRP125min_treatment.txt.

4 Plotting Model Results

NOTE: All figures are present in figures directory in the vignette folder.
After sections Preparation of Inputs and Running the model, we can plot results. The code files mentioned here prepare R data structures containing the rate data as well summarizes them:-

cat plotParamsFromRepeatParamScan.R
library(ggplot2)
library(dplyr)
# contains source colde of plotRates and plotRates_noKpre
#source("/home/rudramukh/Documents/COPASI_compartmentModel/plotRateFuncs.R")
source("plotRateFuncs.R")
#repeatScanParamDir = "/home/rudramukh/Documents/COPASI_compartmentModel/compPolModelpkg/paramfitdir_TBP_dTAGv1/repeatedParamScans" # Documents/COPASI_compartmentModel/compPolModelpkg/paramfitdir

## TRP 12.5 min - with kpre
# for all params between 1e-8 to 1.
combnResultsFile = "InternalControl_TRP12min_repressed/paramEstimationResults_combined.txt"
plotRates(combnResultsFile, "baseline", "TRP_12.5min", "TRP_12.5min (repressed)", "TRP_12min_repressed_allParamsConstrained", 12)

The file plotParamsFromRepeatParamScan.R references plotRateFuncs.R below:-

cat plotRateFuncs.R
library(ggplot2)
library(scales)

# genefile = the file to read gene names from
# repeatScanParamDir = the directory where outputFiles of the form `repeatedParamEstimates_<gene>_onlyvalidfits.txt` exists
# exprName = name of experiment to be used for referencing column names.
# filePrefix = an identifier, used while saving plots and files
# colNames = columns to parsse from the repeatedParamScans
# uniqGeneNum = how many unique genes to use to create plots. Default is 10.
plotRates <- function(combinedResultsFile, baselineName, exprName, yLabel, fileidentifier=NULL, uniqGeneNum = 10, ALPHA=0.03) {
 #P_base = "Values.P_base."  #'Values[P_base]'
 #P_expr = "Values.P_expr." # 'Values[P_expr]'
 fontTheme = theme(
  axis.title.x = element_text(size = 16),
  axis.text.x = element_text(size = 16),
  axis.text.y = element_text(size = 16),
  axis.title.y = element_text(size = 18),
  legend.title = element_text(size=14),
  legend.text = element_text(size=13),
 legend.position="top")

  if (is.null(fileidentifier)) {
     fileidentifier = sprintf("%s_vs_%s", exprName, baselineName)
     print(sprintf("setting filePrefix to %s", fileidentifier))
 }
 kelong = "elongationRate" # 'Values[kelong]'
 kinit_base = sprintf("initiationRate_%s", baselineName) #'Values[kinit_base]'
 kinit_expr = sprintf("initiationRate_%s", exprName) #'Values[kinit_expr]'
 kpre = "prematureTerminationRate" # 'Values[kpre]'
 krel_base = sprintf("pauseReleaseRate_%s", baselineName) #'Values[krel_base]'
 krel_expr = sprintf("pauseReleaseRate_%s", exprName) #'Values[krel_expr]'
 colNamesToParse = c("gene", kelong, kpre, kinit_base, kinit_expr, krel_base, krel_expr)
 combinedResults = read.table(combinedResultsFile,sep="\t", header=T)
 # if (file.exists(genefile)) { 
#    geneInfoTable = read.table(genefile, sep="\t", header=T)
# } else {
#        stop(sprintf("genefile %s not found", genefile))
# }
 # start reading in the repeated param estimation results
 #rateData = c() 
 rateData = combinedResults[,colNamesToParse]
 rateData = as.data.frame(rateData)
 colnames(rateData) = c("gene", "kelong", "kpre", "kinit_base", "kinit_expr", "krel_base", "krel_expr")
 rateData$kinit_FC = rateData$kinit_expr / rateData$kinit_base
 rateData$krel_FC = rateData$krel_expr / rateData$krel_base
 saveRDS(rateData, sprintf("repeatedParamScanData_%s.rds", fileidentifier))
 dodge = position_dodge(width=0.05)
 # min, max kinit_FC, krel_FC
 minMaxSummary =  group_by(rateData, gene) %>% summarise_at(c("kinit_FC", "krel_FC"), c("min", "max"), na.rm=T)
 lenGene = nrow(minMaxSummary)
 stepSize = 0.25
 minMaxSummary$Ycoord = seq(0.3, 0.3 + stepSize * (lenGene - 1), stepSize)
 minMaxSummary$kinitFCRange = minMaxSummary$kinit_FC_max - minMaxSummary$kinit_FC_min
 minMaxSummary$krelFCRange = minMaxSummary$krel_FC_max - minMaxSummary$krel_FC_min
 fontTheme = theme(
  axis.title.x = element_text(size = 16),
  axis.text.x = element_text(size = 16),
  axis.text.y = element_text(size = 16),
  axis.title.y = element_text(size = 18),
  legend.title = element_text(size=14),
  legend.text = element_text(size=13),
 legend.position="top")
 saveRDS(minMaxSummary, sprintf("minMaxSummary_%s.rds", fileidentifier))
 # line segments
 ggplot(minMaxSummary) + geom_segment(aes(x=kinit_FC_min, xend=kinit_FC_max, y=Ycoord, yend=Ycoord)) + scale_x_continuous(trans=log2_trans(),breaks=trans_breaks('log2', function(x) 2^x),labels=trans_format('log2', math_format(2^.x))) + xlab("segment joining min and max kinit FC for all gene") + ylab("vertical axis to separate intervals") + theme_bw() + fontTheme
 ggsave(sprintf("intervals_kinit_FC_%s.pdf", fileidentifier))
 ggplot(minMaxSummary) + geom_segment(aes(x=krel_FC_min, xend=krel_FC_max, y=Ycoord, yend=Ycoord)) + scale_x_continuous(trans=log2_trans(),breaks=trans_breaks('log2', function(x) 2^x),labels=trans_format('log2', math_format(2^.x))) + xlab("segment joining min and max kinit FC for all gene") + ylab("vertical axis to separate intervals") + theme_bw() + fontTheme
 ggsave(sprintf("intervals_krel_FC_%s.pdf", fileidentifier))


 # width of kinit FC and krel FC
 ggplot(minMaxSummary) + geom_violin(aes(x="width - kinit_FC", y = kinitFCRange)) + geom_point(position = dodge, alpha=0.5, aes(x="width - kinit_FC", y=kinitFCRange)) + xlab("") + ylab(sprintf("(max - min) kinit FC for all genes - %s", exprName)) + theme_bw()
  ggsave(sprintf("Width_kinit_FC_%s.pdf", fileidentifier))
 ggplot(minMaxSummary) + geom_violin(aes(x="width - krel_FC", y = krelFCRange)) + geom_point(position = dodge, alpha=0.5, aes(x="width - krel_FC", y=krelFCRange)) + xlab("") + ylab(sprintf("(max - min) krel FC for all genes - %s", exprName)) + theme_bw()
  ggsave(sprintf("Width_krel_FC_%s.pdf", fileidentifier))
 # save summary
 summaryRateData = group_by(rateData, gene) %>% summarise_at(c("kinit_base", "kinit_expr", "krel_base", "krel_expr", "kinit_FC", "krel_FC","kelong", "kpre"), mean, na.rm=T)
 #summaryRateData$kinit_FC = summaryRateData$kinit_expr / summaryRateData$kinit_base
 #summaryRateData$krel_FC = summaryRateData$krel_expr / summaryRateData$krel_base
 saveRDS(summaryRateData, sprintf("summary_repeatedParamScanData_%s.rds", fileidentifier))
 # savefrequency table
 freqTableGenes = table(rateData$gene)
 saveRDS(freqTableGenes, sprintf("freqTable_repeatedParamScans_%s.rds", fileidentifier))
 uniqueGenes = unique(rateData[,"gene"])
 # start plotting
 uniqGeneNum = min(uniqGeneNum, length(uniqueGenes))
 print(sprintf("plotting data for first %d genes", uniqGeneNum))
 subsetData = rateData[rateData$gene %in% uniqueGenes[1:uniqGeneNum],]
 ## Mean kinit_FC and krel_FC
 ggplot(summaryRateData) + geom_violin(aes(x="Initiation", y=kinit_FC)) + geom_violin(aes(x="Pause Release", y=krel_FC)) +
           geom_boxplot(aes(x="Initiation", y=kinit_FC), alpha=0.9) + geom_boxplot(aes(x="Pause Release", y=krel_FC),alpha=0.9) +  
           geom_point(position = "jitter", alpha=ALPHA, aes(x="Initiation", y=kinit_FC)) +
            geom_point(position = "jitter", alpha=ALPHA, aes(x="Pause Release", y=krel_FC)) +
        scale_y_continuous(trans=log2_trans(),breaks=trans_breaks('log2', function(x) 2^x),labels=trans_format('log2', math_format(.x))) +
        #scale_y_continuous(breaks = seq(0, ceiling(max(max(summaryRateData[,"kinit_FC"]), max(summaryRateData[,"krel_FC"]))), len = 10)) +
            ylab(sprintf("log2 (Mean Fold Change) for each gene - %s", yLabel)) + xlab("Rates") + theme_bw()  + fontTheme
  ggsave(sprintf("log2_Mean_foldchange_kinit,krel_%s.pdf", fileidentifier))

 
  # log2 TRANSFORMATION - plot y vals
#  ggplot(summaryRateData) + geom_violin(aes(x="kinit", y=kinit_FC)) + 
#     geom_point(position=dodge, alpha=0.5, aes(x="kinit",y=kinit_FC)) + 
#     scale_y_continuous(trans=log2_trans(),breaks=trans_breaks('log2', function(x) 2^x),labels=trans_format('log2', math_format(2^.x))) +
#           ylab(sprintf("Mean value of Fold Change for each gene - %s", yLabel)) + xlab("Rates") + theme(text = element_text(size=12)) + theme_bw()
#  ggsave(sprintf("log2_Mean_foldchange_kinit_%s.pdf", fileidentifier))
# 
# ggplot(summaryRateData) + geom_violin(aes(x="krel", y=krel_FC)) + 
#     geom_point(position=dodge, alpha=0.5, aes(x="krel",y=krel_FC)) + 
#     scale_y_continuous(trans=log2_trans(),breaks=trans_breaks('log2', function(x) 2^x),labels=trans_format('log2', math_format(2^.x))) +
#           ylab(sprintf("Mean value of Fold Change for each gene - %s", yLabel)) + xlab("Rates") + theme(text = element_text(size=12)) + theme_bw()
#  ggsave(sprintf("log2_Mean_foldchange_krel_%s.pdf", fileidentifier))
#

   ## both kinit fc and krel fc on same plot for one single gene
  #ggplot(subsetData) + geom_violin(aes(x="kinit_fc", y=kinit_FC)) + geom_point(position = position_jitter(width= 0.05), alpha=0.05,  aes(x="kinit_fc", y=kinit_FC)) + geom_violin(aes(x="krel_fc", y=krel_FC)) + geom_point(position = position_jitter(width= 0.05), alpha=0.05,  aes(x="krel_fc", y=kinit_FC)) + ylim(0,NA) + ylab("fold change for TMEM221 over multi param estimation")
  ## KINIT FC
 ggplot(subsetData) + geom_violin(aes(x=gene, y= kinit_FC, group=gene)) + geom_point(position = position_jitter(width= 0.05), alpha=0.05, aes(gene, y=kinit_FC, group=gene)) + ylab(sprintf("kinit foldchange - %s vs %s", exprName, baselineName)) + theme_bw()
ggsave(sprintf("kinit_FC_%s.png", fileidentifier), width=14,dpi=300)
## KREL FC
 ggplot(subsetData) + geom_violin(aes(x=gene, y= krel_FC)) + geom_point(position = position_jitter(width= 0.05), alpha=0.05, aes(gene, y=krel_FC, group=gene)) + ylab(sprintf("krel foldchange - %s vs %s", exprName, baselineName))+ theme_bw()
ggsave(sprintf("krel_FC_%s.png", fileidentifier), width=14,dpi=300)
## KINIT - EXPeriment
 ggplot(subsetData) + geom_violin(aes(x=gene, y= kinit_expr)) + geom_point(position = position_jitter(width= 0.05), alpha=0.05, aes(gene, y=kinit_expr, group=gene)) + ylab(sprintf("kinit - %s", exprName))+ theme_bw()
 ggsave(sprintf("kinit_expr_%s.png", fileidentifier), width=14,dpi=300)
## KINIT - baseline
 ggplot(subsetData) + geom_violin(aes(x=gene, y= kinit_base)) + geom_point(position = position_jitter(width= 0.05), alpha=0.05, aes(gene, y=kinit_base, group=gene)) + ylab(sprintf("kinit - %s", baselineName))+ theme_bw()
 ggsave(sprintf("kinit_base_%s.png", fileidentifier), width=14,dpi=300)
## KREL - EXPeriment
 ggplot(subsetData) + geom_violin(aes(x=gene, y= krel_expr)) + geom_point(position = position_jitter(width= 0.05), alpha=0.05, aes(gene, y=krel_expr, group=gene)) + ylab(sprintf("krel - %s", exprName))+ theme_bw()
 ggsave(sprintf("krel_expr_%s.png", fileidentifier), width=14,dpi=300)
## KREL - baseline
 ggplot(subsetData) + geom_violin(aes(x=gene, y= krel_base)) + geom_point(position = position_jitter(width= 0.05), alpha=0.05, aes(gene, y=krel_base, group=gene)) + ylab(sprintf("krel - %s", baselineName))+ theme_bw()
 ggsave(sprintf("krel_base_%s.png", fileidentifier), width=14,dpi=300)
## KELONG
 ggplot(subsetData) + geom_violin(aes(x=gene, y= kelong)) + geom_point(position = position_jitter(width= 0.05), alpha=0.05, aes(gene, y=kelong, group=gene)) + ylab(sprintf("kelong - %s, %s", exprName,baselineName))+ theme_bw()
 ggsave(sprintf("kelong_%s.png", fileidentifier), width=14,dpi=300)
}

We get log2_Mean_foldchange_kinit,krel_TRP_12min_repressed_allParamsConstrained.pdf as one of the figures and repeatedParamScanData_TRP_12min_repressed_allParamsConstrained.rds containing all valid param sets for all the genes.

Figure 2: log2 Mean Fold Change in Initiation and Pause Release

There are some custom functions available for plotting in customSummariseRateData.R. We use repeatedParamScanData_TRP_12min_repressed_allParamsConstrained.rds and density.TRP_12.5min_all.v.control_all.rds (produced by run_pauseworkflow_InternalTRP_matched.R) to plot other relevant figures.

cat customSummariseRateData.R
library(ggplot2)
library(dplyr)
library(ggsci)
library(RColorBrewer)
library(scales)
# https://cran.r-project.org/web/packages/ggsci/vignettes/ggsci.html
source("plotting_composites_lattice.R")

percentTotal_trans <- function(x){
    TOTAL = 7527 # change it to the total you are using
    paste(round((x / TOTAL)*100, digits=2), "%")
}
fontTheme = theme(
  axis.title.x = element_text(size = 20),
  #axis.text.x = element_text(size = 4),
  #axis.text.x = element_text(size = 11),
  #axis.text.x = element_text(size=28, angle = 30, hjust=1),
  axis.text.x = element_text(size=27),
  axis.text.y = element_text(size = 25),
  axis.title.y = element_text(size = 28),
  legend.title = element_text(size=25),
  legend.text = element_text(size=24),
  panel.border = element_rect(colour="black", linewidth=2), 
  panel.grid.major = element_blank(),
 panel.grid.minor = element_blank(),
 axis.ticks.length = unit(2, "mm")
 #axis.line = element_line(colour = "black"),
 #axis.line.x = element_line(colour = "black"),
 #axis.line.y = element_line(colour = "black"),
  #legend.position="top"
 #axis.line.x.top = element_line(colour = "black"),
 #axis.line.y.right = element_line(colour = "black")
)

density.TRP = readRDS("Jonkers_2014_Trp_FP_paper/density.TRP_12.5min_all.v.control_all.rds")
rateData.TRP = readRDS("InternalControl_TRP12min_repressed/repeatedParamScanData_TRP_12min_repressed_allParamsConstrained.rds") # kinit, kpre, krel between 1e-8 and 1.
## plot pause sum and gene body avg
rateData.TRP$P_base = rateData.TRP$kinit_base / (rateData.TRP$kpre + rateData.TRP$krel_base)
rateData.TRP$P_expr = rateData.TRP$kinit_expr / (rateData.TRP$kpre + rateData.TRP$krel_expr)
rateData.TRP$B_base = (rateData.TRP$krel_base * rateData.TRP$P_base) / rateData.TRP$kelong
rateData.TRP$B_expr = (rateData.TRP$krel_expr * rateData.TRP$P_expr) / rateData.TRP$kelong
rateData.TRP$Release_base = rateData.TRP$P_base * rateData.TRP$krel_base
rateData.TRP$Release_expr = rateData.TRP$P_expr * rateData.TRP$krel_expr
rateData.TRP$Terminal_base = rateData.TRP$kpre * rateData.TRP$P_base
rateData.TRP$Terminal_expr = rateData.TRP$kpre * rateData.TRP$P_expr

newSummary.TRP = group_by(rateData.TRP, gene) %>%  summarise_at(c("Release_base", "Release_expr", "P_base", "P_expr", "B_base", "B_expr", "kinit_base", "kinit_expr", "krel_base", "krel_expr", "kpre", "kelong", "kinit_FC", "krel_FC"), c("min", "max", "var", "mean"))

# update category of kinit and krel foldchange
newSummary.TRP$kinitFCcategory = "Undefined"
newSummary.TRP[newSummary.TRP$kinit_FC_max < 1, ]$kinitFCcategory = "kinitFC_less_1"
newSummary.TRP[newSummary.TRP$kinit_FC_min <= 1 & newSummary.TRP$kinit_FC_max > 1, ]$kinitFCcategory = "kinitFC_span_1"
#newSummary.TRP[newSummary.TRP$kinit_FC_min > 1, ]$kinitFCcategory = "kinitFC_more_1"
newSummary.TRP$kinitFCcategory = factor(newSummary.TRP$kinitFCcategory, levels = c("kinitFC_less_1", "kinitFC_span_1"))
# krel FC category
newSummary.TRP$krelFCcategory = NA
newSummary.TRP[newSummary.TRP$krel_FC_min >= 1, ]$krelFCcategory = "krelFCmore1"
newSummary.TRP[newSummary.TRP$krel_FC_max <= 1, ]$krelFCcategory = "krelFCless1"
subset(newSummary.TRP, is.na(krelFCcategory))[,c("gene","krel_FC_min", "krel_FC_max")] # 5 genes with krel_FC_max hovering around 1
newSummary.TRP[is.na(newSummary.TRP$krelFCcategory), ]$krelFCcategory = "krelFCless1"

newSummary.TRP$krelFCcategory = factor(newSummary.TRP$krelFCcategory,levels = c("krelFCless1", "krelFCmore1") )


# bar plot for initiation foldchange categories
df.TRP = as.data.frame(cbind(c("kinitFCless1", "kinitFCspan1"),
                 c(sum(newSummary.TRP$kinitFCcategory == "kinitFC_less_1"), 
                   sum(newSummary.TRP$kinitFCcategory == "kinitFC_span_1"))))
                   #sum(newSummary.TRP$kinitFCcategory == "kinitFC_more_1"))))
colnames(df.TRP) = c("typeOfData", "countOfData")
df.TRP$typeOfData = factor(df.TRP$typeOfData, levels = c("kinitFCless1", "kinitFCspan1"))
df.TRP$countOfData = as.numeric(df.TRP$countOfData)
sum_all = sum(df.TRP$countOfData)
df.TRP$countOfData = df.TRP$countOfData / sum_all 

colorManual.one <- c("#1F77B4", "#FF7F0E", "#2CA02C")

ggplot(df.TRP, aes(x = "All Genes", y = countOfData, fill=typeOfData)) +
         #geom_bar(stat="identity", position=position_fill(reverse=T)) +
         geom_bar(stat="identity", position=position_stack(reverse=T)) +
     #scale_y_continuous(labels = scales::percent) + 
     ylab("Fraction of Genes") +
     #labs(fill = "") + 
     #xlab("") + 
     #scale_fill_brewer(palette = "Dark2",
     scale_fill_manual(values = colorManual.one,
               #labels= c("Decrease Initiation Rate", "Ambiguous Change", "Increase Initiation Rate")) +
               labels= c("Decrease Initiation Rate", "Uncertain Change")) +
     # scale_color_manual(values = colorManual.one,
    #          labels= c("Decrease Initiation Rate", "Uncertain Change")) +
               #  labels= c("Decrease Initiation Rate", "Ambiguous Change", "Increase Initiation Rate")) +
     theme_bw() + fontTheme +
     theme(legend.justification = "center",
           axis.text.x = element_text(size=31, face = 1.3),
              axis.text.y = element_text(size=27),
               axis.title.y = element_text(size=32),
          axis.title.x = element_blank(), legend.title = element_blank(), aspect.ratio = 9)
#ggsave("CountFrequency_KinitFC_profile_TRP_treat.pdf")
ggsave("CountFrequency_stacked_KinitFC_profile_TRP_treat.pdf", height=8.4) # 6.67 x 8

## selection of genes which have wide and narrow distribution of Kinit FC
newSummary.TRP$IndexDisp = newSummary.TRP$kinit_FC_var / newSummary.TRP$kinit_FC_mean
newSummary.TRP = newSummary.TRP[order(newSummary.TRP$IndexDisp, decreasing=T),]
TRP.kinitFC.less1 = subset(newSummary.TRP, kinit_FC_max < 1)
TRP.kinitFC.span1 = subset(newSummary.TRP, kinit_FC_max >= 1 & kinit_FC_min <= 1)
colNames = c("gene", "kinit_FC_max", "kinit_FC_min", "IndexDisp", "strand")
TRP.kinitFC.less1 = TRP.kinitFC.less1[order(TRP.kinitFC.less1$IndexDisp, decreasing = T),]
TRP.kinitFC.span1 = TRP.kinitFC.span1[order(TRP.kinitFC.span1$IndexDisp, decreasing = T),]
plus.TRP.kinitFC.less1 = subset(TRP.kinitFC.less1, strand == "+")
head(plus.TRP.kinitFC.less1[,colNames], n=12) # Nup214 (selected)
tail(TRP.kinitFC.less1[,colNames], n=12) #gene , Fgfr2 (selected), Top2a, Cox11, Cdkl3
TRP.kinitFC.span1 = TRP.kinitFC.span1[order(TRP.kinitFC.span1$IndexDisp, decreasing=T),]
head(TRP.kinitFC.span1[, colNames]) # Clhc1, Klhl35, Car2
tail(TRP.kinitFC.span1[,colNames], n=30) # Dmrta2, Gal
gene.list.Init = c("Nup214", "Fgfr2", "Clhc1", "Dmrta2")

# plot log2 kinit foldchange
colorManual.two = c("#1F77B4", "#FF7F0E", "#E7298A", "#1B9E77") # blue, orange, dark palette
subset.rateData.TRP = subset(rateData.TRP, gene %in% gene.list.Init)
#subset.rateData.TRP$kinitFCcategory = "undefined"
subset.rateData.TRP$kinitFCcategory = NA 
#subset.rateData.TRP$kinitFCcategory = factor(subset.rateData.TRP$kinitFCcategory, levels= c("kinitFC_less_1", "kinitFC_span_1","kinitFC_more_1"))
for (gene.name in gene.list.Init) {
 subset.rateData.TRP[subset.rateData.TRP$gene == gene.name,]$kinitFCcategory = subset(newSummary.TRP,gene == gene.name)$kinitFCcategory
}
subset.rateData.TRP$kinitFCcategory = factor(subset.rateData.TRP$kinitFCcategory, levels= c("kinitFC_less_1", "kinitFC_span_1"))
subset.rateData.TRP$gene = factor(subset.rateData.TRP$gene, levels = gene.list.Init)

ggplot(subset.rateData.TRP) + geom_violin(aes(x=gene, y=kinit_FC, fill=kinitFCcategory), scale="width") + 
    geom_point(aes(x=gene,y=kinit_FC), position="jitter", alpha=0.3 , size=0.16) + 
    scale_y_continuous(trans=log2_trans(), breaks=trans_breaks('log2', function(x) 2^x), labels=trans_format('log2', math_format(.x))) + 
    #ylab(expression(atop(log[2] ~ "Foldchange in" ~ K[init] ~ ", TRP treatment")))  + 
ylab(expression(log[2] ~ "Foldchange in" ~ italic(K[init])))  + 
    scale_fill_manual(values = colorManual.two,
               labels= c("Decrease Initiation Rate", "Uncertain Change")) +
    theme_bw() + fontTheme + 
    theme(legend.position="none",axis.text.x = element_text(size=32, face=1.5, angle=30, vjust=0.9, hjust=1),
           axis.title.x = element_blank(), axis.title.y = element_text(size=32), legend.title = element_blank())

ggsave("log2_TRP_kinitFC_distribution.pdf", height=8.4, width = 6)

# plot log10 kinit foldchnage index of dispersion
ggplot() + 
    geom_violin(data=newSummary.TRP, aes(x=kinitFCcategory, y = IndexDisp, fill = kinitFCcategory), scale="width") + 
    geom_boxplot(data=newSummary.TRP, aes(x=kinitFCcategory, y = IndexDisp),alpha = 0.1, width=0.36, linewidth=0.3, outlier.shape=NA) + 
    
    geom_point(data=subset(newSummary.TRP, gene %in% gene.list.Init), aes(x=kinitFCcategory, y = IndexDisp, fill = kinitFCcategory), size=2) + 
       geom_text(data=subset(newSummary.TRP, gene %in% gene.list.Init), aes(x=kinitFCcategory, y = IndexDisp, label=gene, fontface=2), nudge_x=-0.12, nudge_y=0.3, size=7) + 
       #scale_y_continuous(trans=log2_trans(), breaks=trans_breaks('log2', function(x) 2^x), labels=trans_format('log2', math_format(.x))) +     
       scale_y_continuous(trans=log10_trans(), breaks=trans_breaks('log10', function(x) 10^x), labels=trans_format('log10', math_format(.x))) +
       scale_fill_d3() +
      scale_color_d3() +  
          theme_bw() + fontTheme +
    #ggtitle("Gene Categories - Triptolide 12.5 min") + 
    theme(plot.title = element_text(size=23),
          axis.text.x = element_text(size=31),
          axis.text.y = element_text(size=26),
               axis.title.y = element_text(size=29),
          axis.title.x = element_blank(),
         # axis.text.x = element_text(size=28, angle=30, hjust=0.1, vjust=0.22),
        legend.position = "None") + 
          labs(fill = "Gene Categories") +  
    #ylab("Index of Dispersion - TRP treatment") 
    #ylab("log2 Index of Dispersion - TRP treatment") 
    #ylab(expression(atop(log[10] ~ frac("Variance", "Mean") ~ "Foldchange in Initiation, TRP treatment", "(valid parameter sets, TRP treatment)"))) +
    ylab(expression(log[10] ~ frac("Variance (Foldchange" ~ italic(K[init]) ~")", "Mean (Foldchange" ~ italic(K[init]) ~")"))) +
    scale_x_discrete(labels = c("Decrease\nInitiation", "Uncertain\nChange"))  
    #scale_x_discrete(labels = c("Decrease\nInitiation", "Ambiguous\nChange", "Increase\nInitiation"))  
    #theme(aspect.ratio = 7 / 4)
#ggsave("TRP_kinitFC_indexOfDispersion.pdf")
#ggsave("log2_TRP_kinitFC_indexOfDispersion.pdf")
ggsave("log10_TRP_kinitFC_indexOfDispersion.pdf", width=6, height=8.4) # 6.97 x 8.4

## plot for krel foldchange
df.TRP.krel = as.data.frame(cbind(c("krelFCless1", "krelFCmore1"),
                  c(sum(newSummary.TRP$krelFCcategory == "krelFCless1"), 
                   sum(newSummary.TRP$krelFCcategory == "krelFCmore1"))))   
colnames(df.TRP.krel) = c("typeOfData", "countOfData")
df.TRP.krel$typeOfData = factor(df.TRP.krel$typeOfData, levels = c("krelFCless1", "krelFCmore1"))
df.TRP.krel$countOfData = as.numeric(df.TRP.krel$countOfData)
sum_all = sum(df.TRP.krel$countOfData)
df.TRP.krel$countOfData = df.TRP.krel$countOfData / sum_all 
colorManual.two = c("#C7A76C", "#86B875")
ggplot(df.TRP.krel, aes(x = "genes", y = countOfData, fill=typeOfData)) +
         #geom_bar(stat="identity", position=position_fill(reverse=T)) +
         geom_bar(stat="identity", position=position_stack(reverse=F)) +
     #scale_y_continuous(labels = scales::percent) + 
     ylab("Fraction of Genes") + 
     #xlab("") + 
     #labsForLegend + 
     #scale_fill_brewer(palette = "Dark2",
    scale_fill_manual(values = colorManual.two, 
               labels= c("Decrease Pause Release", "Increase Pause Release")) +
     #scale_fill_frontiers(labels= c("Decrease Pause Release", "Increase Pause Release")) + 
    #labs(fill = "Gene Categories") + 
    theme_bw() + fontTheme + 
     theme(legend.justification = "center", axis.text.x = element_text(size=31),
              axis.text.y = element_text(size=26),
               axis.title.y = element_text(size=29),
              axis.title.x = element_blank(), legend.title = element_blank(), aspect.ratio = 9)

#ggsave("CountFrequency_KinitFC_profile_TRP_treat.pdf")
ggsave("CountFrequency_stacked_KrelFC_TRP_treat.pdf", height = 8.4) # 

# compute how many genes have krel fC > 1 , krelFC < 1 for kinitFC < 1
TRP.kinitFCless1.krelFCless1 = subset(newSummary.TRP, kinitFCcategory == "kinitFC_less_1" & krelFCcategory ==  "krelFCless1")
TRP.kinitFCless1.krelFCmore1 = subset(newSummary.TRP, kinitFCcategory == "kinitFC_less_1" & krelFCcategory ==  "krelFCmore1")
df.TRP.krel.kinit = as.data.frame(cbind(c("krelFCless1", "krelFCmore1"),
                             c(nrow(TRP.kinitFCless1.krelFCless1) , nrow(TRP.kinitFCless1.krelFCmore1))))
colnames(df.TRP.krel.kinit) = c("typeOfData", "countOfData")
df.TRP.krel.kinit$typeOfData = factor(df.TRP.krel.kinit$typeOfData, levels = c("krelFCless1", "krelFCmore1"))
df.TRP.krel.kinit$countOfData = as.numeric(df.TRP.krel.kinit$countOfData)
sum_all = sum(df.TRP.krel.kinit$countOfData)
df.TRP.krel.kinit$countOfData = df.TRP.krel.kinit$countOfData / sum_all
df.TRP.krel.kinit$class = "Decreasing Kinit"
TRP.kinitFCspan1.krelFCless1 = subset(newSummary.TRP, kinitFCcategory == "kinitFC_span_1" & krelFCcategory ==  "krelFCless1")
TRP.kinitFCspan1.krelFCmore1 = subset(newSummary.TRP, kinitFCcategory == "kinitFC_span_1" & krelFCcategory ==  "krelFCmore1")
tmp.df = as.data.frame(cbind(c("krelFCless1", "krelFCmore1"),
                             c(nrow(TRP.kinitFCspan1.krelFCless1) , nrow(TRP.kinitFCspan1.krelFCmore1))))
colnames(tmp.df) = c("typeOfData", "countOfData")
tmp.df$countOfData = as.numeric(tmp.df$countOfData)
sum_all = sum(tmp.df$countOfData)
tmp.df$countOfData = tmp.df$countOfData / sum_all
tmp.df$class = "Uncertain\nChange"
df.TRP.krel.kinit = rbind(df.TRP.krel.kinit, tmp.df)
# for all genes
#df.TRP$class = c("All Genes", "All Genes")
#df.TRP.krel.kinit = rbind(df.TRP.krel.kinit, df.TRP)
#df.TRP.krel.kinit$class = factor(df.TRP.krel.kinit$class, levels = c("All Genes", "Decreasing Kinit", "Kinit span 1"))
#df.TRP.krel.kinit$typeOfData = factor(df.TRP.krel.kinit$typeOfData, levels = c("kinitFCspan1", "kinitFCless1", "krelFCless1", "krelFCmore1"))
colors.New = c("#E7298A", "#1B9E77")
ggplot(df.TRP.krel.kinit, aes(x=class, y = countOfData, fill=typeOfData)) +
         #geom_bar(stat="identity", position=position_fill(reverse=T)) +
         geom_bar(stat="identity", position=position_stack(reverse=F), width=0.6) +
     #scale_y_continuous(labels = scales::percent) + 
     ylab("Fraction of Genes") + 
     #ylab(expression("Fraction of Genes (with" ~ italic(K[init]) ~ " decrease)")) + 
     #xlab(expression("Genes with decreasing" ~ italic(K[init]) )) + 
     #labsForLegend + 
     #scale_fill_brewer(palette = "Dark2",
     scale_fill_manual(values = colors.New,        
              labels= c(expression("Decrease" ~ italic(K[rel])), expression("Increase" ~ italic(K[rel])))) +
    #scale_fill_manual(values = colorManual.two, 
    #          labels= c("Decrease Pause Release", "Increase Pause Release")) +
     #scale_fill_frontiers(labels= c("Decrease Pause Release", "Increase Pause Release")) + 
    #labs(fill = "Gene Categories") + 
    scale_x_discrete(labels = c(expression(atop("Decrease","in" ~ italic(K[init]))), 
                expression(atop("Uncertain",italic(K[init]) ~ "Change")))) + 
    theme_bw() + fontTheme + 
     theme(legend.justification = "center",
               axis.ticks.length.x.bottom = unit(4, "mm"),
               axis.title.x= element_blank(),
               axis.text.x = element_text(size=30, angle=90, face=1.3, vjust=0.65, hjust=0.5),
              axis.text.y = element_text(size=26),
               axis.title.y = element_text(size=32), legend.title = element_blank())
       #    aspect.ratio = 4)
ggsave("CountFrequency_stacked_KinitFC_KrelFC_TRP_treat.pdf", height = 10.4) 

# plot krel FC for the genes selected
gene.list = gene.list.Init
subset.rateData.TRP = subset(rateData.TRP, gene %in% gene.list)
subset.rateData.TRP$gene = factor(subset.rateData.TRP$gene, levels = gene.list)
subset.rateData.TRP$krelFCcategory = NA
for (gene.name in gene.list) {
 subset.rateData.TRP[subset.rateData.TRP$gene == gene.name,]$krelFCcategory = newSummary.TRP[newSummary.TRP$gene == gene.name,]$krelFCcategory
}
subset.rateData.TRP$krelFCcategory = factor(subset.rateData.TRP$krelFCcategory, levels=c("krelFCless1", "krelFCmore1"))

ggplot(subset.rateData.TRP) + geom_violin(aes(x=gene, y=krel_FC, fill=krelFCcategory), scale="width") +
        geom_point(aes(x=gene,y=krel_FC,color=krelFCcategory), position="jitter", alpha=0.3 , size=0.16) +
        scale_y_continuous(trans=log2_trans(),
               breaks=trans_breaks('log2', function(x) 2^x), 
               labels=trans_format('log2', math_format(.x))) + 
    ylab(expression(log[2] ~ "Foldchange in" ~ italic(K[rel])))  + 
    xlab("Genes") + 
    scale_fill_manual(values = c("#E7298A", "#1B9E77"),
              labels = c("decrease Krel", "increase Krel")) +   
    scale_color_manual(values = c("#E7298A", "#1B9E77")) + 
    #scale_color_brewer(palette = "Dark2") + 
    #scale_fill_brewer(palette = "Dark2") +
    #labs(fill = "Pause Release Rate", color = "Pause Release Rate") +
    #scale_color_startrek() + scale_fill_startrek()  + 
    #theme_light() + 
    theme_bw() + 
    fontTheme + 
    theme(legend.position="none",axis.text.x = element_text(size=32, face=1.5, angle=30, vjust=0.9, hjust=1),
               axis.title.x = element_blank(), axis.title.y = element_text(size=32), legend.title = element_blank())
ggsave("log2_krel_FC_TRP_distribution.pdf", height=8.4, width=6)
# index of dispersion - krel - always tight because B_fold_change = Krel_foldchange * P_foldchange
newSummary.TRP$IndexDisp_krel = newSummary.TRP$krel_FC_var / newSummary.TRP$krel_FC_mean

ggplot() + 
    geom_violin(data=subset(newSummary.TRP, IndexDisp_krel > 0), aes(x=krelFCcategory, y = IndexDisp_krel, fill=krelFCcategory), scale="width") +
        geom_boxplot(data=subset(newSummary.TRP, IndexDisp_krel > 0), aes(x=krelFCcategory, y = IndexDisp_krel),alpha = 0.1, width=0.36, linewidth=0.3, outlier.shape=NA) +
    geom_point(data=subset(newSummary.TRP, IndexDisp_krel > 0 & gene %in% gene.list), aes(x=krelFCcategory,y=IndexDisp_krel), size=2) +
        geom_text(data=subset(newSummary.TRP, gene %in% gene.list), aes(x=krelFCcategory,y=IndexDisp_krel,label=gene, fontface=2), nudge_x=-0.12, nudge_y=0.3, size=7) +
    scale_y_continuous(trans=log10_trans(), breaks=trans_breaks('log10', function(x) 10^x), labels=trans_format('log10', math_format(.x)), limits = c(NA,1)) +
    ylab(expression(log[10] ~ frac("Variance (Foldchange" ~ italic(K[rel]) ~")", "Mean (Foldchange" ~ italic(K[rel]) ~")"))) +
    #ylab(expression(atop(log[10] ~ "Index of Dispersion", "(valid parameter sets, TRP treatment)")))  + 
    #ylab("log10 index of dispersion - krel") + 
    xlab("Gene Categories") +
    #labs(fill = "Pause Release Rate") + 
    scale_fill_manual(values = c("#E7298A", "#1B9E77")) + 
    #scale_fill_brewer(palette = "Dark2") +
    #scale_fill_startrek() + 
        theme_bw() + fontTheme + 
       theme(legend.position = "none", 
                 axis.title.y = element_text(size=29),
              axis.title.x = element_blank(),
         axis.text.x = element_text(size=31)) +
       scale_x_discrete(labels = c(expression(atop("Decrease", "in" ~ italic(K[rel]))), 
                   expression(atop("Increase", "in" ~ italic(K[rel])))))  
       #theme(aspect.ratio = 6/4)

ggsave("log10_krel_FC_TRP_IndexOfDispersion.pdf", width=6, height=8.4)
The figures generated are given below.

Figure 3: Frequency bar plot of genes by their Initiation foldchange behavior

Figure 4: Frequency bar plot of genes by their Pause Release foldchange behavior

We select genes with a wide and narrow distribution of foldchange in Kinit.

Figure 5: Initiation Foldchange across all parameter sets

Figure 6: Pause release Foldchange across all parameter sets

Finally, we have figures showing index of dispersion. Notice that Krel FC always have a narrow distribution because of the simple relation of B_fold_change = Krel_foldchange * P_foldchange.

Figure 7: Index of Dispersion for Initation rate foldchnage

Figure 8: Index of Dispersion for Pause Release rate foldchnage

We also constrained premature termination rate and used this limit file:-

[initiation]
lower = 0.00000001

[prematuretermination]
lower = 0.02
upper = 0.023

All valid parameters are stored in InternalControl_TRP12min_repressed_KpreConstrained/paramEstimationResults_combined.txt. The following code plots the results:-

cat plotForConstrainedKpre.R
@
library(ggplot2)
library(dplyr)
library(ggsci)
library(RColorBrewer)
library(scales)
# https://cran.r-project.org/web/packages/ggsci/vignettes/ggsci.html
source("plotting_composites_lattice.R")

fontTheme = theme(
  axis.title.x = element_text(size = 20),
  #axis.text.x = element_text(size = 4),
  #axis.text.x = element_text(size = 11),
  #axis.text.x = element_text(size=28, angle = 30, hjust=1),
  axis.text.x = element_text(size=27),
  axis.text.y = element_text(size = 25),
  axis.title.y = element_text(size = 28),
  legend.title = element_text(size=25),
  legend.text = element_text(size=24),
  panel.border = element_rect(colour="black", linewidth=2),
  panel.grid.major = element_blank(),
 panel.grid.minor = element_blank(),
 axis.ticks.length = unit(2, "mm")
 #axis.line = element_line(colour = "black"),
 #axis.line.x = element_line(colour = "black"),
 #axis.line.y = element_line(colour = "black"),
  #legend.position="top"
 #axis.line.x.top = element_line(colour = "black"),
 #axis.line.y.right = element_line(colour = "black")
)

#scanDataDir = "/Users/rudradeepmukherjee/Documents/UConn Health/comparment.model.project/datasets, genome/Jonkers_2014_Trp_FP_paper/ParamEstimation_FP_TRP/InternalControl_TRP12min_repressed_KPRE_CONSTRAINED_ALLPARAMS/figures"
scanDataDir = "InternalControl_TRP12min_repressed_KpreConstrained/"
rateData.Kpre.constrained = readRDS(sprintf("%s/repeatedParamScanData_TRP_12min_repressed_Steurer_constrained.rds", scanDataDir))
rateData.Kpre.constrained$P_base = rateData.Kpre.constrained$kinit_base / (rateData.Kpre.constrained$kpre + rateData.Kpre.constrained$krel_base)
rateData.Kpre.constrained$P_expr = rateData.Kpre.constrained$kinit_expr / (rateData.Kpre.constrained$kpre + rateData.Kpre.constrained$krel_expr)
rateData.Kpre.constrained$B_base = ( rateData.Kpre.constrained$krel_base * rateData.Kpre.constrained$P_base) / (rateData.Kpre.constrained$kelong)
rateData.Kpre.constrained$B_expr = ( rateData.Kpre.constrained$krel_expr * rateData.Kpre.constrained$P_expr) / (rateData.Kpre.constrained$kelong)
rateData.Kpre.constrained$Release_base = rateData.Kpre.constrained$P_base * rateData.Kpre.constrained$krel_base
rateData.Kpre.constrained$Release_expr = rateData.Kpre.constrained$P_expr * rateData.Kpre.constrained$krel_expr
rateData.Kpre.constrained$Terminal_base = rateData.Kpre.constrained$kpre * rateData.Kpre.constrained$P_base
rateData.Kpre.constrained$Terminal_expr = rateData.Kpre.constrained$kpre * rateData.Kpre.constrained$P_expr

newSummary.TRP.Kpre.constrained = group_by(rateData.Kpre.constrained, gene) %>%  summarise_at(c("Terminal_base", "Terminal_expr", "Release_base", "Release_expr", "kinit_base", "kinit_expr", "krel_base", "krel_expr", "kpre", "kelong", "kinit_FC", "krel_FC", "P_base", "P_expr", "B_base", "B_expr"), c("min", "max", "mean", "var"))

newSummary.TRP.Kpre.constrained$IndexDisp = newSummary.TRP.Kpre.constrained$kinit_FC_var /  newSummary.TRP.Kpre.constrained$kinit_FC_mean
newSummary.TRP.Kpre.constrained$IndexDisp_Krel = newSummary.TRP.Kpre.constrained$krel_FC_var /  newSummary.TRP.Kpre.constrained$krel_FC_mean

newSummary.TRP.Kpre.constrained = newSummary.TRP.Kpre.constrained[order(newSummary.TRP.Kpre.constrained$IndexDisp, decreasing=T),]
newSummary.TRP.Kpre.constrained$kinitFCcategory = "Undefined"
newSummary.TRP.Kpre.constrained[newSummary.TRP.Kpre.constrained$kinit_FC_max < 1, ]$kinitFCcategory = "kinitFC_less_1"
newSummary.TRP.Kpre.constrained[newSummary.TRP.Kpre.constrained$kinit_FC_max >= 1 & newSummary.TRP.Kpre.constrained$kinit_FC_min <= 1, ]$kinitFCcategory = "kinitFC_span_1"

newSummary.TRP.Kpre.constrained.kinitfc.less1 = subset(newSummary.TRP.Kpre.constrained, kinit_FC_max < 1)
newSummary.TRP.Kpre.constrained.kinitfc.span1 = subset(newSummary.TRP.Kpre.constrained, kinit_FC_min <= 1 & kinit_FC_max >= 1)

colNames = c("gene", "kinit_FC_min", "kinit_FC_max", "krel_FC_min","krel_FC_max","IndexDisp")

newSummary.TRP.Kpre.constrained.kinitfc.span1[, colNames]
min(newSummary.TRP.Kpre.constrained.kinitfc.less1$kinit_FC_min)
max(newSummary.TRP.Kpre.constrained.kinitfc.less1$kinit_FC_max)
min(newSummary.TRP.Kpre.constrained.kinitfc.less1$krel_FC_min)
max(newSummary.TRP.Kpre.constrained.kinitfc.less1$krel_FC_max)

gene.list.Init = c("Nup214", "Fgfr2", "Clhc1", "Dmrta2")
colorManual.one = c("#FF7F0E", "#1F77B4", "#E7298A", "#1B9E77") # blue, orange, dark palette

subset.rateData.TRP.kpreConstrain = subset(rateData.Kpre.constrained, gene %in% gene.list.Init)
subset.rateData.TRP.kpreConstrain$kinitFCcategory = NA

for (gene.name in gene.list.Init) {
    subset.rateData.TRP.kpreConstrain[subset.rateData.TRP.kpreConstrain$gene == gene.name, ]$kinitFCcategory = subset(newSummary.TRP.Kpre.constrained, gene == gene.name)$kinitFCcategory 
}
subset.rateData.TRP.kpreConstrain$kinitFCcategory = factor(subset.rateData.TRP.kpreConstrain$kinitFCcategory, levels = c("kinitFC_less_1", "kinitFC_span_1"))

subset.rateData.TRP.kpreConstrain$gene = factor(subset.rateData.TRP.kpreConstrain$gene, levels = gene.list.Init)

# kinit FC changes
colorManual.one = c("#1F77B4", "#FF7F0E", "#E7298A", "#1B9E77") # orange, blue, dark palette
ggplot(subset.rateData.TRP.kpreConstrain) + 
    geom_violin(aes(x=gene, y=kinit_FC, fill=kinitFCcategory), scale="width") +
        geom_point(aes(x=gene,y=kinit_FC), position="jitter", alpha=0.3 , size=0.16) +
    scale_y_continuous(trans=log2_trans(), breaks=trans_breaks('log2', function(x) 2^x), labels=trans_format('log2', math_format(.x)), limits = c(NA,1)) +
        #ylab(expression(atop(log[2] ~ "Foldchange in" ~ K[init] ~ ", TRP treatment")))  + 
        ylab(expression(log[2] ~ "Foldchange in" ~ italic(K[init])))  +
        scale_fill_manual(values = colorManual.one,
                           labels= c("Decrease Initiation Rate", "Uncertain Change")) + 
    theme_bw() + fontTheme +
        theme(legend.position="none",axis.text.x = element_text(size=32, face=1.5, angle=30, vjust=0.9, hjust=1),
               axis.title.x = element_blank(), axis.title.y = element_text(size=32), legend.title = element_blank())

ggsave("log2_TRP_kinitFC_distribution_Kpre_constrained.pdf", height=8.4, width = 6)

# kinit FC index of dispersion

ggplot() +
        geom_violin(data=newSummary.TRP.Kpre.constrained, aes(x=kinitFCcategory, y = IndexDisp, fill = kinitFCcategory), scale="width") +
        geom_boxplot(data=newSummary.TRP.Kpre.constrained, aes(x=kinitFCcategory, y = IndexDisp),alpha = 0.1, width=0.36, linewidth=0.3, outlier.shape=NA) +

        geom_point(data=subset(newSummary.TRP.Kpre.constrained, gene %in% gene.list.Init), aes(x=kinitFCcategory, y = IndexDisp, fill = kinitFCcategory), size=2) +
       geom_text(data=subset(newSummary.TRP.Kpre.constrained, gene %in% gene.list.Init), aes(x=kinitFCcategory, y = IndexDisp, label=gene, fontface=2), nudge_x=-0.12, nudge_y=0.3, size=7) +
       #scale_y_continuous(trans=log2_trans(), breaks=trans_breaks('log2', function(x) 2^x), labels=trans_format('log2', math_format(.x))) +
       scale_y_continuous(trans=log10_trans(), breaks=trans_breaks('log10', function(x) 10^x), labels=trans_format('log10', math_format(.x)), limits = c(NA,1)) +
       scale_fill_d3() +
      scale_color_d3() +
                  theme_bw() + fontTheme +
        #ggtitle("Gene Categories - Triptolide 12.5 min") +
        theme(plot.title = element_text(size=23),
              axis.text.x = element_text(size=31),
              axis.text.y = element_text(size=26),
               axis.title.y = element_text(size=29),
              axis.title.x = element_blank(),
             # axis.text.x = element_text(size=28, angle=30, hjust=0.1, vjust=0.22),
                legend.position = "None") +
                  labs(fill = "Gene Categories") +
        #ylab("Index of Dispersion - TRP treatment")
        #ylab("log2 Index of Dispersion - TRP treatment")
        #ylab(expression(atop(log[10] ~ frac("Variance", "Mean") ~ "Foldchange in Initiation, TRP treatment", "(valid parameter sets, TRP treatment)"))) +
        ylab(expression(log[10] ~ frac("Variance (Foldchange" ~ italic(K[init]) ~")", "Mean (Foldchange" ~ italic(K[init]) ~")"))) +
        scale_x_discrete(labels = c("Decrease\nInitiation", "Uncertain\nChange"))

ggsave("log10_TRP_kinitFC_indexOfDispersion_Kpre_constrained.pdf", width=6, height=8.4) # 6.97 x 8.4
### histograms
colorManual.two = c("#1F77B4", "#FF7F0E", "#E7298A", "#1B9E77") # orange, blue, dark palette
df.TRP.kpreconstrain = as.data.frame(cbind(c("kinitFCless1", "kinitFCspan1"),
                   c(sum(newSummary.TRP.Kpre.constrained$kinitFCcategory == "kinitFC_less_1"),
                     sum(newSummary.TRP.Kpre.constrained$kinitFCcategory == "kinitFC_span_1"))))
                               #sum(newSummary.TRP$kinitFCcategory == "kinitFC_more_1"))))
colnames(df.TRP.kpreconstrain) = c("typeOfData", "countOfData")
df.TRP.kpreconstrain$typeOfData = factor(df.TRP.kpreconstrain$typeOfData, levels = c("kinitFCless1", "kinitFCspan1"))
df.TRP.kpreconstrain$countOfData = as.numeric(df.TRP.kpreconstrain$countOfData)
sum_all = sum(df.TRP.kpreconstrain$countOfData)
df.TRP.kpreconstrain$countOfData = df.TRP.kpreconstrain$countOfData / sum_all

ggplot(df.TRP.kpreconstrain, aes(x = "All Genes", y = countOfData, fill=typeOfData)) +
         #geom_bar(stat="identity", position=position_fill(reverse=T)) +
         geom_bar(stat="identity", position=position_stack(reverse=T)) +
         #scale_y_continuous(labels = scales::percent) +
         ylab("Fraction of Genes") +
         #labs(fill = "") +
         #xlab("") +
         #scale_fill_brewer(palette = "Dark2",
         scale_fill_manual(values = colorManual.two,
                           #labels= c("Decrease Initiation Rate", "Ambiguous Change", "Increase Initiation Rate")) +
                           labels= c("Decrease Initiation Rate", "Uncertain Change")) +
         # scale_color_manual(values = colorManual.one,
        #                  labels= c("Decrease Initiation Rate", "Uncertain Change")) +
                           #  labels= c("Decrease Initiation Rate", "Ambiguous Change", "Increase Initiation Rate")) +
         theme_bw() + fontTheme +
         theme(legend.justification = "center",
               axis.text.x = element_text(size=31),
              axis.text.y = element_text(size=27),
               axis.title.y = element_text(size=32),
              axis.title.x = element_blank(), legend.title = element_blank(), aspect.ratio = 9)

ggsave("CountFrequency_stacked_KinitFC_profile_TRP_treat_Kpre_constrained.pdf", height=8.4, width=8) # 6.53 x 8.4

# distribution of mean Kinit, Krel, kpre, kelong
# first - compare between krel_base/krel_expr and kpre
ggplot(newSummary.TRP.Kpre.constrained) +
    geom_violin(aes(x="release_base_mean",y= Release_base_mean), scale="width") +
        geom_violin(aes(x="release_expr_mean",y= Release_expr_mean), scale="width") +
        geom_violin(aes(x="kinit_base_mean",y= kinit_base_mean), scale="width") +
        geom_violin(aes(x="kinit_expr_mean",y= kinit_expr_mean), scale="width") +
        geom_violin(aes(x="kpre_mean",y= kpre_mean), scale="width") +
        geom_violin(aes(x="krel_base_mean",y= krel_base_mean), scale="width") +
        geom_violin(aes(x="krel_expr_mean",y= krel_expr_mean), scale="width") +
        geom_violin(aes(x="terminal_base_mean",y= Terminal_base_mean), scale="width") +
        geom_violin(aes(x="terminal_expr_mean",y= Terminal_expr_mean), scale="width") +

        geom_point(aes(x="release_base_mean",y= Release_base_mean), position="jitter", alpha=0.02) +
        geom_point(aes(x="release_expr_mean",y= Release_expr_mean), position="jitter", alpha=0.02) +
        geom_point(aes(x="kinit_base_mean",y= kinit_base_mean), position="jitter", alpha=0.02) +
        geom_point(aes(x="kinit_expr_mean",y= kinit_expr_mean), position="jitter", alpha=0.02) +
        geom_point(aes(x="kpre_mean",y= kpre_mean), position="jitter", alpha=0.02) +
        geom_point(aes(x="krel_base_mean",y= krel_base_mean), position="jitter", alpha=0.02) +
        geom_point(aes(x="krel_expr_mean",y= krel_expr_mean), position="jitter", alpha=0.02) +
        geom_point(aes(x="terminal_base_mean",y= Terminal_base_mean), position="jitter", alpha=0.02) +
        geom_point(aes(x="terminal_expr_mean",y= Terminal_expr_mean), position="jitter", alpha=0.02) +

        geom_boxplot(aes(x="release_base_mean",y= Release_base_mean), alpha=0.1, width=0.36, linewidth=0.3) +
        geom_boxplot(aes(x="release_expr_mean",y= Release_expr_mean), alpha=0.1, width=0.36, linewidth=0.3)+
        geom_boxplot(aes(x="kinit_base_mean",y= kinit_base_mean), alpha=0.1, width=0.36, linewidth=0.3) +
        geom_boxplot(aes(x="kinit_expr_mean",y= kinit_expr_mean), alpha=0.1, width=0.36, linewidth=0.3) +
        geom_boxplot(aes(x="kpre_mean",y= kpre_mean), alpha=0.1, width=0.36, linewidth=0.3) +
        geom_boxplot(aes(x="krel_base_mean",y= krel_base_mean), alpha=0.1, width=0.36, linewidth=0.3) +
        geom_boxplot(aes(x="krel_expr_mean",y= krel_expr_mean), alpha=0.1, width=0.36, linewidth=0.3) +
        geom_boxplot(aes(x="terminal_base_mean",y= Terminal_base_mean), alpha=0.1, width=0.36, linewidth=0.3) +
        geom_boxplot(aes(x="terminal_expr_mean",y= Terminal_expr_mean), alpha=0.1, width=0.36, linewidth=0.3)+
                  scale_y_continuous(trans=log10_trans(),
                breaks=trans_breaks('log10', function(x) 10^x),
                labels=trans_format('log10', math_format(.x))) +
         scale_x_discrete(labels = c(expression(atop(italic(K[init]), "(control)")), expression(atop(italic(K[init]), "(triptolide)")),
                                     expression(italic(K[pre])),expression(atop(italic(K[rel]), "(control)")),expression(atop(italic(K[rel]), "(triptolide)")),
                                     expression(atop("Effective" ~ italic(K[rel]), "(control)")),expression(atop("Effective" ~ italic(K[rel]), "(triptolide)")),
                                     expression(atop("Effective" ~ italic(K[pre]), "(control)")), expression(atop("Effective" ~ italic(K[pre]), "(triptolide)")))) + # kinit_base_mean, kinit_expr_mean, kpre_mean, krel_base_mean,krel_expr_mean, release_base_mean,release_expr_mean, terminal_base_mean, terminal_expr_mean
         ylab(expression(log[10] ~ "Mean Raw Values - All Genes")) +
         theme_bw() + fontTheme + theme(axis.title.x = element_blank(),
         axis.text.x = element_text(size=32, face=1, vjust=0.9))

ggsave("TRP_trends_Kpre_Krel_withconstrainedKpre.pdf", width = 26, height=8)
## plot P values trend
ggplot(newSummary.TRP.Kpre.constrained) + 
    geom_violin(aes(x="P_base_mean",y= P_base_mean)) +
    geom_violin(aes(x="P_expr_mean",y= P_expr_mean)) +
    geom_violin(aes(x="B_base_mean",y= B_base_mean)) +
     geom_violin(aes(x="B_expr_mean",y= B_expr_mean)) +

    geom_point(aes(x="P_base_mean",y= P_base_mean), position="jitter", alpha=0.05) +
    geom_point(aes(x="P_expr_mean",y= P_expr_mean), position="jitter", alpha=0.05) + 
     geom_point(aes(x="B_base_mean",y= B_base_mean), position="jitter", alpha=0.05) +
     geom_point(aes(x="B_expr_mean",y= B_expr_mean), position="jitter", alpha=0.05) +

    geom_boxplot(aes(x="P_base_mean",y= P_base_mean), alpha=0.1, width=0.36, linewidth=0.3) +
    geom_boxplot(aes(x="P_expr_mean",y= P_expr_mean), alpha=0.1, width=0.36, linewidth=0.3) +
    geom_boxplot(aes(x="B_base_mean",y= B_base_mean), alpha=0.1, width=0.36, linewidth=0.3) +
    geom_boxplot(aes(x="B_expr_mean",y= B_expr_mean), alpha=0.1, width=0.36, linewidth=0.3) +   
         scale_y_continuous(trans=log10_trans(), 
        breaks=trans_breaks('log10', function(x) 10^x), 
        labels=trans_format('log10', math_format(.x)), limits = c(NA,1)) +
         ylab(expression(log[10] ~ "raw values - all genes")) + 
     theme_bw() + fontTheme + theme(axis.title.x = element_blank(), 
                    axis.text.x = element_text(size=26, angle=30, vjust=1, hjust=0.9))

ggsave("TRP_trends_P_B_Vals_withconstrainedKpre.pdf")

### plot Krel trends
newSummary.TRP.Kpre.constrained$krelFCcategory = "Undefined"
newSummary.TRP.Kpre.constrained[newSummary.TRP.Kpre.constrained$krel_FC_max < 1, ]$krelFCcategory = "krelFC_less_1"
newSummary.TRP.Kpre.constrained[newSummary.TRP.Kpre.constrained$krel_FC_min >= 1, ]$krelFCcategory = "krelFC_more_1"

subset.rateData.TRP.kpreConstrain = subset(rateData.Kpre.constrained, gene %in% gene.list.Init)
subset.rateData.TRP.kpreConstrain$krelFCcategory = NA

for (gene.name in gene.list.Init) {
    subset.rateData.TRP.kpreConstrain[subset.rateData.TRP.kpreConstrain$gene == gene.name, ]$krelFCcategory = subset(newSummary.TRP.Kpre.constrained, gene == gene.name)$krelFCcategory 
}

subset.rateData.TRP.kpreConstrain$krelFCcategory = factor(subset.rateData.TRP.kpreConstrain$krelFCcategory, levels = c("krelFC_less_1", "krelFC_more_1"))

subset.rateData.TRP.kpreConstrain$gene = factor(subset.rateData.TRP.kpreConstrain$gene, levels = gene.list.Init)

# krel FC changes

ggplot(subset.rateData.TRP.kpreConstrain) + 
    geom_violin(aes(x=gene, y=krel_FC, fill=krelFCcategory), scale="width") +
        geom_point(aes(x=gene,y=krel_FC, color=krelFCcategory), position="jitter", alpha=0.3 , size=0.16) +
    scale_y_continuous(trans=log2_trans(), breaks=trans_breaks('log2', function(x) 2^x), labels=trans_format('log2', math_format(.x))) +
        ylab(expression(log[2] ~ "Foldchange in" ~ italic(K[rel])))  +
        scale_fill_manual(values = c("#E7298A", "#1B9E77"),
                          labels = c("decrease Krel", "increase Krel")) +
        scale_color_manual(values = c("#E7298A", "#1B9E77")) +
    theme_bw() + fontTheme +
        theme(legend.position="none",axis.text.x = element_text(size=32, face=1.5, angle=30, vjust=0.9, hjust=1),
               axis.title.x = element_blank(), axis.title.y = element_text(size=32), legend.title = element_blank())

ggsave("log2_TRP_krelFC_distribution_Kpre_constrained.pdf", height=8.4, width = 6)

## krel FC index of dispersion

ggplot() +
        geom_violin(data=newSummary.TRP.Kpre.constrained, aes(x=krelFCcategory, y = IndexDisp_Krel, fill = krelFCcategory), scale="width") +
        geom_boxplot(data=newSummary.TRP.Kpre.constrained, aes(x=krelFCcategory, y = IndexDisp_Krel),alpha = 0.1, width=0.36, linewidth=0.3, outlier.shape=NA) +

        geom_point(data=subset(newSummary.TRP.Kpre.constrained, gene %in% gene.list.Init), aes(x=krelFCcategory, y = IndexDisp_Krel, fill = krelFCcategory), size=2) +
       geom_text(data=subset(newSummary.TRP.Kpre.constrained, gene %in% gene.list.Init), aes(x=krelFCcategory, y = IndexDisp_Krel, label=gene, fontface=2), nudge_x=-0.12, nudge_y=0.3, size=7) +
       #scale_y_continuous(trans=log2_trans(), breaks=trans_breaks('log2', function(x) 2^x), labels=trans_format('log2', math_format(.x))) +
       scale_y_continuous(trans=log10_trans(), breaks=trans_breaks('log10', function(x) 10^x), labels=trans_format('log10', math_format(.x)), limits = c(NA,1)) +
       scale_fill_manual(values = c("#E7298A", "#1B9E77"),
                          labels = c("decrease Krel", "increase Krel")) +
       scale_color_manual(values = c("#E7298A", "#1B9E77")) +
        theme_bw() + fontTheme +
        #ggtitle("Gene Categories - Triptolide 12.5 min") +
        theme(plot.title = element_text(size=23),
              axis.text.x = element_text(size=31),
              axis.text.y = element_text(size=26),
               axis.title.y = element_text(size=29),
              axis.title.x = element_blank(),
             # axis.text.x = element_text(size=28, angle=30, hjust=0.1, vjust=0.22),
                legend.position = "None") +
                  labs(fill = "Gene Categories") +
        #ylab("Index of Dispersion - TRP treatment")
        #ylab("log2 Index of Dispersion - TRP treatment")
        #ylab(expression(atop(log[10] ~ frac("Variance", "Mean") ~ "Foldchange in Initiation, TRP treatment", "(valid parameter sets, TRP treatment)"))) +
        ylab(expression(log[10] ~ frac("Variance (Foldchange" ~ italic(K[rel]) ~")", "Mean (Foldchange" ~ italic(K[rel]) ~")"))) +
        scale_x_discrete(labels = c(expression(atop("Decrease", italic(K[rel]))), expression(atop("Increase", italic(K[rel])))))

ggsave("log10_TRP_krelFC_indexOfDispersion_Kpre_constrained.pdf", width=6, height=8.4) # 6.97 x 8.4

# krel for kinitFC_less_1 and kinitFC_span_1 
TRP.kpreConstrain.kinitFCless1.krelFCless1 = subset(newSummary.TRP.Kpre.constrained, kinit_FC_max < 1 & krel_FC_max < 1)
TRP.kpreConstrain.kinitFCless1.krelFCmore1 = subset(newSummary.TRP.Kpre.constrained, kinit_FC_max < 1 & krel_FC_min >= 1)

df.TRP.kpreConstrain.krel.kinit = as.data.frame(cbind(c("krelFCless1", "krelFCmore1"),
                             c(nrow(TRP.kpreConstrain.kinitFCless1.krelFCless1) ,
                   nrow(TRP.kpreConstrain.kinitFCless1.krelFCmore1))))
colnames(df.TRP.kpreConstrain.krel.kinit) = c("typeOfData", "countOfData")
df.TRP.kpreConstrain.krel.kinit$typeOfData = factor(df.TRP.kpreConstrain.krel.kinit$typeOfData, levels = c("krelFCless1", "krelFCmore1"))
df.TRP.kpreConstrain.krel.kinit$countOfData = as.numeric(df.TRP.kpreConstrain.krel.kinit$countOfData)
sum_all = sum(df.TRP.kpreConstrain.krel.kinit$countOfData)
df.TRP.kpreConstrain.krel.kinit$countOfData = df.TRP.kpreConstrain.krel.kinit$countOfData / sum_all
df.TRP.kpreConstrain.krel.kinit$class = "Decreasing Kinit"
TRP.kpreConstrain.kinitFCspan1.krelFCless1 = subset(newSummary.TRP.Kpre.constrained, kinit_FC_max >= 1 & kinit_FC_min < 1 & krel_FC_max < 1)
TRP.kpreConstrain.kinitFCspan1.krelFCmore1 = subset(newSummary.TRP.Kpre.constrained, kinit_FC_max >= 1 & kinit_FC_min < 1 & krel_FC_min >= 1)

tmp.df = as.data.frame(cbind(c("krelFCless1", "krelFCmore1"),
                             c(nrow(TRP.kpreConstrain.kinitFCspan1.krelFCless1) , 
                   nrow(TRP.kpreConstrain.kinitFCspan1.krelFCmore1))))

colnames(tmp.df) = c("typeOfData", "countOfData")
tmp.df$countOfData = as.numeric(tmp.df$countOfData)
sum_all = sum(tmp.df$countOfData)
tmp.df$countOfData = tmp.df$countOfData / sum_all
tmp.df$class = "Uncertain\nChange"
df.TRP.kpreConstrain.krel.kinit = rbind(df.TRP.kpreConstrain.krel.kinit, tmp.df)
colors.New = c("#E7298A", "#1B9E77")
ggplot(df.TRP.kpreConstrain.krel.kinit, aes(x=class, y = countOfData, fill=typeOfData)) +
         #geom_bar(stat="identity", position=position_fill(reverse=T)) +
         geom_bar(stat="identity", position=position_stack(reverse=F), width=0.6) +
         #scale_y_continuous(labels = scales::percent) +
         ylab("Fraction of Genes") +
         #ylab(expression("Fraction of Genes (with" ~ italic(K[init]) ~ " decrease)")) +
         #xlab(expression("Genes with decreasing" ~ italic(K[init]) )) +
         #labsForLegend +
         #scale_fill_brewer(palette = "Dark2",
        scale_fill_manual(values = colors.New ,
                          labels= c(expression("Decrease" ~ italic(K[rel])), expression("Increase" ~ italic(K[rel])))) +
        #scale_fill_manual(values = colorManual.two,
        #              labels= c("Decrease Pause Release", "Increase Pause Release")) +
         #scale_fill_frontiers(labels= c("Decrease Pause Release", "Increase Pause Release")) +
        #labs(fill = "Gene Categories") +
        scale_x_discrete(labels = c(expression(atop("Decrease","in" ~ italic(K[init]))),
                                expression(atop("Uncertain",italic(K[init]) ~ "Change")))) +
        theme_bw() + fontTheme +
         theme(legend.justification = "center",
               axis.ticks.length.x.bottom = unit(4, "mm"),
           axis.title.x= element_blank(),
               axis.text.x = element_text(size=30, angle=90, face=1.3, vjust=0.65, hjust=0.5),
              axis.text.y = element_text(size=26),
               axis.title.y = element_text(size=32), legend.title = element_blank())
ggsave("CountFrequency_stacked_KrelFC_TRP_treat_Kpre_constrained.pdf", height=10.4)

# plot absolute raets for gene Nup214 (selected). Other genes are Fgfr2 Clhc1 Dmrta2.
subset.rateData.TRP.kpreConstrain$P_base = subset.rateData.TRP.kpreConstrain$kinit_base / (subset.rateData.TRP.kpreConstrain$kpre + subset.rateData.TRP.kpreConstrain$krel_base)
subset.rateData.TRP.kpreConstrain$P_expr = subset.rateData.TRP.kpreConstrain$kinit_expr / (subset.rateData.TRP.kpreConstrain$kpre + subset.rateData.TRP.kpreConstrain$krel_expr)

ggplot(subset(subset.rateData.TRP.kpreConstrain, gene %in% c("Nup214"))) + 
#ggplot(subset(subset.rateData.TRP.kpreConstrain, gene %in% c("Fgfr2"))) + 
    geom_violin(aes(x="Premature\nTermination",y= kpre)) +
    #geom_violin(aes(x="Elongation",y= kelong)) + 
    geom_violin(aes(x="Initiation\n(control)",y= kinit_base)) +
    geom_violin(aes(x="Initiation\n(triptolide)",y= kinit_expr)) +
    geom_violin(aes(x="Pause Release\n(control)",y= krel_base)) +
    geom_violin(aes(x="Pause Release\n(triptolide)",y= krel_expr)) +
    geom_violin(aes(x="release_base",y= Release_base), scale="width") +
        geom_violin(aes(x="release_expr",y= Release_expr), scale="width") +
    geom_violin(aes(x="terminal_base",y= Terminal_base), scale="width") +
        geom_violin(aes(x="terminal_expr",y= Terminal_expr), scale="width") +

        geom_point(aes(x="release_base",y= Release_base), position="jitter", alpha=0.1) +
        geom_point(aes(x="release_expr",y= Release_expr), position="jitter", alpha=0.1) +
    geom_point(aes(x="terminal_base",y= Terminal_base), position="jitter", alpha=0.1) +
        geom_point(aes(x="terminal_expr",y= Terminal_expr), position="jitter", alpha=0.1) +
    geom_point(aes(x="Premature\nTermination",y= kpre), position="jitter", alpha=0.1) +
    geom_point(aes(x="Initiation\n(control)",y= kinit_base), position="jitter", alpha=0.1) + 
    geom_point(aes(x="Initiation\n(triptolide)",y= kinit_expr), position="jitter", alpha=0.1) +
    geom_point(aes(x="Pause Release\n(control)",y= krel_base), position="jitter", alpha=0.1) +
    geom_point(aes(x="Pause Release\n(triptolide)",y= krel_expr), position="jitter", alpha=0.1) +

    geom_boxplot(aes(x="terminal_base",y= Terminal_base), alpha=0.1, width=0.36, linewidth=0.3) +
        geom_boxplot(aes(x="terminal_expr",y= Terminal_expr), alpha=0.1, width=0.36, linewidth=0.3)+
    geom_boxplot(aes(x="release_base",y= Release_base), alpha=0.1, width=0.36, linewidth=0.3) +
        geom_boxplot(aes(x="release_expr",y= Release_expr), alpha=0.1, width=0.36, linewidth=0.3)+
    geom_boxplot(aes(x="Premature\nTermination",y= kpre), alpha=0.1, width=0.36, linewidth=0.3) +
    geom_boxplot(aes(x="Initiation\n(control)",y= kinit_base), alpha=0.1, width=0.36, linewidth=0.3) + 
    geom_boxplot(aes(x="Initiation\n(triptolide)",y= kinit_expr), alpha=0.1, width=0.36, linewidth=0.3) + 
    geom_boxplot(aes(x="Pause Release\n(control)",y= krel_base), alpha=0.1, width=0.36, linewidth=0.3) +
    geom_boxplot(aes(x="Pause Release\n(triptolide)",y= krel_expr), alpha=0.1, width=0.36, linewidth=0.3) +
         scale_y_continuous(trans=log10_trans(), 
        breaks=trans_breaks('log10', function(x) 10^x), 
        labels=trans_format('log10', math_format(.x)), limits=c(NA,1)) +
         ylab(expression(log[10] ~ "Raw values for gene" ~ italic("Nup214"))) + 
         #ylab(expression(log[10] ~ "Raw values for gene" ~ italic("Fgrf2"))) + 
     scale_x_discrete(labels = c(expression(atop(italic(K[init]), "(control)")),
                     expression(atop(italic(K[init]), "(triptolide)")),
                     expression(atop(italic(K[rel]), "(control)")),
                     expression(atop(italic(K[rel]), "(triptolide)")),
                     expression(italic(K[pre])),
                                     expression(atop("Effective" ~ italic(K[rel]), "(control)")),expression(atop("Effective" ~ italic(K[rel]), "(triptolide)")),
                                     expression(atop("Effective" ~ italic(K[pre]), "(control)")), expression(atop("Effective" ~ italic(K[pre]), "(triptolide)")))) + 
     theme_bw() + fontTheme + 
     theme(axis.title.x = element_blank(),
     axis.text.x = element_text(size=30, face=1.3),
         axis.text.y = element_text(size=26))
ggsave("TRP_treatment_values_gene_Nup214_kpre_constrained.pdf", height=8.4, width=26)
#ggsave("TRP_treatment_values_gene_Fgrf2_kpre_constrained.pdf", height=8.4)

# plot pause sums
ggplot(subset(subset.rateData.TRP.kpreConstrain, gene %in% c("Nup214"))) + 
    geom_violin(aes(x="Initiation\n(control)",y= kinit_base)) +
    geom_violin(aes(x="Elongation",y= kelong)) +
    geom_violin(aes(x="Pause Sum\n(control)",y= P_base)) +
    geom_violin(aes(x="Pause Sum\n(triptolide)",y= P_expr)) +
    
    geom_point(aes(x="Initiation\n(control)",y= kinit_base), position="jitter", alpha=0.1) + 
    geom_point(aes(x="Elongation",y= kelong), position="jitter", alpha=0.1) +
    geom_point(aes(x="Pause Sum\n(control)",y= P_base), position="jitter", alpha=0.1) +
    geom_point(aes(x="Pause Sum\n(triptolide)",y= P_expr), position="jitter", alpha=0.1) +

    geom_boxplot(aes(x="Initiation\n(control)",y= kinit_base), alpha=0.1, width=0.36, linewidth=0.3) + 
    geom_boxplot(aes(x="Elongation",y= kelong), alpha=0.1, width=0.36, linewidth=0.3) + 
    geom_boxplot(aes(x="Pause Sum\n(control)",y= P_base), alpha=0.1, width=0.36, linewidth=0.3) +
    geom_boxplot(aes(x="Pause Sum\n(triptolide)",y= P_expr), alpha=0.1, width=0.36, linewidth=0.3) +
         scale_y_continuous(trans=log2_trans(), 
        breaks=trans_breaks('log2', function(x) 2^x), 
        labels=trans_format('log2', math_format(.x))) +
         ylab(expression(log[2] ~ "Values for gene" ~ italic("Nup214"))) + 
     scale_x_discrete(labels = c(expression(italic(K[elong])),
                     expression(atop(italic(K[init]), "(control)")),
                     expression(atop("Pause Sum", "(control)")),
                     expression(atop("Pause Sum", "(triptolide)"))
                     )) + 
     theme_bw() + fontTheme + 
     theme(axis.title.x = element_blank(),
     axis.text.x = element_text(size=30, face=1.3),
         axis.text.y = element_text(size=26))


ggsave("TRP_treatment_P_values_gene_Nup214_kpre_constrained.pdf", height=8.4)