1 Introduction
2 Getting Started
3 Preparation of Inputs
4 Miscellaneous Plots
- 4.1 Plotting rates
- 4.2 Plotting saturation fits
References

1 Introduction

This document describes steps to run compartment model (Mukherjee et al) on nascent transcript datasets to estimate changes in pause release and initiation.

2 Getting Started

2.1 Installing the package

Note: This section assumes that user already has genes of interest with their pause sums and body densities. Please refer to section Preparation of Inputs for creation of these input files. We will install a python based package, compartmentModel currently hosted on Test PyPI:-

python3 -m pip install -i https://test.pypi.org/simple compartmentModel

After successful installation, the help menu can be accessed by running the following on command line:-

polcomp estparam --help

2.2 Inputs to the compartment model

We need two tab-separated files from each of control and treatment conditions containing ‘gene’, ‘pause_sum’, ‘body_density’ for all genes of our interest. We also need a file containing pause sums of all expressed genes in baseline, in order to estimate an occupancy value (see Mukherjee et al). Please refer to the section Preparation of Inputs for detailed overview of creation of these files. For this example, we will use two files prepared from Triptolide treatment data (Jonkers et al. 2014):-

head TRP125min_repressed_control_varlessthanpoint1.txt
gene    pause_sum   body_density
Tcf24   9.19386497139931    0.0804360718203523
Cops5   85.9334249198437    0.0384385124459365
Cspp1   756.197381913662    0.0254058947988453
Arfgef1 5.56534001231194    0.0219468799772712
Rnf152  6.98348507285118    0.0068474169675437
Pign    322.371615976095    0.0171639551516065
Relch   33.6940849125385    0.0252412666679433
Kdsr    153.332449406385    0.0335362723951963
Vps4b   185.665255844593    0.0386704052081501

head TRP125min_repressed_treatment_varlessthanpoint1.txt
gene    pause_sum   body_density
Tcf24   3.97053015232086    0.0243700172313062
Cops5   16.6258801221848    0.0278898475135028
Cspp1   116.273120522499    0.0247544261562293
Arfgef1 1.59940004348755    0.0177578365557946
Rnf152  0.523810029029846   0.0044233950876575
Pign    63.7204707860947    0.0124597407610381
Relch   10.836000084877 0.0200736366477201
Kdsr    27.2139506340027    0.0272897903308197
Vps4b   81.6698702573776    0.0313645508928707

2.3 Running the model

We will specify the inputs to our model as mentioned in the help message:-

polcomp estparam --help

By default for parameter estimation, the lower and higher limits for premature termination (Kpre), pause release (Krel), initiation (Kinit) rates are 0.01 and 1 respectively, whereas lower and upper limits for elongation (Kelong) are 30 and 60. To demonstrate that these limits can be changed, we will use a config file to change the lower limit to 1e-8.

cat TRP_UpperLimitFile
[initiation]
lower = 0.00000001
upper = 10

[pauserelease]
lower = 0.00000001
upper = 10

[prematuretermination]
lower = 0.00000001
upper = 10

Note: In similar manner, the upper limit can be changed for any parameter by specifying upper value for the desired paramter, say premature termination rate. This is for demonstration and not being used in this section.:-

[prematuretermination]
lower = 0.02
upper = 0.023

We will run the following command:-

polcomp estparam --basename baseline --basefile TRP125min_repressed_control_varlessthanpoint1.txt --exprname TRP_12.5min --exprfile TRP125min_repressed_treatment_varlessthanpoint1.txt --allgenesfile TRP125min_control_allgenes_varlessthanpoint1.txt --conf TRP_UpperLimitFile --outputdir paramFit_TRP_EntireRange_percentile

At the end of the run, it should produce the following output (last few lines mentioned):-

removed 625 genes from run as they have zero pause sum or body density
Genes with pause sum or body avg as zero in either condition have been written to paramFit_TRP_EntireRange_percentile/GeneswithZeroPauseSums_or_BodyAvg.txt
Received the command:
polcomp estparam --basename baseline --basefile TRP125min_repressed_control_varlessthanpoint1.txt --exprname TRP_12.5min --exprfile TRP125min_repressed_treatment_varlessthanpoint1.txt --allgenesfile TRP125min_control_allgenes_varlessthanpoint1.txt --conf TRP_UpperLimitFile --outputdir paramFit_TRP_EntireRange_percentile
writing command information to paramFit_TRP_EntireRange_percentile/commandUsed.txt
> Estimated rates saved as paramFit_TRP_EntireRange_percentile/paramEstimationResults_TRP_12.5min_vs_baseline.txt

The output file paramEstimationResults_TRP_12.5min_vs_baseline.txt contain all the valid parameter sets for all the genes. The column names are self-explanatory and depend on the basename and exprname mentioned in the input command:-

head paramFit_TRP_EntireRange_percentile/paramEstimationResults_TRP_12.5min_vs_baseline.txt
gene    P_normalized_baseline   B_normalized_baseline   P_normalized_TRP_12.5min    B_normalized_TRP_12.5min    B_by_P_baseline pauseRelease_kelong30_baseline  pauseRelease_kelong45_baseline  pauseRelease_kelong60_baseline  B_by_P_TRP_12.5min  pauseRelease_kelong30_TRP_12.5min   pauseRelease_kelong45_TRP_12.5min   pauseRelease_kelong60_TRP_12.5min   pauseRelease_FC initiationRate_FC_bound1    initiationRate_FC_bound2    initiationRate_baseline_kelong45_termination_point01    initiationRate_baseline_kelong45_termination_point1 initiationRate_baseline_kelong45_termination_1  initiationRate_TRP_12.5min_kelong45_termination_point01 initiationRate_TRP_12.5min_kelong45_termination_point1  initiationRate_TRP_12.5min_kelong45_termination_1
Tcf24   0.07061542377222484 0.0006178062562194436   0.0304964963243157  0.0001871790698991729   0.008748885487287061    0.26246656461861184 0.39369984692791776 0.5249331292372237  0.006137723753857254    0.18413171261571762 0.27619756892357644 0.36826342523143524 0.7015435009151662  0.43186735553247346 0.30297373653122617 0.02850743576759721 0.03486282390709745 0.0984167053020998  0.008728023108705936    0.01147270777789435 0.03891955446977848
Cops5   0.660029839005765   0.00029523512189788217  0.1276985874892177  0.0002142138705795001   0.0004473057193029473   0.013419171579088418    0.02012875736863263 0.026838343158176837    0.0016774960067399915   0.050324880202199745    0.07548732030329962 0.10064976040439949 3.750222575633717   0.19347396124026195 0.7255704172405131  0.019885878875462348    0.0792885643859812  0.6733154194911697  0.010916610050969682    0.022409482924999273    0.1373382116652952
Cspp1   5.808133874643234   0.0001951353465723098   0.8930602858055213  0.00019013160392264555  3.359690922831837e-05   0.001007907276849551    0.0015118609152743267   0.002015814553699102    0.00021289895760077485  0.006386968728023246    0.009580453092034869    0.012773937456046491    6.336861410493238   0.15376027913274945 0.9743575793029887  0.06686242934218628 0.5895944780600774  5.816914965238988   0.017486525034574263    0.09786195075707119 0.9016162079820403
Arfgef1 0.04274577064471635 0.00016856765189550999  0.012284530195248049    0.00013639281820717074  0.003943493107109206    0.1183047932132762  0.1774571898199143  0.2366095864265524  0.011102811099762753    0.3330843329928826  0.49962649948932386 0.6661686659857652  2.8154762283587997  0.2873858632085861  0.809128066230148   0.008013002041745113    0.011860121399769585    0.0503313149800143  0.006260522121275164    0.007366129838847489    0.018422207014570733
Rnf152  0.05363813362067858 5.2593033768977255e-05  0.004023233677147883    3.3974821209427734e-05  0.0009805157304858493   0.029415471914575476    0.04412320787186322 0.05883094382915095 0.008444655204197057    0.2533396561259117  0.3800094841888676  0.5066793122518234  8.612462749590666   0.07500696623039929 0.6459947026191188  0.002903067855810762    0.0077304998816718346   0.05600482014028256 0.0015690992911957268   0.0019311903221390363   0.005552100631572131
Pign    2.47604335555345    0.00013183138651792507  0.48941854829535325 9.569967327884901e-05   5.3242761772423756e-05  0.0015972828531727126   0.002395924279759069    0.003194565706345425    0.00019553748751895765  0.00586612462556873 0.008799186938353094    0.01173224925113746 3.6725647019353755  0.1976615422333578  0.7259248029363383  0.03069284594884113 0.2535367479486516  2.4819757679467567  0.009200670780501737    0.053248340127083536    0.49372503359290143
Relch   0.25879454311303635 0.00019387088540558193  0.08322818970801042 0.00015417981044294933  0.0007491304997142191   0.022473914991426574    0.03371087248713986 0.04494782998285315 0.0018524950618757727   0.055574851856273186    0.08336227778440977 0.11114970371254637 2.4728602861350177  0.32159947697064845 0.7952705746425097  0.01131213527438155 0.03460364415455482 0.26751873295628753 0.007770373367012824    0.015260910440733763    0.09016628117794315
Kdsr    1.177702296754216   0.00025758243070728 0.209022501691584   0.00020960500451777837  0.00021871608081022276  0.006561482424306682    0.009842223636460024    0.013122964848613364    0.0010027867948258213   0.03008360384477464 0.04512540576716196 0.06016720768954928 4.584879132394143   0.1774833098888034  0.8137395238574177  0.023368232349369758    0.1293614390572492  1.1892935061360435  0.011522450220215866    0.030334475372458432    0.21845472689488404
Vps4b   1.4260412527299717  0.0002970162232871636   0.6272827059770899  0.00024090206454146785  0.00020828024625414195  0.006248407387624258    0.009372611081436388    0.012496814775248517    0.00038404066020953915  0.011521219806286174    0.01728182970942926 0.02304243961257235 1.8438650189655321  0.43987697044264196 0.8110737584477228  0.027626142575222078    0.15596985532091953 1.4394069827778941  0.017113419964136952    0.07356886350207505 0.6381232988814559

A way to display the parameter results is mentioned in the section Plotting Rates. The user may use custom functions for displaying/summarising the parameter sets mentioned in paramEstimationResults_TRP_12.5min_vs_baseline.txt. _normalized_ in column names refers to transformed pause and body densities to occupancy values (as mentioned in Mukherjee et al). Please refer to section, Saturation values where running polcomp with --norm saturation is discussed.

3 Preparation of Inputs

3.1 Processing of raw sequencing files

We will be using Triptolide (TRP) treatment dataset from Jonkers et al. 2014. We will process the raw GRO-seq files from control and TRP 12.5min conditions. Mouse V6.5 cells were processed in two replicates. These are the accession numbers for files control rep1 - SRR935093, SRR935094, control rep2 - SRR935095, SRR935096 and TRP 12.5min - SRR935119 (rep1), SRR935120 (rep2). As Arabidopsis thaliana RNAs were used as spike-ins, we also remove reads aligning to Arabidopsis genome. We are using Mus_musculus.GRCm39.105 genome for mouse and GCF_000001735.4 for Arabidopsis. Processing scripts for all other datasets used in Mukherjee et al, are present in processingscripts/ folder on GitHub: TBD. We follow the almost the same workflow as detailed in Nascent RNA Methods, Smith et al 2021 (PEPPRO) and Scott et al 2022 with following changes:-

We align to Arabidopsis genome and collect reads aligning to Arbidopsis (spikeIns) and mouse.
The mouse reads are further aligned to mouse ribosome and we remove the ribosomal reads.
The remaining reads aligning to mouse are then processed.

We use the following file as our “placeholder” and we replace the placeholder within this file by the names of our replicates:-

cat processPRO_seqdata_placeholder_spikeIn.sh
#!/bin/bash
#SBATCH --job-name=processPRO_seqdata
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 8
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=END
#SBATCH --mem=50G
#SBATCH --mail-user=rmukherjee@uchc.edu
#SBATCH -o processPRO_seqdata_placeholder_%j.out
#SBATCH -e processPRO_seqdata_placeholder_%j.err

set -e
# add path to the R files
export PATH=$PATH:/home/FCAM/rmukherjee/Spring2022Rotation/R_files
module load cutadapt;
module load seqtk;
module load bowtie2;
module load bedtools;
module load samtools;
module load bamtools;
module load genometools
seqBiasDir=/home/FCAM/rmukherjee/seqOutBias/target/release
filesDir=/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/fastqfiles
UMI_length=0
mouseGenomeIndex=/home/FCAM/rmukherjee/Spring2022Rotation/mm39_bt/mm39
mouserDNAIndex=/home/FCAM/rmukherjee/Spring2022Rotation/mouse_rDNA_bt2/mouse_rDNA
name=placeholder
annotation_prefix=/home/FCAM/rmukherjee/Spring2022Rotation/genome/annotations/Mus_musculus.GRCm39.105
aThalianaGenomeIndex=/home/FCAM/rmukherjee/arabidopsisThalianaGenome/AThaliana_GCF1735
cores=8
read_size=50
parentMouseGenomeDir=/home/FCAM/rmukherjee/Spring2022Rotation/genome
tallymer=${parentMouseGenomeDir}/mm39.tal_${read_size}.gtTxt.gz
table=${parentMouseGenomeDir}/mm39_${read_size}.4.2.2.tbl
aThalianaGenomeDir=/home/FCAM/rmukherjee/arabidopsisThalianaGenome
aThalianaTallymer=${aThalianaGenomeDir}/GCF_000001735.4.tal_${read_size}.gtTxt.gz
aThalianaTable=${aThalianaGenomeDir}/GCF_000001735.4_${read_size}.4.2.2.tbl

cd ${filesDir}
gunzip ${name}.fastq.gz

cutadapt --cores=${cores} -m $((UMI_length+2)) -O 1 -a TGGAATTCTCGGGTGCCAAGG ${name}.fastq \
                                -o ${name}_noadap.fastq --too-short-output ${name}_short.fastq > ${name}_cutadapt.txt
## initialize the metrics file
#echo -e  "value\texperiment\tthreshold\tmetric" > ${name}_QC_metrics.txt
#PE1_total=$(wc -l ${name}.fastq | awk '{print $1/4}')
#PE1_w_Adapter=$(wc -l ${name}_short.fastq | awk '{print $1/4}')
#AAligation=$(echo "scale=2 ; $PE1_w_Adapter / $PE1_total" | bc)
#echo -e "$AAligation\t$name\t0.80\tAdapter/Adapter" >> ${name}_QC_metrics.txt
seqtk seq -L $((UMI_length+10)) ${name}_noadap.fastq > ${name}_noadap_trimmed.fastq
#
########################
####### since UMI length is zero
##### - no fqdedup as UMI is not present
##### - no reverse complementing for GRO-seq as `opposite adapters on the 3 and 5 ends for GRO`.
mv ${name}_noadap_trimmed.fastq ${name}_processed_UMI_${UMI_length}.fastq
#########################

## bowtie log showed highest alignment with both b and e removed
#seqtk trimfq -b ${UMI_length} -e ${UMI_length} ${name}_dedup.fastq | seqtk seq -r - > ${name}_processed_UMI_${UMI_length}.fastq

##bowtie2 -p $((cores)) -x ${aThalianaGenomeIndex} -U ${name}_processed_UMI_${UMI_length}.fastq 2>${name}_bowtie2_${UMI_length}_aThaliana.log | samtools view -b - | \
#                          samtools sort - -o ${name}_${UMI_length}_aThaliana.bam
##parse bowtie log for alignment
#filename=${name}_bowtie2_${UMI_length}_aThaliana.log # logfile
#totalReads=$(wc -l ${name}_processed_UMI_${UMI_length}.fastq | awk '{ print $1/4 }')
#alignmentRate=$(grep "overall alignment rate" $filename | awk -F"%" '{ print $1 }')
#alignmentFraction=$(echo "scale=2 ; ${alignmentRate} / 100" | bc)
#spikeInReads=$(echo "scale=0; ${alignmentFraction} * ${totalReads}" | bc)
#mousealignedPercent=$(grep "overall alignment rate" ${name}_bowtie2_${UMI_length}_human.log | awk -F"%" '{ print $1 }')
#mouseReads=$(echo "scale=0; ${mousealignedPercent} * ${totalReads} * 0.01" | bc)
#echo -e "${alignmentFraction}\t$name\t0.30\tThaliana Alignment Rate" > ${name}_metrics_spikeIn.txt
#echo -e "spikeInReads\tmouseReads\ttotalReads" > ${name}_metrics_spikeIn.txt
#echo -e "${name}\t${spikeInReads}\t${mouseReads}\t${totalReads}" >> ${name}_metrics_spikeIn.txt

#
#bowtie2 -p $((cores)) -x ${mouseGenomeIndex} -U ${name}_processed_UMI_${UMI_length}.fastq 2>${name}_bowtie2_${UMI_length}_mouse.log | samtools view -b - | \
#                          samtools sort - -o ${name}_${UMI_length}_mouse.bam
##parse bowtie log for alignment
#filename=${name}_bowtie2_${UMI_length}_mouse.log # logfile
#alignmentRate=$(grep "overall alignment rate" $filename | awk -F"%" '{ print $1 }')
#alignmentFraction=$(echo "scale=2 ; ${alignmentRate} / 100" | bc)
#echo -e "${alignmentFraction}\t$name\t0.80\tAlignment Rate" >> ${name}_QC_metrics.txt

# thaliana reads - collect all reads not aligning to mouse genome
bowtie2 -p $((cores)) --maxins 1000 -x ${mouseGenomeIndex} -U ${name}_processed_UMI_${UMI_length}.fastq  2>${name}_bowtie2_non_mouseDNA.log |  samtools sort -n - | samtools fastq -f 0x4 - > ${name}.non.mouse.fastq

# align to ribosomal reads and remove them
bowtie2 -p $((cores)) -x ${mouserDNAIndex} -U ${name}_processed_UMI_${UMI_length}.fastq 2>${name}_bowtie2_rDNA.log | \
                        samtools sort -n - | samtools fastq -f 0x4 - > ${name}.non.rDNA.fastq
reads=$(wc -l ${name}.non.rDNA.fastq | awk '{print $1/4}')

filename=${name}_bowtie2_rDNA.log # logfile
rDNA_alignment=$(grep "overall alignment rate" $filename | awk -F"%" '{ print $1 }')
rDNA_alignmentFraction=$(echo "scale=2 ; ${rDNA_alignment} / 100" | bc)
echo -e "$rDNA_alignmentFraction\t$name\t0.10\trDNA Alignment Rate" >> ${name}_QC_metrics.txt

bowtie2 -p $((cores)) --maxins 1000 -x ${mouseGenomeIndex} -U ${name}.non.rDNA.fastq  2>${name}_bowtie2_non_rDNA.log | samtools view -b - | samtools sort - -o ${name}.nonrDNA.bam
# parse bowtie log to find number of non-rDNA reads mapping to mouse genome.
filename=${name}_bowtie2_non_rDNA.log # logfile
non_rDNA_reads=$(wc -l ${name}.non.rDNA.fastq | awk '{ print $1/4 }')
non_rDNA_alignment=$(grep "overall alignment rate" $filename | awk -F"%" '{ print $1 }')
nonrDNAalignmentFraction=$(echo "scale=2 ; ${non_rDNA_alignment} / 100" | bc)
non_rDNA_ReadCount=$(echo "scale=0 ; ( ${non_rDNA_alignment} * ${non_rDNA_reads} ) / 100" | bc)
echo -e "${nonrDNAalignmentFraction}\t$name\t0.80\tAlignment Rate" >> ${name}_QC_metrics.txt
#
### complexity and theoretical read depth
#fqComplexity -i ${name}_noadap_trimmed.fastq
#PE1_total=$(wc -l ${name}.fastq | awk '{print $1/4}')
#PE1_noadap_trimmed=$(wc -l ${name}_noadap_trimmed.fastq | awk '{print $1/4}')
#factorX=$(echo "scale=2 ; $PE1_noadap_trimmed / $PE1_total" | bc)
#echo fraction of reads that are not adapter/adapter ligation products or below 10 base inserts
#echo $factorX
###calculate PE1 deduplicated reads
#PE1_dedup=$(wc -l ${name}_dedup.fastq | awk '{print $1/4}')
###divide
#factorY=$(echo "scale=2 ; $non_rDNA_ReadCount / $PE1_dedup" | bc)
##re-run with factors
#fqComplexity -i ${name}_noadap_trimmed.fastq -x $factorX -y $factorY
#
bowtie2 -p $((cores)) -x ${aThalianaGenomeIndex} -U ${name}.non.mouse.fastq  2>${name}_bowtie2_${UMI_length}_aThaliana.log | samtools view -b - | \
                                   samtools sort - -o ${name}_${UMI_length}_aThaliana.bam
##parse bowtie log for alignment
filename=${name}_bowtie2_${UMI_length}_aThaliana.log # logfile
totalReads=$(wc -l ${name}_processed_UMI_${UMI_length}.fastq | awk '{ print $1/4 }')
alignmentRate=$(grep "overall alignment rate" $filename | awk -F"%" '{ print $1 }')
alignmentFraction=$(echo "scale=2 ; ${alignmentRate} / 100" | bc)
spikeInReads=$(echo "scale=0; ${alignmentFraction} * ${totalReads}" | bc)
mousealignedPercent=$(grep "overall alignment rate" ${name}_bowtie2_${UMI_length}_human.log | awk -F"%" '{ print $1 }')
mouseReads=$(echo "scale=0; ${mousealignedPercent} * ${totalReads} * 0.01" | bc)
#echo -e "${alignmentFraction}\t$name\t0.30\tThaliana Alignment Rate" > ${name}_metrics_spikeIn.txt
echo -e "exprName\tspikeInReads\tmouseReads\ttotalReads" > ${name}_metrics_spikeIn.txt
echo -e "${name}\t${spikeInReads}\t${mouseReads}\t${totalReads}" >> ${name}_metrics_spikeIn.txt

##TBD: convert to bigWig and BED6 using seqOutBias
${seqBiasDir}/seqOutBias scale $table ${name}.nonrDNA.bam --no-scale --stranded --bed-stranded-positive \
                    --bw=${name}.bigWig --bed=${name}.bed \
            --tail-edge --read-size=${read_size} --tallymer=$tallymer

grep -v "random" ${name}_not_scaled.bed | grep -v "chrUn" | grep -v "chrEBV" | sort -k1,1 -k2,2n > ${name}_tmp.txt
mv ${name}_tmp.txt ${name}_not_scaled.bed
# run seqoutBias on aThaliana bam
${seqBiasDir}/seqOutBias scale ${aThalianaTable} ${name}_${UMI_length}_aThaliana.bam --no-scale --stranded --bed-stranded-positive \
                    --bw=${name}_aThaliana.bigWig --bed=${name}_aThaliana.bed \
            --tail-edge --read-size=${read_size} --tallymer=${aThalianaTallymer}

sort -k1,1 -k2,2n ${name}_aThaliana_not_scaled.bed > ${name}_pause_counts_tmp.bed
mv ${name}_pause_counts_tmp.bed ${name}_aThaliana_not_scaled.bed

##count reads in pause region
mapBed -null "0" -s -a $annotation_prefix.pause.bed -b ${name}_not_scaled.bed | \
                                awk '$7>0' | sort -k5,5 -k7,7nr | sort -k5,5 -u > ${name}_pause.bed
#discard anything with chr and strand inconsistencies
join -1 5 -2 5 ${name}_pause.bed $annotation_prefix.bed | \
                                awk '{OFS="\t";} $2==$8 && $6==$12 {print $2, $3, $4, $1, $6, $7, $9, $10}' | \
                                                awk '{OFS="\t";} $5 == "+" {print $1,$2+480,$8,$4,$6,$5} $5 == "-" {print $1,$7,$2 - 380,$4,$6,$5}' | \
                                                            awk  '{OFS="\t";} $3>$2 {print $1,$2,$3,$4,$5,$6}' > ${name}_pause_counts_body_coordinates.bed

sort -k1,1 -k2,2n ${name}_pause_counts_body_coordinates.bed > ${name}_pause_counts_tmp.bed
mv ${name}_pause_counts_tmp.bed ${name}_pause_counts_body_coordinates.bed

#column ten is Pause index
mapBed -null "0" -s -a ${name}_pause_counts_body_coordinates.bed \
            -b ${name}_not_scaled.bed | awk '$7>0' | \
                                    awk '{OFS="\t";} {print $1,$2,$3,$4,$5,$6,$7,$5/100,$7/($3 - $2)}' | \
                                                    awk '{OFS="\t";} {print $1,$2,$3,$4,$5,$6,$7,$8,$9,$8/$9}' > ${name}_pause_body.bed
pause_index.R ${name}_pause_body.bed
#
mapBed -null "0" -s -a $annotation_prefix.introns.bed \
                        -b ${name}_not_scaled.bed | awk '$7>0' | \
            awk '{OFS="\t";} {print $1,$2,$3,$5,$5,$6,$7,($3 - $2)}' > ${name}_intron_counts.bed

mapBed -null "0" -s -a $annotation_prefix.no.first.exons.named.bed \
            -b ${name}_not_scaled.bed | awk '$7>0' | \
                awk '{OFS="\t";} {print $1,$2,$3,$4,$4,$6,$7,($3 - $2)}' > ${name}_exon_counts.bed

exon_intron_ratio.R ${name}_exon_counts.bed ${name}_intron_counts.bed

gzip ${name}.fastq

#cat *_QC_metrics.txt | awk '!x[$0]++' > project_QC_metrics.txt 
#plot_all_metrics.R project_QC_metrics.txt Estrogen_treatment_PRO

We can replace the placeholder and submit jobs in the following manner (submission example for slurm environment, HPC cluster @ UConn):-

for filename in V65_cntrl_rep1 V65_cntrl_rep2 V65_TRP_treat_12min_rep1 V65_TRP_treat_12min_rep2; do
        sed "s/placeholder/${filename}/g" processPRO_seqdata_placeholder_spikeIn.sh > processPRO_seqdata_${filename}_spikeIn.sh
        echo ${filename}_spikeIn
        sbatch processPRO_seqdata_${filename}_spikeIn.sh
        sleep 1
done

3.2 Normalization of sequencing data by their read-depth

We eventually will normalize data by DESeq2 size factors. For running DESeq2, we plan to use the following criteria:- * consider genes with length greater than 50kbp and treat regions from [TSS + 50kbp] to TTS as unchanging between control and TRP treatment for 12.5 mins. TSS = transcription start site, TTS = transcription termination site. * for differential expression analysis by DESeq2, only consider regions between 500 and 25kbp..

We create normalized files in the same manner as mentioned in PRO_normalization of Nascent RNA Methods and use normalize_bedGraph_v2.py for increased execution speed.

cat normalization_factor.R
#!/usr/bin/env Rscript

# modified from normalization_factor.R because of different file names 
# assumes the naming convention CELL_COND_TIME_ETC_rep<#>_<plus><minus>.bigWig
# assumes that you are running the script in the directory with your bigWigs 

if(!require(devtools)) { install.packages("devtools", repos='http://cran.us.r-project.org') }
if(!require(bigWig)) {devtools::install_github("andrelmartins/bigWig", subdir="bigWig")}

library(bigWig)

coverage = c()
file.name = c()

# commented out from the original code as it seems not to serve any purpose
#for (plus.bigWig in Sys.glob(file.path("*_rep1_plus_PE1.bigWig"))) { # name changed
#   file.prefix = strsplit(plus.bigWig, '_rep1')[[1]][1]
#   #print(file.prefix)
#    minus.bigWig = paste0(strsplit(plus.bigWig, 'plus')[[1]][1], "plus.bigWig") # should actually be minus.bigWig
#    #print(minus.bigWig)
#    }
# these bigWigs were creataed using seqOutBias
for (replicate.plus.bigWig in Sys.glob(file.path("*_plus.bigWig"))) { # name changed
    #print(replicate.plus.bigWig)
    replicate.prefix = strsplit(replicate.plus.bigWig, '_plus.bigWig')[[1]][1] # name changed
    replicate.minus.bigWig = paste0(strsplit(replicate.prefix, '_plus.bigWig')[[1]][1], "_minus.bigWig") # name changed
    #print(replicate.minus.bigWig)
    bigwig.plus = load.bigWig(replicate.plus.bigWig)
    bigwig.minus = load.bigWig(replicate.minus.bigWig)
    coverage.bigwig.plus = bigwig.plus$basesCovered*bigwig.plus$mean # coveredBases * mean (mysterious because I don't know the definitions)
    coverage.bigwig.minus = abs(bigwig.minus$basesCovered*bigwig.minus$mean) # abs used because of negstive values
    coverage = append(coverage, coverage.bigwig.plus)
    coverage = append(coverage, coverage.bigwig.minus)
    file.name = append(file.name, replicate.plus.bigWig)
    file.name = append(file.name, replicate.minus.bigWig)
    #print(coverage)
    unload.bigWig(bigwig.plus)
    unload.bigWig(bigwig.minus)
}   
# NOTE: CHANGE THE FOllowing prefix to `basename` in PRO_normalization (line 34)
#celltype.prefix = strsplit(file.prefix, '_')[[1]][1]
celltype.prefix = "V65" 
write.table(data.frame(file.name, coverage), file = paste0(celltype.prefix, '_normalization.txt'), sep = '\t', row.names = FALSE, col.names=FALSE, quote =FALSE)

cat PRO_normalization
#! /bin/sh

# modified from original code as file names are different: https://github.com/guertinlab/Nascent_RNA_Methods/blob/main/PRO_normalization

# put this in your path: normalize_bedGraph.py
# put this executable in your path: normalization_factor.R

#normalization_factor_v2.R and normalize_bedGraph_v2.py are present in the same directory and they have been modifed accordingly.
set -e
module load R/4.1.2
export PATH=${PATH}:/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014 # for normalization_bedgraph_v2.py
export PATH=${PATH}:/home/FCAM/rmukherjee/Summer2022Rotation/R_files # for bigWigToBedGraph
while getopts "c:" OPTION
do
    case $OPTION in
    c)
        chrSizes=$OPTARG
        ;;

    esac
done

bigWigFilesDir=/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/fastqfiles

cd ${bigWigFilesDir}

if [ $chrSizes ]; then

    Rscript /home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/normalization_factor.R # modified version of normalization_factor.R from Guertin lab's GitHub
        #for i in V65_cntrl_rep1_plus.bigWig V65_FP_treat_25min_rep1_plus.bigWig V65_TRP_treat_12min_rep1_plus.bigWig; # name modified
        for i in V65_cntrl_rep1_plus.bigWig V65_TRP_treat_12min_rep1_plus.bigWig; # name modified
          do
          name=$(echo $i | awk -F"_rep1_plus.bigWig" '{print $1}') # name modified
              printf "%s - processing individual replicates\n" ${name}
          basename=$(echo $i | awk -F"_" '{print $1}') 
          # count the number of replicates
                  reps=$(ls ${name}_rep?_plus.bigWig | wc -w | bc) # modified name
                  for j in ${name}_rep?_plus.bigWig; 
          do
                  repNum=$(echo $j | awk -F"rep" '{print $NF}' | awk -F"_plus" '{print $1}') # take the num# of replicate 
                              invscalePlus=$(grep "${name}_rep${repNum}_plus.bigWig" ${basename}_normalization.txt | awk -F" " '{print $2}' | bc)
                  invscaleMinus=$(grep "${name}_rep${repNum}_minus.bigWig" ${basename}_normalization.txt | awk -F" " '{print $2}' | bc)
                  echo $invscalePlus
                      echo $invscaleMinus
                  # scale to 10 million
                  invScale=$(expr $invscalePlus + $invscaleMinus)
                          scale=$(bc <<< "scale=3 ; 10000000 / $invScale") # normalised to 1e7 reads
                      bigWigToBedGraph ${name}_rep${repNum}_plus.bigWig ${name}_rep${repNum}_plus.bedGraph
                  bigWigToBedGraph ${name}_rep${repNum}_minus.bigWig ${name}_rep${repNum}_minus.bedGraph
                          normalize_bedGraph_v2.py -i ${name}_rep${repNum}_plus.bedGraph -s $scale -o ${name}_rep${repNum}_plus_normalized.bedGraph
                      normalize_bedGraph_v2.py -i ${name}_rep${repNum}_minus.bedGraph -s $scale -o ${name}_rep${repNum}_minus_normalized.bedGraph
                  bedGraphToBigWig ${name}_rep${repNum}_plus_normalized.bedGraph $chrSizes ${name}_rep${repNum}_plus_normalized.bigWig 
                  bedGraphToBigWig ${name}_rep${repNum}_minus_normalized.bedGraph $chrSizes ${name}_rep${repNum}_minus_normalized.bigWig
          done
          printf "%s - creating and merging normalized replicate bigWigs\n" ${name}
          plusfiles=$(ls ${name}_rep*_plus_normalized.bigWig)
          bigWigMerge $plusfiles tmpPlus.bg
              minusfiles=$(ls ${name}_rep*_minus_normalized.bigWig)
              bigWigMerge -threshold=-10000000000 $minusfiles tmpMinus.bg # for bigWigMerge utility: -threshold=0.N - don't output values at or below this threshold. Default is 0.0
                  scaleall=$(bc <<< "scale=4 ; 1.0 / $reps")
          normalize_bedGraph_v2.py -i tmpPlus.bg -s $scaleall -o ${name}_plus_normalized.bg # scaleall=1/rep#
              normalize_bedGraph_v2.py -i tmpMinus.bg -s $scaleall -o ${name}_minus_normalized.bg
              sort -k1,1 -k2,2n ${name}_plus_normalized.bg > ${name}_plus_normalized_sorted.bg
          sort -k1,1 -k2,2n ${name}_minus_normalized.bg > ${name}_minus_normalized_sorted.bg
          bedGraphToBigWig ${name}_plus_normalized_sorted.bg $chrSizes ${name}_plus_normalized.bigWig 
          bedGraphToBigWig ${name}_minus_normalized_sorted.bg $chrSizes ${name}_minus_normalized.bigWig   
              rm ${name}_plus_normalized.bg
          rm ${name}_minus_normalized.bg
          rm ${name}_plus_normalized_sorted.bg
          rm ${name}_minus_normalized_sorted.bg
              rm tmpPlus.bg
              rm tmpMinus.bg
      done
fi

We will now use the normalized files from control condition - V65_cntrl_plus_normalized.bigWig and V65_cntrl_minus_normalized.bigWig for estimating the TSS in next section.

3.3 Estimation of transcription start sites for genes

We will use primaryTranscriptAnnotation (Anderson et al. 2020, GitHub ) to estimate TSS and TTS. Comments mention why some thresholds were used:-

cat createPrimaryTranscAnnot_Mouse_cntThresh_point9.R
library(bigWig)
library(NMF)
library(dplyr)
library(pracma)
library(RColorBrewer)
library(dplyr)

# source primaryTranscriptAnnotation package files here
source("/home/FCAM/rmukherjee/Summer2022Rotation/genomicsRpackage-master/primaryTranscriptAnnotation/R/gene_ann.R")
source("/home/FCAM/rmukherjee/Summer2022Rotation/genomicsRpackage-master/primaryTranscriptAnnotation/R/map_tu.R")

# modified from get.TSS()
get.counts.around.potential.TSS <- function(bed.in=NULL, bed.out=NULL, bw.plus=NULL, bw.minus=NULL,
                                           bp.range=NULL, cnt.thresh=NULL, low.read=FALSE, by="cnt"){
     final.df = c()
      # error handling. note that input and output genes must be matched
      if( length(unique(bed.in$gene)) != length(unique(bed.out$gene)) ){stop("input genes do not match output genes")}
  if( setequal(unique(bed.in$gene),unique(bed.out$gene)) == FALSE ){stop("input genes do not match output genes")}
    # loop through each unique gene
    # map reads a range for each annotated TSS
    bed0 = bed.out
    bed = bed.in
      genes = unique(bed$gene)
      tss = rep(0,length(genes))
        for(ii in 1:length(genes)){
    ind.gene = which(bed$gene == genes[ii])
    strand.gene = unique(bed$strand[ind.gene])[1]
    chr.gene = unique(bed$chr[ind.gene])[1]
        # get all sets of intervals around each annotated TSS
        # sort upstream to downstream
        if(strand.gene == "-"){
              starts = unique(bed$end[ind.gene])
          ranges = data.frame(chr=chr.gene, start=starts-bp.range[2],
                                    end=starts-bp.range[1], stringsAsFactors=FALSE)
            ranges = ranges[order(ranges$end,decreasing=TRUE),]
          }
        if(strand.gene == "+"){
              starts = unique(bed$start[ind.gene])
          ranges = data.frame(chr=chr.gene, start=starts+bp.range[1],
                                    end=starts+bp.range[2], stringsAsFactors=FALSE)
                ranges = ranges[order(ranges$start),]
              }

        # get counts and densities for each interval of interest
        if(strand.gene == "-"){ counts = abs(bed.region.bpQuery.bigWig(bw.minus, ranges)) }
        if(strand.gene == "+"){ counts = abs(bed.region.bpQuery.bigWig(bw.plus, ranges)) }
            dens = counts / (ranges$end - ranges$start)

        # select the best TSS, select the most upstream for ties
        if(by == "cnt"){ind.max = which(counts == max(counts))[1]}
        if(by == "den"){ind.max = which(dens == max(dens))[1]}
           final.df = rbind(final.df, c(genes[ii], counts[ind.max], ranges[ind.max, ], strand.gene))
    }
      final.df = as.data.frame(final.df)
      return(final.df)
}
# load transcript, firstExon files
annotDir="/home/FCAM/rmukherjee/Summer2022Rotation/annotations"
outputDir = "/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/primaryTranscriptAnnotation"

# import data for first exons, annotate, and remove duplicate transcripts
fname = sprintf("%s/gencode.mm39.firstExon.bed", annotDir)
dat0 = read.table(fname,header=F,stringsAsFactors=F)
names(dat0) = c('chr', 'start', 'end', 'gene', 'xy', 'strand')
dat0 = unique(dat0)
gencode.firstExon = dat0
# import data for all transcripts, annotate, and remove duplicate transcripts
fname = sprintf("%s/gencode.mm39.transcript.bed", annotDir)
dat0 = read.table(fname,header=F,stringsAsFactors=F)
names(dat0) = c('chr', 'start', 'end', 'gene', 'xy', 'strand')
dat0 = unique(dat0)
gencode.transcript = dat0
# chromosome sizes
chrom.sizes = read.table("/home/FCAM/rmukherjee/Spring2022Rotation/genome/mm39.chrom.sizes",stringsAsFactors=F,header=F)
names(chrom.sizes) = c("chr","size")

bigwigDir = "/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/fastqfiles"
#plus.file = sprintf("%s/TRP_FP_control_merged_plus.bigWig", bigwigDir)
#minus.file = sprintf("%s/TRP_FP_control_merged_minus.bigWig", bigwigDir)
#plus.file = sprintf("%s/V65_CNTRL_All_plus_normalized.bigWig", bigwigDir)
#minus.file = sprintf("%s/V65_CNTRL_All_minus_normalized.bigWig", bigwigDir)
plus.file = sprintf("%s/V65_cntrl_plus_normalized.bigWig", bigwigDir)
minus.file = sprintf("%s/V65_cntrl_minus_normalized.bigWig", bigwigDir)
bw.plus = load.bigWig(plus.file)
bw.minus = load.bigWig(minus.file)
cntrl.rep1.plus = load.bigWig(sprintf("%s/V65_cntrl_rep1_plus.bigWig", bigwigDir))
cntrl.rep1.minus = load.bigWig(sprintf("%s/V65_cntrl_rep1_minus.bigWig", bigwigDir))
cntrl.rep2.plus = load.bigWig(sprintf("%s/V65_cntrl_rep2_plus.bigWig", bigwigDir))
cntrl.rep2.minus = load.bigWig(sprintf("%s/V65_cntrl_rep2_minus.bigWig", bigwigDir))

# (using gencode transcript bed) get intervals for furthest TSS and TTS +/- interval
# stitch together the lowest start and highest end of each gene
largest.interval.bed = get.largest.interval(bed=gencode.transcript)
saveRDS(largest.interval.bed, file=sprintf("%s/gencode.largest.interval.rds", outputDir))
# (use gencode transcript bed) get read counts and densities for each annotated gene
transcript.reads = read.count.transcript(bed=gencode.transcript,
                     bw.plus=bw.plus, bw.minus=bw.minus)
saveRDS(transcript.reads, file = sprintf("%s/gencode.transcript.reads.rds", outputDir))
# decide on cutoffs for discarding reads - evaluate count and density distributions
pdf(sprintf("%s/read_count_density.pdf", outputDir))
par(mfrow=c(1,2))
hist(log(transcript.reads$density), breaks=200,
          col="black",xlab="log read density",main="")
hist(log(transcript.reads$counts), breaks=200,
          col="black",xlab="log read count",main="")
dev.off()
# create cutoffs and discard gene with low expression
den.cut = -7.2 # density < exp(-7.5)
cnt.cut = 0.3    # count < exp (0)
# plot the graph
pdf(sprintf("%s/read_count_density_thresholdLine.pdf", outputDir))
par(mfrow=c(1,2))
hist(log(transcript.reads$density), breaks=200,
          col="black",xlab="log read density",main="")
abline(v=den.cut, col="red")
hist(log(transcript.reads$counts), breaks=200,
          col="black",xlab="log read count",main="")
abline(v=cnt.cut, col="red")
dev.off()
# remove "unexpressed" genes
# get indices based on density and cnt threshold
ind.cut.den = which(log(transcript.reads$density) < den.cut)
ind.cut.cnt = which(log(transcript.reads$counts) < cnt.cut)
ind.cut = union(ind.cut.den, ind.cut.cnt) # take union of indices to be removed
unexp = names(transcript.reads$counts)[ind.cut] # get all unexpressed genes
largest.interval.expr.bed = largest.interval.bed[!(largest.interval.bed$gene %in% unexp),] # get interval bed of all genes which are expressed in sufficient amount
saveRDS(largest.interval.expr.bed, file=sprintf("%s/largest_interval_expr_bed.rds", outputDir))

# select the TSS (defined as having the highest count) for each gene and incorporate these TSSs
# into the largest interval coordinates
bp.range = c(20,160) # estimated using plotMetaProfileOfBigwigs.R
cnt.thresh = 0.9 # chose this because the Stat4 correct annotation was at 0.1805. Also doesn't make sense to use this threshold as most reads are really low 
#original count threshold was five. Took few cases with read-depth normalized values and computed the threshold
## normalization factor
# control_rep1 = 0.2495727
# control_rep2 = 0.3610368
# the nromalized count calculated using the following cases
#  control_rep1 control_rep2 normalizedRead
#1            5            0      0.6239319
#2            0            5      0.9025920
#3            5            5      1.5265239
#4            2            3      0.7911280
#5            3            2      0.7353959
#6            1            4      0.8468600
#7            4            1      0.6796639
bed.out = largest.interval.expr.bed
# select exon 1 having genes in largest.interval. Probably to enforce same genes?
bed.in = gencode.firstExon[gencode.firstExon$gene %in% bed.out$gene,]
saveRDS(bed.in, sprintf("%s/V65.Jonkers.bed.in.rds", outputDir))
# bp.range is the interval within which the greatest read density is searched for
# the following regions on gene are searched through bigwig.query
# for - strand, the start and end are set as "end-120", "end-20"
# for + strand, the start and end are set as "start+20" and "start+120"
# after the max coverage is found,
# the + strand TSS is set to "start"
# whereas - strand TSS is set to "end"
# finally, the gene names from TSS calculated are used to set the same genes in largest
# interval bed, where genes in plus(+) strand get their "start" set as TSS whereas genes in minus(-) strand get their "end" set as TSS
# get.TSS also returns an "issues" object which contain all genes whose ends are lower than starts and whose start are less than zero
# the "bed" object returned contains the interval data patched with TSS as start.

##################### EXERCISE FOR estimating the counts around TSS 
#counts.df = get.counts.around.potential.TSS(bed.in=bed.in, bed.out=bed.out,
#                                          bw.plus=bw.plus, bw.minus=bw.minus,
#                                                     bp.range=bp.range, cnt.thresh=cnt.thresh)
#counts.df = as.data.frame(counts.df)
#counts.df[,2] =as.numeric(counts.df[,2])
#saveRDS(counts.df, file=sprintf("%s/max.counts.around.potential.tss.rds", outputDir))
## control rep1
#counts.df.cntrl.rep1 = get.counts.around.potential.TSS(bed.in=bed.in, bed.out=bed.out,
#                                          bw.plus=cntrl.rep1.plus, bw.minus=cntrl.rep1.minus,
#                                                     bp.range=bp.range, cnt.thresh=cnt.thresh)
#counts.df.cntrl.rep1 = as.data.frame(counts.df.cntrl.rep1)
#colnames(counts.df.cntrl.rep1) = c("gene", "max.count", "chr", "start", "end", "strand")
#counts.df.cntrl.rep1[,2] =as.numeric(counts.df.cntrl.rep1[,2])
#saveRDS(counts.df.cntrl.rep1, file=sprintf("%s/cntrl.rep1.max.counts.potential.tss.rds", outputDir))
## control rep2
#counts.df.cntrl.rep2 = get.counts.around.potential.TSS(bed.in=bed.in, bed.out=bed.out,
#                                          bw.plus=cntrl.rep2.plus, bw.minus=cntrl.rep2.minus,
#                                                     bp.range=bp.range, cnt.thresh=cnt.thresh)
#counts.df.cntrl.rep2 = as.data.frame(counts.df.cntrl.rep2)
#counts.df.cntrl.rep2[,2] =as.numeric(counts.df.cntrl.rep2[,2])
#colnames(counts.df.cntrl.rep2) = c("gene", "max.count", "chr", "start", "end", "strand")
#saveRDS(counts.df.cntrl.rep2, file=sprintf("%s/cntrl.rep2.max.counts.potential.tss.rds", outputDir))
#############################

gene.TSS = get.TSS(bed.in=bed.in, bed.out=bed.out, 
           bw.plus=bw.plus, bw.minus=bw.minus,
           bp.range=bp.range, cnt.thresh=cnt.thresh)
TSS.gene = gene.TSS$bed
saveRDS(TSS.gene, file=sprintf("%s/TSS_gene_unfiltered.cnt.thresh.%0.2f.rds", outputDir, cnt.thresh))
# pre-filter to remove poorly annotated genes
for (substr in c("Gm", "Rik", "AC", "AW")) {
               grep.result = grep(substr, TSS.gene$gene)
   if (length(grep.result) > 0) {
                        TSS.gene = TSS.gene[-grep.result,]
         }
}
saveRDS(TSS.gene, file=sprintf("%s/TSS_gene_filtered.cnt.thresh.%0.2f.rds", outputDir, cnt.thresh))
# remove overlaps by selecting the gene with the highest density
# (for both plus and minus strand) gene.overlaps searches for genes within the same chromosome which has a start inside other genes. A dataframe is created with gene list and which genes start within them ("has.start.inside") or where they start within other genes ("is.start.inside").
# also, number of cases of such occurence of inner and outer genes are documented for each of the genes. (integer overlap.)
overlap.data = gene.overlaps( bed = TSS.gene )
# the cases from gene.overlaps is used again within remove.overlaps function
# only those genes are kept which have the maximum density within the transcript bed
# the Tss.gene object with rows filtered out are returned
filtered.id.overlaps = remove.overlaps(bed=TSS.gene,
                       overlaps=overlap.data$cases,                                   
                       transcripts=gencode.transcript,bw.plus=bw.plus, bw.minus=bw.minus, by="den")
# now start finding TTS coordinates
# get intervals for TTS evaluation
# the output from the function is an input for get.TTS()
# for plus strand,
# add.to.end is number of bases added to continue searching after gene end annotation
# start of TTS search is set by subtracting fraction.end*(gene_length_from_annotation) from (end+add.to.end)
# for minus strand,
# add.to.end is subtracted from start
# the end of TTS search is set by adding fraction.end*(fraction.end*gene_length) from (start - add.to.end)
# (for both plus and minus strand) dist.to.start is the number of bases left from gene start of the next gene in annotati
add.to.end = 100000
fraction.end = 0.2
dist.from.start = 50
bed.for.tts.eval = get.end.intervals(bed=filtered.id.overlaps,
                     add.to.end=add.to.end, fraction.end=fraction.end,                                                                  dist.from.start=dist.from.start)
saveRDS(bed.for.tts.eval, file=sprintf("%s/bed.for.tts.eval.cnt.thresh.%0.2f.rds", outputDir, cnt.thresh))
# identify gene ends
# @param knot.div (in get.TTS) the number of knots for the spline fit is defined as the number of bins covering the gene end divided by this number;
#' increasing this parameter results and a smoother curve
add.to.end = max(bed.for.tts.eval$xy)
knot.div = 40
pk.thresh = 0.05
bp.bin = 50
knot.thresh = 5
Cnt.thresh = 4 # same threshold as for get.TSS
tau.dist = 50000
frac.max = 1
frac.min = 0.3
# get.TTS tries to figure how to consider the area with the last transcription signal
# as count data is disordered, we need to find proper peaks. For that purpose a spline is fitted with number of knots (the number of polynomial functions.)
# then we search for the maximum value reached by spline. From that peak of spline,
# we move to the last peak of the spline which at a threshold decided by equations in the code.
# we call the x coodirnate of that peak as TTS.
# (all the above process in reverse direction for minus strand.)
inferred.coords = get.TTS(bed=bed.for.tts.eval, tss=filtered.id.overlaps,
                 bw.plus=bw.plus, bw.minus=bw.minus, 
                 bp.bin=bp.bin, add.to.end=add.to.end,
                 pk.thresh=pk.thresh, knot.thresh=knot.thresh,cnt.thresh=Cnt.thresh, tau.dist=tau.dist,
                 frac.max=frac.max, frac.min=frac.min, knot.div=knot.div)
saveRDS(inferred.coords, file=sprintf("%s/inferred_coords.cnt.thresh.%0.2f.rds", outputDir, cnt.thresh))
coords = inferred.coords$bed
coords.filt = chrom.size.filter(coords=coords, chrom.sizes=chrom.sizes)
coords.filt$log
head(coords.filt$bed)
# process output, remove mitochondrial coordinates, and write out data
out = coords.filt$bed %>% select(chr,start,end,gene)
out = cbind(out, c(500), coords.filt$bed$strand) %>% as.data.frame(stringsAsFactors=F)
# create a gene list for use while merging
pta.gene.list = out$gene # gene lists from primary Transcript annotation workflow
# save the previous data as inferred coords
out = out[-which(out$chr == "chrM"),]
write.table(out,sprintf("%s/inferredcoords.cnt.thresh.%0.2f.bed", outputDir, cnt.thresh),quote=F,sep="\t",col.names=F,row.names=F)
# now, begin to merge together a full annotation list.
removed.gene.interval = largest.interval.bed %>% filter(!(gene %in% pta.gene.list | chr == "chrM"))
names(removed.gene.interval) = names(out) # set the colnames equal
removed.gene.interval[c("c(500)")] = 100 # set score to something low as these were discarded
master.annotation.bed = rbind(out, removed.gene.interval)
write.table(master.annotation.bed, sprintf("%s/mastercoords.cnt.thresh.%0.2f.bed", outputDir, cnt.thresh), quote=F,sep="\t",col.names=F,row.names=F)

3.4 Running DESeq2 for differential analysis

We will first prepare bed files for “affected genes” (TSS+500 to 25kb) and “internalcontrol genes” (50kbp to TTS) as discussed above.

cat parsePTACoordsToShortAffectedAndLongControlGenes.R
library(dplyr)

# wrong # df.pause.body <- readRDS("pauseWindowsAndBWplots/TRP_FP_vscntrl.pTA.pause.body.rds")
df.pause.body = read.table("primaryTranscriptAnnotation/mastercoords.bed", header=F)
colnames(df.pause.body) = c("chr", "start", "end", "gene", "score", "strand")

# break long genes into long genes with no signal
controlgenes.len.gt.50kb =  subset(df.pause.body, end - start >= 49999)

control.genes.df = c()
genes.for.affect = c()
colnames.of.interest = c("chr","start", "end", "gene", "score", "strand")
for (i in c(1:nrow(controlgenes.len.gt.50kb))) {
    this.row = controlgenes.len.gt.50kb[i,]
    if ( this.row$strand == "+" ) {
        copy.row = this.row
        this.row$start = this.row$start + 49999
        this.row$gene = sprintf("%s_cntrlGene", this.row$gene)
        control.genes.df = rbind(control.genes.df, this.row[,colnames.of.interest])
        copy.row$end = copy.row$start + 24999
        copy.row$gene = sprintf("%s_affected", copy.row$gene)
        genes.for.affect = rbind(genes.for.affect, copy.row[,colnames.of.interest])
    } else {
        copy.row = this.row
        this.row$end = this.row$end - 49999
        this.row$gene = sprintf("%s_cntrlGene", this.row$gene)
        control.genes.df = rbind(control.genes.df, this.row[,colnames.of.interest])
        copy.row$start = copy.row$end - 24999
        copy.row$gene = sprintf("%s_affected", copy.row$gene)
        genes.for.affect = rbind(genes.for.affect, copy.row[,colnames.of.interest])
       }
}

# shorter genes
controlgenes.len.short.50kb =  subset(df.pause.body, end - start < 49999)
for (i in c(1:nrow(controlgenes.len.short.50kb))) {
    this.row = controlgenes.len.short.50kb[i,]
    if ( this.row$strand == "+" ) {
    copy.row = this.row
    copy.row$end = min(copy.row$end, copy.row$start + 24999)
    copy.row$gene = sprintf("%s_affected", copy.row$gene)
    genes.for.affect = rbind(genes.for.affect, copy.row[,colnames.of.interest])
    } else {
    copy.row = this.row
    copy.row$start = max(copy.row$start, copy.row$end - 24999)
    copy.row$gene = sprintf("%s_affected", copy.row$gene)
    genes.for.affect = rbind(genes.for.affect, copy.row[,colnames.of.interest])
    }
}
write.table(as.data.frame(control.genes.df), file="primaryTranscriptAnnotation/InternalControlGenes_TRP_12min.bed", quote=F, sep="\t", row.names=F, col.names=F)
write.table(as.data.frame(genes.for.affect), file="primaryTranscriptAnnotation/GenesWithRegionsAffected_TRP_12min.bed", quote=F, sep="\t", row.names=F, col.names=F)

cat parsePTACoordsToShortAffectedAndLongControlGenes.R
library(dplyr)

# wrong # df.pause.body <- readRDS("pauseWindowsAndBWplots/TRP_FP_vscntrl.pTA.pause.body.rds")
df.pause.body = read.table("primaryTranscriptAnnotation/mastercoords.bed", header=F)
colnames(df.pause.body) = c("chr", "start", "end", "gene", "score", "strand")

# break long genes into long genes with no signal
controlgenes.len.gt.50kb =  subset(df.pause.body, end - start >= 49999)

control.genes.df = c()
genes.for.affect = c()
colnames.of.interest = c("chr","start", "end", "gene", "score", "strand")
for (i in c(1:nrow(controlgenes.len.gt.50kb))) {
    this.row = controlgenes.len.gt.50kb[i,]
    if ( this.row$strand == "+" ) {
        copy.row = this.row
        this.row$start = this.row$start + 49999
        this.row$gene = sprintf("%s_cntrlGene", this.row$gene)
        control.genes.df = rbind(control.genes.df, this.row[,colnames.of.interest])
        copy.row$end = copy.row$start + 24999
        copy.row$gene = sprintf("%s_affected", copy.row$gene)
        genes.for.affect = rbind(genes.for.affect, copy.row[,colnames.of.interest])
    } else {
        copy.row = this.row
        this.row$end = this.row$end - 49999
        this.row$gene = sprintf("%s_cntrlGene", this.row$gene)
        control.genes.df = rbind(control.genes.df, this.row[,colnames.of.interest])
        copy.row$start = copy.row$end - 24999
        copy.row$gene = sprintf("%s_affected", copy.row$gene)
        genes.for.affect = rbind(genes.for.affect, copy.row[,colnames.of.interest])
       }
}

# shorter genes
controlgenes.len.short.50kb =  subset(df.pause.body, end - start < 49999)
for (i in c(1:nrow(controlgenes.len.short.50kb))) {
    this.row = controlgenes.len.short.50kb[i,]
    if ( this.row$strand == "+" ) {
    copy.row = this.row
    copy.row$end = min(copy.row$end, copy.row$start + 24999)
    copy.row$gene = sprintf("%s_affected", copy.row$gene)
    genes.for.affect = rbind(genes.for.affect, copy.row[,colnames.of.interest])
    } else {
    copy.row = this.row
    copy.row$start = max(copy.row$start, copy.row$end - 24999)
    copy.row$gene = sprintf("%s_affected", copy.row$gene)
    genes.for.affect = rbind(genes.for.affect, copy.row[,colnames.of.interest])
    }
}
write.table(as.data.frame(control.genes.df), file="primaryTranscriptAnnotation/InternalControlGenes_TRP_12min.bed", quote=F, sep="\t", row.names=F, col.names=F)
write.table(as.data.frame(genes.for.affect), file="primaryTranscriptAnnotation/GenesWithRegionsAffected_TRP_12min.bed", quote=F, sep="\t", row.names=F, col.names=F)

We will generate read count data from our processed GRO-seq files and the bed files for “affected” and “internalcontrol” genes created earlier.

cat deseq2DataCreator_InternalControl_TRP12min.sh
#!/bin/bash
set -e
module load bedtools # specific for Xanadu cluster
proseqfolder=/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/fastqfiles
annotation_bed_TRP_sorted=/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/primaryTranscriptAnnotation/GenesWithRegionsAffected_TRP_12min_sorted.bed
annotationbed_TRP_internalControls=/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/primaryTranscriptAnnotation/InternalControlGenes_TRP_12min_sorted.bed
outputFolder=/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/deseq2

printf "entering proseqfolder: %s\n" "${proseqfolder}"
cd ${proseqfolder}
filenames=$(ls V65*rep?_not_scaled.bed | grep -v "FP")
for filename in ${filenames}
do
 name=$(echo $filename | awk -F"_not_scaled.bed" '{print $1}')
 echo -e  "\t${name}" > ${outputFolder}/${name}_gene_counts_Affected_Trp12min.txt
 mapBed -null "0" -s -a ${annotation_bed_TRP_sorted} -b $filename |\
     awk '{OFS="\t";} {print $4,$7}' >> ${outputFolder}/${name}_gene_counts_Affected_Trp12min.txt
done

#internal spike ins
for filename in ${filenames}
do
 name=$(echo $filename | awk -F"_not_scaled.bed" '{print $1}')
 echo -e  "\t${name}" > ${outputFolder}/${name}_gene_counts_InternalControl_Trp12min.txt
 mapBed -null "0" -s -a ${annotationbed_TRP_internalControls} -b $filename |\
     awk '{OFS="\t";} {print $4,$7}' >> ${outputFolder}/${name}_gene_counts_InternalControl_Trp12min.txt
done

printf "entering outputFolder: %s\n" "${outputFolder}"
cd ${outputFolder}
paste -d'\t' *_gene_counts_Affected_Trp12min.txt > PRO_TRP_V65_Affected_GeneCounts.txt
paste -d'\t' *_gene_counts_InternalControl_Trp12min.txt >  PRO_TRP_V65_InternalControl_GeneCounts.txt

cat runDeseq2_TRP_12min_matched_internalControl.R
library(MatchIt)
library(bigWig)
library(DESeq2)

source("/home/FCAM/rmukherjee/Spring2022Rotation/R_files/ZNF143_functions.R")
setwd("/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/deseq2")

outputDir = "/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/deseq2"
pTA_dir = "/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/primaryTranscriptAnnotation"
fastqfilesdir = "/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/fastqfiles"
genecountsDir = "/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/deseq2"
spikeInCountsFile = sprintf("%s/PRO_TRP_V65_InternalControl_GeneCounts.txt", genecountsDir)
mouseGeneCountFile = sprintf("%s/PRO_TRP_V65_Affected_GeneCounts.txt", genecountsDir) # made from primary transcript annotation
counts.obj.internal = read.table(spikeInCountsFile, sep = '\t', header = T)
row.names.internal = counts.obj.internal[,1]
counts.obj.internal = counts.obj.internal[,seq(2,to=ncol(counts.obj.internal),by=2)]
rownames(counts.obj.internal) = row.names.internal

## only consider TRP and control files - regex V65_[cT]
## only consider genes with length upto 25kb 
inferred.coords=read.table(sprintf('%s/GenesWithRegionsAffected_TRP_12min_sorted.bed', pTA_dir), sep="\t", header =FALSE)
inferred.coords[,"V4"] = gsub("_affected", "", inferred.coords[,"V4"])
counts.df = abs(get.raw.counts.interval(inferred.coords, fastqfilesdir, file.prefix = "V65_[cT]", 
        file.plus.suffix = "*rep?_plus.bigWig", file.minus.suffix = "_minus.bigWig"))
saveRDS(counts.df, "raw.counts.TRP12min_affected.rds")
merged.counts.obj = rbind(counts.obj.internal, counts.df)
saveRDS(merged.counts.obj, file = "merged.internal.plus.raw.counts.rds")
cond.list = c("control", "control", "TRP_12min", "TRP_12min")
cond.levels=c("TRP_12min", "control")
sample.conditions = factor(cond.list, levels = cond.levels)
sample.conditions = relevel(sample.conditions, ref="control")

# #PCA for experiments
dds = run.deseq.list.dds(merged.counts.obj, sample.conditions, cond.levels, ControlGenes.Indices=1:nrow(counts.obj.internal))
rld <- rlog(dds)
plotPCA(rld) # basic plot

pca.plot = plotPCA(rld, intgroup="sample.conditions", returnData=TRUE)
pca.plot$sample.conditions = rownames(pca.plot)
plotPCAlattice(pca.plot, file = 'PCA_TRP_12min_PRO.pdf')

rep = factor(sapply(strsplit(colnames(counts.df), 'rep'), '[', 2))
# formaula: ~ rep + sample.conditions
deseq.df = DESeqDataSetFromMatrix(counts.df,colData = cbind.data.frame(sample.conditions, rep), ~ rep + sample.conditions)
deseq.df$sizeFactor = estimateSizeFactorsForMatrix(merged.counts.obj, controlGenes = 1:nrow(counts.obj.internal))
saveRDS(deseq.df$sizeFactor, file = sprintf("%s/size.factors.rawCounts.TRPInternalGenes.rds", outputDir))
deseq.df = DESeq(deseq.df)
saveRDS(deseq.df, "deseq.df.TRPinternal.RAW.counts.rds")

# formula: ~ sample.conditions
deseq.df = DESeqDataSetFromMatrix(counts.df,colData = DataFrame(sample.conditions), ~ sample.conditions)
deseq.df$sizeFactor = estimateSizeFactorsForMatrix(merged.counts.obj, controlGenes = 1:nrow(counts.obj.internal))
saveRDS(deseq.df$sizeFactor, file = sprintf("%s/size.factors.modifiedFormula.rawCounts.TRPInternalGenes.rds", outputDir))
deseq.df = DESeq(deseq.df)
saveRDS(deseq.df, "deseq.df.modifiedFormula.TRPinternal.RAW.counts.rds")

# Create tables of activated/repressed genes and their matching genes 
for (i in levels(sample.conditions)[-1]) {
    res.deseq = results(deseq.df, contrast = c("sample.conditions", i, "control"))
  sum(res.deseq$padj < 0.1 & !is.na(res.deseq$padj))
    activated = res.deseq[res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & res.deseq$log2FoldChange > 0,]
    activated.strand = merge(cbind(tighten_summit_window(activated), "Activated"), inferred.coords, by.x = "gene", by.y = "V4")[,c(2, 3, 4, 1, 5, 10)]    
      write.table(activated.strand, file = paste0(i, '_activated_genes.bed'), quote = FALSE, row.names = FALSE, col.names = FALSE, sep = '\t')
      unchanged = res.deseq[!is.na(res.deseq$padj) & res.deseq$padj > 0.1 & abs(res.deseq$log2FoldChange) < 0.01,]
        #unchanged = unchanged[sample(x, size, replace = FALSE, prob = NULL),]
        unchanged$treatment = 0
        activated$treatment = 1
      df.deseq.effects.lattice = rbind(unchanged, activated)
      out = matchit(treatment ~ baseMean, data = df.deseq.effects.lattice, method = "optimal", ratio = 1)
      summary(out, un=F) # don't show stats from pre-matching data
      pdf("matchIt.optimal.activated.vs.unchanged.TRP12min.pdf")
      plot(out, type="density", interactive = FALSE)
      dev.off()  
      
      # chose unchanged genes which are matched to activated
      unchanged = df.deseq.effects.lattice[rownames(df.deseq.effects.lattice) %in% out$match.matrix,]
      unchanged.strand = merge(cbind(tighten_summit_window(unchanged), "Matched to Activated"), inferred.coords, by.x = "gene", by.y = "V4")[,c(2, 3, 4, 1, 5, 10)] 
          write.table(unchanged.strand, file = paste0(i, '_activated_matched_unchanged_genes.bed'), quote = FALSE, row.names = FALSE, col.names = FALSE, sep = '\t')
          unchanged$treatment = "Matched to Activated"
        activated$treatment = "Activated"
        df.x = rbind(activated, unchanged)
    # do the similar procedure using repressed genes
      repressed = res.deseq[res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & res.deseq$log2FoldChange < 0,]
      repressed.strand = merge(cbind(tighten_summit_window(repressed), "Repressed"), inferred.coords, by.x = "gene", by.y = "V4")[,c(2, 3, 4, 1, 5, 10)]    
        write.table(repressed.strand, file = paste0(i, '_repressed_genes.bed'), quote = FALSE, row.names = FALSE, col.names = FALSE, sep = '\t')
        unchanged = res.deseq[!is.na(res.deseq$padj) & res.deseq$padj > 0.1 & abs(res.deseq$log2FoldChange) < 0.01,]

      # unchanged = res.deseq[rownames(res.deseq) %notin% rownames(repressed) & !is.na(res.deseq$padj) & res.deseq$log2FoldChange > 0,]
       unchanged$treatment = 0
      repressed$treatment = 1
        df.deseq.effects.lattice = rbind(unchanged, repressed)
        out = matchit(treatment ~ baseMean, data = df.deseq.effects.lattice, method = "optimal", ratio = 1)
    summary(out, un=F)
        ## Warning message: Fewer control units than treated units; not all treated units will get a match.
    pdf("matchIt.optimal.repressed.vs.unchanged.TRP12min.pdf")
      plot(out, type="density", interactive = FALSE)
      dev.off()
      unchanged = res.deseq[!is.na(res.deseq$padj) & res.deseq$padj > 0.1 & abs(res.deseq$log2FoldChange) < 0.01,]  
      unchanged = df.deseq.effects.lattice[rownames(df.deseq.effects.lattice) %in% out$match.matrix,]
      unchanged.strand = merge(cbind(tighten_summit_window(unchanged), "Matched to Repressed"), inferred.coords, by.x = "gene", by.y = "V4")[,c(2, 3, 4, 1, 5, 10)] 

        write.table(unchanged.strand, file = paste0(i, '_repressed_matched_unchanged_genes.bed'), quote = FALSE, row.names = FALSE, col.names = FALSE, sep = '\t')
        unchanged$treatment = "Matched to Repressed"
          repressed$treatment = "Repressed"
          df.x = rbind(df.x, unchanged)
            df.x = rbind(df.x, repressed)
}

## MA Plot for activated, repressed, unchanged

temp = df.x
unique(temp$treatment)

dim(res.deseq[is.na(res.deseq$padj),])
otherGenes = res.deseq[!(rownames(res.deseq) %in% rownames(temp)) & !(is.na(res.deseq$padj)),]
otherGenes$treatment = "Other"

ma.df = rbind(temp,otherGenes)
dim(ma.df)
# `matched to activated` and `matched to repressed` are labels of unchanged genes
levelsMA = c("Other","Matched to Activated", "Matched to Repressed","Activated","Repressed")
ma.df$treatment = factor(ma.df$treatment, levels = levelsMA)
saveRDS(ma.df, file = "TRP_treat_matched_DESeq_genes_categorized.rds")
ma.plot.lattice(ma.df, filename='TRP_treat_matched',title.main = "Diff Exp (TRP 12.5min)", col=c("grey90","grey30","grey30", "#ce228e" ,"#2290cf"))
deseqGenes = tighten_summit_window(ma.df)
dim(deseqGenes)

deseqGenesInfo = cbind(deseqGenes, ma.df)
dim(deseqGenesInfo)
# save cateogrized data
saveRDS(deseqGenesInfo, "deseqGenesInfo.TRP12min.matched.rds")
write.table(deseqGenesInfo,file="TRP_treat_matched_DESeq_genes_categorized.bed",quote = FALSE, sep ='\t', row.names = FALSE, col.names = TRUE)

We get a MA plot showing 9632 genes as repressed.

Figure 1: MA plot for repressed genes

We then save the DESeq2 size factors in a tab-separated text file.

cat saveSizeFactorsAsTxtFile.R

### For TRP 12.5min treatment
#genecountsDir = "/Users/rudradeepmukherjee/Documents/UConn Health/comparment.model.project/datasets, genome/Jonkers_2014_Trp_FP_paper"
genecountsDir = "/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014"
outputDir = sprintf("%s/deseq2", genecountsDir)
size.factors = readRDS(sprintf("%s/size.factors.rawCounts.TRPInternalGenes.rds", outputDir))
#size.factors = readRDS(sprintf("%s/size.factors.modifiedFormula.rawCounts.TRPInternalGenes.rds", outputDir))
size.factors.transpose = t(t(size.factors))
write.table(size.factors.transpose, sprintf("%s/sizefactors_TRP_12min_InternalGenes.txt",outputDir), sep="\t", quote=F, col.names=F)

### For FP 25min treatment
#genecountsDir = "/Users/rudradeepmukherjee/Documents/UConn Health/comparment.model.project/datasets, genome/Jonkers_2014_Trp_FP_paper"
genecountsDir = "/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014"
outputDir = sprintf("%s/deseq2", genecountsDir)
#size.factors = readRDS(sprintf("%s/size.factors.FPInternalGenes.rds", outputDir))
size.factors = readRDS(sprintf("%s/size.factors.modifiedFormula.rawCounts.FPInternalGenes.rds", outputDir))
size.factors.transpose = t(t(size.factors))
write.table(size.factors.transpose, sprintf("%s/sizefactors_FP_25min_InternalGenes.txt",outputDir), sep="\t", quote=F, col.names=F)

We use the DESeq2 size factors for normalization. The code is derived from PRO_normalization_sizeFactor of Nascent RNA Methods

cat PRO_normalization_sizeFactor_TRP  
#! /bin/sh

# modified from original code as file names are different: https://github.com/guertinlab/Nascent_RNA_Methods/blob/main/PRO_normalization

# put this in your path: normalize_bedGraph.py
# put this executable in your path: normalization_factor.R

usage() { echo "Usage: $0 -c chrSizes -z sizeFactorsDataTxtFile" 1>&2; exit 1; }

set -e
module load R/4.1.2
export PATH=${PATH}:/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014 # for normalization_bedgraph_v2.py
export PATH=${PATH}:/home/FCAM/rmukherjee/Summer2022Rotation/R_files # for bigWigToBedGraph
while getopts "c:z:" OPTION
do
    case $OPTION in
    c)
        chrSizes=$OPTARG
        ;;
    z) sizeFactorsFile=$OPTARG
    ;;
    *) 
        usage 
        ;;
    esac
done

if [ -z "${chrSizes}" ] || [ -z "${sizeFactorsFile}" ]; then
        usage
fi

bigWigFilesDir=/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/fastqfiles

cd ${bigWigFilesDir}

if [ $chrSizes ]; then

    #NOTE - normalization_factor.R is not needed as we are directly using the sizeFactors file containing two columns - 
    # `replicateName`, `deseq2.sizefactor`

    #Rscript /home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/normalization_factor_v2.R # modified version of normalization_factor.R from Guertin lab's GitHub
        #for i in V65_cntrl_rep1_plus.bigWig V65_FP_treat_25min_rep1_plus.bigWig V65_TRP_treat_12min_rep1_plus.bigWig; # name modified
        for i in V65_cntrl_rep1_plus.bigWig  V65_TRP_treat_12min_rep1_plus.bigWig; # name modified
          do
          name=$(echo $i | awk -F"_rep1_plus.bigWig" '{print $1}') # name modified
              printf "%s - processing individual replicates\n" ${name}
          basename=$(echo $i | awk -F"_" '{print $1}') 
          # count the number of replicates
                  reps=$(ls ${name}_rep?_plus.bigWig | wc -w | bc) # modified name
                  for j in ${name}_rep?_plus.bigWig; 
          do
                  repNum=$(echo $j | awk -F"rep" '{print $NF}' | awk -F"_plus" '{print $1}') # take the num# of replicate 
                              # NOTE - we uill use DESeq2 sizeFactors instead of invscalePlus and invscaleMinus - 
                  #scaleFactor=$(grep ${name}_rep${repNum} ../deseq2/sizefactors_TRP_12min_InternalGenes.txt | awk -F"\t" '{print $2}' | bc)
                  scaleFactor=$(grep ${name}_rep${repNum} ${sizeFactorsFile} | awk -F"\t" '{print $2}' | bc)
                  echo "scalefactor for ${name}_rep${repNum} - ${scaleFactor}"
                  
                  # TAKE THE INVERSE - eqn (6) of Anders and Huber, 2010 finds normalized values by dividing counts with sizeFactors
                  scaleFactor=$(echo "scale=5; 1/${scaleFactor}" | bc)
                  echo "scalefactor inverse for ${name}_rep${repNum} - ${scaleFactor}"

                  # scale to 10 million
                      bigWigToBedGraph ${name}_rep${repNum}_plus.bigWig ${name}_rep${repNum}_plus.bedGraph
                  bigWigToBedGraph ${name}_rep${repNum}_minus.bigWig ${name}_rep${repNum}_minus.bedGraph
                          normalize_bedGraph_v2_pandas.py -i ${name}_rep${repNum}_plus.bedGraph -s $scaleFactor -o ${name}_rep${repNum}_plus_normalized.bedGraph
                      normalize_bedGraph_v2_pandas.py -i ${name}_rep${repNum}_minus.bedGraph -s $scaleFactor -o ${name}_rep${repNum}_minus_normalized.bedGraph
                  bedGraphToBigWig ${name}_rep${repNum}_plus_normalized.bedGraph $chrSizes ${name}_rep${repNum}_plus_normalized.bigWig 
                  bedGraphToBigWig ${name}_rep${repNum}_minus_normalized.bedGraph $chrSizes ${name}_rep${repNum}_minus_normalized.bigWig
          done
          printf "%s - creating and merging normalized replicate bigWigs\n" ${name}
          plusfiles=$(ls ${name}_rep*_plus_normalized.bigWig)
          bigWigMerge $plusfiles tmpPlus.bg
              minusfiles=$(ls ${name}_rep*_minus_normalized.bigWig)
              bigWigMerge -threshold=-10000000000 $minusfiles tmpMinus.bg # for bigWigMerge utility: -threshold=0.N - don't output values at or below this threshold. Default is 0.0
                  scaleall=$(bc <<< "scale=4 ; 1.0 / $reps")
          normalize_bedGraph_v2_pandas.py -i tmpPlus.bg -s $scaleall -o ${name}_plus_normalized.bg # scaleall=1/rep#
              normalize_bedGraph_v2_pandas.py -i tmpMinus.bg -s $scaleall -o ${name}_minus_normalized.bg
              sort -k1,1 -k2,2n ${name}_plus_normalized.bg > ${name}_plus_normalized_sorted.bg
          sort -k1,1 -k2,2n ${name}_minus_normalized.bg > ${name}_minus_normalized_sorted.bg
          bedGraphToBigWig ${name}_plus_normalized_sorted.bg $chrSizes ${name}_plus_normalized_sizeFactor_TRP.bigWig 
          bedGraphToBigWig ${name}_minus_normalized_sorted.bg $chrSizes ${name}_minus_normalized_sizeFactor_TRP.bigWig   
              rm ${name}_plus_normalized.bg
          rm ${name}_minus_normalized.bg
          rm ${name}_plus_normalized_sorted.bg
          rm ${name}_minus_normalized_sorted.bg
              rm tmpPlus.bg
              rm tmpMinus.bg
      done
fi

The processed files are hosted as a track hub which can be loaded in UCSC Genome Browser. A session can be accessed here.

3.5 Estimating Pol II densities in pause region and gene body

We will prepare the pause windows and calculate the densities in gene body regions. We assume all packages mentioned in this file has been installed.

Removing genes with high variability. We will also remove genes with high signal variability in their body regions. We will compute body averages in windows of 2000bp in step sizes of 50 bp using find.density.across.windows(..., start_bp=0, end_bp=15000,windowsize_bp=200, stepsize_bp=50) and then filter out genes which have body variance/mean >= 0.1.

cat run_pauseworkflow_InternalTRP_matched.R
# calls functions in pause_workflow_v2.R to plot composites from conditions.
# this code was written for a project where there were three conditions - control, 45min and 90min.
library(dplyr)
library(bigWig)

# source the functions
source("pause_workflow_funcs_v2.R")

## NOTE: CHANGE these PATHS before running on your own platform.
# set the working directory
outputDir = '/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/pauseWindowsAndBWplots'
# load the merged bigWigs from all conditions x replicates
merged.bw.dir = "/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/fastqfiles"
#bw.plus <- load.bigWig(sprintf('%s/TRP_FP_control_merged_plus.bigWig', merged.bw.dir))
#bw.minus <- load.bigWig(sprintf('%s/TRP_FP_control_merged_minus.bigWig', merged.bw.dir))

normalized.bw.dir = "/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/fastqfiles"

bw.cntrl.plus = load.bigWig(sprintf('%s/V65_cntrl_plus_normalized_sizeFactor_TRP.bigWig', normalized.bw.dir))
bw.cntrl.minus = load.bigWig(sprintf('%s/V65_cntrl_minus_normalized_sizeFactor_TRP.bigWig', normalized.bw.dir))

bw.TRP.plus = load.bigWig(sprintf('%s/V65_TRP_treat_12min_plus_normalized_sizeFactor_TRP.bigWig', normalized.bw.dir))
bw.TRP.minus = load.bigWig(sprintf('%s/V65_TRP_treat_12min_minus_normalized_sizeFactor_TRP.bigWig', normalized.bw.dir))

deseq2.files.dir = "/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/deseq2"
categorized.TRPtreatvscontrol = readRDS(sprintf("%s/deseqGenesInfo.TRP12min.matched.rds",deseq2.files.dir))
#categorized.TRPtreatvscontrol$gene = categorized.TRPtreatvscontrol$gene %>% str_replace("_affected", "")
categorized.TRPtreatvscontrol  = categorized.TRPtreatvscontrol[!duplicated(categorized.TRPtreatvscontrol$gene),]
rownames(categorized.TRPtreatvscontrol) = categorized.TRPtreatvscontrol$gene

# find pause region for all genes listed in primaryTranscriptAnnotation bed file using the merged bigWigs 
# load the coordinates file
affectedGenes.newTTS.bed = "/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/primaryTranscriptAnnotation/GenesWithRegionsAffected_TRP_12min_sorted_renamed.bed"
pTA <- read.table(affectedGenes.newTTS.bed, sep="\t")
df.pause.body <- find.pause.regions(pTA,bw.cntrl.plus,bw.cntrl.minus, searchWindow = 300)
rownames(df.pause.body) = df.pause.body$gene
saveRDS(df.pause.body, file=sprintf("%s/Internal_TRP_vscntrl.pause.body.rds",outputDir))

merged.pbody.TRPtreat.vs.control = merge(df.pause.body, as.data.frame(categorized.TRPtreatvscontrol))
merged.pbody.TRPtreat.vs.control = merged.pbody.TRPtreat.vs.control[,c(colnames(df.pause.body), colnames(categorized.TRPtreatvscontrol))]
rownames(merged.pbody.TRPtreat.vs.control) = merged.pbody.TRPtreat.vs.control$gene
saveRDS(merged.pbody.TRPtreat.vs.control, file=sprintf("%s/Internal.merged.pbody.TRP.vs.control.rds",outputDir))

## find genes with high variability
result_df = find.density.across.windows(df.pause.body, bw.cntrl.plus, bw.cntrl.minus, start_bp=0, end_bp=15000,windowsize_bp=200, stepsize_bp=50)
saveRDS(result_df, sprintf("%s/TRP.body.density.rollingwindow.windowsize=200,stepsize=50,bp_end=15000.rds", outputDir))
summary.TRP.windows = group_by(result_df, gene) %>% summarise_at(c("body.avg"), c("mean", "median", "var", "min", "max"))
summary.TRP.windows$var_by_mean = summary.TRP.windows$var / summary.TRP.windows$mean
saveRDS(summary.TRP.windows, sprintf("%s/summary.TRP.windows.rds", outputDir))
#summary.TRP.windows = readRDS( sprintf("%s/summary.TRP.windows.rds", outputDir))
filteredGenes = subset(summary.TRP.windows, var_by_mean <= 0.1)$gene

setwd(outputDir)
# call plotting steps for all genes
run.plotting.steps(merged.pbody.TRPtreat.vs.control, "control_all", bw.cntrl.plus, bw.cntrl.minus,
                "TRP_12.5min_all", bw.TRP.plus, bw.TRP.minus,
            "TRP_12.5min (all genes, Internal Control)", color.names=c(rgb(1,0,0,1/2), rgb(0,0,1,1/2))) # red, blue from rgb value. Indices set in alphabetical order of column names
density.TRP = readRDS("density.TRP_12.5min_all.v.control_all.rds")
repressed.TRP.baseline = subset(density.TRP, cond == "control_all" & treatment == "Repressed")
repressed.TRP.treatment = subset(density.TRP, cond == "TRP_12.5min_all" & treatment == "Repressed")

colNames = c("gene", "pause.sum", "body.avg")
write.table(repressed.TRP.baseline[,colNames], "InternalControl_repressed_TRP125min_baseline.txt", sep="\t", quote=F, row.names=F)
write.table(repressed.TRP.treatment[,colNames], "InternalControl_repressed_TRP125min_treatment.txt", sep="\t", quote=F, row.names=F)

# call the run.plotting.steps on pause body object containing genes of our interest.
repressed.TRPvscontrol = subset(merged.pbody.TRPtreat.vs.control, treatment == "Repressed")
saveRDS(repressed.TRPvscontrol, file=sprintf("%s/InternalControl.repressed.pbody.TRP.vs.control.rds", outputDir))
run.plotting.steps(repressed.TRPvscontrol, "control", bw.cntrl.plus, bw.cntrl.minus,
                "TRP_12.5min (repressed, internal spikein)", bw.TRP.plus, bw.TRP.minus,
            "TRP_12.5min(rrepressed, internal control)", color.names=c(rgb(1,0,0,1/2), rgb(0,0,1,1/2))) # red, blue from rgb value. Indices set in alphabetical order of column names

## USE GENES with body variance / mean <= 0.1
repressed.TRP.baseline = subset(density.TRP, cond == "control_all" & treatment == "Repressed")
repressed.TRP.treatment = subset(density.TRP, cond == "TRP_12.5min_all" & treatment == "Repressed")
control.TRP.allgenes.varlessthanpoint1 = subset(density.TRP, cond == "control_all" & gene %in% filteredGenes)
repressed.TRP.baseline.varlessthanpoint1 = subset(density.TRP, cond == "control_all" & treatment == "Repressed" & gene %in% filteredGenes)
repressed.TRP.treatment.varlessthanpoint1 = subset(density.TRP, cond == "TRP_12.5min_all" & treatment == "Repressed"& gene %in% filteredGenes)

colNames = c("gene", "pause.sum", "body.avg")
write.table(subset(control.TRP.allgenes.varlessthanpoint1,pause.sum>0)[, colNames], "TRP125min_control_allgenes_varlessthanpoint1.txt", sep="\t", quote=F, row.names=F)
write.table(repressed.TRP.baseline.varlessthanpoint1[, colNames], "TRP125min_repressed_control_varlessthanpoint1.txt", sep="\t", quote=F, row.names=F)
write.table(repressed.TRP.treatment.varlessthanpoint1[, colNames], "TRP125min_repressed_treatment_varlessthanpoint1.txt", sep="\t", quote=F, row.names=F)

We now have desired files - TRP125min_repressed_control_varlessthanpoint1.txt and TRP125min_repressed_treatment_varlessthanpoint1.txt containing pause sums and body densities of genes of interest. We will use TRP125min_control_allgenes_varlessthanpoint1.txt to estimate a full occupancy value for 1 RNAPII in the pause region.

3.6 Saturation values

Note: As discussed in original manuscript (Fig S1, Mukherjee et al), we used pause sum at 90th percentile of data TRP125min_control_allgenes_varlessthanpoint1.txt as occupancy value. See section Running the model. To demonstrate saturation values from fits to hyperbolic tangent function (Materials & Methods, Mukherjee et al), we will use --norm as saturation:

polcomp estparam --basename baseline --basefile TRP125min_repressed_control_varlessthanpoint1.txt --exprname TRP_12.5min --exprfile TRP125min_repressed_treatment_varlessthanpoint1.txt --allgenesfile TRP125min_control_allgenes_varlessthanpoint1.txt --norm saturation --conf TRP_UpperLimitFile --outputdir paramFit_TRP_EntireRange_Saturation

The output will look like the following:

checking inputs
will use 1153.7 as saturation value corresponding to max index 12762 (at percentile 100.0) provided
saturation fits saved as paramFit_TRP_EntireRange_Saturation/TRP_12.5min_baseline_saturationFit_Yes_logY.pdf
saturation fits saved as paramFit_TRP_EntireRange_Saturation/TRP_12.5min_baseline_saturationFit_No_logY.pdf
parameters sets written as paramFit_TRP_EntireRange_Saturation/TRP_12.5min_vs_baseline_fittedParameters.txt
removed 625 genes from run as they have zero pause sum or body density
Genes with pause sum or body avg as zero in either condition have been written to paramFit_TRP_EntireRange_Saturation/GeneswithZeroPauseSums_or_BodyAvg.txt
Received the command:
polcomp estparam --basename baseline --basefile TRP125min_repressed_control_varlessthanpoint1.txt --exprname TRP_12.5min --exprfile TRP125min_repressed_treatment_varlessthanpoint1.txt --allgenesfile TRP125min_control_allgenes_varlessthanpoint1.txt --norm saturation --conf TRP_UpperLimitFile --outputdir paramFit_TRP_EntireRange_Saturation
writing command information to paramFit_TRP_EntireRange_Saturation/commandUsed.txt
Estimated rates saved as paramFit_TRP_EntireRange_Saturation/paramEstimationResults_TRP_12.5min_vs_baseline.txt

The fitted parameters look like the following:-

head paramFit_TRP_EntireRange_Saturation/TRP_12.5min_vs_baseline_fittedParameters.txt
rankFitted  percentileRankFitted    A   B   C   saturationVal   percentileOfSaturationVal
11486   90.0    2.431881136506332   0.0001237883637946522   -0.23589809241576964    157.0   92.56
12124   95.0    2.6319695245707972  0.00010586476889332477  -0.1976872023523184 271.8   97.48
12762   100.0   3.1943701718101787  7.667947285209665e-05   -0.13226970355876166    1153.7  99.94

The saturation curve can be derived from 1 to rankFitted above, using A* tanh(Bx) + C, where ‘x’ is a placeholder variable for ranks. The maximum saturation value at each of these ranks is S = A + C. For plotting different saturation curves, please refer to Plotting saturation fits under the section Miscellaneous Plots.

NOTE: for datasets created using UMIs, like ZNF143 dTAG (Dong et al 2024), Dexamethasone treatment (Wissink et al 2021), TBP dTAG (Santana et al 2022), --norm saturation should be used while invoking polcomp.

4 Miscellaneous Plots

Subsections describe plotting functions for various use cases. All subsections show basic plotting code. Users are encouraged to refer to these sections and reuse these code for custom plotting needs.

4.1 Plotting rates

After model run (see Running the model and other sections), a file with rates are produced. For instance, in the subsection Saturation Values, we generated the file paramFit_TRP_EntireRange_Saturation/paramEstimationResults_TRP_12.5min_vs_baseline.txt containing the following columns:-

head paramFit_TRP_EntireRange_Saturation/paramEstimationResults_TRP_12.5min_vs_baseline.txt
gene    P_normalized_baseline   B_normalized_baseline   P_normalized_TRP_12.5min    B_normalized_TRP_12.5min    B_by_P_baseline pauseRelease_kelong30_baseline  pauseRelease_kelong45_baseline  pauseRelease_kelong60_baseline  B_by_P_TRP_12.5min  pauseRelease_kelong30_TRP_12.5min   pauseRelease_kelong45_TRP_12.5min   pauseRelease_kelong60_TRP_12.5min   pauseRelease_FC initiationRate_FC_bound1    initiationRate_FC_bound2    initiationRate_baseline_kelong45_termination_point01    initiationRate_baseline_kelong45_termination_point1 initiationRate_baseline_kelong45_termination_1  initiationRate_TRP_12.5min_kelong45_termination_point01initiationRate_TRP_12.5min_kelong45_termination_point1   initiationRate_TRP_12.5min_kelong45_termination_1
Tcf24   0.00796888671188262 6.971887730342455e-05   0.003441502030798615    2.1122988763380635e-05  0.00874888548728706 0.2624665646186118  0.39369984692791765 0.5249331292372236  0.006137723753857253    0.1841317126157176  0.2761975689235764  0.3682634252314352  0.7015435009151662  0.43186735553247346 0.30297373653122617 0.003217038345772931    0.003934238149842367    0.011106236190536726    0.0009849495146601148   0.0012946846974319901   0.004392036525150743

The columns P_normalized_baseline, P_normalized_TRP_12.5min and like wise denote the quantities associated with baseline and TRP_12.5min conditions, specified in polcomp run as --baselinename and --exprname. To make the column names easily usable, we will transform this file using another helper code.

cat transformRates.R ## changes column names of parameter file to make it more usable
transform_rates <- function(combinedResultsFile, baselineName, exprName, yLabel, fileidentifier=NULL, kelong_vals = c(30,45,60), kpre_vals = c(1e-3,1e-2,1e-1,1), ALPHA=0.03) {
  combinedResults = read.table(combinedResultsFile,sep="\t", header=T)
  if (is.null(fileidentifier)) {
     fileidentifier = sprintf("%s_vs_%s", exprName, baselineName)
     print(sprintf("setting filePrefix to %s", fileidentifier))
 }
   p_b_df = combinedResults[,c("gene", sprintf("P_normalized_%s", baselineName), sprintf("B_normalized_%s", baselineName),
                                           sprintf("P_normalized_%s", exprName), sprintf("B_normalized_%s", exprName))]
   colnames(p_b_df) = c("gene", "P_base", "B_base", "P_expr", "B_expr")
  rate.df = as.data.frame(c())
   for (kelong_val in kelong_vals) {
       p_b_df$kelong = kelong_val
         estimatedRate = p_b_df
         for (kpre_val in kpre_vals) {
           estimatedRate$kpre = kpre_val
           estimatedRate$kinit_base = estimatedRate$kpre * estimatedRate$P_base + estimatedRate$kelong * estimatedRate$B_base
           estimatedRate$kinit_expr = estimatedRate$kpre * estimatedRate$P_expr + estimatedRate$kelong * estimatedRate$B_expr
           estimatedRate$krel_base = estimatedRate$kelong * ( estimatedRate$B_base / estimatedRate$P_base )
           estimatedRate$krel_expr = estimatedRate$kelong * ( estimatedRate$B_expr / estimatedRate$P_expr )
           rate.df = rbind(rate.df, estimatedRate)
         }
  }
 saveRDS(rate.df, sprintf("calculatedRateData_%s.rds", fileidentifier))

}

combnResultsFile = "paramFit_TRP_EntireRange_Saturation/paramEstimationResults_TRP_12.5min_vs_baseline.txt"
transform_rates(combnResultsFile, "baseline", "TRP_12.5min", "TRP_12.5min (repressed)", "TRP_12min_repressed_saturation", 12)

The created file calculatedRateData_TRP_12.5min_vs_baseline.rds contains column names where baseline and experiment are denoted generically by base and expr. We will use this file to plot the rate changes:

source("generalPlotter.R")
scanDataFile.entireRange = "paramFit_TRP_EntireRange_Saturation/calculatedRateData_TRP_12.5min_vs_baseline.rds"
## creates summary for rates per gene
summary.TRP.entireRange = summaryFunction(readRDS(scanDataFile.entireRange))
#plots fold change in pause release (krel)
krelFoldchangePlotter(scanDataFile.entireRange, "TRP_krel_foldchange", "Triptolide (TRP)", "TRP", filetype="pdf", UP=32, DOWN=1/4.5, WIDTH=5, pointALPHA = 0.1)
#plots fold change in initiation (kinit)
kinitFCvsPfoldchangePlotter(scanDataFile.entireRange, "TRP_kinit_FC", "Triptolide", "(TRP)", filetype="pdf", directionKinit="down", pointALPHA = 0.1)
#plots pause release values against elongation rates
kelongVsKrelPlotter(summary.TRP.entireRange, "Fig_TRP_krel_vs_kelong_minute", "Triptolide (TRP)", "TRP", filetype="pdf")

The generated plots show Triptolide decreases initiation, and may also result in increase in pause release.

Figure 2: log2 Fold Change in Pause Relese

Figure 3: log2 Fold Change in Initiation

Figure 4: Pause release rates at different elongation rates

The generalPlotter.R file is present in the vignette folder and is omitted here for brevity.

4.2 Plotting saturation fits

We will use paramFit_TRP_EntireRange_Saturation/TRP_12.5min_vs_baseline_fittedParameters.txt created in previous subsection, Saturation values.

source("residualAndFitPlotter.R")
paramFileLoc.TRP = "paramFit_TRP_EntireRange_Saturation/TRP_12.5min_vs_baseline_fittedParameters.txt"
originalPauseSums.TRP =  read.table("TRP125min_control_allgenes_varlessthanpoint1.txt", sep="\t", header=T)
originalPauseSums.TRP = originalPauseSums.TRP[rank(order(originalPauseSums.TRP$pause_sum)),]
color.vals = c("#F69994", "#F5807B", "#F26660", "#ED4843", "#E92623")
fitPlotter_multiplePercentiles("TRP", paramFileLoc.TRP, originalPauseSums.TRP, percentileVals = c(70,80,90,95,100), "LOG_HYPER_TANG", "No", plotTitle = "Triptolide (mESCs, no UMIs)", logYPlotter="Yes", showLegend = "Yes", plotMaxSaturation="Yes", outputDir = "figures/", V_JUST = c(1.2,-0.8, -0.7,-1.1, -1.1), H_JUST=c(rep(0.6,2),rep(0.58,2), 0.545), color.list = color.vals)

The inset shows the legend of fits used (X percentile referring to a fit for pause sums upto Xth percentile.) The dashed lines and the saturation values are indicated with their percentiles (w.r.t the entire data)

Figure 5: Fits of hyperbolic tangent function to the ranked pause sums

The residualAndFitPlotter.R file is present in the vignette folder and is omitted here for brevity.

References

1. Jonkers I, Kwak H, Lis JT. Genome-wide dynamics of pol II elongation and its interplay with promoter proximal pausing, chromatin, and exons. Elife. 2014;3:e02407.

2. Smith JP, Dutta AB, Sathyan KM, et al. PEPPRO: Quality control and processing of nascent RNA profiling data. Genome Biology. 2021;22:155.

3. Scott TG, Martins AL, Guertin MJ. Processing and evaluating the quality of genome-wide nascent transcription profiling libraries. bioRxiv. 2022. https://doi.org/10.1101/2022.12.14.520463.

4. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology. 2014;15:550.

5. Duarte FM, Fuda NJ, Mahat DB, Core LJ, Guertin MJ, Lis JT. Transcription factors GAF and HSF act at distinct regulatory steps to modulate stress-induced gene activation. Genes & Development. 2016;30:1731–46.

6. Dong J, Scott TG, Mukherjee R, Guertin MJ. ZNF143 binds DNA and stimulates transcription initiation to activate and repress direct target genes. bioRxiv. 2024. https://doi.org/10.1101/2024.05.13.594008.

7. Vihervaara A, Mahat DB, Himanen SV, Blom MAH, Lis JT, Sistonen L. Stress-induced transcriptional memory accelerates promoter-proximal pause release and decelerates termination over mitotic divisions. Molecular Cell. 2021;81:1715–1731.e6.

8. Wissink EM, Martinez DM, Ehmsen KT, Yamamoto KR, Lis JT. Glucocorticoid receptor collaborates with pioneer factors and AP-1 to execute genome-wide regulation. bioRxiv. 2021. https://doi.org/10.1101/2021.06.01.444518.

9. Santana JF, Collins GS, Parida M, Luse DS, Price DH. Differential dependencies of human RNA polymerase II promoters on TBP, TAF1, TFIIB and XPB. Nucleic Acids Research. 2022;50:9127–48.

compartment-model-vignette

4 November 2024

Contents