This document describes steps to run compartment model (Mukherjee et al) on nascent transcript datasets to estimate changes in pause release and initiation.
Note: This section assumes that user already has genes of interest with their pause sums and body densities. Please refer to section Preparation of Inputs
for creation of these input files.
We will install a python based package, compartmentModel
currently hosted on Test PyPI:-
python3 -m pip install -i https://test.pypi.org/simple compartmentModel
After successful installation, the help menu can be accessed by running the following on command line:-
polcomp estparam --help
We need two tab-separated files from each of control and treatment conditions containing ‘gene’, ‘pause_sum’, ‘body_density’ for all genes of our interest. We also need a file containing pause sums of all expressed genes in baseline, in order to estimate an occupancy value (see Mukherjee et al). Please refer to the section Preparation of Inputs
for detailed overview of creation of these files. For this example, we will use two files prepared from Triptolide treatment data (Jonkers et al. 2014):-
head TRP125min_repressed_control_varlessthanpoint1.txt
gene pause_sum body_density
Tcf24 9.19386497139931 0.0804360718203523
Cops5 85.9334249198437 0.0384385124459365
Cspp1 756.197381913662 0.0254058947988453
Arfgef1 5.56534001231194 0.0219468799772712
Rnf152 6.98348507285118 0.0068474169675437
Pign 322.371615976095 0.0171639551516065
Relch 33.6940849125385 0.0252412666679433
Kdsr 153.332449406385 0.0335362723951963
Vps4b 185.665255844593 0.0386704052081501
head TRP125min_repressed_treatment_varlessthanpoint1.txt
gene pause_sum body_density
Tcf24 3.97053015232086 0.0243700172313062
Cops5 16.6258801221848 0.0278898475135028
Cspp1 116.273120522499 0.0247544261562293
Arfgef1 1.59940004348755 0.0177578365557946
Rnf152 0.523810029029846 0.0044233950876575
Pign 63.7204707860947 0.0124597407610381
Relch 10.836000084877 0.0200736366477201
Kdsr 27.2139506340027 0.0272897903308197
Vps4b 81.6698702573776 0.0313645508928707
We will specify the inputs to our model as mentioned in the help message:-
polcomp estparam --help
By default for parameter estimation, the lower and higher limits for premature termination (Kpre
), pause release (Krel
), initiation (Kinit
) rates are 0.01 and 1 respectively, whereas lower and upper limits for elongation (Kelong
) are 30 and 60. To demonstrate that these limits can be changed, we will use a config file to change the lower limit to 1e-8
.
cat TRP_UpperLimitFile
[initiation]
lower = 0.00000001
upper = 10
[pauserelease]
lower = 0.00000001
upper = 10
[prematuretermination]
lower = 0.00000001
upper = 10
Note: In similar manner, the upper limit can be changed for any parameter by specifying upper
value for the desired paramter, say premature termination rate. This is for demonstration and not being used in this section.:-
[prematuretermination]
lower = 0.02
upper = 0.023
We will run the following command:-
polcomp estparam --basename baseline --basefile TRP125min_repressed_control_varlessthanpoint1.txt --exprname TRP_12.5min --exprfile TRP125min_repressed_treatment_varlessthanpoint1.txt --allgenesfile TRP125min_control_allgenes_varlessthanpoint1.txt --conf TRP_UpperLimitFile --outputdir paramFit_TRP_EntireRange_percentile
At the end of the run, it should produce the following output (last few lines mentioned):-
removed 625 genes from run as they have zero pause sum or body density
Genes with pause sum or body avg as zero in either condition have been written to paramFit_TRP_EntireRange_percentile/GeneswithZeroPauseSums_or_BodyAvg.txt
Received the command:
polcomp estparam --basename baseline --basefile TRP125min_repressed_control_varlessthanpoint1.txt --exprname TRP_12.5min --exprfile TRP125min_repressed_treatment_varlessthanpoint1.txt --allgenesfile TRP125min_control_allgenes_varlessthanpoint1.txt --conf TRP_UpperLimitFile --outputdir paramFit_TRP_EntireRange_percentile
writing command information to paramFit_TRP_EntireRange_percentile/commandUsed.txt
> Estimated rates saved as paramFit_TRP_EntireRange_percentile/paramEstimationResults_TRP_12.5min_vs_baseline.txt
The output file paramEstimationResults_TRP_12.5min_vs_baseline.txt
contain all the valid parameter sets for all the genes. The column names are self-explanatory and depend on the basename
and exprname
mentioned in the input command:-
head paramFit_TRP_EntireRange_percentile/paramEstimationResults_TRP_12.5min_vs_baseline.txt
gene P_normalized_baseline B_normalized_baseline P_normalized_TRP_12.5min B_normalized_TRP_12.5min B_by_P_baseline pauseRelease_kelong30_baseline pauseRelease_kelong45_baseline pauseRelease_kelong60_baseline B_by_P_TRP_12.5min pauseRelease_kelong30_TRP_12.5min pauseRelease_kelong45_TRP_12.5min pauseRelease_kelong60_TRP_12.5min pauseRelease_FC initiationRate_FC_bound1 initiationRate_FC_bound2 initiationRate_baseline_kelong45_termination_point01 initiationRate_baseline_kelong45_termination_point1 initiationRate_baseline_kelong45_termination_1 initiationRate_TRP_12.5min_kelong45_termination_point01 initiationRate_TRP_12.5min_kelong45_termination_point1 initiationRate_TRP_12.5min_kelong45_termination_1
Tcf24 0.07061542377222484 0.0006178062562194436 0.0304964963243157 0.0001871790698991729 0.008748885487287061 0.26246656461861184 0.39369984692791776 0.5249331292372237 0.006137723753857254 0.18413171261571762 0.27619756892357644 0.36826342523143524 0.7015435009151662 0.43186735553247346 0.30297373653122617 0.02850743576759721 0.03486282390709745 0.0984167053020998 0.008728023108705936 0.01147270777789435 0.03891955446977848
Cops5 0.660029839005765 0.00029523512189788217 0.1276985874892177 0.0002142138705795001 0.0004473057193029473 0.013419171579088418 0.02012875736863263 0.026838343158176837 0.0016774960067399915 0.050324880202199745 0.07548732030329962 0.10064976040439949 3.750222575633717 0.19347396124026195 0.7255704172405131 0.019885878875462348 0.0792885643859812 0.6733154194911697 0.010916610050969682 0.022409482924999273 0.1373382116652952
Cspp1 5.808133874643234 0.0001951353465723098 0.8930602858055213 0.00019013160392264555 3.359690922831837e-05 0.001007907276849551 0.0015118609152743267 0.002015814553699102 0.00021289895760077485 0.006386968728023246 0.009580453092034869 0.012773937456046491 6.336861410493238 0.15376027913274945 0.9743575793029887 0.06686242934218628 0.5895944780600774 5.816914965238988 0.017486525034574263 0.09786195075707119 0.9016162079820403
Arfgef1 0.04274577064471635 0.00016856765189550999 0.012284530195248049 0.00013639281820717074 0.003943493107109206 0.1183047932132762 0.1774571898199143 0.2366095864265524 0.011102811099762753 0.3330843329928826 0.49962649948932386 0.6661686659857652 2.8154762283587997 0.2873858632085861 0.809128066230148 0.008013002041745113 0.011860121399769585 0.0503313149800143 0.006260522121275164 0.007366129838847489 0.018422207014570733
Rnf152 0.05363813362067858 5.2593033768977255e-05 0.004023233677147883 3.3974821209427734e-05 0.0009805157304858493 0.029415471914575476 0.04412320787186322 0.05883094382915095 0.008444655204197057 0.2533396561259117 0.3800094841888676 0.5066793122518234 8.612462749590666 0.07500696623039929 0.6459947026191188 0.002903067855810762 0.0077304998816718346 0.05600482014028256 0.0015690992911957268 0.0019311903221390363 0.005552100631572131
Pign 2.47604335555345 0.00013183138651792507 0.48941854829535325 9.569967327884901e-05 5.3242761772423756e-05 0.0015972828531727126 0.002395924279759069 0.003194565706345425 0.00019553748751895765 0.00586612462556873 0.008799186938353094 0.01173224925113746 3.6725647019353755 0.1976615422333578 0.7259248029363383 0.03069284594884113 0.2535367479486516 2.4819757679467567 0.009200670780501737 0.053248340127083536 0.49372503359290143
Relch 0.25879454311303635 0.00019387088540558193 0.08322818970801042 0.00015417981044294933 0.0007491304997142191 0.022473914991426574 0.03371087248713986 0.04494782998285315 0.0018524950618757727 0.055574851856273186 0.08336227778440977 0.11114970371254637 2.4728602861350177 0.32159947697064845 0.7952705746425097 0.01131213527438155 0.03460364415455482 0.26751873295628753 0.007770373367012824 0.015260910440733763 0.09016628117794315
Kdsr 1.177702296754216 0.00025758243070728 0.209022501691584 0.00020960500451777837 0.00021871608081022276 0.006561482424306682 0.009842223636460024 0.013122964848613364 0.0010027867948258213 0.03008360384477464 0.04512540576716196 0.06016720768954928 4.584879132394143 0.1774833098888034 0.8137395238574177 0.023368232349369758 0.1293614390572492 1.1892935061360435 0.011522450220215866 0.030334475372458432 0.21845472689488404
Vps4b 1.4260412527299717 0.0002970162232871636 0.6272827059770899 0.00024090206454146785 0.00020828024625414195 0.006248407387624258 0.009372611081436388 0.012496814775248517 0.00038404066020953915 0.011521219806286174 0.01728182970942926 0.02304243961257235 1.8438650189655321 0.43987697044264196 0.8110737584477228 0.027626142575222078 0.15596985532091953 1.4394069827778941 0.017113419964136952 0.07356886350207505 0.6381232988814559
A way to display the parameter results is mentioned in the section Plotting Rates
. The user may use custom functions for displaying/summarising the parameter sets mentioned in paramEstimationResults_TRP_12.5min_vs_baseline.txt
. _normalized_
in column names refers to transformed pause and body densities to occupancy values (as mentioned in Mukherjee et al). Please refer to section, Saturation values
where running polcomp
with --norm saturation
is discussed.
We will be using Triptolide (TRP) treatment dataset from Jonkers et al. 2014
. We will process the raw GRO-seq files from control
and TRP 12.5min
conditions. Mouse V6.5 cells were processed in two replicates. These are the accession numbers for files control rep1
- SRR935093
, SRR935094
, control rep2
- SRR935095
, SRR935096
and TRP 12.5min
- SRR935119
(rep1), SRR935120
(rep2). As Arabidopsis thaliana RNAs were used as spike-ins, we also remove reads aligning to Arabidopsis genome. We are using Mus_musculus.GRCm39.105
genome for mouse and GCF_000001735.4
for Arabidopsis. Processing scripts for all other datasets used in Mukherjee et al, are present in processingscripts/
folder on GitHub: TBD.
We follow the almost the same workflow as detailed in Nascent RNA Methods, Smith et al 2021 (PEPPRO) and Scott et al 2022 with following changes:-
We use the following file as our “placeholder” and we replace the placeholder
within this file by the names of our replicates:-
cat processPRO_seqdata_placeholder_spikeIn.sh
#!/bin/bash
#SBATCH --job-name=processPRO_seqdata
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 8
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=END
#SBATCH --mem=50G
#SBATCH --mail-user=rmukherjee@uchc.edu
#SBATCH -o processPRO_seqdata_placeholder_%j.out
#SBATCH -e processPRO_seqdata_placeholder_%j.err
set -e
# add path to the R files
export PATH=$PATH:/home/FCAM/rmukherjee/Spring2022Rotation/R_files
module load cutadapt;
module load seqtk;
module load bowtie2;
module load bedtools;
module load samtools;
module load bamtools;
module load genometools
seqBiasDir=/home/FCAM/rmukherjee/seqOutBias/target/release
filesDir=/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/fastqfiles
UMI_length=0
mouseGenomeIndex=/home/FCAM/rmukherjee/Spring2022Rotation/mm39_bt/mm39
mouserDNAIndex=/home/FCAM/rmukherjee/Spring2022Rotation/mouse_rDNA_bt2/mouse_rDNA
name=placeholder
annotation_prefix=/home/FCAM/rmukherjee/Spring2022Rotation/genome/annotations/Mus_musculus.GRCm39.105
aThalianaGenomeIndex=/home/FCAM/rmukherjee/arabidopsisThalianaGenome/AThaliana_GCF1735
cores=8
read_size=50
parentMouseGenomeDir=/home/FCAM/rmukherjee/Spring2022Rotation/genome
tallymer=${parentMouseGenomeDir}/mm39.tal_${read_size}.gtTxt.gz
table=${parentMouseGenomeDir}/mm39_${read_size}.4.2.2.tbl
aThalianaGenomeDir=/home/FCAM/rmukherjee/arabidopsisThalianaGenome
aThalianaTallymer=${aThalianaGenomeDir}/GCF_000001735.4.tal_${read_size}.gtTxt.gz
aThalianaTable=${aThalianaGenomeDir}/GCF_000001735.4_${read_size}.4.2.2.tbl
cd ${filesDir}
gunzip ${name}.fastq.gz
cutadapt --cores=${cores} -m $((UMI_length+2)) -O 1 -a TGGAATTCTCGGGTGCCAAGG ${name}.fastq \
-o ${name}_noadap.fastq --too-short-output ${name}_short.fastq > ${name}_cutadapt.txt
## initialize the metrics file
#echo -e "value\texperiment\tthreshold\tmetric" > ${name}_QC_metrics.txt
#PE1_total=$(wc -l ${name}.fastq | awk '{print $1/4}')
#PE1_w_Adapter=$(wc -l ${name}_short.fastq | awk '{print $1/4}')
#AAligation=$(echo "scale=2 ; $PE1_w_Adapter / $PE1_total" | bc)
#echo -e "$AAligation\t$name\t0.80\tAdapter/Adapter" >> ${name}_QC_metrics.txt
seqtk seq -L $((UMI_length+10)) ${name}_noadap.fastq > ${name}_noadap_trimmed.fastq
#
########################
####### since UMI length is zero
##### - no fqdedup as UMI is not present
##### - no reverse complementing for GRO-seq as `opposite adapters on the 3 and 5 ends for GRO`.
mv ${name}_noadap_trimmed.fastq ${name}_processed_UMI_${UMI_length}.fastq
#########################
## bowtie log showed highest alignment with both b and e removed
#seqtk trimfq -b ${UMI_length} -e ${UMI_length} ${name}_dedup.fastq | seqtk seq -r - > ${name}_processed_UMI_${UMI_length}.fastq
##bowtie2 -p $((cores)) -x ${aThalianaGenomeIndex} -U ${name}_processed_UMI_${UMI_length}.fastq 2>${name}_bowtie2_${UMI_length}_aThaliana.log | samtools view -b - | \
# samtools sort - -o ${name}_${UMI_length}_aThaliana.bam
##parse bowtie log for alignment
#filename=${name}_bowtie2_${UMI_length}_aThaliana.log # logfile
#totalReads=$(wc -l ${name}_processed_UMI_${UMI_length}.fastq | awk '{ print $1/4 }')
#alignmentRate=$(grep "overall alignment rate" $filename | awk -F"%" '{ print $1 }')
#alignmentFraction=$(echo "scale=2 ; ${alignmentRate} / 100" | bc)
#spikeInReads=$(echo "scale=0; ${alignmentFraction} * ${totalReads}" | bc)
#mousealignedPercent=$(grep "overall alignment rate" ${name}_bowtie2_${UMI_length}_human.log | awk -F"%" '{ print $1 }')
#mouseReads=$(echo "scale=0; ${mousealignedPercent} * ${totalReads} * 0.01" | bc)
#echo -e "${alignmentFraction}\t$name\t0.30\tThaliana Alignment Rate" > ${name}_metrics_spikeIn.txt
#echo -e "spikeInReads\tmouseReads\ttotalReads" > ${name}_metrics_spikeIn.txt
#echo -e "${name}\t${spikeInReads}\t${mouseReads}\t${totalReads}" >> ${name}_metrics_spikeIn.txt
#
#bowtie2 -p $((cores)) -x ${mouseGenomeIndex} -U ${name}_processed_UMI_${UMI_length}.fastq 2>${name}_bowtie2_${UMI_length}_mouse.log | samtools view -b - | \
# samtools sort - -o ${name}_${UMI_length}_mouse.bam
##parse bowtie log for alignment
#filename=${name}_bowtie2_${UMI_length}_mouse.log # logfile
#alignmentRate=$(grep "overall alignment rate" $filename | awk -F"%" '{ print $1 }')
#alignmentFraction=$(echo "scale=2 ; ${alignmentRate} / 100" | bc)
#echo -e "${alignmentFraction}\t$name\t0.80\tAlignment Rate" >> ${name}_QC_metrics.txt
# thaliana reads - collect all reads not aligning to mouse genome
bowtie2 -p $((cores)) --maxins 1000 -x ${mouseGenomeIndex} -U ${name}_processed_UMI_${UMI_length}.fastq 2>${name}_bowtie2_non_mouseDNA.log | samtools sort -n - | samtools fastq -f 0x4 - > ${name}.non.mouse.fastq
# align to ribosomal reads and remove them
bowtie2 -p $((cores)) -x ${mouserDNAIndex} -U ${name}_processed_UMI_${UMI_length}.fastq 2>${name}_bowtie2_rDNA.log | \
samtools sort -n - | samtools fastq -f 0x4 - > ${name}.non.rDNA.fastq
reads=$(wc -l ${name}.non.rDNA.fastq | awk '{print $1/4}')
filename=${name}_bowtie2_rDNA.log # logfile
rDNA_alignment=$(grep "overall alignment rate" $filename | awk -F"%" '{ print $1 }')
rDNA_alignmentFraction=$(echo "scale=2 ; ${rDNA_alignment} / 100" | bc)
echo -e "$rDNA_alignmentFraction\t$name\t0.10\trDNA Alignment Rate" >> ${name}_QC_metrics.txt
bowtie2 -p $((cores)) --maxins 1000 -x ${mouseGenomeIndex} -U ${name}.non.rDNA.fastq 2>${name}_bowtie2_non_rDNA.log | samtools view -b - | samtools sort - -o ${name}.nonrDNA.bam
# parse bowtie log to find number of non-rDNA reads mapping to mouse genome.
filename=${name}_bowtie2_non_rDNA.log # logfile
non_rDNA_reads=$(wc -l ${name}.non.rDNA.fastq | awk '{ print $1/4 }')
non_rDNA_alignment=$(grep "overall alignment rate" $filename | awk -F"%" '{ print $1 }')
nonrDNAalignmentFraction=$(echo "scale=2 ; ${non_rDNA_alignment} / 100" | bc)
non_rDNA_ReadCount=$(echo "scale=0 ; ( ${non_rDNA_alignment} * ${non_rDNA_reads} ) / 100" | bc)
echo -e "${nonrDNAalignmentFraction}\t$name\t0.80\tAlignment Rate" >> ${name}_QC_metrics.txt
#
### complexity and theoretical read depth
#fqComplexity -i ${name}_noadap_trimmed.fastq
#PE1_total=$(wc -l ${name}.fastq | awk '{print $1/4}')
#PE1_noadap_trimmed=$(wc -l ${name}_noadap_trimmed.fastq | awk '{print $1/4}')
#factorX=$(echo "scale=2 ; $PE1_noadap_trimmed / $PE1_total" | bc)
#echo fraction of reads that are not adapter/adapter ligation products or below 10 base inserts
#echo $factorX
###calculate PE1 deduplicated reads
#PE1_dedup=$(wc -l ${name}_dedup.fastq | awk '{print $1/4}')
###divide
#factorY=$(echo "scale=2 ; $non_rDNA_ReadCount / $PE1_dedup" | bc)
##re-run with factors
#fqComplexity -i ${name}_noadap_trimmed.fastq -x $factorX -y $factorY
#
bowtie2 -p $((cores)) -x ${aThalianaGenomeIndex} -U ${name}.non.mouse.fastq 2>${name}_bowtie2_${UMI_length}_aThaliana.log | samtools view -b - | \
samtools sort - -o ${name}_${UMI_length}_aThaliana.bam
##parse bowtie log for alignment
filename=${name}_bowtie2_${UMI_length}_aThaliana.log # logfile
totalReads=$(wc -l ${name}_processed_UMI_${UMI_length}.fastq | awk '{ print $1/4 }')
alignmentRate=$(grep "overall alignment rate" $filename | awk -F"%" '{ print $1 }')
alignmentFraction=$(echo "scale=2 ; ${alignmentRate} / 100" | bc)
spikeInReads=$(echo "scale=0; ${alignmentFraction} * ${totalReads}" | bc)
mousealignedPercent=$(grep "overall alignment rate" ${name}_bowtie2_${UMI_length}_human.log | awk -F"%" '{ print $1 }')
mouseReads=$(echo "scale=0; ${mousealignedPercent} * ${totalReads} * 0.01" | bc)
#echo -e "${alignmentFraction}\t$name\t0.30\tThaliana Alignment Rate" > ${name}_metrics_spikeIn.txt
echo -e "exprName\tspikeInReads\tmouseReads\ttotalReads" > ${name}_metrics_spikeIn.txt
echo -e "${name}\t${spikeInReads}\t${mouseReads}\t${totalReads}" >> ${name}_metrics_spikeIn.txt
##TBD: convert to bigWig and BED6 using seqOutBias
${seqBiasDir}/seqOutBias scale $table ${name}.nonrDNA.bam --no-scale --stranded --bed-stranded-positive \
--bw=${name}.bigWig --bed=${name}.bed \
--tail-edge --read-size=${read_size} --tallymer=$tallymer
grep -v "random" ${name}_not_scaled.bed | grep -v "chrUn" | grep -v "chrEBV" | sort -k1,1 -k2,2n > ${name}_tmp.txt
mv ${name}_tmp.txt ${name}_not_scaled.bed
# run seqoutBias on aThaliana bam
${seqBiasDir}/seqOutBias scale ${aThalianaTable} ${name}_${UMI_length}_aThaliana.bam --no-scale --stranded --bed-stranded-positive \
--bw=${name}_aThaliana.bigWig --bed=${name}_aThaliana.bed \
--tail-edge --read-size=${read_size} --tallymer=${aThalianaTallymer}
sort -k1,1 -k2,2n ${name}_aThaliana_not_scaled.bed > ${name}_pause_counts_tmp.bed
mv ${name}_pause_counts_tmp.bed ${name}_aThaliana_not_scaled.bed
##count reads in pause region
mapBed -null "0" -s -a $annotation_prefix.pause.bed -b ${name}_not_scaled.bed | \
awk '$7>0' | sort -k5,5 -k7,7nr | sort -k5,5 -u > ${name}_pause.bed
#discard anything with chr and strand inconsistencies
join -1 5 -2 5 ${name}_pause.bed $annotation_prefix.bed | \
awk '{OFS="\t";} $2==$8 && $6==$12 {print $2, $3, $4, $1, $6, $7, $9, $10}' | \
awk '{OFS="\t";} $5 == "+" {print $1,$2+480,$8,$4,$6,$5} $5 == "-" {print $1,$7,$2 - 380,$4,$6,$5}' | \
awk '{OFS="\t";} $3>$2 {print $1,$2,$3,$4,$5,$6}' > ${name}_pause_counts_body_coordinates.bed
sort -k1,1 -k2,2n ${name}_pause_counts_body_coordinates.bed > ${name}_pause_counts_tmp.bed
mv ${name}_pause_counts_tmp.bed ${name}_pause_counts_body_coordinates.bed
#column ten is Pause index
mapBed -null "0" -s -a ${name}_pause_counts_body_coordinates.bed \
-b ${name}_not_scaled.bed | awk '$7>0' | \
awk '{OFS="\t";} {print $1,$2,$3,$4,$5,$6,$7,$5/100,$7/($3 - $2)}' | \
awk '{OFS="\t";} {print $1,$2,$3,$4,$5,$6,$7,$8,$9,$8/$9}' > ${name}_pause_body.bed
pause_index.R ${name}_pause_body.bed
#
mapBed -null "0" -s -a $annotation_prefix.introns.bed \
-b ${name}_not_scaled.bed | awk '$7>0' | \
awk '{OFS="\t";} {print $1,$2,$3,$5,$5,$6,$7,($3 - $2)}' > ${name}_intron_counts.bed
mapBed -null "0" -s -a $annotation_prefix.no.first.exons.named.bed \
-b ${name}_not_scaled.bed | awk '$7>0' | \
awk '{OFS="\t";} {print $1,$2,$3,$4,$4,$6,$7,($3 - $2)}' > ${name}_exon_counts.bed
exon_intron_ratio.R ${name}_exon_counts.bed ${name}_intron_counts.bed
gzip ${name}.fastq
#cat *_QC_metrics.txt | awk '!x[$0]++' > project_QC_metrics.txt
#plot_all_metrics.R project_QC_metrics.txt Estrogen_treatment_PRO
We can replace the placeholder
and submit jobs in the following manner (submission example for slurm environment, HPC cluster @ UConn):-
for filename in V65_cntrl_rep1 V65_cntrl_rep2 V65_TRP_treat_12min_rep1 V65_TRP_treat_12min_rep2; do
sed "s/placeholder/${filename}/g" processPRO_seqdata_placeholder_spikeIn.sh > processPRO_seqdata_${filename}_spikeIn.sh
echo ${filename}_spikeIn
sbatch processPRO_seqdata_${filename}_spikeIn.sh
sleep 1
done
We eventually will normalize data by DESeq2 size factors. For running DESeq2, we plan to use the following criteria:-
* consider genes with length greater than 50kbp and treat regions from [TSS + 50kbp]
to TTS
as unchanging between control and TRP treatment for 12.5 mins. TSS = transcription start site, TTS = transcription termination site.
* for differential expression analysis by DESeq2, only consider regions between 500 and 25kbp..
We create normalized files in the same manner as mentioned in PRO_normalization
of Nascent RNA Methods and use normalize_bedGraph_v2.py
for increased execution speed.
cat normalization_factor.R
#!/usr/bin/env Rscript
# modified from normalization_factor.R because of different file names
# assumes the naming convention CELL_COND_TIME_ETC_rep<#>_<plus><minus>.bigWig
# assumes that you are running the script in the directory with your bigWigs
if(!require(devtools)) { install.packages("devtools", repos='http://cran.us.r-project.org') }
if(!require(bigWig)) {devtools::install_github("andrelmartins/bigWig", subdir="bigWig")}
library(bigWig)
coverage = c()
file.name = c()
# commented out from the original code as it seems not to serve any purpose
#for (plus.bigWig in Sys.glob(file.path("*_rep1_plus_PE1.bigWig"))) { # name changed
# file.prefix = strsplit(plus.bigWig, '_rep1')[[1]][1]
# #print(file.prefix)
# minus.bigWig = paste0(strsplit(plus.bigWig, 'plus')[[1]][1], "plus.bigWig") # should actually be minus.bigWig
# #print(minus.bigWig)
# }
# these bigWigs were creataed using seqOutBias
for (replicate.plus.bigWig in Sys.glob(file.path("*_plus.bigWig"))) { # name changed
#print(replicate.plus.bigWig)
replicate.prefix = strsplit(replicate.plus.bigWig, '_plus.bigWig')[[1]][1] # name changed
replicate.minus.bigWig = paste0(strsplit(replicate.prefix, '_plus.bigWig')[[1]][1], "_minus.bigWig") # name changed
#print(replicate.minus.bigWig)
bigwig.plus = load.bigWig(replicate.plus.bigWig)
bigwig.minus = load.bigWig(replicate.minus.bigWig)
coverage.bigwig.plus = bigwig.plus$basesCovered*bigwig.plus$mean # coveredBases * mean (mysterious because I don't know the definitions)
coverage.bigwig.minus = abs(bigwig.minus$basesCovered*bigwig.minus$mean) # abs used because of negstive values
coverage = append(coverage, coverage.bigwig.plus)
coverage = append(coverage, coverage.bigwig.minus)
file.name = append(file.name, replicate.plus.bigWig)
file.name = append(file.name, replicate.minus.bigWig)
#print(coverage)
unload.bigWig(bigwig.plus)
unload.bigWig(bigwig.minus)
}
# NOTE: CHANGE THE FOllowing prefix to `basename` in PRO_normalization (line 34)
#celltype.prefix = strsplit(file.prefix, '_')[[1]][1]
celltype.prefix = "V65"
write.table(data.frame(file.name, coverage), file = paste0(celltype.prefix, '_normalization.txt'), sep = '\t', row.names = FALSE, col.names=FALSE, quote =FALSE)
cat PRO_normalization
#! /bin/sh
# modified from original code as file names are different: https://github.com/guertinlab/Nascent_RNA_Methods/blob/main/PRO_normalization
# put this in your path: normalize_bedGraph.py
# put this executable in your path: normalization_factor.R
#normalization_factor_v2.R and normalize_bedGraph_v2.py are present in the same directory and they have been modifed accordingly.
set -e
module load R/4.1.2
export PATH=${PATH}:/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014 # for normalization_bedgraph_v2.py
export PATH=${PATH}:/home/FCAM/rmukherjee/Summer2022Rotation/R_files # for bigWigToBedGraph
while getopts "c:" OPTION
do
case $OPTION in
c)
chrSizes=$OPTARG
;;
esac
done
bigWigFilesDir=/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/fastqfiles
cd ${bigWigFilesDir}
if [ $chrSizes ]; then
Rscript /home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/normalization_factor.R # modified version of normalization_factor.R from Guertin lab's GitHub
#for i in V65_cntrl_rep1_plus.bigWig V65_FP_treat_25min_rep1_plus.bigWig V65_TRP_treat_12min_rep1_plus.bigWig; # name modified
for i in V65_cntrl_rep1_plus.bigWig V65_TRP_treat_12min_rep1_plus.bigWig; # name modified
do
name=$(echo $i | awk -F"_rep1_plus.bigWig" '{print $1}') # name modified
printf "%s - processing individual replicates\n" ${name}
basename=$(echo $i | awk -F"_" '{print $1}')
# count the number of replicates
reps=$(ls ${name}_rep?_plus.bigWig | wc -w | bc) # modified name
for j in ${name}_rep?_plus.bigWig;
do
repNum=$(echo $j | awk -F"rep" '{print $NF}' | awk -F"_plus" '{print $1}') # take the num# of replicate
invscalePlus=$(grep "${name}_rep${repNum}_plus.bigWig" ${basename}_normalization.txt | awk -F" " '{print $2}' | bc)
invscaleMinus=$(grep "${name}_rep${repNum}_minus.bigWig" ${basename}_normalization.txt | awk -F" " '{print $2}' | bc)
echo $invscalePlus
echo $invscaleMinus
# scale to 10 million
invScale=$(expr $invscalePlus + $invscaleMinus)
scale=$(bc <<< "scale=3 ; 10000000 / $invScale") # normalised to 1e7 reads
bigWigToBedGraph ${name}_rep${repNum}_plus.bigWig ${name}_rep${repNum}_plus.bedGraph
bigWigToBedGraph ${name}_rep${repNum}_minus.bigWig ${name}_rep${repNum}_minus.bedGraph
normalize_bedGraph_v2.py -i ${name}_rep${repNum}_plus.bedGraph -s $scale -o ${name}_rep${repNum}_plus_normalized.bedGraph
normalize_bedGraph_v2.py -i ${name}_rep${repNum}_minus.bedGraph -s $scale -o ${name}_rep${repNum}_minus_normalized.bedGraph
bedGraphToBigWig ${name}_rep${repNum}_plus_normalized.bedGraph $chrSizes ${name}_rep${repNum}_plus_normalized.bigWig
bedGraphToBigWig ${name}_rep${repNum}_minus_normalized.bedGraph $chrSizes ${name}_rep${repNum}_minus_normalized.bigWig
done
printf "%s - creating and merging normalized replicate bigWigs\n" ${name}
plusfiles=$(ls ${name}_rep*_plus_normalized.bigWig)
bigWigMerge $plusfiles tmpPlus.bg
minusfiles=$(ls ${name}_rep*_minus_normalized.bigWig)
bigWigMerge -threshold=-10000000000 $minusfiles tmpMinus.bg # for bigWigMerge utility: -threshold=0.N - don't output values at or below this threshold. Default is 0.0
scaleall=$(bc <<< "scale=4 ; 1.0 / $reps")
normalize_bedGraph_v2.py -i tmpPlus.bg -s $scaleall -o ${name}_plus_normalized.bg # scaleall=1/rep#
normalize_bedGraph_v2.py -i tmpMinus.bg -s $scaleall -o ${name}_minus_normalized.bg
sort -k1,1 -k2,2n ${name}_plus_normalized.bg > ${name}_plus_normalized_sorted.bg
sort -k1,1 -k2,2n ${name}_minus_normalized.bg > ${name}_minus_normalized_sorted.bg
bedGraphToBigWig ${name}_plus_normalized_sorted.bg $chrSizes ${name}_plus_normalized.bigWig
bedGraphToBigWig ${name}_minus_normalized_sorted.bg $chrSizes ${name}_minus_normalized.bigWig
rm ${name}_plus_normalized.bg
rm ${name}_minus_normalized.bg
rm ${name}_plus_normalized_sorted.bg
rm ${name}_minus_normalized_sorted.bg
rm tmpPlus.bg
rm tmpMinus.bg
done
fi
We will now use the normalized files from control
condition - V65_cntrl_plus_normalized.bigWig
and V65_cntrl_minus_normalized.bigWig
for estimating the TSS in next section.
We will use primaryTranscriptAnnotation
(Anderson et al. 2020, GitHub ) to estimate TSS and TTS. Comments mention why some thresholds were used:-
cat createPrimaryTranscAnnot_Mouse_cntThresh_point9.R
library(bigWig)
library(NMF)
library(dplyr)
library(pracma)
library(RColorBrewer)
library(dplyr)
# source primaryTranscriptAnnotation package files here
source("/home/FCAM/rmukherjee/Summer2022Rotation/genomicsRpackage-master/primaryTranscriptAnnotation/R/gene_ann.R")
source("/home/FCAM/rmukherjee/Summer2022Rotation/genomicsRpackage-master/primaryTranscriptAnnotation/R/map_tu.R")
# modified from get.TSS()
get.counts.around.potential.TSS <- function(bed.in=NULL, bed.out=NULL, bw.plus=NULL, bw.minus=NULL,
bp.range=NULL, cnt.thresh=NULL, low.read=FALSE, by="cnt"){
final.df = c()
# error handling. note that input and output genes must be matched
if( length(unique(bed.in$gene)) != length(unique(bed.out$gene)) ){stop("input genes do not match output genes")}
if( setequal(unique(bed.in$gene),unique(bed.out$gene)) == FALSE ){stop("input genes do not match output genes")}
# loop through each unique gene
# map reads a range for each annotated TSS
bed0 = bed.out
bed = bed.in
genes = unique(bed$gene)
tss = rep(0,length(genes))
for(ii in 1:length(genes)){
ind.gene = which(bed$gene == genes[ii])
strand.gene = unique(bed$strand[ind.gene])[1]
chr.gene = unique(bed$chr[ind.gene])[1]
# get all sets of intervals around each annotated TSS
# sort upstream to downstream
if(strand.gene == "-"){
starts = unique(bed$end[ind.gene])
ranges = data.frame(chr=chr.gene, start=starts-bp.range[2],
end=starts-bp.range[1], stringsAsFactors=FALSE)
ranges = ranges[order(ranges$end,decreasing=TRUE),]
}
if(strand.gene == "+"){
starts = unique(bed$start[ind.gene])
ranges = data.frame(chr=chr.gene, start=starts+bp.range[1],
end=starts+bp.range[2], stringsAsFactors=FALSE)
ranges = ranges[order(ranges$start),]
}
# get counts and densities for each interval of interest
if(strand.gene == "-"){ counts = abs(bed.region.bpQuery.bigWig(bw.minus, ranges)) }
if(strand.gene == "+"){ counts = abs(bed.region.bpQuery.bigWig(bw.plus, ranges)) }
dens = counts / (ranges$end - ranges$start)
# select the best TSS, select the most upstream for ties
if(by == "cnt"){ind.max = which(counts == max(counts))[1]}
if(by == "den"){ind.max = which(dens == max(dens))[1]}
final.df = rbind(final.df, c(genes[ii], counts[ind.max], ranges[ind.max, ], strand.gene))
}
final.df = as.data.frame(final.df)
return(final.df)
}
# load transcript, firstExon files
annotDir="/home/FCAM/rmukherjee/Summer2022Rotation/annotations"
outputDir = "/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/primaryTranscriptAnnotation"
# import data for first exons, annotate, and remove duplicate transcripts
fname = sprintf("%s/gencode.mm39.firstExon.bed", annotDir)
dat0 = read.table(fname,header=F,stringsAsFactors=F)
names(dat0) = c('chr', 'start', 'end', 'gene', 'xy', 'strand')
dat0 = unique(dat0)
gencode.firstExon = dat0
# import data for all transcripts, annotate, and remove duplicate transcripts
fname = sprintf("%s/gencode.mm39.transcript.bed", annotDir)
dat0 = read.table(fname,header=F,stringsAsFactors=F)
names(dat0) = c('chr', 'start', 'end', 'gene', 'xy', 'strand')
dat0 = unique(dat0)
gencode.transcript = dat0
# chromosome sizes
chrom.sizes = read.table("/home/FCAM/rmukherjee/Spring2022Rotation/genome/mm39.chrom.sizes",stringsAsFactors=F,header=F)
names(chrom.sizes) = c("chr","size")
bigwigDir = "/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/fastqfiles"
#plus.file = sprintf("%s/TRP_FP_control_merged_plus.bigWig", bigwigDir)
#minus.file = sprintf("%s/TRP_FP_control_merged_minus.bigWig", bigwigDir)
#plus.file = sprintf("%s/V65_CNTRL_All_plus_normalized.bigWig", bigwigDir)
#minus.file = sprintf("%s/V65_CNTRL_All_minus_normalized.bigWig", bigwigDir)
plus.file = sprintf("%s/V65_cntrl_plus_normalized.bigWig", bigwigDir)
minus.file = sprintf("%s/V65_cntrl_minus_normalized.bigWig", bigwigDir)
bw.plus = load.bigWig(plus.file)
bw.minus = load.bigWig(minus.file)
cntrl.rep1.plus = load.bigWig(sprintf("%s/V65_cntrl_rep1_plus.bigWig", bigwigDir))
cntrl.rep1.minus = load.bigWig(sprintf("%s/V65_cntrl_rep1_minus.bigWig", bigwigDir))
cntrl.rep2.plus = load.bigWig(sprintf("%s/V65_cntrl_rep2_plus.bigWig", bigwigDir))
cntrl.rep2.minus = load.bigWig(sprintf("%s/V65_cntrl_rep2_minus.bigWig", bigwigDir))
# (using gencode transcript bed) get intervals for furthest TSS and TTS +/- interval
# stitch together the lowest start and highest end of each gene
largest.interval.bed = get.largest.interval(bed=gencode.transcript)
saveRDS(largest.interval.bed, file=sprintf("%s/gencode.largest.interval.rds", outputDir))
# (use gencode transcript bed) get read counts and densities for each annotated gene
transcript.reads = read.count.transcript(bed=gencode.transcript,
bw.plus=bw.plus, bw.minus=bw.minus)
saveRDS(transcript.reads, file = sprintf("%s/gencode.transcript.reads.rds", outputDir))
# decide on cutoffs for discarding reads - evaluate count and density distributions
pdf(sprintf("%s/read_count_density.pdf", outputDir))
par(mfrow=c(1,2))
hist(log(transcript.reads$density), breaks=200,
col="black",xlab="log read density",main="")
hist(log(transcript.reads$counts), breaks=200,
col="black",xlab="log read count",main="")
dev.off()
# create cutoffs and discard gene with low expression
den.cut = -7.2 # density < exp(-7.5)
cnt.cut = 0.3 # count < exp (0)
# plot the graph
pdf(sprintf("%s/read_count_density_thresholdLine.pdf", outputDir))
par(mfrow=c(1,2))
hist(log(transcript.reads$density), breaks=200,
col="black",xlab="log read density",main="")
abline(v=den.cut, col="red")
hist(log(transcript.reads$counts), breaks=200,
col="black",xlab="log read count",main="")
abline(v=cnt.cut, col="red")
dev.off()
# remove "unexpressed" genes
# get indices based on density and cnt threshold
ind.cut.den = which(log(transcript.reads$density) < den.cut)
ind.cut.cnt = which(log(transcript.reads$counts) < cnt.cut)
ind.cut = union(ind.cut.den, ind.cut.cnt) # take union of indices to be removed
unexp = names(transcript.reads$counts)[ind.cut] # get all unexpressed genes
largest.interval.expr.bed = largest.interval.bed[!(largest.interval.bed$gene %in% unexp),] # get interval bed of all genes which are expressed in sufficient amount
saveRDS(largest.interval.expr.bed, file=sprintf("%s/largest_interval_expr_bed.rds", outputDir))
# select the TSS (defined as having the highest count) for each gene and incorporate these TSSs
# into the largest interval coordinates
bp.range = c(20,160) # estimated using plotMetaProfileOfBigwigs.R
cnt.thresh = 0.9 # chose this because the Stat4 correct annotation was at 0.1805. Also doesn't make sense to use this threshold as most reads are really low
#original count threshold was five. Took few cases with read-depth normalized values and computed the threshold
## normalization factor
# control_rep1 = 0.2495727
# control_rep2 = 0.3610368
# the nromalized count calculated using the following cases
# control_rep1 control_rep2 normalizedRead
#1 5 0 0.6239319
#2 0 5 0.9025920
#3 5 5 1.5265239
#4 2 3 0.7911280
#5 3 2 0.7353959
#6 1 4 0.8468600
#7 4 1 0.6796639
bed.out = largest.interval.expr.bed
# select exon 1 having genes in largest.interval. Probably to enforce same genes?
bed.in = gencode.firstExon[gencode.firstExon$gene %in% bed.out$gene,]
saveRDS(bed.in, sprintf("%s/V65.Jonkers.bed.in.rds", outputDir))
# bp.range is the interval within which the greatest read density is searched for
# the following regions on gene are searched through bigwig.query
# for - strand, the start and end are set as "end-120", "end-20"
# for + strand, the start and end are set as "start+20" and "start+120"
# after the max coverage is found,
# the + strand TSS is set to "start"
# whereas - strand TSS is set to "end"
# finally, the gene names from TSS calculated are used to set the same genes in largest
# interval bed, where genes in plus(+) strand get their "start" set as TSS whereas genes in minus(-) strand get their "end" set as TSS
# get.TSS also returns an "issues" object which contain all genes whose ends are lower than starts and whose start are less than zero
# the "bed" object returned contains the interval data patched with TSS as start.
##################### EXERCISE FOR estimating the counts around TSS
#counts.df = get.counts.around.potential.TSS(bed.in=bed.in, bed.out=bed.out,
# bw.plus=bw.plus, bw.minus=bw.minus,
# bp.range=bp.range, cnt.thresh=cnt.thresh)
#counts.df = as.data.frame(counts.df)
#counts.df[,2] =as.numeric(counts.df[,2])
#saveRDS(counts.df, file=sprintf("%s/max.counts.around.potential.tss.rds", outputDir))
## control rep1
#counts.df.cntrl.rep1 = get.counts.around.potential.TSS(bed.in=bed.in, bed.out=bed.out,
# bw.plus=cntrl.rep1.plus, bw.minus=cntrl.rep1.minus,
# bp.range=bp.range, cnt.thresh=cnt.thresh)
#counts.df.cntrl.rep1 = as.data.frame(counts.df.cntrl.rep1)
#colnames(counts.df.cntrl.rep1) = c("gene", "max.count", "chr", "start", "end", "strand")
#counts.df.cntrl.rep1[,2] =as.numeric(counts.df.cntrl.rep1[,2])
#saveRDS(counts.df.cntrl.rep1, file=sprintf("%s/cntrl.rep1.max.counts.potential.tss.rds", outputDir))
## control rep2
#counts.df.cntrl.rep2 = get.counts.around.potential.TSS(bed.in=bed.in, bed.out=bed.out,
# bw.plus=cntrl.rep2.plus, bw.minus=cntrl.rep2.minus,
# bp.range=bp.range, cnt.thresh=cnt.thresh)
#counts.df.cntrl.rep2 = as.data.frame(counts.df.cntrl.rep2)
#counts.df.cntrl.rep2[,2] =as.numeric(counts.df.cntrl.rep2[,2])
#colnames(counts.df.cntrl.rep2) = c("gene", "max.count", "chr", "start", "end", "strand")
#saveRDS(counts.df.cntrl.rep2, file=sprintf("%s/cntrl.rep2.max.counts.potential.tss.rds", outputDir))
#############################
gene.TSS = get.TSS(bed.in=bed.in, bed.out=bed.out,
bw.plus=bw.plus, bw.minus=bw.minus,
bp.range=bp.range, cnt.thresh=cnt.thresh)
TSS.gene = gene.TSS$bed
saveRDS(TSS.gene, file=sprintf("%s/TSS_gene_unfiltered.cnt.thresh.%0.2f.rds", outputDir, cnt.thresh))
# pre-filter to remove poorly annotated genes
for (substr in c("Gm", "Rik", "AC", "AW")) {
grep.result = grep(substr, TSS.gene$gene)
if (length(grep.result) > 0) {
TSS.gene = TSS.gene[-grep.result,]
}
}
saveRDS(TSS.gene, file=sprintf("%s/TSS_gene_filtered.cnt.thresh.%0.2f.rds", outputDir, cnt.thresh))
# remove overlaps by selecting the gene with the highest density
# (for both plus and minus strand) gene.overlaps searches for genes within the same chromosome which has a start inside other genes. A dataframe is created with gene list and which genes start within them ("has.start.inside") or where they start within other genes ("is.start.inside").
# also, number of cases of such occurence of inner and outer genes are documented for each of the genes. (integer overlap.)
overlap.data = gene.overlaps( bed = TSS.gene )
# the cases from gene.overlaps is used again within remove.overlaps function
# only those genes are kept which have the maximum density within the transcript bed
# the Tss.gene object with rows filtered out are returned
filtered.id.overlaps = remove.overlaps(bed=TSS.gene,
overlaps=overlap.data$cases,
transcripts=gencode.transcript,bw.plus=bw.plus, bw.minus=bw.minus, by="den")
# now start finding TTS coordinates
# get intervals for TTS evaluation
# the output from the function is an input for get.TTS()
# for plus strand,
# add.to.end is number of bases added to continue searching after gene end annotation
# start of TTS search is set by subtracting fraction.end*(gene_length_from_annotation) from (end+add.to.end)
# for minus strand,
# add.to.end is subtracted from start
# the end of TTS search is set by adding fraction.end*(fraction.end*gene_length) from (start - add.to.end)
# (for both plus and minus strand) dist.to.start is the number of bases left from gene start of the next gene in annotati
add.to.end = 100000
fraction.end = 0.2
dist.from.start = 50
bed.for.tts.eval = get.end.intervals(bed=filtered.id.overlaps,
add.to.end=add.to.end, fraction.end=fraction.end, dist.from.start=dist.from.start)
saveRDS(bed.for.tts.eval, file=sprintf("%s/bed.for.tts.eval.cnt.thresh.%0.2f.rds", outputDir, cnt.thresh))
# identify gene ends
# @param knot.div (in get.TTS) the number of knots for the spline fit is defined as the number of bins covering the gene end divided by this number;
#' increasing this parameter results and a smoother curve
add.to.end = max(bed.for.tts.eval$xy)
knot.div = 40
pk.thresh = 0.05
bp.bin = 50
knot.thresh = 5
Cnt.thresh = 4 # same threshold as for get.TSS
tau.dist = 50000
frac.max = 1
frac.min = 0.3
# get.TTS tries to figure how to consider the area with the last transcription signal
# as count data is disordered, we need to find proper peaks. For that purpose a spline is fitted with number of knots (the number of polynomial functions.)
# then we search for the maximum value reached by spline. From that peak of spline,
# we move to the last peak of the spline which at a threshold decided by equations in the code.
# we call the x coodirnate of that peak as TTS.
# (all the above process in reverse direction for minus strand.)
inferred.coords = get.TTS(bed=bed.for.tts.eval, tss=filtered.id.overlaps,
bw.plus=bw.plus, bw.minus=bw.minus,
bp.bin=bp.bin, add.to.end=add.to.end,
pk.thresh=pk.thresh, knot.thresh=knot.thresh,cnt.thresh=Cnt.thresh, tau.dist=tau.dist,
frac.max=frac.max, frac.min=frac.min, knot.div=knot.div)
saveRDS(inferred.coords, file=sprintf("%s/inferred_coords.cnt.thresh.%0.2f.rds", outputDir, cnt.thresh))
coords = inferred.coords$bed
coords.filt = chrom.size.filter(coords=coords, chrom.sizes=chrom.sizes)
coords.filt$log
head(coords.filt$bed)
# process output, remove mitochondrial coordinates, and write out data
out = coords.filt$bed %>% select(chr,start,end,gene)
out = cbind(out, c(500), coords.filt$bed$strand) %>% as.data.frame(stringsAsFactors=F)
# create a gene list for use while merging
pta.gene.list = out$gene # gene lists from primary Transcript annotation workflow
# save the previous data as inferred coords
out = out[-which(out$chr == "chrM"),]
write.table(out,sprintf("%s/inferredcoords.cnt.thresh.%0.2f.bed", outputDir, cnt.thresh),quote=F,sep="\t",col.names=F,row.names=F)
# now, begin to merge together a full annotation list.
removed.gene.interval = largest.interval.bed %>% filter(!(gene %in% pta.gene.list | chr == "chrM"))
names(removed.gene.interval) = names(out) # set the colnames equal
removed.gene.interval[c("c(500)")] = 100 # set score to something low as these were discarded
master.annotation.bed = rbind(out, removed.gene.interval)
write.table(master.annotation.bed, sprintf("%s/mastercoords.cnt.thresh.%0.2f.bed", outputDir, cnt.thresh), quote=F,sep="\t",col.names=F,row.names=F)
We will first prepare bed files for “affected genes” (TSS+500 to 25kb) and “internalcontrol genes” (50kbp to TTS) as discussed above.
cat parsePTACoordsToShortAffectedAndLongControlGenes.R
library(dplyr)
# wrong # df.pause.body <- readRDS("pauseWindowsAndBWplots/TRP_FP_vscntrl.pTA.pause.body.rds")
df.pause.body = read.table("primaryTranscriptAnnotation/mastercoords.bed", header=F)
colnames(df.pause.body) = c("chr", "start", "end", "gene", "score", "strand")
# break long genes into long genes with no signal
controlgenes.len.gt.50kb = subset(df.pause.body, end - start >= 49999)
control.genes.df = c()
genes.for.affect = c()
colnames.of.interest = c("chr","start", "end", "gene", "score", "strand")
for (i in c(1:nrow(controlgenes.len.gt.50kb))) {
this.row = controlgenes.len.gt.50kb[i,]
if ( this.row$strand == "+" ) {
copy.row = this.row
this.row$start = this.row$start + 49999
this.row$gene = sprintf("%s_cntrlGene", this.row$gene)
control.genes.df = rbind(control.genes.df, this.row[,colnames.of.interest])
copy.row$end = copy.row$start + 24999
copy.row$gene = sprintf("%s_affected", copy.row$gene)
genes.for.affect = rbind(genes.for.affect, copy.row[,colnames.of.interest])
} else {
copy.row = this.row
this.row$end = this.row$end - 49999
this.row$gene = sprintf("%s_cntrlGene", this.row$gene)
control.genes.df = rbind(control.genes.df, this.row[,colnames.of.interest])
copy.row$start = copy.row$end - 24999
copy.row$gene = sprintf("%s_affected", copy.row$gene)
genes.for.affect = rbind(genes.for.affect, copy.row[,colnames.of.interest])
}
}
# shorter genes
controlgenes.len.short.50kb = subset(df.pause.body, end - start < 49999)
for (i in c(1:nrow(controlgenes.len.short.50kb))) {
this.row = controlgenes.len.short.50kb[i,]
if ( this.row$strand == "+" ) {
copy.row = this.row
copy.row$end = min(copy.row$end, copy.row$start + 24999)
copy.row$gene = sprintf("%s_affected", copy.row$gene)
genes.for.affect = rbind(genes.for.affect, copy.row[,colnames.of.interest])
} else {
copy.row = this.row
copy.row$start = max(copy.row$start, copy.row$end - 24999)
copy.row$gene = sprintf("%s_affected", copy.row$gene)
genes.for.affect = rbind(genes.for.affect, copy.row[,colnames.of.interest])
}
}
write.table(as.data.frame(control.genes.df), file="primaryTranscriptAnnotation/InternalControlGenes_TRP_12min.bed", quote=F, sep="\t", row.names=F, col.names=F)
write.table(as.data.frame(genes.for.affect), file="primaryTranscriptAnnotation/GenesWithRegionsAffected_TRP_12min.bed", quote=F, sep="\t", row.names=F, col.names=F)
cat parsePTACoordsToShortAffectedAndLongControlGenes.R
library(dplyr)
# wrong # df.pause.body <- readRDS("pauseWindowsAndBWplots/TRP_FP_vscntrl.pTA.pause.body.rds")
df.pause.body = read.table("primaryTranscriptAnnotation/mastercoords.bed", header=F)
colnames(df.pause.body) = c("chr", "start", "end", "gene", "score", "strand")
# break long genes into long genes with no signal
controlgenes.len.gt.50kb = subset(df.pause.body, end - start >= 49999)
control.genes.df = c()
genes.for.affect = c()
colnames.of.interest = c("chr","start", "end", "gene", "score", "strand")
for (i in c(1:nrow(controlgenes.len.gt.50kb))) {
this.row = controlgenes.len.gt.50kb[i,]
if ( this.row$strand == "+" ) {
copy.row = this.row
this.row$start = this.row$start + 49999
this.row$gene = sprintf("%s_cntrlGene", this.row$gene)
control.genes.df = rbind(control.genes.df, this.row[,colnames.of.interest])
copy.row$end = copy.row$start + 24999
copy.row$gene = sprintf("%s_affected", copy.row$gene)
genes.for.affect = rbind(genes.for.affect, copy.row[,colnames.of.interest])
} else {
copy.row = this.row
this.row$end = this.row$end - 49999
this.row$gene = sprintf("%s_cntrlGene", this.row$gene)
control.genes.df = rbind(control.genes.df, this.row[,colnames.of.interest])
copy.row$start = copy.row$end - 24999
copy.row$gene = sprintf("%s_affected", copy.row$gene)
genes.for.affect = rbind(genes.for.affect, copy.row[,colnames.of.interest])
}
}
# shorter genes
controlgenes.len.short.50kb = subset(df.pause.body, end - start < 49999)
for (i in c(1:nrow(controlgenes.len.short.50kb))) {
this.row = controlgenes.len.short.50kb[i,]
if ( this.row$strand == "+" ) {
copy.row = this.row
copy.row$end = min(copy.row$end, copy.row$start + 24999)
copy.row$gene = sprintf("%s_affected", copy.row$gene)
genes.for.affect = rbind(genes.for.affect, copy.row[,colnames.of.interest])
} else {
copy.row = this.row
copy.row$start = max(copy.row$start, copy.row$end - 24999)
copy.row$gene = sprintf("%s_affected", copy.row$gene)
genes.for.affect = rbind(genes.for.affect, copy.row[,colnames.of.interest])
}
}
write.table(as.data.frame(control.genes.df), file="primaryTranscriptAnnotation/InternalControlGenes_TRP_12min.bed", quote=F, sep="\t", row.names=F, col.names=F)
write.table(as.data.frame(genes.for.affect), file="primaryTranscriptAnnotation/GenesWithRegionsAffected_TRP_12min.bed", quote=F, sep="\t", row.names=F, col.names=F)
We will generate read count data from our processed GRO-seq files and the bed files for “affected” and “internalcontrol” genes created earlier.
cat deseq2DataCreator_InternalControl_TRP12min.sh
#!/bin/bash
set -e
module load bedtools # specific for Xanadu cluster
proseqfolder=/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/fastqfiles
annotation_bed_TRP_sorted=/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/primaryTranscriptAnnotation/GenesWithRegionsAffected_TRP_12min_sorted.bed
annotationbed_TRP_internalControls=/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/primaryTranscriptAnnotation/InternalControlGenes_TRP_12min_sorted.bed
outputFolder=/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/deseq2
printf "entering proseqfolder: %s\n" "${proseqfolder}"
cd ${proseqfolder}
filenames=$(ls V65*rep?_not_scaled.bed | grep -v "FP")
for filename in ${filenames}
do
name=$(echo $filename | awk -F"_not_scaled.bed" '{print $1}')
echo -e "\t${name}" > ${outputFolder}/${name}_gene_counts_Affected_Trp12min.txt
mapBed -null "0" -s -a ${annotation_bed_TRP_sorted} -b $filename |\
awk '{OFS="\t";} {print $4,$7}' >> ${outputFolder}/${name}_gene_counts_Affected_Trp12min.txt
done
#internal spike ins
for filename in ${filenames}
do
name=$(echo $filename | awk -F"_not_scaled.bed" '{print $1}')
echo -e "\t${name}" > ${outputFolder}/${name}_gene_counts_InternalControl_Trp12min.txt
mapBed -null "0" -s -a ${annotationbed_TRP_internalControls} -b $filename |\
awk '{OFS="\t";} {print $4,$7}' >> ${outputFolder}/${name}_gene_counts_InternalControl_Trp12min.txt
done
printf "entering outputFolder: %s\n" "${outputFolder}"
cd ${outputFolder}
paste -d'\t' *_gene_counts_Affected_Trp12min.txt > PRO_TRP_V65_Affected_GeneCounts.txt
paste -d'\t' *_gene_counts_InternalControl_Trp12min.txt > PRO_TRP_V65_InternalControl_GeneCounts.txt
cat runDeseq2_TRP_12min_matched_internalControl.R
library(MatchIt)
library(bigWig)
library(DESeq2)
source("/home/FCAM/rmukherjee/Spring2022Rotation/R_files/ZNF143_functions.R")
setwd("/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/deseq2")
outputDir = "/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/deseq2"
pTA_dir = "/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/primaryTranscriptAnnotation"
fastqfilesdir = "/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/fastqfiles"
genecountsDir = "/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/deseq2"
spikeInCountsFile = sprintf("%s/PRO_TRP_V65_InternalControl_GeneCounts.txt", genecountsDir)
mouseGeneCountFile = sprintf("%s/PRO_TRP_V65_Affected_GeneCounts.txt", genecountsDir) # made from primary transcript annotation
counts.obj.internal = read.table(spikeInCountsFile, sep = '\t', header = T)
row.names.internal = counts.obj.internal[,1]
counts.obj.internal = counts.obj.internal[,seq(2,to=ncol(counts.obj.internal),by=2)]
rownames(counts.obj.internal) = row.names.internal
## only consider TRP and control files - regex V65_[cT]
## only consider genes with length upto 25kb
inferred.coords=read.table(sprintf('%s/GenesWithRegionsAffected_TRP_12min_sorted.bed', pTA_dir), sep="\t", header =FALSE)
inferred.coords[,"V4"] = gsub("_affected", "", inferred.coords[,"V4"])
counts.df = abs(get.raw.counts.interval(inferred.coords, fastqfilesdir, file.prefix = "V65_[cT]",
file.plus.suffix = "*rep?_plus.bigWig", file.minus.suffix = "_minus.bigWig"))
saveRDS(counts.df, "raw.counts.TRP12min_affected.rds")
merged.counts.obj = rbind(counts.obj.internal, counts.df)
saveRDS(merged.counts.obj, file = "merged.internal.plus.raw.counts.rds")
cond.list = c("control", "control", "TRP_12min", "TRP_12min")
cond.levels=c("TRP_12min", "control")
sample.conditions = factor(cond.list, levels = cond.levels)
sample.conditions = relevel(sample.conditions, ref="control")
# #PCA for experiments
dds = run.deseq.list.dds(merged.counts.obj, sample.conditions, cond.levels, ControlGenes.Indices=1:nrow(counts.obj.internal))
rld <- rlog(dds)
plotPCA(rld) # basic plot
pca.plot = plotPCA(rld, intgroup="sample.conditions", returnData=TRUE)
pca.plot$sample.conditions = rownames(pca.plot)
plotPCAlattice(pca.plot, file = 'PCA_TRP_12min_PRO.pdf')
rep = factor(sapply(strsplit(colnames(counts.df), 'rep'), '[', 2))
# formaula: ~ rep + sample.conditions
deseq.df = DESeqDataSetFromMatrix(counts.df,colData = cbind.data.frame(sample.conditions, rep), ~ rep + sample.conditions)
deseq.df$sizeFactor = estimateSizeFactorsForMatrix(merged.counts.obj, controlGenes = 1:nrow(counts.obj.internal))
saveRDS(deseq.df$sizeFactor, file = sprintf("%s/size.factors.rawCounts.TRPInternalGenes.rds", outputDir))
deseq.df = DESeq(deseq.df)
saveRDS(deseq.df, "deseq.df.TRPinternal.RAW.counts.rds")
# formula: ~ sample.conditions
deseq.df = DESeqDataSetFromMatrix(counts.df,colData = DataFrame(sample.conditions), ~ sample.conditions)
deseq.df$sizeFactor = estimateSizeFactorsForMatrix(merged.counts.obj, controlGenes = 1:nrow(counts.obj.internal))
saveRDS(deseq.df$sizeFactor, file = sprintf("%s/size.factors.modifiedFormula.rawCounts.TRPInternalGenes.rds", outputDir))
deseq.df = DESeq(deseq.df)
saveRDS(deseq.df, "deseq.df.modifiedFormula.TRPinternal.RAW.counts.rds")
# Create tables of activated/repressed genes and their matching genes
for (i in levels(sample.conditions)[-1]) {
res.deseq = results(deseq.df, contrast = c("sample.conditions", i, "control"))
sum(res.deseq$padj < 0.1 & !is.na(res.deseq$padj))
activated = res.deseq[res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & res.deseq$log2FoldChange > 0,]
activated.strand = merge(cbind(tighten_summit_window(activated), "Activated"), inferred.coords, by.x = "gene", by.y = "V4")[,c(2, 3, 4, 1, 5, 10)]
write.table(activated.strand, file = paste0(i, '_activated_genes.bed'), quote = FALSE, row.names = FALSE, col.names = FALSE, sep = '\t')
unchanged = res.deseq[!is.na(res.deseq$padj) & res.deseq$padj > 0.1 & abs(res.deseq$log2FoldChange) < 0.01,]
#unchanged = unchanged[sample(x, size, replace = FALSE, prob = NULL),]
unchanged$treatment = 0
activated$treatment = 1
df.deseq.effects.lattice = rbind(unchanged, activated)
out = matchit(treatment ~ baseMean, data = df.deseq.effects.lattice, method = "optimal", ratio = 1)
summary(out, un=F) # don't show stats from pre-matching data
pdf("matchIt.optimal.activated.vs.unchanged.TRP12min.pdf")
plot(out, type="density", interactive = FALSE)
dev.off()
# chose unchanged genes which are matched to activated
unchanged = df.deseq.effects.lattice[rownames(df.deseq.effects.lattice) %in% out$match.matrix,]
unchanged.strand = merge(cbind(tighten_summit_window(unchanged), "Matched to Activated"), inferred.coords, by.x = "gene", by.y = "V4")[,c(2, 3, 4, 1, 5, 10)]
write.table(unchanged.strand, file = paste0(i, '_activated_matched_unchanged_genes.bed'), quote = FALSE, row.names = FALSE, col.names = FALSE, sep = '\t')
unchanged$treatment = "Matched to Activated"
activated$treatment = "Activated"
df.x = rbind(activated, unchanged)
# do the similar procedure using repressed genes
repressed = res.deseq[res.deseq$padj < 0.1 & !is.na(res.deseq$padj) & res.deseq$log2FoldChange < 0,]
repressed.strand = merge(cbind(tighten_summit_window(repressed), "Repressed"), inferred.coords, by.x = "gene", by.y = "V4")[,c(2, 3, 4, 1, 5, 10)]
write.table(repressed.strand, file = paste0(i, '_repressed_genes.bed'), quote = FALSE, row.names = FALSE, col.names = FALSE, sep = '\t')
unchanged = res.deseq[!is.na(res.deseq$padj) & res.deseq$padj > 0.1 & abs(res.deseq$log2FoldChange) < 0.01,]
# unchanged = res.deseq[rownames(res.deseq) %notin% rownames(repressed) & !is.na(res.deseq$padj) & res.deseq$log2FoldChange > 0,]
unchanged$treatment = 0
repressed$treatment = 1
df.deseq.effects.lattice = rbind(unchanged, repressed)
out = matchit(treatment ~ baseMean, data = df.deseq.effects.lattice, method = "optimal", ratio = 1)
summary(out, un=F)
## Warning message: Fewer control units than treated units; not all treated units will get a match.
pdf("matchIt.optimal.repressed.vs.unchanged.TRP12min.pdf")
plot(out, type="density", interactive = FALSE)
dev.off()
unchanged = res.deseq[!is.na(res.deseq$padj) & res.deseq$padj > 0.1 & abs(res.deseq$log2FoldChange) < 0.01,]
unchanged = df.deseq.effects.lattice[rownames(df.deseq.effects.lattice) %in% out$match.matrix,]
unchanged.strand = merge(cbind(tighten_summit_window(unchanged), "Matched to Repressed"), inferred.coords, by.x = "gene", by.y = "V4")[,c(2, 3, 4, 1, 5, 10)]
write.table(unchanged.strand, file = paste0(i, '_repressed_matched_unchanged_genes.bed'), quote = FALSE, row.names = FALSE, col.names = FALSE, sep = '\t')
unchanged$treatment = "Matched to Repressed"
repressed$treatment = "Repressed"
df.x = rbind(df.x, unchanged)
df.x = rbind(df.x, repressed)
}
## MA Plot for activated, repressed, unchanged
temp = df.x
unique(temp$treatment)
dim(res.deseq[is.na(res.deseq$padj),])
otherGenes = res.deseq[!(rownames(res.deseq) %in% rownames(temp)) & !(is.na(res.deseq$padj)),]
otherGenes$treatment = "Other"
ma.df = rbind(temp,otherGenes)
dim(ma.df)
# `matched to activated` and `matched to repressed` are labels of unchanged genes
levelsMA = c("Other","Matched to Activated", "Matched to Repressed","Activated","Repressed")
ma.df$treatment = factor(ma.df$treatment, levels = levelsMA)
saveRDS(ma.df, file = "TRP_treat_matched_DESeq_genes_categorized.rds")
ma.plot.lattice(ma.df, filename='TRP_treat_matched',title.main = "Diff Exp (TRP 12.5min)", col=c("grey90","grey30","grey30", "#ce228e" ,"#2290cf"))
deseqGenes = tighten_summit_window(ma.df)
dim(deseqGenes)
deseqGenesInfo = cbind(deseqGenes, ma.df)
dim(deseqGenesInfo)
# save cateogrized data
saveRDS(deseqGenesInfo, "deseqGenesInfo.TRP12min.matched.rds")
write.table(deseqGenesInfo,file="TRP_treat_matched_DESeq_genes_categorized.bed",quote = FALSE, sep ='\t', row.names = FALSE, col.names = TRUE)
We get a MA plot showing 9632 genes as repressed.
We then save the DESeq2 size factors in a tab-separated text file.
cat saveSizeFactorsAsTxtFile.R
### For TRP 12.5min treatment
#genecountsDir = "/Users/rudradeepmukherjee/Documents/UConn Health/comparment.model.project/datasets, genome/Jonkers_2014_Trp_FP_paper"
genecountsDir = "/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014"
outputDir = sprintf("%s/deseq2", genecountsDir)
size.factors = readRDS(sprintf("%s/size.factors.rawCounts.TRPInternalGenes.rds", outputDir))
#size.factors = readRDS(sprintf("%s/size.factors.modifiedFormula.rawCounts.TRPInternalGenes.rds", outputDir))
size.factors.transpose = t(t(size.factors))
write.table(size.factors.transpose, sprintf("%s/sizefactors_TRP_12min_InternalGenes.txt",outputDir), sep="\t", quote=F, col.names=F)
### For FP 25min treatment
#genecountsDir = "/Users/rudradeepmukherjee/Documents/UConn Health/comparment.model.project/datasets, genome/Jonkers_2014_Trp_FP_paper"
genecountsDir = "/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014"
outputDir = sprintf("%s/deseq2", genecountsDir)
#size.factors = readRDS(sprintf("%s/size.factors.FPInternalGenes.rds", outputDir))
size.factors = readRDS(sprintf("%s/size.factors.modifiedFormula.rawCounts.FPInternalGenes.rds", outputDir))
size.factors.transpose = t(t(size.factors))
write.table(size.factors.transpose, sprintf("%s/sizefactors_FP_25min_InternalGenes.txt",outputDir), sep="\t", quote=F, col.names=F)
We use the DESeq2 size factors for normalization. The code is derived from PRO_normalization_sizeFactor
of Nascent RNA Methods
cat PRO_normalization_sizeFactor_TRP
#! /bin/sh
# modified from original code as file names are different: https://github.com/guertinlab/Nascent_RNA_Methods/blob/main/PRO_normalization
# put this in your path: normalize_bedGraph.py
# put this executable in your path: normalization_factor.R
usage() { echo "Usage: $0 -c chrSizes -z sizeFactorsDataTxtFile" 1>&2; exit 1; }
set -e
module load R/4.1.2
export PATH=${PATH}:/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014 # for normalization_bedgraph_v2.py
export PATH=${PATH}:/home/FCAM/rmukherjee/Summer2022Rotation/R_files # for bigWigToBedGraph
while getopts "c:z:" OPTION
do
case $OPTION in
c)
chrSizes=$OPTARG
;;
z) sizeFactorsFile=$OPTARG
;;
*)
usage
;;
esac
done
if [ -z "${chrSizes}" ] || [ -z "${sizeFactorsFile}" ]; then
usage
fi
bigWigFilesDir=/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/fastqfiles
cd ${bigWigFilesDir}
if [ $chrSizes ]; then
#NOTE - normalization_factor.R is not needed as we are directly using the sizeFactors file containing two columns -
# `replicateName`, `deseq2.sizefactor`
#Rscript /home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/normalization_factor_v2.R # modified version of normalization_factor.R from Guertin lab's GitHub
#for i in V65_cntrl_rep1_plus.bigWig V65_FP_treat_25min_rep1_plus.bigWig V65_TRP_treat_12min_rep1_plus.bigWig; # name modified
for i in V65_cntrl_rep1_plus.bigWig V65_TRP_treat_12min_rep1_plus.bigWig; # name modified
do
name=$(echo $i | awk -F"_rep1_plus.bigWig" '{print $1}') # name modified
printf "%s - processing individual replicates\n" ${name}
basename=$(echo $i | awk -F"_" '{print $1}')
# count the number of replicates
reps=$(ls ${name}_rep?_plus.bigWig | wc -w | bc) # modified name
for j in ${name}_rep?_plus.bigWig;
do
repNum=$(echo $j | awk -F"rep" '{print $NF}' | awk -F"_plus" '{print $1}') # take the num# of replicate
# NOTE - we uill use DESeq2 sizeFactors instead of invscalePlus and invscaleMinus -
#scaleFactor=$(grep ${name}_rep${repNum} ../deseq2/sizefactors_TRP_12min_InternalGenes.txt | awk -F"\t" '{print $2}' | bc)
scaleFactor=$(grep ${name}_rep${repNum} ${sizeFactorsFile} | awk -F"\t" '{print $2}' | bc)
echo "scalefactor for ${name}_rep${repNum} - ${scaleFactor}"
# TAKE THE INVERSE - eqn (6) of Anders and Huber, 2010 finds normalized values by dividing counts with sizeFactors
scaleFactor=$(echo "scale=5; 1/${scaleFactor}" | bc)
echo "scalefactor inverse for ${name}_rep${repNum} - ${scaleFactor}"
# scale to 10 million
bigWigToBedGraph ${name}_rep${repNum}_plus.bigWig ${name}_rep${repNum}_plus.bedGraph
bigWigToBedGraph ${name}_rep${repNum}_minus.bigWig ${name}_rep${repNum}_minus.bedGraph
normalize_bedGraph_v2_pandas.py -i ${name}_rep${repNum}_plus.bedGraph -s $scaleFactor -o ${name}_rep${repNum}_plus_normalized.bedGraph
normalize_bedGraph_v2_pandas.py -i ${name}_rep${repNum}_minus.bedGraph -s $scaleFactor -o ${name}_rep${repNum}_minus_normalized.bedGraph
bedGraphToBigWig ${name}_rep${repNum}_plus_normalized.bedGraph $chrSizes ${name}_rep${repNum}_plus_normalized.bigWig
bedGraphToBigWig ${name}_rep${repNum}_minus_normalized.bedGraph $chrSizes ${name}_rep${repNum}_minus_normalized.bigWig
done
printf "%s - creating and merging normalized replicate bigWigs\n" ${name}
plusfiles=$(ls ${name}_rep*_plus_normalized.bigWig)
bigWigMerge $plusfiles tmpPlus.bg
minusfiles=$(ls ${name}_rep*_minus_normalized.bigWig)
bigWigMerge -threshold=-10000000000 $minusfiles tmpMinus.bg # for bigWigMerge utility: -threshold=0.N - don't output values at or below this threshold. Default is 0.0
scaleall=$(bc <<< "scale=4 ; 1.0 / $reps")
normalize_bedGraph_v2_pandas.py -i tmpPlus.bg -s $scaleall -o ${name}_plus_normalized.bg # scaleall=1/rep#
normalize_bedGraph_v2_pandas.py -i tmpMinus.bg -s $scaleall -o ${name}_minus_normalized.bg
sort -k1,1 -k2,2n ${name}_plus_normalized.bg > ${name}_plus_normalized_sorted.bg
sort -k1,1 -k2,2n ${name}_minus_normalized.bg > ${name}_minus_normalized_sorted.bg
bedGraphToBigWig ${name}_plus_normalized_sorted.bg $chrSizes ${name}_plus_normalized_sizeFactor_TRP.bigWig
bedGraphToBigWig ${name}_minus_normalized_sorted.bg $chrSizes ${name}_minus_normalized_sizeFactor_TRP.bigWig
rm ${name}_plus_normalized.bg
rm ${name}_minus_normalized.bg
rm ${name}_plus_normalized_sorted.bg
rm ${name}_minus_normalized_sorted.bg
rm tmpPlus.bg
rm tmpMinus.bg
done
fi
The processed files are hosted as a track hub which can be loaded in UCSC Genome Browser. A session can be accessed here.
We will prepare the pause windows and calculate the densities in gene body regions. We assume all packages mentioned in this file has been installed.
Removing genes with high variability. We will also remove genes with high signal variability in their body regions. We will compute body averages in windows of 2000bp in step sizes of 50 bp using find.density.across.windows(..., start_bp=0, end_bp=15000,windowsize_bp=200, stepsize_bp=50)
and then filter out genes which have body variance/mean >= 0.1.
cat run_pauseworkflow_InternalTRP_matched.R
# calls functions in pause_workflow_v2.R to plot composites from conditions.
# this code was written for a project where there were three conditions - control, 45min and 90min.
library(dplyr)
library(bigWig)
# source the functions
source("pause_workflow_funcs_v2.R")
## NOTE: CHANGE these PATHS before running on your own platform.
# set the working directory
outputDir = '/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/pauseWindowsAndBWplots'
# load the merged bigWigs from all conditions x replicates
merged.bw.dir = "/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/fastqfiles"
#bw.plus <- load.bigWig(sprintf('%s/TRP_FP_control_merged_plus.bigWig', merged.bw.dir))
#bw.minus <- load.bigWig(sprintf('%s/TRP_FP_control_merged_minus.bigWig', merged.bw.dir))
normalized.bw.dir = "/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/fastqfiles"
bw.cntrl.plus = load.bigWig(sprintf('%s/V65_cntrl_plus_normalized_sizeFactor_TRP.bigWig', normalized.bw.dir))
bw.cntrl.minus = load.bigWig(sprintf('%s/V65_cntrl_minus_normalized_sizeFactor_TRP.bigWig', normalized.bw.dir))
bw.TRP.plus = load.bigWig(sprintf('%s/V65_TRP_treat_12min_plus_normalized_sizeFactor_TRP.bigWig', normalized.bw.dir))
bw.TRP.minus = load.bigWig(sprintf('%s/V65_TRP_treat_12min_minus_normalized_sizeFactor_TRP.bigWig', normalized.bw.dir))
deseq2.files.dir = "/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/deseq2"
categorized.TRPtreatvscontrol = readRDS(sprintf("%s/deseqGenesInfo.TRP12min.matched.rds",deseq2.files.dir))
#categorized.TRPtreatvscontrol$gene = categorized.TRPtreatvscontrol$gene %>% str_replace("_affected", "")
categorized.TRPtreatvscontrol = categorized.TRPtreatvscontrol[!duplicated(categorized.TRPtreatvscontrol$gene),]
rownames(categorized.TRPtreatvscontrol) = categorized.TRPtreatvscontrol$gene
# find pause region for all genes listed in primaryTranscriptAnnotation bed file using the merged bigWigs
# load the coordinates file
affectedGenes.newTTS.bed = "/home/FCAM/rmukherjee/compartmentModel_project_datasets/Jonkers_et_al_2014/primaryTranscriptAnnotation/GenesWithRegionsAffected_TRP_12min_sorted_renamed.bed"
pTA <- read.table(affectedGenes.newTTS.bed, sep="\t")
df.pause.body <- find.pause.regions(pTA,bw.cntrl.plus,bw.cntrl.minus, searchWindow = 300)
rownames(df.pause.body) = df.pause.body$gene
saveRDS(df.pause.body, file=sprintf("%s/Internal_TRP_vscntrl.pause.body.rds",outputDir))
merged.pbody.TRPtreat.vs.control = merge(df.pause.body, as.data.frame(categorized.TRPtreatvscontrol))
merged.pbody.TRPtreat.vs.control = merged.pbody.TRPtreat.vs.control[,c(colnames(df.pause.body), colnames(categorized.TRPtreatvscontrol))]
rownames(merged.pbody.TRPtreat.vs.control) = merged.pbody.TRPtreat.vs.control$gene
saveRDS(merged.pbody.TRPtreat.vs.control, file=sprintf("%s/Internal.merged.pbody.TRP.vs.control.rds",outputDir))
## find genes with high variability
result_df = find.density.across.windows(df.pause.body, bw.cntrl.plus, bw.cntrl.minus, start_bp=0, end_bp=15000,windowsize_bp=200, stepsize_bp=50)
saveRDS(result_df, sprintf("%s/TRP.body.density.rollingwindow.windowsize=200,stepsize=50,bp_end=15000.rds", outputDir))
summary.TRP.windows = group_by(result_df, gene) %>% summarise_at(c("body.avg"), c("mean", "median", "var", "min", "max"))
summary.TRP.windows$var_by_mean = summary.TRP.windows$var / summary.TRP.windows$mean
saveRDS(summary.TRP.windows, sprintf("%s/summary.TRP.windows.rds", outputDir))
#summary.TRP.windows = readRDS( sprintf("%s/summary.TRP.windows.rds", outputDir))
filteredGenes = subset(summary.TRP.windows, var_by_mean <= 0.1)$gene
setwd(outputDir)
# call plotting steps for all genes
run.plotting.steps(merged.pbody.TRPtreat.vs.control, "control_all", bw.cntrl.plus, bw.cntrl.minus,
"TRP_12.5min_all", bw.TRP.plus, bw.TRP.minus,
"TRP_12.5min (all genes, Internal Control)", color.names=c(rgb(1,0,0,1/2), rgb(0,0,1,1/2))) # red, blue from rgb value. Indices set in alphabetical order of column names
density.TRP = readRDS("density.TRP_12.5min_all.v.control_all.rds")
repressed.TRP.baseline = subset(density.TRP, cond == "control_all" & treatment == "Repressed")
repressed.TRP.treatment = subset(density.TRP, cond == "TRP_12.5min_all" & treatment == "Repressed")
colNames = c("gene", "pause.sum", "body.avg")
write.table(repressed.TRP.baseline[,colNames], "InternalControl_repressed_TRP125min_baseline.txt", sep="\t", quote=F, row.names=F)
write.table(repressed.TRP.treatment[,colNames], "InternalControl_repressed_TRP125min_treatment.txt", sep="\t", quote=F, row.names=F)
# call the run.plotting.steps on pause body object containing genes of our interest.
repressed.TRPvscontrol = subset(merged.pbody.TRPtreat.vs.control, treatment == "Repressed")
saveRDS(repressed.TRPvscontrol, file=sprintf("%s/InternalControl.repressed.pbody.TRP.vs.control.rds", outputDir))
run.plotting.steps(repressed.TRPvscontrol, "control", bw.cntrl.plus, bw.cntrl.minus,
"TRP_12.5min (repressed, internal spikein)", bw.TRP.plus, bw.TRP.minus,
"TRP_12.5min(rrepressed, internal control)", color.names=c(rgb(1,0,0,1/2), rgb(0,0,1,1/2))) # red, blue from rgb value. Indices set in alphabetical order of column names
## USE GENES with body variance / mean <= 0.1
repressed.TRP.baseline = subset(density.TRP, cond == "control_all" & treatment == "Repressed")
repressed.TRP.treatment = subset(density.TRP, cond == "TRP_12.5min_all" & treatment == "Repressed")
control.TRP.allgenes.varlessthanpoint1 = subset(density.TRP, cond == "control_all" & gene %in% filteredGenes)
repressed.TRP.baseline.varlessthanpoint1 = subset(density.TRP, cond == "control_all" & treatment == "Repressed" & gene %in% filteredGenes)
repressed.TRP.treatment.varlessthanpoint1 = subset(density.TRP, cond == "TRP_12.5min_all" & treatment == "Repressed"& gene %in% filteredGenes)
colNames = c("gene", "pause.sum", "body.avg")
write.table(subset(control.TRP.allgenes.varlessthanpoint1,pause.sum>0)[, colNames], "TRP125min_control_allgenes_varlessthanpoint1.txt", sep="\t", quote=F, row.names=F)
write.table(repressed.TRP.baseline.varlessthanpoint1[, colNames], "TRP125min_repressed_control_varlessthanpoint1.txt", sep="\t", quote=F, row.names=F)
write.table(repressed.TRP.treatment.varlessthanpoint1[, colNames], "TRP125min_repressed_treatment_varlessthanpoint1.txt", sep="\t", quote=F, row.names=F)
We now have desired files - TRP125min_repressed_control_varlessthanpoint1.txt
and TRP125min_repressed_treatment_varlessthanpoint1.txt
containing pause sums and body densities of genes of interest. We will use TRP125min_control_allgenes_varlessthanpoint1.txt
to estimate a full occupancy value for 1 RNAPII in the pause region.
Note: As discussed in original manuscript (Fig S1, Mukherjee et al), we used pause sum at 90th percentile of data TRP125min_control_allgenes_varlessthanpoint1.txt
as occupancy value. See section Running the model
.
To demonstrate saturation values from fits to hyperbolic tangent function (Materials & Methods
, Mukherjee et al), we will use --norm
as saturation
:
polcomp estparam --basename baseline --basefile TRP125min_repressed_control_varlessthanpoint1.txt --exprname TRP_12.5min --exprfile TRP125min_repressed_treatment_varlessthanpoint1.txt --allgenesfile TRP125min_control_allgenes_varlessthanpoint1.txt --norm saturation --conf TRP_UpperLimitFile --outputdir paramFit_TRP_EntireRange_Saturation
The output will look like the following:
checking inputs
will use 1153.7 as saturation value corresponding to max index 12762 (at percentile 100.0) provided
saturation fits saved as paramFit_TRP_EntireRange_Saturation/TRP_12.5min_baseline_saturationFit_Yes_logY.pdf
saturation fits saved as paramFit_TRP_EntireRange_Saturation/TRP_12.5min_baseline_saturationFit_No_logY.pdf
parameters sets written as paramFit_TRP_EntireRange_Saturation/TRP_12.5min_vs_baseline_fittedParameters.txt
removed 625 genes from run as they have zero pause sum or body density
Genes with pause sum or body avg as zero in either condition have been written to paramFit_TRP_EntireRange_Saturation/GeneswithZeroPauseSums_or_BodyAvg.txt
Received the command:
polcomp estparam --basename baseline --basefile TRP125min_repressed_control_varlessthanpoint1.txt --exprname TRP_12.5min --exprfile TRP125min_repressed_treatment_varlessthanpoint1.txt --allgenesfile TRP125min_control_allgenes_varlessthanpoint1.txt --norm saturation --conf TRP_UpperLimitFile --outputdir paramFit_TRP_EntireRange_Saturation
writing command information to paramFit_TRP_EntireRange_Saturation/commandUsed.txt
Estimated rates saved as paramFit_TRP_EntireRange_Saturation/paramEstimationResults_TRP_12.5min_vs_baseline.txt
The fitted parameters look like the following:-
head paramFit_TRP_EntireRange_Saturation/TRP_12.5min_vs_baseline_fittedParameters.txt
rankFitted percentileRankFitted A B C saturationVal percentileOfSaturationVal
11486 90.0 2.431881136506332 0.0001237883637946522 -0.23589809241576964 157.0 92.56
12124 95.0 2.6319695245707972 0.00010586476889332477 -0.1976872023523184 271.8 97.48
12762 100.0 3.1943701718101787 7.667947285209665e-05 -0.13226970355876166 1153.7 99.94
The saturation curve can be derived from 1 to rankFitted
above, using A* tanh(Bx) + C
, where ‘x’ is a placeholder variable for ranks. The maximum saturation value at each of these ranks is S = A + C
. For plotting different saturation curves, please refer to Plotting saturation fits
under the section Miscellaneous Plots
.
NOTE: for datasets created using UMIs, like ZNF143 dTAG (Dong et al 2024), Dexamethasone treatment (Wissink et al 2021), TBP dTAG (Santana et al 2022), --norm saturation
should be used while invoking polcomp
.
Subsections describe plotting functions for various use cases. All subsections show basic plotting code. Users are encouraged to refer to these sections and reuse these code for custom plotting needs.
After model run (see Running the model
and other sections), a file with rates are produced. For instance, in the subsection Saturation Values
, we generated the file paramFit_TRP_EntireRange_Saturation/paramEstimationResults_TRP_12.5min_vs_baseline.txt
containing the following columns:-
head paramFit_TRP_EntireRange_Saturation/paramEstimationResults_TRP_12.5min_vs_baseline.txt
gene P_normalized_baseline B_normalized_baseline P_normalized_TRP_12.5min B_normalized_TRP_12.5min B_by_P_baseline pauseRelease_kelong30_baseline pauseRelease_kelong45_baseline pauseRelease_kelong60_baseline B_by_P_TRP_12.5min pauseRelease_kelong30_TRP_12.5min pauseRelease_kelong45_TRP_12.5min pauseRelease_kelong60_TRP_12.5min pauseRelease_FC initiationRate_FC_bound1 initiationRate_FC_bound2 initiationRate_baseline_kelong45_termination_point01 initiationRate_baseline_kelong45_termination_point1 initiationRate_baseline_kelong45_termination_1 initiationRate_TRP_12.5min_kelong45_termination_point01initiationRate_TRP_12.5min_kelong45_termination_point1 initiationRate_TRP_12.5min_kelong45_termination_1
Tcf24 0.00796888671188262 6.971887730342455e-05 0.003441502030798615 2.1122988763380635e-05 0.00874888548728706 0.2624665646186118 0.39369984692791765 0.5249331292372236 0.006137723753857253 0.1841317126157176 0.2761975689235764 0.3682634252314352 0.7015435009151662 0.43186735553247346 0.30297373653122617 0.003217038345772931 0.003934238149842367 0.011106236190536726 0.0009849495146601148 0.0012946846974319901 0.004392036525150743
The columns P_normalized_baseline
, P_normalized_TRP_12.5min
and like wise denote the quantities associated with baseline
and TRP_12.5min
conditions, specified in polcomp
run as --baselinename
and --exprname
. To make the column names easily usable, we will transform this file using another helper code.
cat transformRates.R ## changes column names of parameter file to make it more usable
transform_rates <- function(combinedResultsFile, baselineName, exprName, yLabel, fileidentifier=NULL, kelong_vals = c(30,45,60), kpre_vals = c(1e-3,1e-2,1e-1,1), ALPHA=0.03) {
combinedResults = read.table(combinedResultsFile,sep="\t", header=T)
if (is.null(fileidentifier)) {
fileidentifier = sprintf("%s_vs_%s", exprName, baselineName)
print(sprintf("setting filePrefix to %s", fileidentifier))
}
p_b_df = combinedResults[,c("gene", sprintf("P_normalized_%s", baselineName), sprintf("B_normalized_%s", baselineName),
sprintf("P_normalized_%s", exprName), sprintf("B_normalized_%s", exprName))]
colnames(p_b_df) = c("gene", "P_base", "B_base", "P_expr", "B_expr")
rate.df = as.data.frame(c())
for (kelong_val in kelong_vals) {
p_b_df$kelong = kelong_val
estimatedRate = p_b_df
for (kpre_val in kpre_vals) {
estimatedRate$kpre = kpre_val
estimatedRate$kinit_base = estimatedRate$kpre * estimatedRate$P_base + estimatedRate$kelong * estimatedRate$B_base
estimatedRate$kinit_expr = estimatedRate$kpre * estimatedRate$P_expr + estimatedRate$kelong * estimatedRate$B_expr
estimatedRate$krel_base = estimatedRate$kelong * ( estimatedRate$B_base / estimatedRate$P_base )
estimatedRate$krel_expr = estimatedRate$kelong * ( estimatedRate$B_expr / estimatedRate$P_expr )
rate.df = rbind(rate.df, estimatedRate)
}
}
saveRDS(rate.df, sprintf("calculatedRateData_%s.rds", fileidentifier))
}
combnResultsFile = "paramFit_TRP_EntireRange_Saturation/paramEstimationResults_TRP_12.5min_vs_baseline.txt"
transform_rates(combnResultsFile, "baseline", "TRP_12.5min", "TRP_12.5min (repressed)", "TRP_12min_repressed_saturation", 12)
The created file calculatedRateData_TRP_12.5min_vs_baseline.rds
contains column names where baseline
and experiment are denoted generically by base
and expr
. We will use this file to plot the rate changes:
source("generalPlotter.R")
scanDataFile.entireRange = "paramFit_TRP_EntireRange_Saturation/calculatedRateData_TRP_12.5min_vs_baseline.rds"
## creates summary for rates per gene
summary.TRP.entireRange = summaryFunction(readRDS(scanDataFile.entireRange))
#plots fold change in pause release (krel)
krelFoldchangePlotter(scanDataFile.entireRange, "TRP_krel_foldchange", "Triptolide (TRP)", "TRP", filetype="pdf", UP=32, DOWN=1/4.5, WIDTH=5, pointALPHA = 0.1)
#plots fold change in initiation (kinit)
kinitFCvsPfoldchangePlotter(scanDataFile.entireRange, "TRP_kinit_FC", "Triptolide", "(TRP)", filetype="pdf", directionKinit="down", pointALPHA = 0.1)
#plots pause release values against elongation rates
kelongVsKrelPlotter(summary.TRP.entireRange, "Fig_TRP_krel_vs_kelong_minute", "Triptolide (TRP)", "TRP", filetype="pdf")
The generated plots show Triptolide decreases initiation, and may also result in increase in pause release.
The generalPlotter.R
file is present in the vignette folder and is omitted here for brevity.
We will use paramFit_TRP_EntireRange_Saturation/TRP_12.5min_vs_baseline_fittedParameters.txt
created in previous subsection, Saturation values
.
source("residualAndFitPlotter.R")
paramFileLoc.TRP = "paramFit_TRP_EntireRange_Saturation/TRP_12.5min_vs_baseline_fittedParameters.txt"
originalPauseSums.TRP = read.table("TRP125min_control_allgenes_varlessthanpoint1.txt", sep="\t", header=T)
originalPauseSums.TRP = originalPauseSums.TRP[rank(order(originalPauseSums.TRP$pause_sum)),]
color.vals = c("#F69994", "#F5807B", "#F26660", "#ED4843", "#E92623")
fitPlotter_multiplePercentiles("TRP", paramFileLoc.TRP, originalPauseSums.TRP, percentileVals = c(70,80,90,95,100), "LOG_HYPER_TANG", "No", plotTitle = "Triptolide (mESCs, no UMIs)", logYPlotter="Yes", showLegend = "Yes", plotMaxSaturation="Yes", outputDir = "figures/", V_JUST = c(1.2,-0.8, -0.7,-1.1, -1.1), H_JUST=c(rep(0.6,2),rep(0.58,2), 0.545), color.list = color.vals)
The inset shows the legend of fits used (X percentile referring to a fit for pause sums upto Xth percentile.) The dashed lines and the saturation values are indicated with their percentiles (w.r.t the entire data)
The residualAndFitPlotter.R
file is present in the vignette folder and is omitted here for brevity.