We ran peak calling and continued visualization of data on the genome browser
macs3 callpeak -t ATF_ChIP_10mil.bam -c Input_10mil.bam -f BAM -g hs -n ATF1 -B -q 0.01 --outdir atf_peaks 2>&1 | tee -a ATF1_peaks_log.txt
addTrackLines.sh Input_macs3 ATF1_control_lambda.bdg
addTrackLines.sh ATF1_macs3 ATF1_treat_pileup.bdg
gzip ATF1_treat_pileup_header.bedGraph
gzip ATF1_control_lambda_header.bedGraph
addTrackLine_macs_narrowPeak.sh ATF1_narrow ATF1_peaks.narrowPeak
addTrackLine_macs_bed.sh ATF1_summits ATF1_summits.bed
# transfer to your CPU with sftp and load to UCSC
bedtools
for ChIP-seq analysesToday we will learn how to:
awk
Extending reads by estimated fragment length from MACS analysis.
last time we used MACS to call peaks in our ChIP-seq data and create a model that describes the width of our peaks. Now we want to use the lag between the forward and reverse strands from the model to adjust our reads such that they best represent where our factor binds.
To do this, we will extend the reads by this estimation of the fragment size. Stated differently, the reads will no longer be as long as the aligned sequence, they will all the the same length, which is the average sonication size.
Awk can also be used for simple manipulations of columns. Let’s add a column that adds fragment size (120 in our case) to column 2 of ATF_ChIP_10mil_sorted.bed
from the post-processing lecture 12:
head -4 ./ATF_ChIP_10mil_sorted.bed
## chr1 10050 10100 SRR5331338.7693560 1 -
## chr1 10062 10112 SRR5331338.5121754 1 -
## chr1 10068 10118 SRR5331338.1602006 1 -
## chr1 10111 10161 SRR5331338.8329283 1 +
awk '{OFS="\t";} {print $1, $2, $2+120, $4, $5, $6}' ATF_ChIP_10mil_sorted.bed > ATF1_ChIP_10mil_extended.bed
head -4 ./ATF1_ChIP_10mil_extended.bed
## chr1 10050 10170 SRR5331338.7693560 1 -
## chr1 10062 10182 SRR5331338.5121754 1 -
## chr1 10068 10188 SRR5331338.1602006 1 -
## chr1 10111 10231 SRR5331338.8329283 1 +
\t
Specifies that the fields will be tab-delimited$<col_num>
Specifies the column number to be printed. $0 prints entire line.awk
commands are surrounded by single quotes and curly brackets: ‘{}’We are not finished yet since we have strand specific reads, we need to add to the + strand reads and subtract from the - strand reads.
We already know how to use awk to select rows based on strand:
awk '{if($6=="+") print $0}' ./ATF_ChIP_10mil_sorted.bed > ./ATF_ChIP_10mil_plus.bed
head -4 ./ATF_ChIP_10mil_plus.bed
## chr1 10111 10161 SRR5331338.8329283 1 +
## chr1 10122 10172 SRR5331338.6021060 1 +
## chr1 10154 10204 SRR5331338.7065577 30 +
## chr1 10157 10207 SRR5331338.5694959 30 +
Now let’s combine this with the print function:
awk '{OFS="\t";} {if($6=="+")
print $1, $2, $2+120, $4, $5, $6;
else
print $1, $3-120, $3, $4, $5, $6;}' ATF_ChIP_10mil_sorted.bed > ATF_ChIP_10mil_extended_strand.bed
head -4 ./ATF_ChIP_10mil_extended_strand.bed
## chr1 9980 10100 SRR5331338.7693560 1 -
## chr1 9992 10112 SRR5331338.5121754 1 -
## chr1 9998 10118 SRR5331338.1602006 1 -
## chr1 10111 10231 SRR5331338.8329283 1 +
Problem: sometimes shifting or extending reads will create negative genome coordinates. We can resolve this by using awk
:
awk '{if ($2 >0) print $0}' ATF_ChIP_10mil_extended_strand.bed > ATF_10mil_extended_filtered.bed
We can then create the bedGraph
and upload it to the browser. Here, introducing the hg38.chrom.sizes
file has the same effect of excluding any reads off the chromosome edges in the bed file:
bedtools genomecov -bg -i ATF_10mil_extended_filtered.bed -g genomes/hg38.chrom.sizes > ATF_10mil_extended.bedGraph
## Add tracklines
you know how to do this by now!
## Upload to browser
you know how to do this by now!
Reformatting gene lists:
We are going to use bedtools
for comparing our ChIP-seq peaks locations to genes. However, standard gene lists are not necessarily in .bed
format (http://genome.ucsc.edu/FAQ/FAQformat.html#format1), and thus must be converted. Here is the standard format of a gene list downloaded from the UCSC genome browser.
head -5 Homo_sapiens.GRCh38.104.unique.txt
## 5S_rRNA ENSG00000285609 chr1 - 182944365 182944490
## 5_8S_rRNA ENSG00000283274 chr14 + 16057472 16057622
## 7SK ENSG00000277313 chr1 - 120228046 120228295
## A1BG ENSG00000121410 chr19 - 58345178 58353492
## A1BG-AS1 ENSG00000268895 chr19 + 58347718 58355455
IMPORTANT NOTE ON GENE LISTS:
Gene lists are always represented so that the lowest genomic coordinate is first (in the start column) regardless of the strand of the gene. This means that the gene start and promoter for a plus-stranded gene is in the second column (start) and the gene start and promoter for a gene on the minus strand is in the third column (end).
To convert to .bed
format, we simply need to reorder the columns. Remember, the column specifications for .bed
(6) format is as follows:
1. chr
2. start, TSS for + strand genes
3. end, TSS for - strand genes
4. name
5. score
6. strand
We will review using the awk
language to reorder and manipulate columns from large tables very easily using the print command. I uploaded an annotation file to /home/FCAM/meds5420/annotations
. You should also have downloaded a GENCODE annotation file in BED format in Lecture 13.
awk '{OFS="\t";} {print $3, $5, $6, $1, 1, $4}' Homo_sapiens.GRCh38.104.unique.txt > hg38_genes.bed
# The '1' in column 5 is a space filler for the score
# since there is no score information for our gene list.
awking
We have covered several uses of awk and there are many more. As a reminder, Peteris Krummis has tabulated many useful awk
commands in a three part series on-line (http://www.catonmat.net/blog/awk-one-liners-explained-part-three/). Just swap in ‘one’ and ‘two’ at the end of the url to see the others. He also has an inexpensive book for sale with these awk
tips and tricks included.
As we discussed in the previous class, awk
can also be used for summing data in columns.
cat table.txt
## This 1 red +
## is 2 orange -
## a 4 yellow -
## test 4 green -
## this 7 blue +
## is 6 purple -
## only 80 red -
## a 19 orange +
## test 100 yellow +
## if 6 green +
## this 4 blue -
## had 54 purple -
## been 23 red +
## a 60 orange -
## real 4 yellow +
## emergency 5 green +
##
awk '{sum+=$2} END {print sum}' table.txt
## 379
awk
syntax for selecting something that is not equal
(!=):
#select minus strand genes
awk '{if ($6 != "+") print $0}' hg38_genes.bed > hg38_genes_minus.bed
head -4 hg38_genes_minus.bed
## chr1 182944365 182944490 5S_rRNA 1 -
## chr1 120228046 120228295 7SK 1 -
## chr19 58345178 58353492 A1BG 1 -
## chr10 50799409 50885675 A1CF 1 -
Bedtools has a coverageBed
command that can calculate the depth and breadth of coverage of two bed files. That is, given a sequencing ‘reads’ .bed
file and a ‘loci’ .bed
file one can calculate the number of reads found in each loci and the number of bases that are covered.
Link to manual here:
http://bedtools.readthedocs.io/en/latest/content/bedtools-suite.html.