Contents

1 Previous class:

We ran peak calling and continued visualization of data on the genome browser

macs3 callpeak -t ATF_ChIP_10mil.bam -c Input_10mil.bam -f BAM -g hs -n ATF1 -B -q 0.01 --outdir atf_peaks 2>&1 | tee -a ATF1_peaks_log.txt

addTrackLines.sh Input_macs3 ATF1_control_lambda.bdg
addTrackLines.sh ATF1_macs3 ATF1_treat_pileup.bdg 
gzip ATF1_treat_pileup_header.bedGraph
gzip ATF1_control_lambda_header.bedGraph

addTrackLine_macs_narrowPeak.sh ATF1_narrow ATF1_peaks.narrowPeak 
addTrackLine_macs_bed.sh ATF1_summits ATF1_summits.bed
# transfer to your CPU with sftp and load to UCSC

2 TODAY: Using bedtools for ChIP-seq analyses

Today we will learn how to:

2.1 Reformatting tables with awk

Extending reads by estimated fragment length from MACS analysis. last time we used MACS to call peaks in our ChIP-seq data and create a model that describes the width of our peaks. Now we want to use the lag between the forward and reverse strands from the model to adjust our reads such that they best represent where our factor binds.

To do this, we will extend the reads by this estimation of the fragment size. Stated differently, the reads will no longer be as long as the aligned sequence, they will all the the same length, which is the average sonication size.

Awk can also be used for simple manipulations of columns. Let’s add a column that adds fragment size (120 in our case) to column 2 of ATF_ChIP_10mil_sorted.bed from the post-processing lecture 12:

head -4 ./ATF_ChIP_10mil_sorted.bed
## chr1 10050   10100   SRR5331338.7693560  1   -
## chr1 10062   10112   SRR5331338.5121754  1   -
## chr1 10068   10118   SRR5331338.1602006  1   -
## chr1 10111   10161   SRR5331338.8329283  1   +
awk '{OFS="\t";} {print $1, $2, $2+120, $4, $5, $6}' ATF_ChIP_10mil_sorted.bed > ATF1_ChIP_10mil_extended.bed
head -4 ./ATF1_ChIP_10mil_extended.bed
## chr1 10050   10170   SRR5331338.7693560  1   -
## chr1 10062   10182   SRR5331338.5121754  1   -
## chr1 10068   10188   SRR5331338.1602006  1   -
## chr1 10111   10231   SRR5331338.8329283  1   +

2.1.1 awk syntax notes (a review):

  • \t Specifies that the fields will be tab-delimited
  • $<col_num> Specifies the column number to be printed. $0 prints entire line.
  • Pay close attention to formatting: awk commands are surrounded by single quotes and curly brackets: ‘{}’

We are not finished yet since we have strand specific reads, we need to add to the + strand reads and subtract from the - strand reads.
We already know how to use awk to select rows based on strand:

awk '{if($6=="+") print $0}' ./ATF_ChIP_10mil_sorted.bed > ./ATF_ChIP_10mil_plus.bed 
head -4 ./ATF_ChIP_10mil_plus.bed 
## chr1 10111   10161   SRR5331338.8329283  1   +
## chr1 10122   10172   SRR5331338.6021060  1   +
## chr1 10154   10204   SRR5331338.7065577  30  +
## chr1 10157   10207   SRR5331338.5694959  30  +

Now let’s combine this with the print function:

awk '{OFS="\t";} {if($6=="+")
      print $1, $2, $2+120, $4, $5, $6;
      else
      print $1, $3-120, $3, $4, $5, $6;}' ATF_ChIP_10mil_sorted.bed > ATF_ChIP_10mil_extended_strand.bed
head -4 ./ATF_ChIP_10mil_extended_strand.bed
## chr1 9980    10100   SRR5331338.7693560  1   -
## chr1 9992    10112   SRR5331338.5121754  1   -
## chr1 9998    10118   SRR5331338.1602006  1   -
## chr1 10111   10231   SRR5331338.8329283  1   +

Problem: sometimes shifting or extending reads will create negative genome coordinates. We can resolve this by using awk:

awk '{if ($2 >0) print $0}' ATF_ChIP_10mil_extended_strand.bed > ATF_10mil_extended_filtered.bed

We can then create the bedGraph and upload it to the browser. Here, introducing the hg38.chrom.sizes file has the same effect of excluding any reads off the chromosome edges in the bed file:

bedtools genomecov -bg -i ATF_10mil_extended_filtered.bed -g genomes/hg38.chrom.sizes > ATF_10mil_extended.bedGraph

## Add tracklines
you know how to do this by now!

## Upload to browser
you know how to do this by now!

Reformatting gene lists:

We are going to use bedtools for comparing our ChIP-seq peaks locations to genes. However, standard gene lists are not necessarily in .bed format (http://genome.ucsc.edu/FAQ/FAQformat.html#format1), and thus must be converted. Here is the standard format of a gene list downloaded from the UCSC genome browser.

head -5 Homo_sapiens.GRCh38.104.unique.txt
## 5S_rRNA  ENSG00000285609 chr1    -   182944365   182944490
## 5_8S_rRNA    ENSG00000283274 chr14   +   16057472    16057622
## 7SK  ENSG00000277313 chr1    -   120228046   120228295
## A1BG ENSG00000121410 chr19   -   58345178    58353492
## A1BG-AS1 ENSG00000268895 chr19   +   58347718    58355455


IMPORTANT NOTE ON GENE LISTS: Gene lists are always represented so that the lowest genomic coordinate is first (in the start column) regardless of the strand of the gene. This means that the gene start and promoter for a plus-stranded gene is in the second column (start) and the gene start and promoter for a gene on the minus strand is in the third column (end).
To convert to .bed format, we simply need to reorder the columns. Remember, the column specifications for .bed (6) format is as follows:

1. chr
2. start, TSS for + strand genes
3. end, TSS for - strand genes
4. name

5. score
6. strand

We will review using the awk language to reorder and manipulate columns from large tables very easily using the print command. I uploaded an annotation file to /home/FCAM/meds5420/annotations. You should also have downloaded a GENCODE annotation file in BED format in Lecture 13.

awk '{OFS="\t";} {print $3, $5, $6, $1, 1, $4}' Homo_sapiens.GRCh38.104.unique.txt > hg38_genes.bed

# The '1' in column 5 is a space filler for the score
# since there is no score information for our gene list.

3 More awking

We have covered several uses of awk and there are many more. As a reminder, Peteris Krummis has tabulated many useful awk commands in a three part series on-line (http://www.catonmat.net/blog/awk-one-liners-explained-part-three/). Just swap in ‘one’ and ‘two’ at the end of the url to see the others. He also has an inexpensive book for sale with these awk tips and tricks included.

As we discussed in the previous class, awk can also be used for summing data in columns.

cat table.txt
## This 1   red +
## is   2   orange  -
## a    4   yellow  -
## test 4   green   -
## this 7   blue    +
## is   6   purple  -
## only 80  red -
## a    19  orange  +
## test 100 yellow  +
## if   6   green   +
## this 4   blue    -
## had  54  purple  -
## been 23  red +
## a    60  orange  -
## real     4   yellow +
## emergency    5   green   +
##  
awk '{sum+=$2} END {print sum}' table.txt
## 379

awk syntax for selecting something that is not equal (!=):

#select minus strand genes
awk '{if ($6 != "+") print $0}' hg38_genes.bed > hg38_genes_minus.bed 
head -4 hg38_genes_minus.bed 
## chr1 182944365   182944490   5S_rRNA 1   -
## chr1 120228046   120228295   7SK 1   -
## chr19    58345178    58353492    A1BG    1   -
## chr10    50799409    50885675    A1CF    1   -

3.1 Number of ChIP-seq reads in defined regions of the genome

Bedtools has a coverageBed command that can calculate the depth and breadth of coverage of two bed files. That is, given a sequencing ‘reads’ .bed file and a ‘loci’ .bed file one can calculate the number of reads found in each loci and the number of bases that are covered.

Link to manual here:
http://bedtools.readthedocs.io/en/latest/content/bedtools-suite.html.