Contents

1 Previous class:

We ran peak calling and continued visualization of data on the genome browser

macs3 callpeak -t ATF_ChIP_10mil.bam -c Input_10mil.bam -f BAM -g hs -n ATF1 -B -q 0.01 --outdir atf_peaks 2>&1 | tee -a ATF1_peaks_log.txt

addTrackLines.sh Input_macs3 ATF1_control_lambda.bdg
addTrackLines.sh ATF1_macs3 ATF1_treat_pileup.bdg 
gzip ATF1_treat_pileup_header.bedGraph
gzip ATF1_control_lambda_header.bedGraph

addTrackLine_macs_narrowPeak.sh ATF1_narrow ATF1_peaks.narrowPeak 
addTrackLine_macs_bed.sh ATF1_summits ATF1_summits.bed
# transfer to your CPU with sftp and load to UCSC

2 TODAY: Using bedtools for ChIP-seq analyses

Today we will learn how to:

2.1 Reformatting tables with awk

Extending reads by estimated fragment length from MACS analysis. last time we used MACS to call peaks in our ChIP-seq data and create a model that describes the width of our peaks. Now we want to use the lag between the forward and reverse strands from the model to adjust our reads such that they best represent where our factor binds.

To do this, we will extend the reads by this estimation of the fragment size. Stated differently, the reads will no longer be as long as the aligned sequence, they will all the the same length, which is the average sonication size.

Awk can also be used for simple manipulations of columns. Let’s add a column that adds fragment size (120 in our case) to column 2 of ATF_ChIP_10mil_sorted.bed from the post-processing lecture 12:

head -4 ./ATF_ChIP_10mil_sorted.bed
## chr1 10050   10100   SRR5331338.7693560  1   -
## chr1 10062   10112   SRR5331338.5121754  1   -
## chr1 10068   10118   SRR5331338.1602006  1   -
## chr1 10111   10161   SRR5331338.8329283  1   +
awk '{OFS="\t";} {print $1, $2, $2+120, $4, $5, $6}' ATF_ChIP_10mil_sorted.bed > ATF1_ChIP_10mil_extended.bed
head -4 ./ATF1_ChIP_10mil_extended.bed
## chr1 10050   10170   SRR5331338.7693560  1   -
## chr1 10062   10182   SRR5331338.5121754  1   -
## chr1 10068   10188   SRR5331338.1602006  1   -
## chr1 10111   10231   SRR5331338.8329283  1   +

2.1.1 awk syntax notes (a review):

  • \t Specifies that the fields will be tab-delimited
  • $<col_num> Specifies the column number to be printed. $0 prints entire line.
  • Pay close attention to formatting: awk commands are surrounded by single quotes and curly brackets: ‘{}’

We are not finished yet since we have strand specific reads, we need to add to the + strand reads and subtract from the - strand reads.
We already know how to use awk to select rows based on strand:

awk '{if($6=="+") print $0}' ./ATF_ChIP_10mil_sorted.bed > ./ATF_ChIP_10mil_plus.bed 
head -4 ./ATF_ChIP_10mil_plus.bed 
## chr1 10111   10161   SRR5331338.8329283  1   +
## chr1 10122   10172   SRR5331338.6021060  1   +
## chr1 10154   10204   SRR5331338.7065577  30  +
## chr1 10157   10207   SRR5331338.5694959  30  +

Now let’s combine this with the print function:

awk '{OFS="\t";} {if($6=="+")
      print $1, $2, $2+120, $4, $5, $6;
      else
      print $1, $3-120, $3, $4, $5, $6;}' ATF_ChIP_10mil_sorted.bed > ATF_ChIP_10mil_extended_strand.bed
head -4 ./ATF_ChIP_10mil_extended_strand.bed
## chr1 9980    10100   SRR5331338.7693560  1   -
## chr1 9992    10112   SRR5331338.5121754  1   -
## chr1 9998    10118   SRR5331338.1602006  1   -
## chr1 10111   10231   SRR5331338.8329283  1   +

Problem: sometimes shifting or extending reads will create negative genome coordinates. We can resolve this by using awk:

awk '{if ($2 >0) print $0}' ATF_ChIP_10mil_extended_strand.bed > ATF_10mil_extended_filtered.bed

We can then create the bedGraph and upload it to the browser. Here, introducing the hg38.chrom.sizes file has the same effect of excluding any reads off the chromosome edges in the bed file:

bedtools genomecov -bg -i ATF_10mil_extended_filtered.bed -g genomes/hg38.chrom.sizes > ATF_10mil_extended.bedGraph

## Add tracklines
you know how to do this by now!

## Upload to browser
you know how to do this by now!

Reformatting gene lists:

We are going to use bedtools for comparing our ChIP-seq peaks locations to genes. However, standard gene lists are not necessarily in .bed format (http://genome.ucsc.edu/FAQ/FAQformat.html#format1), and thus must be converted. Here is the standard format of a gene list downloaded from the UCSC genome browser.

head -5 Homo_sapiens.GRCh38.104.unique.txt
## 5S_rRNA  ENSG00000285609 chr1    -   182944365   182944490
## 5_8S_rRNA    ENSG00000283274 chr14   +   16057472    16057622
## 7SK  ENSG00000277313 chr1    -   120228046   120228295
## A1BG ENSG00000121410 chr19   -   58345178    58353492
## A1BG-AS1 ENSG00000268895 chr19   +   58347718    58355455


IMPORTANT NOTE ON GENE LISTS: Gene lists are always represented so that the lowest genomic coordinate is first (in the start column) regardless of the strand of the gene. This means that the gene start and promoter for a plus-stranded gene is in the second column (start) and the gene start and promoter for a gene on the minus strand is in the third column (end).
To convert to .bed format, we simply need to reorder the columns. Remember, the column specifications for .bed (6) format is as follows:

1. chr
2. start, TSS for + strand genes
3. end, TSS for - strand genes
4. name

5. score
6. strand

We will review using the awk language to reorder and manipulate columns from large tables very easily using the print command. I uploaded an annotation file to /home/FCAM/meds5420/annotations. You should also have downloaded a GENCODE annotation file in BED format in Lecture 13.

awk '{OFS="\t";} {print $3, $5, $6, $1, 1, $4}' Homo_sapiens.GRCh38.104.unique.txt > hg38_genes.bed

# The '1' in column 5 is a space filler for the score
# since there is no score information for our gene list.

3 More awking

We have covered several uses of awk and there are many more. As a reminder, Peteris Krummis has tabulated many useful awk commands in a three part series on-line (http://www.catonmat.net/blog/awk-one-liners-explained-part-three/). Just swap in ‘one’ and ‘two’ at the end of the url to see the others. He also has an inexpensive book for sale with these awk tips and tricks included.

As we discussed in the previous class, awk can also be used for summing data in columns.

cat table.txt
## This 1   red +
## is   2   orange  -
## a    4   yellow  -
## test 4   green   -
## this 7   blue    +
## is   6   purple  -
## only 80  red -
## a    19  orange  +
## test 100 yellow  +
## if   6   green   +
## this 4   blue    -
## had  54  purple  -
## been 23  red +
## a    60  orange  -
## real     4   yellow +
## emergency    5   green   +
##  
awk '{sum+=$2} END {print sum}' table.txt
## 379

awk syntax for selecting something that is not equal (!=):

#select minus strand genes
awk '{if ($6 != "+") print $0}' hg38_genes.bed > hg38_genes_minus.bed 
head -4 hg38_genes_minus.bed 
## chr1 182944365   182944490   5S_rRNA 1   -
## chr1 120228046   120228295   7SK 1   -
## chr19    58345178    58353492    A1BG    1   -
## chr10    50799409    50885675    A1CF    1   -

3.1 Number of ChIP-seq reads in defined regions of the genome

Bedtools has a coverageBed command that can calculate the depth and breadth of coverage of two bed files. That is, given a sequencing ‘reads’ .bed file and a ‘loci’ .bed file one can calculate the number of reads found in each loci and the number of bases that are covered.

Link to manual here:
http://bedtools.readthedocs.io/en/latest/content/bedtools-suite.html.

coverageBed

Figure 1: coverageBed

Usage:

coverageBed -a LOCI.bed -b READS.bed  > OUTFILE.bed

After each entry from A, coverageBed will report:
1) The number of features in B that overlapped (by at least one base pair) the A interval.
2) The number of bases in A that had non-zero coverage from features in B.
3) The length of the entry in A.
4) The fraction of bases in A that had non-zero coverage from features in B.

Example:

head -5 ./coverageStrand_ATF1_peaks_chr5.bed
## chr5 157285930   157286466   MACS_peak_444   239.31  98  453 536 0.8451493
## chr5 5373747 5374632 MACS_peak_8 75.29   55  653 885 0.7378531
## chr5 12615159    12616046    MACS_peak_26    133.77  76  687 887 0.7745209
## chr5 16629414    16629942    MACS_peak_35    101.21  31  412 528 0.7803030
## chr5 29982605    29982952    MACS_peak_46    52.09   16  229 347 0.6599424

3.2 Understanding functions:

As functions and tools become more complicated with more options and variations, it is critical to look at the input and output files (head, tail, wc -l, etc.). Ask yourself if the output makes sense. If you selected all plus strand genes, does the number of lines reduce by about half? Is there a + in column 6 when I randomly look at 10 lines?

Maybe you are trying to decipher cryptic language describing an option. Employ the option and compare the input and output files to determine if you are interpreting the option correctly. I am constantly doing these types of coherence checks and empirical confirmations when I employ new tools and even tools that I am familiar with.

3.3 Find closest gene to ChIP-seq peaks:

There is a simple but useful command in bedtools to find the closest gene to your ChIP-seq peak. There are nice options to force the program to consider the strand of your loci, to set distance cutoffs, and to compute the distance between ‘matched’ loci.

bedtools closest

Figure 2: bedtools closest

Usage:

closestBed  -a PEAKS.bed -b GENES.bed  > OUTFILE.bed

Important options:

-D:

  • a: Report distance with respect to A. When A is on the - strand, “upstream” means B has a higher (start,stop).
  • b: Report distance with respect to B. When B is on the - strand, “upstream” means A has a higher (start,stop).

-t:

  • all: In the event of a tie, report all
  • first: In the event of a tie, report the first hit
  • last: In the event of a tie, report the last hit

Example:

head -5 ./chr5_closest_summits_genes_D_b.out
## chr5 207444  207445  MACS_peak_1 42.00   chr5    204874  218297  NM_145265   CCDC127 -   0
## chr5 1345286 1345287 MACS_peak_2 90.00   chr5    1317999 1345002 NM_030782   CLPTM1L -   -284
## chr5 1634151 1634152 MACS_peak_3 66.00   chr5    1597671 1634120 NR_003713   LOC728613   -   -31
## chr5 1683698 1683699 MACS_peak_4 10.00   chr5    1708899 1708983 NR_036240   MIR4277 -   25200
## chr5 1746763 1746764 MACS_peak_5 51.00   chr5    1708899 1708983 NR_036240   MIR4277 -   -37780

4 More useful ways to compare or manipulate regions

4.1 Merging regions

mergeBed

Figure 3: mergeBed

4.2 Clustering regions

clusterBed

Figure 4: clusterBed

4.3 Complementing regions

complementBed

Figure 5: complementBed

4.4 Changing boundaries

slopBed

Figure 6: slopBed

4.5 Getting nearby regions

flankBed

Figure 7: flankBed

4.6 Intersecting regions

intersectBed

Figure 8: intersectBed

4.7 In Class Exercise 1:

Today you will begin to analyze the ATF1_ChIP-seq data for your chromosome by quantifying the number or reads in peaks and finding the closest genes in each peak. Your chromosome is the same as your usr#. I am usr17, so my chromosome is chr17. You must convert the format of several files.

1) You should have all 10 million ChIP-seq reads in a sorted bed file (if you cannot find this, then you will find one here: /home/FCAM/meds5420/data/ATF1/bed). First use grep and to make a bed file with only reads from your chromosome. I called mine ATF1_chr17_sorted.bed

2) Use your ATF1 peak file (MACS output from last time), and your reads file to determine the number of reads in the narrowPeak file genomic regions. Again, use grep to only pull out peaks from your chromosome in the narrowPeak file.


  1. Convert the gene list to .bed format and select genes from your chromosome:
  • Copy the gene list from /home/FCAM/meds5420/data/ATF1/annotations/Homo_sapiens.GRCh38.104.unique.txt
    to an annotations folder within your home directory tree.
  • Use awk to convert the gene list to .bed format. Use column 2 in the name column (bed column 4) and “1” in the score column (bed column 5).
  • Use grep to move only genes from your chromosome into a new file



  1. Use your ATF1 summits file (MACS output from last time), and your gene list .bed file to determine the closest gene to your peaks. First pull only your chromosome from the summits file. Use proper options so that you:
  • only print one gene locus in the event of a tie in distance.
  • report the distance between each peak and the closest gene relative to the gene.

  1. list the bedtools command(s) in sequence that you could use to do the following:
  • Merge peaks in the narrowPeak file that are within 1000 bp of each other
  • Get intergenic regions of the entire genome
  • Find peaks that are within genes
  • Find peaks that are outside of genes


  1. Thought experiment: You are trying to find the closest ChIP-seq peak to the transcription start site of genes (TSS) of your genes. Can you see a problem with how bedtools commands used above in 5) would report this?

5 Answers to in class Exercise

Note: To make the sorted bedfile of the data, refer back to the post-processing lecture section Introduction to BED format:

#convert .sam to .bam
samtools view -b ATF_ChIP_10mil.sam > ATF_ChIP_10mil.bam

#convert to bed
bedtools bamtobed -i ATF_ChIP_10mil.bam > ATF_ChIP_10mil.bed

# sort 
sort -k1,1 -k2,2n ATF_ChIP_10mil.bed > ATF_ChIP_10mil_sorted.bed 
grep -w "chr17" lec13/ATF_ChIP_10mil_sorted.bed > ATF1_reads_chr17.bed
  1. Get number of reads in peaks
grep -w "chr17" peaks/atf_peaks/ATF1_peaks.narrowPeak > ATF1_peaks_chr17.narrowPeak 
coverageBed -a ATF1_peaks_chr17.narrowPeak -b ATF1_reads_chr17.bed > ATF1_peak_coverage.bed
  1. Reformat file to .bed and select genes on your chromosome
awk '{OFS="\t";} {print $3, $5, $6, $1, 1, $4}' Homo_sapiens.GRCh38.104.unique.txt | grep -w "chr17" > hg38_chr17_genes.bed 
  1. Call closest peaks with distance in respect to genes:

grep -w "chr17" peaks/atf_peaks/ATF1_summits.bed > ATF_chr15_summits.bed
bedtools closest -t "first" -D b -a ATF_chr17_summits.bed -b hg38_chr17_genes.bed > ATF_gene_proximity.bed

#This gives me an error. I need to sort! 

sortBed -i hg38_chr17_genes.bed > hg38_chr17_genes_sorted.bed 
bedtools closest -t "first" -D b -a ATF_chr17_summits.bed -b hg38_chr17_genes_sorted.bed > ATF_gene_proximity.bed 
  1. list the bedtools command(s) in sequence that you could use to do the following: Merge peaks that are within 1000 bp of each other:

wc -l to make sure the output makes sense

Get intergenic regions (I had a few sorting Errors that I had to correct on my first pass) * sortBed -i hg38_genes.bed > hg38_genes_sorted.bed or sort -k 1,1 -k2,2n hg38_genes.bed > hg38_genes_sorted.bed * sort -k 1,1 -k2,2n genomes/hg38.chrom.sizes > genomes/hg38.chrom.sizes.sorted * complementBed -i hg38_genes_sorted.bed -g genomes/hg38.chrom.sizes.sorted > intergenic_hg38.bed

Find peaks that are within genes. I like to use the summits file because peaks are variable width and the summit is the best guess of where the TF is bound.

Find peaks that are outside of genes

Define a promoter region that is +/- 500 base pairs around the TSS

  1. Bedtools will report the closest gene regardless of whether the peak is closer to the gene start or end (see below).

6 Creating custom annotations lists for ChIP-seq analysis:

One caveat with closestBed, is that it chooses the closet edge of your targets (genes) to compare your peaks locations to. This means that it is not always obvious whether or not your peak is closest to the gene start or end. One way to get around this is to create a new annotation list that just contains TSSs. This would allow you to use bedtools closest to get the closest TSS to your peak.

peak assignment problem

Figure 9: peak assignment problem

Using awk, we can easily create such a TSS file. The awk script below creates a new file and sets both the start and stop to the TSS of the gene, which is the ‘start’ for genes on the plus strand and the ‘end’ for genes on the minus strand.


awk '{OFS="\t";} 
{ if($6=="+")
print $1, $2, $2+1, $4,$5,$6;
else 
print $1, $3-1, $3, $4,$5,$6}' hg38_genes.bed > hg38_genes_strandedTSS.bed

One can then use bedtools closest to determine the closest TSS (or promoter) to each peak location.

sortBed -i hg38_genes_strandedTSS.bed > hg38_genes_strandedTSS_sorted.bed
closestBed -D a -a peaks/atf_peaks/ATF1_summits.bed -b hg38_genes_strandedTSS_sorted.bed > peaks_promoters.bed

This same type of manipulation can be used for creating bedfiles of different regions for comparing to ChIP-seq peak locations. i.e. You can create promoter, genic and intergenic regions and determine where you ChIP-seq peaks tend to go to.