Contents

1 Review and mapping of data for ChIP-seq analysis:

1.1 Review of bowtie and bowtie2 and mapping options

In order for the output to be a proper .sam file use the -S option to explicitly say that you want a .sam output. This is required for bowtie2, and ensures that the header is included in the .sam file which is important for downstream steps.

We will continue working with a ChIP-seq dataset from human cells. The factor that was IP’ed was ATF1 (SRR5331338). The fastq file for the experiment and control (Input SRR5331584) is here:
/home/FCAM/meds5420/data/ATF1/fastq/

You have already selected 10 million reads, aligned to a genome, converted to BED, sorted, converted to bedGraph, and visualized these data on the UCSC genome browser.

2 ChIP-seq analyses

ChIP-seq overview

Figure 1: ChIP-seq overview

Question: How do we find real peaks in the data?

3 Calling peaks in ChIP-seq data using MACS

3.1 Considerations:

  • Resolution / Fragment size estimation.
  • Peak type: narrow vs. broad
  • Background estimation
  • Enrichment calculation
  • Significance testing
  • Removal of artifacts
  • Differential peak calling analysis (e.g. comparing two conditions).
  • Sequencing depth

4 Model based analysis of ChIP-seq (MACS)

MACS Zhang et al. (2008) and Feng et al. (2013) is a widely used peak calling software utility that does well with respect to all of the above considerations, except the final two. Other algorithms can be used for calling differential enriched peaks of MACS-called peaks. MACS2 and MACS3 have better options for calling broad peaks. We can also discuss how to assess sequencing depth from MACS output.

The documentation and manual for MACS3 can be found at: https://macs3-project.github.io/MACS/

4.1 Overview of MACS workflow

MACS workflow

Figure 2: MACS workflow

4.2 Narrow vs. Broad Peaks

Read Structure in peaks

Figure 3: Read Structure in peaks

We will focus on narrow peaks today. Because of the nature of the assay, sequence reads from enriched regions will cluster into an expected structure. Since crosslinking of a factor occurs in a small region, all the signal of narrow peaks should span this region. These molecules are sequence randomly from either end in single end sequencing data (or both ends in pair end sequencing). Reads that align to the forward strand (red) are upstream of the peak, aligning to the reverse strand are downstream. The red and blue peaks are shifted by the average fragment size.

Read Structure in peaks

Figure 4: Read Structure in peaks

4.3 Resolution / Fragment size estimation:

MACS can identify this pattern and build a model of the fragment length with the following workflow:
* Scan the genome for regions enriched for aligned reads (10-30X)
* Select 1000 random enriched regions
* Separate reads into positive and negative strand
* Identify peak summits (bimodal pattern around site)
* Fragment length (d) = distance between summits.
* Peak center = d/2

Fragment size from MACS Model

Figure 5: Fragment size from MACS Model

This model is important for the accuracy of the binding site prediction and for searching for known or new binding site motifs under the peak.

4.4 Estimating enrichment / background

To call peaks, MACS uses both information from local background near enriched sites and data from a control or input sample that should not have peaks.

Local background can influence peak enrichment for a number of reasons all involving sample preparation:

  • Sonication bias (see below)
  • ligation bias
  • Gel purification
  • PCR / amplification bias
  • poor antibody specificity

For example: The number of reads obtained in regions in the experimental and control samples is correlated. This tells you that there is something inherent that locus that allows it to be isolated / enriched independently of your protein of interest (i.e. one of the biases mentioned above).

correlation of input vs ChIP

Figure 6: correlation of input vs ChIP

Above, red dots are inside of called peaks, black dots are in control peaks.

Also, below is an example showing that Non-immunoprecipitatied DNA has ‘structure’. This is from a technique called ‘SONO-Seq’ by Auerbach et al. (2009).

SONO-seq method (left) and data (right).

Figure 7: SONO-seq method (left) and data (right)

Above, they sequenced sonicated chromatin without doing the IP and found that the shearing of chromatin is not random. Breaks occurred more often in active promoter regions resulting in enrichment in those regions without IP.
This is why the control sample is important.

Solutions:

  • Use the control sample to account for biases whatever their source. Why is input a substantially worse control than mock IP?
  • PCR biases are also removed by removing duplicate sequence reads. Sometimes up to two of the same read is allowed (the default in MACS is to keep only one read from any location).
  • To ensure antibody specificity:
    • One can knockdown the factor of interest, ChIP, and select for peaks that were depleted upon knockdown
    • One can ChIP with two different antibodies and choose only overlapping peaks

4.5 Using MACS3:

The basic usage for MACS3 is as follows:


macs3 callpeak -t <TREATMENT>.bam -c <Input>.bam -f BAM -g GENOMESIZE -n <name_prefix> --outdir <peaks_folder> 2>&1 | tee -a ATF1_peaks_log.txt

#Don't forget to save terminal output (std error in this case) as a log file.
  • -t: Specifies the experimental sample
  • -c: Control sample
  • -n: Prefix name for the output files
  • -g: genome size. Can use integer (e.g. 2900000000) or scientific (e.g. 2.9e9).  There are also a few shortcuts (hs = humans = 2.9e9).

Note: if you want to keep duplicate reads you can use the --keep-dup option. This option followed by an interger or ‘all’ will keep that number or all duplicates respectively. I beleive the best practice is to perform paired-end sequencing, remove duplicates from the bam file and use --keep-dup all. Removing duplicates from a paired-end bam file necessitates that the paired-end 1 and paired-end 2 reads are identical between two sequencing read; this is very unlikely to happen independently and it predominantly due to PCR duplications.

You can type macs3 --help and macs3 callpeak -h to get a list of other options and their descriptions.
You can also generate bedGraph files from MACS, but this takes a while and generates bedGraph data for every bp (also known as a wiggle format). These files can be quite large when uploading to the genome, so I prefer to make bedGraph files of the data myself as we have previously in this class.

I ran the following:

macs3 callpeak -t ATF_ChIP_10mil.bam -c Input_10mil.bam -f BAM -g hs -n ATF1 -B -q 0.01 --outdir atf_peaks 2>&1 | tee -a ATF1_peaks_log.txt

What does -B above specify?

You will get the following output files from MACS3:

  • NAME_PREFIX_model.r
  • NAME_PREFIX_peaks.narrowPeak
  • NAME_PREFIX_summits.bed
  • NAME_PREFIX_peaks.xls

The model.r is a script written in R that will produce a nice .PDF figure of the fragment length model for your data. You can run this script on the server or on your local machine with:

#first load R
module load R

Rscript NAME_PREFIX_model.r

The excel spreadsheets contain information about the datasets (total number of reads) and peak locations (reads in peaks and significance of peaks (FDR) for the experimental (_peaks.xls) and input (_negative_peaks.xls) samples.

The peaks.bed and summits.bed are bed files that give the locations of enriched regions and their maximum, respectively.

Have a look at the files: peaks bed file

head -4 ./ATF1_peaks.narrowPeak
## chr1 778568  778817  ATF1_peak_1 751 .   15.5695 81.2068 75.1172 130
## chr1 1000795 1000968 ATF1_peak_2 559 .   19.6258 61.765  55.9288 85
## chr1 1471609 1471790 ATF1_peak_3 244 .   8.97496 29.6308 24.4232 82
## chr1 1511940 1512176 ATF1_peak_4 365 .   14.9583 42.0281 36.5468 144

In this format the 5th column (or score column) contains the transformed p-value (-10*log10[pVal]) for the signifcance of the peak.

summits

head -4 ./ATF1_summits.bed
## chr1 778698  778699  ATF1_peak_1 75.1172
## chr1 1000880 1000881 ATF1_peak_2 55.9288
## chr1 1471691 1471692 ATF1_peak_3 24.4232
## chr1 1512084 1512085 ATF1_peak_4 36.5468

In this format the 5th column (or score column) contains the the number of reads that overlap with this peak summit.

The above files are ready to be uploaded to the browser - you just need to add track lines. The example in the next section below shows track line addition using awk.

peaks table with most information

head -5 ./ATF1_peaks.xls
## # This file is generated by MACS version 3.0.0a6
## # Command line: callpeak -t ATF_ChIP_10mil.bam -c Input_10mil.bam -f BAM -g hs -n ATF1 -q 0.01 --outdir atf_peaks
## # ARGUMENTS LIST:
## # name = ATF1
## # format = BAM

Unfortunately, The peaks file with the most information in it is meant to be opened in excel. The first 20-25 lines contain information regaring the parameters of the analysis. Fortunately, these lines begin with a “##” ad so can be easily parsed out with grep.


grep -v -w "#" ./ATF1_peaks.xls > ATF1_peaks_noHeader.txt

Then you get:

head ./ATF1_peaks_noHeader.txt
## 
## chr  start   end length  abs_summit  pileup  -log10(pvalue)  fold_enrichment -log10(qvalue)  name
## chr1 778569  778817  249 778699  92.42   81.2068 15.5695 75.1172 ATF1_peak_1
## chr1 1000796 1000968 173 1000881 57.88   61.765  19.6258 55.9288 ATF1_peak_2
## chr1 1471610 1471790 181 1471692 43.87   29.6308 8.97496 24.4232 ATF1_peak_3
## chr1 1511941 1512176 236 1512085 43.87   42.0281 14.9583 36.5468 ATF1_peak_4
## chr1 1574807 1575021 215 1574917 103.62  135.566 34.8731 128.438 ATF1_peak_5
## chr1 1615300 1615504 205 1615396 57.88   76.8936 28.3065 70.8667 ATF1_peak_6
## chr1 1658956 1659236 281 1659009 28.01   18.3673 7.2513  13.4733 ATF1_peak_7
## chr1 1724322 1724525 204 1724418 34.54   30.322  11.8466 25.0955 ATF1_peak_8


There is still a blank line that you can remove with grep:

grep -v -e '^$'  ./ATF1_peaks_noHeader.txt | head -5
## chr  start   end length  abs_summit  pileup  -log10(pvalue)  fold_enrichment -log10(qvalue)  name
## chr1 778569  778817  249 778699  92.42   81.2068 15.5695 75.1172 ATF1_peak_1
## chr1 1000796 1000968 173 1000881 57.88   61.765  19.6258 55.9288 ATF1_peak_2
## chr1 1471610 1471790 181 1471692 43.87   29.6308 8.97496 24.4232 ATF1_peak_3
## chr1 1511941 1512176 236 1512085 43.87   42.0281 14.9583 36.5468 ATF1_peak_4

This searches for a line where the start and end of a line are immediately next to each other. It uses regular expressions as follows:

  • -e tells grep that reg-expressions are being used
  • ^ indicates the start of the line.
  • $ indicates the end.

Two ways to remove a 1 line header is to use tail -n +2, which takes the second line to the end.

#tail
grep -v -e '^$' ./ATF1_peaks_noHeader.txt | tail -n +2 | head -5

#awk:
grep -v -e '^$' ./ATF1_peaks_noHeader.txt | awk 'NR>1 {print $0}' | head -5

grep -v -e '^$' ./ATF1_peaks_noHeader.txt | awk 'NR>1 {print $0}' > ATF1_peaks_onlyPeaks.txt
## chr1 778569  778817  249 778699  92.42   81.2068 15.5695 75.1172 ATF1_peak_1
## chr1 1000796 1000968 173 1000881 57.88   61.765  19.6258 55.9288 ATF1_peak_2
## chr1 1471610 1471790 181 1471692 43.87   29.6308 8.97496 24.4232 ATF1_peak_3
## chr1 1511941 1512176 236 1512085 43.87   42.0281 14.9583 36.5468 ATF1_peak_4
## chr1 1574807 1575021 215 1574917 103.62  135.566 34.8731 128.438 ATF1_peak_5
## chr1 778569  778817  249 778699  92.42   81.2068 15.5695 75.1172 ATF1_peak_1
## chr1 1000796 1000968 173 1000881 57.88   61.765  19.6258 55.9288 ATF1_peak_2
## chr1 1471610 1471790 181 1471692 43.87   29.6308 8.97496 24.4232 ATF1_peak_3
## chr1 1511941 1512176 236 1512085 43.87   42.0281 14.9583 36.5468 ATF1_peak_4
## chr1 1574807 1575021 215 1574917 103.62  135.566 34.8731 128.438 ATF1_peak_5

4.6 Adding tracklines and using score column in bed files:

Note that the score column relates to the intensity of the peak and interpreting the score can be turned on with useScore=1 in the track options line.

#UCSC tracklines for displaying MACS output
awk 'BEGIN {  print "browser position chr1:1,505,048-1,516,424"
              print "track type=bed name=\"ATF1_peaks_bed\" 
              description=\"ATF1_peaks_bed\" visibility=squish 
              autoScale=on useScore=1"}
           {  print $0}' INFILE.bed > INFILE_header.bed

You will also have to make a script for adding a header to narrowPeak files, specifying the type as narrowPeak so that UCSC recognizes the format.

I include three addTrackLine*.sh scripts: to /home/FCAM/meds5420/scripts/

Bed Score Usage

Figure 8: Bed Score Usage

4.7 Exercise

You will run MACS on the ATF1 ChIP-seq data. Previously, we mapped this data to the genome, converted in into bed and bedGraph and uploaded to the UCSC genome browser. Now we will call peaks on this data with MACS and upload the tracks to the browser to compare with our raw data.

  • Create a new folder in your home data folder (maybe call it peaks).

  • Start an interactive session with srun (4G 4 cores) if you haven’t already: srun --pty -p mcbstudent --qos=mcbstudent --mem=4G -c 4 bash

  • To run MACS, you will need information regarding the size of your ‘genome’. You can find this info in the chrom.sizes file: /home/FCAM/meds5420/genomes/hg38.chrom.sizes. We did not explicitly learn how to add column values using the command line, but you should have sufficient basic awk knowledge to Google and come up with a solution like this:
    awk '{sum+=$2;} END {print sum;}' hg38.chrom.sizes
    1. Move to your ‘peaks’ directory and run MACS (use the -B option) from there on this data using the experimental and control (Input) samples. Everyone should have their own experimental and input bam files from Lecture 12-in class exercise 1: Processing ChIP-seq data. However, if you did not do this exercise, you can find preprocessed files to copy here: /home/FCAM/meds5420/data/ATF1/bam
    2. When finished, generate the model figure by running the R script NAME_model.r as follows:

Rscript NAME_PREFIX_model.r

3. Add tracklines to the bed, narrowPeak, and bedGraph files (see notes above and from last class). If you get stuck, you can use the scripts I generated /home/FCAM/meds5420/scripts/addTrackLine*—each of these scripts is specific to a type of input file (i.e. bedGraph, bed, narrowPeak). Look at the file content to determine which is which. Feel free to copy and edit these scripts to change options.
4. How could you quickly find the peak with the highest enrichment so that you could make the browser automatically open to that peak after uploading the data?
5. Create a new folder on your local machine that more or less mirrors your data folder on the browser. Transfer all MACS output to this folder using sftp.
* Upload the narrowPeak and summits bed files to the saved UCSC genome browser Session that you sent me last class.
* Visually compare your new data to the .bed and .bedGraph files we generated previously for ATF1.

Since a few students did not send me a saved session, you can add them to the track here:

ATF1_raw_data

However, please go back and complete the previous exercises if you have not already.

After uploading the peak and summits file you should see something like this ATF1_comparisons and the figure below (although my narrowPeaks file is hidden, so how do we show this track?):

ATF1 browser image

Figure 9: ATF1 browser image

4.8 Thought Question:

How could you determine if you have sequenced enough of the library to capture all the peaks?

5 Answers to exercises:

5.1 In class exercise:

  1. run MACS: the number is the size of the genome from the hg38.chrom.sizes file or hs
module load macs3

macs3 callpeak -t ATF_ChIP_10mil.bam -c Input_10mil.bam -f BAM -g hs -n ATF1 -B -q 0.01 --outdir atf_peaks 2>&1 | tee -a ATF1_peaks_log.txt
  1. Move to MACS output directory and run Rscript to make figure of model
cd atf_peaks

Rscript ATF1_model.r
Plus and Minus read distribution for 10 million ATF1 reads

Figure 10: Plus and Minus read distribution for 10 million ATF1 reads

  1. add tracklines using the addTrackline*.sh scripts we modified from last week.
/home/FCAM/meds5420/scripts/addTrackLines.sh Input_macs3 ATF1_control_lambda.bdg
/home/FCAM/meds5420/scripts/addTrackLines.sh ATF1_macs3 ATF1_treat_pileup.bdg 
gzip ATF1_treat_pileup_header.bedGraph
gzip ATF1_control_lambda_header.bedGraph

/home/FCAM/meds5420/scripts/addTrackLine_macs_narrowPeak.sh ATF1_narrow ATF1_peaks.narrowPeak 
/home/FCAM/meds5420/scripts/addTrackLine_macs_bed.sh ATF1_summits ATF1_summits.bed
  1. Make it open to the largest peak: Use the result from below to set the browser position in a modified addTrackLiness.sh script.
sort -k5nr ATF1_peaks_summits.bed | head -1
  1. copy to your machine and upload to browser (see previous lessons).

5.2 Thought question:

One can subsample the ChIP-seq data and record the number of peaks called at certain read depths. Plotting of the number of peaks called vs. read depth, will reveal whether you are at saturation. This is outlined nicely in Figure 6A of Kharchenko, et. al., 2008

5.2.1 Implementation

To select random lines from a file you can use the shuf command in shell as follows:


shuf -n #of_lines <INPUT.file> > <OUTPUT.file>

However, it is not as easy as just choosing random lines from a FASTQ or SAM file. Why? Can you think of solutions?

It is easiest to operate on a sam file (without the header), because fastq have four lines per read. Then you would have to add back a header.

You can also use seq in combination with shuffle in a shell script to get random lines in defined intervals for doing the analysis.


seq 1000 500 3000 
# gives a sequence of numbers from 1000 to 3000 in increments of 500.
## 1000
## 1500
## 2000
## 2500
## 3000

The next step is to loop through the sequence of numbers and use shuf to randomly select lines from the input file. These outputs can be converted to bam files and used for peak calling with macs3

ChIP-seq saturation

Figure 11: ChIP-seq saturation