bowtie
and bowtie2
and mapping optionsIn order for the output to be a proper .sam file use the -S
option to explicitly say that you want a .sam
output. This is required for bowtie2
, and ensures that the header is included in the .sam
file which is important for downstream steps.
We will continue working with a ChIP-seq dataset from human cells. The factor that was IP’ed was ATF1 (SRR5331338). The fastq
file for the experiment and control (Input SRR5331584) is here:
/home/FCAM/meds5420/data/ATF1/fastq/
You have already selected 10 million reads, aligned to a genome, converted to BED, sorted, converted to bedGraph
, and visualized these data on the UCSC genome browser.
Question: How do we find real peaks in the data?
MACS Zhang et al. (2008) and Feng et al. (2013) is a widely used peak calling software utility that does well with respect to all of the above considerations, except the final two. Other algorithms can be used for calling differential enriched peaks of MACS-called peaks. MACS2 and MACS3 have better options for calling broad peaks. We can also discuss how to assess sequencing depth from MACS output.
The documentation and manual for MACS3 can be found at: https://macs3-project.github.io/MACS/
We will focus on narrow peaks today. Because of the nature of the assay, sequence reads from enriched regions will cluster into an expected structure. Since crosslinking of a factor occurs in a small region, all the signal of narrow peaks should span this region. These molecules are sequence randomly from either end in single end sequencing data (or both ends in pair end sequencing). Reads that align to the forward strand (red) are upstream of the peak, aligning to the reverse strand are downstream. The red and blue peaks are shifted by the average fragment size.
MACS can identify this pattern and build a model of the fragment length with the following workflow:
* Scan the genome for regions enriched for aligned reads (10-30X)
* Select 1000 random enriched regions
* Separate reads into positive and negative strand
* Identify peak summits (bimodal pattern around site)
* Fragment length (d) = distance between summits.
* Peak center = d/2
This model is important for the accuracy of the binding site prediction and for searching for known or new binding site motifs under the peak.
To call peaks, MACS uses both information from local background near enriched sites and data from a control or input sample that should not have peaks.
Local background can influence peak enrichment for a number of reasons all involving sample preparation:
For example: The number of reads obtained in regions in the experimental and control samples is correlated. This tells you that there is something inherent that locus that allows it to be isolated / enriched independently of your protein of interest (i.e. one of the biases mentioned above).
Above, red dots are inside of called peaks, black dots are in control peaks.
Also, below is an example showing that Non-immunoprecipitatied DNA has ‘structure’. This is from a technique called ‘SONO-Seq’ by Auerbach et al. (2009).
Above, they sequenced sonicated chromatin without doing the IP and found that the shearing of chromatin is not random. Breaks occurred more often in active promoter regions resulting in enrichment in those regions without IP.
This is why the control sample is important.
Solutions:
The basic usage for MACS3 is as follows:
macs3 callpeak -t <TREATMENT>.bam -c <Input>.bam -f BAM -g GENOMESIZE -n <name_prefix> --outdir <peaks_folder> 2>&1 | tee -a ATF1_peaks_log.txt
#Don't forget to save terminal output (std error in this case) as a log file.
-t
: Specifies the experimental sample-c
: Control sample-n
: Prefix name for the output files-g
: genome size. Can use integer (e.g. 2900000000) or scientific (e.g. 2.9e9). There are also a few shortcuts (hs
= humans = 2.9e9).Note: if you want to keep duplicate reads you can use the --keep-dup
option. This option followed by an interger or ‘all’ will keep that number or all duplicates respectively. I beleive the best practice is to perform paired-end sequencing, remove duplicates from the bam
file and use --keep-dup all
. Removing duplicates from a paired-end bam
file necessitates that the paired-end 1 and paired-end 2 reads are identical between two sequencing read; this is very unlikely to happen independently and it predominantly due to PCR duplications.
You can type macs3 --help
and macs3 callpeak -h
to get a list of other options and their descriptions.
You can also generate bedGraph
files from MACS, but this takes a while and generates bedGraph
data for every bp (also known as a wiggle
format). These files can be quite large when uploading to the genome, so I prefer to make bedGraph
files of the data myself as we have previously in this class.
I ran the following:
macs3 callpeak -t ATF_ChIP_10mil.bam -c Input_10mil.bam -f BAM -g hs -n ATF1 -B -q 0.01 --outdir atf_peaks 2>&1 | tee -a ATF1_peaks_log.txt
What does -B
above specify?
You will get the following output files from MACS3:
NAME_PREFIX_model.r
NAME_PREFIX_peaks.narrowPeak
NAME_PREFIX_summits.bed
NAME_PREFIX_peaks.xls
The model.r is a script written in R
that will produce a nice .PDF figure of the fragment length model for your data.
You can run this script on the server or on your local machine with:
#first load R
module load R
Rscript NAME_PREFIX_model.r
The excel spreadsheets contain information about the datasets (total number of reads) and peak locations (reads in peaks and significance of peaks (FDR) for the experimental (_peaks.xls
) and input (_negative_peaks.xls
) samples.
The peaks.bed and summits.bed are bed files that give the locations of enriched regions and their maximum, respectively.
Have a look at the files:
peaks bed file
head -4 ./ATF1_peaks.narrowPeak
## chr1 778568 778817 ATF1_peak_1 751 . 15.5695 81.2068 75.1172 130
## chr1 1000795 1000968 ATF1_peak_2 559 . 19.6258 61.765 55.9288 85
## chr1 1471609 1471790 ATF1_peak_3 244 . 8.97496 29.6308 24.4232 82
## chr1 1511940 1512176 ATF1_peak_4 365 . 14.9583 42.0281 36.5468 144
In this format the 5th column (or score column) contains the transformed p-value (-10*log10[pVal]) for the signifcance of the peak.
summits
head -4 ./ATF1_summits.bed
## chr1 778698 778699 ATF1_peak_1 75.1172
## chr1 1000880 1000881 ATF1_peak_2 55.9288
## chr1 1471691 1471692 ATF1_peak_3 24.4232
## chr1 1512084 1512085 ATF1_peak_4 36.5468
In this format the 5th column (or score column) contains the the number of reads that overlap with this peak summit.
The above files are ready to be uploaded to the browser - you just need to add track lines. The example in the next section below shows track line addition using awk
.
peaks table with most information
head -5 ./ATF1_peaks.xls
## # This file is generated by MACS version 3.0.0a6
## # Command line: callpeak -t ATF_ChIP_10mil.bam -c Input_10mil.bam -f BAM -g hs -n ATF1 -q 0.01 --outdir atf_peaks
## # ARGUMENTS LIST:
## # name = ATF1
## # format = BAM
Unfortunately, The peaks file with the most information in it is meant to be opened in excel. The first 20-25 lines contain information regaring the parameters of the analysis. Fortunately, these lines begin with a “##” ad so can be easily parsed out with grep.
grep -v -w "#" ./ATF1_peaks.xls > ATF1_peaks_noHeader.txt
Then you get:
head ./ATF1_peaks_noHeader.txt
##
## chr start end length abs_summit pileup -log10(pvalue) fold_enrichment -log10(qvalue) name
## chr1 778569 778817 249 778699 92.42 81.2068 15.5695 75.1172 ATF1_peak_1
## chr1 1000796 1000968 173 1000881 57.88 61.765 19.6258 55.9288 ATF1_peak_2
## chr1 1471610 1471790 181 1471692 43.87 29.6308 8.97496 24.4232 ATF1_peak_3
## chr1 1511941 1512176 236 1512085 43.87 42.0281 14.9583 36.5468 ATF1_peak_4
## chr1 1574807 1575021 215 1574917 103.62 135.566 34.8731 128.438 ATF1_peak_5
## chr1 1615300 1615504 205 1615396 57.88 76.8936 28.3065 70.8667 ATF1_peak_6
## chr1 1658956 1659236 281 1659009 28.01 18.3673 7.2513 13.4733 ATF1_peak_7
## chr1 1724322 1724525 204 1724418 34.54 30.322 11.8466 25.0955 ATF1_peak_8
There is still a blank line that you can remove with grep:
grep -v -e '^$' ./ATF1_peaks_noHeader.txt | head -5
## chr start end length abs_summit pileup -log10(pvalue) fold_enrichment -log10(qvalue) name
## chr1 778569 778817 249 778699 92.42 81.2068 15.5695 75.1172 ATF1_peak_1
## chr1 1000796 1000968 173 1000881 57.88 61.765 19.6258 55.9288 ATF1_peak_2
## chr1 1471610 1471790 181 1471692 43.87 29.6308 8.97496 24.4232 ATF1_peak_3
## chr1 1511941 1512176 236 1512085 43.87 42.0281 14.9583 36.5468 ATF1_peak_4
This searches for a line where the start and end of a line are immediately next to each other. It uses regular expressions
as follows:
-e
tells grep
that reg-expressions are being used^
indicates the start of the line.$
indicates the end.Two ways to remove a 1 line header is to use tail -n +2
, which takes the second line to the end.
#tail
grep -v -e '^$' ./ATF1_peaks_noHeader.txt | tail -n +2 | head -5
#awk:
grep -v -e '^$' ./ATF1_peaks_noHeader.txt | awk 'NR>1 {print $0}' | head -5
grep -v -e '^$' ./ATF1_peaks_noHeader.txt | awk 'NR>1 {print $0}' > ATF1_peaks_onlyPeaks.txt
## chr1 778569 778817 249 778699 92.42 81.2068 15.5695 75.1172 ATF1_peak_1
## chr1 1000796 1000968 173 1000881 57.88 61.765 19.6258 55.9288 ATF1_peak_2
## chr1 1471610 1471790 181 1471692 43.87 29.6308 8.97496 24.4232 ATF1_peak_3
## chr1 1511941 1512176 236 1512085 43.87 42.0281 14.9583 36.5468 ATF1_peak_4
## chr1 1574807 1575021 215 1574917 103.62 135.566 34.8731 128.438 ATF1_peak_5
## chr1 778569 778817 249 778699 92.42 81.2068 15.5695 75.1172 ATF1_peak_1
## chr1 1000796 1000968 173 1000881 57.88 61.765 19.6258 55.9288 ATF1_peak_2
## chr1 1471610 1471790 181 1471692 43.87 29.6308 8.97496 24.4232 ATF1_peak_3
## chr1 1511941 1512176 236 1512085 43.87 42.0281 14.9583 36.5468 ATF1_peak_4
## chr1 1574807 1575021 215 1574917 103.62 135.566 34.8731 128.438 ATF1_peak_5
Note that the score column relates to the intensity of the peak and interpreting the score can be turned on with useScore=1
in the track options line.
#UCSC tracklines for displaying MACS output
awk 'BEGIN { print "browser position chr1:1,505,048-1,516,424"
print "track type=bed name=\"ATF1_peaks_bed\"
description=\"ATF1_peaks_bed\" visibility=squish
autoScale=on useScore=1"}
{ print $0}' INFILE.bed > INFILE_header.bed
You will also have to make a script for adding a header to narrowPeak files, specifying the type as narrowPeak so that UCSC recognizes the format.
I include three addTrackLine*.sh
scripts: to /home/FCAM/meds5420/scripts/
You will run MACS on the ATF1 ChIP-seq data. Previously, we mapped this data to the genome, converted in into bed
and bedGraph
and uploaded to the UCSC genome browser. Now we will call peaks on this data with MACS and upload the tracks to the browser to compare with our raw data.
Create a new folder in your home data folder (maybe call it peaks
).
Start an interactive session with srun (4G 4 cores) if you haven’t already: srun --pty -p mcbstudent --qos=mcbstudent --mem=4G -c 4 bash
To run MACS, you will need information regarding the size of your ‘genome’. You can find this info in the chrom.sizes
file: /home/FCAM/meds5420/genomes/hg38.chrom.sizes
. We did not explicitly learn how to add column values using the command line, but you should have sufficient basic awk
knowledge to Google and come up with a solution like this:
awk '{sum+=$2;} END {print sum;}' hg38.chrom.sizes
1. Move to your ‘peaks’ directory and run MACS (use the -B
option) from there on this data using the experimental and control (Input) samples. Everyone should have their own experimental and input bam
files from Lecture 12-in class exercise 1: Processing ChIP-seq data. However, if you did not do this exercise, you can find preprocessed files to copy here: /home/FCAM/meds5420/data/ATF1/bam
2. When finished, generate the model figure by running the R script NAME_model.r
as follows:
Rscript NAME_PREFIX_model.r
3. Add tracklines to the bed
, narrowPeak
, and bedGraph
files (see notes above and from last class). If you get stuck, you can use the scripts I generated /home/FCAM/meds5420/scripts/addTrackLine*
—each of these scripts is specific to a type of input file (i.e. bedGraph, bed, narrowPeak). Look at the file content to determine which is which. Feel free to copy and edit these scripts to change options.
4. How could you quickly find the peak with the highest enrichment so that you could make the browser automatically open to that peak after uploading the data?
5. Create a new folder on your local machine that more or less mirrors your data folder on the browser. Transfer all MACS output to this folder using sftp
.
* Upload the narrowPeak
and summits
bed
files to the saved UCSC genome browser Session that you sent me last class.
* Visually compare your new data to the .bed
and .bedGraph
files we generated previously for ATF1.
Since a few students did not send me a saved session, you can add them to the track here:
However, please go back and complete the previous exercises if you have not already.
After uploading the peak and summits file you should see something like this ATF1_comparisons and the figure below (although my narrowPeaks file is hidden, so how do we show this track?):
How could you determine if you have sequenced enough of the library to capture all the peaks?
module load macs3
macs3 callpeak -t ATF_ChIP_10mil.bam -c Input_10mil.bam -f BAM -g hs -n ATF1 -B -q 0.01 --outdir atf_peaks 2>&1 | tee -a ATF1_peaks_log.txt
cd atf_peaks
Rscript ATF1_model.r
addTrackline*.sh
scripts we modified from last week./home/FCAM/meds5420/scripts/addTrackLines.sh Input_macs3 ATF1_control_lambda.bdg
/home/FCAM/meds5420/scripts/addTrackLines.sh ATF1_macs3 ATF1_treat_pileup.bdg
gzip ATF1_treat_pileup_header.bedGraph
gzip ATF1_control_lambda_header.bedGraph
/home/FCAM/meds5420/scripts/addTrackLine_macs_narrowPeak.sh ATF1_narrow ATF1_peaks.narrowPeak
/home/FCAM/meds5420/scripts/addTrackLine_macs_bed.sh ATF1_summits ATF1_summits.bed
addTrackLiness.sh
script.sort -k5nr ATF1_peaks_summits.bed | head -1
One can subsample the ChIP-seq data and record the number of peaks called at certain read depths. Plotting of the number of peaks called vs. read depth, will reveal whether you are at saturation. This is outlined nicely in Figure 6A of Kharchenko, et. al., 2008
To select random lines from a file you can use the shuf
command in shell as follows:
shuf -n #of_lines <INPUT.file> > <OUTPUT.file>
However, it is not as easy as just choosing random lines from a FASTQ or SAM file. Why? Can you think of solutions?
It is easiest to operate on a sam
file (without the header), because fastq
have four lines per read. Then you would have to add back a header.
You can also use seq
in combination with shuffle in a shell script to get random lines in defined intervals for doing the analysis.
seq 1000 500 3000
# gives a sequence of numbers from 1000 to 3000 in increments of 500.
## 1000
## 1500
## 2000
## 2500
## 3000
The next step is to loop through the sequence of numbers and use shuf
to randomly select lines from the input file. These outputs can be converted to bam
files and used for peak calling with macs3