Last time we used bedtools
to tell us the number of reads within our peaks and also what the closest genes were to peaks. However, one caveat with bedtools closest
, is that it chooses the closet edge of your targets (genes) to compare your peaks locations to. This means that it is not always obvious whether or not your peak is closest to the gene start or end. One way to get around this is to create a new annotation list that just contains TSSs. This would allow you to use bedtools closest
to get the closest TSS to your peak.
Using awk
, we can easily create such a TSS file. The awk sript below creates a new file and sets both the start and stop to the TSS of the gene, which is the ‘start’ for genes on the plus strand and the ‘stop’ for genes on the minus strand.
awk '{OFS="\t";} { if($6=="+")
print $1, $2, $2+1, $4, $5, $6;
else
print $1, $3-1, $3, $4, $5, $6}' hg38_genes.bed > hg38_genes_strandedTSS.bed
One can then use bedtools closest
to determine the closest TSS (or promoter) to each peak location.
bedtools closest -a PEAKS.bed -b TSS.bed > OUTFILE.bed
This same type of manipulation can be used for creating bedfiles of different regions for comparing to ChIP-seq peak locations. i.e. You can create promoter, genic and intergenic regions and determine where you ChIP-seq peaks tend to go to. A file that contains the closest gene to all ATF1 peaks is available at /home/FCAM/meds5420/data/ATF1/closest_peak_TSS/peaks_promoters.bed
. The last column is the distance to the gene. What does a negative distance mean?
An important piece of information that can be gained from genomic binding data is the discovery of the underlying DNA motif that is bound by the transcription factor. There are many tools out there that will allow you to discover motifs. The MEME analysis suite is a popular and powerful set of tools that can allow you to perform several useful analyses on DNA sequences of interest, such as:
MEME Website: http://meme-suite.org/
Several steps must be performed prior to MEME analysis. Here is a summary list:
1. Create bed
file with peak locations to analyze.
- Short windows around summits.
- Only most intense peaks needed.
- Get random windows of the same size in for background estimation.
2. retrieve DNA sequences from regions to analyze
3. Create a background base frequency model (part of the MEME suite)
MEME requires DNA sequences (in fasta format) of your regions of interest. Long sequences can decrease accuracy and take a long time to search through, so it is best to choose narrow regions. When finding motifs within ChIP-seq peaks we can use our peak summits to center our analysis since most TFs will be found in the immediate vicinity of their peak summits. We can use bedtools slop
to create small windows around our summits:
Let’s get 100bp window centered on peak summits (/home/FCAM/meds5420/data/ATF1/ATF1_peaks/ATF1_summits.bed
) as follows (technically this is a 101 base window):
bedtools slop -i SUMMITS.BED -g <chrom_sizes> -b 50 > SUMMITS_100bp.bed
#-g is chromosome sizes file
#-b The number of basebairs to add to both sides
#-l The number of base pairs to subtract from the start coordinate
#-r The number of base pairs to add from the end coordinate
Another way to reduce time for running MEME and to increase accuracy is to run it on a subset of the most intense peaks. The peak intensity is in column 5 of the peak summits file. The summits .bed file can be sorted and parsed as follows:
sort -n -r -k5 SUMMITS_100bp.bed | head -200 > SUMMITS_100bp_top200.bed
# Performs numeric sort on column 5, reports it in
# reverse (-r) order (descending) and outputs the first 200 to output file.
Next, we can use bedtools getfasta
to get the sequence within the desired intervals and return them in fasta format.
fastaFromBed -fi GENOME.fa -bed SUMMITS_100bp.bed -fo SUMMITS_100bp.fa
#-name: Will use the 'name' column from bed file to name sequence
#-fi: Input fasta sequences to retrieve interval sequences from
#-bed: Input bed file
#-fo: Output fasta file
Fasta formatted genome sequences and the genome sizes file for this class can be found at:
/home/FCAM/meds5420/genomes/
If you get the error: could not open index file /home/FCAM/meds5420/genomes/hg38.fa.fai for writing!
, then just copy the file to your directory and operate directly on the local file.
(Optional): Create a file with background sequences for comparison to our enriched sequences.
We can generate the interval with bedtools random
as below.
bedtools random -g hg38.chrom.sizes -n 1000 -l 101 > hg38_random_intervals.bed
Then get the fasta sequences for the random regions:
fastaFromBed -fi hg38.fa -bed hg38_random_intervals.bed -fo hg38_random_intervals.fa
Then create a background file. You can create a Markov model of any order from an input FASTA file of sequences using fasta-get-markov
.
fasta-get-markov -m 3 hg38_random_intervals.fa > hg38_random_intervals_bkgrnd.txt
This is optional because the end results are a simple probability of bases occuring based on frequency in your sequences. Have a look at hg38_random_intervals_bkgrnd.txt
to see. meme
will automatically create a background on the fly by creating a ‘shuffled’ version of your sequences.
Now we can run MEME on the .fasta
formatted sequences. On the Xanadu server meme
functions are run as singularity exec /isg/shared/apps/meme/5.4.1/meme.sif <function>
:
singularity exec /isg/shared/apps/meme/5.4.1/meme.sif fasta-get-markov -m 3 hg38_random_intervals.fa > hg38_random_intervals_bkgrnd.txt
I think running meme
for motif analysis without the singularity
command will work, but this also functions:
singularity exec /isg/shared/apps/meme/5.4.1/meme.sif meme SUMMITS_100bp.fa -oc meme_OUT_FOLDER -bfile hg38_random_intervals_bkgrnd.txt -dna -nmotifs 2 -minw 5 -maxw 10 -revcomp -mod zoops
I am getting some weird warnings about perl when running on the server, but the results seem ok.
I much prefer running the classic function for more sensitive and wider motifs, but it takes longer. I employ -markov_order 3
in the command, which incorporates a background model on the fly using the file that specifies all k-mer frequencies up to the specified value. I typically just calculate it from the input file as opposed to random regions or the entire genome.
singularity exec /isg/shared/apps/meme/5.4.1/meme.sif meme -oc ${name}.meme_output -objfun classic -evt 0.01 -searchsize 0 -minw 5 -maxw 18 -revcomp -dna -markov_order 3 -maxsize 100000000 SUMMITS_100bp.fa
There are a number of options to consider here:
-oc
: Folder to create where files will be written. Allows overwriting of previously made folders (use -o to disable this)
-dna
: Tell MEME to interpret symbols as nucleic acids (as opposed to amino acids)
-nmotifs
: number of motifs to search for
-evt
: stop searching for motifs when a motif cannot be found with a samller e-value (if nmotifs
and evt
are invoked, then evt
will supercede the nmotifs
option) Â
-minw
: minumum motif width in bases
-maxw
: maximum motif width in bases
-revcomp
: look for motif on both strands of DNA (do not assume strandedness)
-mod zoops
: MEME assumes that each sequence contains zero or one motif. Opposed to anr
where any number of motifs can be present per sequence.
There are more options avaiable. See manual or http://meme-suite.org/doc/meme.html?man_type=web.
The following files are placed in the designated output folder:
.xml
file of output (machine readable).txt
file of output (human readable).html
file of output browser viewable file of output.eps
file of sequence logo for motif(s). Both forward and reverse complementThe output file contains:
Given a motif position weight matrix, one can search other for sequences for the occurance of that motif. For instance, finding a motif under a ChIP-seq peak with MEME does not mean that the motif is present at every binding site. We can use MAST to determine the number of sequences or found binding sites that have a specific motif.
For usage go here: http://meme-suite.org/tools/mast
Click on manual bar on left and view command line version for MAST
.
The basic usage is simple:
singularity exec /isg/shared/apps/meme/5.4.1/meme.sif mast MEME_output.txt SUMMITS_100bp.fa -oc mast_OUT_FOLDER
# The first file is the meme output with motifs that you want to search for
# The second file is the fasta formatted ChIP-seq regions we want to search for motifs within
# The third is the output folder as in MEME
The MAST output format is similar to meme and is placed in the designated output folder:
.xml
file of output (machine readable).txt
file of output (human readable).html
file of output browser viewable file of outputThe output file contains the names of the sequences that have significant occurences of the motif.
Using the -hit_list
option in mast creates a list of matches in a more easily processible format. However, this is sent to stdout
and so must be redirected to a file.
#sending hits list to file by redirect
singularity exec /isg/shared/apps/meme/5.4.1/meme.sif mast MEME_output.txt -hit_list SUMMITS_100bp.fa > mast_hits.txt
MAST is useful for determining if a sequence (or lists of them) have a specific motif. FIMO, is good for finding ALL significant occurrences of a motif within a sequence (i.e. a whole chromosome or genome).
FIMO can use a background sequence to account for biases in sequences. You can create this file as follows.
singularity exec /isg/shared/apps/meme/5.4.1/meme.sif fasta-get-markov -m 3 hg38.fa > hg38_bkgrnd.txt
You can then run FIMO to scan sequences for all occurrences of a motif. I usually set --max-stored-scores
quite high and --max-strand
provides the best match to identically overlapping sequences (only difference being the strand):
singularity exec /isg/shared/apps/meme/5.4.1/meme.sif fimo --max-strand --max-stored-scores 10000000 --oc fimo_OUT_FOLDER --bgfile hg38_bkgrnd.txt MEME_output.txt hg38.fa
This is a basic run, but there are several options one can implement:Â http://meme-suite.org/doc/fimo.html?man_type=web
FIMO returns .html
, .gff3
, and .tsv
(tab-separated values) files. I usually invoke the option --text
and redirect (>
) the stdout
to a file name to create a text file output that contains a list of the motifs, their locations, and significance.
Here is a run that I recently did on my local machine (it does not require singularity exec ...
)
fimo --max-strand --text --max-stored-scores 10000000 --oc fimo_atf1_out meme.txt hg38.fa > fimo_output.txt
After going through this workflow, there are a number of things one can do with these locations and bedtools.
mast
output using bedtools
.fimo
output using bedtools
, choose non-overlapping.Starting with the ChIP-seq summits file, try the following:
/home/FCAM/meds5420/data/ATF1/ATF1_peaks/ATF1_summits.bed
hg38.fa
and chromosomes (needed later) can be found here:/home/FCAM/meds5420/genomes/
and/home/FCAM/meds5420/genomes/chroms
narrowPeaks
file for this):.tar
files from the output and scp
or sftp
the folders to your machine..html
files in the browser.bedtools
to assign motifs (from FIMO) as inside or outside of peaks. Tip: Convert FIMO output to a bed
file first.module load meme
module load bedtools
#Let's set a variable to the chromInfo file so we don't have to type it every time:
genome=/home/FCAM/meds5420/genomes/hg38.chrom.sizes
peaks=/home/FCAM/meds5420/data/ATF1/ATF1_peaks/ATF1_summits.bed
mkdir in_class_lec16
cd in_class_lec16
#starting from your own ATF1/peaks directory make a window around the summits
slopBed -i $peaks -g $genome -b 50 > ATF1_summit_101bp.bed
# get the top 200 peaks
sort -k5nr ATF1_summit_101bp.bed | head -n 200 > ATF1_summit_101bp_top200.bed
#you will have to cp the hg38.fa file because an intermediate file is made
cp /home/FCAM/meds5420/genomes/hg38.fa ./
fastaFromBed -fi hg38.fa -bed ATF1_summit_101bp_top200.bed -fo ATF1_summit_101bp_top200.fasta
meme ATF1_summit_101bp_top200.fasta -oc meme_ATF1 -nmotifs 2 -objfun classic -searchsize 0 -minw 5 -maxw 10 -revcomp -dna -markov_order 3 -maxsize 10000000
#Now we must get the fasta sequence for all peaks in order to use MAST on them
fastaFromBed -fi hg38.fa -bed ATF1_summit_101bp.bed -fo ATF1_summit_101bp.fasta
# Now run MAST on them to ID peaks with the MEME motifs.
# Check the manual for the options.
singularity exec /isg/shared/apps/meme/5.4.1/meme.sif mast -hit_list meme_ATF1/meme.txt ATF1_summit_101bp.fasta > mast_ATF1_hits.txt
# To then run FIMO we should create a background file,
# Using our chromosome, which you can get from UCSC, or use hg38.fa
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chromFa.tar.gz
tar -xzf hg38.chromFa.tar.gz
cp /home/FCAM/meds5420/genomes/chroms/chr17.fa ./
# recall I put them here /home/FCAM/meds5420/genomes/chroms
singularity exec /isg/shared/apps/meme/5.4.1/meme.sif fasta-get-markov -m 3 chr17.fa > chr17_bkgrnd.txt
#Next run FIMO with the background file, MEME output and your chr.fasta to search
singularity exec /isg/shared/apps/meme/5.4.1/meme.sif fimo --text --max-strand --max-stored-scores 10000000 --bgfile chr17_bkgrnd.txt ./meme_ATF1/meme.txt chr17.fa > fimo_ATF1_chr17.txt
#Convert FIMO output to bed
awk '{OFS="\t";} NR>1 {print $3, $4, $5, $7, $8, $6}' fimo_ATF1_chr17.txt > fimo_ATF1_chr17.bed
#Intersect FIMO output with peaks
intersectBed -wo -a fimo_ATF1_chr17.bed -b ATF1_summit_101bp.bed -g $genome > ATF1_chr17_motifs_in_peaks.txt
# to get all the fimo output files, note the tsv files is nearly identical to the txt file above
singularity exec /isg/shared/apps/meme/5.4.1/meme.sif fimo --max-strand --max-stored-scores 10000000 -oc fimo_ATF1 --bgfile chr17_bkgrnd.txt ./meme_ATF1/meme.txt chr17.fa
# run meme with bfile on random intervals
randomBed -n 1000 -l 101 -g $genome > hg38_random_intervals.bed
fastaFromBed -fi hg38.fa -bed hg38_random_intervals.bed -fo hg38_random_intervals.fasta
singularity exec /isg/shared/apps/meme/5.4.1/meme.sif fasta-get-markov -m 3 hg38_random_intervals.fasta > hg38_random_intervals_markov.txt
meme ATF1_summit_101bp_top200.fasta -oc meme_ATF1_bfile -nmotifs 2 -objfun classic -evt 0.01 -searchsize 0 -minw 5 -maxw 10 -revcomp -dna -bfile hg38_random_intervals_markov.txt -maxsize 10000000
Reporting:
#How many peaks are there on chr17?
grep -w chr17 ATF1_summit_101bp.bed > ATF1_chr17_peaks.bed
wc -l ATF1_chr17_peaks.bed
#What is the number of peaks (summit -/+ 50bp) on your chromosome with MAST called motifs:
grep "chr17:" mast_ATF1_hits.txt | cut -f1 -d " " | sort | uniq | wc -l
#how many FIMO motifs on chr17:
cat fimo_ATF1/fimo.tsv | grep "#" -v | grep "motif_id" -v | wc -l
#how many FIMO motifs in peaks?
wc -l ATF1_chr17_motifs_in_peaks.txt
#How many unique peaks in FIMO overlap?
cat ATF1_chr17_motifs_in_peaks.txt | cut -f 10 | sort | uniq | wc -l
# MAST and FIMO call motif instances a bit differently