Last time we used bedtools
to tell us the number of reads within our peaks and also what the closest genes were to peaks. However, one caveat with bedtools closest
, is that it chooses the closet edge of your targets (genes) to compare your peaks locations to. This means that it is not always obvious whether or not your peak is closest to the gene start or end. One way to get around this is to create a new annotation list that just contains TSSs. This would allow you to use bedtools closest
to get the closest TSS to your peak.
Figure 1: peak assignment problem
Using awk
, we can easily create such a TSS file. The awk sript below creates a new file and sets both the start and stop to the TSS of the gene, which is the ‘start’ for genes on the plus strand and the ‘stop’ for genes on the minus strand.
awk '{OFS="\t";} { if($6=="+")
print $1, $2, $2+1, $4, $5, $6;
else
print $1, $3-1, $3, $4, $5, $6}' hg38_genes.bed > hg38_genes_strandedTSS.bed
One can then use bedtools closest
to determine the closest TSS (or promoter) to each peak location.
bedtools closest -a PEAKS.bed -b TSS.bed > OUTFILE.bed
This same type of manipulation can be used for creating bedfiles of different regions for comparing to ChIP-seq peak locations. i.e. You can create promoter, genic and intergenic regions and determine where you ChIP-seq peaks tend to go to. A file that contains the closest gene to all ATF1 peaks is available at /home/FCAM/meds5420/data/ATF1/closest_peak_TSS/peaks_promoters.bed
. The last column is the distance to the gene. What does a negative distance mean?