1 Last Time:

Last time we used bedtools to tell us the number of reads within our peaks and also what the closest genes were to peaks. However, one caveat with bedtools closest, is that it chooses the closet edge of your targets (genes) to compare your peaks locations to. This means that it is not always obvious whether or not your peak is closest to the gene start or end. One way to get around this is to create a new annotation list that just contains TSSs. This would allow you to use bedtools closest to get the closest TSS to your peak.

peak assignment problem

Figure 1: peak assignment problem

Using awk, we can easily create such a TSS file. The awk sript below creates a new file and sets both the start and stop to the TSS of the gene, which is the ‘start’ for genes on the plus strand and the ‘stop’ for genes on the minus strand.

awk '{OFS="\t";} { if($6=="+")
print $1, $2, $2+1, $4, $5, $6;
print $1, $3-1, $3, $4, $5, $6}' hg38_genes.bed > hg38_genes_strandedTSS.bed

One can then use bedtools closest to determine the closest TSS (or promoter) to each peak location.

bedtools closest -a PEAKS.bed -b TSS.bed > OUTFILE.bed

This same type of manipulation can be used for creating bedfiles of different regions for comparing to ChIP-seq peak locations. i.e. You can create promoter, genic and intergenic regions and determine where you ChIP-seq peaks tend to go to. A file that contains the closest gene to all ATF1 peaks is available at /home/FCAM/meds5420/data/ATF1/closest_peak_TSS/peaks_promoters.bed. The last column is the distance to the gene. What does a negative distance mean?

1.1 Review of analysis pipeline