samtools
and bedtools
bowtie
and bowtie2
and mapping optionsIn order for the output to be a proper .sam
file use the -S
option to explicitly say that you want a .sam
output. This is required for bowtie2
, and ensures that the header is included in the .sam
file which is important for downstream steps (today’s lesson).
For the next part of the course we will be working with a ChIP-seq dataset from human cells. The factor that was IP’ed was ATF1 (SRR5331338). The fastq
file for the experiment and control (Input SRR5331584) is here:
/home/FCAM/meds5420/data/ATF1/fastq/
It may take a while for us all to map the data, so I did it already. Here’s the commands I used.
#request 2 CPUs and 4G of RAM
srun --pty -p mcbstudent --qos=mcbstudent --mem=4G -c 2 bash
#Set a variable where my genome is
hgGen=/home/FCAM/meds5420/genomes/hg38_bt2/hg38
#set variable to where raw data
atfRaw=/home/FCAM/meds5420/data/ATF1/fastq/
#set variable to where i want output fq to go
atfFq=/home/FCAM/meds5420/data/ATF1/fastq/
#set variable to where i want output sam to go
atfSam=/home/FCAM/meds5420/data/ATF1/sam/
mkdir data
cd data
mkdir ATF1
cd ATF1
mkdir fastq
mkdir sam
#move 10million reads to a new file, omit the first million
zcat ${atfRaw}SRR5331338_ATF1_ChIP.fastq.gz | head -44000000 |tail -40000000 > ${atfFq}ATF1_chip_10m_230227.fastq
#load bowtie module and run Bowtie
module load bowtie2
bowtie2 -p2 -t -x $hgGen -U ${atfFq}ATF1_chip_10m_230227.fastq -S ${atfSam}ATF1_chip_10m_230227.sam 2>&1 | tee ${atfSam}ATF1_chip_10m_230227_alignment_log.txt
##LOOK at manual for options description.
OR
module load bowtie
hg_bt="/home/FCAM/meds5420/genomes/hg38_bt/hg38"
bowtie -p4 -v2 -m1 -x $hg_bt ${atfFq}ATF1_chip_10m_230227.fastq -S ${atfSam}ATF1_10m_bt1_align_230227.sam 2>&1 | tee ${atfSam}ATF1_chip_10m_230227_alignment_bt1_log.txt
# Order is options, genome, reads-to-map, outfilename (can be designated with -S)
Notes (take note and let me know if you have questions!):
ATF1_chip_10m_230227
and ATF1_chip_10m_230227_alignment
). I should set a variable such as ‘prefix’ and appended .sam
’ and _alignment_log.txt
to the output. This would ensure that the basename of all the files is the same and I can repurpose this script to process any file in a consistent manner.-p
option of mkdir
specify?Today we will conduct basic steps in processing data after genome alignment. We will convert the data into a format that is compatible for viewing in the UCSC genome browser and for downstream analyses.
The most common output format from bowtie
and other aligners is the SAM format. SAM format allows for storing a wealth of information about the sequence alignment in a single line of text. The format is a mix of human-readable and computer-readable information.
More information can be found in the publication: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2723002/pdf/btp352.pdf, and in the manual:
http://samtools.github.io/hts-specs/SAMv1.pdf.
Notes:
@
symbol. There is an option in bowtie to omit these lines from the output if so desired.There is a lot of information here, but for most analytical purposes, we can use four pieces of information contained in 4 columns. These columns are:
Interpreting the FLAG:
There are also several types of information in the FLAG. For simplicity, we only need to worry about three FLAGs:
If you do get a flag that you don’t understand, you can get the interpretation of it at this website:
https://broadinstitute.github.io/picard/explain-flags.html.
In addition to the above information the CIGAR string (6th column) can also be useful. This line gives information about the nature of the alignment. (i.e. which bases matched the reference and where in the read there are any mismatches, insertions or deletions). The link below shows a decent example of how the CIGAR string relates to the alignment.
SAM / CIGAR string example: http://genome.sph.umich.edu/wiki/SAM
samtools
and bedtools
.bam
files with samtools
.SamTools is a set of utilities for manipulating SAM files. Compared to other sets of Utilities (e.g. bedtools
), Samtools is somewhat limited. However, it is important that you know that it exists, and we will use it for at least one task - converting to binary sam format (BAM).
The .bam
format is simply a binary version of your .sam file. This allows the file to be read faster by your computer, but it is not human readable. Try viewing the head of a .bam
file and you can see that the file structure is lost to human eyes. Several downstream utilities can use .bam
files as input, making this a useful step in post-processing.
The command is the samtools view
, with the following usage:
module load samtools
samtools view -S -b infile.sam > outfile.bam
# -S auto-detects input format, -b specifies bam output.
Note about samtool
s usage:
All results are reported to standard output (i.e. to the screen). This means you have to use the >
when specifying the output file. This is useful because it also means that we can use pipes (‘|’), to string together multiple samtools commands
bedtools
utilities for manipulating files and performing genomic operations.Bedtools, is a powerful ‘swiss-army knife’ (self-proclaimed) set of utilities that allows simple and complex operations to be done with minimal effort by the user. Here are some useful operations that can be performed with bedtools
:
bamtobed
- convert .bam file to .bed filebamtofastq
- convert .bam back to .fastqbedtobam
- convert .bed back to .bamclosest
- find the closest region in one file to locations in another filegenomecov
- compute bases covered by reads in genomegetfastq
- extract sequence regions out of a fasta sequence fileintersect
- get overlapping interval from various filesmerge
- merge overlapping intervals of two filesrandom
- select random lines from a fileToday we will use bedtools
to convert our .bam
file to a .bed
file and another type of file (.bedGraph
) for viewing in the genome browser.
bamtobed usage:
module load bedtools
bedtools bamtobed -i infile.bam > outfile.bed # see note
Note about Bedtools usage:
As with samtools
, all results are reported to standard output (i.e. to the screen). This means you have to use the >
when specifying the output file.
Bed format (Browser Extensible Data), is a file format developed for displaying genomic data on genome browsers. You can displaying both data and annotations with bedfiles. See the UCSC genome browser for a detailed description of the .bed format. There is some critical data needed for displaying data in the browser, while other types of info are optional. Here are the column specifications for each feature:
1. chrom - chromosome
2. start - where item starts in the chromosome
3. end - where item ends on the chromosome
4. name (optional) - name (e.g. gene name) of item
5. score (optional) - number from 1-1000 that specifies shading of the item (higher number = darker)
6. strand (optional) - the strand the feature aligns to
7. thickStart
8. thickStart
9. itemRGB
10. blockCount
11. blockSizes
12. blockStarts
Specific information about bed files accepted as input to bedtools can be found here: http://bedtools.readthedocs.org/en/latest/content/general-usage.html.
Here’s an example of a simple bedfile:
head -5 ATF1_ChIP_10mil.bed
## chr11 79457171 79457221 SRR5331338.1000001 255 +
## chr12 41587919 41587969 SRR5331338.1000002 255 -
## chr8 43564697 43564747 SRR5331338.1000006 255 +
## chr1 112597481 112597531 SRR5331338.1000007 255 -
## chr10 32414161 32414211 SRR5331338.1000008 255 -
Before moving forward, you must sort the bedfile for both viewing in the genome browser and for any future operations with Bedtools. Bedtools has a sort function called sortBed with the following usage:
sortBed -i infile.bed > infile_sorted.bed
From an earlier lesson, you can also use the sort shell command as follows:
sort -k1,1 -k2,2n infile.bed > infile_sorted.bed
Now look at the sorted file:
head ATF1_ChIP_10mil_sorted.bed
## chr1 10433 10483 SRR5331338.1925486 255 -
## chr1 17459 17509 SRR5331338.7795627 255 +
## chr1 17475 17525 SRR5331338.6585365 255 -
## chr1 181448 181498 SRR5331338.2607872 255 +
## chr1 182898 182948 SRR5331338.9132195 255 -
## chr1 191526 191576 SRR5331338.10621969 255 -
## chr1 456415 456465 SRR5331338.4610441 255 +
## chr1 505622 505672 SRR5331338.8028261 255 -
## chr1 606777 606827 SRR5331338.9768832 255 -
## chr1 629074 629124 SRR5331338.6454906 255 -
bedGraphs
for viewing in the browserBED files as we made them have a unique line or entry for every read in your data. This provides a detailed view of the data. However, when working with large datasets, in can be an inefficient and cumbersome way to display data in the browser. bedGraphs
are another way of displaying data in the browser. Instead of showing a line per read in the file (as in a .bed file), bases with data are shown with with the number of reads covering that base shown in a separate column. When possible, .bedGraph
files are often more efficient than using .bed
files for two reasons.
The bedtool utility genomecov
can be used to create a .bedGraph
from a sorted .bed
file. bedtools
developers provide a handy diagram about how different options handle the data:
For a basic bedGraph
use the following format:
bedtools genomecov -bg -i infile.sorted.bed -g hg38.chrom.sizes > test.bedGraph
# bg specifies bedgraph
#-g option is required and points to a chromosome Info file.
The hg38.chrom.sizes
file is a file that specifies the chromsomes represented in the data and their lengths.
It is located here: /home/FCAM/meds5420/genomes/hg38.chrom.sizes
See what the new file looks like:
head ATF1_ChIP_10mil.bedGraph
## chr1 10433 10483 1
## chr1 17459 17475 1
## chr1 17475 17509 2
## chr1 17509 17525 1
## chr1 181448 181498 1
## chr1 182898 182948 1
## chr1 191526 191576 1
## chr1 456415 456465 1
## chr1 505622 505672 1
## chr1 606777 606827 1
More info on bedgraph format specifications can be found here.
Mapping and post-processing of ChIP-seq data for viewing in the genome browser as a .bed and .bedGraph file.
- The data is from a ChIP-seq experiment for the transcription factor, ATF1, in human cells.
- The file names are: SRR5331338_ATF1_ChIP.fastq.gz
and SRR5331584_Control_Input.fastq.gz
and the files are in the /home/FCAM/meds5420/data/ATF1/fastq
folder on the server.
- Recall there is an indexed hg38 genome for bowtie2:
/home/FCAM/meds5420/genomes/hg38_bt2
- there is a chromosome size file here : /home/FCAM/meds5420/genomes/hg38.chrom.sizes
Map 10 million reads from each file to the genome with bowtie2
:
1. use 4G of RAM and 4CPUs (you may have to log in again and request more memory and cores: srun --pty -p mcbstudent --qos=mcbstudent --mem=4G -c 2 bash
)
2. Before aligning the data, ignore the first 1,000,000 reads, use reads 1,000,001 - 11,000,000.
3. specify .sam
output in the alignment
4. Convert the output file to a .bam
file (samtools
)
5. Convert the previous output file to a .bed
file (bedtools
)
6. Sort the bed
file
7. Convert the file to a bedGraph
file (bedtools
) - look at the manual online.
8. Create a variation bedGraph
:
* One that computes coverage for only the 5’ end of the read.
Track lines allow you to specify what data is being displayed and how to display it. Browser lines also allow you to specific where in the genome viewing will begin. There is more information from the browser on track Lines here:
http://genome.ucsc.edu/goldenpath/help/customTrack.html#TRACK
There is an option in the genomecov
program that allows you to do this, but these can be added easily with awk as well.
Adding tracklines to .bedGraph
files:
#UCSC tracklines for bedgraph
awk 'BEGIN { print "browser position chr11:5,289,521-5,291,937";
print "track type=bedGraph name=\"NAME_bedgraph\" description=\"NAME_bedgraph\" visibility=full autoScale=on alwaysZero=on color=0,125,0 windowingFunction=maximum"}
{ print $0}' INFILE.bedgraph > INFILE_header.bedgraph
Here’s another one for a .bed
file that colors by strand:
#UCSC tracklines for bed file
awk 'BEGIN { print "browser position chr11:5,289,521-5,291,937";
print "track type=bed name=\"NAME_bed\" description=\"NAME_bed\" visibility=squish autoScale=on colorByStrand=\"255,0,0 0,0,255\""}
{ print $0}' INFILE.bed > INFILE_header.bed
These awk commands are on the server in the following location: /home/FCAM/meds5420/scripts/addTrackLine_awk.txt
I would stash these commands in your awk_one_liner
file if you have one.
Let’s turn this into a shell script. However, there is one tip in awk
that we need to revisit to make this possible:
awk
Consider the following situation where you want to introduce a variable within a line of printed text. The text must be wrapped in quotes, therfore you try this:
#create variable
var=ChIP-seq
#pass variable to awk in quoted text
echo $var | awk -v var="$var" '{print "This is my track with var data"}'
## This is my track with var data
One can escape these quotes to print the variable by adding a second set of quotes. Recall that the variable has to be outside of a double quote or it will be interpreted literally and not as a variable.
#create variable
var=ChIP-seq
#pass variable to awk AND quote it inside quoted text
echo $var | awk -v var="$var" '{print "This is my track with " var " data"}'
## This is my track with ChIP-seq data
What if we want to add real quotes to a printed line in awk
? Add a backslash \
inside of quotes to prevent awk
from interpreting it. Stated differently: when we want awk
to print a "
string, we must precede the quote with a \
character. The \"
is printed as a "
.
#create variable
var=ChIP-seq
#pass variable to awk with quotes and then use backslash for literal print of a quote.
echo $var | awk -v var="$var" '{print "This is my track with \""var"\" data"}'
## This is my track with "ChIP-seq" data
Let’s take everything we learned and convert this to a script with variables and name the file addTrackLines.sh
:
#!/bin/bash
# argument 1 is the descriptive name of the experiment
# argument 2 is the input bedGraph file without a header
# usage: add a header file to a headerless bedGraph file
# some of these header variables are hard coded, but can be passed as variables
name="$1"
outname=$(echo "$2" | cut -d "." -f 1) #use the base name in outfile
awk -v var="$name" 'BEGIN { print "browser position chr11:5,289,521-5,291,937"; print "track type=bedGraph name=\"" var "\" description=\"" var "_bedGraph\" visibility=full autoScale=on alwaysZero=on color=0,125,0 windowingFunction=maximum"} { print $0}' "$2" > ${outname}_header.bedGraph
The above script is complicated because of the excessive use of double quotes. However, one should only have to write a script like this once and only make minor modifications over time.
You can upload these data (.bed
and .bedGraph
files) as custom tracks on the browser. Just go to ‘custom tracks’ -> ‘add custom tracks’ and follow instructions.
Uploading Tips:
.gzip
ped files.Starting with the .bed
file and .bedGraph
file you just created, do the following.
- Add appropriate track lines to the head of the .bed
and .bedGraph
files.
- gzip
the files
- Transfer from the server to your laptop
- Upload the data onto the UCSC genome browser
Mapping and post-processing of ChIP-seq data for viewing in the genome browser as a .bed and .bedgraph file.
The data is from a ChIP-seq experiment for the transcription factor, ATF1, in human cells. The file names are: ATF1_chip_10m.fastq.gz
and ATF1_input_10m.fastq.gz
and the files are in the /home/FCAM/meds5420/data/ATF1/fastq/
folder on the server.
#Map the data (answers to Bowtie 1)
bowtie -p4 -m1 -v2 -S -x /home/FCAM/meds5420/genomes/hg38_bt/hg38_bt
/home/FCAM/meds5420/data/ATF1/fastq/ATF1_chip_10m.fastq ATF1_chip_10m.sam
bowtie -p4 -m1 -v2 -S -x /home/FCAM/meds5420/genomes/hg38_bt/hg38_bt
/home/FCAM/meds5420/data/ATF1/fastq/ATF1_input_10m.fastq ATF1_input_10m.sam
# what I asked bowtie2
hgGen=/home/FCAM/meds5420/genomes/hg38_bt2/hg38
bowtie2 -p4 -t -x $hgGen -U /home/FCAM/meds5420/data/ATF1/fastq/ATF1_chip_10m.fastq -S ATF1_chip_10m.sam
bowtie2 -p4 -t -x $hgGen -U /home/FCAM/meds5420/data/ATF1/fastq/ATF1_input_10m.fastq -S ATF1_input_10m.sam
#convert to .bam
samtools view -Sb ATF1_chip_10m.sam > ATF1_chip_10m.bam
samtools view -Sb ATF1_input_10m.sam > ATF1_input_10m.bam
#convert to .bed
bedtools bamtobed ATF1_chip_10m.bam > ATF1_chip_10m.bed
bedtools bamtobed ATF1_input_10m.bam > ATF1_input_10m.bed
#sort the bed
sort -k1,1 -k2,2n ATF1_chip_10m.bed > ATF1_chip_10m_sorted.bed
sort -k1,1 -k2,2n ATF1_input_10m.bed > ATF1_input_10m_sorted.bed
# bedgraph: hg38.chrom.sizes file is at /home/FCAM/meds5420/genomes/
bedtools genomecov -bg -i ATF1_chip_10m_sorted.bed -g /home/FCAM/meds5420/genomes/hg38.chrom.sizes > ATF1_ChIP.bedGraph
#report 5' end of read only
bedtools genomecov -bg -5 -i ATF1_chip_10m_sorted.bed -g hg38.chrom.sizes > ATF1_ChIP_5prime.bedGraph
addTrackLines.sh
script we wrote in class to add the track lines.scp
os sftp
to copy the files to your computer.I will demonstrate in class.