samtools
and bedtools
bowtie
and bowtie2
and mapping optionsIn order for the output to be a proper .sam
file use the -S
option to explicitly say that you want a .sam
output. This is required for bowtie2
, and ensures that the header is included in the .sam
file which is important for downstream steps (today’s lesson).
For the next part of the course we will be working with a ChIP-seq dataset from human cells. The factor that was IP’ed was ATF1 (SRR5331338). The fastq
file for the experiment and control (Input SRR5331584) is here:
/home/FCAM/meds5420/data/ATF1/fastq/
It may take a while for us all to map the data, so I did it already. Here’s the commands I used.
#request 2 CPUs and 4G of RAM
srun --pty -p mcbstudent --qos=mcbstudent --mem=4G -c 2 bash
#Set a variable where my genome is
hgGen=/home/FCAM/meds5420/genomes/hg38_bt2/hg38
#set variable to where raw data
atfRaw=/home/FCAM/meds5420/data/ATF1/fastq/
#set variable to where i want output fq to go
atfFq=/home/FCAM/meds5420/data/ATF1/fastq/
#set variable to where i want output sam to go
atfSam=/home/FCAM/meds5420/data/ATF1/sam/
mkdir data
cd data
mkdir ATF1
cd ATF1
mkdir fastq
mkdir sam
#move 10million reads to a new file, omit the first million
zcat ${atfRaw}SRR5331338_ATF1_ChIP.fastq.gz | head -44000000 |tail -40000000 > ${atfFq}ATF1_chip_10m_230227.fastq
#load bowtie module and run Bowtie
module load bowtie2
bowtie2 -p2 -t -x $hgGen -U ${atfFq}ATF1_chip_10m_230227.fastq -S ${atfSam}ATF1_chip_10m_230227.sam 2>&1 | tee ${atfSam}ATF1_chip_10m_230227_alignment_log.txt
##LOOK at manual for options description.
OR
module load bowtie
hg_bt="/home/FCAM/meds5420/genomes/hg38_bt/hg38"
bowtie -p4 -v2 -m1 -x $hg_bt ${atfFq}ATF1_chip_10m_230227.fastq -S ${atfSam}ATF1_10m_bt1_align_230227.sam 2>&1 | tee ${atfSam}ATF1_chip_10m_230227_alignment_bt1_log.txt
# Order is options, genome, reads-to-map, outfilename (can be designated with -S)
Notes (take note and let me know if you have questions!):
ATF1_chip_10m_230227
and ATF1_chip_10m_230227_alignment
). I should set a variable such as ‘prefix’ and appended .sam
’ and _alignment_log.txt
to the output. This would ensure that the basename of all the files is the same and I can repurpose this script to process any file in a consistent manner.-p
option of mkdir
specify?
Figure 1: post processing flow
Today we will conduct basic steps in processing data after genome alignment. We will convert the data into a format that is compatible for viewing in the UCSC genome browser and for downstream analyses.
The most common output format from bowtie
and other aligners is the SAM format. SAM format allows for storing a wealth of information about the sequence alignment in a single line of text. The format is a mix of human-readable and computer-readable information.
More information can be found in the publication:Â http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2723002/pdf/btp352.pdf, and in the manual:
http://samtools.github.io/hts-specs/SAMv1.pdf.
Notes:
@
symbol. There is an option in bowtie to omit these lines from the output if so desired.