Contents

1 Review and mapping of data for ChIP-seq analysis:

1.1 Review of bowtie and bowtie2 and mapping options

In order for the output to be a proper .sam file use the -S option to explicitly say that you want a .sam output. This is required for bowtie2, and ensures that the header is included in the .sam file which is important for downstream steps (today’s lesson).

For the next part of the course we will be working with a ChIP-seq dataset from human cells. The factor that was IP’ed was ATF1 (SRR5331338). The fastq file for the experiment and control (Input SRR5331584) is here:
/home/FCAM/meds5420/data/ATF1/fastq/

It may take a while for us all to map the data, so I did it already. Here’s the commands I used.

#request 2 CPUs and 4G of RAM
srun --pty -p mcbstudent --qos=mcbstudent --mem=4G -c 2 bash

#Set a variable where my genome is 
hgGen=/home/FCAM/meds5420/genomes/hg38_bt2/hg38
#set variable to where raw data
atfRaw=/home/FCAM/meds5420/data/ATF1/fastq/
#set variable to where i want output fq to go
atfFq=/home/FCAM/meds5420/data/ATF1/fastq/
#set variable to where i want output sam to go
atfSam=/home/FCAM/meds5420/data/ATF1/sam/

mkdir data
cd data 
mkdir ATF1
cd ATF1
mkdir fastq
mkdir sam

#move 10million reads to a new file, omit the first million
zcat ${atfRaw}SRR5331338_ATF1_ChIP.fastq.gz | head -44000000 |tail -40000000 > ${atfFq}ATF1_chip_10m_230227.fastq

#load bowtie module and run Bowtie
module load bowtie2

bowtie2 -p2 -t -x $hgGen -U ${atfFq}ATF1_chip_10m_230227.fastq -S ${atfSam}ATF1_chip_10m_230227.sam 2>&1 | tee ${atfSam}ATF1_chip_10m_230227_alignment_log.txt

##LOOK at manual for options description.

OR

module load bowtie
hg_bt="/home/FCAM/meds5420/genomes/hg38_bt/hg38"

bowtie -p4 -v2 -m1 -x $hg_bt ${atfFq}ATF1_chip_10m_230227.fastq  -S ${atfSam}ATF1_10m_bt1_align_230227.sam 2>&1 | tee ${atfSam}ATF1_chip_10m_230227_alignment_bt1_log.txt
# Order is options, genome, reads-to-map, outfilename (can be designated with -S)

Notes (take note and let me know if you have questions!):

  • The beginning of this command skips the first million reads because they are often the most prone to errors due to technical artifacts of sequencing.
  • setting long paths to variables can help declutter command lines.
  • The names of all the output files are messy and inconsistent (ATF1_chip_10m_230227 and ATF1_chip_10m_230227_alignment). I should set a variable such as ‘prefix’ and appended .sam’ and _alignment_log.txt to the output. This would ensure that the basename of all the files is the same and I can repurpose this script to process any file in a consistent manner.
  • I made all the directories manually, but I should have checked for the presence of the directory using an if statement and only made the directory if needed. What does the -p option of mkdir specify?

2 Overview through today:

post processing flow

Figure 1: post processing flow

3 Today: Post-processing and viewing data in UCSC genome browser.

Today we will conduct basic steps in processing data after genome alignment. We will convert the data into a format that is compatible for viewing in the UCSC genome browser and for downstream analyses.

4 The Sequence Alignment/Map Format (.sam)

The most common output format from bowtie and other aligners is the SAM format. SAM format allows for storing a wealth of information about the sequence alignment in a single line of text. The format is a mix of human-readable and computer-readable information. More information can be found in the publication: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2723002/pdf/btp352.pdf, and in the manual:
http://samtools.github.io/hts-specs/SAMv1.pdf.

Notes: