You just joined a new research lab that uses molecular genomics to study gene expression and transcription factors. Your plan is take over a project from a former student named Jimothy. Jimothy was proficient at the bench and so produced several ChIP-seq data sets. Jimothy was, however, less proficient at analysis of high throughput sequencing data, so several datasets remain unanalyzed. Your task is to perform an prelimary analysis on the ChIP-seq datasets he left behind. Unfortunately, Jimothy was also poor at naming his samples and maintaining his lab notebook. Therefore, in the course of your analysis, you must also identify the factor that was targeted in the ChIP-seq experiments.
What you should turn in:
netID_midterm.sh
) file for the pipeline (ChIP-seq processing and mapping)netID_midterm.txt
file containing answers to questions, displaying tables, or reporting commands used for running programs.Scoring: 120 points total, 25% of final grade.
Datasets:
Your dataset is found within /home/FCAM/meds5420/midterm/
and corresponds to your user number. You datasets are the following:
usr#_chip.fastq.gz
and usr#_control.fastq.gz
bowtie2 index location for hg38
/home/FCAM/meds5420/genomes/hg38_bt2
chromosome size info for hg38
/home/FCAM/meds5420/genomes/hg38.chrom.sizes
ENSEMBL gene file - .bed
/home/FCAM/meds5420/annotations/hg38_genes.bed
ENSEMBL TSS file - .bed
/home/FCAM/meds5420/annotations/hg38_genes_strandedTSS.bed
fasta formatted chromosomes
/home/FCAM/meds5420/genomes/chroms/
Jaspar motif database in MEME format
/home/FCAM/meds5420/TF_db/JASPAR/JASPAR2022_CORE_vertebrates_non-redundant.meme
Illumina ChIP-seq adapter sequence:
GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGAAA
The pipeline should be a single shell script that can string together the following operations. The script must be a loop that iterates through lists of files and should incorporate user-defined varaibles in the beginning such that limited alterations are needed in the future (i.e. see HW2 key). It should also be well annotated and contain a log file. (35 points)
1. Run fastqc
on the data
2. Remove the ChIP-seq adapter sequence
3. Trim reads by removing bases from the end with a quality score lower than 25 and keeping a minimum length of 26bp
4. Run fastqc
on the processed data
5. Align the original and processed (post clipping and trimming) .fastq
files to the genome with bowtie2
6. Convert to a .bed
file and sort the aligned bed file by chromosome and position.
7. Use bedtools
to create a bedGraph
file, complete with tracklines, for viewing in the genome browser. Make the Input and ChIP data display in different colors.
If the questions require command line operations, please report the command lines used.
Create a table (use Google Sheets) summarizing the following for each sample before and after pre-processing (10 points):
1. The number of reads total
2. The number of reads that mapped uniquely
3. The number of reads that mapped to multiple places
Based on your mapping and fastqc results indicate whether you think the datasets are good quality in terms of sequencing, adapter contamination. Are there any minor or major problems? (5 points)
Provide the code or commands used for the following operations or analyses.
1. Use MACS3 to call peaks using the input as the control (5 points).
2. Add tracklines to you summits and narrow peaks files for display in the browser.
3. Display the raw bed and bedGraph data and the peaks and summits in the browser. Find a region that has your strongest peak (highest q-value), zoom out 3x, save a .PDF file, save the peak in a session, and send me a session link. (10 points)
1. Use your summits file to get 101bp of DNA centered on the top 500 binding sites for the meme analysis. (10 points)
2. Search for 2 motifs and use a minimum size of 6 and a maximum size of 15 (5 points)
1. Using the complete meme output, show the command used and be sure to do this for all peaks, not the ones used to ID the motif (Hint: You need to make fasta files for all peaks). Report number and percentage of motifs in peaks. (10 points)
2. Provide an explanation for why or why not all of your peaks have identified motifs. (5 points)
1. Show the FIMO command used to find motifs. (5 points)
2. Convert FIMO results to a .bed file and intersect the FIMO results with your peaks (.narrowPeak
file, not summits). Report the number of motifs inside and outside of peaks. (5 points)
3. What do these results tell you about the about the likelihood of the TF finding its motif in the genome? Provide an explanation (5 points).
1. Show the command and report the top 2 or 3 hits (5 points)