Contents

0.1 Midterm Project: ChIP-seq Processing and Analysis

You just joined a new research lab that uses molecular genomics to study gene expression and transcription factors. Your plan is take over a project from a former student named Jimothy. Jimothy was proficient at the bench and so produced several ChIP-seq data sets. Jimothy was, however, less proficient at analysis of high throughput sequencing data, so several datasets remain unanalyzed. Your task is to perform an prelimary analysis on the ChIP-seq datasets he left behind. Unfortunately, Jimothy was also poor at naming his samples and maintaining his lab notebook. Therefore, in the course of your analysis, you must also identify the factor that was targeted in the ChIP-seq experiments.

What you should turn in:

Scoring: 120 points total, 25% of final grade.

0.2 Due: Monday, April 3 11:59pm

0.3 Important Files / Directories:

Datasets:
Your dataset is found within /home/FCAM/meds5420/midterm/ and corresponds to your user number. You datasets are the following:

bowtie2 index location for hg38
/home/FCAM/meds5420/genomes/hg38_bt2

chromosome size info for hg38
/home/FCAM/meds5420/genomes/hg38.chrom.sizes

ENSEMBL gene file - .bed
/home/FCAM/meds5420/annotations/hg38_genes.bed

ENSEMBL TSS file - .bed
/home/FCAM/meds5420/annotations/hg38_genes_strandedTSS.bed

fasta formatted chromosomes
/home/FCAM/meds5420/genomes/chroms/

Jaspar motif database in MEME format
/home/FCAM/meds5420/TF_db/JASPAR/JASPAR2022_CORE_vertebrates_non-redundant.meme

Illumina ChIP-seq adapter sequence:
GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGAAA

1 ChIP-seq processing, alignment, and display.

1.1 Create a pipeline to process the ChIP-seq data.

The pipeline should be a single shell script that can string together the following operations. The script must be a loop that iterates through lists of files and should incorporate user-defined varaibles in the beginning such that limited alterations are needed in the future (i.e. see HW2 key). It should also be well annotated and contain a log file. (35 points)

1. Run fastqc on the data
2. Remove the ChIP-seq adapter sequence
3. Trim reads by removing bases from the end with a quality score lower than 25 and keeping a minimum length of 26bp
4. Run fastqc on the processed data
5. Align the original and processed (post clipping and trimming) .fastq files to the genome with bowtie2
6. Convert to a .bed file and sort the aligned bed file by chromosome and position.
7. Use bedtools to create a bedGraph file, complete with tracklines, for viewing in the genome browser. Make the Input and ChIP data display in different colors.

1.2 Complete the following tasks or answer the questions.

If the questions require command line operations, please report the command lines used.

1.2.1 Create a summary table

Create a table (use Google Sheets) summarizing the following for each sample before and after pre-processing (10 points):

1. The number of reads total
2. The number of reads that mapped uniquely
3. The number of reads that mapped to multiple places

1.2.2 QC assessment

Based on your mapping and fastqc results indicate whether you think the datasets are good quality in terms of sequencing, adapter contamination. Are there any minor or major problems? (5 points)

2 Peak calling and analysis

Provide the code or commands used for the following operations or analyses.

2.1 Call peaks with MACS3

1. Use MACS3 to call peaks using the input as the control (5 points).
2. Add tracklines to you summits and narrow peaks files for display in the browser.
3. Display the raw bed and bedGraph data and the peaks and summits in the browser. Find a region that has your strongest peak (highest q-value), zoom out 3x, save a .PDF file, save the peak in a session, and send me a session link. (10 points)

2.2 Identify motifs under peaks and compare motif and peak locations.

2.2.1 Use MEME to indentify the motifs under the peaks.

1. Use your summits file to get 101bp of DNA centered on the top 500 binding sites for the meme analysis. (10 points)
2. Search for 2 motifs and use a minimum size of 6 and a maximum size of 15 (5 points)

2.2.2 Use MAST to determine how many of your peaks that have the motif.

1. Using the complete meme output, show the command used and be sure to do this for all peaks, not the ones used to ID the motif (Hint: You need to make fasta files for all peaks). Report number and percentage of motifs in peaks. (10 points)
2. Provide an explanation for why or why not all of your peaks have identified motifs. (5 points)

2.2.3 Use FIMO to determine where the motifs are on the chromosome associated with your user number.

1. Show the FIMO command used to find motifs. (5 points)
2. Convert FIMO results to a .bed file and intersect the FIMO results with your peaks (.narrowPeak file, not summits). Report the number of motifs inside and outside of peaks. (5 points)
3. What do these results tell you about the about the likelihood of the TF finding its motif in the genome? Provide an explanation (5 points).

2.2.4 Use TOMTOM to identify the TFs that bind the motifs by comparing it to the JASPAR database.

1. Show the command and report the top 2 or 3 hits (5 points)

3 What transcription factor or transcription factor family did Jimothy ChIP? (10 points)