0.1 Due May 2nd, 2023 11:59pm

1 Data

Jimothy generated three replicates of mystery_treatment and no_treatment in the directory: /home/FCAM/meds5420/final/

To save time on alignments, I used the following commands to generate the .sam, .bam, and .bam.bai files within the directory:

module load hisat2
module load samtools
module load htseq/0.13.5


for i in /home/FCAM/meds5420/final/*fastq.gz
    name=$(echo $i | cut -d "." -f 1)
    echo $name
    hisat2 -p4 --dta --rna-strandness R -x $ht2_gen -U $i -S ${name}_R.sam
    hisat2 -p4 --dta --rna-strandness F -x $ht2_gen -U $i -S ${name}_F.sam
    samtools sort -@ 10 -o ${name}_R.bam ${name}_R.sam
    samtools sort -@ 10 -o ${name}_F.bam ${name}_F.sam
    samtools index ${name}_R.bam
    samtools index ${name}_F.bam

1.1 Task 1

Neither Jimothy nor I know the proper --rna-strandness option (unstranded or stranded: template strand or coding strand) for this library, so I used both R and F when aligning.

1. Use two independent methods to determine whether R, F, or the default unstranded is the appropriate option. If stranded, you do not need to interpret the R/F option as the biological template/coding strand nomenclature. Show all your code and provide a few sentences of interpretation to justify your answer. Write the commands as if you were in an interactive session. (20 points)

1.2 Task 2

2. Use htseq-count on the no_treatment_rep1_* data to show that htseq-count ignores strandedness specified by hisat2 (3 points/10) and empirically determine the appropriate --stranded= argument for htseq-count (7 points/10). Write the commands as if you were in an interactive session. Use the following gtf annotation file: /home/FCAM/meds5420/annotations/gencode.v39.annotation.gtf (10 points)

1.3 Task 3

3. Write a batch script that incorporates an appropriate htseq-count command into a loop to count reads in all six conditions. Name the batch script with your last name, user number as so: Keep this script separate from your interactive commands because I will run this in batch mode. This may take > 8 hours to run, but you need to run this script. If you are having issues getting the code correct, I recommend subsampling the data in some way to ensure the code is correct before running full-scale. (15 points)

1.4 Task 4

4. Visualize the data in the browser and include a UCSC session link in your interactive session script. If the data is stranded, color plus strand reads red and minus strand reads blue. If the data is not stranded upload a single track. (10 points)

1.5 Task 5

5. Perform differential expression analysis with DESeq2 on the output of Task 3. Report the number of activated genes using an FDR adjusted p-value (padj) of 0.001. Generate an MA plot PDF to submit. Since the code should be nearly identical to the lectures, there is no need to send along the code. (5 points)

1.6 Task 6

6. Perform Gene Set Enrichment Analysis with fgsea and report the category with the highest NES that also contains a multiple testing adjusted p-value below 0.3 (padj < 0.3). Use this GMT file: /home/FCAM/meds5420/annotations/RNA_seq_final.gmt. Since the code should be nearly identical to the lectures, there is no need to send along the code. Generate an enrichment plot PDF to submit. (5 points)

1.7 Task 7

7. Upload you activated gene list to Panther (Statistical overrepresentation test-PANTHER pathways) using “Homo sapiens-all genes in database” as the Reference list. Report the Gene Ontology class that has the highest Fold Enrichment value. (5 points)

1.8 Task 8

8. Upload you activated gene list to BART and report the top factor as ranked by Irwin-Hall p-value that is likely regulating the activated genes. (5 points)

1.9 Task 9

9. Use tar to generate a single file (named in this format: guertin_usr17.tar) that contains any interactive .sh files, your batch .sh file, your PDF files, and a text file with any non-code answers (--rna-strandness option, --stranded= option, number activated genes, fgsea category, GO category, TF). Email me this single file. (10 points)