Contents

1 Homework 2: Create a shell script to process and map NGS data.

1. Due Monday 3/13, 11:59pm
2. Points available: 10

Goal: You will create a shell script to automate the preprocessing and mapping of data.

Instructions:
- Email the answers to questions and your script. No need to send any results files.

/home/FCAM/meds5420/data/HW/.

1.1 Questions and Tasks:

  1. Count how many lines there are in the file. How many sequence reads does this equate to? (1 point)

  2. Move the first 2 million and last 2 million lines to files to two separate files and name them SRR412199_head.fastq and SRR412199_tail.fastq, respectively. Make sure you put these files into your data folder that is in your home directory.

  3. Write a single shell script that will automate the following processess for both files in succession (3 points):

  • Run fastqc on the data.
  • Run fastx_clipper to remove the adapter sequences
    (seq = TGCTTGGACTACATATGGTTGAGGGTTGTATGGAATTCTCGGGTGCCAAGG). You would know the sequence if you did the experiment, but you could also get this from the fastqc report.
  • Run fastqc on the new clipped file.
  • Run fastq_quality_trimmer on the original data to trim reads with Q-scores lower than 30. Keep sequences with a minimum length of 20 bases.
  • Rerun fastqc on the quality trimmed data.
  • Map the original, clipped, and trimmed data seperately to the drosophila genome with bowtie2.
  1. In your script, include the following:
  • Add comments to the script (1 point)
  • Add lines that track the progress of the script by printing what is happening to the screen. (1 point)
  • Use the Tee function to produce a log.txt file of what it printed to the terminal screen. (1 point)
  1. Run the script in an interactive session (srun). View the contents of the log files. How many reads from each file mapped to the genome? (1 point)

  2. View the results of the fastqc.html files by transferring them from the server and opening them in a browser on your computer. Given the mapping results and the fastqc results, what would you say the major problem with this data is (1 point) and what is the best pre-mapping solution to deal with this (if you had to choose only one) (1 point)?