1. Due Monday 3/13, 11:59pm
2. Points available: 10
Goal: You will create a shell script to automate the preprocessing and mapping of data.
- Email the answers to questions and your script. No need to send any results files.
srun), or by using a sbatch script as shown in class.
Count how many lines there are in the file. How many sequence reads does this equate to? (1 point)
Move the first 2 million and last 2 million lines to files to two separate files and name them
SRR412199_tail.fastq, respectively. Make sure you put these files into your
data folder that is in your home directory.
Write a single shell script that will automate the following processess for both files in succession (3 points):
fastqcon the data.
fastx_clipperto remove the adapter sequences
seq = TGCTTGGACTACATATGGTTGAGGGTTGTATGGAATTCTCGGGTGCCAAGG). You would know the sequence if you did the experiment, but you could also get this from the fastqc report.
fastqcon the new clipped file.
fastq_quality_trimmeron the original data to trim reads with Q-scores lower than 30. Keep sequences with a minimum length of 20 bases.
fastqcon the quality trimmed data.
log.txtfile of what it printed to the terminal screen. (1 point)
Run the script in an interactive session (srun). View the contents of the log files. How many reads from each file mapped to the genome? (1 point)
View the results of the fastqc.html files by transferring them from the server and opening them in a browser on your computer. Given the mapping results and the fastqc results, what would you say the major problem with this data is (1 point) and what is the best pre-mapping solution to deal with this (if you had to choose only one) (1 point)?