1. Due Monday 3/13, 11:59pm
2. Points available: 10
Goal: You will create a shell script to automate the preprocessing and mapping of data.
Instructions:
- Email the answers to questions and your script. No need to send any results files.
/home/FCAM/meds5420/data/HW/
.
/home/FCAM/meds5420/genomes/dm6/
srun
), or by using a sbatch script as shown in class.Count how many lines there are in the file. How many sequence reads does this equate to? (1 point)
Move the first 2 million and last 2 million lines to files to two separate files and name them SRR412199_head.fastq
and SRR412199_tail.fastq
, respectively. Make sure you put these files into your data
folder that is in your home directory.
Write a single shell script that will automate the following processess for both files in succession (3 points):
fastqc
on the data.fastx_clipper
to remove the adapter sequencesseq = TGCTTGGACTACATATGGTTGAGGGTTGTATGGAATTCTCGGGTGCCAAGG
). You would know the sequence if you did the experiment, but you could also get this from the fastqc report.fastqc
on the new clipped file.fastq_quality_trimmer
on the original data to trim reads with Q-scores lower than 30. Keep sequences with a minimum length of 20 bases.fastqc
on the quality trimmed data.bowtie2
.log.txt
file of what it printed to the terminal screen. (1 point)Run the script in an interactive session (srun). View the contents of the log files. How many reads from each file mapped to the genome? (1 point)
View the results of the fastqc.html files by transferring them from the server and opening them in a browser on your computer. Given the mapping results and the fastqc results, what would you say the major problem with this data is (1 point) and what is the best pre-mapping solution to deal with this (if you had to choose only one) (1 point)?