Contents

1 Optional Homework 3: Compare awk and data.frame & data.table in R for processing big files.

1. Due Monday 4/17, 11:59pm
2. Points available: 5

Goal: write a single sbatch script to compare the processing times of three methods.

Instructions:
- zip the following 4 files and send the single zipped file to : 1) the sbatch script you ran (recall the need to run Rscript from an sbatch script); 2) Two .R files that process the data; one file for the data.frame method and one file for the data.table method; 3) the amount of time each method takes to complete as a plain txt file.

REMINDER: Do not run your scripts or any functions on the head node. I usually test the code from an interactive session (srun), then confirm that everything works in an sbatch script as shown in class.

1.1 Questions and Tasks:

  1. Add the value 50 to column 3 and subtract the value 49 from column 2 of the input file and make a new file with the expanded genomic window size and retain columns 1 & 4 unaltered. Maintain the tab separation between columns (needed for bed format). This should be done using awk. Use statements in the sbatch script to determine when the command begins running and when the command finishes running and save a log. (2 points)

  2. Perform the same transformation of the genomic windows using data.frame and data.table in R. Recall that you still need to make a new output file and all output files must be named differently (and preferably logically as well). (2 points)

  3. Report the amount of time (hh:mm:ss) that each processing method takes to complete in the txt fileโ€“do not report the start and end times. (1 point)