awk
and data.frame
& data.table
in R
for processing big files.1. Due Monday 4/17, 11:59pm
2. Points available: 5
Goal: write a single sbatch
script to compare the processing times of three methods.
Instructions:
- zip
the following 4 files and send the single zipped
file to guertin@uchc.edu: 1) the sbatch
script you ran (recall the need to run Rscript
from an sbatch
script); 2) Two .R
files that process the data; one file for the data.frame
method and one file for the data.table
method; 3) the amount of time each method takes to complete as a plain txt
file.
/home/FCAM/meds5420/optional_HW/big_bed_sorted.bed
REMINDER: Do not run your scripts or any functions on the head node. I usually test the code from an interactive session (srun
), then confirm that everything works in an sbatch
script as shown in class.
Add the value 50
to column 3 and subtract the value 49
from column 2 of the input file and make a new file with the expanded genomic window size and retain columns 1 & 4 unaltered. Maintain the tab separation between columns (needed for bed
format). This should be done using awk
. Use statements in the sbatch
script to determine when the command begins running and when the command finishes running and save a log. (2 points)
Perform the same transformation of the genomic windows using data.frame
and data.table
in R
. Recall that you still need to make a new output file and all output files must be named differently (and preferably logically as well). (2 points)
Report the amount of time (hh:mm:ss) that each processing method takes to complete in the txt
fileโdo not report the start and end times. (1 point)