1. Due February 13, 11:59pm
2. Points: 10 + Bonus (2)
REMINDER: Please include identifying information, such as your username, in file names that are turned in. Example: mjg54_HW1.sh
All assignments can be emailed as attachments.
geneLists_for_HW.tar
file from GitHub:allGene_dat.txt
is a list of gene expression data for all genes.UP_IDonly.txt
is a list of genes that are activated after mouse ESC cells differentiate into enododerm tissue.Down_IDonly.txt
is a list of genes that are repressed after mouse ESC cells differentiate into enododerm tissue.Your goal is to use the gene names from the activated and repressed lists to extract the associated data from the allGene_dat.txt
. That is, use the activated and repressed lists to parse the data from the allGene_data.txt
such that you create two separate files with the data from the list. Do this in each of the following ways:
Use a single-line grep
command for each list. (4pts)
Create a shell script with a loop that accomplishes the same two tasks (retrieves the activated and repressed genes) using the grep
command. (4pts)
Bonus: Use awk
in a shell script with a loop OR use awk
in a single line command to parse each table. (1pt)
Bonus: Last year a few students were doing coherence checks on their data and they noticed that there were fewer genes output than present in the input. This is a common problem when you are comparing gene sets from different sources. If they are from the same source, then the activated and repressed must be a subset of all the genes. Write a shell script using grep (see the -v
option) to generate a file of UP and DOWN genes that are not present in allGene_dat.txt
. (1pt)
The up
and down
ID lists have some redundant gene names that you need to remove so you do not count data from any gene twice.
Devise a way to print only redundant genes to a new file. Show me the command line(s) used and report the number of redundancies for each list.