1. Due February 13, 11:59pm
2. Points: 10 + Bonus (2)
REMINDER: Please include identifying information, such as your username, in file names that are turned in. Example:
mjg54_HW1.sh All assignments can be emailed as attachments.
geneLists_for_HW.tarfile from GitHub:
allGene_dat.txtis a list of gene expression data for all genes.
UP_IDonly.txtis a list of genes that are activated after mouse ESC cells differentiate into enododerm tissue.
Down_IDonly.txtis a list of genes that are repressed after mouse ESC cells differentiate into enododerm tissue.
Your goal is to use the gene names from the activated and repressed lists to extract the associated data from the
allGene_dat.txt. That is, use the activated and repressed lists to parse the data from the
allGene_data.txt such that you create two separate files with the data from the list. Do this in each of the following ways:
Use a single-line
grep command for each list. (4pts)
Create a shell script with a loop that accomplishes the same two tasks (retrieves the activated and repressed genes) using the
grep command. (4pts)
awk in a shell script with a loop OR use
awk in a single line command to parse each table. (1pt)
Bonus: Last year a few students were doing coherence checks on their data and they noticed that there were fewer genes output than present in the input. This is a common problem when you are comparing gene sets from different sources. If they are from the same source, then the activated and repressed must be a subset of all the genes. Write a shell script using grep (see the
-v option) to generate a file of UP and DOWN genes that are not present in
down ID lists have some redundant gene names that you need to remove so you do not count data from any gene twice.
Devise a way to print only redundant genes to a new file. Show me the command line(s) used and report the number of redundancies for each list.