Contents

1 HOMEWORK 1:

1. Due February 13, 11:59pm
2. Points: 10 + Bonus (2)

REMINDER: Please include identifying information, such as your username, in file names that are turned in. Example: mjg54_HW1.sh All assignments can be emailed as attachments.

1.1 Part 1: Parsing shell script:

1.1.1 Retreive the geneLists_for_HW.tar file from GitHub:

  • allGene_dat.txt is a list of gene expression data for all genes.
  • UP_IDonly.txt is a list of genes that are activated after mouse ESC cells differentiate into enododerm tissue.
  • Down_IDonly.txt is a list of genes that are repressed after mouse ESC cells differentiate into enododerm tissue.

1.1.2 Task (8pts + 2 bonus):

Your goal is to use the gene names from the activated and repressed lists to extract the associated data from the allGene_dat.txt. That is, use the activated and repressed lists to parse the data from the allGene_data.txt such that you create two separate files with the data from the list. Do this in each of the following ways:

  • Use a single-line grep command for each list. (4pts)

  • Create a shell script with a loop that accomplishes the same two tasks (retrieves the activated and repressed genes) using the grep command. (4pts)

  • Bonus: Use awk in a shell script with a loop OR use awk in a single line command to parse each table. (1pt)

  • Bonus: Last year a few students were doing coherence checks on their data and they noticed that there were fewer genes output than present in the input. This is a common problem when you are comparing gene sets from different sources. If they are from the same source, then the activated and repressed must be a subset of all the genes. Write a shell script using grep (see the -v option) to generate a file of UP and DOWN genes that are not present in allGene_dat.txt. (1pt)

1.2 Part 2: Find redundant genes:

The up and down ID lists have some redundant gene names that you need to remove so you do not count data from any gene twice.

1.2.1 Task (2pts):

Devise a way to print only redundant genes to a new file. Show me the command line(s) used and report the number of redundancies for each list.