Contents

1 Manipulation / parsing of tables with awk

One bad thing about cut is that you cannot reorder columns (e.g. cut -f 3,1 table does not work)

A convenient way to reorder columns, especially with large files is with another language called awk.
Awk user manual: https://www.cs.unibo.it/~renzo/doc/awk/nawkA4.pdf

Here’s an example:

cat color-table.txt | awk '{print $3, "\t", $2, "\t", $1}' > newTable.txt
## red   1   This
## orange    2   is
## yellow    4   a
## green     4   test
## blue      7   this
## purple    6   is
## red   80      only
## orange    19      a
## yellow    100     test
## green     6   if

In the example above:

awk has very cryptic syntax. Therefore, I recommend learning as needed for specific tasks and keeping track of useful operations by creating a repository to store them. For instance, I have an “awk one-liners” file where I keep useful operations.

1.1 Feedback: drop the awk

The only class feedback I ignored from last years evaluations is to not teach awk. I view computational analyses for genomics in two parts: 1) perform basic, routine, data processing that can usually be automated by using shell scripts; 2) exploratory data analyses in your favorite programming language, such as python or R. awk is extremely useful for shell scripting and you will learn 15 years of my accumulated (and frequently forgotten) awk knowledge from this class. Exploratory data analysis in R or python would be another class entirely. It is true that many tasks that awk performs can be performed in R or python and awk is more awkward. However, awk is faster and easier to implement into workflows.

1.2 More awk usage and syntax:

Outer Field Separator (OFS) can be used to specify the delimiter

cat color-table.txt | awk '{OFS="\t";} {print $3, $1, $2}' > newTable.txt
head -3 newTable.txt
## red  This    1
## orange   is  2
## yellow   a   4

Input files can be passed in at the end of command instead of using a pipe | and cat:

awk '{OFS="\t";} {print $3, $1, $2}' < color-table.txt > newTable.txt
head -3 newTable.txt
## red  This    1
## orange   is  2
## yellow   a   4

left caret/arrow/less than sign < before input file is not necessary, but can be used to avoid ambiguity.

One can quickly parse columns with awk using “if” statements:

cat color-table.txt | awk '{ if($3 == "yellow") print $0}'
## a    4   yellow  
## test 100 yellow
## real     4   yellow

In this example, the if statement is followed by a test (in parentheses), followed by the desired action if the test result is true.
- $0 prints the entire line instead of a specific column.
- The double equal sign == signifies that you are asking if the two items are equal to each other rather than setting the value of a variable.

1.3 Passing shell variables to awk:

There are times when you want to use a variable created in shell in an awk command. Try:

y=yellow
cat color-table.txt | awk '{ if($3 == $y) print $0}'

As you can see, awk does not recognize shell variables, but they can be “passed” into awk as follows:

y=yellow
cat color-table.txt | awk -v cols="$y" '{ if($3 == cols) print $0}' #readable to awk

In this case, the -v option allows you to create an awk variable from a shell variable. Note: this is done before the rest of the awk statement is wrapped in '{}'. I randomly named the variable cols within awk, but we can use any designation, including y.

1.4 Running simple tests before function in awk:

Example: sometimes we will want to work on the beginning of a file in order to add, remove, or alter the header for columns. We can run a simple test to determine which lines to work with in awk using the NR (row number) argument before we start awk functions.

head -3 mm9_genes.txt
## genename geneID  chr strand  start   stop
## 4930594M22Rik    AK157947    chr14   +   123312252   123328664
## Zfp85-rs1    NM_001001130    chr13   -   67848736    67856071
cat mm9_genes.txt | awk 'NR>1{ print $0}' | head -3
## 4930594M22Rik    AK157947    chr14   +   123312252   123328664
## Zfp85-rs1    NM_001001130    chr13   -   67848736    67856071
## Scap NM_001001144    chr9    +   110235821   110287450

1.5 Common uses for awk we will use in this course:

1. Reordering columns
2. Simple math with columns
3. Printing certain rows of columns
4. Adding rows or columns to large files
5. Adding/removing headers from large files

1.6 In class exercise 1: Splitting strings and parsing files.

Consider the example path to the mm9_genes.txt:  /users/tempdata3/MEDS5420/annotations/mm9-genes.txt

1. Use cut to get the file name without the extension (.txt)
Download the mm9_genes.txt file from GitHub Lecture 4 and move it to your MEDS5420 folder.
2. Determine size of your list by counting the number of lines in the file.
3. The number of genes names in the list (column 1) is not the same as the number of gene IDs (column 2). Determine how many redundant gene names are listed in the table.
4. Use awk to move genes on the plus strand to another file - call it PlusStrandGenes.txt
5. Use awk to create a file with another column that has the gene length - call it mm9GeneLengths.txt
6. Create a .bed file (used later in course) by reordering the columns as follows and separate the columns with a tab: chromosome, start, end, geneID, strand

2 Answers to exercises:

2.1 In class exercise 1:

Consider the path to the mm9_genes.txt in the MEDS5420 folder on a server: /archive/MEDS5420/annotations/mm9_genes.txt

1. Use cut (alone or in combination with other functions) to retrieve the file name without the extension (.txt):

file=/archive/MEDS5420/annotations/mm9_genes.txt
echo $file | cut -d "/" -f 4 | cut -d "." -f 1

# OR independent of positional information:

echo $file | rev | cut -d "/" -f 1 | rev | cut -d "." -f 1


2. Count the number of lines in the file:

wc -l mm9_genes.txt


3. The number of genes in the list is not the same as the number of genetic loci. Determine how many redundant gene names are listed in the table:



# to get number of unique gene names
cat mm9_genes.txt | cut -f 1| sort | uniq| wc -l 

# to get the number of duplicated gene names
cat mm9_genes.txt | cut -f 1| sort | uniq -d| wc -l 

4. Use awk to move genes on the plus strand to another file:

cat mm9_genes.txt | awk '{if($4 == "+") print $0}' > PlusStrandGenes.txt

5. Use awk to create another column that has the gene length
Use awk to create another column that has the gene length

cat mm9_genes.txt | awk '{print $0, $6-$5}' > mm9_plus_genes.txt
# Above command works, but the header is not correct. Try it and see.

cat mm9_genes.txt | awk 'NR<2{ print $0 "\t" "geneLength"} NR>1 {print $0 "\t" $6-$5}'
#Above commands use "NR"" to create the proper header and data running commands on specific rows.

6. Create a .bed. file by reordering columns and separate the columns with a tab delimiter

cat mm9_genes.txt | awk '{OFS="\t";} {print $3, $5, $6, $2, $4}' > mm9_genes.bed