One bad thing about cut is that you cannot reorder columns (e.g. cut -f 3,1 table does not work)
A convenient way to reorder columns, especially with large files is with another language called awk.
Awk user manual:
https://www.cs.unibo.it/~renzo/doc/awk/nawkA4.pdf
Here’s an example:
cat color-table.txt | awk '{print $3, "\t", $2, "\t", $1}' > newTable.txt
## red 1 This
## orange 2 is
## yellow 4 a
## green 4 test
## blue 7 this
## purple 6 is
## red 80 only
## orange 19 a
## yellow 100 test
## green 6 if
In the example above:
$#
specifies the field or column number."\t"
specifies tabs in between fields.awk has very cryptic syntax. Therefore, I recommend learning as needed for specific tasks and keeping track of useful operations by creating a repository to store them. For instance, I have an “awk one-liners” file where I keep useful operations.
The only class feedback I ignored from last years evaluations is to not teach awk. I view computational analyses for genomics in two parts: 1) perform basic, routine, data processing that can usually be automated by using shell scripts; 2) exploratory data analyses in your favorite programming language, such as python
or R
. awk is extremely useful for shell scripting and you will learn 15 years of my accumulated (and frequently forgotten) awk knowledge from this class. Exploratory data analysis in R
or python
would be another class entirely. It is true that many tasks that awk performs can be performed in R
or python
and awk is more awkward. However, awk is faster and easier to implement into workflows.
awk
usage and syntax:Outer Field Separator (OFS) can be used to specify the delimiter
cat color-table.txt | awk '{OFS="\t";} {print $3, $1, $2}' > newTable.txt
head -3 newTable.txt
## red This 1
## orange is 2
## yellow a 4
Input files can be passed in at the end of command instead of using a pipe |
and cat
:
awk '{OFS="\t";} {print $3, $1, $2}' < color-table.txt > newTable.txt
head -3 newTable.txt
## red This 1
## orange is 2
## yellow a 4
left caret/arrow/less than sign <
before input file is not necessary, but can be used to avoid ambiguity.
One can quickly parse columns with awk using “if” statements:
cat color-table.txt | awk '{ if($3 == "yellow") print $0}'
## a 4 yellow
## test 100 yellow
## real 4 yellow
In this example, the if statement is followed by a test (in parentheses), followed by the desired action if the test result is true.
- $0
prints the entire line instead of a specific column.
- The double equal sign ==
signifies that you are asking if the two items are equal to each other rather than setting the value of a variable.
There are times when you want to use a variable created in shell
in an awk
command. Try:
y=yellow
cat color-table.txt | awk '{ if($3 == $y) print $0}'
As you can see, awk
does not recognize shell variables, but they can be “passed” into awk
as follows:
y=yellow
cat color-table.txt | awk -v cols="$y" '{ if($3 == cols) print $0}' #readable to awk
In this case, the -v
option allows you to create an awk variable from a shell variable. Note: this is done before the rest of the awk statement is wrapped in '{}'
. I randomly named the variable cols
within awk
, but we can use any designation, including y
.
Example: sometimes we will want to work on the beginning of a file in order to add, remove, or alter the header for columns. We can run a simple test to determine which lines to work with in awk using the NR
(row number) argument before we start awk functions.
head -3 mm9_genes.txt
## genename geneID chr strand start stop
## 4930594M22Rik AK157947 chr14 + 123312252 123328664
## Zfp85-rs1 NM_001001130 chr13 - 67848736 67856071
cat mm9_genes.txt | awk 'NR>1{ print $0}' | head -3
## 4930594M22Rik AK157947 chr14 + 123312252 123328664
## Zfp85-rs1 NM_001001130 chr13 - 67848736 67856071
## Scap NM_001001144 chr9 + 110235821 110287450
awk
we will use in this course:1. Reordering columns
2. Simple math with columns
3. Printing certain rows of columns
4. Adding rows or columns to large files
5. Adding/removing headers from large files
Consider the example path to the mm9_genes.txt: /users/tempdata3/MEDS5420/annotations/mm9-genes.txt
1. Use cut
to get the file name without the extension (.txt)
Download the mm9_genes.txt file from GitHub Lecture 4 and move it to your MEDS5420 folder.
2. Determine size of your list by counting the number of lines in the file.
3. The number of genes names in the list (column 1) is not the same as the number of gene IDs (column 2). Determine how many redundant gene names are listed in the table.
4. Use awk to move genes on the plus strand to another file - call it PlusStrandGenes.txt
5. Use awk to create a file with another column that has the gene length - call it mm9GeneLengths.txt
6. Create a .bed file (used later in course) by reordering the columns as follows and separate the columns with a tab: chromosome, start, end, geneID, strand
Consider the path to the mm9_genes.txt in the MEDS5420 folder on a server: /archive/MEDS5420/annotations/mm9_genes.txt
1. Use cut (alone or in combination with other functions) to retrieve the file name without the extension (.txt):
file=/archive/MEDS5420/annotations/mm9_genes.txt
echo $file | cut -d "/" -f 4 | cut -d "." -f 1
# OR independent of positional information:
echo $file | rev | cut -d "/" -f 1 | rev | cut -d "." -f 1
2. Count the number of lines in the file:
wc -l mm9_genes.txt
3. The number of genes in the list is not the same as the number of genetic loci. Determine how many redundant gene names are listed in the table:
# to get number of unique gene names
cat mm9_genes.txt | cut -f 1| sort | uniq| wc -l
# to get the number of duplicated gene names
cat mm9_genes.txt | cut -f 1| sort | uniq -d| wc -l
4. Use awk to move genes on the plus strand to another file:
cat mm9_genes.txt | awk '{if($4 == "+") print $0}' > PlusStrandGenes.txt
5. Use awk to create another column that has the gene length
Use awk to create another column that has the gene length
cat mm9_genes.txt | awk '{print $0, $6-$5}' > mm9_plus_genes.txt
# Above command works, but the header is not correct. Try it and see.
cat mm9_genes.txt | awk 'NR<2{ print $0 "\t" "geneLength"} NR>1 {print $0 "\t" $6-$5}'
#Above commands use "NR"" to create the proper header and data running commands on specific rows.
6. Create a .bed.
file by reordering columns and separate the columns with a tab delimiter
cat mm9_genes.txt | awk '{OFS="\t";} {print $3, $5, $6, $2, $4}' > mm9_genes.bed