awk for parsing text files
1. Conditional expressions and tests in shell scripts
2. Practicing good habits in shell scripts
3. Logging into the Xanadu cluster
4. Moving data to and from the Xanadu cluster
Shell
We’ve learn some basics of shell scripting with loops. Let’s add more sophistication by adding conditional statements.
if/else
snow="1 2 3 4 6 8 10"
for x in $snow
do
echo "There are ${x} inches of snow"
if [ $x -lt 3 ]
then
echo 'stay calm'
else
echo 'panic'
fi
done
## There are 1 inches of snow
## stay calm
## There are 2 inches of snow
## stay calm
## There are 3 inches of snow
## panic
## There are 4 inches of snow
## panic
## There are 6 inches of snow
## panic
## There are 8 inches of snow
## panic
## There are 10 inches of snow
## panic
*Note: the test statement must be separated from the square brackets by a space.
End if
loops with fi
To get a list of operators for numerical tests use:
man test
What if I want to add a layer(s) of contingency here: elif
snow="1 2 3 4 6 8 10"
type=windy
for x in $snow
do
echo "There are ${x} inches of snow"
if [ $x -lt 3 ]
then
echo 'stay calm'
elif [ $x -lt 8 ] && [ ${type}=windy ]
then
echo 'it is windy, take cover'
else
echo 'ignore the wind and grab your sled'
fi
done
## There are 1 inches of snow
## stay calm
## There are 2 inches of snow
## stay calm
## There are 3 inches of snow
## it is windy, take cover
## There are 4 inches of snow
## it is windy, take cover
## There are 6 inches of snow
## it is windy, take cover
## There are 8 inches of snow
## ignore the wind and grab your sled
## There are 10 inches of snow
## ignore the wind and grab your sled
Note that I also included a variable that is used for interpretation.
&&
represents and creating an if/and statement
||
represents or creating an if/or statement
I uploaded a file to the Lecture 6 directory in GitHub with rain data and I want to process with this script.
cat rain_data.txt
## Inches 1
## Inches 2
## Inches 3
## Inches 4
## Inches 6
## Inches 8
## Inches 10
I could read it in directly:
rain=$(cat rain_data.txt | cut -f 2)
cond=windy
for x in $rain
do
echo "There are ${x} inches of rain"
if [ $x -lt 3 ]
then
echo 'stay calm'
elif [ $x -gt 5 ] && [ ${cond} == windy ]
then
echo 'it is windy, take cover'
else
echo 'get in a boat'
fi
done
## There are 1 inches of rain
## stay calm
## There are 2 inches of rain
## stay calm
## There are 3 inches of rain
## get in a boat
## There are 4 inches of rain
## get in a boat
## There are 6 inches of rain
## it is windy, take cover
## There are 8 inches of rain
## it is windy, take cover
## There are 10 inches of rain
## it is windy, take cover
Or, I could set a variable as below and save this as rain.sh
script.
#! /usr/bin/sh
rain=$(cat "$1" | cut -f "$2")
cond=$3
for x in $rain
do
echo "There are ${x} inches of rain"
if [ $x -lt 3 ]
then
echo 'stay calm'
elif [ $x -gt 5 ] && [ $cond == windy ]
then
echo 'it is windy, take cover'
else
echo 'get in a boat'
fi
done
Note: the $1
usage here is a shortcut that allows the user to add an input file in the first argument. The usage would then be: <script_name> ARG1 ARG2 ARG3
$1
refers to rain_data.txt
$2
refers to the number 2
, which happens to be our second argument and in the script it is used to parse out the second column
$3
refers to the condition, which is the third argument
Then I would run:
# script input1 input2
bash rain.sh rain_data.txt 2 calm
# OR
bash rain.sh rain_data.txt 2 windy
# OR
chmod +x rain.sh
./rain.sh rain_data.txt 2 windy
More arguments can be added and the order of the arguments sets the substitution order.
Question: What if we have a shell variable that we want to use or pass to awk?
Try this:
list="1 2 3"
echo $list | awk '{print $list}'
It doesn’t work because a variable made in the shell cannot inherently be read by awk
. You have to pass the variable to awk
. Here’s how:
list="1 2 3"
echo $list| awk -v nums="$list" '{print nums}'
## 1 2 3
Recall the -v
option in the beginning of the awk command.
In the class last week we used color-table.txt
and learned how to isolate and parse different columns and rows with cut, uniq and awk. Now, try writing a script that will use the color column to parse each row to a file with identical colors only. That is, all the ‘red’ rows should go to ‘red.txt’ file, blue; to a ‘blue.txt’ file, etc.
Even though your code may have worked, the script is not considered finished as it stands. We need to use indentation and add annotation to the code for several reasons.
1. Proper indentation of loop makes the code more readable.
2. To provide USAGE instructions
3. To describe the steps being taken. This is important to remind yourself what your coding steps were or for other that might want to modify your script.
4. To track the progress of the script. This is most important for debugging, so that one can know where in the code a script failed.
Editing in a text editor with syntax highlighting will help construct a readable script
To access the cluster you need to login with ssh (secure shell):
ssh <user_name>@xanadu-submit-ext.cam.uchc.edu
# you user name looks like this:
ssh meds5420usr17@xanadu-submit-ext.cam.uchc.edu
Once you login you are in the head (or login) node. DO NOT run any resource intensive commands on the head node. We will go over the procedure for allocating resources using both interactive sessions and job submissions.
sdYou can use scp from a terminal window, or you can use WinSCP
which is a convenient FTP client or user interface for transferring data between computers. Below are some links with tutorials for downloading, installing, and using WinSCP
.
https://winscp.net/eng/docs/guide_connect
https://www.youtube.com/watch?v=58KmUBaEW34
To move files in between computer you can login with sftp use scp (secure copy):
sftp
:ftp
stands for “File Transfer Protocol”, sftp
is ” Secure File Transfer Protocol”. In other words, with sftp, a useraccount and password are required.
sftp <your_username>@<host_name>
For the Xanadu cluster, there is a special partition for transferring data:
sftp <your_username>@transfer.cam.uchc.edu
1. You can then navigate to the directory where you want to take files from.
2. put
and get
can be used to move files from or to your computer, respectively
# copy a document to the cluster
put /Users/guertinlab/MEDS5420/color-table.txt
# retrieve a copy of a document from the cluster (will go in whatever folder you logged in from)
get /home/FCAM/meds5420/usr17/file.txt
scp
scp can be used without logging in provided you know the exact location where your file of interest is or will go. I find sftp
easier and will use sftp
for class.
# for copying TO the server
scp -r <path_to_directory> <your_username>@transfer.cam.uchc.edu:~/path/to/target/folder
You should be prompted for a password. If not, the transfer probably failed.
# for copying FROM the server
scp -r <your_username>@<host_name>:@transfer.cam.uchc.edu <target_directory>
To access the cluster you need to login with ssh (secure shell):
ssh <user_name>@xanadu-submit-ext.cam.uchc.edu
# you user name looks like this:
ssh meds5420usr17@xanadu-submit-ext.cam.uchc.edu
You can use scp from a terminal window, or you can use WinSCP
which is a convenient FTP client or user interface for transferring data between computers. Below are some links with tutorials for downloading, installing, and using WinSCP
.
https://winscp.net/eng/docs/guide_connect
https://www.youtube.com/watch?v=58KmUBaEW34
To move files in between computer you can login with sftp use scp (secure copy):
sftp
:ftp
stands for “File Transfer Protocol”, sftp
is ” Secure File Transfer Protocol”. In other words, with sftp, a useraccount and password are required.
sftp <your_username>@<host_name>
For the Xanadu cluster, there is a special partition for transferring data:
sftp <your_username>@transfer.cam.uchc.edu
1. You can then navigate to the directory where you want to take files from.
2. put
and get
can be used to move files from or to your computer, respectively
put /Users/guertinlab/MEDS5420/color-table.txt
get
scp
scp can be used without logging in provided you know the exact location where your file of interest is or will go. We will primarily use sftp
in this course.
# for copying TO the server
scp -r <path_to_directory> <your_username>@transfer.cam.uchc.edu:~/path/to/target/folder
You should be prompted for a password. If not, the transfer probably failed.
# for copying FROM the server
scp -r <your_username>@<host_name>:<target_directory>
There’s a program called md5
(mac) or md5sum
(linux) that can help us with this. It returns a compact digital fingerprint for each file. Any change to the file will result in a different fingerprint.
on a mac:
md5 ./data-shell.tar
## MD5 (./data-shell.tar) = 600c193f4bffbbf029d357c86ff71c0c
on Linux:
md5sum ./data-shell.tar
1 Log onto the server using ssh
2 Navigate to the MEDS5420 folder in /home/FCAM/meds5420/in_class
3 View the contents of the data-shell.tar
file without unbundling it.
4 View the checksum string for the file.
5 Logout and return to your home directory or open a new terminal window (command-t)
6 Transfer the file to your computer using sftp
7 Confirm that the transfer was complete
colors=$(cat "$1"| cut -f 3 | sort | uniq)
for col in $colors
do
touch ${col}_rows.txt
cat "$1" | awk -v col="$col" '{ if ($3 == col) print $0}' >> ${col}_rows.txt
done
Here’s another version of the script with decent annotation:
# This script will parse unique items from column 3 to separate files
# USAGE:bash parse_colors.sh <INPUT_FILE>
#Create uniq list of colors. Note that input file is first argument
colors=$(cat "$1"| cut -f 3 | sort | uniq)
echo $colors
#iterate through list of colors and parse into new files
for col in $colors
do
echo parsing ${col} # this prints the progress to the screen
touch ${col}_rows.txt
cat "$1" | awk -v col="$col" '{ if ($3 == col) print $0}' >> ${col}_rows.txt
echo ${col} parsed # this prints the progress to the screen
done
The input file would be the color-table.txt file. The first argument then replaces the “$1” wherever it appears in the script.
ssh meds5420usr17@xanadu-submit-ext.cam.uchc.edu
#the 17 refers to your user number
cd /home/FCAM/meds5420/in_class
tar -tvf data-shell.tar
md5sum data-shell.tar
# a174bf3795d25f39891f43571ba1c678
exit
Note that none of these commands demand significant compute resources.
cd ~
sftp meds5420usr17@transfer.cam.uchc.edu
cd /home/FCAM/meds5420/in_class
get data-shell.tar
exit
md5 data-shell.tar # mac
#OR
md5sum data-shell.tar # linux