Contents

1 Readings

Practical Computing for Biologists: Chapters 2, 5, 6, 16, Appendices 2, 3.

Unix “Basics” and “Finding Things” from UConn CBC: http://bioinformatics.uconn.edu/unix-basics/

Software Carpentry Shell Novice lesson: Episodes 5-7: https://swcarpentry.github.io/shell-novice/

Review basic commands and server access from UConn_Unix_basics

2 Last Time:

2.1 Command line navigation:


1. Complete path:

head /home/username/MEDS5420/lec02_files/the_raven.txt
head /Users/username/MEDS5420/lec02_files/the_raven.txt
head ~/MEDS5420/lec02_files/the_raven.txt

OR

2. By using the relative path starting from where you are:

head ./username/MEDS5420/lec02_files/the_raven.txt
head username/MEDS5420/lec02_files/the_raven.txt

2.2 Command line utlities so far:

1. pwd - print working directory
2. ls - list directory contents
3. mkdir - create a directory
4. unzip - decompression
5. mv - move file
6. cp - copy file
7. cat - print contents of file
8. touch - create empty file
9. rm - remove file
10. wc - count lines/words/characters/ in file
11. > - redirects output to new file
12. >> - redirects output to append to existing file
13. * - wildcard that specifies any input

3 Pipes, filtering with wildcards, redirecting outputs to files

One can select multiple files using the * wildcard. Navigate to the ~/MEDS5420/lec02_files directory and type:

wc *.txt

Instead of seeing the 3 columns of numbers for the number of lines, words and characters, we can limit the wc command to only show us the number of lines using the -l argument:

wc -l *.txt

One can also add some specificity to wild cards using brackets: []

wc -l [Wt]*.txt
    # this is equivalent to saying files that start with a "W" or "t"

Let’s find which file is shortest. Let’s save the wc output to disk with the redirection > operator; then we can verify the contents of length.txt are the same as what wc produces using cat or less:

wc -l *.txt > lengths.txt
cat lengths.txt
less lengths.txt

To find the shortest file, we then sort the lengths using the sort command. We then pick the top shortest file using head -n 1:

sort -n lengths.txt > sorted-lengths.txt
head -n 1 sorted-lengths.txt

Using the intermediate files can be confusing, especially in more complex problems. We can save a lot of messy files and typing using pipes (|):

wc -l *.txt | sort -n | head -n 1

3.1 Exercise 1: Pipe Reading Comprehension

A file called animals.txt contains the following data:

deer
rabbit
raccoon
rabbit
deer
fox
rabbit
bear

3.1.1 Part 1:

What text passes through each of the pipes and the final redirect in the pipeline below? Manually rearrange and parse the input before you run or deconstruct the command.

cat animals.txt | head -n 5 | tail -n 3 | sort > final.txt

3.1.2 Part 2:

Alter the commands to get only all three rabbits as the final output.

4 Additional Commands:

4.1 File Compression

Command Function
gzip compression/decompression tool using Lempel-Ziv coding (LZ77)
tar Bundling files in folders

4.2 Finding things:

  • Files in directories
  • words in files
Command Function
grep Global Regular Expression Print (useful flags: -w, -i, -v, -n)
find Recursively list all files and directories and filter

4.3 Concepts:

1. Variables (creating and printing to screen).
2. Basics of shell scripts.

5 Dealing with compressed files (archives)

Download and move the data-shell.tar from GitHub to your MEDS5420 folder. See the third code chunk of section 6 of Lecture 2 for how to accomplish this for Windows OS.

We already unzipped a file using unzip:

unzip -d Example_files Example_files.zip

Other types of archives you will encounter:
.tar # bundles multiple files or folders
.gzip # compressed file

XKCD: valid `tar` command

Figure 1: XKCD: valid tar command

To view contents of archive:

tar -tvf data-shell.tar # displays tar contents

To extract contents of archive:

tar -xvf data-shell.tar # extracts contents into original directories

To combine contents of a directory:

tar -cvf data-shell_retar.tar data-shell 

#format is <target.tar> <directory-to-be-tarred>
#For directories, execute command in parent directory (one level up). 
#Don't use absolute path. 

Compressing files with gzip:

gzip filename # compresses file

Let’s look a specific example in the writing folder within data.shell

cd ./data-shell/writing/leisure/

ls

To view contents of a gzipped file (linux):

zcat haiku.txt.gz | head

On a Mac use this instead:

gunzip -c haiku.txt.gz | head

OR
gzcat on a Mac

gzcat haiku.txt.gz | head

Note: These commands are useful because they allow you to glance at or access the contents of large compressed files without spending the time of decompressing them.

To extract gzipped files:

gunzip haiku.txt.gz #decompresses file

6 Finding things

6.1 Searching inside files using grep

We can search for patterns inside of files and print them using the grep command. Let’s head over to the writing directory and try using grep:
Have a look at haiku.txt:

cat haiku.txt
## Some things can be hard
## covering your face is not
## wear a mask dammit
## 
## With searching comes loss
## and the presence of absence:
## "My Thesis" not found.
## 
## Yesterday it worked
## Today it is not working
## Software is like that.

When was haiku.txt last modified?

ls -l haiku.txt
## -rwxr--r--@ 1 mikeguertin  staff  216 Sep  9  2020 haiku.txt
grep mask haiku.txt
## wear a mask dammit

In the above command, the first argument “mask” is the pattern we are searching for. The default action for grep is to return the entire line in which the pattern was found.

Let’s instead search for the word day:

grep day haiku.txt
## Yesterday it worked
## Today it is not working

In this case that grep shows us results with larger words containing “day”. We might instead only want to see exact words not part of larger words. To impose word boundaries, we use the -w flag:

grep -w day haiku.txt

There are no results because “day” is only part of larger words in haiku.txt

Sometimes we want to search for more than a single word. To search for a phrase, we need to use double quotes so that grep treats the pattern as a single argument.

grep -w "is not" haiku.txt
## covering your face is not
## Today it is not working

Other very useful grep flags are -n, -i and -v:

grep -n "it" haiku.txt
## 3:wear a mask dammit
## 5:With searching comes loss
## 9:Yesterday it worked
## 10:Today it is not working
grep -n -w -i "the" haiku.txt
## 6:and the presence of absence:

As you might have guessed:

  • -n prints the line number of the matching line.
  • -i ignores capitalization (also called “case”; the “i” comes from case-insensitive)

You can learn more about grep flags using grep --help

The real power of grep is using a special class of wildcards known as “regular expressions” (the “re” in grep). Let’s use regular expressions to find lines where the second letter is “o”:

grep '^.o' haiku.txt
## Some things can be hard
## covering your face is not
## Today it is not working
## Software is like that.

Explanation of the pattern:

  • The caret (^) tells grep to only look from the start of a line rather than anywhere in the line.
  • The dot (.) tells grep to match any single character (letter, number, or symbol) - basically a single character wild card.
  • The “o” means to specifically only match the letter “o” (it will not match an upper case “O”).

Some other useful expression in grep:

  • $ specifies the matching at end of a line.
  • * in grep, the asterisk is a repetition operator. This is commonly coupled to . to act as a wild card of unspecified length

Learning the full power of regular expressions takes time, but for now just know that they exist. If you want to make use of them, check out these cheat sheets and other online resources.

6.2 Exercise 2: ‘regex’ with grep

  1. Use the criteria below to print the appropriate lines from haiku.txt:
  • lines that begin with the letter s
  • lines that end in the letter d
  • lines that begin and end with the letters s and d
  • lines with words that begin with the letter n

6.3 Exercise 3: Command practice and ‘grepping’ patterns

Look in the song_lyrics folder inside the data-shell folder and you should see a single file: TS_example.txt The TS_example.txt file contains lyrics to a song by a well-known contemporary female artist. Using the command line utilities you have learned, try the following:

1. Print the number of lines in the file.

2. Print the lines and line number that have the word ‘shake’ in them to a new file called shake-lines.txt.
3. Print the number of lines that have the word ‘shake’ in them.
4. Devise a way to print the number of times ‘shake’ appears in the song. Be sure to include all instances or forms of the word.

*hint: use the manuals for different functions to see what your options could be.

7 Answers to in class exercises

7.1 Answers to Exercise 1

Part1:
cat prints all the contents of animals.txt and passes it on to head. Standard output from cat (or standard input to head):

deer
rabbit
raccoon
rabbit
deer
fox
rabbit
bear

head reads the first 5 lines of that output and passes it on to tail. Standard output from head (or standard input to tail):

deer
rabbit
raccoon
rabbit
deer

tail reads the last 3 lines of the output and passes it on to sort. Standard output from tail (or standard input to sort):

raccoon
rabbit
deer

sort rearranges the lines in alphabetical order (you can read the man pages of sort to discern the arguments, including -r which is reverse alphabetical) and saves them into final.txt. Standard output from sort (or contents of final.txt)

deer
rabbit
raccoon

Part 2:

cat animals.txt | sort | tail -4 | head -3

7.2 Exercise 2: Grep with ‘regex’

  1. Use the criteria below to print the appropriate lines from haiku.txt:
  • lines that begin with the letter s:
    grep -i '^s' haiku.txt
  • lines that end in the letter d:
    grep -i 'd$' haiku.txt
  • lines that begin and end with the letters s and d:
    grep -i '^s.*d$' haiku.txt
  • lines with words that begin with the letter n. This requires a Google search, to specify the first and last letters of words use \< (start of word) and \> (end of word).
    grep -i '\<n' haiku.txt or grep -w -i "n.*"

7.3 Exercise 3: ‘grepping’ patterns

The TS_example.txt file contains lyrics to a song by a well-known contemporary female artist. Using the command line utilities you just learned, try the following:
1. Print the number of lines in the file: wc -l TS_example.txt
2. Print the lines and line number that have the word ‘shake’ in them to a new file called shake-lines.txt: grep -n -i shake TS_example.txt > shake-lines.txt -this answer will print all instances of shake regardless of upper and lowercase letters due to the -i option.
3. Print the number of lines that have the word ‘shake’ in them: grep -c shake TS_example.txt -this answer is not sensitive to upper and lowercase
4. Devise a way to print the number of times ‘shake’ appears in the song: grep -o -i shake TS_example.txt | wc -w
-here the -o will only print the shake part of the line to a new line in the output and we can pipe to wc and count the number of words

*Hint: use the manuals for different functions to see what your options could be.