Practical Computing for Biologists: Chapters 2, 5, 6, 16, Appendices 2, 3.
Unix “Basics” and “Finding Things” from UConn CBC: http://bioinformatics.uconn.edu/unix-basics/
Software Carpentry Shell Novice lesson: Episodes 5-7: https://swcarpentry.github.io/shell-novice/
Review basic commands and server access from UConn_Unix_basics
1. pwd
- print working directory
2. ls
- list directory contents
3. mkdir
- create a directory
4. unzip
- decompression
5. mv
- move file
6. cp
- copy file
7. cat
- print contents of file
8. touch
- create empty file
9. rm
- remove file
10. wc
- count lines/words/characters/ in file
11. >
- redirects output to new file
12. >>
- redirects output to append to existing file
13. *
- wildcard that specifies any input
One can select multiple files using the *
wildcard. Navigate to the ~/MEDS5420/lec02_files
directory and type:
wc *.txt
Instead of seeing the 3 columns of numbers for the number of lines,
words and characters, we can limit the wc
command to only show us
the number of lines using the -l
argument:
wc -l *.txt
One can also add some specificity to wild cards using brackets: []
wc -l [Wt]*.txt
# this is equivalent to saying files that start with a "W" or "t"
Let’s find which file is shortest. Let’s save the wc
output to disk
with the redirection >
operator; then we can verify the contents of
length.txt
are the same as what wc
produces using cat
or less
:
wc -l *.txt > lengths.txt
cat lengths.txt
less lengths.txt
To find the shortest file, we then sort the lengths using the sort
command. We then pick the top shortest file using head -n 1
:
sort -n lengths.txt > sorted-lengths.txt
head -n 1 sorted-lengths.txt
Using the intermediate files can be confusing, especially in more
complex problems. We can save a lot of messy files and typing using
pipes (|
):
wc -l *.txt | sort -n | head -n 1
A file called animals.txt contains the following data:
deer
rabbit
raccoon
rabbit
deer
fox
rabbit
bear
What text passes through each of the pipes and the final redirect in the pipeline below? Manually rearrange and parse the input before you run or deconstruct the command.
cat animals.txt | head -n 5 | tail -n 3 | sort > final.txt
Alter the commands to get only all three rabbits as the final output.
Command | Function |
---|---|
gzip |
compression/decompression tool using Lempel-Ziv coding (LZ77) |
tar |
Bundling files in folders |
Command | Function |
---|---|
grep |
Global Regular Expression Print (useful flags: -w , -i , -v , -n ) |
find |
Recursively list all files and directories and filter |
1. Variables (creating and printing to screen).
2. Basics of shell scripts.
Download and move the data-shell.tar from GitHub to your MEDS5420 folder. See the third code chunk of section 6 of Lecture 2 for how to accomplish this for Windows OS.
We already unzipped a file using unzip:
unzip -d Example_files Example_files.zip
Other types of archives you will encounter:
.tar # bundles multiple files or folders
.gzip # compressed file
To view contents of archive:
tar -tvf data-shell.tar # displays tar contents
To extract contents of archive:
tar -xvf data-shell.tar # extracts contents into original directories
To combine contents of a directory:
tar -cvf data-shell_retar.tar data-shell
#format is <target.tar> <directory-to-be-tarred>
#For directories, execute command in parent directory (one level up).
#Don't use absolute path.
Compressing files with gzip:
gzip filename # compresses file
Let’s look a specific example in the writing folder within data.shell
cd ./data-shell/writing/leisure/
ls
To view contents of a gzipped file (linux):
zcat haiku.txt.gz | head
On a Mac use this instead:
gunzip -c haiku.txt.gz | head
OR
gzcat
on a Mac
gzcat haiku.txt.gz | head
Note: These commands are useful because they allow you to glance at or access the contents of large compressed files without spending the time of decompressing them.
To extract gzipped files:
gunzip haiku.txt.gz #decompresses file
grep
We can search for patterns inside of files and print them using the grep
command.
Let’s head over to the writing directory and try using grep
:
Have a look at haiku.txt
:
cat haiku.txt
## Some things can be hard
## covering your face is not
## wear a mask dammit
##
## With searching comes loss
## and the presence of absence:
## "My Thesis" not found.
##
## Yesterday it worked
## Today it is not working
## Software is like that.
When was haiku.txt
last modified?
ls -l haiku.txt
## -rwxr--r--@ 1 mikeguertin staff 216 Sep 9 2020 haiku.txt
grep mask haiku.txt
## wear a mask dammit
In the above command, the first argument “mask” is the pattern we are searching for. The default action for grep
is to return the entire line in which the pattern was found.
Let’s instead search for the word day:
grep day haiku.txt
## Yesterday it worked
## Today it is not working
In this case that grep
shows us results with larger words containing “day”.
We might instead only want to see exact words not part of larger words.
To impose word boundaries, we use the -w
flag:
grep -w day haiku.txt
There are no results because “day” is only part of larger words in haiku.txt
Sometimes we want to search for more than a single word.
To search for a phrase, we need to use double quotes so that grep
treats the pattern as a single argument.
grep -w "is not" haiku.txt
## covering your face is not
## Today it is not working
Other very useful grep
flags are -n
, -i
and -v
:
grep -n "it" haiku.txt
## 3:wear a mask dammit
## 5:With searching comes loss
## 9:Yesterday it worked
## 10:Today it is not working
grep -n -w -i "the" haiku.txt
## 6:and the presence of absence:
As you might have guessed:
-n
prints the line number of the matching line.-i
ignores capitalization (also called “case”; the “i” comes from case-insensitive)You can learn more about grep
flags using grep --help
The real power of grep
is using a special class of wildcards known as “regular expressions” (the “re” in grep
).
Let’s use regular expressions to find lines where the second letter is “o”:
grep '^.o' haiku.txt
## Some things can be hard
## covering your face is not
## Today it is not working
## Software is like that.
Explanation of the pattern:
^
) tells grep
to only look from the start of a line rather than anywhere in the line..
) tells grep
to match any single character (letter, number, or symbol) - basically a single character wild card.Some other useful expression in grep:
$
specifies the matching at end of a line.*
in grep, the asterisk is a repetition operator. This is commonly coupled to .
to act as a wild card of unspecified lengthLearning the full power of regular expressions takes time, but for now just know that they exist. If you want to make use of them, check out these cheat sheets and other online resources.
haiku.txt
:Look in the song_lyrics
folder inside the data-shell folder and you should see a single file: TS_example.txt
The TS_example.txt file contains lyrics to a song by a well-known contemporary female artist. Using the command line utilities you have learned, try the following:
1. Print the number of lines in the file.
2. Print the lines and line number that have the word ‘shake’ in them to a new file called shake-lines.txt
.
3. Print the number of lines that have the word ‘shake’ in them.
4. Devise a way to print the number of times ‘shake’ appears in the song. Be sure to include all instances or forms of the word.
*hint: use the manuals for different functions to see what your options could be.
Part1:
cat
prints all the contents of animals.txt
and passes it on to
head
. Standard output from cat
(or standard input to head
):
deer
rabbit
raccoon
rabbit
deer
fox
rabbit
bear
head
reads the first 5 lines of that output and passes it on to
tail
. Standard output from head
(or standard input to tail
):
deer
rabbit
raccoon
rabbit
deer
tail
reads the last 3 lines of the output and passes it on to
sort
. Standard output from tail
(or standard input to sort
):
raccoon
rabbit
deer
sort
rearranges the lines in alphabetical order (you can
read the man
pages of sort
to discern the arguments, including -r
which is reverse alphabetical) and saves them into final.txt
. Standard output from sort
(or contents of final.txt
)
deer
rabbit
raccoon
Part 2:
cat animals.txt | sort | tail -4 | head -3
haiku.txt
:grep -i '^s' haiku.txt
grep -i 'd$' haiku.txt
grep -i '^s.*d$' haiku.txt
\<
(start of word) and \>
(end of word).grep -i '\<n' haiku.txt
or grep -w -i "n.*"
The TS_example.txt file contains lyrics to a song by a well-known contemporary female artist. Using the command line utilities you just learned, try the following:
1. Print the number of lines in the file: wc -l TS_example.txt
2. Print the lines and line number that have the word ‘shake’ in them to a new file called shake-lines.txt
: grep -n -i shake TS_example.txt > shake-lines.txt
-this answer will print all instances of shake regardless of upper and lowercase letters due to the -i
option.
3. Print the number of lines that have the word ‘shake’ in them: grep -c shake TS_example.txt
-this answer is not sensitive to upper and lowercase
4. Devise a way to print the number of times ‘shake’ appears in the song: grep -o -i shake TS_example.txt | wc -w
-here the -o
will only print the shake part of the line to a new line in the output and we can pipe to wc
and count the number of words
*Hint: use the manuals for different functions to see what your options could be.