Contents

ChIP-seq schematic

Figure 1: ChIP-seq schematic

Shrinking data

Figure 2: Shrinking data

1 Today: Using R to manipulate, parse, and perform statistics on data frames

1.1 What is R?

R is an open source programming language and environment. It is partially based on the S statistical software originally developed by Bell Laboratories. R is useful for:

  • Handling data: maniupulation, conversion, and storage
  • Graphing or plotting data
  • Performing statistics
  • Modeling data

2 Downloading and installing R-base and R-packages:

On a MAC, you can download binaries for easy install. There are also R-GUIs available for MACs that can make using it slightly easier.
With Linux, you can install or update R with command line:

sudo apt-get install r-base
OR
sudo brew install r-base




Cran and BioConductor packages over time

Figure 3: Cran and BioConductor packages over time

2.1 R resources

2.2 R packages

2.3 Intro to R sessions

To start an are session type R in the command line:

R

You are now in an R environment. Notice command prompt: >

R commands have their own syntax, and it is somewhat different than shell commands. In general the format for R commands or functions is as follows:

command(<INPUTS>, <OPTIONS>)

To get information on how to use a command, use:

help(COMMAND) OR ?COMMAND

R anchors itself in whatever directory you were in when you started the session. To see where you are, use:

getwd() # no input or options needed
## [1] "/Users/guertinlab/Documents/class/2023_lec18"

You can also run shell commands in R:

system("pwd")

To change the directory you are working in:

setwd(<DESIRED_PATH>) 

To install packages from CRAN:

install.packages("PACKAGE_NAME")

To install packages from Bioconductor: Install bioconductor:

source("http://bioconductor.org/biocLite.R") # download source
biocLite() # install bioclite
biocLite("PACKAGE_NAME")

These packages are stored in the ‘library’ directory on your computer. You can view a list of installed packages with:

library()

To load a package for use in the current R workspace session:

biocLite("PACKAGE_NAME")

To find out what R version you have and what packages are currently loaded:

sessionInfo()  

To get help with a function:

help(<FUNCTION>) 

To quit R:

quit() # or just q()

3 Data Types:

You can test whether a variable or data is any of these types as follows:

To quit R:

is.numeric(<VALUE>)

Examples:

is.numeric(3)
## [1] TRUE
is.numeric(T)
## [1] FALSE
is.character(3)
## [1] FALSE

Quotes always makes the value a character:

is.numeric("3")
## [1] FALSE
is.character("3")
## [1] TRUE

However, you can force a character to become numeric if needed:

# Print 3 as character:
"3"
## [1] "3"
#Print 3 as a number
as.numeric("3")
## [1] 3

3.1 Data structures:

  • one dimensional:
    • lists
    • vectors
  • two dimensional:
    • data frames
    • matrixes
  • multi-dimensional
    • arrays
    • lists
data structures

Figure 4: data structures


3.2 Lists and Vectors

  • Are similar in that they are both one dimensional, but..
  • Lists can be a mixture of data types (strings, numbers)  
  • Vectors can be of any type but only one type at a time

Vectors
elements of a vector are combined with the format c(<element1>, <element2>, etc . . .)

v <- c(4,5,6) # multiple elements must be combined with c().
v

Note: <- is an assignment operator used with R. = can also be used, but <- is conventional. A decent explanation on the difference is provided here:
https://renkun.me/2014/01/28/difference-between-assignment-operators-in-r/

To access elements with in a vector, you can use their position or ‘index’:

v[2] # position with the vector is designated with square brackets [].
v[1]/v[2] # simple math with vector indexes

Remember: you cannot easily combine data types with vectors:

ve <- c(c(4,5,6), c('flapjack', 'waffle', 'frenchtoast'))
ve
## [1] "4"           "5"           "6"           "flapjack"    "waffle"      "frenchtoast"
  • The numbers are in quotes, because they are no longer considered numbers. For instance: try doing math with the elements in vector:
  • This is because vectors can only handle one data type at a time.
ve[1]/ve[2]

Lists

Lists can hold mutliple data types and can be nested:

vl <- list(c(4,5,6), c('flapjack', 'waffle', 'frenchtoast'))
vl
## [[1]]
## [1] 4 5 6
## 
## [[2]]
## [1] "flapjack"    "waffle"      "frenchtoast"

*Compare this output to that from the vector (ve) above.

Accessing items in list is similar:

vl[[1]][2] # access 2nd item of first item in the list
## [1] 5
vl[[1]][1]/ vl[[1]][2] # divide first item of first item in the list by the second
## [1] 0.8

3.3 Matrices and DataFrames

Vectors can be combined to create matrixes using cbind or rbind:

vec1 <- c(10, 9, 8)
vec2 <- c(7, 6, 5)
cbind(vec1,vec2)
##      vec1 vec2
## [1,]   10    7
## [2,]    9    6
## [3,]    8    5

OR

rbind(vec1,vec2)
##      [,1] [,2] [,3]
## vec1   10    9    8
## vec2    7    6    5

This makes a matrix.

mat=cbind(vec1,vec2)
is.matrix(mat)
## [1] TRUE
# like vectors, matrices can only handle one type of data at a time

Regardless of how you create the matrix, they can be transposed:

t(mat)
##      [,1] [,2] [,3]
## vec1   10    9    8
## vec2    7    6    5

Like vectors, matrices can only handle one data type at a time. To combine data types in a table, you can create dataframes.
You can also specify that you want a matrix directly as follows: For a matrix:

matrix(c(vec1, vec2), ncol=2)
##      [,1] [,2]
## [1,]   10    7
## [2,]    9    6
## [3,]    8    5

Dataframes
Vectors of different types can be combined into dataframes:

list1= c(5,6,5)
list2= c('flapjack', 'waffle', 'frenchtoast') 
data.frame(list2, list1)
##         list2 list1
## 1    flapjack     5
## 2      waffle     6
## 3 frenchtoast     5

We can also save this data.frame as a variable in our session.

df<-data.frame(list2, list1)

We can also set or change column names.

df<-data.frame(list2, list1)
colnames(df)<-c("name", "price")
df
##          name price
## 1    flapjack     5
## 2      waffle     6
## 3 frenchtoast     5

We can also set column names from the start.

list1= c(5,6,5)
list2= c('flapjack', 'waffle', 'frenchtoast') 
data.frame(names=list2, price=list1)
# notice that each column can be assigned a name

This table is a data frame. You can test this with:

# Set df variable equal to our table:
list1= c(5,6,5)
list2= c('flapjack', 'waffle', 'frenchtoast') 
df<-data.frame(names=list2, price=list1)
is.data.frame(df)
## [1] TRUE
# like lists, data frames can handle multiple types of data 
# (string and numbers in this case).

Dataframes can be merged if they have a common column:

# consider two dataframes
dfA<- data.frame(names = c("A", "B"), data1 = c(5,6))
dfB<- data.frame(names = c("A", "B"), data2 = c(10,11))
# merge them together
dfC <- merge(dfA, dfB, by="names")
dfC
##   names data1 data2
## 1     A     5    10
## 2     B     6    11

You can specificy the column by which to match in a number of ways (see below).

4 Accessing elements of matrixes and data frames:

Accessing elements of tables is similar to lists and vectors, but with two dimensions:

data.frame(names=list2, price=list1)
##         names price
## 1    flapjack     5
## 2      waffle     6
## 3 frenchtoast     5
df[2,1]  # specifies item in second row, first column.
## [1] "waffle"

To access an entire column or row at once:

df[,1]  # specifies entire first column.
## [1] "flapjack"    "waffle"      "frenchtoast"
df[3,]  # specifies 3rd row.
##         names price
## 3 frenchtoast     5

To access a range of columns or row:

df[1:2,1]  # specifies first two rows of column 1.
## [1] "flapjack" "waffle"

4.1 Loading your own data frames:

To load your own data frames you can use read.table()

read.table("<FILENAME>", header=T, sep='\t')
# Header states that there are column names at the beginning of each column to be used
# sep ='\t' specifies that the file is tab-delimited

Read in the file /home/FCAM/meds5420/data/ATF1/closest_peak_TSS/peaks_promoters.bed from the server and save it as a variable:

x.frame = read.table('peaks_promoters.bed', header=FALSE, sep = '\t')
is.data.frame(x.frame)
## [1] TRUE
is.data.table(x.frame)
## [1] FALSE

You can also load .csv files:

read.csv("<FILENAME>", header=F, sep=',')
# sep = ',' specifies that the files is comma-separated

4.2 Working with data:

For now lets look a datset that’s already in your environment, calledPuromycin:

head(Puromycin)
##   conc rate   state
## 1 0.02   76 treated
## 2 0.02   47 treated
## 3 0.06   97 treated
## 4 0.06  107 treated
## 5 0.11  123 treated
## 6 0.11  139 treated

This and other tables come preloaded with various packages and are often meant to be used as examples in tutorials.
To find all the tables tutorial tables type

data() # shows all tables available

Here are some useful functions when working with tables:

Show all tables currently loaded in environment:

ls() # list current tables in environment
##  [1] "closest"            "closest_cdf"        "closest_chr17"      "closest_hist"       "correlation"        "dat"                "df"                 "df_cast"           
##  [9] "df_melt"            "dfA"                "dfB"                "dfC"                "dog_x"              "dog_y"              "DT"                 "groove_width_major"
## [17] "groove_width_minor" "groove_x"           "groove_y"           "list"               "list1"              "list2"              "major_groove"       "major_groove_x"    
## [25] "major_groove_y1"    "major_groove_y2"    "mat"                "minor_groove"       "minor_groove_x"     "minor_groove_y1"    "minor_groove_y2"    "myList"            
## [33] "n"                  "query"              "sigma"              "sno.rna"            "theta"              "ve"                 "vec1"               "vec2"              
## [41] "vl"                 "x"                  "x.frame"            "xlim"               "xy"                 "y"                  "y.table"            "y1"                
## [49] "y2"                 "ylim"

Remove a item from environment

rm() # list current tables in environment

Remove all tables from envirnoment

rm(ls()) # list current tables in environment

Get the end of a table:

tail(Puromycin)
##    conc rate     state
## 18 0.11  115 untreated
## 19 0.22  131 untreated
## 20 0.22  124 untreated
## 21 0.56  144 untreated
## 22 0.56  158 untreated
## 23 1.10  160 untreated

Get the number of rows:

nrow(Puromycin)
## [1] 23

Get the number of rows:

ncol(Puromycin)
## [1] 3

Get both the number of rows and columns:

dim(Puromycin)
## [1] 23  3
dim(Puromycin)[2] # just number of columns
## [1] 3

Get column names of table:

names(Puromycin)
## [1] "conc"  "rate"  "state"

Get summary information about tables:

summary(Puromycin)
##       conc             rate             state   
##  Min.   :0.0200   Min.   : 47.0   treated  :12  
##  1st Qu.:0.0600   1st Qu.: 91.5   untreated:11  
##  Median :0.1100   Median :124.0                 
##  Mean   :0.3122   Mean   :126.8                 
##  3rd Qu.:0.5600   3rd Qu.:158.5                 
##  Max.   :1.1000   Max.   :207.0

Get structural information about tables:

str(Puromycin)
## 'data.frame':    23 obs. of  3 variables:
##  $ conc : num  0.02 0.02 0.06 0.06 0.11 0.11 0.22 0.22 0.56 0.56 ...
##  $ rate : num  76 47 97 107 123 139 159 152 191 201 ...
##  $ state: Factor w/ 2 levels "treated","untreated": 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "reference")= chr "A1.3, p. 269"

Specifcy columns by name:

Puromycin$rate[1:6]
## [1]  76  47  97 107 123 139
# '$' specifies column name, also indexed for first 6 items in this example
# OR
head(Puromycin$rate) # just another way of doing the above
## [1]  76  47  97 107 123 139

Simple math operations:

mean(Puromycin$rate)
## [1] 126.8261

Other mathematical operators are:

  • sum
  • median
  • mode
  • sd (standard deviation)
  • max
  • min

Slightly more complex, but powerful functions:

aggregate will perform an operation on a set of data, and separate the results based on a set of categories or ‘factors’ that you provide. It returns a data frame of the results.

More on factors can be found here: http://www.stat.berkeley.edu/~s133/factors.html

aggregate(Puromycin$rate, by=list(Puromycin$state), mean)
##     Group.1        x
## 1   treated 141.5833
## 2 untreated 110.7273

tapply is similar to aggregate, but returns a matrix.

tapply(Puromycin$rate, list(Puromycin$state), mean)
##   treated untreated 
##  141.5833  110.7273

4.3 data.tables

There is also a package that creates a more advanced data.frame called a data.table. data.tables have several advantages that we will see over the course of using R. Among them is being able to operate on columns using indexing and factors as shown above with simpler expressions and increased speed. It is generally recommended that new users start with data.tables as soon as possible to develop good habits.

See discussions: https://stackoverflow.com/questions/18001120/what-is-the-practical-difference-between-data-frame-and-data-table-in-r

https://www.analyticsvidhya.com/blog/2016/05/data-table-data-frame-work-large-data-sets/

https://www.r-bloggers.com/data-table-or-data-frame/

There’s also tidyverse which was created by Hadley Wickham:

Example:

library(data.table)

DT<- data.table(Puromycin)
DT[,mean(rate), by= state]
##        state       V1
## 1:   treated 141.5833
## 2: untreated 110.7273
#Format = DATA.TABLE[row, column, by= factor_column]
#Notice that functions can be combined with column designations to operate on large tables.

4.4 Converting table formats between ‘wide’ and ‘long’

# consider this table
dfC <- merge(dfA<- data.frame(names = c("A", "B"), data1 = c(5,6)), dfB<- data.frame(names = c("A", "B"), data2 = c(10,11)), by="names")
dfC
##   names data1 data2
## 1     A     5    10
## 2     B     6    11

You can melt the data.frame into a long format.

# merge tables together
dfC <- merge(dfA<- data.frame(names = c("A", "B"), data1 = c(5,6)), dfB<- data.frame(names = c("A", "B"), data2 = c(10,11)), by="names")
dfC
##   names data1 data2
## 1     A     5    10
## 2     B     6    11
# melt to 'long' format
library(reshape)
df_melt <- melt(dfC, by=names, variable.name="data")
## Using names as id variables
df_melt
##   names variable value
## 1     A    data1     5
## 2     B    data1     6
## 3     A    data2    10
## 4     B    data2    11

Or cast to wide format

# merge tables together
dfC <- merge(dfA<- data.frame(names = c("A", "B"), data1 = c(5,6)), dfB<- data.frame(names = c("A", "B"), data2 = c(10,11)), by="names")
dfC
##   names data1 data2
## 1     A     5    10
## 2     B     6    11
# melt to 'long' format
library(reshape)
df_melt <- melt(dfC, by=names,  variable.name="data")
## Using names as id variables
# cast to 'wide' format
df_cast <- cast(df_melt, variable = data )
df_cast
##   names data1 data2
## 1     A     5    10
## 2     B     6    11

5 Managing data and saving your environment workspace

Saving tables, etc:

save(<R_OBJECT>, file="FILENAME.Rdata")
# saves file as binary file readable by R

loading saved tables:

load(file="FILENAME.Rdata")

IMPORTANT: The saved R object retains its original name regardless of the filename. Thus, it may be difficult to find the object when loaded into your environment if you forget the original name or if you are opening an object from someone else. There are several ways around this:

list<-c(1,2,3,4)
save(list, file="list.r")
dat<- load("list.r")
myList<- get(dat)
saveRDS=(R_object, "FILENAME.RDS")
myObj <- readRDS("FILENAME.RDS"))

IMPORTANT!!: Saving workspace environment: This saves all the tables and variables you have created or loaded into an R sessions. That way you can continue where you left off instead or recreating all your work.

save.image(file="NAME.Rdata")
# saves currently loaded variables, functions, and packages.

You can then load the saved workspace as follows:

load(file="NAME.Rdata")
#loads previous workspace and all the associated tables

6 Writing tables to text file:

write.table(<R_Object>, file="FILENAME.txt", sep='\t', col.names=T, row.names=F, quote=F)
# you would reload this table with "read.table"

7 Reading text files fast as data.tables with fread:

y.table = fread('peaks_promoters.bed')
is.data.frame(y.table)
## [1] TRUE
is.data.table(y.table)
## [1] TRUE

7.1 writing data.tables:

You can write data.table fast with fwrite

fwrite(<DATA.TABLE>, file="FILENAME.txt", sep='\t', col.names=T, row.names=F, quote=F)
# you would reload this table with "fread(file=<DATA.table>"

8 Rscript to run R code

Just like one can run a shell script using the command bash scriptname.sh, we can run a file with R code by calling it directly from the Terminal.

touch filename.R
nano filename.R
# populate the file with R code and a few print statements

An example file to test Rscript (note the she bang is differnet for R scripts):

#!/usr/bin/env Rscript

print('test')
x <- mean(c(5,6,7,23))
print(x)

You can make the .R file in a text editor and call it directly from the terminal with Rscript:

Rscript filename.R

The output will print to the screen and this can be coupled with sbatch to automate/submit a job and tee to generate log files. This is how you will compare using awk, data.frame, and data.table in your optional homework.

9 For Next Time:

10 In Class Exercises:

There are two tables (exp_untreated.txt, exp_treated.txt) in the following location:

/home/FCAM/meds5420/in_class/R_intro

To Do:

1. Either copy these files to your local machine (using sfto) or work with them on the server in an R session.
2. Read the tables into R and set them equal to a variable that makes sense (e.g. “untreated”, “treated”).
3. Create column names for the data tables (i.e. “genes” for column 1, and “treated” or “untreated” for column 2)
4. Merge the tables into one table with the merge() function
5. Calculate a mean, median, and standard deviation for the treated and untreated samples.
6. Create a new column in the dataframe that has the fold change (treated/untreated) for the data.
7. Devise a way to merge the untreated and treated such that all data names and numbers are in the same column and a third factor column designates treated/untreated.
8. Use aggregate() to calculate the mean, median, and standard deviation for each treatment from this new table.
9. Convert the table to a data.table and then calculate the same parameters as above.
10. Try saving and reading back in your tables.

11 Answers to In Class Exercises:

Read the tables into R and set them equal to a variable that makes sense (e.g. “untreated”, “treated”)

#first look at the data structure to determine how to read it in
system("head exp_treated.txt")

#then read them in
treated <- read.table("exp_treated.txt", header=F, sep=" ")
untreated <- read.table("exp_untreated.txt", header=F, sep=" ")

Create column names for the data tables (i.e. “genes” for column 1, and “data” for column 2)

colnames(treated) <- c("genes", "data")
colnames(untreated) <- c("genes", "data")

Merge the tables into one table with the merge() function

merged <- merge(treated, untreated, by="genes")

Calculate a mean, median, and standard deviation for the treated and untreated samples.

mean(treated$data)
median(treated$data)
sd(treated$data)

#etc. . . .

Create a new column in the dataframe that has the fold change (treated/untreated) for the data.

merged[,4] <- merged[,2]/merged[,3]

#or

foldChange <- data.frame(merged, fc<-merged[,2]/merged[,3])

Devise a way to merge the untreated and treated such that all data names and numbers are in the same column and a third factor column designates treated/untreated.

# first add a column with the trement state to both
treated <- data.frame(treated, state="treated"
untreated <- data.frame(untreated, state="untreated"  

#Then use rbind to merge the tables
merged <- rbind(treated, untreated)

Use aggregate() to calculate the mean, median, and standard deviation from this new table.

aggregate(merged$data, by=list(merged$state), mean)
aggregate(merged$data, by=list(merged$state), median)
aggregate(merged$data, by=list(merged$state), sd)


Convert the table to a data.table and then calculate the same parameters as above.

library(data.table)

#Convert to data.table
DT <- data.table(merged)

#Calculations
DT[, mean(data), by = state]
DT[, median(data), by = state]
DT[, sd(data), by = state]