Figure 1: ChIP-seq schematic

Figure 2: Shrinking data

1 Today: Using R to manipulate, parse, and perform statistics on data frames

1.1 What is R?

R is an open source programming language and environment. It is partially based on the S statistical software originally developed by Bell Laboratories. R is useful for:

Handling data: maniupulation, conversion, and storage
Graphing or plotting data
Performing statistics
Modeling data

2 Downloading and installing R-base and R-packages:

R comes with most installations of Linux, Mac OS, and Windows but you can do custom installs or update to newer versions on your own.
You can go to http://www.r-project.org/about.html to get all the info you need.
R versions are kept at the CRAN (Comprehensive R Archive Network) website: http://cran.r-project.org/
- To download you first need to choose a CRAN site near you from which to download (called a mirror).

On a MAC, you can download binaries for easy install. There are also R-GUIs available for MACs that can make using it slightly easier.
With Linux, you can install or update R with command line:

sudo apt-get install r-base
OR
sudo brew install r-base

Figure 3: Cran and BioConductor packages over time

2.1 `R` resources

Books listed in syllabus
Software Carpentry: https://datacarpentry.org/R-ecology-lesson/02-starting-with-data.html
R for Data Science: https://r4ds.had.co.nz/

2.2 R packages

R on it’s own (R-base), has many basic statistical and plotting functions.
R can also be enhanced by installing any of the many ‘modules’ or ‘packages’ that are available at:
- The CRAN website: http://cran.r-project.org/src/contrib/
- Another repository called bioconductor: http://www.bioconductor.org/.
- Discussion of differences between CRAN and BioC: https://blog.revolutionanalytics.com/2015/08/a-short-introduction-to-bioconductor.html]
R Packages are collections of functions or compiled software that are designed to run in the R environment.
Lots of people all over the world contribute to the many thousands of packages available.

2.3 Intro to R sessions

To start an are session type R in the command line:

You are now in an R environment. Notice command prompt: >

R commands have their own syntax, and it is somewhat different than shell commands. In general the format for R commands or functions is as follows:

command(<INPUTS>, <OPTIONS>)

To get information on how to use a command, use:

help(COMMAND) OR ?COMMAND

R anchors itself in whatever directory you were in when you started the session. To see where you are, use:

getwd() # no input or options needed

## [1] "/Users/guertinlab/Documents/class/2023_lec18"

You can also run shell commands in R:

system("pwd")

To change the directory you are working in:

setwd(<DESIRED_PATH>)

To install packages from CRAN:

install.packages("PACKAGE_NAME")

To install packages from Bioconductor: Install bioconductor:

source("http://bioconductor.org/biocLite.R") # download source
biocLite() # install bioclite
biocLite("PACKAGE_NAME")

These packages are stored in the ‘library’ directory on your computer. You can view a list of installed packages with:

library()

To load a package for use in the current R workspace session:

biocLite("PACKAGE_NAME")

To find out what R version you have and what packages are currently loaded:

sessionInfo()

To get help with a function:

help(<FUNCTION>)

To quit R:

quit() # or just q()

3 Data Types:

Character - letters, strings
numeric - numbers
Factors - Categorical variables for plotting or statistical modeling

You can test whether a variable or data is any of these types as follows:

To quit R:

is.numeric(<VALUE>)

Examples:

is.numeric(3)

## [1] TRUE

is.numeric(T)

## [1] FALSE

is.character(3)

## [1] FALSE

Quotes always makes the value a character:

is.numeric("3")

## [1] FALSE

is.character("3")

## [1] TRUE

However, you can force a character to become numeric if needed:

# Print 3 as character:
"3"

## [1] "3"

#Print 3 as a number
as.numeric("3")

## [1] 3

3.1 Data structures:

one dimensional:
- lists
- vectors
two dimensional:
- data frames
- matrixes
multi-dimensional
- arrays
- lists

Figure 4: data structures

3.2 Lists and Vectors

Are similar in that they are both one dimensional, but..
Lists can be a mixture of data types (strings, numbers)
Vectors can be of any type but only one type at a time

Vectors
elements of a vector are combined with the format c(<element1>, <element2>, etc . . .)

v <- c(4,5,6) # multiple elements must be combined with c().
v

Note: <- is an assignment operator used with R. = can also be used, but <- is conventional. A decent explanation on the difference is provided here:
https://renkun.me/2014/01/28/difference-between-assignment-operators-in-r/

To access elements with in a vector, you can use their position or ‘index’:

v[2] # position with the vector is designated with square brackets [].
v[1]/v[2] # simple math with vector indexes

Remember: you cannot easily combine data types with vectors:

ve <- c(c(4,5,6), c('flapjack', 'waffle', 'frenchtoast'))
ve

## [1] "4"           "5"           "6"           "flapjack"    "waffle"      "frenchtoast"

The numbers are in quotes, because they are no longer considered numbers. For instance: try doing math with the elements in vector:
This is because vectors can only handle one data type at a time.

ve[1]/ve[2]

Lists

Lists can hold mutliple data types and can be nested:

vl <- list(c(4,5,6), c('flapjack', 'waffle', 'frenchtoast'))
vl

## [[1]]
## [1] 4 5 6
## 
## [[2]]
## [1] "flapjack"    "waffle"      "frenchtoast"

*Compare this output to that from the vector (ve) above.

Accessing items in list is similar:

vl[[1]][2] # access 2nd item of first item in the list

## [1] 5

vl[[1]][1]/ vl[[1]][2] # divide first item of first item in the list by the second

## [1] 0.8

3.3 Matrices and DataFrames

Vectors can be combined to create matrixes using cbind or rbind:

vec1 <- c(10, 9, 8)
vec2 <- c(7, 6, 5)
cbind(vec1,vec2)

##      vec1 vec2
## [1,]   10    7
## [2,]    9    6
## [3,]    8    5

rbind(vec1,vec2)

##      [,1] [,2] [,3]
## vec1   10    9    8
## vec2    7    6    5

This makes a matrix.

mat=cbind(vec1,vec2)
is.matrix(mat)

## [1] TRUE

# like vectors, matrices can only handle one type of data at a time

Regardless of how you create the matrix, they can be transposed:

t(mat)

##      [,1] [,2] [,3]
## vec1   10    9    8
## vec2    7    6    5

Like vectors, matrices can only handle one data type at a time. To combine data types in a table, you can create dataframes.
You can also specify that you want a matrix directly as follows: For a matrix:

matrix(c(vec1, vec2), ncol=2)

##      [,1] [,2]
## [1,]   10    7
## [2,]    9    6
## [3,]    8    5

Dataframes
Vectors of different types can be combined into dataframes:

list1= c(5,6,5)
list2= c('flapjack', 'waffle', 'frenchtoast') 
data.frame(list2, list1)

##         list2 list1
## 1    flapjack     5
## 2      waffle     6
## 3 frenchtoast     5

We can also save this data.frame as a variable in our session.

df<-data.frame(list2, list1)

We can also set or change column names.

df<-data.frame(list2, list1)
colnames(df)<-c("name", "price")
df

##          name price
## 1    flapjack     5
## 2      waffle     6
## 3 frenchtoast     5

We can also set column names from the start.

list1= c(5,6,5)
list2= c('flapjack', 'waffle', 'frenchtoast') 
data.frame(names=list2, price=list1)
# notice that each column can be assigned a name

This table is a data frame. You can test this with:

# Set df variable equal to our table:
list1= c(5,6,5)
list2= c('flapjack', 'waffle', 'frenchtoast') 
df<-data.frame(names=list2, price=list1)
is.data.frame(df)

## [1] TRUE

# like lists, data frames can handle multiple types of data 
# (string and numbers in this case).

Dataframes can be merged if they have a common column:

# consider two dataframes
dfA<- data.frame(names = c("A", "B"), data1 = c(5,6))
dfB<- data.frame(names = c("A", "B"), data2 = c(10,11))
# merge them together
dfC <- merge(dfA, dfB, by="names")
dfC

##   names data1 data2
## 1     A     5    10
## 2     B     6    11

You can specificy the column by which to match in a number of ways (see below).

4 Accessing elements of matrixes and data frames:

Accessing elements of tables is similar to lists and vectors, but with two dimensions:

data.frame(names=list2, price=list1)

##         names price
## 1    flapjack     5
## 2      waffle     6
## 3 frenchtoast     5

df[2,1]  # specifies item in second row, first column.

## [1] "waffle"

To access an entire column or row at once:

df[,1]  # specifies entire first column.

## [1] "flapjack"    "waffle"      "frenchtoast"

df[3,]  # specifies 3rd row.

##         names price
## 3 frenchtoast     5

To access a range of columns or row:

df[1:2,1]  # specifies first two rows of column 1.

## [1] "flapjack" "waffle"

4.1 Loading your own data frames:

To load your own data frames you can use read.table()

read.table("<FILENAME>", header=T, sep='\t')
# Header states that there are column names at the beginning of each column to be used
# sep ='\t' specifies that the file is tab-delimited

Read in the file /home/FCAM/meds5420/data/ATF1/closest_peak_TSS/peaks_promoters.bed from the server and save it as a variable:

x.frame = read.table('peaks_promoters.bed', header=FALSE, sep = '\t')
is.data.frame(x.frame)

## [1] TRUE

is.data.table(x.frame)

## [1] FALSE

You can also load .csv files:

read.csv("<FILENAME>", header=F, sep=',')
# sep = ',' specifies that the files is comma-separated

4.2 Working with data:

For now lets look a datset that’s already in your environment, calledPuromycin:

head(Puromycin)

##   conc rate   state
## 1 0.02   76 treated
## 2 0.02   47 treated
## 3 0.06   97 treated
## 4 0.06  107 treated
## 5 0.11  123 treated
## 6 0.11  139 treated

This and other tables come preloaded with various packages and are often meant to be used as examples in tutorials.
To find all the tables tutorial tables type

data() # shows all tables available

Here are some useful functions when working with tables:

Show all tables currently loaded in environment:

ls() # list current tables in environment

##  [1] "closest"            "closest_cdf"        "closest_chr17"      "closest_hist"       "correlation"        "dat"                "df"                 "df_cast"           
##  [9] "df_melt"            "dfA"                "dfB"                "dfC"                "dog_x"              "dog_y"              "DT"                 "groove_width_major"
## [17] "groove_width_minor" "groove_x"           "groove_y"           "list"               "list1"              "list2"              "major_groove"       "major_groove_x"    
## [25] "major_groove_y1"    "major_groove_y2"    "mat"                "minor_groove"       "minor_groove_x"     "minor_groove_y1"    "minor_groove_y2"    "myList"            
## [33] "n"                  "query"              "sigma"              "sno.rna"            "theta"              "ve"                 "vec1"               "vec2"              
## [41] "vl"                 "x"                  "x.frame"            "xlim"               "xy"                 "y"                  "y.table"            "y1"                
## [49] "y2"                 "ylim"

Remove a item from environment

rm() # list current tables in environment

Remove all tables from envirnoment

rm(ls()) # list current tables in environment

Get the end of a table:

tail(Puromycin)

##    conc rate     state
## 18 0.11  115 untreated
## 19 0.22  131 untreated
## 20 0.22  124 untreated
## 21 0.56  144 untreated
## 22 0.56  158 untreated
## 23 1.10  160 untreated

Get the number of rows:

nrow(Puromycin)

## [1] 23

Get the number of rows:

ncol(Puromycin)

## [1] 3

Get both the number of rows and columns:

dim(Puromycin)

## [1] 23  3

dim(Puromycin)[2] # just number of columns

## [1] 3

Get column names of table:

names(Puromycin)

## [1] "conc"  "rate"  "state"

Get summary information about tables:

summary(Puromycin)

##       conc             rate             state   
##  Min.   :0.0200   Min.   : 47.0   treated  :12  
##  1st Qu.:0.0600   1st Qu.: 91.5   untreated:11  
##  Median :0.1100   Median :124.0                 
##  Mean   :0.3122   Mean   :126.8                 
##  3rd Qu.:0.5600   3rd Qu.:158.5                 
##  Max.   :1.1000   Max.   :207.0

Get structural information about tables:

str(Puromycin)

## 'data.frame':    23 obs. of  3 variables:
##  $ conc : num  0.02 0.02 0.06 0.06 0.11 0.11 0.22 0.22 0.56 0.56 ...
##  $ rate : num  76 47 97 107 123 139 159 152 191 201 ...
##  $ state: Factor w/ 2 levels "treated","untreated": 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "reference")= chr "A1.3, p. 269"

Specifcy columns by name:

Puromycin$rate[1:6]

## [1]  76  47  97 107 123 139

# '$' specifies column name, also indexed for first 6 items in this example
# OR
head(Puromycin$rate) # just another way of doing the above

## [1]  76  47  97 107 123 139

Simple math operations:

mean(Puromycin$rate)

## [1] 126.8261

Other mathematical operators are:

sum
median
mode
sd (standard deviation)
max
min

Slightly more complex, but powerful functions:

aggregate will perform an operation on a set of data, and separate the results based on a set of categories or ‘factors’ that you provide. It returns a data frame of the results.

More on factors can be found here: http://www.stat.berkeley.edu/~s133/factors.html

aggregate(Puromycin$rate, by=list(Puromycin$state), mean)

##     Group.1        x
## 1   treated 141.5833
## 2 untreated 110.7273

tapply is similar to aggregate, but returns a matrix.

tapply(Puromycin$rate, list(Puromycin$state), mean)

##   treated untreated 
##  141.5833  110.7273

4.3 `data.tables`

There is also a package that creates a more advanced data.frame called a data.table. data.tables have several advantages that we will see over the course of using R. Among them is being able to operate on columns using indexing and factors as shown above with simpler expressions and increased speed. It is generally recommended that new users start with data.tables as soon as possible to develop good habits.

See discussions: https://stackoverflow.com/questions/18001120/what-is-the-practical-difference-between-data-frame-and-data-table-in-r

https://www.analyticsvidhya.com/blog/2016/05/data-table-data-frame-work-large-data-sets/

https://www.r-bloggers.com/data-table-or-data-frame/

There’s also tidyverse which was created by Hadley Wickham:

Example:

library(data.table)

DT<- data.table(Puromycin)
DT[,mean(rate), by= state]

##        state       V1
## 1:   treated 141.5833
## 2: untreated 110.7273

#Format = DATA.TABLE[row, column, by= factor_column]
#Notice that functions can be combined with column designations to operate on large tables.

4.4 Converting table formats between ‘wide’ and ‘long’

# consider this table
dfC <- merge(dfA<- data.frame(names = c("A", "B"), data1 = c(5,6)), dfB<- data.frame(names = c("A", "B"), data2 = c(10,11)), by="names")
dfC

##   names data1 data2
## 1     A     5    10
## 2     B     6    11

You can melt the data.frame into a long format.

# merge tables together
dfC <- merge(dfA<- data.frame(names = c("A", "B"), data1 = c(5,6)), dfB<- data.frame(names = c("A", "B"), data2 = c(10,11)), by="names")
dfC

##   names data1 data2
## 1     A     5    10
## 2     B     6    11

# melt to 'long' format
library(reshape)
df_melt <- melt(dfC, by=names, variable.name="data")

## Using names as id variables

df_melt

##   names variable value
## 1     A    data1     5
## 2     B    data1     6
## 3     A    data2    10
## 4     B    data2    11

Or cast to wide format

# merge tables together
dfC <- merge(dfA<- data.frame(names = c("A", "B"), data1 = c(5,6)), dfB<- data.frame(names = c("A", "B"), data2 = c(10,11)), by="names")
dfC

##   names data1 data2
## 1     A     5    10
## 2     B     6    11

# melt to 'long' format
library(reshape)
df_melt <- melt(dfC, by=names,  variable.name="data")

## Using names as id variables

# cast to 'wide' format
df_cast <- cast(df_melt, variable = data )
df_cast

##   names data1 data2
## 1     A     5    10
## 2     B     6    11

5 Managing data and saving your environment workspace

Saving tables, etc:

save(<R_OBJECT>, file="FILENAME.Rdata")
# saves file as binary file readable by R

loading saved tables:

load(file="FILENAME.Rdata")

IMPORTANT: The saved R object retains its original name regardless of the filename. Thus, it may be difficult to find the object when loaded into your environment if you forget the original name or if you are opening an object from someone else. There are several ways around this:

Surround the load statement in parentheses (load(file="FILENAME.Rdata")) and the names of the loaded R object(s) will be printed to the screen.
Use the get command to assign the object a new name:

list<-c(1,2,3,4)
save(list, file="list.r")
dat<- load("list.r")
myList<- get(dat)

Save the object as an .RDS file and then reassign it upon loading:

saveRDS=(R_object, "FILENAME.RDS")

myObj <- readRDS("FILENAME.RDS"))

IMPORTANT!!: Saving workspace environment: This saves all the tables and variables you have created or loaded into an R sessions. That way you can continue where you left off instead or recreating all your work.

save.image(file="NAME.Rdata")
# saves currently loaded variables, functions, and packages.

You can then load the saved workspace as follows:

load(file="NAME.Rdata")
#loads previous workspace and all the associated tables

6 Writing tables to text file:

write.table(<R_Object>, file="FILENAME.txt", sep='\t', col.names=T, row.names=F, quote=F)
# you would reload this table with "read.table"

7 Reading text files fast as `data.tables` with `fread`:

y.table = fread('peaks_promoters.bed')
is.data.frame(y.table)

## [1] TRUE

is.data.table(y.table)

## [1] TRUE

7.1 writing `data.tables`:

You can write data.table fast with fwrite

fwrite(<DATA.TABLE>, file="FILENAME.txt", sep='\t', col.names=T, row.names=F, quote=F)
# you would reload this table with "fread(file=<DATA.table>"

8 `Rscript` to run `R` code

Just like one can run a shell script using the command bash scriptname.sh, we can run a file with R code by calling it directly from the Terminal.

touch filename.R
nano filename.R
# populate the file with R code and a few print statements

An example file to test Rscript (note the she bang is differnet for R scripts):

#!/usr/bin/env Rscript

print('test')
x <- mean(c(5,6,7,23))
print(x)

You can make the .R file in a text editor and call it directly from the terminal with Rscript:

Rscript filename.R

The output will print to the screen and this can be coupled with sbatch to automate/submit a job and tee to generate log files. This is how you will compare using awk, data.frame, and data.table in your optional homework.

9 For Next Time:

Download and install R or RStudio from: http://www.rstudio.com/products/rstudio/download/

10 In Class Exercises:

There are two tables (exp_untreated.txt, exp_treated.txt) in the following location:

/home/FCAM/meds5420/in_class/R_intro

To Do:

1. Either copy these files to your local machine (using sfto) or work with them on the server in an R session.
2. Read the tables into R and set them equal to a variable that makes sense (e.g. “untreated”, “treated”).
3. Create column names for the data tables (i.e. “genes” for column 1, and “treated” or “untreated” for column 2)
4. Merge the tables into one table with the merge() function
5. Calculate a mean, median, and standard deviation for the treated and untreated samples.
6. Create a new column in the dataframe that has the fold change (treated/untreated) for the data.
7. Devise a way to merge the untreated and treated such that all data names and numbers are in the same column and a third factor column designates treated/untreated.
8. Use aggregate() to calculate the mean, median, and standard deviation for each treatment from this new table.
9. Convert the table to a data.table and then calculate the same parameters as above.
10. Try saving and reading back in your tables.

11 Answers to In Class Exercises:

Read the tables into R and set them equal to a variable that makes sense (e.g. “untreated”, “treated”)

#first look at the data structure to determine how to read it in
system("head exp_treated.txt")

#then read them in
treated <- read.table("exp_treated.txt", header=F, sep=" ")
untreated <- read.table("exp_untreated.txt", header=F, sep=" ")

Create column names for the data tables (i.e. “genes” for column 1, and “data” for column 2)

colnames(treated) <- c("genes", "data")
colnames(untreated) <- c("genes", "data")

Merge the tables into one table with the merge() function

merged <- merge(treated, untreated, by="genes")

Calculate a mean, median, and standard deviation for the treated and untreated samples.

mean(treated$data)
median(treated$data)
sd(treated$data)

#etc. . . .

Create a new column in the dataframe that has the fold change (treated/untreated) for the data.

merged[,4] <- merged[,2]/merged[,3]

#or

foldChange <- data.frame(merged, fc<-merged[,2]/merged[,3])

Devise a way to merge the untreated and treated such that all data names and numbers are in the same column and a third factor column designates treated/untreated.

# first add a column with the trement state to both
treated <- data.frame(treated, state="treated"
untreated <- data.frame(untreated, state="untreated"  

#Then use rbind to merge the tables
merged <- rbind(treated, untreated)

Use aggregate() to calculate the mean, median, and standard deviation from this new table.

aggregate(merged$data, by=list(merged$state), mean)
aggregate(merged$data, by=list(merged$state), median)
aggregate(merged$data, by=list(merged$state), sd)

Convert the table to a data.table and then calculate the same parameters as above.

library(data.table)

#Convert to data.table
DT <- data.table(merged)

#Calculations
DT[, mean(data), by = state]
DT[, median(data), by = state]
DT[, sd(data), by = state]

MEDS5420 - Introduction to R

March 27, 2023

Contents

1 Today: Using R to manipulate, parse, and perform statistics on data frames

1.1 What is R?

2 Downloading and installing R-base and R-packages:

2.1 `R` resources

2.2 R packages

2.3 Intro to R sessions

3 Data Types:

3.1 Data structures:

3.2 Lists and Vectors

3.3 Matrices and DataFrames

4 Accessing elements of matrixes and data frames:

4.1 Loading your own data frames:

4.2 Working with data:

4.3 `data.tables`

4.4 Converting table formats between ‘wide’ and ‘long’

5 Managing data and saving your environment workspace

6 Writing tables to text file:

7 Reading text files fast as `data.tables` with `fread`:

7.1 writing `data.tables`:

8 `Rscript` to run `R` code

9 For Next Time:

10 In Class Exercises:

11 Answers to In Class Exercises:

MEDS5420 - Introduction to R

March 27, 2023

Contents

1 Today: Using R to manipulate, parse, and perform statistics on data frames

1.1 What is R?

2 Downloading and installing R-base and R-packages:

2.1 R resources

2.2 R packages

2.3 Intro to R sessions

3 Data Types:

3.1 Data structures:

3.2 Lists and Vectors

3.3 Matrices and DataFrames

4 Accessing elements of matrixes and data frames:

4.1 Loading your own data frames:

4.2 Working with data:

4.3 data.tables

4.4 Converting table formats between ‘wide’ and ‘long’

5 Managing data and saving your environment workspace

6 Writing tables to text file:

7 Reading text files fast as data.tables with fread:

7.1 writing data.tables:

8 Rscript to run R code

9 For Next Time:

10 In Class Exercises:

11 Answers to In Class Exercises:

2.1 `R` resources

4.3 `data.tables`

7 Reading text files fast as `data.tables` with `fread`:

7.1 writing `data.tables`:

8 `Rscript` to run `R` code