data.tables
with fread
:
Rscript
to run R
codeR is an open source programming language and environment. It is partially based on the S statistical software originally developed by Bell Laboratories. R is useful for:
R comes with most installations of Linux, Mac OS, and Windows but you can do custom installs or update to newer versions on your own.
You can go to http://www.r-project.org/about.html to get all the info you need.
R versions are kept at the CRAN (Comprehensive R Archive Network) website: http://cran.r-project.org/
On a MAC, you can download binaries for easy install. There are also R-GUIs available for MACs that can make using it slightly easier.
With Linux, you can install or update R with command line:
sudo apt-get install r-base
OR
sudo brew install r-base
R
resourcesR
Packages are collections of functions or compiled software that are designed to run in the R
environment.To start an are session type R
in the command line:
R
You are now in an R environment. Notice command prompt: >
R commands have their own syntax, and it is somewhat different than shell commands. In general the format for R
commands or functions is as follows:
command(<INPUTS>, <OPTIONS>)
To get information on how to use a command, use:
help(COMMAND)
OR ?COMMAND
R
anchors itself in whatever directory you were in when you started the session. To see where you are, use:
getwd() # no input or options needed
## [1] "/Users/guertinlab/Documents/class/2023_lec18"
You can also run shell commands in R:
system("pwd")
To change the directory you are working in:
setwd(<DESIRED_PATH>)
To install packages from CRAN:
install.packages("PACKAGE_NAME")
To install packages from Bioconductor: Install bioconductor:
source("http://bioconductor.org/biocLite.R") # download source
biocLite() # install bioclite
biocLite("PACKAGE_NAME")
These packages are stored in the ‘library’ directory on your computer. You can view a list of installed packages with:
library()
To load a package for use in the current R workspace session:
biocLite("PACKAGE_NAME")
To find out what R
version you have and what packages are currently loaded:
sessionInfo()
To get help with a function:
help(<FUNCTION>)
To quit R:
quit() # or just q()
Character - letters, strings
numeric - numbers
Factors - Categorical variables for plotting or statistical modeling
You can test whether a variable or data is any of these types as follows:
To quit R:
is.numeric(<VALUE>)
Examples:
is.numeric(3)
## [1] TRUE
is.numeric(T)
## [1] FALSE
is.character(3)
## [1] FALSE
Quotes always makes the value a character:
is.numeric("3")
## [1] FALSE
is.character("3")
## [1] TRUE
However, you can force a character to become numeric if needed:
# Print 3 as character:
"3"
## [1] "3"
#Print 3 as a number
as.numeric("3")
## [1] 3
Vectors
elements of a vector are combined with the format c(<element1>, <element2>, etc . . .)
v <- c(4,5,6) # multiple elements must be combined with c().
v
Note: <-
is an assignment operator used with R
. =
can also be used, but <-
is conventional. A decent explanation on the difference is provided here:
https://renkun.me/2014/01/28/difference-between-assignment-operators-in-r/
To access elements with in a vector, you can use their position or ‘index’:
v[2] # position with the vector is designated with square brackets [].
v[1]/v[2] # simple math with vector indexes
Remember: you cannot easily combine data types with vectors:
ve <- c(c(4,5,6), c('flapjack', 'waffle', 'frenchtoast'))
ve
## [1] "4" "5" "6" "flapjack" "waffle" "frenchtoast"
ve[1]/ve[2]
Lists
Lists can hold mutliple data types and can be nested:
vl <- list(c(4,5,6), c('flapjack', 'waffle', 'frenchtoast'))
vl
## [[1]]
## [1] 4 5 6
##
## [[2]]
## [1] "flapjack" "waffle" "frenchtoast"
*Compare this output to that from the vector (ve
) above.
Accessing items in list is similar:
vl[[1]][2] # access 2nd item of first item in the list
## [1] 5
vl[[1]][1]/ vl[[1]][2] # divide first item of first item in the list by the second
## [1] 0.8
Vectors can be combined to create matrixes using cbind
or rbind
:
vec1 <- c(10, 9, 8)
vec2 <- c(7, 6, 5)
cbind(vec1,vec2)
## vec1 vec2
## [1,] 10 7
## [2,] 9 6
## [3,] 8 5
OR
rbind(vec1,vec2)
## [,1] [,2] [,3]
## vec1 10 9 8
## vec2 7 6 5
This makes a matrix.
mat=cbind(vec1,vec2)
is.matrix(mat)
## [1] TRUE
# like vectors, matrices can only handle one type of data at a time
Regardless of how you create the matrix, they can be transposed:
t(mat)
## [,1] [,2] [,3]
## vec1 10 9 8
## vec2 7 6 5
Like vectors, matrices can only handle one data type at a time. To combine data types in a table, you can create dataframes.
You can also specify that you want a matrix directly as follows:
For a matrix:
matrix(c(vec1, vec2), ncol=2)
## [,1] [,2]
## [1,] 10 7
## [2,] 9 6
## [3,] 8 5
Dataframes
Vectors of different types can be combined into dataframes:
list1= c(5,6,5)
list2= c('flapjack', 'waffle', 'frenchtoast')
data.frame(list2, list1)
## list2 list1
## 1 flapjack 5
## 2 waffle 6
## 3 frenchtoast 5
We can also save this data.frame as a variable in our session.
df<-data.frame(list2, list1)
We can also set or change column names.
df<-data.frame(list2, list1)
colnames(df)<-c("name", "price")
df
## name price
## 1 flapjack 5
## 2 waffle 6
## 3 frenchtoast 5
We can also set column names from the start.
list1= c(5,6,5)
list2= c('flapjack', 'waffle', 'frenchtoast')
data.frame(names=list2, price=list1)
# notice that each column can be assigned a name
This table is a data frame. You can test this with:
# Set df variable equal to our table:
list1= c(5,6,5)
list2= c('flapjack', 'waffle', 'frenchtoast')
df<-data.frame(names=list2, price=list1)
is.data.frame(df)
## [1] TRUE
# like lists, data frames can handle multiple types of data
# (string and numbers in this case).
Dataframes can be merged if they have a common column:
# consider two dataframes
dfA<- data.frame(names = c("A", "B"), data1 = c(5,6))
dfB<- data.frame(names = c("A", "B"), data2 = c(10,11))
# merge them together
dfC <- merge(dfA, dfB, by="names")
dfC
## names data1 data2
## 1 A 5 10
## 2 B 6 11
You can specificy the column by which to match in a number of ways (see below).
Accessing elements of tables is similar to lists and vectors, but with two dimensions:
data.frame(names=list2, price=list1)
## names price
## 1 flapjack 5
## 2 waffle 6
## 3 frenchtoast 5
df[2,1] # specifies item in second row, first column.
## [1] "waffle"
To access an entire column or row at once:
df[,1] # specifies entire first column.
## [1] "flapjack" "waffle" "frenchtoast"
df[3,] # specifies 3rd row.
## names price
## 3 frenchtoast 5
To access a range of columns or row:
df[1:2,1] # specifies first two rows of column 1.
## [1] "flapjack" "waffle"
To load your own data frames you can use read.table()
read.table("<FILENAME>", header=T, sep='\t')
# Header states that there are column names at the beginning of each column to be used
# sep ='\t' specifies that the file is tab-delimited
Read in the file /home/FCAM/meds5420/data/ATF1/closest_peak_TSS/peaks_promoters.bed
from the server and save it as a variable:
x.frame = read.table('peaks_promoters.bed', header=FALSE, sep = '\t')
is.data.frame(x.frame)
## [1] TRUE
is.data.table(x.frame)
## [1] FALSE
You can also load .csv files:
read.csv("<FILENAME>", header=F, sep=',')
# sep = ',' specifies that the files is comma-separated
For now lets look a datset that’s already in your environment, calledPuromycin
:
head(Puromycin)
## conc rate state
## 1 0.02 76 treated
## 2 0.02 47 treated
## 3 0.06 97 treated
## 4 0.06 107 treated
## 5 0.11 123 treated
## 6 0.11 139 treated
This and other tables come preloaded with various packages and are often meant to be used as examples in tutorials.
To find all the tables tutorial tables type
data() # shows all tables available
Here are some useful functions when working with tables:
Show all tables currently loaded in environment:
ls() # list current tables in environment
## [1] "closest" "closest_cdf" "closest_chr17" "closest_hist" "correlation" "dat" "df" "df_cast"
## [9] "df_melt" "dfA" "dfB" "dfC" "dog_x" "dog_y" "DT" "groove_width_major"
## [17] "groove_width_minor" "groove_x" "groove_y" "list" "list1" "list2" "major_groove" "major_groove_x"
## [25] "major_groove_y1" "major_groove_y2" "mat" "minor_groove" "minor_groove_x" "minor_groove_y1" "minor_groove_y2" "myList"
## [33] "n" "query" "sigma" "sno.rna" "theta" "ve" "vec1" "vec2"
## [41] "vl" "x" "x.frame" "xlim" "xy" "y" "y.table" "y1"
## [49] "y2" "ylim"
Remove a item from environment
rm() # list current tables in environment
Remove all tables from envirnoment
rm(ls()) # list current tables in environment
Get the end of a table:
tail(Puromycin)
## conc rate state
## 18 0.11 115 untreated
## 19 0.22 131 untreated
## 20 0.22 124 untreated
## 21 0.56 144 untreated
## 22 0.56 158 untreated
## 23 1.10 160 untreated
Get the number of rows:
nrow(Puromycin)
## [1] 23
Get the number of rows:
ncol(Puromycin)
## [1] 3
Get both the number of rows and columns:
dim(Puromycin)
## [1] 23 3
dim(Puromycin)[2] # just number of columns
## [1] 3
Get column names of table:
names(Puromycin)
## [1] "conc" "rate" "state"
Get summary information about tables:
summary(Puromycin)
## conc rate state
## Min. :0.0200 Min. : 47.0 treated :12
## 1st Qu.:0.0600 1st Qu.: 91.5 untreated:11
## Median :0.1100 Median :124.0
## Mean :0.3122 Mean :126.8
## 3rd Qu.:0.5600 3rd Qu.:158.5
## Max. :1.1000 Max. :207.0
Get structural information about tables:
str(Puromycin)
## 'data.frame': 23 obs. of 3 variables:
## $ conc : num 0.02 0.02 0.06 0.06 0.11 0.11 0.22 0.22 0.56 0.56 ...
## $ rate : num 76 47 97 107 123 139 159 152 191 201 ...
## $ state: Factor w/ 2 levels "treated","untreated": 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "reference")= chr "A1.3, p. 269"
Specifcy columns by name:
Puromycin$rate[1:6]
## [1] 76 47 97 107 123 139
# '$' specifies column name, also indexed for first 6 items in this example
# OR
head(Puromycin$rate) # just another way of doing the above
## [1] 76 47 97 107 123 139
Simple math operations:
mean(Puromycin$rate)
## [1] 126.8261
Other mathematical operators are:
Slightly more complex, but powerful functions:
aggregate
will perform an operation on a set of data, and separate the results based on a set of categories or ‘factors’ that you provide. It returns a data frame of the results.
More on factors can be found here: http://www.stat.berkeley.edu/~s133/factors.html
aggregate(Puromycin$rate, by=list(Puromycin$state), mean)
## Group.1 x
## 1 treated 141.5833
## 2 untreated 110.7273
tapply is similar to aggregate, but returns a matrix.
tapply(Puromycin$rate, list(Puromycin$state), mean)
## treated untreated
## 141.5833 110.7273
data.tables
There is also a package that creates a more advanced data.frame called a data.table
. data.tables
have several advantages that we will see over the course of using R. Among them is being able to operate on columns using indexing and factors as shown above with simpler expressions and increased speed. It is generally recommended that new users start with data.tables
as soon as possible to develop good habits.
See discussions:
https://stackoverflow.com/questions/18001120/what-is-the-practical-difference-between-data-frame-and-data-table-in-r
https://www.analyticsvidhya.com/blog/2016/05/data-table-data-frame-work-large-data-sets/
https://www.r-bloggers.com/data-table-or-data-frame/
There’s also tidyverse
which was created by Hadley Wickham:
Example:
library(data.table)
DT<- data.table(Puromycin)
DT[,mean(rate), by= state]
## state V1
## 1: treated 141.5833
## 2: untreated 110.7273
#Format = DATA.TABLE[row, column, by= factor_column]
#Notice that functions can be combined with column designations to operate on large tables.
# consider this table
dfC <- merge(dfA<- data.frame(names = c("A", "B"), data1 = c(5,6)), dfB<- data.frame(names = c("A", "B"), data2 = c(10,11)), by="names")
dfC
## names data1 data2
## 1 A 5 10
## 2 B 6 11
You can melt
the data.frame
into a long format.
# merge tables together
dfC <- merge(dfA<- data.frame(names = c("A", "B"), data1 = c(5,6)), dfB<- data.frame(names = c("A", "B"), data2 = c(10,11)), by="names")
dfC
## names data1 data2
## 1 A 5 10
## 2 B 6 11
# melt to 'long' format
library(reshape)
df_melt <- melt(dfC, by=names, variable.name="data")
## Using names as id variables
df_melt
## names variable value
## 1 A data1 5
## 2 B data1 6
## 3 A data2 10
## 4 B data2 11
Or cast
to wide format
# merge tables together
dfC <- merge(dfA<- data.frame(names = c("A", "B"), data1 = c(5,6)), dfB<- data.frame(names = c("A", "B"), data2 = c(10,11)), by="names")
dfC
## names data1 data2
## 1 A 5 10
## 2 B 6 11
# melt to 'long' format
library(reshape)
df_melt <- melt(dfC, by=names, variable.name="data")
## Using names as id variables
# cast to 'wide' format
df_cast <- cast(df_melt, variable = data )
df_cast
## names data1 data2
## 1 A 5 10
## 2 B 6 11
Saving tables, etc:
save(<R_OBJECT>, file="FILENAME.Rdata")
# saves file as binary file readable by R
loading saved tables:
load(file="FILENAME.Rdata")
IMPORTANT: The saved R object retains its original name regardless of the filename. Thus, it may be difficult to find the object when loaded into your environment if you forget the original name or if you are opening an object from someone else. There are several ways around this:
(load(file="FILENAME.Rdata"))
and the names of the loaded R object(s) will be printed to the screen.get
command to assign the object a new name:list<-c(1,2,3,4)
save(list, file="list.r")
dat<- load("list.r")
myList<- get(dat)
.RDS
file and then reassign it upon loading:saveRDS=(R_object, "FILENAME.RDS")
myObj <- readRDS("FILENAME.RDS"))
IMPORTANT!!: Saving workspace environment: This saves all the tables and variables you have created or loaded into an R
sessions. That way you can continue where you left off instead or recreating all your work.
save.image(file="NAME.Rdata")
# saves currently loaded variables, functions, and packages.
You can then load the saved workspace as follows:
load(file="NAME.Rdata")
#loads previous workspace and all the associated tables
write.table(<R_Object>, file="FILENAME.txt", sep='\t', col.names=T, row.names=F, quote=F)
# you would reload this table with "read.table"
data.tables
with fread
:y.table = fread('peaks_promoters.bed')
is.data.frame(y.table)
## [1] TRUE
is.data.table(y.table)
## [1] TRUE
data.tables
:You can write data.table
fast with fwrite
fwrite(<DATA.TABLE>, file="FILENAME.txt", sep='\t', col.names=T, row.names=F, quote=F)
# you would reload this table with "fread(file=<DATA.table>"
Rscript
to run R
codeJust like one can run a shell script using the command bash scriptname.sh
, we can run a file with R
code by calling it directly from the Terminal.
touch filename.R
nano filename.R
# populate the file with R code and a few print statements
An example file to test Rscript
(note the she bang is differnet for R scripts):
#!/usr/bin/env Rscript
print('test')
x <- mean(c(5,6,7,23))
print(x)
You can make the .R
file in a text editor and call it directly from the terminal with Rscript
:
Rscript filename.R
The output will print to the screen and this can be coupled with sbatch
to automate/submit a job and tee
to generate log files. This is how you will compare using awk
, data.frame
, and data.table
in your optional homework.
R
or RStudio
from:
http://www.rstudio.com/products/rstudio/download/There are two tables (exp_untreated.txt, exp_treated.txt) in the following location:
/home/FCAM/meds5420/in_class/R_intro
To Do:
1. Either copy these files to your local machine (using sfto
) or work with them on the server in an R
session.
2. Read the tables into R
and set them equal to a variable that makes sense (e.g. “untreated”, “treated”).
3. Create column names for the data tables (i.e. “genes” for column 1, and “treated” or “untreated” for column 2)
4. Merge the tables into one table with the merge()
function
5. Calculate a mean, median, and standard deviation for the treated and untreated samples.
6. Create a new column in the dataframe that has the fold change (treated/untreated) for the data.
7. Devise a way to merge the untreated and treated such that all data names and numbers are in the same column and a third factor
column designates treated/untreated.
8. Use aggregate()
to calculate the mean, median, and standard deviation for each treatment from this new table.
9. Convert the table to a data.table
and then calculate the same parameters as above.
10. Try saving and reading back in your tables.
Read the tables into R and set them equal to a variable that makes sense (e.g. “untreated”, “treated”)
#first look at the data structure to determine how to read it in
system("head exp_treated.txt")
#then read them in
treated <- read.table("exp_treated.txt", header=F, sep=" ")
untreated <- read.table("exp_untreated.txt", header=F, sep=" ")
Create column names for the data tables (i.e. “genes” for column 1, and “data” for column 2)
colnames(treated) <- c("genes", "data")
colnames(untreated) <- c("genes", "data")
Merge the tables into one table with the merge()
function
merged <- merge(treated, untreated, by="genes")
Calculate a mean, median, and standard deviation for the treated and untreated samples.
mean(treated$data)
median(treated$data)
sd(treated$data)
#etc. . . .
Create a new column in the dataframe that has the fold change (treated/untreated) for the data.
merged[,4] <- merged[,2]/merged[,3]
#or
foldChange <- data.frame(merged, fc<-merged[,2]/merged[,3])
Devise a way to merge the untreated and treated such that all data names and numbers are in the same column and a third factor
column designates treated/untreated.
# first add a column with the trement state to both
treated <- data.frame(treated, state="treated"
untreated <- data.frame(untreated, state="untreated"
#Then use rbind to merge the tables
merged <- rbind(treated, untreated)
Use aggregate()
to calculate the mean, median, and standard deviation from this new table.
aggregate(merged$data, by=list(merged$state), mean)
aggregate(merged$data, by=list(merged$state), median)
aggregate(merged$data, by=list(merged$state), sd)
Convert the table to a data.table and then calculate the same parameters as above.
library(data.table)
#Convert to data.table
DT <- data.table(merged)
#Calculations
DT[, mean(data), by = state]
DT[, median(data), by = state]
DT[, sd(data), by = state]