Contents

1 Relevant R tutorials and resources:

  1. Software carpentry: https://swcarpentry.github.io/r-novice-gapminder/
  2. Detailed R tutorial: http://adv-r.had.co.nz/
  3. Several images in today’s tutorial are from the book R in Action. Can be purchased here: http://www.manning.com/kabacoff/ or on Amazon.
  4. Quick R is also one of many good resources for plotting: http://www.statmethods.net/

2 Before we begin:

3 Today: Intro to R and Basic of plotting with R

3.1 More basic R: subsetting tables

Open an R environment and set you working directory to folder where the peaks_promoters.bed file is.

getwd() # no input or options needed
## [1] "/Users/guertinlab/Documents/class/2023_lec20"
setwd("/Users/guertinlab/MEDS5420") 
getwd()

The peaks_promoters.bed file contains information on which gene TSS is closest to ATF1 peaks that we called with MACS. Recall that we used bedtools closest to find the distance between ATF1 summits and TSSs and report them relative to genes with consideration of strand.

Read this table into your R session with read.table:

#read in table
closest <- read.table("peaks_promoters.bed")

#check structure
head(closest,3)
##     V1      V2      V3          V4      V5   V6      V7      V8        V9             V10 V11 V12
## 1 chr1  778706  778707 ATF1_peak_1 82.7650 chr1  778747  778748 LINC01409 ENSG00000237491   +  41
## 2 chr1 1000890 1000891 ATF1_peak_2 41.9708 chr1 1001138 1001139     ISG15 ENSG00000187608   + 248
## 3 chr1 1471676 1471677 ATF1_peak_3 30.5627 chr1 1471765 1471766    ATAD3B ENSG00000160072   +  89

Let’s check how many chromosomes are present in this data:

#report unique items in a list or vector
unique(closest[,1])
##  [1] "chr1"                    "chr10"                   "chr11"                   "chr11_KI270832v1_alt"    "chr12"                   "chr13"                  
##  [7] "chr14"                   "chr14_GL000225v1_random" "chr14_KI270846v1_alt"    "chr15"                   "chr16"                   "chr17"                  
## [13] "chr17_KI270857v1_alt"    "chr18"                   "chr18_GL383572v1_alt"    "chr18_KI270863v1_alt"    "chr18_KI270911v1_alt"    "chr19"                  
## [19] "chr1_KI270763v1_alt"     "chr2"                    "chr20"                   "chr21"                   "chr21_KI270872v1_alt"    "chr22"                  
## [25] "chr22_KI270733v1_random" "chr3"                    "chr3_KI270779v1_alt"     "chr3_KI270924v1_alt"     "chr4"                    "chr5"                   
## [31] "chr5_GL339449v2_alt"     "chr5_KI270791v1_alt"     "chr5_KI270897v1_alt"     "chr6"                    "chr6_GL000252v2_alt"     "chr6_GL000254v2_alt"    
## [37] "chr6_GL000255v2_alt"     "chr6_KI270797v1_alt"     "chr7"                    "chr7_KI270809v1_alt"     "chr8"                    "chr8_KI270816v1_alt"    
## [43] "chr9"                    "chr9_KI270718v1_random"  "chrUn_GL000216v2"        "chrUn_GL000220v1"        "chrX"                    "chrX_KI270880v1_alt"    
## [49] "chrX_KI270913v1_alt"

Looks like this is the whole dataset, but what if you only want the part from your chromosome.   Here are two ways to do this:

#subset a table based on a logical match to a column
closest_chr17<- subset(closest, closest[,1]=='chr17')
head(closest_chr17, 3)
##        V1      V2      V3            V4      V5    V6      V7      V8     V9             V10 V11  V12
## 203 chr17  782326  782327 ATF1_peak_202 66.2886 chr17  782353  782354   MRM3 ENSG00000171861   +   27
## 204 chr17 4143675 4143676 ATF1_peak_203 14.7257 chr17 4143168 4143169 CYB5D2 ENSG00000167740   + -507
## 205 chr17 4704157 4704158 ATF1_peak_204 30.8864 chr17 4704336 4704337  PELP1 ENSG00000141456   -  179

#subset a table based on a logical match to a column
closest_chr17<- closest[closest[,1]=='chr17',] 
#notice the syntax is based on table indexing
head(closest_chr17, 3)
##        V1      V2      V3            V4      V5    V6      V7      V8     V9             V10 V11  V12
## 203 chr17  782326  782327 ATF1_peak_202 66.2886 chr17  782353  782354   MRM3 ENSG00000171861   +   27
## 204 chr17 4143675 4143676 ATF1_peak_203 14.7257 chr17 4143168 4143169 CYB5D2 ENSG00000167740   + -507
## 205 chr17 4704157 4704158 ATF1_peak_204 30.8864 chr17 4704336 4704337  PELP1 ENSG00000141456   -  179

Notes on relational operators for logical tests:
Site: http://sites.stat.psu.edu/~dhunter/R/html/base/html/Comparison.html
Operator : Interpretation

  1. x < y : x is less than y
  2. x > y : x is greater than y
  3. x <= y : x is less than or equal to y
  4. x >= y : x is greater than or equal to y
  5. x == y : x is equal to y
  6. x != y : x is not equal to y

Also when conducting logical tests:

  • & = and
  • | = or

4 Plotting distributions

4.1 Boxplots

One simple way to plot the distribution of a data set is to use boxplots:

boxplot(closest[,12])

4.2 What do boxplots show?

anatomy of a box plot

Figure 1: anatomy of a box plot

median - Middle dark line.
box - Interquartile ranges. i.e. extending from 25th percentile to 75th percentile.
whisker - Extremes (cover 90% of the data).
outliers - Points outside of whiskers.

The default parameters from boxplot are not great for publication. However, there are options specific to the boxplot function that we can use to make the figure more meaningful and understandable.

For instance: see help(boxplots))

  • remove outliers (outline=F)
  • color border (border="red")
  • border thickness (lwd=2)
boxplot(closest[,12], outline=FALSE, border='red', col='white', lwd=2)

lwd = line width
col = inside box color
border= border color
outline= display outliers (if TRUE)

4.3 Commonly used options with graphs:

  • graph size
  • margins
  • Axis labels (xlab, ylab)
  • Axis font, size, color
  • graph title (main =“title”)
  • legend
  • line, fill colors
  • line width