0.1 Cleaning up data
1 Building genome with HISAT2
- 1.1 GTFs:
- 1.2 Using HISAT2 and related software on server:
2 Building your annotated genome for mapping with HISAT2.
3 Alignment with HISAT2
- 3.1 Note on --rna-strandness
- 3.2 Intermediate file conversions
4 StringTie for assembling transcripts
5 In class exercise 1:
6 Counting reads in transcripts with HTseq
- 6.1 Important update on HTseq-count
- 6.2 Strandedness with HTSeq
7 In class exercise 2:
8 Use Genome Coverage from bedtools to create a bedGraph:
- 8.1 Note on visualizing stranded data with genomeCoverageBed
9 In class exercise 3:
10 Code chunks and commentary not covered in class
11 Visualizing stranded PE data:
- 11.1 Mapping paired end and stranded data with HISAT2
12 Displaying splice junctions
13 Molecular biology of library preparations
14 Answers to in class exercise 1:
15 Answers to class exercise 2:
16 Answers to in class exercise 3:

Figure 1: RNAseq workflows

0.1 Cleaning up data

RNA-SeQC is a first-step option for quality control: http://archive.broadinstitute.org/cancer/cga/rna-seqc
RNA-SeQC is not available on the Xanadu server, but you can prefilter for quality and remove adapters just as in ChIP-seq:

fastqc or mutliQC: https://multiqc.info/docs/
quality filter / trimming
adapter removal

NOTE: Illumina adapters have barcodes these days, so when removing adapter from multiple files, just use the constant regions before the variable region (bolded below) to filter against. Example:\

GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG

Here, the bolded sequence is the variable region, so just use the region before it (GATCGGAAGAGCACACGTCTGAACTCCAGTCAC) for filtering.

1 Building genome with HISAT2

Things you need:

Indexed genome file(s): /home/FCAM/meds5420/genomes/hg38_hisat2
File with transcript and exon annotations (GFF or GTF files). :
/home/FCAM/meds5420/annotations/gencode.v39.annotation.gtf
installation of hisat2, stringTie, and GFFcompare: see below

1.1 GTFs:

GTF (General Transfer Format) format is a table of individual exons with retention of transcript ID.
Explanation of GTF and GFF files can be found here.

GENCODE GTFs This can be retrieved from GENCODE (Figure 2):

Figure 2: GENCODE GTF website

head -5 gencode.v39.annotation.gtf

## ##description: evidence-based annotation of the human genome (GRCh38), version 39 (Ensembl 105)
## ##provider: GENCODE
## ##contact: gencode-help@ebi.ac.uk
## ##format: gtf
## ##date: 2021-09-02

Obtaining .GTF (Gene Transfer Format) files from UCSC (Figure 3):

Figure 3: UCSC GTF Table Browser

1.2 Using HISAT2 and related software on server:

Use module to load necessary software:

module load hisat2
module load stringtie
module load gffcompare

2 Building your annotated genome for mapping with HISAT2.

HISAT2 will map to your genome, but will also use gene annotations to map to known exons and splice junctions specifically.

First, we need to use the GTF file to create a list of exons and splice sites to use when we map.
There are python scripts that come with the hisat2 software for this purpose.


extract_splice_sites.py gencode.v39.annotation.gtf > hg38_hisat_splicesites.txt

extract_exons.py gencode.v39.annotation.gtf > hg38_hisat_exons.txt

Here’s what the files look like:

head -5 ./hisat2/hg38_hisat_splicesites.txt

## chr1 12056   12178   +
## chr1 12226   12612   +
## chr1 12696   12974   +
## chr1 12720   13220   +
## chr1 13051   13220   +

head -5 ./hisat2/hg38_hisat_exons.txt

## chr1 11868   12226   +
## chr1 12612   12720   +
## chr1 12974   13051   +
## chr1 13220   14500   +
## chr1 15004   15037   -

These files are then used in building your annotated genome for mapping RNA-seq data. Here’s how to build it:

hisat2-build -p 8 --ss hg38_hisat_splicesites.txt --exon hg38_hisat_exons.txt hg38.fa hg38_hisat2

Should output the following to the screen (actually to stderr):

Settings:
  Output files: "hg38_hisat2.*.ht2"
  Line rate: 7 (line is 128 bytes)
  Lines per side: 1 (side is 128 bytes)
  Offset rate: 4 (one in 16)
  FTable chars: 10
  Strings: unpacked
  Local offset rate: 3 (one in 8)
  Local fTable chars: 6
  Local sequence length: 57344
  Local sequence overlap between two consecutive indexes: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  ../hg38.fa
Reading reference sizes
  Time reading reference sizes: 00:00:19
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
etc,
etc,
etc,
Total time for call to driver() for forward index: 02:47:58

NOTE: THIS IS THE MAJOR BOTTLENECK WITH HISAT2.

You can plan on this step taking ~2 hours and requiring ~160GB RAM for a large-ish genome such as human. You cannot do this with your average laptop, but you can run scripts that require high memory on the server with a special option.
Alternatively you may be able to download a pre-built genome for your organism from the HISAT2 website: https://ccb.jhu.edu/software/hisat2/index.shtml.
- In this case you lose flexibility due to being are locked in to the annotation set used in the build.
You can instead use annotations during the mapping, or (preferred) create a custom annotation using your initial alignment (after post processing below) and then re-align your data using the custom annotation set for good coverage.
Also, if you build your genome without annotations it requires much less memory and can probably be built on a desktop or even laptop.

To build your own genome of the server you can create a qsub script and make sure that the following lines are added to the header:

#SBATCH -c=32
#SBATCH -p himem
#SBATCH --qos=himem
#SBATCH --mem=256G

This would use half of the RAM and cores one one high memory node (i.e. each has 64 cores with 512G RAM).

This information is available on the BBC cluster site:

3 Alignment with HISAT2

HISAT2 manual:

hisat2 -p 4 --dta --rna-strandness F -x $ht2_gen -U INFILE.fastq -S OUTFILE.sam 

# ht2_gen=/home/FCAM/meds5420/genomes/hg38_hisat2/hg38_hisat2

Output is standard .sam output plus optional columns that provide greater detail. See manual for details.

Options are VERY important for RNA mapping - Some useful options:
* --dta: downstream transcript assembly -
* --rna-strandness: specify when data is stranded or not. Default is not stranded. F means read represents transcript, R means transcript is reverse complement. This is only used in stringtie and other software from the tuxedo suite. Invoking R or F here does not change the hexadecimal SAM flag. Instead, every read alignment will have an XS attribute tag with a - or + associated with the read alignment. This attribute tag is only (from what I can tell) used for downstream applications in the tuxedo suite.
- --tmo: transcript mapping only: only map to known transcripts.
- --phred33 or --phred64: specify Illumina quality score conversion.
- --summary-file [fileName.txt]: Print alignment summary to this file.
- --time: print time for alignment to the screen (sent to stnderr)
- --no-softclip prevents aligner from clipping ends that do not match
- --known-splicesite_infile: can add splice sites here if not in your assembly (need to extract splice sites with the python script).

3.1 Note on `--rna-strandness`

The option --rna-strandness is dependent upon the molecular biology that was used in library construction. Even when the molecular biology is known (sometimes it is not if you send out to a company), I recommend performing a coherence check like this by running hisat with the --rna-strandness F or --rna-strandness R option. If your data is stranded, you can also determine whether the R or F is the correct option for --rna-strandness. What do you expect to see if you chose the incorrect orientation? I will continue to stress this point because “pipelines” that ignore the molecular biology and these types of coherence checks can get researchers into trouble. In 2020 I had a student take my class and they realized that they were treating their unstranded libraries as stranded libraries in the analysis. This should not effect the conclusions, but using the wrong strand will. Treating unstranded libraries as stranded libraries is equivalent of sequencing to half the read depth—why?

3.2 Intermediate file conversions

Next we have to convert the alignments to a .bam file and sort them.

samtools view -bS INFILE.sam > OUTFILE.bam

samtools sort -@ 4 INFILE.bam OUTFILE_sorted

# no extension needed for output file name
# -@ option designates # of cores for multi-threading 
# Important to remove temporary files, especially when working on the server.

rm INFILE.sam
rm INFILE.bam

NOTE: newer versions of samtools can convert to .bam and sort in one step.

samtools sort -@ 4 -o OUTFILE_sorted.bam INFILE.sam

If you are interested in the number of reads with splice junctions in them you can extract them from a .bam or .sam as below:

samtools view INFILE.sam | awk '($6 ~ /N/)' > OUTFILE_spliced_reads.sam

This command uses the CIGAR (Compact Idiosyncratic Gapped Alignment Report) string to find gaps in reads with no match, which is designated by the N. M: Alignment match (can be a sequence match or mismatch)
I: Insertion to the reference
D: Deletion from the reference
N: Skipped region from the reference
S: Soft clip on the read (clipped sequence present in sequence)
H: Hard clip on the read (clipped sequence NOT present in sequence)
P: Padding (silent deletion from the padded reference sequence)

head -3 C1_RNAseq_rep1_F_chr15.sam

## @SQ  SN:chr15    LN:101991189
## SRR5882541.93    16  chr15   52066185    60  74M1S   *   0   0   TTTATTTTATTGAATACTGTCTGTATCTTTGGTTATCCTGTTTGAAGAAAAAGGACAAATAAAACATGGCCAGCC F.AFFFF.FFFFFFAFFFFFFFFAFFFFFAFFAFFAFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFAFFFAAAAA AS:i:-1 ZS:i:-11    XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:74 YT:Z:UU XS:A:-  NH:i:1
## SRR5882541.161   16  chr15   43793064    60  20M626N53M  *   0   0   GCTGCCGCCCGCAAGCAGAGGGACTCGGAGATCATGCAGCAGAAGCAGAAAAAGGCAAACGAGAAGAAGGAGG   FFAFF<FFFFA<FFF7A<<<F<FF.FFAFFA<FFFFF<AFFFFFFF7F<FFFFFFFFFFAAFFF7FFFAAAAA   AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:73 YT:Z:UU XS:A:-  NH:i:1

head -2 C1_RNAseq_rep1_F_chr15_spliced_reads.sam

## SRR5882541.161   16  chr15   43793064    60  20M626N53M  *   0   0   GCTGCCGCCCGCAAGCAGAGGGACTCGGAGATCATGCAGCAGAAGCAGAAAAAGGCAAACGAGAAGAAGGAGG   FFAFF<FFFFA<FFF7A<<<F<FF.FFAFFA<FFFFF<AFFFFFFF7F<FFFFFFFFFFAAFFF7FFFAAAAA   AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:73 YT:Z:UU XS:A:-  NH:i:1
## SRR5882541.176   0   chr15   89321203    60  1S58M475N17M    *   0   0   CGGGTGTAGCCAGGTGGGGCCTGCACCATGGCTTTCAACTCACTGCCTACTCGGTCAGGCCGGGCATTGCTGGCGG    AAAAAFF.FFFFFFFFFFFFFFFF<FAFFFFF.FFFA)AFF.FFFFFF.AFFFFF77FFF7FFFF.AFFFFFFFFF    AS:i:-1 XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:75 YT:Z:UU XS:A:+  NH:i:1

4 StringTie for assembling transcripts

In the author’s words:
StringTie is a fast and highly efficient assembler of RNA-Seq alignments into potential transcripts. It uses a novel network flow algorithm as well as an optional de novo assembly step to assemble and quantitate full-length transcripts representing multiple splice variants for each gene locus. Its input can include not only alignments of short reads that can also be used by other transcript assemblers, but also alignments of longer sequences that have been assembled from those reads. In order to identify differentially expressed genes between experiments, StringTie’s output can be processed by specialized downstream software.

StringTie Manual

Assemble transcripts with StringTie:

stringtie -p 4 -G gencode.v39.annotation.gtf -o OUTFILE.gtf -l TREATMENT_CELLS INFILE_sorted.bam

variables

-G: known annotation GTF
-o: output GTF
-l: prefix for transcript IDs
last entry is the input .bam file
--rf: Assumes a stranded library fr-firststrand
--fr: Assumes a stranded library fr-secondstrand
I empirically found that --rf and --fr are ignored if HISAT2 was invoked with the --rna-strandness argument as R or F.

4.1 StringTie for merging transcripts

At this point, if you want to do differential gene expression analysis, you will need to merge all of the transcripts from all sequencing samples and conditions.

Figure 4: Merging transcripts

stringtie --merge -p 8 -G gencode.v39.annotation.gtf -o stringtie_merged.gtf mergelist.txt

mergelist.txt is a simple text file with the names of each file to merge on a separate line.

4.2 gffcompare to compare to annotations:

gffcompare manual

gffcompare -R -r gencode.v39.annotation.gtf -o stingtie_merged stringtie_merged.gtf

The -R option forces the program to only consider transcripts for which there is data, so that the number of ‘missed’ exons or transcripts is not influenced simply by which genes are expressed.

Output files:

stringtie_merged.combined.gtf
stringtie_merged.loci
stringtie_merged.stats
stringtie_merged.tracking
stringtie_merged.gtf.refmap
stringtie_merged.gtf.tmap

less stingtie_merged.stats

## # gffcompare v0.12.6 | Command line was:
## #gffcompare -R -r /home/FCAM/meds5420/annotations/gencode.v39.annotation.gtf -o stingtie_merged stringtie_merged.gtf
## #
## 
## #= Summary for dataset: stringtie_merged.gtf 
## #     Query mRNAs :  245645 in   57989 loci  (218720 multi-exon transcripts)
## #            (22471 multi-transcript loci, ~4.2 transcripts per locus)
## # Reference mRNAs :  242722 in   57415 loci  (216649 multi-exon)
## # Super-loci w/ reference transcripts:    57413
## #-----------------| Sensitivity | Precision  |
##         Base level:   100.0     |    99.7    |
##         Exon level:    99.7     |    99.4    |
##       Intron level:   100.0     |    99.5    |
## Intron chain level:   100.0     |    99.1    |
##   Transcript level:   100.0     |    98.8    |
##        Locus level:    99.9     |    98.9    |
## 
##      Matching intron chains:  216647
##        Matching transcripts:  242608
##               Matching loci:   57373
## 
##           Missed exons:       0/652609   (  0.0%)
##            Novel exons:    2570/655065   (  0.4%)
##         Missed introns:       0/395147   (  0.0%)
##          Novel introns:    1901/397160   (  0.5%)
##            Missed loci:       0/57415    (  0.0%)
##             Novel loci:     576/57989    (  1.0%)
## 
##  Total union super-loci across all input datasets: 57989 
## 245645 out of 245645 consensus transcripts written in stingtie_merged.annotated.gtf (0 discarded as redundant)

head stringtie_merged.gtf.refmap

## ref_gene ref_id  class_code  qry_id_list
## DDX11L1  ENST00000456328.2   =   ENSG00000223972.5|ENST00000456328.2
## DDX11L1  ENST00000450305.2   =   ENSG00000223972.5|ENST00000450305.2
## MIR1302-2HG  ENST00000473358.1   =   ENSG00000243485.5|ENST00000473358.1
## MIR1302-2HG  ENST00000469289.1   =   ENSG00000243485.5|ENST00000469289.1
## MIR1302-2    ENST00000607096.1   =   ENSG00000284332.1|ENST00000607096.1
## OR4G4P   ENST00000606857.1   =   ENSG00000268020.3|ENST00000606857.1
## OR4G11P  ENST00000642116.1   =   ENSG00000240361.2|ENST00000642116.1
## OR4G11P  ENST00000492842.2   =   ENSG00000240361.2|ENST00000492842.2
## OR4F5    ENST00000641515.2   =   ENSG00000186092.7|ENST00000641515.2

4.3 Calculate transcript abundances with stringtie for use with ballgown or other DE analysis software.

stringtie -e -B -p 8 -G stringtie_merged.gtf 
          -o OUTFILE_est.gtf 
          -A OUTFILE_abun.tab INFILE_sorted.bam

Options:

-e: Only quantify transcripts that agree with reference annotations.
-B: Converts data to a table (.ctab) that is compatible with Ballgown DE analysis
-o: Path and name of new .GTF file with coverage data. Accessory files will be written to same path and path will be created if it doesn’t exist.
-A: Create an abundance table that reports transcript level quantification.
--rf: Assumes a stranded library fr-firststrand
--fr: Assumes a stranded library fr-secondstrand
I empirically found that --rf and --fr are ignored if HISAT2 was invoked with the --rna-strandness argument as R or F.

There are several outputs, one of which contains the data for each transcript (coverage, FPKM, and TPM values). You can use grep and your favorite gene’s name to look at its transcript abundance.

5 In class exercise 1:

I have aligned RNA-seq data from two clones of the same progenitor cell line to the hg38 genome with HISAT2. These are two clones, clone 1 and clone 7, that were expanded upon single cell dilution. The genome file indicies I used for alignment is here: /home/FCAM/meds5420/genomes/hg38_hisat2. I used the exact hisat2 command:

ht2_gen=/home/FCAM/meds5420/genomes/hg38_hisat2/hg38_hisat2

for i in C*rep*fastq.gz
do
    name=$(echo $i | cut -d "." -f 1)
    echo $name
    hisat2 -p4 --dta --rna-strandness R -x $ht2_gen -U $i -S ${name}_R.sam
    hisat2 -p4 --dta --rna-strandness F -x $ht2_gen -U $i -S ${name}_F.sam
done

The .sam files can be found here:

Folder location: /home/FCAM/meds5420/data/RNA_seq
Files:
- C1_RNAseq_rep*_R.sam
- C7_RNAseq_rep*_R.sam
- C1_RNAseq_rep*_F.sam
- C7_RNAseq_rep*_F.sam

The annotation .gtf file can be found here: /home/FCAM/meds5420/annotations/gencode.v39.annotation.gtf
Extract your specific chromosome from the annotation file to speed things up.

Access the aligned data and with each R and F strandedness file (using a loop) and with “your” specific chromosome perform:

transcript assembly with stringtie. In your loop you will want to generate a file with a list of individual annotation file names that you will use in the next step.
merge transcripts from each assembly, but keep R and F assemblies separate so you can compare them.
comparison the R and F merged assemblies to annotations with a focus on the .stats file. Interpret the stats file with a focus on Precision and novel features. Provide a written explanation of your interpretation with respect to the --rna-standedness input.
stringtie quantification of transcripts that match the annotation (i.e. not your merged transcripts gtf) using the C1_RNAseq_rep1_F and C1_RNAseq_rep1_R sorted BAM files from your chromosome.
Is the R or F option, or unstranded correct for these libraries? Perform tests to determine this and provide written explanations. What would you expect if the libraries were unstranded?

To speed things up try logging into an interactive session with 2 or 4 cores: srun --pty -p mcbstudent --qos=mcbstudent --mem=6G -c 4 bash
Only use data from your chromosome, i.e. retrieve your chromosome data from the .sam file to your $HOME directory tree and pull out only GTF annotations on your chromosome.
Take the time to look at the output and manual (-h) at each step to become familiar with the operations and output.

This exercise will take some time, so we will have at least the first 20 minutes of next class to finish this and ask questions.

6 Counting reads in transcripts with HTseq

HTseq is a python-based set of tools for analyzing sequencing data.

Link to the manual: https://htseq.readthedocs.io/en/release_0.10.0/
It is particularly useful because it can take input .GTF files and map reads to exons and return the results by whole transcript or gene.

HTseq only requires a sorted .sam or .bam file, so the output from hisat2 alignment can be used directly after sorting. With later versions of HTSeq you will need to make a bam index file with samtools using samtools index

Here is an example of it’s usage:

module load htseq/0.13.5

samtools index HISAT_ALIGNED_SORTED.bam
htseq-count -m union -r pos -f bam -i gene_id -a 10 --stranded=no HISAT_ALIGNED_SORTED.bam $GTF > FILE.counts
  
 # $GTF file is a variable for the annotation file of choice

Options:

-m union: union is one of three options for htseq count for binning of reads to features (genes/exons). It is the recommended option that is most stringent since it will only count a read if it aligns to only 1 gene.
-r pos : meaning that the BAM/SAM file is sorted by genomic coordinate/position
-f: specify input file type
-i gene_id : tells htseq-count to use the gene ID in output file
-a 10 : tells htseq-count to skip reads with alignment score less than 10.
–stranded=no/yes/reverse : the RNA-seq reads strand specificity.

6.1 Important update on HTseq-count

HTseq (older than version 0.8.0) only maps to unique portions of transcripts and genes. Therefore, read count will often be underrepresented if genes or transcript isoforms overlap or share common exons. The newer version of HTseq-count (0.8.0 and newer) have a new option called --nonunique that if set to all will allow reads to map to multiple isoforms or overlapping genes. So in the examples shown in the HTseq figure, ambiguous reads get mapped to both transcripts or genes. Xanadu has newer versions of HTseq that allow the use of the nonunique option.

Figure 5: HTseq counting options

6.2 Strandedness with HTSeq

We use htseq-count to bin reads into genomic features (i.e. genes). The options above include -r which tells the program the bam file is sorted by position, -f which tells the program the input is a BAM file, and --stranded tells the program the strand information of the file. Notice the .gtf file is included. The output of this command is a .txt file that contains the gene counts for that sample. We will get one .txt file per sample. We already confirmed that the data is stranded, but even if I know everything about the experiment, is not intuitive for me to know whether to specify yes or reverse for the --stranded argument. To make matters more confusing HTSeq is not in the tuxedo suite, so the XS attribute tag that contains the - or + associated with the read alignment is ignored. The SAM flag is used, so we must again specify the stranded option when counting with HTSeq.

To ensure that I use the correct option, I always run both options and observe the results—this is only necessary for one file if you have a whole bunch of files that were generated with the same molecular biology. Whatever option has more reads in the vast majority of the genes is correct. I appreciate that this is not elegant, but these empirical checks are often necessary to ensure the data is properly processed. The biggest mistakes that I have observed in genomics could have been avoided if the bioinformatician looked at the output of analyses and confirmed that the result is expected or, at minimum, reasonable.

7 In class exercise 2:

1) Count reads in genes and empirically determine the correct stranded parameter option for your C1_RNAseq_rep1.fastq.gz data aligned to your chromosome. Recall that C7_RNAseq_rep1_F.sam and C7_RNAseq_rep1_R.sam are found here: /home/FCAM/meds5420/data/RNA_seq. Confirm that HTSeq ignores the --rna-strandedness option from hisat2 by counting with both SAM files.

You should have already extracted only reads from your chromosome using grep and these input SAM files. Likewise, you should have subset your gtf annotation to only contain your chromosome using grep.

Note: The output counts file has a footer (see the tail) that we want to ignore in the counting. The grep -v __ command prints all the lines without __ to stdout, and you can use awk (you should have a column-summing one-liner awk command already saved in the awk_one_liners.sh file that I know you are all keeping) to count the second column.

8 Use Genome Coverage from bedtools to create a bedGraph:

bedtools genomecov -bg -split -ibam HISAT_out.sorted.bam > HISAT_out.bedgraph

NOTES: -split is an important option. If you do not specify it here, you will get signal across exons that looks like background.

Also, if you have strand specific data, make sure you have to run this twice and specify a different strand each time.

bedtools genomecov -bg -split -strand + -ibam INFILE > OUTFILE_plus.bedgraph

bedtools genomecov -bg -split -strand - -ibam INFILE > OUTFILE_minus.bedgraph

8.1 Note on visualizing stranded data with `genomeCoverageBed`

Again, we have to be careful about strandedness! genomeCoverageBed defines “strand” by their orientation and they will treat BAM files generated with the --rna-strandness R and F options identically because bedtools is not part of the tuxedo suite. Therefore in the testing stage for generating a workflow, you will have to try combinations of -strand arguments and plus vs. minus outputs to ensure proper visualization as I show in the code chunk below. Since you already confirmed the R option with HISAT2 and the rev option with HTseq, I will tell you that the specified strand and output is inconsistent (i.e. - paired with plus and + paired with minus) with this library prep method.

#test1
bedtools genomecov -bga -split -strand + -ibam INFILE > OUTFILE_plus.bedgraph
bedtools genomecov -bga -split -strand - -ibam INFILE > OUTFILE_minus.bedgraph

#test2
bedtools genomecov -bga -split -strand + -ibam INFILE > OUTFILE_minus.bedgraph
bedtools genomecov -bga -split -strand - -ibam INFILE > OUTFILE_plus.bedgraph

We can add track lines as option to include in the files:

bedtools genomecov -bg -split -strand + -ibam INFILE -trackline 
  -trackopts 'name=RNAseq_plus description="RNAseq_plus"  
  visibility=full autoScale=on alwaysZero=on 
  color=255,0,0 windowingFunction=maximum' 
  > OUTFILE_plus_track.bedgraph

You can add track lines as previously shown with awk or with a modified AddTrackLines script we wrote in class for ChIP-seq.

awk 'BEGIN {  print "browser position chr11:5,289,521-5,291,937"
              print "track type=bedGraph name=RNAseq_plus 
              description="RNAseq_plus"  visibility=full 
              autoScale=on alwaysZero=on 
              color=255,0,0 windowingFunction=maximum" }
           {  print $0}' RNAseq.bedgraph > RNAseq_header.bedgraph

You can now upload the file into the UCSC genome browser:

https://genome.ucsc.edu/s/Mike%20Guertin/hg38_rna_seq_class_C1

Figure 6: RNA-seq UCSC

9 In class exercise 3:

1. Use HISAT2 to align C1_RNAseq_rep1.fastq.gz (/home/FCAM/meds5420/data/RNA_seq/C1_RNAseq_rep1.fastq.gz) to your user ID chromosome (/home/FCAM/meds5420/genomes/chroms/chr#.fa) using the appropriate options. Recall that you will have to build your chromosome index files by first extracting splice sites and exons. Speed up the process by using you GTF annotation file that is unique to your chromosome.
2. Visualize the data in the browser and send me a UCSC session link.
- Your chromosome can be found here: /home/FCAM/meds5420/genomes/chroms
extract your chromosome gene annotations from here: /home/FCAM/meds5420/annotations/gencode.v39.annotation.gtf

10 Code chunks and commentary not covered in class

The following code chunks and commentary are for your reference. You have the knowledge base to implement the code and understand the processing if the need arises. Most RNA-seq libraries are paired end, so I provided the code and explanations to visualize PE data. UConn and JAX have very strong RNA biology and splicing labs, so I also included code to parse files to visualize splice junction reads on a browser.

11 Visualizing stranded PE data:

Original merged C1 and C7 PE FASTQ files: /home/FCAM/meds5420/data/RNA_seq/paired_end

11.1 Mapping paired end and stranded data with HISAT2

If you submit samples for RNA-seq at UConn’s CGI, then you will most likely get stranded data back. It will often be paired-end data as well. Here’s how to map and visualize the data.

module load hisat2
dir=/home/FCAM/meds5420/data/RNA_seq/paired_end/
ht2_gen=/home/FCAM/meds5420/genomes/hg38_hisat2/hg38_hisat2
hisat2 -p4 --dta --rna-strandness RF -x $ht2_gen -1 ${dir}C_RNAseq_all_PE1.fastq.gz -2 ${dir}C_RNAseq_all_PE2.fastq.gz -S C_RNAseq_all.sam

the --rne-strandness refers to whether the sequence read represents the RNA equivalent or is complementary to it. If the read represents the same strand as the RNA, then use the F as the input, if it is the complement of the RNA, use R.
For the Paired-end kit used in the CGI, the first read (mate pair 1) is the complement to the RNA, and the second read (mate pair 2) is the same strand as the RNA. Therefore, you should enter RF for this option.
See: Taken from: https://chipster.csc.fi/manual/library-type-summary.html

Next, We can convert this .sam file to a sorted .bam , however, for making bed files of paired end data, the reads should be sorted by name (-n), so mate-pairs as next to each other. The example below will just process chr15 to speed things up. Remember, when employing new code or building a pipeline it is always a good idea to start processing only a fraction of the data (i.e. head -n 400000 X > Y or grep -w chr19 X > Y) until all seems to be working well. This will save you time just waiting for things to run.

grep -w chr15 C_RNAseq_all.sam > C_RNAseq_chr15.sam
samtools view -b C_RNAseq_chr15.sam -o C_RNAseq_chr15.bam
samtools sort -n C_RNAseq_chr15.bam -o C_RNAseq_chr15_sorted.bam

Now we can convert to bed12 format for display in the browser. bed12 format specifications can be found here:
https://genome.ucsc.edu/FAQ/FAQformat.html#format1

bedtools bamtobed -bed12 -split -i C_RNAseq_chr15_sorted.bam > C_RNAseq_chr15_pe.bed12.bed

# These files will need to be sorted by position before uploading to browser.

Review bedtools commands and options here:

bam to bed: http://bedtools.readthedocs.io/en/latest/content/tools/bamtobed.html
genome coverage: http://bedtools.readthedocs.io/en/latest/content/tools/genomecov.html

12 Displaying splice junctions

Figure 7: RNAseq browser

Although deeper exon coverage is readily apparent, there is no precise information regarding the splice sites that are detected.

Note: You can display splice junctions for any RNAseq data - stranded or not, paired-end or not.

We start with a bam file that is sorted by position, not name.

12.1 Move header to a new file

samtools sort C_RNAseq_chr15.bam > C_RNAseq_chr15_pos.bam
samtools view -H C_RNAseq_chr15_pos.bam > C_RNAseq_chr15_spliced.sam

12.2 Extract splice junction reads

samtools view C_RNAseq_chr15_pos.bam | awk '($6 ~/N/)' >> C_RNAseq_chr15_spliced.sam

12.3 Convert back to bam

samtools view -b C_RNAseq_chr15_spliced.sam > C_RNAseq_chr15_spliced.bam

12.4 Convert to bed12

Example of bed12 format:

head C_RNAseq_chr15_spliced_bed12.bed

## chr15    43793063    43793762    SRR5882541.161/1    60  -   43793063    43793762    255,0,0 2   20,53   0,646
## chr15    89321202    89321752    SRR5882541.176/1    60  +   89321202    89321752    255,0,0 2   58,17   0,533
## chr15    89321821    89322005    SRR5882541.176/2    60  -   89321821    89322005    255,0,0 2   32,44   0,140
## chr15    44715682    44716385    SRR5882541.229/2    60  +   44715682    44716385    255,0,0 2   19,57   0,646
## chr15    69455255    69455468    SRR5882541.472/1    60  -   69455255    69455468    255,0,0 2   32,41   0,172
## chr15    69455255    69455468    SRR5882541.472/2    60  +   69455255    69455468    255,0,0 2   32,41   0,172
## chr15    69426433    69429161    SRR5882541.561/2    60  +   69426433    69429161    255,0,0 2   24,51   0,2677
## chr15    90967188    90969543    SRR5882541.908/1    60  +   90967188    90969543    255,0,0 3   14,42,20    0,1890,2335
## chr15    90969571    90970427    SRR5882541.908/2    60  -   90969571    90970427    255,0,0 2   52,24   0,832
## chr15    22872955    22873548    SRR5882541.1109/1   60  +   22872955    22873548    255,0,0 2   17,58   0,535

bamToBed -bed12 -i C_RNAseq_chr15_spliced.bam > C_RNAseq_chr15_spliced_bed12.bed

IMPORTANT: If using paired-end data, you will need to flip the strand on this file and then add tracklines and upload. In the name column (column 4), mate pair one reads end with "/1" and mate pair 2 reads end in "/2". We can use this to "flip" mate pair 1 reads to the opposite strand. Below is an awk script that can do this.

#flip
awk '{OFS="\t"; if($4 ~"/2") print $0; else if($4 ~"/1" && $6 == "+") print($1, $2, $3,$4,$5,"-", $7,$8,$9,$10,$11,$12); else if($4 ~"/1" && $6 == "-") print($1, $2, $3,$4,$5,"+", $7,$8,$9,$10,$11,$12)}' C_RNAseq_chr15_spliced_bed12.bed > C_RNAseq_chr15_spliced_bed12_R1Flip.bed

#add trackline  
awk 'BEGIN { print "browser position chr15:38,408,924-38,589,493"; print "track type=bed name=\"C_RNAseq_chr15_pe\" description=\"C_RNAseq_chr15_pe\"  visibility=squish autoScale=on colorByStrand=\"255,0,0 0,0,255\""} { print $0}' C_RNAseq_chr15_spliced_bed12_R1Flip.bed > C_RNAseq_chr15_spliced_header.bed

https://genome.ucsc.edu/s/Mike%20Guertin/hg38_RNA_splicing_class

Figure 8: RNAseq splice junctions

We can then make bedgraph files as before using the -split and -strand options. However, the input .bed files need to be sorted by position, unmapped reads discarded (. in columns 1 or 3) and undergo some more processing first.

First, sort the bed file by position.

sort -k1,1 -k2,2n C_RNAseq_chr15_pe.bedpe.bed | grep -v -w "\." > C_RNAseq_chr15_pe.bedpe_sorted.bed
sort -k1,1 -k2,2n C_RNAseq_chr15_pe.bed12.bed | grep -v -w "\." > C_RNAseq_chr15_pe.bed12_sorted.bed

IMPORTANT: Bedtools can properly convert stranded data to a bedGraph viewable in the browser. However, it does not consider the strandedness of the pairs, so one of the mates in a pair will always map to the wrong strand creating data on both strands where it should only be one one. To get around this, we must manually flip the read that is the reverse complement of the RNA to the other strand. In this case, we must flip read 1. First, let’s look at a bed12 file and see how we can parse it ourselves.

head C_RNAseq_chr15_pe.bed12_sorted.bed | tail -5

## chr15    17014168    17014241    SRR5882548.2115642/1    1   +   17014168    17014241    255,0,0 1   73  0
## chr15    17014168    17014241    SRR5882548.2115642/2    1   -   17014168    17014241    255,0,0 1   73  0
## chr15    17014168    17014241    SRR5882548.6017347/1    1   +   17014168    17014241    255,0,0 1   73  0
## chr15    17014168    17014241    SRR5882548.6017347/2    1   -   17014168    17014241    255,0,0 1   73  0
## chr15    17016468    17016540    SRR5882544.2934632/1    0   +   17016468    17016540    255,0,0 1   72  0

In the name column (column 4), mate pair one reads end with "/1" and mate pair 2 reads end in "/2". We can use this to "flip" mate pair 1 reads to the opposite strand. Below is an awk script that can do this.

awk '{OFS="\t"; if($4 ~"/2") print $0; else if($4 ~"/1" && $6 == "+") print($1, $2, $3,$4,$5,"-", $7,$8,$9,$10,$11,$12); else if($4 ~"/1" && $6 == "-") print($1, $2, $3,$4,$5,"+", $7,$8,$9,$10,$11,$12)}' C_RNAseq_chr15_pe.bed12_sorted.bed > C_RNAseq_chr15_bed12_R1Flip.bed

Now create the bedGraphs, add track lines, zip the files, transfer them to our computer, and upload to UCSC;


hgSizes=/home/FCAM/meds5420/genomes/hg38.chrom.sizes
bedtools genomecov -bg -split -strand + -i C_RNAseq_chr15_bed12_R1Flip.bed -g $hgSizes > C_RNAseq_chr15_pe_R1flip_plus.bedGraph

bedtools genomecov -bg -split -strand - -i C_RNAseq_chr15_bed12_R1Flip.bed -g $hgSizes > C_RNAseq_chr15_pe_R1flip_minus.bedGraph

awk 'BEGIN { print "browser position chr15:38,449,131-38,569,510"; print "track type=bedGraph name=\"C_RNAseq_chr15_plus\" description=\"C_RNAseq_chr15_plus\"  visibility=squish autoScale=on color=\"255,0,0\""} { print $0}' C_RNAseq_chr15_pe_R1flip_plus.bedGraph > C_RNAseq_chr15_pe_plus.bedGraph

awk 'BEGIN { print "browser position chr15:38,449,131-38,569,510"; print "track type=bedGraph name=\"C_RNAseq_chr15_minus\" description=\"C_RNAseq_chr15_minus\"  visibility=squish autoScale=on color=\"0,0,255\""} { print $0}' C_RNAseq_chr15_pe_R1flip_minus.bedGraph > C_RNAseq_chr15_pe_minus.bedGraph

gzip C_RNAseq_chr15_pe_plus.bedGraph
gzip C_RNAseq_chr15_pe_minus.bedGraph

#sftp

https://genome.ucsc.edu/s/Mike%20Guertin/hg38_PE_class

Figure 9: RNAseq Paired End browser

13 Molecular biology of library preparations

The molecular biology of library preps diverge depending upon whether the experimenter wishes to preserve the RNA strand information in the sequencing data. Unstranded libraries were common over a decade ago, but I see no reason why contemporary research would employ strand-ambiguous RNA-seq. Therefore, I will briefly discuss the workflow for the library preparation that I prefer and how clever molecular biologists developed methods to retain the strand specificity of the RNA throughout the library prep.
1) Isolate RNA.
2) Negative selection of rRNA.
3) Fragment the RNA with heat.
4) Anneal random hexamer DNA primers to the RNA.
5) Perform reverse transcription of the RNA using a reverse transcriptase that is engineered to operate at high temperatures. RT at higher temperatures relieves secondary structure of the RNA and leads to less biased libraries. Including actinomycin D at this step prevents second strand synthesis during the RT by inhibiting DNA-dependent DNA synthesis.
6) Perform second strand synthesis using the dUTP nucleotide as opposed to dTTP. dUTP incorporation is important for conferring strand specificity.
7) The next step is to blunt the ends of the DNA and generate a 5’ A overhang. T4 Polymerase harbors 3’ to 5’ exonuclease activity and 5’ to 3’ polymerase activity, which will blunt both overhangs. T4 Polymerase also phosphorylates 5’ DNA ends.
8) Taq polymerase is used to add an A to the 3’ end of the dsDNA. The high throughput sequencing adapter is usually a forked Y adapter that contains a T overhang on the 5’ end. This pairs with the A overhang, traditionally referred to as TA cloning. Alternatively, a hairpin adapter is used that incorporates a dUTP that effectively generates a Y adapter when the dUTP is digested.
9) Ligate the sequencing adapters with DNA ligase.
10) Digest dUTPs incorporated during the second strand synthesis with uracil-DNA glycosylase (UGDase). This confers sequence specificity because now you have a single strand DNA molecule with a distinct adapter on each end. The ssDNA represents the template strand.
11) The last step is to PCR amplify with a variety of multiplexed adapters that permit pooling of samples.

The dsDNA molecules can be sequenced using a specific primer for each end of the molecule. Refer to step 10 to determine which sequencing primer would result in the coding strand and which would be the reverse complement thereof.

Figure 10: RNA-seq molecular biology

NEXT TIME: Differential gene expression analysis.

14 Answers to in class exercise 1:


anno=/home/FCAM/meds5420/annotations/gencode.v39.annotation.gtf

grep -w "chr15" $anno > gencode.v39.annotation.chr15.gtf
annochr15=gencode.v39.annotation.chr15.gtf
chr=chr15

#for i in /home/FCAM/meds5420/data/RNA_seq/C*_RNAseq_rep*_*.sam
touch mergelist.txt
for i in /home/FCAM/meds5420/data/RNA_seq/C*_RNAseq_rep*[FR].sam
do
    name=$(echo $i | rev | cut -d "/" -f 1 | rev | cut -d "." -f 1)
    echo $name
    grep -w $chr $i > ${name}_chr15.sam
    samtools sort -@ 4 -o ${name}_chr15_sorted.bam ${name}_chr15.sam
    stringtie -p 4 -G $annochr15 -o $name.gtf -l ${name} ${name}_chr15_sorted.bam
    echo $name.gtf >> mergelist.txt
done

grep "F.gtf" mergelist.txt > mergelist_F.txt
grep "R.gtf" mergelist.txt > mergelist_R.txt

fr="F R"
for x in $fr
do
    #2 merge transcripts
    stringtie --merge -p 4 -G $annochr15 -o stringtie_merged_${x}.gtf mergelist_${x}.txt
    #3 compare
    gffcompare -R -r $annochr15 -o stringtie_merged_${x}_all stringtie_merged_${x}.gtf
done    

The `R` oriented files have higher precision and fewer novel features. Novel features are false positives (FP), because we consider the reference gene annotations as true positives (TP). precision is calculated as Precision = TP / (TP+FP). All the novel features (FP counts) in `F` are actually found because the orientation of the strand is not correct. This further confirms that `R` strandedness is correct.


#4 abundance

name=C1_RNAseq_rep1_
for x in $fr
do
    stringtie -e -B -p 8 -G $annochr15 -o stringtie_est_${x}.gtf  -A ${name}${x}.tab ${name}${x}_chr15_sorted.bam
#5 tests    
    echo ${name}${x} HSP90 counts
    grep -i hsp90 ${name}${x}.tab
    echo cumulative Coverage  
    awk 'NR>1 {sum+=$7} END{print "sum=",sum}' ${name}${x}.tab 
    echo Cumulative FPKM
    awk 'NR>1 {sum+=$8} END{print "sum=",sum}' ${name}${x}.tab 
    echo Cumulative TPM
    awk 'NR>1 {sum+=$9} END{print "sum=",sum}' ${name}${x}.tab 
done

I did a few tests to see if R or F is correct:
1) I did not know what highly expressed genes were on chr15, so I googled the official gene names (i.e. Actin ==ACTN) and looked at expression of all the candidates: Actin, Tubulin, gapdh, and hsp90. A few variants of hsp90 are present on chr15 and there is coverage and normalized counts in the R orientation. 
2) I looked at the total Coverage, FPKM, and TPM for R and F. the FPKM and TPM are normalized for read depth and since fewer reads align to the wrong orientation, the cumulative values are similar. However, coverage is per-base coverage of the gene and not normailzed by read depth, so this value is directly comparable between R and F input files. The coverage is ~10 fold higher in the R orientation, further confirming R is correct. Coverage would be comparable in R and F if the libraries were unstranded, then we would need to remap with the --rna-standedness flag not specified.

15 Answers to class exercise 2:


annochr15=gencode.v39.annotation.chr15.gtf

for i in C1_RNAseq_rep1_[FR]_chr15_sorted.bam
do
    name=$(echo $i | rev | cut -d "/" -f 1 | rev | cut -d "." -f 1)
    echo $name
    htseq-count -r pos -f bam --stranded=yes ${i} $annochr15 > $name.gene.counts.yes.txt
    htseq-count -r pos -f bam --stranded=reverse ${i} $annochr15 > $name.gene.counts.reverse.txt
    tail $name.gene.counts.yes.txt
    tail $name.gene.counts.reverse.txt
    echo $name
    echo yes
    grep -v __ $name.gene.counts.yes.txt | awk '{sum+=$2} END{print "sum=",sum}'
    echo reverse
    grep -v __ $name.gene.counts.reverse.txt | awk '{sum+=$2} END{print "sum=",sum}'
done
# many more counts in reverse indicate that reverse is counting reads coming from coherent annotation/read strands

# I know that they are all generated using the same molecular bio protocols
# If I was unsure, I would write a loop to go through all of them

16 Answers to in class exercise 3:


grep -w "chr15" gencode.v39.annotation.gtf > gencode.v39.chr15.annotation.gtf

extract_splice_sites.py gencode.v39.chr15.annotation.gtf > hg38_hisat_chr15_splicesites.txt

extract_exons.py gencode.v39.chr15.annotation.gtf > hg38_hisat_chr15_exons.txt

cp /home/FCAM/meds5420/genomes/chroms/chr15.fa ./ 

hisat2-build -p 8 --ss hg38_hisat_chr15_splicesites.txt --exon hg38_hisat_chr15_exons.txt chr15.fa chr15_hisat2

name=C1_RNAseq_rep1
cp /home/FCAM/meds5420/data/RNA_seq/$name.fastq.gz ./

hisat2 -p4 --dta --rna-strandness R -x chr15_hisat2 -U $name.fastq.gz -S $name.sam 

samtools sort -@ 4 -o $name_sorted.bam $name.sam 

bedtools genomecov -bg -split -strand - -ibam $name_sorted.bam -trackline -trackopts 'name=C1_RNAseq_rep1_plus description="C1_RNAseq_rep1_plus" visibility=full autoScale=on alwaysZero=on color=255,0,0 windowingFunction=maximum' > C1_RNAseq_rep1_plus.bedGraph

bedtools genomecov -bg -split -strand + -ibam C1_RNAseq_rep1_sorted.bam -trackline -trackopts 'name=C1_RNAseq_rep1_minus description="C1_RNAseq_rep1_minus" visibility=full autoScale=on alwaysZero=on color=0,0,255 windowingFunction=maximum' > C1_RNAseq_rep1_minus.bedGraph

gzip *bedGraph

#sftp the bedGraphs to your computer and upload to UCSC.

# my session:

# https://genome.ucsc.edu/s/Mike%20Guertin/hg38_rna_seq_class_C1

#interpret the data

#The library is stranded because each strand separated track aligns to gene annotations 
#specifically, as opposed to both strand separated tracks having comparable coverage at #each gene annotation. Since the library is firststrand stranded (dUTP), the incoherent (- specification and plus labeling) is correct.

MEDS5420 - Lectures 21, 22, and 23 - RNA-seq mapping, visualization, and transcript assembly

April 11, 13, and 18 2022

Contents

0.1 Cleaning up data

1 Building genome with HISAT2

1.1 GTFs:

2 Building your annotated genome for mapping with HISAT2.

3 Alignment with HISAT2

3.1 Note on `--rna-strandness`

3.2 Intermediate file conversions

4 StringTie for assembling transcripts

4.1 StringTie for merging transcripts

4.2 gffcompare to compare to annotations:

4.3 Calculate transcript abundances with stringtie for use with ballgown or other DE analysis software.

5 In class exercise 1:

6 Counting reads in transcripts with HTseq

6.1 Important update on HTseq-count

6.2 Strandedness with HTSeq

7 In class exercise 2:

8 Use Genome Coverage from bedtools to create a bedGraph:

8.1 Note on visualizing stranded data with `genomeCoverageBed`

9 In class exercise 3:

10 Code chunks and commentary not covered in class

11 Visualizing stranded PE data:

11.1 Mapping paired end and stranded data with HISAT2

12 Displaying splice junctions

12.1 Move header to a new file

12.2 Extract splice junction reads

12.3 Convert back to bam

12.4 Convert to bed12

13 Molecular biology of library preparations

14 Answers to in class exercise 1:

15 Answers to class exercise 2:

16 Answers to in class exercise 3:

MEDS5420 - Lectures 21, 22, and 23 - RNA-seq mapping, visualization, and transcript assembly

April 11, 13, and 18 2022

Contents

0.1 Cleaning up data

1 Building genome with HISAT2

1.1 GTFs:

1.2 Using HISAT2 and related software on server:

2 Building your annotated genome for mapping with HISAT2.

3 Alignment with HISAT2

3.1 Note on --rna-strandness

3.2 Intermediate file conversions

4 StringTie for assembling transcripts

4.1 StringTie for merging transcripts

4.2 gffcompare to compare to annotations:

4.3 Calculate transcript abundances with stringtie for use with ballgown or other DE analysis software.

5 In class exercise 1:

6 Counting reads in transcripts with HTseq

6.1 Important update on HTseq-count

6.2 Strandedness with HTSeq

7 In class exercise 2:

8 Use Genome Coverage from bedtools to create a bedGraph:

8.1 Note on visualizing stranded data with genomeCoverageBed

9 In class exercise 3:

10 Code chunks and commentary not covered in class

11 Visualizing stranded PE data:

11.1 Mapping paired end and stranded data with HISAT2

12 Displaying splice junctions

12.1 Move header to a new file

12.2 Extract splice junction reads

12.3 Convert back to bam

12.4 Convert to bed12

13 Molecular biology of library preparations

14 Answers to in class exercise 1:

15 Answers to class exercise 2:

16 Answers to in class exercise 3:

3.1 Note on `--rna-strandness`

8.1 Note on visualizing stranded data with `genomeCoverageBed`