0.1 Cleaning up data
1 Building genome with HISAT2
- 1.1 GTFs:
- 1.2 Using HISAT2 and related software on server:
2 Building your annotated genome for mapping with HISAT2.
3 Alignment with HISAT2
- 3.1 Note on --rna-strandness
- 3.2 Intermediate file conversions
4 StringTie for assembling transcripts
- 4.1 StringTie for merging transcripts
- 4.2 gffcompare to compare to annotations:
- 4.3 Calculate transcript abundances with stringtie for use with ballgown or other DE analysis software.
5 In class exercise 1:
6 Counting reads in transcripts with HTseq
- 6.1 Important update on HTseq-count
- 6.2 Strandedness with HTSeq
7 In class exercise 2:
8 Use Genome Coverage from bedtools to create a bedGraph:
- 8.1 Note on visualizing stranded data with genomeCoverageBed
9 In class exercise 3:
10 Processing paired end RNA-seq and visualizing splicing
11 Visualizing stranded PE data:
- 11.1 Mapping paired end and stranded data with HISAT2
12 Displaying splice junctions
- 12.1 Move header to a new file
- 12.2 Extract splice junction reads
- 12.3 Convert back to bam
- 12.4 Convert to bed12
13 Making bedGraphs files of C_RNAseq_chr17_pe.bed12.bed data above (all split spliced reads and non-split reads).
14 Exercise 4: Extract and visualize splice junctions from paired end RNA-seq data.
15 For your reference: Molecular biology of library preparations
16 Answers to in class exercise 1:
17 Answers to class exercise 2:
18 Answers to in class exercise 3:
19 Answers to in class exercise 4:

Figure 1: RNAseq workflows

0.1 Cleaning up data

RNA-SeQC is a first-step option for quality control: http://archive.broadinstitute.org/cancer/cga/rna-seqc
RNA-SeQC is not available on the Xanadu server, but you can prefilter for quality and remove adapters just as in ChIP-seq:

fastqc or mutliQC: https://multiqc.info/docs/
quality filter / trimming
adapter removal

NOTE: Illumina adapters have barcodes these days, so when removing adapter from multiple files, just use the constant regions before the variable region (bolded below) to filter against. Example:\

GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG

Here, the bolded sequence is the variable region, so just use the region before it (GATCGGAAGAGCACACGTCTGAACTCCAGTCAC) for filtering.

1 Building genome with `HISAT2`

Things you need:

Indexed genome file(s): /home/FCAM/meds5420/genomes/hg38_hisat2
File with transcript and exon annotations (GFF or GTF files). :
/home/FCAM/meds5420/annotations/gencode.v39.annotation.gtf
installation of hisat2, stringTie, and GFFcompare: see below

1.1 GTFs:

GTF (General Transfer Format) format is a table of individual exons with retention of transcript ID.
Explanation of GTF and GFF files can be found here.

GENCODE GTFs This can be retrieved from GENCODE (Figure 2):

Figure 2: GENCODE GTF website

head -5 gencode.v39.annotation.gtf

## ##description: evidence-based annotation of the human genome (GRCh38), version 39 (Ensembl 105)
## ##provider: GENCODE
## ##contact: gencode-help@ebi.ac.uk
## ##format: gtf
## ##date: 2021-09-02

Obtaining .GTF (Gene Transfer Format) files from UCSC (Figure 3):

Figure 3: UCSC GTF Table Browser

1.2 Using HISAT2 and related software on server:

Use module to load necessary software:

module load hisat2
module load stringtie
module load gffcompare

2 Building your annotated genome for mapping with HISAT2.

HISAT2 will map to your genome, but will also use gene annotations to map to known exons and splice junctions specifically.

First, we need to use the GTF file to create a list of exons and splice sites to use when we map.
There are python scripts that come with the hisat2 software for this purpose.


extract_splice_sites.py gencode.v39.annotation.gtf > hg38_hisat_splicesites.txt

extract_exons.py gencode.v39.annotation.gtf > hg38_hisat_exons.txt

Here’s what the files look like:

head -5 ./hisat2/hg38_hisat_splicesites.txt

## chr1 12056   12178   +
## chr1 12226   12612   +
## chr1 12696   12974   +
## chr1 12720   13220   +
## chr1 13051   13220   +

head -5 ./hisat2/hg38_hisat_exons.txt

## chr1 11868   12226   +
## chr1 12612   12720   +
## chr1 12974   13051   +
## chr1 13220   14500   +
## chr1 15004   15037   -

These files are then used in building your annotated genome for mapping RNA-seq data. Here’s how to build it:

hisat2-build -p 8 --ss hg38_hisat_splicesites.txt --exon hg38_hisat_exons.txt hg38.fa hg38_hisat2

Should output the following to the screen (actually to stderr):

Settings:
  Output files: "hg38_hisat2.*.ht2"
  Line rate: 7 (line is 128 bytes)
  Lines per side: 1 (side is 128 bytes)
  Offset rate: 4 (one in 16)
  FTable chars: 10
  Strings: unpacked
  Local offset rate: 3 (one in 8)
  Local fTable chars: 6
  Local sequence length: 57344
  Local sequence overlap between two consecutive indexes: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  ../hg38.fa
Reading reference sizes
  Time reading reference sizes: 00:00:19
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
etc,
etc,
etc,
Total time for call to driver() for forward index: 02:47:58

NOTE: THIS IS THE MAJOR BOTTLENECK WITH HISAT2.

You can plan on this step taking ~2 hours and requiring ~160GB RAM for a large-ish genome such as human. You cannot do this with your average laptop, but you can run scripts that require high memory on the server with a special option.
Alternatively you may be able to download a pre-built genome for your organism from the HISAT2 website: https://ccb.jhu.edu/software/hisat2/index.shtml.
- In this case you lose flexibility due to being are locked in to the annotation set used in the build.
You can instead use annotations during the mapping, or (preferred) create a custom annotation using your initial alignment (after post processing below) and then re-align your data using the custom annotation set for good coverage.
Also, if you build your genome without annotations it requires much less memory and can probably be built on a desktop or even laptop.

To build your own genome of the server you can create a qsub script and make sure that the following lines are added to the header:

#SBATCH -c=32
#SBATCH -p himem
#SBATCH --qos=himem
#SBATCH --mem=256G

This would use half of the RAM and cores one one high memory node (i.e. each has 64 cores with 512G RAM).

This information is available on the BBC cluster site:

3 Alignment with HISAT2

HISAT2 manual:

hisat2 -p 4 --dta --rna-strandness F -x $ht2_gen -U INFILE.fastq -S OUTFILE.sam 

# ht2_gen=/home/FCAM/meds5420/genomes/hg38_hisat2/hg38_hisat2

Output is standard .sam output plus optional columns that provide greater detail. See manual for details.

Options are VERY important for RNA mapping - Some useful options:
* --dta: downstream transcript assembly -
* --rna-strandness: specify when data is stranded or not. Default is not stranded. F means read represents transcript, R means transcript is reverse complement. This is only used in stringtie and other software from the tuxedo suite. Invoking R or F here does not change the hexadecimal SAM flag. Instead, every read alignment will have an XS attribute tag with a - or + associated with the read alignment. This attribute tag is only (from what I can tell) used for downstream applications in the tuxedo suite.
- --tmo: transcript mapping only: only map to known transcripts.
- --phred33 or --phred64: specify Illumina quality score conversion.
- --summary-file [fileName.txt]: Print alignment summary to this file.
- --time: print time for alignment to the screen (sent to stnderr)
- --no-softclip prevents aligner from clipping ends that do not match
- --known-splicesite_infile: can add splice sites here if not in your assembly (need to extract splice sites with the python script).

3.1 Note on `--rna-strandness`

The option --rna-strandness is dependent upon the molecular biology that was used in library construction. Even when the molecular biology is known (sometimes it is not if you send out to a company), I recommend performing a coherence check like this by running hisat with the --rna-strandness F or --rna-strandness R option. If your data is stranded, you can also determine whether the R or F is the correct option for --rna-strandness. What do you expect to see if you chose the incorrect orientation? I will continue to stress this point because “pipelines” that ignore the molecular biology and these types of coherence checks can get researchers into trouble. In 2020 I had a student take my class and they realized that they were treating their unstranded libraries as stranded libraries in the analysis. This should not effect the conclusions, but using the wrong strand will. Treating unstranded libraries as stranded libraries is equivalent of sequencing to half the read depth—why?

3.2 Intermediate file conversions

Next we have to convert the alignments to a .bam file and sort them.

samtools view -bS INFILE.sam > OUTFILE.bam

samtools sort -@ 4 INFILE.bam -o OUTFILE_sorted.bam

# no extension needed for output file name
# -@ option designates # of cores for multi-threading 
# Important to remove intermediate files, especially when working on the server.

rm INFILE.sam
rm INFILE.bam

NOTE: the latest versions of samtools can convert to .bam and sort in one step.

samtools sort -@ 4 -o OUTFILE_sorted.bam INFILE.sam

If you are interested in the number of reads with splice junctions in them you can extract them from a .bam or .sam as below:

samtools view INFILE.sam | awk '($6 ~ /N/)' > OUTFILE_spliced_reads.sam

This command uses the CIGAR (Compact Idiosyncratic Gapped Alignment Report) string to find gaps in reads with no match, which is designated by the N.
M: Alignment match (can be a sequence match or mismatch)
I: Insertion to the reference
D: Deletion from the reference
N: Skipped region from the reference
S: Soft clip on the read (clipped sequence present in sequence)
H: Hard clip on the read (clipped sequence NOT present in sequence)
P: Padding (silent deletion from the padded reference sequence)

head -3 C1_RNAseq_rep1_F_chr15.sam

## @SQ  SN:chr15    LN:101991189
## SRR5882541.93    16  chr15   52066185    60  74M1S   *   0   0   TTTATTTTATTGAATACTGTCTGTATCTTTGGTTATCCTGTTTGAAGAAAAAGGACAAATAAAACATGGCCAGCC F.AFFFF.FFFFFFAFFFFFFFFAFFFFFAFFAFFAFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFAFFFAAAAA AS:i:-1 ZS:i:-11    XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:74 YT:Z:UU XS:A:-  NH:i:1
## SRR5882541.161   16  chr15   43793064    60  20M626N53M  *   0   0   GCTGCCGCCCGCAAGCAGAGGGACTCGGAGATCATGCAGCAGAAGCAGAAAAAGGCAAACGAGAAGAAGGAGG   FFAFF<FFFFA<FFF7A<<<F<FF.FFAFFA<FFFFF<AFFFFFFF7F<FFFFFFFFFFAAFFF7FFFAAAAA   AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:73 YT:Z:UU XS:A:-  NH:i:1

head -2 C1_RNAseq_rep1_F_chr15_spliced_reads.sam

## SRR5882541.161   16  chr15   43793064    60  20M626N53M  *   0   0   GCTGCCGCCCGCAAGCAGAGGGACTCGGAGATCATGCAGCAGAAGCAGAAAAAGGCAAACGAGAAGAAGGAGG   FFAFF<FFFFA<FFF7A<<<F<FF.FFAFFA<FFFFF<AFFFFFFF7F<FFFFFFFFFFAAFFF7FFFAAAAA   AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:73 YT:Z:UU XS:A:-  NH:i:1
## SRR5882541.176   0   chr15   89321203    60  1S58M475N17M    *   0   0   CGGGTGTAGCCAGGTGGGGCCTGCACCATGGCTTTCAACTCACTGCCTACTCGGTCAGGCCGGGCATTGCTGGCGG    AAAAAFF.FFFFFFFFFFFFFFFF<FAFFFFF.FFFA)AFF.FFFFFF.AFFFFF77FFF7FFFF.AFFFFFFFFF    AS:i:-1 XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:75 YT:Z:UU XS:A:+  NH:i:1

4 StringTie for assembling transcripts

In the author’s words:
StringTie is a fast and highly efficient assembler of RNA-Seq alignments into potential transcripts. It uses a novel network flow algorithm as well as an optional de novo assembly step to assemble and quantitate full-length transcripts representing multiple splice variants for each gene locus. Its input can include not only alignments of short reads that can also be used by other transcript assemblers, but also alignments of longer sequences that have been assembled from those reads. In order to identify differentially expressed genes between experiments, StringTie’s output can be processed by specialized downstream software.

StringTie Manual

Assemble transcripts with StringTie:

stringtie -p 4 -G gencode.v39.annotation.gtf -o OUTFILE.gtf -l TREATMENT_CELLS INFILE_sorted.bam

variables

-G: known annotation GTF
-o: output GTF
-l: prefix for transcript IDs
last entry is the input .bam file
--rf: Assumes a stranded library fr-firststrand
--fr: Assumes a stranded library fr-secondstrand
I empirically found that --rf and --fr are ignored if HISAT2 was invoked with the --rna-strandness argument as R or F.

4.1 StringTie for merging transcripts

At this point, if you want to do differential gene expression analysis, you will need to merge all of the transcripts from all sequencing samples and conditions.

Figure 4: Merging transcripts

stringtie --merge -p 8 -G gencode.v39.annotation.gtf -o stringtie_merged.gtf mergelist.txt

mergelist.txt is a simple text file with the names of each file to merge on a separate line.

4.2 gffcompare to compare to annotations:

gffcompare manual

gffcompare -R -r gencode.v39.annotation.gtf -o stingtie_merged stringtie_merged.gtf

The -R option forces the program to only consider transcripts for which there is data, so that the number of ‘missed’ exons or transcripts is not influenced simply by which genes are expressed.

Output files:

stringtie_merged.combined.gtf
stringtie_merged.loci
stringtie_merged.stats
stringtie_merged.tracking
stringtie_merged.gtf.refmap
stringtie_merged.gtf.tmap

less stingtie_merged.stats

## # gffcompare v0.12.6 | Command line was:
## #gffcompare -R -r /home/FCAM/meds5420/annotations/gencode.v39.annotation.gtf -o stingtie_merged stringtie_merged.gtf
## #
## 
## #= Summary for dataset: stringtie_merged.gtf 
## #     Query mRNAs :  245645 in   57989 loci  (218720 multi-exon transcripts)
## #            (22471 multi-transcript loci, ~4.2 transcripts per locus)
## # Reference mRNAs :  242722 in   57415 loci  (216649 multi-exon)
## # Super-loci w/ reference transcripts:    57413
## #-----------------| Sensitivity | Precision  |
##         Base level:   100.0     |    99.7    |
##         Exon level:    99.7     |    99.4    |
##       Intron level:   100.0     |    99.5    |
## Intron chain level:   100.0     |    99.1    |
##   Transcript level:   100.0     |    98.8    |
##        Locus level:    99.9     |    98.9    |
## 
##      Matching intron chains:  216647
##        Matching transcripts:  242608
##               Matching loci:   57373
## 
##           Missed exons:       0/652609   (  0.0%)
##            Novel exons:    2570/655065   (  0.4%)
##         Missed introns:       0/395147   (  0.0%)
##          Novel introns:    1901/397160   (  0.5%)
##            Missed loci:       0/57415    (  0.0%)
##             Novel loci:     576/57989    (  1.0%)
## 
##  Total union super-loci across all input datasets: 57989 
## 245645 out of 245645 consensus transcripts written in stingtie_merged.annotated.gtf (0 discarded as redundant)

head stringtie_merged.gtf.refmap

## ref_gene ref_id  class_code  qry_id_list
## DDX11L1  ENST00000456328.2   =   ENSG00000223972.5|ENST00000456328.2
## DDX11L1  ENST00000450305.2   =   ENSG00000223972.5|ENST00000450305.2
## MIR1302-2HG  ENST00000473358.1   =   ENSG00000243485.5|ENST00000473358.1
## MIR1302-2HG  ENST00000469289.1   =   ENSG00000243485.5|ENST00000469289.1
## MIR1302-2    ENST00000607096.1   =   ENSG00000284332.1|ENST00000607096.1
## OR4G4P   ENST00000606857.1   =   ENSG00000268020.3|ENST00000606857.1
## OR4G11P  ENST00000642116.1   =   ENSG00000240361.2|ENST00000642116.1
## OR4G11P  ENST00000492842.2   =   ENSG00000240361.2|ENST00000492842.2
## OR4F5    ENST00000641515.2   =   ENSG00000186092.7|ENST00000641515.2

4.3 Calculate transcript abundances with stringtie for use with ballgown or other DE analysis software.

stringtie -e -B -p 8 -G stringtie_merged.gtf 
          -o OUTFILE_est.gtf 
          -A OUTFILE_abun.tab INFILE_sorted.bam

Options:

-e: Only quantify transcripts that agree with reference annotations.
-B: Converts data to a table (.ctab) that is compatible with Ballgown DE analysis
-o: Path and name of new .GTF file with coverage data. Accessory files will be written to same path and path will be created if it doesn’t exist.
-A: Create an abundance table that reports transcript level quantification.
--rf: Assumes a stranded library fr-firststrand
--fr: Assumes a stranded library fr-secondstrand
I empirically found that --rf and --fr are ignored if HISAT2 was invoked with the --rna-strandness argument as R or F.

There are several outputs, the -A-specified abundance table contains the data for each transcript (coverage, FPKM, and TPM values). You can use grep and your favorite gene’s name to look at its transcript abundance.

Another way to test for R vs. F is to calculate abundance using the reference annotations and then look at accumulated Coverage (7th column). The FPKM and TPM are normalized for read depth and since fewer reads align to the wrong orientation, the cumulative values are similar. However, coverage is per-base coverage of the gene and not normailzed by read depth, so this value is directly comparable between R and F input files.

stringtie -e -B -p 8 -G gencode.v39.annotation.gtf -o R_est_anno.gtf -A R_abun_anno.tab C1_RNAseq_rep1_R_sorted.bam
stringtie -e -B -p 8 -G gencode.v39.annotation.gtf -o F_est_anno.gtf -A F_abun_anno.tab C1_RNAseq_rep1_F_sorted.bam

# and awk code to skip the header and sum over a column
awk 'NR > 1' R_abun_anno.tab | awk '{sum+=$7;} END{print sum;}'
awk 'NR > 1' F_abun_anno.tab | awk '{sum+=$7;} END{print sum;}'

5 In class exercise 1:

I have aligned RNA-seq data from two clones of the same progenitor cell line to the hg38 genome with HISAT2. These are two clones, clone 1 and clone 7, that were expanded upon single cell dilution. The genome file indicies I used for alignment is here: /home/FCAM/meds5420/genomes/hg38_hisat2. I used the exact hisat2 command:

ht2_gen=/home/FCAM/meds5420/genomes/hg38_hisat2/hg38_hisat2

for i in C*rep*fastq.gz
do
    name=$(echo $i | cut -d "." -f 1)
    echo $name
    hisat2 -p4 --dta --rna-strandness R -x $ht2_gen -U $i -S ${name}_R.sam
    hisat2 -p4 --dta --rna-strandness F -x $ht2_gen -U $i -S ${name}_F.sam
done

The .sam files can be found here:

Folder location: /home/FCAM/meds5420/data/RNA_seq
Files:
- C1_RNAseq_rep*_R.sam
- C7_RNAseq_rep*_R.sam
- C1_RNAseq_rep*_F.sam
- C7_RNAseq_rep*_F.sam

The annotation .gtf file can be found here: /home/FCAM/meds5420/annotations/gencode.v39.annotation.gtf
Extract your specific chromosome from the annotation file to speed things up.

Access the aligned data and with each R and F strandedness file (using a loop) and with “your” specific chromosome perform:

transcript assembly with stringtie. In your loop you will want to generate a file with a list of individual annotation file names that you will use in the next step.
merge transcripts from each assembly, but keep R and F assemblies separate so you can compare them.
comparison the R and F merged assemblies to annotations with a focus on the .stats file. Interpret the stats file with a focus on Precision and novel features. Provide a written explanation of your interpretation with respect to the --rna-standedness input.
stringtie quantification of transcripts that match the annotation (i.e. not your merged transcripts gtf) using the C1_RNAseq_rep1_F and C1_RNAseq_rep1_R sorted BAM files from your chromosome.
Is the R or F option, or unstranded correct for these libraries? Perform tests to determine this and provide written explanations. What would you expect if the libraries were unstranded?

To speed things up try logging into an interactive session with 2 or 4 cores: srun --pty -p mcbstudent --qos=mcbstudent --mem=6G -c 4 bash
Only use data from your chromosome, i.e. retrieve your chromosome data from the .sam file to your $HOME directory tree and pull out only GTF annotations on your chromosome.
Take the time to look at the output and manual (-h) at each step to become familiar with the operations and output.

This exercise will take some time, so we will have the first 20 minutes of next class to finish this and ask questions.

6 Counting reads in transcripts with HTseq

HTseq is a python-based set of tools for analyzing sequencing data.

Link to the manual: https://htseq.readthedocs.io/en/release_0.10.0/
It is particularly useful because it can take input .GTF files and map reads to exons and return the results by whole transcript or gene.

HTseq only requires a sorted .sam or .bam file, so the output from hisat2 alignment can be used directly after sorting. With later versions of HTSeq you will need to make a bam index file with samtools using samtools index. I noticed that the default htseq loaded on Xanadu is not the most recent.

Here is an example of it’s usage:

module load htseq/0.13.5

samtools index HISAT_ALIGNED_SORTED.bam
htseq-count -m union -r pos -f bam -i gene_id -a 10 --stranded=no HISAT_ALIGNED_SORTED.bam $GTF > FILE.counts
  
 # $GTF file is a variable for the annotation file of choice

Options:

-m union: union is one of three options for htseq count for binning of reads to features (genes/exons). It is the recommended option that is most stringent since it will only count a read if it aligns to only 1 gene.
-r pos : meaning that the BAM/SAM file is sorted by genomic coordinate/position
-f: specify input file type
-i gene_id : tells htseq-count to use the gene ID in output file
-a 10 : tells htseq-count to skip reads with alignment score less than 10.
–stranded=no/yes/reverse : the RNA-seq reads strand specificity.

6.1 Important update on HTseq-count

HTseq (older than version 0.8.0) only maps to unique portions of transcripts and genes. Therefore, read count will often be underrepresented if genes or transcript isoforms overlap or share common exons. The newer version of HTseq-count (0.8.0 and newer) have a new option called --nonunique that if set to all will allow reads to map to multiple isoforms or overlapping genes. So in the examples shown in the HTseq figure, ambiguous reads get mapped to both transcripts or genes. Xanadu has newer versions of HTseq that allow the use of the nonunique option.

Figure 5: HTseq counting options

6.2 Strandedness with HTSeq

We use htseq-count to bin reads into genomic features (i.e. genes). The options above include -r which tells the program the bam file is sorted by position, -f which tells the program the input is a BAM file, and --stranded tells the program the strand information of the file. Notice the .gtf file is included. The output of this command is a .txt file that contains the gene counts for that sample. We will get one .txt file per sample. We already confirmed that the data is stranded, but even if I know everything about the experiment, is not intuitive for me to know whether to specify yes or reverse for the --stranded argument. To make matters more confusing HTSeq is not in the tuxedo suite, so the XS attribute tag that contains the - or + associated with the read alignment is ignored. The SAM flag is used, so we must again specify the stranded option when counting with HTSeq.

To ensure that I use the correct option, I always run both options and observe the results—this is only necessary for one file if you have a whole bunch of files that were generated with the same molecular biology. Whatever option has more reads in the vast majority of the genes is correct. I appreciate that this is not elegant, but these empirical checks are often necessary to ensure the data is properly processed. The biggest mistakes that I have observed in genomics could have been avoided if the bioinformatician looked at the output of analyses and confirmed that the result is expected or, at minimum, reasonable.

7 In class exercise 2:

1) The initial alignments for C7_RNAseq_rep1.fastq.gz are located in /home/FCAM/meds5420/data/RNA_seq: C7_RNAseq_rep1_F.sam and C7_RNAseq_rep1_R.sam. I already converted these alignment files to bam, next use samtools to extract your specific chromosome from the bam file (this requires a Google or studying of the samtools manual). If (when?) you get an error, Google the error. Using samtools to select your chromosome is better because the very important header of the sam/bam file is retained. Using grep on the sam can cause downstream issues when a header is required. Sort the chromosome-specific bam files and index the bam files (use samtools for both). I did not post this answer below, because I want you to work through this problem.

# how I converted to bam
samtools sort -@ 4 -o C7_RNAseq_rep1_F.bam C7_RNAseq_rep1_F.sam
samtools sort -@ 4 -o C7_RNAseq_rep1_R.bam C7_RNAseq_rep1_R.sam

2) Count reads in genes from the sorted indexed bam files for your chromosome. Confirm that HTSeq ignores the --rna-strandedness option from hisat2 by counting with both the _R and _F files. Look at the output file (head, tail, wc -l, etc.).
3) Determine which --rna-strandedness option is correct for these libraries.

You can subset your gtf annotation to only contain your chromosome using grep. Usually the header of the gtf is not required, but you can use a combination of head and cat to generate a gtf with a header if needed.

Note: The output counts file has a footer (see the tail) that we want to ignore in the counting. The grep -v __ command prints all the lines without __ to stdout, and you can use awk (you should have a column-summing one-liner awk command already saved in the awk_one_liners.sh file that I know you are all keeping) to count the second column.

8 Use Genome Coverage from `bedtools` to create a `bedGraph`:

bedtools genomecov -bg -split -ibam HISAT_out.sorted.bam > HISAT_out.bedgraph

NOTES: -split is an important option. If you do not specify it here, you will get signal across exons that looks like background.

Also, if you have strand specific data, make sure you have to run this twice and specify a different strand each time.

bedtools genomecov -bg -split -strand + -ibam INFILE > OUTFILE_plus.bedgraph

bedtools genomecov -bg -split -strand - -ibam INFILE > OUTFILE_minus.bedgraph

8.1 Note on visualizing stranded data with `genomeCoverageBed`

Again, we have to be careful about strandedness! genomeCoverageBed defines “strand” by their orientation and they will treat BAM files identically, regardless of whether they were generated with the --rna-strandness R and F option. This is because bedtools is not part of the tuxedo suite. Therefore in the testing stage for generating a workflow, you will have to try combinations of -strand arguments and plus vs. minus outputs to ensure proper visualization as I show in the code chunk below. Since you already confirmed the R option with HISAT2 and the rev option with HTseq, I will tell you that the specified strand and output is inconsistent (i.e. - paired with plus and + paired with minus) with this library prep method.

#test1
bedtools genomecov -bga -split -strand + -ibam INFILE > OUTFILE_plus_test1.bedGraph
bedtools genomecov -bga -split -strand - -ibam INFILE > OUTFILE_minus_test1.bedGraph

#test2
bedtools genomecov -bga -split -strand + -ibam INFILE > OUTFILE_minus_test2.bedGraph
bedtools genomecov -bga -split -strand - -ibam INFILE > OUTFILE_plus_test2.bedGraph

We can add track lines as option to include in the files:

bedtools genomecov -bg -split -strand - -ibam INFILE -trackline -trackopts 'name=RNA-seq_plus description="RNA-seq_plus" visibility=full autoScale=on alwaysZero=on color=255,0,0 windowingFunction=maximum' > OUTFILE_plus_track.bedGraph

You can add track lines as previously shown with awk or with a modified AddTrackLines script we wrote in class for ChIP-seq.

awk 'BEGIN {  print "browser position chr17:1,310,184-1,377,073"
              print "track type=bedGraph name=RNAseq_plus description="RNAseq_plus"  visibility=full autoScale=on alwaysZero=on color=255,0,0 windowingFunction=maximum"}
           {  print $0}' RNAseq.bedGraph > RNAseq_header.bedGraph

You can now upload the file into the UCSC genome browser:

https://genome.ucsc.edu/s/Mike%20Guertin/hg38_rna_seq_class_C1

Figure 6: RNA-seq UCSC

9 In class exercise 3:

1. Use HISAT2 to align C1_RNAseq_rep1.fastq.gz (/home/FCAM/meds5420/data/RNA_seq/C1_RNAseq_rep1.fastq.gz) to your user ID chromosome (/home/FCAM/meds5420/genomes/chroms/chr#.fa) using the appropriate options. Recall that you will have to build your chromosome index files by first extracting splice sites and exons. Speed up the process by using you GTF annotation file that is unique to your chromosome.
2. Visualize the data in the browser and send me a UCSC session link.
- Your chromosome can be found here: /home/FCAM/meds5420/genomes/chroms
extract your chromosome gene annotations from here: /home/FCAM/meds5420/annotations/gencode.v39.annotation.gtf

10 Processing paired end RNA-seq and visualizing splicing

Most RNA-seq libraries are paired end, so I we will go over the code and explanations to visualize PE data. UConn and JAX have very strong RNA biology and splicing labs, so we will also go through the entire process of parsing files to visualize splice junction reads on a browser.

11 Visualizing stranded PE data:

Original merged C1 and C7 PE FASTQ files: /home/FCAM/meds5420/data/RNA_seq/paired_end

11.1 Mapping paired end and stranded data with HISAT2

If you submit samples for RNA-seq at UConn’s CGI, then you will most likely get stranded data back. It will often be paired-end data as well. Here’s how to map and visualize the data.

module load hisat2
dir=/home/FCAM/meds5420/data/RNA_seq/paired_end/
ht2_gen=/home/FCAM/meds5420/genomes/hg38_hisat2/hg38_hisat2
hisat2 -p4 --dta --rna-strandness RF -x $ht2_gen -1 ${dir}C_RNAseq_all_PE1.fastq.gz -2 ${dir}C_RNAseq_all_PE2.fastq.gz -S C_RNAseq_all.sam

the --rna-strandness refers to whether the sequence read represents the RNA equivalent or is complementary to it. If the read represents the same strand as the RNA, then use the F as the input, if it is the complement of the RNA, use R.
For the Paired-end kit used in the CGI, the first read (mate pair 1) is the complement to the RNA, and the second read (mate pair 2) is the same strand as the RNA. Therefore, you should enter RF for this option.

Next, we can convert this .sam file to a sorted .bam , however, for making bed files of paired end data, the reads should be sorted by name (-n), so mate-pairs as next to each other. The example below will just process chr17 to speed things up. Remember, when employing new code or building a pipeline it is always a good idea to start processing only a fraction of the data (i.e. head -n 400000 X > Y or grep -w "chr17" X > Y) until all seems to be working well. This will save you time just waiting for things to run. It is a bit convoluted to extract all chr17 entries from a sam or bam file. You may want to use grep for you favorite chromosome in the sam file, then convert to a bam and you are good to go, right? Remember, this excludes the header and causes problems. The following routine will allow you to extract your chromosome from an aligned sam file. It is convoluted, but this is the way.

# convert sam to bam
samtools view -b C_RNAseq_all.sam -o C_RNAseq_all.bam

# sort bam
samtools sort C_RNAseq_all.bam -o C_RNAseq_all_sorted.bam

# index the bam (makes a *.bai file)
samtools index C_RNAseq_all_sorted.bam

# finally we can extract chr17
samtools view -b C_RNAseq_all_sorted.bam chr17 -o C_RNAseq_chr17.bam

# index the chr bam (makes a *.bai file)
samtools index C_RNAseq_chr17.bam

# always sort your bams
samtools sort C_RNAseq_chr17.bam -o C_RNAseq_chr17_sorted.bam

The necessary bam and bam.bai files to extract your chromosome info are within the /home/FCAM/meds5420/data/RNA_seq/processed_bam/ directory on Xanadu.

Now we can convert to bed12 format for display in the browser. However, this will display all reads, not only the spliced reads. bed12 format specifications can be found here:
https://genome.ucsc.edu/FAQ/FAQformat.html#format1

bedtools bamtobed -bed12 -split -i C_RNAseq_chr17_sorted.bam > C_RNAseq_chr17_pe.bed12.bed

Review bedtools commands and options here:

bam to bed: http://bedtools.readthedocs.io/en/latest/content/tools/bamtobed.html
genome coverage: http://bedtools.readthedocs.io/en/latest/content/tools/genomecov.html

Deeper exon coverage is somewhat apparent and we can observed some gapped reads that are indicative of spliced reads:

Figure 7: RNAseq browser bed

This is also a bit cluttered, so we may just want to observe the spliced reads.

12 Displaying splice junctions

To display the splice reads only, we can start with a bam file that is sorted by position (not name).

Note: You can display splice junctions for any RNA-seq data - stranded or not, paired-end or not.

12.1 Move header to a new file

Invoking the -H option in samtools will print the header to a new file, which is important for downstream processing.

samtools view -H C_RNAseq_chr17_sorted.bam > C_RNAseq_chr17_spliced.sam

12.2 Extract splice junction reads

See Section 3.2 Intermediate file conversions for the explanation of the awk code taking out relevant CIGAR codes. We append to the header file:

samtools view C_RNAseq_chr17_sorted.bam | awk '($6 ~/N/)' >> C_RNAseq_chr17_spliced.sam

12.3 Convert back to `bam`

samtools view -b C_RNAseq_chr17_spliced.sam > C_RNAseq_chr17_spliced.bam

12.4 Convert to `bed12`

Use bedtools to convert to bed12

bamToBed -bed12 -i C_RNAseq_chr17_spliced.bam > C_RNAseq_chr17_spliced_bed12.bed

Example of bed12 format:

head C_RNAseq_chr17_spliced_bed12.bed | tail -5

## chr17    65691   65847   SRR5882541.5406954/2    0   -   65691   65847   255,0,0 2   45,18   0,138
## chr17    65703   65871   SRR5882543.19047503/1   0   +   65703   65871   255,0,0 2   33,42   0,126
## chr17    65704   65872   SRR5882544.1641172/1    1   +   65704   65872   255,0,0 2   32,43   0,125
## chr17    65722   71369   SRR5882543.9412633/1    1   +   65722   71369   255,0,0 3   14,58,4 0,107,5643
## chr17    65725   71370   SRR5882544.25013538/2   60  -   65725   71370   255,0,0 3   11,58,5 0,104,5640

IMPORTANT: If using paired-end data, you will need to flip the strand on this file and then add tracklines and upload. In the name column (column 4), mate pair one reads end with "/1" and mate pair 2 reads end in "/2". We can use this to "flip" mate pair 1 reads to the opposite strand. Below is an awk script that does this.

#flip
awk '{OFS="\t"; if($4 ~"/2") print $0; else if($4 ~"/1" && $6 == "+") print($1, $2, $3,$4,$5,"-", $7,$8,$9,$10,$11,$12); else if($4 ~"/1" && $6 == "-") print($1, $2, $3,$4,$5,"+", $7,$8,$9,$10,$11,$12)}' C_RNAseq_chr17_spliced_bed12.bed > C_RNAseq_chr17_spliced_bed12_R1Flip.bed

#add trackline  
awk 'BEGIN { print "browser position chr17:8,408,924-8,589,493"; print "track type=bed name=\"C_RNAseq_chr17_pe\" description=\"C_RNAseq_chr17_pe\"  visibility=squish autoScale=on colorByStrand=\"255,0,0 0,0,255\""} { print $0}' C_RNAseq_chr17_spliced_bed12_R1Flip.bed > C_RNAseq_chr17_spliced_header.bed

https://genome.ucsc.edu/s/Mike%20Guertin/hg38_RNA_splicing_class

Figure 8: RNAseq splice junctions

13 Making `bedGraphs` files of `C_RNAseq_chr17_pe.bed12.bed` data above (all split spliced reads and non-split reads).

We can then make bedGraph files as before using the -split and -strand options. However, the input .bed files need to be sorted by position, unmapped reads discarded (. in columns 1 or 3) and undergo some more processing first.

First, sort the bed file by position.

sort -k1,1 -k2,2n C_RNAseq_chr17_pe.bed12.bed | grep -v -w "\." > C_RNAseq_chr17_pe.bed12_sorted.bed

IMPORTANT: bedtools can properly convert stranded data to a bedGraph viewable in the browser. However, it does not consider the strandedness of the pairs, so one of the mates in a pair will always map to the wrong strand creating data on both strands where it should only be one one. To get around this, we must manually flip the read that is the reverse complement of the RNA to the other strand. In this case, we must flip read 1. First, let’s look at a bed12 file and see how we can parse it ourselves.

head C_RNAseq_chr17_pe.bed12_sorted.bed | tail -5

## chr17    60030   60106   SRR5882548.19678534/1   1   +   60030   60106   255,0,0 1   76  0
## chr17    60030   60106   SRR5882548.7287074/1    1   +   60030   60106   255,0,0 1   76  0
## chr17    60044   60120   SRR5882548.28798868/2   0   -   60044   60120   255,0,0 1   76  0
## chr17    60050   60126   SRR5882542.1394632/1    1   +   60050   60126   255,0,0 1   76  0
## chr17    60082   60154   SRR5882542.1394632/2    1   -   60082   60154   255,0,0 1   72  0

In the name column (column 4), mate pair one reads end with "/1" and mate pair 2 reads end in "/2". We can use this to "flip" mate pair 1 reads to the opposite strand. Below is an awk script that can do this.

awk '{OFS="\t"; if($4 ~"/2") print $0; else if($4 ~"/1" && $6 == "+") print($1, $2, $3,$4,$5,"-", $7,$8,$9,$10,$11,$12); else if($4 ~"/1" && $6 == "-") print($1, $2, $3,$4,$5,"+", $7,$8,$9,$10,$11,$12)}' C_RNAseq_chr17_pe.bed12_sorted.bed > C_RNAseq_chr17_bed12_R1Flip.bed

Now create the bedGraphs, add track lines, zip the files, transfer them to our computer, and upload to UCSC:


hgSizes=/home/FCAM/meds5420/genomes/hg38.chrom.sizes
bedtools genomecov -bg -split -strand + -i C_RNAseq_chr17_bed12_R1Flip.bed -g $hgSizes > C_RNAseq_chr17_pe_R1flip_plus.bedGraph

bedtools genomecov -bg -split -strand - -i C_RNAseq_chr17_bed12_R1Flip.bed -g $hgSizes > C_RNAseq_chr17_pe_R1flip_minus.bedGraph

awk 'BEGIN { print "browser position chr17:8,449,131-8,569,510"; print "track type=bedGraph name=\"C_RNAseq_chr17_plus\" description=\"C_RNAseq_chr17_plus\"  visibility=squish autoScale=on color=\"255,0,0\""} { print $0}' C_RNAseq_chr17_pe_R1flip_plus.bedGraph > C_RNAseq_chr17_pe_plus.bedGraph

awk 'BEGIN { print "browser position chr17:8,449,131-8,569,510"; print "track type=bedGraph name=\"C_RNAseq_chr17_minus\" description=\"C_RNAseq_chr17_minus\"  visibility=squish autoScale=on color=\"0,0,255\""} { print $0}' C_RNAseq_chr17_pe_R1flip_minus.bedGraph > C_RNAseq_chr17_pe_minus.bedGraph

gzip C_RNAseq_chr17_pe_plus.bedGraph
gzip C_RNAseq_chr17_pe_minus.bedGraph

#sftp to your local machine

https://genome.ucsc.edu/s/Mike%20Guertin/hg38_PE_class

Figure 9: RNAseq Paired End browser

14 Exercise 4: Extract and visualize splice junctions from paired end RNA-seq data.

1. Extract the reads from your chromosome, then extract the spliced reads from your chromosome, convert to a BED12 format, add a trackline (see “bed12 format specifications” link above) that includes a different color for each forward and reverse aligned set of splice junctions (see colorByStrand), and upload to the browser. The necessary bam and bam.bai files to extract your chromosome info are within the /home/FCAM/meds5420/data/RNA_seq/processed_bam/ directory on Xanadu. I began extracting your chromosome from the bam file with the command samtools view -b C_RNAseq_all_sorted.bam chr# -o C_RNAseq_chr#.bam, so check to see if you chromosome-specific bam is already present.

2. Optional: make bedGraphs of your chromosome and try to understand what the complicated set of commands are doing by inspecting the input and output files.

15 For your reference: Molecular biology of library preparations

The molecular biology of library preps diverge depending upon whether the experimenter wishes to preserve the RNA strand information in the sequencing data. Unstranded libraries were common over a decade ago, but I see no reason why contemporary research would employ strand-ambiguous RNA-seq. Therefore, I will briefly discuss the workflow for the library preparation that I prefer and how clever molecular biologists developed methods to retain the strand specificity of the RNA throughout the library prep.
1) Isolate RNA.
2) Negative selection of rRNA.
3) Fragment the RNA with heat.
4) Anneal random hexamer DNA primers to the RNA.
5) Perform reverse transcription of the RNA using a reverse transcriptase that is engineered to operate at high temperatures. RT at higher temperatures relieves secondary structure of the RNA and leads to less biased libraries. Including actinomycin D at this step prevents second strand synthesis during the RT by inhibiting DNA-dependent DNA synthesis.
6) Perform second strand synthesis using the dUTP nucleotide as opposed to dTTP. dUTP incorporation is important for conferring strand specificity.
7) The next step is to blunt the ends of the DNA and generate a 5’ A overhang. T4 Polymerase harbors 3’ to 5’ exonuclease activity and 5’ to 3’ polymerase activity, which will blunt both overhangs. T4 Polymerase also phosphorylates 5’ DNA ends.
8) Taq polymerase is used to add an A to the 3’ end of the dsDNA. The high throughput sequencing adapter is usually a forked Y adapter that contains a T overhang on the 5’ end. This pairs with the A overhang, traditionally referred to as TA cloning. Alternatively, a hairpin adapter is used that incorporates a dUTP that effectively generates a Y adapter when the dUTP is digested.
9) Ligate the sequencing adapters with DNA ligase.
10) Digest dUTPs incorporated during the second strand synthesis with uracil-DNA glycosylase (UGDase). This confers sequence specificity because now you have a single strand DNA molecule with a distinct adapter on each end. The ssDNA represents the template strand.
11) The last step is to PCR amplify with a variety of multiplexed adapters that permit pooling of samples.

The dsDNA molecules can be sequenced using a specific primer for each end of the molecule. Refer to step 10 to determine which sequencing primer would result in the coding strand and which would be the reverse complement thereof.

Figure 10: RNA-seq molecular biology

NEXT TIME: Differential gene expression analysis.

16 Answers to in class exercise 1:


anno=/home/FCAM/meds5420/annotations/gencode.v39.annotation.gtf

grep -w "chr15" $anno > gencode.v39.annotation.chr15.gtf
annochr15=gencode.v39.annotation.chr15.gtf
chr=chr15

#for i in /home/FCAM/meds5420/data/RNA_seq/C*_RNAseq_rep*_*.sam
touch mergelist.txt
for i in /home/FCAM/meds5420/data/RNA_seq/C*_RNAseq_rep*[FR].sam
do
    name=$(echo $i | rev | cut -d "/" -f 1 | rev | cut -d "." -f 1)
    echo $name
    grep -w $chr $i > ${name}_chr15.sam
    samtools sort -@ 4 -o ${name}_chr15_sorted.bam ${name}_chr15.sam
    stringtie -p 4 -G $annochr15 -o $name.gtf -l ${name} ${name}_chr15_sorted.bam
    echo $name.gtf >> mergelist.txt
done

grep "F.gtf" mergelist.txt > mergelist_F.txt
grep "R.gtf" mergelist.txt > mergelist_R.txt

fr="F R"
for x in $fr
do
    #2 merge transcripts
    stringtie --merge -p 4 -G $annochr15 -o stringtie_merged_${x}.gtf mergelist_${x}.txt
    #3 compare
    gffcompare -R -r $annochr15 -o stringtie_merged_${x}_all stringtie_merged_${x}.gtf
done    

The `R` oriented files have higher precision and fewer novel features. Novel features are false positives (FP), because we consider the reference gene annotations as true positives (TP). precision is calculated as Precision = TP / (TP+FP). All the novel features (FP counts) in `F` are actually found because the orientation of the strand is not correct. This further confirms that `R` strandedness is correct.


#4 abundance

name=C1_RNAseq_rep1_
for x in $fr
do
    stringtie -e -B -p 8 -G $annochr15 -o stringtie_est_${x}.gtf  -A ${name}${x}.tab ${name}${x}_chr15_sorted.bam
#5 tests    
    echo ${name}${x} HSP90 counts
    grep -i hsp90 ${name}${x}.tab
    echo cumulative Coverage  
    awk 'NR>1 {sum+=$7} END{print "sum=",sum}' ${name}${x}.tab 
    echo Cumulative FPKM
    awk 'NR>1 {sum+=$8} END{print "sum=",sum}' ${name}${x}.tab 
    echo Cumulative TPM
    awk 'NR>1 {sum+=$9} END{print "sum=",sum}' ${name}${x}.tab 
done

I did a few tests to see if R or F is correct:
1) I did not know what highly expressed genes were on chr15, so I googled the official gene names (i.e. Actin ==ACTN) and looked at expression of all the candidates: Actin, Tubulin, gapdh, and hsp90. A few variants of hsp90 are present on chr15 and there is coverage and normalized counts in the R orientation. 
2) I looked at the total Coverage, FPKM, and TPM for R and F. the FPKM and TPM are normalized for read depth and since fewer reads align to the wrong orientation, the cumulative values are similar. However, coverage is per-base coverage of the gene and not normailzed by read depth, so this value is directly comparable between R and F input files. The coverage is ~10 fold higher in the R orientation, further confirming R is correct. Coverage would be comparable in R and F if the libraries were unstranded, then we would need to remap with the --rna-standedness flag not specified.

17 Answers to class exercise 2:

grep -w "chr15" /home/FCAM/meds5420/annotations/gencode.v39.annotation.gtf > gencode.v39.annotation.chr15.gtf
annochr15=gencode.v39.annotation.chr15.gtf

#you need to work through how to make C7_RNAseq_rep1_[FR]_chr15_sorted.bam

# OK, here is how to make the input files.
samtools index C7_RNAseq_rep1_F.bam
samtools index C7_RNAseq_rep1_R.bam
samtools view -b C7_RNAseq_rep1_F.bam chr15 > C7_RNAseq_rep1_chr15_F.bam
samtools view -b C7_RNAseq_rep1_R.bam chr15 > C7_RNAseq_rep1_chr15_R.bam
samtools sort C7_RNAseq_rep1_chr15_F.bam -o C7_RNAseq_rep1_F_chr15_sorted.bam
samtools sort C7_RNAseq_rep1_chr15_R.bam -o C7_RNAseq_rep1_R_chr15_sorted.bam
samtools index C7_RNAseq_rep1_F_chr15_sorted.bam
samtools index C7_RNAseq_rep1_R_chr15_sorted.bam

for i in C7_RNAseq_rep1_[FR]_chr15_sorted.bam
do
    name=$(echo $i | rev | cut -d "/" -f 1 | rev | cut -d "." -f 1)
    echo $name
    htseq-count -r pos -f bam --stranded=yes ${i} $annochr15 > $name.gene.counts.yes.txt
    htseq-count -r pos -f bam --stranded=reverse ${i} $annochr15 > $name.gene.counts.reverse.txt
    tail $name.gene.counts.yes.txt
    tail $name.gene.counts.reverse.txt
    echo $name
    echo yes
    grep -v __ $name.gene.counts.yes.txt | awk '{sum+=$2} END{print "sum=",sum}'
    echo reverse
    grep -v __ $name.gene.counts.reverse.txt | awk '{sum+=$2} END{print "sum=",sum}'
done
# many more counts in reverse indicate that reverse is counting reads coming from coherent annotation/read strands

# I know that they are all generated using the same molecular bio protocols
# If I was unsure, I would write a loop to go through all of them

18 Answers to in class exercise 3:


grep -w "chr15" gencode.v39.annotation.gtf > gencode.v39.chr15.annotation.gtf

extract_splice_sites.py gencode.v39.chr15.annotation.gtf > hg38_hisat_chr15_splicesites.txt

extract_exons.py gencode.v39.chr15.annotation.gtf > hg38_hisat_chr15_exons.txt

cp /home/FCAM/meds5420/genomes/chroms/chr15.fa ./ 

hisat2-build -p 8 --ss hg38_hisat_chr15_splicesites.txt --exon hg38_hisat_chr15_exons.txt chr15.fa chr15_hisat2

name=C7_RNAseq_rep1
cp /home/FCAM/meds5420/data/RNA_seq/$name.fastq.gz ./

hisat2 -p4 --dta --rna-strandness R -x chr15_hisat2 -U $name.fastq.gz -S $name.sam 

samtools sort -@ 4 -o $name_sorted.bam $name.sam 

bedtools genomecov -bg -split -strand - -ibam $name_sorted.bam -trackline -trackopts 'name=C7_RNAseq_rep1_plus description="C1_RNAseq_rep1_plus" visibility=full autoScale=on alwaysZero=on color=255,0,0 windowingFunction=maximum' > C7_RNAseq_rep1_plus.bedGraph

bedtools genomecov -bg -split -strand + -ibam C7_RNAseq_rep1_sorted.bam -trackline -trackopts 'name=C7_RNAseq_rep1_minus description="C7_RNAseq_rep1_minus" visibility=full autoScale=on alwaysZero=on color=0,0,255 windowingFunction=maximum' > C7_RNAseq_rep1_minus.bedGraph

gzip *bedGraph

#sftp the bedGraphs to your computer and upload to UCSC.

# my C1  session:

# https://genome.ucsc.edu/s/Mike%20Guertin/hg38_rna_seq_class_C1

#interpret the data

#The library is stranded because each strand separated track aligns to gene annotations 
#specifically, as opposed to both strand separated tracks having comparable coverage at #each gene annotation. Since the library is firststrand stranded (dUTP), the incoherent (- specification and plus labeling) is correct.

19 Answers to in class exercise 4:

module load samtools
module load bedtools

cp /home/FCAM/meds5420/data/RNA_seq/processed_bam/* ./

samtools view -b C_RNAseq_all_sorted.bam chr17 -o C_RNAseq_chr17.bam
samtools index C_RNAseq_chr17.bam
samtools sort C_RNAseq_chr17.bam -o C_RNAseq_chr17_sorted.bam
samtools view -H C_RNAseq_chr17_sorted.bam > C_RNAseq_chr17_spliced.sam
samtools view C_RNAseq_chr17_sorted.bam | awk '($6 ~/N/)' >> C_RNAseq_chr17_spliced.sam
samtools view -b C_RNAseq_chr17_spliced.sam > C_RNAseq_chr17_spliced.bam
bamToBed -bed12 -i C_RNAseq_chr17_spliced.bam > C_RNAseq_chr17_spliced_bed12.bed
awk '{OFS="\t"; if($4 ~"/2") print $0; else if($4 ~"/1" && $6 == "+") print($1, $2, $3,$4,$5,"-", $7,$8,$9,$10,$11,$12); else if($4 ~"/1" && $6 == "-") print($1, $2, $3,$4,$5,"+", $7,$8,$9,$10,$11,$12)}' C_RNAseq_chr17_spliced_bed12.bed > C_RNAseq_chr17_spliced_bed12_R1Flip.bed
awk 'BEGIN { print "browser position chr17:8,408,924-8,589,493"; print "track type=bed name=\"C_RNAseq_chr17_pe\" description=\"C_RNAseq_chr17_pe\"  visibility=squish autoScale=on colorByStrand=\"255,0,0 0,0,255\""} { print $0}' C_RNAseq_chr17_spliced_bed12_R1Flip.bed > C_RNAseq_chr17_spliced_header.bed

gzip C_RNAseq_chr17_spliced_header.bed

# then sftp to your computer and upload to ucsc

sort -k1,1 -k2,2n C_RNAseq_chr17_pe.bed12.bed | grep -v -w "\." > C_RNAseq_chr17_pe.bed12_sorted.bed
awk '{OFS="\t"; if($4 ~"/2") print $0; else if($4 ~"/1" && $6 == "+") print($1, $2, $3,$4,$5,"-", $7,$8,$9,$10,$11,$12); else if($4 ~"/1" && $6 == "-") print($1, $2, $3,$4,$5,"+", $7,$8,$9,$10,$11,$12)}' C_RNAseq_chr17_pe.bed12_sorted.bed > C_RNAseq_chr17_bed12_R1Flip.bed
hgSizes=/home/FCAM/meds5420/genomes/hg38.chrom.sizes
bedtools genomecov -bg -split -strand + -i C_RNAseq_chr17_bed12_R1Flip.bed -g $hgSizes > C_RNAseq_chr17_pe_R1flip_plus.bedGraph
bedtools genomecov -bg -split -strand - -i C_RNAseq_chr17_bed12_R1Flip.bed -g $hgSizes > C_RNAseq_chr17_pe_R1flip_minus.bedGraph
awk 'BEGIN { print "browser position chr17:8,449,131-8,569,510"; print "track type=bedGraph name=\"C_RNAseq_chr17_plus\" description=\"C_RNAseq_chr17_plus\"  visibility=squish autoScale=on color=\"255,0,0\""} { print $0}' C_RNAseq_chr17_pe_R1flip_plus.bedGraph > C_RNAseq_chr17_pe_plus.bedGraph
awk 'BEGIN { print "browser position chr17:8,449,131-8,569,510"; print "track type=bedGraph name=\"C_RNAseq_chr17_minus\" description=\"C_RNAseq_chr17_minus\"  visibility=squish autoScale=on color=\"0,0,255\""} { print $0}' C_RNAseq_chr17_pe_R1flip_minus.bedGraph > C_RNAseq_chr17_pe_minus.bedGraph
gzip C_RNAseq_chr17_pe_plus.bedGraph
gzip C_RNAseq_chr17_pe_minus.bedGraph

# then sftp to your computer and upload to ucsc

#use wc -l, head, and tail to figure out what each command is doing.

MEDS5420 - Lectures 21, 22, 23, and 24 - RNA-seq mapping, visualization, and transcript assembly

April 5, 10, 12, and 17 2023

Contents

0.1 Cleaning up data

1 Building genome with `HISAT2`

1.1 GTFs:

2 Building your annotated genome for mapping with HISAT2.

3 Alignment with HISAT2

3.1 Note on `--rna-strandness`

3.2 Intermediate file conversions

4 StringTie for assembling transcripts

4.1 StringTie for merging transcripts

4.2 gffcompare to compare to annotations:

4.3 Calculate transcript abundances with stringtie for use with ballgown or other DE analysis software.

5 In class exercise 1:

6 Counting reads in transcripts with HTseq

6.1 Important update on HTseq-count

6.2 Strandedness with HTSeq

7 In class exercise 2:

8 Use Genome Coverage from `bedtools` to create a `bedGraph`:

8.1 Note on visualizing stranded data with `genomeCoverageBed`

9 In class exercise 3:

10 Processing paired end RNA-seq and visualizing splicing

11 Visualizing stranded PE data:

11.1 Mapping paired end and stranded data with HISAT2

12 Displaying splice junctions

12.1 Move header to a new file

12.2 Extract splice junction reads

12.3 Convert back to `bam`

12.4 Convert to `bed12`

13 Making `bedGraphs` files of `C_RNAseq_chr17_pe.bed12.bed` data above (all split spliced reads and non-split reads).

14 Exercise 4: Extract and visualize splice junctions from paired end RNA-seq data.

15 For your reference: Molecular biology of library preparations

16 Answers to in class exercise 1:

17 Answers to class exercise 2:

18 Answers to in class exercise 3:

19 Answers to in class exercise 4:

MEDS5420 - Lectures 21, 22, 23, and 24 - RNA-seq mapping, visualization, and transcript assembly

April 5, 10, 12, and 17 2023

Contents

0.1 Cleaning up data

1 Building genome with HISAT2

1.1 GTFs:

1.2 Using HISAT2 and related software on server:

2 Building your annotated genome for mapping with HISAT2.

3 Alignment with HISAT2

3.1 Note on --rna-strandness

3.2 Intermediate file conversions

4 StringTie for assembling transcripts

4.1 StringTie for merging transcripts

4.2 gffcompare to compare to annotations:

4.3 Calculate transcript abundances with stringtie for use with ballgown or other DE analysis software.

5 In class exercise 1:

6 Counting reads in transcripts with HTseq

6.1 Important update on HTseq-count

6.2 Strandedness with HTSeq

7 In class exercise 2:

8 Use Genome Coverage from bedtools to create a bedGraph:

8.1 Note on visualizing stranded data with genomeCoverageBed

9 In class exercise 3:

10 Processing paired end RNA-seq and visualizing splicing

11 Visualizing stranded PE data:

11.1 Mapping paired end and stranded data with HISAT2

12 Displaying splice junctions

12.1 Move header to a new file

12.2 Extract splice junction reads

12.3 Convert back to bam

12.4 Convert to bed12

13 Making bedGraphs files of C_RNAseq_chr17_pe.bed12.bed data above (all split spliced reads and non-split reads).

14 Exercise 4: Extract and visualize splice junctions from paired end RNA-seq data.

15 For your reference: Molecular biology of library preparations

16 Answers to in class exercise 1:

17 Answers to class exercise 2:

18 Answers to in class exercise 3:

19 Answers to in class exercise 4:

1 Building genome with `HISAT2`

3.1 Note on `--rna-strandness`

8 Use Genome Coverage from `bedtools` to create a `bedGraph`:

8.1 Note on visualizing stranded data with `genomeCoverageBed`

12.3 Convert back to `bam`

12.4 Convert to `bed12`

13 Making `bedGraphs` files of `C_RNAseq_chr17_pe.bed12.bed` data above (all split spliced reads and non-split reads).