HISAT2
bedtools
to create a bedGraph
:
bedGraphs
files of C_RNAseq_chr17_pe.bed12.bed
data above (all split spliced reads and non-split reads).RNA-SeQC is a first-step option for quality control: http://archive.broadinstitute.org/cancer/cga/rna-seqc
RNA-SeQC is not available on the Xanadu server, but you can prefilter for quality and remove adapters just as in ChIP-seq:
NOTE: Illumina adapters have barcodes these days, so when removing adapter from multiple files, just use the constant regions before the variable region (bolded below) to filter against. Example:\
GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG
Here, the bolded sequence is the variable region, so just use the region before it (GATCGGAAGAGCACACGTCTGAACTCCAGTCAC) for filtering.
HISAT2
Things you need:
/home/FCAM/meds5420/genomes/hg38_hisat2
/home/FCAM/meds5420/annotations/gencode.v39.annotation.gtf
hisat2
, stringTie
, and GFFcompare
: see belowGTF (General Transfer Format) format is a table of individual exons with retention of transcript ID.
Explanation of GTF and GFF files can be found here.
GENCODE GTFs This can be retrieved from GENCODE (Figure 2):
head -5 gencode.v39.annotation.gtf
## ##description: evidence-based annotation of the human genome (GRCh38), version 39 (Ensembl 105)
## ##provider: GENCODE
## ##contact: gencode-help@ebi.ac.uk
## ##format: gtf
## ##date: 2021-09-02
Obtaining .GTF (Gene Transfer Format) files from UCSC (Figure 3):
HISAT2 will map to your genome, but will also use gene annotations to map to known exons and splice junctions specifically.
First, we need to use the GTF file to create a list of exons and splice sites to use when we map.
There are python scripts that come with the hisat2
software for this purpose.
extract_splice_sites.py gencode.v39.annotation.gtf > hg38_hisat_splicesites.txt
extract_exons.py gencode.v39.annotation.gtf > hg38_hisat_exons.txt
Here’s what the files look like:
head -5 ./hisat2/hg38_hisat_splicesites.txt
## chr1 12056 12178 +
## chr1 12226 12612 +
## chr1 12696 12974 +
## chr1 12720 13220 +
## chr1 13051 13220 +
head -5 ./hisat2/hg38_hisat_exons.txt
## chr1 11868 12226 +
## chr1 12612 12720 +
## chr1 12974 13051 +
## chr1 13220 14500 +
## chr1 15004 15037 -
These files are then used in building your annotated genome for mapping RNA-seq data. Here’s how to build it:
hisat2-build -p 8 --ss hg38_hisat_splicesites.txt --exon hg38_hisat_exons.txt hg38.fa hg38_hisat2
Should output the following to the screen (actually to stderr):
Settings:
Output files: "hg38_hisat2.*.ht2"
Line rate: 7 (line is 128 bytes)
Lines per side: 1 (side is 128 bytes)
Offset rate: 4 (one in 16)
FTable chars: 10
Strings: unpacked
Local offset rate: 3 (one in 8)
Local fTable chars: 6
Local sequence length: 57344
Local sequence overlap between two consecutive indexes: 1024
Endianness: little
Actual local endianness: little
Sanity checking: disabled
Assertions: disabled
Random seed: 0
Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
../hg38.fa
Reading reference sizes
Time reading reference sizes: 00:00:19
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
etc,
etc,
etc,
Total time for call to driver() for forward index: 02:47:58
NOTE: THIS IS THE MAJOR BOTTLENECK WITH HISAT2.
You can plan on this step taking ~2 hours and requiring ~160GB RAM for a large-ish genome such as human. You cannot do this with your average laptop, but you can run scripts that require high memory on the server with a special option.
Alternatively you may be able to download a pre-built genome for your organism from the HISAT2 website: https://ccb.jhu.edu/software/hisat2/index.shtml.
You can instead use annotations during the mapping, or (preferred) create a custom annotation using your initial alignment (after post processing below) and then re-align your data using the custom annotation set for good coverage.
Also, if you build your genome without annotations it requires much less memory and can probably be built on a desktop or even laptop.
To build your own genome of the server you can create a qsub script and make sure that the following lines are added to the header:
#SBATCH -c=32
#SBATCH -p himem
#SBATCH --qos=himem
#SBATCH --mem=256G
This would use half of the RAM and cores one one high memory node (i.e. each has 64 cores with 512G RAM).
hisat2 -p 4 --dta --rna-strandness F -x $ht2_gen -U INFILE.fastq -S OUTFILE.sam
# ht2_gen=/home/FCAM/meds5420/genomes/hg38_hisat2/hg38_hisat2
Output is standard .sam
output plus optional columns that provide greater detail. See manual for details.
Options are VERY important for RNA mapping - Some useful options:
* --dta
: downstream transcript assembly -
* --rna-strandness
: specify when data is stranded or not. Default is not stranded. F
means read represents transcript, R
means transcript is reverse complement. This is only used in stringtie and other software from the tuxedo suite. Invoking R
or F
here does not change the hexadecimal SAM flag. Instead, every read alignment will have an XS attribute tag with a -
or +
associated with the read alignment. This attribute tag is only (from what I can tell) used for downstream applications in the tuxedo suite.
- --tmo
: transcript mapping only: only map to known transcripts.
- --phred33
or --phred64
: specify Illumina quality score conversion.
- --summary-file
[fileName.txt]: Print alignment summary to this file.
- --time
: print time for alignment to the screen (sent to stnderr)
- --no-softclip
prevents aligner from clipping ends that do not match
- --known-splicesite_infile
: can add splice sites here if not in your assembly (need to extract splice sites with the python script).
--rna-strandness
The option --rna-strandness
is dependent upon the molecular biology that was used in library construction. Even when the molecular biology is known (sometimes it is not if you send out to a company), I recommend performing a coherence check like this by running hisat
with the --rna-strandness F
or --rna-strandness R
option. If your data is stranded, you can also determine whether the R
or F
is the correct option for --rna-strandness
. What do you expect to see if you chose the incorrect orientation? I will continue to stress this point because “pipelines” that ignore the molecular biology and these types of coherence checks can get researchers into trouble. In 2020 I had a student take my class and they realized that they were treating their unstranded libraries as stranded libraries in the analysis. This should not effect the conclusions, but using the wrong strand will. Treating unstranded libraries as stranded libraries is equivalent of sequencing to half the read depth—why?
Next we have to convert the alignments to a .bam
file and sort them.
samtools view -bS INFILE.sam > OUTFILE.bam
samtools sort -@ 4 INFILE.bam -o OUTFILE_sorted.bam
# no extension needed for output file name
# -@ option designates # of cores for multi-threading
# Important to remove intermediate files, especially when working on the server.
rm INFILE.sam
rm INFILE.bam
NOTE: the latest versions of samtools
can convert to .bam
and sort in one step.
samtools sort -@ 4 -o OUTFILE_sorted.bam INFILE.sam
If you are interested in the number of reads with splice junctions in them you can extract them from a .bam or .sam as below:
samtools view INFILE.sam | awk '($6 ~ /N/)' > OUTFILE_spliced_reads.sam
This command uses the CIGAR (Compact Idiosyncratic Gapped Alignment Report) string to find gaps in reads with no match, which is designated by the N.
M
: Alignment match (can be a sequence match or mismatch)
I
: Insertion to the reference
D
: Deletion from the reference
N
: Skipped region from the reference
S
: Soft clip on the read (clipped sequence present in sequence)
H
: Hard clip on the read (clipped sequence NOT present in sequence)
P
: Padding (silent deletion from the padded reference sequence)
head -3 C1_RNAseq_rep1_F_chr15.sam
## @SQ SN:chr15 LN:101991189
## SRR5882541.93 16 chr15 52066185 60 74M1S * 0 0 TTTATTTTATTGAATACTGTCTGTATCTTTGGTTATCCTGTTTGAAGAAAAAGGACAAATAAAACATGGCCAGCC F.AFFFF.FFFFFFAFFFFFFFFAFFFFFAFFAFFAFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFAFFFAAAAA AS:i:-1 ZS:i:-11 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:74 YT:Z:UU XS:A:- NH:i:1
## SRR5882541.161 16 chr15 43793064 60 20M626N53M * 0 0 GCTGCCGCCCGCAAGCAGAGGGACTCGGAGATCATGCAGCAGAAGCAGAAAAAGGCAAACGAGAAGAAGGAGG FFAFF<FFFFA<FFF7A<<<F<FF.FFAFFA<FFFFF<AFFFFFFF7F<FFFFFFFFFFAAFFF7FFFAAAAA AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:73 YT:Z:UU XS:A:- NH:i:1
head -2 C1_RNAseq_rep1_F_chr15_spliced_reads.sam
## SRR5882541.161 16 chr15 43793064 60 20M626N53M * 0 0 GCTGCCGCCCGCAAGCAGAGGGACTCGGAGATCATGCAGCAGAAGCAGAAAAAGGCAAACGAGAAGAAGGAGG FFAFF<FFFFA<FFF7A<<<F<FF.FFAFFA<FFFFF<AFFFFFFF7F<FFFFFFFFFFAAFFF7FFFAAAAA AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:73 YT:Z:UU XS:A:- NH:i:1
## SRR5882541.176 0 chr15 89321203 60 1S58M475N17M * 0 0 CGGGTGTAGCCAGGTGGGGCCTGCACCATGGCTTTCAACTCACTGCCTACTCGGTCAGGCCGGGCATTGCTGGCGG AAAAAFF.FFFFFFFFFFFFFFFF<FAFFFFF.FFFA)AFF.FFFFFF.AFFFFF77FFF7FFFF.AFFFFFFFFF AS:i:-1 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:75 YT:Z:UU XS:A:+ NH:i:1
In the author’s words:
StringTie is a fast and highly efficient assembler of RNA-Seq alignments into potential transcripts. It uses a novel network flow algorithm as well as an optional de novo assembly step to assemble and quantitate full-length transcripts representing multiple splice variants for each gene locus. Its input can include not only alignments of short reads that can also be used by other transcript assemblers, but also alignments of longer sequences that have been assembled from those reads. In order to identify differentially expressed genes between experiments, StringTie’s output can be processed by specialized downstream software.
Assemble transcripts with StringTie:
stringtie -p 4 -G gencode.v39.annotation.gtf -o OUTFILE.gtf -l TREATMENT_CELLS INFILE_sorted.bam
variables
-G
: known annotation GTF-o
: output GTF-l
: prefix for transcript IDs.bam
file--rf
: Assumes a stranded library fr-firststrand--fr
: Assumes a stranded library fr-secondstrand --rf
and --fr
are ignored if HISAT2 was invoked with the --rna-strandness
argument as R
or F
.
At this point, if you want to do differential gene expression analysis, you will need to merge all of the transcripts from all sequencing samples and conditions.
stringtie --merge -p 8 -G gencode.v39.annotation.gtf -o stringtie_merged.gtf mergelist.txt
mergelist.txt
is a simple text file with the names of each file to merge on a separate line.
gffcompare -R -r gencode.v39.annotation.gtf -o stingtie_merged stringtie_merged.gtf
The -R
option forces the program to only consider transcripts for which there is data, so that the number of ‘missed’ exons or transcripts is not influenced simply by which genes are expressed.
Output files:
less stingtie_merged.stats
## # gffcompare v0.12.6 | Command line was:
## #gffcompare -R -r /home/FCAM/meds5420/annotations/gencode.v39.annotation.gtf -o stingtie_merged stringtie_merged.gtf
## #
##
## #= Summary for dataset: stringtie_merged.gtf
## # Query mRNAs : 245645 in 57989 loci (218720 multi-exon transcripts)
## # (22471 multi-transcript loci, ~4.2 transcripts per locus)
## # Reference mRNAs : 242722 in 57415 loci (216649 multi-exon)
## # Super-loci w/ reference transcripts: 57413
## #-----------------| Sensitivity | Precision |
## Base level: 100.0 | 99.7 |
## Exon level: 99.7 | 99.4 |
## Intron level: 100.0 | 99.5 |
## Intron chain level: 100.0 | 99.1 |
## Transcript level: 100.0 | 98.8 |
## Locus level: 99.9 | 98.9 |
##
## Matching intron chains: 216647
## Matching transcripts: 242608
## Matching loci: 57373
##
## Missed exons: 0/652609 ( 0.0%)
## Novel exons: 2570/655065 ( 0.4%)
## Missed introns: 0/395147 ( 0.0%)
## Novel introns: 1901/397160 ( 0.5%)
## Missed loci: 0/57415 ( 0.0%)
## Novel loci: 576/57989 ( 1.0%)
##
## Total union super-loci across all input datasets: 57989
## 245645 out of 245645 consensus transcripts written in stingtie_merged.annotated.gtf (0 discarded as redundant)
head stringtie_merged.gtf.refmap
## ref_gene ref_id class_code qry_id_list
## DDX11L1 ENST00000456328.2 = ENSG00000223972.5|ENST00000456328.2
## DDX11L1 ENST00000450305.2 = ENSG00000223972.5|ENST00000450305.2
## MIR1302-2HG ENST00000473358.1 = ENSG00000243485.5|ENST00000473358.1
## MIR1302-2HG ENST00000469289.1 = ENSG00000243485.5|ENST00000469289.1
## MIR1302-2 ENST00000607096.1 = ENSG00000284332.1|ENST00000607096.1
## OR4G4P ENST00000606857.1 = ENSG00000268020.3|ENST00000606857.1
## OR4G11P ENST00000642116.1 = ENSG00000240361.2|ENST00000642116.1
## OR4G11P ENST00000492842.2 = ENSG00000240361.2|ENST00000492842.2
## OR4F5 ENST00000641515.2 = ENSG00000186092.7|ENST00000641515.2
stringtie -e -B -p 8 -G stringtie_merged.gtf
-o OUTFILE_est.gtf
-A OUTFILE_abun.tab INFILE_sorted.bam
Options:
-e
: Only quantify transcripts that agree with reference annotations.-B
: Converts data to a table (.ctab
) that is compatible with Ballgown DE analysis-o
: Path and name of new .GTF file with coverage data. Accessory files will be written to same path and path will be created if it doesn’t exist.-A
: Create an abundance table that reports transcript level quantification.--rf
: Assumes a stranded library fr-firststrand--fr
: Assumes a stranded library fr-secondstrand --rf
and --fr
are ignored if HISAT2
was invoked with the --rna-strandness
argument as R
or F
.There are several outputs, the -A
-specified abundance table contains the data for each transcript (coverage, FPKM, and TPM values). You can use grep
and your favorite gene’s name to look at its transcript abundance.
Another way to test for R
vs. F
is to calculate abundance using the reference annotations and then look at accumulated Coverage
(7th column). The FPKM and TPM are normalized for read depth and since fewer reads align to the wrong orientation, the cumulative values are similar. However, coverage is per-base coverage of the gene and not normailzed by read depth, so this value is directly comparable between R
and F
input files.
stringtie -e -B -p 8 -G gencode.v39.annotation.gtf -o R_est_anno.gtf -A R_abun_anno.tab C1_RNAseq_rep1_R_sorted.bam
stringtie -e -B -p 8 -G gencode.v39.annotation.gtf -o F_est_anno.gtf -A F_abun_anno.tab C1_RNAseq_rep1_F_sorted.bam
# and awk code to skip the header and sum over a column
awk 'NR > 1' R_abun_anno.tab | awk '{sum+=$7;} END{print sum;}'
awk 'NR > 1' F_abun_anno.tab | awk '{sum+=$7;} END{print sum;}'
I have aligned RNA-seq data from two clones of the same progenitor cell line to the hg38 genome with HISAT2. These are two clones, clone 1 and clone 7, that were expanded upon single cell dilution. The genome file indicies I used for alignment is here: /home/FCAM/meds5420/genomes/hg38_hisat2
. I used the exact hisat2
command:
ht2_gen=/home/FCAM/meds5420/genomes/hg38_hisat2/hg38_hisat2
for i in C*rep*fastq.gz
do
name=$(echo $i | cut -d "." -f 1)
echo $name
hisat2 -p4 --dta --rna-strandness R -x $ht2_gen -U $i -S ${name}_R.sam
hisat2 -p4 --dta --rna-strandness F -x $ht2_gen -U $i -S ${name}_F.sam
done
The .sam
files can be found here:
Folder location:
/home/FCAM/meds5420/data/RNA_seq
Files:
- C1_RNAseq_rep*_R.sam
- C7_RNAseq_rep*_R.sam
- C1_RNAseq_rep*_F.sam
- C7_RNAseq_rep*_F.sam
The annotation .gtf
file can be found here: /home/FCAM/meds5420/annotations/gencode.v39.annotation.gtf
Extract your specific chromosome from the annotation file to speed things up.
stringtie
. In your loop you will want to generate a file with a list of individual annotation file names that you will use in the next step.R
and F
assemblies separate so you can compare them..stats
file. Interpret the stats file with a focus on Precision and novel features. Provide a written explanation of your interpretation with respect to the --rna-standedness
input.C1_RNAseq_rep1_F
and C1_RNAseq_rep1_R
sorted BAM files from your chromosome.R
or F
option, or unstranded
correct for these libraries? Perform tests to determine this and provide written explanations. What would you expect if the libraries were unstranded
?srun --pty -p mcbstudent --qos=mcbstudent --mem=6G -c 4 bash
.sam
file to your $HOME
directory tree and pull out only GTF annotations on your chromosome.-h
) at each step to become familiar with the operations and output.This exercise will take some time, so we will have the first 20 minutes of next class to finish this and ask questions.
HTseq is a python-based set of tools for analyzing sequencing data.
Link to the manual: https://htseq.readthedocs.io/en/release_0.10.0/
It is particularly useful because it can take input .GTF files and map reads to exons and return the results by whole transcript or gene.
HTseq only requires a sorted .sam
or .bam
file, so the output from hisat2
alignment can be used directly after sorting. With later versions of HTSeq you will need to make a bam index file with samtools using samtools index
. I noticed that the default htseq
loaded on Xanadu is not the most recent.
Here is an example of it’s usage:
module load htseq/0.13.5
samtools index HISAT_ALIGNED_SORTED.bam
htseq-count -m union -r pos -f bam -i gene_id -a 10 --stranded=no HISAT_ALIGNED_SORTED.bam $GTF > FILE.counts
# $GTF file is a variable for the annotation file of choice
Options:
-m union
: union is one of three options for htseq count for binning of reads to features (genes/exons). It is the recommended option that is most stringent since it will only count a read if it aligns to only 1 gene.-r pos
: meaning that the BAM/SAM file is sorted by genomic coordinate/position-f
: specify input file type-i gene_id
: tells htseq-count to use the gene ID in output file-a 10
: tells htseq-count to skip reads with alignment score less than 10.–stranded=no/yes/reverse
: the RNA-seq reads strand specificity.HTseq (older than version 0.8.0) only maps to unique portions of transcripts and genes. Therefore, read count will often be underrepresented if genes or transcript isoforms overlap or share common exons. The newer version of HTseq-count (0.8.0 and newer) have a new option called --nonunique
that if set to all
will allow reads to map to multiple isoforms or overlapping genes. So in the examples shown in the HTseq figure, ambiguous reads get mapped to both transcripts or genes. Xanadu has newer versions of HTseq that allow the use of the nonunique
option.
We use htseq-count
to bin reads into genomic features (i.e. genes). The options above include -r
which tells the program the bam file is sorted by position, -f
which tells the program the input is a BAM file, and --stranded
tells the program the strand information of the file. Notice the .gtf file is included. The output of this command is a .txt file that contains the gene counts for that sample. We will get one .txt file per sample. We already confirmed that the data is stranded, but even if I know everything about the experiment, is not intuitive for me to know whether to specify yes
or reverse
for the --stranded
argument. To make matters more confusing HTSeq
is not in the tuxedo suite, so the XS attribute tag that contains the -
or +
associated with the read alignment is ignored. The SAM flag is used, so we must again specify the stranded option when counting with HTSeq
.
To ensure that I use the correct option, I always run both options and observe the results—this is only necessary for one file if you have a whole bunch of files that were generated with the same molecular biology. Whatever option has more reads in the vast majority of the genes is correct. I appreciate that this is not elegant, but these empirical checks are often necessary to ensure the data is properly processed. The biggest mistakes that I have observed in genomics could have been avoided if the bioinformatician looked at the output of analyses and confirmed that the result is expected or, at minimum, reasonable.
1) The initial alignments for C7_RNAseq_rep1.fastq.gz
are located in /home/FCAM/meds5420/data/RNA_seq
: C7_RNAseq_rep1_F.sam
and C7_RNAseq_rep1_R.sam
. I already converted these alignment files to bam
, next use samtools
to extract your specific chromosome from the bam
file (this requires a Google or studying of the samtools
manual). If (when?) you get an error, Google the error. Using samtools
to select your chromosome is better because the very important header of the sam
/bam
file is retained. Using grep
on the sam
can cause downstream issues when a header is required. Sort the chromosome-specific bam
files and index the bam
files (use samtools
for both). I did not post this answer below, because I want you to work through this problem.
# how I converted to bam
samtools sort -@ 4 -o C7_RNAseq_rep1_F.bam C7_RNAseq_rep1_F.sam
samtools sort -@ 4 -o C7_RNAseq_rep1_R.bam C7_RNAseq_rep1_R.sam
2) Count reads in genes from the sorted indexed bam
files for your chromosome. Confirm that HTSeq
ignores the --rna-strandedness
option from hisat2
by counting with both the _R
and _F
files. Look at the output file (head
, tail
, wc -l
, etc.).
3) Determine which --rna-strandedness
option is correct for these libraries.
You can subset your gtf
annotation to only contain your chromosome using grep
. Usually the header of the gtf
is not required, but you can use a combination of head
and cat
to generate a gtf
with a header if needed.
Note: The output counts file has a footer (see the tail
) that we want to ignore in the counting. The grep -v __
command prints all the lines without __
to stdout, and you can use awk
(you should have a column-summing one-liner awk
command already saved in the awk_one_liners.sh
file that I know you are all keeping) to count the second column.
bedtools
to create a bedGraph
:bedtools genomecov -bg -split -ibam HISAT_out.sorted.bam > HISAT_out.bedgraph
NOTES: -split
is an important option. If you do not specify it here, you will get signal across exons that looks like background.
Also, if you have strand specific data, make sure you have to run this twice and specify a different strand each time.
bedtools genomecov -bg -split -strand + -ibam INFILE > OUTFILE_plus.bedgraph
bedtools genomecov -bg -split -strand - -ibam INFILE > OUTFILE_minus.bedgraph
genomeCoverageBed
Again, we have to be careful about strandedness! genomeCoverageBed
defines “strand” by their orientation and they will treat BAM files identically, regardless of whether they were generated with the --rna-strandness
R
and F
option. This is because bedtools
is not part of the tuxedo suite. Therefore in the testing stage for generating a workflow, you will have to try combinations of -strand
arguments and plus
vs. minus
outputs to ensure proper visualization as I show in the code chunk below. Since you already confirmed the R
option with HISAT2 and the rev
option with HTseq
, I will tell you that the specified strand and output is inconsistent (i.e. -
paired with plus
and +
paired with minus
) with this library prep method.
#test1
bedtools genomecov -bga -split -strand + -ibam INFILE > OUTFILE_plus_test1.bedGraph
bedtools genomecov -bga -split -strand - -ibam INFILE > OUTFILE_minus_test1.bedGraph
#test2
bedtools genomecov -bga -split -strand + -ibam INFILE > OUTFILE_minus_test2.bedGraph
bedtools genomecov -bga -split -strand - -ibam INFILE > OUTFILE_plus_test2.bedGraph
We can add track lines as option to include in the files:
bedtools genomecov -bg -split -strand - -ibam INFILE -trackline -trackopts 'name=RNA-seq_plus description="RNA-seq_plus" visibility=full autoScale=on alwaysZero=on color=255,0,0 windowingFunction=maximum' > OUTFILE_plus_track.bedGraph
OR
You can add track lines as previously shown with awk
or with a modified AddTrackLines script we wrote in class for ChIP-seq.
awk 'BEGIN { print "browser position chr17:1,310,184-1,377,073"
print "track type=bedGraph name=RNAseq_plus description="RNAseq_plus" visibility=full autoScale=on alwaysZero=on color=255,0,0 windowingFunction=maximum"}
{ print $0}' RNAseq.bedGraph > RNAseq_header.bedGraph
You can now upload the file into the UCSC genome browser:
https://genome.ucsc.edu/s/Mike%20Guertin/hg38_rna_seq_class_C1
1. Use HISAT2 to align C1_RNAseq_rep1.fastq.gz
(/home/FCAM/meds5420/data/RNA_seq/C1_RNAseq_rep1.fastq.gz
) to your user ID chromosome (/home/FCAM/meds5420/genomes/chroms/chr#.fa
) using the appropriate options. Recall that you will have to build your chromosome index files by first extracting splice sites and exons. Speed up the process by using you GTF annotation file that is unique to your chromosome.
2. Visualize the data in the browser and send me a UCSC session link.
- Your chromosome can be found here: /home/FCAM/meds5420/genomes/chroms
extract your chromosome gene annotations from here: /home/FCAM/meds5420/annotations/gencode.v39.annotation.gtf
Most RNA-seq libraries are paired end, so I we will go over the code and explanations to visualize PE data. UConn and JAX have very strong RNA biology and splicing labs, so we will also go through the entire process of parsing files to visualize splice junction reads on a browser.
Original merged C1 and C7 PE FASTQ files: /home/FCAM/meds5420/data/RNA_seq/paired_end
If you submit samples for RNA-seq at UConn’s CGI, then you will most likely get stranded data back. It will often be paired-end data as well. Here’s how to map and visualize the data.
module load hisat2
dir=/home/FCAM/meds5420/data/RNA_seq/paired_end/
ht2_gen=/home/FCAM/meds5420/genomes/hg38_hisat2/hg38_hisat2
hisat2 -p4 --dta --rna-strandness RF -x $ht2_gen -1 ${dir}C_RNAseq_all_PE1.fastq.gz -2 ${dir}C_RNAseq_all_PE2.fastq.gz -S C_RNAseq_all.sam
the --rna-strandness
refers to whether the sequence read represents the RNA equivalent or is complementary to it. If the read represents the same strand as the RNA, then use the F
as the input, if it is the complement of the RNA, use R
.
For the Paired-end kit used in the CGI, the first read (mate pair 1) is the complement to the RNA, and the second read (mate pair 2) is the same strand as the RNA. Therefore, you should enter RF
for this option.
Next, we can convert this .sam
file to a sorted .bam
, however, for making bed
files of paired end data, the reads should be sorted by name (-n
), so mate-pairs as next to each other. The example below will just process chr17
to speed things up. Remember, when employing new code or building a pipeline it is always a good idea to start processing only a fraction of the data (i.e. head -n 400000 X > Y
or grep -w "chr17" X > Y
) until all seems to be working well. This will save you time just waiting for things to run. It is a bit convoluted to extract all chr17 entries from a sam
or bam
file. You may want to use grep
for you favorite chromosome in the sam
file, then convert to a bam
and you are good to go, right? Remember, this excludes the header and causes problems. The following routine will allow you to extract your chromosome from an aligned sam
file. It is convoluted, but this is the way
.
# convert sam to bam
samtools view -b C_RNAseq_all.sam -o C_RNAseq_all.bam
# sort bam
samtools sort C_RNAseq_all.bam -o C_RNAseq_all_sorted.bam
# index the bam (makes a *.bai file)
samtools index C_RNAseq_all_sorted.bam
# finally we can extract chr17
samtools view -b C_RNAseq_all_sorted.bam chr17 -o C_RNAseq_chr17.bam
# index the chr bam (makes a *.bai file)
samtools index C_RNAseq_chr17.bam
# always sort your bams
samtools sort C_RNAseq_chr17.bam -o C_RNAseq_chr17_sorted.bam
The necessary bam
and bam.bai
files to extract your chromosome info are within the /home/FCAM/meds5420/data/RNA_seq/processed_bam/
directory on Xanadu.
Now we can convert to bed12
format for display in the browser. However, this will display all reads, not only the spliced reads. bed12
format specifications can be found here:
https://genome.ucsc.edu/FAQ/FAQformat.html#format1
bedtools bamtobed -bed12 -split -i C_RNAseq_chr17_sorted.bam > C_RNAseq_chr17_pe.bed12.bed
Review bedtools
commands and options here:
Deeper exon coverage is somewhat apparent and we can observed some gapped reads that are indicative of spliced reads:
This is also a bit cluttered, so we may just want to observe the spliced reads.
To display the splice reads only, we can start with a bam
file that is sorted by position (not name).
Note: You can display splice junctions for any RNA-seq data - stranded or not, paired-end or not.
Invoking the -H
option in samtools
will print the header to a new file, which is important for downstream processing.
samtools view -H C_RNAseq_chr17_sorted.bam > C_RNAseq_chr17_spliced.sam
See Section 3.2 Intermediate file conversions
for the explanation of the awk
code taking out relevant CIGAR codes. We append to the header file:
samtools view C_RNAseq_chr17_sorted.bam | awk '($6 ~/N/)' >> C_RNAseq_chr17_spliced.sam
bam
samtools view -b C_RNAseq_chr17_spliced.sam > C_RNAseq_chr17_spliced.bam
bed12
Use bedtools
to convert to bed12
bamToBed -bed12 -i C_RNAseq_chr17_spliced.bam > C_RNAseq_chr17_spliced_bed12.bed
Example of bed12
format:
head C_RNAseq_chr17_spliced_bed12.bed | tail -5
## chr17 65691 65847 SRR5882541.5406954/2 0 - 65691 65847 255,0,0 2 45,18 0,138
## chr17 65703 65871 SRR5882543.19047503/1 0 + 65703 65871 255,0,0 2 33,42 0,126
## chr17 65704 65872 SRR5882544.1641172/1 1 + 65704 65872 255,0,0 2 32,43 0,125
## chr17 65722 71369 SRR5882543.9412633/1 1 + 65722 71369 255,0,0 3 14,58,4 0,107,5643
## chr17 65725 71370 SRR5882544.25013538/2 60 - 65725 71370 255,0,0 3 11,58,5 0,104,5640
IMPORTANT: If using paired-end data, you will need to flip the strand on this file and then add tracklines and upload. In the name
column (column 4), mate pair one reads end with "/1"
and mate pair 2 reads end in "/2"
. We can use this to "flip"
mate pair 1 reads to the opposite strand. Below is an awk
script that does this.
#flip
awk '{OFS="\t"; if($4 ~"/2") print $0; else if($4 ~"/1" && $6 == "+") print($1, $2, $3,$4,$5,"-", $7,$8,$9,$10,$11,$12); else if($4 ~"/1" && $6 == "-") print($1, $2, $3,$4,$5,"+", $7,$8,$9,$10,$11,$12)}' C_RNAseq_chr17_spliced_bed12.bed > C_RNAseq_chr17_spliced_bed12_R1Flip.bed
#add trackline
awk 'BEGIN { print "browser position chr17:8,408,924-8,589,493"; print "track type=bed name=\"C_RNAseq_chr17_pe\" description=\"C_RNAseq_chr17_pe\" visibility=squish autoScale=on colorByStrand=\"255,0,0 0,0,255\""} { print $0}' C_RNAseq_chr17_spliced_bed12_R1Flip.bed > C_RNAseq_chr17_spliced_header.bed
https://genome.ucsc.edu/s/Mike%20Guertin/hg38_RNA_splicing_class
bedGraphs
files of C_RNAseq_chr17_pe.bed12.bed
data above (all split spliced reads and non-split reads).We can then make bedGraph
files as before using the -split
and -strand
options. However, the input .bed
files need to be sorted by position, unmapped reads discarded (.
in columns 1 or 3) and undergo some more processing first.
First, sort the bed file by position.
sort -k1,1 -k2,2n C_RNAseq_chr17_pe.bed12.bed | grep -v -w "\." > C_RNAseq_chr17_pe.bed12_sorted.bed
IMPORTANT: bedtools
can properly convert stranded data to a bedGraph
viewable in the browser. However, it does not consider the strandedness of the pairs, so one of the mates in a pair will always map to the wrong strand creating data on both strands where it should only be one one. To get around this, we must manually flip
the read that is the reverse complement of the RNA to the other strand. In this case, we must flip read 1. First, let’s look at a bed12
file and see how we can parse it ourselves.
head C_RNAseq_chr17_pe.bed12_sorted.bed | tail -5
## chr17 60030 60106 SRR5882548.19678534/1 1 + 60030 60106 255,0,0 1 76 0
## chr17 60030 60106 SRR5882548.7287074/1 1 + 60030 60106 255,0,0 1 76 0
## chr17 60044 60120 SRR5882548.28798868/2 0 - 60044 60120 255,0,0 1 76 0
## chr17 60050 60126 SRR5882542.1394632/1 1 + 60050 60126 255,0,0 1 76 0
## chr17 60082 60154 SRR5882542.1394632/2 1 - 60082 60154 255,0,0 1 72 0
In the name
column (column 4), mate pair one reads end with "/1"
and mate pair 2 reads end in "/2"
. We can use this to "flip"
mate pair 1 reads to the opposite strand. Below is an awk script that can do this.
awk '{OFS="\t"; if($4 ~"/2") print $0; else if($4 ~"/1" && $6 == "+") print($1, $2, $3,$4,$5,"-", $7,$8,$9,$10,$11,$12); else if($4 ~"/1" && $6 == "-") print($1, $2, $3,$4,$5,"+", $7,$8,$9,$10,$11,$12)}' C_RNAseq_chr17_pe.bed12_sorted.bed > C_RNAseq_chr17_bed12_R1Flip.bed
Now create the bedGraphs
, add track lines, zip
the files, transfer them to our computer, and upload to UCSC:
hgSizes=/home/FCAM/meds5420/genomes/hg38.chrom.sizes
bedtools genomecov -bg -split -strand + -i C_RNAseq_chr17_bed12_R1Flip.bed -g $hgSizes > C_RNAseq_chr17_pe_R1flip_plus.bedGraph
bedtools genomecov -bg -split -strand - -i C_RNAseq_chr17_bed12_R1Flip.bed -g $hgSizes > C_RNAseq_chr17_pe_R1flip_minus.bedGraph
awk 'BEGIN { print "browser position chr17:8,449,131-8,569,510"; print "track type=bedGraph name=\"C_RNAseq_chr17_plus\" description=\"C_RNAseq_chr17_plus\" visibility=squish autoScale=on color=\"255,0,0\""} { print $0}' C_RNAseq_chr17_pe_R1flip_plus.bedGraph > C_RNAseq_chr17_pe_plus.bedGraph
awk 'BEGIN { print "browser position chr17:8,449,131-8,569,510"; print "track type=bedGraph name=\"C_RNAseq_chr17_minus\" description=\"C_RNAseq_chr17_minus\" visibility=squish autoScale=on color=\"0,0,255\""} { print $0}' C_RNAseq_chr17_pe_R1flip_minus.bedGraph > C_RNAseq_chr17_pe_minus.bedGraph
gzip C_RNAseq_chr17_pe_plus.bedGraph
gzip C_RNAseq_chr17_pe_minus.bedGraph
#sftp to your local machine
https://genome.ucsc.edu/s/Mike%20Guertin/hg38_PE_class
1. Extract the reads from your chromosome, then extract the spliced reads from your chromosome, convert to a BED12
format, add a trackline (see “bed12 format specifications” link above) that includes a different color for each forward and reverse aligned set of splice junctions (see colorByStrand
), and upload to the browser. The necessary bam
and bam.bai
files to extract your chromosome info are within the /home/FCAM/meds5420/data/RNA_seq/processed_bam/
directory on Xanadu. I began extracting your chromosome from the bam
file with the command samtools view -b C_RNAseq_all_sorted.bam chr# -o C_RNAseq_chr#.bam
, so check to see if you chromosome-specific bam
is already present.
2. Optional: make bedGraphs
of your chromosome and try to understand what the complicated set of commands are doing by inspecting the input and output files.
The molecular biology of library preps diverge depending upon whether the experimenter wishes to preserve the RNA strand information in the sequencing data. Unstranded libraries were common over a decade ago, but I see no reason why contemporary research would employ strand-ambiguous RNA-seq. Therefore, I will briefly discuss the workflow for the library preparation that I prefer and how clever molecular biologists developed methods to retain the strand specificity of the RNA throughout the library prep.
1) Isolate RNA.
2) Negative selection of rRNA.
3) Fragment the RNA with heat.
4) Anneal random hexamer DNA primers to the RNA.
5) Perform reverse transcription of the RNA using a reverse transcriptase that is engineered to operate at high temperatures. RT at higher temperatures relieves secondary structure of the RNA and leads to less biased libraries. Including actinomycin D at this step prevents second strand synthesis during the RT by inhibiting DNA-dependent DNA synthesis.
6) Perform second strand synthesis using the dUTP nucleotide as opposed to dTTP. dUTP incorporation is important for conferring strand specificity.
7) The next step is to blunt the ends of the DNA and generate a 5’ A overhang. T4 Polymerase harbors 3’ to 5’ exonuclease activity and 5’ to 3’ polymerase activity, which will blunt both overhangs. T4 Polymerase also phosphorylates 5’ DNA ends.
8) Taq polymerase is used to add an A to the 3’ end of the dsDNA. The high throughput sequencing adapter is usually a forked Y adapter that contains a T overhang on the 5’ end. This pairs with the A overhang, traditionally referred to as TA cloning. Alternatively, a hairpin adapter is used that incorporates a dUTP that effectively generates a Y adapter when the dUTP is digested.
9) Ligate the sequencing adapters with DNA ligase.
10) Digest dUTPs incorporated during the second strand synthesis with uracil-DNA glycosylase (UGDase). This confers sequence specificity because now you have a single strand DNA molecule with a distinct adapter on each end. The ssDNA represents the template strand.
11) The last step is to PCR amplify with a variety of multiplexed adapters that permit pooling of samples.
The dsDNA molecules can be sequenced using a specific primer for each end of the molecule. Refer to step 10 to determine which sequencing primer would result in the coding strand and which would be the reverse complement thereof.
NEXT TIME: Differential gene expression analysis.
anno=/home/FCAM/meds5420/annotations/gencode.v39.annotation.gtf
grep -w "chr15" $anno > gencode.v39.annotation.chr15.gtf
annochr15=gencode.v39.annotation.chr15.gtf
chr=chr15
#for i in /home/FCAM/meds5420/data/RNA_seq/C*_RNAseq_rep*_*.sam
touch mergelist.txt
for i in /home/FCAM/meds5420/data/RNA_seq/C*_RNAseq_rep*[FR].sam
do
name=$(echo $i | rev | cut -d "/" -f 1 | rev | cut -d "." -f 1)
echo $name
grep -w $chr $i > ${name}_chr15.sam
samtools sort -@ 4 -o ${name}_chr15_sorted.bam ${name}_chr15.sam
stringtie -p 4 -G $annochr15 -o $name.gtf -l ${name} ${name}_chr15_sorted.bam
echo $name.gtf >> mergelist.txt
done
grep "F.gtf" mergelist.txt > mergelist_F.txt
grep "R.gtf" mergelist.txt > mergelist_R.txt
fr="F R"
for x in $fr
do
#2 merge transcripts
stringtie --merge -p 4 -G $annochr15 -o stringtie_merged_${x}.gtf mergelist_${x}.txt
#3 compare
gffcompare -R -r $annochr15 -o stringtie_merged_${x}_all stringtie_merged_${x}.gtf
done
The `R` oriented files have higher precision and fewer novel features. Novel features are false positives (FP), because we consider the reference gene annotations as true positives (TP). precision is calculated as Precision = TP / (TP+FP). All the novel features (FP counts) in `F` are actually found because the orientation of the strand is not correct. This further confirms that `R` strandedness is correct.
#4 abundance
name=C1_RNAseq_rep1_
for x in $fr
do
stringtie -e -B -p 8 -G $annochr15 -o stringtie_est_${x}.gtf -A ${name}${x}.tab ${name}${x}_chr15_sorted.bam
#5 tests
echo ${name}${x} HSP90 counts
grep -i hsp90 ${name}${x}.tab
echo cumulative Coverage
awk 'NR>1 {sum+=$7} END{print "sum=",sum}' ${name}${x}.tab
echo Cumulative FPKM
awk 'NR>1 {sum+=$8} END{print "sum=",sum}' ${name}${x}.tab
echo Cumulative TPM
awk 'NR>1 {sum+=$9} END{print "sum=",sum}' ${name}${x}.tab
done
I did a few tests to see if R or F is correct:
1) I did not know what highly expressed genes were on chr15, so I googled the official gene names (i.e. Actin ==ACTN) and looked at expression of all the candidates: Actin, Tubulin, gapdh, and hsp90. A few variants of hsp90 are present on chr15 and there is coverage and normalized counts in the R orientation.
2) I looked at the total Coverage, FPKM, and TPM for R and F. the FPKM and TPM are normalized for read depth and since fewer reads align to the wrong orientation, the cumulative values are similar. However, coverage is per-base coverage of the gene and not normailzed by read depth, so this value is directly comparable between R and F input files. The coverage is ~10 fold higher in the R orientation, further confirming R is correct. Coverage would be comparable in R and F if the libraries were unstranded, then we would need to remap with the --rna-standedness flag not specified.
grep -w "chr15" /home/FCAM/meds5420/annotations/gencode.v39.annotation.gtf > gencode.v39.annotation.chr15.gtf
annochr15=gencode.v39.annotation.chr15.gtf
#you need to work through how to make C7_RNAseq_rep1_[FR]_chr15_sorted.bam
# OK, here is how to make the input files.
samtools index C7_RNAseq_rep1_F.bam
samtools index C7_RNAseq_rep1_R.bam
samtools view -b C7_RNAseq_rep1_F.bam chr15 > C7_RNAseq_rep1_chr15_F.bam
samtools view -b C7_RNAseq_rep1_R.bam chr15 > C7_RNAseq_rep1_chr15_R.bam
samtools sort C7_RNAseq_rep1_chr15_F.bam -o C7_RNAseq_rep1_F_chr15_sorted.bam
samtools sort C7_RNAseq_rep1_chr15_R.bam -o C7_RNAseq_rep1_R_chr15_sorted.bam
samtools index C7_RNAseq_rep1_F_chr15_sorted.bam
samtools index C7_RNAseq_rep1_R_chr15_sorted.bam
for i in C7_RNAseq_rep1_[FR]_chr15_sorted.bam
do
name=$(echo $i | rev | cut -d "/" -f 1 | rev | cut -d "." -f 1)
echo $name
htseq-count -r pos -f bam --stranded=yes ${i} $annochr15 > $name.gene.counts.yes.txt
htseq-count -r pos -f bam --stranded=reverse ${i} $annochr15 > $name.gene.counts.reverse.txt
tail $name.gene.counts.yes.txt
tail $name.gene.counts.reverse.txt
echo $name
echo yes
grep -v __ $name.gene.counts.yes.txt | awk '{sum+=$2} END{print "sum=",sum}'
echo reverse
grep -v __ $name.gene.counts.reverse.txt | awk '{sum+=$2} END{print "sum=",sum}'
done
# many more counts in reverse indicate that reverse is counting reads coming from coherent annotation/read strands
# I know that they are all generated using the same molecular bio protocols
# If I was unsure, I would write a loop to go through all of them
grep -w "chr15" gencode.v39.annotation.gtf > gencode.v39.chr15.annotation.gtf
extract_splice_sites.py gencode.v39.chr15.annotation.gtf > hg38_hisat_chr15_splicesites.txt
extract_exons.py gencode.v39.chr15.annotation.gtf > hg38_hisat_chr15_exons.txt
cp /home/FCAM/meds5420/genomes/chroms/chr15.fa ./
hisat2-build -p 8 --ss hg38_hisat_chr15_splicesites.txt --exon hg38_hisat_chr15_exons.txt chr15.fa chr15_hisat2
name=C7_RNAseq_rep1
cp /home/FCAM/meds5420/data/RNA_seq/$name.fastq.gz ./
hisat2 -p4 --dta --rna-strandness R -x chr15_hisat2 -U $name.fastq.gz -S $name.sam
samtools sort -@ 4 -o $name_sorted.bam $name.sam
bedtools genomecov -bg -split -strand - -ibam $name_sorted.bam -trackline -trackopts 'name=C7_RNAseq_rep1_plus description="C1_RNAseq_rep1_plus" visibility=full autoScale=on alwaysZero=on color=255,0,0 windowingFunction=maximum' > C7_RNAseq_rep1_plus.bedGraph
bedtools genomecov -bg -split -strand + -ibam C7_RNAseq_rep1_sorted.bam -trackline -trackopts 'name=C7_RNAseq_rep1_minus description="C7_RNAseq_rep1_minus" visibility=full autoScale=on alwaysZero=on color=0,0,255 windowingFunction=maximum' > C7_RNAseq_rep1_minus.bedGraph
gzip *bedGraph
#sftp the bedGraphs to your computer and upload to UCSC.
# my C1 session:
# https://genome.ucsc.edu/s/Mike%20Guertin/hg38_rna_seq_class_C1
#interpret the data
#The library is stranded because each strand separated track aligns to gene annotations
#specifically, as opposed to both strand separated tracks having comparable coverage at #each gene annotation. Since the library is firststrand stranded (dUTP), the incoherent (- specification and plus labeling) is correct.
module load samtools
module load bedtools
cp /home/FCAM/meds5420/data/RNA_seq/processed_bam/* ./
samtools view -b C_RNAseq_all_sorted.bam chr17 -o C_RNAseq_chr17.bam
samtools index C_RNAseq_chr17.bam
samtools sort C_RNAseq_chr17.bam -o C_RNAseq_chr17_sorted.bam
samtools view -H C_RNAseq_chr17_sorted.bam > C_RNAseq_chr17_spliced.sam
samtools view C_RNAseq_chr17_sorted.bam | awk '($6 ~/N/)' >> C_RNAseq_chr17_spliced.sam
samtools view -b C_RNAseq_chr17_spliced.sam > C_RNAseq_chr17_spliced.bam
bamToBed -bed12 -i C_RNAseq_chr17_spliced.bam > C_RNAseq_chr17_spliced_bed12.bed
awk '{OFS="\t"; if($4 ~"/2") print $0; else if($4 ~"/1" && $6 == "+") print($1, $2, $3,$4,$5,"-", $7,$8,$9,$10,$11,$12); else if($4 ~"/1" && $6 == "-") print($1, $2, $3,$4,$5,"+", $7,$8,$9,$10,$11,$12)}' C_RNAseq_chr17_spliced_bed12.bed > C_RNAseq_chr17_spliced_bed12_R1Flip.bed
awk 'BEGIN { print "browser position chr17:8,408,924-8,589,493"; print "track type=bed name=\"C_RNAseq_chr17_pe\" description=\"C_RNAseq_chr17_pe\" visibility=squish autoScale=on colorByStrand=\"255,0,0 0,0,255\""} { print $0}' C_RNAseq_chr17_spliced_bed12_R1Flip.bed > C_RNAseq_chr17_spliced_header.bed
gzip C_RNAseq_chr17_spliced_header.bed
# then sftp to your computer and upload to ucsc
sort -k1,1 -k2,2n C_RNAseq_chr17_pe.bed12.bed | grep -v -w "\." > C_RNAseq_chr17_pe.bed12_sorted.bed
awk '{OFS="\t"; if($4 ~"/2") print $0; else if($4 ~"/1" && $6 == "+") print($1, $2, $3,$4,$5,"-", $7,$8,$9,$10,$11,$12); else if($4 ~"/1" && $6 == "-") print($1, $2, $3,$4,$5,"+", $7,$8,$9,$10,$11,$12)}' C_RNAseq_chr17_pe.bed12_sorted.bed > C_RNAseq_chr17_bed12_R1Flip.bed
hgSizes=/home/FCAM/meds5420/genomes/hg38.chrom.sizes
bedtools genomecov -bg -split -strand + -i C_RNAseq_chr17_bed12_R1Flip.bed -g $hgSizes > C_RNAseq_chr17_pe_R1flip_plus.bedGraph
bedtools genomecov -bg -split -strand - -i C_RNAseq_chr17_bed12_R1Flip.bed -g $hgSizes > C_RNAseq_chr17_pe_R1flip_minus.bedGraph
awk 'BEGIN { print "browser position chr17:8,449,131-8,569,510"; print "track type=bedGraph name=\"C_RNAseq_chr17_plus\" description=\"C_RNAseq_chr17_plus\" visibility=squish autoScale=on color=\"255,0,0\""} { print $0}' C_RNAseq_chr17_pe_R1flip_plus.bedGraph > C_RNAseq_chr17_pe_plus.bedGraph
awk 'BEGIN { print "browser position chr17:8,449,131-8,569,510"; print "track type=bedGraph name=\"C_RNAseq_chr17_minus\" description=\"C_RNAseq_chr17_minus\" visibility=squish autoScale=on color=\"0,0,255\""} { print $0}' C_RNAseq_chr17_pe_R1flip_minus.bedGraph > C_RNAseq_chr17_pe_minus.bedGraph
gzip C_RNAseq_chr17_pe_plus.bedGraph
gzip C_RNAseq_chr17_pe_minus.bedGraph
# then sftp to your computer and upload to ucsc
#use wc -l, head, and tail to figure out what each command is doing.