1 Last time: Calling de novo motifs and determining their occurences in sequences (motif scanning)
2 Today:
3 Inspect MEME results (quick exercise)
4 TOMTOM usage
- 4.1 Stream-lining motif discovery and database comparison
5 Motif similarity / redundancy
- 5.1 Question . . . Why do we get so many hits for our motifs from TF databases?
6 Answers to In Class Exercise:

1 Last time: Calling de novo motifs and determining their occurences in sequences (motif scanning)

MEME: Motif calling
MAST: Determine if motif(s) is present in sequence
FIMO: Determine location of all occurences of motifs in a sequence

2 Today:

Matrixes that define motifs (frequency and weighted matrixes)
Comparison of motifs to databases of known motifs

3 Inspect MEME results (quick exercise)

First, let’s look at the results from our meme analysis. You can copy it to from the server at this location:
/home/FCAM/meds5420/motif
Open the .html and .txt files to see the results

3.1 Motif Matrices Motif

Position Frequency Matrix (PFM): represents the frequency of each base occurrence at each position within the motif (Raw data):

Figure 1: Position frequency matrix

Position Weight Matrix (PWM): Score for probablility that a base will be present at a given position. Considers the numbers of sequences and background frequency of bases. PWMs are a more realistic reflection of the binding strength of a protein for a given sequence.

Figure 2: Position weight matrix

Matrixes can be in many formats, see: http://meme-suite.org/doc/overview.html#motif_conversion_utilities.

You can create Sequence Logos from your enriched sequences with weblogo: http://weblogo.berkeley.edu/logo.cgi or a vector image using meme ceqlogo:

singularity exec /isg/shared/apps/meme/5.4.1/meme.sif ceqlogo -i meme.txt -m 2 -f EPS -o atf1.eps

3.2 Comparing Your Motif to Databases

Motif databases: JASPAR:http://jaspardev.genereg.net/.
CisBP: http://cisbp.ccbr.utoronto.ca/TFTools.php
Transfac: http://www.gene-regulation.com/pub/databases.html
HOCOMOCO: http://hocomoco11.autosome.ru/

Notes:
JASPAR DB is highly curated from a number of sources.
CisBP is a largely single experimental effort: see: http://www.sciencedirect.com/science/article/pii/S0092867414010368
HOMER: http://homer.ucsd.edu/homer/custom.motifs Transfac in NOT open access. However, UConn recently purchased University-wide licenses for GeneXplain which accesses the Transfac database: http://genexplain.com/
HOCOMOCO was made from the motif search tool ChIPmunk (not covered in this course):http://autosome.ru/ChIPMunk/

3.3 Motif database access and aquisition

NOTE: Database must be in meme format. Several formats can be converted using tools from the MEME tools suite.
see: http://meme-suite.org/doc/overview.html#motif_conversion_utilities.

For instance, the JASPAR motifs look like this:

head ./JASPAR_all_matrix.txt

## >MA0001.1 AGL3
## A  [ 0  3 79 40 66 48 65 11 65  0 ]
## C  [94 75  4  3  1  2  5  2  3  3 ]
## G  [ 1  0  3  4  1  0  5  3 28 88 ]
## T  [ 2 19 11 50 29 47 22 81  1  6 ]
## >MA0002.1 RUNX1
## A  [10 12  4  1  2  2  0  0  0  8 13 ]
## C  [ 2  2  7  1  0  8  0  0  1  2  2 ]
## G  [ 3  1  1  0 23  0 26 26  0  0  4 ]
## T  [11 11 14 24  1 16  0  0 25 16  7 ]

To convert a whole directory of JASPAR motif files:

singularity exec /isg/shared/apps/meme/5.4.1/meme.sif jaspar2meme -pfm DIRECTORY_INPUT > jaspar.meme

-pfm: specifies input format

head -28 ./jaspar.meme | tail -20

## 
## MOTIF CN0001.1 LM1
## 
## letter-probability matrix: alength= 4 w= 16 nsites= 5332 E= 0
##   0.168230     0.079895    0.383721    0.368155  
##   0.045949     0.003938    0.918792    0.031320  
##   0.009377     0.051013    0.010315    0.929295  
##   0.031508     0.018942    0.009002    0.940548  
##   0.107839     0.052138    0.721868    0.118155  
##   0.005064     0.874156    0.014254    0.106527  
##   0.003563     0.904351    0.000750    0.091335  
##   0.918980     0.022881    0.021943    0.036197  
##   0.066954     0.029632    0.039197    0.864216  
##   0.207802     0.000750    0.788635    0.002813  
##   0.019505     0.006377    0.969242    0.004876  
##   0.509002     0.249625    0.051013    0.190360  
##   0.744561     0.018192    0.197299    0.039947  
##   0.955551     0.010128    0.023631    0.010690  
##   0.015191     0.915229    0.010503    0.059077  
##   0.368530     0.346774    0.064891    0.219805

A complete list of JASPAR and other database motifs can be found and downloaded from the MEME website:
https://meme-suite.org/meme/db/motifs

I typically download the databases directly from meme: https://meme-suite.org/meme/meme-software/Databases/motifs/motif_databases.12.22.tgz

You can find a few database files here: /home/FCAM/meds5420/TF_db/JASPAR/
I found that you need to copy them locally to use the files as input for TOMTOM.

4 TOMTOM usage

We can use TOMTOM to compare our discovered motif to databases of known motifs. See documentation: http://meme-suite.org/doc/tomtom.html?man_type=web Output is a .html file and a text file. The file shows the name of the motifs (database ID), the significance of the match and the relevant consensus sequences.

Basic usage:

singularity exec /isg/shared/apps/meme/5.4.1/meme.sif tomtom -eps -oc tomtom_OUTPUT meme.txt DATABASE.meme

-eps: creates an seqLogo of your motif aligned to each known motif that is similar.
-oc: output folder
meme.txt is the output text file from your meme analysis.
DATABASE.meme is the database containing the PWM of known TFs.

Here’s an example usage with more options:

singularity exec /isg/shared/apps/meme/5.4.1/meme.sif tomtom -no-ssc -oc tomtom_OUPUT -verbosity 1 -min-overlap 5 -mi 1 -evalue -thresh 0.05 meme.txt DATABASE.meme

-m which motif(s) to use (depends on the number in your meme.txt file)
-verbosity (1-5) progress reporting
-min-overlap minimum number of bases overlapping between your motif and the database motifs
-evalue p-value for the match corrected for multiple testing
-thresh threshold to apply to the significance testing.

4.1 Stream-lining motif discovery and database comparison

So far we have done the basic motif analysis in discreet steps. MEME now offers a semi-customizable pipeline for motif discovery and comparison to databases. However, it does not run MAST and FIMO.

MEME-ChIP can:

discover novel DNA-binding motifs (with MEME and DREME),
determine which motifs are most centrally enriched (with CentriMo),
analyze them for similarity to known binding motifs (with Tomtom), and
automatically group significant motifs by similarity,
perform a motif spacing analysis (with SpaMo), and,
create a GFF file for viewing each motif’s predicted sites in a genome browser

I had to copy JASPAR2022_CORE_vertebrates*redundant.meme to my local directory for tomtom to work properly.

Documentation:
http://meme-suite.org/doc/meme-chip.html?man_type=web
Example:

singularity exec /isg/shared/apps/meme/5.4.1/meme.sif  meme-chip -oc meme_chip_ATF1 -db JASPAR2022_CORE_vertebrates_non-redundant.meme ATF1_summit_101bp_top200.fasta -meme-nmotifs 2 -minw 5 -maxw 8 -meme-mod zoops

Note: options for each program in the pipeline can be specified by prefixing the option with the program names as shown with meme above.

In Class Exercise:

Complete the last exercise if you did not finish.

FIMO may take a while. Try to run it again in the background (add & to end of command) and then work on the other exercises or the midterm.

There are converted JASPAR databases in the following location
/home/FCAM/meds5420/TF_db/JASPAR/. Use either database to compare to your motif using tomtom.
View the beginning of the tomtom.tsv output—this is just a tab-separated values text file: .tsv. Notice that the protein ID is in the format like MA0604.1 Use grep on the original JASPAR.meme file to find out the common transcription factor name of some of your top hits. Use grep to see if any Atf1 motifs are found
Copy the .html file to your local computer and view it in the browser. Any surprises regarding the TFs found?
Try searching for your motif using the web version of TOMTOM. Do you see any differences in the results?
Optional: Try running meme-chip on your ATF1 data and viewing all the resulting files.

5 Motif similarity / redundancy

5.1 Question . . . Why do we get so many hits for our motifs from TF databases?

Many transcription factors are part of families of transcription factors that have arisen through genome or local duplication events, or in rare cases through convergent evolution. TFs in families provide redundancy, but sequence divergence amongst family members allows certain TFs to interact with different partners and / or respond to different signaling queues. However, the DNA binding domains are often the most conserved part of these proteins, which results in overlapping and similar binding sites for seemingly distinct TFs.

Figure 3: Paralogous TF DBDs
The heat map shows the degree of protein sequence conservation amongst transcription factor families that bind similar concensus motifs (indicated by the sequence logo at the top). Note that Twist and ZNF appear to be an example of convergent evolution.

6 Answers to In Class Exercise:

running TomTom

#switch to directory above where meme was performed:
singularity exec /isg/shared/apps/meme/5.4.1/meme.sif tomtom -no-ssc -oc tomtom_OUPUT -verbosity 1 -min-overlap 5 -mi 1 -evalue -thresh 0.05 ATF1_classic.meme_output/meme.txt JASPAR2022_CORE_vertebrates_non-redundant.meme

grep out some top hits:

grep MA1131.1 /home/FCAM/meds5420/TF_db/JASPAR/JASPAR2022_CORE_vertebrates_redundant.meme
grep MA0604.1 /home/FCAM/meds5420/TF_db/JASPAR/JASPAR2022_CORE_vertebrates_redundant.meme

#the - means that stdin is passed to grep:
grep -i atf1 JASPAR2022_CORE_vertebrates_redundant.meme | cut -d ' ' -f2 | grep -f - tomtom.tsv

You can also search JASPAR online as well.

Example working meme-ChIP command:

singularity exec /isg/shared/apps/meme/5.4.1/meme.sif meme-chip -oc meme_chip_ATF1 -db JASPAR2022_CORE_vertebrates_non-redundant.meme ATF1_summit_101bp_top200.fasta -meme-nmotifs 2 -minw 5 -maxw 8 -meme-mod zoops

MEDS5420 - Lecture 17 - motif queries

March 22, 2023

Contents