Constructing Phylogenies with DECIPHER
We’ve now learned how to find Clusters of Orthologous Genes (COGs)
from a set of sequences. The last thing we need for our final analysis
are phylogenetic reconstructions of each gene cluster. In this step,
we’ll build phylogenies for our COGs using the new
TreeLine() function introduced in the latest update of
DECIPHER. This analysis uses the third COG from the
previous section (if you don’t have it, you can download it below!).
library(DECIPHER) # This is downloadable using the above button # If you finished the previous page, this is MatchedSequences[] datafile <- '/path/to/testCOG.RData'
# Load in example COG # Again, this is COG 3 from the previous section, which you can get with: # testCOG <- MatchedSequences[] # If you don't have it, load it in this way load(datafile, verbose=TRUE) # Should load 'testCOG' # Let's make the names a little more concise # These COGs exist in pairs with two lengths # We'll code them as <Assembly><A/B>, where A is the longer sequence # This renaming will make plotting look a little nicer names(testCOG) <- c('1A', '1B', '2A', '2B', '3B', '3A', '4B', '4A', '5A', '5B') testCOG <- testCOG[order(names(testCOG))]
# Let's look at the aligned COG sequences testCOG
## DNAStringSet object of length 10: ## width seq names ##  1215 GTGTACCTGTCCCACCTCACCGT...CGAGGGCGGCGCCGATGGCTGA 1A ##  564 ATGGCTGAGACCCCCGCCCCGTT...GCCACGCGACACCTACGGCTGA 1B ##  1215 GTGTACCTCTCCCACCTCACCGT...CGAGGGCGGCGCCGATGGCTGA 2A ##  564 ATGGCTGAGACCCCCGCCCCGTT...GCCGCGCGACACCTACGGCTGA 2B ##  1212 GTGCACCTGTCCCACCTGACCGT...CGGCGCCGGCGGGACCGAGGGT 3A ##  565 ATGGCTGAGCAGCCCGCCTCGTT...CCCGCGCGACACCTACGGGTGA 3B ##  1260 GTGCACCTGTCCCACCTCACCGT...GGACGGCGACGCCCGTGCGTGA 4A ##  597 GTGCGTGAGCGCTCGCCGGAGCC...TCCGCGGGACACCTACGGCTGA 4B ##  1215 GTGTACCTGTCCCACCTCACCGT...CGAGGGCGGCGCCGATGGCTGA 5A ##  564 ATGGCTGAGACCCCCGCCCCGTT...GCCACGCGACACCTACGGCTGA 5B
# Since these are coding regions, AlignTranslation is preferred testSeqs <- AlignTranslation(testCOG) # Construct a gene tree from aligned sequences treeCOG1 <- TreeLine(testSeqs, method='MP') # Visualize the result plot(treeCOG1, main='MP')
That’s all we need to construct a quick phylogenenetic tree in R! We
could also set the optional argument
TreeLine automatically reconstruct ancestral states at
TreeLine incorporates a wealth of functionality
past what is detailed here. In fact, this tree isn’t even the best tree
we can make! Let’s take a look at all the new features included in the
Our first example used
method='MP', meaning it
constructed a phyletic tree using a Maximum Parsimony method. However,
many more methods are available.
Maximum Parsimony (
MP), Neighbor-Joining (
WPGMA), and Maximum Likelihood
ML) methods. Each of these have different strengths,
weaknesses, and assumptions. While an in-depth look at the difference
between these methods is outside the scope of this tutorial, I recently
tutorial that goes into the mathematics of how these methods
Example code for each of these:
# Neighbor-Joining distMatrix <- DistanceMatrix(testSeqs) treeNJ <- TreeLine(myDistMatrix=distMatrix, method='NJ') plot(treeNJ, main='NJ')
# UPGMA tree distMatrix <- DistanceMatrix(testSeqs) treeUltra <- TreeLine(myDistMatrix=distMatrix, method='UPGMA') plot(treeUltra, main='UPGMA')
Maximum-Likelihood trees are the most accurate, but also the slowest
to create. This method iteratively maximizes the likelihood of the tree
under a given sequence evolution model for a set of aligned sequences.
In the interest of time, this demo will set the
argument to prevent the algorithm from taking too long.
# Maximum Likehood # - Max runtime is set here to 30sec, default is as long as it takes # - maxTime expresses time in HOURS, not sec/min # - Note that method='ML' is the default setting # - Longer runtime will produce better results treeML <- TreeLine(testSeqs, maxTime=(30/3600)) plot(treeML, main='Maximum-Likelihood')
One question you’re probably asking is, “What is this ‘given sequence evolution model’?” That’s an excellent question!
TreeLine() will test a variety of sequence
evolution models and pick the one that works best for your data. This
means that you shouldn’t typically have to worry about which model to
However, what if we wanted to explicitly pick a certain model? What if we wanted to exclude a handful of models? Or what if we’re just curious what models are even being tested?
Fret not, for there is a solution. Models are passed to
TreeLine() as a list with two named entries,
$Protein. To look at the
default models tested, simply print out the
## $Nucleotide ##  "JC69" "JC69+G4" "K80" "K80+G4" "F81" "F81+G4" ##  "HKY85" "HKY85+G4" "T92" "T92+G4" "TN93" "TN93+G4" ##  "SYM" "SYM+G4" "GTR" "GTR+G4" ## ## $Protein ##  "AB" "AB+G4" "BLOSUM62" "BLOSUM62+G4" ##  "cpREV" "cpREV+G4" "cpREV64" "cpREV64+G4" ##  "Dayhoff" "Dayhoff+G4" "DCMut-Dayhoff" "DCMut-Dayhoff+G4" ##  "DCMut-JTT" "DCMut-JTT+G4" "DEN" "DEN+G4" ##  "FLAVI" "FLAVI+G4" "FLU" "FLU+G4" ##  "gcpREV" "gcpREV+G4" "HIVb" "HIVb+G4" ##  "HIVw" "HIVw+G4" "JTT" "JTT+G4" ##  "LG" "LG+G4" "MtArt" "MtArt+G4" ##  "mtDeu" "mtDeu+G4" "mtInv" "mtInv+G4" ##  "mtMam" "mtMam+G4" "mtMet" "mtMet+G4" ##  "mtOrt" "mtOrt+G4" "mtREV" "mtREV+G4" ##  "mtVer" "mtVer+G4" "MtZoa" "MtZoa+G4" ##  "PMB" "PMB+G4" "Q.bird" "Q.bird+G4" ##  "Q.insect" "Q.insect+G4" "Q.LG" "Q.LG+G4" ##  "Q.mammal" "Q.mammal+G4" "Q.pfam" "Q.pfam+G4" ##  "Q.plant" "Q.plant+G4" "Q.yeast" "Q.yeast+G4" ##  "rtREV" "rtREV+G4" "stmtREV" "stmtREV+G4" ##  "VT" "VT+G4" "WAG" "WAG+G4" ##  "WAGstar" "WAGstar+G4"
Nucleotide models include classic names like Jukes-Cantor
JC69) and Felsenstein 1981 (
F81), as well as
many others. The amino acid substitution models contain a mixture of
general models (e.g.
well as models tailored towards specific organisms
To use a single model, simply create a list matching the structure of
MODELS, just with the models you want to include. To
exclude certain models, copy
MODELS and remove the entries
you don’t want.
# Excluding Specific Models myExcludedModel <- DECIPHER::MODELS myExcludedModel$Protein <- NULL # Remove all protein models myExcludedModel$Nucleotide <- myExcludedModel$Nucleotide[-1] # Remove JC69 exclTree <- TreeLine(testSeqs, model=myExcludedModel, maxTime=(30/3600)) plot(exclTree, main='Excluded Set ML')
Runtime depends greatly on the tree construction method used. Phylogenetic reconstruction methods exhibit a tradeoff between accuracy and runtime; slower algorithms yield better performance.
Runtime scales with the number of genes and the length of each gene. For all below estimates, trials were done using a set of 100 genes with an aligned length of ~350 bases on a single core compute node with 1GB RAM. Reconstruction in amino acid space is slower on average than reconstruction in nucleotide space.
Maximum likelihood methods are the most accurate but also the slowest. For our test set of 100 genes, total runtime for nucleotide alignments takes on the order of 1-3 hours, and for amino acid alignments on the order of 10-12 hours.
Maximum parsimony methods are typically a good compromise between speed and accuracy. On our test set, total runtime was around 5-10 minutes for nucleotide alignments, and around 30 minutes for amino acid alignments.
All the other methods (Neighbor Joining, UPGMA, etc.) are extremely fast, but have much lower accuracy than MP/ML. Constructing a tree with these method (including building a distance matrix) takes less than 1 second.
That’s all we need to know how to do to generate phylogenies for our dataset. In order to conduct our final coevolutionary analysis, we’re going to need to build a gene tree for all of our COGs. I’ve precomputed these trees for us, so we can load those in in the next step without having to worry about long runtimes.
There are a few other parameters I didn’t mention in this writeup.
The most significant one for our use is
as this reconstructs ancestral states (which will be important for later
analyses). There are also additional arguments for multiprocessing
processors=1), using Laguerre quadrature for likelihoods
quadrature=T/F), switching the type of information
criterion for ML trees
informationCriterion=c('AICc', 'BIC')), and many others.
See the documentation page for more information on these–for now, we’ll
continue on to our final goal.