Skip to contents

This tutorial has walked through a complete pipeline of comparative genomic analysis. By using this tool, we’ve been able to generate hypotheses of functional associations between genetic regions using only raw sequence data as input. Along the way, we’ve covered the following:

  • Working with sequencing data, including alignment and visualization
  • Finding and annotating genes from sequencing data
  • Finding Clusters of Orthologous Genes (COGs)
  • Reconstructing the evolutionary history of COGs
  • Analyzing coevolutionary signal between COGs

This coevolutionary signal allows us to generate hypotheses about possible unknown functional associations between proteins. In the previous example, we were able to correctly identify a cluster of ureases using only sequencing data.

This pipeline works on both genes that have been previously characterized and those that have not, meaning we can guide future wet lab investigation with these predictions. Future work will investigate ways to improve our predictions, so stay tuned to the SynExtend package as it develops!

Implementation at Scale

While our analyses are designed to be scalable, they are not yet fast enough to be able to examine thousands of genomes within a single workshop. Since we eventually need to make comparisons between every pair of COGs, the task grows quadratically as more genomes (and thus more COGs) are analyzed. Because of this, we chose a smaller test set to showcase performance in this workshop.

However, our methods can scale to many more genomes. All of our methods are designed to be run in parallel across a supercomputer system. The majority of the methods have low CPU and memory requirements, and thus can be run on small (2-4GB memory) nodes. We have recently completed analysis of all available Streptomyces genomes, which can complete in as little as an hour given sufficient compute nodes.

Our pipeline of analysis looks like this:

Runtime Estimations

The following is a rough estimate of runtime considerations at each stage of the pipeline. These examples use Streptomyces genomes, which are roughly 7-9 megabases. Stages marked single are only run once, whereas parallel are run multiple times. “Runtime” denotes total runtime if single and runtime per iteration if parallel. More details on each runtime component are available on the corresponding pages.

For scaling, \(G\) is the number of Genomes, \(g\) is the number of genes, \(C\) is the number of COGs, \(g_C\) is the number of genes in a particular COG.

NJ, MP, and ML denote Neighbor Joining, Maximum Parsimony, and Maximum Likelihood (respectively). ML is more accurate than MP at the cost of increased runtime, and the same is true of MP and NJ. See the section on Phylogenetics for more info.

Pipeline Stage Scaling Single or Parallel Streptomyces Example Example Runtime
Get Genomes \(O(G)\) Single 300 Genomes ~60 sec
Find Genes \(O(G)\) Parallel Single Genome ~15 min
Annotate Genes \(O(G)\) Parallel Single Genome ~20 min
Pairwise Orthology \(O(G^2)\) Parallel Pair of Genomes ~10 min
Ortholog Pairs to COGs \(O(g^2)\) Single All pairs from 300 Genomes ~5 min
Align and Make Tree \(O(C)\) Parallel Single COG <1 sec (NJ)
~0.5 hr (MP)
~13 hr (ML)
Calculate Coevolution \(O(C^2)\) Parallel Pair of COGs ~5 sec
(Optimal) Total Runtime 300 Genomes ~1 hr (NJ)
~1.5 hr (MP)
~14 hr (ML)

For this example, 300 Streptomyces genomes resulted in ~2.2 million genes, resulting in ~65k COGs. Overall runtime depends on the phylogenetic reconstruction algorithm used. Tree construction is performed using amino acid alignments.

Users interested in deploying these analyses at scale are encouraged to contact our lab for more information!

Thank you!

If you’ve made it through this entire tutorial, thank you for following along! I hope this series was informative and useful to your analyses. All code showcased here is actively being worked on by members of our lab, especially the ProtWeaver and ProtWeb functionalities. If you have any comments, suggestions, or feature requests for ProtWeaver, ProtWeb, or this tutorial, please feel free to either email me at or open an issue on GitHub.

Other Resources

If you’re interested in learning more about me, our lab, or phylogenetics, check out these resources: