Conclusion

This tutorial has walked through a complete pipeline of comparative genomic analysis. By using this tool, we’ve been able to generate hypotheses of functional associations between genetic regions using only raw sequence data as input. Along the way, we’ve covered the following:

Working with sequencing data, including alignment and visualization
Finding and annotating genes from sequencing data
Finding Clusters of Orthologous Genes (COGs)
Reconstructing the evolutionary history of COGs
Analyzing coevolutionary signal between COGs

This coevolutionary signal allows us to generate hypotheses about possible unknown functional associations between proteins. In the previous example, we were able to correctly identify a cluster of ureases using only sequencing data.

This pipeline works on both genes that have been previously characterized and those that have not, meaning we can guide future wet lab investigation with these predictions. Future work will investigate ways to improve our predictions, so stay tuned to the SynExtend package as it develops!

Implementation at Scale

While our analyses are designed to be scalable, they are not yet fast enough to be able to examine thousands of genomes within a single workshop. Since we eventually need to make comparisons between every pair of COGs, the task grows quadratically as more genomes (and thus more COGs) are analyzed. Because of this, we chose a smaller test set to showcase performance in this workshop.

However, our methods can scale to many more genomes. All of our methods are designed to be run in parallel across a supercomputer system. The majority of the methods have low CPU and memory requirements, and thus can be run on small (2-4GB memory) nodes. We have recently completed analysis of all available Streptomyces genomes, which can complete in as little as an hour given sufficient compute nodes.

Our pipeline of analysis looks like this:

Runtime Estimations

The following is a rough estimate of runtime considerations at each stage of the pipeline. These examples use Streptomyces genomes, which are roughly 7-9 megabases. Stages marked single are only run once, whereas parallel are run multiple times. “Runtime” denotes total runtime if single and runtime per iteration if parallel. More details on each runtime component are available on the corresponding pages.

For scaling, \(G\) is the number of Genomes, \(g\) is the number of genes, \(C\) is the number of COGs, \(g_C\) is the number of genes in a particular COG.

NJ, MP, and ML denote Neighbor Joining, Maximum Parsimony, and Maximum Likelihood (respectively). ML is more accurate than MP at the cost of increased runtime, and the same is true of MP and NJ. See the section on Phylogenetics for more info.

Pipeline Stage	Scaling	Single or Parallel	Streptomyces Example	Example Runtime
Get Genomes	\(O(G)\)	Single	300 Genomes	~60 sec
Find Genes	\(O(G)\)	Parallel	Single Genome	~15 min
Annotate Genes	\(O(G)\)	Parallel	Single Genome	~20 min
Pairwise Orthology	\(O(G^2)\)	Parallel	Pair of Genomes	~10 min
Ortholog Pairs to COGs	\(O(g^2)\)	Single	All pairs from 300 Genomes	~5 min
Align and Make Tree	\(O(C)\)	Parallel	Single COG	<1 sec (NJ) ~0.5 hr (MP) ~13 hr (ML)
Calculate Coevolution	\(O(C^2)\)	Parallel	Pair of COGs	~5 sec
(Optimal) Total Runtime	—	—	300 Genomes	~1 hr (NJ) ~1.5 hr (MP) ~14 hr (ML)

For this example, 300 Streptomyces genomes resulted in ~2.2 million genes, resulting in ~65k COGs. Overall runtime depends on the phylogenetic reconstruction algorithm used. Tree construction is performed using amino acid alignments.

Users interested in deploying these analyses at scale are encouraged to contact our lab for more information!

Thank you!

If you’ve made it through this entire tutorial, thank you for following along! I hope this series was informative and useful to your analyses. All code showcased here is actively being worked on by members of our lab, especially the EvoWeaver and EvoWeb functionalities. If you have any comments, suggestions, or feature requests for EvoWeaver, EvoWeb, or this tutorial, please feel free to either email me at ahl27@pitt.edu or open an issue on GitHub.

Other Resources

If you’re interested in learning more about me, our lab, or phylogenetics, check out these resources:

Aidan Lakshman1

Implementation at Scale

Runtime Estimations

Thank you!

Other Resources

Aidan Lakshman¹