Skip to contents

Finding COGs

We’ve now learned some ways to load genomic data into R, as well as ways to find and annotate genomic sequences. Once we have annotated sequence data, we’ll want to find genes that are orthologous. Orthologous genes are genes that derive from some common ancestral gene in the past. This is how we can “match” up genes from different organisms. It isn’t guaranteed that these genes have preserved function since diverging from their ancestral state, but it does give us insight into the evolution of genes over time. Sets of orthologous genes will referred to as COGs (Clusters of Orthologous Genes).

Building Our Dataset

We’re going to continue using our Micrococcus genomes from NCBI, this time on a subset of 5 genomes. As mentioned in previous sections, the complete data are available here, and you are more than welcome to try these analyses out with more genomes at any time!

All the code in this section will work on larger datasets, you may just have to wait a little while. See the Conclusions page for more information on running these analyses at scale.

For this analysis, we’re downloading the genomic data as .fasta files along with precalculated annotations as .gff files. We could have also called genes and annotated them in DECIPHER using the method in the previous page, this just provides an example of using prebuilt annotations for a more thorough overview of different use cases.