Skip to contents

Gene Calling and Annotation

At this point, we’ve learned how to read in some genomic data, and have gained some basic familiarity working with it. The next step in our pipeline is to take a set of genomes, identify the coding regions in them, and predict the function of as many genetic regions as we can. We’ll start off by identifying the genes themselves.

Finding Genes

We’re going to keep using our Micrococcus dataset, but in the interest of time we’ll focus on finding genes in a single sequence. We’ll begin by reading in the data from a .fasta file, as we did in the previous section.

 

library(DECIPHER)

# This file is downloadable at the above link
datafile <- '/path/to/SingleSeq.fa.gz'
dnaGenome <- readDNAStringSet(datafile)
aaGenome <- translate(dnaGenome) 

Next, we’re going to identify the genes in our sequence using FindGenes() from the DECIPHER package.

FindGenes() returns a Genes object with information on where genes start and end in the genome.

geneLocs <- FindGenes(dnaGenome)
# `Genes` object
geneLocs
## Genes object of size 2,219 specifying:
## 2,219 protein coding genes from 75 to 4,653 nucleotides.
## 
##   Index Strand Begin   End TotalScore ... Gene
## 1     1      0   256  1803     367.59 ...    1
## 2     1      0  2554  3669     255.97 ...    1
## 3     1      0  3691  4905     228.41 ...    1
## 4     1      0  4898  5461      74.94 ...    1
## 5     1      0  5819  7981     547.26 ...    1
## 6     1      0  8050 10746     686.36 ...    1
## ... with 2,213 more rows.

We can then extract the sequences corresponding to each gene with ExtractGenes().

genes <- ExtractGenes(geneLocs, dnaGenome, type='AAStringSet')
# Sequences corresponding to each gene
genes
## AAStringSet object of length 2219:
##        width seq
##    [1]   516 MVADQAVLSSWRSVVGSLEDDARVSARLMGFV...RKIRELMAERRTIYNQVTELTNEIKRKQRGA*
##    [2]   372 MKFTVERDILTDAVSWAARSLSPRPPVPVLSG...SAPKPALLTGVNQEDGVVSDYRYLVMPVRIA*
##    [3]   405 MYLSHLTVADFRSYRWADLELTPGSTVLLGAN...RQDAEGSEAVAAAAPVEGDIREPRREGGADG*
##    [4]   188 MAETPAPFEPDRPDLALVQLRRVREAARERGE...TRIEVAGPQAPSWRKGPRTVRGGRGPRDTYG*
##    [5]   721 MVDAMPENPAEEPTAASAAPNPEAVPDAVGQP...DAVFSMLMGEDVESRRTFIQQNAKDIRFLDV*
##    ...   ... ...
## [2215]   195 MSAENVTPAPEAEDAVVETPAGQGSRVAEQDD...KIVHDVVADAGLTSESEGEGARRHVVISADD*
## [2216]   333 MFEAILSPFRWLMSWLLGAFHSILEFAGLPAD...QAGRGGAVLNGEVVRQSGQRVQPQRKNRRRK*
## [2217]   123 MTVTPSPFVVLEPSREWGPLRALPSALLAGLL...PPGHRRWPPGRQPRILALNHPPIPPDLPQED*
## [2218]   133 MLPRDRRVRTPAEFRHLGRTGTRAGRRTVVVS...AEADYALLRRETVGALGKALKPHLPAASEHA*
## [2219]    46 MTKRTFQPNNRRRARKHGFRARMRTRAGRAILSARRGKNRAELSA*

Of note is that FindGenes() assumes there are no introns or frame shifts, and as a result performs best with prokaryotic genomes.

Removing Non-Coding Regions

FindGenes() finds genes, but these may be coding or non-coding genes. We’re more interested in the regions that are actually translated into proteins, since these are what we’ll try to annotate later. The FindNonCoding() function is developed specifically for this purpose, to help distinguish between coding and non-coding genes.

Using FindGenes() with FindNonCoding() in this way also greatly improves the accuracy of FindGenes(), since we won’t accidentally misidentify coding genes as non-coding genes.

FindNonCoding() is used with three main datafiles depending on the data to analyze:

  • data("NonCodingRNA_Archaea") for Archaeal data
  • data("NonCodingRNA_Bacteria") for Bacterial data
  • data("NonCodingRNA_Eukarya") for Eukaryotic data

These include pretrained models with common non-coding patterns for the relevant domain of life. If these pretrained models are insufficient, you can train your own dataset using LearnNonCoding(), though this is outside the scope of this workshop.

data("NonCodingRNA_Bacteria")
ncRNA <- NonCodingRNA_Bacteria

geneticRegions <- FindNonCoding(ncRNA, dnaGenome)

## Find annotations of noncoding regions
annotations <- attr(geneticRegions, "annotations")
geneMatches <- match(geneticRegions[,"Gene"], annotations)
noncodingAnnots <- sort(table(names(annotations)[geneMatches]))
# What noncoding regions have we found and annotated?
noncodingAnnots
## 
##            5'_ureB-RF02514            Flavo_1-RF01705 
##                          1                          1 
##     FMN_Riboswitch-RF00050 Glycine_Riboswitch-RF00504 
##                          1                          1 
##    RNase_P_class_A-RF00010           SmallSRP-RF00169 
##                          1                          1 
##              tmRNA-RF00023                   tRNA-Asn 
##                          1                          1 
##                   tRNA-Asp                   tRNA-Cys 
##                          1                          1 
##                   tRNA-His                   tRNA-Phe 
##                          1                          1 
##                   tRNA-Trp                   tRNA-Tyr 
##                          1                          1 
##           rRNA_16S-RF00177           rRNA_23S-RF02541 
##                          2                          2 
##            rRNA_5S-RF00001     SAM_Riboswitch-RF00162 
##                          2                          2 
##     TPP_Riboswitch-RF00059                   tRNA-Gln 
##                          2                          2 
##                   tRNA-Lys                   tRNA-Met 
##                          2                          2 
##                   tRNA-Val                   tRNA-Glu 
##                          2                          3 
##                   tRNA-Ile                   tRNA-Pro 
##                          3                          3 
##                   tRNA-Thr                   tRNA-Ala 
##                          3                          4 
##                   tRNA-Arg                   tRNA-Gly 
##                          4                          4 
##                   tRNA-Ser                   tRNA-Leu 
##                          4                          5

FindNonCoding() returns a Genes object identifying non-coding regions with annotations. We can pass this object to FindGenes() to improve our identification of coding regions, resulting in more accurate gene calling than just running FindGenes() directly.

# Find Genes
genes <- FindGenes(dnaGenome, includeGenes=geneticRegions)
# Genes in the genome
genes 
## Genes object of size 2,251 specifying:
## 2,193 protein coding genes from 99 to 4,653 nucleotides.
## 58 non-coding RNAs from 73 to 3,088 nucleotides.
## 
##   Index Strand Begin   End TotalScore ... Gene
## 1     1      0   256  1803     376.48 ...    1
## 2     1      0  2554  3669     262.51 ...    1
## 3     1      0  3691  4905     225.34 ...    1
## 4     1      0  4898  5461      72.13 ...    1
## 5     1      0  5819  7981     558.05 ...    1
## 6     1      0  8050 10746     695.96 ...    1
## ... with 2,245 more rows.

As before, we can pull out the sequences for these regions using ExtractGenes(). Since these are all coding regions, we’ll also translate them into amino acids.

# Find amino acid sequences corresponding to found genes
# May get a warning if we don't have an even multiple of 3 base pairs
geneSeqs <- ExtractGenes(genes, dnaGenome, type="DNAStringSet")
geneSeqs <- translate(geneSeqs)
geneSeqs
## AAStringSet object of length 2251:
##        width seq
##    [1]   516 VVADQAVLSSWRSVVGSLEDDARVSARLMGFV...RKIRELMAERRTIYNQVTELTNEIKRKQRGA*
##    [2]   372 VKFTVERDILTDAVSWAARSLSPRPPVPVLSG...SAPKPALLTGVNQEDGVVSDYRYLVMPVRIA*
##    [3]   405 VYLSHLTVADFRSYRWADLELTPGSTVLLGAN...RQDAEGSEAVAAAAPVEGDIREPRREGGADG*
##    [4]   188 MAETPAPFEPDRPDLALVQLRRVREAARERGE...TRIEVAGPQAPSWRKGPRTVRGGRGPRDTYG*
##    [5]   721 VVDAMPENPAEEPTAASAAPNPEAVPDAVGQP...DAVFSMLMGEDVESRRTFIQQNAKDIRFLDV*
##    ...   ... ...
## [2247]   242 MTERGSAYERRPGLDPQDRGDRPAPVGQPAPP...KYGGAEPEILTVGEGLLPETTTVVRVSARPR*
## [2248]   195 MSAENVTPAPEAEDAVVETPAGQGSRVAEQDD...KIVHDVVADAGLTSESEGEGARRHVVISADD*
## [2249]   333 MFEAILSPFRWLMSWLLGAFHSILEFAGLPAD...QAGRGGAVLNGEVVRQSGQRVQPQRKNRRRK*
## [2250]   123 MTVTPSPFVVLEPSREWGPLRALPSALLAGLL...PPGHRRWPPGRQPRILALNHPPIPPDLPQED*
## [2251]   133 VLPRDRRVRTPAEFRHLGRTGTRAGRRTVVVS...AEADYALLRRETVGALGKALKPHLPAASEHA*

Classification with IDTAXA

We now have a set of coding regions. Our last step for this section is to try to annotate their function. This functionality is done with IdTaxa() from the DECIPHER package.

IdTaxa() requires a training set, which can be obtained in two ways. The first is to download them from DECIPHER’s downloads page, which contains prebuilt training sets for a variety of organisms. We’ll be using an Actinobacteria dataset obtained from this website.

The other method is to build a training set yourself using LearnTaxa(). This will not be covered in this workshop, but more information is available on the DECIPHER documentation for people that are interested.

In the interest of time, we’ll just classify the first 10 genes. Note that our training set for IdTaxa() is trained on amino acids, so we have to first call translate() on our DNA sequences to be able to provide IdTaxa() with amino acid sequences.

Download Training Set

 

If this button doesn’t work, you can download it from the DECIPHER Downloads Page under “Training sets for functional classification (amino acids)”. The correct training set for Microccocus is KEGG Actinobacteria.

# RData training set file is downloadable from the above button.
# You can also download it directly from the DECIPHER website.

load('/path/to/KEGG_Actinobacteria_r95.RData')
# loads "trainingSet"
# Grab the first 10 genes extracted earlier
geneSeqSubset <- geneSeqs[1:10]

# Classify our sequences!
ids <- IdTaxa(geneSeqSubset, trainingSet)

Once we’ve finished calculating, we can either view the annotations directly, or plot them as a taxonomy.

# Looking at all results
ids
##   A test set of class 'Taxa' with length 10
##      confidence taxon
##  [1]       100% Root; 09130 Environmental Information Processing; 09132 Signa...
##  [2]       100% Root; 09120 Genetic Information Processing; 09124 Replication...
##  [3]       100% Root; 09120 Genetic Information Processing; 09124 Replication...
##  [4]        28% Root; unclassified_Root...                                      
##  [5]        96% Root; 09180 Brite Hierarchies; 09182 Protein families: geneti...
##  [6]       100% Root; 09180 Brite Hierarchies; 09182 Protein families: geneti...
##  [7]        36% Root; unclassified_Root...                                      
##  [8]        85% Root; unclassified_Root...                                      
##  [9]        27% Root; unclassified_Root...                                      
## [10]        21% Root; unclassified_Root...
# Looking at a specific entry
ids[[1]]
## $taxon
## [1] "Root"                                                   
## [2] "09130 Environmental Information Processing"             
## [3] "09132 Signal transduction"                              
## [4] "02020 Two-component system [PATH:ko02020]"              
## [5] "K02313  dnaA, chromosomal replication initiator protein"
## 
## $confidence
## [1] 100 100 100 100 100
# Plot the distribution of results
plot(ids, trainingSet)

Runtime Considerations

Gene calling and annotation are both parallelizable; each compute node can process a single genome.

Runtime of gene calling depends on the length of each genome, but is overall very fast. For a single Micrococcus genome (~2.5 megabase pairs), FindGenes takes approximately 40sec to find ~2200 genes on a 2021 M1 MacBook Pro. The same operation on an 8.5Mbp Streptomyces genome takes around 5min to find ~7000 genes. FindNonCoding has a similar runtime,

Runtime of gene annotation depends on both the number of genes and the length of each gene (in base pairs). While compute could theoretically be parallelized to annotate one gene per compute node, this can be inefficient due to overhead incurred by copying the training set file to each node. Annotation on Streptomyces genes takes around 2 seconds per gene (ranging from 300-2000 bp).

Conclusion

We now have a way to identify genomic regions and annotate them with function. However, for comparative genomics we need some way to compare these genes across organisms so that we can draw conclusions at scale. In the next section, we’ll build on these techniques to find orthologous genomic regions.