Skip to contents

Gene Calling and Annotation

At this point, we’ve learned how to read in some genomic data, and have gained some basic familiarity working with it. The next step in our pipeline is to take a set of genomes, identify the coding regions in them, and predict the function of as many genetic regions as we can. We’ll start off by identifying the genes themselves.

Finding Genes

We’re going to keep using our Micrococcus dataset, but in the interest of time we’ll focus on finding genes in a single sequence. We’ll begin by reading in the data from a .fasta file, as we did in the previous section.

 

library(DECIPHER)

# This file is downloadable at the above link
datafile <- '/path/to/SingleSeq.fa.gz'
dnaGenome <- readDNAStringSet(datafile)
aaGenome <- translate(dnaGenome) 

Next, we’re going to identify the genes in our sequence using FindGenes() from the DECIPHER package.

FindGenes() returns a Genes object with information on where genes start and end in the genome.

geneLocs <- FindGenes(dnaGenome)
# `Genes` object
geneLocs
## Genes object of size 2,217 specifying:
## 2,217 protein coding genes from 75 to 4,653 nucleotides.
## 
##   Index Strand Begin   End TotalScore ... Gene
## 1     1      0   256  1803     366.46 ...    1
## 2     1      0  2554  3669     255.55 ...    1
## 3     1      0  3691  4905     227.58 ...    1
## 4     1      0  4898  5461      75.15 ...    1
## 5     1      0  5819  7981     544.35 ...    1
## 6     1      0  8050 10746     684.34 ...    1
## ... with 2,211 more rows.

We can then extract the sequences corresponding to each gene with ExtractGenes().

genes <- ExtractGenes(geneLocs, dnaGenome, type='AAStringSet')
# Sequences corresponding to each gene
genes
## AAStringSet object of length 2217:
##        width seq
##    [1]   516 MVADQAVLSSWRSVVGSLEDDARVSARLMGFV...RKIRELMAERRTIYNQVTELTNEIKRKQRGA*
##    [2]   372 MKFTVERDILTDAVSWAARSLSPRPPVPVLSG...SAPKPALLTGVNQEDGVVSDYRYLVMPVRIA*
##    [3]   405 MYLSHLTVADFRSYRWADLELTPGSTVLLGAN...RQDAEGSEAVAAAAPVEGDIREPRREGGADG*
##    [4]   188 MAETPAPFEPDRPDLALVQLRRVREAARERGE...TRIEVAGPQAPSWRKGPRTVRGGRGPRDTYG*
##    [5]   721 MVDAMPENPAEEPTAASAAPNPEAVPDAVGQP...DAVFSMLMGEDVESRRTFIQQNAKDIRFLDV*
##    ...   ... ...
## [2213]   195 MSAENVTPAPEAEDAVVETPAGQGSRVAEQDD...KIVHDVVADAGLTSESEGEGARRHVVISADD*
## [2214]   333 MFEAILSPFRWLMSWLLGAFHSILEFAGLPAD...QAGRGGAVLNGEVVRQSGQRVQPQRKNRRRK*
## [2215]   123 MTVTPSPFVVLEPSREWGPLRALPSALLAGLL...PPGHRRWPPGRQPRILALNHPPIPPDLPQED*
## [2216]   133 MLPRDRRVRTPAEFRHLGRTGTRAGRRTVVVS...AEADYALLRRETVGALGKALKPHLPAASEHA*
## [2217]    46 MTKRTFQPNNRRRARKHGFRARMRTRAGRAILSARRGKNRAELSA*

Of note is that FindGenes() assumes there are no introns or frame shifts, and as a result performs best with prokaryotic genomes.

Removing Non-Coding Regions

FindGenes() finds genes, but these may be coding or non-coding genes. We’re more interested in the regions that are actually translated into proteins, since these are what we’ll try to annotate later. The FindNonCoding() function is developed specifically for this purpose, to help distinguish between coding and non-coding genes.

Using FindGenes() with FindNonCoding() in this way also greatly improves the accuracy of FindGenes(), since we won’t accidentally misidentify coding genes as non-coding genes.

FindNonCoding() is used with three main datafiles depending on the data to analyze:

  • data("NonCodingRNA_Archaea") for Archaeal data
  • data("NonCodingRNA_Bacteria") for Bacterial data
  • data("NonCodingRNA_Eukarya") for Eukaryotic data

These include pretrained models with common non-coding patterns for the relevant domain of life. If these pretrained models are insufficient, you can train your own dataset using LearnNonCoding(), though this is outside the scope of this workshop.

data("NonCodingRNA_Bacteria")
ncRNA <- NonCodingRNA_Bacteria

geneticRegions <- FindNonCoding(ncRNA, dnaGenome)

## Find annotations of noncoding regions
annotations <- attr(geneticRegions, "annotations")
geneMatches <- match(geneticRegions[,"Gene"], annotations)
noncodingAnnots <- sort(table(names(annotations)[geneMatches]))
# What noncoding regions have we found and annotated?
noncodingAnnots
## 
##            5'_ureB-RF02514            Flavo_1-RF01705 
##                          1                          1 
##     FMN_Riboswitch-RF00050 Glycine_Riboswitch-RF00504 
##                          1                          1 
##    RNase_P_class_A-RF00010           SmallSRP-RF00169 
##                          1                          1 
##              tmRNA-RF00023                   tRNA-Asn 
##                          1                          1 
##                   tRNA-Asp                   tRNA-Cys 
##                          1                          1 
##                   tRNA-His                   tRNA-Phe 
##                          1                          1 
##                   tRNA-Trp                   tRNA-Tyr 
##                          1                          1 
##           rRNA_16S-RF00177           rRNA_23S-RF02541 
##                          2                          2 
##            rRNA_5S-RF00001     SAM_Riboswitch-RF00162 
##                          2                          2 
##     TPP_Riboswitch-RF00059                   tRNA-Gln 
##                          2                          2 
##                   tRNA-Lys                   tRNA-Met 
##                          2                          2 
##                   tRNA-Val                   tRNA-Glu 
##                          2                          3 
##                   tRNA-Ile                   tRNA-Pro 
##                          3                          3 
##                   tRNA-Thr                   tRNA-Ala 
##                          3                          4 
##                   tRNA-Arg                   tRNA-Gly 
##                          4                          4 
##                   tRNA-Ser                   tRNA-Leu 
##                          4                          5

FindNonCoding() returns a Genes object identifying non-coding regions with annotations. We can pass this object to FindGenes() to improve our identification of coding regions, resulting in more accurate gene calling than just running FindGenes() directly.

# Find Genes
genes <- FindGenes(dnaGenome, includeGenes=geneticRegions)
# Genes in the genome
genes 
## Genes object of size 2,274 specifying:
## 2,216 protein coding genes from 75 to 4,653 nucleotides.
## 58 non-coding RNAs from 73 to 3,088 nucleotides.
## 
##   Index Strand Begin   End TotalScore ... Gene
## 1     1      0   256  1803     365.62 ...    1
## 2     1      0  2554  3669     255.05 ...    1
## 3     1      0  3691  4905     226.98 ...    1
## 4     1      0  4898  5461      74.97 ...    1
## 5     1      0  5819  7981     544.55 ...    1
## 6     1      0  8050 10746     682.20 ...    1
## ... with 2,268 more rows.

As before, we can pull out the sequences for these regions using ExtractGenes(). Since these are all coding regions, we’ll also translate them into amino acids.

# Find amino acid sequences corresponding to found genes
# May get a warning if we don't have an even multiple of 3 base pairs
geneSeqs <- ExtractGenes(genes, dnaGenome, type="DNAStringSet")
geneSeqs <- translate(geneSeqs)
geneSeqs
## AAStringSet object of length 2274:
##        width seq
##    [1]   516 VVADQAVLSSWRSVVGSLEDDARVSARLMGFV...RKIRELMAERRTIYNQVTELTNEIKRKQRGA*
##    [2]   372 VKFTVERDILTDAVSWAARSLSPRPPVPVLSG...SAPKPALLTGVNQEDGVVSDYRYLVMPVRIA*
##    [3]   405 VYLSHLTVADFRSYRWADLELTPGSTVLLGAN...RQDAEGSEAVAAAAPVEGDIREPRREGGADG*
##    [4]   188 MAETPAPFEPDRPDLALVQLRRVREAARERGE...TRIEVAGPQAPSWRKGPRTVRGGRGPRDTYG*
##    [5]   721 VVDAMPENPAEEPTAASAAPNPEAVPDAVGQP...DAVFSMLMGEDVESRRTFIQQNAKDIRFLDV*
##    ...   ... ...
## [2270]   195 MSAENVTPAPEAEDAVVETPAGQGSRVAEQDD...KIVHDVVADAGLTSESEGEGARRHVVISADD*
## [2271]   333 MFEAILSPFRWLMSWLLGAFHSILEFAGLPAD...QAGRGGAVLNGEVVRQSGQRVQPQRKNRRRK*
## [2272]   123 MTVTPSPFVVLEPSR