Gene Calling and Annotation with DECIPHER
Aidan Lakshman1
Source:vignettes/GeneCallingAnnotation.Rmd
GeneCallingAnnotation.Rmd
Gene Calling and Annotation
At this point, we’ve learned how to read in some genomic data, and have gained some basic familiarity working with it. The next step in our pipeline is to take a set of genomes, identify the coding regions in them, and predict the function of as many genetic regions as we can. We’ll start off by identifying the genes themselves.
Finding Genes
We’re going to keep using our Micrococcus dataset, but in
the interest of time we’ll focus on finding genes in a single sequence.
We’ll begin by reading in the data from a .fasta
file, as
we did in the previous section.
library(DECIPHER)
# This file is downloadable at the above link
datafile <- '/path/to/SingleSeq.fa.gz'
dnaGenome <- readDNAStringSet(datafile)
aaGenome <- translate(dnaGenome)
Next, we’re going to identify the genes in our sequence using
FindGenes()
from the DECIPHER
package.
FindGenes()
returns a Genes
object with
information on where genes start and end in the genome.
geneLocs <- FindGenes(dnaGenome)
# `Genes` object
geneLocs
## Genes object of size 2,217 specifying:
## 2,217 protein coding genes from 75 to 4,653 nucleotides.
##
## Index Strand Begin End TotalScore ... Gene
## 1 1 0 256 1803 366.46 ... 1
## 2 1 0 2554 3669 255.55 ... 1
## 3 1 0 3691 4905 227.58 ... 1
## 4 1 0 4898 5461 75.15 ... 1
## 5 1 0 5819 7981 544.35 ... 1
## 6 1 0 8050 10746 684.34 ... 1
## ... with 2,211 more rows.
We can then extract the sequences corresponding to each gene with
ExtractGenes()
.
genes <- ExtractGenes(geneLocs, dnaGenome, type='AAStringSet')
# Sequences corresponding to each gene
genes
## AAStringSet object of length 2217:
## width seq
## [1] 516 MVADQAVLSSWRSVVGSLEDDARVSARLMGFV...RKIRELMAERRTIYNQVTELTNEIKRKQRGA*
## [2] 372 MKFTVERDILTDAVSWAARSLSPRPPVPVLSG...SAPKPALLTGVNQEDGVVSDYRYLVMPVRIA*
## [3] 405 MYLSHLTVADFRSYRWADLELTPGSTVLLGAN...RQDAEGSEAVAAAAPVEGDIREPRREGGADG*
## [4] 188 MAETPAPFEPDRPDLALVQLRRVREAARERGE...TRIEVAGPQAPSWRKGPRTVRGGRGPRDTYG*
## [5] 721 MVDAMPENPAEEPTAASAAPNPEAVPDAVGQP...DAVFSMLMGEDVESRRTFIQQNAKDIRFLDV*
## ... ... ...
## [2213] 195 MSAENVTPAPEAEDAVVETPAGQGSRVAEQDD...KIVHDVVADAGLTSESEGEGARRHVVISADD*
## [2214] 333 MFEAILSPFRWLMSWLLGAFHSILEFAGLPAD...QAGRGGAVLNGEVVRQSGQRVQPQRKNRRRK*
## [2215] 123 MTVTPSPFVVLEPSREWGPLRALPSALLAGLL...PPGHRRWPPGRQPRILALNHPPIPPDLPQED*
## [2216] 133 MLPRDRRVRTPAEFRHLGRTGTRAGRRTVVVS...AEADYALLRRETVGALGKALKPHLPAASEHA*
## [2217] 46 MTKRTFQPNNRRRARKHGFRARMRTRAGRAILSARRGKNRAELSA*
Of note is that FindGenes()
assumes there are no introns
or frame shifts, and as a result performs best with prokaryotic
genomes.
Removing Non-Coding Regions
FindGenes()
finds genes, but these may be coding or
non-coding genes. We’re more interested in the regions that are actually
translated into proteins, since these are what we’ll try to annotate
later. The FindNonCoding()
function is developed
specifically for this purpose, to help distinguish between coding and
non-coding genes.
Using FindGenes()
with FindNonCoding()
in
this way also greatly improves the accuracy of FindGenes()
,
since we won’t accidentally misidentify coding genes as non-coding
genes.
FindNonCoding()
is used with three main datafiles
depending on the data to analyze:
-
data("NonCodingRNA_Archaea")
for Archaeal data -
data("NonCodingRNA_Bacteria")
for Bacterial data -
data("NonCodingRNA_Eukarya")
for Eukaryotic data
These include pretrained models with common non-coding patterns for
the relevant domain of life. If these pretrained models are
insufficient, you can train your own dataset using
LearnNonCoding()
, though this is outside the scope of this
workshop.
data("NonCodingRNA_Bacteria")
ncRNA <- NonCodingRNA_Bacteria
geneticRegions <- FindNonCoding(ncRNA, dnaGenome)
## Find annotations of noncoding regions
annotations <- attr(geneticRegions, "annotations")
geneMatches <- match(geneticRegions[,"Gene"], annotations)
noncodingAnnots <- sort(table(names(annotations)[geneMatches]))
# What noncoding regions have we found and annotated?
noncodingAnnots
##
## 5'_ureB-RF02514 Flavo_1-RF01705
## 1 1
## FMN_Riboswitch-RF00050 Glycine_Riboswitch-RF00504
## 1 1
## RNase_P_class_A-RF00010 SmallSRP-RF00169
## 1 1
## tmRNA-RF00023 tRNA-Asn
## 1 1
## tRNA-Asp tRNA-Cys
## 1 1
## tRNA-His tRNA-Phe
## 1 1
## tRNA-Trp tRNA-Tyr
## 1 1
## rRNA_16S-RF00177 rRNA_23S-RF02541
## 2 2
## rRNA_5S-RF00001 SAM_Riboswitch-RF00162
## 2 2
## TPP_Riboswitch-RF00059 tRNA-Gln
## 2 2
## tRNA-Lys tRNA-Met
## 2 2
## tRNA-Val tRNA-Glu
## 2 3
## tRNA-Ile tRNA-Pro
## 3 3
## tRNA-Thr tRNA-Ala
## 3 4
## tRNA-Arg tRNA-Gly
## 4 4
## tRNA-Ser tRNA-Leu
## 4 5
FindNonCoding()
returns a Genes
object
identifying non-coding regions with annotations. We can pass this object
to FindGenes()
to improve our identification of coding
regions, resulting in more accurate gene calling than just running
FindGenes()
directly.
# Find Genes
genes <- FindGenes(dnaGenome, includeGenes=geneticRegions)
# Genes in the genome
genes
## Genes object of size 2,274 specifying:
## 2,216 protein coding genes from 75 to 4,653 nucleotides.
## 58 non-coding RNAs from 73 to 3,088 nucleotides.
##
## Index Strand Begin End TotalScore ... Gene
## 1 1 0 256 1803 365.62 ... 1
## 2 1 0 2554 3669 255.05 ... 1
## 3 1 0 3691 4905 226.98 ... 1
## 4 1 0 4898 5461 74.97 ... 1
## 5 1 0 5819 7981 544.55 ... 1
## 6 1 0 8050 10746 682.20 ... 1
## ... with 2,268 more rows.
As before, we can pull out the sequences for these regions using
ExtractGenes()
. Since these are all coding regions, we’ll
also translate them into amino acids.
# Find amino acid sequences corresponding to found genes
# May get a warning if we don't have an even multiple of 3 base pairs
geneSeqs <- ExtractGenes(genes, dnaGenome, type="DNAStringSet")
geneSeqs <- translate(geneSeqs)
geneSeqs
## AAStringSet object of length 2274:
## width seq
## [1] 516 VVADQAVLSSWRSVVGSLEDDARVSARLMGFV...RKIRELMAERRTIYNQVTELTNEIKRKQRGA*
## [2] 372 VKFTVERDILTDAVSWAARSLSPRPPVPVLSG...SAPKPALLTGVNQEDGVVSDYRYLVMPVRIA*
## [3] 405 VYLSHLTVADFRSYRWADLELTPGSTVLLGAN...RQDAEGSEAVAAAAPVEGDIREPRREGGADG*
## [4] 188 MAETPAPFEPDRPDLALVQLRRVREAARERGE...TRIEVAGPQAPSWRKGPRTVRGGRGPRDTYG*
## [5] 721 VVDAMPENPAEEPTAASAAPNPEAVPDAVGQP...DAVFSMLMGEDVESRRTFIQQNAKDIRFLDV*
## ... ... ...
## [2270] 195 MSAENVTPAPEAEDAVVETPAGQGSRVAEQDD...KIVHDVVADAGLTSESEGEGARRHVVISADD*
## [2271] 333 MFEAILSPFRWLMSWLLGAFHSILEFAGLPAD...QAGRGGAVLNGEVVRQSGQRVQPQRKNRRRK*
## [2272] 123 MTVTPSPFVVLEPSR