A reference genome is a database assembled based on whole genome sequencing data from multiple members of a species of interest. This genome assembly is considered as a representative example of the genome organization of a typical member of the species. For an organism to be effectively studied from a genetics perspective, having a high quality reference genome is a must.
Among other things, it facilitates comparative genomics studies with other species. The purpose of such studies is usually to identify genetic mutations that might be involved in causing disease or conferring a particular phenotype. A lot of discoveries in human medical genomics have their roots in comparative genomics studies. Having a high accuracy, ‘gapless’ reference genome, ensures that the significance of genetic variation is properly understood and utilized for novel findings. In other words, the reference genome is a ‘gold standard’ against which DNA sequencing information from other members of the same species is compared. If there are mistakes in the reference genome, benign and deleterious mutations and their frequency could be misinterpreted.
The current feline reference genome – positives and negatives
The most current version of the feline reference genome, Felis_catus_9.0 and is a significant improvement over the previous reference genome (Felis_catus_8.0). For example, the N50 contig length value for Felis_catus_9.0 is 42 Mb. This corresponds to roughly 1000-fold increase in ungapped sequence length compared to Felis_catus_8.0. This N50 value surpasses N50 values for all other carnivore reference genome assemblies.
Felis_catus_9.0 uses long-read sequencing technology and is thus a superior tool for the identification of structural variants (SVs), especially ones spanning multiple megabases. In fact, Felis_catus_9.0 identified more variants than reference genomes for other mammals, such as dog, sheep, horse, pig and cow. However, long-read sequencing technology is known to have a bias towards introducing insertion or deletion (indel) errors in homopolymer regions. In fact, Basepaws’ analysis of our internal genomic database reveals a total of 82,362 sites in Felis_catus_9.0 where the non-reference allele frequency is 100%. Out of these sites, 87% (i.e., 72,139 sites) are indels and indicate potential misassembly errors in the reference genome. Our analysis also shows that 2,794 of these putative misassemblies are found in 2,215 exonic regions. Reference genome errors in gene-coding regions, particularly exons, can have serious negative consequences for genome medicine applications. Two such errors (identified as part of the 82,362 potential misassemblies) are found in the gene tyrosinase (tyr). In Felis_catus_9.0, tyr has a frameshift in the middle of its sequencing, resulting in a wrongly translated protein. This is illustrated in the figure below where we compared the tyr gene sequence from a published feline paper with the tyr sequence obtained from Felis_catus_9.0. The sequence from the reference genome contains an insertion of a G and a deletion of an A (marked in red). These errors lead to a wrong partial translation which we identified after aligning the tyr protein sequence from Felis_catus_9.0 with the previously published feline and the tyr sequences from related mammalian species – Puma concolor (cougar), Canis lupus familiaris (dog), Bos taurus (cow) and Homo sapiens (human). The protein sequence region modified as a consequence of the frameshift is highlighted in yellow. Blue and red colors were used to indicate conserved and altered amino acids, respectively.
How can Basepaws help improve the quality of the feline reference genome?
Felis_catus_9.0 is already a high quality mammalian reference genome. However, perfecting a species’ reference genome is a gradual, iterative process. Basepaws has a vast (and growing) genomic database comprised of a combination of feline genomes sequenced at either high or low depth. We use this resource daily, in combination with Felis_catus_9.0. While Felis_catus_9.0 is instrumental for our routine sample analysis, our own genome database has provided us with insights that allow us to identify potential inaccuracies in the reference genome. These include allele frequency misrepresentations, indels, and gene annotation errors. We are currently cataloging all these errors and plan to release an improved version of the feline reference genome, free for all feline geneticists to use.
Acknowledgments
We would like to thank all our customers and their feline companions, whose support has been instrumental in building our genomic database. In addition, customers who sequenced their cat’s genome at high depth with us deserve a special callout.