Development of Informatics Infrastructure to Support the use of High Throughput Genetic Data for Herd Management

Recipient Organization
INVICTUS INFORMATICS LLC
2316 SARATOGA DR
LOUISVILLE,KY 40205

Performing Department
(N/A)

Non Technical Summary
The use of high throughput genetic data has transformed animal agriculture dramatically overthe last 17 years. Large breed associations, and national breeding programs are collecting genomewide genotypes and implementing genomic selection to improve production phenotypes. Thesetechnologies have proven both powerful and transformative. They have also become inexpensive.Whole genome genotypes can be generated for cattle for $47/animal by a commercial provider.Unfortunately, smaller breed associations, farmers who wish to remain independent, or farmersworking with species out of the mainstream cannot benefit from these technologies. Farmers thatwork with species such as Yak and Texas Longhorns require data management tools and analyticalsoftware sufficiently specialized as to be inaccessible on a limited budget.In our Phase I effort we have developed a web based data management system capable ofinexpensively storing, managing, and analyzing high throughput data sets, and providingrudimentary analyses such as DNA based meat/animal verification, parentage testing andrudimentary marker based selection.Now that these tools are in place, we can build a companion database to store productionphenotype data allowing these smaller breed/species associations to work with genetic researchersand identify genetic markers to improve or maintain production traits, all the while monitoring andmaintaining genetic diversity.The successful completion of this project will allow small farmers to produce, manage, and usewhole genome genotypes for their animals which, in aggregate, may subsequently be use in moresophisticated genome wide association studies aimed at improving or maintaining production traitsand efficiency. The product we have generated will serve to democratize the use of genetic data inlarge animal agriculture.

Animal Health Component

100%

Research Effort Categories

Basic

Applied

100%

Developmental

Classification

Knowledge Area (KA)	Subject of Investigation (SOI)	Field of Science (FOS)	Percent
303	3899	1080	100%

Knowledge Area
303 - Genetic Improvement of Animals;

Subject Of Investigation
3899 - Other animals, general;

Field Of Science
1080 - Genetics;

Keywords

genetics

high throughput genotyping

informatics

Goals / Objectives
This application is in response to USDA NIFA SBIR/STTR Program Priority ofCreating more and better market opportunitiesviaagricultural-related manufacturing technology.Invictus Informatics' platform is a genetic data management system that can benefit small-scale animal breed associations and independent farmers by providing the resources needed to utilize genetic data in similar ways to those that have been significantly effective for larger breed associations. In many species, standard analyses such as a genetic test for parentage verification are not readily available.Using genetic data to inform breeding decisions can accelerate the improvement of production traits, and slow the concomitant loss of genetic diversity in animals[1][2]or plants[3]. Cattle specifically have seen dramatic gains using these technologies, increasing traits related to milk production, feed efficiency and disease resistance to increase profitability by more than $800 / cow.These types of data are critical for livestock producers to optimize their production traitswhilemaintaining genetic diversity. With the advancements in high throughput sequencing approaches, genetic data is becoming more affordable than ever; however,the tools and knowledge necessary to transform this data into actionable information is currently beyond the reach of small and medium-sized producers. With this Phase II proposal, we will expand our data management system and analytical tools developed in Phase I, integrate vertical and horizontal growth by adding a new species (dairy goats) and additional infrastructure (development and incorporation of novel imputation pipelines).Ouraim is to broaden the scope of our tools to allow small scale producers to easily and affordably analyze genetic data from their livestock to optimize productivity while maximizing genetic diversity.We already have three tests which are being offered commercially, including parentage analysis, coefficient of inbreeding and animal / meat ID.These are available to any breed or species which can provide the samples required for analysis.By simultaneously collecting high-quality phenotype data from these animals, we will create a powerful, integrated genotype-phenotype database capable of identifying production and health traits to improve the quality of our livestock while improving generational gains, creating significant value for producers.Demonstration of the efficiency and usability of our data management system with new species will support its use for a range of animals and validate commercialization potential. These objectives are in line with USDA priorities of increasing food security and supporting small and mid-sized farms.

Project Methods
Adapting the Neogen Goat BeadChip Imputation Panel for use with Skim Sequencing DataThe work plan at Invictus will involve the skim sequencing of goats and imputing genotypes based on the haplotype imputation panel developed by Dr. Ben Rosen and others at the USDA, in conjunction with the VarGoats project.The current imputation panel is able to extrapolate haplotypes from the Neogen BeadChip, but can be modified and improved by applying it to skim sequencing data.The basis of skim sequencing is the identification of specific polymorphic alleles known to be associated with specific haplotypes.The work of Dr. Rosen and collaborators on the VarGoats project has associated haplotypes with alleles measured on the commercially available goat SNP chip.In-silico probe analysis of sequence is commonly referred to as a k-mer analysis.Tools such as the REINDEER platform[18]will create a k-mer index on a high throughput sequence dataset such that it can be searched for the presence of a given k-mer (or sequence probe) in a seconds.In fact, in work done in Dr. Kalbfleisch's lab with horse sequence data, a presence/absence answer from ~1.3M probes from the equine HD SNPchip (~650,000 probe pairs) is achieved in 14 seconds.As such, we are confident a very large list of probe sequences can be queried against an indexed low pass sequencing fastq file, providing data on haplotype specific alleles very quickly.Our approach will be to first identify haplotype specific alleles (HSAs) present in Dr. Rosen's haplotype panel.We will then derive "probe sequences" for all HSA as k-mers using flanking sequence data derived from the goat reference genome. This list of probe sequences will be used to query each animal's low pass sequencing dataset, and haplotypes will be assigned to the animal based on the presence or absence of these probe sequences in the dataset.In order to verify the accuracy of the imputation methods, we will generate 20X whole genome sequence data on 22 animals per breed, comprising 3 breeds total.This will total 66 animals, who will also undergo low pass sequencing at 1X coverage, with haplotypes imputed based on the above methodology.The imputed genotypes will be compared with the genotypes derived from the whole genome sequence data as in our Phase I project and validated for accuracy.This work will be performed by Mr. Morozov, and Dr. Kalbfleisch.Development and implementation of workflows to run genome wide association studies:We have budgeted funds to trio sequence 500 goats which will allow us to generate data on the order of 56 families (168 goats) from three different goat breeds (Nubian, Alpine, and Nigerian Dwarf) where the dam has phenotype data collected related to health and milking production traits.With this genetic data coupled with phenotype data, we will be able to 1) provide a foundation for parentage testing in these three breeds, and 2) perform genotype/phenotype association studies and identify genotypes associated with milk production in each breed.The sequencing work and parentage verification will be done by Mr. Morozov and Dr. Kalbfleisch.Methods to perform the association analysis will be developed and executed by Dr. Loux via the subcontract with the University of Kentucky.PLINK will be the tool used initially for GWAS work, with data extracted from the Invictus system and transformed for use in PLINK by Mr. Morozov.There are commercial resources we will use for the computation analyses, web server, and data storage requirements of Invictus Informatics, described as follows.1) Amazon S3 storage. We will be able to store the variant files in the standard infrequent access tier of their cloud, and back the data up in their glacier storage for $0.0125/GB/month and $0.004/GB/month respectively.2)Amazon EC2 computing. Our PostgreSQL database is located on an Amazon EC2 Linux server, as is our web server.In a just in time fashion, additional compute servers will be spun up to perform analytical tasks such as parentage verification, meat verification, or coefficient of inbreeding estimates for potential sires.3)An existing public domain software suite, the High Throughput Sequencing Java Development Kit (htsjdk) produced by the Broad Institute. With this software library it is possible to rapidly access, parse and analyze both genotypes from variant call files, and mapped high throughput datasets if necessary.4) A low pass sequencing test provided by Neogen GeneSeek.