Development of Informatics Infrastructure to Support the use of High Throughput Genetic Data for Herd Management on Small and Mid-size Farms

DEVELOPMENT OF INFORMATICS INFRASTRUCTURE TO SUPPORT THE USE OF HIGH THROUGHPUT GENETIC DATA FOR HERD MANAGEMENT ON SMALL AND MID-SIZE FARMS

Sponsoring Institution

National Institute of Food and Agriculture

Project Status

COMPLETE

Funding Source

SMALL BUSINESS GRANT

Reporting Frequency

Annual

Accession No.

1028588

Grant No.

2022-40000-36948

Cumulative Award Amt.

$124,903.00

Proposal No.

2022-00718

Multistate No.

(N/A)

Project Start Date

Jul 1, 2022

Project End Date

Feb 28, 2023

Grant Year

2022

Program Code

[8.12]- Small and Mid-Size Farms

Recipient Organization
INVICTUS INFORMATICS LLC
2316 SARATOGA DR
LOUISVILLE,KY 40205

Performing Department
INVICTUS INFORMATICS LLC

Non Technical Summary
We are proposing a solution that would provide a web-based data management system capable of inexpensively storing, managing, and analyzing high throughput data sets, and providing rudimentary analyses such as DNA-based meat/animal identification, parentage testing and othermarker-based selections forlivestockbreeders and associations, as well as the calculation of Coefficient of Inbreeding. Once these tools are in place, we can subsequently build a companion database to store production phenotype data allowing these producers to work with genetic researchers and identify genetic markers to improve or maintain production traits, all the while monitoring and maintaining genetic diversity.

Animal Health Component

(N/A)

Research Effort Categories

Basic

30%

Applied

(N/A)

Developmental

70%

Classification

Knowledge Area (KA)	Subject of Investigation (SOI)	Field of Science (FOS)	Percent
304	3310	1080	100%

Knowledge Area
304 - Animal Genome;

Subject Of Investigation
3310 - Beef cattle, live animal;

Field Of Science
1080 - Genetics;

Keywords

Goals / Objectives
Our goals are first to develop a genetic data management system that can benefit small-scale breed associations and independent farmers and allow them to benefit from the genetic tools for trait improvement that have been so effective for larger breed associations. Our second goal is to generate genetic data for these animals, starting in Phase I with Texas Longhorn cattle to demonstrate how farmers will benefit from these tools.We aim to collect genomic data from 50 Texas Longhorn cattle during Phase I, including 20 animals sequenced with 20X genomic coverage, and an additional 10 trios (sire, dam, offspring) which will be skim sequenced. These data will serve as the basis for the development of genetic tests such as parentage verification, animal ID, and identification ofgenetic markers for breed-specific economically-importanttraits.

Project Methods
For the generation of the cloud-based genetic database, we will use Amazon S3 storage. The standard infrequent access tier will be used to storevariant files, with backups of all data stored in their glacial storage tier.Our computational efforts will be performed using the Amazon EC2 servers.We will be able to locate our PostgreSQL database on a low powered Linux server, and also create a second server that will power the web server with the customer facing user interface. In a just in time fashion, additional compute servers will be spun up to perform analytical tasks such as parentage verification, meat verification, or coefficient of inbreeding estimates for potential sires.For genetic analyses, we will rely largely an existing public domain software suit, the High Throughput Sequencing Java Development Kit (htsjdk; Broad Institute).With this software library, it is possible to rapidly access, parse and analyze both genotypes from variant call files, and mapped high throughput datasets.For single parent verification, we will look for opposing homozygotes in the parent and the progeny.Parent A/A, progeny B/B is impossible in Mendelian inheritance, and this would count as an exclusion. Parent A/A (or B/B), progeny A/B, or parent A/B, progeny A/A(or B/B) would not count as an exclusion. We simply total the exclusions and evaluate the result. In a result set with no errors there would be no exclusions in a proper parent-progeny association.For trio verification, our analysis can be much more exacting. For example,sire=A/A, dam=A/A, and progeny=A/B would count as an exclusion when it would not have if either the sire or the dam would have been analyzed in the single parent mode.We willcalculateindividual inbreeding (F) as a proportionate reduction in heterozygosity relative to the expectation under Hardy-Weinberg Equilibrium F = 1 - ((count of heterozygous genotypes) / (count of all genotypes - fraction of expected heterozygosity)).If an animal has a fraction of heterozygotes near the expected value for the fraction of heterozygosity, then F will approach zero. If the animal is completely inbred, i.e., the count of heterozygous genotypes is zero, then F will be 1. Maximized genetic diversity withing species will be attained if the count of heterozygous genotypes is near the count of all genotypes, and F will be negative with a value of roughly 1 minus the reciprocal of the Fraction of Expected Heterozygosity, which for beef cattle is N.

Progress 07/01/22 to 02/28/23

Outputs
Target Audience:Our target audience is comprised of underserved livestock producers who work with breeds and species for whom genomic analysis is not currently available. These producers know the power of genetics and have a strong desire to add it to their program, but simply lack the resources and technical know-how to effectively use these resources. They are committed to the health and well-being of their herds, but may not have a large amount of money to spend per animal, so want to use their resources in a way which maximizes the benefits. This includes breeds and species such as Texas Longhorn cattle, dairy goats, sheep, yak and many more. In Phase I, we worked specifically with Texas Longhorn cattle, although we will be branching into work with dairy goats for Phase II, and hope to move into other underserved breeds and species as we enter Phase III and beyond. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?The investigators and service providers engaged in this project are all well versed in the domain of high throughput genetic analyses, and software infrastructure to support the storage and analyses of these data. However, we have had the opportunity to educate breed associations on the benefits and limitations of genetic testing, and helped them focus their direction moving forward. As such, both Dr. Kalbfleisch and Dr. Loux are currently serving as advisors for the Cattlemen's Texas Longhorn Registry. Moving forward, as our objectives turn to genotype phenotype association studies, we will have opportunities not only to improve the knowledge and skills set of our team members, but when working with breed associations, we will be able to create process and infrastructure informed by their needs, and results that they can use to demonstrate a greater value in their animals. How have the results been disseminated to communities of interest?Going forward, we will begin presenting at meetings for breed associations to educate producers on the value of our system, and how it will benefit them. We will also work on literature that producers and breed associations can use to inform themselves on us, and our capabilities. As such, one area where we requested TABA funding for our Phase II proposal was for assitance writing White Papers to help educate the community about the myriad benefits available to them through genetic testing. We are consistently communicating with the breed associations we are working with. Again, both Dr. Kalbfleisch and Dr. Loux are serving as advisors for the Cattlemen's Texas Longhorn Registry, while Dr. Kalbfleisch is also on the scientific advisory board for USYAK. What do you plan to do during the next reporting period to accomplish the goals? Nothing Reported

Impacts
What was accomplished under these goals? We successfully collected genomic data from 52Texas Longhorn cattle during Phase I, including 22animals sequenced with 20X genomic coverage, and an additional 13trios (sire, dam, offspring) were skim sequenced. These data served as the basis for the development of genetic tests such as parentage verification, animal ID, and identification ofgenetic markers for breed-specific economically-importanttraits. The objective of the Phase I project was the creation of a data management system for the storage, maintenance and analysis of high throughput genetic data produced by smaller animal production operations. To accomplish this objective, we completed the following aims: Aim 1) Build rudimentary data analysis tools such as parentage confirmation, animal identity (verify meat came from an animal for which you already have genetic data, or parental data), estimation of coefficient of inbreeding for potential mating pair for use in AI bull selection. One of our technical objectives was demonstrating that the Neogen Geneseek Low Pass sequencing product would accurately genotype Longhorn cattle given that this breed was not part of the training dataset for the imputation panel. Weindependently submitted samples for both low pass sequencing, and traditional SNP discovery/genotyping via whole genome sequencing and compared the resulting genotypes. Here, we demonstrated that after our quality filters had been applied, there was a less than 1% discordance rate between the genotypes imputed by low pass sequencing, and the corresponding genotype called using the 20X coverage. An artifact that we identied and reported to GeneSeek was that in~1/3 of the records, showed more duplicate VCF records for the same position, and discordant genotypes across those records. We excluded any polymorphism for an animal if it had multiple genotype calls within the same VCF fileat thesame locus. Our second objective within Aim 1 was to demonstrate that these low pass genotypes could be used for both parent, and meat verification. This test was also successful in that for 10 sire/dam/calf trios we demonstrated a less than 1% discrepancy with a Mendelian inheritance pattern. A discrepancy was counted when a calf was heterozygous at a position, for two alleles, such as A/B, and both parents were homozygous for a single allele such as A/A. Another way a discrepancy would have occurred was if either or both parents had an opposing homozygote. A calf with an A/A genotype could not have either a sire or a dam who was homozygous B/B. We demonstrated that a properly identified trio had in nearly all cases, less than 0.1% (one outlier had 0.17%) genotypes exhibiting non-Mendelian inheritance. In one case, we identified an example of mis-attributed paternity where the trio demonstrated 1.5% genotypes with non-Mendelian inheritance. A subsequent analysis of that trio, a "Single Parent" analysis, designed to confirm either the sire or the dam of a calf using only opposing homozygotes between the parent and calf showed the dam was correctly attributed, but the sire was not. This Single Parent analysis was also used for Meat Verification, by comparing the DNA profile of the meat sample to the parents recorded in the animal's pedigree when genotype data was not available for the corresponding animal itself. Finally, it was our objective to provide an analysis of possible matings to identify the pairings that would result in the greatest probability of heterozygosity in the calf. The calculation was done by counting the number of opposing homozygotes in a proposed dam/sire pair; this would be the least number of heterozygotes in the calf. Then a binomial distribution was calculated based on the number of sites where at least one parent was heterozygous. In those cases, there was a 50% chance that the calf would be heterozygous at that site. The total number of heterozygotes in either parent were used in the calculation of a binomial distribution that was centered on half this number. As such, the position of the curve was right shifted by the number of guaranteed hets (opposing homozygotes in the parents), and the width of the curve was given by the count of sites where at least one parent was heterozygous. Aim 2) Build a cloud-based data management system with a user-friendly front end that will allow farmers to upload genetic data for their animals for storage and analysis. We have deployed a web-based data management system that is password protected, with accounts for the breed associations with which we are actively working; currently Cattlemen's Texas Longhorn Registry, and USYAK. The system allows breed associations to upload animal data for registry within the system, and will allow them to request analyses, currently parentage, and meat identification (i.e., did this piece of meat come from the animal I sent to slaughter) that can be done either by comparing to genetic data for that animal, or by verifying vs. one of the animal's parents. Other algorithms we have developed will generate the predicted heterozygosity for proposed dam/sire pairing as described above. From within the application, we provide a template for animal data entry, and test requests that can be downloaded and filled out locally, or completed fully online. We will also mail data entry sheets upon request. Certain fields are required (species, breed, sex, etc). We also have the option to download an excel template which is more convenient when adding large numbers of animals concurrently. We have developed software that will generate PDF reports based on genetic data.

Publications