Adapting single-step GBLUP for complex data, models, and sequence information

ADAPTING SINGLE-STEP GBLUP FOR COMPLEX DATA, MODELS, AND SEQUENCE INFORMATION

Sponsoring Institution

National Institute of Food and Agriculture

Project Status

COMPLETE

Funding Source

AFRI COMPETITIVE GRANT

Reporting Frequency

Annual

Accession No.

1022008

Grant No.

2020-67015-31030

Cumulative Award Amt.

$500,000.00

Proposal No.

2019-05566

Multistate No.

(N/A)

Project Start Date

May 1, 2020

Project End Date

May 9, 2024

Grant Year

2020

Program Code

[A1201]- Animal Health and Production and Animal Products: Animal Breeding, Genetics, and Genomics

Recipient Organization
UNIVERSITY OF GEORGIA
200 D.W. BROOKS DR
ATHENS,GA 30602-5016

Performing Department
(N/A)

Non Technical Summary
Animal populations are now selected using not only the pedigree and phenotypic but alsothegenomic information. Thestandardmethod for analysis is ssGBLUP, which considers all threesources of information jointly.Biases of genomic predictions have been observed predominantly in dairy cattle when thepedigrees are incomplete. Thisis due to ignoring of different genetic level ofdifferent groups of unknown parents. Old methods accounting for unknown parent groups were not successful. We will be applying a new method called metafounders that showed promise in simulations and analyses of small field data.When animals areevaluated genetically, it is desirable to have a measure of accuracy of their evaluations. Current methods forapproximations of accuracy are limited to simple models and are less accurate with large data sets. We will try to extend such methods to arbitrarily large data sets and to complex models for analyses.There is an interest in finding actual nucleotides called causative SNPthataffect traitsof interest. Current methods are limited to single models and are prone to false readings reflecting the population structure and not theactual causative SNP.The method of ssGBLUP that is able to use all availableinformation (pedigree, phenotypes, genomic) jointlywas modifiedfor genome wide association (GWAS) with option for significance testing (P-values). We will extend this method for very large data sets and will use it for imputed sequence data.

Animal Health Component

25%

Research Effort Categories

Basic

50%

Applied

25%

Developmental

25%

Classification

Knowledge Area (KA)	Subject of Investigation (SOI)	Field of Science (FOS)	Percent
304	3999	1081	100%

Knowledge Area
304 - Animal Genome;

Subject Of Investigation
3999 - Animal research, general;

Field Of Science
1081 - Breeding;

Keywords

single-step gblup

accuracy

large-scale genomic eval

pre-selection bias

unknown parent groups

Goals / Objectives
Develop better ssGBLUP predictive models for large data sets with genotyped animals that have incomplete pedigrees and may be from different breeds. Alternative equations for Unknown Parent Groups and the metafounders approach will be incorporated into ssGBLUP to allow for more accurate and less biased/inflated genomic predictions in complex data sets and models regardless of species with the aim of increasing the rate of genetic gain.Establish a robust approximation of individual theoretical accuracy for very large genotyped populations using APY. An accuracy approximation will be developed that is easily computable for any data set size and model and is reliable and acceptable to the livestock industry.Enable computations of p-values in ssGBLUP to select sequence variants for genomic prediction in very large genotyped populations. The capabilities of ssGBLUP will be extended to compute p-values and accurately select sequence variants for GS without the need for data reduction or projections (i.e., deregressions).

Project Methods
For Objective 1, we will implement the concept of metafounders in BLUP90IOD software. We will test it with simulated and real data sets. We will also research the optimal definition of unknown parent groups.For Objective 2, we will use a decomposition of theinformation in estimated breeding values into multiple components includingdueto nongenomic and genomic information, with focus on avoidance of double counting. Initially we will use theAPY algorithm and sparse matrix inversion. Later we will consider machine learning.For objective 3, we will use theAPY algorithm and conversion from GEBV to SNP effects based on core animals. We will obtain weights based on the quadratic and NonlinearAalgorithms. We will test with pig and possibly dairy data.

Progress 05/01/20 to 05/09/24

Outputs
Target Audience:Academy, USDA, breed associations, farmers Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?Training of graduate students, postdocs, and visitors. Training of industry members. How have the results been disseminated to communities of interest?Conference presentations, journal publications, industry meetings What do you plan to do during the next reporting period to accomplish the goals? Nothing Reported

Impacts
What was accomplished under these goals? 1. Analyzes involved more than 47 million lactation records registered between 2000 and 2021 in purebred Holstein and Jersey and their crosses. A total of 27 million animals were included in the analysis, of which 1.4 million were genotyped. Milk, fat, and protein yields were analyzed in a 3-trait repeatability model using BLUP or ssGBLUP. The 2 models were validated using prediction bias and accuracy computed for genotyped cows with no records in the truncated dataset and at least one lactation in the complete dataset. Bias and accuracy were better in the genomic model than in the pedigree-based one, with accuracies for crossbred cows generally higher than those of purebreds. Accurate evaluation of crossbreds requires a crossbred reference population. Genomic accuracies of crossbred cows may be artificially inflated by the validation methodology because genomic predictions for crossbreds also include means of specific crosses (e.g., F1, F2, various reciprocals). 3. Data sets were simulated with different real and effective population sizes, and causative SNPs were detected by p-values. The detection was much more successful in populations with a high effective population size. In a population with a low effective population size, each causative SNP generated a wide SNP response in Manhattan plots, making it difficult to discern adjacent causative SNP. Small effective population size in farm animals makes the detectionof causative SNP difficult but enables high-accuracy prediction that is based on the estimationof large chromosome segments.

Publications

Type: Journal Articles Status: Published Year Published: 2024 Citation: Cesarani, Alberto; Lourenco, Daniela; Bermann, Matias; Nicolazzi, Ezequiel; VanRaden, Paul; Misztal, Ignacy. 2024. Single-step genomic predictions for crossbred Holstein and Jersey cattle in the US. J. Dairy Sci. Comm. 5:124-128. doi:10.3168/jdsc.2023-0399
Type: Journal Articles Status: Published Year Published: 2023 Citation: Jang, S., S. Tsuruta, N. G. Leite, I. Misztal, and D. Lourenco. 2023. Dimensionality of genomic information and its impact on genome-wide associations and variant selection for genomic prediction: a simulation study. Genet. Sel. Evol. 55-49. doi:10.1186/s12711-023-00823-0
Type: Journal Articles Status: Published Year Published: 2023 Citation: McWhorter, T., M. Sargolzaei, C. Sattler, M. Utt, S. Tsuruta, I. Misztal, and D. Lourenco. 2023. Single-step genomic predictions for heat tolerance of production yields in U.S. Holsteins and Jerseys. Journal of Dairy Science. https://doi.org/10.3168/jds.2022-23144

Progress 05/01/22 to 04/30/23

Outputs
Target Audience:Academy, USDA, breed associations, farmers Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided? Nothing Reported How have the results been disseminated to communities of interest?Conferences, USDA committee meetings, breed associationvisits What do you plan to do during the next reporting period to accomplish the goals?We developed and implemented formulas for computations of p-values in ssGBLUP to select sequence variants for genomic prediction that is computationally feasible in very large genotyped populations. Tests included up to 600k genotyped Angus cattle. Tentative results indicate that using ssGBLUP allows for discovery of twice as many statistically significant regions as using a classical method. We plan to confirm the results with other data sets and write a refereed-journal paper.

Impacts
What was accomplished under these goals? We developed single-step GBLUP multibreed genomic predictions for multiple dairy breed: Ayrshire (AY), Brown Swiss (BS), Guernsey (GU), Holstein (HO), and Jersey (JE). A 3-trait model with milk (MY), fat (FY), and protein (PY) yields was applied using about 45 million phenotypes recorded from January 2000 to June 2020. The whole data set included about 29.5 million animals, of which almost 4 million were genotyped. All the effects in the model were breed specific, and breed was also considered as fixed unknown parent groups. Evaluations were done for (1) each single breed separately (single); (2) HO and JE together (HO_JE); (3) AY, BS, and GU together (AY_BS_GU); (4) all the 5 breeds together (5_BREEDS). The inversion of the relationship matrix was by the APY algorithm that minimizes computations by applying recursion on a small number of animals called "core". Initially, 15k core animals were used in APY for AY_BS_GU and 5_BREEDS, but larger core sets with more animals from the least represented breeds were also tested. The HO_JE evaluation had a fixed set of 30k core animals, with an equal representation of the 2 breeds, whereas HO and JE single-breed analysis involved 15k core animals. Validation for cows was based on correlations between adjusted phenotypes and (G)EBV, whereas for bulls on the regression of daughter yield deviations on (G)EBV. Because breed was correctly considered in the model, BLUP results for single and multibreed analyses were the same. Under ssGBLUP, predictability and reliability for AY, BS, and GU were on average 7% and 2% lower in 5_BREEDS compared with single-breed evaluations, respectively. However, validation parameters for these 3 breeds became better than in the single-breed evaluations when 45k animals were included in the core set for 5_BREEDS. Evaluations for Holsteins were more stable across scenarios because of the greatest number of genotyped animals and amount of data. Combining AY, BS, and GU into one evaluation resulted in predictions similar to the ones from single breed, especially when using about 30k core animals in APY. The results showed that single-step large-scale multibreed evaluations are computationally feasible, but fine tuning is needed to avoid a reduction in reliability when numerically dominant breeds are combined. Having evaluations for AY, BS, and GU separated from HO and JE may reduce inflation of GEBV for the first 3 breeds. The results indirectly suggest that scaling of genomic relationship matrix for specific breed groups is not critical, but a recursion needs to include a sufficient number of animals of each breed as core animals. Another indirect result is that the genomic predictions for crossbreds are not based on QTLs but on estimating independent chromosome segments specific for each breed or type of crossbred. Therefore an accurate prediction for a certain breed type requires a reference population of that breed type.

Publications

Type: Journal Articles Status: Published Year Published: 2022 Citation: Bermann, M., I. Aguilar, D. Lourenco, I.Misztal, and A. Legarra. 2023. Reliabilities of estimated breeding values in models with metafounders. Genet. Sel. Evol. 55:6. https://doi.org/10.1186/s12711-023-00778-2
Type: Journal Articles Status: Published Year Published: 2022 Citation: Cesarani, A., D. Lourenco, S. Tsuruta, A. Legarra, E. L. Nicolazzi, P. M. VanRaden,, and I. Misztal. 2022. Multibreed genomic evaluation for production traits of dairy cattle in the US using single-step GBLUP. J. Dairy Sci. 105:5141-5152. doi.org/10.3168/jds.2021-21505
Type: Journal Articles Status: Published Year Published: 2022 Citation: Misztal, I., Y. Stein, and D.A.L. Lourenco. 2022. Genomic evaluation with multibreed and crossbred data. J. Dairy Sci. Comm. 3:156-159. doi.org/10.3168/jdsc.2021-0177

Progress 05/01/21 to 04/30/22

Outputs
Target Audience: Academia, industry Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided? Nothing Reported How have the results been disseminated to communities of interest?The results have been disseminated at 3 scientific meetings and to approximately10industry groups What do you plan to do during the next reporting period to accomplish the goals? Nothing Reported

Impacts
What was accomplished under these goals? 3. Establish a robust approximation of individual theoretical accuracy for very large genotyped populations using APY. Reliability can be calculated as as a function ofprediction error variances (PEV).We developed an efficient algorithm for calculating PEVfor genomic best linear unbiased prediction (GBLUP) models using the Algorithm for Proven and Young (APY).The PEV with APY was calculated by block sparse inversion, efficiently exploiting the sparse structure of the inverse of the genomic relationship matrix with APY. Single-step GBLUP reliabilities were approximated by combining reliabilities with and without genomic information in terms of effective record contributions. Multi-trait reliabilities relied on single-trait results adjusted using the genetic and residual covariance matrices among traits. Tests involved two datasets provided by the American Angus Association. A small dataset (Data1) was used for comparing the approximated reliabilities with the reliabilities obtained by the inversion of the left-hand side of the mixed model equations. A large dataset (Data2) was used for evaluating the computational performance of the algorithm. Analyses with both datasets used single-trait and three-trait models. The number of animals in the pedigree ranged from 167,951 in Data1 to 10,213,401 in Data2, with 50,000 and 20,000 genotyped animals for single-trait and multiple-trait analysis, respectively, in Data1 and 335,325 in Data2. Correlations between estimated and exact reliabilities obtained by inversion ranged from 0.97 to 0.99, whereas the intercept and slope of the regression of the exact on the approximated reliabilities ranged from 0.00 to 0.04 and from 0.93 to 1.05, respectively. For the three-trait model with the largest dataset (Data2), the elapsed time for the reliability estimation was 11 min. The computational complexity of the proposed algorithm increased linearly with the number of genotyped animals and with the number of traits in the model. This algorithm can efficiently approximate the theoretical reliability of genomic estimated breeding values in ssGBLUP with APY for large numbers of genotyped animals at a low cost.?

Publications

Type: Journal Articles Status: Published Year Published: 2022 Citation: Matias Bermann, Daniela Lourenco, Ignacy Misztal, Efficient approximation of reliabilities for single-step genomic best linear unbiased predictor models with the Algorithm for Proven and Young, Journal of Animal Science, Volume 100, Issue 1, January 2022, skab353, https://doi.org/10.1093/jas/skab353

Progress 05/01/20 to 04/30/21

Outputs
Target Audience:Academia, industry Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided? Nothing Reported How have the results been disseminated to communities of interest?Coneferences, industry meetings, personal communications What do you plan to do during the next reporting period to accomplish the goals? Nothing Reported

Impacts
What was accomplished under these goals? We compared 3 different formulation of unnown parent groups in single-step GBLUP. The groups were by QP transformation, using themetafounder concept, and encapsulated.The last two fomulations resulted in unbiasedgenomic estimated breeding values (GEBV). While the metafounder concept is applicable to multibreed populations, the encapsulated optiondoes not require paarmeter estimation. In another study, we looked at theefficiency of two unknown parent group formulation and data trunctation in application to thegenomic evaluation of US Holstein population.The complete data included 80 million records for milk, fat, and protein yields from 31 million cows recorded since 1980. Phenotype-pedigree truncation scenarios included truncation of phenotypes for cows recorded before 1990 and 2000 combined with truncation of pedigree information after 2 or 3 ancestral generations. A total of 861,525 genotyped bulls withprogenyand cows with phenotypic records were used in the analyses. Reliability and bias (inflation/deflation) of GEBV were obtained for 2,710 bulls based on deregressed proofs, and on 381,779 cows born after 2014 based on predictivity (adjusted cow phenotypes). GEBV were unbiased with QP-modified unknown parent groups. Eliminating phenotypes recorded before year 2000 did not reduce accuracy of GEBV for the youngest animals.

Publications

Type: Journal Articles Status: Published Year Published: 2021 Citation: Automatic scaling in single-step genomic BLUP M Bermann, D Lourenco, I Misztal Journal of Dairy Science 104 (2)
Type: Journal Articles Status: Published Year Published: 2021 Citation: Comparison of models for missing pedigree in single-step genomic prediction Y Masuda, S Tsuruta, M Bermann, HL Bradford, I Misztal Journal of Animal Science 99 (2)
Type: Journal Articles Status: Published Year Published: 2021 Citation: Genomic predictions for yield traits in US Holsteins with unknown parent groups A Cesarani, Y Masuda, S Tsuruta, EL Nicolazzi, PM VanRaden, D Lourenco, I Misztal Journal of Dairy Science 104 (5), 5843-5853