Development of multi-trait statistical models for Genomic Prediction

DEVELOPMENT OF MULTI-TRAIT STATISTICAL MODELS FOR GENOMIC PREDICTION

Sponsoring Institution

National Institute of Food and Agriculture

Project Status

ACTIVE

Funding Source

AFRI COMPETITIVE GRANT

Reporting Frequency

Annual

Accession No.

1022334

Grant No.

2020-67013-30904

Cumulative Award Amt.

$500,000.00

Proposal No.

2019-05473

Multistate No.

(N/A)

Project Start Date

Jun 1, 2020

Project End Date

May 31, 2025

Grant Year

2020

Program Code

[A1141]- Plant Health and Production and Plant Products: Plant Breeding for Agricultural Production

Recipient Organization
UNIVERSITY OF CALIFORNIA, DAVIS
410 MRAK HALL
DAVIS,CA 95616-8671

Performing Department
Plant Sciences

Non Technical Summary
Plant breeding programs are critical for addressing global food production needs as the human population approaches nine billion. Genetic improvements in yield, stress tolerance, disease resistance and post-harvest quality are required in every crop. Plant breeders have produced steady gains in all these areas over the past century (Prohens, 2011). However, current rates of yield increases in many critical species such as wheat, rice, corn and soy remain insufficient to meet the future global demands (Ray et al., 2013). Innovations in the process of plant breeding can help to fill this gap.We believe that increasing the use of multiple-trait breeding methods can broadly accelerate crop improvement programs. Measuring multiple traits at once and modeling their relationships can help increase the rate of gain in a single target trait and is critical for improving suites of traits at once. However, the vast majority of statistical models used in plant breeding today handle only a single trait (or at most a few traits) at a time. Multiple-trait data, on the other hand, is widely available. High-throughput phenotyping technologies -- such as hyperspectral imaging and molecular profiling -- are becoming accessible to many breeding programs and provide new types of data to inform selection decisions. But even traditional programs collect data on many traits in every field, and when multiplied across locations or years (and considering each trait in each field as a different trait), total trait numbers can easily reach into the hundreds. Therefore the development of powerful, efficient, and usable statistical tools that can be applied to many traits at once would have an immediate impact on a wide range of breeding programs across the country and the world.We intend to introduce powerful, free, and user-friendly software to support public and private breeding programs in any crop species. Specifically, our software will assist breeders to i) efficiently select on multiple aspects of quality and performance at once, ii) address local and regional adaptation through modeling gene-environment interactions, and iii) incorporate phenomics data into genomics-enabled plant breeding systems.

Animal Health Component

33%

Research Effort Categories

Basic

33%

Applied

33%

Developmental

34%

Classification

Knowledge Area (KA)	Subject of Investigation (SOI)	Field of Science (FOS)	Percent
202	2410	1081	50%
901	7310	1081	50%

Knowledge Area
901 - Program and Project Design, and Statistics; 202 - Plant Genetic Resources;

Subject Of Investigation
7310 - Experimental design and statistical methods; 2410 - Cross-commodity research--multiple crops;

Field Of Science
1081 - Breeding;

Keywords

Goals / Objectives
Plant breeders rely heavily on statistical models. Statistical prediction of genetic quality (e.g. Pedigree selection, Genomic Prediction) has been validated again and again in different crops and breeding programs. However, the vast majority of statistical models used in plant breeding handle only a single trait at a time. Yet plant breeders must optimize many traits simultaneously. Jointly modeling multiple traits can increase the rate of gain in each trait individually and is critical for reaching targets for suites of traits at once. Therefore the development of accessible, efficient, and powerful statistical tools that can be applied to many traits at once would have an immediate impact on a wide range of breeding programs across the country and the world. The need for such tools is becoming more and more urgent with the introduction of large-scale phenotype data from high-throughput phenotyping systems. This project aims to fill this gap.Specifically, we will:1. Develop robust, high-powered statistical models for jointly predicting plant performance from high-dimensional phenotype and genotype data sets.2. Implement the models in flexible, computationally efficient, and user-friendly open- source software3. Design training materials and teaching modules to demonstrate our methods in diverse contexts.Our approach builds on the statistical framework of the linear mixed effect model. Linear mixed models underlie virtually all statistical tools used in plant breeding, including the most widely used Genomic Prediction models. Linear mixed models are robust, interpretable, and relatively easy to use. However, existing software tools are not extendible to more than approximately 5-10 traits at a time. Beyond that point, they become brittle (i.e. not robust and sensitive to data noise) and extremely computationally demanding. Because of this, we believe that the potential of novel phenotyping technologies has yet to be realized in plant breeding.We will overcome these limitations by combining recent innovations in statistical theory and computational architecture, including: i) efficient and tunable Bayesian priors that prioritize only the strongest, most informative signals in Big Data, ii) a latent factor structure for trait covariances, and iii) efficient approximation and implementation schemes for reducing computational costs. Our current prototype software can fit linear mixed models to datasets with >10,000 traits measured on each of >2,000 lines in less than one day on a typical laptop.We will apply our models to two classes of multi-trait prediction problems using existing publicly available data from corn and wheat. Using data from the multi-environment Genomes2Field initiative, we will evaluate the accuracy of genetic value imputation for predicting gene-environment interactions. We expect that the multi-trait/multi-environment imputation will be vastly more accurate than typical Genomic Predictions of traits measured in only a single field. Using drone-based estimates of plant height at weekly intervals, we will study if early-season growth dynamics can be used as an early indicator of genetic differences in final yield. Both applications exemplify how efficient statistical models could be used to dramatically increase accuracy and reduce costs in breeding programs, and therefore will have wide-ranging impacts.

Project Methods
Our approach builds on the framework of the linear mixed model, which underlies virtually all statistical tools used today in plant breeding and quantitative genetics. We address the statistical and computational limitations of extending linear mixed models to very large numbers of traits by re-parameterizing the model structure around a factor model. This novel structure allows our computational algorithm to break apart the computationally demanding steps into more manageable pieces. We then use the concept of regularized regression to achieve statistical efficiency in high-dimensional data. This combination of regularized regression with a factor model has not previously been applied to quantitative genetics, but shows great promise for advancing this field.We will develop these methods into a computationally efficient and accessible R package, written in a combination of R and C++.We will then use our package to analyze several real datasets, either publicly available, or provided by collaborators.

Progress 06/01/23 to 05/31/24

Outputs
Target Audience:The target audience is plant breeding entities in both academia and industry, students with interests in contributing to plant breeding, engineers developing tools for measuring plants in new ways for breeding, and ultimately farmers and consumers who will use the products produced by breeding programs. We have presented our work at international and local conferences including representatives and students from both public and private breeding groups and undergraduate, graduate, and postdoctoral students at UC Davis and other California colleges. We have published our work in peer-reviewed publications in genetics-related journals and continued to maintain and update our open-source R packages based on feedback from users. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?This project has contributed to the training of one graduate studentand a postdoctoral scholar in the development and application of quantitative genetics tools. How have the results been disseminated to communities of interest?This period we published three papers based on our work, presented the results at the GxExM symposium, the PAG conference, a AG2PI field day, the Zeaevolution seminar series, a Corteva New Frontiers conference, and to one industrial group. What do you plan to do during the next reporting period to accomplish the goals?Over the next period, during the no-cost extension, I will attend the 2024 AFRI PD meeting and present a poster on the work, as well as the 2024 NAPD annual conference. We will also publish our work on extending MegaLMM to use environmental covariates for genetic value prediction in multi-environment trials.

Impacts
What was accomplished under these goals? In this year, we made progress in several directions. 1) We applied MegaLMM to multiple new breeding contexts. First, we collaborated with a cattle breeding group to study the use of MegaLMM to improve genetic value prediction of milk quality traits. Milk quality traits are complicated to study because they are highly dynamic during a cow's life, and thus must be studied longitudinally. Near Infrared Spectroscopy is a high throughput phenotyping technique that can be used to indirectly assay quality-related characteristics rapidly on milk samples. But this is a high-dimensional data source so directly applying linear mixed models for genetic value estimation has been difficult using traditional approaches, thus requiring breeders to use approximate methods that don't fully leverage all information in the data. We applied MegaLMM to a dataset from a cattle breeding program, showing improved genetic value predictions with lonngitudinal data. The results were published in the Journal of Dairy Sciences. Second, there is significant interest in plant breeding in using controlled environment systems to carefully measure physiological traits using phenomics technologies and then leveraging these data to improve breeding for stress conditions in the field. We collaborated with an European consortium who used the PhenoArch platform to measure a suite of maize physiological traits under controlled conditions and in parallel ran a large set of field trials. We applied MegaLMM to integrate the chamber and field data to ask if chamber data could improve genetic value estimation and prediction in the field. The results were promising, though perhaps not as successful as hoped, suggesting that field trials are generally more valuable than the chamber data in most cases. Nevertheless, MegaLMM provided more comprehensive answers to this hypothesis than were possible before, and we developed several analytical strategies that had improved performance. 2) We continued to develop new methodologies that built off the MegaLMM framework, allowing us to target new challenges in breeding. First, we used the capability of MegaLMM to learn the genetic and non-genetic correlations among multiple traits to design a new Genome-Wide Association Study (GWAS) approach for identifying genetic loci with effects on multiple traits. It has been known for many years that QTL mapping and GWAS have improved power when used to analyze multiple traits together, because correlations among traits can be leveraged to both control experimental noise, and to find common weak patterns that together increase confidence in discoveries. However, studying multiple traits in GWAS has additional challenges in controlling for false positives results than single-trait GWAS. By applying MegaLMM to multi-trait datasets to estimate correlations, and then using these outputs in GWAS models, we developed the JointGWAS R package that is computationally efficient for multi-trait GWAS. We applied this to a set of ~50 traits measured on a large maize panel to identify loci in maize that derived from the wild relative teosintes that contribute to trait variation in modern maize. These results were published in Science. We also worked on an extension of MegaLMM to leverage environmental data to improve gene-environment-interaction analyses from multi-environment trials, focusing on the goal of predicting genetic values in new environments. Our approach leverages high-dimensional environmental covariates to learn the relationships among trials. We tested our method on the maize GenomesToFields data. Results have been submitted to the journal Genetics and are currently under review. 3) Developing training material for users We improved the documentation of MegaLMM by creating thehttps://deruncie.github.io/MegaLMM/ pkgdown reference site, and created a new vignette showing how to use MegaLMM to analyze data from multi-environment trials.

Publications

Type: Journal Articles Status: Published Year Published: 2024 Citation: Chen, Yansen, Hadi Atashi, Jiayi Qu, Pauline Delhez, Daniel Runcie, H�l�ne Soyeurt, and Nicolas Gengler. "Exploring a Bayesian sparse factor model-based strategy for the genetic analysis of thousands of MIR-spectra traits for animal breeding." Journal of Dairy Science (2024).
Type: Journal Articles Status: Published Year Published: 2024 Citation: Baber Ali, Bertrand Huguenin-Bizot, Maxime Laurent, Fran�ois Chaumont, Laurie C. Maistriaux, St�phane Nicolas, Herv� Duborjal, Claude Welcker, Fran�ois Tardieu, Tristan Mary-Huard, Laurence Moreau, Alain Charcosset, Daniel Runcie & Renaud Rincent. 2024. High-dimensional multi-omics measured in controlled conditions are useful for maize platform and field trait predictions. Theoretical and Applied Genetics, 137(7), p.175.
Type: Journal Articles Status: Published Year Published: 2023 Citation: Yang, Ning, Wang, Yuebin, Liu, Xiangguo, Jin, Minliang, Vallebueno-Estrada, Miguel, Calfee, Erin, Chen, Lu, Dilkes, Brian P., Gui, Songtao, Fan, Xingming, Harper, Thomas K., Kennett, Douglas J., Li, Wenqiang, Lu, Yanli, Ding, Junqiang, Chen, Ziqi, Luo, Jingyun, Mambakkam, Sowmya, Menon, Mitra, Snodgrass, Samantha, Veller, Carl, Wu, Shenshen, Wu, Siying, Zhuo, Lin, Xiao, Yingjie, Yang, Xiaohong, Stitzer, Michelle C., Runcie, Daniel, Yan, Jianbing, Ross-Ibarra, Jeffrey. 2023. Two teosintes made modern maize. Science. 2023 Dec 1;382(6674):eadg8940.

Progress 06/01/22 to 05/31/23

Outputs
Target Audience:The target audience is plant breeding entities in both academia and industry, students with interests in contributing to plant breeding, engineers developing tools for measuring plants in new ways for breeding, and ultimately farmers and consumers who will use the products produced by breeding programs. We have presented our work at international and local conferences including representatives and students from both public and private breeding groups and undergraduate, graduate, and postdoctoral students at UC Davis and other California colleges. We have published our work in peer-reviewed publications in genetics-related journals and continued to maintain and update our open-source R packages based on feedback from users. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?This project has contributed to the training of two graduate students and a postdoctoral scholar in the development and application of quantitative genetics tools. How have the results been disseminated to communities of interest?We have published 7 papers based on these results over the duration of the project. This period, we presented work from this project at the Population, Evolutionary and Quantitative Genetics conference, and the Maize Genetics conference. We developed and taught a module on the use of the MegaLMM at the 2022 UC DavisModern Programming in Genome to Phenome short course. We developed a pkgdown manual for MegaLMM available here:https://deruncie.github.io/MegaLMM/ What do you plan to do during the next reporting period to accomplish the goals?Over the next reporting period, we will complete and publish our extension to MegaLMM to leverage environmental covariates in multi-environment and gene-environment interaction prediction. This will be accompanied by a new tutorial on its use and better document of the whole package on the GitHub page.

Impacts
What was accomplished under these goals? Objective 1: In the first year of the project we fully developed the multi-trait genomic framework MegaLMM for genomic prediction with up to tens of thousands of traits. In the second year of the project we have focused on extending MegaLMM to additional genetic models. In the second year, we developed an extension to MegaLMM called MegaBayesianAlphabet which permits the suite of Bayesian Alphabet prior distributions for marker effects in genome wide analyses. We focused on BayesC because it is one of the most commonly used methods in the Bayesian Alphabet family and has been shown to often perform as well or better than GBLUP for genomic prediction while at the same time providing feature selection and a list of markers driving the genetics. This latter application makes MegaLMM particularly useful as a tool for gene discovery and GWAS. We published a manuscript on this method in Genetics this year in which we show that MegaBayesC can outperform our previous MegaLMM in a specific genomic prediction case study and also that it can better prioritize genetic variants in GWAS under a range of genetic architectures. Under this objective we also published a study showing the importance of accounting for non-genetic correlations when estimating the genomic prediction accuracy in a wheat breeding program, and two studies investigating the genetic architecture of drought responses in a maize breeding population. This year, in addition to publishing the BayesC paper in Genetics, we used published a paper in the International Journal of Molecular Sciences on using MegaLMM to jointly model haploid and doubled haploid (DH) maize lines in a breeding program. Our main focus this year has been on extending MegaLMM to leverage environmental covariates for predicting gene-environment interactions. The original MegaLMM model was purely empirical, only using observed covariances among traits / environments. However, to make predictions for unseen traits or environments we need to relate these learned covariances to the predictive variables, such as weather or soil variables. We have added this functionality to MegaLMM and have tested it in the Genomes2Fields maize dataset. We can show successful predictions into unseen environments in some contexts. A manuscript documenting this is under development and will be published in the next year. Objective 2: Under this objective, we have extended the MegaLMM R package to accommodate the Bayes Alphabet priors on marker effects. To facilitate this we have significantly re-implemented much of the underlying C++ code to take advantage of the faster floating point arithmetic when possible. We have also developed a new R package called JointGWAS that takes the output of MegaLMM analyses and extracts GWAS associations at every marker on each trait or set of traits genome-wide. We are currently writing a manuscript describing how this approach provides a powerful way to account for correlated traits in GWAS. This year, we have added additional functionality to MegaLMM's R package to accept environmental covariance matrices as priors. This allows users to make predictions into unseen environments. Objective 3: We developed a module on the use of MegaLMM for the 2022 Modern Programming in Genome to Phenome short course at UC Davis. Approximately 25 students attended, a mix of domestic and international graduate students and postdocs. For the module, we developed a new tutorial demonstrating the use of MegaLMM for multi-environment trial analysis. This module is since been published on the GitHub page of MegaLMM. ?

Publications

Type: Journal Articles Status: Published Year Published: 2022 Citation: Hu, H., Meng, Y., Liu, W., Chen, Shaojiang, and D. E. Runcie. Multi-Trait Genomic Prediction Improves Accuracy of Selection among Doubled Haploid Lines in Maize. International Journal of Molecular Sciences. (2022), 23(23), 14558
Type: Journal Articles Status: Published Year Published: 2022 Citation: Qu, J., Runcie, D.E., and H. Cheng, Mega-scale Bayesian regression methods for genome-wide prediction and association studies with thousands of traits. Genetics. Volume 223, Issue 3, March 2023, iyac183

Progress 06/01/21 to 05/31/22

Outputs
Target Audience:The target audience is plant breeding entities in both academia and industry, students with interests in contributing to plant breeding, engineers developing tools for measuring plants in new ways for breeding, and ultimately farmers and consumers who will use the products produced by breeding programs. We have presented our work at international and local conferences including representatives and students from both public and private breeding groups and undergraduate, graduate, and postdoctoral students at UC Davis and other California colleges. We have published our work in per-reviewed publications in genetics-related journals and continued to maintain and update our open source R packages based on feedback from users. Changes/Problems:We were unable to host the "Modern Programming in Genomic Prediction" workshop in 2021 due to the COVID-19 pandemic. We will host the workshop and contribute a module on multi-trait analysis featuring MegaLMM in 2022. We were unable to host a post-doc exchange with CIMMYT because of the pandemic. What opportunities for training and professional development has the project provided?This project has contributed to the training of three graduate students and a postdoctoral scholar in the development and application of quantitative genetics tools. How have the results been disseminated to communities of interest?The first paper demonstrating the MegaLMM method was published in Genome Biology. We also published three other papers this year and have a fourth in review at Genetics. We gave presentations on this work at the Plant and Animal Genomes conference and the NCCC170 working group. What do you plan to do during the next reporting period to accomplish the goals?Objective 1: This objective is largely complete. Objective 2: We will focus on documentation of the MegaLMM and JointGWAS package. A particular focus of the software development will be implementing a better issue tracking system and providing documentation for model diagnostics. Objective 3: We will develop a module for the 2022 workshop on quantitative genetics at UC Davis. This is a one-week workshop that attracts students, scientist, and industrial staff with diverse backgrounds. The material will include two case-studies of MegaLMM, one with wheat and the other with corn.

Impacts
What was accomplished under these goals? Objective 1: In the first year of the project we fully developed the multi-trait genomic framework MegaLMM for genomic prediction with up to tens of thousands of traits. In the second year of the project we have focused on extending MegaLMM to additional genetic models. We have developed an extension to MegaLMM called MegaBayesianAlphabet which permits the suite of Bayesian Alphabet prior distributions for marker effects in genome wide analyses. We focused on BayesC because it is one of the most commonly used methods in the Bayesian Alphabet family and has been shown to often perform as well or better than GBLUP for genomic prediction while at the same time providing feature selection and a list of markers driving the genetics. This latter application makes MegaLMM particularly useful as a tool for gene discovery and GWAS. We have a manuscript describing MegaBayesC in review in Genetics in which we show that MegaBayesC can outperform our previous MegaLMM in a specific genomic prediction case study and also that it can better prioritize genetic variants in GWAS under a range of genetic architectures. As a case study, we applied MegaBayesC to a dataset from Arabidopsis to identify genetic variants associated with flowering time. In this dataset each accession was assayed for both flowering time and transcriptomic variation. We leveraged the transcriptomic data to better identify variants associated with flowering time. We used this as a trial dataset because flowering genetics is well characterized in Arabidopsis and we were able to show that 14/15 of the strongest-associated variants were close to well-known flowering time-regulating genes in Arabidopsis. This contrasts with much lower enrichment of the top hits from standard GWAS in this same dataset. Under this objective we also published a study showing the importance of accounting for non-genetic correlations when estimating the genomic prediction accuracy in a wheat breeding program, and two studies investigating the genetic architecture of drought responses in a maize breeding population. Objective 2: Under this objective, we have extended the MegaLMM R package to accommodate the Bayes Alphabet priors on marker effects. To facilitate this we have significantly re-implemented much of the underlying C++ code to take advantage of the faster floating point arithmetic when possible. We have also developed a new R package called JointGWAS that takes the output of MegaLMM analyses and extracts GWAS associations at every marker on each trait or set of traits genome-wide. We are currently writing a manuscript describing how this approach provides a powerful way to account for correlated traits in GWAS. Objective 3: Due to the pandemic our summer course in Quantitative Genetics was canceled in 2021. We will hold this course in August 2022 (https://shortcourse.qtl.rocks/?) and are working on registration and developing training materials for this course. We will include a module on the use of MegaLMM for genomic prediction in this course.

Publications

Type: Journal Articles Status: Published Year Published: 2021 Citation: Hu, H., Campbell, M.T., Yeats, T.H., Zheng, X., Runcie, D.E., Covarrubias-Pazaran, G., Broeckling, C., Yao, L., CAffe-Treml, M., Gutie?rrez, L., Smith, K.P., Tanaka, J., Hoekenga, O.A., Sorrells, M.E., Gore, M.A., and Jean-Luc Jannink. Multi-omics prediction of oat agronomic and seed nutritional traits across environments and in distantly related populations. Theoretical and Applied Genetics volume 134, pages40434054 (2021)
Type: Journal Articles Status: Published Year Published: 2022 Citation: Hudson, A.I., Odell, S.G., Dubreuil, P., Tixier, M-H., Praud, S., Runcie, D.E., and Jeffrey Ross-Ibarra. Analysis of genotype by environment interactions in a maize mapping population. G3 Genes|Genomes|Genetics, Volume 12, Issue 3, March 2022, jkac013
Type: Journal Articles Status: Published Year Published: 2022 Citation: Odell, S.G., Hudson, A.I., Dubreuil, P., Tixier, M-H., Praud, S., Ross-Ibarra, J., and D.E. Runcie. Modeling Allelic Diversity of Multi-parent Mapping Populations Affects Detection of Quantitative Trait Loci. G3 Genes|Genomes|Genetics, Volume 12, Issue 3, March 2022, jkac011

Progress 06/01/20 to 05/31/21

Outputs
Target Audience:The target audience is plant (and animal) breeders and breeding programs in the public and private sectors. We also aim to reach graduate students who aim to increase their familiarity with statistical methodology. Changes/Problems:We will be unable to host the "Modern Programming in Genomic Prediction" workshop in 2021 due to the COVID-19 pandemic. We will host the workshop and contribute a module on multi-trait analysis featuring MegaLMM in 2022. We were unable to recruit a postdoc until towards the end of the first year also because of the pandemic, but a postdoc has started now and will continue through at least the coming reporting period. A graduate student helped with some of the activities this year instead. What opportunities for training and professional development has the project provided?This project has contributed to the training of two graduate students and a postdoctoral scholar in the development and application of quantitative genetics tools. How have the results been disseminated to communities of interest?The first paper demonstrating the MegaLMM method has been accepted at Genome Biology. We have given presentations on the approach and results at 5 conferences and workshops in the field of quantitative genetics or maize genetics, and to a private company working on crop breeding. What do you plan to do during the next reporting period to accomplish the goals?To address the first objective, we will extend MegaLMM to accommodate genetic marker data directly using the BayesC prior, and evaluate whether MegaLMM can function as a tool for multivariate genome-wide association studies. To address the second objective, we will refine the documentation of MegaLMM to make it more complete and create a web-page to help users get started using MegaLMM. A particular focus of the software development will be implementing a better issue tracking system and providing documentation for model diagnostics. To address the third objective, we will refine several case-studies of MegaLMM to use as teaching vignettes. We will start with the wheat and corn datasets described above. These will eventually be used in our workshop on quantitative genetics in breeding that will occur in the third year of this project.

Impacts
What was accomplished under these goals? The central goal of this project is to develop statistical methods and software that enable the efficient and practical use of multi-trait data in breeding programs. Multi-trait data includes multiple quality measures of a single individual (plant, genotype, animal, etc), high-throughput phenotyping data, and measures of genotypes across many environments. In each case the total combined information about the quality of a candidate line in a breeding program from all traits together is greater than the information in any individual trait. However existing statistical methods and software are not capable of jointly analyzing many traits at once. We have demonstrated using two case studies from wheat and corn breeding programs that out methods improve the accuracy of selections and make more efficient usage of all available data, and are feasible to apply to breeding programs using widely available computer systems. Specifically, under out first objective, we have fully developed and published the framework of a multi-trait statistical method called MegaLMM that incorporates genomic data and high-dimensional phenotypic data in a single multi-trait linear mixed model using the technique of Bayesian factor analysis to achieve statistical robustness. We showed that we can model data from a wide range of experimental contexts by including experimental design factors as fixed effects and multiple genetic or environmental terms as random effects. We derived a Markov chain Monte Carlo method to fit the model. We showed that our approach improved genomic prediction accuracy in a wheat breeding program by up to 74% using data from hyperspectral reflectance and in a corn breeding program by up to 20% using data from multi-environment trials. Under our second objective, we developed and published an open source R package also called MegaLMM that is licensed with the MIT license and is available on Github. We developed a number of new computational algorithms to make the model computationally efficient, including the careful storage of intermediate calculations, a grid-based sampling algorithm for certain parameters, and an efficient method for dealing with pattern missing data that is common in many breeding program contexts. We showed that our software could fit linear mixed models with at least 20,000 traits and 650 observations in less than a day, while other programs could only fit dozens to a few hundreds of traits using simpler, less complete models. We have not started on the third objective in this reporting period.

Publications

Type: Journal Articles Status: Accepted Year Published: 2021 Citation: Daniel E Runcie, Jiayi Qu, Hao Cheng, Lorin Crawford (2021) MegaLMM: Mega-scale linear mixed models for genomic predictions with thousands of traits. Genome Biology. Accepted Article.
Type: Journal Articles Status: Submitted Year Published: 2021 Citation: Abelardo Montesinos-L�pez, Daniel Runcie, Maria Itria Ibba, Paulino P�rez-Rodr�guez, Osval A. Montesinos-L�pez, Leonardo A. Crespo, Alison Bentley, and Jos� Crossa. Measurements for multi-trait genomic-enabled prediction accuracy in multi-years breeding trials. Submitted to G3.