NIFA AG2PI Collaborative: Improving Causal Gene Detection across Crop and Livestock Species

NIFA AG2PI COLLABORATIVE: IMPROVING CAUSAL GENE DETECTION ACROSS CROP AND LIVESTOCK SPECIES

Sponsoring Institution

National Institute of Food and Agriculture

Project Status

ACTIVE

Funding Source

OTHER GRANTS

Reporting Frequency

Annual

Accession No.

1031452

Grant No.

2023-70412-41087

Cumulative Award Amt.

$1,132,877.00

Proposal No.

2023-06073

Multistate No.

(N/A)

Project Start Date

Sep 15, 2023

Project End Date

Sep 14, 2026

Grant Year

2023

Program Code

[AG2PI]- Agricultural Genome to Phenome Initiative

Recipient Organization
IOWA STATE UNIVERSITY
2229 Lincoln Way
AMES,IA 50011

Performing Department
(N/A)

Non Technical Summary
In the face of 21st-century agricultural challenges, our mission is clear: we must produce more food, feed, and fiber for a growing population with evolving dietary preferences, while dealing with limited rural labor and agricultural land, and the need for bio-energy sources. Moreover, climate change introduces more frequent biotic and abiotic stresses. While global crop productivity has matched these challenges, we must intensify our efforts to sustain this progress. This is especially vital as we navigate the rest of the century.To address these pressing needs, our team of experts, spanning crop and livestock breeding, genetics, biochemistry, and data science, is forging ahead. We're developing innovative tools to decode the genetic basis of traits in crops like maize, soybean, sorghum, and in pigs. Our advanced statistical models, enhancing methods like GWAS, TWAS, and eQTL mapping, empower biologists to explore data in groundbreaking ways, uncovering new insights.We are bridging the gap between genetics and traits, from crop yields to Vitamin B levels in maize. Our research probes the interplay of genetics, weather, and environment using diverse data. This newfound knowledge will steer enhancements in crucial U.S. crops and livestock.Our ambitious endeavor extends beyond discovery. It entails crafting novel statistical tools to comprehend essential genes in both livestock and crops, applicable across species. Aligned with the USDA's strategic goals, our work contributes to an equitable, resilient, and prosperous U.S. agricultural system, ensuring accessible, wholesome food for all. Through education and outreach, we'll empower crop and livestock breeders and cultivate the human capital needed to fulfill these aspirations.

Animal Health Component

Research Effort Categories

Basic

100%

Applied

Developmental

Classification

Knowledge Area (KA)	Subject of Investigation (SOI)	Field of Science (FOS)	Percent
201	2410	1060	25%
201	7310	2090	25%
304	3910	1060	25%
304	7310	2090	25%

Knowledge Area
201 - Plant Genome, Genetics, and Genetic Mechanisms; 304 - Animal Genome;

Subject Of Investigation
3910 - Cross-commodity research--multiple animal species; 7310 - Experimental design and statistical methods; 2410 - Cross-commodity research--multiple crops;

Field Of Science
2090 - Statistics, econometrics, and biometrics; 1060 - Biology (whole systems);

Keywords

crop breeding

eqtl

gene discovery

gene trait association

livestock breeding

team science

Goals / Objectives
The overarching goal of this project is to develop and support new statistical tools for the breeding of superior individuals or cultivars in genetic populations with the long-term goal of enhancing the production, sustainability, and climate and disease resilience of crop and livestock species. One way to enhance livestock and crop breeding strategies is by better understanding gene-trait associations and prioritizing causal genes of diverse agriculturally important phenotypic traits. Towards this end, the project will bring together researchers from variety of disciplines, including phenomics, genomics, genetic diversity, and data science. Biologists will bring their own biological questions and datasets from different crop and livestock species. Statisticians, with extensive experience in collaborations with biologists, will build statistical models and methodologies to analyze these datasets. The models and analyses will be updated iteratively following feedback from the biologists.Our objectives in this project are: (1) to build powerful multi-locus methods for combined GWAS, TWAS, and expression quantitative trait loci (eQTL) mapping, (2) to develop user friendly open-source R packages and Python libraries with detailed manuals, vignettes, and video tutorials, and (3) interweave research and education through the integration of training and cross-disciplinary research toward producing a skilled STEM agricultural workforce. We plan to achieve these research objectives by pursuing the following three specific objectives:Objective #1: Develop methods to combine GWAS, TWAS, and eQTL mapping of quantitative traits.Our working hypothesis is that a hierarchy of high-dimensional partial-linear and linear models, with appropriate shrinkage on SNP and gene expression effects, will be able to mitigate the confounding effects. In this objective, we will focus on traits for which the responses can be assumed to be univariate or multivariate Gaussian (normally distributed), possibly after a suitable transformation (e.g., log).Objective #2: Develop methods to combine GWAS, TWAS, and eQTL mapping of ordinal traits.The non-Gaussian traits we will focus on are ordinal scores (e.g., disease and root lodging scores). Our working hypothesis is that we will be able to improve the association results by properly accounting for the nature of the non-Gaussianity through an appropriate hierarchical generalized partial-linear multi-locus model. We will also retain the advantages of Objective #1.Objective #3: Develop methods to combine GWAS, TWAS, and eQTL mapping of functional data traits.Here, we will develop multi-locus methods for traits that are measured by a smooth curve (e.g., repeatedly measured phenotypes such as growth rates, time-series, light curves, A/Ci curves). Our working hypothesis is that we will be able to improve the understanding of the genetic basis of variations in the whole trait curves instead of being limited to univariate analyses of summary measurements or independent analysis of individual time points.Alongside the research outcomes, these initiatives will enhance expertise in agricultural genome-to-phenome research through education and outreach activities. Existing ISU courses will be improved by accommodating GWAS and TWAS methods in the syllabus. Outreach programs will provide education and support the research and training a broad range of crop and livestock scientists at multiple types of U.S. institutions. Hybrid workshops will be organized to facilitate training students and scientists.

Project Methods
The statistical methods include the development of hierarchical Bayesian models for combining GWAS, TWAS, and eQTL mapping. Latent indicator variables will be assumed, and model size will be penalized through Bernoulli priors on these latent indicator vectors. Theoretical results will be developed for choosing the right shrinkage to accurately detect associated genes. Fast scalable computational algorithms based on delayed Cholesky factorization, sparse-matrix algebras will be developed and implemented in C++ programming language. These models and methods will be extended to accommodate more general phenotypic responses through link functions.Three methods will be used to assess the validation of association studies: cross-validation with independent datasets from literature, biological pathway analysis, and network analysis with functional enrichment (GO or gene ontology terms) analysis. In addition, simulation experiments will also be conducted based on the literature.Two hybrid workshops will be organized yearly to disseminate research and the software and train the broader scientific community. Feedback will be sought from the workshop participants to assess the overall effectiveness of the workshops and to improve the accessibility of the software, manuals, and vignettes.

Progress 09/15/24 to 09/14/25

Outputs
Target Audience:The target audience includes agricultural scientists, geneticists, and biostatisticians seeking advanced tools for identifying and analyzing important genes in crops and livestock. It also appeals to livestock breeders, crop scientists, biotechnology companies, academic researchers, and policymakers focused on enhancing agricultural productivity and sustainability. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?Seven graduate students in statistics received training in developing advanced statistical methods, implementing software, and conducting interdisciplinary research at the interface of statistics, genetics, and breeding. Additionally, threeplant science and two animal science graduate students received training in data analysis and interdisciplinary collaboration with researchers from other fields. How have the results been disseminated to communities of interest?Findings have been disseminated via conference presentations, publications in peer-reviewed journals, and contributions to book chapters. What do you plan to do during the next reporting period to accomplish the goals? Nothing Reported

Impacts
What was accomplished under these goals? Impact statement: This project addresses the pressing need to improve agricultural productivity and resilience amid climate and resource challenges by advancing statistical tools for genetic improvement. Building on last year's progress, we developed hierarchical Bayesian models that jointly perform GWAS and TWAS, enabling more accurate identification of causal genes and regulatory pathways. Implemented in open-source software, these integrative methods enhance the precision and interpretability of genetic analyses and are readily applicable across crops and livestock. Applying this framework to flowering time in sorghum--a key adaptive trait influencing yield stability--demonstrated how combining GWAS and TWAS can reveal biologically meaningful pathways, such as ageing-related MADS-box and SBP transcription factors, that are directly relevant to breeding for resilience. Beyond research outcomes, the project is fostering workforce development by training graduate students at the intersection of statistics, genetics, and breeding. Through hands-on collaboration with biologists and breeders, trainees are gaining experience in applying advanced statistical tools to real-world agricultural problems. Together, these efforts contribute to U.S. agricultural sustainability by accelerating genetic discovery, enhancing data-driven breeding, and preparing a new generation of interdisciplinary scientists. Objective #1 We developed new hierarchical Bayesian models for jointly performing GWAS and TWAS. The proposed framework effectively incorporates potential group structures among markers and accounts for nonlinear effects of gene expression. Theoretical guarantees for the method have been established. The method has been implemented in the existing R package bravo. In parallel, our statistics group developed specialized software for GWAS applications in crop and livestock species. Collaborating biologists are currently integrating this software into their genetic data preprocessing pipelines. The project has also contributed to workforce development by training four graduate students in statistics, focusing on the interdisciplinary integration of genetics, breeding, and statistical methodology. UNL team applied an integrated GWAS-TWAS approach to analyze flowering time in sorghum. GWAS alone identified several genomic regions, such as SbFT8 and a locus near miR172, though many associations lacked strong statistical confidence. In contrast, TWAS pinpointed candidate genes whose expression levels were significantly correlated with flowering time, including MADS-box genes, SBP transcription factors targeted by miR156, and FT-like paralogs. Both methods converged on the ageing pathway, highlighting the central role of small RNAs and their downstream transcription factors in regulating flowering time variation. These findings illustrate how GWAS excels at detecting regulatory variants at the genomic level, while TWAS captures downstream expression effects. Together, the two approaches provide complementary insights, increasing confidence in the identified candidate pathways. Overall, these results demonstrate the practical impact of developing joint GWAS-TWAS methods: integrative models not only enhance gene discovery but also yield biologically meaningful targets for breeding. This directly advances Objective #1 and lays the groundwork for future work under Objectives #2 and #3, which will extend these methods to ordinal and functional trait data. Objective #2 Despite some initial challenges, we have made strong progress in developing a multi-locus GWAS method for ordinal traits. Our next step is to apply the method to real datasets provided by our biology collaborators at ISU and UNL.

Publications

Type: Peer Reviewed Journal Articles Status: Published Year Published: 2025 Citation: Run Wang, Somak Dutta, Vivekananda Roy. (2025) Bayesian Iterative Screening in Ultra-high Dimensional Linear Regressions. Bayesian Analysis, Advance Publication 1-26 2025. https://doi.org/10.1214/25-BA1517
Type: Peer Reviewed Journal Articles Status: Published Year Published: 2025 Citation: Rao, Y. and Roy, V. (2025). Necessary and sufficient conditions for posterior propriety for generalized linear mixed models. Sankhya, Series A, 87, 157-190
Type: Book Chapters Status: Published Year Published: 2024 Citation: Roy, V., Khare, K. and Hobert, J. P. (2024). The Data Augmentation Algorithm. In Handbook of Markov Chain Monte Carlo, Second Edition (eds. Steve Brooks, Andrew Gelman, Galin L. Jones, and Xiao-Li Meng), Chapman & Hall/CRC.
Type: Peer Reviewed Journal Articles Status: Published Year Published: 2025 Citation: Davis, J. M., Coffey, L. M., Turkus, J., L�pez-Corona, L., Linders, K., Ullagaddi, C., Santra, D. K., Schnable, P. S., & Schnable, J. C. (2025). Assessing the impact of yield plasticity on hybrid performance in maize. Physiologia Plantarum. Advance online publication. https://doi.org/10.1111/ppl.70278
Type: Conference Papers and Presentations Status: Published Year Published: 2025 Citation: Global-local MCMC using Riemannian geometry. Vivekananda Roy. Fast and the Curious 2, Toronto, Canada, September 2025.
Type: Conference Papers and Presentations Status: Published Year Published: 2025 Citation: A geometric approach to informed MCMC sampling. Vivekananda Roy. Joint Statistical Meetings, Nashville, USA, August 2025
Type: Conference Papers and Presentations Status: Published Year Published: 2024 Citation: Informed MCMC for Bayesian variable selection. Vivekananda Roy. CFE-CMStatistics, London, UK, December 2024.
Type: Conference Papers and Presentations Status: Published Year Published: 2024 Citation: Predicting the Unpredictable: Introduction to Monte Carlo Simulations, Vivekananda Roy. Ahmedabad University, India, August 2024.

Progress 09/15/23 to 09/14/24

Outputs
Target Audience:The target audience includes agricultural scientists, geneticists, and biostatisticians seeking advanced tools for identifying and analyzing important genes in crops and livestock. It also appeals to livestock breeders, crop scientists, biotechnology companies, academic researchers, and policymakers focused on enhancing agricultural productivity and sustainability. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?Two statistical graduate students received training in developing and optimizing statistical methods. Additionally, one plant science and two animal science graduate students received training in data analysis and interdisciplinary collaboration with researchers from other fields. How have the results been disseminated to communities of interest?The newly developed method was presented in an online workshop on May 17, 2024, with hands-on training for combined GWAS and TWAS using SVEN from the R package bravo. The workshop attracted 317 participants from 52 countries, received positive feedback, and has since garnered 207 YouTube views and four downloads of related materials. In addition, we are preparing three manuscripts for submission to scientific journals to disseminate the results to communities of interest. What do you plan to do during the next reporting period to accomplish the goals? Nothing Reported

Impacts
What was accomplished under these goals? Impact statement: Agriculture in the 21st century faces significant challenges: it must produce more food, feed, and fiber for a growing population withdiverse dietary preferences, all while dealing with limited farmland,a shrinking rural workforce, and increased demand for bio-energyresources. Additionally, it must adopt sustainable methods and addressthe rising impacts of climate change. Although global cropproductivity has improved over the past 60 years, continuedadvancements are necessary to meet these demands. This requiresinvesting in genetic improvements for crops and livestock, studyingkey agricultural species in real-world conditions, and identifyinggenes critical to U.S. agriculture. It is also essential to sharethese findings with breeders and the agricultural community, whilefostering the development of skilled professionals through educationand outreach initiatives. GWAS (Genome-Wide Association Studies), TWAS (Transcriptome-WideAssociation Studies), and eQTL (expression Quantitative Trait Locus)mapping are powerful tools for identifying the genetic basis ofcomplex traits, validating genes, and guiding genetic improvements incrops and livestock. However, existing methods typically perform GWASand TWAS separately, combining results afterward through statisticaltests, which can limit their ability to detect causalgenes. Additionally, many traits are measured in non-Gaussian formats,such as ordered categorical scores (e.g., crop disease ratings), timeseries (e.g., growth data), or functional curves (e.g., photosyntheticresponses). Current models often overlook nonlinear relationshipsbetween gene expression and traits, reducing their predictivepower. Therefore, innovative generalized or nonlinear models arenecessary to enhance these studies. To address these gaps, we have developed new Bayesian modelsintegrating GWAS and TWAS in a single hierarchical framework,incorporating effect size shrinkage and model penalties to manageconfounding factors. Separate models are being designed for differenttypes of response variables, such as ordinal data. Furthermore, theproject has started training three graduate students in theinterdisciplinary fields of genetics, breeding, and statistics. Thesestudents are helping disseminate the methods to the broader researchcommunity by assisting with data analyses and hosting hybridworkshops. Additionally, software is being developed to promote broadapplication and advance U.S. agricultural goals through research andcapacity-building efforts. For Objective #1, we have now successfully extended the SVEN methodology for jointly performing GWAS and TWAS through a single Bayesian hierarchical model. The method was presented in an online workshop on May 17, 2024. During the workshop, hands-on training was given for combined GWAS and TWAS using SVEN from the R package bravo. There were 317 registered participants in this workshop representing 52 countries. We received several positive comments from the participants of the workshop. Also, there were 207 views of the recording since it was posted on YouTube (May 17, 2024) and 4 people downloaded the workshop-related materials. Currently, methods are being developed for incorporating possible group structures among the markers and nonlinear effects of the gene expression for the combined GWAS and TWAS. The corresponding implementation in the bravo package is also in progress. The team at University of Nebraska-Lincoln (UNL) generated, curated, and transferred two large datasets consisting of matched genotype, transcript and phenotype datasets for flowering time in large maize and sorghum diversity panels to the statistics team at Iowa State University (ISU). By mining the literature, we generated a set of high confidence flowering time genes to use asground truth to evaluate model performance. For Objective #2, we have almost completed the development of a multi-locus GWAS method for ordinal traits. While we have yet to analyze real datasets from the different biology teams of ISU, we are testing our methodology on simulated data sets. Next, we will extend the methodology for combining GWAS, TWAS, and eQTL mapping of ordinal traits.

Publications

Type: Other Status: Submitted Year Published: 2024 Citation: Roy, V. (2024) A geometric approach to informed MCMC sampling, https://arxiv.org/abs/2406.09010
Type: Other Status: Submitted Year Published: 2024 Citation: Rao, Y. and Roy, V. (2024) Necessary and sufficient conditions for posterior propriety for generalized linear mixed models, https://arxiv.org/abs/2302.00665
Type: Book Chapters Status: Submitted Year Published: 2024 Citation: Roy, V., Khare, K., and Hobert, J. P. (2024) The data augmentation algorithm, https://arxiv.org/abs/2406.10464, Handbook of Markov chain Monte Carlo, 2nd Edition, Steve Brooks, Andrew Gelman, Galin L. Jones and Xiao-Li Meng eds., Chapman & Hall/CRC, to appear
Type: Other Status: Other Year Published: 2024 Citation: Escamilla, D.M., D. Li, K.L. Negus, K.L. Kappelmann, A. Kusmec, A.E. Vanous, P.S. Schnable, X. Li, and J. Yu*. Genomic selection: essence, applications, and prospects. Plant Genome. - in preparation