Genetically-Informed Envirotyping Tools to Better Match Test and Target Environments - UNIVERSITY OF NORTH CAROLINA

GENETICALLY-INFORMED ENVIROTYPING TOOLS TO BETTER MATCH TEST AND TARGET ENVIRONMENTS

Sponsoring Institution

National Institute of Food and Agriculture

Project Status

COMPLETE

Funding Source

AFRI COMPETITIVE GRANT

Reporting Frequency

Annual

Accession No.

1011971

Grant No.

2017-67013-26188

Cumulative Award Amt.

$490,000.00

Proposal No.

2016-09663

Multistate No.

(N/A)

Project Start Date

Feb 15, 2017

Project End Date

Feb 14, 2020

Grant Year

2017

Program Code

[A1141]- Plant Health and Production and Plant Products: Plant Breeding for Agricultural Production

Recipient Organization
UNIVERSITY OF NORTH CAROLINA - WILMINGTON
601 SOUTH COLLEGE ROAD
WILMINGTON,NC 28403

Performing Department
Biology and Marine Biology

Non Technical Summary
Better matching of test crop growth environments to target crop production environments is key for efficient crop breeding. We propose to optimize promising new envirotyping analysis and modeling methods and develop publicly accessible known-truth genotype-environment simulations to allow improved breeding schemes for better global crop yield. In conjunction with the development of simulations, we will improve our promising PreMiuM profile regression algorithm's run speed and develop breeder-relevant output plots and tables. We will combine PreMiuM profile regression covariate variable selection with standard linear model selection and fit methods to create a combined analysis workflow that will allow breeders to fit SNP and environment variates to their data. To illustrate these new analysis methods and inform our breeding program modeling, we will analyze real crop datasets with our improved PreMiuM and PreMiuM+model selection workflow and make spatial results maps to visualize the results in an easily interpretable field context.To leverage better envirotyping within breeding programs, we need modeling tools that allow exploration of program design constraints. We will develop breeding simulation models that incorporate realistic environment covariate features of test and target environments and flexible, extensible specifications of genetic gain within an open-source, widely used web-accessible modeling system that supports both student training and advanced breeder modeling. Modeling tools and better envirotyping tools will support crop breeders. Breeders will be able to design optimal germplasm exchange programs for maximum genetic gain by using PreMiuM results to inform setup of test and target environments.

Animal Health Component

(N/A)

Research Effort Categories

Basic

100%

Applied

(N/A)

Developmental

(N/A)

Classification

Knowledge Area (KA)	Subject of Investigation (SOI)	Field of Science (FOS)	Percent
201	7310	1081	75%
203	0420	1020	25%

Knowledge Area
203 - Plant Biological Efficiency and Abiotic Stresses Affecting Plants; 201 - Plant Genome, Genetics, and Genetic Mechanisms;

Subject Of Investigation
0420 - Weather; 7310 - Experimental design and statistical methods;

Field Of Science
1081 - Breeding; 1020 - Physiology;

Keywords

genotype-environemnt interaction

Goals / Objectives
Long-term goalsWe will optimize promising new envirotyping analysis and modeling methods and develop relevant, publicly accessible known-truth genotype-environment simulations to allow improved breeding schemes for better global crop yield. Our goal is to develop a flexible, extendable computational framework that allows dynamic adjustment of test-environment-to-target-production-environment relationships as new data, new climate projections, and new analysis methods are added, and to make this framework accessible to plant breeders via public cyberinfrastructure.ObjectivesObjective 11a) Create known-truth simulations of crossover genotype by environment interaction with realistic correlated covariate structures which combine to generate nonlinear outcome (yield) levels. These simulations and complete documentation and tutorials will be publicly available for future use by other algorithm developers and breeders.1b) Improve the PReMiuM profile regression algorithm run speed and develop breeder-relevant output plots and tables.1c) Combine PReMiuM profile regression covariate variable selection with standard linear model selection and parameter estimation methods to create a combined workflow that will allow breeders to fit SNP and environment variates to their data, with use of our improved simulations to guide workflow development.1d) Analyze real crop datasets with our improved PReMiuM and PReMiuM+linear model workflow, make spatial results maps to visualize the results in an easily interpretable field context.Objective 2Develop breeding simulation models that incorporate realistic environment covariate features of test and target environments (with updates as information is collected from Objective 1) and flexible, extensible specifications of genetic gain within an open web-accessible modeling system (InsightMaker) that supports both student training and advanced breeder modeling of breeding program options to achieve higher yields with relevant constraints.

Project Methods
We will evaluate population and selection simulation software tools and select the most relevant for our simulation of genotype-environment interaction populations. For data-driven simulations, real genotype distributions from the G2F and T3 data will be used as a base for allocation of phenotype values to those populations. To carry out the code profiling work on the PreMiuM clustering package we will use an R package named Profvis. We will evaluate additional BIC packages for ease of use and scalability along with other open-source software options such as the Julia parallel sparse regression modules, C++ libraries and parallel R implementations for model fitting and model term effect estimation. Our simulated data will be used for benchmarking in addition to our use of standard code function profiling and software requirements analyses, such as documentation completeness, community size and activity, and support quality. We will test and implement SNP shrinkage if reduced SNP sets are needed for efficient computation. Workflow management options that are suitable for use on the XSEDE high-performance compute architecture, such as makeflow and Pegasus, will be evaluated and the most appropriate package selected for our covariate profiling and model selection workflow.We will model existing plant breeding program simulation packages and create UML diagrams of the functions, then instantiate these 'rules' into InsightMaker equation syntax. The key information to track across breeding cycles is allele frequency as a proportion of the maximum genetic gain, so the equations will be formed using those variables. A web-friendly interface with stock and flow diagrams is automatically created and visualizations can easily be plotted from simulation runs.

Progress 02/15/17 to 02/14/20

Outputs
Target Audience:Plant breeding educators, plant breeding graduate students and plant breeding faculty have provided survey answers and tested our breeding program simulations. Foundations, public breeding programs, large agricultural companies and agricultural data providers have expressed interest in the Premium profile regression analysis method and we have assisted them in implementing and testing the method. Changes/Problems:Linear model code was not needed, as the Premium package computational improvements that we implemented made it possible to do analyses of realistic-size field data sets. In addition to making our code and tutorials available we were able to implement a web-form with simple drop-down menus and viewers for the code to make analysis on HPC easier for new users. This web-based Premium profile regression analysis is available from https://idols.tacc.utexas.edu/. The T3 data did not have adequate information on planting dates to enable matching to weather data, so we were unable to use those data sets as a profile regression example. We instead focused on use of the G2F data sets across multiple years, from 2014 to 2016, for our example real data analyses. What opportunities for training and professional development has the project provided?We have trained two mathematics and statistics students in use of R code, in writing efficient functions and testing using updated coding practice in R. The students also learned the biology and biological vocabulary needed to understand our use cases. One statistics graduate student learned advanced visualization methods (RShiny and ggplot2). One statistics PhD graduate student learned the needed Bayesian theory and Premium function details along with advanced R scripting for generation of simulation/benchmarking data sets. Two students visited a large agricultural company that had implemented the Premium method, to learn more about industry careers in data analysis and learn about industry practices in code development and project management. We trained one research assistant on sharing tools for collaborative work and on statistical methods for simulations and their implementation in R. In 2017 TACC hosted a one-week work session and training at TACC for research assistants on envirotyping project from the University of North Carolina Wilmington and Queen Mary University of London. In 2019 a TACC scientist learned the new statistical programming language Julia and then implemented and tested a newly published Julia package for Dirichlet Process Mixture Model estimation on HPC. How have the results been disseminated to communities of interest?The improved R package is publicly available and we have disseminated the Insightmaker breeding program simulation and updated R code links to interested educators and breeders via email. The Premium R package had 10,715 downloads in 2019, 9384 downloads in 2018, 5786 downloads in 2017 and 6786 downloads in 2016. For 2020, downloads from January 1 to February 23 were 2101. From 2017 to 2019, downloads from RStudio almost doubled. The process of code improvement and the results showing substantial speed-up were presented at the PEARC18 conference, July 22-26, 2018. PD Stapleton presented a poster at the Gordon Research Conference on Quantitative Genetics and Genomics Feb 9-15 2019; The poster title was "GENETICALLY-INFORMED ENVIROTYPING TOOLS TO BETTER MATCH TEST AND TARGET ENVIRONMENT". This presentation led to multiple interactions with large agricultural company statisticians who have implemented the method for their data. Our most current tutorials and datasets are always available at https://envirotyping.readthedocs.io/en/master/. This web resource has been viewed by 251 different people, primarily from the US, in the past 1.5 years. PD Stapleton presented a poster at the National Association of Plant Breeding conference August 25-29, 2019. The poster title was "NIFA: Genetically-informed Envirotyping Tools to Better Match Test and Target Environments". PD Stapleton presented the Premium method to a Gates Foundation program officer and to the Excellence in Breeding program in Spring 2020. PD Stapleton presented the method to the aWhere team in Spring 2020. What do you plan to do during the next reporting period to accomplish the goals? Nothing Reported

Impacts
What was accomplished under these goals? We have improved the envirotyping PReMiuM R package so that it is practical to use on real breeding datasets. This will allow breeders to better characterize environments in multi-environment trials and thus optimize their breeding programs. We profiled the PReMiuM R package using the ProfVis tool and identified several issues causing inefficiency and performance bottlenecks. The subsequent code refactoring included changed data structures, algorithm modification and reducing resource usage. The code optimization improved speed 200-fold. Scripts to run the existing package using large computing resources at TACC were developed and tutorials on HPC use were created. We created breeder-relevant output plot types to improve visualization and understanding of the Premium package output, which is very rich and extensive. Specifically, we created examples of weather covariate graphs and violin plots using ggplot2 to process the PReMiuM analysis output; the code and instructions are available at https://envirotyping.readthedocs.io/en/master/. We also created an interactive shiny app to allow specific views of the PReMiuM results; this allows breeders and modelers to answer specific questions they may have about certain years, sites, hybrids and weather covariates in the G2F datasets. The tutorial and example spatial map view of the G2F dataset is available at https://envirotyping.readthedocs.io/en/master/workflow/human_readable/. The map titled 'Location of Select Hybrids by Post-Hoc Group' gives a simple view of groups of similar environments for groups of hybrids, which is the key information for selection of test environments for specific production environments. We note that geographic distance is not a useful indicator of a good test environment in this example G2F data set. We cleaned the 2014, 2015 and 2016 data, fit four models (linear regression, mixed model and Bayesian tree model) to the 2014 and 2015 data, then tested our model fit on the 2016 data. We incorporated what we learnt about the hybrid-environment interactions from this G2F data analysis into a method and scripts to create realistic simulated datasets, which are available at https://gitlab.com/EnvirotypingGroupProjectGitlab/envirotypingprojectgitlab. We identified and repaired seed-setting and covariate specification PReMiuM code scale-up problems and then ran the Premium profile regressions on all three years of public G2F data with weather covariates input in monthly increments (using median, max, and min for each weather variable) using the optimized R code. These results are available in our git repository and explanations and instructions on how to run the analyses are available in our envirotyping readthedocs site. We have investigated and evaluated several other libraries with parallel implementation of solving Dirichlet Process Mixture Model(DPMM). Those models provide alternative options to PREMiuM package and may be integrated with current workflow for further performance improvements. We completed initial protocols and tutorials for fast transfer and processing of NOAA weather covariate data for use with PReMiuM using SPARK. We have implemented scripts and code to download and preprocessing weather data available from NOAA; the scripts are available at https://github.com/TACC/EnviroTyping. The preprocessing implementation is built on Spark programming framework and Apache Zeppelin for future scalability and interactivity requirements. It can ingest and clean raw downloaded data and transform them into a Spark Dataframe for future uses. We developed an interactive web application to request supercomputing resources, submit PReMiuM envirotyping jobs, monitor job status and show analytic results, which will be useful for teaching and for researchers new to supercomputing. More details and instructions are available from idols.tacc.utexas.edu and https://envirotyping.readthedocs.io/. We created breeding simulations using the Insightmaker web application, for learning about plant breeding, with documentation and tutorials.

Publications

Type: Journal Articles Status: Published Year Published: 2018 Citation: Lotterhos KE, Moore JH, Stapleton AE (2018) Analysis validation has been neglected in the Age of Reproducibility. PLOS Biol 16: e3000070 (7,076 views since publication)
Type: Book Chapters Status: Awaiting Publication Year Published: 2019 Citation: ON TO THE NEXT CHAPTER FOR CROP BREEDING: CONVERGENCE WITH DATA SCIENCE Elhan S. Ersoz; Nicolas F Martin; Ann Stapleton doi: 10.2135/cropsci2019.03.0149; Date posted: October 05, 2019 (Crop Science, in press) Preprint at Ersoz, E.S.; Martin, N.F.; Stapleton, A.E. On to the Next Chapter for Crop Breeding: Convergence with Data Science. Preprints 2019, 2019030115 (doi: 10.20944/preprints201903.0115.v1). First look early access version available via https://dl.sciencesocieties.org/publications/cs/first-look.

Progress 02/15/18 to 02/14/19

Outputs
Target Audience:Plant breeders and modelers were surveyed about their teaching and research needs. All software and tutorials are publically available. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?TACC hosted a one-week discussion and training at TACC for research assistants on envirotyping project from the University of North Carolina Wilmington and Queen Mary University of London. How have the results been disseminated to communities of interest?The process of code improvement and the results showing substantial speed-up were presented at the PEARC18 conference, July 22-26, 2018. PD Stapleton presented a poster at the Gordon Research Conference on Quantitative Genetics and Genomics Feb 9-15 2019; The poster title was "GENETICALLY-INFORMED ENVIROTYPING TOOLS TO BETTER MATCH TEST AND TARGET ENVIRONMENT". Our most current tutorials and datasets are always available at https://envirotyping.readthedocs.io/en/master/. What do you plan to do during the next reporting period to accomplish the goals?Finish adding realistic simulations with true positive-false-positive graphics to the readthedocs tutorial and code documentation. Currently working on testing feasibility of using scalable DPMM algorithms from 'Scalable Estimation of Dirichlet Process Mixture Models on Distributed Data (Wang & Lin 2017)' on Stampede2 supercomputer cluster, with the goal of refactoring of the R code to allow full use of parallel supercomputing resources. We will use Spark weather covariate processing and refactored PReMiuM code to examine importance of covariate time step size in modeling (in contrast to our current summary by month). We will analyze all T3 datasets which have varieties planted in >3 field sites where the field sites have latitude-longitude coordinates and which have associated planting-harvest dates. For Objective 2, in year 3 we will create and release additional systems models that incorporate specific training and test environments as informed by our variety trial profile regression analyses and simulations.

Impacts
What was accomplished under these goals? Objective 1 accomplishments 1a. Develop genotype-environment crossover simulations, documentation and tutorials. We incorporated what we learnt about the hybrid-environment interactions in the G2F data (see section 1d) to create realistic simulate datasets, which are available at https://gitlab.com/EnvirotypingGroupProjectGitlab/envirotypingprojectgitlab. 1b. Add breeder-relevant output plot types to the premium package. We created examples of weather covariate graphs and violin plots using ggplot2 to process the PReMiuM analysis output; the code and instructions are available at https://envirotyping.readthedocs.io/en/master/. We also created an interactive shiny app to allow specific views of the PReMiuM results; this allows breeders and modelers to answer specific questions they may have about certain years, sites, hybrids and weather covariates in the G2F datasets. We developed an interactive web application to request supercomputing resources, submit PReMiuM envirotyping jobs, monitor job status and show analytic results, which will be useful for teaching and for researchers new to supercomputing. 1c. Implement code parallelization to improve Premium run times. We profiled and optimized performance of the PReMiuM package, achieving a 60 times speedup for the initial large test dataset. We developed a Spark (HPC) script to process G2F weather data in parallel. 1d. Analyze g2f real data sets. We cleaned the 2014, 2015 and 2016 data, fit four models (linear regression, mixed model and Bayesian tree model) to the 2014 and 2015 data, then tested our model fit on the 2016 data. We identified and repaired seed-setting and covariate specification PReMiuM code scale-up problems and then ran the bayesian profile regressions on all three years of public G2F data with weather covariates input in monthly increments (using median, max, and min for each weather variable) using the optimized R code. Objective 2 accomplishments We completed the initial system models during our first project year; during this reporting period we maintained and updated the model as needed.

Publications

Type: Conference Papers and Presentations Status: Published Year Published: 2018 Citation: Huang R, Xu W, Liverani S, Hiltbrand D, Stapleton AE (2018) A Case Study of R Performance Analysis and Optimization. Proc. Pract. Exp. Adv. Res. Comput. ACM, New York, NY, USA, pp 33:133:6

Progress 02/15/17 to 02/14/18

Outputs
Target Audience:Plant breeding educators, plant breeding graduate students and plant breeding faculty have provided survey answers and tested our simulations and software tools. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?We have trained one undergraduate mathematics and statistics student in use of R code, in writing efficient functions and test using updated coding practice in R, and in development of simulations using the built-in PReMiuM simulation functions. The student has also learned the biology an biological vocabulary needed to understand our use cases. How have the results been disseminated to communities of interest?The improved R package is publicly available and we have disseminated the insightmaker breeding program simulation and updated R code links to interested educators and breeders via email. The Premium R package had 5786 downloads in 2017 and 6786 downloads in 2016. We are preparing a manuscript based on our experiences and accomplishments described above for the 2018 Practice & Experience in Advanced Research Computing (PEARC) conference. What do you plan to do during the next reporting period to accomplish the goals?Our year 2 goals include: Addition of support for crossover interaction analysis and additional breeder-relevant output plots to the PReMiuM package. We have already added suppport for violin plots. Complete the development of genotype-environment crossover simulations, documentation, and tutorials, and make all materials public. Complete the development of protocols and tutorials for fast transfer and CyVerse Data Store storage of NOAA weather covariate data for use with PReMiuM on XSEDE nodes. Begin analysis of T3 and G2F datasets. Refine InsightMaker breeding simulations using G2F and T3 envirotyping results, and add additional tutorials. Begin testing of log-linear model-fit code options, to enable selection of one or more packages that can be developed into an Agave-compatible workflow with PReMiuM covariate outputs as input to model selection code. Begin implementation of code parallelization to further improve PReMiuM run times.

Impacts
What was accomplished under these goals? We have improved the envirotyping PReMiuM R package so that it is practical to use on real breeding datasets. This will allow breeders to better characterize environments in multi-environment trials and thus optimize their breeding programs. Our year 1 goals have been completed; specifically, we have A. Profiled current PReMiuM R code and completed an effective code optimization that improved speed 200-fold (part of Objective 1b). We have profiled the current PReMiuM R package and identified several issues causing inefficiency and performance bottlenecks using the ProfVis tool. The subsequent code refactoring included changed data structures, algorithm modification and reducing resource usage. We also prepared scripts to run the existing package using large computing resources at TACC in preparation for future optimization and scaling needs. B. Created breeding simulation using InsightMaker, with documentation and tutorials (part of Objective 2). C. Completed initial protocol and tutorials for fast transfer and processing of NOAA weather covariate data for use with PReMiuM using SPARK (part of Objective 1b and 1d). We have implemented scripts and code to download and preprocessing weather data available from NOAA. The preprocessing implementation is built on Spark programming framework and Apache Zeppelin for future scalability and interactivity requirements. It can ingest and clean raw downloaded data and transform them into Spark Dataframe for future uses. D. Begun development of realistic, complex genotype-environment simulations, documentation and tutorials. Current test and benchmark simulations are publically available in the R package repository.

Publications