Source: UNIVERSITY OF NORTH CAROLINA - WILMINGTON submitted to
GENETICALLY-INFORMED ENVIROTYPING TOOLS TO BETTER MATCH TEST AND TARGET ENVIRONMENTS
Sponsoring Institution
National Institute of Food and Agriculture
Project Status
NEW
Funding Source
Reporting Frequency
Annual
Accession No.
1011971
Grant No.
2017-67013-26188
Project No.
NC.W-2016-09663
Proposal No.
2016-09663
Multistate No.
(N/A)
Program Code
A1141
Project Start Date
Feb 15, 2017
Project End Date
Feb 14, 2020
Grant Year
2017
Project Director
Stapleton, A. E.
Recipient Organization
UNIVERSITY OF NORTH CAROLINA - WILMINGTON
601 SOUTH COLLEGE ROAD
WILMINGTON,NC 28403
Performing Department
Biology and Marine Biology
Non Technical Summary
Better matching of test crop growth environments to target crop production environments is key for efficient crop breeding. We propose to optimize promising new envirotyping analysis and modeling methods and develop publicly accessible known-truth genotype-environment simulations to allow improved breeding schemes for better global crop yield. In conjunction with the development of simulations, we will improve our promising PreMiuM profile regression algorithm's run speed and develop breeder-relevant output plots and tables. We will combine PreMiuM profile regression covariate variable selection with standard linear model selection and fit methods to create a combined analysis workflow that will allow breeders to fit SNP and environment variates to their data. To illustrate these new analysis methods and inform our breeding program modeling, we will analyze real crop datasets with our improved PreMiuM and PreMiuM+model selection workflow and make spatial results maps to visualize the results in an easily interpretable field context.To leverage better envirotyping within breeding programs, we need modeling tools that allow exploration of program design constraints. We will develop breeding simulation models that incorporate realistic environment covariate features of test and target environments and flexible, extensible specifications of genetic gain within an open-source, widely used web-accessible modeling system that supports both student training and advanced breeder modeling. Modeling tools and better envirotyping tools will support crop breeders. Breeders will be able to design optimal germplasm exchange programs for maximum genetic gain by using PreMiuM results to inform setup of test and target environments.
Animal Health Component
0%
Research Effort Categories
Basic
100%
Applied
(N/A)
Developmental
(N/A)
Classification

Knowledge Area (KA)Subject of Investigation (SOI)Field of Science (FOS)Percent
2017310108175%
2030420102025%
Goals / Objectives
Long-term goalsWe will optimize promising new envirotyping analysis and modeling methods and develop relevant, publicly accessible known-truth genotype-environment simulations to allow improved breeding schemes for better global crop yield. Our goal is to develop a flexible, extendable computational framework that allows dynamic adjustment of test-environment-to-target-production-environment relationships as new data, new climate projections, and new analysis methods are added, and to make this framework accessible to plant breeders via public cyberinfrastructure.ObjectivesObjective 11a) Create known-truth simulations of crossover genotype by environment interaction with realistic correlated covariate structures which combine to generate nonlinear outcome (yield) levels. These simulations and complete documentation and tutorials will be publicly available for future use by other algorithm developers and breeders.1b) Improve the PReMiuM profile regression algorithm run speed and develop breeder-relevant output plots and tables.1c) Combine PReMiuM profile regression covariate variable selection with standard linear model selection and parameter estimation methods to create a combined workflow that will allow breeders to fit SNP and environment variates to their data, with use of our improved simulations to guide workflow development.1d) Analyze real crop datasets with our improved PReMiuM and PReMiuM+linear model workflow, make spatial results maps to visualize the results in an easily interpretable field context.Objective 2Develop breeding simulation models that incorporate realistic environment covariate features of test and target environments (with updates as information is collected from Objective 1) and flexible, extensible specifications of genetic gain within an open web-accessible modeling system (InsightMaker) that supports both student training and advanced breeder modeling of breeding program options to achieve higher yields with relevant constraints.
Project Methods
We will evaluate population and selection simulation software tools and select the most relevant for our simulation of genotype-environment interaction populations. For data-driven simulations, real genotype distributions from the G2F and T3 data will be used as a base for allocation of phenotype values to those populations. To carry out the code profiling work on the PreMiuM clustering package we will use an R package named Profvis. We will evaluate additional BIC packages for ease of use and scalability along with other open-source software options such as the Julia parallel sparse regression modules, C++ libraries and parallel R implementations for model fitting and model term effect estimation. Our simulated data will be used for benchmarking in addition to our use of standard code function profiling and software requirements analyses, such as documentation completeness, community size and activity, and support quality. We will test and implement SNP shrinkage if reduced SNP sets are needed for efficient computation. Workflow management options that are suitable for use on the XSEDE high-performance compute architecture, such as makeflow and Pegasus, will be evaluated and the most appropriate package selected for our covariate profiling and model selection workflow.We will model existing plant breeding program simulation packages and create UML diagrams of the functions, then instantiate these 'rules' into InsightMaker equation syntax. The key information to track across breeding cycles is allele frequency as a proportion of the maximum genetic gain, so the equations will be formed using those variables. A web-friendly interface with stock and flow diagrams is automatically created and visualizations can easily be plotted from simulation runs.

Progress 02/15/18 to 02/14/19

Outputs
Target Audience:Plant breeders and modelers were surveyed about their teaching and research needs. All software and tutorials are publically available. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?TACC hosted a one-week discussion and training at TACC for research assistants on envirotyping project from the University of North Carolina Wilmington and Queen Mary University of London. How have the results been disseminated to communities of interest?The process of code improvement and the results showing substantial speed-up were presented at the PEARC18 conference, July 22-26, 2018. PD Stapleton presented a poster at the Gordon Research Conference on Quantitative Genetics and Genomics Feb 9-15 2019; The poster title was "GENETICALLY-INFORMED ENVIROTYPING TOOLS TO BETTER MATCH TEST AND TARGET ENVIRONMENT". Our most current tutorials and datasets are always available at https://envirotyping.readthedocs.io/en/master/. What do you plan to do during the next reporting period to accomplish the goals?Finish adding realistic simulations with true positive-false-positive graphics to the readthedocs tutorial and code documentation. Currently working on testing feasibility of using scalable DPMM algorithms from 'Scalable Estimation of Dirichlet Process Mixture Models on Distributed Data (Wang & Lin 2017)' on Stampede2 supercomputer cluster, with the goal of refactoring of the R code to allow full use of parallel supercomputing resources. We will use Spark weather covariate processing and refactored PReMiuM code to examine importance of covariate time step size in modeling (in contrast to our current summary by month). We will analyze all T3 datasets which have varieties planted in >3 field sites where the field sites have latitude-longitude coordinates and which have associated planting-harvest dates. For Objective 2, in year 3 we will create and release additional systems models that incorporate specific training and test environments as informed by our variety trial profile regression analyses and simulations.

Impacts
What was accomplished under these goals? Objective 1 accomplishments 1a. Develop genotype-environment crossover simulations, documentation and tutorials. We incorporated what we learnt about the hybrid-environment interactions in the G2F data (see section 1d) to create realistic simulate datasets, which are available at https://gitlab.com/EnvirotypingGroupProjectGitlab/envirotypingprojectgitlab. 1b. Add breeder-relevant output plot types to the premium package. We created examples of weather covariate graphs and violin plots using ggplot2 to process the PReMiuM analysis output; the code and instructions are available at https://envirotyping.readthedocs.io/en/master/. We also created an interactive shiny app to allow specific views of the PReMiuM results; this allows breeders and modelers to answer specific questions they may have about certain years, sites, hybrids and weather covariates in the G2F datasets. We developed an interactive web application to request supercomputing resources, submit PReMiuM envirotyping jobs, monitor job status and show analytic results, which will be useful for teaching and for researchers new to supercomputing. 1c. Implement code parallelization to improve Premium run times. We profiled and optimized performance of the PReMiuM package, achieving a 60 times speedup for the initial large test dataset. We developed a Spark (HPC) script to process G2F weather data in parallel. 1d. Analyze g2f real data sets. We cleaned the 2014, 2015 and 2016 data, fit four models (linear regression, mixed model and Bayesian tree model) to the 2014 and 2015 data, then tested our model fit on the 2016 data. We identified and repaired seed-setting and covariate specification PReMiuM code scale-up problems and then ran the bayesian profile regressions on all three years of public G2F data with weather covariates input in monthly increments (using median, max, and min for each weather variable) using the optimized R code. Objective 2 accomplishments We completed the initial system models during our first project year; during this reporting period we maintained and updated the model as needed.

Publications

  • Type: Conference Papers and Presentations Status: Published Year Published: 2018 Citation: Huang R, Xu W, Liverani S, Hiltbrand D, Stapleton AE (2018) A Case Study of R Performance Analysis and Optimization. Proc. Pract. Exp. Adv. Res. Comput. ACM, New York, NY, USA, pp 33:133:6


Progress 02/15/17 to 02/14/18

Outputs
Target Audience:Plant breeding educators, plant breeding graduate students and plant breeding faculty have provided survey answers and tested our simulations and software tools. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?We have trained one undergraduate mathematics and statistics student in use of R code, in writing efficient functions and test using updated coding practice in R, and in development of simulations using the built-in PReMiuM simulation functions. The student has also learned the biology an biological vocabulary needed to understand our use cases. How have the results been disseminated to communities of interest?The improved R package is publicly available and we have disseminated the insightmaker breeding program simulation and updated R code links to interested educators and breeders via email. The Premium R package had 5786 downloads in 2017 and 6786 downloads in 2016. We are preparing a manuscript based on our experiences and accomplishments described above for the 2018 Practice & Experience in Advanced Research Computing (PEARC) conference. What do you plan to do during the next reporting period to accomplish the goals?Our year 2 goals include: Addition of support for crossover interaction analysis and additional breeder-relevant output plots to the PReMiuM package. We have already added suppport for violin plots. Complete the development of genotype-environment crossover simulations, documentation, and tutorials, and make all materials public. Complete the development of protocols and tutorials for fast transfer and CyVerse Data Store storage of NOAA weather covariate data for use with PReMiuM on XSEDE nodes. Begin analysis of T3 and G2F datasets. Refine InsightMaker breeding simulations using G2F and T3 envirotyping results, and add additional tutorials. Begin testing of log-linear model-fit code options, to enable selection of one or more packages that can be developed into an Agave-compatible workflow with PReMiuM covariate outputs as input to model selection code. Begin implementation of code parallelization to further improve PReMiuM run times.

Impacts
What was accomplished under these goals? We have improved the envirotyping PReMiuM R package so that it is practical to use on real breeding datasets. This will allow breeders to better characterize environments in multi-environment trials and thus optimize their breeding programs. Our year 1 goals have been completed; specifically, we have A. Profiled current PReMiuM R code and completed an effective code optimization that improved speed 200-fold (part of Objective 1b). We have profiled the current PReMiuM R package and identified several issues causing inefficiency and performance bottlenecks using the ProfVis tool. The subsequent code refactoring included changed data structures, algorithm modification and reducing resource usage. We also prepared scripts to run the existing package using large computing resources at TACC in preparation for future optimization and scaling needs. B. Created breeding simulation using InsightMaker, with documentation and tutorials (part of Objective 2). C. Completed initial protocol and tutorials for fast transfer and processing of NOAA weather covariate data for use with PReMiuM using SPARK (part of Objective 1b and 1d). We have implemented scripts and code to download and preprocessing weather data available from NOAA. The preprocessing implementation is built on Spark programming framework and Apache Zeppelin for future scalability and interactivity requirements. It can ingest and clean raw downloaded data and transform them into Spark Dataframe for future uses. D. Begun development of realistic, complex genotype-environment simulations, documentation and tutorials. Current test and benchmark simulations are publically available in the R package repository.

Publications