Progress 02/15/17 to 02/14/20
Outputs Target Audience:Plant breeding educators, plant breeding graduate students and plant breeding faculty have provided survey answers and tested our breeding program simulations. Foundations, public breeding programs, large agricultural companies and agricultural data providers have expressed interest in the Premium profile regression analysis method and we have assisted them in implementing and testing the method. Changes/Problems:Linear model code was not needed, as the Premium package computational improvements that we implemented made it possible to do analyses of realistic-size field data sets. In addition to making our code and tutorials available we were able to implement a web-form with simple drop-down menus and viewers for the code to make analysis on HPC easier for new users. This web-based Premium profile regression analysis is available from https://idols.tacc.utexas.edu/. The T3 data did not have adequate information on planting dates to enable matching to weather data, so we were unable to use those data sets as a profile regression example. We instead focused on use of the G2F data sets across multiple years, from 2014 to 2016, for our example real data analyses. What opportunities for training and professional development has the project provided?We have trained two mathematics and statistics students in use of R code, in writing efficient functions and testing using updated coding practice in R. The students also learned the biology and biological vocabulary needed to understand our use cases. One statistics graduate student learned advanced visualization methods (RShiny and ggplot2). One statistics PhD graduate student learned the needed Bayesian theory and Premium function details along with advanced R scripting for generation of simulation/benchmarking data sets. Two students visited a large agricultural company that had implemented the Premium method, to learn more about industry careers in data analysis and learn about industry practices in code development and project management. We trained one research assistant on sharing tools for collaborative work and on statistical methods for simulations and their implementation in R. In 2017 TACC hosted a one-week work session and training at TACC for research assistants on envirotyping project from the University of North Carolina Wilmington and Queen Mary University of London. In 2019 a TACC scientist learned the new statistical programming language Julia and then implemented and tested a newly published Julia package for Dirichlet Process Mixture Model estimation on HPC. How have the results been disseminated to communities of interest?The improved R package is publicly available and we have disseminated the Insightmaker breeding program simulation and updated R code links to interested educators and breeders via email. The Premium R package had 10,715 downloads in 2019, 9384 downloads in 2018, 5786 downloads in 2017 and 6786 downloads in 2016. For 2020, downloads from January 1 to February 23 were 2101. From 2017 to 2019, downloads from RStudio almost doubled. The process of code improvement and the results showing substantial speed-up were presented at the PEARC18 conference, July 22-26, 2018. PD Stapleton presented a poster at the Gordon Research Conference on Quantitative Genetics and Genomics Feb 9-15 2019; The poster title was "GENETICALLY-INFORMED ENVIROTYPING TOOLS TO BETTER MATCH TEST AND TARGET ENVIRONMENT". This presentation led to multiple interactions with large agricultural company statisticians who have implemented the method for their data. Our most current tutorials and datasets are always available at https://envirotyping.readthedocs.io/en/master/. This web resource has been viewed by 251 different people, primarily from the US, in the past 1.5 years. PD Stapleton presented a poster at the National Association of Plant Breeding conference August 25-29, 2019. The poster title was "NIFA: Genetically-informed Envirotyping Tools to Better Match Test and Target Environments". PD Stapleton presented the Premium method to a Gates Foundation program officer and to the Excellence in Breeding program in Spring 2020. PD Stapleton presented the method to the aWhere team in Spring 2020. What do you plan to do during the next reporting period to accomplish the goals?
Nothing Reported
Impacts What was accomplished under these goals?
We have improved the envirotyping PReMiuM R package so that it is practical to use on real breeding datasets. This will allow breeders to better characterize environments in multi-environment trials and thus optimize their breeding programs. We profiled the PReMiuM R package using the ProfVis tool and identified several issues causing inefficiency and performance bottlenecks. The subsequent code refactoring included changed data structures, algorithm modification and reducing resource usage. The code optimization improved speed 200-fold. Scripts to run the existing package using large computing resources at TACC were developed and tutorials on HPC use were created. We created breeder-relevant output plot types to improve visualization and understanding of the Premium package output, which is very rich and extensive. Specifically, we created examples of weather covariate graphs and violin plots using ggplot2 to process the PReMiuM analysis output; the code and instructions are available at https://envirotyping.readthedocs.io/en/master/. We also created an interactive shiny app to allow specific views of the PReMiuM results; this allows breeders and modelers to answer specific questions they may have about certain years, sites, hybrids and weather covariates in the G2F datasets. The tutorial and example spatial map view of the G2F dataset is available at https://envirotyping.readthedocs.io/en/master/workflow/human_readable/. The map titled 'Location of Select Hybrids by Post-Hoc Group' gives a simple view of groups of similar environments for groups of hybrids, which is the key information for selection of test environments for specific production environments. We note that geographic distance is not a useful indicator of a good test environment in this example G2F data set. We cleaned the 2014, 2015 and 2016 data, fit four models (linear regression, mixed model and Bayesian tree model) to the 2014 and 2015 data, then tested our model fit on the 2016 data. We incorporated what we learnt about the hybrid-environment interactions from this G2F data analysis into a method and scripts to create realistic simulated datasets, which are available at https://gitlab.com/EnvirotypingGroupProjectGitlab/envirotypingprojectgitlab. We identified and repaired seed-setting and covariate specification PReMiuM code scale-up problems and then ran the Premium profile regressions on all three years of public G2F data with weather covariates input in monthly increments (using median, max, and min for each weather variable) using the optimized R code. These results are available in our git repository and explanations and instructions on how to run the analyses are available in our envirotyping readthedocs site. We have investigated and evaluated several other libraries with parallel implementation of solving Dirichlet Process Mixture Model(DPMM). Those models provide alternative options to PREMiuM package and may be integrated with current workflow for further performance improvements. We completed initial protocols and tutorials for fast transfer and processing of NOAA weather covariate data for use with PReMiuM using SPARK. We have implemented scripts and code to download and preprocessing weather data available from NOAA; the scripts are available at https://github.com/TACC/EnviroTyping. The preprocessing implementation is built on Spark programming framework and Apache Zeppelin for future scalability and interactivity requirements. It can ingest and clean raw downloaded data and transform them into a Spark Dataframe for future uses. We developed an interactive web application to request supercomputing resources, submit PReMiuM envirotyping jobs, monitor job status and show analytic results, which will be useful for teaching and for researchers new to supercomputing. More details and instructions are available from idols.tacc.utexas.edu and https://envirotyping.readthedocs.io/. We created breeding simulations using the Insightmaker web application, for learning about plant breeding, with documentation and tutorials.
Publications
- Type:
Journal Articles
Status:
Published
Year Published:
2018
Citation:
Lotterhos KE, Moore JH, Stapleton AE (2018) Analysis validation has been neglected in the Age of Reproducibility. PLOS Biol 16: e3000070 (7,076 views since publication)
- Type:
Book Chapters
Status:
Awaiting Publication
Year Published:
2019
Citation:
ON TO THE NEXT CHAPTER FOR CROP BREEDING: CONVERGENCE WITH DATA SCIENCE Elhan S. Ersoz; Nicolas F Martin; Ann Stapleton doi: 10.2135/cropsci2019.03.0149; Date posted: October 05, 2019 (Crop Science, in press)
Preprint at Ersoz, E.S.; Martin, N.F.; Stapleton, A.E. On to the Next Chapter for Crop Breeding: Convergence with Data Science. Preprints 2019, 2019030115 (doi: 10.20944/preprints201903.0115.v1).
First look early access version available via https://dl.sciencesocieties.org/publications/cs/first-look.
|
Progress 02/15/18 to 02/14/19
Outputs Target Audience:Plant breeders and modelers were surveyed about their teaching and research needs. All software and tutorials are publically available. Changes/Problems:
Nothing Reported
What opportunities for training and professional development has the project provided?TACC hosted a one-week discussion and training at TACC for research assistants on envirotyping project from the University of North Carolina Wilmington and Queen Mary University of London. How have the results been disseminated to communities of interest?The process of code improvement and the results showing substantial speed-up were presented at the PEARC18 conference, July 22-26, 2018. PD Stapleton presented a poster at the Gordon Research Conference on Quantitative Genetics and Genomics Feb 9-15 2019; The poster title was "GENETICALLY-INFORMED ENVIROTYPING TOOLS TO BETTER MATCH TEST AND TARGET ENVIRONMENT". Our most current tutorials and datasets are always available at https://envirotyping.readthedocs.io/en/master/. What do you plan to do during the next reporting period to accomplish the goals?Finish adding realistic simulations with true positive-false-positive graphics to the readthedocs tutorial and code documentation. Currently working on testing feasibility of using scalable DPMM algorithms from 'Scalable Estimation of Dirichlet Process Mixture Models on Distributed Data (Wang & Lin 2017)' on Stampede2 supercomputer cluster, with the goal of refactoring of the R code to allow full use of parallel supercomputing resources. We will use Spark weather covariate processing and refactored PReMiuM code to examine importance of covariate time step size in modeling (in contrast to our current summary by month). We will analyze all T3 datasets which have varieties planted in >3 field sites where the field sites have latitude-longitude coordinates and which have associated planting-harvest dates. For Objective 2, in year 3 we will create and release additional systems models that incorporate specific training and test environments as informed by our variety trial profile regression analyses and simulations.
Impacts What was accomplished under these goals?
Objective 1 accomplishments 1a. Develop genotype-environment crossover simulations, documentation and tutorials. We incorporated what we learnt about the hybrid-environment interactions in the G2F data (see section 1d) to create realistic simulate datasets, which are available at https://gitlab.com/EnvirotypingGroupProjectGitlab/envirotypingprojectgitlab. 1b. Add breeder-relevant output plot types to the premium package. We created examples of weather covariate graphs and violin plots using ggplot2 to process the PReMiuM analysis output; the code and instructions are available at https://envirotyping.readthedocs.io/en/master/. We also created an interactive shiny app to allow specific views of the PReMiuM results; this allows breeders and modelers to answer specific questions they may have about certain years, sites, hybrids and weather covariates in the G2F datasets. We developed an interactive web application to request supercomputing resources, submit PReMiuM envirotyping jobs, monitor job status and show analytic results, which will be useful for teaching and for researchers new to supercomputing. 1c. Implement code parallelization to improve Premium run times. We profiled and optimized performance of the PReMiuM package, achieving a 60 times speedup for the initial large test dataset. We developed a Spark (HPC) script to process G2F weather data in parallel. 1d. Analyze g2f real data sets. We cleaned the 2014, 2015 and 2016 data, fit four models (linear regression, mixed model and Bayesian tree model) to the 2014 and 2015 data, then tested our model fit on the 2016 data. We identified and repaired seed-setting and covariate specification PReMiuM code scale-up problems and then ran the bayesian profile regressions on all three years of public G2F data with weather covariates input in monthly increments (using median, max, and min for each weather variable) using the optimized R code. Objective 2 accomplishments We completed the initial system models during our first project year; during this reporting period we maintained and updated the model as needed.
Publications
- Type:
Conference Papers and Presentations
Status:
Published
Year Published:
2018
Citation:
Huang R, Xu W, Liverani S, Hiltbrand D, Stapleton AE (2018) A Case Study of R Performance Analysis and Optimization. Proc. Pract. Exp. Adv. Res. Comput. ACM, New York, NY, USA, pp 33:133:6
|
Progress 02/15/17 to 02/14/18
Outputs Target Audience:Plant breeding educators, plant breeding graduate students and plant breeding faculty have provided survey answers and tested our simulations and software tools. Changes/Problems:
Nothing Reported
What opportunities for training and professional development has the project provided?We have trained one undergraduate mathematics and statistics student in use of R code, in writing efficient functions and test using updated coding practice in R, and in development of simulations using the built-in PReMiuM simulation functions. The student has also learned the biology an biological vocabulary needed to understand our use cases. How have the results been disseminated to communities of interest?The improved R package is publicly available and we have disseminated the insightmaker breeding program simulation and updated R code links to interested educators and breeders via email. The Premium R package had 5786 downloads in 2017 and 6786 downloads in 2016. We are preparing a manuscript based on our experiences and accomplishments described above for the 2018 Practice & Experience in Advanced Research Computing (PEARC) conference. What do you plan to do during the next reporting period to accomplish the goals?Our year 2 goals include: Addition of support for crossover interaction analysis and additional breeder-relevant output plots to the PReMiuM package. We have already added suppport for violin plots. Complete the development of genotype-environment crossover simulations, documentation, and tutorials, and make all materials public. Complete the development of protocols and tutorials for fast transfer and CyVerse Data Store storage of NOAA weather covariate data for use with PReMiuM on XSEDE nodes. Begin analysis of T3 and G2F datasets. Refine InsightMaker breeding simulations using G2F and T3 envirotyping results, and add additional tutorials. Begin testing of log-linear model-fit code options, to enable selection of one or more packages that can be developed into an Agave-compatible workflow with PReMiuM covariate outputs as input to model selection code. Begin implementation of code parallelization to further improve PReMiuM run times.
Impacts What was accomplished under these goals?
We have improved the envirotyping PReMiuM R package so that it is practical to use on real breeding datasets. This will allow breeders to better characterize environments in multi-environment trials and thus optimize their breeding programs. Our year 1 goals have been completed; specifically, we have A. Profiled current PReMiuM R code and completed an effective code optimization that improved speed 200-fold (part of Objective 1b). We have profiled the current PReMiuM R package and identified several issues causing inefficiency and performance bottlenecks using the ProfVis tool. The subsequent code refactoring included changed data structures, algorithm modification and reducing resource usage. We also prepared scripts to run the existing package using large computing resources at TACC in preparation for future optimization and scaling needs. B. Created breeding simulation using InsightMaker, with documentation and tutorials (part of Objective 2). C. Completed initial protocol and tutorials for fast transfer and processing of NOAA weather covariate data for use with PReMiuM using SPARK (part of Objective 1b and 1d). We have implemented scripts and code to download and preprocessing weather data available from NOAA. The preprocessing implementation is built on Spark programming framework and Apache Zeppelin for future scalability and interactivity requirements. It can ingest and clean raw downloaded data and transform them into Spark Dataframe for future uses. D. Begun development of realistic, complex genotype-environment simulations, documentation and tutorials. Current test and benchmark simulations are publically available in the R package repository.
Publications
|
|