Progress 02/15/18 to 02/14/19
Target Audience:Plant breeders and modelers were surveyed about their teaching and research needs. All software and tutorials are publically available. Changes/Problems:
What opportunities for training and professional development has the project provided?TACC hosted a one-week discussion and training at TACC for research assistants on envirotyping project from the University of North Carolina Wilmington and Queen Mary University of London. How have the results been disseminated to communities of interest?The process of code improvement and the results showing substantial speed-up were presented at the PEARC18 conference, July 22-26, 2018. PD Stapleton presented a poster at the Gordon Research Conference on Quantitative Genetics and Genomics Feb 9-15 2019; The poster title was "GENETICALLY-INFORMED ENVIROTYPING TOOLS TO BETTER MATCH TEST AND TARGET ENVIRONMENT". Our most current tutorials and datasets are always available at https://envirotyping.readthedocs.io/en/master/. What do you plan to do during the next reporting period to accomplish the goals?Finish adding realistic simulations with true positive-false-positive graphics to the readthedocs tutorial and code documentation. Currently working on testing feasibility of using scalable DPMM algorithms from 'Scalable Estimation of Dirichlet Process Mixture Models on Distributed Data (Wang & Lin 2017)' on Stampede2 supercomputer cluster, with the goal of refactoring of the R code to allow full use of parallel supercomputing resources. We will use Spark weather covariate processing and refactored PReMiuM code to examine importance of covariate time step size in modeling (in contrast to our current summary by month). We will analyze all T3 datasets which have varieties planted in >3 field sites where the field sites have latitude-longitude coordinates and which have associated planting-harvest dates. For Objective 2, in year 3 we will create and release additional systems models that incorporate specific training and test environments as informed by our variety trial profile regression analyses and simulations.
What was accomplished under these goals?
Objective 1 accomplishments 1a. Develop genotype-environment crossover simulations, documentation and tutorials. We incorporated what we learnt about the hybrid-environment interactions in the G2F data (see section 1d) to create realistic simulate datasets, which are available at https://gitlab.com/EnvirotypingGroupProjectGitlab/envirotypingprojectgitlab. 1b. Add breeder-relevant output plot types to the premium package. We created examples of weather covariate graphs and violin plots using ggplot2 to process the PReMiuM analysis output; the code and instructions are available at https://envirotyping.readthedocs.io/en/master/. We also created an interactive shiny app to allow specific views of the PReMiuM results; this allows breeders and modelers to answer specific questions they may have about certain years, sites, hybrids and weather covariates in the G2F datasets. We developed an interactive web application to request supercomputing resources, submit PReMiuM envirotyping jobs, monitor job status and show analytic results, which will be useful for teaching and for researchers new to supercomputing. 1c. Implement code parallelization to improve Premium run times. We profiled and optimized performance of the PReMiuM package, achieving a 60 times speedup for the initial large test dataset. We developed a Spark (HPC) script to process G2F weather data in parallel. 1d. Analyze g2f real data sets. We cleaned the 2014, 2015 and 2016 data, fit four models (linear regression, mixed model and Bayesian tree model) to the 2014 and 2015 data, then tested our model fit on the 2016 data. We identified and repaired seed-setting and covariate specification PReMiuM code scale-up problems and then ran the bayesian profile regressions on all three years of public G2F data with weather covariates input in monthly increments (using median, max, and min for each weather variable) using the optimized R code. Objective 2 accomplishments We completed the initial system models during our first project year; during this reporting period we maintained and updated the model as needed.
Conference Papers and Presentations
Huang R, Xu W, Liverani S, Hiltbrand D, Stapleton AE (2018) A Case Study of R Performance Analysis and Optimization. Proc. Pract. Exp. Adv. Res. Comput. ACM, New York, NY, USA, pp 33:133:6
Progress 02/15/17 to 02/14/18
Target Audience:Plant breeding educators, plant breeding graduate students and plant breeding faculty have provided survey answers and tested our simulations and software tools. Changes/Problems:
What opportunities for training and professional development has the project provided?We have trained one undergraduate mathematics and statistics student in use of R code, in writing efficient functions and test using updated coding practice in R, and in development of simulations using the built-in PReMiuM simulation functions. The student has also learned the biology an biological vocabulary needed to understand our use cases. How have the results been disseminated to communities of interest?The improved R package is publicly available and we have disseminated the insightmaker breeding program simulation and updated R code links to interested educators and breeders via email. The Premium R package had 5786 downloads in 2017 and 6786 downloads in 2016. We are preparing a manuscript based on our experiences and accomplishments described above for the 2018 Practice & Experience in Advanced Research Computing (PEARC) conference. What do you plan to do during the next reporting period to accomplish the goals?Our year 2 goals include: Addition of support for crossover interaction analysis and additional breeder-relevant output plots to the PReMiuM package. We have already added suppport for violin plots. Complete the development of genotype-environment crossover simulations, documentation, and tutorials, and make all materials public. Complete the development of protocols and tutorials for fast transfer and CyVerse Data Store storage of NOAA weather covariate data for use with PReMiuM on XSEDE nodes. Begin analysis of T3 and G2F datasets. Refine InsightMaker breeding simulations using G2F and T3 envirotyping results, and add additional tutorials. Begin testing of log-linear model-fit code options, to enable selection of one or more packages that can be developed into an Agave-compatible workflow with PReMiuM covariate outputs as input to model selection code. Begin implementation of code parallelization to further improve PReMiuM run times.
What was accomplished under these goals?
We have improved the envirotyping PReMiuM R package so that it is practical to use on real breeding datasets. This will allow breeders to better characterize environments in multi-environment trials and thus optimize their breeding programs. Our year 1 goals have been completed; specifically, we have A. Profiled current PReMiuM R code and completed an effective code optimization that improved speed 200-fold (part of Objective 1b). We have profiled the current PReMiuM R package and identified several issues causing inefficiency and performance bottlenecks using the ProfVis tool. The subsequent code refactoring included changed data structures, algorithm modification and reducing resource usage. We also prepared scripts to run the existing package using large computing resources at TACC in preparation for future optimization and scaling needs. B. Created breeding simulation using InsightMaker, with documentation and tutorials (part of Objective 2). C. Completed initial protocol and tutorials for fast transfer and processing of NOAA weather covariate data for use with PReMiuM using SPARK (part of Objective 1b and 1d). We have implemented scripts and code to download and preprocessing weather data available from NOAA. The preprocessing implementation is built on Spark programming framework and Apache Zeppelin for future scalability and interactivity requirements. It can ingest and clean raw downloaded data and transform them into Spark Dataframe for future uses. D. Begun development of realistic, complex genotype-environment simulations, documentation and tutorials. Current test and benchmark simulations are publically available in the R package repository.