Source: AGRICULTURAL RESEARCH SERVICE submitted to NRP
ENVIRONMENTALLY AWARE DEEP LEARNING BASED GENOMIC SELECTION AND MANAGEMENT OPTIMIZATION FOR MAIZE YIELD
Sponsoring Institution
National Institute of Food and Agriculture
Project Status
COMPLETE
Funding Source
Reporting Frequency
Annual
Accession No.
1030152
Grant No.
2023-67012-39485
Cumulative Award Amt.
$225,000.00
Proposal No.
2022-09713
Multistate No.
(N/A)
Project Start Date
May 15, 2023
Project End Date
Mar 31, 2025
Grant Year
2023
Program Code
[A1141]- Plant Health and Production and Plant Products: Plant Breeding for Agricultural Production
Recipient Organization
AGRICULTURAL RESEARCH SERVICE
1815 N University
Peoria,IL 61604
Performing Department
(N/A)
Non Technical Summary
Growing enough food is a challenge, especially with changing climate conditions. Improved crop cultivars are required to meet this challenge as are improved methods for developing cultivars and optimizing management schemes. Ideally, these methods would consider genetics, environment, and management strategy all at once so that cultivars that will thrive in specific environmental conditions or with certain management practices can be identified. These methods will support the development of improved cultivars, helping farmers satisfy our needs for food and fiber.To improve cultivar development, this project uses a large dataset of corn yields planted in diverse environments across the United States to produce deep learning models that account for the interactions between genomic, environmental, and management factors. These models can be used to predict yield and determine what factors are associated with high yield. The yield associated factors will be possible targets for improvement in the future. Deep learning models often need a lot of data to be accurate. This project will determine if the amount of needed data can be reduced by using data from another crop in this case supplementing data on wheat with data on corn. The ultimate goal supported by this work is to improve the ability to of breeders to develop corn cultivars in diverse environments which is aided by providing these breeders with accurate models and making it easier to make models for other crops.
Animal Health Component
(N/A)
Research Effort Categories
Basic
20%
Applied
(N/A)
Developmental
80%
Classification

Knowledge Area (KA)Subject of Investigation (SOI)Field of Science (FOS)Percent
2011510108175%
2011549108125%
Goals / Objectives
The main goals of this project are to improve the accuracy of phenotypic prediction in crops grown in diverse environments, to incorporate genotype-by-environment interactions and biological theory into phenotypic models, and to make deep learning phenotypic prediction models usable in crops lacking sufficient data to train these models. The objectives and sub-objectives below support the realization of these goals.Goal 1: Development of a State-of-the-Art Deep Learning (DL) Phenotypic Prediction Models1.a Optimize Deep Learning Models using Genomic Data 1.a.1 Using maize yield data from the Genomes to Fields Initiative, design fully connected (FCN), convolutional (CNN), and recurrent neural networks (RNN). 1.a.2 Train models using raw genomic data (FCN, CNN, RNN), representation as a Hilbert curve (CNN), or following data reduction. Data will be reduced to loci significantly associated with yield (FCN), autoencoder embeddings (FCN), or to the loci of annotated genes (FCN, CNN, RNN). 1.a.3 Best linear unbiased predictor models (BLUPs) will be created using comparable input data. 1.a.4 Trained DL models will be evaluated against each other and BLUPs. Performance of differences between model types and data treatments (reduction, modification) will be considered.1.b Optimize Deep Learning Models using Environmental and Management Data 1.b.1 DL models (CNN, RNN) using environmental and management data will be designed. Inclusion of residual connections between layers will be tested. 1.b.2 DL models will be trained on data represented as a time series (CNN, RNN) and as a Hilbert curve (CNN) and with and without pre-training on low resolution (e.g., farm or county instead of plot) data. 1.b.3 Performance of the created models will be evaluated.Goal 2: Incorporate Genotype-by-Environment Interactions (GxE) and Biological Theory2.a Design a DL model processing genomic data with connections informed by known gene pathways in maize.2.b Create six models with GxE interactions using the either the best genome processing model from 1.a or the pathway informed model from 2.a.1 and the best environmental and management processing model from 1.b using one of three interaction networks. 2.b.1 Directly predicting yield. 2.b.2 Producing outputs corresponding to variables in a simple crop growth model. 2.b.3 Producing the same number of outputs as in 2, but which are not required to correspond to physiological available.2.c Train benchmarking BLUP and machine learning models.2.d Evaluate model performance relative to benchmarking models and non-interaction models created in 1.2.e For the best performing DL model, SNPs and environmental variables of high importance to the predicted yield will be identified and compared with those reported in the literature.Goal 3: Assess Transferability of Trained Models to Other Crops and Multi-Crop Models3.a Prepare a wheat dataset complete with yield in multiple environments and genomic data.3.b Evaluate Effectiveness of Transfer Learning 3.b.1 Using the design of the best DL model identified, train a model exclusively on wheat. 3.b.2 Fit a BLUP model using these data for use as a benchmark. 3.b.3 Retrain copies of the maize model and wheat model on different percentages of the other species data (5%, 25%, 50%, 75%, or 100%). 3.b.4 Compare model accuracy with respect to percent data available.3.c Evaluate Effectiveness of Multitarget Learning 3.c.1 Extend the design of the best DL model identified by adding a second GxE interaction subnetwork so that each species has a distinct interaction subnetwork. 3.c.2 Train models on half of both species' datasets or all of both with accuracy being tracked for both species. 3.c.3 Train models using differing percentages (5%, 25%, 50%, 75%, or 100%) of the total samples available for each crop.3.d Compare the performance of models using transfer learning and multitarget learning with respect to data availability and with respect to non-retrained single species DL models and benchmark models.
Project Methods
The key model qualities to be evaluated are predictive accuracy and model interpretability. Predictive accuracy will be assessed by splitting the available data into a portion for training the model and a portion for testing it. The accuracy of model predictions in the test portion will be summarized by calculating each model's root mean squared error (RMSE). If inferential statistics on model performance need be calculated, replicates of each model will be trained and the distributions of resulting RMSEs will be compared. Model interpretability will be evaluated in a two-step process. First, the most important features (SNPs, environmental covariates) that influence model output must be identified. At a minimum salience and layer wise relevance propagation will be considered, but other metrics may be added if they are deemed to have a high likelihood of performing well. Following identification of important features, the model features will be compared with the published literature to assess the agreement between the two. If agreement is high, then identified features or feature relationships which are not present in the literature will be more credible targets for future study. This project will require two datasets, with maize and wheat data respectively, containing genomic, phenotypic, and environmental data coming from multiple locations. Both datasets will undergo quality control checks and imputation of missing or aberrant data using approaches based on published methods. Completion of data cleaning will result in two datasets ready for statistical modeling. The cleaned data or steps and scripts for creating it will be released for public use, provided there are none of the data is embargoed or licensed to permit distribution of derivative informational products. Deep neural networks predicting maize yield based on either genomic or environmental and management data will be created. Fully connected, convolutional, and recurrent neural network architectures will be considered. While these architectures have been applied to agronomic data previously, the use of plot level data and multiple data types (genomic, environmental, and management data) in a single model is rare. In addition to testing more conventional data transformation strategies, genomic data will be represented in a Hilbert curve before processing with a convolutional neural network. This allows for a 1-dimensional sequence to be represented in a 2-dimensional plane while keeping base pairs in proximity in the sequence in proximity in the plane. To the author's knowledge this approach has not been applied to genomic prediction. This project will also construct a neural network processing genomic data the network's structure being informed by maize gene networks. This and similar approaches using expression data have been used for modeling data from humans and yeast, this has not been applied to crops. Following training of the above models, benchmarking models including best linear unbiased predictors will be train using genomic or environmental and management data to compare algorithm performance using RMSE. The best performing non-genomic model will be combined with the best performing genomic model and the gene network informed model will tested with one of three gene-by-environment-by-management interactions architectures - resulting in six models to evaluate. In the first architecture, following previously published work, fully connected networks will represent the interactions and directly predict yield. The second, will instead predicting input variables for a simplified crop growth model which will in turn predict yield. The third acts as a companion to the second and will produce the same number of output variables, but is not the constrained to produce physiologically interpretable values. The last two approaches represents a deviation from the structure of most deep learning models in this area which predict the target phenotype directly. This strategy of sequentially optimizes portions of the final network (i.e., optimizing models for individual data types then using those to optimize a model containing interactions) instead of optimizing the full model at once. The preliminary data shown in the project narrative supports the effectiveness of this strategy but (to the author's knowledge) it has not been applied in the context of crop modeling. Benchmarking models will be created to evaluate model performance using RMSE as above. The best performing model as described in the steps to promote model interpretability above. In brief, the model will be used will be used to identify relevant input features (SNPs, environmental covariates, etc.). The most influential features will be compared to the published literature to identify agreement or disagreement regarding the key factors influencing maize yield. The work above will be communicated to the target audience through presentations and a peer reviewed publication. To ensure maximum usefulness, the data and code will be made available at time of publication. Following development of the optimized maize model, models predicting wheat yield will be generated. The goal of these models is to assess the capability of a phenotypic prediction model trained for one species to be used for another. Two strategies will be considered: transfer learning and multitarget learning. To assess the effectiveness of transfer learning a new model will be trained using wheat data and the most effective architecture identified for maize. The trained models will be retrained on different percentages of the other species data (5%, 25%, 50%, 75%, or 100%). To assess the effectiveness of multitarget learning a second interaction subnetwork will be added to the previously identified architecture - one predicting maize yield the other wheat yield. This model will be trained on all available data or with a reduced dataset (to control for number of observations in the training set) using differing percentages (5%, 25%, 50%, 75%, or 100%) of the total samples available for each crop. These methods are uncommon - few examples of transfer learning and multitarget learning exist for predicting crop phenotypes, and to the author's knowledge none exist that use genomic and environmental data. As in the previous set of experiments, benchmarking models will be created with performance being measured in RMSE. Unlike the above, rather than raw error, the relative in performance dependent on the size of training data, fraction of each species data, and training strategy (i.e., transfer learning vs multitarget learning) is of primary interest here. These results will be communicated to the target audience through a peer reviewed publication and presentations. As in the previously described publication, associated data and code will be made available. Special consideration will be given to the ease of model re-use to aid extension of this work to crops beyond maize and wheat.

Progress 05/15/24 to 04/05/25

Outputs
Target Audience:The target audiences reached predominantly consist of researchers (USDA and non-USDA) and students. Both groups have been reached through scientific presentations (National Association of Plant Breeders), contribution to scientific manuscripts, technical blog articles, and a technical skills seminar series(attended by USDA SYs, post-docs, and students). Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?This has provided tremendous opportunities for professional development. For aim 2.b.2 I developed project management documents to work on this aim in collaboration with a post-doc and student who wanted increased experience with deep learning. Thispreparation was valuable and made me more effective in other collaborations, where I had a more modest role -- designing and implementing improvements to a machine learning pipeline, or a major role -- initiating and coordinated a collaboration with three labs to evaluate deep learning models across species, the logistics of which were challenging and pushed me to develop better communication and organizational skills.Over the same period I worked to develop my technical communication skills by leading a deep learning group - organizing, teaching, and collecting feedback.These experiences were critical to my development. In addition to expanding my network they prepared me to work effectively as part of a team and clearly communicate technical topics. How have the results been disseminated to communities of interest?In addition to the manuscripts reference previously, I disseminated results through a posterat the National Association of Plant Breeders meeting. Additional presentations were planned for spring of 2025 but travel was suspended. What do you plan to do during the next reporting period to accomplish the goals? Nothing Reported

Impacts
What was accomplished under these goals? In previous report "Aims 1.a, 1.b, and 2.a, with the adaptations described in the changes problems section. Extended the method described in 2.a into a standalone model fitting tool (sparsevnn, in "other products"). Prepared for aim 2.b.2, mimicking a simple crop growth model, by building the tools (apsimxSimData, in "other products") to generate synthetic data and producing ~500 Gb of the same." In addition to this, data infrastructure and preliminary models for aim 2.b.2 were produced with encouraging results. Feasibility of 3 was assessed in connection with collaborator and not completed due to data incompatibility between data available between crops. Aim 2.a was spun out into standalone collaboration with 5 labs contacted and 3 ultimately involved in testing "biologically informed" neural net efficacy for predicting traits in species of agricultural and scientific importance - Corn, Soy, Cattle, and fruit flies. This has resulted in a deep learning library, command line tools, and a manuscript in progress.

Publications

  • Type: Peer Reviewed Journal Articles Status: Under Review Year Published: 2025 Citation: Worasit Sangjan, Daniel R Kick, Jacob D Washburn, Improving Plant Breeding Through AI-Supported Data Integration, 2025, Theoretical and Applied Genetics
  • Type: Peer Reviewed Journal Articles Status: Under Review Year Published: 2025 Citation: Shawn K. Thomas, R. Shawn Abrahams, Daniel Robert Kick, Nora Walden, Gavin Conant, Michael R. McKain, Hong An, Tatiana Arias, Patrick P. Edger, Alex Harkess, Kasper P. Hendriks, Marcus A. Koch, Frederic Lens, Martin A. Lysak, Alex McAlvay, Klaus Mummenhoff, Ihsan A. Al-Shehbaz, Jacob D. Washburn, J. Chris Pires, Polyploidy as a clade marker for mustard crops and wild relatives, 2025, Annals of Botany


Progress 05/15/23 to 03/31/25

Outputs
Target Audience:The target audiences reached predominantly consist of researchers (USDA and non-USDA) and students. Both groups have been reached through scientific presentations (National Association of Plant Breeders), contribution to scientific manuscripts, technical blog articles, and a technical skills seminar series(attended by USDA SYs, post-docs, and students). Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided? This has provided tremendous opportunities for professional development. For aim 2.b.2 I developed project management documents to work on this aim in collaboration with a post-doc and student who wanted increased experience with deep learning. Thispreparation was valuable and made me more effective in other collaborations, where I had a more modest role -- designing and implementing improvements to a machine learning pipeline, or a major role -- initiating and coordinated a collaboration with three labs to evaluate deep learning models across species, the logistics of which were challenging and pushed me to develop better communication and organizational skills.Over the same period I worked to develop my technical communication skills by leading a deep learning group - organizing, teaching, and collecting feedback.These experiences were critical to my development. In addition to expanding my network they prepared me to work effectively as part of a team and clearly communicate technical topics. How have the results been disseminated to communities of interest?In addition to the manuscripts reference previously, I disseminated results through a posterat the National Association of Plant Breeders meeting. Additional presentations were planned for spring of 2025 but travel was suspended. What do you plan to do during the next reporting period to accomplish the goals? Nothing Reported

Impacts
What was accomplished under these goals? In previous report "Aims 1.a, 1.b, and 2.a, with the adaptations described in the changes problems section. Extended the method described in 2.a into a standalone model fitting tool (sparsevnn, in "other products"). Prepared for aim 2.b.2, mimicking a simple crop growth model, by building the tools (apsimxSimData, in "other products") to generate synthetic data and producing ~500 Gb of the same." In addition to this, data infrastructure and preliminary models for aim 2.b.2 were produced with encouraging results. Feasibility of 3 was assessed in connection with collaborator and not completed due to data incompatibility between data available between crops. Aim 2.a was spun out into standalone collaboration with 5 labs contacted and 3 ultimately involved in testing "biologically informed" neural net efficacy for predicting traits in species of agricultural and scientific importance - Corn, Soy, Cattle, and fruit flies. This has resulted in a deep learning library, command line tools, and a manuscript in progress.

Publications

  • Type: Peer Reviewed Journal Articles Status: Under Review Year Published: 2025 Citation: Worasit Sangjan, Daniel R Kick, Jacob D Washburn, Improving Plant Breeding Through AI-Supported Data Integration, 2025, Theoretical and Applied Genetics
  • Type: Peer Reviewed Journal Articles Status: Under Review Year Published: 2025 Citation: Shawn K. Thomas, R. Shawn Abrahams, Daniel Robert Kick, Nora Walden, Gavin Conant, Michael R. McKain, Hong An, Tatiana Arias, Patrick P. Edger, Alex Harkess, Kasper P. Hendriks, Marcus A. Koch, Frederic Lens, Martin A. Lysak, Alex McAlvay, Klaus Mummenhoff, Ihsan A. Al-Shehbaz, Jacob D. Washburn, J. Chris Pires, Polyploidy as a clade marker for mustard crops and wild relatives, 2025, Annals of Botany


Progress 05/15/23 to 05/14/24

Outputs
Target Audience:The target audiences reached thus far predominantly consists of researchers and students. Both groups have been reached through scientific seminar presentations (at University of Michigan, Truman State University, Iowa State University, and University of Georgia) and a poster at the Maize Genetics Meeting and the University of Missouri plant research symposium. The latter group has also been served through individual mentoring of undergraduates and a short workshop in the fall attended by undergraduates, graduate students, one post-doc, and a lab tech. Changes/Problems:I prototyped portions of this project that I suspected to be the most difficult or likely to fail -creating networks with biological information, mimicking crop growth models, and creating autoencoder networks. From these experiences, I propose the following changes. Alteration 1: Expand aim 2.a (biologically informed networks) into a collaboration and possibly a standalone manuscript. Alteration 2: Alter aim 2.b (gene by environment interactions) such that a crop growth model mimicking interaction uses environmental data directly and possibly separate as a standalone manuscript. Subtraction 1: Remove select data reduction strategies and network architectures to prioritize the above alterations. Alteration 1: Expand a portion of aim 2 into an independent manuscript The first of these, the biologically informed neural networks, yielded preliminary data suggesting that yield can be predicted from genomic data more accurately than I had expected. This is exciting because this model did not yet have environmental data so, while it should be used with environmental data for maximum effect, it could be used even in cases where environmental data is not available or of questionable quality. The big question this raises is "do these preliminary results hold across crop species?" and if they do, what practical problems do I need to solve to make this method accessible to plant breeders, researchers, and other stakeholders? Through a collaboration I'm aiming to a acquire the data to determine if the performance holds and identify and solve logistical challenges with the method's use. I have met with researchersin the USDA and in universities to pitch this collaboration. The principle investigatorsI have met with thus far regarding this are: Dong Xu andTrupti Joshi (machine learning experts with experience in soybean) Jason Gillman (expert in soybean genetics) Jean-Luc Jannink (expert in wheat breeding) Gustavo de los Campos (expert in quantitative methods, experience with wheat data and other systems) Troy Rowan (expert in quantitative genetics, cattle genetics) Libby King (quantitative genetics expert, expert in drosophila genetics) This has the potential to impact more crop systems at once and even animal systems (if I am ultimately able to include cattle data). Working with these individuals has the added benefit of bolstering my soft skill development (project management, collaboration management, communication, etc.). It also aids in procuringthe wheat dataset needed to complete aim 3. What I propose is to expand the relevant portion of aim 2 into an independent study and to use data from KEGG rather than CornCyc. The most challenging portions of this (writing and optimizing code to retrieve biological relationships, and create these models quickly, identifying and approaching possible collaborators, planning contingencies, etc.) are done. I've identified several places in Alteration 2 and Subtraction 1 where changes can provide the necessary time for this. Alteration 2: Alter structure of gene by environment interaction model In aim 2 I suggested testing 3 networks that integrate the outputs of a biologically informed neural network or the best identified gene only neural network and the best identified environmental neural network (6 combinations in total). These three networks are 1) a fully connected neural network 2) a network that mimics a crop growth model (e.g. APSIM) and 3) a network with the same number of inputs as a crop growth model (but isn't constrained to act as a crop growth model). This aim can be substantially improved. I suggest instead building a model that directly mimics a crop growth model while receiving genotype specific parameters from a neural network. If effective, this would help reduce a limitation of crop growth models (using genomic information is challenging). Below I outline the strategy I have come up with and discussed with an expert on APSIM (Dr. Sotirios Archontoulis). Train the CGM model using simulated data from APSIM (I've generated files on the order of ~500 Gb for this purpose). Use a genomic model (as previously planned) to predict yield through this CGM model Use the these to fine tune the genomic model. What I propose is to change aim 2, so that: 1) The CGM network acts on environmental data directly mimicking APSIM. 2) The network with the same number of input variables as the CGM network (network 3 above) is removed from consideration and instead the best neural network without the constraint of operating as CGM is compared against the best one constrained to operate as a CGM. 3) Reduce the number combinations of interaction networks by using only the best genomic model instead of the best non-biologically informed genomic model and the best biologically informed genomic model. Doing so allows for more time to be available for implementing the optional training step, producing a deeper analysis of the CGM network's behavior, and completing aim 3. Subtraction 1: Remove select data reduction strategies and network architectures to have time for alterations While the additions in Alteration 1 and changes in Alteration 2 improve this project and expand the set of stakeholders impacted by it, they require time. By reducing the number of combinations of models to be tested in, Alteration 2 the net increase in time needed is reduced. Alteration 1, however represents a substantial net increase in time. To make up this difference I propose to remove certainnetwork architecturesand data transformations from consideration which I have discussed with my primary mentor. Data Alteration and Reduction Autoencoder networks provide a way to reduce the number of variables in a dataset. I proposed using this for genomic data. The method I tried, a "variational autoencoder" (VAE), turned out to be prone to introducing severe artifacts. While I think this solvable, it seems wise to focus on other parts of the project. I also proposed to filter genomic data by each SNP's p-value. I have not seen this outside of a single manuscript. The training procedure should provide a similar benefit without using unconventional practices. From a practical standpoint this prefiltering by p value could inflate the apparent performance of the method by "leaking" information from the testing set to the training set. For these reasons I would like to omit this data reduction technique. Network Architectures I proposed training recurrent neural networks on genomic data in addition to more training fully connected and convolutional models. As I now understand it this was an unwise idea because the order in which the genomic data is given to the model (order of chromosomes and which strand is provided) should substantially alter the performance of the model. Metaphorically the model "remembers" more recent events better so for long sequences relevant information early on in the sequence can become drowned out by more recently "seen" information. Some model architectures (transformer models) are designed to reduce this issue and because of the success of these models in working with natural language there are labs working to apply these models to genomic data (I've spoken with several). It seems sensible to not pursue this approach for genomic data and lean into the biologically informed network approach. I propose not training recurrent neural networks on any genomic data. What opportunities for training and professional development has the project provided?This project has provided a plethora of opportunities for training and professional development. During the summer, I attended a workshop on the crop growth simulation software "APSIM". This experience radically improved how I think about crop growth models. This experience spurred the development of one of my "other products", a publicly available programming tool "apsimxSimData" and caused me to realize that with some adjustments to aim 2, I could potentially benefit a whole new group of stakeholders. This workshop also put me in contact with Dr. Sotirios Archontoulis and Dr. Fernando Miguez which has been a benefit to the quality of this work. This spring I have formally presented six times, speaking to three university departments (University of Michigan, Truman State University, Iowa State University), presenting two posters (Maize Genetics Meeting and the University of Missouri plant research symposium), and both speaking and acting as a panelist at University of Georgia's AI in Plant Breeding Symposium. Beyond sharpening my communication and presentation skills, these experiences (along with attending the North American Plant Phenotyping Network's meeting) provided me with the opportunity to connect with possible collaborators. That latter piece has proved valuable as one of the sub-aims of the project (2.a) proved to be both more promising and more challenging (time consuming) than expected. The apparent performance of this method lead me to struggle through the process of optimizing it so that it would be feasible option for other to use. This resulted in a huge speedup (approximately 198x, from 0.14 to 27.84 iterations per second) and caused me to think more deeply about the technical aspects of this project. This also prompted me to begin building a collaboration between individuals in USDA and academia - a mixture of persons I met through these presentation opportunities and identified and reached out to directly. Learning to wrangle this collaboration has been a great experience and one I expect to continue challenging me as I continue striving to collaborate more effectively. I have aimed to improve my soft skills not only through the above collaboration, training students and junior scientists. My original plan had been to volunteer with the Software Carpentries. Seeing no workshops scheduled I sought to volunteer with the USDA Data Science Training Program Facilitator Network, but my application received no reply. Instead, I spoke with the students and more junior scientists working with labs in my unit about what training they would find most valuable. After identifying a desire for improving interviewing skills I designed and ran a short series of professional development sessions. While this did not provide an opportunity to practice communicating technical and scientific topics clearly mentoring two students in my primary mentor's lab has provided the opportunity to continue practicing these skills. How have the results been disseminated to communities of interest?The results of this project have been disseminated primarily through scientific presentations (six formal presentations during the spring of 2024). Creating durable, visible, digital artifacts (research repositories, code repositories, etc.) is a secondary dissemination strategy I have taken. These two work together nicely as the former has been helpful in identifying people who can benefit from the products of my project beyond the final manuscript. What do you plan to do during the next reporting period to accomplish the goals?I plan on making the adjustments described in the "Changes/Problems" section which will result in a model more directly compatible with APSIM, and a method potentially useful across crops - thus allowing this project to provide value to a greater number and variety of stakeholders. Developing the collaboration centered around aim 2.a (to demonstrate the degree to which this method is valuable across crops) allows some of my work to do "double duty" by providing the wheat dataset I will need to complete aim 3, or possibly a dataset in a different crop that would be even better suited to aim 3. While the suggested adjustments represent an expansion of some aspects of this project, they're expected to provide more immediate benefits while still advancing the core of this project. After completing these, I should be well positioned to complete the remainder of aim 3, with model architectures and data cleaned and formatted for modeling in hand. The results from this work I plan on communicating to the communities of interest via scientific conferences and at the NIFA PD meeting. While the work I've done to date supports the upcoming objectives, there is much to be accomplished. In the past year, I sought out project management resources to learn to better organize and run projects (especially important for the aforementioned collaboration). In the coming year, I plan on leaning into the practices I've learned and discussing project management (and people management) with my primary mentor and constellation of mentors. Discussing and practicing these strategies I anticipate this increasing my effectiveness and support completion of the project objectives.

Impacts
What was accomplished under these goals? Aims 1.a, 1.b, and 2.a, with the adaptations described in the changes problems section. Extended the method described in 2.a into a standalone model fitting tool (sparsevnn, in "other products"). Prepared for aim 2.b.2, mimicking a simple crop growth model, by building the tools (apsimxSimData, in "other products") to generate synthetic data and producing ~500 Gb of the same.

Publications