Progress 05/15/24 to 04/05/25
Outputs Target Audience:The target audiences reached predominantly consist of researchers (USDA and non-USDA) and students. Both groups have been reached through scientific presentations (National Association of Plant Breeders), contribution to scientific manuscripts, technical blog articles, and a technical skills seminar series(attended by USDA SYs, post-docs, and students). Changes/Problems:
Nothing Reported
What opportunities for training and professional development has the project provided?This has provided tremendous opportunities for professional development. For aim 2.b.2 I developed project management documents to work on this aim in collaboration with a post-doc and student who wanted increased experience with deep learning. Thispreparation was valuable and made me more effective in other collaborations, where I had a more modest role -- designing and implementing improvements to a machine learning pipeline, or a major role -- initiating and coordinated a collaboration with three labs to evaluate deep learning models across species, the logistics of which were challenging and pushed me to develop better communication and organizational skills.Over the same period I worked to develop my technical communication skills by leading a deep learning group - organizing, teaching, and collecting feedback.These experiences were critical to my development. In addition to expanding my network they prepared me to work effectively as part of a team and clearly communicate technical topics. How have the results been disseminated to communities of interest?In addition to the manuscripts reference previously, I disseminated results through a posterat the National Association of Plant Breeders meeting. Additional presentations were planned for spring of 2025 but travel was suspended. What do you plan to do during the next reporting period to accomplish the goals?
Nothing Reported
Impacts What was accomplished under these goals?
In previous report "Aims 1.a, 1.b, and 2.a, with the adaptations described in the changes problems section. Extended the method described in 2.a into a standalone model fitting tool (sparsevnn, in "other products"). Prepared for aim 2.b.2, mimicking a simple crop growth model, by building the tools (apsimxSimData, in "other products") to generate synthetic data and producing ~500 Gb of the same." In addition to this, data infrastructure and preliminary models for aim 2.b.2 were produced with encouraging results. Feasibility of 3 was assessed in connection with collaborator and not completed due to data incompatibility between data available between crops. Aim 2.a was spun out into standalone collaboration with 5 labs contacted and 3 ultimately involved in testing "biologically informed" neural net efficacy for predicting traits in species of agricultural and scientific importance - Corn, Soy, Cattle, and fruit flies. This has resulted in a deep learning library, command line tools, and a manuscript in progress.
Publications
- Type:
Peer Reviewed Journal Articles
Status:
Under Review
Year Published:
2025
Citation:
Worasit Sangjan, Daniel R Kick, Jacob D Washburn, Improving Plant Breeding Through AI-Supported Data Integration, 2025, Theoretical and Applied Genetics
- Type:
Peer Reviewed Journal Articles
Status:
Under Review
Year Published:
2025
Citation:
Shawn K. Thomas, R. Shawn Abrahams, Daniel Robert Kick, Nora Walden, Gavin Conant, Michael R. McKain, Hong An, Tatiana Arias, Patrick P. Edger, Alex Harkess, Kasper P. Hendriks, Marcus A. Koch, Frederic Lens, Martin A. Lysak, Alex McAlvay, Klaus Mummenhoff, Ihsan A. Al-Shehbaz, Jacob D. Washburn, J. Chris Pires, Polyploidy as a clade marker for mustard crops and wild relatives, 2025, Annals of Botany
|
Progress 05/15/23 to 03/31/25
Outputs Target Audience:The target audiences reached predominantly consist of researchers (USDA and non-USDA) and students. Both groups have been reached through scientific presentations (National Association of Plant Breeders), contribution to scientific manuscripts, technical blog articles, and a technical skills seminar series(attended by USDA SYs, post-docs, and students). Changes/Problems:
Nothing Reported
What opportunities for training and professional development has the project provided? This has provided tremendous opportunities for professional development. For aim 2.b.2 I developed project management documents to work on this aim in collaboration with a post-doc and student who wanted increased experience with deep learning. Thispreparation was valuable and made me more effective in other collaborations, where I had a more modest role -- designing and implementing improvements to a machine learning pipeline, or a major role -- initiating and coordinated a collaboration with three labs to evaluate deep learning models across species, the logistics of which were challenging and pushed me to develop better communication and organizational skills.Over the same period I worked to develop my technical communication skills by leading a deep learning group - organizing, teaching, and collecting feedback.These experiences were critical to my development. In addition to expanding my network they prepared me to work effectively as part of a team and clearly communicate technical topics. How have the results been disseminated to communities of interest?In addition to the manuscripts reference previously, I disseminated results through a posterat the National Association of Plant Breeders meeting. Additional presentations were planned for spring of 2025 but travel was suspended. What do you plan to do during the next reporting period to accomplish the goals?
Nothing Reported
Impacts What was accomplished under these goals?
In previous report "Aims 1.a, 1.b, and 2.a, with the adaptations described in the changes problems section. Extended the method described in 2.a into a standalone model fitting tool (sparsevnn, in "other products"). Prepared for aim 2.b.2, mimicking a simple crop growth model, by building the tools (apsimxSimData, in "other products") to generate synthetic data and producing ~500 Gb of the same." In addition to this, data infrastructure and preliminary models for aim 2.b.2 were produced with encouraging results. Feasibility of 3 was assessed in connection with collaborator and not completed due to data incompatibility between data available between crops. Aim 2.a was spun out into standalone collaboration with 5 labs contacted and 3 ultimately involved in testing "biologically informed" neural net efficacy for predicting traits in species of agricultural and scientific importance - Corn, Soy, Cattle, and fruit flies. This has resulted in a deep learning library, command line tools, and a manuscript in progress.
Publications
- Type:
Peer Reviewed Journal Articles
Status:
Under Review
Year Published:
2025
Citation:
Worasit Sangjan, Daniel R Kick, Jacob D Washburn, Improving Plant Breeding Through AI-Supported Data Integration, 2025, Theoretical and Applied Genetics
- Type:
Peer Reviewed Journal Articles
Status:
Under Review
Year Published:
2025
Citation:
Shawn K. Thomas, R. Shawn Abrahams, Daniel Robert Kick, Nora Walden, Gavin Conant, Michael R. McKain, Hong An, Tatiana Arias, Patrick P. Edger, Alex Harkess, Kasper P. Hendriks, Marcus A. Koch, Frederic Lens, Martin A. Lysak, Alex McAlvay, Klaus Mummenhoff, Ihsan A. Al-Shehbaz, Jacob D. Washburn, J. Chris Pires, Polyploidy as a clade marker for mustard crops and wild relatives, 2025, Annals of Botany
|
Progress 05/15/23 to 05/14/24
Outputs Target Audience:The target audiences reached thus far predominantly consists of researchers and students. Both groups have been reached through scientific seminar presentations (at University of Michigan, Truman State University, Iowa State University, and University of Georgia) and a poster at the Maize Genetics Meeting and the University of Missouri plant research symposium. The latter group has also been served through individual mentoring of undergraduates and a short workshop in the fall attended by undergraduates, graduate students, one post-doc, and a lab tech. Changes/Problems:I prototyped portions of this project that I suspected to be the most difficult or likely to fail -creating networks with biological information, mimicking crop growth models, and creating autoencoder networks. From these experiences, I propose the following changes. Alteration 1: Expand aim 2.a (biologically informed networks) into a collaboration and possibly a standalone manuscript. Alteration 2: Alter aim 2.b (gene by environment interactions) such that a crop growth model mimicking interaction uses environmental data directly and possibly separate as a standalone manuscript. Subtraction 1: Remove select data reduction strategies and network architectures to prioritize the above alterations. Alteration 1: Expand a portion of aim 2 into an independent manuscript The first of these, the biologically informed neural networks, yielded preliminary data suggesting that yield can be predicted from genomic data more accurately than I had expected. This is exciting because this model did not yet have environmental data so, while it should be used with environmental data for maximum effect, it could be used even in cases where environmental data is not available or of questionable quality. The big question this raises is "do these preliminary results hold across crop species?" and if they do, what practical problems do I need to solve to make this method accessible to plant breeders, researchers, and other stakeholders? Through a collaboration I'm aiming to a acquire the data to determine if the performance holds and identify and solve logistical challenges with the method's use. I have met with researchersin the USDA and in universities to pitch this collaboration. The principle investigatorsI have met with thus far regarding this are: Dong Xu andTrupti Joshi (machine learning experts with experience in soybean) Jason Gillman (expert in soybean genetics) Jean-Luc Jannink (expert in wheat breeding) Gustavo de los Campos (expert in quantitative methods, experience with wheat data and other systems) Troy Rowan (expert in quantitative genetics, cattle genetics) Libby King (quantitative genetics expert, expert in drosophila genetics) This has the potential to impact more crop systems at once and even animal systems (if I am ultimately able to include cattle data). Working with these individuals has the added benefit of bolstering my soft skill development (project management, collaboration management, communication, etc.). It also aids in procuringthe wheat dataset needed to complete aim 3. What I propose is to expand the relevant portion of aim 2 into an independent study and to use data from KEGG rather than CornCyc. The most challenging portions of this (writing and optimizing code to retrieve biological relationships, and create these models quickly, identifying and approaching possible collaborators, planning contingencies, etc.) are done. I've identified several places in Alteration 2 and Subtraction 1 where changes can provide the necessary time for this. Alteration 2: Alter structure of gene by environment interaction model In aim 2 I suggested testing 3 networks that integrate the outputs of a biologically informed neural network or the best identified gene only neural network and the best identified environmental neural network (6 combinations in total). These three networks are 1) a fully connected neural network 2) a network that mimics a crop growth model (e.g. APSIM) and 3) a network with the same number of inputs as a crop growth model (but isn't constrained to act as a crop growth model). This aim can be substantially improved. I suggest instead building a model that directly mimics a crop growth model while receiving genotype specific parameters from a neural network. If effective, this would help reduce a limitation of crop growth models (using genomic information is challenging). Below I outline the strategy I have come up with and discussed with an expert on APSIM (Dr. Sotirios Archontoulis). Train the CGM model using simulated data from APSIM (I've generated files on the order of ~500 Gb for this purpose). Use a genomic model (as previously planned) to predict yield through this CGM model Use the these to fine tune the genomic model. What I propose is to change aim 2, so that: 1) The CGM network acts on environmental data directly mimicking APSIM. 2) The network with the same number of input variables as the CGM network (network 3 above) is removed from consideration and instead the best neural network without the constraint of operating as CGM is compared against the best one constrained to operate as a CGM. 3) Reduce the number combinations of interaction networks by using only the best genomic model instead of the best non-biologically informed genomic model and the best biologically informed genomic model. Doing so allows for more time to be available for implementing the optional training step, producing a deeper analysis of the CGM network's behavior, and completing aim 3. Subtraction 1: Remove select data reduction strategies and network architectures to have time for alterations While the additions in Alteration 1 and changes in Alteration 2 improve this project and expand the set of stakeholders impacted by it, they require time. By reducing the number of combinations of models to be tested in, Alteration 2 the net increase in time needed is reduced. Alteration 1, however represents a substantial net increase in time. To make up this difference I propose to remove certainnetwork architecturesand data transformations from consideration which I have discussed with my primary mentor. Data Alteration and Reduction Autoencoder networks provide a way to reduce the number of variables in a dataset. I proposed using this for genomic data. The method I tried, a "variational autoencoder" (VAE), turned out to be prone to introducing severe artifacts. While I think this solvable, it seems wise to focus on other parts of the project. I also proposed to filter genomic data by each SNP's p-value. I have not seen this outside of a single manuscript. The training procedure should provide a similar benefit without using unconventional practices. From a practical standpoint this prefiltering by p value could inflate the apparent performance of the method by "leaking" information from the testing set to the training set. For these reasons I would like to omit this data reduction technique. Network Architectures I proposed training recurrent neural networks on genomic data in addition to more training fully connected and convolutional models. As I now understand it this was an unwise idea because the order in which the genomic data is given to the model (order of chromosomes and which strand is provided) should substantially alter the performance of the model. Metaphorically the model "remembers" more recent events better so for long sequences relevant information early on in the sequence can become drowned out by more recently "seen" information. Some model architectures (transformer models) are designed to reduce this issue and because of the success of these models in working with natural language there are labs working to apply these models to genomic data (I've spoken with several). It seems sensible to not pursue this approach for genomic data and lean into the biologically informed network approach. I propose not training recurrent neural networks on any genomic data. What opportunities for training and professional development has the project provided?This project has provided a plethora of opportunities for training and professional development. During the summer, I attended a workshop on the crop growth simulation software "APSIM". This experience radically improved how I think about crop growth models. This experience spurred the development of one of my "other products", a publicly available programming tool "apsimxSimData" and caused me to realize that with some adjustments to aim 2, I could potentially benefit a whole new group of stakeholders. This workshop also put me in contact with Dr. Sotirios Archontoulis and Dr. Fernando Miguez which has been a benefit to the quality of this work. This spring I have formally presented six times, speaking to three university departments (University of Michigan, Truman State University, Iowa State University), presenting two posters (Maize Genetics Meeting and the University of Missouri plant research symposium), and both speaking and acting as a panelist at University of Georgia's AI in Plant Breeding Symposium. Beyond sharpening my communication and presentation skills, these experiences (along with attending the North American Plant Phenotyping Network's meeting) provided me with the opportunity to connect with possible collaborators. That latter piece has proved valuable as one of the sub-aims of the project (2.a) proved to be both more promising and more challenging (time consuming) than expected. The apparent performance of this method lead me to struggle through the process of optimizing it so that it would be feasible option for other to use. This resulted in a huge speedup (approximately 198x, from 0.14 to 27.84 iterations per second) and caused me to think more deeply about the technical aspects of this project. This also prompted me to begin building a collaboration between individuals in USDA and academia - a mixture of persons I met through these presentation opportunities and identified and reached out to directly. Learning to wrangle this collaboration has been a great experience and one I expect to continue challenging me as I continue striving to collaborate more effectively. I have aimed to improve my soft skills not only through the above collaboration, training students and junior scientists. My original plan had been to volunteer with the Software Carpentries. Seeing no workshops scheduled I sought to volunteer with the USDA Data Science Training Program Facilitator Network, but my application received no reply. Instead, I spoke with the students and more junior scientists working with labs in my unit about what training they would find most valuable. After identifying a desire for improving interviewing skills I designed and ran a short series of professional development sessions. While this did not provide an opportunity to practice communicating technical and scientific topics clearly mentoring two students in my primary mentor's lab has provided the opportunity to continue practicing these skills. How have the results been disseminated to communities of interest?The results of this project have been disseminated primarily through scientific presentations (six formal presentations during the spring of 2024). Creating durable, visible, digital artifacts (research repositories, code repositories, etc.) is a secondary dissemination strategy I have taken. These two work together nicely as the former has been helpful in identifying people who can benefit from the products of my project beyond the final manuscript. What do you plan to do during the next reporting period to accomplish the goals?I plan on making the adjustments described in the "Changes/Problems" section which will result in a model more directly compatible with APSIM, and a method potentially useful across crops - thus allowing this project to provide value to a greater number and variety of stakeholders. Developing the collaboration centered around aim 2.a (to demonstrate the degree to which this method is valuable across crops) allows some of my work to do "double duty" by providing the wheat dataset I will need to complete aim 3, or possibly a dataset in a different crop that would be even better suited to aim 3. While the suggested adjustments represent an expansion of some aspects of this project, they're expected to provide more immediate benefits while still advancing the core of this project. After completing these, I should be well positioned to complete the remainder of aim 3, with model architectures and data cleaned and formatted for modeling in hand. The results from this work I plan on communicating to the communities of interest via scientific conferences and at the NIFA PD meeting. While the work I've done to date supports the upcoming objectives, there is much to be accomplished. In the past year, I sought out project management resources to learn to better organize and run projects (especially important for the aforementioned collaboration). In the coming year, I plan on leaning into the practices I've learned and discussing project management (and people management) with my primary mentor and constellation of mentors. Discussing and practicing these strategies I anticipate this increasing my effectiveness and support completion of the project objectives.
Impacts What was accomplished under these goals?
Aims 1.a, 1.b, and 2.a, with the adaptations described in the changes problems section. Extended the method described in 2.a into a standalone model fitting tool (sparsevnn, in "other products"). Prepared for aim 2.b.2, mimicking a simple crop growth model, by building the tools (apsimxSimData, in "other products") to generate synthetic data and producing ~500 Gb of the same.
Publications
|
|