Source: CORNELL UNIVERSITY submitted to NRP
DSFAS-CIN: INTEGRATING MULTISCALE REMOTE SENSING DATA FOR ENHANCING DATA-DRIVEN PREDICTIVE ANALYTICS IN CROP BREEDING AND MANAGEMENT
Sponsoring Institution
National Institute of Food and Agriculture
Project Status
ACTIVE
Funding Source
Reporting Frequency
Annual
Accession No.
1028210
Grant No.
2022-67021-36466
Cumulative Award Amt.
$1,000,000.00
Proposal No.
2021-11493
Multistate No.
(N/A)
Project Start Date
Jan 15, 2022
Project End Date
Jan 14, 2027
Grant Year
2022
Program Code
[A1541]- Food and Agriculture Cyberinformatics and Tools
Recipient Organization
CORNELL UNIVERSITY
(N/A)
ITHACA,NY 14853
Performing Department
Plant Breeding & Genetics
Non Technical Summary
Increasing the predictability of crop variety performance in research and production environments is critical for national food and nutrition security. To address these challenges, plant scientists need to have a better understanding of which genes control trait variation and how these genes interact with the environment across plant growth and development. With the ease at which low-cost genomics data can be generated, efforts of the scientific community have now turned to developing low-cost, high-throughput phenomics methods for collecting and analyzing multiple types of sensor and image data across different temporal and spatial scales, providing more detailed information on dynamic processes driving plant growth and development. Through the joint modeling of genomics and phenomics data, it should be possible to gain a deeper understanding of the underlying genetic and environmental factors responsible for variation in traits over the life history of crop plants, allowing for enhanced prediction and forecasting of variety performance. Our project will develop the open-source digital tools needed to model and predict agronomically and economically important phenotypes by extracting and unifying features of high-dimensional datasets to better dissect and model genotype-by-environment interactions in field environments. The developed interoperable digital ecosystem, consisting of integrated data, software tools, and modeling approaches, will exist in the public domain so that it can be available and co-evolve with the greater scientific community.
Animal Health Component
30%
Research Effort Categories
Basic
70%
Applied
30%
Developmental
0%
Classification

Knowledge Area (KA)Subject of Investigation (SOI)Field of Science (FOS)Percent
20115101080100%
Knowledge Area
201 - Plant Genome, Genetics, and Genetic Mechanisms;

Subject Of Investigation
1510 - Corn;

Field Of Science
1080 - Genetics;
Goals / Objectives
The long-term goal of this project is to build analytical tools comprising a digital ecosystem for processing and unifying the data streams of varied agricultural technologies, allowing for the enhancement of models to accelerate fundamental and applied research in crop sciences. The agroecosystems that underpin the world's crop production regions consist of complex interactions involving a plethora of living and nonliving factors from the molecular to community scales. Understanding how the interplay of these factors give rise to the characteristics (phenotypes or traits) of a crop is challenging because, in part, it requires overcoming the major obstacle of collecting, registering, and fusing many sources and types of data at scale. However, even when this entry barrier is minimized, there are still limitations to the extent in which modelling can extract biologically relevant insights and accurately predict phenotypes at the field or individual plant level within and across environments. Therefore, we have an incomplete understanding of the manner in which genotype, environment, and their interaction produce phenotypes. Although this resultant knowledge gap between genotype and phenotype has been narrowed for important crop species in recent years, its ongoing existence remains a chasm that limits advancements in the genetic improvement and management of crop species. The collection, integration, and modeling of multiple digital data streams from genomes to fields offers opportunities to unlock our ability to genetically dissect and predict plant phenotypes, as well as better manage crops in variable environments.Our goal of this project is to establish an open-source software system to optimally centralize, fuse, and analyze the diverse and unstructured data from different types of field measurements to support biological inquiry and decision making in crop breeding and management.The overall objective of this open-source, cross-functional collaborative project is to create novel computational tools and data analysis approaches needed to transform disparate remote sensing datasets acquired at varying temporal and spatial scales into biological knowledge. To accomplish our overall objective, we are building a digital ecosystem that will integrate diverse remote sensing data types from across multiple scales.Obj1: Constructing the computational tools to integrate disparate types of data collected at different temporal and spatial scales from crop field experiments.Obj2: Developing software, which includes the novel functions developed in Obj1, that has the capability to enable processing, aggregation, and storing of large datasets consisting of a multi-dimensional matrix of biological and environmental features.Obj3: Analyzing the data matrix produced from efforts in Objs1 and 2 with mechanistic and data-driven models to uncover insights that relate to the genetic dissection and prediction of phenotypes related to productivity and adaptation to environmental changes.In summary, the outcomes from all three objectives will be used to establish a virtual network of computational tools for aggregating data, provide access to datasets, and data analysis methods to be used by collaborators and enhanced by others in the community to drive decision-making in the breeding and management of crops.
Project Methods
Objective 1 Registering Different Scales and Modalities of Field DataData-driven modeling has tremendous potential in agriculture, where the number and variety of factors that affect yield can make it difficult to develop predictive models that generalize and scale. With robotics and sensing technologies becoming more powerful and affordable--from cameras and GPS on mobile phones to a wide range of sensing modalities on satellites, UAV, and UGV--a growing number of tools offer efficient ways to collect field data at scale. At the same time, machine learning (ML) methods have proven remarkably effective at turning such data into powerful predictive models. Unfortunately, putting these two pieces together--an abundance of sensing data and ML algorithms that can make sense of it--is uniquely challenging for agricultural models, which depend on factors that span many image modalities. Obj1 focuses on building tools and procedures to register data across these different modalities, leading to larger, more diverse datasets for ML. Specifically, we focus on developing tools to help co-register images, 3D point clouds, and mobile field survey data, which are straightforward to collect but extremely difficult to combine at scale with existing workflows. By solving this registration problem, we can allow spatial and temporal queries across all the collected modalities simultaneously, providing a seamless pipeline to generate multimodal training data for ML.The key challenge of Obj1 is determining how different representations of field data correspond to one another. For example, a single plant may be represented by pixels in a 2D image, 3D points in a LiDAR scan, and GPS coordinates in GIS data. Each of these modalities provides different information but using all of them in the same model requires knowing which pixels, 3D points, and GPS coordinates belong to each plant. Inferring this correspondence directly can be difficult or even impossible: a UGV and UAV, for example, may sense or image entirely different parts of each plant, with no recognizable features appearing in both data streams. As current computational tools cannot reliably solve these difficult registration problems, we propose the development of a mobile application to help users capture additional partial measurements of the environment that overlap with each target modality. The data collected by this application can serve as a frame of reference for registration, with color information from video for aligning images, GPS coordinates to align with GIS data, and 3D points for aligning LiDAR scans and other geometry. To test the feasibility of this approach, we developed a prototype mobile application for an initial field study. Our prototype app is preliminary, but with it we verified that multimodal registration is possible.The reference frame provided by our mobile app makes registration possible, but our current approach for aligning data with this reference involves an expert running multiple pieces of software and custom scripts. To have an impact on the research community, we need to simplify and automate this process, and to provide tools for exploring, analyzing, and visualizing co-registered data. The remainder of Obj1 focuses on this needed development.Objective 2 Build an Interoperable Digital Ecosystem for Phenomics DataProximal sensing provides a powerful and highly scalable approach to collect phenotypes related to plant development, architecture, and responses to biotic and abiotic stressors. However, as with most new high-throughput technologies, the software needed to process, analyze, and store these new data types limits large-scale adoption and impact. In Obj2, the methods being developed in Objs1 and 3 will be implemented in software to enable processing, analysis, aggregation and FAIR access to powerful research datasets. This "digital ecosystem" will enable adoption of cutting-edge methods developed as part of this project, as well as reduce barriers to use for more common applications of proximal sensing data being developed by the plant science community. The approach to building the "digital ecosystem" will seek to engage existing communities(BrAPI, NAPPN, MIAPPE) and leverage existing open-source software by expanding functionality, developing standards for interfacing tools and data sources, and building new pipelines to fill existing gaps in functionality.Objective 3 Mining Biological Insight from the Digital Ecosystem While Objs1 and 2 aim to process, integrate, and organize data streams from UAVs and UGVs, generating biological insight from those data will require extraction and analysis of phenotypes that have biological relevance and practical utility. While the tools described in Objs1 and 2 are under active development, members of our team (Robbins, Gore, Sun, Gage, and collaborators) will evaluate results and provide feedback to improve them. Once the tools are fully developed, end users will include both internal team members and external stakeholders who need to integrate information-rich data streams from multiple sources, allowing for the conducting of latent space phenotyping, genome-wide association studies, whole-genome prediction, and crop modelling. We envision stakeholders to be those interested in utilizing dense and multi-faceted sources of data in large-scale, field-based evaluation of plants, but who may not have the technical know-how to properly process and integrate data streams from such a wide array of sensors.With information from integrated data streams, stakeholders will be able to better make applied decisions, such as how to respond to disease pressure detected before visible onset; or more efficiently select promising varieties by predicting grain yield from early-season phenotypic data. Stakeholders will also be able perform basic research in real field settings. Studies that rely on high-resolution reconstruction of plant architecture or detailed sensor measurements are commonly conducted in controlled growth chambers. Our methods will make it easier to move such experiments into the field. Examples of basic research that will be enabled by this project include measuring the relationship between photosynthetic efficiency and biomass accumulation in different varieties or over time; or obtaining finely detailed evaluation of plant architecture.

Progress 01/15/24 to 01/14/25

Outputs
Target Audience:During the reporting period, our primary target audiences included researchers and growers working to integrate proximal sensors into crop improvement and precision agriculture workflows. These groups were prioritized because they directly benefit from improved tools for sensor integration, streamlined data collection, and reduced computational barriers--core goals of this project. Our work supports efforts to enhance the applicability of proximal sensing in breeding and production, with broader implications for accelerating trait discovery, reducing breeding cycles, and optimizing resource use--factors that influence grower profitability and consumer health and cost outcomes. We also targeted undergraduate and graduate students through formal instruction. The PD and CoPDs integrated project-related content into lectures, exposing students to interdisciplinary approaches in plant science and data-driven agriculture. Graduate students involved in the project received mentorship and experiential learning through hands-on training in field data collection, sensor operation, and analysis workflows. Changes/Problems:A major challenge in crop modeling is the lack of ground truth data, which hinders the validation and broader application of the remote-sensed trait extraction pipeline. Without ground truth measurements, traits extracted from UAV-based MSI and UGV-based LiDAR datasets collected over the past four years cannot be reliably assessed. To address this, the next reporting period will focus on collecting ground truth data to validate the pipeline and ensure its accuracy before applying it more broadly to existing remote-sensed datasets. Additionally, the performance of the legacy ImageBreed system, which forms the foundation of the digital ecosystems, has required efforts to improve system performance before fully implementing the new methodologies being developed in this project. What opportunities for training and professional development has the project provided?The interdisciplinary nature of this project has fostered meaningful cross-disciplinary learning and collaboration. Computer scientists have gained insights into agricultural systems, while agricultural researchers have developed familiarity with computational tools and methods. Within the computer science team, varied expertise has supported our objectives: one graduate student focuses on human-computer interaction, another on machine learning and computer vision, and a programmer on software and API development--all essential to advancing project goals. The project has also provided valuable training and professional development across career stages. Graduate students received weekly mentorship and feedback from faculty across disciplines and have had opportunities to present at conferences, mentor undergraduates, and participate in public outreach--building both technical and communication skills. Notably, two Ph.D. students received expert instruction in latent phenotyping, while an early career CoPD has benefited from mentorship and leadership training through close collaboration with senior PDs and CoPDs. Two undergraduate students were mentored through this project. One student conducted a short project from June through early August as part of the Boyce Thompson Institute (BTI) Research Experience for Undergraduates (REU). She collected, processed, and analyzed UAV multispectral images (MSIs) and UGV LiDAR scans, and she presented her work as a poster at the BTI REU Symposium. A second student joined the project from September through December and worked to collect and process the proximal sensing data and develop a model for extracting latent phenotypes from genomic data. A newly appointed postdoctoral associate has undergone intensive training over the past six months to align with project goals. As part of his development, he earned his FAA Part 107 license on August 11, 2024, qualifying him to operate UAVs and UGVs--key platforms for collecting multispectral imaging (MSI) and LiDAR data. He also attended the International Plant Phenotyping Symposium (October 6-12, 2024), gaining valuable insights into the field, and received hands-on training in fieldwork and computational analysis. Working closely with team members, he has helped refine measurement protocols and process sensor datasets. Beyond field and phenotyping work, the project's software development activities--particularly those related to analysis and tool creation--have offered early career team members training in collaborative software workflows. These include version control, issue tracking, and code review practices, which are essential for careers in data science, software engineering, and related quantitative fields [Objs 1 and 2]. How have the results been disseminated to communities of interest?We engaged researchers, growers, external collaborators, and potential partners through presentations at key professional meetings. Project findings were shared with members of the Genomes to Fields Initiative at the North American Plant Phenotyping Network (NAPPN) meeting at Purdue University and the 8th International Plant Phenotyping Symposium at the University of Nebraska-Lincoln. Results on the impact of vertical corn architecture on crop productivity were presented at the American Geophysical Union (AGU) Annual Meeting in Washington, DC. A graduate student from PD Gore's lab also presented related work at the NSF STC CROPPS Annual Meeting, which brought together participants from diverse disciplines--including biology, computer science, and engineering. CoPD Gage presented latent phenotyping results as an invited speaker at the 14th Annual UW-Madison Plant Sciences Symposium and in a guest lecture for the NC State course Horticulture Science 590: Plant Phenomics--Fundamentals, Techniques, and AI. PD Gore shared project outcomes with undergraduate interns in the Boyce Thompson Institute's Research Experiences for Undergraduates (REU) program and with industry stakeholders at Cotton Incorporated in Cary, North Carolina. Computer science team members contributed to activities showcased at the Celebration of Modern Ag on the National Mall in May 2024, where project efforts were shared with the general public and government stakeholders. Toward the end of Q2 2024, recurring monthly meetings were initiated with the D2S team and later expanded to include the T3 team [Objs 1 and 2]. These discussions focus on aligning efforts and ensuring compatibility between platforms through API integration to improve usability and accessibility for the broader research community. What do you plan to do during the next reporting period to accomplish the goals?In the next project period, we will apply autoencoder and prediction models to all available datasets (2020-2024), conducting cross-validation experiments to assess performance. Neural network-based predictors will be tested alongside existing regression models to capture potential nonlinear relationships among latent phenotypes. In addition to improving prediction accuracy, we aim to interpret the structure of the latent space and apply our methods to additional locations to explore genotype-by-environment interactions. Ultimately, the developed models and tools will be integrated into the project's software infrastructure and made publicly available to the research community. We will also continue developing and documenting the ImageBreedPy software package in parallel with ongoing field experiments conducted in collaboration with agricultural partners. This work aims to deliver a functional system for processing, analyzing, and storing features extracted from multispectral images and LiDAR data collected via ground-based rovers. The software will be integrated with Jupyter Notebooks for downstream analysis to support plant breeding and genetics research. A software developer will work closely with modelers to define use cases and connect analysis pipelines to the ImageBreedPy system through user interfaces and API calls linking to the D2S and Breedbase backend databases. In summer 2025, we will collect ground truth data to validate the trait extraction pipeline developed during the previous reporting period. This will include key traits such as plant height, leaf area index, and leaf angle.

Impacts
What was accomplished under these goals? Objective (Obj) 1: Constructing the computational tools to integrate disparate types of data collected at different temporal and spatial scales from crop field experiments. We are actively translating our collaborative experiments into a user-friendly software package for the broader community. This involves defining formats for different types of sensor data and creating object classes to facilitate common analyses, with a focus on LiDAR and multispectral image data. We are developing Python object classes for fields and plots to provide a unified interface for importing data and running experiments. These classes will be integrated with existing machine learning (ML) frameworks to simplify model training and testing, while data visualization APIs will enable easy exploration and reuse of data and predictions. Key tasks have included hyperparameter searches for ML models--such as exploring different dimensionalities for latent representations of crop data from autoencoders and testing various network sizes--as well as comparing methods like latent representations from autoencoders, PCA, end-to-end regression models, and LSTMs for predicting phenotypes, particularly grain yield. One challenge we have faced is the limited data, especially in the time dimension. This issue increases the risk of overfitting and causes misalignment between data from different years. While addressing overfitting remains a key challenge, we've started exploring sequential LSTM models to better capture temporal information and are considering strategies for latent interpolation. We are also exploring ways to incorporate environmental factors, such as growing degree days, to improve the representation of time. We have collected UGV LiDAR scans and UAV MSIs for maize hybrids planted as part of the Genomes to Fields (G2F) Initiative at Musgrave Research Farm in Aurora, NY. In previous project years, we developed and tested autoencoder models for these data streams. Autoencoders generate reduced representations of input images and use them to reconstruct the originals. We optimized model architecture, image augmentations, and hyperparameters based on image reconstruction quality, then extracted latent phenotypes from the images and integrated them by date. In 2024, we developed and tested prediction models for eight ground truth traits collected for the G2F hybrids, including grain yield, flowering time, and stalk lodging. These models, which used partial least squares and Bayesian regression, were parameterized by MSI, LiDAR, and integrated latent phenotypes, and were compared to a baseline of 49 vegetation indices (VIs). Objective 2: Developing software, which includes the novel functions developed inObjective 1, that has the capability to enable processing, aggregation, and storing of large datasets consisting of a multi-dimensional matrix of biological and environmental features. To support the project's objectives (Objs 1 and 2), we continued developing a more modern version of the image processing pipeline and software. Efforts were focused on two primary areas: the first was maintaining and supporting the existing ImageBreed software system for processing multispectral images in support of modeling efforts, and the second was developing the new ImageBreedPy system to improve performance and functionality. To implement the new system, we modularized the image analysis functionality of the legacy ImageBreed system and separated it from the backend database used to store experimental information and extracted phenotypes. As part of the modularization effort, we leveraged existing community tools compatible with ImageBreedPy, selecting the recently developed data-to-science platform (d2s) for the orthomosaic storage component of the application. By building new functionality on top of d2s, we accelerated development toward Obj 1 and Obj 2, contributing to the broader community. We also initiated the development of hybrid web app/notebook widget applications using React and Python to assist with plot splitting and assignment, which will be integrated into ImageBreedPy and d2s. Additionally, we created Python notebooks to automate the bulk loading of image data into the image management platform. As part of developing client-side analysis tools to support Obj 3, we laid the groundwork for a modular API client, which can query LiDAR data by location and date, retrieve point clouds via streaming, and perform queries by field and date to retrieve files. Key accomplishments include: Optimizing ImageBreed's image analysis pipelines for performance (Objs 1 and 2) Establishing communication with the D2S team (Objs 1 and 2) Reviewing D2S code for compatibility with the existing code base and evaluating performance (Obj 2) Deploying D2S instances on the cloud and test servers (Objs 1 and 2) Evaluating D2S and providing bug reports to the D2S main project developer (Objs 1 and 2) Initiating UI component development for data management and processing tools (Obj 2) Objective 3: Analyzing the data matrix produced from efforts inObjectives1 and 2with mechanistic and data-driven models to uncover insights that relate to the genetic dissection and prediction of phenotypes related to productivity and adaptation to environmental changes. We first focused on a single environment, assessing the LiDAR and MSI latent phenotypes for biologically relevant information, and found heritabilities up to ~0.9. For Bayesian regression models predicting plot-level phenotypes using a baseline of vegetation indices (VIs), prediction accuracies ranged from ~0.26 for stalk lodging to ~0.85 for days to anthesis. Integrated latent features consistently outperformed VIs for all traits tested, except flowering time and grain moisture. Prediction accuracy improvements ranged from a decrease of ~8.8% for grain moisture to an increase of ~19.0% for ear height, with an overall average increase of ~4.6%. When compared to MSI and LiDAR latent phenotypes alone, integrated latent features resulted in ~5.1% and ~20.8% accuracy gains, respectively. To assess the predictive ability of integrated latent features pre-flowering, we added time points iteratively until all collection dates were included in the model training. Prediction accuracies increased with additional data, but gains were more moderate for end-of-season traits like grain yield, suggesting potential for pre-flowering forecasting. We made significant progress in developing a remote-sensed trait extraction pipeline, which is key for identifying biologically meaningful phenotypes from maize plots. This pipeline processes UAV-based MSI and UGV-based LiDAR datasets to extract important traits such as plant height, leaf area index, and leaf angle, connecting remote sensing data to maize physiology. This framework paves the way for integrating these traits into mechanistic and data-driven crop growth models, improving predictions of phenotypic responses to environmental variation and enhancing productivity and adaptation modeling. Next steps will focus on validating the extracted traits with ground truth data to ensure accuracy before applying them to previously collected datasets. Additionally, a graduate student has integrated 3D radiative transfer models (RTM), Terrestrial Lidar Scanning (TLS), and solar-induced chlorophyll fluorescence (SIF) to monitor and simulate photosynthesis and crop yield, providing valuable insights for breeding. Progress has been made in quantifying how corn canopy architecture (e.g., leaf angle, LAI) influences photosynthesis, by combining TLS and 3D RTM DART (Discrete Anisotropic Radiative Transfer) model simulations. We have also explored how vertical variation in physiological processes, such as energy partitioning among photochemistry, non-photochemical quenching, and chlorophyll fluorescence emission, impacts photosynthesis and yield.

Publications


    Progress 01/15/23 to 01/14/24

    Outputs
    Target Audience:The PD and CoPDs delivered presentations on the project's research endeavors and their significance in various lectures to both undergraduate and graduate students. Additionally, project findings were disseminated to external collaborators and potential partners, including members of The Genomes to Fields Initiative at the North American Plant Phenotyping Network (NAPPN) meeting. Additionally, PD Gore engaged researchers at the minority serving institutions Delaware State University and Tuskegee University to establish new partnerships and training opportunities in digital biology. The lab of PD Gore hosted a Ph.D. level visiting scientist from Senegal to provide him learning experience in how to apply digital tools to processing and analyzing proximal sensing data collected from crops. Moreover, a graduate student from PD Gore's Lab showcased analyzed data at the NSF STC CROPPS Annual Meeting in October 2023, hosted on the campus of the University of Illinois Urbana-Champaign. The meeting attracted attendees from diverse disciplines, including biologists, computer scientists, and engineers, representing institutions such as Cornell University, University of Illinois Urbana-Champaign, Boyce Thompson Institute, and the University of Arizona. Changes/Problems:During preparation to implement the new APIs planned for the ImageBreed tool, an effort for identifying and mergingcode differences between BreedBase (ImageBreed's parent), was done. These differences were contributed back to BreedBase as a separate branch. As part of this work, the team determined that separating the functionality from ImageBreed as a BrAPP was more straightforward with a Python version of ImageBreedPy that focuses on ImageBreed's multi-spectral analysis functionality. What opportunities for training and professional development has the project provided?This project has provided graduate students with training from experts in the subject of latent phenotyping. Furthermore, graduate students working on this project have had the opportunity to work closely with faculty, receiving consistent guidance and feedback on a weekly basis. In addition, graduate students have presented their work at conferences andseveral Cornellseminar series. This project has alsoprovided PI Gage with training in mentorship and leadership through interactions with more senior PIs Gore, Robbins, Davis, and Sun. How have the results been disseminated to communities of interest?A graduate studentpresented their findings from this project at the North American Plant Phenotyping Network Annual Conference in February 2023, the National Association of Plant Breeders Annual Meeting in July 2023, and the Center for Research on Programmable Plant Systems Annual Meeting in October 2023. Contributions from the ImageBreed (legacy) code based were contributed back to BreedBase's code repository as a separated branch: https://github.com/solgenomics/sgn/tree/topic/imagebreed. The ImageBreed BrAPP (ImageBreedPy) development is published in the following repository, continued ImageBreedPy development will happen there. https://github.com/agostof/ImagebreedPy Progress on ImageBreed was presented at PAG 2024. Session: BreedBase: Genomics and Breeder Data Management and Analysis Tools. Title: ImageBreed: Open-Source Software to Process, Store, and Analyze Image-Based Phenotypes. https://plan.core-apps.com/pag_2024/abstract/c0e864ce-8e15-4295-869f-b66aafbaa1ba What do you plan to do during the next reporting period to accomplish the goals?For the next reporting period we plan to: Modularize ImageBreedPy further to make UAV/UGV data integration easier. Further work on the LIDAR module to provide pointcloud upload capabilities within the BrAPP. Continue support of the original ImageBreed (legacy), while we prepare ImageBreedPy for broader use. Explore and develop queries that take advantage of multi-spectral images and LIDAR using the same platform using a unified API. Continue developing our multi-modal analysis tools and aggregate these in a form that can be disseminated for others to use across our research communities. Develop visualization and interaction tools to integrate our work and make it more accessible and applicable to others. Train the lidar and MSI autoencoder models on subsequent years of data, testing different cross-validation schemes. Train the prediction model on additional phenotypes and compare model accuracies for each sensor separately as well as their integration. Further expand on the regression models for prediction, comparing those models with machine learning and extracting meaning from the latent space. Public release of prediction methods to the broader community and integrated with additional tools to store and process data, which are currently being developed.

    Impacts
    What was accomplished under these goals? Advances in artificial intelligence have revolutionized many industries, but much of the recent progress has not reached applications in agriculture. However, progress has been accelerating on this front with the development of new tools to analyze field data and with applications such as precision agriculture. In parallel, the availability and variety of proximal remote sensors and platforms for high-throughput phenotyping have increased rapidly in recent years, with applications to breeding and trait discovery. Data can be captured across various scales and modalities, each providing unique information in its stream. However, these are often used separately for characterizing phenotypes, and methods of data processing and analysis tend to be crop, phenotype, sensor, and platform specific, limiting the applicability of developed methods. Therefore, the goal of this project is to integrate disparate proximal remote sensing data to improve characterizations of plant phenotypes, such as grain yield.To achieve the goal of this project, we used unsupervised, deep learning methods to extract latent phenotypes from both unoccupied aerial vehicle (UAV) collected multispectral images (MSIs) and unoccupied ground vehicle (UGV) collected lidar scans. The latent spaces are reduced representations of the original images, enabling the integration and downstream prediction of agronomic phenotypes. Obj1: UGV lidar scans and UAV MSIs were collected for the G2F fields planted at Musgrave Research Farm in Aurora, NY from 2018 through 2023. Here, we focus on the integration of these two proximal remote sensors. In 2022, we developed and trained an autoencoder model to extract latent phenotypes from UAV MSIs collected over about 10 dates in 2020. Autoencoders are a type of unsupervised, deep learning method, which learns a reduced representation of the original object and then tries to reconstruct the original based on the reduced representation. This reduced representation constitutes the latent space, which gives "latent phenotypes" for each image. In 2023, we optimized this model, by changing the model architecture to ResNet18 and applying image augmentations. We trained the Resnet-18 model by a reconstruction loss, which asks the model to reproduce the input MSI from a compressed representation. We have also trained the same Resnet-18 model on two-dimensional density maps of the three-dimensional lidar point clouds. This allowed us to extract latent phenotypes from each data stream and concatenate these by overlapping dates, thus achieving integration. In addition, we have written and trained a preliminary supervised machine learning model to predict grain yield from the latent phenotypes. This model has been trained for each sensor separately, as well as their integration. We are currently working to incorporate the uncertainty of the LiDAR measurements into the data. Also, we are looking at incorporating domain-specific loss functions for our model that can help learn more useful representations for phenotyping. Given that our current approach loses all 3D information due to projection, we are looking at designing a new model that takes advantage of this 3D information. Obj2: To support the project's Objs 1 and 2, a copy of the ImageBreed application was setup to work on two main computer systems: one within our high-performance computing infrastructure, and another, a development version was installed on a workstation within the laboratory. Assistance was provided to users engaged with the platform. A post-doctoral researcher in our lab used ImageBreed to analyze multi-spectral images from his experiments, and provided feedback that was crucial in the identification of possible shortcomings with the application, and resolution of bugs. The development work towards creating a lighter, containerized version of the ImageBreed application, continued during the current reporting period to support Objs 1 and 2. This effort has been tentatively named ImageBreedPy. ImageBreedPy recreates and modularizes ImageBreed's functionality in a modern framework making it suitable to expansion by other developers from the community. The application has prototype implementations of the BrAPI-like APIs that were created as part of this project. A pipeline consisting of Python and shells scripts were developed for uploading LIDAR data into the pointcloud database to make it available to ImageBreedPy. The APIs and supporting infrastructure will enable the creation of downstream analysis tools that support Obj 3. Obj3: To elucidate underlying meaning contained within the latent spaces extracted from both the MSIs and lidar scans, we performed principal component analysis on each. PCA plots showed clustering by date of data collection for both data streams, and the PCA plot of the MSI latent space further showed clustering by normalized difference vegetation index (NDVI). In addition, we are exploring regression models for phenotype prediction to further explore the latent space and compare to black box machine learning methods. We trained some models to predict grain yield from MSIs and projected LiDAR data independently, then used the concatenated feature representation. We found that the projected lidar data correlates much more strongly with grain yield, which makes sense since it contains more detailed information while MSIs only show a very coarse aerial view of the plot. However, using both representations allows for better performance than each modality independently, providing support that both are useful together. We integrated Solar-Induced Fluorescence (SIF), a mechanistic photosynthetic tracer that can be remotely sensed, with a 3D plant radiative transfer model DART (Discrete Anisotropic Radiative Transfer), to understand how vertical variations of plant physiological and structural traits can impact crop productivity. Such understanding will inform crop breeding on key physiological and structural traits that should be prioritized in the future to enhance productivity. Specific tasks included the following: At the leaf level, we collected a suite of datasets characterizing leaf properties, including Leaf greenness (NDVI), absorptance, reflectance, transmittance, Leaf Mass and Size, biochemical parameters, leaf fluorescence parameters, and incident Photosynthetically Active Radiation (PAR) at various leaf positions. This data collection spanned from the top to the bottom of canopies in a corn field at the Musgrave Research Farm over four weeks following the peak growth in 2023. We then incorporate these vertical variations of leaf properties into DART to simulate vertical profiles for SIF and GPP via the mechanistic light response model described by Co-PI Sun's lab (Han et al., 2022). Our analysis revealed that a significant proportion of SIF and GPP activity is concentrated in the canopy's middle layers, where corn ears develop. Furthermore, we observed that the size of the leaves and the distribution of radiation within the canopy significantly influence photosynthesis rates within corn canopies. Stress detection with remote sensing measurements. We examined the differences in leaf-level NDVI and the reflectance of Red & NIR bands across the vertical profiles of the canopy. We then compare these leaf-level measurements with canopy-level measurements collected by UAV equipped with hyperspectral spectrometers and by satellite platforms (Sentinel-2, and Planet). Our findings suggest that canopy-level observations fail to detect the early stages of corn senescence, which started from the bottom of the canopy. This is primarily caused by the obstruction of upper-canopy leaves and the interference of the ground below in the above-canopy measurements (by UAV and satellite platforms). Therefore, traditional optical remote sensing measurements at the canopy scale might overlook critical phenological and phenotypic variations in crops.

    Publications


      Progress 01/15/22 to 01/14/23

      Outputs
      Target Audience:The PDand CoPDspresented the project'sresearch efforts in several lectures to undergraduate and graduate students in courses at Cornell University. Project findings and discussions were shared with external collaborators and potential partners that are members of The Genomes to Fields Initiative. A graduate student in PD Gore's Labpresented analyzed data at the NSF STC CROPPS Annual Meeting in October 2022 held on the Ithaca Cornell campus. Meeting attendees were biologists, computer scientists, and engineers from Cornell University, University of Illinois Urbana-Champaign, Boyce Thompson Institute, and the University of Arizona. Changes/Problems:We changed the order of some of our goals related to Objective 1. Rather than focusing on registration as a first step, we are focusing on how to build models that can learn from more arbitrary (potentially unregistered) data. The reason for this change is three-fold. First, we observed that it can provide more immediate use for the community. Second, it allowsus to start building actual models sooner, which we hope will make it easier to integrate with our work on other objectives sooner. And third, it can provide a baseline of comparison for the development of more complicated approaches that rely on more advanced registration. The team originally planned to implement and test the new APIs in the ImageBreed tool. However, after a design discussion, it was determined that ImageBreed needed some major refactoring before additional APIs can be added effectively. The testing of the new APIs has been de-prioritized until the work on ImageBreed is finished. The lidar data collected by the unoccupied ground vehicle(UGV) and the stationary terrestrial lidar scanning (TLS) show considerable disparity, in comparison with hand-held measurements (as reference). We explored a number of approaches and algorithms to derive meaningful values of leaf area index (LAI)from the original point cloud data from lidar. What opportunities for training and professional development has the project provided?This project has provided the opportunity for training and mentorship of two graduate students (Computer Science and Plant Breeding), through weekly project meetings with faculty. Students received consistent guidance and feedback in model training, development, and evaluation. A graduate student (Soil and Crop Sciences) used a dataset generated by this project for a guest lecture given in a course at Cornell University. Additionally, this project has provided CoPDGage, as a new assistant professor, with professional development in remote project management and mentorship through his weekly meetings with Plant Breeding and Computer Science graduate students based at Cornell. He has also received mentoring from PDGore on the process of recruitment and interviewing postdoctoral candidates. How have the results been disseminated to communities of interest?2022 Aurora Farm (Field Crops) Field Day Date: Thursday, August 18, 2022 Location: Cornell University'sMusgrave Research Farm in Aurora, NY CoPD Robbins and a postdoctoral associate demonstrated proximal sensing with unoccupied aerial and ground vehicles to attendees that included Cornell faculty, staff, and students, as well as local growers from the surrounding farming communities. ImageBreed software documentation provided to the plant breeding and digital biology communities: All the API documentation developed during this reporting period is publicly available on SwggerHub at this URL: https://app.swaggerhub.com/apis/PlantBreedingAPI/ImageBreed/0.1 What do you plan to do during the next reporting period to accomplish the goals? We plan to turn the developed componentsinto easily installable python modules that support integration withBreedBase. The high-level goal is to build infrastructure that helps standardize data from different sensor modalities that may be useful for training plant prediction models and facilitate the sharing of that data. We plan tocontinue collecting UAVand UGV data on maize germplasm in NY and in locations that differ from those collected by the Cornell groups, which can be used for strengthening or testing developed tools. We will conduct further analyses to dissect the latent codes extracted from multispectral images. We also plan to construct and train an analogous model on the lidar point cloud data. The latent codes extracted from the MSIs and lidar scans will then be used to train a final neural network to predict phenotypes of interest, such as grain yield in maize. Accuracy of this model will be compared for each data stream used separately and together. We will conduct genotype-specific DSSAT model development (calibration and validation) We plan to Integrate lidar, hyperspectral, and solar induced fluorescencewithin a canopy radiative transfer model, which will be integrated with DSSAT We will refactor ImageBreed to make it compatible with other software tools using the API We plan to define, implement and test, a BrAPI specification compatible with imaging and LiDAR data

      Impacts
      What was accomplished under these goals? Obj1: Constructing the computational tools to integrate disparate types of data collected at different temporal and spatial scales from crop field experiments. Overview: We are building a framework to support the training of multi-modal phenotype prediction models. Our goal here is to provide the community with a flexible toolbox of code for preparing and training models on field data from different sensor modalities. At a high level, our toolbox provides three types of components. The first two are for modality-specific data prep (e.g., registering, normalizing, and augmenting training images or point clouds) and modality-specific feature learning. The third type of component is for training prediction models on features derived from different modalities. Our first milestones here are to add components that support data prep and self-supervised feature learning for each modality. We have completed a first pass of these components for multispectral image data. Here, our data prep component takes plot images as input and offers per-channel data whitening and basic image augmentation strategies. For feature learning, we use an autoencoder as it makes very few limiting assumptions about the training data. We focus on two remote sensors and platforms: unoccupied aerial vehicle collected multispectral images (MSIs) and unoccupied ground vehicle (UGV) collected lidar scans. Both were collected for about 1,000 maize hybrid plots in summer of 2022 at Musgrave Research Farm in Aurora, NY. Lidar scans were collected weekly from June through August, with a final date immediately prior to harvest. MSIs were collected weekly from July through mid-October and were collected on the same day as lidar scans for overlapping weeks. We also trained an autoencoder model on the MSIs and extracted latent phenotypes. Autoencoders are a type of unsupervised, deep learning model, where the original image is passed through a convolutional neural network to produce the latent codes. The latent codes are then passed through a second neural network to reproduce the original image, and the model is optimized to minimize the difference between the original and the reconstruction. In this way, autoencoders perform dimensionality reduction, parsing noise to extract the informative features from the images. The autoencoder model was trained on MSIs collected for 4,990 maize hybrid plots in 2020. In our model, the original images of size 5x396x94 (186,120 values) are transformed to vectors of length 256. We showed that the latent vectors were an efficient representation of the original images. Obj2: Developing software, which includes the novel functions developed inObj1, that has the capability to enable processing, aggregation, and storing of large datasets consisting of a multi-dimensional matrix of biological and environmental features. ImageBreed is a breeding database with functionality to process, analyze and store high-throughput phenotypes. It is a central component of the digital ecosystem the project seeks to create in Objs 1 and 2. During this reporting period, the existing ImageBreed API has been documented in SwaggerHub following the OpenAPI3 standard documentation. A new, BrAPI-like, API specification has been developed for image data and generic image analysis. Developing and documenting these API specifications, facilitates connecting and integrating software created in this project to achieve a cohesive goal, contributing directly to Objs 1 and 2 and providing support for downstream analyses in Obj 3. Initial work was done to implement this API using Python; data models and function stubs were defined following the API specification. The API specifications and their software implementation will enable image analysis tools to connect with genomics, phenotypic, germplasm, and experimental data stored in widely adopted and emerging plant breeding and genetics databases like BMS, BreedBase, DeltaBreed and Gigwa. The functionality of ImageBreed was expanded by implementing new pipelines for the analysis of lidar data, in addition to existing functionality for processing and analyzing multi-spectral images. The interoperability made possible by BrAPI allows analysis tools the ability to access the experimental data needed to process image and lidar data. The integration of these tools via BrAPI fosters the creation of an ecosystem of scientific tools instrumental in enabling the downstream predictive modeling, which is critical for completing Obj 3. Obj3: Analyzing the data matrix produced from efforts inObjs1 and 2with mechanistic and data-driven models to uncover insights that relate to the genetic dissection and prediction of phenotypes related to productivity and adaptation to environmental changes. We also performed initial analyses of the information contained in the latent vectors. Principal component analysis plots show clustering by days after planting (DAP) and normalized difference vegetation index (NDVI), a vegetation index indicative of plant health. By taking a subset of the 25 principal components explaining the highest variance in the latent codes, we used a support vector machine to classify the observations by date label, with about 92% accuracy. Finally, linear regression was used to predict NDVI using the first 25 PCs. Based onthe results for the test set, the correlation between the actual and predicted values was 0.89. This shows that the latent vectors contain biologically relevant information. We also have initiated crop modelling efforts in maize, as eventually the latent codes will be included in crop models. At the regional scale, we used DSSAT to disentangle individual drivers of the corn yield growth from 1950 to 2020 over the corn belt. We first validated our regional scale simulation with the USDA NASS dataset - the baseline simulation. We then ran scenario simulations by fixing one factor at a time; these factors include radiation use efficiency (RUE), planting density (PD), fertilizer (Fert), genetics (all related coefficients), and leaf angle (xl). We found that N fertilizer and genetic improvement have a similar level of impact on the growth trend of corn yield at the regional scale. At the field scale, we utilized two spectrometers onboard a UAV to concurrently collect surface reflectance (to calculate a vegetation index) and solar induced chlorophyll fluorescence (SIF). We found no significant differences among three N treatment plots when N supply was beyond the optimal fertilizer rate. The lidar data collected from the mobile rover and the stationary terrestrial lidar scanning (TLS) show considerable disparity, in comparison with hand-held measurements (as reference). We explored several approaches and algorithms to derive meaningful values of LAI from the original point cloud data from lidar.

      Publications