RAPID - Root Analysis for Plant Improvement and Design

Recipient Organization
OCEANIT LABORATORIES INC
828 FORT STREET MALL
HONOLULU,HI 96813

Performing Department
(N/A)

Non Technical Summary
Modern plant breeding for commercial agriculture is an essential tool for maintaining the supply of food, food stocks, and biofuels for both our nation and the larger world. Even a modest improvement in agronomic performance and yield of a commercial crop can have significant impact. However, there is a key area in which plant breeding efforts have lagged: the root system. Roots play a vital role in water uptake, nutrient absorption, and stress resilience, yet the effort required to study them in-situ has limited phenotyping at scale.Traditionally, root phenotyping has been labor-intensive and difficult to scale, using manual root tracing techniques that require trained plant scientists and countless hours to complete. Root data from a single test plot can potentially take months to process. Due to this labor and time constraint, root phenotyping with legacy methods is inherently incompatible with modern plant breeding programs.The Rhizomatic application provides a major step forward in analyzing roots using a combination of deep learning AI and classical image processing techniques, however it was designed for modest academic studies rather than large scale breeding studies. In this effort, it will be expanded to support these large datasets by transitioning to a multi-processing cloud service, developing new characterization tools, andoptimizing the processing of root images.The end result of this project will be a user-friendly tool that can automatically characterize roots in thousands of images efficiently, replacing months of labor with mere hours of processing and enabling plant breeders to produce new varieties with positive root traits. These crops will be more productive and resilient, able to withstand drought and disease and to grow quickly and in adverse environments, leading to better and more stable yields in Americas farms.

Animal Health Component

25%

Research Effort Categories

Basic

Applied

25%

Developmental

75%

Classification

Knowledge Area (KA)	Subject of Investigation (SOI)	Field of Science (FOS)	Percent
204	2499	1081	50%
203	2499	1081	25%
202	2499	1081	25%

Knowledge Area
204 - Plant Product Quality and Utility (Preharvest); 203 - Plant Biological Efficiency and Abiotic Stresses Affecting Plants; 202 - Plant Genetic Resources;

Subject Of Investigation
2499 - Plant research, general;

Field Of Science
1081 - Breeding;

Keywords

minirhizotrons

automatic root tracing

in-situ tool

machine learning

Goals / Objectives
RAPID will enhance Rhizomatic, an existing automatic root characterization platform, to provide tools to support large-scale root detection and characterization for use in high-throughput phenotyping used by modern plant breeding programs. This will close the gap between above-ground and below-ground plant phenotyping in order to fully optimize crop performance, creatingmore productive and resilient plants to improve and safeguard United States agriculture.Phase I technical work will focus on building on the Rhizomatic infrastructure to support the tools and dataset sizes required for commercial breeding operations.Gathering and preparing representative manual root tracing data.Expanding the machine learning pipeline for very large datasets.Adding tools and interfaces for high-throughput phenotyping.Managing throughput and reliability with cloud resources.Reviewing tools and detection results with end users.Task 1 will involve the selection and preparation of root data in support of the other tasks. Large traced root datasets will be a key resource for building and evaluating better machine learning models and guiding development of tools and interfaces for scientific plant breeders. This will include gathering the data for processing, creating software as needed to accomodate the data format, and filtering and correcting errors as appropriate for validation. Our initial milestone for task 1 will be the ingesting data from Dr. Leakey's massive minirhizotron installation at the University of Illinois, but we mayseek out additional data as the project progresses as needed to validate our process in different species and formats.Task 2 will focus on improving the machine learning pipeline to process and exploit very large datasets. This task will include determining how much and whatdata should be involved in training efforts, methods to distribute processing among multiple CPUs and GPUs for efficient processing, and methods to fail fast and provide feedback if training is not proceeding as expected. Milestones will include the deployment of each feature. New data selection methods and distributed processing will eachbe compared against the baseline full-dataset and non-distributed method to measure training time and model performance.Task 3 will focus on developing tools and interfaces to provide scientists with the information they need to better phenotype and compare roots in large datasets and to ensure those tools are easy to understand and use. It is expected that different researchers will have different tool, interface, and data requirements for their research, so this task will include discussions with our network of root researchers and potential commercial customers. Examples of improvements that would be performed under this task are comparative measures of traits between different test groups and improvements to the UI to efficiently browse large datasets. Success in this task will be qualitative, as supported by demonstrations and interviews with interested parties or by review in Task 5.Task 4 will focus on exploring methods for integrating cloud computing to efficiently handle the load expected from large datasets. We will explore both full-cloud solutions, where the entire processing infrastructure is managed by on-demand cloud processing, and hybrid-cloud solutions, where local resources, which are less expensive, process most workloads but either particularly large jobs or jobs that overflow available processing can be transferred to on-demand cloud processing. Milestones for Task 4 will include implementation of cloud and hybrid-cloud processing frameworks. Each solution will be evaluated based on the tradeoff between speed and cost of processing, with qualitative judgments of software reliability related to system complexity and ability to withstand disruptions due to network or power outages.Task 5 will involve collaborating with our research partners and other interested parties to test and review new features, metrics, and user experiences. Additionally, as human root tracing is fallible, it does not necessarily provide perfect truth data for comparison, Task 5 will review disagreements between predictions and tracings to determine whether the prediction or the tracing was correct. Task 5 will include user interviews and demonstrations to fine tune solutions to best support their workflows and needs.

Project Methods
In general, where possible we will avoid capturing results that may include some aspect of randomness, such as random sampling or random model initiation, using only a single example. Multiple trials will be used and statistics for the results calcuated, such as standard deviation andp-value.In Task 1 we will perform some basic counts of images and tracing prevalenceto better understand the data distribution, but other than a general preference for having large datasets, there isn't a quantitative measure of success as real datasets can come in a variety of forms. We will survey the data to ensure that the tracings are generally complete and correctand, where they are not, we will implement filters to only evaluate on those images or areas where they are so that we will not degrade metrics by comparing againsttracings that are missing or erroneous.For Tasks 2 we will seek to evaluate our changes on two dimensions:The time it takes to train and evaluate a modelThe quality of models with different amounts of training dataTime will be measured in two ways, processing time and clock time. Processing time measures the total amount of computing time that was used, which excludes time spent paused while other processes or system tasks take precedence, and clock time, which is the total real-world ellapsed time. The reason to measure both is that the former is often tied to the real costs for processing using cloud resources and the latter determines the user experience.To evaluate the quality of models at different training data sizes, we will capture four measures:The raw machine learning training training goal (loss)Comparing overall traced length to the overall predicted lengthThe percentage of predicted roots that have a traced root nearby (measuring accuracy)The percentage of traced roots that have a predicted root nearby (measuring completeness).Loss applies directly to machine learning models, while the other measures compare actual tracings to the end product of the entire pipeline. As the original use tracings may sometimes have errors, we will have domain experts review disagreements that appear to be a result of human error rather than model error.In Task 3, if new metrics involve some additional measure of prediction beyond what the system already does we will perform a comparison between system-generated results (predictions) and human-generated results (truth). For UI changes, we will seek user feedback for new changes and measure loading times for making the interface more responsive with large datasets.Task 4 will measure time changes for different cloud integration solutions in the same way as Task 2. In addition we will compare actual costs of processing between different strategies based on different cloud options.Task 5 will involve user review and requirements gathering. We will use surveys, demonstrations, and interviews to gather user opinions in support for the other tasks.