DSFAS-AI: Deep-Learning Framework for Optimal Selection of Soil Sampling Sites

DSFAS-AI: DEEP-LEARNING FRAMEWORK FOR OPTIMAL SELECTION OF SOIL SAMPLING SITES

Sponsoring Institution

National Institute of Food and Agriculture

Project Status

COMPLETE

Funding Source

AFRI COMPETITIVE GRANT

Reporting Frequency

Annual

Accession No.

1028237

Grant No.

2022-67021-36518

Cumulative Award Amt.

$272,201.00

Proposal No.

2021-11481

Multistate No.

(N/A)

Project Start Date

Jan 15, 2022

Project End Date

Jan 14, 2024

Grant Year

2022

Program Code

[A1541]- Food and Agriculture Cyberinformatics and Tools

Recipient Organization
SOUTH DAKOTA STATE UNIVERSITY
PO BOX 2275A
BROOKINGS,SD 57007

Performing Department
Mechanical Engineering

Non Technical Summary
The overarching goal of this research is to develop a deep-learning framework for the automated selection of optimal soil sampling sites based on landscape position. Soil sampling is one of the most fundamental processes in agriculture: it is the crucial first step in soil testing to determine soil health. A soil analysis, which provides information important to maximize nutrient use efficiency and agricultural productivity, can only be as good as the samples sent to the lab. Good samples require sampling at multiple optimal sites in the field. In the current practice, farmers collect and pool samples themselves and send them to a lab for analysis. There are scientific methods where samples can be pooled based on landscape position and knowledge about how nutrients move in soil due to differences in properties. However, most producers do not follow those procedures as they can be complex and vary from field to field. Pooled soil samples do not represent the actual variation in soil properties. Producers can pull multiple soil samples but have no assurance of the extent of their fields which each sample actually represents. As a result, there are no reliable methods for farmers to accurately utilize knowledge of soil variation with their precision agriculture technologies. There is an urgent need for an automated tool that will help producers identify which spots are optimal for sampling and can be pooled to get accurate soil analysis results.To fill this critical need, we aim to develop a deep-learning tool that outputs landscape zones with position elevation and identifies optimal sampling spots for each zone. The training landscape data and scientific methodology will be provided by the Soil Pedologist Co-PI. The artificial intelligence (AI) -enabled tool will be developed with GPS guidance to go from spot to spot. It will also allow for the mixing of appropriate samples and provide information on the estimated cost of analysis. If the producer chooses a price cap, the tool could inform them of the number of samples that could be analyzed within that price range and how accurate the results would be.Our central hypothesis is that the use of advanced deep-learning techniques to analyze and refine landscape data will enable precise and reliable recommendations of optimal soil sampling spots. In addition, the framework will produce consistent results under the uncertain variable conditions from field to field. The rationale is that deep learning extracts meanings from landscape data and human-labeled data to train multi-layer neural networks to increase the reliability and accuracy of sampling location selection. Our team is particularly well prepared to undertake the proposed research because of our extensive and successful track record of AI-enabled and data-driven research in precision agriculture and soil-landscape analysis.We plan to test the central hypothesis by pursuing the following three specific objectives:Establish a cyberinfrastructure of landscape data and soil sampling annotations to train deep convolutional neural networks.Develop a deep-learning pipeline to learn, analyze, and refine landscape data for automated selection of soil sampling locations. The framework takes landscape data as input and outputs optimal soil sampling spots.Design and implement a set of metrics to assess the success rate of the optimal sampling site prediction tool.The proposed research is original and transformative because it will create an advanced tool for a crucial and challenging precision agriculture problem, namely automated and reliable selection of soil sampling sites. This tool is currently missing, and it will enable a significant improvement in soil sampling and analysis, which will lead to a better understanding of soil health. It will lay a foundation for novel applications of data science and AI technologies to solve agricultural problems.

Animal Health Component

60%

Research Effort Categories

Basic

20%

Applied

60%

Developmental

20%

Classification

Knowledge Area (KA)	Subject of Investigation (SOI)	Field of Science (FOS)	Percent
404	7299	2020	100%

Knowledge Area
404 - Instrumentation and Control Systems;

Subject Of Investigation
7299 - Research equipment and methods, general/other;

Field Of Science
2020 - Engineering;

Keywords

precision agriculture

soil sampling

soil health

artificial intelligence

Goals / Objectives
Soil sampling is one of the most fundamental agricultural processes. The current soil sampling techniques produce random samples across a field, which are not representative. Despite recent advancements in relevant data science and machine learning, there has not been significant innovation in soil sampling for decades. To fill this gap, this project develops a deep-learning framework capable of refining landscape data and extracting the optimal locations for soil sampling in a given field. We hypothesize that the integration of deep learning into the landscape data refinement and sampling rules will enable automated, precise, and reliable selection of optimal soil sampling sites. These sample sites will be more representative of a landscape area for analysis than the current composite sampling practice. We plan to test this central hypothesis by pursuing the following three specific objectives:Objective 1. Establish a cyberinfrastructure of landscape data and soil sampling annotations for the training of deep convolutional neural networks.Objective 2. Develop a deep-learning pipeline to learn, analyze, and refine landscape data for automated selection of soil sampling locations. Objective 3. Design and implement a set of metrics to assess the success rate of the optimal sampling site prediction tool.The project outcomes will lead to a novel commercial application that delivers GPS coordinates of sampling sites to producers with a real-time map to reduce human error. Landscape position data from the tool can be used to accurately interpolate soil properties from the collected samples. This will result in more precise and cost-effective soil data for precision agriculture systems.

Project Methods
Objective 1. Establish a cyberinfrastructure of landscape data and soil sampling annotations to train deep convolutional neural networksTest fields will be selected in Eastern South Dakota to create training data. Since non-dynamic soil properties (texture, rock fragments, etc) vary by landscapes, fields will be selected in such a way that they represent a diverse group of common landscapes. Collectively, these landscapes will represent much of the Midwest glaciated plains, and the resulting tool will be applicable to similar areas.At each field, we will collect a digital elevation model (DEM) using a drone-mounted lidar array with a high-precision GPS unit. The initial determination of various hillslope positions will be done in the field and completed using geographic information system (GIS) software. Hillslope position boundaries, and known training points, will be collected in the field using handheld GPS units. Final delineation of training data will then be completed using GIS. The DEM will also be used to generate a suite of landscape derivatives such as slope, aspect, curvature, etc. These derivatives will be used as features in the deep-learning pipeline. DEM derivatives are straightforward and can be automatically generated, unlike hillslope profile positions. Thus, strong relationships between hillslope profiles and soil properties will create a more robust final product.Soil samples will be taken at each field location for analysis. A random stratified sampling technique will be implemented, stratifying by hillslope profile position. Soils will be analyzed for soil carbon, pH, and a suite of nutrients--the standard analyses that a producer would get. These samples will show the total variation of soil properties and composition within and between hillslope positions. Using the DEM and soils data, we will produce a series of maps for each soil property which will be used to test the accuracy of our deep-learning models.The collected data will be annotated and fed into a deep-learning pipeline that refines and extracts the optimal locations for soil sampling given a landscape area as described in the next section.Objective 2. Develop a deep-learning pipeline to learn, analyze, and refine landscape data for automated selection of soil sampling locations.We divide the deep-learning pipeline into four modules. The fundamental difference between the deep-learning methodology we are pursuing with respect to regular learning techniques is that it has multiple layers of processing and neural networks. It is robust because if one layer does not perform well, then the next layers will rectify the mistake. However, the deepness of learning comes with a larger number of parameters to be used in the deep learning training. Therefore, the physical meaning of data and data refinement are critical to maintaining reasonable training and processing time.Feature extraction: During the feature extraction process, a convolutional neural network will be developed to extract important characteristics ("features") of an optimal sampling zone that can be used to uniquely identify it from other zones that are not desirable for soil sampling. Additionally, the feature extraction process will lead to a decrease in data size by a reduction factor, as only zones with desirable characteristics are considered after this step. Our algorithm will achieve this data refinement while preserving unique characteristics of each optimal sampling zone present in the input landscape. However, this module is only a rough refinement process. The convolutional neural network's parameters will be tuned in such a way that only coordinates that do not contain desirable characteristics with high certainty are removed from further processing to avoid over-filtering.Region proposal: This module takes a feature map from the feature extraction process as input. At each coordinate of a feature map, we will place a fixed number of rectangles called anchor boxes of different dimensions. Anchor boxes are the set of predefined boxes and are crucial for optimal sampling zone detection in our deep-learning framework. They can be thought of as zones where the network initially predicts the probability that a set of coordinates contain desirable soil sampling zones (i.e. initial guess). The algorithm then refines and resizes these anchor boxes when it learns more about the characteristics of optimal sampling spots with guidance from the ground truth.Region-of-Interest (RoI) pooling: Coming out of the Region Proposal process, not all regions are of the same size. The main goal of ROI pooling is to separate individual proposed regions and resize them to a fixed size while still maintaining all key features. The outputs of this RoI pooling process are well-defined regions of the same size that may contain desirable soil-sampling zones.Inference: This module is a final refinement step. Here, we will formulate another multilayer neural network that takes the above ROIs as inputs and outputs regions containing optimal sampling locations with high certainty and high precision. Similar to the Regional Proposal module, the neural network will be trained via an optimization scheme. Nonetheless, with much more refined and specific landscape data input (produced by ROI pooling), the results are expected to be more precise and with a higher agreement with the ground truth. How close to the ground truth is defined by a set of performance metrics, which dictate the number of times this entire deep-learning pipeline will be iterated. The only constraint is that additional iterations require additional training data and time.Objective 3. Design and implement a set of metrics to assess the success rate of the optimal sampling site prediction toolTo compute the performance of the developed algorithm, we will run the trained network through a set of evaluation landscape data. Each evaluation landscape has a ground truth, in which every optimal soil sampling spot is manually labeled with a bounding box by the Co-PI soil pedologist. The ground truth is the best possible performance from careful inspection by a domain expert. We will pass each landscape dataset through the trained deep-learning pipeline, compare the results with the ground truth, and compute the accuracy based on the data. This set of metrics will penalize the algorithm for every missed detection and any wrong detection. The value of accuracy will be 1.0 if the sampling site detection algorithm results exactly match the ground truth. With multiple input landscape datasets, the overall accuracy is the average accuracy over all the experimental landscapes. The accuracy defined here will be used as a performance metric to quantify the ability of the model to detect optimal soil sampling sites. We will set the desired level of accuracy to be 90% and iterate the deep-learning pipeline until this accuracy is achieved.

Progress 01/15/22 to 08/29/22

Outputs
Target Audience:Our target audience included researchers interested in precision soil sampling, and they were contacted through peer reviewed publication. Changes/Problems:The rate of expenditure may appear slow from the funding agency's perspective for the following reasons: Both PI and Co-PI have been spending very economically by making use of support from their respective departments. In particular, when the project started on 01/15/2022, the Ph.D. student from the PI's group has already been in the middle of a Research Assistantship appointment funded by our industry partner, which covers his Spring 2022 semester. Therefore, he started studying deep learning and developing machine learning models for this project without the need to spend grant funds on his stipend and tuition for Spring 2022. During Summer 2022, the PI accepted an offer to move from South Dakota State University to the Florida Institute of Technology. In addition, the Ph.D. student obtained a paid summer internship to work on a related machine-learning project at our industry partner. Therefore, the PI did not spend any funds during this transition period. During the initial period of the project, Jan-May 2022, the winter weather in South Dakota prevented us from organizing proposed field trips for data collection. Instead, the trips happened in the Summer of 2022. During this time the Co-PI department had a surplus in budget and offered to cover the Co-PI's MS student's summer. The Co-PI did spend funds on his summer salary. Despite the above events, there is no significant deviation from the proposed research plan and goals. We have made very promising progress with a large amount of field data collected during Summer 2022. We also had a paper accepted for publication in Computers and Electronics in Agriculture and another paper under review. Both papers acknowledged the support from this grant. This is accomplished while preserving a significant portion of the grant for future research activities to be conducted in this project. What opportunities for training and professional development has the project provided?An M.S. student (Sravanthi Bachina) and a Ph.D. student (Praneel Acharya) are working on this project. The M.S. student from the Co-PI Kris Osterloh's research group is in charge of data collection and labeling (Objective 1). The Ph.D. student from the PI Kim Nguyen's group is in charge of developing the machine-learning models and performance metrics validation (Objectives 2 and 3). The training and professional development activities for these students include learning knowledge and skills related to machine learning and data processing, collecting soil sampling data, and innovating and applying machine learning techniques to the data to accomplish the goals. How have the results been disseminated to communities of interest?The techniques, which we have been developing in this project, have been applied to data obtained in a different project. The results have been reported in the form of two papers, both of which were submitted to the Computers and Electronics in Agriculture journal. One paper was recently accepted for publication, while the other paper is currently under review. What do you plan to do during the next reporting period to accomplish the goals?The PI has moved from South Dakota State University to the Florida Institute of Technology. The PI is requesting to transfer the grant to his new institution to continue the proposed research. This Final Report is a required component of the grant transfer process. After the grant transfer process is completed, the remaining research tasks will be performed in close collaboration at two different institutions: Florida Institute of Technology (PI, Objectives 2 and 3) and South Dakota State University (Co-PI, subaward, Objective 1). Below are the remaining objectives to be accomplished in the remaining of the project period: Objective 1: Establish a cyberinfrastructure of landscape data and soil sampling annotations for the training of deep convolutional neural networks In Summer 2022, we arranged a number of field trips to the Edinger Brothers Partnership farm in Mount Vernon, SD to collect soil sampling data from over 60 fields, most of which are at least 150 acres. We were also granted access to their soil sampling data collected from 2011 to date. Since non-dynamic soil properties (texture, rock fragments, etc) vary by landscapes, the fields were selected in such a way that they represent a diverse group of common landscapes. In the next phase of the project, the final delineation of training data will be completed using the geographic information system (GIS) software. The collected digital elevation model (DEM) will also be used to generate a suite of landscape derivatives such as slope, aspect, curvature, etc. These derivatives will be used as features in the deep-learning pipeline. DEM derivatives are straightforward and can be automatically generated, unlike hillslope profile positions. Thus, strong relationships between hillslope profiles and soil properties will create a more robust final product. The collected data will be annotated and fed into a deep-learning pipeline that refines and extracts the optimal locations for soil sampling given a landscape area as described in Objective 2. Objective 2. Develop a deep-learning pipeline to learn, analyze, and refine landscape data for automated selection of soil sampling locations. In the next phase of this research task, we will focus on developing and testing the remaining proposed modules of the machine-learning pipeline: Region proposal: This module takes a feature map from the feature extraction process as input. At each coordinate of a feature map, we will place a fixed number of rectangles called anchor boxes of different dimensions. Anchor boxes are the set of pre-defined boxes and are crucial for optimal sampling zone detection in our deep-learning framework. They can be thought of as zones where the network initially predicts the probability that a set of coordinates contain desirable soil sampling zones (i.e. initial guess). The algorithm then refines and re-sizes these anchor boxes when it learns more about the characteristics of optimal sampling spots with guidance from the ground truth. Each anchor box in a feature map covers a region on an input landscape that might have desirable sampling zones in it. Therefore, a region proposal neural network needs to be formulated and well-trained with labeled ground truth data to effectively find regions that contain optimal sampling zones with high probability. This is another step of data refinement. Region-of-Interest (RoI) pooling: Coming out of the Region Proposal process, not all regions are of the same size. The main goal of ROI pooling is to separate individual proposed regions and resize them to a fixed size while still maintaining all key features. The outputs of this RoI pooling process are well-defined regions of the same size that may contain desirable soil-sampling zones. Inference: This module is a final refinement step in addition to the previous processes. Here, we will formulate another multilayer neural network that takes the above ROIs as inputs and outputs regions containing optimal sampling locations with high certainty and high precision. Similar to the Regional Proposal module, the neural network will be trained via an optimization scheme. Nonetheless, with much more refined and specific landscape data input (produced by ROI pooling), the results are expected to be more precise and with a higher agreement with the ground truth. How close to the ground truth is defined by a set of performance metrics, which dictate the number of times this entire deep-learning pipeline will be iterated. The only constraint is that additional iterations require additional training data and time. Objective 3. Design and implement a set of metrics to assess the success rate of the optimal sampling site prediction tool. To compute the performance of the developed algorithm, we will run the trained network through a set of evaluation landscape data. Each evaluation landscape has a ground truth, in which every optimal soil sampling spot is manually labeled with a bounding box by the Co-PI soil pedologist. The ground truth is the best possible performance from careful inspection by a domain expert. We will pass each landscape dataset through the trained deep-learning pipeline, compare the results with the ground truth, and store the following information for each evaluation landscape: The total number of optimal soil sampling spots missed by the trained model (MD) for a given landscape compared to its ground truth. The total number of incorrect detections (FD) made by the trained model for a given landscape compared to its ground truth. These incorrect detections can occur when the trained model detects a desirable sampling spot, but the ground truth does not consider it optimal. The total number of detections (DT) by the trained model for the given input landscape. Note that not all detections contain optimal sampling spots. For example, FD consists of detections counted in D_T that are not optimal sampling spots when compared to the ground truth. The true detection number (TD) computed by the difference between the total number of sampling spots detected by the trained model for the given landscape and the total number of incorrect detections for that input landscape. Finally, the accuracy is computed as TD/(TD+FD+MD ). This is a simple set of metrics we will start with. We will formulate and implement more complicated performance metrics if needed. With this set of metrics, we penalize the algorithm for every missed detection and any wrong detection. The value of accuracy will be 1.0 if the sampling site detection algorithm results exactly match the ground truth. With multiple input landscape datasets, the overall accuracy is the average accuracy over all the experimental landscapes. The accuracy defined here will be used as a performance metric to quantify the ability of the model to detect optimal soil sampling sites.

Impacts
What was accomplished under these goals? Objective 1: Establish a cyberinfrastructure of landscape data and soil sampling annotations for the training of deep convolutional neural networks (50% Accomplished) The first phase of the project focused on data collection and establishing a cyberinfrastructure of landscape data and soil sampling annotations. The official project start date was 01/15/2022. We were not able to perform field trips to collect soil sampling data since it was winter weather in South Dakota from January to May 2022. In Summer 2022, we arranged a number of field trips to the Edinger Brothers Partnership farm in Mount Vernon, SD, to collect soil sampling data from over 60 fields, most of which are at least 150 acres. We were also granted access to their soil sampling data collected from 2011 to date. In Fall 2022, we will focus on organizing, analyzing, and labeling the collected data to make it ready to train the machine-learning models developed in Objective 2. Objective 2: Develop a deep-learning pipeline to learn, analyze, and refine landscape data for automated selection of soil sampling locations (20% Accomplished) We accomplished about 20% of the task of developing a deep-learning pipeline to learn, analyze, and refine landscape data for automated selection of soil sampling locations. During the reporting period, we used a small soil sampling data set, since the proposed data collection had not been done yet. We developed the feature extraction module with a neural network designed to extract important characteristics of an optimal sampling zone that can be used to uniquely identify it from other zones that are not desirable for soil sampling. The feature extraction process results in reduced data size, since only zones with desirable characteristics are outputs from this module. Hence, instead of directly dealing with the original landscape data size, the optimal sampling zone detection algorithm works with feature maps that have reduced size, i.e. much less information to be processed and reduced processing time and computational power. Our algorithm achieves this data refinement while preserving unique characteristics of each optimal sampling zone present in the input landscape. However, this module is only a rough refinement process. The artificial neural network's parameters will be tuned in such a way that only coordinates that do not contain desirable characteristics with high certainty are removed from further processing to avoid over-filtering. The remaining modules to be developed and tested in this task are: Region proposal, Region-of-Interest pooling, Inference, and testing of the entire machine-learning pipeline altogether. Objective 3: Design and implement a set of metrics to assess the success rate of the optimal sampling site prediction tool (0% Accomplished) We will start working on this research task after the first two goals have been accomplished.

Publications

Type: Journal Articles Status: Accepted Year Published: 2022 Citation: Acharya, P., T. Burgers, K.D. Nguyen. 2022. AI-enabled droplet detection and tracking for agricultural spraying systems. Computers and Electronics in Agriculture. Accepted.
Type: Journal Articles Status: Under Review Year Published: 2022 Citation: Acharya, P., K.D. Nguyen. 2022. A deep-learning framework for spray pattern segmentation and estimation in agricultural spraying systems. Computers and Electronics in Agriculture.