Source: FLORIDA INSTITUTE OF TECHNOLOGY, INC. submitted to
DSFAS-AI: DEEP-LEARNING FRAMEWORK FOR OPTIMAL SELECTION OF SOIL SAMPLING SITES
Sponsoring Institution
National Institute of Food and Agriculture
Project Status
EXTENDED
Funding Source
Reporting Frequency
Annual
Accession No.
1029668
Grant No.
2022-67021-38911
Project No.
FLAW-2022-09228
Proposal No.
2022-09228
Multistate No.
(N/A)
Program Code
A1541
Project Start Date
Sep 1, 2022
Project End Date
Feb 28, 2025
Grant Year
2023
Project Director
Nguyen, K.
Recipient Organization
FLORIDA INSTITUTE OF TECHNOLOGY, INC.
150 W UNIVERSITY BLVD OFC
MELBOURNE,FL 329016975
Performing Department
(N/A)
Non Technical Summary
The overarching goal of this research is to develop a deep-learning framework for the automated selection of optimal soil sampling sites based on landscape position. Soil sampling is one of the most fundamental processes in agriculture: it is the crucial first step in soil testing to determine soil health. A soil analysis, which provides information important to maximize nutrient use efficiency and agricultural productivity, can only be as good as the samples sent to the lab. Good samples require sampling at multiple optimal sites in the field. In the current practice, farmers collect and pool samples themselves and send them to a lab for analysis. There are scientific methods where samples can be pooled based on landscape position and knowledge about how nutrients move in soil due to differences in properties. However, most producers do not follow those procedures as they can be complex and vary from field to field. Pooled soil samples do not represent the actual variation in soil properties. Producers can pull multiple soil samples but have no assurance of the extent of their fields which each sample actually represents. As a result, there are no reliable methods for farmers to accurately utilize knowledge of soil variation with their precision agriculture technologies. There is an urgent need for an automated tool that will help producers identify which spots are optimal for sampling and can be pooled to get accurate soil analysis results.To fill this critical need, we aim to develop a deep-learning tool that outputs landscape zones with position elevation and identifies optimal sampling spots for each zone. The training landscape data and scientific methodology will be provided by the Soil Pedologist Co-PI. The artificial intelligence (AI) -enabled tool will be developed with GPS guidance to go from spot to spot. It will also allow for the mixing of appropriate samples and provide information on the estimated cost of analysis. If the producer chooses a price cap, the tool could inform them of the number of samples that could be analyzed within that price range and how accurate the results would be.Our central hypothesis is that the use of advanced deep-learning techniques to analyze and refine landscape data will enable precise and reliable recommendations of optimal soil sampling spots. In addition, the framework will produce consistent results under uncertain variable conditions from field to field. The rationale is that deep learning extracts meanings from landscape data and human-labeled data to train multi-layer neural networks to increase the reliability and accuracy of sampling location selection. Our team is particularly well prepared to undertake the proposed research because of our extensive and successful track record of AI-enabled and data-driven research in precision agriculture and soil-landscape analysis.We plan to test the central hypothesis by pursuing the following three specific objectives:Establish a cyberinfrastructure of landscape data and soil sampling annotations to train deep convolutional neural networks.Develop a deep-learning pipeline to learn, analyze, and refine landscape data for automated selection of soil sampling locations. The framework takes landscape data as input and outputs optimal soil sampling spots.Design and implement a set of metrics to assess the success rate of the optimal sampling site prediction tool.The proposed research is original and transformative because it will create an advanced tool for a crucial and challenging precision agriculture problem, namely automated and reliable selection of soil sampling sites. This tool is currently missing, and it will enable a significant improvement in soil sampling and analysis, which will lead to a better understanding of soil health. It will lay a foundation for novel applications of data science and AI technologies to solve agricultural problems.
Animal Health Component
0%
Research Effort Categories
Basic
20%
Applied
60%
Developmental
20%
Classification

Knowledge Area (KA)Subject of Investigation (SOI)Field of Science (FOS)Percent
40472992020100%
Goals / Objectives
Soil sampling is one of the most fundamental agricultural processes. The current soil sampling techniques produce random samples across a field, which are not representative. Despite recent advancements in relevant data science and machine learning, there has not been significant innovation in soil sampling for decades. To fill thegap, this project develops a deep-learning framework capable of refining landscape data and extracting the optimal locations for soil sampling in a given field. We hypothesize that the integration of deep learning into the landscape data refinement and sampling rules will enable automated, precise, and reliable selection of optimal soil sampling sites. These sample sites will be more representative of a landscape area for analysis than the current composite sampling practice. We plan to test this central hypothesis by pursuing the following three specific objectives:Objective 1. Establish a cyberinfrastructure of landscape data and soil sampling annotations for the training of deep convolutional neural networks.Objective 2. Develop a deep-learning pipeline to learn, analyze, and refine landscape data for automated selection of soil sampling locations. Objective 3. Design and implement a set of metrics to assess the success rate of the optimal sampling site prediction tool.The project outcomes will lead to a novel commercial application that delivers GPS coordinates of sampling sites to producers with a real-time map to reduce human error. Landscape position data from the tool can be used to accurately interpolate soil properties from the collected samples. This will result in more precise and cost-effective soil data for precision agriculture systems.
Project Methods
Objective 1. Establish a cyberinfrastructure of landscape data and soil sampling annotations to train deep convolutional neural networksTest fields will be selected in Eastern South Dakota to create training data. Since non-dynamic soil properties (texture, rock fragments, etc) vary by landscapes, fields will be selected in such a way that they represent a diverse group of common landscapes. Collectively, these landscapes will represent much of the Midwest glaciated plains, and the resulting tool will be applicable to similar areas.At each field, we will collect a digital elevation model (DEM) using a drone-mounted lidar array with a high-precision GPS unit. The initial determination of various hillslope positions will be done in the field and completed using geographic information system (GIS) software. Hillslope position boundaries, and known training points, will be collected in the field using handheld GPS units. The final delineation of training data will then be completed using GIS. The DEM will also be used to generate a suite of landscape derivatives such as slope, aspect, curvature, etc. These derivatives will be used as features in the deep-learning pipeline. DEM derivatives are straightforward and can be automatically generated, unlike hillslope profile positions. Thus, strong relationships between hillslope profiles and soil properties will create a more robust final product.Soil samples will be taken at each field location for analysis. A random stratified sampling technique will be implemented, stratifying by hillslope profile position. Soils will be analyzed for soil carbon, pH, and a suite of nutrients--the standard analyses that a producer would get. These samples will show the total variation of soil properties and composition within and between hillslope positions. Using the DEM and soil data, we will produce a series of maps for each soil property which will be used to test the accuracy of our deep-learning models.The collected data will be annotated and fed into a deep-learning pipeline that refines and extracts the optimal locations for soil sampling given a landscape area as described in the next section.Objective 2. Develop a deep-learning pipeline to learn, analyze, and refine landscape data for automated selection of soil sampling locations.We divide the deep-learning pipeline into four modules. The fundamental difference between the deep-learning methodology we are pursuing with respect to regular learning techniques is that it has multiple layers of processing and neural networks. It is robust because if one layer does not perform well, then the next layers will rectify the mistake. However, the deepness of learning comes with a larger number of parameters to be trained. Therefore, the physical meaning of data and data refinement are critical to maintaining reasonable training and processing time.Feature extraction: During the feature extraction process, a convolutional neural network will be developed to extract important characteristics ("features") of an optimal sampling zone that can be used to uniquely identify it from other zones that are not desirable for soil sampling. Additionally, the feature extraction process will lead to a decrease in data size by a reduction factor, as only zones with desirable characteristics are considered after this step. Our algorithm will achieve this data refinement while preserving the unique characteristics of each optimal sampling zone present in the input landscape. However, this module is only a rough refinement process. The convolutional neural network's parameters will be tuned in such a way that only coordinates that do not contain desirable characteristics with high certainty are removed from further processing to avoid over-filtering.Region proposal: This module takes a feature map from the feature extraction process as input. At each coordinate of a feature map, we will place a fixed number of rectangles called anchor boxes of different dimensions. Anchor boxes are a set of predefined boxes and are crucial for optimal sampling zone detection in our deep-learning framework. They can be thought of as zones where the network initially predicts the probability that a set of coordinates contain desirable soil sampling zones (i.e. initial guess). The algorithm then refines and resizes these anchor boxes when it learns more about the characteristics of optimal sampling spots with guidance from the ground truth.Region-of-Interest (RoI) pooling: Coming out of the Region Proposal process, not all regions are of the same size. The main goal of ROI pooling is to separate individual proposed regions and resize them to a fixed size while still maintaining all key features. The outputs of this RoI pooling process are well-defined regions of the same size that may contain desirable soil-sampling zones.Inference: This module is a final refinement step. Here, we will formulate another multilayer neural network that takes the above ROIs as inputs and outputs regions containing optimal sampling locations with high certainty and high precision. Similar to the Regional Proposal module, the neural network will be trained via an optimization scheme. Nonetheless, with much more refined and specific landscape data input (produced by ROI pooling), the results are expected to be more precise and with a higher agreement with the ground truth. How close to the ground truth is defined by a set of performance metrics, which dictate the number of times this entire deep-learning pipeline will be iterated. The only constraint is that additional iterations require additional training data and time.Objective 3. Design and implement a set of metrics to assess the success rate of the optimal sampling site prediction toolTo compute the performance of the developed algorithm, we will run the trained network through a set of evaluation landscape data. Each evaluation landscape has a ground truth, in which every optimal soil sampling spot is manually labeled with a bounding box by the Co-PI soil pedologist. The ground truth is the best possible performance from careful inspection by a domain expert. We will pass each landscape dataset through the trained deep-learning pipeline, compare the results with the ground truth, and compute the accuracy based on the data. This set of metrics will penalize the algorithm for every missed detection and any wrong detection. The value of accuracy will be 1.0 if the sampling site detection algorithm results exactly match the ground truth. With multiple input landscape datasets, the overall accuracy is the average accuracy over all the experimental landscapes. The accuracy defined here will be used as a performance metric to quantify the ability of the model to detect optimal soil sampling sites. We will set the desired level of accuracy to 90% and iterate the deep-learning pipeline until this accuracy is achieved.

Progress 09/01/22 to 08/31/23

Outputs
Target Audience:Our target audience included producers who need to perform soil sampling and analysis in their field. We have an ongoing partnership with them and we visited their farms to collect data. Other target audiences includeresearchers interested in precision soil sampling, and reached out to them through conference presentations and peer-reviewed publications. Changes/Problems:There are several major changes that affect the progress of the project in the last 15 months: • The PI moved from South Dakota State University (SDSU) to the Florida Institute of Technology (FIT) in 2022: The PI's contract with SDSU ended in May 2022 and the PI's position at FIT started in August 2022. The grant transfer process was completed in December 2022 when FIT received the award notification. Therefore, the PI cannot use the funds to progress, and the project was essentially halted from May 2022 to December 2022. • The Ph.D. student who worked on the project with the PI at SDSU, Praneel Acharya, decided not to follow the PI to FIT because it was the final year of his graduate program (He is now an assistant professor at Minnesota State University, Mankato). Additionally, FIT officially received the grant in December 2022, and the timing left us with very limited options for new graduate student recruitment who can start in January 2023. As a result, we could not hire students who have substantial relevant skill sets and experience to immediately contribute toward the project objectives. Most of Spring 2023 was dedicated to training the new Ph.D. students in data science, machine learning, and soil sampling and analysis. This challenge caused further delays in executing the research tasks and achieving the project objectives. Despite these challenges, during the reporting period, we still managed to submit a journal paper to Computers and Electronics in Agriculture, waiting for peer reviews. We also presented our work at the USDA NIFA AI in Agriculture: Innovation and Discovery to Equitably Meet Producer Needs and Perceptions Conference in Orlando, in April 2023. What opportunities for training and professional development has the project provided?An M.S. student (Sravanthi Bachina) and two Ph.D. students (Praneel Acharya at South Dakota State University who has been with the project from the beginning; and Hanh Pham at Florida Institute of Technology, who started in 01/2023) have been working on this project. The M.S. student from the Co-PI Kris Osterloh's research group is in charge of data collection and labeling (Objective 1). The two Ph.D. students from the PI Kim Nguyen's group are in charge of developing the machine-learning models and performance metrics validation (Objectives 2 and 3). The training and professional development activities for these students include learning knowledge and skills related to machine learning and data processing, collecting soil sampling data, and innovating and applying machine learning techniques to the data to accomplish the goals. One of the Ph.D. students, Praneel Acharya, graduated in May 2023 and is currently an Assistant Professor at Minnesota State University, Mankato. How have the results been disseminated to communities of interest?We have summarized the current results in the form of a journal paper, which is under review withComputers and Electronics in Agriculture. We also presented our project at the USDA NIFA AI in Agriculture: Innovation and Discovery to Equitably Meet Producer Needs and Perceptions Conference in Orlando, April 2023. The techniques developed in this project have been applied to the data obtained in a different project. The results have been reported in the form of two journal papers. One was published in the Computers and Electronics in Agriculture journal while the other was published in the Scientific Reports journal. What do you plan to do during the next reporting period to accomplish the goals?Below are the remaining objectives to be accomplished in the remainder of the project period: Objective 1: Establish a cyberinfrastructure of landscape data and soil sampling annotations for the training of deep convolutional neural networks In the next phase of the project, the final delineation of training data will be completed using the geographic information system (GIS) software. The collected digital elevation model (DEM) will also be used to generate a suite of landscape derivatives such as slope, aspect, curvature, etc. These derivatives will be used as features in the deep-learning pipeline. DEM derivatives are straightforward and can be automatically generated, unlike hillslope profile positions. Thus, strong relationships between hillslope profiles and soil properties will create a more robust final product. The collected data will be annotated and fed into a deep-learning pipeline that refines and extracts the optimal locations for soil sampling given a landscape area as described in Objective 2. Objective 2. Develop a deep-learning pipeline to learn, analyze, and refine landscape data for automated selection of soil sampling locations. In the next phase of this research task, we will focus on developing and testing the remaining proposed modules of the machine-learning pipeline: ?Region-of-Interest (RoI) pooling: Coming out of the Region Proposal process, not all regions are of the same size. The main goal of ROI pooling is to separate individual proposed regions and resize them to a fixed size while still maintaining all key features. The outputs of this RoI pooling process are well-defined regions of the same size that may contain desirable soil-sampling zones. Inference: This module is a final refinement step in addition to the previous processes. Here, we will formulate another multilayer neural network that takes the above ROIs as inputs and outputs regions containing optimal sampling locations with high certainty and high precision. Similar to the Regional Proposal module, the neural network will be trained via an optimization scheme. Nonetheless, with much more refined and specific landscape data input (produced by ROI pooling), the results are expected to be more precise and with a higher agreement with the ground truth. How close to the ground truth is defined by a set of performance metrics, which dictate the number of times this entire deep-learning pipeline will be iterated. The only constraint is that additional iterations require additional training data and time. Challenges: We will also seek a solution to the most challenging problem we are facing so far: the imbalance of input data. This problem arises from the fact that optimal soil sampling spots are only a small number of pixels in the map, while the rest of the pixels are background. This tremendous imbalance in pixel class leads to the problem that the loss of the machine learning model is always near zero even if the model produces wrong predictions as compared to the ground truth. We will look into recent advances in machine learning including vision transformers and attention mechanisms to address this problem. Objective 3. Design and implement a set of metrics to assess the success rate of the optimal sampling site prediction tool. We will continue to reinforce and expand the set of performance metrics we have designed so far. Setting up performance metrics for the proposed machine learning models is a crucial step in evaluating the effectiveness and quality of our models. The design of metrics depends on the nature of the problem, the type of data, and the specific goals of the project. In the next project period, we will continue to reinforce and expand the set of performance metrics we have designed so far. Below is a list of performance metrics that we will consider and develop: Accuracy: Overall proportion of correctly predicted instances. Precision: Proportion of true positive predictions among positive predictions. Recall: Proportion of true positive predictions among actual positive instances. F1-Score: Harmonic mean of precision and recall. Area Under the Receiver Operating Characteristic Curve (AUC-ROC): Measures the trade-off between true positive rate and false positive rate. Mean Absolute Error (MAE): Average absolute differences between predicted and actual values. Mean Squared Error (MSE): Average squared differences between predicted and actual values. Root Mean Squared Error (RMSE): Square root of the MSE. R-squared (Coefficient of Determination): Measures the proportion of the variance in the dependent variable that's predictable from the independent variables. Silhouette Score: Measures how close each sample in one cluster is to the samples in the neighboring clusters. Davies-Bouldin Index: Measures the average similarity between each cluster and its most similar cluster. Implement the Metrics: Implement the chosen metrics in your evaluation code. Many machine learning libraries provide built-in functions for calculating common metrics. Cross-Validation: Use cross-validation to estimate the model's performance on unseen data. This helps mitigate issues like overfitting and provides a more reliable estimate of the model's generalization ability. Compare Multiple Models: If you're trying different algorithms or hyperparameters, compare their performance using the chosen metrics. This can help you select the best-performing model for your task. Visualize and Interpret Results: Visualize the results using appropriate visualizations like confusion matrices, ROC curves, or precision-recall curves. Interpret the results to gain insights into model performance and areas for improvement. Iterate and Improve: Based on the results and insights, refine your model, feature engineering, and hyperparameters to improve performance.

Impacts
What was accomplished under these goals? Objective 1: Establish a cyberinfrastructure of landscape data and soil sampling annotations for the training of deep convolutional neural networks (60% Accomplished) We arranged a number of field trips to the Edinger Brothers Partnership farm in Davison counties, SD, to collect soil sampling data from over 60 fields, ranging from 150 to 200 hectares each. We were also granted access to their soil sampling data collected from 2011 to date. These fields encompassed diverse landscapes and soil properties and were under the management of local farmers. For training and testing the model, each field was characterized by five attributes: aspect, flow accumulation, slope, yield, and normalized difference vegetation index (NDVI). The corresponding ground truths were also annotated by the pedologists in our team. Digital Elevation Models (DEM) for each field were downloaded from the LiDAR (Light Detection and Ranging) dataset in South Dakota Geological Survey with a spatial resolution of one meter (https://www.sdgs.usd.edu/).} Terrain attributes (slope and aspect) and hydrological attributes (flow accumulation) were processed by using ArcMap 10.8 (Esri, ArcGIS). Soil properties like water holding capacity, soil texture, soil organic matter, and Cation Exchange Capacity (CEC) were obtained from the SSURGO data sets. For this research, Sentinel 2A imagery was downloaded from the Copernicus Open Hub website with a spatial resolution of 10 meters which is further used to obtain NDVI values for each field. Yield data were obtained by Field View Plus (Climate LLC) which was further cleaned and processed by using SMS Ag software (Ag Leader, SMS Advanced). For model training, terrain and yield attributes including slope, aspect, flow accumulation, yield, and NDVI data are used as features or independent variables. The corresponding ground truths soil sampling sites of these set data were labeled, which represent the characteristics of the field. Objective 2: Develop a deep-learning pipeline to learn, analyze, and refine landscape data for automated selection of soil sampling locations (30% Accomplished) Soil sampling location selection plays an important role in soil analysis, but the current soil sampling techniques produce random samples across a field, which are not representative. During the reporting period, we leverage the advancement of deep learning on computer vision and image processing for finding necessary locations that present the important characteristics of a field. The data for training are collected at different fields with five features: aspect, flow accumulation, slope, NDVI (normalized difference vegetation index), and yield. The soil sampling dataset is challenging because of the unbalance between the soil sampling site area and the background area. In this work, we approach the problem with three methods, the first approach involves utilizing a CNN-based model, while the second is related to employing a Transformer-based model. We analyze the performances of the two baselines and then finalize with a Transformer-based model. We proposed atrous SegFormer MiT-B3 as a tool for dealing with the soil sampling dataset. The tool has an encoder-decoder architecture and uses the SegFormer MiT-b3 as the backbone. In the encoder, the self-attention mechanism is the key feature extractor, which produces feature maps. In the decoder, we introduce atrous convolution networks to concatenate, fuse the extracted features, and then export the optimal locations for soil sampling. Currently, the model has achieved impressive results on the testing dataset, with a mean accuracy of 99.53%, a mean Intersection over Union (IoU) of 57.35%, and a mean Dice Coefficient of 71.47%, while those values of SegFormer MiT-B3 are 99.51%, 53.97%, and 68.69%, respectively. Our proposed model outperforms the SegFormer MiT-B3 on the soil sampling dataset. To the best of our knowledge, our work is the first one to provide a soil sampling dataset with multiple attributes and leverage deep learning techniques to support location selection in soil sampling. Objective 3: Design and implement a set of metrics to assess the success rate of the optimal sampling site prediction tool (10% Accomplished) To assess the performance of the trained model, we established a set of performance metrics to compare the ground truth soil data with the output predictions generated by the model. This allows us to compute various variables listed below that capture the differences between the ground truth and the model's output. True positive (TP) is the total number of white pixels predicted from the model overlapping with the white pixels in the ground truths. They have the same positive category. True negative (TN) is the total number of black pixels predicted from the model overlapping with the black pixels in the ground truths. They have the same negative category. False positive (FP) is the total number of white pixels predicted from the model overlapping with the black pixels in the ground truths. They have different categories, positive prediction and negative reality. False negative (FN) is the total number of black pixels predicted from the model overlapping with the white pixels in the ground truths. They have different categories, negative prediction and positive reality. For each validation or testing dataset, we then compute three key metrics to evaluate the performance: Mean Accuracy (MA) represents the average pixel accuracy between the ground truth and the model's output across all images. The accuracy for each image is computed as follows: (TP+TN)/(TP+TN+FP+FN). MA alone does not accurately represent the performance of the model when it comes to segmenting the minority class. Relying solely on MA can create a misleading perception that the model is performing well, even if it fails to capture the details of the minority class effectively. Therefore, we consider alternative metrics that provide a more comprehensive evaluation of the model's performance, specifically focusing on the minority class. Aside from the MA metric, two other commonly used criteria to assess the effectiveness of a segmentation model are Dice Coefficient (DC) and Intersection over Union (IoU). Mean Dice Coefficient (MDC) computes the similarity between the model-generated output image and the ground truth. The quantities range from 0 to 1, and a score of 1 indicates a perfect alignment between the predicted and ground truth regions while a score of 0 signifies no overlap whatsoever. This metric is also effective in evaluating model performance on imbalanced dataset. MDC for each image is calculated as MDC = (2TP)(2TP+FP+FN). Mean Intersection over Union (MIoU) is the mean overlap of the ground truths and predictions over the total N images. The IoU measures the number of white pixels that overlap between the predictions and the ground truth data: IoU = (TP)(TP+FP+FN).

Publications

  • Type: Conference Papers and Presentations Status: Published Year Published: 2023 Citation: Praneel Acharya, Sravanthi Bachina, Tan-Hanh Pham, Kristopher Osterloh, Kim-Doang Nguyen, "Deep-Learning Framework for Optimal Selection of Soil Sampling Sites," USDA NIFA AI in Agriculture: Innovation and Discovery to Equitably meet Producer Needs and Perceptions Conference, Orlando, April 2023.
  • Type: Journal Articles Status: Submitted Year Published: 2023 Citation: Tan-Hanh Pham, Praneel Acharya, Sravanthi Bachina, Kristopher Osterloh, Kim- Doang Nguyen, "Deep-Learning Framework for Optimal Selection of Soil Sampling Sites", Computers and Electronics in Agriculture, under review, 2023.