DSFAS-AI: Deep-Learning Framework for Optimal Selection of Soil Sampling Sites

DSFAS-AI: DEEP-LEARNING FRAMEWORK FOR OPTIMAL SELECTION OF SOIL SAMPLING SITES

Sponsoring Institution

National Institute of Food and Agriculture

Project Status

ACTIVE

Funding Source

AFRI COMPETITIVE GRANT

Reporting Frequency

Annual

Accession No.

1029668

Grant No.

2022-67021-38911

Cumulative Award Amt.

$262,121.39

Proposal No.

2022-09228

Multistate No.

(N/A)

Project Start Date

Sep 1, 2022

Project End Date

Feb 28, 2026

Grant Year

2023

Program Code

[A1541]- Food and Agriculture Cyberinformatics and Tools

Recipient Organization
FLORIDA INSTITUTE OF TECHNOLOGY, INC.
150 W UNIVERSITY BLVD OFC
MELBOURNE,FL 329016975

Performing Department
(N/A)

Non Technical Summary
The overarching goal of this research is to develop a deep-learning framework for the automated selection of optimal soil sampling sites based on landscape position. Soil sampling is one of the most fundamental processes in agriculture: it is the crucial first step in soil testing to determine soil health. A soil analysis, which provides information important to maximize nutrient use efficiency and agricultural productivity, can only be as good as the samples sent to the lab. Good samples require sampling at multiple optimal sites in the field. In the current practice, farmers collect and pool samples themselves and send them to a lab for analysis. There are scientific methods where samples can be pooled based on landscape position and knowledge about how nutrients move in soil due to differences in properties. However, most producers do not follow those procedures as they can be complex and vary from field to field. Pooled soil samples do not represent the actual variation in soil properties. Producers can pull multiple soil samples but have no assurance of the extent of their fields which each sample actually represents. As a result, there are no reliable methods for farmers to accurately utilize knowledge of soil variation with their precision agriculture technologies. There is an urgent need for an automated tool that will help producers identify which spots are optimal for sampling and can be pooled to get accurate soil analysis results.To fill this critical need, we aim to develop a deep-learning tool that outputs landscape zones with position elevation and identifies optimal sampling spots for each zone. The training landscape data and scientific methodology will be provided by the Soil Pedologist Co-PI. The artificial intelligence (AI) -enabled tool will be developed with GPS guidance to go from spot to spot. It will also allow for the mixing of appropriate samples and provide information on the estimated cost of analysis. If the producer chooses a price cap, the tool could inform them of the number of samples that could be analyzed within that price range and how accurate the results would be.Our central hypothesis is that the use of advanced deep-learning techniques to analyze and refine landscape data will enable precise and reliable recommendations of optimal soil sampling spots. In addition, the framework will produce consistent results under uncertain variable conditions from field to field. The rationale is that deep learning extracts meanings from landscape data and human-labeled data to train multi-layer neural networks to increase the reliability and accuracy of sampling location selection. Our team is particularly well prepared to undertake the proposed research because of our extensive and successful track record of AI-enabled and data-driven research in precision agriculture and soil-landscape analysis.We plan to test the central hypothesis by pursuing the following three specific objectives:Establish a cyberinfrastructure of landscape data and soil sampling annotations to train deep convolutional neural networks.Develop a deep-learning pipeline to learn, analyze, and refine landscape data for automated selection of soil sampling locations. The framework takes landscape data as input and outputs optimal soil sampling spots.Design and implement a set of metrics to assess the success rate of the optimal sampling site prediction tool.The proposed research is original and transformative because it will create an advanced tool for a crucial and challenging precision agriculture problem, namely automated and reliable selection of soil sampling sites. This tool is currently missing, and it will enable a significant improvement in soil sampling and analysis, which will lead to a better understanding of soil health. It will lay a foundation for novel applications of data science and AI technologies to solve agricultural problems.

Animal Health Component

60%

Research Effort Categories

Basic

20%

Applied

60%

Developmental

20%

Classification

Knowledge Area (KA)	Subject of Investigation (SOI)	Field of Science (FOS)	Percent
404	7299	2020	100%

Knowledge Area
404 - Instrumentation and Control Systems;

Subject Of Investigation
7299 - Research equipment and methods, general/other;

Field Of Science
2020 - Engineering;

Keywords

precision agriculture

soil sampling

soil health

artificial intelligence

Goals / Objectives
Soil sampling is one of the most fundamental agricultural processes. The current soil sampling techniques produce random samples across a field, which are not representative. Despite recent advancements in relevant data science and machine learning, there has not been significant innovation in soil sampling for decades. To fill thegap, this project develops a deep-learning framework capable of refining landscape data and extracting the optimal locations for soil sampling in a given field. We hypothesize that the integration of deep learning into the landscape data refinement and sampling rules will enable automated, precise, and reliable selection of optimal soil sampling sites. These sample sites will be more representative of a landscape area for analysis than the current composite sampling practice. We plan to test this central hypothesis by pursuing the following three specific objectives:Objective 1. Establish a cyberinfrastructure of landscape data and soil sampling annotations for the training of deep convolutional neural networks.Objective 2. Develop a deep-learning pipeline to learn, analyze, and refine landscape data for automated selection of soil sampling locations. Objective 3. Design and implement a set of metrics to assess the success rate of the optimal sampling site prediction tool.The project outcomes will lead to a novel commercial application that delivers GPS coordinates of sampling sites to producers with a real-time map to reduce human error. Landscape position data from the tool can be used to accurately interpolate soil properties from the collected samples. This will result in more precise and cost-effective soil data for precision agriculture systems.

Project Methods
Objective 1. Establish a cyberinfrastructure of landscape data and soil sampling annotations to train deep convolutional neural networksTest fields will be selected in Eastern South Dakota to create training data. Since non-dynamic soil properties (texture, rock fragments, etc) vary by landscapes, fields will be selected in such a way that they represent a diverse group of common landscapes. Collectively, these landscapes will represent much of the Midwest glaciated plains, and the resulting tool will be applicable to similar areas.At each field, we will collect a digital elevation model (DEM) using a drone-mounted lidar array with a high-precision GPS unit. The initial determination of various hillslope positions will be done in the field and completed using geographic information system (GIS) software. Hillslope position boundaries, and known training points, will be collected in the field using handheld GPS units. The final delineation of training data will then be completed using GIS. The DEM will also be used to generate a suite of landscape derivatives such as slope, aspect, curvature, etc. These derivatives will be used as features in the deep-learning pipeline. DEM derivatives are straightforward and can be automatically generated, unlike hillslope profile positions. Thus, strong relationships between hillslope profiles and soil properties will create a more robust final product.Soil samples will be taken at each field location for analysis. A random stratified sampling technique will be implemented, stratifying by hillslope profile position. Soils will be analyzed for soil carbon, pH, and a suite of nutrients--the standard analyses that a producer would get. These samples will show the total variation of soil properties and composition within and between hillslope positions. Using the DEM and soil data, we will produce a series of maps for each soil property which will be used to test the accuracy of our deep-learning models.The collected data will be annotated and fed into a deep-learning pipeline that refines and extracts the optimal locations for soil sampling given a landscape area as described in the next section.Objective 2. Develop a deep-learning pipeline to learn, analyze, and refine landscape data for automated selection of soil sampling locations.We divide the deep-learning pipeline into four modules. The fundamental difference between the deep-learning methodology we are pursuing with respect to regular learning techniques is that it has multiple layers of processing and neural networks. It is robust because if one layer does not perform well, then the next layers will rectify the mistake. However, the deepness of learning comes with a larger number of parameters to be trained. Therefore, the physical meaning of data and data refinement are critical to maintaining reasonable training and processing time.Feature extraction: During the feature extraction process, a convolutional neural network will be developed to extract important characteristics ("features") of an optimal sampling zone that can be used to uniquely identify it from other zones that are not desirable for soil sampling. Additionally, the feature extraction process will lead to a decrease in data size by a reduction factor, as only zones with desirable characteristics are considered after this step. Our algorithm will achieve this data refinement while preserving the unique characteristics of each optimal sampling zone present in the input landscape. However, this module is only a rough refinement process. The convolutional neural network's parameters will be tuned in such a way that only coordinates that do not contain desirable characteristics with high certainty are removed from further processing to avoid over-filtering.Region proposal: This module takes a feature map from the feature extraction process as input. At each coordinate of a feature map, we will place a fixed number of rectangles called anchor boxes of different dimensions. Anchor boxes are a set of predefined boxes and are crucial for optimal sampling zone detection in our deep-learning framework. They can be thought of as zones where the network initially predicts the probability that a set of coordinates contain desirable soil sampling zones (i.e. initial guess). The algorithm then refines and resizes these anchor boxes when it learns more about the characteristics of optimal sampling spots with guidance from the ground truth.Region-of-Interest (RoI) pooling: Coming out of the Region Proposal process, not all regions are of the same size. The main goal of ROI pooling is to separate individual proposed regions and resize them to a fixed size while still maintaining all key features. The outputs of this RoI pooling process are well-defined regions of the same size that may contain desirable soil-sampling zones.Inference: This module is a final refinement step. Here, we will formulate another multilayer neural network that takes the above ROIs as inputs and outputs regions containing optimal sampling locations with high certainty and high precision. Similar to the Regional Proposal module, the neural network will be trained via an optimization scheme. Nonetheless, with much more refined and specific landscape data input (produced by ROI pooling), the results are expected to be more precise and with a higher agreement with the ground truth. How close to the ground truth is defined by a set of performance metrics, which dictate the number of times this entire deep-learning pipeline will be iterated. The only constraint is that additional iterations require additional training data and time.Objective 3. Design and implement a set of metrics to assess the success rate of the optimal sampling site prediction toolTo compute the performance of the developed algorithm, we will run the trained network through a set of evaluation landscape data. Each evaluation landscape has a ground truth, in which every optimal soil sampling spot is manually labeled with a bounding box by the Co-PI soil pedologist. The ground truth is the best possible performance from careful inspection by a domain expert. We will pass each landscape dataset through the trained deep-learning pipeline, compare the results with the ground truth, and compute the accuracy based on the data. This set of metrics will penalize the algorithm for every missed detection and any wrong detection. The value of accuracy will be 1.0 if the sampling site detection algorithm results exactly match the ground truth. With multiple input landscape datasets, the overall accuracy is the average accuracy over all the experimental landscapes. The accuracy defined here will be used as a performance metric to quantify the ability of the model to detect optimal soil sampling sites. We will set the desired level of accuracy to 90% and iterate the deep-learning pipeline until this accuracy is achieved.

Progress 09/01/23 to 08/31/24

Outputs
Target Audience:Our target audience included producers who need to perform soil sampling and analysis in their field. We have an ongoing partnership with local producers in South Dakota and we visited their farms to discuss and collect data. Other target audiences includeresearchers interested in precision soil sampling, and reached out to them through conference presentations and peer-reviewed publications. Changes/Problems:There are several major changes that affect the progress of the project in the last 15 months: The PI moved from South Dakota State University (SDSU) to the Florida Institute of Technology (FIT) in 2022: The PI's contract with SDSU ended in May 2022 and the PI's position at FIT started in August 2022. The grant transfer process was completed in December 2022 when FIT received the award notification. Therefore, the PI cannot use the funds to progress, and the project was essentially halted from May 2022 to December 2022. The Ph.D. student who worked on the project with the PI at SDSU, Praneel Acharya, decided not to follow the PI to FIT because it was the final year of his graduate program (He is now an assistant professor at Minnesota State University, Mankato). Additionally, FIT officially received the grant in December 2022, and the timing left us with very limited options for new graduate student recruitment who can start in January 2023. As a result, we could not hire students who have substantial relevant skill sets and experience to immediately contribute toward the project objectives. Most of Spring 2023 was dedicated to training the new Ph.D. students in data science, machine learning, and soil sampling and analysis. This challenge caused further delays in executing the research tasks and achieving the project objectives. The performance of our current model is limited by the size of the training dataset. Currently, our dataset contains 20 fields collected during the first two years of the project. Collecting data is expensive especially when travel is involved, and the funding portion for data collection has been exhausted. This bottleneck prevents us from making progress as fast as we would like. Nonetheless, the problem inspires us to strive for generative-AI innovations to obtain synthetic data and compensate for the limitation in the size of training data in the next project period. Despite these challenges, during the project period so far, we still managed to publish 8 journal papers. We also presented our work at The USDA NIFA AI in Agriculture: Innovation and Discovery to Equitably Meet Producer Needs and Perceptions Conference in Orlando, April 2023. The PD meeting organized by USDA NIFA in conjunction with the International Conference of Precision Agriculture in Manhattan, KS, July 2024 What opportunities for training and professional development has the project provided?An M.S. student (Sravanthi Bachina) and two Ph.D. students (Praneel Acharya at South Dakota State University who has been with the project from the beginning; and Hanh Pham at Florida Institute of Technology, who started in 01/2023) have been working on this project. The M.S. student from the Co-PI Kris Osterloh's research group is in charge of data collection and labeling (Objective 1). The two Ph.D. students from the PI Kim Nguyen's group are in charge of developing the machine-learning models and performance metrics validation (Objectives 2 and 3). The training and professional development activities for these students include learning knowledge and skills related to machine learning and data processing, collecting soil sampling data, and innovating and applying machine learning techniques to the data to accomplish the goals. One of the Ph.D. students, Praneel Acharya, graduated in May 2023 and is currently an Assistant Professor at Minnesota State University, Mankato. The other PhD student, Hanh Pham, won the Outstanding Graduate Student of the Year award at FIT. The M.S. student, Sravanthi Bachina, graduated in May 2024, and currently pursue her Ph.D. program in the same group. Their research efforts have resulted in 8 journal publications as listed in the Product Table. How have the results been disseminated to communities of interest?We presented our project at The USDA NIFA AI in Agriculture: Innovation and Discovery to Equitably Meet Producer Needs and Perceptions Conference in Orlando, April 2023. The PD meeting organized by USDA NIFA in conjunction with the International Conference of Precision Agriculture in Manhattan, KS, July 2024 The research efforts in this project have resulted in 8 peer-reviewed research articles published in top journals with relevant audiences and great visibility as listed in the Product Table. The journals include Computers and Electronics in Agriculture, Machine Learning and Knowledge Extraction, Scientific Reports, and Microporous and Mesoporous Materials. Two other papers are currently under review with Soil Science Society of America Journal. We also discussed the project results with local producers in South Dakota who are the partners of this project and let us collect the data from their fields. What do you plan to do during the next reporting period to accomplish the goals?Below are the remaining objectives to be accomplished in the remainder of the project period: Objective 1: Establish a cyberinfrastructure of landscape data and soil sampling annotations for the training of deep convolutional neural networks The performance of our current model is limited by the size of the training dataset. Currently, our dataset contains 20 fields collected during the first two years of the project. Collecting data is expensive especially when travel is involved, and the funding portion for data collection has been exhausted. This is a common problem with any machine learning projects, experimental data collection is usually expensive. To address this bottleneck of limited data size, in the next phase of the project, we will continue to implement data augmentation techniques to increase the training dataset. We plan to leverage the following specific methods: Rotation: We will use this technique to rotate the entire stack of soil data around its center by a random angle, typically within a specific range (for example, -30° to +30°). Rotation can be clockwise or counterclockwise. This operation will help the model learn to recognize optimal soil sampling locations regardless of the orientation of the fields. Flipping: We will flip stacks of soil data across their vertical axis, resulting in mirrored data, or their horizontal axis, turning the data upside down. Scaling: This involves resizing the data by zooming in or zooming out. The stack of data will bethen either cropped or padded to return to the original size. Scaling will teach the deep-learning model to handle variations in object size and help it focus on different levels of detail within an image. Affine Transformations: We will apply a combination of linear transformations while maintaining parallel lines in the data, making it a more complex but comprehensive transformation approach. This technique allows for a wide variety of distortions while preserving the overall geometry of the soil data maps, making the model more robust to different types of fields. Beside data augmentation, we will develop a generative-AI methodology to produce synthetic soil data that complement the real soil dataset we currently have. The description of this generative-AI framework is briefly described in the next objective. Objective 2. Develop a deep-learning pipeline to learn, analyze, and refine landscape data for automated selection of soil sampling locations. Under this objective, we have made promising progress so far with many innovations and results on leveraging recent advancements in deep learning for the optimal selection of soil sampling sites. However, we are facing a bottleneck that hinders further improvement of the performance of our innovative deep-learning models: the limited size of the training dataset. This is expected and is a common challenge with deep-learning projects because deep learning requires big data to perform well and real-world data collection is expensive especially when travel is involved. One possible solution to increase the data size is to make additional travel trips to the fields and collect additional data. Unfortunately, doing this requires additional funding resources beyond this grant. In the next phase of this research task, we will pursue another solution to the data size problem by leveraging recent breakthroughs in generative AI to generate synthetic novel soil mapping data for deep learning training. The architecture of the planned generative AI framework is described as follows. The generative AI framework we plan to build will be grounded in the Generative Adversarial Network (GAN) architecture which is composed of a generator neural network and a discriminator neural network. The generator network will be designed to produce synthetic soil data from random noise or a latent space vector. The architecture will comprise fully connected layers and convolutional layers. The output layer will produce soil data, with dimensions corresponding to the desired number of soil attributes and properties. The discriminator network will be designed to distinguish between real and synthetic soil data. The discriminator will be formulated to output a probability score indicating the likelihood that a given input is real. While the discriminator's architecture will mirror the generator structure, it will be optimized for classification tasks. We will use conditioning variables inserted into both the generator and discriminator, allowing the model to generate data that conforms to specific conditions. Training of the developed generative AI models will be performed with a training dataset obtained from real soil data collected from the fields. The data will be cleaned up by cropping out undesirable areas, preprocessed, and augmented with common techniques to increase the diversity of the training data and help the model generalize better. The training data will be annotated by the soil pedologist Co-PI. The synthetic data generated by this generative AI framework will be mixed with real data and used to train the deep-learning framework developed in the last project period. The goal is to show that a mixture of real and synthetic data will resolve the bottleneck of the limited training dataset and help the deep-learning model further improve the soil mapping performance. Objective 3. Design and implement a set of metrics to assess the success rate of the optimal sampling site prediction tool. The research activities in the next reporting period under this objective will focus on developing and implementing performance metrics for the generative AI framework developed in Objective 2. Performance metrics are essential for evaluating the quality and diversity of the data generated by generative models. Since generative models are often evaluated on their ability to produce realistic, diverse, and novel outputs, these metrics must capture various aspects of the generated data. The set of performance metrics we plan to design and implement are described as follows. The inception scorewill be designed by leveraging the pre-trained inception model, a deep convolutional neural network trained on existing datasets, to assess the realism and diversity of the generated soil data. A higher inception score suggests that the data are diverse and cover a range of categories. Therefore, this is a single score that combines both quality and diversity, making it relatively easy to interpret. The Fréchet Inception Distance (FID) is another reliable metric for evaluating the quality of generated soil data that we will leverage. Unlike the inception score, FID compares the statistical properties of real and generated data. The metric computes the Fréchet distance (also known as the Wasserstein-2 distance) between the distributions of the real and generated soil data in this feature space. It takes into account both the mean and the covariance of the feature vectors. It penalizes generators that produce less diverse outputs. Since FID compares the generated data distribution directly with the real data distribution, it provides a more meaningful assessment of how closely the generated images match the real ones. Precision and recall are metrics we used before for soil mapping segmentation tasks, and they can be adapted to measure the fidelity and coverage of the generated data relative to the real data distribution. By providing separate scores for precision and recall, this metric allows for a more nuanced understanding of the model's performance. A model with high precision but low recall might generate high-quality samples but lack diversity, whereas the opposite might indicate diverse but lower-quality samples. We anticipate that the remaining research activities in the next period will result in at least two additional peer-reviewed journal papers.

Impacts
What was accomplished under these goals? Objective 1: (80% Accomplished) In this reporting period, the final delineation of training data was completed using the geographic information system (GIS) software. The collected digital elevation model (DEM) was also used to generate a suite of landscape derivatives such as slope, aspect, curvature, etc. These derivatives were used as features in the deep-learning pipeline. DEM derivatives were straightforward and can be automatically generated, unlike hillslope profile positions. Thus, strong relationships between hillslope profiles and soil properties created a more robust final product. In order to make the data uniform and increase the number of samples for training and testing, we applied augmentation techniques. Initially, square images were cropped from the center of the input image, each with varying sizes. Subsequently, the cropped images were resized to dimensions of 572 × 572. Lastly, random rotations were applied to the resized images to generate additional data. It is important to note that when applying augmentation techniques to a set of input attributes, including aspect, slope, flow accumulation, yield, and NVDI, we stacked all attributes on top of each other. This means that during data augmentation, all input attributes within a set were rotated in the same manner and direction. We selected these augmentation methods due to their simplicity and computational efficiency. The total dataset was subdivided into three sets: a training set comprising 2720samples, a validation set with 340 samples, and a testing set consisting of 340 samples. The collected and augmented soil data were annotated and fed into a deep-learning pipeline that refines and extracts the optimal locations for soil sampling given a landscape area as described in Objective 2. Objective 2:(70% Accomplished) Under this objective, we have devised a novel strategy centered on building a dual deep-learning architecture composed of a refiner and a selector. The refiner is designed to extract finer features from the input soil data. The selector analyzes the refined features to select the optimal locations for soil sampling matching the ground truth. The fusion of these dual components is enabled by a set of delicately designed bridge layers. Our novel soil-mapping segmentation model exhibits remarkable improvements in accuracy and precision compared to other state-of-the-art segmentation models. Refiner: The refiner takes in a set of input images, including the aspect, flow accumulation, slope, yield, and NVDI attributes from a field. Each image represents a landscape attribute. In particular, we designed the refiner based on an encoder-decoder architecture to extract patterns from these inputs, producing feature maps as bridge layers. Sequentially, these bridge layersare fed into the selector's backbone. The encoder consists of four building blocks, each composed of a basic block followed by a max-pooling function. The encoder allows the model to effectively reduce the input image's dimensions and increase the number of feature maps. This structure enables the model to capture and retain hierarchical features at different scales. Unlike the encoder, the decoder consists of four building blocks, each of which comprises a basic block preceded by a deconvolution neural network (DCNN). The DCNN first upscales the input features and then concatenates them with the corresponding features from the encoder using skip connections. Typically, these skip connections allow information from the encoder to skip the subsequent operations and directly reach the decoder. Selector: The selector comprises a backbone and an output head. Fine features are first extracted by the refiner network, as described in the last section. The selector then establishes further relationships between features using a residual neural network backbone, grounded in the ResNet structure. After that, it fuses the processed features and generates a prediction map. The rest of this section discusses the operation of these networks. Selector Backbone The architecture of the selector backbone is constructed with atrous convolution neural networks (ACNNs). There are five main blocks in the network, including the pre-processing block and eight convolutional neural networks. These blocks serve as extractors, further extracting features from the bridge layers hierarchically. After passing through these blocks, the dimensions of the bridge layers decrease while the number of features increases. Fuser The resulting features of the backbone are fed into the fuser, a series of CNNs, where the features are further processed and fused to create the final prediction map. The output features from the network are processed by an atrous spatial pyramid pooling (ASPP) and a global average pooling (GAP) module operating in parallel. The idea behind the ASPP is to apply multiple convolutional filters with different atrous rates to the same input feature, which enables the model to capture information at varying spatial resolutions. Beyond ASPP, GAP is applied on top of the same output features from the backbone. Unlike max-pooling, GAP computes the average value of each channel across the entire spatial extent of the input feature map. It is used as a method to reduce the spatial dimensions of feature maps while retaining important information about the presence of different features in the image. In addition, the GAP module is followed by a BatchNorm function and a ReLU activation function, described similarly to the ASPP module. Results: For experimental validation, we trained the two proposed models and then compared them with their counterparts. The experimental results indicate that the proposed model outperformed the state-of-the-art segmentation models on the soil sampling dataset (Soil-20s). The results of this project so far demonstrate that using fine feature maps is better for handling challenging soil datasets compared to raw input images. In the soil sampling dataset, the accuracy of the soil sampling tool increased by 6.1% and 3.18% in terms of the mIoU and mDC metrics, respectively, when using our proposed model as compared to our earlier model developed in the last reporting period. Objective 3: (70% Accomplished) In this reporting period, we reinforced and expanded the performance metrics we designed. Below is the list of additional performance metrics that we have developed and tested: Accuracy: Overall proportion of correctly predicted instances. Precision: Proportion of true positive predictions among positive predictions. Recall: Proportion of true positive predictions among actual positive instances. F1-Score: Harmonic mean of precision and recall. Area Under the Receiver Operating Characteristic Curve (AUC-ROC): Measures the trade-off between true positive rate and false positive rate. Mean Absolute Error (MAE): Average absolute differences between predicted and actual values. Mean Squared Error (MSE): Average squared differences between predicted and actual values. Root Mean Squared Error (RMSE): Square root of the MSE. R-squared (Coefficient of Determination): This measures the proportion of the variance in the dependent variable that's predictable from the independent variables. Silhouette Score: A measure of how close each sample in one cluster is to the samples in the neighboring clusters. These metrics were implemented in our evaluation code with the help of built-in functions from the machine-learning libraries we used. In addition, we employed cross-validation to estimate the model's performance on unseen data. This helped mitigate issues like overfitting and provided a more reliable estimate of the model's generalization ability. The results were visualized using confusion matrices, ROC curves, and precision-recall curves. These visualization tools help in Interpreting the results to gain insights into model performance and areas for improvement.

Publications

Type: Journal Articles Status: Published Year Published: 2022 Citation: Acharya, P., Burgers, T. and Nguyen, K.D., 2022. AI-enabled droplet detection and tracking for agricultural spraying systems. Computers and Electronics in Agriculture, 202, p.107325. Impact factor: 7.7
Type: Journal Articles Status: Published Year Published: 2024 Citation: Huynh, N., Wagner, R., He, X. and Nguyen, K.D., 2024. Predicting the decomposition temperatures of metal-organic frameworks based on the pair distribution functions using a deep learning model. Microporous and Mesoporous Materials, 369, p.113042. Impact factor: 4.8
Type: Journal Articles Status: Published Year Published: 2024 Citation: Pham, T.H. and Nguyen, K.D., 2024. Soil Sampling Map Optimization with a Dual Deep Learning Framework. Machine Learning and Knowledge Extraction, 6(2), pp.751-769. Impact factor: 4.0
Type: Journal Articles Status: Published Year Published: 2024 Citation: Pham, T.H., Li, X., and Nguyen, K.D., 2024. SeUNet-Trans: A Simple yet Effective UNet-Transformer Model for Medical Image Segmentation. IEEE Access, accepted. Impact factor: 3.4
Type: Journal Articles Status: Published Year Published: 2023 Citation: Acharya, P., Burgers, T. and Nguyen, K.D., 2023. A deep-learning framework for spray pattern segmentation and estimation in agricultural spraying systems. Scientific Reports, 13(1), p.7545. Impact factor: 3.8
Type: Journal Articles Status: Published Year Published: 2023 Citation: Nguyen, K.D., 2023. An adaptive control framework for a class of nonlinear time-delay systems. Transactions of the Institute of Measurement and Control, 45(7), pp.1271-1281. Impact factor: 1.8
Type: Journal Articles Status: Published Year Published: 2024 Citation: Pham, T.H., Acharya, P., Bachina, S., Osterloh, K. and Nguyen, K.D., 2024. Deep-learning framework for optimal selection of soil sampling sites. Computers and Electronics in Agriculture, 217, p.108650. Impact factor: 7.7
Type: Journal Articles Status: Published Year Published: 2024 Citation: Huynh, N. and Nguyen, K.D., 2024. Real-time droplet detection for agricultural spraying systems: A deep learning approach. Machine Learning and Knowledge Extraction, 6(1), pp.259-282. Impact factor: 4.0
Type: Journal Articles Status: Under Review Year Published: 2024 Citation: Pham, T.H. and Nguyen, K.D., 2024. Mapping of Soil Sampling Sites Using Terrain and Hydrological Attributes. Soil Science Society of America Journal.

Progress 09/01/22 to 08/31/23

Outputs
Target Audience:Our target audience included producers who need to perform soil sampling and analysis in their field. We have an ongoing partnership with them and we visited their farms to collect data. Other target audiences includeresearchers interested in precision soil sampling, and reached out to them through conference presentations and peer-reviewed publications. Changes/Problems:There are several major changes that affect the progress of the project in the last 15 months: • The PI moved from South Dakota State University (SDSU) to the Florida Institute of Technology (FIT) in 2022: The PI's contract with SDSU ended in May 2022 and the PI's position at FIT started in August 2022. The grant transfer process was completed in December 2022 when FIT received the award notification. Therefore, the PI cannot use the funds to progress, and the project was essentially halted from May 2022 to December 2022. • The Ph.D. student who worked on the project with the PI at SDSU, Praneel Acharya, decided not to follow the PI to FIT because it was the final year of his graduate program (He is now an assistant professor at Minnesota State University, Mankato). Additionally, FIT officially received the grant in December 2022, and the timing left us with very limited options for new graduate student recruitment who can start in January 2023. As a result, we could not hire students who have substantial relevant skill sets and experience to immediately contribute toward the project objectives. Most of Spring 2023 was dedicated to training the new Ph.D. students in data science, machine learning, and soil sampling and analysis. This challenge caused further delays in executing the research tasks and achieving the project objectives. Despite these challenges, during the reporting period, we still managed to submit a journal paper to Computers and Electronics in Agriculture, waiting for peer reviews. We also presented our work at the USDA NIFA AI in Agriculture: Innovation and Discovery to Equitably Meet Producer Needs and Perceptions Conference in Orlando, in April 2023. What opportunities for training and professional development has the project provided?An M.S. student (Sravanthi Bachina) and two Ph.D. students (Praneel Acharya at South Dakota State University who has been with the project from the beginning; and Hanh Pham at Florida Institute of Technology, who started in 01/2023) have been working on this project. The M.S. student from the Co-PI Kris Osterloh's research group is in charge of data collection and labeling (Objective 1). The two Ph.D. students from the PI Kim Nguyen's group are in charge of developing the machine-learning models and performance metrics validation (Objectives 2 and 3). The training and professional development activities for these students include learning knowledge and skills related to machine learning and data processing, collecting soil sampling data, and innovating and applying machine learning techniques to the data to accomplish the goals. One of the Ph.D. students, Praneel Acharya, graduated in May 2023 and is currently an Assistant Professor at Minnesota State University, Mankato. How have the results been disseminated to communities of interest?We have summarized the current results in the form of a journal paper, which is under review withComputers and Electronics in Agriculture. We also presented our project at the USDA NIFA AI in Agriculture: Innovation and Discovery to Equitably Meet Producer Needs and Perceptions Conference in Orlando, April 2023. The techniques developed in this project have been applied to the data obtained in a different project. The results have been reported in the form of two journal papers. One was published in the Computers and Electronics in Agriculture journal while the other was published in the Scientific Reports journal. What do you plan to do during the next reporting period to accomplish the goals?Below are the remaining objectives to be accomplished in the remainder of the project period: Objective 1: Establish a cyberinfrastructure of landscape data and soil sampling annotations for the training of deep convolutional neural networks In the next phase of the project, the final delineation of training data will be completed using the geographic information system (GIS) software. The collected digital elevation model (DEM) will also be used to generate a suite of landscape derivatives such as slope, aspect, curvature, etc. These derivatives will be used as features in the deep-learning pipeline. DEM derivatives are straightforward and can be automatically generated, unlike hillslope profile positions. Thus, strong relationships between hillslope profiles and soil properties will create a more robust final product. The collected data will be annotated and fed into a deep-learning pipeline that refines and extracts the optimal locations for soil sampling given a landscape area as described in Objective 2. Objective 2. Develop a deep-learning pipeline to learn, analyze, and refine landscape data for automated selection of soil sampling locations. In the next phase of this research task, we will focus on developing and testing the remaining proposed modules of the machine-learning pipeline: ?Region-of-Interest (RoI) pooling: Coming out of the Region Proposal process, not all regions are of the same size. The main goal of ROI pooling is to separate individual proposed regions and resize them to a fixed size while still maintaining all key features. The outputs of this RoI pooling process are well-defined regions of the same size that may contain desirable soil-sampling zones. Inference: This module is a final refinement step in addition to the previous processes. Here, we will formulate another multilayer neural network that takes the above ROIs as inputs and outputs regions containing optimal sampling locations with high certainty and high precision. Similar to the Regional Proposal module, the neural network will be trained via an optimization scheme. Nonetheless, with much more refined and specific landscape data input (produced by ROI pooling), the results are expected to be more precise and with a higher agreement with the ground truth. How close to the ground truth is defined by a set of performance metrics, which dictate the number of times this entire deep-learning pipeline will be iterated. The only constraint is that additional iterations require additional training data and time. Challenges: We will also seek a solution to the most challenging problem we are facing so far: the imbalance of input data. This problem arises from the fact that optimal soil sampling spots are only a small number of pixels in the map, while the rest of the pixels are background. This tremendous imbalance in pixel class leads to the problem that the loss of the machine learning model is always near zero even if the model produces wrong predictions as compared to the ground truth. We will look into recent advances in machine learning including vision transformers and attention mechanisms to address this problem. Objective 3. Design and implement a set of metrics to assess the success rate of the optimal sampling site prediction tool. We will continue to reinforce and expand the set of performance metrics we have designed so far. Setting up performance metrics for the proposed machine learning models is a crucial step in evaluating the effectiveness and quality of our models. The design of metrics depends on the nature of the problem, the type of data, and the specific goals of the project. In the next project period, we will continue to reinforce and expand the set of performance metrics we have designed so far. Below is a list of performance metrics that we will consider and develop: Accuracy: Overall proportion of correctly predicted instances. Precision: Proportion of true positive predictions among positive predictions. Recall: Proportion of true positive predictions among actual positive instances. F1-Score: Harmonic mean of precision and recall. Area Under the Receiver Operating Characteristic Curve (AUC-ROC): Measures the trade-off between true positive rate and false positive rate. Mean Absolute Error (MAE): Average absolute differences between predicted and actual values. Mean Squared Error (MSE): Average squared differences between predicted and actual values. Root Mean Squared Error (RMSE): Square root of the MSE. R-squared (Coefficient of Determination): Measures the proportion of the variance in the dependent variable that's predictable from the independent variables. Silhouette Score: Measures how close each sample in one cluster is to the samples in the neighboring clusters. Davies-Bouldin Index: Measures the average similarity between each cluster and its most similar cluster. Implement the Metrics: Implement the chosen metrics in your evaluation code. Many machine learning libraries provide built-in functions for calculating common metrics. Cross-Validation: Use cross-validation to estimate the model's performance on unseen data. This helps mitigate issues like overfitting and provides a more reliable estimate of the model's generalization ability. Compare Multiple Models: If you're trying different algorithms or hyperparameters, compare their performance using the chosen metrics. This can help you select the best-performing model for your task. Visualize and Interpret Results: Visualize the results using appropriate visualizations like confusion matrices, ROC curves, or precision-recall curves. Interpret the results to gain insights into model performance and areas for improvement. Iterate and Improve: Based on the results and insights, refine your model, feature engineering, and hyperparameters to improve performance.

Impacts
What was accomplished under these goals? Objective 1: Establish a cyberinfrastructure of landscape data and soil sampling annotations for the training of deep convolutional neural networks (60% Accomplished) We arranged a number of field trips to the Edinger Brothers Partnership farm in Davison counties, SD, to collect soil sampling data from over 60 fields, ranging from 150 to 200 hectares each. We were also granted access to their soil sampling data collected from 2011 to date. These fields encompassed diverse landscapes and soil properties and were under the management of local farmers. For training and testing the model, each field was characterized by five attributes: aspect, flow accumulation, slope, yield, and normalized difference vegetation index (NDVI). The corresponding ground truths were also annotated by the pedologists in our team. Digital Elevation Models (DEM) for each field were downloaded from the LiDAR (Light Detection and Ranging) dataset in South Dakota Geological Survey with a spatial resolution of one meter (https://www.sdgs.usd.edu/).} Terrain attributes (slope and aspect) and hydrological attributes (flow accumulation) were processed by using ArcMap 10.8 (Esri, ArcGIS). Soil properties like water holding capacity, soil texture, soil organic matter, and Cation Exchange Capacity (CEC) were obtained from the SSURGO data sets. For this research, Sentinel 2A imagery was downloaded from the Copernicus Open Hub website with a spatial resolution of 10 meters which is further used to obtain NDVI values for each field. Yield data were obtained by Field View Plus (Climate LLC) which was further cleaned and processed by using SMS Ag software (Ag Leader, SMS Advanced). For model training, terrain and yield attributes including slope, aspect, flow accumulation, yield, and NDVI data are used as features or independent variables. The corresponding ground truths soil sampling sites of these set data were labeled, which represent the characteristics of the field. Objective 2: Develop a deep-learning pipeline to learn, analyze, and refine landscape data for automated selection of soil sampling locations (30% Accomplished) Soil sampling location selection plays an important role in soil analysis, but the current soil sampling techniques produce random samples across a field, which are not representative. During the reporting period, we leverage the advancement of deep learning on computer vision and image processing for finding necessary locations that present the important characteristics of a field. The data for training are collected at different fields with five features: aspect, flow accumulation, slope, NDVI (normalized difference vegetation index), and yield. The soil sampling dataset is challenging because of the unbalance between the soil sampling site area and the background area. In this work, we approach the problem with three methods, the first approach involves utilizing a CNN-based model, while the second is related to employing a Transformer-based model. We analyze the performances of the two baselines and then finalize with a Transformer-based model. We proposed atrous SegFormer MiT-B3 as a tool for dealing with the soil sampling dataset. The tool has an encoder-decoder architecture and uses the SegFormer MiT-b3 as the backbone. In the encoder, the self-attention mechanism is the key feature extractor, which produces feature maps. In the decoder, we introduce atrous convolution networks to concatenate, fuse the extracted features, and then export the optimal locations for soil sampling. Currently, the model has achieved impressive results on the testing dataset, with a mean accuracy of 99.53%, a mean Intersection over Union (IoU) of 57.35%, and a mean Dice Coefficient of 71.47%, while those values of SegFormer MiT-B3 are 99.51%, 53.97%, and 68.69%, respectively. Our proposed model outperforms the SegFormer MiT-B3 on the soil sampling dataset. To the best of our knowledge, our work is the first one to provide a soil sampling dataset with multiple attributes and leverage deep learning techniques to support location selection in soil sampling. Objective 3: Design and implement a set of metrics to assess the success rate of the optimal sampling site prediction tool (10% Accomplished) To assess the performance of the trained model, we established a set of performance metrics to compare the ground truth soil data with the output predictions generated by the model. This allows us to compute various variables listed below that capture the differences between the ground truth and the model's output. True positive (TP) is the total number of white pixels predicted from the model overlapping with the white pixels in the ground truths. They have the same positive category. True negative (TN) is the total number of black pixels predicted from the model overlapping with the black pixels in the ground truths. They have the same negative category. False positive (FP) is the total number of white pixels predicted from the model overlapping with the black pixels in the ground truths. They have different categories, positive prediction and negative reality. False negative (FN) is the total number of black pixels predicted from the model overlapping with the white pixels in the ground truths. They have different categories, negative prediction and positive reality. For each validation or testing dataset, we then compute three key metrics to evaluate the performance: Mean Accuracy (MA) represents the average pixel accuracy between the ground truth and the model's output across all images. The accuracy for each image is computed as follows: (TP+TN)/(TP+TN+FP+FN). MA alone does not accurately represent the performance of the model when it comes to segmenting the minority class. Relying solely on MA can create a misleading perception that the model is performing well, even if it fails to capture the details of the minority class effectively. Therefore, we consider alternative metrics that provide a more comprehensive evaluation of the model's performance, specifically focusing on the minority class. Aside from the MA metric, two other commonly used criteria to assess the effectiveness of a segmentation model are Dice Coefficient (DC) and Intersection over Union (IoU). Mean Dice Coefficient (MDC) computes the similarity between the model-generated output image and the ground truth. The quantities range from 0 to 1, and a score of 1 indicates a perfect alignment between the predicted and ground truth regions while a score of 0 signifies no overlap whatsoever. This metric is also effective in evaluating model performance on imbalanced dataset. MDC for each image is calculated as MDC = (2TP)(2TP+FP+FN). Mean Intersection over Union (MIoU) is the mean overlap of the ground truths and predictions over the total N images. The IoU measures the number of white pixels that overlap between the predictions and the ground truth data: IoU = (TP)(TP+FP+FN).

Publications

Type: Conference Papers and Presentations Status: Published Year Published: 2023 Citation: Praneel Acharya, Sravanthi Bachina, Tan-Hanh Pham, Kristopher Osterloh, Kim-Doang Nguyen, "Deep-Learning Framework for Optimal Selection of Soil Sampling Sites," USDA NIFA AI in Agriculture: Innovation and Discovery to Equitably meet Producer Needs and Perceptions Conference, Orlando, April 2023.
Type: Journal Articles Status: Submitted Year Published: 2023 Citation: Tan-Hanh Pham, Praneel Acharya, Sravanthi Bachina, Kristopher Osterloh, Kim- Doang Nguyen, "Deep-Learning Framework for Optimal Selection of Soil Sampling Sites", Computers and Electronics in Agriculture, under review, 2023.