Progress 09/01/22 to 08/31/23
Outputs Target Audience:Our target audience included producers who need to perform soil sampling and analysis in their field. We have an ongoing partnership with them and we visited their farms to collect data. Other target audiences includeresearchers interested in precision soil sampling, and reached out to them through conference presentations and peer-reviewed publications. Changes/Problems:There are several major changes that affect the progress of the project in the last 15 months: • The PI moved from South Dakota State University (SDSU) to the Florida Institute of Technology (FIT) in 2022: The PI's contract with SDSU ended in May 2022 and the PI's position at FIT started in August 2022. The grant transfer process was completed in December 2022 when FIT received the award notification. Therefore, the PI cannot use the funds to progress, and the project was essentially halted from May 2022 to December 2022. • The Ph.D. student who worked on the project with the PI at SDSU, Praneel Acharya, decided not to follow the PI to FIT because it was the final year of his graduate program (He is now an assistant professor at Minnesota State University, Mankato). Additionally, FIT officially received the grant in December 2022, and the timing left us with very limited options for new graduate student recruitment who can start in January 2023. As a result, we could not hire students who have substantial relevant skill sets and experience to immediately contribute toward the project objectives. Most of Spring 2023 was dedicated to training the new Ph.D. students in data science, machine learning, and soil sampling and analysis. This challenge caused further delays in executing the research tasks and achieving the project objectives. Despite these challenges, during the reporting period, we still managed to submit a journal paper to Computers and Electronics in Agriculture, waiting for peer reviews. We also presented our work at the USDA NIFA AI in Agriculture: Innovation and Discovery to Equitably Meet Producer Needs and Perceptions Conference in Orlando, in April 2023. What opportunities for training and professional development has the project provided?An M.S. student (Sravanthi Bachina) and two Ph.D. students (Praneel Acharya at South Dakota State University who has been with the project from the beginning; and Hanh Pham at Florida Institute of Technology, who started in 01/2023) have been working on this project. The M.S. student from the Co-PI Kris Osterloh's research group is in charge of data collection and labeling (Objective 1). The two Ph.D. students from the PI Kim Nguyen's group are in charge of developing the machine-learning models and performance metrics validation (Objectives 2 and 3). The training and professional development activities for these students include learning knowledge and skills related to machine learning and data processing, collecting soil sampling data, and innovating and applying machine learning techniques to the data to accomplish the goals. One of the Ph.D. students, Praneel Acharya, graduated in May 2023 and is currently an Assistant Professor at Minnesota State University, Mankato. How have the results been disseminated to communities of interest?We have summarized the current results in the form of a journal paper, which is under review withComputers and Electronics in Agriculture. We also presented our project at the USDA NIFA AI in Agriculture: Innovation and Discovery to Equitably Meet Producer Needs and Perceptions Conference in Orlando, April 2023. The techniques developed in this project have been applied to the data obtained in a different project. The results have been reported in the form of two journal papers. One was published in the Computers and Electronics in Agriculture journal while the other was published in the Scientific Reports journal. What do you plan to do during the next reporting period to accomplish the goals?Below are the remaining objectives to be accomplished in the remainder of the project period: Objective 1: Establish a cyberinfrastructure of landscape data and soil sampling annotations for the training of deep convolutional neural networks In the next phase of the project, the final delineation of training data will be completed using the geographic information system (GIS) software. The collected digital elevation model (DEM) will also be used to generate a suite of landscape derivatives such as slope, aspect, curvature, etc. These derivatives will be used as features in the deep-learning pipeline. DEM derivatives are straightforward and can be automatically generated, unlike hillslope profile positions. Thus, strong relationships between hillslope profiles and soil properties will create a more robust final product. The collected data will be annotated and fed into a deep-learning pipeline that refines and extracts the optimal locations for soil sampling given a landscape area as described in Objective 2. Objective 2. Develop a deep-learning pipeline to learn, analyze, and refine landscape data for automated selection of soil sampling locations. In the next phase of this research task, we will focus on developing and testing the remaining proposed modules of the machine-learning pipeline: ?Region-of-Interest (RoI) pooling: Coming out of the Region Proposal process, not all regions are of the same size. The main goal of ROI pooling is to separate individual proposed regions and resize them to a fixed size while still maintaining all key features. The outputs of this RoI pooling process are well-defined regions of the same size that may contain desirable soil-sampling zones. Inference: This module is a final refinement step in addition to the previous processes. Here, we will formulate another multilayer neural network that takes the above ROIs as inputs and outputs regions containing optimal sampling locations with high certainty and high precision. Similar to the Regional Proposal module, the neural network will be trained via an optimization scheme. Nonetheless, with much more refined and specific landscape data input (produced by ROI pooling), the results are expected to be more precise and with a higher agreement with the ground truth. How close to the ground truth is defined by a set of performance metrics, which dictate the number of times this entire deep-learning pipeline will be iterated. The only constraint is that additional iterations require additional training data and time. Challenges: We will also seek a solution to the most challenging problem we are facing so far: the imbalance of input data. This problem arises from the fact that optimal soil sampling spots are only a small number of pixels in the map, while the rest of the pixels are background. This tremendous imbalance in pixel class leads to the problem that the loss of the machine learning model is always near zero even if the model produces wrong predictions as compared to the ground truth. We will look into recent advances in machine learning including vision transformers and attention mechanisms to address this problem. Objective 3. Design and implement a set of metrics to assess the success rate of the optimal sampling site prediction tool. We will continue to reinforce and expand the set of performance metrics we have designed so far. Setting up performance metrics for the proposed machine learning models is a crucial step in evaluating the effectiveness and quality of our models. The design of metrics depends on the nature of the problem, the type of data, and the specific goals of the project. In the next project period, we will continue to reinforce and expand the set of performance metrics we have designed so far. Below is a list of performance metrics that we will consider and develop: Accuracy: Overall proportion of correctly predicted instances. Precision: Proportion of true positive predictions among positive predictions. Recall: Proportion of true positive predictions among actual positive instances. F1-Score: Harmonic mean of precision and recall. Area Under the Receiver Operating Characteristic Curve (AUC-ROC): Measures the trade-off between true positive rate and false positive rate. Mean Absolute Error (MAE): Average absolute differences between predicted and actual values. Mean Squared Error (MSE): Average squared differences between predicted and actual values. Root Mean Squared Error (RMSE): Square root of the MSE. R-squared (Coefficient of Determination): Measures the proportion of the variance in the dependent variable that's predictable from the independent variables. Silhouette Score: Measures how close each sample in one cluster is to the samples in the neighboring clusters. Davies-Bouldin Index: Measures the average similarity between each cluster and its most similar cluster. Implement the Metrics: Implement the chosen metrics in your evaluation code. Many machine learning libraries provide built-in functions for calculating common metrics. Cross-Validation: Use cross-validation to estimate the model's performance on unseen data. This helps mitigate issues like overfitting and provides a more reliable estimate of the model's generalization ability. Compare Multiple Models: If you're trying different algorithms or hyperparameters, compare their performance using the chosen metrics. This can help you select the best-performing model for your task. Visualize and Interpret Results: Visualize the results using appropriate visualizations like confusion matrices, ROC curves, or precision-recall curves. Interpret the results to gain insights into model performance and areas for improvement. Iterate and Improve: Based on the results and insights, refine your model, feature engineering, and hyperparameters to improve performance.
Impacts What was accomplished under these goals?
Objective 1: Establish a cyberinfrastructure of landscape data and soil sampling annotations for the training of deep convolutional neural networks (60% Accomplished) We arranged a number of field trips to the Edinger Brothers Partnership farm in Davison counties, SD, to collect soil sampling data from over 60 fields, ranging from 150 to 200 hectares each. We were also granted access to their soil sampling data collected from 2011 to date. These fields encompassed diverse landscapes and soil properties and were under the management of local farmers. For training and testing the model, each field was characterized by five attributes: aspect, flow accumulation, slope, yield, and normalized difference vegetation index (NDVI). The corresponding ground truths were also annotated by the pedologists in our team. Digital Elevation Models (DEM) for each field were downloaded from the LiDAR (Light Detection and Ranging) dataset in South Dakota Geological Survey with a spatial resolution of one meter (https://www.sdgs.usd.edu/).} Terrain attributes (slope and aspect) and hydrological attributes (flow accumulation) were processed by using ArcMap 10.8 (Esri, ArcGIS). Soil properties like water holding capacity, soil texture, soil organic matter, and Cation Exchange Capacity (CEC) were obtained from the SSURGO data sets. For this research, Sentinel 2A imagery was downloaded from the Copernicus Open Hub website with a spatial resolution of 10 meters which is further used to obtain NDVI values for each field. Yield data were obtained by Field View Plus (Climate LLC) which was further cleaned and processed by using SMS Ag software (Ag Leader, SMS Advanced). For model training, terrain and yield attributes including slope, aspect, flow accumulation, yield, and NDVI data are used as features or independent variables. The corresponding ground truths soil sampling sites of these set data were labeled, which represent the characteristics of the field. Objective 2: Develop a deep-learning pipeline to learn, analyze, and refine landscape data for automated selection of soil sampling locations (30% Accomplished) Soil sampling location selection plays an important role in soil analysis, but the current soil sampling techniques produce random samples across a field, which are not representative. During the reporting period, we leverage the advancement of deep learning on computer vision and image processing for finding necessary locations that present the important characteristics of a field. The data for training are collected at different fields with five features: aspect, flow accumulation, slope, NDVI (normalized difference vegetation index), and yield. The soil sampling dataset is challenging because of the unbalance between the soil sampling site area and the background area. In this work, we approach the problem with three methods, the first approach involves utilizing a CNN-based model, while the second is related to employing a Transformer-based model. We analyze the performances of the two baselines and then finalize with a Transformer-based model. We proposed atrous SegFormer MiT-B3 as a tool for dealing with the soil sampling dataset. The tool has an encoder-decoder architecture and uses the SegFormer MiT-b3 as the backbone. In the encoder, the self-attention mechanism is the key feature extractor, which produces feature maps. In the decoder, we introduce atrous convolution networks to concatenate, fuse the extracted features, and then export the optimal locations for soil sampling. Currently, the model has achieved impressive results on the testing dataset, with a mean accuracy of 99.53%, a mean Intersection over Union (IoU) of 57.35%, and a mean Dice Coefficient of 71.47%, while those values of SegFormer MiT-B3 are 99.51%, 53.97%, and 68.69%, respectively. Our proposed model outperforms the SegFormer MiT-B3 on the soil sampling dataset. To the best of our knowledge, our work is the first one to provide a soil sampling dataset with multiple attributes and leverage deep learning techniques to support location selection in soil sampling. Objective 3: Design and implement a set of metrics to assess the success rate of the optimal sampling site prediction tool (10% Accomplished) To assess the performance of the trained model, we established a set of performance metrics to compare the ground truth soil data with the output predictions generated by the model. This allows us to compute various variables listed below that capture the differences between the ground truth and the model's output. True positive (TP) is the total number of white pixels predicted from the model overlapping with the white pixels in the ground truths. They have the same positive category. True negative (TN) is the total number of black pixels predicted from the model overlapping with the black pixels in the ground truths. They have the same negative category. False positive (FP) is the total number of white pixels predicted from the model overlapping with the black pixels in the ground truths. They have different categories, positive prediction and negative reality. False negative (FN) is the total number of black pixels predicted from the model overlapping with the white pixels in the ground truths. They have different categories, negative prediction and positive reality. For each validation or testing dataset, we then compute three key metrics to evaluate the performance: Mean Accuracy (MA) represents the average pixel accuracy between the ground truth and the model's output across all images. The accuracy for each image is computed as follows: (TP+TN)/(TP+TN+FP+FN). MA alone does not accurately represent the performance of the model when it comes to segmenting the minority class. Relying solely on MA can create a misleading perception that the model is performing well, even if it fails to capture the details of the minority class effectively. Therefore, we consider alternative metrics that provide a more comprehensive evaluation of the model's performance, specifically focusing on the minority class. Aside from the MA metric, two other commonly used criteria to assess the effectiveness of a segmentation model are Dice Coefficient (DC) and Intersection over Union (IoU). Mean Dice Coefficient (MDC) computes the similarity between the model-generated output image and the ground truth. The quantities range from 0 to 1, and a score of 1 indicates a perfect alignment between the predicted and ground truth regions while a score of 0 signifies no overlap whatsoever. This metric is also effective in evaluating model performance on imbalanced dataset. MDC for each image is calculated as MDC = (2TP)(2TP+FP+FN). Mean Intersection over Union (MIoU) is the mean overlap of the ground truths and predictions over the total N images. The IoU measures the number of white pixels that overlap between the predictions and the ground truth data: IoU = (TP)(TP+FP+FN).
Publications
- Type:
Conference Papers and Presentations
Status:
Published
Year Published:
2023
Citation:
Praneel Acharya, Sravanthi Bachina, Tan-Hanh Pham, Kristopher Osterloh, Kim-Doang Nguyen, "Deep-Learning Framework for Optimal Selection of Soil Sampling Sites," USDA NIFA AI in Agriculture: Innovation and Discovery to Equitably meet Producer Needs and Perceptions Conference, Orlando, April 2023.
- Type:
Journal Articles
Status:
Submitted
Year Published:
2023
Citation:
Tan-Hanh Pham, Praneel Acharya, Sravanthi Bachina, Kristopher Osterloh, Kim-
Doang Nguyen, "Deep-Learning Framework for Optimal Selection of Soil Sampling Sites", Computers and Electronics in Agriculture, under review, 2023.
|