Source: FOUR GROWERS LLC submitted to
A NOVEL ROBOTIC SOLUTION FOR IMPROVING TOMATO HARVESTING PROCESS IN GREENHOUSES. INCLUDES DESIGN OF A NEW MACHINE LEARNING VISION ALGORITHM FOR HARVESTING.
Sponsoring Institution
National Institute of Food and Agriculture
Project Status
COMPLETE
Funding Source
Reporting Frequency
Annual
Accession No.
1016698
Grant No.
2018-33610-28553
Cumulative Award Amt.
$99,991.00
Proposal No.
2018-00241
Multistate No.
(N/A)
Project Start Date
Aug 15, 2018
Project End Date
Apr 14, 2019
Grant Year
2018
Program Code
[8.13]- Plant Production and Protection-Engineering
Recipient Organization
FOUR GROWERS LLC
265 N DITHRIDGE ST APT 5
PITTSBURGH,PA 15213
Performing Department
(N/A)
Non Technical Summary
In the US alone, 3 billion pounds of consumed fresh tomatoes are grown in greenhouse farms. These farms improve food security, require less land, and consume less resources; however, for these farms to operate, large labor forces are needed to maintain, harvest, and package the tomatoes. These jobs, though paying well above minimum wage, do not attract local workers and force farmers to use immigrant labor. This lack of local labor availability and unsupportive governmental regulations has led to high turnover rates and high training costs. This prevents growers from growing their full potential and limits their ability to expand their operations. With new governmental regulations increasing labor costs by 35% in Ontario (one of the largest greenhouse growing regions in North America) and major labor availability issues in the US, growers are in desperate need of solutions. Many are actively seeking out automation to solve this issue, but there are currently no solutions for them to use. This high rising need, with a current absence of a solution, makes the development of greenhouse automation crucial for the success of this industry.With harvesting as the largest expenditure for greenhouse labor, a tomato harvesting robot is proposed. Previous researchers have attempted to develop a tomato harvesting robot for this environment, but none has yet to enter the market. The main reason for this has been a lack of accuracy and speed in computer vision algorithms; however, in the last five years, convolutional neural networks (CNN) have enabled this once impossible task to become possible. By applying CNNs to tomato harvesting, a higher level of accuracy will be reached, thereby making the creation of a tomato harvesting robot feasible.
Animal Health Component
50%
Research Effort Categories
Basic
0%
Applied
50%
Developmental
50%
Classification

Knowledge Area (KA)Subject of Investigation (SOI)Field of Science (FOS)Percent
40214602020100%
Knowledge Area
402 - Engineering Systems and Equipment;

Subject Of Investigation
1460 - Tomato;

Field Of Science
2020 - Engineering;
Goals / Objectives
The proposed research aims to create an effective way for greenhouses to grow more produce at a lower cost through development of a tomato harvesting robot. In Phase I, a vision system will be created through research and utilization of CNNs and applying them to the greenhouse industry. A deep CNN will be designed and trained to identify ripe grape tomatoes and determine the harvesting locations.1) A properly labelled dataset of over 1000 images for machine learning training of grape tomato determination.2) An object detection system that can output the proper grape tomato locations with an mAP of over 803) An object detection vision system that can output the proper three-dimensional coordinates for harveting location with a mAP of 80 or above.4) A fully functional vision control algorithm that is fed a stereo camera input image and outputs a 3D coordinate for the cut location with an mAP > 80
Project Methods
Image Dataset Creation Since machine learning based computer vision is highly dependent on the quality of its training data, the first critical task is to create the necessary image datasets for proper training, validation, and testing.Collecting these images will require the acquisition of thousands of images of tomato plants in varying light under all types of greenhouse conditions. To perform such a task, an image gathering rig will be created that includes a stereo camera and a lighting source. Following the acquisition of the images, the team will need to carefully label every image by hand with bounding boxes and labels. For optimal results, these images will be collected at commercial greenhouses and clusters will be labeled based on grower feedback for ripe and unripe clusters.Because YOLOv2 utilizes "context clues" of pixels outside the bounding box, the dataset will include negative training images to ensure that the harvesting locations are determined only for ripe and nearly ripe tomatoes.For this dataset, 20% of the images in each category will not be used in training the neural network and will be used for testing the algorithms. When selecting the 20% of each category, a random sampling will be taken. This will allow for the test images to have variation in the time of day, and greenhouse location, which will ensure the generalization of the algorithms for proper operation in all greenhouses.CNN Training With a properly labelled dataset, training and testing for object detection will begin. We will start by training the YOLOv2 detector with the Darknet-19 classifier because the literature review revealed this to be one of the fastest, most accurate systems. Transfer learning will be used because to train a CNN from scratch takes millions of images. Since there is an architecture that could provide promising results with some alteration, it was deemed more efficient to build off work that had already been performed, than to rebuild everything from scratch.During this task, the YOLO9000 CNN will be retrained on the YOLOv2 architecture as detailed in (Redmon and Farhadi 2016) to identify ripe clusters and cutting ROIs. This will occur through transfer learning and will occur using the Darknet open source neural network framework. During the retraining process, parameters such as batch size, subdivisions, learning rate, decay rate, will be varied, to bring the retrained network to the highest accuracy possible.Data Analysis and Interpretation To evaluate its performance, the trained algorithm will be run on the test images, and the output bounding boxes that label and identify the objects in the image will be compared with the ground truth labeled images. The test images will have not have been used for training and therefore will simulate the algorithm operating on a new environment. After running the algorithm on these images, a precision recall (PR) curve will be calculated. This curve will then be utilized to evaluate the effectiveness of the algorithm, including the determination of the mAP. By comparing the PR curves of different CNNs trained using different hyperparameters, patterns between output results and hyperparameters will be identified and these insights will improve hyper parameter tuning. For each new CNN, a different 20% of the images will be segmented for the test images. This ensures that the algorithm is not being accidentally overfitted to the test data by the human evaluators.In addition to this quantitative evaluation of the algorithm, a qualitative evaluation of the algorithm will also occur. This analysis will primarily be focused on the error images, and will be focused on understanding why these errors occurred, with a strong emphasis on false positive errors. This is because a human supervisor can always harvest missed tomatoes, but cutting an unripe cluster is lost revenue. From these evaluations, insights will be gained to determine if changing the architecture design or changing the dataset labeling is necessary.Wrapping SystemIn order to wrap the system, four main pieces must be performed: Integrating the camera input to CNN1, Parallelizing the depth mapping with the CNN analysis, and outputting the proper 3D coordinates. In order to perform these four tasks, various libraries like OpenCV will be used for image management and interfacing.The primary evaluation of this task is whether it will be capable of accepting a stereo camera feed and output a 3D harvesting location. Upon successful completion of this task, the speed of the wrapping will be evaluated. This will occur by timing each of the individual steps previously performed in above tasks. By using the time for CNN1, and Depth mapping and the total time, we will be able to determine the amount of time the wrapping portions take. The wrapping system will continually be improved, until it reaches the criterion for acceptance. In addition, the full system will be evaluated for its mAP on test images, but since every previous task will have reached its mAP metric, the integrated system should have no issues.

Progress 08/15/18 to 04/14/19

Outputs
Target Audience:During the project period, our efforts reached a target audience which represents 2,026 acres of high-tech, greenhouse farms. This group includes operation managers, growers, owners, and partners of farms. Changes/Problems:Between proposal submission and start of the project, one change occured. Instead of designing a vision system for tomatoes on the vine (TOV), a system was instead designed for specialties like cherry and grape tomatoes. Before changing, this was discussed with and approved by Rachel Melnick. What opportunities for training and professional development has the project provided? Nothing Reported How have the results been disseminated to communities of interest?Results have been disseminated to the communities of interest through video sharing, phone calls, and in person meetings. There have been discussions with operation managers, growers, and owners of over 2,600 acres of greenhouse farming. Some of these farms were so excited by the results that one has already deposited $40k in a paid beta, and two others have signed letters of intent totalling $9.1M. What do you plan to do during the next reporting period to accomplish the goals? Nothing Reported

Impacts
What was accomplished under these goals? The first step to successfully harvest a tomato is to accurately detect a tomato and determine if it is ripe and ready for harvest. In Phase I of our project, we were able to develop a vision algorithm which is able to accurately classify specialty tomatoes with an mAP over 88 accuracy, well over the previous ciriterion of acceptance of 80. The first step to accomplish this was to create a dataset that would enable us to train a neural network. Using Intel Realsense cameras, we captured images of greenhouse rows and the ripe tomatoes. To increase the generalizability of our CNN, we created a dataset combined of three different greenhouse farms based across the entire US (California, Ohio, and New York). To prevent overfitting to just one type of tomato, we also included three different variteties of specialty tomatoes in our dataset. Because we planned to use an object detection-based approach, specifically the YOLO architecture (You Only Look Once), we labeled all the tomatoes whose color scale was greater than half of the grower's color scale (meaning any tomato slightly green-orange or darker). After all the image gathering and labelling, we had a dataset of over 2,000 images to train our CNN on. We then trained different CNNs on our dataset with different hyper parameters and data augmentations. To evaluate its performace, the trained algorithm was run onthe test images not included during training, and the output bounding boxes that label and identify the objects in the image were compared with the ground truth labled images. From the analysis, we found that our algorithm achieved an mAP of 88. After successfully identifying the ripe tomato, the last step was to determine the 3D harvesting locations. By using our CNN, we now had the x-y component of the tomato and needed to include the depth value. By using an Intel Realsense D435, we were able to successfully align the visual and depth frames. After alighnment, we then extracted the depth value from the corresponding x-y coordinate in the visual frame. Because the D435 performs the depth processing onboard the camera and not onboard our system, we were able to not only successfully find the 3D coordinates for the vision system, but we could also do so while running in real time. The result of this was then converted to real world x-y-z coordinates by using matrix transforms based on the lens distortions, focal lengths, and principal points of the D435 camera. The resulting output of this perception system is a 3D real-world coordinate for over 88% of tomatoes in the field of view. In addition to successfully reaching all the goals mentioned in the oringial Phase I application, the team was able to create a custom end effector (the part of the robot that interacts with the environment) that does not bruise tomatoes, an initial basic planning for manipulation to control a 6 degree of freedom robotic arm, and a mobile structure to contain it all. A video of the team's accomplishments can be seen here: https://www.youtube.com/watch?v=-qQffIHmlXk&feature=youtu.be

Publications