Recipient Organization
KANSAS STATE UNIV
(N/A)
MANHATTAN,KS 66506
Performing Department
Entomology
Non Technical Summary
Striving to ensure pollinator populations remain healthy and capable of providing importantpollination services requires research by the scientific community and application by theagricultural community, the public, and conservationists. Reliable identification of pollinators,such as bees, is a critical to maintaining bee health. However, because bees can have only subtlemorphological differences, species-level identification can be difficult, requiring specializedtaxonomic knowledge. This results in a bottleneck that is expensive and time consuming, whichslows the pace of research and adoption of new applications. Moreover, setbacks for pollinatorresearch can result from errors based on misidentification if experts are unavailable or if funds areinsufficient. However, new technology that is currently being developed in the fields of machinelearning and computer vision are enabling fast and reliable automated identification of objectsfrom images. Cutting-edge techniques, such as convolutional neural networks (CNNs), are beingemployed in diverse fields and helping to drive advances in precision agriculture and insectidentification. We propose to develop a large image dataset of expertly identified bees to use asinput to CNNs for automated bee identification. We will focus on (1) the bumble bees (Bombus)of North America and (2) the bees (Anthophila) of Kansas, USA, which represent bee subsets ofinterest for addressing bee health. We will also produce a mobile app that will allow non-expertsto identify bees to species from images. Using state-of-the-art technology, our project will thus puta greatly needed tool into the hands of those acting to maintain pollinator health.
Animal Health Component
25%
Research Effort Categories
Basic
25%
Applied
25%
Developmental
50%
Goals / Objectives
There is a critical need for quick and reliable methods of bee identification. State-of-the-art deep learning technology can provide for this. We propose to gather image data on North American bees to complete three objectives:Develop a CNN model to classify the 46 bumble bee species of North AmericaDevelop a CNN model to classify the 305 bee species of KansasCreate a mobile app for novices to experts capable of identifying bees from images of pinned specimens and active bees in natureTo complete the proposed objectives, we will develop an image dataset composed of expertly identified images of pinned specimens from museum and research collections, combined with images of live bees active in nature. Models capable of classifying the complete bee fauna of the world (>20,000 species) or even North America (>4,000) are infeasible for this study. Therefore, in objectives 1 and 2 we will focus on two subsets of North American bee species: all bumble bee (Bombus) species occurring in North America (north of Mexico) and all bee species recorded within the state of Kansas since 1950. Focusing these subsets of North American bees will allow us to address two goals: to provide a reliable model for anyone in the US and Canada to use to identify bumble bees and (2) to take the first steps toward developing a complete dataset and set of models to classify all bee species in North America and beyond. Objective 2 will thus provide a scalable framework for expanding coverage of bee species to ever larger regions. With objective 3, we will provide stakeholders with a tool - a mobile app - for identifying bees. Our long-term goal is to put expert-level bee identification capability in the hands of anyone who needs it.
Project Methods
Dataset development: We will generate a dataset composed of images of pinned specimens with at least 510 images per species. Depending on availability, we will photograph 30 intact and high-quality individuals of each species. Specimens will be obtained from the K-State Department of Entomology collections, PD Spiesman's research collection, and specimens obtained on loan from the University of Kansas Biodiversity Institute, as well as other museum and research collections. Each individual will be photographed from 17 different viewing angles: one from the top, 8 images in parallel plane (face, rear, left side, right side, left and right front and back oblique angles), and 8 images looking down at 45° angles (face, rear, left side, right side, left and right front and back oblique angles). So, for the 341 species to be imaged (North American bumble bees + other Kansas bees) this will require imaging 10,230 individual bees resulting in 173,910 images.Preliminary lab testing indicated that, after initial setup, it takes less than 4 minutes to acquire the 17 images, record the specimen and image ID numbers, and replace the specimen with the next in line. Thus, conservatively, working at it for 32 hrs./week would take approximately 21 person-weeks to complete the imaging. This is quite feasible to complete over the course of a summer, especially when dividing the work between a graduate student and an undergraduate assistant.Images will be taken using a digital SLR camera set up on a secure stand set so that the subject fills the frame and with a focal length that allows the subject to be completely in focus. Images will be stored as 12-megapixel jpegs, a resolution that is more than enough for CNN model input, which is usually less than 500 × 500 pixels. A rotating stage will be set at a fixed distance so that specimens can be quickly replaced and rotated in front of the camera for multi-angle imaging.We will also attempt to gather 500 labeled images per species of live bees in nature. In combination with images of pinned specimens, this will result in a total of over 1,000 images per species. Which puts us well over the often-reported guideline that at least 100 images/class are required for CNN analysis. Images of live bees will be acquired from online databases including the Wisconsin Bumble Bee Brigade, Bumble Bee Watch, BugGuide (https://bugguide.net), iNaturalist (https://www.inaturalist.org/), and others. We will only include images that have been verified by experts. Image acquisition has already begun for bumble bees and to date we have assembled more than 6,700 images belonging to 43 of the 46 North American species.To validate CNN models, we take new photographs of bees visiting flowers in the field. We will target bee groups with rare representation in field-based image datasets. For the targeted bee groups, we will take multiple photos of each individual before they are captured and returned to the lab for identification. Captured bees will also be photographed after they have been pinned for identification. The quantity and targets of these photos will be determined by data needs once our image datasets are complete.Model development for objectives 1 and 2. We will implement our CNN models for objectives 1 and 2 in the Python programming language utilizing keras with a TensorFlow backend. As a starting point, we will use the VGG16 CNN architecture, which is frequently used as a base model because of its ability to generalize and for its relatively simple structure compared to other models such as Google's Inception or Microsoft's ResNet. VGG16 is composed of five blocks of convolutional layers. Each block is comprised of 2 or 3 convolutional layers, which apply a number of 3×3 filters to the preceding input. Blocks end with subsampling using 2×2 MaxPooling. Three fully connected layers follow the five blocks of convolutional layers ending with the class prediction layer. The input images will be subsampled form 224×224-pixel images, which is the standard VGG16 input size. We will explore variations of CNN architecture that may better fit our application to bee classification. For example, we will explore different image input sizes, convolutional layer dimensions, as well as ways to improve performance and limit overfitting by using dropout and image standardization. This set of model architectures will be trained on the full dataset, including both images of pinned specimens and bees in the field. We will divide the dataset into smaller subsets for more efficient model exploration with each subset including 80% of the data for training and 20% for testing. Additional image data from field-caught specimens will be used for validation. Similar proportions will be used for training and testing on the entire dataset. We will assess model performance based on model loss, accuracy, precision, and recall, as well as run time.Mobile app development for objective 3. The proposed work depends on deployed mechanisms for citizen data science, specifically: a mobile application for rapid data acquisition, preparation, and transmission to a hybrid cloud platform. Basic requirements for this application are that it support a wide variety of smartphones and other mobile devices such as tablets and camera watches, producing images on the order of up to 10 megapixels (10Mb file size, for 0.1 - 10Gb per day bandwidth per user); scales well for the intended use, by hundreds to hundreds of thousands of users (10Gb - 1Pb per day enterprise-wide); is usable by nontechnical bee-watchers; and is extensible by enthusiasts and mobile app developers. We plan to develop feature-parallel and interoperable iOS (Swift) and Android (Java) versions with a shared metadata format. This will be backed by custom cloud services running on Beocat, the K-State high-performance computing (HPC) cluster. Co-PI Hsu's lab has experience with mobile applications development and citizen data science for social good, and extensive experience with web-enabled front ends for uploading both training data for deep learning and individual images or batches of images for inferencing using previously or incrementally trained models.