Recipient Organization
UNIV OF IDAHO
875 PERIMETER DRIVE
MOSCOW,ID 83844-9803
Performing Department
Entomology, Plant Pathology and Nematology
Non Technical Summary
Knowing the identity and history of insects and other organisms allows us to deal with challenges posed by pest species and contributes to a better understanding of their roles in natural ecosystems. Despite significant progress in understanding insect diversity, emerging pests are sometimes difficut to identify. Not knowing the correct pest identity has led to ineffective management practices in the past. This illustrates the need for improving tools and resources used for identification of species, as well as their delimitation, or practice involving determining species boundaries. Traditionally, species identification relies on increasingly rare expert knowledge and delimitation involves sohpisticated genetic analysis. This proposal focuses on development of theory, best practices, and new tools taking advantage the innovative techonology of machine learning and artificial intelligence (AI). AI enables species species identification from photographs, drastically reducing the time and effort needed. Species delimitation can also benefit from this approach, which promises to be faster and more powerful than traditional approaches to discerning species boundaries.
Animal Health Component
25%
Research Effort Categories
Basic
50%
Applied
25%
Developmental
25%
Goals / Objectives
The three overarching goals of this research program are: 1) conceptual development of effective new approaches for automated species identification and species delimitation, 2) implementation of these approaches in bioinformatics software and 3) application of these tools to systematics of ants and other organisms.Within these broad objectives, more specific aims include:Investigation of the impact of various deep learning strategies (neural network architecture, input data quantity and quality, training parameters) on the efficacy of automated recognition of insect (ants and others) species from photographs.Development of a deep learning framework for species delimitation from molecular data involving simulations and demographic model selection using neural networks.Development of software for automated species identification based on photographic images and for molecular species delimitation.Creation of resources for automated species identification and species delimitation in the taxonomically challenging ant genus Formica, based on software developed in the preceding aim.
Project Methods
Investigation of the impact of various deep learning strategies on automated species identificationUndergraduate research assistant Alex McKeenen and the PI will develop a computer database framework that can store taxonomic information along with images. This database framework design is well underway. After completing the database framework design, twodata sets will be constructed: 1) all curated and identified images downloaded from www.BugGuide.net, a community-sourced web portal with images of North American arthropods and2) images of Formica ants. Data sets 1 and 2 will be complementary and partially overlapping because many agriculturally important arthropod species are already represented on BugGuide. We already downloaded 1.5 million images representing over 37,000 species (>35% of North American insects) with a custom-written computer script that can automatically download large amounts of information from websites. Images for data set 2will be generated at the University of Idaho by laboratory student technicians using imaging systems available in the lab in the PI's program, the Arthropod Molecular Systematics laboratory.Development of a deep learning framework for species delimitation from molecular dataThe deep learning approach in the context of molecular species delimitation involves training a neural network on a large number of molecular sequences generated with simulations under different demographic models. In this context, a demographic model is a formal description of an evolutionary scenario of how populations changed over time, including splitting and fusion, gene flow, and population size change. These population-level processes leave clues in genetic data and can be used for species delimitation, helping systematists determine whether two populations of putative species status are exchanging genes or not, for example. Modern population genetic software allows simulation of sequence data under user-specified evolutionary scenarios. This means that simulated data can be used for training a neural network which, given a sequence alignment, attempts to predict what demographic model generated it.For molecular species delimitation we will leverage existing frameworks for simulation of molecular sequence data under complex evolutionary scenarios combined with newly developed neural network training and prediction approaches. Once we understand the behavior of neural networks with simulated data sets, we will use an optimally trained network to delimit known and putative species in the extensive genomic datasets of Formica, which have been collected in PI's previous projects (see below) as well as in non-insect systems where species identity and boundaries are known (humans) or uncertain (Darwin's finches). This compute-intensive work will be performed on resources available in the Arthropod Molecular Systematics laboratory, which include a computer workstation especially designed for efficient training of neural networks. Additional workload will be distributed among computers of the University of Idaho Computational Resources Core, where the PI has a dedicated 20-processor server for exclusive use and account access to more than 2,000 processors.Development of software for automated species identification and molecular species delimitationThe work proposed here will not only advance machine learning approaches in biology but will also build software that is well documented, usable, free of charge, and maintainable. This will ensure accessibility for users and developers. Our software projects will be modeled after current best practices in scientific software development. Specifically, we will write code that is open source, thoroughly documented, hosted on-line and version controlled (i.e. with every modification to the code tracked in a database) and available on-line on GitHub, a popular repository for open-source projects. Our software will be implemented in the Python programming language.Creation of resources for automated species identification and species delimitation in FormicaWe will also apply the newly developed theory and tools to the empirical system of focus in my lab, Formica ants. In order to improve the poor taxonomic and phylogenetic framework available for these ants, with collaborators, I have been developing a phylogenetic and taxonomic framework for Formica since 2016. As a result, we have collected specimens of over 100 species and generated genomic data (ultraconserved elements) from over 300 specimens representing those taxa. We will leverage these existing data and continue to collect new specimens and sequences to address outstanding questions in Formica systematics.We will continue fieldwork and collecting to obtain additional Formica samples, generate high-quality images of all species, and continue generating genomic data at the University of Idaho. The images will be taken with the macro digital imaging system available in the PI's lab.The images will be used to train different deep learning models and evaluate their performance on morphologically challenging classification problems. The genetic data will be used to evaluate the performance of deep learning models designed to distinguish between demographic and aid in species delimitation.Because the genus Formica contains 175 described species, a comprehensive systematic treatment using genomic data for molecular species delimitation is not feasible within the proposed time frame. Because of this, we will focus our detailed systematic studies on North American members of the Formica rufa species group, the so-called wood ants, as well as other selected clades in the genus, including certain species complexes of North American F. fusca group species. In total, these account for 22 described species and several putative new species. This smaller taxonomic and geographic focus will allow us to collect comprehensive genomic data from across the species' ranges that will then be used to infer population-level processes that bear on species delimitation.In addition to collecting genomic data, we will develop an on-line application to aid in automated identification of all North American Formica species based on images generated during the course of this work.?