Source: CORNELL UNIVERSITY submitted to
INFORMATION TECHNOLOGY IN ORNITHOLOGY: NEW STRATEGIES FOR CONTENT AND DATA MANAGEMENT AND ANALYSIS
Sponsoring Institution
State Agricultural Experiment Station
Project Status
TERMINATED
Funding Source
Reporting Frequency
Annual
Accession No.
0202071
Grant No.
(N/A)
Project No.
NYC-171323
Proposal No.
(N/A)
Multistate No.
(N/A)
Program Code
(N/A)
Project Start Date
Jul 1, 2004
Project End Date
Sep 30, 2009
Grant Year
(N/A)
Project Director
Kelling, S. T.
Recipient Organization
CORNELL UNIVERSITY
(N/A)
ITHACA,NY 14853
Performing Department
ORNITHOLOGY
Non Technical Summary
Our projects will expose advances in high performance computing, information organization and analysis to vast new audiences, from biologists, conservation agencies, and land planners to school classrooms and literally millions of people who watch and appreciate wild birds. We actively promote dissemination and use of this information through collaborative education and conservation programs.
Animal Health Component
(N/A)
Research Effort Categories
Basic
50%
Applied
50%
Developmental
(N/A)
Classification

Knowledge Area (KA)Subject of Investigation (SOI)Field of Science (FOS)Percent
9010820208050%
9010820209050%
Goals / Objectives
There is a critical need to bring recent advances in Information Technologies (IT) and the Computer Sciences to a variety of environmental, educational, and conservation initiatives. To accomplish this we are actively working in two areas. First we are developing new strategies of machine learning and mining of large datasets to analyze the distribution and abundance of bird populations via Web-based data visualizations and explorations. Second, we are expanding existing digital library and content management strategies by creating an open-source infrastructure for science knowledge and education dissemination-via the Internet-and will allow continuous annotations of the content through community input.
Project Methods
We explore new models of machine learning, develop new methodologies for the mining of large data sets, create new techniques in ensemble learning statistical smoothing and multi-task machine learning, advance new content management and knowledge dissemination strategies, and expand Grid and other Internet technologies to further our understanding of wild bird populations across North America.

Progress 10/01/08 to 09/30/09

Outputs
OUTPUTS: Over the past year the Avian Knowledge Network (AKN, http://www.avianknowledge.net) has made much progress in building a network of contributors, organizing massive quantities of observational data, making these data available, and providing intuitive explorations, visualizations, and analyses of these data. Members of the AKN project team, who encompass a continent-scale group of researchers who have academic, federal agency, and non-governmental organizations have focused on the development of processes that will allow us to accurately predict species occurrence across broad spatial and temporal scales. To accomplish this we have been working with the massive volume of bird occurrence data made available through the AKN. AKN DATA CENTERS: Development of data centers that served as the primary access nodes to the AKN progressed well. A total of 3 active data centers have been established (California Avian Data Center, Northeast Partners and Flight Data Center, and the Cornell Lab of Ornithology Data Center), with several more in development. Data Centers greatly facilitate data discovery in distributed databases, but in the absence of customized toolkits for their management and query they depend on expert database administrators and analysts. Because a goal of the AKN is to 'unlock the data' to the broadest possible audience, we have created a series web-based tools for managing, querying, and visualizing avian observation data. Though large datasets are amenable to summaries commonly used in research and reporting, patterns in the data may not be readily apparent, in part due to the fact that data may come from disparate sampling designs with varying filters for quality control. Specialized statistical techniques are often required to tease apart these patterns. AKN REFERENCE DATASETS: As part of our work related to the this funding, the Cornell Lab of Ornithology Data Center has released the eBird Reference Dataset. eBird (http://www.ebird.org), is a citizen science project that enlists the public in collecting large quantities of data across an array of habitats and locations over long spans of time. eBird is the largest dataset housed at the AKN, and contains more than 16.5 million observations, gathered during more than 917 thousand sampling events, at more than 155 thousand locations throughout the western hemisphere and New Zealand. PARTICIPANTS: Participant Individuals: CoPrincipal Investigator(s) : Andre A Dhondt; Grant Ballard; Daniel Fink Technician, programmer(s) : Kevin Webb; Tim Levatich; Dan Danowski; Doug Moody; Mark Herzog; Chris Rintoul Senior personnel(s) : Nadav Nur; Leonardo Salas; Marshall Iliff Technician, programmer(s) : Michael Fitzgibbon; Dennis Jongsomjit; Christine Howell; Diana Stralberg Partner Organizations: PRBO Conservation Science: Financial Support; In-kind Support; Facilities; Collaborative Research; Personnel Exchanges USDA Forest Service Redwood Science Lab: In-kind Support; Facilities; Collaborative Research; Personnel Exchanges Bird Studies Canada: In-kind Support; Collaborative Research; Personnel Exchanges TARGET AUDIENCES: Educators and researchers from academic, federal agency, and non-governmental organizations who are focused on the development of processes that will allow us to accurately predict species occurrence across broad spatial and temporal scales. PROJECT MODIFICATIONS: Not relevant to this project.

Impacts
The StatioTemporal Exploratory Model (STEM): Modeling dynamic species distributions requires that analyses deal with spatiotemporal variation on two main scales. Ecological systems often exhibit strong homogeneity when viewed at "fine" or "local" scales. There are many processes that induce similarity of nearby observations. For example, the fine-scale spatial and temporal patterning of resources induces corresponding local distribution patterns and juvenile dispersal limitations help define the extent of "locality". Thus, the importance of accounting for spatial and temporal correlation has been broadly recognized. In contrast to fine-scale homogeneity, many ecological systems also exhibit strong heterogeneity when viewed at "coarse" or "global" scales. For example, it is known that individuals of the same species often occupy different specialized habitats at the edges of their distributions and population dynamics processes such as the Allee effect and source-sink dynamics can create spatial patterning at relatively large spatial scales. Similarly, in the temporal domain, large-scale effects like El Nino/La Nina and North Atlantic Oscillation create strong, relatively abrupt changes in population size and composition. The motivation for this work was to explore the continent-wide inter-annual migrations of common North American birds using data from the citizen science project, eBird (http://www.ebird.org). This is challenging in part because of the great variation in migration dynamics between species. To deal with this, we sought to develop a highly automated STEM capable of producing objective, dynamic species distribution estimates with a minimum of user inputs. This STEM model was compared to a simpler bagged decision tree model without any scale-structure. We found that for species with highly dynamic annual migrations, STEM consistently outperformed the simpler bagged decision tree models. When applied to non-migratory species STEM and the bagged decision trees achieved comparable performance.

Publications

  • Sorokina D., Caruana R., Riedewald M., and Fink, D. 2008. Detecting Statistical Interactions with Additive Groves of Trees. In Proc. International Conference on Machine Learning (ICML), pages 1000-1007.
  • Sullivan, B. L., C. L. Wood, M. I. Iliff, R. E. Bonney, D. Fink, and S. Kelling. 2009. eBird: A Citizen-based Bird Observation Network in the Biological Sciences. Biological Conservation 142:2282-2292.
  • Fink. D. and Hochachka, W.M. 2009. Gaussian semiparametric analysis using hierarchical predictive models. Environmental and Ecological Statistics,3,1011-1035. Appearing in special monograph on Modeling Demographic Processes in Marked Populations D.L. Thomson et al. (eds).
  • Hochachka, W.M., R. Caruana, D. Fink, S. Kelling, A. Munson, M. Riedewald, D. Sorokina, S. Kelling. 2007 Data mining for discovery of pattern and process in ecological systems. Journal of Wildlife Management 71(7)2427-2437.
  • Kelling, S., Hochachka, W.M. Fink, D. Riedewald, M. Caruana, R., Ballard, G. and Hooker, G. 2009. Data-intensive Science: A New Paradigm for Biodiversity Studies. BioScience, 59: 613-620.
  • Submitted 2009 Shaby, B. and Ruppert, D. Tapered Covariance: Bayeian Estimation, Asymptoics, and Applications. 2009. Submitted to Journal of American Statistical Association.
  • 2009 Fink, D, Hochachka, W., Zuckerberg, B., Winkler, DW, Shaby, B, Munson, MA, Hooker, G, Riedewald, M, Sheldon, D, and Kelling, S. 2009. Spatiotemporal exploratory models for broad-scale survey data. Submitted to Ecological Applications.


Progress 10/01/07 to 09/30/08

Outputs
OUTPUTS: Since the last annual report, the AKN has 1) substantially increased the number of observations to more than 52 million records; 2) added the capability to incorporate bird banding data in addition to point-count data; 3) enhanced visualizations; and 4) begun advanced development of tools suitable for user-accessible analysis (BAP/SciencePipes). The AKN has also shifted focus from serving primarily as a data warehouse, and toward an end-user data-analysis model of operation. PARTICIPANTS: Nothing significant to report during this reporting period. TARGET AUDIENCES: Nothing significant to report during this reporting period. PROJECT MODIFICATIONS: Nothing significant to report during this reporting period.

Impacts
The increase in number of records will help us continue to serve as one of the largest biodiversity resources in the GBIF library. Adding banding data makes our data more complete and valuable for analysis by adding depth and expanding the field of data available. Additional tools and visualizations help make the data more accessible and useful, whereas the shift in focus to end-user data analysis will allow the AKN to maintain its relevance in biodiversity research.

Publications

  • No publications reported this period


Progress 01/01/07 to 11/30/07

Outputs
We have continued to make progress moving the AKN forward over the last 11 months, including: 1) expanding the database to over 36 million records; 2) incorporating a potentially infinite number of covariates such as land-cover data, lat/long, and climate/weather information; and 3) increasing the number of datasets available by partnering with an ever-increasing number of organizations.

Impacts
We have further enhanced the value of data analysis and made the results ever more available to users, due in large part to improvements in the amount and quality of data available.

Publications

  • No publications reported this period


Progress 01/01/06 to 12/31/06

Outputs
Over the past year we have made many advances in the Avian Knowledge Network. These include: 1. Increased the number of bird observations to over 22 million records. THis is the largest biodiversity data resource in the GBIF library. 2. Linked over 130,000 locations with almost 1000 environmental variables. This has allowed us to explore the relationship between habitat and other environmental variables with species occurrence. 3. We have developed exploratory analysis applications that are available via the Internet (http://www.avianknowledge.net).

Impacts
As we continue to federate data resources we 1) provide a greater resource for research and analysis, and 2) archive a growing number of primary biodiversity information. We supply this information to a growing number of users globally.

Publications

  • No publications reported this period


Progress 01/01/05 to 12/31/05

Outputs
We have developed a novel strategy for analyzing observational data on birds that allows us to determine the causality of bird distributions and abundances at broad geographic scales. The techniques are based in data mining and machine learning techniques that allow us to include hundred if not thousands of variables that might impact bird populations (weather, habitat, anthropogenic, climate, etc) in our analysis a-priori. One of the big findings is that anthropogenic factors (human population density in particular) have dramatic significance in the distribution and abundance of wintering birds. Our work in digital libraries is developing new techniques that allow us to manage primary scientific periodicals over the Internet. We have developed an open source application that allows any scientific discipline to manage their publications (writing, editing, access, fee structuring etc) via the web. This has successfully been implemented for the Birds of North America (http://birds.cornell.edu/bna).

Impacts
The impact of these 2 projects will be in the area of biodiversity conservation and digital library methodologies. We are introducing a entirely new analysis schema for research in ecology, that being data mining and machine learning. In the area of digital libraries we are developing tools that allow organizations to efficiently manage their primary information resources over the Internet.

Publications

  • No publications reported this period