Source: UNIV OF CONNECTICUT submitted to
FACT: ENABLING ASSOCIATION MAPPING AND LANDSCAPE GENOMICS THROUGH THE ADVANCED INTEGRATION OF GENOTYPE, PHENOTYPE, AND GEOSPATIAL DATA
Sponsoring Institution
National Institute of Food and Agriculture
Project Status
NEW
Funding Source
Reporting Frequency
Annual
Accession No.
1019897
Grant No.
2019-67021-29920
Project No.
CONW-2018-09223
Proposal No.
2018-09223
Multistate No.
(N/A)
Program Code
A1541
Project Start Date
Aug 1, 2019
Project End Date
Jul 31, 2022
Grant Year
2019
Project Director
Wegrzyn, J.
Recipient Organization
UNIV OF CONNECTICUT
(N/A)
STORRS,CT 06269
Performing Department
Environmental & Evol. Biology
Non Technical Summary
Funding agencies have made significant investments to mitigate losses and improve production of economically and ecologically important species, including support for population genetic studies. These require generation and integration of genotype, phenotype, and environmental data. The associated datasets must also be connected to analytical pipelines supported by high performance computing to contend with high throughput technologies. Currently, applications for the collection of this data and associated metadata are rare, and not universally adopted. Moreover, the outcomes of large-scale studies, revealing potential adaptive and causal variants, are often lost between genome versions and intermediate assemblies.We will develop the first web-based application that integrates genotype-phenotype metadata and data for model and non-model plant systems in a geospatial context. The field to analysis framework will connect data collection, data submission, ontology-based metadata annotation, storage, and exchange directly to high performance computing for association/landscape genomics analysis to examine adaptive potential, genotypic diversity from wild accessions, impact of invasive species, and productivity of breeding populations. This application is being developed within the open-source Tripal framework which represents a federation of over 25 plant and animal genomics/phenomics databases.Plant health, productivity, and biogeographical response to environmental challenges have consequences beyond food and timber production: they affect the environment, spanning areas of biodiversity, carbon cycling, and planetary health. Providing access to high quality genotypic, phenotypic and environmental data in a semantically aware, geospatial-enabled, web-based platform connected to high performance computing resources will enable interrogation, storage, and exchange of data from large-scale association studies.
Animal Health Component
0%
Research Effort Categories
Basic
40%
Applied
10%
Developmental
50%
Classification

Knowledge Area (KA)Subject of Investigation (SOI)Field of Science (FOS)Percent
2010430108025%
2020699108125%
1360699107025%
1230199106025%
Goals / Objectives
The objective of this proposal is to develop CartograTree, the first web-based application that integrates genotype-phenotype metadata and raw data for georeferenced individuals from model and non-model systems. The software will provide an interactive geospatial interface and access to high performance computing for real-time analysis of important biological questions. This framework will be implemented for tree databases but also developed as an independent Tripal module, CartograPlant. Goal 1: Leverage Tripal3 enabled databases to build a FAIR model for data submission, storage, and validation across genotype, phenotype, and environmental data that can be applied broadly to plants Goal 2: Enable a Web-based association and landscape genetics workflows that can interpret heterogeneous input data and provide a robust data filtering, visualization, and analytic framework via the Tripal GalaxyGoal 3: Integrate diverse environmental layers to provide rapid access to high resolution data for georeferenced individuals that can be directly assessed for landscape genomicsGoal 4: Train and educate researchers in landscape genomics via TreeSnap and CartograTree as well as Tripal plant database providers with CartograPlant.
Project Methods
We will build a global capability for landscape genomics and association mapping by the following six steps:1. Develop and test metadata collection, identifier assignment, and machine-actionable datasets with collaborator data for the Tripal Plant PopGen Submit module2. Mapping of the scientific protocols for AM and landscape genomics to informatic workflows3. Develop new approaches to facilitate geospatial data discovery and integration4. Integrate with TreeSnap to provide a field to data connection with CartograTree for scientists5. Architect and implement the major components: Web-based, interactive geospatially-aware GWAS for forest trees1. Tripal Plant PopGen Submit PipelineThe current TGDR pipeline serves as a basic but not FAIR representation for data submission for forest tree species. The proposed Tripal Plant PopGen Submit (TPPS) pipeline will provide an interface for the collection of genotypic, phenotypic, and environmental data from any georeferenced (exact or regional) plant focused experiment. Once the user is logged in to their Tripal database profile, TPPS will ask the user specific questions on the design of the study, starting with the data that was obtained (what combination of genotype, phenotype, and environmental data). The user will be asked to provide a text file to identify the species and georeferenced location of the trees and/or sites that were evaluated. This will be validated as machine-readable before proceeding. Metadata on the analytical process will be recorded through directed questions relating to: tools, statistical assumptions, and population/kinship analysis. Following successful submission, the researcher will be provided with a long-term TreeGenes accession number that is associated with DOI generated via Zenodo.2. Informatic components supporting Association Mapping and Landscape Genomics Submission of the G/P/E variety will take place via TPPS which will be hosted at TreeGenes. In addition to direct submissions, the source databases will import data from TreeSnap (georeferenced phenotype), Dryad (georeferenced genotype/phenotype), and TRY-DB (georeferenced phenotype). Any public user of CartograTree will be able to browse, query, and analyze the public datasets.Genotypes: The majority of the analytical functionality will focus on SNP markers, the predominant marker type for fine-scale mapping. Upon selection of a set of individuals, users will view a SNP summary of shared SNPs with statistics on: missing data, allele frequencies, and previously identified associations. Metadata associated with the genotypes includes the sequencing technology, bioinformatic processes used to identify the polymorphism, physical/relative locations, and validation status. The metadata for a given association will include: imputation techniques applied, analysis model used, covariates (kinship and population structure), and multiple-testing methods employed.Phenotypes: High-throughput phenotyping (phenomics) as well as traditional phenotyping metrics are housed in a variety of different repositories and formats. CartograTree will focus on phenotypic data retrieved from three different derivative sources. For the purpose of this application, our goal is to leverage the growing set of repositories that are enabling a variety of reporting standards.Analysis: After filters are applied, the user may access the active search or combine across several saved searches associated with their profile. The combined datasets, which may include any combination of genotype, phenotype, and environmental data, is pre-processed prior to uploading it for analysis on a high-performance computing cluster (HPC). The location of the Galaxy run will be transparent to the user but the design is optimized to resolve significant loads on the local UCONN clusters. Since TPPS submitted datasets contain information on the study design (common garden, landscape, etc.), the workflows will prompt the user with recommended processes based upon the metadata available. Once the analysis is complete, CartograTree will download the results from the Galaxy instance into the user's profile, and convert them to a layer vector format, which the user can display on the map as an overlay.3. Develop new approaches to facilitate geospatial data discovery and integration.In CartograTree, the Mapnik/Renderd/Mod_tile framework will allow tiling of OSM, and GeoServer with GeoWebCache, will allow tiling of the other layers (both raster and vector formats) to optimize performance. Different functionality is provided by GeoServer based on the type of layer. In addition, GeoServer enables the cross-query of the data in vector layers through the Extended Common Query Language (ECQL). We will extend the functionality of GeoServer to allow querying raster layers as well.4. Architect and implement the major components: Web-based, interactive geospatially-aware GWAS for forest trees:(1) The user experience (UX) is an immersive full browser window view of the geographical mapping area. Client-side architecture is a SPA (Single Page Application), whereby page elements are updated continuously and asynchronously. The user does not navigate over multiple pages, but engages all interactions on the opening page. The SPA is enabled by a JavaScript MVC (Model-View-Controller) framework (2) using jQuery, Backbone, and Bootstrap for scaffolding and user interface (UI) components, and MapBox GL JS and other libraries for map tiling interaction. Industry-standard HTML5, CSS3, and JavaScript AJAX technologies are supported by all major browsers. The client-side MVC framework communicates directly with tiling servers for tiling maps (3). Custom user layers will be pre-processed into mapping tiles and hosted along with open source OpenStreetMap tiles on either our own open source GeoServer tiling server or on MapBox. User layers are available for geo-queries (4). For CartograTree persistence, we will use a NoSQL JSON data store such as MongoDB, specifically for high density genotype and environmental data. This acts as an interface with backend source data, yielding performance benefits that are requisite for responsive Web applications. GWAS data is derived from TreeGenes and HWG submissions via TPPS, and pre-processed for Web application querying and delivery (5). Lastly, best-practice is to segregate Web servers and analytic servers. (6). Specific workflows around PLINK will be developed with support from Galaxy workflows and the Tripal Gateway. The workflow results are stored locally in the user's profile and pushed out to CyVerse with a connected profile, for longer term storage.5. Advance and integrate TreeSnap, a citizen science and phenotyping mobile platform:TreeSnap, a forest health mobile application developed by Co-PI Staton, aims to connect the public with ongoing tree research programs. TreeSnap uses the ubiquity of smartphones to engage the public in scouting for trees affected by invasive insects and diseases. The TreeSnap mobile application, available for free on iOS and Android, has an intuitive interface, allowing users to take photos, collect GPS coordinates, and report on tree features. This proposal will support an established connection (API) between TreeSnap and CartograTree, to provide a single interface with access to the geolocated data from both sources.