Recipient Organization
GLOUCESTER MARINE GENOMICS INSTITUTE INCORPORATED
417 MAIN ST
GLOUCESTER,MA 019303006
Performing Department
(N/A)
Non Technical Summary
Economic challenges imposed by climate change and disease on the aquaculture industry necessitate advances for improved animal welfare and resiliency. Biomarkers associated with environmental and disease resilience traits can be leveraged in breeding and management strategies. However, their discovery has been limited in part by the complexity of molecular systems and the cost of genomics tools used to understand them. Advances in computational approaches including machine learning algorithms, together with the wealth of genomic data that has amassed, enable powerful meta-analyses for improved biomarker discovery in aquaculture species. The proposed project aims to advance the discovery and characterization of biomarkers through mining publicly available shellfish genomic datasets from resilient populations. The objectives are to 1) develop standardized open-access, user-friendly, reproducible bioinformatics pipelines for resilience biomarker discovery through systematic reanalysis, data integration and meta-analysis and 2) build a user-friendly open-access comprehensive database of candidate resilience biomarkers that is widely available for use by the aquaculture community. The resulting database will enable improved molecular tool development for more efficient phenotype selection and health monitoring, relevant to the AFRI animal genomics program area priority goal of increasing animal fitness and improving animal welfare as well as the priority of implementing selection methods that use a systems biology approach for simultaneous improvement of multiple traits.
Animal Health Component
25%
Research Effort Categories
Basic
75%
Applied
25%
Developmental
(N/A)
Goals / Objectives
The longterm goal of this project is to improve resiliency in aquaculture by creating computational resources for genomics integration. The main objective is to create a multidimensional bioinformatics framework for resilience biomarker discovery that could be used for breeding programs, disease monitoring, and physiological assessment in aquaculture. Using shellfish genomic datasets as a proof-of-concept, the following supporting objectives are proposed: i. Develop standardized open-access, user-friendly, reproducible bioinformatics pipelines for resilience biomarker discovery through systematic reanalysis, data integration and meta-analysis ii. Build a user-friendly open-access comprehensive database of candidate resilience biomarkers that is widely available for use by the aquaculture community.
Project Methods
Crassostrea gigas omics datasets from disease and environmental response population studies will be systematically reanalyzed using standardized pipelines, parameters, and reference data. Once pipeline standardization is established, omics datasets from other bivalve species will be systematically reanalyzed. Datasets and corresponding metadata will be downloaded from public repositories. Any missing metadata will be retrieved from the published study and compiled. The same software specific to each data type will be used. Available nf-core community software will be used because these have already been standardized to ensure high reproducibility with stable releases, high portability, thorough documentation, and support. New pipelines will be created for analyses not currently available following the nf-core template and guidelines. Optimal alignment and feature calling parameters will be determined by iteratively testing effects of different modifications on alignment statistics, quality control metrics, and total features identified across studies. Modifications resulting in the greatest preservation of data, comparable to or improving upon published results, with the highest quality control metrics will be selected. Data will be effectively down-sampled as needed to limit coverage variation effects. Data will go through a normalizing pipeline prior to differential and integrative analyses to limit biases from sequencing method differences. Standardized bioinformatics pipeline output will be in a common format to facilitate downstream analyses. Alignment thresholds will be optimized to increase mapping and reduce false positive alignments. Analysis pipelines will be mobilized together on a virtual machine and several options will be examined for open-access hosting. Both post-analysis data integration and integrated data analysis will be carried out on standardized data and data integration pipelines will be mobilized to the cloud-based virtual machine. Covariates and random effects will be included in statistical models for differential analysis. Normalization, batch effect correction, and data harmonization methods will be applied. Different omics datasets will be first analyzed in isolation. Experimental group comparisons will include statistics to assess group mean methylation, expression, or protein abundance (e.g. linear regression and Cohen's d effect size) and (2) group variance (e.g. multiple linear regression models of relative standard deviation) as they relate to phenotype. Molecular features associated with common phenotypes will then be compared across studies for biomarker discovery using meta-analysis methods (e.g. weighted Z score for combining p-values and RankSum method for combining effect sizes across datasets). Biomarker discovery via integrated data analysis will involve clustering, dimensionality reduction-based approaches, and systems-based network modeling approaches. The leave-one-out-method will be used to rank biomarkers for reliability (not dependent on one dataset) and perturbation clustering will be used to rank ability to predict phenotype. Biomarkers will be computationally validated using independent datasets that did not undergo integrated data analysis and evaluating their ability to discriminate phenotypes within and across species. Genome feature, biological process and pathway, and inferred protein-protein interaction and metabolic network annotations will be used to resolve relationships among molecular features and resilience. Unannotated features will still have a biomarker ranking. Results from differential analyses will be compiled into a large list of molecular features of diverse classes significantly associated with resilience and will be validated by comparing to features identified in published studies. Features that are consistently associated with resilience traits across studies and stress types will be considered candidate biomarkers. Pipeline reproducibility will be evaluated by personnel at the PD's institution and by collaborators using independent datasets, pre-established evaluation criteria (ease of access, installation time, run time, ease of setup, etc.) and providing feedback. The pipelines and systematic reanalysis will be presented at national conferences to solicit community-wide feedback, and the framework will be optimized accordingly. Attendees will use their own datasets or datasets used in the proposed project and will evaluate the framework in real time. Candidate biomarkers will be compiled into a user-friendly, interactive, comprehensive database to facilitate their use across the aquaculture field. Candidate biomarkers will be easily searchable through a column sort feature (e.g. for condition, trait, effect - resilient or sensitive), and a search bar feature for text querying. A 'download' feature will enable users to download the entire or filtered subset of the database. An 'evidence' column will denote species-specific or cross-species evidence. A 'class' column will denote the discovery class (genomic feature, ontology enrichment, or protein interaction network). A 'source' column will list the publication(s) where the original data was generated. A 'rank' column will list a confidence score based on supporting evidence that incorporates criteria like the number of studies, number of species, and number of datatypes. A 'feedback' feature will allow users to solicit feedback and request updates (e.g. biomarker addition) to the table. Instructions will be listed in text and screencast formats describing how the database can be used and how users can contribute or solicit feedback. The table will be generated using the R function 'datatable'54 and the basis will be a simple CSV file so that no proprietary or complex software is needed. The database and webpage code will be hosted on GitHub and backed up on Open Science Framework for redundancy, which allow the code and data to be maintained in perpetuity through web archiving. In addition to being mirrored on GMGI and University of Washington organizational websites, open access through GitHub allows easy mobility and any user to also mirror the webpage on any site through repository cloning. GitHub hosting allows users to join as collaborators and contributors initially at the discretion of the PD and proposal collaborators (but extending to others in the future) for data files and webpage format editing and openly soliciting feedback using the publicly accessible 'issues' function. Stable releases will be made annually to ensure data, webpage, and software are up to date. Information from the previous releases will be permanently available and documentation of updates for each release will be available through a link on the home page. The initial release will be presented and demonstrated in virtual and in-person training meetings with proposal collaborators and lab members who will beta-test independently. Database features and webpage functions will be thoroughly tested following a standardized task list and will be evaluated by criteria ranking (e.g. task difficulty, user-friendliness). Communication for beta testing will be facilitated by GitHub 'discussions' and 'issues' features that allow for tracked, annotated, prioritized, and searchable Q&A and sharing by anyone in the community. Quantitative evaluation criteria will be used and optimization to the database will be made based on majority rankings and by using up-voting data in suggestions made in GitHub 'discussions'. Following optimization after beta-testing, a new release will be made and demonstrated in an interactive presentation to the broader community at national conferences, where community users could engage and solicit more feedback.