Source: NORTH CAROLINA STATE UNIV submitted to NRP
COMPARATIVE TOXICOGENOMICS DATABASE (CTD)
Sponsoring Institution
State Agricultural Experiment Station
Project Status
COMPLETE
Funding Source
Reporting Frequency
Annual
Accession No.
0232460
Grant No.
(N/A)
Cumulative Award Amt.
(N/A)
Proposal No.
(N/A)
Multistate No.
(N/A)
Project Start Date
Oct 1, 2012
Project End Date
Jul 1, 2013
Grant Year
(N/A)
Program Code
[(N/A)]- (N/A)
Recipient Organization
NORTH CAROLINA STATE UNIV
(N/A)
RALEIGH,NC 27695
Performing Department
Biology
Non Technical Summary
There is an expressed need for the inclusion of exposure data into the equations used to prioritize environmental health research. This need is not being met. The lack of centralization and contextualization of exposure information in the public domain is limiting the ability of researchers to fully exploit the potential of these data. For several years, we have been developing the publicly available resource, CTD, to promote mechanistic understanding of environment-disease connections using curation and data integration strategies. Our ability to leverage the CTD infrastructure, its biological framework, and our substantial curation, software engineering and statistical expertise uniquely positions us to immediately address the need for centralization and biological contextualization of exposure data. Importantly, this project has the support of the exposure science community. We formed an Exposure Curation Working Group with active and diverse members of this community to develop an appropriate and valuable scope for curation and an exposure ontology, which will be completed in Summer 2010 and will be used for this proposed project. In addition to addressing the deficits that have long been facing the exposure science community, curation and integration of exposure data will further enhance the value of CTD by providing "real-world" exposure context for our existing chemical, gene/protein, pathway, and disease data sets. The resulting publicly available resource will coordinate data and analysis tools key to enhancing the capacity to prioritize environmental health research and uncover the complex connections between the environment and human disease.
Animal Health Component
(N/A)
Research Effort Categories
Basic
100%
Applied
(N/A)
Developmental
(N/A)
Classification

Knowledge Area (KA)Subject of Investigation (SOI)Field of Science (FOS)Percent
72373101150100%
Goals / Objectives
Our objective is to provide a centralized, publicly available resource with comprehensive, well-annotated data and analysis tools that informs design and interpretation of environmental health studies and promotes novel insights into the etiologies of environmentally influenced diseases. Most human diseases involve interactions between genetic and environmental factors; however, the basis of these complex interactions are not well understood and limit improvements in toxicity prediction, risk assessment, research prioritization and therapeutic interventions. We developed the Comparative Toxicogenomics Database (CTD; http://ctdbase.org) to enhance understanding about environment-disease connections by providing manually curated data describing chemical-gene/protein interactions and chemical- and gene/protein-disease relationships from the peer-reviewed literature and integrating these data with select external data sets (e.g., pathways and biological process data) and novel data analysis tools. In this application, we propose to leverage our expertise and CTD infrastructure to: 1) enhance the capacity to identify environment-disease connections by curating and integrating exposure data into CTD; and 2) expand the capacity for prediction, analysis and interpretation of environment-disease networks by developing novel analysis and visualization tools that include exposure data. This proposal responds to the needs expressed by the NIEHS and partner agencies for inclusion of exposure data when prioritizing research and performing toxicity testing, it addresses the need for centralization of exposure data in a broader biological context and it will provide "real-world" exposure context for existing data in CTD. The resulting resource will enable new opportunities for understanding and prioritizing human health effects from exposure and their underlying etiologies and coordinate data key to enhancing the capacity for toxicity prediction and risk assessment.
Project Methods
Specific aim 1. Enhance the capacity to identify environment-disease connections by curating and integrating exposure data within the broader environmental health framework of the Comparative Toxicogenomics Database (CTD). Prioritize data sources for curation and integration. We will integrate exposure data into CTD. Sources for exposure data will include manually curated information from the published literature and from publicly available sources. Consistent with our existing curation protocol, we will define priority journals for curation and external data sets for integration. Develop data curation protocol and initiate exposure curation. We will develop a formal curation protocol that outlines the requirements, process and standard operating procedures for manual curation of exposure data. The triage process to identify articles for curation will be modified to include exposure-related publications. Expand CTD data model and data load procedures to accommodate integration of exposure data. Once the curation protocol is clearly established, CTD software engineers will begin designing and developing the processes and tools necessary to: a) load the ontology to a database and maintain it, b) enable curators to enter exposure data extracted from scientific publications using the ontology, c) integrate the exposure data with CTD data, and d) view the exposure data via the web. Expand text-mining pipeline and curation application to accommodate exposure data. We will expand our online curation application to allow for curation of exposure data, which will involve making modifications to data fields related to exposure curation, QC measures, and requirements for highlighting curatable terms. Specific aim 2. Expand the capacity for prediction, analysis and interpretation of environment-disease networks by developing novel analysis and visualization tools that include exposure data. Prioritize computational research questions and develop and test statistical models that will address these questions. We will identify and prioritize exposure-related research questions and challenges needing computational solutions. We will investigate several different methods that will allow us to build new computational models using multiple lines of evidence for the same interaction or relationship such as Bayesian network analysis. Encode data analysis pipelines and design and implement visualization options for data presentation in the CTD web application. Because data analysis requirements associated with networks tend to be extremely computationally- and resource-intensive, it is anticipated that high performance computing tools will have to be developed to pre-process the data elements required for visual presentation.

Progress 10/01/12 to 07/01/13

Outputs
Target Audience: Researchers and students in environmental health sciences. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided? This project has provided (over its lifetime) significant opportunities for technical development by several software developers, and training of multiple scientists in data curation. It has also faclitated collaborations and expansion of research opportunities in undergraduate courses. How have the results been disseminated to communities of interest? Yes, this resource is publicly available. All information and analysis tools are accessible and documented at http://ctdbase.org. What do you plan to do during the next reporting period to accomplish the goals? Nothing Reported

Impacts
What was accomplished under these goals? Data Curation. CTD is the only publicly available database that provides curated data describing molecular mechanisms of action of chemicals and disease relationships. Currently, CTD provides over 820,000 curated interactions between more than 9,800 chemicals and 29,000 genes and proteins in over 450 species. CTD also presents more than 184,000 direct chemical-disease and 26,000 direct gene-disease relationships. Data integration in CTD also enables novel inferred relationships to be made. For example, an inferred chemical-disease relationship is established via curated chemical–gene interactions (e.g., chemical A is associated with disease B because chemical A has a curated interaction with gene C, and gene C has a direct relationship with disease B). Relationships are identified and help users develop hypotheses about mechanisms underlying environmental diseases. We also developed novel statistical analyses of these inferences (King et al. PLoS One. 2012;7(11):e46524.). Similarly, integration of curated chemical-gene-disease data with external data sets like the Gene Ontology, pathways and protein-protein interaction data provide insights into the functions and pathways affected by chemical exposures. Finally, curated information in CTD enables the research community to leverage broad-based legacy data, identify connections and patterns that might not otherwise be apparent, and use these insights coordinately with evolving technologies and emerging experimental approaches. To ensure that our data remain current and as complete as possible, we modified our literature triaging protocol via: a) journal-centric curation and b) curation updates of priority chemicals. We conducted a test phase by prioritizing curation of articles from “Toxicological Sciences” (2009, 2010, and 2011), “Chemico-Biological Interactions” (2009, 2010, and 2011), and “Environmental Health Perspectives” (2009, 2010, and 2011). We also updated curation of many priority chemicals (e.g., bisphenol A, arsenic). Details of these curation modifications were published recently (Database (Oxford). 2012 Dec 6;2012:bas051). Text mining innovation. The Critical Assessment of Information Extraction systems in Biology (BioCreAtIvE) challenge evaluation is a community-wide effort for evaluating text mining and information extraction systems for the biological domain. CTD staff has so-organized two workshops that bring together international groups to advance text-mining capabilities. Ontology development. We continue to enhance our controlled vocabularies for CTD curation and for the broader scientific community. At the Biocuration meeting in April 2012, we presented our disease vocabulary, MEDIC, which merges diseases from MeSH and OMIM into a hierarchical structure. We have since published a report on this vocabulary and it is available for download in several formats from the CTD cite for community use (Database 2012:bar065.). Analysis tool development. Many new analysis tools and data visualization strategies were implemented during this reporting period – several are highlighted below and described in more detail in Nucleic Acids Research 41: D1101-14.) Pathway prediction tool. We expanded the pathway prediction capacity of CTD significantly. Initially we connected users with pathways from the KEGG and Reactome databases based on genes associated with a chemical or disease of interest (see enricher tool below). We recently incorporated all of the curated protein-protein interaction and gene regulatory data from the BIND database and use these data in combination with Cytoscape visualization tool to allow generation of novel interaction pathways among genes in CTD (e.g., interacting genes for a chemical of interest, or genes that form the basis of an inferred chemical-disease relationship). This capability allows users to not only identify novel gene sets, but to determine whether there are known interactions among them. We also implemented a separate instance of this functionality that allows users to submit a gene set of interest. This capability is much like that provided by Ingenuity; however, in CTD it is freely accessible. Gene Set Enricher tool. This tool finds enriched GO or Pathway annotations (from KEGG and Reactome) associated with a gene set. A user can access the tool directly with their specific list of genes (http://ctdbase.org/tools/enricher.go), choose their enrichment analysis, and configure the results via any corrected (or raw) p-value threshold. The tool is also linked to all chemical-disease relationships in CTD such that users may see enriched GO and pathway annotations for each direct or inferred chemical-disease gene set. Data filtering capabilities. We calculate “comparable” chemicals based on similar interacting gene sets. Previously these calculations considered all types of interactions. To enhance the consistency of comparison, we added the ability to filter the calculations by interaction types and degree (e.g., increased transcription). Enhanced links to external resources. We now include links from CTD Chemical pages to ChEMBl, a dictionary of molecular entities focused on small chemical compounds, and to PubChem, a repository of chemical compounds and their associated biological activities. CTD Gene pages now have links to WikiGenes, an author-driven wiki system of biological information, and NCBI Gene provides links back to CTD Gene pages. In total, CTD links out to 23 external databases from our Chemical, Gene, Disease, Organism, Gene Ontology, Pathway, and Reference pages.

Publications

  • Type: Journal Articles Status: Accepted Year Published: 2012 Citation: 1. Davis, A.P., C.G. Murphy, R. Johnson, J.M. Lay, K. Lennon-Hopkins, C. Saraceni-Richards, D. Sciaky, B.L. King, M.C. Rosenstein, T.C. Wiegers, and C.J. Mattingly, The Comparative Toxicogenomics Database: update 2013. Nucleic Acids Res, 2013. 41(Database issue): p. D1104-14. PMCID: PMC3531134.
  • Type: Journal Articles Status: Accepted Year Published: 2012 Citation: 2. Wiegers, T.C., A.P. Davis, and C.J. Mattingly, Collaborative biocuration--text-mining development task for document prioritization for curation. Database (Oxford), 2012. 2012: p. bas037. PMCID: PMC3504477.
  • Type: Journal Articles Status: Accepted Year Published: 2012 Citation: 4. King, B.L., A.P. Davis, M.C. Rosenstein, T.C. Wiegers, and C.J. Mattingly, Ranking transitive chemical-disease inferences using local network topology in the comparative toxicogenomics database. PLoS One, 2012. 7(11): p. e46524. PMCID: PMC3492369.
  • Type: Journal Articles Status: Accepted Year Published: 2012 Citation: 6. Davis, A.P., C.G. Murphy, R. Johnson, J.M. Lay, K. Lennon-Hopkins, C. Saraceni-Richards, D. Sciaky, B.L. King, M.C. Rosenstein, T.C. Wiegers, and C.J. Mattingly, The Comparative Toxicogenomics Database: update 2013. Nucleic Acids Res, 2012. PMCID: PMC3531134.