Source: IOWA STATE UNIVERSITY submitted to
QTLDB AND CORRDB: RESOURCES TO HELP CLOSE THE GENOTYPE TO PHENOTYPE GAP
Sponsoring Institution
National Institute of Food and Agriculture
Project Status
ACTIVE
Funding Source
Reporting Frequency
Annual
Accession No.
1027775
Grant No.
2022-67015-36217
Cumulative Award Amt.
$650,000.00
Proposal No.
2021-07083
Multistate No.
(N/A)
Project Start Date
Jan 1, 2022
Project End Date
Dec 31, 2024
Grant Year
2022
Program Code
[A1201]- Animal Health and Production and Animal Products: Animal Breeding, Genetics, and Genomics
Project Director
Reecy, J. M.
Recipient Organization
IOWA STATE UNIVERSITY
2229 Lincoln Way
AMES,IA 50011
Performing Department
Animal Science
Non Technical Summary
Thanks to technological advances, animal geneticists have an ever-expanding tool chest with which to study the inheritance of traits in livestock in order to improve production. Our long-range goal is to develop integrated resources that leverage prior investments in cyberinfrastructure to help maximize the utility of genotype-to-phenotype data to functionally annotate livestock genomes. The objectives of this particular application are: 1) development of machine learning-assisted data curation and automated semantic annotation, and 2) manual curation of genotype/phenotype, correlation, and heritability data. With the growing volume and breadth of information, it is increasingly difficult for curators to keep abreast of publications. These complementary objectives target the need to efficiently collect and comprehend large amounts of genotype/phenotype association and correlation/heritability data that are being published at an accelerating rate. First, we expect to begin to automate the functional annotation of livestock genomes by applying artificial intelligence techniques to the curation of published QTL/variant association data into the Animal QTLdb, and genetic and phenotypic correlation and heritability data into the Animal CorrDB, for multiple livestock species. Second, we expect to develop artificial intelligence tools to expedite ontology development. Third, we expect to develop intelligent retrieval tools that can answer queries semantically. Fourth, we expect to curate genotype/phenotype and correlation/heritability data and to expand relevant ontologies. Taken together, our efforts are expected to generate positive long-term effects on researchers' ability to transfer knowledge and analyze QTL/association data to address issues of economic and health importance in livestock species.
Animal Health Component
(N/A)
Research Effort Categories
Basic
50%
Applied
(N/A)
Developmental
50%
Classification

Knowledge Area (KA)Subject of Investigation (SOI)Field of Science (FOS)Percent
3043399108020%
3043810108010%
3043711108010%
3043710108010%
3043599108015%
3043499108015%
3043299108010%
3043699108010%
Goals / Objectives
Our long-range goal is to develop integrated resources that leverage prior investments in cyberinfrastructure to help researchers maximize the utility of genotype-to-phenotype data to ultimately address issues of importance to the livestock industry. To accomplish this goal, the objectives of this particular application are:1) development of machine earning-assisted data curation and automated semantic annotation, and2) manual curation of genotype/phenotype, correlation, and heritability data.With the growing volume of literature and breadth of information, it is increasingly difficult for curators to keep abreast of publications. These complementary objectives target the need to efficiently collect and comprehend large amounts of genotype/phenotype association and correlation/heritability data that are being published at an accelerating rate.Objective 1: Machine learning-assisted curation of genotype-to-phenotype and correlation/heritability data, automated semantic annotation, and ontology enrichment to annotate livestock and aquaculture genomes coupled with an intelligent document search system. With the rise of high-throughput technology, vast amounts of genotype/phenotype data including QTL, variant associations, selective sweep information, eQTL, epistatic interactions, genetic and phenotypic correlations, heritability estimates, etc., are being rapidly generated. Animal QTLdb and other genotype/phenotype databases would greatly benefit from automated and expedited curation tools. While it is well recognized that the adoption of common controlled vocabularies and ontologies facilitates data integration and reuse, it is nontrivial to extract such terms from scientific texts automatically. Systematic entity tagging in publications will greatly facilitate the identification of relevant articles, as well as the entities and concepts with broad impact. The ontologies will need continual updating to include emerging concepts and vocabularies. However, the curator labor required for both tasks is intensive and needs to be reduced. The proposed interactive digital curation tool will support automated detection and normalization of entities and concepts from scientific texts, and constant ontology enrichment for the emerging concepts and vocabularies. With the improved entity recognition and ontology enrichment results, we can further develop an intelligent search engine to retrieve related articles from the massive number of publications by analyzing the semantic meanings of input queries.Objective 2: Manual curation of genotype-to-phenotype and correlation/heritability data, and ontology development. Since the inception of Animal QTLdb, the amount and types of data generated in an effort to dissect the genetic basis of complex traits have changed dramatically as the use of high-throughput technology has increased. As the primary source for curated livestock QTL and association data, Animal QTLdb needs to be kept up to date. Furthermore, the Animal CorrDB provides data important for a better understanding of the relationships between traits, which provides further possibilities for networked information analysis. Importantly, this gold standard data set is needed to achieve the best possible training/validation data for the tools developed in Objective 1.
Project Methods
We plan to develop an innovative method for this special machine learning scenario. We have formulated the dictionary-based information extraction problem via Multi-class Positive and Unlabeled (MPU) learning and propose a theoretically and practically novel Multi-class Positive and Unlabeled approach.Positive and Unlabeled (PU) learning generates a classifier from sets of positive and unlabeled data and aims to assign labels to the unlabeled data. In a broad sense, PU learning belongs to semi-supervised learning. However, there is a fundamental difference between them: semi-supervised learning requires labeled negative data, but PU learning does not. PU learning is naturally suited for handling distant supervision, where the dictionary can only annotate the positive examples (such as a gene mention or a trait mention), but not the negative examples. In this way, if a token is not annotated by the dictionary, we do not treat it as a "non-entity," but rather "unknown."We have formulated the distantly supervised task as a confidence-based Multi-class Positive and Unlabeled (Conf-MPU) learning problem, which is a theoretically and practically novel approach. The proposed Conf-MPU can significantly reduce the impact of noisy annotations during model training. To fully utilize the limited amount of distantly labeled data, Conf-MPU consists of two steps. First, a confidence score is estimated for each token of being part of an entity. This step can detect more positive tokens and can help to avoid overfitting to some false-positive samples. Then, wewill perform Conf-MPU, which estimates the empirical loss by combining the confidence scores obtained in the first step and the loss obtained from the MPU learning step.We propose to design machine learning models based on a small size of pairs from the curated data as training examples. First, we plan to jointly apply contextual patterns and positive-only distributive methods to mine more candidate pairs . Comparing the results of these existing methods, we will further improve the word embedding, considering rich information provided by existing ontologies, such as the lexicon features and contextual features coming from the following four aspects: (1) the context of the entity mentions; (2) the spelling of the entity mentions ; (3) the ontology-provided definitions for each term, which are useful in distinguishing the semantic meanings of different terms ; and (4) the co-occurrences of other entity mentions. We propose to represent the features using a semi-supervised embedding technique to overcome the limitations of training size and training bias.Further, we propose to train two machine learning models, one for hypernym-hyponym extraction and one for synonym extraction , simultaneously so they can mutually enhance each other. The key observations lie in the mutual exclusiveness of the two tasks: if an entity, A, is the hypernym of another entity, B, then A cannot be a synonym of B, and vice versa. Thus, the two models can provide critical negative examples to each other and help to find a more accurate decision boundary. We will apply the Multi-class Positive-Unlabeled (MPU) learning framework proposed in Objective 1A to combine the two models. We also propose to conduct candidate filtering to further speed up the computation.

Progress 01/01/23 to 12/31/23

Outputs
Target Audience:The results of this project will be of interest to geneticists (human and livestock), animal breeders, animal scientists, and progressive livestock producers. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?During the reporting period, two PhD students are mentored under the supervision of Co-PD Li on this project. One PhD student participated and presented in the AAAI conference 2023. One PhD student served on the Program Committee for AAAI-2023's main track. Both PhD students presented a poster of the published AAAI paper in the department of computer science. Both PhD students interned at Amazon as Applied Scientists working on information extraction related projects. How have the results been disseminated to communities of interest?Co-PD Li, in collaboration with ISU Extension for Iowa-4H and NRT-D4 Graduate Traineeship program, organized Data Science Workshops at Spencer, Iowa and introduced how AI and text mining can be used in accelerating scientific discovery. All new QTLdb data are ported to NCBI, Ensembl, UCSC genome browser, and Reuters Data Citation Index upon each data release. What do you plan to do during the next reporting period to accomplish the goals?Objective 1: With the recognized trait entities identified by developed method CuPUL, we will develop a machine learning-based method to link them to their normalized or official names in Vertebrate Trait Ontology (VT) and Livestock Product Trait Ontology (LPT), which aims to reduce the curation efforts on trait entity disambiguation. Objective 2: During 2024, we will continue curation of relevant data into the QTLdb and CorrDB, as well as development of the related ontologies, VT, LPT, and LBO.

Impacts
What was accomplished under these goals? Impact statement: This project specifically addresses the USDA-AFRI Animal Breeding and Functional Annotation of Genomes program by annotating livestock genomes with genotype-by-phenotype association data. The data we curate and tools we develop will benefit all agriculturally important species. Objective 1: Machine learning-assisted curation of genotype-to-phenotype and correlation/heritability data, automated semantic annotation, and ontology enrichment to annotate livestock and aquaculture genomes coupled with an intelligent document search system. Throughout the reporting period, we have developed two text mining models for CorrDB and QTLdb. We have trained a deep learning model to pre-screen articles retrieved through keyword search from PubMed. This model is particularly adept at discerning and filtering out articles that do not pertain to our research needs before they reach the curation stage. We deployed this model on a batch of 20,508 articles, where it was able to pinpoint that only 13% of these articles were relevant to our databases. Further validation through human evaluation on a selected subset of these articles revealed that the model has a high recall rate, successfully identifying over 95% of the articles that are relevant. This has led to a significant decrease in the amount of time and effort our curators need to spend on filtering out irrelevant content. We also developed an entity recognition method called CuPUL. This machine learning-based method utilizes a dictionary to identify entities of interest in the article. We gathered a dictionary containing 3884 trait names and abstracts of 1716 articles to train the machine learning models in CuPUL. The trained models are capable of marking trait entities in relevant articles, providing a foundation for semantic annotation and ontology development. Additional verification revealed that CuPUL achieved a 98.5% success rate in retrieving trait entities in 102 manually annotated articles, while the precision recognition rate was 54.4%. This method significantly reduces the workload associated with manual curation. Objective 2: Manual curation of genotype-to-phenotype and correlation/heritability data, and ontology development. In 2023 a total of 22,883 new QTL/association data were curated into QTLdb by November 20. Currently, there are 54,816 porcine QTL/associations, 195,011 bovine QTL/associations, 18,646 chicken QTL/associations, 2,649 horse QTL/associations, 4,729 sheep QTL/associations, 558 goat QTL/associations, and 2,329 rainbow trout QTL/associations released to the public domain. All new data have been ported to NCBI, Ensembl, UCSC genome browser, and Reuters Data Citation Index for data sharing upon each data release. Therefore, users can utilize the browsing and data mining tools at these database sites to explore animal QTL/association data. In 2023, a total of 5,709 new correlations and 783 heritability estimates were curated into CorrDB. Currently, the CorrDB contains 29,947 correlations (14,839 cattle; 2,053 chicken; 834 goat; 209 horse; 9,308 pig; and 2,704 sheep) and 5,458 heritabilities (2,636 cattle; 400 chicken; 72 goat; 131 horse; 1,781 pig; and 438 sheep) that have been released to the public domain. (The released numbers will be higher by year-end for both databases). In addition to the curation of new data, 19,448 previously curated QTL/association data, 6,454 correlation data, and 58 heritability data were updated as part of an overhaul of the trait management system within the QTLdb/CorrDB curator tools. The new system was developed to deal with an ever-increasing list of traits created to describe the varying circumstances under which traits are assessed, such as muscle pH in different muscles or at multiple post-mortem time points, litter size in different parities, or subcutaneous fat thickness at different locations on the body. Instead of creating a new "sibling trait" associated with QTL/association data (e.g., semimembranosus pH 24 hr post-mortem), the structure now relies on the creation of "trait variants," in which the base trait (muscle pH) is "modified" by one or more additional terms related to a defined set of properties such as anatomical location, environment, time of measurement, etc. This system allows the hierarchy of traits within the databases to remain manageable, since all data for a specific trait, regardless of the conditions of a particular study, are linked back to the same base trait. Continued development of VT/LPT: As an important part of Animal QTLdb/CorrDB development, expansion and improvement of the Vertebrate Trait Ontology (VT) have been ongoing. During 2023, we released 14 new versions of VT. In addition, we have released 10 versions of the Livestock Breed Ontology (LBO), which is used for annotation of QTL/associations with breed data. Each release of the updated VT/LBO data has been made available on BioPortal, Github, and the AnimalGenome.org website.

Publications

  • Type: Journal Articles Status: Published Year Published: 2023 Citation: Hu, Z.-L., C.A. Park, and J.M. Reecy. 2023. A combinatorial approach implementing new database structures to facilitate practical data curation management of QTL, association, correlation, and heritability data on trait variants. Database (Oxford). 2023:baad024. https://doi.org/10.1093/database/baad024
  • Type: Conference Papers and Presentations Status: Published Year Published: 2023 Citation: Hu, Z.-L., C.A. Park, and J.M. Reecy. 2023. An implementation of new approaches to extend livestock trait ontologies for practical curation management of QTL, association, correlation, and heritability data. Presented at Plant & Animal Genome 30 Meeting, January 812, 2023. San Diego, CA. https://animalgenome.org/QTLdb/publications/2023PAG.pdf
  • Type: Conference Papers and Presentations Status: Published Year Published: 2023 Citation: Zhou, K., Q. Qiao, Y. Li, and Q. Li. 2023. Improving distantly supervised relation extraction by natural language inference. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 37, No. 11, pp. 14047-14055). https://doi.org/10.1609/aaai.v37i11.26644
  • Type: Conference Papers and Presentations Status: Submitted Year Published: 2023 Citation: Li, Y., K. Zhou, Q. Qiao, Q. Wang, and Q. Li. 2024. Improving Distantly Supervised NER via Token-Level Curriculum-Based Positive-Unlabeled Learning. NAACL 2024


Progress 01/01/22 to 12/31/22

Outputs
Target Audience:The results of this project will be of interest to geneticists (human and livestock), animal breeders, animal scientists, and progressive livestock producers. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?During the reporting period, two Ph.D. students are mentored under the supervision of Co-PD Li on this project. One Ph.D. student participated and presented in the ACL conference 2022. How have the results been disseminated to communities of interest?The implementation of ConfMPU is publicly available on GitHub. All new QTLdb data are ported to NCBI, Ensembl, UCSC genome browser, and Reuters Data Citation Index upon each data release. What do you plan to do during the next reporting period to accomplish the goals?Objective 1: We will include the two text mining models in the current curation pipeline of CorrDB and QTLdb. We will design an automated ontology enrichment method to assist the development of Vertebrate Trait Ontology (VT) and Livestock Product Trait Ontology (LPT). Objective 2: During 2023, we will continue curation of relevant data into the QTLdb and CorrDB, as well as ontology development of the VT, LPT, and LBO.

Impacts
What was accomplished under these goals? Impact statement: This project specifically addresses the USDA-AFRI Animal Breeding and Functional Annotation of Genomes program by annotating livestock genomes with genotype-by-phenotype association data. The data we curate and the tools we develop will benefit all agriculturally important species. Objective 1: Machine learning-assisted curation of genotype-to-phenotype and correlation/heritability data, automated semantic annotation, and ontology enrichment to annotate livestock and aquaculture genomes coupled with an intelligent document search system. Throughout the reporting period, we have developed two text mining models for CorrDB and QTLdb. We have trained a deep learning model to pre-screen articles retrieved through keyword search from PubMed. This model can filter irrelevant articles before the curation step. We also developed a new entity recognition method called ConfMPU. It is a machine learning method that uses a dictionary to recognize entity names of interest in the articles. ConfMPU significantly improved state-of-the-art distantly supervised named entity recognition methods on multiple benchmark datasets from different domains and achieved comparable results with fully supervised methods. This method has been used to train a trait entity recognizing model for CorrDB and QTLdb. It built a solidfoundations for semantic annotation and ontology development. Objective 2: Manual curation of genotype-to-phenotype and correlation/heritability data, and ontology development. In 2022, a total of 22,320 new QTL/association data were curated into QTLdb. Currently, there are 36,725 porcine QTL/associations, 193,641 bovine QTL/associations, 18,313 chicken QTL/associations, 2,649 horse QTL/associations, 4,504 sheep QTL/associations, 129 goat QTL/associations, and 2,329 rainbow trout QTL/associations in the database. All new data have been ported to NCBI, Ensembl, UCSC genome browser, and Reuters Data Citation Index for data sharing upon each data release. Therefore, users can utilize the browsing and data mining tools at these database sites to explore animal QTL/association data. In 2022, a total of 2,735 new correlations and 459 heritability estimates were curated into CorrDB. Currently, the CorrDB contains 26,839 correlations (13,471 cattle; 1,835 chicken; 311 goat; 209 horse; 8,902 pig; and 2,111 sheep) and 4,778 heritabilities (2,330 cattle; 377 chicken; 5 goat; 123 horse; 1,607 pig; and 336 sheep). In addition to the curation of new data, 16,227 previously curated QTL/association data, 5,573 correlation data, and 415 heritability data were updated as part of an overhaul of the trait management system within the QTLdb/CorrDB curator tools. The new system was developed to deal with an ever-increasing list of traits created to describe the varying circumstances under which traits are assessed, such as muscle pH in different muscles or at multiple post-mortem time points, litter size in different parities, or subcutaneous fat thickness at different locations on the body. Instead of creating a new "sibling trait" associated with QTL/association data (e.g., semimembranosus pH 24 hr post-mortem), the structure now relies on the creation of "trait variants," in which the base trait (muscle pH) is "modified" by one or more additional terms related to a defined set of properties such as anatomical location, environment, time of measurement, etc. This system allows the hierarchy of traits within the databases to remain manageable, since all data for a specific trait, regardless of the conditions of a particular study, are linked back to the same base trait. Continued development of VT/LPT: As an important part of Animal QTLdb/CorrDB development, expansion and improvement of the Vertebrate Trait Ontology (VT) and Livestock Product Trait Ontology (LPT) have been ongoing. During 2022, we released nine new versions of VT and three new versions of LPT. In addition, we have released 15 versions of the Livestock Breed Ontology (LBO), which is used for annotation of QTL/associations with breed data. Each release of the updated VT/LPT/LBO data has been made available on BioPortal, Github, and the AnimalGenome.org website.

Publications

  • Type: Conference Papers and Presentations Status: Published Year Published: 2022 Citation: Hu, Z.-L, C.A. Park, and J.M. Reecy. 2022. A database structural improvement for efficient trait variation curation in Animal QTLdb and CorrDB. Presented at the World Congress on Genetics Applied to Livestock Production (WCGALP). The Netherlands.
  • Type: Conference Papers and Presentations Status: Published Year Published: 2022 Citation: K. Zhou, Y, Li, and Q. Li. 2022. Distantly Supervised Named Entity Recognition via Confidence-Based Multi-Class Positive and Unlabeled Learning. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).