Progress 01/01/23 to 12/31/23
Outputs Target Audience:The results of this project will be of interest to geneticists (human and livestock), animal breeders, animal scientists, and progressive livestock producers. Changes/Problems:
Nothing Reported
What opportunities for training and professional development has the project provided?During the reporting period, two PhD students are mentored under the supervision of Co-PD Li on this project. One PhD student participated and presented in the AAAI conference 2023. One PhD student served on the Program Committee for AAAI-2023's main track. Both PhD students presented a poster of the published AAAI paper in the department of computer science. Both PhD students interned at Amazon as Applied Scientists working on information extraction related projects. How have the results been disseminated to communities of interest?Co-PD Li, in collaboration with ISU Extension for Iowa-4H and NRT-D4 Graduate Traineeship program, organized Data Science Workshops at Spencer, Iowa and introduced how AI and text mining can be used in accelerating scientific discovery. All new QTLdb data are ported to NCBI, Ensembl, UCSC genome browser, and Reuters Data Citation Index upon each data release. What do you plan to do during the next reporting period to accomplish the goals?Objective 1: With the recognized trait entities identified by developed method CuPUL, we will develop a machine learning-based method to link them to their normalized or official names in Vertebrate Trait Ontology (VT) and Livestock Product Trait Ontology (LPT), which aims to reduce the curation efforts on trait entity disambiguation. Objective 2: During 2024, we will continue curation of relevant data into the QTLdb and CorrDB, as well as development of the related ontologies, VT, LPT, and LBO.
Impacts What was accomplished under these goals?
Impact statement: This project specifically addresses the USDA-AFRI Animal Breeding and Functional Annotation of Genomes program by annotating livestock genomes with genotype-by-phenotype association data. The data we curate and tools we develop will benefit all agriculturally important species. Objective 1: Machine learning-assisted curation of genotype-to-phenotype and correlation/heritability data, automated semantic annotation, and ontology enrichment to annotate livestock and aquaculture genomes coupled with an intelligent document search system. Throughout the reporting period, we have developed two text mining models for CorrDB and QTLdb. We have trained a deep learning model to pre-screen articles retrieved through keyword search from PubMed. This model is particularly adept at discerning and filtering out articles that do not pertain to our research needs before they reach the curation stage. We deployed this model on a batch of 20,508 articles, where it was able to pinpoint that only 13% of these articles were relevant to our databases. Further validation through human evaluation on a selected subset of these articles revealed that the model has a high recall rate, successfully identifying over 95% of the articles that are relevant. This has led to a significant decrease in the amount of time and effort our curators need to spend on filtering out irrelevant content. We also developed an entity recognition method called CuPUL. This machine learning-based method utilizes a dictionary to identify entities of interest in the article. We gathered a dictionary containing 3884 trait names and abstracts of 1716 articles to train the machine learning models in CuPUL. The trained models are capable of marking trait entities in relevant articles, providing a foundation for semantic annotation and ontology development. Additional verification revealed that CuPUL achieved a 98.5% success rate in retrieving trait entities in 102 manually annotated articles, while the precision recognition rate was 54.4%. This method significantly reduces the workload associated with manual curation. Objective 2: Manual curation of genotype-to-phenotype and correlation/heritability data, and ontology development. In 2023 a total of 22,883 new QTL/association data were curated into QTLdb by November 20. Currently, there are 54,816 porcine QTL/associations, 195,011 bovine QTL/associations, 18,646 chicken QTL/associations, 2,649 horse QTL/associations, 4,729 sheep QTL/associations, 558 goat QTL/associations, and 2,329 rainbow trout QTL/associations released to the public domain. All new data have been ported to NCBI, Ensembl, UCSC genome browser, and Reuters Data Citation Index for data sharing upon each data release. Therefore, users can utilize the browsing and data mining tools at these database sites to explore animal QTL/association data. In 2023, a total of 5,709 new correlations and 783 heritability estimates were curated into CorrDB. Currently, the CorrDB contains 29,947 correlations (14,839 cattle; 2,053 chicken; 834 goat; 209 horse; 9,308 pig; and 2,704 sheep) and 5,458 heritabilities (2,636 cattle; 400 chicken; 72 goat; 131 horse; 1,781 pig; and 438 sheep) that have been released to the public domain. (The released numbers will be higher by year-end for both databases). In addition to the curation of new data, 19,448 previously curated QTL/association data, 6,454 correlation data, and 58 heritability data were updated as part of an overhaul of the trait management system within the QTLdb/CorrDB curator tools. The new system was developed to deal with an ever-increasing list of traits created to describe the varying circumstances under which traits are assessed, such as muscle pH in different muscles or at multiple post-mortem time points, litter size in different parities, or subcutaneous fat thickness at different locations on the body. Instead of creating a new "sibling trait" associated with QTL/association data (e.g., semimembranosus pH 24 hr post-mortem), the structure now relies on the creation of "trait variants," in which the base trait (muscle pH) is "modified" by one or more additional terms related to a defined set of properties such as anatomical location, environment, time of measurement, etc. This system allows the hierarchy of traits within the databases to remain manageable, since all data for a specific trait, regardless of the conditions of a particular study, are linked back to the same base trait. Continued development of VT/LPT: As an important part of Animal QTLdb/CorrDB development, expansion and improvement of the Vertebrate Trait Ontology (VT) have been ongoing. During 2023, we released 14 new versions of VT. In addition, we have released 10 versions of the Livestock Breed Ontology (LBO), which is used for annotation of QTL/associations with breed data. Each release of the updated VT/LBO data has been made available on BioPortal, Github, and the AnimalGenome.org website.
Publications
- Type:
Journal Articles
Status:
Published
Year Published:
2023
Citation:
Hu, Z.-L., C.A. Park, and J.M. Reecy. 2023. A combinatorial approach implementing new database structures to facilitate practical data curation management of QTL, association, correlation, and heritability data on trait variants. Database (Oxford). 2023:baad024. https://doi.org/10.1093/database/baad024
- Type:
Conference Papers and Presentations
Status:
Published
Year Published:
2023
Citation:
Hu, Z.-L., C.A. Park, and J.M. Reecy. 2023. An implementation of new approaches to extend livestock trait ontologies for practical curation management of QTL, association, correlation, and heritability data. Presented at Plant & Animal Genome 30 Meeting, January 812, 2023. San Diego, CA. https://animalgenome.org/QTLdb/publications/2023PAG.pdf
- Type:
Conference Papers and Presentations
Status:
Published
Year Published:
2023
Citation:
Zhou, K., Q. Qiao, Y. Li, and Q. Li. 2023. Improving distantly supervised relation extraction by natural language inference. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 37, No. 11, pp. 14047-14055).
https://doi.org/10.1609/aaai.v37i11.26644
- Type:
Conference Papers and Presentations
Status:
Submitted
Year Published:
2023
Citation:
Li, Y., K. Zhou, Q. Qiao, Q. Wang, and Q. Li. 2024. Improving Distantly Supervised NER via Token-Level Curriculum-Based Positive-Unlabeled Learning. NAACL 2024
|
Progress 01/01/22 to 12/31/22
Outputs Target Audience:The results of this project will be of interest to geneticists (human and livestock), animal breeders, animal scientists, and progressive livestock producers. Changes/Problems:
Nothing Reported
What opportunities for training and professional development has the project provided?During the reporting period, two Ph.D. students are mentored under the supervision of Co-PD Li on this project. One Ph.D. student participated and presented in the ACL conference 2022. How have the results been disseminated to communities of interest?The implementation of ConfMPU is publicly available on GitHub. All new QTLdb data are ported to NCBI, Ensembl, UCSC genome browser, and Reuters Data Citation Index upon each data release. What do you plan to do during the next reporting period to accomplish the goals?Objective 1: We will include the two text mining models in the current curation pipeline of CorrDB and QTLdb. We will design an automated ontology enrichment method to assist the development of Vertebrate Trait Ontology (VT) and Livestock Product Trait Ontology (LPT). Objective 2: During 2023, we will continue curation of relevant data into the QTLdb and CorrDB, as well as ontology development of the VT, LPT, and LBO.
Impacts What was accomplished under these goals?
Impact statement: This project specifically addresses the USDA-AFRI Animal Breeding and Functional Annotation of Genomes program by annotating livestock genomes with genotype-by-phenotype association data. The data we curate and the tools we develop will benefit all agriculturally important species. Objective 1: Machine learning-assisted curation of genotype-to-phenotype and correlation/heritability data, automated semantic annotation, and ontology enrichment to annotate livestock and aquaculture genomes coupled with an intelligent document search system. Throughout the reporting period, we have developed two text mining models for CorrDB and QTLdb. We have trained a deep learning model to pre-screen articles retrieved through keyword search from PubMed. This model can filter irrelevant articles before the curation step. We also developed a new entity recognition method called ConfMPU. It is a machine learning method that uses a dictionary to recognize entity names of interest in the articles. ConfMPU significantly improved state-of-the-art distantly supervised named entity recognition methods on multiple benchmark datasets from different domains and achieved comparable results with fully supervised methods. This method has been used to train a trait entity recognizing model for CorrDB and QTLdb. It built a solidfoundations for semantic annotation and ontology development. Objective 2: Manual curation of genotype-to-phenotype and correlation/heritability data, and ontology development. In 2022, a total of 22,320 new QTL/association data were curated into QTLdb. Currently, there are 36,725 porcine QTL/associations, 193,641 bovine QTL/associations, 18,313 chicken QTL/associations, 2,649 horse QTL/associations, 4,504 sheep QTL/associations, 129 goat QTL/associations, and 2,329 rainbow trout QTL/associations in the database. All new data have been ported to NCBI, Ensembl, UCSC genome browser, and Reuters Data Citation Index for data sharing upon each data release. Therefore, users can utilize the browsing and data mining tools at these database sites to explore animal QTL/association data. In 2022, a total of 2,735 new correlations and 459 heritability estimates were curated into CorrDB. Currently, the CorrDB contains 26,839 correlations (13,471 cattle; 1,835 chicken; 311 goat; 209 horse; 8,902 pig; and 2,111 sheep) and 4,778 heritabilities (2,330 cattle; 377 chicken; 5 goat; 123 horse; 1,607 pig; and 336 sheep). In addition to the curation of new data, 16,227 previously curated QTL/association data, 5,573 correlation data, and 415 heritability data were updated as part of an overhaul of the trait management system within the QTLdb/CorrDB curator tools. The new system was developed to deal with an ever-increasing list of traits created to describe the varying circumstances under which traits are assessed, such as muscle pH in different muscles or at multiple post-mortem time points, litter size in different parities, or subcutaneous fat thickness at different locations on the body. Instead of creating a new "sibling trait" associated with QTL/association data (e.g., semimembranosus pH 24 hr post-mortem), the structure now relies on the creation of "trait variants," in which the base trait (muscle pH) is "modified" by one or more additional terms related to a defined set of properties such as anatomical location, environment, time of measurement, etc. This system allows the hierarchy of traits within the databases to remain manageable, since all data for a specific trait, regardless of the conditions of a particular study, are linked back to the same base trait. Continued development of VT/LPT: As an important part of Animal QTLdb/CorrDB development, expansion and improvement of the Vertebrate Trait Ontology (VT) and Livestock Product Trait Ontology (LPT) have been ongoing. During 2022, we released nine new versions of VT and three new versions of LPT. In addition, we have released 15 versions of the Livestock Breed Ontology (LBO), which is used for annotation of QTL/associations with breed data. Each release of the updated VT/LPT/LBO data has been made available on BioPortal, Github, and the AnimalGenome.org website.
Publications
- Type:
Conference Papers and Presentations
Status:
Published
Year Published:
2022
Citation:
Hu, Z.-L, C.A. Park, and J.M. Reecy. 2022. A database structural improvement for efficient trait variation curation in Animal QTLdb and CorrDB. Presented at the World Congress on Genetics Applied to Livestock Production (WCGALP). The Netherlands.
- Type:
Conference Papers and Presentations
Status:
Published
Year Published:
2022
Citation:
K. Zhou, Y, Li, and Q. Li. 2022. Distantly Supervised Named Entity Recognition via Confidence-Based Multi-Class Positive and Unlabeled Learning. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
|