Cellulose Materials Informatics: Building A Knowledge Graph For Cellulose Materials Discovery And Research (Cellograph)

CELLULOSE MATERIALS INFORMATICS: BUILDING A KNOWLEDGE GRAPH FOR CELLULOSE MATERIALS DISCOVERY AND RESEARCH (CELLOGRAPH)

Sponsoring Institution

National Institute of Food and Agriculture

Project Status

ACTIVE

Funding Source

AFRI COMPETITIVE GRANT

Reporting Frequency

Annual

Accession No.

1025810

Grant No.

2021-67022-34366

Cumulative Award Amt.

$500,000.00

Proposal No.

2020-08816

Multistate No.

(N/A)

Project Start Date

Mar 1, 2021

Project End Date

Feb 28, 2026

Grant Year

2021

Program Code

[A1541]- Food and Agriculture Cyberinformatics and Tools

Recipient Organization
UNIVERSITY OF MAINE
(N/A)
ORONO,ME 04469

Performing Department
School of Forest Resources

Non Technical Summary
The project is to develop a cellulose informatics named as CelloGraph. It will include a cellulose ontology, cellulose database, cellulose data analytical tool as well as a website as the gate to the informatics. CelloGraph is the portmanteau of cellulose and graph, i.e. it is a record of cellulosic materials available in the literature. Ontology is similar as taxonomy, thesaurus, or glossary. It is a term meaning a computer-interpretable set of terms, propositions, and axioms over which reasoning can be performed by computer programs. Ontology is a framework of concepts; knowledge is specific data. When we think of trees, it is ontology; but an oak tree in one's backyard named as Doggy Doug is an instance of knowledge. In these senses, firstly, CelloGraph is planning to organize concepts in the cellulose domain as a coherent searchable tree structure affording convenience of looking up. Secondly, CelloGraph will also substantiate each concept with specific knowledge from the cellulose literature. It would be a system that can retrieve the data on cellulosic entities, their attributes, and relations, visualize the data, and apply machine learning or other analytical algorithms to the data to mine patterns, insights, and perspectives. If successful, this effort would create a powerful resource that increases efficiencyin searching for cellulosic data, reduce duplication of research effort, create the potential for larger data sets for evaluation around a research question, and generate a synergistic capability for finding and filling gaps, and allow for rapid advancement in the cellulosic research area.

Animal Health Component

60%

Research Effort Categories

Basic

20%

Applied

60%

Developmental

20%

Classification

Knowledge Area (KA)	Subject of Investigation (SOI)	Field of Science (FOS)	Percent
511	0699	3030	100%

Knowledge Area
511 - New and Improved Non-Food Products and Processes;

Subject Of Investigation
0699 - Trees, forests, and forest products, general;

Field Of Science
3030 - Information and communication;

Keywords

cellulose

cellulose informatics

knowledge graph

materials

ontology

Goals / Objectives
The final outcome is expected to be a cellulose informatics system that includes a web application (displaying, querying, and analyzing cellulose data), a cellulose data service (delivering and storing the data via the internet), and an analytical service framework that enables further computing and modeling with the data.Four objectives:1) expand a controlled vocabulary - the cellulose ontology that is currently under development - to guide curation and querying of the cellulose database,2) convert cellulose data from the literature into cellulose ontology compliant RDF datasets and enable querying them (Cellulose Knowledge Graph), 3) learn the features and representations of the data and develop a machine learning tool that applies machine learning algorithms to the datasets to disclose hidden patterns and intrinsic structures, and 4) develop a database management system, and a software system that enables users to define, create, maintain and control access to the database and utilize the machine learning tool.

Project Methods
This project will use Semantic Web technologies to make cellulose data findable, accessible, interoperable, and reusable (FAIR) in line with a national push for FAIR data principles.The project involves four tasks: (1) Semi-automatic data extraction from the literature, (2) Expansion and population of the cellulose ontology, (3) Development of analytical and exploratory tools, and (4) Deployment as a web application.The data extraction task involves gathering representative publications on cellulosic materials, deciding on the key data and terms in the publications, and then developing a pipeline for semi-automated annotation, mapping and information extractionof the knowledge using state-of-the-art natural language processing (NLP) techniques. The cellulose ontology construction and population task will align the extracted data with the cellulose ontology that is currently being developed, and add missing terms. The data is then being added as instances of the ontology's classes and properties, constructing a full-fledged knowledge graph. The analytical and exploratory tool development task involves specifying suitable templates for semantic queries needed to explore and understand the data, to relate and compare specific datasets, classes and properties, and apply machine learning algorithms for classifying and mining the data. The web deployment task is to make the knowledge graph easily accessible and navigable to researchers and other domain experts. It involves developing a user-friendly web interface for the materials community to contribute, share and use knowledge about cellulose materials.

Progress 03/01/24 to 02/28/25

Outputs
Target Audience:We presented the work on the Ontology for Named Entity Recognition (OnNER) at the 14th International Conference on Formal Ontologies in Information Systems (FOIS 2024) in July 2024 in Enschede, the Netherlands, where it was well received. We're preparing another publication comparing different approaches for Named Entity Recognition of Cellulosic Entities for submission to the 13th International Conference on Knowledge Capture (K-CAP 2025),to be held in December in Dayton, Ohio. In addition, we're planning to prepare a manuscript on the overall system, its construction, and use for submission to a more cellulose-focused journal or conference. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?Umayer Reza, a PhD primarily working on the project, has made great progress towards his degree completion by moving to candidacy. As part of the work on the project, he has learned and compared various LLMs for the suitability for cellulosic NER. We presented his research at the annual conference on Formal Ontologies in Information Systems (FOIS 2024) in July 2024 and are preparing a second paper for submission to the 13th International Conference on Knowledge Capture (K-CAP 2025). These two papers form the bulk of his dissertation work, together with a description of the overall system. We're anticipating for him to finish his dissertation in Winter 20026 and graduate by May 2026. Umayer further attended the Maine Forest Biomaterials Week, including the Cellulose Nanomaterials Forum in Orono, ME in August 2024 to connect and learn more about the research community on cellulose materials. will present his work to a related research project on Maine's Forest-based economy that involves both AI and material science researchers working with nanocellulose. He is also finalizing a second manuscript to report the results of his comparison of the different methods for identifying and classifying NEs from text. In addition, he is participating in an ongoing challenge (hosted in coordination with the International Semantic Web Conference to be held in November 2025 in Tokyo, Japan) to extract material science knowledge from text. One undergraduate student in Computer Science, Nicholas Pease, has been primarily involved in the development of TART. He has gained valuable practical experience in developing, testing and deploying an open software tool and learned more about cutting-edge applications of AI and NER. How have the results been disseminated to communities of interest?We presented the work on the Ontology for Named Entity Recognition (OnNER) at the 14th International Conference on Formal Ontologies in Information Systems (FOIS 2024) in July 2024 in Enschede, Netherlands, where the work was well received. We're preparing another publication on comparing the different approaches for Named Entity Recognition of Cellulosic Entities for submission to the 13th International Conference on Knowledge Capture (K-CAP 2025) to be held in December in Dayton, Ohio. In addition, we're planning to prepare a manuscript on the overall system, its construction and use for submission to a more cellulose focused journal or conference. What do you plan to do during the next reporting period to accomplish the goals?For the coming year, we will deploy the annotation tool, scale up the annotation effort, involving domain experts, rerun the experiments to finalize the selection of an NER model and revisit objective 1 using the input from what labels for the NER pipeline works wells and then expand the ontology to improve classification of identified terms using the cellulose ontology and suggesting automated expansion of the ontology. This is all in support of scaling up the overall knowledge graph to demonstrate it to potential user communities. We're planning to present the overall system, its construction and use for submission at more cellulose focused conferences and workshops to increase awareness of the effort and gather input for future development. At the same time, we plan to prepare a manuscript on the project for submission to a cellulose-focused journal.

Impacts
What was accomplished under these goals? Project Report on Cellulose Knowledge Graph and Named Entity Recognition (NER) System Development This project aims to establish a comprehensive digital infrastructure for managing, analyzing, and querying data on cellulose-based materials. The work centers around four primary objectives: Expand the controlled vocabulary, or cellulose ontology, currently under development, to support more effective curation and querying of the cellulose database. Convert cellulose-related literature data into RDF datasets that comply with the cellulose ontology, enabling structured queries and forming the foundation of a Cellulose Knowledge Graph (CKG). This objective was successfully completed in a prior project phase. Design and implement a machine learning tool capable of learning from the structured datasets. This tool will uncover latent patterns and intrinsic structures in the data through the application of machine learning algorithms. Develop a robust database and software system that enables users to manage, interact with, and analyze the cellulose data. This includes defining, creating, maintaining, and controlling access to the database and integrating the machine learning tool for intelligent insights. During the current project phase, efforts were primarily focused on objectives 3 and 4, with progress on objective 1 supported through foundational work in these areas. Cellulosic Named Entity Recognition (NER) In support of Objective 3, we initiated a series of experiments aimed at improving named entity recognition (NER) in the cellulose domain. Cellulose-related literature is inherently interdisciplinary, intersecting areas such as wood science, materials science, chemistry, and biology. Existing NER models and corpora do not adequately capture the diverse and highly specialized terminology used in this field. To overcome these limitations, we structured our work around three major tasks: Define classification categories specific to cellulose-related terms, based on expert knowledge of the field. Develop a labeled corpus of texts annotated with these categories, creating a gold-standard dataset for training and evaluation. Evaluate multiple NER strategies to identify the most accurate and scalable approach for recognizing and classifying named entities in cellulose literature. The NER strategies tested include: (1) a custom-trained model developed from scratch, (2) a zero-shot learning model based on large language models (LLMs), and (3) a few-shot learning model that provides the LLM with labeled examples to enhance its performance. Initial testing used a limited dataset to validate our experimental framework. Preliminary results show that the few-shot approach with a larger example set (around 20 examples) performs best. This approach demonstrated the most promise in balancing accuracy and generalizability. We are now preparing to repeat these experiments on a larger dataset that spans a broader cross-section of cellulose-related literature. The ultimate goal of this work is to integrate the best-performing NER module into the broader CelloGraph system, enabling automated extraction of terms and structured knowledge generation across large volumes of research publications. CelloGraph System Architecture Progress toward Objective 4 involved completing the design and implementation of the full CelloGraph system architecture. The system is built for modularity, allowing components to be improved or replaced over time without overhauling the entire pipeline. The system workflow proceeds as follows: Article Collection Module: Uses the Semantic Scholar API to retrieve open-access cellulose-related publications, including metadata and URLs. The corresponding PDFs are downloaded and stored. SciTex Module: Converts the PDFs into structured XML files. RDF Generator: Processes XML files to generate RDF representations of each document using internal data structures. Graph DB Import-Export Module: Imports RDF data into a GraphDB instance and retrieves content as needed. It also exports content to be processed by other modules. NER Tool Integrator: Extracts paragraphs from the graph database, processes them for entity recognition, and returns structured annotations. JSON Module: Organizes recognized entities and converts the data into a format compatible with the RDF Generator for re-import into the database. Each cycle through the system enhances the knowledge graph, making it a richer and more queryable resource for researchers. The codebase is open source and available at CelloGraph GitHub repository. To test the pipeline, we processed five open-source scientific articles on cellulosic materials, beginning with their PDF versions. The system generated a knowledge graph containing over 263,000 RDF triples, describing 260 paragraphs and 4,472 extracted terms. These terms, identified using experimental NER models, can now be queried to reveal relationships and insights relevant to cellulose science. Text Annotation Tool - TART A critical bottleneck in developing high-performance NER models is the need for large quantities of annotated training data. To streamline this process, an undergraduate research assistant has been developing the Text Annotation and Review Tool (TART). This tool enhances an existing annotation framework with capabilities for reviewing, editing, and tracking the history of annotations, thereby increasing transparency and trust in the data. TART offers two modes: Annotation Mode: Users manually tag entities or refine suggestions from automated NER tools. Review Mode: Expert users validate or correct annotations, handling cases of misidentification or omissions. TART supports converting annotations from JSON to RDF, ensuring compatibility with the CelloGraph knowledge base. This integration bridges the gap between raw text processing and high-quality knowledge graph construction. TART is expected to be finalized by Summer 2025 and is available open source at TART GitHub repository.

Publications

Type: Peer Reviewed Journal Articles Status: Published Year Published: 2024 Citation: Wang, Jinwu, et al. "Cellulose Membranes: Synthesis and Applications for Water and Gas Separation and Purification." Membranes 14.7 (2024): 148.

Progress 03/01/23 to 02/29/24

Outputs
Target Audience:Dr. Hahmann has taught a course on Ontology Engineering Principles and Practice in Spring 2024 (17 graduate students) that incorporates lessons learned from the project. In addition, Dr. Hahmann and Umayer Reza, one of the graduate students involved in the project, have carved out and supervised a Computer Science Capstone Design project that gave the students experience developing software as a team. The product is currently being finalized for use in the project. Umayer Reza also presented an outline and progress on his dissertation work within the project at the Doctoral Consortium of the FOIS 2023 conference. Jul 18, 2023: Poster presented on "Populating and Refining an Ontology of Cellulose Materials with Terms from Scientific Publications" at the Early Career Symposium of 13th International Conference on Formal Ontology in Information Systems 2023, Sherbrooke, Quebec, Canada. Apr 14, 2023: Poster presented on "The Scientific Publication Ontology (SciPub): Linking publications' content to the domain ontologies" at UMaine Student Symposium, Orono, Maine, USA. May 23, 2022: Poster presented on "CelloGraph - Accelerating Knowledge Discovery for Cellulosic Materials using AI and Knowledge Graphs" at Manufacturing RENEW3D Conference, Orono, Maine, USA.? Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?Dr. Hahmann has taught a course on Ontology Engineering Principles and Practice in Spring 2024 (17 graduate students) that incorporates lessons learned from the project. In addition, Dr. Hahmann and Umayer Reza, one of the graduate students involved in the project, have carved out and supervised a Computer Science Capstone Design project that gave the students experience developing software as a team. The product is currently being finalized for use in the project. Umayer Reza also presented an outline and progress on his dissertation work within the project at the Doctoral Consortium of the FOIS 2023 conference. How have the results been disseminated to communities of interest?In addition to Umayer Reza presenting his work at the Doctoral Consortium of the FOIS 2023 conference, we published the ontology online and completed a manuscript describing the work. This paper describing the OnNER ontology has been accepted for publication and will be presented to the ontology community in July at the FOIS 2024 conference. In addition, Xuelian Zhang presented progress on her work at the University's Student research symposium. We're still working towards publishing the properties ontologies and results from the Named Entity Recognition from Cellulosic Publications. What do you plan to do during the next reporting period to accomplish the goals?Toward goal 1, Xuelian Zhang will finalize and publish the Properties module of the Cellograph Ontology and link entity mentions from papers to it, and prepare a manuscript for publication. This work has taken longer than expected due tothe steep learning curve required of Xuelian Zhang in systematically designing ontologies and familiarizing herself with the tools to support large-scale ontology development. We can then finally continue with the development of the Materials and Process modules and their integration with the Properties module. Goal 2 has been completed with the OnNER ontology with the exception that the NER terms need to be added as they are detected using the NER pipeline (see goal 3 below) and connected to the cellulose ontology from goal 1. This will be added as progress on goals 1 and 3 are made but does not require substantial efforts anymore. Towards goal 3, we aim to test a complete pipeline for named entity recognition and measure its accuracy as well as how quickly the results can be reviewed. The results will then be integrated into the combined graph as detailed above (under goal 2). Towards goal 4, we will focus on deploying the populated knowledge in GraphDB, which will serve as the overall database management system and provide at least a basic interface -- and showcasing its use using SPARQL queries in a public repository for demonstration and testing by users. Upon completion of work on the properties ontologies and a basic pipeline for NER, the next major step will be to deploy the ontology together with the graph data to test the compatibility and devise methods for linking terms from the detailed representation of each publication to the cellulose ontology for improved and smarter querying.

Impacts
What was accomplished under these goals? Towards goal 1, we have expanded our approach to now systematically extract over 300 that are mentioned in the abstract of papers published on cellulosic materials as a reference set of properties. We have designed a pipeline to automatically identify duplicates (despite variations in spelling) and other similarities and compare the properties with extract from specific papers to the reference set to identify new properties. These are used to more systematically identify any common or high-level properties that we previously missed in the taxonomy. These are again provided as an OWL/RDF ontology that extends the core modules of the Cellograph ontology. We expect to conclude this work by the end of August; it will constitute one Chapter of one of the involved PhD students, Xuelian Zhang. Most importantly, this establishes a reproducible workflow that can be quickly executed on other terms (cellulose materials, processes, etc.) to build up the ontology quickly yet systematically. Toward goal 2, we have finalized the Ontology for Named Entity Representation (OnNER), which we previously referred to as the Scientific Publications (SciPub) Ontology. A manuscript describing the ontology, its development, and its use for constructing semantically queryable Knowledge Graphs of cellulose or other domain knowledge has been accepted for presentation and publication at the International Conference on Formal Ontology in Information Systems (FOIS 2024), to be presented in Amsterdam in July 2024. This is work led by one of the PhD students and a substantial part of his dissertation work. The ontology itself is publicly shared via GitHub at: https://github.com/theSKAILab/OnNER Towards goal 4, we have demonstrated how the ontology serves as the framework for automatically populating a Knowledge Graph of cellulose materials using the content of 5 articles from the cellulose materials literature. This is now a fully automated process, that we can easily scale up in the near future. In particular, we plan to populate the graph with ~100 articles from an open-source corpus to create a more realistic demonstration and test graph. The ontology and the data have been deployed in a functional knowledge graph, allowing us to semantically query it using the SPARQL queries. Examples of such queries in plain English include: Which ``chemicals'' are mentioned in conjunction with ``permeability''? In which publications and paragraphs? What are the most recent publications that mention at least three times one of ``bacterial cellulose'' or ``BC nanofibers''. Retrieve all paragraphs from publications since 2017 that include named entities labeled as ``application'' in conjunction with the property ``tensile strength''. What named entities have been assigned classification labels by an NER System that have been corrected by human labelers? These are translated to SPARQL for execution on the knowledge graph. For example, the first query is represented in SPARQL as: SELECT ? chemical ?paragraph ?publication WHERE { ?publication onner:containsDocumentPart ?paragraphID . ?paragraphID rdf:type onner:Paragraph ; onner:paragraphText ?paragraph ; onner:directlyContainsLabeledTerm ?ne1 , ?ne2 . ?ne1 onner:labeledTermText `permeability'^^xsd:string . ?ne2 onner:labeledTermText ?chemical ; onner:hasLabeledTermStatus ?ne2Status . ?ne2Status onner:hasLabeledTermLabel ?label2 . ?label2 onner:labelText `CHEMICAL'^^xsd:string . } More details of the ontology, its implementation and evaluation are available from the shared paper. Towards goal 3, we have continued to test various machine learning algorithms for named entity recognition (NER) from the text from the scientific literature. Because of the low accuracy of our previous approaches and the need for a large amount of training data, we have shifted our focus to leveraging a Large Language Model (currently GPT 3.5 Turbo and GPT 4) for Named Entity Recognition using a process called Finetuning. This work is still in the early stages. Towards goal 4 and to assist with machine learning (goal 3), we have also identified and customized an online annotation editor to allow domain experts to efficiently review tags identified by NER algorithms to clean the recognized entities and include them correctly in the ontology and knowledge graph. This work was conducted by a group of 4 undergraduate computer science students as part of their Capstone project and supervised by the co-PI Hahmann and the PhD student Umayer Reza. This also helps generate a larger training set that can be used to improve the NER algorithms.

Publications

Progress 03/01/22 to 02/28/23

Outputs
Target Audience:12 Graduate and undergraduate students through a newly developed graduate level course "Information Extraction from NL Texts" (developed and taught by Dr. Hahmann) that exposed students to machine learning for information extraction from scientific texts (including about cellulose materials) and construction of formal representations of the information. Changes/Problems:The involved graduate students have only been working fully on the project since Sept 2022 due the need to finish up work on a prior work. Thus overall progress is behind schedule but commensurate with expenditures (right now ~81k of the total budget). The project is now going full steam ahead. What opportunities for training and professional development has the project provided?Dr. Hahmann has developed a new graduate course on "Natural Language Processing for Information Extraction" and taught it for the first time in Fall 2022 to 12 graduate and two advanced undergraduate students. The course trained the students in modern machine learning - including deep learning - methods, in particular with applications to natural language process and in information extraction. The students who have been trained in this course included the two graduate students and one undergraduate students involved in the project, who have working on a project for testing and comparing named entity recognition approaches from cellulosic literature. How have the results been disseminated to communities of interest?No dissemination activities have happened over the last year, though we're currently preparing three manuscripts: (1) on the Scientific Publication Ontology; (2) on the Properties module of the Cellograph Ontology; and (3) on preliminary results for Named Entity Recognition from Cellulosic Publications. What do you plan to do during the next reporting period to accomplish the goals?Towards goal 1, we will formally verify the Properties module of the Cellograph Ontology using logical internal and external consistency (against data). Next, we will continue with the development of the Materials and Process modules and their integration with the Properties module. Towards goal 2, we will focus on testing the pipeline of conversion from PDF publications to RDF representations at a larger scale. Towards goal 4, we will deploy a graph database (GraphDB) to store the extracted data and demonstrate simple queries with it. This will in turn allow us to debug both the ontology and data representation. It will further allow us to test the extraction and addition of new cellulosic terms to the ontology in order to demonstrate the overall system architecture, namely how the machine learning will help grow the graph and knowledge base over time in order to support more meaningful and fine-grained exploration and analysis of the cellulose data. At the same time, we aim to continue developing the machine learning tools to improve the recognition and classification of cellulosic terms as the main research thread towards goal 3. By deploying the graph, we will also be able to showcase progress on the machine learning based extracting of cellulosic knowledge.

Impacts
What was accomplished under these goals? Towards goal 1, we have surveyed the literature and extracted over 200 physical and chemical properties that are relevant for cellulose. These properties have been organized into a taxonomy of properties and together with metadata (origin, natural language definitions, alternative terms, etc.) provided as an OWL/RDF ontologies that extends the core modules of the Cellograph ontology. Towards goal 2, we have revised and completed the Scientific Publications (SciPub) Ontology as a standard RDF-format for representing information from the literature. In its revision, we are connecting and reusing existing ontologies of document components and bibliographic information to avoid redundancy and to improve reuse. We're preparing a manuscript to describe the ontology and preparing to make the ontology accessible to the wider community. Furthermore, the prototype of a tool called SciTex from Year 1 has been improved to more accurately maintain the structure of documents and sub- and superscripts (which are important for many chemical formula) when converting PDF versions of scientific publication into plain text in preparation for encoding them in RDF formats and further processing them with machine learning tools to identify cellulose-related terms and add them to the cellulose ontology. Towards goal 3, we have explored and tested machine learning algorithms for named entity recognition from the text from the scientific literature. We have deployed and tested a machine learning model for recognizing and classifying chemical entities. As expected, the existing models don't perform particularly well on cellulose data, missing many entities we would like to extract and classify from text. Thus, in a second step, we are currently experimenting with different approaches to retrain and amend existing model with custom corpora of cellulose entities. For thus purpose, we used a set of custom tags (materials, properties, processes, products) and tagged a small test corpora for development purposes.

Publications

Progress 03/01/21 to 02/28/22

Outputs
Target Audience:15 Graduate and undergraduate students through work and exposure on the development of small excerpts of the cellulose ontology within the course SIE 580: "Ontology Engineering Principles and Practice". Broader scientific community, specifically in the areas of bio-based manufacturing who attendedthe renew3D manufacturing conference for large-scale bio-based additive manufacturing (AM) in Orono, ME, May 23/24, 2022 where the graduate students presented an overview of the cellulose informatics pipeline and information system. Changes/Problems:Start of work on the project has been delayed while the students divide their time between this funded project (approx. 30% effort; currently only paying tuition expenses to cover the portion of the work) and a related project (70% of effort; covering student's stipends) on the ontology development that has been delayed to the Covid pandemic and is currently still wrapping up (finishing . This amounts to the project being currently approx. 8 months behind schedule (in expending and progress). What opportunities for training and professional development has the project provided?The 2 graduate students and 1 undergraduate student have been trained in the development of ontologies as part of the course SIE 580: "Ontology Engineering Principles and Practice". How have the results been disseminated to communities of interest?The graduate students presented an overview of the cellulose informatics information system and the ontology to the broader scientific community, specifically in the areas of bio-based manufacturing, who attendedthe renew3D manufacturing conference for large-scale bio-based additive manufacturing (AM) in Orono, ME, May 23/24, 2022. What do you plan to do during the next reporting period to accomplish the goals?Towards goal 2, we will focus on expanding the conversion pipeline from PDF publications to RDF representations, particularly improving robustness and scalability. Afterwards, the pipeline will be formally evaluated using a set of new (previously unseen) publications. Towards goal 1, we will use the PDF to RDF conversion to automatically identify and filter for scientific terms that can be used to expand the vocabulary (the cellulose ontology) and can be reviewed by subject matter experts. This will also be the first step towards goal 4, as it involves designing the overall database/knowledge management system.

Impacts
What was accomplished under these goals? Towards goal 2, we have started developing the Scientific Publications Ontology as an standard RDF-format for representing information from the literature and produced a first and second iteration of that ontology. A sample instantiation of this ontology for one publication has been created. Furthermore, a proof-of-concept software has been developed to read a PDF version of a publication from the literature and convert it into an internal structured representation that can be much easier converted to RDF subsequently and can be used to help identify new terms and add them to the cellulose ontology.

Publications