Fact: Innovative Big Data Analytics Technology For Microbiological Risk Mitigation Assuring Fresh Produce Safety

FACT: INNOVATIVE BIG DATA ANALYTICS TECHNOLOGY FOR MICROBIOLOGICAL RISK MITIGATION ASSURING FRESH PRODUCE SAFETY

Sponsoring Institution

National Institute of Food and Agriculture

Project Status

COMPLETE

Funding Source

AFRI COMPETITIVE GRANT

Reporting Frequency

Annual

Accession No.

1023720

Grant No.

2020-67021-32459

Cumulative Award Amt.

$499,815.00

Proposal No.

2019-07449

Multistate No.

(N/A)

Project Start Date

Aug 1, 2020

Project End Date

Jul 31, 2023

Grant Year

2020

Program Code

[A1541]- Food and Agriculture Cyberinformatics and Tools

Recipient Organization
UNIVERSITY OF ILLINOIS
2001 S. Lincoln Ave.
URBANA,IL 61801

Performing Department
Food Science & Human Nutrition

Non Technical Summary
Fresh produce has been repeatedly linked to high-profile foodborne disease outbreaks, leading to illness, loss of life, significant economic loss, and erosion of consumer confidence. Conventional methods for risk assessment have attempted to model important events leading to produce contamination by human pathogens. However, the critical need for early identification of emerging produce safety risks and early warning to the public has not been met. In this age of the Internet of Things (IoT), the use of the Internet, especially real-time social media and its rapid proliferation and dissemination, and the emergence of game-changing big data technologies have provided unprecedented opportunities to detect emerging produce safety issues and alert the public at an early stage. The overall goal of this study is to develop an innovative big data analytics infrastructure for fresh produce safety risk prediction and early warning based on cyber-informatics technologies that exploit multi-source big data, including social media, news media, and government reports, to reduce the incidence of foodborne diseases associated with consumption of fresh produce. The specific objectives include to: 1) Develop a real-time data retrieval mechanism to extract relevant information from diverse digital on-line sources, 2) Design big data storage fusing risk pattern data sets, 3) Discover event patterns about safety risks in fresh produce chains, 4) Design machine learning models for predicting outbreaks early, and 5) Implement a web-based early warning interface for stakeholders to visually explore levels of risks.

Animal Health Component

30%

Research Effort Categories

Basic

30%

Applied

30%

Developmental

40%

Classification

Knowledge Area (KA)	Subject of Investigation (SOI)	Field of Science (FOS)	Percent
711	5010	2080	50%
712	5010	2080	50%

Knowledge Area
711 - Ensure Food Products Free of Harmful Chemicals, Including Residues from Agricultural and Other Sources; 712 - Protect Food from Contamination by Pathogenic Microorganisms, Parasites, and Naturally Occurring Toxins;

Subject Of Investigation
5010 - Food;

Field Of Science
2080 - Mathematics and computer sciences;

Keywords

Goals / Objectives
The goal of this study is to develop an innovative big data analytics infrastructure for the modeling of fresh produce safety risks and the early warning of fresh produce safety outbreaks. The resulting infrastructure applies state-of-the-art cyber-informatics technologies that leverage multi-source big data, including social media, news media, and government reports, to reduce the incidence of foodborne diseases associated with the consumption of fresh produce.

Project Methods
The big data analytics infrastructure called ESP (for Early Warning System for Fresh Produce)will be developed composed of several core technologies. A data collector retrieves information related to foodborne outbreaks from digital online sources by identifying relevant posts and reports and then extracting structured properties from these unstructured posts (Objective 1). The project will work with a rich variety of data sources, including social media, news outlets, and authoritarian web sites from the CDC, FDA, USDA, and other local and state government organizations. As a foundation for extracting relevant reports from these sources,we built an initial lexicon for food safety risks. This lexicon developed by the PI's group will be refined by deploying human labelers using a crowd sourcing study with Mechanical Turk for establishing the ground truth and by employing deep learning to generalize the food safety vocabulary. Leveraging this lexicon, deep learning models such as RNNs and Transformers will then be used to extract relevant food safety incidents from the identified digital data sources. The extracted data will be uploaded into our big data server (Objective 2). The big data server will be based on a spatiotemporal data model that captures the extracted food safety related incidents indexed by location and time of occurrence. Periodically, these data extractors are run to extract, clean, and upload additional incident data into our integrated data repository for analysis. We will optimize the processing and data management strategies of this data server, as needed, to assure practical performance of the system is achieved as the database grows. For identifying potential risks of fresh produce safety outbreaks based on this integrated data store, state-of-the-art machine learning techniques will be applied on this data, including both unsupervised methods (Objective 3) and supervised methods (Objective 4) that learn models for capturing food safety risk and for predicting food safety outbreaks. Lastly, a web-based interface to our data server will support stakeholders to visually explore levels of risks and outbreaks predicated by our ESP infrastructure (Objective 5).

Progress 08/01/20 to 09/23/22

Outputs
Target Audience:Members of the target audience included government agencies, food safety managers, and consumers. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?This is a joint project between the University of Illinois at Urbana-Champaign (UIUC) and the Worcester Polytechnic Institute (WPI). One PhD student in Food Science from UIUC, Dr. Tao, who graduated from UIUC last year, two graduate students in Data Science, and a group of five undergraduate students in Computer Science from WPI are involved. The two groups communicated their results, communicated to resolve issues jointly and when help was needed, and collaboratively identified potential solutions to the discussed problems through weekly meetings on Zoom. One-on-one weekly research meetings for each student with the PIs have allowed students to work side-by-side with faculty and others on their research projects. This allows students to experience how different researchers approach and tackle problems. During the second year, the Co-PI Prof. Rundensteiner and two graduate students in Data Science built a team of fiveundergraduate students. They designed and implemented the data storage system and web-based visual analytics interface. The advisor and mentors provide guidance and feedback on how to build a real-world application. Weekly DAISY meetings led by the Co-PI Prof. Rundensteiner and a team of eighteenresearch students have allowed each student in this project to communicate their ideas and results to a larger audience and get feedback on the presentation. How have the results been disseminated to communities of interest?A project web page was developed here: https://davis.wpi.edu/dsrg/PROJECTS/ESPP/. The primary interactive interface can be found at: https://usda-foodpoisoning.wpi.edu/. The developed dataset was released to the public through the Language Resources and Evaluation Conference and the primary work were presented to the natural language processing community. Another manuscript is in submission to a conference to disseminate the results to broader public audiences. What do you plan to do during the next reporting period to accomplish the goals?In the third year of this project, we will focus on pattern discovery and outbreak detection, going into more depth on the analysis and the technology itself. Refine the database system. Develop a method to discover event outbreaks of foodborne illnesses and then utilize the method to detect events that have not yet been reported.The method will use all sources of information we have identified so far as input.The developed method will be used to conduct case studies on the real-world dataset to test the method's performance. Refine the design of web-based visual analytics tools. The tool will support visualization of potential foodborne illness incidents on multi-levels. e.g., nation, state, county, but also provide filtering options to the users. The tool will allow users to analyze trends of a foodborne illness across time and geographic regions, assuming sufficient data can be identified. The tool will present possible (or confirmed) source(s) of foodborne illness and common symptoms of each incident when available.

Impacts
What was accomplished under these goals? For Objective 1, the data collection pipeline was developed in the first year. However, there is abundance of data lacking the location information of a reported case on Twitter, which is significant to the outbreak detection model. Therefore, we refined the data pipeline to estimate the location of the posts without geolocation information. Specifically, we run a second Twitter API query to get the user profile data. Then we check if there is a profile location, and if there is, we use that as the location of the Tweet. Then we check the profile description to try and tokenize it, checking each token to see if it is an indicative location word. For Objective 2, in the first year, we have trained the state-of-the-art natural language processing models, and employed the models to identify relevant tweets about foodborne illnesses and extract critical entities within the tweets. In the second year, we refined the database system used to store the primary data model to capture more important relevant properties of the data. Mainly, we used the open-source relational database management system PostgreSQL to store the unified spatiotemporal data model and created a simple query API for accessing our database to identify and analyze foodborne illness incidents from collected data. By using our query API, we provide targeted access to retrieve information regarding different food products, geolocations, and timeframes. Visual analytics tools are further developed to help us discover food safety risk patterns, which is also part of the front-end design and implementation to achieve Objective 5. For Objective 3, we conducted primary data analysis for the five worst foodborne illness outbreaks since 2018 to get some insight. We picked keywords related to each outbreak to retrieve the outbreak-related tweets from thirtydays before the first reported case to the end via Twitter API. Then we applied the trained deep learning models described in Objective 2 to process the tweets and extract critical information. We tried several ways to extract signals from the tweet data and compared them with the timeline of when people got sick reported on the CDC website. For Objective 5, we chose to design the front end of our application through ReactJS and deploy it using Apache Web Server on the WPI machine. Our current application mainly offers three visualization tools: 1) Timeline, 2) Word Cloud, and 3) Tracker Map. The timeline can show elements of time within our research. We achieved this by displaying a box containing the outbreak, and the duration of the outbreak over time determines the box's width. The outbreak's location on the timeline is determined by its starting date. The word cloud allows users to explore the data we have made available through visualizations. The visualizations would be more informative than the ones on the home page, where we could inform users of interesting findings that we have found while working on this project and the relationship between Twitter and food poisoning outbreaks. The tracker map uses the county of specific food poisoning instances to display data on a choropleth map of the United States. The application features a Timeline of historical instances of foodborne illness outbreaks, infographics that provide insight into the size and common keywords found in our dataset, as well as a choropleth map of the United States that will show the number of Tweets collected per county that our model has identified to contain a relevant report of foodborne illness. The findings were reported in three manuscripts, one published in the journal "Scientific Reports", the second accepted by "Language Resources and Evaluation Conference" in the computer science field, and the third is currently submitted to a big data conference.

Publications

Type: Journal Articles Status: Published Year Published: 2022 Citation: Hu R, Zhang D, Tao D, Hartvigsen T, Feng H and Rundensteiner E. 2022. TWEET-FID: An annotated dataset for multiple foodborne illness detection tasks, arXiv preprint arXiv:2205.10726.
Type: Journal Articles Status: Published Year Published: 2021 Citation: Tao DD, Zhang D, Hu R, Rundebsteiner E and Feng H. 2021. Crowdsourcing and machine learning approaches for extracting entities indicating potential foodborne outbreaks from social media, Scientific Reports, 11, 21678. https://doi.org/10.1038/s41598-021-00766-w.
Type: Journal Articles Status: Published Year Published: 2021 Citation: Tao DD and Feng H. 2021. Text mining of social media data for enhancing food safety of famer⿿s market, Journal of Food Science and Nutrition, 4(6), 1-9.
Type: Conference Papers and Presentations Status: Published Year Published: 2021 Citation: Tao D, Zhang D, Hu R, Rundensteiner E and Feng H. 2021. Joint learning approaches for identifying unreported foodborne illnesses and extracting entity from social media. International Association for Food Protection Annual Meeting (Virtual).

Progress 08/01/21 to 07/31/22

Outputs
Target Audience:Members of the target audience included government agencies, food safety managers, and consumers. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?This is a joint project between the University of Illinois at Urbana-Champaign (UIUC) and the Worcester Polytechnic Institute (WPI). One PhD student in Food Science from UIUC, Dr. Tao, who graduated from UIUC last year, two graduate students in Data Science, and a group of five undergraduate students in Computer Science from WPI are involved. The two groups communicated their results, communicated to resolve issues jointly and when help was needed, and collaboratively identified potential solutions to the discussed problems through weekly meetings on Zoom. One-on-one weekly research meetings for each student with the PIs have allowed students to work side-by-side with faculty and others on their research projects. This allows students to experience how different researchers approach and tackle problems. During the second year, the Co-PI Prof. Rundensteiner and two graduate students in Data Science built a team of fiveundergraduate students. They designed and implemented the data storage system and web-based visual analytics interface. The advisor and mentors provide guidance and feedback on how to build a real-world application. Weekly DAISY meetings led by the Co-PI Prof. Rundensteiner and a team of eighteenresearch students have allowed each student in this project to communicate their ideas and results to a larger audience and get feedback on the presentation. How have the results been disseminated to communities of interest?A project web page was developed here: https://davis.wpi.edu/dsrg/PROJECTS/ESPP/ The primary interactive interface here: https://usda-foodpoisoning.wpi.edu/ The developed dataset was released to the public through the Language Resources and Evaluation Conference and the primary work were presented to the natural language processing community. Another manuscript is in submission to a conference to disseminate the results to broader public audiences. What do you plan to do during the next reporting period to accomplish the goals?In the third year of this project, we will focus on pattern discovery and outbreak detection, going into more depth on the analysis and the technology itself. Refine the database system Develop a method to discover event outbreaks of foodborne illnesses and then utilize the method to detect events that have not yet been reported.The method will use all sources of information we have identified so far as input. The developed method will be used to conduct case studies on the real-world dataset to test the method's performance.? Refine the design of web-based visual analytics tools. The tool will support visualization of potential foodborne illness incidents on multi-levels. e.g., nation, state, county, but also provide filtering options to the users. The tool will allow users to analyze trends of a foodborne illness across time and geographic regions, assuming sufficient data can be identified. The tool will present possible (or confirmed) source(s) of foodborne illness and common symptoms of each incident when available.

Impacts
What was accomplished under these goals? For Objective 1, the data collection pipeline was developed in the first year. However, there is an abundance of data lacking the location information of a reported case on Twitter, which is significant to the outbreak detection model. Therefore, we refined the data pipeline to estimate the location of the posts without geolocation information. Specifically, we ran a second Twitter API query to get the user profile data. Then we check if there is a profile location, and if there is, we use that as the location of the Tweet. Then we check the profile description to try and tokenize it, checking each token to see if it is an indicative location word. For Objective 2, in the first year, we have trained the state-of-the-art natural language processing models, and employed the models to identify relevant tweets about foodborne illnesses and extract critical entities within the tweets. In the second year, we refined the database system used to store the primary data model to capture more important relevant properties of the data. Mainly, we used the open-source relational database management system PostgreSQL to store the unified spatiotemporal data model and created a simple query API for accessing our database to identify and analyze foodborne illness incidents from collected data. By using our query API, we provide targeted access to retrieve information regarding different food products, geolocations, and timeframes. Visual analytics tools are further developed to help us discover food safety risk patterns, which is also part of the front-end design and implementation to achieve Objective 5. For Objective 3, we conducted primary data analysis for the five worst foodborne illness outbreaks since 2018 to get some insight. We picked keywords related to each outbreak to retrieve the outbreak-related tweets from thirtydays before the first reported case to the end via Twitter API. Then we applied the trained deep learning models described in Objective 2 to process the tweets and extract critical information. We tried several ways to extract signals from the tweet data and compared them with the timeline of when people got sick reported on the CDC website. For Objective 5, we chose to design the front end of our application through ReactJS and deploy it using Apache Web Server on the WPI machine. Our current application mainly offers three visualization tools: 1) timeline, 2) word cloud, and 3) tracker map. 1. The timeline can show elements of time within our research. We achieved this by displaying a box containing the outbreak, and the duration of the outbreak over time determines the box's width. The outbreak's location on the timeline is determined by its starting date. 2. The word cloud allows users to explore the data we have made available through visualizations. The visualizations would be more informative than the ones on the home page, where we could inform users of interesting findings that we have found while working on this project and the relationship between Twitter and food poisoning outbreaks. 3. The tracker map uses the county of specific food poisoning instances to display data on a choropleth map of the United States. The application features a Timeline of historical instances of foodborne illness outbreaks, infographics that provide insight into the size and common keywords found in our dataset, as well as a choropleth map of the United States that will show the number of Tweets collected per county that our model has identified to contain a relevant report of foodborne illness. The findings were reported in three manuscripts, one published in the journal "Scientific Reports", the second accepted by "Language Resources and Evaluation Conference" in the computer science field, and the third is currently submitted to a big data conference.

Publications

Type: Conference Papers and Presentations Status: Accepted Year Published: 2022 Citation: Hu, R., Zhang, D., Tao, D., Hartvigsen, T., Feng, H. and Rundensteiner, E. 2022. TWEET-FID: An Annotated Dataset for Multiple Foodborne Illness Detection Tasks. Proceedings of the 13th Language Resources and Evaluation Conference.
Type: Conference Papers and Presentations Status: Submitted Year Published: 2022 Citation: Hu, R., Zhang, D., Tao, D., Feng, H. and Rundensteiner, E. 2022. An Interactive System for Food Poisoning Outbreak Analysis and Detection.

Progress 08/01/20 to 07/31/21

Outputs
Target Audience:Members of the target audience included government agencies, food safety managers, and consumers. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?This is a joint project between the University of Illinois at Urbana-Champaign (UIUC) and the Worcester Polytechnic Institute (WPI). Two graduate students from UIUC (with one receiving the Ph.D. degree of Food Science and one receiving M.S. degree of Computer Science) and two graduate students from WPI (both in Data Sciences) are involved. 1. During the first year, the two groups communicated their results, communicated to resolve jointly issues and when help was needed, and collaboratively identified potential solutions to the discussed problems by weekly meetings on Zoom. 2. One-on-one weekly research meetings for each student with the PIs have given students the opportunity to work side-by-side with faculty and others on their research projects. This provides students with the opportunity to experience how different researchers approach and tackle problems. 3. Weekly DAISY meetings led by the Co-PI Prof. Rundensteiner and a team of twelveresearch students have allowed each student in this project to communicate their ideas and results to a larger audience and get feedback on the presentation. How have the results been disseminated to communities of interest?Development of a project web page accessible to the public: https://davis.wpi.edu/dsrg/PROJECTS/ESPP/ Some results were presented via a poster atthe International Association for Food Protection (IAFP) Meeting in 2021 to disseminate the findings to the food safety community around the world. In addition, findings in the preliminary studies were reported in manuscripts with one submitted to the Scientific Reports journal and another to a conference to disseminate the results to broader public audiences. What do you plan to do during the next reporting period to accomplish the goals?In Year 2 of this project, we will continue to focus on the social media data analysis, going into more depth both on the analysis as well as on the technology itself. Refine the current real-time data retrieval mechanism to extract more tweets relevant to foodborne illness incidents. Search for other relevant sources of foodborne illness incident information. Refine the current relevant tweet identification and information extraction model to achieve higher retrieval accuracy. Establish mechanisms to geolocate tweets without location information in the streaming data possibly using the user profile or other sources. Refine the data model for fusing diverse data from multiple information source, so the data model can capture more important relevant properties of the data. Develop a method to discover event outbreaks about foodborne illness and then utilize the method to detect events that have not yet been reported. The method will utilize all sources of information we have identified so far as input. Begin the design of web-based visual analytics tools. The tool should support visualization of foodborne illness incidents on multi levels. e.g., nation, state, county. etc. The tool should allow users to analyze trends of a foodborne illness across time and geographic regions assuming sufficient data can be identified. The tool should present possible (or confirmed) source(s) of foodborne illness and common symptoms of each incident when available.

Impacts
What was accomplished under these goals? To meet our overarching goal of this project i.e., to construct a big data analytics infrastructure early warning system for fresh produce safetywith multiple-sourced digital data, we identified five specific objectives to pursue. These objectives include: 1) Data collection; 2) Entity extraction and storage; 3) Pattern discovery; 4) Predictive models; and 5) Interface design. In our first year, we have made great progress in the first two objectives. The research findings have been reported in several manuscripts with one submitted to the journal "Scientific Reports", while another was recently submitted to a prestigious deep learning conference. For Objective 1, we first conducted a study on source identification of foodborne outbreaks in the United States. Thereafter, all relevant outbreaks/recalls related to fresh produce were extracted for further analysis. We worked with authoritarian sources like the CDC, FDA, USDA, and other local and state government websites to identify confirmed outbreaks related to fresh produce in the U.S. We also investigated news reports about foodborne illnesses. While informative for post-hoc analysis and valuable for warning our citizens, news media is not a valuable source for collecting "early" signals of foodborne outbreaks. We thus also studied social media data sources. In particular. For social media data, we developed a reliable data retrieval mechanism from the social media platform Twitter via the official Twitter API. These collected data are kept in the server of Worcester Polytechnic Institute (WPI). The stored data is protected with disk mirroring, daily backups, and other means. Full-time system administrators monitor the security and availability of these systems. Appropriate access control and other security policies and mechanisms are put in place to protect the integrity, security, privacy, confidentiality, and other rights or requirements. For Objective 2, we used this data warehouse of collected tweet data to construct a labeled dataset. Amazon Mechanical Turk was used for crowdsourcing, in which registered workers were recruited to complete the tasks. For a given tweet, workers were asked to read carefully, score on a scale of zero to five on how much they agreed that the tweet indicated a possible foodborne illness incidence (zero: not at all, five: very sure), highlighted all words/phrases belonging to specific labels (food, location, symptom, and foodborne illness keywords), and decided if each of the highlighted words/phrases was related to a foodborne illness incidence. Then we trained the state-of-the-art natural language processing models based on deep learning technology and pretrained on large corpi of text, such as BERT, ROBERTA and otherson this data set. Thereafter, the trained models were employed for both the identification of relevant tweets of foodborne illnesses and the extraction of critical entities within the tweets. We have also begun to create a big data storage fusing spatiotemporal risk pattern data set. For this, we use the open-source flexible database management system MongoDB to store the unified spatiotemporal data model. To identify and analyze foodborne illness incidents from collected data, we created a simple query API for accessing our database. By using our query API, we provide for targeted access to retrieve information regarding different food products, geolocations, and timeframes.

Publications

Type: Journal Articles Status: Published Year Published: 2021 Citation: Tao, D.D. and Feng, H. 2021. Text mining of social media data for enhancing food safety of famers market. Journal of Food Science and Nutrition, 4(6).
Type: Journal Articles Status: Under Review Year Published: 2021 Citation: Tao, D., Zhang, D., Hu, R., Rundensteiner, E. and Feng, H. 2021. Crowdsourcing and machine learning approaches for extracting entities indicating potential foodborne outbreaks from social media. Scientific Reports. Under Review.