Progress 08/01/20 to 09/23/22
Outputs Target Audience:Members of the target audience included government agencies, food safety managers, and consumers. Changes/Problems:
Nothing Reported
What opportunities for training and professional development has the project provided?This is a joint project between the University of Illinois at Urbana-Champaign (UIUC) and the Worcester Polytechnic Institute (WPI). One PhD student in Food Science from UIUC, Dr. Tao, who graduated from UIUC last year, two graduate students in Data Science, and a group of five undergraduate students in Computer Science from WPI are involved. The two groups communicated their results, communicated to resolve issues jointly and when help was needed, and collaboratively identified potential solutions to the discussed problems through weekly meetings on Zoom. One-on-one weekly research meetings for each student with the PIs have allowed students to work side-by-side with faculty and others on their research projects. This allows students to experience how different researchers approach and tackle problems. During the second year, the Co-PI Prof. Rundensteiner and two graduate students in Data Science built a team of fiveundergraduate students. They designed and implemented the data storage system and web-based visual analytics interface. The advisor and mentors provide guidance and feedback on how to build a real-world application. Weekly DAISY meetings led by the Co-PI Prof. Rundensteiner and a team of eighteenresearch students have allowed each student in this project to communicate their ideas and results to a larger audience and get feedback on the presentation. How have the results been disseminated to communities of interest?A project web page was developed here: https://davis.wpi.edu/dsrg/PROJECTS/ESPP/. The primary interactive interface can be found at: https://usda-foodpoisoning.wpi.edu/. The developed dataset was released to the public through the Language Resources and Evaluation Conference and the primary work were presented to the natural language processing community. Another manuscript is in submission to a conference to disseminate the results to broader public audiences. What do you plan to do during the next reporting period to accomplish the goals?In the third year of this project, we will focus on pattern discovery and outbreak detection, going into more depth on the analysis and the technology itself. Refine the database system. Develop a method to discover event outbreaks of foodborne illnesses and then utilize the method to detect events that have not yet been reported.The method will use all sources of information we have identified so far as input.The developed method will be used to conduct case studies on the real-world dataset to test the method's performance. Refine the design of web-based visual analytics tools. The tool will support visualization of potential foodborne illness incidents on multi-levels. e.g., nation, state, county, but also provide filtering options to the users. The tool will allow users to analyze trends of a foodborne illness across time and geographic regions, assuming sufficient data can be identified. The tool will present possible (or confirmed) source(s) of foodborne illness and common symptoms of each incident when available.
Impacts What was accomplished under these goals?
For Objective 1, the data collection pipeline was developed in the first year. However, there is abundance of data lacking the location information of a reported case on Twitter, which is significant to the outbreak detection model. Therefore, we refined the data pipeline to estimate the location of the posts without geolocation information. Specifically, we run a second Twitter API query to get the user profile data. Then we check if there is a profile location, and if there is, we use that as the location of the Tweet. Then we check the profile description to try and tokenize it, checking each token to see if it is an indicative location word. For Objective 2, in the first year, we have trained the state-of-the-art natural language processing models, and employed the models to identify relevant tweets about foodborne illnesses and extract critical entities within the tweets. In the second year, we refined the database system used to store the primary data model to capture more important relevant properties of the data. Mainly, we used the open-source relational database management system PostgreSQL to store the unified spatiotemporal data model and created a simple query API for accessing our database to identify and analyze foodborne illness incidents from collected data. By using our query API, we provide targeted access to retrieve information regarding different food products, geolocations, and timeframes. Visual analytics tools are further developed to help us discover food safety risk patterns, which is also part of the front-end design and implementation to achieve Objective 5. For Objective 3, we conducted primary data analysis for the five worst foodborne illness outbreaks since 2018 to get some insight. We picked keywords related to each outbreak to retrieve the outbreak-related tweets from thirtydays before the first reported case to the end via Twitter API. Then we applied the trained deep learning models described in Objective 2 to process the tweets and extract critical information. We tried several ways to extract signals from the tweet data and compared them with the timeline of when people got sick reported on the CDC website. For Objective 5, we chose to design the front end of our application through ReactJS and deploy it using Apache Web Server on the WPI machine. Our current application mainly offers three visualization tools: 1) Timeline, 2) Word Cloud, and 3) Tracker Map. The timeline can show elements of time within our research. We achieved this by displaying a box containing the outbreak, and the duration of the outbreak over time determines the box's width. The outbreak's location on the timeline is determined by its starting date. The word cloud allows users to explore the data we have made available through visualizations. The visualizations would be more informative than the ones on the home page, where we could inform users of interesting findings that we have found while working on this project and the relationship between Twitter and food poisoning outbreaks. The tracker map uses the county of specific food poisoning instances to display data on a choropleth map of the United States. The application features a Timeline of historical instances of foodborne illness outbreaks, infographics that provide insight into the size and common keywords found in our dataset, as well as a choropleth map of the United States that will show the number of Tweets collected per county that our model has identified to contain a relevant report of foodborne illness. The findings were reported in three manuscripts, one published in the journal "Scientific Reports", the second accepted by "Language Resources and Evaluation Conference" in the computer science field, and the third is currently submitted to a big data conference.
Publications
- Type:
Journal Articles
Status:
Published
Year Published:
2022
Citation:
Hu R, Zhang D, Tao D, Hartvigsen T, Feng H and Rundensteiner E. 2022. TWEET-FID: An annotated dataset for multiple foodborne illness detection tasks, arXiv preprint arXiv:2205.10726.
- Type:
Journal Articles
Status:
Published
Year Published:
2021
Citation:
Tao DD, Zhang D, Hu R, Rundebsteiner E and Feng H. 2021. Crowdsourcing and machine learning approaches for extracting entities indicating potential foodborne outbreaks from social media, Scientific Reports, 11, 21678. https://doi.org/10.1038/s41598-021-00766-w.
- Type:
Journal Articles
Status:
Published
Year Published:
2021
Citation:
Tao DD and Feng H. 2021. Text mining of social media data for enhancing food safety of famers market, Journal of Food Science and Nutrition, 4(6), 1-9.
- Type:
Conference Papers and Presentations
Status:
Published
Year Published:
2021
Citation:
Tao D, Zhang D, Hu R, Rundensteiner E and Feng H. 2021. Joint learning approaches for identifying unreported foodborne illnesses and extracting entity from social media. International Association for Food Protection Annual Meeting (Virtual).
|
Progress 08/01/21 to 07/31/22
Outputs Target Audience:Members of the target audience included government agencies, food safety managers, and consumers. Changes/Problems:
Nothing Reported
What opportunities for training and professional development has the project provided?This is a joint project between the University of Illinois at Urbana-Champaign (UIUC) and the Worcester Polytechnic Institute (WPI). One PhD student in Food Science from UIUC, Dr. Tao, who graduated from UIUC last year, two graduate students in Data Science, and a group of five undergraduate students in Computer Science from WPI are involved. The two groups communicated their results, communicated to resolve issues jointly and when help was needed, and collaboratively identified potential solutions to the discussed problems through weekly meetings on Zoom. One-on-one weekly research meetings for each student with the PIs have allowed students to work side-by-side with faculty and others on their research projects. This allows students to experience how different researchers approach and tackle problems. During the second year, the Co-PI Prof. Rundensteiner and two graduate students in Data Science built a team of fiveundergraduate students. They designed and implemented the data storage system and web-based visual analytics interface. The advisor and mentors provide guidance and feedback on how to build a real-world application. Weekly DAISY meetings led by the Co-PI Prof. Rundensteiner and a team of eighteenresearch students have allowed each student in this project to communicate their ideas and results to a larger audience and get feedback on the presentation. How have the results been disseminated to communities of interest?A project web page was developed here: https://davis.wpi.edu/dsrg/PROJECTS/ESPP/ The primary interactive interface here: https://usda-foodpoisoning.wpi.edu/ The developed dataset was released to the public through the Language Resources and Evaluation Conference and the primary work were presented to the natural language processing community. Another manuscript is in submission to a conference to disseminate the results to broader public audiences. What do you plan to do during the next reporting period to accomplish the goals?In the third year of this project, we will focus on pattern discovery and outbreak detection, going into more depth on the analysis and the technology itself. Refine the database system Develop a method to discover event outbreaks of foodborne illnesses and then utilize the method to detect events that have not yet been reported.The method will use all sources of information we have identified so far as input. The developed method will be used to conduct case studies on the real-world dataset to test the method's performance.? Refine the design of web-based visual analytics tools. The tool will support visualization of potential foodborne illness incidents on multi-levels. e.g., nation, state, county, but also provide filtering options to the users. The tool will allow users to analyze trends of a foodborne illness across time and geographic regions, assuming sufficient data can be identified. The tool will present possible (or confirmed) source(s) of foodborne illness and common symptoms of each incident when available.
Impacts What was accomplished under these goals?
For Objective 1, the data collection pipeline was developed in the first year. However, there is an abundance of data lacking the location information of a reported case on Twitter, which is significant to the outbreak detection model. Therefore, we refined the data pipeline to estimate the location of the posts without geolocation information. Specifically, we ran a second Twitter API query to get the user profile data. Then we check if there is a profile location, and if there is, we use that as the location of the Tweet. Then we check the profile description to try and tokenize it, checking each token to see if it is an indicative location word. For Objective 2, in the first year, we have trained the state-of-the-art natural language processing models, and employed the models to identify relevant tweets about foodborne illnesses and extract critical entities within the tweets. In the second year, we refined the database system used to store the primary data model to capture more important relevant properties of the data. Mainly, we used the open-source relational database management system PostgreSQL to store the unified spatiotemporal data model and created a simple query API for accessing our database to identify and analyze foodborne illness incidents from collected data. By using our query API, we provide targeted access to retrieve information regarding different food products, geolocations, and timeframes. Visual analytics tools are further developed to help us discover food safety risk patterns, which is also part of the front-end design and implementation to achieve Objective 5. For Objective 3, we conducted primary data analysis for the five worst foodborne illness outbreaks since 2018 to get some insight. We picked keywords related to each outbreak to retrieve the outbreak-related tweets from thirtydays before the first reported case to the end via Twitter API. Then we applied the trained deep learning models described in Objective 2 to process the tweets and extract critical information. We tried several ways to extract signals from the tweet data and compared them with the timeline of when people got sick reported on the CDC website. For Objective 5, we chose to design the front end of our application through ReactJS and deploy it using Apache Web Server on the WPI machine. Our current application mainly offers three visualization tools: 1) timeline, 2) word cloud, and 3) tracker map. 1. The timeline can show elements of time within our research. We achieved this by displaying a box containing the outbreak, and the duration of the outbreak over time determines the box's width. The outbreak's location on the timeline is determined by its starting date. 2. The word cloud allows users to explore the data we have made available through visualizations. The visualizations would be more informative than the ones on the home page, where we could inform users of interesting findings that we have found while working on this project and the relationship between Twitter and food poisoning outbreaks. 3. The tracker map uses the county of specific food poisoning instances to display data on a choropleth map of the United States. The application features a Timeline of historical instances of foodborne illness outbreaks, infographics that provide insight into the size and common keywords found in our dataset, as well as a choropleth map of the United States that will show the number of Tweets collected per county that our model has identified to contain a relevant report of foodborne illness. The findings were reported in three manuscripts, one published in the journal "Scientific Reports", the second accepted by "Language Resources and Evaluation Conference" in the computer science field, and the third is currently submitted to a big data conference.
Publications
- Type:
Conference Papers and Presentations
Status:
Accepted
Year Published:
2022
Citation:
Hu, R., Zhang, D., Tao, D., Hartvigsen, T., Feng, H. and Rundensteiner, E. 2022. TWEET-FID: An Annotated Dataset for Multiple Foodborne Illness Detection Tasks. Proceedings of the 13th Language Resources and Evaluation Conference.
- Type:
Conference Papers and Presentations
Status:
Submitted
Year Published:
2022
Citation:
Hu, R., Zhang, D., Tao, D., Feng, H. and Rundensteiner, E. 2022. An Interactive System for Food Poisoning Outbreak Analysis and Detection.
|
Progress 08/01/20 to 07/31/21
Outputs Target Audience:Members of the target audience included government agencies, food safety managers, and consumers. Changes/Problems:
Nothing Reported
What opportunities for training and professional development has the project provided?This is a joint project between the University of Illinois at Urbana-Champaign (UIUC) and the Worcester Polytechnic Institute (WPI). Two graduate students from UIUC (with one receiving the Ph.D. degree of Food Science and one receiving M.S. degree of Computer Science) and two graduate students from WPI (both in Data Sciences) are involved. 1. During the first year, the two groups communicated their results, communicated to resolve jointly issues and when help was needed, and collaboratively identified potential solutions to the discussed problems by weekly meetings on Zoom. 2. One-on-one weekly research meetings for each student with the PIs have given students the opportunity to work side-by-side with faculty and others on their research projects. This provides students with the opportunity to experience how different researchers approach and tackle problems. 3. Weekly DAISY meetings led by the Co-PI Prof. Rundensteiner and a team of twelveresearch students have allowed each student in this project to communicate their ideas and results to a larger audience and get feedback on the presentation. How have the results been disseminated to communities of interest?Development of a project web page accessible to the public: https://davis.wpi.edu/dsrg/PROJECTS/ESPP/ Some results were presented via a poster atthe International Association for Food Protection (IAFP) Meeting in 2021 to disseminate the findings to the food safety community around the world. In addition, findings in the preliminary studies were reported in manuscripts with one submitted to the Scientific Reports journal and another to a conference to disseminate the results to broader public audiences. What do you plan to do during the next reporting period to accomplish the goals?In Year 2 of this project, we will continue to focus on the social media data analysis, going into more depth both on the analysis as well as on the technology itself. Refine the current real-time data retrieval mechanism to extract more tweets relevant to foodborne illness incidents. Search for other relevant sources of foodborne illness incident information. Refine the current relevant tweet identification and information extraction model to achieve higher retrieval accuracy. Establish mechanisms to geolocate tweets without location information in the streaming data possibly using the user profile or other sources. Refine the data model for fusing diverse data from multiple information source, so the data model can capture more important relevant properties of the data. Develop a method to discover event outbreaks about foodborne illness and then utilize the method to detect events that have not yet been reported. The method will utilize all sources of information we have identified so far as input. Begin the design of web-based visual analytics tools. The tool should support visualization of foodborne illness incidents on multi levels. e.g., nation, state, county. etc. The tool should allow users to analyze trends of a foodborne illness across time and geographic regions assuming sufficient data can be identified. The tool should present possible (or confirmed) source(s) of foodborne illness and common symptoms of each incident when available.
Impacts What was accomplished under these goals?
To meet our overarching goal of this project i.e., to construct a big data analytics infrastructure early warning system for fresh produce safetywith multiple-sourced digital data, we identified five specific objectives to pursue. These objectives include: 1) Data collection; 2) Entity extraction and storage; 3) Pattern discovery; 4) Predictive models; and 5) Interface design. In our first year, we have made great progress in the first two objectives. The research findings have been reported in several manuscripts with one submitted to the journal "Scientific Reports", while another was recently submitted to a prestigious deep learning conference. For Objective 1, we first conducted a study on source identification of foodborne outbreaks in the United States. Thereafter, all relevant outbreaks/recalls related to fresh produce were extracted for further analysis. We worked with authoritarian sources like the CDC, FDA, USDA, and other local and state government websites to identify confirmed outbreaks related to fresh produce in the U.S. We also investigated news reports about foodborne illnesses. While informative for post-hoc analysis and valuable for warning our citizens, news media is not a valuable source for collecting "early" signals of foodborne outbreaks. We thus also studied social media data sources. In particular. For social media data, we developed a reliable data retrieval mechanism from the social media platform Twitter via the official Twitter API. These collected data are kept in the server of Worcester Polytechnic Institute (WPI). The stored data is protected with disk mirroring, daily backups, and other means. Full-time system administrators monitor the security and availability of these systems. Appropriate access control and other security policies and mechanisms are put in place to protect the integrity, security, privacy, confidentiality, and other rights or requirements. For Objective 2, we used this data warehouse of collected tweet data to construct a labeled dataset. Amazon Mechanical Turk was used for crowdsourcing, in which registered workers were recruited to complete the tasks. For a given tweet, workers were asked to read carefully, score on a scale of zero to five on how much they agreed that the tweet indicated a possible foodborne illness incidence (zero: not at all, five: very sure), highlighted all words/phrases belonging to specific labels (food, location, symptom, and foodborne illness keywords), and decided if each of the highlighted words/phrases was related to a foodborne illness incidence. Then we trained the state-of-the-art natural language processing models based on deep learning technology and pretrained on large corpi of text, such as BERT, ROBERTA and otherson this data set. Thereafter, the trained models were employed for both the identification of relevant tweets of foodborne illnesses and the extraction of critical entities within the tweets. We have also begun to create a big data storage fusing spatiotemporal risk pattern data set. For this, we use the open-source flexible database management system MongoDB to store the unified spatiotemporal data model. To identify and analyze foodborne illness incidents from collected data, we created a simple query API for accessing our database. By using our query API, we provide for targeted access to retrieve information regarding different food products, geolocations, and timeframes.
Publications
- Type:
Journal Articles
Status:
Published
Year Published:
2021
Citation:
Tao, D.D. and Feng, H. 2021. Text mining of social media data for enhancing food safety of famers market. Journal of Food Science and Nutrition, 4(6).
- Type:
Journal Articles
Status:
Under Review
Year Published:
2021
Citation:
Tao, D., Zhang, D., Hu, R., Rundensteiner, E. and Feng, H. 2021. Crowdsourcing and machine learning approaches for extracting entities indicating potential foodborne outbreaks from social media. Scientific Reports. Under Review.
|
|