Cloud-based Fuzzy Data Mining for Diabetes Gene Pathway Analysis

CLOUD-BASED FUZZY DATA MINING FOR DIABETES GENE PATHWAY ANALYSIS

Sponsoring Institution

National Institute of Food and Agriculture

Project Status

COMPLETE

Funding Source

HATCH

Reporting Frequency

Annual

Accession No.

0227068

Grant No.

(N/A)

Cumulative Award Amt.

(N/A)

Proposal No.

(N/A)

Multistate No.

(N/A)

Project Start Date

Nov 1, 2011

Project End Date

Nov 1, 2014

Grant Year

(N/A)

Program Code

[(N/A)]- (N/A)

Recipient Organization
UNIV OF THE DISTRICT OF COLUMBIA
4200 CONNECTICUT AVENUE N.W
WASHINGTON,DC 20008

Performing Department
SCHOOL OF ENGINEERING & APPLIED SCIENCE

Non Technical Summary
Diabetes is a group of diseases characterized by high levels of blood glucose resulting from defects in insulin production, insulin action, or both. There are 25.8 million people in the United States, or 8.3% of the population, who have diabetes. While an estimated 18.8 million have been diagnosed, unfortunately, 7.0 million people are unaware that they have the disease. There were 1.9 million new cases of diabetes diagnosed in people aged 20 years and older in 2010. Diabetes is also one of the leading causes of death in U.S. In 2007, it contributed to 231,404 deaths. The serious complications associated with diabetes include heart disease and stroke, high blood pressure, blindness, kidney disease, nervous system disease, and others. The total monetary cost of diabetes in the United States in 2007 was $218 billion [14]. The cost of diabetes on human life quality is immeasurable. In a statement made by Judith E. Fradkin, M.D., Director, Division of Diabetes, Endocrinology, and Metabolic Diseases NIH, on July 1, 2010, "..it is clear that the scientific progress achieved during that time period has been remarkable... However, diabetes still places an enormous personal and economic toll on our country, so it is critically important to continue the pursuit of research to make further improvements in patients' health and quality of life."[15] The overall goal of this interdisciplinary research is to develop a cloud-computing-based pathway analysis approach, CPA, to identify pathways that are associated with diabetes. We propose this research as an extension of our current project "Developing Fuzzy-set-theory-based Data Mining Methodologies for Diabetes Data Analysis" (Oct. 2008~Sep.2011) We propose to: i) Design Cloud-computing-based Pathway Analysis (CPA) to identify gene pathways that are significant in diabetes; ii) Implement CPA and test it on both synthetic datasets and real-world diabetes gene expression data; and iii) Compare the performance of CPA with existing approaches. To the best of our knowledge, we are one of the first that have developed a cloud computing based approach to microarray data analysis. And we will be the first to develop a cloud computing based approach for gene pathway analysis. The P.I. is one of the few computer scientists who are developing cloud based/fuzzy set theory based data mining methodologies for diabetes data analysis. Successful development of the proposed research project will greatly help the understanding of diabetes disease and the development of strategies to prevent and control diabetes. This, in turn, will significantly reduce the burden of diabetes on healthcare systems.

Animal Health Component

(N/A)

Research Effort Categories

Basic

(N/A)

Applied

(N/A)

Developmental

(N/A)

Classification

Knowledge Area (KA)	Subject of Investigation (SOI)	Field of Science (FOS)	Percent
901	7310	2090	100%

Knowledge Area
901 - Program and Project Design, and Statistics;

Subject Of Investigation
7310 - Experimental design and statistical methods;

Field Of Science
2090 - Statistics, econometrics, and biometrics;

Keywords

cloud computing based

fuzzy data mining

diabetes

gene pathway analysis

gene microarray data

Goals / Objectives
The overall goal of this interdisciplinary research is to develop a cloud computing based pathway analysis approach, CPA, to identify pathways that are associated with diabetes. We propose this research as an extension of our current project "Developing Fuzzy-set-theory-based Data Mining Methodologies for Diabetes Data Analysis" (Oct. 2008~Sep.2011). We propose to i) Design Cloud-computing-based Pathway Analysis (CPA) to identify gene pathways that are significant in diabetes; ii) Implement CPA and test it on both synthetic datasets and real-world diabetes gene expression data; and iii) Compare the performance of CPA with existing approaches. The proposed research is innovative because, to the best of our knowledge, a cloud computing platform has not yet been used for gene pathway analysis. Thus, the advantage of using the latest technology to meet the challenges of gene pathway analysis has not yet been examined. The proposed research is collaborative and interdisciplinary. The Principle Investigator, Dr. Liang, is a faculty member in the Department of Computer Science and Information Technology at UDC. She has been working in the field of biomedical data mining for the past few years [1-9] and has also made many research contributions in the field of artificial intelligence, fuzzy logic and data mining [10-13]. She has built a multi-disciplinary team, which includes a biologist, students in computer science and professionals in bioinformatics.

Project Methods
Aim 1. Design Cloud-computing-based Pathway Analysis (CPA) to identify gene pathways that are associated with diabetes. We propose CPA to discover genes that are not only tightly associated with each other but also significant in diabetes. We will first generate a table of changes, then we'll compute a correlation coefficient matrix. We will then use a simple GA framework. Aim 2. Implement this approach and test it on both synthetic datasets and real-world diabetes gene expression data. We plan to implement this approach on the two-node cloud cluster in the P.I.'s research lab. To examine its time performance, we'll test it on both the Illinois Cloud Computing Test Bed [23] and Amazon EC2. We'll test the approach in the following two steps: 1. Test on synthetic data. We will use the approach used in [24] to generate time series of gene expression data first, with a pre-defined target S-system model. Then we will test to see how close the solution found by CPA resembles the target model. 2. Test on real world diabetes data. We will test CPA on the dataset used by [25] to investigate the efficiencies of Cyclophosphamide treatment imposed on Type 1 diabetes. Comparing with non-time-series data, time series data adds an additional dimension, time, to the gene expression data. This additional dimension adds information about changes of a gene's expression over time and thus gives more opportunity to find the correlation among the genes. The computational complexity of finding the correlations among thousands of genes over time is very high. However, parallel processing on a cloud cluster will meet this challenge. Aim 3. Compare the performance of this approach with existing approaches. To identify the advantages of our approach, we will compare to the following existing approaches: 1) Bayesian Networks Model. We will compare CPA with Gaussian Mixture Models [16] on the yeast and honey bee datasets [16]. We will compare CPA with Bayesian Network Model [17] on Saccharomyces cerevisiae cell cycle time series data [17]. 2) S-system [24]. We'll compare the result of CPA achieved in Aim 2 with the results of S-system.

Progress 11/01/11 to 11/01/14

Outputs
Target Audience: Researchers in bioinformatics study, diabetes study and cloud computing. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided? This project has provided the MSCS program with the research activities and faculty mentoring, as well as financial support of graduate students. Thus, it is essential to the growth of UDC, an HBCU in the nation's capital. The research activities of this project provided UDC students with great research experiences and trainings and prepared them for their careers. This project also helps to build up graduate program(s) at CSIT department of UDC. With this project, as well as other research projects, the MSCS program grew quickly in the past few years, from a handful of students to the current enrollment of twenty-two. Based on that, CSIT department proposed in Fall 2012 to establish a Ph.D. program in collaboration with other departments in the School of Engineering and Applied Sciences at UDC. This proposed program has been approved the senate recently and is included in Vision 2020. The future extension of this project will provide the proposed new Ph.D. program with similar support. How have the results been disseminated to communities of interest? The results of this research has been disseminated by the PI through her classes in Artificial Intelligence (graduate level, offered in 2013 Fall) and Data Mining (graduate level, offered in 2014 Fall). Manuscripts have been prepared and in the final stage of submission to a conference and two journals. What do you plan to do during the next reporting period to accomplish the goals? Even though this project has ended, we plan to extend the project by continuing with the new thread that was generated from this project, which is to propose an innovative multi-objective optimization genetic algorithm on cloud computing platform for "big data" analysis.

Impacts
What was accomplished under these goals? Designed and implemented Cloud-computing-based Pathway Analysis (CPA) Designed and implemented experiments to prove the effectiveness of CPA, which include: Experiments that investigate the improvement of performance of CPA over generations of the simulated polution Experiments that compare our approach with its non-cloud computing counterpart to demonstrate the benefit of cloud computing Developed a manuscript Identified one gene pathway with 200+ genes. This pathway is currently being verified by a biologist through conducting literature search. Under the umbrella of this project, we have also developed two more threads: Cloud-Computing-based Parallel Genetic Algorithm for Gene Microarray Data Analysis. This is for gene analysis, not for pathway. It is published. Multi-objective optimization genetic algorithm on cloud computing platform. Currently a graduate student is developing this research for his thesis. We plan to extend this project on this direction.

Publications

Type: Conference Papers and Presentations Status: Published Year Published: 2011 Citation: Anthony Benites and Lily R. Liang, Cloud-Computing-based Parallel Genetic Algorithm for Gene Microarray Data Analysis, in Proc. of International Conference on Tools with Artificial Intelligence (ICTAI), Nov. 7-9, 2011, Boca Raton, FL, USA, pp. 932 933.

Progress 01/01/13 to 09/30/13

Outputs
Target Audience: Researchers in bioinformatics study, diabetes study and cloud computing. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided? The research activities of this project have been providing UDC students with great research experiences and trainings in one of the most promising interdisciplinary fields: bioinformatics and cloud computing. Students gain first-hand experiences in these fields and become prepared for their careers. This project also has helped to build up graduate program(s) at CSIT department of UDC. With this project, as well as other research projects, the MSCS program grew quickly in the past few years. Based on that, CSIT department proposed in Fall 2012 to establish a Ph.D. program in collaboration with other departments in the School of Engineering and Applied Sciences at UDC. This proposed Ph.D program is included in the University's Vision 2020. This project provides the Master program with the research activities and faculty mentoring, as well as financial support of graduate students. Thus, it is essential to the growth of UDC, an HBCU in the nation's capital. How have the results been disseminated to communities of interest? Nothing Reported What do you plan to do during the next reporting period to accomplish the goals? We will continue to test it on other sets of data and compare it with approaches on time performance as well as identification accuracy.

Impacts
What was accomplished under these goals? We have designed and implemented CPA. We have tested it on one set of real-world data. We have compared it with one other approach.

Publications

Progress 01/01/12 to 12/31/12

Outputs
OUTPUTS: During this period, we started to investigate the above proposed topic based on our previous years of research in diabetes gene analysis. We implemented the approach we proposed in the proposal and tested on real world dataset. Based on the experimental results, we identified problems of the original design. We searched literature for works related to the problem identified and revised the design accordingly. This new design has been implemented and tested. The results have shown that our approach is effective. That is, we can successfully generate and select high-quality gene pathway candidates from thousands of genes quickly taking advantage of parallel computing of a cloud platform. We are drafting the manuscript to submit for publication. The PI is currently on Sabbatical leave. However, she and her collaborator are still having regular meetings on this project. PARTICIPANTS: Principal Investigator Dr. Liang, Rui (Lily,) Professor of Computer Science, University of the District of Columbia; Collaborator Dr. Kumar, Deepak, Professor of Biology, University of the District of Columbia; Dr. Anthony Benites, IT professional, Adjunct Professor of CSIT Dept., University of the District of Columbia; and Vinay Mandal, IT professional. This project provides an excellent opportunity for students to work in one of the most promising interdisciplinary fields: bioinformatics. And our recent development integrates another highly promising area, cloud computing, into this project. Students who are trained on this project gain first-hand experience in these fields. TARGET AUDIENCES: Researchers in bioinformatics study, diabetes study and cloud computing. PROJECT MODIFICATIONS: Nothing significant to report during this reporting period.

Impacts
Publications: During this period, we have produced results that will generate two journal publications. One of the manuscripts is being finalized while another is being revised. We plan to submit both of them for publication in the next 2-3 months. Education: During Summer 2011, one Master student, Soufiane Berouel, in Computer Science worked on this project. Anthony Benites, previous UDC MSCS graduate student, who now has received his degree with thesis on the work of previous work of this project, is now planning to become a Ph.D. student in Department of Computer Science at UDC, while working as a IT professional and adjunct professor. He is currently working with the PI on this project. We believe the research of project has become an essential part of his professional preparation. Collaborations: During this period, the collaborations are not only between the disciplines of Computer Science and Biology but also between academic and industry. During this period, we have two full-time faculty members, an adjunct faculty and IT professional, another IT professional and a student collaborating. The team that we established since 2005 continues to be dynamic and diverse. Capacity Building: The research activities of this project provided UDC students with great research experiences and trainings and prepared them for their careers. This project also helps to build up graduate program(s) at CSIT department of UDC. With this project, as well as other research projects, the MSCS program grew quickly in the past few years. Based on that, CSIT department proposed in Fall 2012 to establish a Ph.D. program in collaboration with other departments in the School of Engineering and Applied Sciences at UDC. This proposed program has been approved by the senate recently and is expected to be implemented in Fall 2013. This project will provide the new Ph.D. program with the research activities and faculty mentoring, as well as financial support of graduate students. Thus, it is essential to the growth of UDC, an HBCU in the nation's capital.

Publications

No publications reported this period