Recipient Organization
UNIV OF THE DISTRICT OF COLUMBIA
4200 CONNECTICUT AVENUE N.W
WASHINGTON,DC 20008
Performing Department
SCHOOL OF ENGINEERING & APPLIED SCIENCE
Non Technical Summary
Diabetes is a group of diseases characterized by high levels of blood glucose resulting from defects in insulin production, insulin action, or both. There are 25.8 million people in the United States, or 8.3% of the population, who have diabetes. While an estimated 18.8 million have been diagnosed, unfortunately, 7.0 million people are unaware that they have the disease. There were 1.9 million new cases of diabetes diagnosed in people aged 20 years and older in 2010. Diabetes is also one of the leading causes of death in U.S. In 2007, it contributed to 231,404 deaths. The serious complications associated with diabetes include heart disease and stroke, high blood pressure, blindness, kidney disease, nervous system disease, and others. The total monetary cost of diabetes in the United States in 2007 was $218 billion [14]. The cost of diabetes on human life quality is immeasurable. In a statement made by Judith E. Fradkin, M.D., Director, Division of Diabetes, Endocrinology, and Metabolic Diseases NIH, on July 1, 2010, "..it is clear that the scientific progress achieved during that time period has been remarkable... However, diabetes still places an enormous personal and economic toll on our country, so it is critically important to continue the pursuit of research to make further improvements in patients' health and quality of life."[15] The overall goal of this interdisciplinary research is to develop a cloud-computing-based pathway analysis approach, CPA, to identify pathways that are associated with diabetes. We propose this research as an extension of our current project "Developing Fuzzy-set-theory-based Data Mining Methodologies for Diabetes Data Analysis" (Oct. 2008~Sep.2011) We propose to: i) Design Cloud-computing-based Pathway Analysis (CPA) to identify gene pathways that are significant in diabetes; ii) Implement CPA and test it on both synthetic datasets and real-world diabetes gene expression data; and iii) Compare the performance of CPA with existing approaches. To the best of our knowledge, we are one of the first that have developed a cloud computing based approach to microarray data analysis. And we will be the first to develop a cloud computing based approach for gene pathway analysis. The P.I. is one of the few computer scientists who are developing cloud based/fuzzy set theory based data mining methodologies for diabetes data analysis. Successful development of the proposed research project will greatly help the understanding of diabetes disease and the development of strategies to prevent and control diabetes. This, in turn, will significantly reduce the burden of diabetes on healthcare systems.
Animal Health Component
(N/A)
Research Effort Categories
Basic
(N/A)
Applied
(N/A)
Developmental
(N/A)
Goals / Objectives
The overall goal of this interdisciplinary research is to develop a cloud computing based pathway analysis approach, CPA, to identify pathways that are associated with diabetes. We propose this research as an extension of our current project "Developing Fuzzy-set-theory-based Data Mining Methodologies for Diabetes Data Analysis" (Oct. 2008~Sep.2011). We propose to i) Design Cloud-computing-based Pathway Analysis (CPA) to identify gene pathways that are significant in diabetes; ii) Implement CPA and test it on both synthetic datasets and real-world diabetes gene expression data; and iii) Compare the performance of CPA with existing approaches. The proposed research is innovative because, to the best of our knowledge, a cloud computing platform has not yet been used for gene pathway analysis. Thus, the advantage of using the latest technology to meet the challenges of gene pathway analysis has not yet been examined. The proposed research is collaborative and interdisciplinary. The Principle Investigator, Dr. Liang, is a faculty member in the Department of Computer Science and Information Technology at UDC. She has been working in the field of biomedical data mining for the past few years [1-9] and has also made many research contributions in the field of artificial intelligence, fuzzy logic and data mining [10-13]. She has built a multi-disciplinary team, which includes a biologist, students in computer science and professionals in bioinformatics.
Project Methods
Aim 1. Design Cloud-computing-based Pathway Analysis (CPA) to identify gene pathways that are associated with diabetes. We propose CPA to discover genes that are not only tightly associated with each other but also significant in diabetes. We will first generate a table of changes, then we'll compute a correlation coefficient matrix. We will then use a simple GA framework. Aim 2. Implement this approach and test it on both synthetic datasets and real-world diabetes gene expression data. We plan to implement this approach on the two-node cloud cluster in the P.I.'s research lab. To examine its time performance, we'll test it on both the Illinois Cloud Computing Test Bed [23] and Amazon EC2. We'll test the approach in the following two steps: 1. Test on synthetic data. We will use the approach used in [24] to generate time series of gene expression data first, with a pre-defined target S-system model. Then we will test to see how close the solution found by CPA resembles the target model. 2. Test on real world diabetes data. We will test CPA on the dataset used by [25] to investigate the efficiencies of Cyclophosphamide treatment imposed on Type 1 diabetes. Comparing with non-time-series data, time series data adds an additional dimension, time, to the gene expression data. This additional dimension adds information about changes of a gene's expression over time and thus gives more opportunity to find the correlation among the genes. The computational complexity of finding the correlations among thousands of genes over time is very high. However, parallel processing on a cloud cluster will meet this challenge. Aim 3. Compare the performance of this approach with existing approaches. To identify the advantages of our approach, we will compare to the following existing approaches: 1) Bayesian Networks Model. We will compare CPA with Gaussian Mixture Models [16] on the yeast and honey bee datasets [16]. We will compare CPA with Bayesian Network Model [17] on Saccharomyces cerevisiae cell cycle time series data [17]. 2) S-system [24]. We'll compare the result of CPA achieved in Aim 2 with the results of S-system.