Source: NORTH CAROLINA STATE UNIV submitted to
GRAMMATICAL EVOLUTION NEURAL NETWORKS FOR GENETIC ASSOCIATION STUDIES
Sponsoring Institution
National Institute of Food and Agriculture
Project Status
EXTENDED
Funding Source
Reporting Frequency
Annual
Accession No.
0212668
Grant No.
(N/A)
Project No.
NC02217
Proposal No.
(N/A)
Multistate No.
(N/A)
Program Code
(N/A)
Project Start Date
Aug 16, 2007
Project End Date
Sep 30, 2013
Grant Year
(N/A)
Project Director
Motsinger, A.
Recipient Organization
NORTH CAROLINA STATE UNIV
(N/A)
RALEIGH,NC 27695
Performing Department
Statistics
Non Technical Summary
Gene-gene and gene-environment interactions are an important component of common, complex disease, and are difficult to detect using traditional statistical methods. The purpose of this study is to continue the development of a machine learning method to detect gene-gene and gene-environment interactions.
Animal Health Component
(N/A)
Research Effort Categories
Basic
(N/A)
Applied
(N/A)
Developmental
(N/A)
Classification

Knowledge Area (KA)Subject of Investigation (SOI)Field of Science (FOS)Percent
3046099209050%
3046099208050%
Goals / Objectives
This project is aimed at the continued development of a machine learning method designed to detect gene-gene and gene-environment interactions in genetic association data. Grammatical Evolution Neural Networks (GENN) has been shown to be a powerful tool to detect interactions in both real and simulated data. Theoretical and empirical progress will extend the method to more genetic study designs, and will optimize the speed of the software program for large-scale data.
Project Methods
Both theoretical and empirical approaches will be used to extend and optimize the GENN method. Efficient methods of parallelization will be explored, and the machine learning component of the software package will be optimized. Real and simulated data will be used to extend the GENN method to a variety of study designs.

Progress 10/01/10 to 09/30/11

Outputs
OUTPUTS: Identifying susceptibility genes for complex disease is a major challenge for human geneticists. As genomic technologies rapidly advance, the resulting explosion of genetic information creates a challenge for analysis and interpretation. Many disease susceptibility genes exhibit effects that are dependent on interactions with other genes or environmental factors. Such interactions are difficult to detect using traditional statistical methods because for the most part, these methods were not developed to detect purely interactive effects; their function has been to identify genes with main effects and then follow up with analysis of interactions between those genes that already exhibit a main effect. Because of this, new statistical and computational methods are needed. In addressing this need, we had previously developed a neural network (NN) methodology optimized by an evolutionary computation technique (grammatical evolution) to detect interactions. GENN performs both variable selection and statistical modeling without the computational burden of exhaustively searching all possible variable combinations. In the previous year of the project, we have considered adding an additional classifier, decision trees (DT) to the software and method. While the NN modeling has been highly successful, a limitation of NN in general is the "black-box" nature of the algorithm that creates problems for interpretation. As an alternative, we implemented a DT option in the method and software. In the last year, the GE optimized DT method (GEDT) was developed and programmed, and initial simulation studies were performed. In the past year, the development of both practical and theoretical aspects of the GEDT method has progressed. These outputs have been disseminated in local and interational presentations to genetic epidemiologists, biostatisticians, biologists, and clinicians interested in the method development and application of GEDT. The following talks discussing the GENN and GEDT methodology and experimental results have been presented during this last year: 2010 "Grammatical Evolution Decision Trees for Detecting Gene-gene Interactions" 8th European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics Istanbul, Turkey (Host: Elena Marchiori) 2010 "Methods for Detecting Interactions in High-throughput Genetic Data" Department of Human Genetics Seminar Series University of California, Los Angeles Los Angeles, CA (Host: Marc Suchard) 2011 "Capturing Knowledge from Large Datasets" Monsanto Fellows Professional Development Progam SAS Campus, Cary NC (Host: Catherine Maxwell) 2010 "Data Mining Approaches for Detecting Complex Models that Predict Complex Traits" Monsanto Prospecting Event North Carolina State University (Host: Kelly Sexton) PARTICIPANTS: Individuals that worked on the project that receive support are the Principle Investigator, Alison Motsinger-Reif, and a graduate research assistant, Sushamna Deohdar. One key participant in the current project is a Bioinformatics MS student, Nicholas Hardison, who is focusing on the development of the methodology as his thesis project. Additionally, two students in the Computation for Undergraduates in Statistics (CUSP) program, Rachael Marceau and Kristopher Hoover worked on the parameter sweep experiments for GEDT. TARGET AUDIENCES: The outcomes of this project are meant for geneticists that are mapping complex traits. PROJECT MODIFICATIONS: Nothing significant to report during this reporting period.

Impacts
The outcomes of the studies funded by the current project has produced theoretical and applied knowledge in genetic epidemiology, and in the last year alone have resulted in three publications. Since last year, the outcomes have been developed: First, while the GENN method was originally designed for studies with binary traits, the methodology has been extended to quantitative (continuous) traits, and this extension evaluated in extensive simulation studies and applied to a real data application in HIV pharmacogenetics. The results demonstrate that GENN has high power to detect a wide range of quantitative trail loci, and interactions between these loci. Second, because neural networks represent "black box" models, in that resulting models are not readily interpretable, the GENN methodology was extended to optionally evolve decision trees in place of neural networks. This extension was integrated into the software, and evaluated in simulation studies. Third, simulation studies have been completed comparing the relative performance of the neural network and decision tree options. Generally, the results demonstrate that decision trees provide an improvement in interpretability, which neural networks have increased power to model interactions. The comparison is currently being extended to more completely compare the approaches.

Publications

  • Motsinger-Reif AA, Deohdar S, Winham SJ, Hardison NE. Grammatical Evolution Decision Trees for Detecting Gene-Gene Interactions. BMC BioDataMining. 2010 Nov 18;3(1):8. Hoover KM*, Marceau R*, Harris TP, Reif DM, and Motsinger-Reif AA. Optimization of Grammatical Evolution Decision Trees for Detecting Epistasis. Proceeding of the 2011 Genetic and Evolutionary Computation Conference. 2011:35-37. *Indicated equal contribution. Hardison NE, Motsinger-Reif AA. Power of Quantitative Trait Neural Networks for Detecting Gene-Gene Interactions. Proceeding of the 2011 Genetic and Evolutionary Computation Conference. 2011:299-306.


Progress 10/01/08 to 09/30/09

Outputs
OUTPUTS: Identifying susceptibility genes for complex disease is a major challenge for human geneticists. As genomic technologies rapidly advance, the resulting explosion of genetic information creates a challenge for analysis and interpretation. Many disease susceptibility genes exhibit effects that are dependent on interactions with other genes or environmental factors. Such interactions are difficult to detect using traditional statistical methods because for the most part, these methods were not developed to detect purely interactive effects; their function has been to identify genes with main effects and then follow up with analysis of interactions between those genes that already exhibit a main effect. Because of this, new statistical and computational methods are needed. In addressing this need, we have developed a neural network (NN) methodology optimized by an evolutionary computation technique (grammatical evolution) to detect interactions. GENN performs both variable selection and statistical modeling without the computational burden of exhaustively searching all possible variable combinations. In the previous year, software for the GENN method was developed and programmed, and initial studies performed. In the past year, the development of both practical and theoretical aspects of the GENN method has progressed. The GENN software has been extended, debugged, and disseminated to additional users for beta testing. Additionally, the method has been extended to additional modeling strategies and data types and extensively evaluated in both real and simulated data. These outputs have been disseminated in local, national, and interational presentations to genetic epidemiologists, biostatisticians, biologists, and clinicians interested in the method development and application of GENN. The following talks discussing the GENN methodology and experimental results have been presented during this last year: "Grammatical Evolution Neural Networks for Predictive Modeling", Joint Symposium of 5th ICT and 2nd TIES, Seoul, Korea (Host: Weida Tong) "Prediction Accuracy Estimates in Bioinformatics", Toxicogenomics and MAQC Conference, Shanghai, China (Host: Weida Tong) "Predictive Modeling in ToxCast Using Grammatical Evolution Neural Networks" ToxCast Data Analysis Summit, U.S. Environmental Protection Agency, Research Triangle Park, NC (Host: Robert Kavlock) "Comparisons of Data Mining Methods to Detect Gene-Gene Interactions", GlaxoSmithKline, Research Triangle Park (Host: Meg Ehm) "Grammatical Evolution Neural Networks for Genetic Association Mapping" Genetics and Genomics Seminar Series, Department of Genetics, University of Alabama, Birmingham, AL (Host: Brett McKinney) "Data Mining Methods for Detecting Gene-Gene Interactions", Department of Biostatistics, Duke University, Durham, NC (Host: Kouros Owzar) "Grammatical Evolution Neural Networks for Genetic Epidemiology", 5th International Symposium on Bioinformatics Research and Applications, Ft. Lauderdale, FL (Host: Andrew Allen) "Grammatical Evolution Neural Networks for Genetic Association Mapping", Biomathematics Seminar Series, NCSU Department of Statistics, Raleigh, NC (Host: Kevin Gross) PARTICIPANTS: Individuals that worked on the project that receive support are the Principle Investigator, Alison Motsinger-Reif, and a graduate research assistant, Sushamna Deohdar. One key participant in the current project is a Bioinformatics PhD student, Nicholas Hardison, who is focusing on the development of the methodology as his thesis project. Additionally, collaborators at Partner Organizations include David Reif PhD at the U.S. Environmental Protection Agency, and David Haas MD at Vanderbilt University. These individual have contributed as informal collaborators on the research projects described. TARGET AUDIENCES: The outcomes of the current project are intended to impact researchers and students of genetic epidemiology, statistical genetics, and computer science. This imformation is disseminated in efforts including conference presentations and publications. PROJECT MODIFICATIONS: While no major goals of the outlined project have been removed, additional research studies have developed in the last year. The comparison of neural networks to decision trees that have been performed evolved as a natural extension to the project in response to feedback from the field in the form of manuscript reviews and questions/comments at conferences.

Impacts
The outcomes of the studies funded by the current project has produced theoretical and applied knowledge in genetic epidemiology, and in the last year alone have resulted in three publications that are in submission or in preparation for submission in Spring of 2010. Since last year, the outcomes have been developed: First, while the GENN method was originally designed for studies with binary traits, the methodology has been extended to quantitative (continuous) traits, and this extension evaluated in extensive simulation studies and applied to a real data application in HIV pharmacogenetics. The results demonstrate that GENN has high power to detect a wide range of quantitative trail loci, and interactions between these loci. Secondly, GENN was used to evaluate real data in toxicology, detecting complex models that predict toxicity outcomes in the ToxCast (http://www.epa.gov/ncct/toxcast/) data. The results were presented at the U.S. Environmental Protection Agencies ToxCast Data Analysis Summit. Thirdly, because neural networks represent "black box" models, in that resulting models are not readily interpretable, the GENN methodology was extended to optionally evolve decision trees in place of neural networks. This extension was integrated into the software, and evaluated in simulation studies. A graduate research assistant as a part of his M.S. thesis work undertook this project. Fourthly, simulation studies have been completed comparing the relative performance of the neural network and decision tree options. Generally, the results demonstrate that decision trees provide an improvement in interpretability, which neural networks have increased power to model interactions. The comparison is currently being extended to more completely compare the approaches. Finally, given the success of the initial application of the quantitative extension of GENN, simulation studies are currently underway to compare the performance of quantitative GENN (qGENN) to other methods in the field that were designed for similar data.

Publications

  • Deohdar S, Motsinger-Reif AA. (2009) Grammatical Evolution Decision Trees for Detecting Gene-Gene Interactions. Proceedings of the 8th European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics. In submission.
  • Hardison NE, Motsinger-Reif AA. (2010) Grammatical Evolution Neural Networks (GENN) to Detect Gene-Gene and Gene-Environment Interactions that Predict Quantitative Traits. In preparation.
  • Deohdar S, Hardison NE, Motsinger-Reif AA. (2010) A Comparison of Evolutionary Optimized Neural Networks and Decision Trees. In preparation.


Progress 10/01/07 to 09/30/08

Outputs
OUTPUTS: Identifying susceptibility genes for complex disease is a major challenge for human geneticists. As genomic technologies rapidly advance, the resulting explosion of genetic information creates a challenge for analysis and interpretation. Many disease susceptibility genes exhibit effects that are dependent on interactions with other genes or environmental factors. Such interactions are difficult to detect using traditional statistical methods because for the most part, these methods were not developed to detect purely interactive effects; their function has been to identify genes with main effects and then follow up with analysis of interactions between those genes that already exhibit a main effect. Because of this, new statistical and computational methods are needed that have better power to detect interactions in relatively small sample sizes. In addressing this need, we have developed a neural network (NN) methodology optimized by an evolutionary computation technique (grammatical evolution) to detect gene-gene interactions in the presence or absence of marginal main effects. GENN performs both variable selection and statistical modeling without the computational burden of exhaustively searching all possible variable combinations. GENN uses grammatical evolution to build a NN that model complex, nonlinear data. In the past year, the development of both practical and theoretical aspects of the GENN method has progressed. The GENN method has been reprogrammed in C++ and Perl to optimize speed and cross-platform compatibility. The product is a GENN software package is currently in the beta-testing stage, and has been disseminated to the first 8 users for beta-testing. Extensive simulation studies and real data application studies have been completed with GENN, and these outputs have been disseminated in local, national, and interational presentations to genetic epidemiologists, biostatisticians, and clinicians interested in the method development and application of GENN. The following talks have been presented this year: 1) "A Comparison of Analytical Methods for Genetic Association Studies"; Bioinformatics Seminar Series; Raleigh, NC (Host: Dahlia Nielson); 2008 2) "Neural Networks for Detecting Gene-Gene Interactions" National Center for Computational Toxicology; U.S. Environmental Protection Agency; Research Triangle Park, NC (Host: Richard Judson); 2008. 3)"Genetic programming optimized neural network (GPNN) as a method for improved the identification of gene-environment interactions"; Annual Congress of the European Respiratory Society Berlin, Germany (Host: Patricia Haslam); 2008. The following abstracts were presented at national conferences this year: 1) Hardison NE, Motsinger-Reif AA. Grammatical Evolution Neural Networks to Detect Gene-Gene and Gene-Environment Interactions in Quantitative Traits. American Society of Human Genetics International Meeting Philadelphia, PA. 2008. 2) Hardison NE, Fanelli TJ, Dudek SM, Ritchie MD, Reif DM, Motsinger-Reif AA. Balanced accuracy as a fitness function in Grammatical Evolution Neural Networks is robust to imbalanced data. Genetic and Evolutionary Algorithm Conference. Atlanta, GA. 2008. PARTICIPANTS: Individuals that worked on the project that receive support is limited to the Principle Investigator, Alison Motsinger-Reif. Additionally, collaborators at Partner Organizations include Marylyn Ritchie PhD, Theresa Fanelli, Anna Davis, and Scott Dudek at Vanderbilt University, David Reif PhD at the U.S. Environmental Protection Agency, and Lance Hahn PhD at the University of Kentucky. These individuals have contributed as nonformal collaborators on the research projects described. A graduate student in Bioinformatics has also been participating in the project, with his dissertation focused on this project. TARGET AUDIENCES: Not relevant to this project. PROJECT MODIFICATIONS: While no major goals of the outline project have been removed, additional research studies have developed in the last year. The comparisons to other methods in the field that have been performed and published as a part of the current project evolved as natural extensions of the research performed as proposed in the project, as well as in response to feedback from the field through reviews and questions at conferences. Additionally, again as a natural extension, a thorough review/perspective was completed discussing the impact of the method under development to the field as a whole.

Impacts
The outcomes of the studies funded by the current project has produced theoretical and applied knowledge in genetic epidemiology resulting in two abstract presentations, as well as five publications communicating the following results. First, the performance of the GENN method was evaluated in a broad range of simulation studies examining the impact of noise common to disease studies in genetic epidemiology. These types of noise included: missing data, genetic heterogeneity, phenocopies, and genotyping error. The results of this study indicate that GENN is robust to reasonable levels of noise in data. Second, the GENN method was originally developed for studies with binary traits with balanced, it previously required that the number of individuals of each class was equal. Extensions of the method, utilizing a new fitness function (balanced accuracy), has been implemented and evaluated on both real and simulated data so that the method is robust to class imbalance. Third, the empirical performance of GENN was compared to other methods in the field has been completed, using simulated data with a broad range of disease models.. GENN was shown to have comparable or favorable performance in a wide range of models. Fourth, GENN was compared to other machine learning applications to evolve NN to detect interactions in genetic epidemiology. GENN was shown to have higher power to detect interactions, with more reasonable computation time than previous applications. Fifth, while the GENN method was originally designed for studies with binary traits, the methodology has recently been extended to quantitative (continuous) traits. This extension will allow the application of the method to new phenotypes in human genetics. Large-scale simulation studies, as well as collaborative application to real data are currently ongoing. Additionally, aspects of the development of the GENN method has become the topic of research for a Bioinformatics PhD student. FInally, a review of the current state of the GENN method, as well as guidance for its application have been discussed in a book chapter.

Publications

  • Motsinger-Reif AA, Dudek SM, Hahn LW, and Ritchie MD. 2008. Comparison of approaches for machine-learning optimization of neural networks for detecting gene-gene interactions in genetic epidemiology. Genetic Epidemiology Feb 8 [Epub ahead of print]
  • Motsinger AA, Fanelli TJ, Ritchie MD. 2008. Power of Grammatical Evolution Neural Networks to detect gene-gene interactions in the presence of error common to genetic epidemiological studies. BMC Research Notes Aug 13;1:65.
  • Motsinger-Reif AA, Reif DM, Fanelli TJ, Ritchie MD. 2008. Comparison of computational approaches for genetic association studies. Genetic Epidemiology Jun 16. [Epub ahead of print]
  • Motsinger-Reif AA, Ritchie MD. 2008. Neural networks for genetic epidemiology: past, present, and future. BioData Mining. 1:3 (17Jul2008)
  • Hardison NE, Fanelli TJ, Dudek SM, Ritchie MD, Reif DM, Motsinger-Reif AA. 2008. Balanced accuracy as a fitness function in Grammatical Evolution Neural Networks is robust to imbalanced data. Proceedings of the 10th annual conference on Genetic and evolutionary computation. p353-354.
  • Hardison NE, Motsinger AA. 2008. Grammatical Evolution Optimized Neural Networks for Disease Variant Mapping. In Columbus, F (ed.), Chromosome Mapping Research. Nova Publishers, New York, (In Press).