Innovative bioinformatics resources for biomedical research

INNOVATIVE BIOINFORMATICS RESOURCES FOR BIOMEDICAL RESEARCH

Sponsoring Institution

National Institute of Food and Agriculture

Project Status

COMPLETE

Funding Source

HATCH

Reporting Frequency

Annual

Accession No.

0213305

Grant No.

(N/A)

Cumulative Award Amt.

(N/A)

Proposal No.

(N/A)

Multistate No.

(N/A)

Project Start Date

Jan 1, 2008

Project End Date

Jun 30, 2013

Grant Year

(N/A)

Program Code

[(N/A)]- (N/A)

Recipient Organization
CLEMSON UNIVERSITY
(N/A)
CLEMSON,SC 29634

Performing Department
Genetics and Biochemistry

Non Technical Summary
With the knowledge of human genome sequence, biomedical research has become increasingly data-intensive. Although genomic medicine has the potential to revolutionize health care with our growing knowledge of the molecular basis of disease, the analysis of large and heterogeneous datasets poses daunting informatic challenges for biomedical researchers. To address the so-called "data rich, information poor" problem, we propose to develop a set of computational tools for biomedical data integration and mining. Machine learning methods will be developed for sequence-based prediction of DNA/RNA-binding residues, lipid-interacting residues and protein stability change upon mutations. These predictive methods will be directly used to characterize a collection of candidate genes in which mutations may cause mental retardation, autism and other genetic disorders. With the accumulation of microarray data from brain research, we will compile and analyze a compendium of gene expression profiles to identify co-expressed gene modules and regulatory networks. The analytical results may provide useful information for understanding the molecular mechanisms underlying genetic brain disorders. A database system will be developed to support the large-scale genomic data analysis and to make our results available to the biomedical research community. Therefore, our timely work will facilitate biomedical research through addressing the informatic challenges in the post-genomic era, and may subsequently improve the quality of health care in the United States. The proposed work will establish a necessary and important computational infrastructure for biomedical data mining at the Greenwood Genetic Center (GGC). GGC conducts research on birth defects and genetic disorders including mental retardation and autism, and works closely with the South Carolina Department of Disabilities and Special Needs to provide diagnostic services, treatment and prevention programs to reduce the risk and severity of disabling conditions. Mental retardation is the most frequent developmental disability that affects 1-3% of people worldwide. The estimated annual cost of mental retardation in the United States is $51 billion according to the US Centers for Disease Control and Prevention (2004). In addition, we will develop web servers to make our computational methods available to the broader scientific community. The predictive methods developed in this project can be used not only to characterize disease-causing mutations but also for identifying protein functional sites and modeling macromolecular interactions. Our BindN web server has already been used by researchers worldwide, and we anticipate that the new web-based tools will also be widely used for related scientific research.

Animal Health Component

(N/A)

Research Effort Categories

Basic

(N/A)

Applied

(N/A)

Developmental

(N/A)

Classification

Knowledge Area (KA)	Subject of Investigation (SOI)	Field of Science (FOS)	Percent
901	7299	1080	75%
304	3910	1040	25%

Knowledge Area
901 - Program and Project Design, and Statistics; 304 - Animal Genome;

Subject Of Investigation
3910 - Cross-commodity research--multiple animal species; 7299 - Research equipment and methods, general/other;

Field Of Science
1040 - Molecular biology; 1080 - Genetics;

Keywords

sequence analysis

predictive methods

machine learning

dna/rna-binding residues

lipid-interacting residues

protein stability change

disease-causing mutations

web server development

microarray data analysis

coexpressed gene modules

database development

bioinformatics

biomedical informatics

Goals / Objectives
The project is aimed to develop innovative computational methods for biomedical data integration and mining. With the advent of high-throughput technologies, biomedical research now generates large amounts of data, ranging from genomic sequences to microarray gene expression profiles to protein structures. Obviously, it is no longer efficient to analyze the high-throughput data with paper and pencil, or even a spreadsheet. Instead, sophisticated database systems and data mining tools are required for knowledge discovery from the massive and heterogeneous datasets. At the Greenwood Genetic Center (GGC), clinical and molecular data from patients with mental retardation and other genetic disorders are accumulating. We have identified the following specific objectives for this project: Objective 1: Development of predictive methods for understanding protein function. We will apply machine learning to sequence-based prediction of DNA-binding residues, RNA-binding residues, lipid-interacting residues, and protein stability changes upon mutations. The predictive methods developed in this project will be used to analyze the GGC collection of candidate genes in which mutations may cause mental retardation and other genetic disorders. We anticipate that the computational results will provide useful information for experimental characterization of these disease-causing mutations. In addition, 3-5 web-based tools will be developed for public access to the predictive methods developed by this project. Objective 2: Analysis of co-expressed gene modules related to genetic brain disorders. We will compile a compendium of microarray gene expression profiles from published studies as well as experiments at GGC. The microarray profiles from different sources will be integrated after manual curation for quality control. The integrated dataset will then be used for statistical inference of gene modules, each of which includes a group of genes that show similar expression patterns across different clinical conditions. Since these coherent modules correspond to protein complexes or biological processes, the analytical results may be more interpretable than lists of differentially expressed genes. The analysis may also provide a global view of the gene regulatory networks underlying complex brain disorders. Objective 3: Development of an integrated database for human genetics research. Relational data models will be developed for clinical and molecular data integration. The web-accessible database will be used to support biomedical data mining and to make the computational results available to the research community. We believe that the database system will provide a necessary and important infrastructure for bioinformatics and biomedical research at GGC.

Project Methods
First, predictive methods will be developed for understanding protein function and the effect of disease-causing mutations. In our previous studies, artificial neural networks and support vector machines were trained with known DNA or RNA-binding residues extracted from available structures. The resulting models have been used to construct the BindN web server, which takes an amino acid sequence as input and predicts potential DNA or RNA-binding residues. We propose to further develop BindN by improving its prediction accuracy. Our preliminary results suggest that the performance of BindN can be enhanced by incorporating evolutionary information for input encoding and by using the random forest learning algorithm. The new classifiers will be used to upgrade the BindN web server (http://bioinfo.ggc.org/bindn/). We will also develop new machine learning methods for sequence-based prediction of lipid-interacting residues and protein stability change upon point mutations. Loss of protein stability has been shown to be the most common mechanism by which a missense mutation results in disease. We will examine various sequence-derived features and select the relevant ones for accurate prediction of protein stability change upon mutations. The predictive methods will then be used to analyze the GGC collection of candidate genes in which mutations may cause mental retardation and other genetic disorders. The computational results may provide useful information for experimental characterization of these disease-causing proteins. Second, we are interested in microarray data analysis for understanding gene regulation in brain development and genetic disorders. We have been collaborating with GGC scientists on statistical analysis of differentially expressed genes in mental retardation patients. With the accumulation of microarray data both at GGC and in the public databases, we propose to perform module-level analysis of microarray expression data from brain research. Since genes are grouped into coherent modules with similar expression patterns across a variety of clinical conditions, the analysis may provide results that are more interpretable than lists of differentially expressed genes. A variety of human microarray datasets will be collected from GGC projects, published papers or websites, and public databases. We anticipate that analysis of the integrated microarray data may provide a global view of the regulatory modules underlying genetic brain disorders. Third, a database system will be developed in this project to integrate data from different sources, to support large-scale data analysis, and to make our analytical results available to biomedical scientists. We will apply data warehousing concepts to biomedical data integration. A data staging area will be constructed to transform, cleanse and integrate data from different sources. It is important to note that the system will not be developed as an operational database for clinical data management. Rather, the database will be used to support biomedical data mining. We will thus devote more effort to data integration and analysis than database application development.

Progress 01/01/08 to 06/30/13

Outputs
Target Audience: The target audience of the project includes human geneticists, biomedical and biological researchers. We have developed and maintained 5 web-based software tools which are freely accessible to the research community for molecular genetic studies. In addition, two graduate students and one undergraduate student have been trained through participating in this project. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided? Five graduate students and one undergraduate student have been trained. The PI and his students have given eight oral presentations in international conferences to report the findings. The computational methods and data resources developed in this project provide a sound foundation for our future bioinformatics research. The predictive performance of our machine learning models may be further improved, and the integrated microarray database can be used for prioritizing disease candidate genes. The improved models and candidate prioritization system are expected to facilitate the identification and characterization of pathogenic mutations from next-generation sequencing screens. Currently, these are the major research directions in the PI’s group. How have the results been disseminated to communities of interest? This project has been conducted at the Greenwood Genetic Center, which focuses on birth defects and genetic disorders, especially intellectual disability and autism. Greenwood Genetic Center works closely with the South Carolina Department of Disabilities and Special Needs to provide diagnostic services, treatment and prevention programs to reduce the risk and severity of disabling conditions. The bioinformatic tools and integrated microarray database developed in this project have been shown to provide useful information for identifying pathogenic mutations and understanding the molecular mechanisms causing intellectual disability. The molecular knowledge can be translated into novel diagnostic, preventive and therapeutic approaches, and thus benefits the affected individuals and their families in South Carolina and elsewhere. We have also developed and maintained 5 web servers, including BindN (http://bioinfo.ggc.org/bindn/), BindN-RF (http://bioinfo.ggc.org/bindn-rf/), BindN+ (http://bioinfo.ggc.org/bindn+/), MuStab (http://bioinfo.ggc.org/mustab/), and seeSUMO (http://bioinfo.ggc.org/seesumo/). Our web servers are publicly available, and have been frequently used for human genetic studies and biomedical research. Since 2007, a total of 135,521 queries have been made to our web servers by 5,656 worldwide users. Our papers reporting the web servers have been cited by 175 journal publications. Thus, the findings from this project have been effectively disseminated to the research communities. What do you plan to do during the next reporting period to accomplish the goals? Nothing Reported

Impacts
What was accomplished under these goals? We have developed new machine learning models for sequence-based prediction of DNA/RNA-binding residues, lipid-interacting residues, and protein stability changes upon amino acid substitutions. A structure-based method has also been developed to assess the effects of amino acid substitutions on protein stability and function. Our machine learning strategy has been applied to the prediction of protein sumoylation sites and siRNA potency. In collaboration with human geneticists at the Greenwood Genetic Center, our predictive methods have been used to analyze candidate genes in which mutations may cause intellectual disability and other genetic disorders. W have compiled 2,968 microarray gene expression profiles of various human tissues (including 616 brain samples) from published studies. A genomic data integration strategy has been developed to combine the microarray data from different sources into a single dataset. The integrated microarray dataset has been validated, and then used to examine disease gene expression patterns, co-expression network analysis, and genome-wide identification of human tissue-specific genes. Disease gene co-expression network analysis has revealed the molecular pathways underlying intellectual disability, and the tissue-specific genes identified in this project can be used as potential candidate genes for the screen of disease-causing mutations. We have developed two relational database systems and five web-based software tools. The two database systems (RESDB and DSNDB) have been used to store clinical and genomic data at the Greenwood Genetic Center. The five web-based tools, which have been developed using our machine learning models, are freely accessible to biomedical and biological researchers. These computational systems provide important resources for biomedical data mining and human genetic studies.

Publications

Type: Journal Articles Status: Published Year Published: 2013 Citation: Teng, S., Yang, J.Y. and Wang, L. (2013) Genome-wide prediction and analysis of human tissue-selective genes using microarray expression data. BMC Medical Genomics, 6(Suppl 1):S10.

Progress 01/01/12 to 12/31/12

Outputs
OUTPUTS: The development and testing of predictive software tools has continued for investigating protein function and stability, and the mining of a large integrated microarray database for understanding human disease gene regulation and pathways. (1) We have developed and maintained several web servers, including BindN+ (http://bioinfo.ggc.org/bindn+/) for accurate prediction of DNA/RNA-binding residues in protein sequences, MuStab (http://bioinfo.ggc.org/mustab/) for predicting protein stability changes upon single amino acid substitutions, and seeSUMO (http://bioinfo.ggc.org/seesumo/) for sequence-based prediction of protein sumoylation sites. We have also developed a structure-based method to assess the effects of amino acid substitutions on protein stability and function. In collaboration with human geneticists at the Greenwood Genetic Center, we have used both sequence and structure-based methods to investigate the effects of pathogenic mutations. Our methods have been shown to provide useful information for understanding the molecular causes of human genetic disorders. For instance, when our methods were used to analyze the A693V mutation in the human ZBTB20 gene, the results revealed the details of the molecular mechanisms causing intellectual disability. The A693V mutation was found to decrease protein stability and affect protein-DNA interaction. (2) We compiled 2,968 publicly available microarray gene expression profiles of various normal tissue samples. The integrated microarray database has been used to perform gene co-expression network analysis with the WGCNA method. Since co-expressed gene modules often correspond to specific protein complexes or biological processes, the analytical results can be useful for relating disease genes to molecular pathways. We are interested in the co-expression modules enriched with intellectual disability (ID) genes. For instance, module 3 (M3) is selectively expressed in brain, and significantly enriched with the Gene Ontology term "synaptic transmission". ID genes are overrepresented in M3, and the well-known ID genes NRXN1 and STXBP1 are hub genes and highly connected with the other ID genes in the M3 co-expression network. NRXN1 and NLGN4X are connected in the network, and the two proteins physically interact for synapse formation. NRXN1 and CNTNAP2 are also connected, and mutations in these two genes can cause Pitt-Hopkins-like ID. As a NRXN1-connected gene, PRKCG has recently been implicated in ID. Our results suggest that co-expression network analysis with the integrated microarray data can not only reveal the molecular pathways underlying various ID syndromes, but also identify potential candidate genes for the screen of ID-causing mutations. The integrated microarray database has been frequently used at the Greenwood Genetic Center. We have been using the database to examine the expression patterns of disease candidate genes, search for co-expressed genes, and identify tissue-specific genes. The results have been shown to provide useful information for human genetic studies. PARTICIPANTS: Nothing significant to report during this reporting period. TARGET AUDIENCES: Nothing significant to report during this reporting period. PROJECT MODIFICATIONS: Nothing significant to report during this reporting period.

Impacts
Some of the results have been published in two peer-reviewed journal articles. (1) We have developed and maintained several predictive software tools for understanding protein function and stability. Our web servers, including BindN, BindN-RF, BindN+, MuStab and seeSUMO, are publicly available, and have been frequently used for human genetic studies and biomedical research. Since 2007, a total of 135,521 queries have been made to our web servers by 5,656 worldwide users. Our papers reporting the web servers have been cited by 175 journal publications. In collaboration with human geneticists at the Greenwood Genetic Center, we have been using the computational methods developed in this project to identify pathogenic mutations and understand the molecular mechanisms causing intellectual disability and other genetic disorders. Since loss of protein stability is a common mechanism by which missense mutations cause human diseases, MuStab can be used to distinguish disease-causing mutations from normal sequence variations. The predictive software tool is useful because disease gene identification remains to be difficult, even with the advent of next-generation sequencing technologies. (2) Our integrated microarray database has been used for disease gene co-expression network analysis, and the findings have provided new insight into the genes and pathways disrupted in intellectual disability. We have identified several coexpressed modules in which intellectual disability genes are significantly enriched, and functional annotation of these modules has revealed the common pathways perturbed in various syndromes. Our integrated microarray database has also been used to examine the expression patterns of disease candidate genes, and identify tissue-specific genes. The findings have been shown to provide useful information for identifying and prioritizing candidate genes of intellectual disability. Although a number of disease genes have been identified, the genetic causes of most intellectual disability cases are still unknown. The identification and characterization of new disease genes can provide further insight into the molecular etiology of various intellectual disability syndromes. This project has been conducted at the Greenwood Genetic Center, which focuses on birth defects and genetic disorders, especially intellectual disability and autism. Greenwood Genetic Center works closely with the South Carolina Department of Disabilities and Special Needs to provide diagnostic services, treatment and prevention programs to reduce the risk and severity of disabling conditions. The molecular knowledge of intellectual disability can be translated into novel diagnostic, preventive and therapeutic approaches, and thus benefits the affected individuals and their family members in South Carolina and elsewhere.

Publications

Teng, S., Luo, H. and Wang, L. (2012) Predicting protein sumoylation sites from sequence features. Amino Acids, 43(1):447-455.
Teng, S., Yang, J.Y. and Wang, L. (2012) Genome-wide prediction and analysis of human tissue-selective genes using microarray expression data. BMC Medical Genomics, in press.

Progress 01/01/11 to 12/31/11

Outputs
OUTPUTS: In the fourth year of this project, we have focused on developing predictive software tools for understanding protein function and performing gene coexpression network analysis of human disease genes. (1) We have developed and maintained several web servers, including BindN+ (available at http://bioinfo.ggc.org/bindn+/) for accurate prediction of DNA/RNA-binding residues in protein sequences, MuStab (available at http://bioinfo.ggc.org/mustab/) for predicting protein stability changes upon single amino acid substitutions, and seeSUMO (available at http://bioinfo.ggc.org/seesumo/) for sequence-based prediction of protein sumoylation sites. The new web server, seeSUMO, was developed using random forest and support vector machine models that were constructed with data collected from the literature and domain-specific knowledge in terms of relevant biological features. Our models were found to achieve more accurate prediction of protein sumoylation sites than the other existing classifiers. We have also developed a structure-based method for predicting the effects of amino acid substitutions on protein function and stability. Protein folding and binding energy differences between wild-type and mutant structures were computed to quantitatively assess the effects of amino acid substitutions on protein stability and protein-protein interaction, respectively. In collaboration with human geneticists at the Greenwood Genetic Center, we have used both sequence and structure-based methods to investigate some pathogenic mutations in intellectual disability genes. (2) We have performed gene coexpression network analysis to investigate human disease genes using a large database of microarray expression profiles. We compiled 2,968 publicly available microarray expression profiles of various normal tissue samples, and verified the usefulness of the integrated data for examining human gene expression patterns. The integrated microarray data has been used to obtain the coexpressed modules of human disease genes with the Weighted Gene Coexpression Network Analysis (WGCNA) method. We have identified several coexpressed modules in which intellectual disability genes are enriched. Annotations of these modules have provided insight into the functions of intellectual disability genes. In particular, one module shows preferential expression in brain cortex and hippocampus, and is significantly enriched for Gene Ontology terms such as "neuron projection" and "synaptic transmission". This module is consisted of 247 human disease genes, 54 of which are known to be involved in intellectual disability. The coexpression network can thus be used to investigate the common pathways perturbed in various intellectual disability syndromes. Additional candidate genes of intellectual disability may also be identified in the coexpressed modules based on network properties. The results from gene coexpression network analysis are made available to human genetic studies at the Greenwood Genetic Center. In addition, the integrated microarray database provides a useful resource for the biomedical research community. PARTICIPANTS: Nothing significant to report during this reporting period. TARGET AUDIENCES: Nothing significant to report during this reporting period. PROJECT MODIFICATIONS: Nothing significant to report during this reporting period.

Impacts
The proposed research activities for the fourth year of the project have been successfully completed. Some of the results have recently been published in two peer-reviewed journal articles. (1) We have developed and maintained several predictive software tools for understanding protein function and stability. Our web servers, including BindN+, MuStab and seeSUMO, are publicly available, and have the potential to be widely used for human genetic studies as well as other areas of biomedical research. The predictions made by BindN+ can provide useful information for protein-DNA/RNA docking and experimental studies such as site-directed mutagenesis for understanding protein-nucleic acid interactions. MuStab can be used to help identify disease-causing mutations through predicting protein stability changes upon single amino acid substitutions. Since protein sumoylation as an important post-translational modification is implicated in various diseases, the newly developed seeSUMO web server can be used to facilitate experimental design and data interpretation for biomedical researchers. Furthermore, the structure-based method has been developed for examining the effects of pathogenic mutations on protein function and stability. We have demonstrated that the structure-based analysis can provide useful information for understanding the molecular mechanisms of human genetic disorders. In collaboration with human geneticists at the Greenwood Genetic Center, we have been using the computational methods developed in this project to identify pathogenic mutations and further understand the molecular etiology of intellectual disability. (2) Disease gene coexpression network analysis with the integrated microarray data has provided new insight into the genes and pathways disrupted in intellectual disability. We have identified several coexpressed modules in which intellectual disability genes are significantly enriched, and functional annotation of these modules has revealed the common pathways perturbed in various syndromes. The expression patterns of these modules in the brain and other tissues have also been examined using the integrated microarray data. The findings can thus provide useful information for understanding intellectual disability gene function and regulation. Moreover, the coexpression network may be used to identify and prioritize candidate genes of intellectual disability. Although a number of disease genes have been identified, the genetic causes of most intellectual disability cases are still unknown. Since coexpressed modules often correspond to protein complexes or biological processes, additional candidate genes may be identified in the modules enriched with known intellectual disability genes. These candidate genes can be given high priority in genetic screening of patients. Altogether, the molecular knowledge obtained in this project may be translated into novel diagnostic, preventive and therapeutic approaches, and thus benefit the intellectual disability individuals and their family members.

Publications

Teng, S., Srivastava, A.K., Schwartz, C.E., Alexov, E. and Wang, L. (2011) Structural assessment of the effects of amino acid substitutions on protein stability and protein-protein interaction. International Journal of Computational Biology and Drug Design, 3(4):334-349.
Teng, S., Luo, H. and Wang, L. (2011) Predicting protein sumoylation sites from sequence features. Amino Acids, in press (Epub ahead of print: Oct 7, 2011)

Progress 01/01/10 to 12/31/10

Outputs
OUTPUTS: The proposed research in the third year has focused on machine learning model construction, predictive tool development, and microarray gene expression data analysis. (1) New models have been constructed for predicting DNA or RNA-binding residues and protein stability changes upon single amino acid substitutions. Biochemical features and evolutionary information descriptors have been used to train support vector machines for accurate prediction of DNA/RNA-binding residues. A new web server called BindN+ (http://bioinfo.ggc.org/bindn+/) has been developed to make the model freely accessible to the biological research community. For protein stability prediction, we have fine tuned the MuStab model with relevant biological features. A wrapper approach for feature selection has been used to identify the optimal subset of six biological features for model construction, and the upgraded MuStab web server (freely available at http://bioinfo.ggc.org/mustab/) shows improved performance for predicting protein stability changes upon single amino acid substitutions. In addition, the machine learning strategy developed in this project has been used to construct a random forest model for protein sumoylation site prediction. A new web server is currently being developed and tested. Thus, our research in the third year has established a general machine learning strategy for modeling biological data, and produced two web-based software tools (BindN+ and MuStab) for the biological research community. (2) We have been analyzing a large dataset of human gene expression profiles compiled from 131 different microarray studies. A data integration method has been developed to combine the 2,968 microarray expression profiles of various normal tissue samples (including 616 brain samples) into a single dataset. To validate the integrated microarray dataset, we have examined the expression patterns of known tissue-specific genes, including 286 brain-specific genes and 63 liver-specific genes. The dataset has also been used to identify candidate genes that are expressed predominantly in the brain, liver or testis. The results suggest that publicly available microarray data from different sources can be integrated for examining gene expression patterns in various tissues. After verifying the usefulness of the integrated dataset, we have used it to examine the expression patterns of X-linked intellectual disability genes, and for gene coexpression network analysis. We are particularly interested in the coexpressed gene modules that contain known intellectual disability genes. These modules have been identified, and are currently being analyzed functionally. The identification and analysis of co-expressed modules represent a new approach for gene functional annotation. The lists of potential tissue-specific genes have recently been published, and the results from gene coexpression network analysis are made available to human genetic studies at the Greenwood Genetic Center. In addition, the integrated microarray dataset provides a useful resource for biomedical research. PARTICIPANTS: Nothing significant to report during this reporting period. TARGET AUDIENCES: Not relevant to this project. PROJECT MODIFICATIONS: Nothing significant to report during this reporting period.

Impacts
All the research activities proposed for the third year of the project have been successfully completed. The findings have been published in five peer-reviewed papers, including three journal articles and two conference papers. The results obtained in the third year are important for the final outcomes of the proposed project. (1) We have developed two machine learning models for protein functional analyses. The BindN+ model can be used for accurate prediction of DNA/RNA-binding residues from amino acid sequence data. The predictions can provide useful information for protein-DNA/RNA docking and experimental studies such as site-directed mutagenesis for understanding protein-nucleic acid interactions. The MuStab model can be used to predict protein stability changes upon single amino acid substitutions. Loss of protein stability is a common mechanism by which single amino acid mutations cause human diseases. Since each person carries a large number of sequence variations including non-synonymous single nucleotide polymorphisms, it is often difficult to distinguish the disease-causing mutation from the neutral sequence variations. By predicting protein stability changes, the MuStab model can help identify the causative mutation among the sequence variations. We have been using the machine learning models to analyze intellectual disability genes for identifying causative mutations. Furthermore, since the predictive models developed in this project are made publicly available through the web servers, they have the potential to be widely used in human genetic studies as well as other areas of biomedical research. (2) The integrated microarray dataset has been shown to be useful for examining human gene expression patterns in various tissues. We have used the integrated microarray dataset to identify candidate genes that are expressed predominantly in the brain as well as other tissues. The brain-selective genes provide a good starting list for genetic screening of intellectual disability patients. Although a number of disease genes have been identified, the genetic causes of most intellectual disability cases are still unknown, and the brain-selective genes can be given high priority for further genetic screening. We have also used the integrated microarray dataset to examine the expression patterns of intellectual disability genes in the brain and other tissues, and to perform gene coexpression network analysis for identifying the coexpressed gene modules, which often correspond to protein complexes or biological processes. Since the genetic screening of intellectual disability genes is normally performed using patient blood samples, the molecular functions and expression patterns of these genes in the brain remain to be elucidated. The findings in this project can provide useful information for understanding intellectual disability gene function and regulation. The molecular knowledge can be translated into novel diagnostic, preventive and therapeutic approaches, and thus benefit the intellectual disability individuals and their family members.

Publications

Wang, L., Huang, C., Yang, M.Q. and Yang, J.Y. (2010) BindN+ for accurate prediction of DNA and RNA-binding residues from protein sequence features. BMC Systems Biology, 4(Suppl 1):S3.
Teng, S., Srivastava, A.K. and Wang, L. (2010) Sequence feature-based prediction of protein stability changes upon amino acid substitutions. BMC Genomics, 11(Suppl 2):S5.
Teng, S., Luo, H. and Wang, L. (2010) Random forest-based prediction of protein sumoylation sites from sequence features. In Proceedings of the 2010 ACM International Conference on Bioinformatics and Computational Biology, pp. 120-126. ACM Press.
Wang, L. (2010) Computational identification of human tissue-specific genes using public microarray expression data. In Proceedings of the 2010 International Conference on Bioinformatics and Computational Biology, CSREA Press.
Wang, L., Srivastava, A.K. and Schwartz, C.E. (2010) Microarray data integration for genome wide analysis of human tissue-selective gene expression. BMC Genomics, 11(Suppl 2):S15.

Progress 01/01/09 to 12/31/09

Outputs
OUTPUTS: The proposed research in the second year has focused on construction of machine learning models, collection and curation of public microarray data, and database system design. First, new classifiers have been constructed for sequence-based prediction of DNA-binding residues and protein stability changes upon amino acid substitutions. For DNA-binding site prediction, a random forest (RF) classifier has been constructed with biochemical features and new descriptors of evolutionary information for input vector encoding. The RF classifier has been shown to outperform our previous model used for BindN. A new web server called BindN-RF (publicly available at http://bioinfo.ggc.org/bindn-rf/) has been developed to make the more accurate RF classifier accessible to the biological research community. For protein stability prediction from sequence information, a support vector machine (SVM) classifier has been constructed with relevant sequence features representing biological knowledge for input vector encoding. Twenty sequence features have been examined, and classifier performance varies significantly by the use of different features. The SVM classifier, which is more accurate than previously reported models, has been used to develop a new web server called MuStab (publicly accessible at http://bioinfo.ggc.org/mustab/) for online prediction of protein stability changes upon amino acid substitutions. Second, we have compiled a compendium of 2,968 human gene expression profiles of various normal tissue samples (including 616 brain samples) from the NCBI GEO database. These expression profiles have been selected from 131 microarray datasets generated at different laboratories. Each microarray profile has been manually curated for data quality control and sample classification. The selected profiles were generated using the Affymetrix HG-U133 Plus 2.0 Array, a recent platform for complete coverage of the human genome with 54,675 probe sets. Microarray profiles of diseased tissues or cell cultures were excluded from selection. A data integration approach has been developed to combine the expression profiles of various tissues and from different microarray studies. The approach includes microarray data normalization, transformation and quality control. The integrated dataset has started to be used for examining the normal expression patterns of mental retardation genes and for coexpression network analysis to reveal the coexpressed gene modules in the human brain. Third, we have designed and refined a relational data model for the database to be developed in this project for human genetic studies. The data model consists of twelve schema modules for storing genomic and clinical data. Relational database tables have been designed for each schema module. In particular, the schema modules of Specimen, Sequence and Expression have been implemented to store microarray gene expression profiles as well as sample annotations. The data warehousing approach will be used to develop the database system. Database construction will be completed in the next two years as proposed. PARTICIPANTS: Nothing significant to report during this reporting period. TARGET AUDIENCES: Not relevant to this project. PROJECT MODIFICATIONS: Nothing significant to report during this reporting period.

Impacts
All the research activities proposed for the second year of the project have been successfully completed. The results are promising toward the long-term goals of the proposed research. First, we have been developing predictive machine learning methods for understanding protein function. In the past year, we have constructed two new classifiers, one for sequence-based prediction of DNA-binding residues and the other for predicting protein stability changes upon amino acid substitutions. To improve predictive performance, biological features and new descriptors of evolutionary information have been used for classifier construction. The research has resulted in four peer-reviewed publications, one article in BMC Genomics and three papers in conference proceedings. We have also developed two new web servers, BindN-RF and MuStab, which are publicly accessible to the biomedical research community. The BindN-RF web server can be used to provide useful information for protein-DNA docking and experimental studies such as site-directed mutagenesis for understanding protein-DNA interactions. Since protein destabilization is a common mechanism by which point mutations cause diseases, MuStab can be used in human genetic studies to distinguish between deleterious and tolerant alterations in disease candidate genes. Second, the integrated dataset of 2,968 microarray expression profiles provides a useful resource for investigating mental retardation (MR) gene function and regulation. A variety of genetic alterations associated with mental retardation have been identified at the Greenwood Genetic Center. While the genetic screening of MR candidate genes can be performed using patient blood samples, the molecular functions and expression patterns of these genes in the brain remain to be elucidated. The integrated microarray data allow us to examine the normal expression patterns of MR genes in the brain and other tissues, and to perform gene coexpresson network analysis for identifying the coexpressed modules with one or more MR genes. Since coexpressed gene modules often correspond to protein complexes or biological processes, the findings can be useful for understanding MR gene function. We have started to analyze the integrated microarray data, and will complete the analysis in the next two years as proposed. The results obtained in the second year are important for the final outcomes of this work. Third, a comprehensive data model has been developed, and system design has been completed for the proposed database. Data modeling and system design are critical for database construction. The database system, which will be constructed in the next two years as proposed, will be used to support genomic and clinical data integration, and provide a computational infrastructure for developing web-based software tools for biomedical research. The database system will also be used to make our analytical results available to the biomedical research community. Thus, the database and software tools will facilitate biomedical research and subsequently improve the quality of health care.

Publications

Teng, S., Srivastava, A.K., and Wang, L. 2009. Biological features for sequence-based prediction of protein stability changes upon amino acid substitutions. Page 201 to 206 in Proceedings of the 2009 International Joint Conferences on Bioinformatics, Systems Biology and Intelligent Computing (IJCBS'09), Shanghai, China. IEEE Computer Society.
Wang, L., and Huang, C. 2009. New descriptors of evolutionary information for accurate prediction of DNA and RNA-binding residues in protein sequences. Page 246 to 250 in Proceedings of the 2009 International Joint Conferences on Bioinformatics, Systems Biology and Intelligent Computing (IJCBS'09), Shanghai, China. IEEE Computer Society.
Wang, L., Yang, M.Q., and Yang, J.Y. 2009. Prediction of DNA-binding residues from protein sequence information using random forests. BMC Genomics, 10:S1.
Wang, L. 2009. Combining biochemical features and evolutionary information for predicting DNA-binding residues in protein sequences. In T.-h. Kim et al. (eds.), Communications in Computer and Information Science, vol. 28. Springer-Verlag, Berlin Heidelberg, p. 176-189.

Progress 01/01/08 to 12/31/08

Outputs
OUTPUTS: The first year of the proposed research focused on data collection and feature selection for developing machine learning methods, microarray data analysis, and database design. First, two papers were published in international conference proceedings for the use of new evolutionary information descriptors to predict DNA and RNA-binding residues in protein sequences. The new models are currently being used to develop two web-based tools for the biomedical research community. A journal article was published for sequence-based prediction of lipid-interacting residues. The training dataset of known lipid-interacting residues was obtained from the structures of protein-lipid complexes available at the Protein Data Bank (PDB), and Support Vector Machine (SVM) classifiers were constructed using biochemical features for input encoding. For prediction of protein stability changes upon amino acid substitutions, a training dataset was collected and twenty biological features were tested for input encoding. The preliminary results suggest that highly accurate SVM classifiers can be constructed for protein stability prediction using a combination of relevant biological features. Second, as a proof-of-concept study, 906 microarray gene expression profiles obtained from postmortem brain samples were retrieved from the NCBI GEO database. Each microarray profile has been manually curated for data quality control and sample classification. The integrated microarray dataset contains gene expression levels across a variety of tissues and clinical conditions. This dataset is currently being used to investigate brain expression patterns of mental retardation genes identified at the Greenwood Genetic Center (GGC). In collaboration with GGC scientists, microarray gene expression profiles have also been generated using blood-derived cell lines from mental retardation patients with mutations in the UPF3B and MED12 genes. Statistical analysis of the microarray data has given lists of candidate genes that are differentially expressed between patient samples and normal controls. The differential expression has been confirmed using quantitative RT-PCR for some candidate genes. Third, database system requirement analysis has been completed, and a conceptual data model has been developed for storing clinical and genomic data at GGC. The data model consists of twelve schema modules, including six modules (Patient, Specimen, Diagnostic Test, Clinical Evaluation, Disease, and Genetics) for clinical data, four modules (Sequence, Expression, Regulation, and Prediction) for genomic data, the Publication module for published references, and the CV module for controlled vocabulary such as Gene Ontology and medical terms. The data model is extensible for potential future expansion of the database. Relational database tables have been designed for each schema module. The database system is currently being implemented to support microarray gene expression data integration and analysis for understanding the molecular function of mental retardation genes. PARTICIPANTS: Nothing significant to report during this reporting period. TARGET AUDIENCES: Nothing significant to report during this reporting period. PROJECT MODIFICATIONS: Nothing significant to report during this reporting period.

Impacts
All the research activities proposed for the first year of the project have been successfully completed. The results are promising toward the long-term goals of the project, including development of predictive machine learning methods for understanding protein function, analysis of co-expressed gene modules related to genetic brain disorders, and construction of an integrated database for human genetic research. The computational methods and software systems developed in this project will promote bioinformatics applications to biomedical research, and thus improve the quality of health care in South Carolina and elsewhere. To develop accurate machine learning methods, we have been selecting relevant biological features for classifier construction. For DNA and RNA-binding site prediction, new descriptors of evolutionary information have been developed to improve classifier performance. The improved classifiers are currently being used to upgrade our previous web server, BindN, which has been used in biomedical research. We are also developing predictive tools for sequence-based prediction of lipid-interacting residues and protein stability changes upon mutations, and our results suggest that accurate classifiers can be constructed using relevant biological features. These predictive tools will be used to analyze the GGC collection of candidate genes in which mutations may cause mental retardation and other genetic disorders. Thus, the promising results obtained in the first year are important for the final outcomes of the project. A variety of genetic alterations that are associated with mental retardation have been identified using patient blood samples at GGC. Nevertheless, for most candidate genes, their molecular functions and regulatory mechanisms are still poorly understood. We have been developing a computational pipeline to integrate the published microarray gene expression profiles obtained from postmortem brain samples. Our preliminary results suggest that our data integration method works properly, and the integrated micorarray data can be used to investigate mental retardation gene expression patterns. We are now collecting more brain gene expression profiles from published studies. The integrated dataset of microarray gene expression profiles will provide a useful resource for understanding brain function and disorders. We have been developing a database system to facilitate genomic and clinical data integration, and to make our results available to the biomedical research community. The database system will also provide an important computational infrastructure for human genetic research at GGC. In the first year of the project, we developed an extensible database schema for handling clinical and genomic data. This work is critical for constructing a functional database system for biomedical data integration and mining.

Publications

Wang, L., Irausquin, S.J. and Yang, J.Y. (2008) Prediction of lipid-interacting amino acid residues from sequence features. International Journal of Computational Biology and Drug Design (IJCBDD), 1:14-25.
Wang, L. (2008) Random forests for prediction of DNA-binding residues in protein sequences using evolutionary information. In Proceedings of the Second International Conference on Future Generation Communication and Networking (FGCN 2008), vol. 3, pp. 24-29. Springer.
Wang, L. (2008) BindN+ for improved prediction of DNA or RNA binding residues in protein sequences using evolutionary information. In Proceedings of the 2008 International Conference on Bioinformatics and Computational Biology (BIOCOMP08), pp. 961-964. CSREA Press.