Progress 10/13/15 to 02/21/19
Outputs Target Audience:Research community. The scientific community will be reached through: 1) journal article publications, 2) oral/poster presentations at scientific conferences, and 3) a book chapter on "Big Data Analytics in Genomics." The new insights gained and new tools developed in this project will enable a large community of plant biology researchers to conduct a broad range of new studies that are currently not feasible. Meanwhile, the knowledge gained from this project will be useful to any scientist studying living cells - bacteria, plants, animal, and human cells. This project will lead to improved qualitative understanding of gene expression for all the sequenced plant genomes. 2. Graduate and undergraduate students. The students involved in this project will receive interdisciplinary STEM training (Science, Technology, Engineering, and Math) while they conduct the research and process the data for dissemination. This training will enable the students to be better prepared for the rapidly expanding biotech industry. A broader group of students will benefit from a new undergraduate/graduate course in this area. These activities will contribute to developing a globally competitive STEM workforce, strengthening STEM education programs at SDSU, and providing an increased number of students pursuing STEM careers. Changes/Problems:The PI has accepted a position at another university, and so this project is now completed. What opportunities for training and professional development has the project provided?During this project I provided students the opportunity for interdisciplinary and cutting-edge training in computational systems biology. Four graduate students and five undergraduate students were trained during this project. 1. Adam McDermaid. Ph.D. student. (2016.03 - 2019.01). Assisted with RNA-seq data analysis and modeling in Goal 1 and is now is working with me as a Postdoc. 2. Jinyu Yang. M.S. student. (2016.01 - 2017.12). Assisted with phylogenetic footprinting framework for accurate motif predictions and web server design in Goal 3. 3. Juan Xie. M.S. student. (2016.08 - 2018.09). Involved in the biclustering R package development and new biclustering algorithm design in Goal 2. Currently is working with me as a Ph.D. student. 4. Anjun Ma. M.S. student. (2017.08 - present). Involved in new biclustering algorithm design for single-cell RNA-seq data analysis in Goal 2. 5. Cankun Wang, M.S. student (2018.01 - present). Involved in motif prediction web server design in Goal 3. 6. Yiran Zhang. Undergraduate student. (2016.04 - 2018.10). Learned big biological data analysis skills in Goal 1, (e.g., data retrieve from reputed databases and RNA-seq data analysis and modeling through computer programming). 7. Minxuan Sun. Undergraduate student. (2017.08 - 2019.01). Learned to use deep learning pipeline and application in genomic sequences in Goal 3. 8. Xiaozhu Jin. Undergraduate student. (2016.04 - 2017.05). Learned gene expression data analysis skills in Goal 1 and Goal 2. 9. Prajwal Khatiwada. Undergraduate student. (2015.09 - 2016.01). Learned to buildan RNA-seq data analysis pipeline using existing tools/software in Goal 1. I taught a graduate level class in the fall of 2016-2018, titled "Next Generation Sequencing (NGS) Data Analysis (PS-735-S01)". In 2018, there were thirteen graduate students enrolled in this course, coming from Agronomy, Horticulture and Plant Science; Biology and Microbiology; Chemistry and Biochemistry; and Mathematics and Statistics. In this class, students were exposed to general/advanced computational techniques for NGS data analysis, current public databases, major bioinformatics algorithms, and programs. Using real data as examples, a project-based strategy was used throughout the class so that students could develop an understanding of algorithms in the context of solving biological problems. How have the results been disseminated to communities of interest?Results have been disseminated through journal article publication, presentations at scientific conferences, and invited presentations. What do you plan to do during the next reporting period to accomplish the goals?
Nothing Reported
Impacts What was accomplished under these goals?
format below Overall, I have addressed departmental, college, and University strategic goals in my scholarship by seeking and receiving internal and external funds; and in my research by focusing on development and applying bioinformatics techniques in the discovery of essential biology insights from large-scale Omics data. My research projects, funded proposals, collaborations, publications, and multiple presentations provide evidence of focused research, scholarship, and creative activity outcomes. It is noteworthy that the number of all my citations keeps increasing in the past five years, and these studies have attracted worldwide attention. Three specific goals in this project and their progress are listed below. Goal One: Development of a de novo RNA-seq analysis pipeline for accurate read mapping and estimation of the expression levels of all genes (100% Accomplished) An automatic RNA-seq analysis (SeqTU) and evaluation pipeline has been developed with a user-friendly online interface. The pipeline achieved the accuracy level ~0.95 for multiple RNA-seq datasets of Escherichia coli and Clostridium thermocellum through the cross-validation. The corresponding manuscript has been published in the journal Scientific Reports. This pipeline will offer a clear and logical interface to display how it works and what it will deliver. An improved version of this pipeline (GeneQC) focuses on the read mapping uncertainty issue and aims to give a comprehensive evaluation framework of all the genes in a genome, through mathematical and statistical modeling. A paper describing this work has been published in Frontiers in Genetics. It reported the gene list whose expression levels are accurately estimated, and the gene list whose expression levels have bias due to relevant ambiguous reads. A detailed report summary of all the deliverables will be generated for users. This pipeline has been used in several collaborations: (i) with Dr. Michael Udvardi from the Noble Foundation to identify nitrogen conservation genes and underlying changes in genome activity associated with annual senescence in perennial Switchgrass (leading to a publication in New Phytologist); and (ii) with Dr. Michael Wisniewski from USDA to determine which genes, if any, are differentially expressed between strains of apple trees that have shown cold-hardiness traits compared to a set of control samples (leading to a joint publication which is ready for submission). Goal Two: Development of a novel biclustering algorithm for analyses of gene expression data generated from RNA-Seq data. (100% accomplished). An article describing R/Bioconductor implementation of biclustering has been published in the journal Bioinformatics. This R/Bioconductor performs much better than other similar existing packages regarding efficiency and prediction accuracy based on systematic evaluation using gene expression data from bacteria, plants, and cancer cell lines. It has two unique features: (i) an 82% average improved efficiency by refactoring and optimizing the source C code of QUBIC; and (ii) a set of comprehensive functions to facilitate biclustering-based biological studies, including the qualitative representation (discretization) of expression data, query-based biclustering, bicluster expanding, biclusters comparison, heatmap visualization of any identified biclusters, and co-expression networks elucidation. It is noteworthy that this tool has been cited 14 times since 2017. A novel biclustering algorithm for bulk RNA-Seq data and single-cell RNA-Seq data has been developed and the paper has been submitted to Nature Methods. The intellectual merit of the study rests in its endeavor to detect and understand underlying regulatory mechanisms through modeling and analyses of gene expression and transcription factor binding sites (Goal 3). The PI will design a reliable qualitative representation of the gene expression to reflect different expression states corresponding to various regulatory signals, where the unquantifiable errors in RNA-seq data will be handled by a rigorous truncated model. An information-divergence function will be implemented in a graph-theory-based biclustering framework to identify statistically significant and biologically meaningful co-expression gene modules. Goal Three: Development of a computational framework for motif identification integrated with the ChIP-seq data modeling (100% accomplished). Identification of transcription factor binding sites and cis-regulatory motifs is a frontier on which the rules governing TFs and DNA binding are being revealed. We have developed an integrative phylogenetic footprinting framework for accurate motif predictions in prokaryotic genomes (MP3). This study has been published in the journal of BMC Genomics. We have applied the framework to the Escherichia coli k12 genome and evaluated the prediction performance through comparison with seven existing programs. This evaluation was systematically carried out at the nucleotide and binding site levels, and the results showed that MP3 consistently outperformed other popular motif finding tools. We have integrated MP3 into our motif identification and analysis server DMINDA, allowing users to identify and analyze motifs more efficiently. This server has been published in Bioinformatics and contains: (i) five motif prediction and analyses algorithms, including a phylogenetic footprinting framework; (ii) 2,125 species with complete genomes to support the above five functions, covering animals, plants, and bacteria; and (iii) bacterial regulon prediction and visualization. We have developed a new method for cis-regulatory motif prediction by deep neural networks and the binomial distribution model (called DESSO), which is under review in Genome Research. DESSO outperformed existing tools, including DeepBind, in predicting motifs in 690 human ENCODE ChIP-Sequencing datasets. Furthermore, we demonstrated that the deep-learning framework of DESSO expands motif discovery beyond the state-of-the-art to allow identification of known and new protein-protein-DNA tethering interactions in human TFs. Specifically, 61 putative tethering interactions were identified among the 100 TFs in the K562 cell line. In this work we further demonstrated the power of DESSO by integrating detection of DNA shape features. We found that shape information has a strong predictive power for TF-DNA binding and provides new putative shape motif information for human TFs. Thus, DESSO and the analyses it enables, represent a potent improvement the identification of TF binding sites and by accommodating the complexities of DNA binding into a deep-learning framework. DESSO significantly outperforms other popular tools, e.g., DeepBind, Basset, and MEME-ChIP, in terms of the similarity assessment against validated motifs, all achieving Wilcoxon testp-values < 1x10-3. We believe that this work will be of keen interest to genomics researchers and will have a radiating impact on the field. The source code predicted motifs and TFBSs from the 690 ENCODE TF ChIP-Seq datasets are freely available at the DESSO web server: http://bmbl.sdstate.edu/DESSO.
Publications
- Type:
Journal Articles
Status:
Published
Year Published:
2018
Citation:
Adam McDermaid, Xin Chen, Yiran Zhang, Cankun Wang, Juan Xie, Qin Ma, A new machine learning-based framework for mapping uncertainty analysis in RNA-Seq read alignment and gene expression estimation. Frontiers in Genetics. https://doi.org/10.3389/fgene.2018.00313.
- Type:
Journal Articles
Status:
Published
Year Published:
2018
Citation:
Juan Xie, Anjun Ma, Anne Fennell, Jing Zhao, Qin Ma, A comprehensive review of the biclustering application in addressing biological and biomedical problems. Briefings in Bioinformatics. https://doi.org/10.1093/bib/bby014. 2018.
- Type:
Journal Articles
Status:
Published
Year Published:
2018
Citation:
Adam McDermaid, Brandon Monier, Jing Zhao, Bingqiang Liu, Qin Ma, Interpretation of differential gene expression results of RNA-Seq data: review and integration. Briefings in Bioinformatics. https://doi.org/10.1093/bib/bby067. 2018.
- Type:
Journal Articles
Status:
Published
Year Published:
2018
Citation:
Sen Liang, Anjun Ma, Yan Wang, Qin Ma, Paired Data Feature Selection Methods for Gene Expression Data Analysis: A Comprehensive Review. Computational and Structural Biotechnology Journal. DOI: https://doi.org/10.1016/j.csbj.2018.02.005. 2018.
- Type:
Journal Articles
Status:
Published
Year Published:
2018
Citation:
Fang Zhang, Anjun Ma, Zhao Wang, Qin Ma, Bingqiang Liu, Lan Huang, Yan Wang, A Central Edge Selection Based Overlapping Community Detection Algorithm for the Detection of Overlapping Structures in Protein-Protein Interaction Networks. Molecules, 23(10), 2633; DOI: 10.3390/molecules23102633. 2018.
- Type:
Journal Articles
Status:
Published
Year Published:
2018
Citation:
Siyu Han, Yanchun Liang, Qin Ma, Cankun Wang, Yangyi Xu, Yu Zhang, Wei Du and Ying Li, LncFinder: an integrated package for long non-coding RNA identification utilizing sequence intrinsic composition, structural information and physicochemical property. Briefings in Bioinformatics. bby065, https://doi.org/10.1093/bib/bby065. 2018.
- Type:
Journal Articles
Status:
Published
Year Published:
2018
Citation:
Yu Zhang, Sha Cao, Jing Zhao, Qin Ma, Chi Zhang, MRHCA: a nonparametric statistics-based method for hub and co-expression module identi?cation in large gene co-expression network. Quantitative Biology. DOI: https://doi.org/10.1007/s40484-018-0131-z. 2018.
- Type:
Journal Articles
Status:
Published
Year Published:
2018
Citation:
Surendra Neupane, Qin Ma, Madhav P. Nepal, Febina Mathew, Adam Varenhorst, Ethan J. Andersen, Comparative Analysis of TNL Disease Resistance Proteins in Soybean (Glycine max) and Common Bean (Phaseolus vulgaris). Biochemical Genetics, doi: 10.1007/s10528-018-9851-z. 2018.
- Type:
Journal Articles
Status:
Published
Year Published:
2018
Citation:
Bingqiang Liu, Ling Han, Xiangrong Liu, Jichang Wu, Qin Ma, Computational prediction of sigma-54 promoters in bacterial genomes by integrating motif finding and machine learning strategies. IEEE/ACM Transactions on Computational Biology and Bioinformatics, DOI: 10.1109/TCBB.2018.2816032. 2018.
- Type:
Journal Articles
Status:
Published
Year Published:
2018
Citation:
Guoqing Liu, Qin Ma, Ying Xu, Physical properties of DNA may direct the binding of nucleoid-associated proteins along the E. coli genome. Mathematical Biosciences. DOI: 10.1016/j.mbs.2018.03.026. 2018.
- Type:
Journal Articles
Status:
Published
Year Published:
2018
Citation:
Xin Chen, Anjun Ma, Hanyuan Zhang, Chao Liu, Huansheng Cao, Yan Wang, Qin Ma, RECTA: Regulon Identification Based on Comparative Genomics and Transcriptomics analysis. Genes. 9(6), 278. doi: https://doi.org/10.3390/genes9060278. 2018.
- Type:
Other
Status:
Other
Year Published:
2018
Citation:
Qin Ma. 2018. Development and application of computational methods driven by addressing genomic and transcriptomic questions. September 10. Indiana University, Indianapolis, IN. (Invited by Dr. Chi Zhang).
- Type:
Other
Status:
Other
Year Published:
2018
Citation:
Juan Xie. 2018. Hypothesis-driven and discovery-driven analysis of Grapevine expression data. Plant & Animal Genome Conference, Jan. 14-18. San Diego, CA.
- Type:
Conference Papers and Presentations
Status:
Published
Year Published:
2018
Citation:
Cankun Wang, Qin Ma. 2018. Combining computational methods and experimental data for Motif prediction. Plant Science Research Day, April 26. South Dakota State University, Brookings, SD.
- Type:
Conference Papers and Presentations
Status:
Published
Year Published:
2018
Citation:
Anjun Ma, Qin Ma. 2018. Bioinformatics and Mathematical Biosciences Lab, Faculty Excellence Showcase on Celebration of Faculty Excellence. February 21. Brookings, SD
|
Progress 10/01/17 to 09/30/18
Outputs Target Audience:Research community. The new insights gained, and new tools developed in this project will enable a large community of plant biology researchers to conduct a broad range of new studies that are currently not feasible. Meanwhile, the knowledge gained from this project will be useful to any scientist studying living cells - bacteria, plants, animals and human cells. It will lead to the better qualitative understanding of gene expression for all the sequenced plant genomes. General public. A broader and non-scientist audience will be reached through (1) the oral/poster presentations at national and local conferences; (2) writing and publication of a book chapter on "Big Data Analytics in Genomics"; and (3) developing and teaching a new undergraduate/graduate course on the same topic. Graduate and undergraduate students. The students involved in this project will receive better interdisciplinary STEM training (Science, Technology Engineering, and Math). The training will enable the students to be better prepared in the rapidly expanding biotech industry, meeting the demands of interdisciplinary academic training. These activities will contribute to developing a globally competitive STEM workforce, strengthened STEM education programs at SDSU, and an increased number of students pursuing STEM higher education and careers. Changes/Problems:
Nothing Reported
What opportunities for training and professional development has the project provided?During this project I provide students the opportunity for interdisciplinary and cutting-edge training in computational systems biology. Four graduate students and three undergraduate students were trained on this project. 1. Adam McDermaid. Ph.D. student. (2016.03 - 2018.05). Assisted with the RNA-seq data analysis and modeling in Goal 1, and is now is working with me as a Postdoc. 2. Jinyu Yang. M.S. student. (2016.01 - 2017.12). Assisted with thephylogenetic footprinting framework for accurate motif predictions and web server design in Goal 3. 3. Juan Xie. M.S. student. (2016.08 - 2018.09). Involved in the biclustering R package development and new biclustering algorithm design in Goal 2. Now is working with me as a Ph.D. student. 4. Anjun Ma. M.S. student. (2017.08 - present). Involved in the new biclustering algorithm design for single-cell RNA-seq data analysis in Goal 2. 5. Cankun Wang, M.S. student (2018.01 - present). Involved in the motif prediction web server design in Goal 3. 6. Yiran Zhang. Undergraduate student. (2016.04 - present). Learned big biological data analysis skills in Goal 1, (e.g., data retrieve from reputed databases and RNA-seq data analysis and modeling through computer programming). 7. Minxuan Sun. Undergraduate student. (2017.08 - present). Learned to use deep learning pipeline and application in genomic sequences in Goal 3. I am teaching a graduate level class in fall of 2018, named "Next Generation Sequencing (NGS) Data Analysis (PS-735-S01)". There are thirteen graduate students enrolled in this course, coming from Agronomy, Horticulture and Plant Science, Biology and Microbiology, Biochemistry, and Mathematics and Statistics. In this class, students will be exposed to general/advanced computational techniques for NGS data analysis, current public databases, major bioinformatics algorithms and programs. Using real data as examples, a project-based strategy will be adopted throughout the class so that students can understand algorithms in the context of solving the biological problems. How have the results been disseminated to communities of interest?These results have been disseminated by 14 journalarticles, five invited lectures, and two conference posters. All of them have been listed in the" products" section. What do you plan to do during the next reporting period to accomplish the goals?Overall, I will continue progress on funded projects which have the common goal of developing advanced computational techniques in support of systems-level understanding of critical biological problems; continue training graduate and undergraduate students in bioinformatics, mathematical modeling, computational programming, and biological data analysis; keep publishing manuscripts in refereed journals; and try to turn these preliminary results into internal and external grant proposals. Goal One: Development of ade novosingle-cell RNA-seq analysis pipeline for accurate read mapping and estimationof the expression levels of all genes. Our plan to approach this issue is (1) to consider external information in the form of known co-expressed gene information; (2) to mathematically evaluate the gene expression uncertainty and dropouts through a through regression framework; and (3) to classify the genes with expression uncertainty into multiple categories based on a mixture model fitting strategy.Furthermore, to properly use the co-expression information, a statistical model must be created and employed to provide a precise value indicating the preferred gene location for any read or combination of reads. Goal Two: Development of a novel biclustering algorithm for analyses of gene expression data generated from single-cell RNA-Seq data. Develop a novel biclustering algorithm and program for large-scale gene expression data analysis and apply these techniques to gene regulatory network (GRN) construction. Eventually, provide a more reliable elucidation of transcriptional regulatory mechanism encoded in a genome. Goal 3: Development of a deep-learning framework for motif identification integrated with the ChIP-seq data modeling. To further elucidate the underlying regulatory mechanism, the PI will develop a deep learning (DL) framework for motif prediction, mainly integrating ChIP-seq peaks and DNA shape. A weighted two-stage alignment algorithm, considering the peak signals and the motif conservation property, will be designed to reduce the high noise level in ChIP-seq peak calling. The gated convolutional neural network and DNA-shape information will be organically integrated to improve the DL performance in motif identification. Overall, a deeper understanding of gene expression and TFBSs from this project will ultimately improve the effectiveness and efficiency of the NGS data utilization in transcriptional regulation.
Impacts What was accomplished under these goals?
Goal One: Development of a de novo RNA-seq analysis pipeline for accurate read mapping and estimation of the expression levels of all genes 90% Accomplished An automatic RNA-seq analysis (SeqTU) and evaluation pipeline have been developed with a user-friendly online interface. The pipeline achieved the accuracy level ~0.95 for multiple RNA-seq datasets of Escherichia coli and Clostridium thermocellum through the cross-validation. The corresponding manuscript has been published in the journal Scientific Reports. This pipeline will offer a clear and logical interface to display how it works and what it will deliver. An improved version of this pipeline (GeneQC) focuses on the read mapping uncertainty issue and aims to give a comprehensive evaluation framework of all the genes in a genome, through mathematical and statistical modeling. This paper has been published in Frontiers in Genetics. It reported the gene list whose expression levels are accurately estimated, and the gene list whose expression levels have bias due to relevant ambiguous reads. A detailed report summary of all the deliverables will be generated for users. This pipeline has been used in several collaborations: (i) with Dr. Michael Udvardi from the Noble Foundation to identify nitrogen conservation genes and underlying changes in genome activity associated with annual senescence in perennial Switchgrass (leading to a publication in New Phytologist); and (ii) with Dr. Michael Wisniewski from USDA to determine which genes, if any, are differentially expressed between strains of apple trees that have shown cold-hardiness traits compared to a set of control samples (leading to a joint publication which is ready for submission). Goal Two: Development of a novel biclustering algorithm for analyses of gene expression data generated from RNA-Seq data. (90% accomplished). An R/Bioconductor implementation of biclustering has been published in the journal Bioinformatics. This R/Bioconductor performs much better than other similar existing packages regarding efficiency and prediction accuracy based on systematic evaluation using gene expression data from bacteria, plants, and cancer. It has two unique features: (i) an 82% average improved efficiency by refactoring and optimizing the source C code of QUBIC; and (ii) a set of comprehensive functions to facilitate biclustering-based biological studies, including the qualitative representation (discretization) of expression data, query-based biclustering, bicluster expanding, biclusters comparison, heatmap visualization of any identified biclusters, and co-expression networks elucidation. It is noteworthy that this tool has been cited 14 times since 2017. A novel biclustering algorithm for bulk RNA-Seq data and single-cell RNA-Seq data has been developed and the paper has been submitted to Nature methods for peer review. The intellectual merit of the study rests in its endeavor to detect and understand underlying regulatory mechanisms through modeling and analyses of gene expression and transcription factor binding sites (Goal 3). The PI will design a reliable qualitative representation of the gene expression to reflect different expression states corresponding to various regulatory signals, where the unquantifiable errors in RNA-seq data will be handled by a rigorous truncated model. An information-divergence function will be implemented in a graph-theory-based biclustering framework to identify statistically significant and biologically meaningful co-expression gene modules. Goal Three: Development of a computational framework for motif identification integrated with the ChIP-seq data modeling (80% accomplished). Identification of transcription factor binding sites and cis-regulatory motifs is a frontier on which the rules governing TFs and DNA binding are being revealed. We have developed an integrative phylogenetic footprinting framework for accurate motif predictions in prokaryotic genomes (MP3). This study has been published in the journal of BMC Genomics. We have applied the framework to the Escherichia coli k12 genome and evaluated the prediction performance through comparison with seven existing programs. This evaluation was systematically carried out at the nucleotide and binding site levels, and the results showed that MP3 consistently outperformed other popular motif finding tools. We have integrated MP3 into our motif identification and analysis server DMINDA, allowing users to identify and analyze motifs more efficiently. This server has been published in Bioinformatics and contains: (i) five motif prediction and analyses algorithms, including a phylogenetic footprinting framework; (ii) 2,125 species with complete genomes to support the above five functions, covering animals, plants, and bacteria; and (iii) bacterial regulon prediction and visualization. We have developed a new method for cis-regulatory motif prediction by deep neural networks and the binomial distribution model (called DESSO), which is under review in Genome Research. DESSO outperformed existing tools, including DeepBind, in predicting motifs in 690 human ENCODE ChIP-Sequencing datasets. Furthermore, we demonstrated that the deep-learning framework of DESSO expands motif discovery beyond the state-of-the-art by to allow identification of known and new protein-protein-DNA tethering interactions in human TFs. Specifically, 61 putative tethering interactions were identified among the 100 TFs in the K562 cell line. In this work we further demonstrated the power of DESSO by integrating detection of DNA shape features. We found that shape information has a strong predictive power for TF-DNA binding and provides new putative shape motif information for by human TFs. Thus, DESSO and the analyses it enables represents a potent improvement the identification of TF binding sites and by accommodating the complexities of DNA binding into a deep-learning framework. DESSO significantly outperforms other popular tools, e.g., DeepBind, Basset, and MEME-ChIP, in terms of the similarity assessment against validated motifs, all achieving Wilcoxon testp-values < 1x10-3. We believe that this work will be of keen interest to genomics researchers and will have a radiating impact on the field. The source code predicted motifs and TFBSs from the 690 ENCODE TF ChIP-Seq datasets are freely available at the DESSO web server: http://bmbl.sdstate.edu/DESSO.
Publications
- Type:
Journal Articles
Status:
Published
Year Published:
2018
Citation:
Juan Xie, Anjun Ma, Anne Fennell, Jing Zhao, Qin Ma. A comprehensive review of the biclustering application in addressing biological and biomedical problems. Briefings in Bioinformatics, 2018. https://doi.org/10.1093/bib/bby014.
- Type:
Journal Articles
Status:
Awaiting Publication
Year Published:
2018
Citation:
Baoguang Tian, Xue Wu, Cheng Chen, Wenying Qiu, Qin Ma, Bin Yu. Predicting protein-protein interactions based on multi-information fusion using wavelet denoising and support vector machine.
Journal of Theoretical Biology. 2018
- Type:
Conference Papers and Presentations
Status:
Awaiting Publication
Year Published:
2018
Citation:
Xing Shi, Zhancheng Gao, Qiang Lin, Liping Zhao, Qin Ma, Yu Kang, Jun Yu. Meta?analysis Reveals Potential Influence of Oxidative Stress on the Airway Microbiomes of Cystic Fibrosis Patients.
Genomics, Proteomics & Bioinformatics. 2018
- Type:
Journal Articles
Status:
Accepted
Year Published:
2018
Citation:
Fang Zhang*, Anjun Ma*, Zhao Wang, Qin Ma, Bingqiang Liu, Lan Huang, Yan Wang. A Central Edge Selection Based Overlapping Community Detection Algorithm for the Detection of Overlapping Structures in Protein-Protein Interaction Networks. 2018. Molecules.
- Type:
Journal Articles
Status:
Published
Year Published:
2018
Citation:
Xiaolan Rao, Xin Chen, Hui Shen, Qin Ma, Guifen Li, Yuhong Tang, Maria Pena, William York, Taylor Frazier, Scott Lenaghan, Xirong Xiao, Fang Chen and Richard A. Dixon. Gene Regulatory Networks for Lignin Biosynthesis in Switchgrass (Panicum virgatum).
Plant Biotechnology Journal. 2018 Aug 22. DOI: 10.1111/pbi.13000.
- Type:
Journal Articles
Status:
Published
Year Published:
2018
Citation:
Siyu Han, Yanchun Liang, Qin Ma, Cankun Wang, Yangyi Xu, Yu Zhang, Wei Du and Ying Li. LncFinder: an integrated package for long non-coding RNA identification utilizing sequence intrinsic composition, structural information and physicochemical property.
Briefings in Bioinformatics. 2018. bby065, https://doi.org/10.1093/bib/bby065.
- Type:
Journal Articles
Status:
Published
Year Published:
2018
Citation:
Adam McDermaid, Brandon Monier, Jing Zhao, Bingqiang Liu, and Qin Ma. Interpretation of differential gene expression results of RNA-Seq data: review and integration.
Briefings in Bioinformatics. 2018. https://doi.org/10.1093/bib/bby067. 06 August 2018.
- Type:
Journal Articles
Status:
Published
Year Published:
2018
Citation:
Yu Zhang, Sha Cao, Jing Zhao, Qin Ma$, and Chi Zhang. MRHCA: a nonparametric statistics-based method for hub and co-expression module identi?cation in large gene co-expression network.
Quantitative Biology. 2018. DOI: https://doi.org/10.1007/s40484-018-0131-z.
- Type:
Journal Articles
Status:
Published
Year Published:
2018
Citation:
Sen Liang, Anjun MaL, Yan Wang$, Qin Ma. Paired Data Feature Selection Methods for Gene Expression Data Analysis: A Comprehensive Review.
Computational and Structural Biotechnology Journal, 2018. DOI: https://doi.org/10.1016/j.csbj.2018.02.005.
- Type:
Journal Articles
Status:
Published
Year Published:
2018
Citation:
Surendra Neupane, Qin Ma, Madhav P. Nepal, Febina Mathew, Adam Varenhorst, and Ethan J. Andersen. Comparative Analysis of TNL Disease Resistance Proteins in Soybean (Glycine max) and Common Bean (Phaseolus vulgaris).
Biochemical Genetics, 2018 Mar 2. doi: 10.1007/s10528-018-9851-z.
- Type:
Journal Articles
Status:
Published
Year Published:
2018
Citation:
Bingqiang Liu, Ling Han, Xiangrong Liu, Jichang Wu, Qin Ma. Computational prediction of sigma-54 promoters in bacterial genomes by integrating motif finding and machine learning strategies.
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 2018. DOI: 10.1109/TCBB.2018.2816032.
- Type:
Journal Articles
Status:
Published
Year Published:
2018
Citation:
Guoqing Liu, Qin Ma, Ying Xu. Physical properties of DNA may direct the binding of nucleoid-associated proteins along the E. coli genome.
Mathematical Biosciences. 2018. DOI: 10.1016/j.mbs.2018.03.026.
- Type:
Conference Papers and Presentations
Status:
Other
Year Published:
2018
Citation:
Anjun Ma, Qin Ma. 2018. Bioinformatics and Mathematical Biosciences Lab, Faculty Excellence Showcase on Celebration of Faculty Excellence. Brookings, SD, February 21, 2018.
- Type:
Journal Articles
Status:
Published
Year Published:
2018
Citation:
Xin Chen*, Anjun Ma*, Hanyuan Zhang, Chao Liu, Huansheng Cao, Yan Wang, Qin Ma. RECTA: Regulon Identification Based on Comparative Genomics and Transcriptomics analysis.
Genes. 2018, 9(6), 278; https://doi.org/10.3390/genes9060278
- Type:
Conference Papers and Presentations
Status:
Other
Year Published:
2018
Citation:
Qin Ma. 2018. Development and application of computational methods driven by addressing genomic and transcriptomic questions. September 10th, 2018. Indiana University, Indianapolis, IN 46202. (Invited by Dr. Chi Zhang)
- Type:
Conference Papers and Presentations
Status:
Other
Year Published:
2018
Citation:
Juan Xie. 2018. Hypothesis-driven and discovery-driven analysis of Grapevine expression data. January 16th, 2018. Plant & Animal Genome Conference, Jan. 14-18, San Diego, CA, USA.
- Type:
Conference Papers and Presentations
Status:
Other
Year Published:
2017
Citation:
Qin Ma. 2017. Integrated and systematic views of regulatory DNA motif identification and analyses. Department of Computer Science, Jilin University. December 12th, 2017. (Invited by Prof. Yan Wang)
- Type:
Conference Papers and Presentations
Status:
Other
Year Published:
2017
Citation:
Qin Ma. 2017. RNA Sequencing Analyses & Mapping Uncertainty, August 25th, 2017, the University of California, Davis campus in Davis, California. (The Y1 meeting for the NSF Plant Genome Research Program).
- Type:
Conference Papers and Presentations
Status:
Other
Year Published:
2018
Citation:
Cankun Wang, Qin Ma. 2018. Combining computational methods and experimental data for Motif prediction. Plant Science Research Day, April 26th, 2018. South Dakota State University.
|
Progress 10/01/16 to 09/30/17
Outputs Target Audience: Research community. The new insights gained and new tools developed in this project will enable a large community of plant biology researchers to conduct a broad range of new studies that are currently not feasible. Meanwhile, the knowledge gained from this project will be useful to any scientist studying living cells - bacteria, plants, animals and human cells. It will lead to the better qualitative understanding of gene expression for all the sequenced plant genomes. General public. A broader and non-scientist audience will be reached through (1) the oral/poster presentations at national and local conferences; (2) writing and publication of a book chapter on "Big Data Analytics in Genomics"; and (3) developing and teaching a new undergraduate/graduate course on the same topic. Graduate and undergraduate students. The students involved in this project will receive better interdisciplinary STEM training (Science, Technology Engineering, and Math). The training will enable the students to be better prepared in the rapidly expanding biotech industry, meeting the demands of interdisciplinary academic training. These activities will contribute to developing a globally competitive STEM workforce, strengthened STEM education programs at SDSU, and an increased number of students pursuing STEM higher education and careers. Changes/Problems:
Nothing Reported
What opportunities for training and professional development has the project provided?During this projectI provide students the opportunity for interdisciplinary and cutting-edge training in computational systems biology. Four graduate students and three undergraduate students were trained on this project. 1. Adam McDermaid. Ph.D. student. (2016.03 - present). Assisted with the RNA-seq data analysis and modeling in Goal 1. 2. Jinyu Yang. M.S. student. (2016.01 - 2017.12). Assisted with thephylogenetic footprinting framework for accurate motif predictions and web server design in Goal 3. 3. Juan Xie. M.S. student. (2016.08 - present). Involved in the biclustering R package development and new biclustering algorithm design in Goal 2. 4. Anjun Ma. M.S. student. (2017.08 - present). Involved in the new biclustering algorithm design for single-cell RNA-seq data analysis in Goal 2. 5. Xiaozhu Jin. Undergraduate student. (2016.04 - 2017.05). Learned gene expression data analysis skills in Goal 1 and Goal 2. 6. Yiran Zhang. Undergraduate student. (2016.04 - present). Learned big biological data analysis skills in Goal 1, e.g., data retrieve from reputed databases and RNA-seq data analysis and modeling through computer programming. 7. Minxuan Sun. Undergraduate student. (2017.08 - present). Learned to use deep learning pipeline and application in genomic sequences in Goal 3. I am teaching a graduate level class in fall of 2017, named "Next Generation Sequencing (NGS) Data Analysis (PS-792-S03)". There are twenty-three graduate students enrolled in this course, coming from Agronomy, Horticulture and Plant Science, Biology and Microbiology, Biochemistry, Animal Science, Computer Sciences, and Mathematics and Statistics. In this class, students will be exposed to general/advanced computational techniques for NGS data analysis, current public databases, major bioinformatics algorithms and programs. Using real data as examples, a project-based strategy will be adopted throughout the class so that students can understand algorithms in the context of solving the biological problems. How have the results been disseminated to communities of interest?These results have been disseminated by nine journalarticles, 11 invited lectures, and eight conference posters. All of them have been listed in the"other products" section. What do you plan to do during the next reporting period to accomplish the goals?Overall, I will continue progress on funded projects which have the common goal of developing advanced computational techniques in support of systems-level understanding of critical biological problems; continue training graduate and undergraduate students in bioinformatics, mathematical modeling, computational programming, and biological data analysis; keep publishing manuscripts in refereed journals; and try to turn these preliminary results into internal and external grant proposals. Goal 1: Development of ade novoRNA-seq analysis pipeline for accurate read mapping and estimationof the expression levels of all genes. Our plan to approach this issue is (1) to consider external information in the form of known co-expressed gene information; (2) to mathematically evaluate the gene expression uncertainty through a through regression framework; and (3) to classify the genes with expression uncertainty into multiple categories based on a mixture model fitting strategy.Furthermore, to properly use the co-expression information, a statistical model must be created and employed to provide a precise value indicating the preferred gene location for any read or combination of reads. Goal 2: Development of a novel biclustering algorithm for analyses of gene expression data generated from RNA-Seq data.Develop a novel biclustering algorithm and program for large-scale gene expression data analysis, and apply these techniques to gene regulatory network (GRN) construction. Eventually, provide a more reliable elucidation of transcriptional regulatory mechanism encoded in a genome. Goal 3: Development of a computational framework for motifidentification integrated with the ChIP-seq data modeling.To further elucidate the underlying regulatory mechanism, the PI will develop a deep learning (DL) framework for motif prediction, mainly integrating ChIP-seq peaks and DNA shape. A weighted two-stage alignment algorithm, considering the peak signals and the motif conservation property, will be designed to reduce the high noise level in ChIP-seq peak calling. The gated convolutional neural network and DNA-shape information will be organically integrated to improve the DL performance in motif identification. Overall, a deeper understanding of gene expression and TFBSs from this project will ultimately improve the effectiveness and efficiency of the NGS data utilization in transcriptional regulation.
Impacts What was accomplished under these goals?
Overall, I have addressed departmental, college and University strategic goals in my scholarship by seeking and receiving internal and external funds; and in my research by focusing on development and applying bioinformatics techniques in the discovery of essential biology insights from large-scale Omics data. My research projects, funded proposals, collaborations, publications and multiple presentations provide evidence of focused research, scholarship and creative activity outcomes. It is noteworthy that the number of all my citations keeps increasing in the past five years, and these studies have attracted worldwide attention. Three specific goals in this project and their progress are listed below. Goal 1: Development of a de novo RNA-seq analysis pipeline for accurate read mapping and estimation of the expression levels of all genes (60% accomplished). An automatic RNA-seq analysis (SeqTU) and evaluation pipeline have been developed with a user-friendly online interface. The pipeline achieved the accuracy level ~0.95 for multiple RNA-seq datasets of Escherichia coli and Clostridium thermocellum through the five-fold cross-validation. The corresponding manuscript has been published in the journal Scientific Reports. This pipeline will offer a clear and logical interface to display how it works and what it will deliver. An improved version of this pipeline (GeneQC) focuses on the read mapping uncertainty issue and aims to give a comprehensive evaluation framework of all the genes in a genome, through mathematical and statistical modeling. It will report the gene list whose expression levels are accurately estimated, and the gene list whose expression levels have bias due to relevant ambiguous reads. A detailed report summary of all the deliverables will be generated for users. The corresponding manuscript is currently ready for submission to the journal Bioinformatics. This pipeline has been used into several collaborations: (i) with Dr. Michael Udvardi from the Noble Foundation to identify nitrogen conservation and underlying changes in genome activity associated with annual senescence in perennial Switchgrass; and (ii) with Dr. Michael Wisniewski from USDA to determine which genes, if any, are differentially expressed between strains of apple trees that have shown cold-hardiness traits compared to a set of control samples. Goal 2: Development of a novel biclustering algorithm for analyses of gene expression data generated from RNA-Seq data. (60% accomplished). An R/Bioconductor implementation of biclustering has been published in the journal Bioinformatics, which performs much better than other similar existing packages regarding efficiency and prediction accuracy based on systematic evaluation using gene expression data from bacteria, plants, and cancer. It has two unique features: (i) an 82% average improved efficiency by refactoring and optimizing the source C code of QUBIC; and (ii) a set of comprehensive functions to facilitate biclustering-based biological studies, including the qualitative representation (discretization) of expression data, query-based biclustering, bicluster expanding, biclusters comparison, heatmap visualization of any identified biclusters, and co-expression networks elucidation. A novel biclustering algorithm for RNA-seq data and single-cell RNA-seq data has been developed. The intellectual merit of the study rests in its endeavor to detect and understand underlying regulatory mechanisms through modeling and analyses of gene expression and transcription factor binding sites (Goal 3). The PI will design a reliable qualitative representation of the gene expression to reflect different expression states corresponding to various regulatory signals, where the unquantifiable errors in RNA-seq data will be handled by a rigorous truncated model. An information-divergence function will be implemented in a graph-theory-based biclustering framework to identify statistically significant and biologically meaningful co-expression gene modules. Goal 3: Development of a computational framework for motif identification integrated with the ChIP-seq data modeling (50% accomplished). We have developed an integrative phylogenetic footprinting framework for accurate motif predictions in prokaryotic genomes (MP3). This study has been published in the journal of BMC Genomics. We have applied the framework to the Escherichia coli k12 genome and evaluated the prediction performance through comparison with seven existing programs. This evaluation was systematically carried out at the nucleotide and binding site level, and the results showed that MP3 consistently outperformed other popular motif finding tools. We have integrated MP3 into our motif identification and analysis server DMINDA, allowing users to identify and analyze motifs efficiently. This server has been published in Bioinformatics and contains: (i) five motif prediction and analyses algorithms, including a phylogenetic footprinting framework; (ii) 2,125 species with complete genomes to support the above five functions, covering animals, plants, and bacteria; and (iii) bacterial regulon prediction and visualization.
Publications
- Type:
Journal Articles
Status:
Published
Year Published:
2017
Citation:
Huansheng Cao, Qin Ma, Xin Chen, Ying Xu, DOOR: A microbial operon database for gene organization and function discovery. Briefings in Bioinformatics. https://doi.org/10.1093/bib/bbx088.
- Type:
Journal Articles
Status:
Published
Year Published:
2017
Citation:
Sheng-Yong Niu, Jinyu Yang, Adam McDermaid, Jing Zhao, Yu Kang, Qin Ma (corresponding author). Bioinformatics tools for metagenome and metatranscriptome data analysis in microbiome studies. Briefings in Bioinformatics. 2017, 1-15. doi: 10.1093/bib/bbx051.
- Type:
Journal Articles
Status:
Published
Year Published:
2017
Citation:
Jinyu Yang, Xin Chen, Adam McDermaid, and Qin Ma (corresponding author), DMINDA 2.0: integrated and systematic views of regulatory DNA motif identification and analyses. Bioinformatics. 13 April 2017. https://doi.org/10.1093/bioinformatics/btx223.
- Type:
Journal Articles
Status:
Published
Year Published:
2017
Citation:
Bingqiang Liu, Jinyu Yang, Yang Li, Adam McDermaid, Qin Ma (corresponding author). An algorithmic perspective of de-novo cis-regulatory motif finding based on ChIP-seq data. Briefings in Bioinformatics. 2017, 113, doi: 10.1093/bib/bbx026.
- Type:
Journal Articles
Status:
Published
Year Published:
2017
Citation:
Huansheng Cao, Wei Du, Yuedong Yang, Yu Shang, Gaoyang Li, Yaoqi Zhou, Qin Ma, Ying Xu, Systems-Level Understanding of the Impact of Ethanol-Induced Stresses and Adaptation in E. coli. Scientific Reports. Article number: 44150 (2017). doi:10.1038/srep44150.
- Type:
Journal Articles
Status:
Published
Year Published:
2017
Citation:
Xin Chen, Wen-Chi Zhou, Qin Ma (corresponding author) and Ying Xu, SeqTU: A Web Server for RNA-seq based Transcription Unit Identification in Bacteria. Scientific Reports. Article number: 43925 (2017). doi:10.1038/srep43925.
- Type:
Journal Articles
Status:
Published
Year Published:
2017
Citation:
Lina Yuan, Yang Yu, Yanmin Zhu, Yulai Li, Changqing Li, Rujiao Li, Qin Ma, Gilman Kit-Hang Siu, Jun Yu, Taijiao Jiang, Jingfa Xiao, Yu Kang, GAAP: Genome-Organization-Framework-Assisted Assembly Pipeline for the Prokaryotic Genome. BMC Genomics. 2017. 18(Suppl 1):952, DOI: 10.1186/s12864-016-3267-0.
- Type:
Journal Articles
Status:
Published
Year Published:
2017
Citation:
Ying Li, Shi Xiaohu, Liang Yanchun, Juan Xie, Yu Zhang, Qin Ma (corresponding author), RNA-TVcurve: A Web Server for RNA Secondary Structure Comparison based on a Multi-Scale Similarity of its Triple Vector Representation. BMC Bioinformatics (2017) 18:51 DOI 10.1186/s12859-017-1481-7.
- Type:
Journal Articles
Status:
Published
Year Published:
2016
Citation:
G Xu, Y Xia, Y Tang, H Cao, Qin Ma, Y Xu, N Zhang, H Xu, Bibliometric Screening of Helicobacter Pylori Pathogenic Genes Through Pathway Enrichment and Operon Analysis. Clinical laboratory, 2016 Nov 1;62(11):2125-2137. doi: 10.7754/Clin.Lab.2016.160319.
|
Progress 10/13/15 to 09/30/16
Outputs Target Audience: Research community. The new insights gained and new tools developed in this project will enable a large community of plant biology researchers to conduct a broad range of new studies that are currently not feasible. Meanwhile, the knowledge gained from this project will be useful to any scientist studying living cells - bacteria, plants, animals and human cells. It will lead to better qualitative understanding of gene expression for all the sequenced plant genomes. General public.A larger and non-scientist audience will be reached through (1) the writing and publication of a book chapter on "Big Data Analytics in Genomics" and (2) developing and teaching a new undergraduate / graduate course on the same topic. Graduate and undergraduate students.The students involved in this project will receive better interdisciplinary STEM training (Science, Technology Engineering and Math). This will enable the students to be better prepared in the rapidly expanding biotech industry, meeting the demands of interdisciplinary academic training. These activities will contribute to developing a globally competitive STEM workforce, strengthened STEM education programs at SDSU, and an increased number of students pursuing STEM higher education and careers. Changes/Problems:
Nothing Reported
What opportunities for training and professional development has the project provided?During this project, I provide students the opportunity for interdisciplinary and cutting-edge training in computational systems biology. Three graduate students and three undergraduate students were trained on this project. 1. Adam McDermaid. Ph.D. student. (2016.03 - present). Assisted with the RNA-seq data analysis and modeling in Goal 1. 2. Jinyu Yang. M.S. student. (2016.01 - present). Assisted with thephylogenetic footprinting framework for accurate motif predictions in Goal 3. 3. Juan Xie. M.S. student. (2016.08 - present). Got involved in the biclustering R package development in Goal 2. 4. Xiaozhu Jin. Undergraduate student. (2016.04 - present). Learned gene expression data analysis skills in Goal 1 and Goal 2. 5. Yiran Zhang. Undergraduate student. (2016.04 - present). Learned big biological data analysis skills in Goal 1, e.g. data retrieve from reputed databases and basic computational analysis through computer programming. 6. Prajwal Khatiwada. Undergraduate student. (2015.09 - 2016.01). Learned to buildan RNA-seq data analysis pipeline using existing tools/software in Goal 1. 7. I have created and am teaching a graduate level class in fall of 2016, named "Next Generation Sequencing (NGS) Data Analysis (PS-792-S04)". There are twenty-five graduate students enrolled in this class, coming from Agronomy, Horticulture and Plant Science, Biology and Microbiology, Biochemistry, Animal Science, and Mathematics and Statistics.In my class, students will be exposed to general/advanced computational techniques for NGS data analysis, public up-to-date databases, major bioinformatics algorithms and programs. Using real data as examples, a project-based strategy will be adopted throughout the class so that students can understand algorithms in the context of solving the biological problems. How have the results been disseminated to communities of interest?These results have been disseminated by 7 invited lectures and 13 conference posters. All of them have been listed in the "other products" section. What do you plan to do during the next reporting period to accomplish the goals?Overall, I will continue progress on funded projects which have the common goal of developing advanced computational techniques in support of systems level understanding of important biological problems; continue training graduate and undergraduate students in bioinformatics, mathematical modeling, computational programming, and biological data analysis; keep publishing manuscripts in refereed journals; and try to turn these preliminary results into internal and external grant proposals. Goal 1: Development of ade novoRNA-seq analysis pipeline for accurate read mapping and estimationof the expression levels of all genes. We plan to consider external information in the form of known co-expressed gene information. Certain genes are co-expressed with other genes at certain rates, and knowing these rates, along with other acquired information, can lead to much more certain assignment of these previously ambiguous reads. To properly use the co-expression information, a statistical model must be created and employed to provide a clear value indicating the preferred gene location for any read or combination of reads. Goal 2: Development of a novel biclustering algorithm for analyses of gene expression data generated from RNA-seq data.Develop a novel biclustering algorithm and program for large-scale gene expression data analysis, and apply these techniques to gene regulatory network (GRN) construction. Eventually, provide a more reliable elucidation of transcriptional regulatory mechanism encoded in a genome. Goal 3: Development of a phylogenetic footprinting framework for motifs identification integrated with the ChIP-seq data modeling.For our new phylogenetic footprinting pipeline, a potential and reasonable improvement is integrating some experimental data, if available, e.g. Chromatin immunoprecipitation followed by sequencing (ChIP-seq). It is a technique used for genome-wide profiling of DNA-binding proteins, histone modifications, or nucleosomes; and has become an indispensable tool for studying gene regulationas it can provide transcription factor binding information with higher resolution, less noise, and greater coverage than traditional array-based predecessor, like ChIP-chip. However, it cannot replace the computational prediction tools particularly for prokaryotes. Firstly, there is very small amounts ChIP-seq data available for prokaryote; secondly, ChIP-seq is not suitable for TFs with only a few binding sites; thirdly, the complexity of regulation can also lead to bias because TFs may not bind on their binding sites in certain environments. Specifically, the score curves used in MP3can be further optimized by integrating the binding signal from ChIP-seq, using machine learning or pattern classification. The ChIP-seq based peaks and CBRs identified by MP3can be cross-validated by each other in application, aiming to overcome some intrinsic computational challenges in high-throughput data analyses. Upon the availability of large-scale ChIP-seq data in prokaryote, we believe that the information integration in our framework can further improve the performance in motif prediction and analysis.
Impacts What was accomplished under these goals?
Overall, I have addressed departmental, college and University strategic goals in my scholarship by seeking and receiving internal and external funds; and in my research by focusing on development and applying bioinformatics techniques in the discovery of important biology insights from large-scale Omics data. My research projects, funded proposals, collaborations, publications and multiple presentations provide evidence of focused research, scholarship and/or creative activity outcomes. It is noteworthy that the number of all my citations keeps increasing in the past five years, and these studies have attracted worldwide attention. Three specific goals in this project and their progress are listed below. Goal 1: Development of ade novoRNA-seq analysis pipeline for accurate read mapping and estimationof the expression levels of all genes(40% accomplished). An automaticRNA-seq analysis and evaluation pipeline has been developed with a user-friendly online interface. The corresponding manuscript has been accepted by the journal Scientific Reports. This pipeline will offer a clear and logical interface to display how it works and what it will deliver.It will report a gene list whose mapped reads are mainly unique ones and the other gene list whose mapped reads are mainly ambiguous reads.A detailed report summary of all the deliverables will be generated for users. This pipeline has been used into several collaborations with PIs from USDA, Noble Foundation, and USD. Goal 2: Development of a novel biclustering algorithm for analyses of gene expression data generated from RNA-seq data(40% accomplished).An R implementation of biclusteringhas been published in the journal Bioinformatics with two unique features: (i) a 82% average improved efficiency by refactoring and optimizing the source C code of QUBIC; and (ii) a set of comprehensive functions to facilitate biclustering-based biological studies, including the qualitative representation (discretization) of expression data, query-based biclustering, bicluster expanding, biclusters comparison, heatmap visualization of any identified biclusters, and co-expression networks elucidation. This R package is also accepted and integrated into the Bioconductor, which is the largest community of bioinformatic research in R. Based on systematic evaluation using gene expression data from bacteria, plants, and cancer, our method performs much better than other similar existing packages in terms of efficiency and prediction accuracy. Goal 3: Development of a phylogenetic footprinting framework for motifs identification integrated with the ChIP-seq data modeling (30% accomplished).We have developed an integrative phylogenetic footprinting framework for accurate motif predictions in prokaryotic genomes (MP3). This study has been published in the journal of BMC Genomics.The framework includes a new orthologous data preparation procedure, an additional promoter scoring and pruning method and an integration of six existing motif finding algorithms as basic motif search engines. Specifically, we collected orthologous genes from available prokaryotic genomes and built the orthologous regulatory regions based on sequence similarity of promoter regions. This procedure made full use of the large-scale genomic data and taxonomy information and filtered out the promoters with limited contribution to produce a high quality orthologous promoter set. The promoter scoring and pruning is implemented through motif voting by a set of complementary predicting tools that mine as many motif candidates as possible and simultaneously eliminate the effect of random noise. We have applied the framework toEscherichia colik12 genome and evaluated the prediction performance through comparison with seven existing programs. This evaluation was systematically carried out at the nucleotide and binding site level, and the results showed that MP3consistently outperformed other popular motif finding tools. We have integrated MP3into our motif identification and analysis server DMINDA, allowing users to efficiently identify and analyze motifs in 2,072 completely sequenced prokaryotic genomes.
Publications
- Type:
Journal Articles
Status:
Published
Year Published:
2015
Citation:
Qin Ma*, Xizeng Mao*, Bingqiang Liu, Zheng Chang, Chuan Zhou, Hanyuan Zhang, Ying Xu. Revisiting Operons: An Analysis of the Landscape of Transcriptional Units in E. coli, BMC Bioinformatics, DOI: 10.1186/s12859-015-0805-8. 2015
- Type:
Journal Articles
Status:
Published
Year Published:
2015
Citation:
Xin Chen*, Qin Ma*, Xiaolan Rao, Yuhong Tang, Yan Wang, Gaoyang Li, Chi Zhang, Xizeng Mao, Richard A. Dixon and Ying Xu, Genome-Scale Identification of Cell-Wall Related Genes in Switchgrass through Comparative Genomics and Computational Analyses of Transcriptomic Data, BioEnergy Research, DOI: 10.1007/s12155-015-9674-2. 2015
- Type:
Journal Articles
Status:
Published
Year Published:
2016
Citation:
Guishen Wang, Lan Huang, Yan Wang, Qin Ma, Wei Pang, Link community detection based on linear graphs with a novel link similarity measure, International Journal of Modern Physics B. 30, 1650023 (2016) [18 pages] DOI: http://dx.doi.org/10.1142/S0217979216500235.
- Type:
Journal Articles
Status:
Published
Year Published:
2016
Citation:
Jiading Yang, Eric Worley, Qin Ma, Jun Li, Ivone Torres-Jerez, Yi-Ching Lee, Jiyi Zhang, Nick Krom, Fuqi Liao, Yuhong Tang, Patrick X. Zhao, Michael Udvardi, Nitrogen conservation and underlying changes in genome activity associated with annual senescence in perennial Switchgrass, Panicum virgatum, New Phytologist. DOI: 10.1111/nph.13898. 2016
- Type:
Journal Articles
Status:
Published
Year Published:
2016
Citation:
Bingqiang Liu, Chuan Zhou, Guojun Li, Hanyuan Zhang, Erliang Zeng, Qin Ma (corresponding author), Bacterial regulon modeling and prediction based on systematic cis regulatory motif analyses. Scientific Reports. doi:10.1038/srep23030. 2016
- Type:
Journal Articles
Status:
Published
Year Published:
2016
Citation:
Guishen Wang, Lan Huang, Yan Wang, Wei Pang, and Qin Ma, A link density clustering algorithm based on automatically selecting density peaks for overlapping community detection. International Journal of Modern Physics B. 30, 1650167 (2016) [15 pages] DOI: http://dx.doi.org/10.1142/S0217979216501678.
- Type:
Journal Articles
Status:
Published
Year Published:
2016
Citation:
Bingqiang Liu, Hanyuan Zhang, Chuan Zhou, Guojun Li, Anne Fennell, Guanghui Wang, Yu Kang, Qi Liu and Qin Ma (corresponding author), An integrative and applicable phylogenetic footprinting framework for cis-regulatory motifs identification in prokaryotic genomes. BMC Genomics. DOI: 10.1186/s12864-016-2982-x. 2016
- Type:
Journal Articles
Status:
Published
Year Published:
2016
Citation:
Yu Zhang*, Juan Xie*, Jinyu Yang, Anne Fennell and Qin Ma (corresponding author), QUBIC: a Bioconductor package for qualitative biclustering analysis of gene co-expression data. Bioinformatics. doi: 10.1093/bioinformatics/btw635. 2016
|
|