Development and application of computational tools driven by inferring the genome-scale gene regulatory networks

DEVELOPMENT AND APPLICATION OF COMPUTATIONAL TOOLS DRIVEN BY INFERRING THE GENOME-SCALE GENE REGULATORY NETWORKS

Sponsoring Institution

National Institute of Food and Agriculture

Project Status

COMPLETE

Funding Source

HATCH

Reporting Frequency

Annual

Accession No.

1008151

Grant No.

(N/A)

Cumulative Award Amt.

(N/A)

Proposal No.

(N/A)

Multistate No.

(N/A)

Project Start Date

Oct 13, 2015

Project End Date

Feb 21, 2019

Grant Year

(N/A)

Program Code

[(N/A)]- (N/A)

Recipient Organization
SOUTH DAKOTA STATE UNIVERSITY
PO BOX 2275A
BROOKINGS,SD 57007

Performing Department
Plant Science

Non Technical Summary
One of the most important goals in systems biology is to infer how gene regulatory networks (GRNs) will respond under various conditions or with specific genetic perturbations; and to understand how different gene expression states are controlled by their underlying regulatory systems. Hence, reconstruction of global GRNs is a key to understand gene function and evolution, especially in the era of high throughput genome sequencing. It is worth noting that these networks are very important to the studies in agricultural and energy crops. For example, they have already substantially improved the understanding of the seed development and maturation control system, and the plant cell wall recalcitrance and conversion in the past two decades. In addition, these networks integrated high-throughput Omics data analysis can definitely contribute the precision agriculture through advising and optimizing various experimental designs.Mathematically, this is modeled as a regulon identification problem, aiming to identify all the co-regulated genes by each of regulatory transcription factors (TFs).Hence, regulons are the basic units of the response system in a cell, and a successful elucidation of regulons will substantially improve the state of the art in identification of transcriptionally co-regulated genes encoded in a genome, realistically allowing reliable prediction of global GRNs in both eukaryotic and prokaryotic organisms. However, elucidation of all the regulons using experimental approaches are clearly desirable, doing so at a genome scale is far from being realistic. Aside from the high costs and efforts required, one key issue is that it is difficult to know what conditions may trigger which regulons; hence unless we can exhaustively go through all possible conditions that each trigger at least one regulon encoded in a genome, we will not be able to observe some of the regulons experimentally. Therefore the PI believes that computational algorithms will play an essential role in elucidation of all regulons encoded in a genome.The two next-generation data types, which can significantly benefit the regulon elucidation, are RNA sequencing (RNA-seq) and Chromatin immunoprecipitation followed by sequencing (ChIP-seq). Firstly, the RNA-seq data promises a comprehensive picture of the transcriptome, enabling the complete annotation and quantification of all genes and allowing for the following co-expression analysis and differentially expressed gene identification. ChIP-seq is a technique for genome-wide profiling of DNA-binding proteins, histone modifications, or nucleosomes. ChIP-seq has become an indispensable tool for studying gene regulation as it can provide transcription factor binding information with higher resolution, less noise, and greater coverage than traditional array-based predecessor ChIP-chip.Overall, the key focus of the planned tasks in this proposal is to perform independent research centered on rapid and reliable reconstruction of genome-scale GRNs encoded in plant genomes; and collaborative research at SDSU with new computational techniques developed linking the large quantities of Omics data to studies of complex system-level organizations and regulatory behaviors in both plants and associated bacteria. The PI will apply his technical strength in developing advanced computational techniques / web databasesand his experience in Omics data analysis, molecular function, and modeling and simulation of complex biological systems, to extend the computational biology approaches in dealing with the substantially increased complexity and scale of the target biological systems.

Animal Health Component

60%

Research Effort Categories

Basic

40%

Applied

60%

Developmental

(N/A)

Classification

Knowledge Area (KA)	Subject of Investigation (SOI)	Field of Science (FOS)	Percent
201	2499	1040	40%
901	7299	2080	60%

Knowledge Area
901 - Program and Project Design, and Statistics; 201 - Plant Genome, Genetics, and Genetic Mechanisms;

Subject Of Investigation
7299 - Research equipment and methods, general/other; 2499 - Plant research, general;

Field Of Science
1040 - Molecular biology; 2080 - Mathematics and computer sciences;

Keywords

bioinformatics

gene regulatory network

computational tools development

omics data mining and modeling

Goals / Objectives
One of the most important goals in systems biology is to infer how gene regulatory networks (GRNs) will respond under various conditions or with specific genetic perturbations; and to understand how different gene expression states are controlled by their underlying regulatory systems. Hence, the overarching goal of this project and related future studies is to develop advanced computational techniques for information discovery from the multiple types of Omics data (including RNA-sequencing and ChIP-sequencing data) of plants and associated bacteria. This will enable independent research centered on rapid and reliable reconstruction of genome-scale GRNs encoded in plant genomes; and collaborative research at SDSU with new computational techniques developed linking the large quantities of Omics data to studies of complex system-level organizations and regulatory behaviors in both plants and associated bacteria.

Project Methods
Although substantial efforts have been made in gene regulatory networkselucidation, thenetwork is still far from complete even for the most well-studied model organism Escherichia coli (E. coli).The reconstruction processes are very complex and problematic as the network is more complex than we though, for example, it is condition-specific and has hierarchical topology leading to some special structures, like network motif. Therefore, with the advent of high-throughput Omics data, the metholdology in this proposal includes: comprehensive RNA-seq data analysis, large-scale gene co-expression data analysis, motifs prediction and clustering, and functional genomics analysis. It is worth noting that, although these proposed studies are not designed for a specific plant genome, they can be easily applied to agricultural plants, including but not limited to rice, wheat, soybean and grape, and energy crops, e.g. switchgrass.

Progress 10/13/15 to 02/21/19

Outputs
Target Audience:Research community. The scientific community will be reached through: 1) journal article publications, 2) oral/poster presentations at scientific conferences, and 3) a book chapter on "Big Data Analytics in Genomics." The new insights gained and new tools developed in this project will enable a large community of plant biology researchers to conduct a broad range of new studies that are currently not feasible. Meanwhile, the knowledge gained from this project will be useful to any scientist studying living cells - bacteria, plants, animal, and human cells. This project will lead to improved qualitative understanding of gene expression for all the sequenced plant genomes. 2. Graduate and undergraduate students. The students involved in this project will receive interdisciplinary STEM training (Science, Technology, Engineering, and Math) while they conduct the research and process the data for dissemination. This training will enable the students to be better prepared for the rapidly expanding biotech industry. A broader group of students will benefit from a new undergraduate/graduate course in this area. These activities will contribute to developing a globally competitive STEM workforce, strengthening STEM education programs at SDSU, and providing an increased number of students pursuing STEM careers. Changes/Problems:The PI has accepted a position at another university, and so this project is now completed. What opportunities for training and professional development has the project provided?During this project I provided students the opportunity for interdisciplinary and cutting-edge training in computational systems biology. Four graduate students and five undergraduate students were trained during this project. 1. Adam McDermaid. Ph.D. student. (2016.03 - 2019.01). Assisted with RNA-seq data analysis and modeling in Goal 1 and is now is working with me as a Postdoc. 2. Jinyu Yang. M.S. student. (2016.01 - 2017.12). Assisted with phylogenetic footprinting framework for accurate motif predictions and web server design in Goal 3. 3. Juan Xie. M.S. student. (2016.08 - 2018.09). Involved in the biclustering R package development and new biclustering algorithm design in Goal 2. Currently is working with me as a Ph.D. student. 4. Anjun Ma. M.S. student. (2017.08 - present). Involved in new biclustering algorithm design for single-cell RNA-seq data analysis in Goal 2. 5. Cankun Wang, M.S. student (2018.01 - present). Involved in motif prediction web server design in Goal 3. 6. Yiran Zhang. Undergraduate student. (2016.04 - 2018.10). Learned big biological data analysis skills in Goal 1, (e.g., data retrieve from reputed databases and RNA-seq data analysis and modeling through computer programming). 7. Minxuan Sun. Undergraduate student. (2017.08 - 2019.01). Learned to use deep learning pipeline and application in genomic sequences in Goal 3. 8. Xiaozhu Jin. Undergraduate student. (2016.04 - 2017.05). Learned gene expression data analysis skills in Goal 1 and Goal 2. 9. Prajwal Khatiwada. Undergraduate student. (2015.09 - 2016.01). Learned to buildan RNA-seq data analysis pipeline using existing tools/software in Goal 1. I taught a graduate level class in the fall of 2016-2018, titled "Next Generation Sequencing (NGS) Data Analysis (PS-735-S01)". In 2018, there were thirteen graduate students enrolled in this course, coming from Agronomy, Horticulture and Plant Science; Biology and Microbiology; Chemistry and Biochemistry; and Mathematics and Statistics. In this class, students were exposed to general/advanced computational techniques for NGS data analysis, current public databases, major bioinformatics algorithms, and programs. Using real data as examples, a project-based strategy was used throughout the class so that students could develop an understanding of algorithms in the context of solving biological problems. How have the results been disseminated to communities of interest?Results have been disseminated through journal article publication, presentations at scientific conferences, and invited presentations. What do you plan to do during the next reporting period to accomplish the goals? Nothing Reported

Impacts
What was accomplished under these goals? format below Overall, I have addressed departmental, college, and University strategic goals in my scholarship by seeking and receiving internal and external funds; and in my research by focusing on development and applying bioinformatics techniques in the discovery of essential biology insights from large-scale Omics data. My research projects, funded proposals, collaborations, publications, and multiple presentations provide evidence of focused research, scholarship, and creative activity outcomes. It is noteworthy that the number of all my citations keeps increasing in the past five years, and these studies have attracted worldwide attention. Three specific goals in this project and their progress are listed below. Goal One: Development of a de novo RNA-seq analysis pipeline for accurate read mapping and estimation of the expression levels of all genes (100% Accomplished) An automatic RNA-seq analysis (SeqTU) and evaluation pipeline has been developed with a user-friendly online interface. The pipeline achieved the accuracy level ~0.95 for multiple RNA-seq datasets of Escherichia coli and Clostridium thermocellum through the cross-validation. The corresponding manuscript has been published in the journal Scientific Reports. This pipeline will offer a clear and logical interface to display how it works and what it will deliver. An improved version of this pipeline (GeneQC) focuses on the read mapping uncertainty issue and aims to give a comprehensive evaluation framework of all the genes in a genome, through mathematical and statistical modeling. A paper describing this work has been published in Frontiers in Genetics. It reported the gene list whose expression levels are accurately estimated, and the gene list whose expression levels have bias due to relevant ambiguous reads. A detailed report summary of all the deliverables will be generated for users. This pipeline has been used in several collaborations: (i) with Dr. Michael Udvardi from the Noble Foundation to identify nitrogen conservation genes and underlying changes in genome activity associated with annual senescence in perennial Switchgrass (leading to a publication in New Phytologist); and (ii) with Dr. Michael Wisniewski from USDA to determine which genes, if any, are differentially expressed between strains of apple trees that have shown cold-hardiness traits compared to a set of control samples (leading to a joint publication which is ready for submission). Goal Two: Development of a novel biclustering algorithm for analyses of gene expression data generated from RNA-Seq data. (100% accomplished). An article describing R/Bioconductor implementation of biclustering has been published in the journal Bioinformatics. This R/Bioconductor performs much better than other similar existing packages regarding efficiency and prediction accuracy based on systematic evaluation using gene expression data from bacteria, plants, and cancer cell lines. It has two unique features: (i) an 82% average improved efficiency by refactoring and optimizing the source C code of QUBIC; and (ii) a set of comprehensive functions to facilitate biclustering-based biological studies, including the qualitative representation (discretization) of expression data, query-based biclustering, bicluster expanding, biclusters comparison, heatmap visualization of any identified biclusters, and co-expression networks elucidation. It is noteworthy that this tool has been cited 14 times since 2017. A novel biclustering algorithm for bulk RNA-Seq data and single-cell RNA-Seq data has been developed and the paper has been submitted to Nature Methods. The intellectual merit of the study rests in its endeavor to detect and understand underlying regulatory mechanisms through modeling and analyses of gene expression and transcription factor binding sites (Goal 3). The PI will design a reliable qualitative representation of the gene expression to reflect different expression states corresponding to various regulatory signals, where the unquantifiable errors in RNA-seq data will be handled by a rigorous truncated model. An information-divergence function will be implemented in a graph-theory-based biclustering framework to identify statistically significant and biologically meaningful co-expression gene modules. Goal Three: Development of a computational framework for motif identification integrated with the ChIP-seq data modeling (100% accomplished). Identification of transcription factor binding sites and cis-regulatory motifs is a frontier on which the rules governing TFs and DNA binding are being revealed. We have developed an integrative phylogenetic footprinting framework for accurate motif predictions in prokaryotic genomes (MP3). This study has been published in the journal of BMC Genomics. We have applied the framework to the Escherichia coli k12 genome and evaluated the prediction performance through comparison with seven existing programs. This evaluation was systematically carried out at the nucleotide and binding site levels, and the results showed that MP3 consistently outperformed other popular motif finding tools. We have integrated MP3 into our motif identification and analysis server DMINDA, allowing users to identify and analyze motifs more efficiently. This server has been published in Bioinformatics and contains: (i) five motif prediction and analyses algorithms, including a phylogenetic footprinting framework; (ii) 2,125 species with complete genomes to support the above five functions, covering animals, plants, and bacteria; and (iii) bacterial regulon prediction and visualization. We have developed a new method for cis-regulatory motif prediction by deep neural networks and the binomial distribution model (called DESSO), which is under review in Genome Research. DESSO outperformed existing tools, including DeepBind, in predicting motifs in 690 human ENCODE ChIP-Sequencing datasets. Furthermore, we demonstrated that the deep-learning framework of DESSO expands motif discovery beyond the state-of-the-art to allow identification of known and new protein-protein-DNA tethering interactions in human TFs. Specifically, 61 putative tethering interactions were identified among the 100 TFs in the K562 cell line. In this work we further demonstrated the power of DESSO by integrating detection of DNA shape features. We found that shape information has a strong predictive power for TF-DNA binding and provides new putative shape motif information for human TFs. Thus, DESSO and the analyses it enables, represent a potent improvement the identification of TF binding sites and by accommodating the complexities of DNA binding into a deep-learning framework. DESSO significantly outperforms other popular tools, e.g., DeepBind, Basset, and MEME-ChIP, in terms of the similarity assessment against validated motifs, all achieving Wilcoxon testp-values < 1x10-3. We believe that this work will be of keen interest to genomics researchers and will have a radiating impact on the field. The source code predicted motifs and TFBSs from the 690 ENCODE TF ChIP-Seq datasets are freely available at the DESSO web server: http://bmbl.sdstate.edu/DESSO.

Publications

Type: Journal Articles Status: Published Year Published: 2018 Citation: Adam McDermaid, Xin Chen, Yiran Zhang, Cankun Wang, Juan Xie, Qin Ma, A new machine learning-based framework for mapping uncertainty analysis in RNA-Seq read alignment and gene expression estimation. Frontiers in Genetics. https://doi.org/10.3389/fgene.2018.00313.
Type: Journal Articles Status: Published Year Published: 2018 Citation: Juan Xie, Anjun Ma, Anne Fennell, Jing Zhao, Qin Ma, A comprehensive review of the biclustering application in addressing biological and biomedical problems. Briefings in Bioinformatics. https://doi.org/10.1093/bib/bby014. 2018.
Type: Journal Articles Status: Published Year Published: 2018 Citation: Adam McDermaid, Brandon Monier, Jing Zhao, Bingqiang Liu, Qin Ma, Interpretation of differential gene expression results of RNA-Seq data: review and integration. Briefings in Bioinformatics. https://doi.org/10.1093/bib/bby067. 2018.
Type: Journal Articles Status: Published Year Published: 2018 Citation: Sen Liang, Anjun Ma, Yan Wang, Qin Ma, Paired Data Feature Selection Methods for Gene Expression Data Analysis: A Comprehensive Review. Computational and Structural Biotechnology Journal. DOI: https://doi.org/10.1016/j.csbj.2018.02.005. 2018.
Type: Journal Articles Status: Published Year Published: 2018 Citation: Fang Zhang, Anjun Ma, Zhao Wang, Qin Ma, Bingqiang Liu, Lan Huang, Yan Wang, A Central Edge Selection Based Overlapping Community Detection Algorithm for the Detection of Overlapping Structures in Protein-Protein Interaction Networks. Molecules, 23(10), 2633; DOI: 10.3390/molecules23102633. 2018.
Type: Journal Articles Status: Published Year Published: 2018 Citation: Siyu Han, Yanchun Liang, Qin Ma, Cankun Wang, Yangyi Xu, Yu Zhang, Wei Du and Ying Li, LncFinder: an integrated package for long non-coding RNA identification utilizing sequence intrinsic composition, structural information and physicochemical property. Briefings in Bioinformatics. bby065, https://doi.org/10.1093/bib/bby065. 2018.
Type: Journal Articles Status: Published Year Published: 2018 Citation: Yu Zhang, Sha Cao, Jing Zhao, Qin Ma, Chi Zhang, MRHCA: a nonparametric statistics-based method for hub and co-expression module identi?cation in large gene co-expression network. Quantitative Biology. DOI: https://doi.org/10.1007/s40484-018-0131-z. 2018.
Type: Journal Articles Status: Published Year Published: 2018 Citation: Surendra Neupane, Qin Ma, Madhav P. Nepal, Febina Mathew, Adam Varenhorst, Ethan J. Andersen, Comparative Analysis of TNL Disease Resistance Proteins in Soybean (Glycine max) and Common Bean (Phaseolus vulgaris). Biochemical Genetics, doi: 10.1007/s10528-018-9851-z. 2018.
Type: Journal Articles Status: Published Year Published: 2018 Citation: Bingqiang Liu, Ling Han, Xiangrong Liu, Jichang Wu, Qin Ma, Computational prediction of sigma-54 promoters in bacterial genomes by integrating motif finding and machine learning strategies. IEEE/ACM Transactions on Computational Biology and Bioinformatics, DOI: 10.1109/TCBB.2018.2816032. 2018.
Type: Journal Articles Status: Published Year Published: 2018 Citation: Guoqing Liu, Qin Ma, Ying Xu, Physical properties of DNA may direct the binding of nucleoid-associated proteins along the E. coli genome. Mathematical Biosciences. DOI: 10.1016/j.mbs.2018.03.026. 2018.
Type: Journal Articles Status: Published Year Published: 2018 Citation: Xin Chen, Anjun Ma, Hanyuan Zhang, Chao Liu, Huansheng Cao, Yan Wang, Qin Ma, RECTA: Regulon Identification Based on Comparative Genomics and Transcriptomics analysis. Genes. 9(6), 278. doi: https://doi.org/10.3390/genes9060278. 2018.
Type: Other Status: Other Year Published: 2018 Citation: Qin Ma. 2018. Development and application of computational methods driven by addressing genomic and transcriptomic questions. September 10. Indiana University, Indianapolis, IN. (Invited by Dr. Chi Zhang).
Type: Other Status: Other Year Published: 2018 Citation: Juan Xie. 2018. Hypothesis-driven and discovery-driven analysis of Grapevine expression data. Plant & Animal Genome Conference, Jan. 14-18. San Diego, CA.
Type: Conference Papers and Presentations Status: Published Year Published: 2018 Citation: Cankun Wang, Qin Ma. 2018. Combining computational methods and experimental data for Motif prediction. Plant Science Research Day, April 26. South Dakota State University, Brookings, SD.
Type: Conference Papers and Presentations Status: Published Year Published: 2018 Citation: Anjun Ma, Qin Ma. 2018. Bioinformatics and Mathematical Biosciences Lab, Faculty Excellence Showcase on Celebration of Faculty Excellence. February 21. Brookings, SD

Progress 10/01/17 to 09/30/18

Outputs
Target Audience:Research community. The new insights gained, and new tools developed in this project will enable a large community of plant biology researchers to conduct a broad range of new studies that are currently not feasible. Meanwhile, the knowledge gained from this project will be useful to any scientist studying living cells - bacteria, plants, animals and human cells. It will lead to the better qualitative understanding of gene expression for all the sequenced plant genomes. General public. A broader and non-scientist audience will be reached through (1) the oral/poster presentations at national and local conferences; (2) writing and publication of a book chapter on "Big Data Analytics in Genomics"; and (3) developing and teaching a new undergraduate/graduate course on the same topic. Graduate and undergraduate students. The students involved in this project will receive better interdisciplinary STEM training (Science, Technology Engineering, and Math). The training will enable the students to be better prepared in the rapidly expanding biotech industry, meeting the demands of interdisciplinary academic training. These activities will contribute to developing a globally competitive STEM workforce, strengthened STEM education programs at SDSU, and an increased number of students pursuing STEM higher education and careers. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?During this project I provide students the opportunity for interdisciplinary and cutting-edge training in computational systems biology. Four graduate students and three undergraduate students were trained on this project. 1. Adam McDermaid. Ph.D. student. (2016.03 - 2018.05). Assisted with the RNA-seq data analysis and modeling in Goal 1, and is now is working with me as a Postdoc. 2. Jinyu Yang. M.S. student. (2016.01 - 2017.12). Assisted with thephylogenetic footprinting framework for accurate motif predictions and web server design in Goal 3. 3. Juan Xie. M.S. student. (2016.08 - 2018.09). Involved in the biclustering R package development and new biclustering algorithm design in Goal 2. Now is working with me as a Ph.D. student. 4. Anjun Ma. M.S. student. (2017.08 - present). Involved in the new biclustering algorithm design for single-cell RNA-seq data analysis in Goal 2. 5. Cankun Wang, M.S. student (2018.01 - present). Involved in the motif prediction web server design in Goal 3. 6. Yiran Zhang. Undergraduate student. (2016.04 - present). Learned big biological data analysis skills in Goal 1, (e.g., data retrieve from reputed databases and RNA-seq data analysis and modeling through computer programming). 7. Minxuan Sun. Undergraduate student. (2017.08 - present). Learned to use deep learning pipeline and application in genomic sequences in Goal 3. I am teaching a graduate level class in fall of 2018, named "Next Generation Sequencing (NGS) Data Analysis (PS-735-S01)". There are thirteen graduate students enrolled in this course, coming from Agronomy, Horticulture and Plant Science, Biology and Microbiology, Biochemistry, and Mathematics and Statistics. In this class, students will be exposed to general/advanced computational techniques for NGS data analysis, current public databases, major bioinformatics algorithms and programs. Using real data as examples, a project-based strategy will be adopted throughout the class so that students can understand algorithms in the context of solving the biological problems. How have the results been disseminated to communities of interest?These results have been disseminated by 14 journalarticles, five invited lectures, and two conference posters. All of them have been listed in the" products" section. What do you plan to do during the next reporting period to accomplish the goals?Overall, I will continue progress on funded projects which have the common goal of developing advanced computational techniques in support of systems-level understanding of critical biological problems; continue training graduate and undergraduate students in bioinformatics, mathematical modeling, computational programming, and biological data analysis; keep publishing manuscripts in refereed journals; and try to turn these preliminary results into internal and external grant proposals. Goal One: Development of ade novosingle-cell RNA-seq analysis pipeline for accurate read mapping and estimationof the expression levels of all genes. Our plan to approach this issue is (1) to consider external information in the form of known co-expressed gene information; (2) to mathematically evaluate the gene expression uncertainty and dropouts through a through regression framework; and (3) to classify the genes with expression uncertainty into multiple categories based on a mixture model fitting strategy.Furthermore, to properly use the co-expression information, a statistical model must be created and employed to provide a precise value indicating the preferred gene location for any read or combination of reads. Goal Two: Development of a novel biclustering algorithm for analyses of gene expression data generated from single-cell RNA-Seq data. Develop a novel biclustering algorithm and program for large-scale gene expression data analysis and apply these techniques to gene regulatory network (GRN) construction. Eventually, provide a more reliable elucidation of transcriptional regulatory mechanism encoded in a genome. Goal 3: Development of a deep-learning framework for motif identification integrated with the ChIP-seq data modeling. To further elucidate the underlying regulatory mechanism, the PI will develop a deep learning (DL) framework for motif prediction, mainly integrating ChIP-seq peaks and DNA shape. A weighted two-stage alignment algorithm, considering the peak signals and the motif conservation property, will be designed to reduce the high noise level in ChIP-seq peak calling. The gated convolutional neural network and DNA-shape information will be organically integrated to improve the DL performance in motif identification. Overall, a deeper understanding of gene expression and TFBSs from this project will ultimately improve the effectiveness and efficiency of the NGS data utilization in transcriptional regulation.

Impacts
What was accomplished under these goals? Goal One: Development of a de novo RNA-seq analysis pipeline for accurate read mapping and estimation of the expression levels of all genes 90% Accomplished An automatic RNA-seq analysis (SeqTU) and evaluation pipeline have been developed with a user-friendly online interface. The pipeline achieved the accuracy level ~0.95 for multiple RNA-seq datasets of Escherichia coli and Clostridium thermocellum through the cross-validation. The corresponding manuscript has been published in the journal Scientific Reports. This pipeline will offer a clear and logical interface to display how it works and what it will deliver. An improved version of this pipeline (GeneQC) focuses on the read mapping uncertainty issue and aims to give a comprehensive evaluation framework of all the genes in a genome, through mathematical and statistical modeling. This paper has been published in Frontiers in Genetics. It reported the gene list whose expression levels are accurately estimated, and the gene list whose expression levels have bias due to relevant ambiguous reads. A detailed report summary of all the deliverables will be generated for users. This pipeline has been used in several collaborations: (i) with Dr. Michael Udvardi from the Noble Foundation to identify nitrogen conservation genes and underlying changes in genome activity associated with annual senescence in perennial Switchgrass (leading to a publication in New Phytologist); and (ii) with Dr. Michael Wisniewski from USDA to determine which genes, if any, are differentially expressed between strains of apple trees that have shown cold-hardiness traits compared to a set of control samples (leading to a joint publication which is ready for submission). Goal Two: Development of a novel biclustering algorithm for analyses of gene expression data generated from RNA-Seq data. (90% accomplished). An R/Bioconductor implementation of biclustering has been published in the journal Bioinformatics. This R/Bioconductor performs much better than other similar existing packages regarding efficiency and prediction accuracy based on systematic evaluation using gene expression data from bacteria, plants, and cancer. It has two unique features: (i) an 82% average improved efficiency by refactoring and optimizing the source C code of QUBIC; and (ii) a set of comprehensive functions to facilitate biclustering-based biological studies, including the qualitative representation (discretization) of expression data, query-based biclustering, bicluster expanding, biclusters comparison, heatmap visualization of any identified biclusters, and co-expression networks elucidation. It is noteworthy that this tool has been cited 14 times since 2017. A novel biclustering algorithm for bulk RNA-Seq data and single-cell RNA-Seq data has been developed and the paper has been submitted to Nature methods for peer review. The intellectual merit of the study rests in its endeavor to detect and understand underlying regulatory mechanisms through modeling and analyses of gene expression and transcription factor binding sites (Goal 3). The PI will design a reliable qualitative representation of the gene expression to reflect different expression states corresponding to various regulatory signals, where the unquantifiable errors in RNA-seq data will be handled by a rigorous truncated model. An information-divergence function will be implemented in a graph-theory-based biclustering framework to identify statistically significant and biologically meaningful co-expression gene modules. Goal Three: Development of a computational framework for motif identification integrated with the ChIP-seq data modeling (80% accomplished). Identification of transcription factor binding sites and cis-regulatory motifs is a frontier on which the rules governing TFs and DNA binding are being revealed. We have developed an integrative phylogenetic footprinting framework for accurate motif predictions in prokaryotic genomes (MP3). This study has been published in the journal of BMC Genomics. We have applied the framework to the Escherichia coli k12 genome and evaluated the prediction performance through comparison with seven existing programs. This evaluation was systematically carried out at the nucleotide and binding site levels, and the results showed that MP3 consistently outperformed other popular motif finding tools. We have integrated MP3 into our motif identification and analysis server DMINDA, allowing users to identify and analyze motifs more efficiently. This server has been published in Bioinformatics and contains: (i) five motif prediction and analyses algorithms, including a phylogenetic footprinting framework; (ii) 2,125 species with complete genomes to support the above five functions, covering animals, plants, and bacteria; and (iii) bacterial regulon prediction and visualization. We have developed a new method for cis-regulatory motif prediction by deep neural networks and the binomial distribution model (called DESSO), which is under review in Genome Research. DESSO outperformed existing tools, including DeepBind, in predicting motifs in 690 human ENCODE ChIP-Sequencing datasets. Furthermore, we demonstrated that the deep-learning framework of DESSO expands motif discovery beyond the state-of-the-art by to allow identification of known and new protein-protein-DNA tethering interactions in human TFs. Specifically, 61 putative tethering interactions were identified among the 100 TFs in the K562 cell line. In this work we further demonstrated the power of DESSO by integrating detection of DNA shape features. We found that shape information has a strong predictive power for TF-DNA binding and provides new putative shape motif information for by human TFs. Thus, DESSO and the analyses it enables represents a potent improvement the identification of TF binding sites and by accommodating the complexities of DNA binding into a deep-learning framework. DESSO significantly outperforms other popular tools, e.g., DeepBind, Basset, and MEME-ChIP, in terms of the similarity assessment against validated motifs, all achieving Wilcoxon testp-values < 1x10-3. We believe that this work will be of keen interest to genomics researchers and will have a radiating impact on the field. The source code predicted motifs and TFBSs from the 690 ENCODE TF ChIP-Seq datasets are freely available at the DESSO web server: http://bmbl.sdstate.edu/DESSO.

Publications

Type: Journal Articles Status: Published Year Published: 2018 Citation: Juan Xie, Anjun Ma, Anne Fennell, Jing Zhao, Qin Ma. A comprehensive review of the biclustering application in addressing biological and biomedical problems. Briefings in Bioinformatics, 2018. https://doi.org/10.1093/bib/bby014.
Type: Journal Articles Status: Awaiting Publication Year Published: 2018 Citation: Baoguang Tian, Xue Wu, Cheng Chen, Wenying Qiu, Qin Ma, Bin Yu. Predicting protein-protein interactions based on multi-information fusion using wavelet denoising and support vector machine. Journal of Theoretical Biology. 2018
Type: Conference Papers and Presentations Status: Awaiting Publication Year Published: 2018 Citation: Xing Shi, Zhancheng Gao, Qiang Lin, Liping Zhao, Qin Ma, Yu Kang, Jun Yu. Meta?analysis Reveals Potential Influence of Oxidative Stress on the Airway Microbiomes of Cystic Fibrosis Patients. Genomics, Proteomics & Bioinformatics. 2018
Type: Journal Articles Status: Accepted Year Published: 2018 Citation: Fang Zhang*, Anjun Ma*, Zhao Wang, Qin Ma, Bingqiang Liu, Lan Huang, Yan Wang. A Central Edge Selection Based Overlapping Community Detection Algorithm for the Detection of Overlapping Structures in Protein-Protein Interaction Networks. 2018. Molecules.
Type: Journal Articles Status: Published Year Published: 2018 Citation: Xiaolan Rao, Xin Chen, Hui Shen, Qin Ma, Guifen Li, Yuhong Tang, Maria Pena, William York, Taylor Frazier, Scott Lenaghan, Xirong Xiao, Fang Chen and Richard A. Dixon. Gene Regulatory Networks for Lignin Biosynthesis in Switchgrass (Panicum virgatum). Plant Biotechnology Journal. 2018 Aug 22. DOI: 10.1111/pbi.13000.
Type: Journal Articles Status: Published Year Published: 2018 Citation: Siyu Han, Yanchun Liang, Qin Ma, Cankun Wang, Yangyi Xu, Yu Zhang, Wei Du and Ying Li. LncFinder: an integrated package for long non-coding RNA identification utilizing sequence intrinsic composition, structural information and physicochemical property. Briefings in Bioinformatics. 2018. bby065, https://doi.org/10.1093/bib/bby065.
Type: Journal Articles Status: Published Year Published: 2018 Citation: Adam McDermaid, Brandon Monier, Jing Zhao, Bingqiang Liu, and Qin Ma. Interpretation of differential gene expression results of RNA-Seq data: review and integration. Briefings in Bioinformatics. 2018. https://doi.org/10.1093/bib/bby067. 06 August 2018.
Type: Journal Articles Status: Published Year Published: 2018 Citation: Yu Zhang, Sha Cao, Jing Zhao, Qin Ma$, and Chi Zhang. MRHCA: a nonparametric statistics-based method for hub and co-expression module identi?cation in large gene co-expression network. Quantitative Biology. 2018. DOI: https://doi.org/10.1007/s40484-018-0131-z.
Type: Journal Articles Status: Published Year Published: 2018 Citation: Sen Liang, Anjun MaL, Yan Wang$, Qin Ma. Paired Data Feature Selection Methods for Gene Expression Data Analysis: A Comprehensive Review. Computational and Structural Biotechnology Journal, 2018. DOI: https://doi.org/10.1016/j.csbj.2018.02.005.
Type: Journal Articles Status: Published Year Published: 2018 Citation: Surendra Neupane, Qin Ma, Madhav P. Nepal, Febina Mathew, Adam Varenhorst, and Ethan J. Andersen. Comparative Analysis of TNL Disease Resistance Proteins in Soybean (Glycine max) and Common Bean (Phaseolus vulgaris). Biochemical Genetics, 2018 Mar 2. doi: 10.1007/s10528-018-9851-z.
Type: Journal Articles Status: Published Year Published: 2018 Citation: Bingqiang Liu, Ling Han, Xiangrong Liu, Jichang Wu, Qin Ma. Computational prediction of sigma-54 promoters in bacterial genomes by integrating motif finding and machine learning strategies. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 2018. DOI: 10.1109/TCBB.2018.2816032.
Type: Journal Articles Status: Published Year Published: 2018 Citation: Guoqing Liu, Qin Ma, Ying Xu. Physical properties of DNA may direct the binding of nucleoid-associated proteins along the E. coli genome. Mathematical Biosciences. 2018. DOI: 10.1016/j.mbs.2018.03.026.
Type: Conference Papers and Presentations Status: Other Year Published: 2018 Citation: Anjun Ma, Qin Ma. 2018. Bioinformatics and Mathematical Biosciences Lab, Faculty Excellence Showcase on Celebration of Faculty Excellence. Brookings, SD, February 21, 2018.
Type: Journal Articles Status: Published Year Published: 2018 Citation: Xin Chen*, Anjun Ma*, Hanyuan Zhang, Chao Liu, Huansheng Cao, Yan Wang, Qin Ma. RECTA: Regulon Identification Based on Comparative Genomics and Transcriptomics analysis. Genes. 2018, 9(6), 278; https://doi.org/10.3390/genes9060278
Type: Conference Papers and Presentations Status: Other Year Published: 2018 Citation: Qin Ma. 2018. Development and application of computational methods driven by addressing genomic and transcriptomic questions. September 10th, 2018. Indiana University, Indianapolis, IN 46202. (Invited by Dr. Chi Zhang)
Type: Conference Papers and Presentations Status: Other Year Published: 2018 Citation: Juan Xie. 2018. Hypothesis-driven and discovery-driven analysis of Grapevine expression data. January 16th, 2018. Plant & Animal Genome Conference, Jan. 14-18, San Diego, CA, USA.
Type: Conference Papers and Presentations Status: Other Year Published: 2017 Citation: Qin Ma. 2017. Integrated and systematic views of regulatory DNA motif identification and analyses. Department of Computer Science, Jilin University. December 12th, 2017. (Invited by Prof. Yan Wang)
Type: Conference Papers and Presentations Status: Other Year Published: 2017 Citation: Qin Ma. 2017. RNA Sequencing Analyses & Mapping Uncertainty, August 25th, 2017, the University of California, Davis campus in Davis, California. (The Y1 meeting for the NSF Plant Genome Research Program).
Type: Conference Papers and Presentations Status: Other Year Published: 2018 Citation: Cankun Wang, Qin Ma. 2018. Combining computational methods and experimental data for Motif prediction. Plant Science Research Day, April 26th, 2018. South Dakota State University.

Progress 10/01/16 to 09/30/17

Outputs
Target Audience: Research community. The new insights gained and new tools developed in this project will enable a large community of plant biology researchers to conduct a broad range of new studies that are currently not feasible. Meanwhile, the knowledge gained from this project will be useful to any scientist studying living cells - bacteria, plants, animals and human cells. It will lead to the better qualitative understanding of gene expression for all the sequenced plant genomes. General public. A broader and non-scientist audience will be reached through (1) the oral/poster presentations at national and local conferences; (2) writing and publication of a book chapter on "Big Data Analytics in Genomics"; and (3) developing and teaching a new undergraduate/graduate course on the same topic. Graduate and undergraduate students. The students involved in this project will receive better interdisciplinary STEM training (Science, Technology Engineering, and Math). The training will enable the students to be better prepared in the rapidly expanding biotech industry, meeting the demands of interdisciplinary academic training. These activities will contribute to developing a globally competitive STEM workforce, strengthened STEM education programs at SDSU, and an increased number of students pursuing STEM higher education and careers. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?During this projectI provide students the opportunity for interdisciplinary and cutting-edge training in computational systems biology. Four graduate students and three undergraduate students were trained on this project. 1. Adam McDermaid. Ph.D. student. (2016.03 - present). Assisted with the RNA-seq data analysis and modeling in Goal 1. 2. Jinyu Yang. M.S. student. (2016.01 - 2017.12). Assisted with thephylogenetic footprinting framework for accurate motif predictions and web server design in Goal 3. 3. Juan Xie. M.S. student. (2016.08 - present). Involved in the biclustering R package development and new biclustering algorithm design in Goal 2. 4. Anjun Ma. M.S. student. (2017.08 - present). Involved in the new biclustering algorithm design for single-cell RNA-seq data analysis in Goal 2. 5. Xiaozhu Jin. Undergraduate student. (2016.04 - 2017.05). Learned gene expression data analysis skills in Goal 1 and Goal 2. 6. Yiran Zhang. Undergraduate student. (2016.04 - present). Learned big biological data analysis skills in Goal 1, e.g., data retrieve from reputed databases and RNA-seq data analysis and modeling through computer programming. 7. Minxuan Sun. Undergraduate student. (2017.08 - present). Learned to use deep learning pipeline and application in genomic sequences in Goal 3. I am teaching a graduate level class in fall of 2017, named "Next Generation Sequencing (NGS) Data Analysis (PS-792-S03)". There are twenty-three graduate students enrolled in this course, coming from Agronomy, Horticulture and Plant Science, Biology and Microbiology, Biochemistry, Animal Science, Computer Sciences, and Mathematics and Statistics. In this class, students will be exposed to general/advanced computational techniques for NGS data analysis, current public databases, major bioinformatics algorithms and programs. Using real data as examples, a project-based strategy will be adopted throughout the class so that students can understand algorithms in the context of solving the biological problems. How have the results been disseminated to communities of interest?These results have been disseminated by nine journalarticles, 11 invited lectures, and eight conference posters. All of them have been listed in the"other products" section. What do you plan to do during the next reporting period to accomplish the goals?Overall, I will continue progress on funded projects which have the common goal of developing advanced computational techniques in support of systems-level understanding of critical biological problems; continue training graduate and undergraduate students in bioinformatics, mathematical modeling, computational programming, and biological data analysis; keep publishing manuscripts in refereed journals; and try to turn these preliminary results into internal and external grant proposals. Goal 1: Development of ade novoRNA-seq analysis pipeline for accurate read mapping and estimationof the expression levels of all genes. Our plan to approach this issue is (1) to consider external information in the form of known co-expressed gene information; (2) to mathematically evaluate the gene expression uncertainty through a through regression framework; and (3) to classify the genes with expression uncertainty into multiple categories based on a mixture model fitting strategy.Furthermore, to properly use the co-expression information, a statistical model must be created and employed to provide a precise value indicating the preferred gene location for any read or combination of reads. Goal 2: Development of a novel biclustering algorithm for analyses of gene expression data generated from RNA-Seq data.Develop a novel biclustering algorithm and program for large-scale gene expression data analysis, and apply these techniques to gene regulatory network (GRN) construction. Eventually, provide a more reliable elucidation of transcriptional regulatory mechanism encoded in a genome. Goal 3: Development of a computational framework for motifidentification integrated with the ChIP-seq data modeling.To further elucidate the underlying regulatory mechanism, the PI will develop a deep learning (DL) framework for motif prediction, mainly integrating ChIP-seq peaks and DNA shape. A weighted two-stage alignment algorithm, considering the peak signals and the motif conservation property, will be designed to reduce the high noise level in ChIP-seq peak calling. The gated convolutional neural network and DNA-shape information will be organically integrated to improve the DL performance in motif identification. Overall, a deeper understanding of gene expression and TFBSs from this project will ultimately improve the effectiveness and efficiency of the NGS data utilization in transcriptional regulation.

Impacts
What was accomplished under these goals? Overall, I have addressed departmental, college and University strategic goals in my scholarship by seeking and receiving internal and external funds; and in my research by focusing on development and applying bioinformatics techniques in the discovery of essential biology insights from large-scale Omics data. My research projects, funded proposals, collaborations, publications and multiple presentations provide evidence of focused research, scholarship and creative activity outcomes. It is noteworthy that the number of all my citations keeps increasing in the past five years, and these studies have attracted worldwide attention. Three specific goals in this project and their progress are listed below. Goal 1: Development of a de novo RNA-seq analysis pipeline for accurate read mapping and estimation of the expression levels of all genes (60% accomplished). An automatic RNA-seq analysis (SeqTU) and evaluation pipeline have been developed with a user-friendly online interface. The pipeline achieved the accuracy level ~0.95 for multiple RNA-seq datasets of Escherichia coli and Clostridium thermocellum through the five-fold cross-validation. The corresponding manuscript has been published in the journal Scientific Reports. This pipeline will offer a clear and logical interface to display how it works and what it will deliver. An improved version of this pipeline (GeneQC) focuses on the read mapping uncertainty issue and aims to give a comprehensive evaluation framework of all the genes in a genome, through mathematical and statistical modeling. It will report the gene list whose expression levels are accurately estimated, and the gene list whose expression levels have bias due to relevant ambiguous reads. A detailed report summary of all the deliverables will be generated for users. The corresponding manuscript is currently ready for submission to the journal Bioinformatics. This pipeline has been used into several collaborations: (i) with Dr. Michael Udvardi from the Noble Foundation to identify nitrogen conservation and underlying changes in genome activity associated with annual senescence in perennial Switchgrass; and (ii) with Dr. Michael Wisniewski from USDA to determine which genes, if any, are differentially expressed between strains of apple trees that have shown cold-hardiness traits compared to a set of control samples. Goal 2: Development of a novel biclustering algorithm for analyses of gene expression data generated from RNA-Seq data. (60% accomplished). An R/Bioconductor implementation of biclustering has been published in the journal Bioinformatics, which performs much better than other similar existing packages regarding efficiency and prediction accuracy based on systematic evaluation using gene expression data from bacteria, plants, and cancer. It has two unique features: (i) an 82% average improved efficiency by refactoring and optimizing the source C code of QUBIC; and (ii) a set of comprehensive functions to facilitate biclustering-based biological studies, including the qualitative representation (discretization) of expression data, query-based biclustering, bicluster expanding, biclusters comparison, heatmap visualization of any identified biclusters, and co-expression networks elucidation. A novel biclustering algorithm for RNA-seq data and single-cell RNA-seq data has been developed. The intellectual merit of the study rests in its endeavor to detect and understand underlying regulatory mechanisms through modeling and analyses of gene expression and transcription factor binding sites (Goal 3). The PI will design a reliable qualitative representation of the gene expression to reflect different expression states corresponding to various regulatory signals, where the unquantifiable errors in RNA-seq data will be handled by a rigorous truncated model. An information-divergence function will be implemented in a graph-theory-based biclustering framework to identify statistically significant and biologically meaningful co-expression gene modules. Goal 3: Development of a computational framework for motif identification integrated with the ChIP-seq data modeling (50% accomplished). We have developed an integrative phylogenetic footprinting framework for accurate motif predictions in prokaryotic genomes (MP3). This study has been published in the journal of BMC Genomics. We have applied the framework to the Escherichia coli k12 genome and evaluated the prediction performance through comparison with seven existing programs. This evaluation was systematically carried out at the nucleotide and binding site level, and the results showed that MP3 consistently outperformed other popular motif finding tools. We have integrated MP3 into our motif identification and analysis server DMINDA, allowing users to identify and analyze motifs efficiently. This server has been published in Bioinformatics and contains: (i) five motif prediction and analyses algorithms, including a phylogenetic footprinting framework; (ii) 2,125 species with complete genomes to support the above five functions, covering animals, plants, and bacteria; and (iii) bacterial regulon prediction and visualization.

Publications

Type: Journal Articles Status: Published Year Published: 2017 Citation: Huansheng Cao, Qin Ma, Xin Chen, Ying Xu, DOOR: A microbial operon database for gene organization and function discovery. Briefings in Bioinformatics. https://doi.org/10.1093/bib/bbx088.
Type: Journal Articles Status: Published Year Published: 2017 Citation: Sheng-Yong Niu, Jinyu Yang, Adam McDermaid, Jing Zhao, Yu Kang, Qin Ma (corresponding author). Bioinformatics tools for metagenome and metatranscriptome data analysis in microbiome studies. Briefings in Bioinformatics. 2017, 1-15. doi: 10.1093/bib/bbx051.
Type: Journal Articles Status: Published Year Published: 2017 Citation: Jinyu Yang, Xin Chen, Adam McDermaid, and Qin Ma (corresponding author), DMINDA 2.0: integrated and systematic views of regulatory DNA motif identification and analyses. Bioinformatics. 13 April 2017. https://doi.org/10.1093/bioinformatics/btx223.
Type: Journal Articles Status: Published Year Published: 2017 Citation: Bingqiang Liu, Jinyu Yang, Yang Li, Adam McDermaid, Qin Ma (corresponding author). An algorithmic perspective of de-novo cis-regulatory motif finding based on ChIP-seq data. Briefings in Bioinformatics. 2017, 113, doi: 10.1093/bib/bbx026.
Type: Journal Articles Status: Published Year Published: 2017 Citation: Huansheng Cao, Wei Du, Yuedong Yang, Yu Shang, Gaoyang Li, Yaoqi Zhou, Qin Ma, Ying Xu, Systems-Level Understanding of the Impact of Ethanol-Induced Stresses and Adaptation in E. coli. Scientific Reports. Article number: 44150 (2017). doi:10.1038/srep44150.
Type: Journal Articles Status: Published Year Published: 2017 Citation: Xin Chen, Wen-Chi Zhou, Qin Ma (corresponding author) and Ying Xu, SeqTU: A Web Server for RNA-seq based Transcription Unit Identification in Bacteria. Scientific Reports. Article number: 43925 (2017). doi:10.1038/srep43925.
Type: Journal Articles Status: Published Year Published: 2017 Citation: Lina Yuan, Yang Yu, Yanmin Zhu, Yulai Li, Changqing Li, Rujiao Li, Qin Ma, Gilman Kit-Hang Siu, Jun Yu, Taijiao Jiang, Jingfa Xiao, Yu Kang, GAAP: Genome-Organization-Framework-Assisted Assembly Pipeline for the Prokaryotic Genome. BMC Genomics. 2017. 18(Suppl 1):952, DOI: 10.1186/s12864-016-3267-0.
Type: Journal Articles Status: Published Year Published: 2017 Citation: Ying Li, Shi Xiaohu, Liang Yanchun, Juan Xie, Yu Zhang, Qin Ma (corresponding author), RNA-TVcurve: A Web Server for RNA Secondary Structure Comparison based on a Multi-Scale Similarity of its Triple Vector Representation. BMC Bioinformatics (2017) 18:51 DOI 10.1186/s12859-017-1481-7.
Type: Journal Articles Status: Published Year Published: 2016 Citation: G Xu, Y Xia, Y Tang, H Cao, Qin Ma, Y Xu, N Zhang, H Xu, Bibliometric Screening of Helicobacter Pylori Pathogenic Genes Through Pathway Enrichment and Operon Analysis. Clinical laboratory, 2016 Nov 1;62(11):2125-2137. doi: 10.7754/Clin.Lab.2016.160319.

Progress 10/13/15 to 09/30/16

Outputs
Target Audience: Research community. The new insights gained and new tools developed in this project will enable a large community of plant biology researchers to conduct a broad range of new studies that are currently not feasible. Meanwhile, the knowledge gained from this project will be useful to any scientist studying living cells - bacteria, plants, animals and human cells. It will lead to better qualitative understanding of gene expression for all the sequenced plant genomes. General public.A larger and non-scientist audience will be reached through (1) the writing and publication of a book chapter on "Big Data Analytics in Genomics" and (2) developing and teaching a new undergraduate / graduate course on the same topic. Graduate and undergraduate students.The students involved in this project will receive better interdisciplinary STEM training (Science, Technology Engineering and Math). This will enable the students to be better prepared in the rapidly expanding biotech industry, meeting the demands of interdisciplinary academic training. These activities will contribute to developing a globally competitive STEM workforce, strengthened STEM education programs at SDSU, and an increased number of students pursuing STEM higher education and careers. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?During this project, I provide students the opportunity for interdisciplinary and cutting-edge training in computational systems biology. Three graduate students and three undergraduate students were trained on this project. 1. Adam McDermaid. Ph.D. student. (2016.03 - present). Assisted with the RNA-seq data analysis and modeling in Goal 1. 2. Jinyu Yang. M.S. student. (2016.01 - present). Assisted with thephylogenetic footprinting framework for accurate motif predictions in Goal 3. 3. Juan Xie. M.S. student. (2016.08 - present). Got involved in the biclustering R package development in Goal 2. 4. Xiaozhu Jin. Undergraduate student. (2016.04 - present). Learned gene expression data analysis skills in Goal 1 and Goal 2. 5. Yiran Zhang. Undergraduate student. (2016.04 - present). Learned big biological data analysis skills in Goal 1, e.g. data retrieve from reputed databases and basic computational analysis through computer programming. 6. Prajwal Khatiwada. Undergraduate student. (2015.09 - 2016.01). Learned to buildan RNA-seq data analysis pipeline using existing tools/software in Goal 1. 7. I have created and am teaching a graduate level class in fall of 2016, named "Next Generation Sequencing (NGS) Data Analysis (PS-792-S04)". There are twenty-five graduate students enrolled in this class, coming from Agronomy, Horticulture and Plant Science, Biology and Microbiology, Biochemistry, Animal Science, and Mathematics and Statistics.In my class, students will be exposed to general/advanced computational techniques for NGS data analysis, public up-to-date databases, major bioinformatics algorithms and programs. Using real data as examples, a project-based strategy will be adopted throughout the class so that students can understand algorithms in the context of solving the biological problems. How have the results been disseminated to communities of interest?These results have been disseminated by 7 invited lectures and 13 conference posters. All of them have been listed in the "other products" section. What do you plan to do during the next reporting period to accomplish the goals?Overall, I will continue progress on funded projects which have the common goal of developing advanced computational techniques in support of systems level understanding of important biological problems; continue training graduate and undergraduate students in bioinformatics, mathematical modeling, computational programming, and biological data analysis; keep publishing manuscripts in refereed journals; and try to turn these preliminary results into internal and external grant proposals. Goal 1: Development of ade novoRNA-seq analysis pipeline for accurate read mapping and estimationof the expression levels of all genes. We plan to consider external information in the form of known co-expressed gene information. Certain genes are co-expressed with other genes at certain rates, and knowing these rates, along with other acquired information, can lead to much more certain assignment of these previously ambiguous reads. To properly use the co-expression information, a statistical model must be created and employed to provide a clear value indicating the preferred gene location for any read or combination of reads. Goal 2: Development of a novel biclustering algorithm for analyses of gene expression data generated from RNA-seq data.Develop a novel biclustering algorithm and program for large-scale gene expression data analysis, and apply these techniques to gene regulatory network (GRN) construction. Eventually, provide a more reliable elucidation of transcriptional regulatory mechanism encoded in a genome. Goal 3: Development of a phylogenetic footprinting framework for motifs identification integrated with the ChIP-seq data modeling.For our new phylogenetic footprinting pipeline, a potential and reasonable improvement is integrating some experimental data, if available, e.g. Chromatin immunoprecipitation followed by sequencing (ChIP-seq). It is a technique used for genome-wide profiling of DNA-binding proteins, histone modifications, or nucleosomes; and has become an indispensable tool for studying gene regulationas it can provide transcription factor binding information with higher resolution, less noise, and greater coverage than traditional array-based predecessor, like ChIP-chip. However, it cannot replace the computational prediction tools particularly for prokaryotes. Firstly, there is very small amounts ChIP-seq data available for prokaryote; secondly, ChIP-seq is not suitable for TFs with only a few binding sites; thirdly, the complexity of regulation can also lead to bias because TFs may not bind on their binding sites in certain environments. Specifically, the score curves used in MP3can be further optimized by integrating the binding signal from ChIP-seq, using machine learning or pattern classification. The ChIP-seq based peaks and CBRs identified by MP3can be cross-validated by each other in application, aiming to overcome some intrinsic computational challenges in high-throughput data analyses. Upon the availability of large-scale ChIP-seq data in prokaryote, we believe that the information integration in our framework can further improve the performance in motif prediction and analysis.

Impacts
What was accomplished under these goals? Overall, I have addressed departmental, college and University strategic goals in my scholarship by seeking and receiving internal and external funds; and in my research by focusing on development and applying bioinformatics techniques in the discovery of important biology insights from large-scale Omics data. My research projects, funded proposals, collaborations, publications and multiple presentations provide evidence of focused research, scholarship and/or creative activity outcomes. It is noteworthy that the number of all my citations keeps increasing in the past five years, and these studies have attracted worldwide attention. Three specific goals in this project and their progress are listed below. Goal 1: Development of ade novoRNA-seq analysis pipeline for accurate read mapping and estimationof the expression levels of all genes(40% accomplished). An automaticRNA-seq analysis and evaluation pipeline has been developed with a user-friendly online interface. The corresponding manuscript has been accepted by the journal Scientific Reports. This pipeline will offer a clear and logical interface to display how it works and what it will deliver.It will report a gene list whose mapped reads are mainly unique ones and the other gene list whose mapped reads are mainly ambiguous reads.A detailed report summary of all the deliverables will be generated for users. This pipeline has been used into several collaborations with PIs from USDA, Noble Foundation, and USD. Goal 2: Development of a novel biclustering algorithm for analyses of gene expression data generated from RNA-seq data(40% accomplished).An R implementation of biclusteringhas been published in the journal Bioinformatics with two unique features: (i) a 82% average improved efficiency by refactoring and optimizing the source C code of QUBIC; and (ii) a set of comprehensive functions to facilitate biclustering-based biological studies, including the qualitative representation (discretization) of expression data, query-based biclustering, bicluster expanding, biclusters comparison, heatmap visualization of any identified biclusters, and co-expression networks elucidation. This R package is also accepted and integrated into the Bioconductor, which is the largest community of bioinformatic research in R. Based on systematic evaluation using gene expression data from bacteria, plants, and cancer, our method performs much better than other similar existing packages in terms of efficiency and prediction accuracy. Goal 3: Development of a phylogenetic footprinting framework for motifs identification integrated with the ChIP-seq data modeling (30% accomplished).We have developed an integrative phylogenetic footprinting framework for accurate motif predictions in prokaryotic genomes (MP3). This study has been published in the journal of BMC Genomics.The framework includes a new orthologous data preparation procedure, an additional promoter scoring and pruning method and an integration of six existing motif finding algorithms as basic motif search engines. Specifically, we collected orthologous genes from available prokaryotic genomes and built the orthologous regulatory regions based on sequence similarity of promoter regions. This procedure made full use of the large-scale genomic data and taxonomy information and filtered out the promoters with limited contribution to produce a high quality orthologous promoter set. The promoter scoring and pruning is implemented through motif voting by a set of complementary predicting tools that mine as many motif candidates as possible and simultaneously eliminate the effect of random noise. We have applied the framework toEscherichia colik12 genome and evaluated the prediction performance through comparison with seven existing programs. This evaluation was systematically carried out at the nucleotide and binding site level, and the results showed that MP3consistently outperformed other popular motif finding tools. We have integrated MP3into our motif identification and analysis server DMINDA, allowing users to efficiently identify and analyze motifs in 2,072 completely sequenced prokaryotic genomes.

Publications

Type: Journal Articles Status: Published Year Published: 2015 Citation: Qin Ma*, Xizeng Mao*, Bingqiang Liu, Zheng Chang, Chuan Zhou, Hanyuan Zhang, Ying Xu. Revisiting Operons: An Analysis of the Landscape of Transcriptional Units in E. coli, BMC Bioinformatics, DOI: 10.1186/s12859-015-0805-8. 2015
Type: Journal Articles Status: Published Year Published: 2015 Citation: Xin Chen*, Qin Ma*, Xiaolan Rao, Yuhong Tang, Yan Wang, Gaoyang Li, Chi Zhang, Xizeng Mao, Richard A. Dixon and Ying Xu, Genome-Scale Identification of Cell-Wall Related Genes in Switchgrass through Comparative Genomics and Computational Analyses of Transcriptomic Data, BioEnergy Research, DOI: 10.1007/s12155-015-9674-2. 2015
Type: Journal Articles Status: Published Year Published: 2016 Citation: Guishen Wang, Lan Huang, Yan Wang, Qin Ma, Wei Pang, Link community detection based on linear graphs with a novel link similarity measure, International Journal of Modern Physics B. 30, 1650023 (2016) [18 pages] DOI: http://dx.doi.org/10.1142/S0217979216500235.
Type: Journal Articles Status: Published Year Published: 2016 Citation: Jiading Yang, Eric Worley, Qin Ma, Jun Li, Ivone Torres-Jerez, Yi-Ching Lee, Jiyi Zhang, Nick Krom, Fuqi Liao, Yuhong Tang, Patrick X. Zhao, Michael Udvardi, Nitrogen conservation and underlying changes in genome activity associated with annual senescence in perennial Switchgrass, Panicum virgatum, New Phytologist. DOI: 10.1111/nph.13898. 2016
Type: Journal Articles Status: Published Year Published: 2016 Citation: Bingqiang Liu, Chuan Zhou, Guojun Li, Hanyuan Zhang, Erliang Zeng, Qin Ma (corresponding author), Bacterial regulon modeling and prediction based on systematic cis regulatory motif analyses. Scientific Reports. doi:10.1038/srep23030. 2016
Type: Journal Articles Status: Published Year Published: 2016 Citation: Guishen Wang, Lan Huang, Yan Wang, Wei Pang, and Qin Ma, A link density clustering algorithm based on automatically selecting density peaks for overlapping community detection. International Journal of Modern Physics B. 30, 1650167 (2016) [15 pages] DOI: http://dx.doi.org/10.1142/S0217979216501678.
Type: Journal Articles Status: Published Year Published: 2016 Citation: Bingqiang Liu, Hanyuan Zhang, Chuan Zhou, Guojun Li, Anne Fennell, Guanghui Wang, Yu Kang, Qi Liu and Qin Ma (corresponding author), An integrative and applicable phylogenetic footprinting framework for cis-regulatory motifs identification in prokaryotic genomes. BMC Genomics. DOI: 10.1186/s12864-016-2982-x. 2016
Type: Journal Articles Status: Published Year Published: 2016 Citation: Yu Zhang*, Juan Xie*, Jinyu Yang, Anne Fennell and Qin Ma (corresponding author), QUBIC: a Bioconductor package for qualitative biclustering analysis of gene co-expression data. Bioinformatics. doi: 10.1093/bioinformatics/btw635. 2016