Source: UNIVERSITY OF ARIZONA submitted to
STATISTICAL METHODS IN METAGENOMIC ANALYSIS
Sponsoring Institution
National Institute of Food and Agriculture
Project Status
TERMINATED
Funding Source
Reporting Frequency
Annual
Accession No.
1000555
Grant No.
(N/A)
Project No.
ARZT-1360830-H22-138
Proposal No.
(N/A)
Multistate No.
(N/A)
Program Code
(N/A)
Project Start Date
Oct 1, 2013
Project End Date
Sep 30, 2018
Grant Year
(N/A)
Project Director
An, LI.
Recipient Organization
UNIVERSITY OF ARIZONA
888 N EUCLID AVE
TUCSON,AZ 85719-4824
Performing Department
Agri & Biosystems Engineering
Non Technical Summary
Metagenomics is a relatively new but fast growing field. It is based on the genomic analysis of microbial DNA that is extracted directly from natural (e.g., soil, or water) or host-associated (e.g., animal gut) environmental samples that contain microorganisms organized into communities or microbiomes. It is estimated that more than 99% of microbial species on earth cannot be cultured in the laboratory. Metagenomic approach enables researchers to understand the diversity of microbes, their functions, cooperation, and evolution in a particular ecosystem without culturing the microbes. This technology will probably lead to major advances in agriculture, medicine, energy production, bioremediation, and biodefense. For example, in agricultural studies understanding the roles of beneficial and harmful microorganisms living in, on and around domesticated plants and animals can help us to detect diseases in crops and livestock, and then develop improved farming practices. High-throughput next generation sequencing technologies provide a powerful way in metagenomic studies. However, due to the massive short DNA sequences produced by the new sequencing technologies, there is an urgent need to develop efficient statistical methods to rapidly and accurately detect the features/functions present in a metagenomic sample/community based on massive sequencing data and to construct the relationship between the signature features/functions and the status of the microbial communities that they reside in. There is not any project in the CRIS database focusing on statistical methods development in metagenomics. This study focuses on developing statistical methods on 1) identifying all possible functional roles present in metagenomic samples/communities and 2) detecting functional "biomarker" patterns that are linked to the presence of a biological factor (e.g, a disease or biothreat agent) and 3) predicting virulence/disease level for a new individual. This project not only proposes novel methods and algorithms for analyzing metagenomic data in general but also develops standalone software which will be made publicly available for the community so that researchers can use them for analyzing metagenomic samples or generating simulated sequence reads for metagenomic analysis.
Animal Health Component
0%
Research Effort Categories
Basic
20%
Applied
40%
Developmental
40%
Classification

Knowledge Area (KA)Subject of Investigation (SOI)Field of Science (FOS)Percent
7237310209050%
7124099108050%
Goals / Objectives
Metagenomics is the genomic analysis of microbial DNA that is extracted directly from natural (e.g., soil, or water) or host-associated (e.g., animal gut) environmental samples that contain microorganisms organized into communities or microbiomes. Metagenomic approach enables us to understand the diversity of microbes, their functions, cooperation, and evolution in a particular ecosystem without culturing the microbes. Based on high-throughput next generation sequencing our focuses are to develop efficient statistical methods to rapidly and accurately detect the features/functions present in a metagenomic sample/community and to construct the relationship between the signature features/functions and the status of the microbial communities that they reside in.
Project Methods
Tools for sequence data alignment: BLAST(x) Statistical methods will be applied/developed on the analysis of alignment output, including beta regression and high-dimensional techniques.

Progress 10/01/13 to 09/30/18

Outputs
Target Audience:The target audiences are 1) researchers/scientists who investigate microbial communities with applications in biology, health and medical sciences, environmental sciences and forensic science. 2) statisticians/computational scientists who develop computational methods or algorithms in metagenomic studies. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?This research will have a significant long-term impact due to the increased number of microbiome projects. This project has integrated research and education and explicitly addresses cross-disciplinary training at multiple levels. It has provided training and professional development for 10+graduate students in cutting-edge research. Participating in this research they learned the new sequence technology and associated computational challenges. Under Dr. An's supervision, her lab has offered two summer workshops on metagenomics to the graduate and undergraduate students in Statistics, Mathematics, Biosystems Engineering, Pharmacy, Public Health, and Medicine. The topics included "Introduction to metagenomics", "Introduction to the current IHMP (integrative Human Microbiome Projects) and their status", "Statistical and computational challenges in microbial research", "Advanced quantitative methods in metagenomic data analysis", and hands-on experience in metagenomic data analysis in the High Performance Computing systems at University of Arizona, from sequence data alignment to quality control, from upstream analysis to downstream analysis, e.g., differential abundance analysis, cluster analysis, network and visualization. Through the workshops, they hoped to bring this new research area to the next generation scientists and educators. How have the results been disseminated to communities of interest?The research findings/outcomes have been presented at various conferences/symposium and seminars, including ENAR (Eastern North American Region) meeting, WNAR (Western North American Region) meeting of International of Biometric Society, ICSA (International Chinese Statistical Association) symposium, JSM (Joint Statistical Meetings),Annual Scientific Meeting of American Academy of Forensic Sciences, International Human Microbiome Consortium, and NSF workshop. We have published about 10 peer-reviewed journal papers directly or indirectly resulting from this project. All the software/packages we have developed for the project can be downloaded from http://cals.arizona.edu/~anling/software/software.htm. Researchers can use this software for free in their research. What do you plan to do during the next reporting period to accomplish the goals? Nothing Reported

Impacts
What was accomplished under these goals? Biological threats are associated with the deliberate or accidental release of a pathogen or biotoxin. Sometimes they may cause people illness, death, fear, societal disruption, and economic damage. One of the difficulties in detecting this threat in an efficient manner is that our environment is already rich in microorganisms, that are harmless or actively beneficial, and about whom we know very little. Metagenomics is the study of genetic material recovered directly from natural (e.g., soil and water) or host-associated (e.g., human gut) environmental samples that contain micro-organisms organized into communities or microbiomes. With the development of new sequencing technology, metagenomic approach enables us to detect and study biological threats. In addition, new technology has also led us to advances in agriculture, medicine, energy production, and bioremediation. For example, in agricultural studies understanding the roles of beneficial and harmful microorganisms living in, on and around domesticated plants and animals can help us to detect diseases in crops and livestock, and then develop improved farming practices. High-throughput next-generation sequencing technologies provide a powerful way in metagenomic studies. However, due to the massive short DNA sequences produced by the new sequencing technologies, there is an urgent need to develop efficient statistical methods to rapidly and accurately detect the species and gene functions present in a metagenomic sample/community and to construct the relationship between the signature species/functions and the status of the microbial communities that they reside in. The project PI, Dr. An, working with her students, has developed several statistical methods on 1) identifying all possible functional roles present in metagenomic samples/communities and 2) detecting functional "biomarker" patterns that are linked to the presence of a biological factor (e.g, a disease or biothreat agent) and 3) predicting virulence/disease level for a new individual. They not only proposed novel methods and algorithms for analyzing metagenomic data in general but also developed software that implemented the proposed methods. All of the software is made publicly available for the research community. This will help scientists to ask, answer, and evaluate complex and multi-disciplinary biological questions. For example, by applying our methods scientists can detect the harmful microbial species, even without any prior micrological knowledge. They can also predict a disease/virulence level for a new microbial sample. This, in turn, will help people in treating disease or biothreat controlling.

Publications

  • Type: Journal Articles Status: Published Year Published: 2018 Citation: Zhu, L., An, L., Ran, D., Lizzarraga, R., Bondy, C., Zhou, X., Harper, R., Liao, S., Chen, Y. (2018). The Club Cell Marker SCGB1A Downstream of FOXA1 is Reduced in Asthma. American Journal of Respiratory Cell and Molecular Biology. doi: 10.1165/rcmb.2018-0199OC
  • Type: Journal Articles Status: Published Year Published: 2018 Citation: Zhang, S., Wang, D., Zhang, H., Skaggs, M., Lloyd, A., Ran, D., An, L, & Yadegari, R. (2018). FERTILIZATION-INDEPENDENT SEED-Polycomb Repressive Complex 2 plays a dual role in regulating type I MADS-box genes in early endosperm development. Plant Physiology, 177(1), 285299.
  • Type: Book Chapters Status: Awaiting Publication Year Published: 2018 Citation: Jiang H, An L, Ban Y. Introduction to metagenomics. No Boundary Thinking in Bioinformatics, edited by Huang and Moore.
  • Type: Journal Articles Status: Submitted Year Published: 2018 Citation: Klug, K. E., Jennings, C. M., Lytal, N., An, L., & Yoon, J.-Y. (2018). Mie Scattering and Microparticle Based Characterization of Heavy Metal Ions and Classi??cation by Statistical Inference Methods. Royal Society Open Science.
  • Type: Book Chapters Status: Published Year Published: 2018 Citation: Du, R., An, L., & Fang, Z. (2018). Performance evaluation of normalization approaches for metagenomic compositional data on differential abundance analysis. New Frontiers of Biostatistics and Bioinformatics. Edited by Yichuan Zhao and DingGeng Chen. Spinger.


Progress 10/01/16 to 09/30/17

Outputs
Target Audience:Researchers/scientists in health and medical sciences, environmental sciences, and forensic science. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?It provided a training and professional development for 3 graduate students in cutting-edge research. How have the results been disseminated to communities of interest?Our findings/outcomes have been presented at various conferences/symposium and seminars, including ENAR (Eastern North American Region of International of Biometric Society) meeting, ICSA (International Chinese Statistical Association) symposium, JSM (Joint Statistical Meetings). All the software/packages we have developed for the project can be downloaded from http://cals.arizona.edu/~anling/software/software.htm. Researchers can use this software for free in their research. What do you plan to do during the next reporting period to accomplish the goals?Attend several statistical conferences and one workshhop to present our research and publish peer-reviewed journal papers.

Impacts
What was accomplished under these goals? We have developed one method in trace evidence analysis and one method in deathtime estimation,based on microbial sequence data. Both can be applied in Forensic sciences.

Publications

  • Type: Journal Articles Status: Under Review Year Published: 2016 Citation: Shanshan Zhang, Dongfang Wang, Huajian Zhang, Megan Skaggs, Alan Lloyd, Di Ran, Lingling An, Karen Schumaker, Gary Drews, and Ramin Yadegari. FIS-PRC2 plays a dual role in regulation of type I MADS-box genes in early endosperm of Arabidopsis" 2017, Plant Physiology
  • Type: Theses/Dissertations Status: Accepted Year Published: 2017 Citation: Wenchi Lu. A Novel Approach on Differential Abundance Analysis for Matched Metagenomic Samples.
  • Type: Journal Articles Status: Under Review Year Published: 2017 Citation: Fei Jia, Murat Kacira* , Lingling An,d, Caitlin C. Brown, Kimberly L. Ogden, and Judith K. Brown. Autonomous Detection of an Abiotic and Biotic Disturbance in a Microalgae Culture System Using a Multi-Wavelength Optical Density Sensor. (2017) Biosystems Engineering
  • Type: Journal Articles Status: Submitted Year Published: 2017 Citation: Kyle Carter, Meng Lu, Hongmei Jiang, Lingling An. Suspect Reduction for Culture Independent Microbial Source Tracking in Trace Evidence Analysis Using Community. Journal of Forensic Sciences
  • Type: Conference Papers and Presentations Status: Other Year Published: 2017 Citation: Dan Luo, Linglnig An. Differential Abundance Analysis on Longitudianal Metagenomic Count Data. Joint Statistical Meetings.
  • Type: Journal Articles Status: Submitted Year Published: 2017 Citation: Meng Lu, Kyle Carter, Lingling An. Accurate Prediction of Postmortem Interval with Integrating Microbial Community Dynamics. Forensic Science International.


Progress 10/01/15 to 09/30/16

Outputs
Target Audience:Researchers/scientists in health and medical sciences, environmental sciences, and forensic science. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?It provided a training and professional development for 3 graduate students in cutting-edge research. How have the results been disseminated to communities of interest?We presented our findings/outcomes at various conferences/symposium and seminars, including ENAR (Eastern North American Regionof International of Biometric Society) meeting, ICSA (International Chinese Statistical Association) symposium, JSM (Joint Statistical Meetings) All the software/packages we have developed for the project so far have been posted online, which can be downloaded from http://cals.arizona.edu/~anling/software/software.htm. Researchers can use this software for free in their research. What do you plan to do during the next reporting period to accomplish the goals?Attend some statistical conferences to present our research and publish a few more peer-reviewed journal papers.

Impacts
What was accomplished under these goals? We have developed a new method "metaDprof" which is implemented in R software.

Publications

  • Type: Journal Articles Status: Published Year Published: 2016 Citation: Zhang, Y., Kacira, M., & An, L. (2016) A CFD study on improving air flow uniformity in indoor plant factory system Biosystems Engineering, 147, 193205.
  • Type: Journal Articles Status: Published Year Published: 2016 Citation: Pena, E., Wu, W., Piegrosch, W., West, R., & An, L (2016) Model Selection and Estimation with Quantal-Response Data in Benchmark Risk Assessment Risk Analysis. 37(4):716-732
  • Type: Journal Articles Status: Accepted Year Published: 2016 Citation: Luo D, Ziebell S, An L* (2016) An informative approach on differential abundance analysis for time-course metagenomic sequencing count data. Bioinformatics.
  • Type: Theses/Dissertations Status: Published Year Published: 2015 Citation: Sara Ziebell. A Powerful CorrelationMethod forMicrobial Co-occurrence Networks
  • Type: Theses/Dissertations Status: Published Year Published: 2016 Citation: Dailu Chen. A metagenomic analysis of the microbiome in the colorectal cancer microenvironment


Progress 10/01/14 to 09/30/15

Outputs
Target Audience:researchers/scientists in health and medical sciences, environmental sciences Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?It provided a training and professional development for a postdoc and 3 graduate students. How have the results been disseminated to communities of interest?We presented our findings/outcomes at various conferences/symposium and seminars, including NSF workshop, Statistical or informatics conferences, and seminars at various universities. All the software/packages we have developed for the project so far have been posted online, which can be downloaded from http://cals.arizona.edu/~anling/software/software.htm Researchers can use these software for free in their research. What do you plan to do during the next reporting period to accomplish the goals?a few more manuscripts are in preparation to publish the research findings.

Impacts
What was accomplished under these goals? We have developed a new approach to accurately estimate relative abundance of closely related species. We have developed a two-step statistical approach in accurately identifying the potential functions & quantifying their abundance in a metagenomic sample. We have developed a few R software, TAEA, metaFunction, ENNB and RAIDA.

Publications

  • Type: Journal Articles Status: Published Year Published: 2014 Citation: Sohn M, An L*, Pookhao N, Li Q. Accurate Estimation of Genome Relative Abundance for Closely Related Species in a Metagenomic Sample. BMC Bioinformatics 2014, 15:242, PMID: 25027647 (*: corresponding author)
  • Type: Journal Articles Status: Published Year Published: 2014 Citation: An L*, Pookhao N, Jiang H, Xu J. Statistical approach of functional profiling for a microbial community. PLoS ONE 2014, 9(9): e106588, PMID: 25198674 (*: corresponding author)
  • Type: Journal Articles Status: Published Year Published: 2015 Citation: Pookhao N, Sohn M, Li Q, Jenkins I, Du R, Jiang H, An L*. A two-stage statistical procedure for feature selection and comparison in functional analysis of metagenomes. Bioinformatics, 2015, 31:158-165. PMID: 25256572 (*: corresponding author)
  • Type: Journal Articles Status: Published Year Published: 2015 Citation: Du R, Mercante D, An L*, Fang Z*. A statistical approach to correcting cross-annotations in a metagenomic functional profile generated by short reads. 2015, Journal of Biometrics & Biostatistics, 5:208 (*: co-corresponding author)
  • Type: Journal Articles Status: Published Year Published: 2015 Citation: Sohn M, Du R, An L*. (2015) A robust approach for identifying differentially abundant features in metagenomic samples. Bioinformatics. PMID: 25792553 (*: corresponding author)
  • Type: Journal Articles Status: Published Year Published: 2015 Citation: Ban Y, An L, Jiang H (2015) Investigating microbial co-occurrence patterns based on metagenomic compositional data. Bioinformatics 31(20): 3322-3329
  • Type: Journal Articles Status: Published Year Published: 2015 Citation: Drewry J, Choi C, An L, Gharagozloo P. (2015) A computational fluid dynamics model of algal growth: development and validation. Transactions of the American Society of Agricultural and Biological Engineers, 58:2, 203-213
  • Type: Journal Articles Status: Published Year Published: 2015 Citation: Yigiter A, Chen J, An L, Danacioglu N. (2015) An on-line CNV detection method for short sequencing reads. Journal of Applied Statistics,42:7, 15561571
  • Type: Theses/Dissertations Status: Published Year Published: 2015 Citation: M.B. Sohn. Novel Computational And Statistical Approaches In Metagenomic Studies. PhD Dissertation in Statistics, University of Arizona
  • Type: Theses/Dissertations Status: Published Year Published: 2015 Citation: Ahmad Hakeem Abdul Wahab. Statistical Discovery of Biomarkers in Metagenomics. MS thesis in Statistics, University of Arizona
  • Type: Theses/Dissertations Status: Published Year Published: 2014 Citation: Naruekamol Pookhao. Statistical Methods For Functional Metagenomic Analysis Based On Next-Generation Sequencing Data. PhD Dissertation in ABE, University of Arizona


Progress 10/01/13 to 09/30/14

Outputs
Target Audience: The target audiences are the biologists/medical scientists who investigate microbial communities and statisticians/computational scientists who develop computational methods or algorithms in metagenomic studies. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided? It has provided training opportunities for graduate students in Biosystems Engineering, Statistics, and Computer Science. Also it has brought internship opportunities to several undergraduate students in Biosystems Engineering. Through this project a postdoc has been mentored in his early career. How have the results been disseminated to communities of interest? The PI has delivered several presentations on the results of this project at various conferences/meetings/seminars to general audience, including biologists, medical doctors, statisticians, mathematicians, and computer scientists. The results have also been presented to a general audience at a NSF workshop, including federal funding agents. PI's lab has also provided a series of workshops on metagenomics to a group of undergraduate students to motivate their interest in related research areas. What do you plan to do during the next reporting period to accomplish the goals? develop more statistical methods and computational algorithms (software) in this topic, including: -- normalize the microbial samples that are collected at different scales -- robustly and sensitively detect the biomarkers (species or genes) that are related to the changed biological factor(s), e.g., from healthy condition to diseased condition -- construct (sub)network among the detected biomarkers

Impacts
What was accomplished under these goals? We have developed statistical approaches to more accurately profiling the functional/taxonomic composition of microbial communities. The accurate annotation results play a critical role in the downstream analysis. Specially, the paper produced by Sohn et al in my lab was highlighted as top 5 (out of about 500 papers) downloaded papers in the top journal -- BMC Bioinformatics. This paper is designated for accurately detecting very closely related species, which is a big traditional challenge in metagenomic/microbial studies under low sequence coverage.

Publications

  • Type: Journal Articles Status: Published Year Published: 2013 Citation: Piegorsch W*, An L*, Wickens A, West W, Pe�a E, Wu W. (2013) Information-theoretic model-averaged benchmark dose analysis in environmental risk assessment. Environmetrics 24:143-157 (*: co-first author)
  • Type: Journal Articles Status: Published Year Published: 2014 Citation: An L*, Pookhao N, Jiang H, Xu J. (2014) Statistical approach of functional profiling for a microbial community. PLoS ONE 9(9): e106588 (*: corresponding author)
  • Type: Journal Articles Status: Published Year Published: 2014 Citation: Sohn M, An L*, Pookhao N, Li Q. (2014) Accurate Estimation of Genome Relative Abundance for Closely Related Species in a Metagenomic Sample. BMC Bioinformatics 15:242 (*: corresponding author)
  • Type: Journal Articles Status: Published Year Published: 2014 Citation: Jiang H, An L, Baladandayuthapani V, Auer P. (2014) Classification, predictive modeling, and statistical analysis of cancer data. Cancer informatics 01/2014; 13(Suppl 2):1-3. DOI: 10.4137/CIN.S19328
  • Type: Journal Articles Status: Accepted Year Published: 2014 Citation: Pookhao N, Sohn M, Li Q, Jenkins I, Du R, Jiang H, An L*. (2014) A two-stage statistical procedure for feature selection and comparison in functional analysis of metagenomes. Bioinformatics. (*: corresponding author)