Experimental Annotation of Bovine Respiratory Disease Pathogen Genomes by Proteogenomic Mapping

EXPERIMENTAL ANNOTATION OF BOVINE RESPIRATORY DISEASE PATHOGEN GENOMES BY PROTEOGENOMIC MAPPING

Sponsoring Institution

National Institute of Food and Agriculture

Project Status

COMPLETE

Funding Source

NRI COMPETITIVE GRANT

Reporting Frequency

Annual

Accession No.

0208405

Grant No.

2006-35600-17688

Cumulative Award Amt.

(N/A)

Proposal No.

2006-04900

Multistate No.

(N/A)

Project Start Date

Sep 15, 2006

Project End Date

Jul 31, 2009

Grant Year

2006

Program Code

[23.2]- Microbial Genome Sequencing

Recipient Organization
MISSISSIPPI STATE UNIV
(N/A)
MISSISSIPPI STATE,MS 39762

Performing Department
COLLEGE OF VETERINARY MEDICINE

Non Technical Summary
Since 2000, the USDA has invested heavily in microbial genome sequencing, which has resulted in the completion of a large number of genome sequences from animal pathogens that impact U.S. agriculture. However, the ability to decipher the information content of sequenced genomes is currently limited and has seriously hindered the full experimental exploitation of these sequences. In particular, there is no experimental evidence for the existence of predicted protein products from the large majority of annotated genes in sequenced microbial genomes. Furthermore, standardized gene ontology (GO) to facilitate data retrieval is not consistently used. Three animal pathogens whose genomes have been sequenced through USDA funding are the most important bacterial pathogens responsible for causing bovine respiratory disease, a syndrome costing more than $1 billion annually. These three pathogens are: Mannheimia haemolytica, Histophilus somni, and Pasteurella multocida. Accurate and accessible annotation of their genomes is needed to maximize the utility of the genome sequences to the U.S. agricultural research community. In this project, we will identify proteins from these three species and map the proteins onto their genome sequences to confirm predicted proteins and identify novel proteins. We will also assign standardized nomenclature and functional annotations for the identified proteins as well as build computational tools to use these annotations, which will be publicly accessible at our AgBase website.

Animal Health Component

(N/A)

Research Effort Categories

Basic

100%

Applied

(N/A)

Developmental

(N/A)

Classification

Knowledge Area (KA)	Subject of Investigation (SOI)	Field of Science (FOS)	Percent
311	4010	1100	100%

Knowledge Area
311 - Animal Diseases;

Subject Of Investigation
4010 - Bacteria;

Field Of Science
1100 - Bacteriology;

Keywords

gene ontology,

annotation,

proteogenomic mapping,

bovine respiratory disease

Goals / Objectives
The objective of this proposal is to conduct experimental annotation of the three bovine respiratory disease (BRD) pathogens Mannheimia haemolytica, Histophilus somni, and Pasteurella multocida by proteogenomic mapping to improve their ongoing structural annotations and provide functional annotation. Another objective of this project is to improve annotation tools available for researchers to conduct functional genomics and systems biology investigations on these pathogens. To accomplish this, multi-dimensional protein identification technology (MuDPIT) will be used map the proteome of each species by matching tandem mass spectra against all potential reading frames from the genome of each strain. Tools that have already been developed for proteogenomic mapping will then be used to identify protein coding sequences in each of the genomes. A critical component of this proposal is a coordinated effort with the PIs of the genome sequencing projects for each of these pathogens. Information will be disseminated to these annotation projects through an established GO annotation website (AgBase), and GO annotation will be conducted in conjunction with these sequencing projects using a distributed annotation system. Throughout the project, annotation tools that are publicly available on AgBase will continue to be improved. Our rationale for conducting this research is that experimental annotation combined with standardized gene ontology, easily accessible databases, and new computational tools will achieve much more comprehensively annotated microbial genomes and will allow them to be more fully exploited for functional genomic and systems biology studies for agricultural research. Our expected outcomes are more comprehensively annotated genome sequences and tools for BRD pathogens that will be available to the research community in a single place and in an easily usable format. Specifically, we will: 1) improve structural annotation of the three genomes in terms of protein coding, 2) provide GO functional annotation for the three genomes, 3) provide computational tools for functional annotation, and 4) provide computational tools for conducting functional genomics in agricultural microbial genomes.

Project Methods
Our first aim is to produce proteogenomic maps of the M. haemolytica, H. somni, and P. multocida genomes that will provide direct ex-vivo evidence for the existence of proteins that are predicted by homology, more accurately determine the boundaries and enumeration of functional ORFs, verify predicted proteins, and identify unknown ORFs that are not predicted by homology. Our experimental approach will be to use multi-dimensional protein identification technology (MuDPIT) to produce a map of the coding sequences in these genomes by proteogenomic mapping. Our rationale for taking this experimental approach is that high throughput proteogenomic mapping will provide the most economic, rapid, and comprehensive coverage of microbial genomes. In addition, because of the importance of BRD in cattle production, proteogenomic mapping will provide a highly relevant proteome dataset for the agricultural research community. Our expectation is that completion of the experimental work in this aim will provide comprehensive annotation of these microbial genomes at the protein level. This will result in improved databases for proteomics, a better understanding of the size and diversity of proteomes in Pasteurellaceae, and improved ability to model functional genomics datasets. Our second aim is to provide standardized nomenclature and functional annotations for all expressed protein sequence tags (ePSTs) identified, to make all of this data publicly accessible, to link our data directly to Ensembl, UniProt and PRIDE, to keep current with changes in the UniProt and nomenclature databases, and to provide computational tools to facilitate comparative and functional genomics studies in microbial genomes. One of our major goals is that both the annotations and the tools be easily accessible to the research community. Our approach will be to do electronic and manual GO annotation from our experimental data, continue to provide computational tools for functional genomics, and continue to provide public access to these via AgBase. This approach will achieve more comprehensively annotated microbial genomes that are more easily accessible and have more user-friendly tools; in particular, it will enable the M. haemolytica, P. multocida, and H. somni genomes to be more fully exploited for functional genomic and systems biology studies. This expectation is based on the tools for functional genomics studies that we have already produced and that have decreased the time required for functional annotation and analyzing functional genomic data from months to hours.

Progress 09/15/06 to 07/31/09

Outputs
OUTPUTS: Information from this project is being disseminated through an established GO annotation website, AgBase. Expressed protein sequence tags (ePSTs), which are experimentally-derived protein coding sequences that are not in the current genome annotation, are displayed on AgBase using the Generic Genome Browser (GBrowse). The GBrowse display includes biological evidence to allow evaluation of the strength of each ePST, including the presence of a valid start codon, the number of peptides used to identify the ePST, the coverage of the potential ORF by peptides, the presence/absence of a ribosomal binding site, the presence/absence of conserved domain(s), codon bias, and confidence in peptide identifications. All of the ePSTs and the biological evidence for M. haemoltyica, H. somni, and P. multocida can be accessed at the AgBase website (http://www.agbase.msstate.edu/epst/) under the link, "Microbial GBrowsers". A computational pipeline for proteogenomic mapping was developed as a component of this project, and it is available for download through AgBase at http://www.agbase.msstate.edu/index.html under the "Tools" link. A tool was also developed for peptide validation, PepOut. This tool uses a target/ decoy approach for peptide validation. It combines an outlier based machine learning approach with Bayesian statistics to determine the probability of a true identification of each peptide, which is a critical step in the proteogenomic mapping process. Results from this project were also disseminated to the bovine respiratory disease community at the annual Conference for Research Workers in Animal Diseases. PARTICIPANTS: Dr. Mark L. Lawrence at the Mississippi State University College of Veterinary Medicine (MSU-CVM) is the project director. Dr. Shane Burgess (MSU-CVM), Dr. Susan Bridges (MSU Department of Computer Sciences and Engineering), and Dr. Bindu Nanduri (MSU-CVM) are the project co-directors. Dr. James Watt, a postdoctoral scientist, is analyzing proteogenomic mapping data. Protein isolations were conducted by Michelle Banes (MSU-CVM), mass spectrometry was conducted by Dr. Tibor Pechan at the MSU Life Sciences and Biotechnology Institute. Nan Wang, a Ph.D. student under the direction of Dr. Bridges, constructed the proteogenomic mapping computational pipeline and is conducting the mapping analyses. Ranjit Kumar, a Ph.D. student under the direction of Dr. Lawrence, is responsible for displaying results in GBrowse and establishing links with collaborators at other institutions. Dr. Sarah Highlander at the Baylor College of Medicine directed genome sequencing of M. haemolytica, the H. somni genome sequence was conducted by Dr. Tom Inzana at the Virginia Tech College of Veterinary Medicine, and the 2.38 Mbp genome sequence of P. multocida nontoxigenic porcine pneumonic pasteurellosis isolate 3480 was finished by Dr. Allison Gillaspy at the Oklahoma University Health Sciences Center. TARGET AUDIENCES: This project specifically benefits the bovine respiratory disease research community by improving the genome annotation for the three most important bacterial pathogens responsible for causing bovine respiratory disease: Mannheimia haemolytica, Histophilus somni, and Pasteurella multocida. Accurate and accessible annotation of their genomes is needed to maximize the utility of the genome sequences to the U.S. agricultural research community. In a broader sense, this project also serves as an important model for experimental annotation of other agricultural microbial genomes. The techniques and tools developed are publicly available to benefit annotation efforts of other microbial genomes. This project also evaluated whether the gene prediction methods used to annotate these genomes are accurate. PROJECT MODIFICATIONS: Nothing significant to report during this reporting period.

Impacts
Proteogenomic mapping has provided comprehensive experimental evidence to improve the annotation of the M. haemolytica, P. multocida, and H. somni genomes. For the M. haemolytica genome, which is not a finished genome, it provided biological evidence for the expression of many proteins that were annotated as pseudogenes. Thus, proteogenomic mapping is an efficient method for improving the annotation of unfinished genomes. For P. multocida and H. somni, which are finished genomes, proteogenomic mapping resulted in the identification of novel proteins, but not as many because of the better quality sequence and annotation. Importantly, this project resulted in the generation of computational tools for proteogenomic mapping, and it resulted in the generation of new computational tools to assess the quality of peptides and ePSTs. Our results indicate that assessment of ePST quality is critical to the success of any proteogenomic mapping project.

Publications

Wang, N., B. Nanduri, M. L. Lawrence, S. M. Bridges, and S. C. Burgess. 2010. Gene Model Detection Using Mass Spectrometry. Methods in Molecular Biology: Proteome Bioinformatics. Vol. 604 ISBN: 978-1-60761-443-2
Watt, J. M., G. D. Ramsey, S. K. Bridges, B. Nanduri, R. Kumar, S. C. Burgess, and M. L. Lawrence. 2009. Experimental annotation of Mannheimia haemolytica A:1 by proteogenomic mapping. Conference for Research Workers in Animal Disease, Chicago, Illinois.
Wang, N., S. C. Burgess, M. L. Lawrence, and S. M. Bridges. 2009. Proteogenomic Mapping for Structural Annotation of Prokaryote Genomes. Proceeding of IJCBS09, Shanghai, China.
Lawrence, M. L., J. Watt, S. Bridges, B. Nanduri, N. Wang, R. Kumar, and S. C. Burgess. 2008. Improvement of genome annotation for bovine respiratory disease pathogens by proteogenomic mapping. Conference for Research Workers in Animal Disease, Chicago, Illinois.
Wang, N., S. C. Burgess, M. L. Lawrence, B. Nanduri, F. McCarthy, C. Yuan, and S. M. Bridges. 2008. Novel algorithms for structural annotation of prokaryotic genomes. ISBM 2008, Toronto, Canada
Wang, N., C. Yuan, B. Nanduri, and S. M. Bridges. 2008. Integrating evidence for evaluation of potential protein-coding genes using Bayesian networks. Proceeding of BIOCOMP08, Las Vegas, Nevada.
Wang, N., C. Yuan, D. Wu, S. C. Burgess, B. Nanduri, and S. M. Bridges. 2008. PepOut: Distance-based Outlier Detection Model for Improving MS/MS Peptide Identification Confidence . MCBIOS 2008, Oklahoma City, Oklahoma.

Progress 09/15/07 to 09/14/08

Outputs
OUTPUTS: A critical component of this proposal is a coordinated effort with the PIs of the genome sequencing projects for Mannheimia haemolytica, Histophilus somni, and Pasteurella multocida. Information will be disseminated to these annotation projects through an established GO annotation website, AgBase (http://www.agbase.msstate.edu/). Another objective of this project is to develop proteogenomic mapping tools and improve annotation tools available for researchers to conduct functional genomics and systems biology investigations on these pathogens. These tools are also being made available through AgBase. Expressed protein sequence tags (ePSTs), which are experimentally-derived protein coding sequences that are not in the current genome annotation, will be visualized and displayed on AgBase using the Generic Genome Browser (GBrowse). The GBrowse display will include biological evidence to allow evaluation of the strength of each ePST, including the presence of a valid start codon, the number of peptides used to identify the ePST, the coverage of the potential ORF by peptides, the presence/absence of a ribosomal binding site, the presence/absence of conserved domain(s), codon bias, and confidence in peptide identifications. Results from this project are being disseminated through national meetings and will be published in peer-reviewed journals. PARTICIPANTS: Dr. Mark L. Lawrence at the Mississippi State University College of Veterinary Medicine (MSU-CVM) is the project director. Dr. Shane Burgess (MSU-CVM), Dr. Susan Bridges (MSU Department of Computer Sciences and Engineering), and Dr. Bindu Nanduri (MSU-CVM) are the project co-directors. Dr. James Watt, a postdoctoral scientist, is analyzing proteogenomic mapping data. Protein isolations were conducted by Michelle Banes (MSU-CVM), mass spectrometry was conducted by Dr. Tibor Pechan at the MSU Life Sciences and Biotechnology Institute. Nan Wang, a Ph.D. student under the direction of Dr. Bridges, constructed the proteogenomic mapping computational pipeline and is conducting the mapping analyses. Ranjit Kumar, a Ph.D. student under the direction of Dr. Lawrence, is responsible for displaying results in GBrowse and establishing links with collaborators at other institutions. Dr. Sarah Highlander at the Baylor College of Medicine directed genome sequencing of M. haemolytica, the H. somni genome sequence was conducted by Dr. Tom Inzana at the Virginia Tech College of Veterinary Medicine, and the 2.38 Mbp genome sequence of P. multocida nontoxigenic porcine pneumonic pasteurellosis isolate 3480 was finished by Dr. Allison Gillaspy at the Oklahoma University Health Sciences Center. TARGET AUDIENCES: This project will specifically benefit the bovine respiratory disease research community by improving the genome annotation for the three most important bacterial pathogens responsible for causing bovine respiratory disease: Mannheimia haemolytica, Histophilus somni, and Pasteurella multocida. Accurate and accessible annotation of their genomes is needed to maximize the utility of the genome sequences to the U.S. agricultural research community. In a broader sense, this project will also serve as an important model for experimental annotation of other agricultural microbial genomes. The techniques and tools developed will be made publicly available to benefit annotation efforts of other microbial genomes, and it will result in the establishment of a centralized database for annotation of agricultural microbial genomes. This project will also determine whether the gene prediction methods used to annotate these genomes are accurate, and it is likely that proteins will be identified that were not predicted in the annotation. PROJECT MODIFICATIONS: Nothing significant to report during this reporting period.

Impacts
High throughput proteogenomic mapping will provide economic, rapid, and comprehensive experimental evidence to improve the annotation of the M. haemolytica, P. multocida, and H. somni genomes. Experimental evidence is being provided for the existence of protein products from the predicted protein coding sequences that were identified during annotation of these three genomes. In addition, proteins are being identified that were not predicted in the genome annotation, which will directly improve the genome annotations. This will result in improved proteomics databases for these species, a better understanding of the size and diversity of proteomes in Pasteurellaceae, and improved ability to model functional genomics datasets for these species. In addition, the computational tools for proteogenomic mapping and annotation that are being developed will be freely available through the AgBase website.

Publications

Lawrence, M. L., J. Watt, S. Bridges, B. Nanduri, N. Wang, R. Kumar, and S. C. Burgess. 2008. Improvement of genome annotation for bovine respiratory disease pathogens by proteogenomic mapping. Conference for Research Workers in Animal Disease, Chicago, Illinois.
Wang, N., C. Yuan, S.C. Burgess, S. M. Bridges. 2008 Distance-based Outlier Detection Model for Improving MS/MS Peptide Identification Confidence. MCBIOS'08, Oklahoma City, OK.
Wang, Nan, S.C. Burgess, M.L. Lawrence, B. Nanduri, F. McCarthy, C. Yuan, S. M. Bridges. 2008. Novel algorithms for structural annotation of prokaryotic genomes. ISBM 2008, Toronto, Canada

Progress 09/15/06 to 09/14/07

Outputs
OUTPUTS: The objective of this proposal is to conduct experimental annotation of the three bovine respiratory disease (BRD) pathogens Mannheimia haemolytica, Histophilus somni, and Pasteurella multocida by proteogenomic mapping to improve their ongoing structural annotations and provide functional annotation. Another objective of this project is to improve annotation tools available for researchers to conduct functional genomics and systems biology investigations on these pathogens. To date, proteogenomic maps of Mannheimia haemolytica strain PHL213 and P. multocida strain 3480 have been produced. Proteins were isolated from each strain in triplicate and analyzed by multi-dimensional protein identification technology (MuDPIT) using two-dimensional liquid chromatography with electrospray ionization tandem mass spectrometry. The resulting mass spectra were searched against their respective protein databases using SEQUEST (Bioworks 3.2 cluster; ThermoElectron). For proteogenomic mapping, tandem mass spectra were also searched against the respective genome sequences translated in all six potential frames using SEQUEST. Peptides were validated using an outlier detection method by using a k-nearest neighbor approach to compare Xcorr and ∆CN scores derived from true and randomized databases. The lists of peptides identified from the genome sequences were compared with the lists identified from protein databases. For peptides only identified from the genome sequence, our automated proteogenomic mapping pipeline was then used to produce expressed protein sequence tags (ePSTs), which are the theoretical protein coding sequences that contain the peptide sequence identified by mass spectrometry. To visualize ePSTs, they were displayed against their respective genomes using the Generic Genome Browser (GBrowse). Biological evidence is currently being incorporated into GBrowse to allow evaluation of the strength of each ePST, including the presence of a valid start codon, the number of peptides used to identify the ePST, the coverage of the potential ORF by peptides, the presence/absence of a ribosomal binding site, the presence/absence of conserved domain(s), codon bias, and confidence in peptide identifications. Proteins have been isolated and analyzed from Histophilus somni strain 2336, and proteogenomic mapping of this strain is currently being conducted. PARTICIPANTS: Dr. Mark L. Lawrence at the Mississippi State University College of Veterinary Medicine (MSU-CVM) is the project director. Dr. Shane Burgess (MSU-CVM), Dr. Susan Bridges (MSU Department of Computer Sciences and Engineering), and Dr. Bindu Nanduri (MSU-CVM) are the project co-directors. Protein isolations were conducted by Michelle Banes (MSU-CVM), mass spectrometry was conducted by Dr. Tibor Pechan at the MSU Life Sciences and Biotechnology Institute. Nan Wang, a Ph.D. student under the direction of Dr. Bridges, constructed the proteogenomic mapping computational pipeline and is conducting the mapping analyses. Ranjit Kumar, a Ph.D. student under the direction of Dr. Lawrence, is responsible for displaying results in GBrowse and establishing links with collaborators at other institutions. A critical component of this proposal is a coordinated effort with the PIs of the genome sequencing projects for Mannheimia haemolytica, Histophilus somni, and Pasteurella multocida. Information will be disseminated to these annotation projects through an established GO annotation website (AgBase). Dr. Sarah Highlander at the Baylor College of Medicine directed genome sequencing of M. haemolytica, the H. somni genome sequence was conducted by Dr. Tom Inzana at the Virginia Tech College of Veterinary Medicine, and the 2.38 Mbp genome sequence of P. multocida nontoxigenic porcine pneumonic pasteurellosis isolate 3480 was finished by Dr. Allison Gillaspy at the Oklahoma University Health Sciences Center. TARGET AUDIENCES: Since 2000, the USDA has invested heavily in microbial genome sequencing, which has resulted in the completion of a large number of genome sequences from animal pathogens that impact U.S. agriculture. However, the ability to decipher the information content of sequenced genomes is currently limited and has seriously hindered the full experimental exploitation of these sequences. In particular, there is no experimental evidence for the existence of predicted protein products from the large majority of annotated genes in sequenced microbial genomes. Furthermore, standardized gene ontology (GO) to facilitate data retrieval is not consistently used. This project will specifically benefit the bovine respiratory disease research community by improving the genome annotation for the three most important bacterial pathogens responsible for causing bovine respiratory disease: Mannheimia haemolytica, Histophilus somni, and Pasteurella multocida. Accurate and accessible annotation of their genomes is needed to maximize the utility of the genome sequences to the U.S. agricultural research community. In a broader sense, this project will also serve as an important model for experimental annotation of other agricultural microbial genomes. The techniques and tools developed will be made publicly available to benefit annotation efforts of other microbial genomes, and it will result in the establishment of a centralized database for annotation of agricultural microbial genomes. This project will also determine whether the gene prediction methods used to annotate these genomes are accurate, and it is likely that proteins will be identified that were not predicted in the annotation. PROJECT MODIFICATIONS: No Project Modifications information reported.

Impacts
High throughput proteogenomic mapping will provide economic, rapid, and comprehensive experimental evidence to improve the annotation of the M. haemolytica, P. multocida, and H. somni genomes. Experimental evidence is being provided for the existence of protein products from the predicted protein coding sequences that were identified during annotation of these three genomes. In addition, proteins are being identified that were not predicted in the genome annotation, which will directly improve the genome annotations. This will result in improved proteomics databases for these species, a better understanding of the size and diversity of proteomes in Pasteurellaceae, and improved ability to model functional genomics datasets for these species. In addition, the computational tools for proteogenomic mapping and annotation that are being developed will be freely available through the AgBase website.

Publications

No publications reported this period