DEVELOPMENT OF A PATHOGEN SEQUENCE DATABASE - AGRICULTURAL RESEARCH SERVICE

Goals / Objectives
The objective of this research project is provide a secure, well-annotated database of nucleotide sequence data, and associated biological data, of relevant viral pathogens affecting livestock.

Project Methods
The core of this system will be a database containing a set of well-annotated genomic sequences of viral pathogens currently on the threat list, and sequences of known pathogens that produce disease similar to those caused by threat list agents. The core genomic sequences in the database, termed reference sequences, will be curated by experts to ensure that they are of the highest quality possible since they will provide the basis for both experimental and epidemiologic/forensic analyses. In addition, reference sequences will be highly annotated in terms of their source (host species, geographic origin, etc.), phenotypes (serotype, permissive cell lines, plaque size, etc.) and clinical data (disease signs, lesions caused, kinetics of infection, etc.). Beyond providing the basis for analysis of outbreak pathogen isolates, this high quality core of reference sequences will enable the discovery of associations between pathogen sequence and phenotypes on a scale not yet approached in livestock infectious disease research. Other sequence data will also be contained in the database. This data would include well-annotated partial sequences of pathogens, non-curated sequences and genomic sequences with limited annotation. Special Appropriation for Homeland Security ($1,500, 000).

Progress 08/28/02 to 11/27/05

Outputs
1. What major problem or issue is being resolved and how are you resolving it (summarize project aims and objectives)? How serious is the problem? What does it matter? We are developing a pathogen sequence database that will support rapid characterization (or annotation) of causal exotic viral agents in the event of a disease outbreak. Annotation will include viral strain, genetic history, source, ability to infect different species (host range), evidence of genetic modification, and anticipated impact (infectivity, mortality, morbidity) of the disease on U.S. populations. Early in an outbreak of an exotic disease, newly developed detection methods (real- time PCR, gene chips, etc.) will provide information about the genomic nucleotide sequence of the causative viral agent. Such sequence not only defines the "species" of agent (e.g., foot-and-mouth disease virus) but also contains other information including the genetic history of the virus, how readily it may infect different species, and evidence of genetic modifications. This information will help to determine the source of the virus and to predict its behavior in U.S. livestock populations. Current sequence databases are not annotated appropriately to provide this additional information. Furthermore, some information required for inferring important properties of a given agent is sensitive and not available in public databases. These deficiencies in current pathogen sequence databases reduce the efficiency of agencies responding to an outbreak. Exotic livestock diseases have long represented a serious risk to food producing animals in the U.S. This risk has increased in recent times due to increased international travel, international trade in livestock and livestock products, and the threat of intentional introduction of disease. The impact of even a small outbreak of these types of diseases would be enormous, both in terms of direct impact on the livestock and producers, and the ripple effect that would resonate throughout the U.S. economy. Recent events (mad cow disease outbreaks) in the U.K., Canada and U.S., clearly indicate the need for a rapid and effective response to outbreaks of exotic livestock diseases to minimize their consequences in the U.S. This research addresses the goals of ARS National Program Action Plan - Food Animal Production 101: IV. Genomic tools with emphasis on 1. Comprehensive maps, and 5. Bioinformatics and statistical analysis tools. While this program component is often thought of as relating to host genomics, the skills, tools and expertise for database management and analysis of pathogen genomic sequence and annotation are nearly identical to those used for genomic sequencing of livestock species. In fact, the specific goals of this program component, 1. to develop databases of genomic and proteomic information and 2. computer software for efficient manipulation and analysis of genomic and proteomic data, are exactly the goals of the CRIS project. The bioinformatics portion of this project is particularly relevant and will have implications beyond achievement of the current goals. 2. List the milestones (indicators of progress) from your Project Plan. Year 1 (FY 2003) Procure software and hardware needed to manage and analyze pathogen sequence and annotation. Software includes Vector NTI, Lab Share and Bionumerics. Hardware includes a database server and a 24-node Linux cluster for analysis. Year 2 (FY 2004) Develop system to detect genetic modification in viral genomic sequence. Prepare and submit manuscript. Year 3 (FY 2005) Customize the Generic Model Organism Database (Gmod) to make it suitable for managing pathogen sequence and annotation. Prepare and submit manuscript. 3a List the milestones that were scheduled to be addressed in FY 2005. For each milestone, indicate the status: fully met, substantially met, or not met. If not met, why. 1. Customize the Generic Model Organism Database (Gmod) to make it suitable for managing pathogen sequence and annotation. Prepare and submit manuscript. Milestone Substantially Met 3b List the milestones that you expect to address over the next 3 years (FY 2006, 2007, and 2008). What do you expect to accomplish, year by year, over the next 3 years under each milestone? This project is scheduled to terminate in 2005. 4a What was the single most significant accomplishment this past year? University of South Carolina computer scientists wrote a program that computes the joint probability of a cluster of arbitrary sequence features. This tool can be used to help detect viral manipulation. This research was conducted through a specific cooperative agreement with University of South Carolina (see CRIS 5438-32000-025-01). 5. Describe the major accomplishments over the life of the project, including their predicted or actual impact. Action Plan components: Strategic Plan 3.2.3.1 - Animal Disease Microorganisms; National Program 101 - IV Genomic tools with emphasis on 1. Comprehensive maps, and 5. Bioinformatics and statistical analysis tools; ARS Strategic Plan Goals 1 Enhance economic opportunities for agricultural producers and 3 Enhance protection and safety of the nations agriculture and food supply. We discovered that the distributions of ordered amino acid triplets were different for different viral genome types. We have prepared and submitted a manuscript. This information is fundamental to detecting genome modification. Distribution of amino acid triplets may prove to be a powerful technique for classifying novel anonymous sequences. The impact may extend well beyond pathogen sequencing. This research was conducted through a specific cooperative agreement with University of South Carolina (see CRIS 5438-32000-025-01S). 6. What science and/or technologies have been transferred and to whom? When is the science and/or technology likely to become available to the end- user (industry, farmer, other scientists)? What are the constraints, if known, to the adoption and durability of the technology products? The technology will be accessible to other scientists and action agencies once the paper is published. Constraints to adoption of the methodology depend on sample sequencing capacity in the event of a disease outbreak. Once the genomic infrastructure (sequencing capacity and associated expertise) is available, the technology can be widely applied.

Impacts
(N/A)

Publications

Rose, J.R., Turkett Jr., W.H., Oroian, I.C., Laegreid, W.W., Keele, J.W. 2005. Correlation of amino acid preference and mammalian viral genome type. Bioinformatics 21(8):1349-1357.

Progress 10/01/03 to 09/30/04

Outputs
1. What major problem or issue is being resolved and how are you resolving it (summarize project aims and objectives)? How serious is the problem? What does it matter? We are developing a pathogen sequence database that will support rapid characterization (or annotation) of causal exotic viral agents in the event of a disease outbreak. Annotation will include viral strain, genetic history, source, ability to infect different species (host range), evidence of genetic modification, and anticipated impact (infectivity, mortality, morbidity) of the disease on U.S. populations. Early in an outbreak of an exotic disease, newly developed detection methods (real- time PCR, gene chips, etc.) will provide information about the genomic nucleotide sequence of the causative viral agent. Such sequence not only defines the "species" of agent (e.g., foot-and-mouth disease virus) but also contains other information including the genetic history of the virus, how readily it may infect different species, and evidence of genetic modifications. This information will help to determine the source of the virus and to predict its behavior in U.S. livestock populations. Current sequence databases are not annotated appropriately to provide this additional information. Furthermore, some information required for inferring important properties of a given agent is sensitive and not available in public databases. These deficiencies in current pathogen sequence databases reduce the efficiency of agencies responding to an outbreak. Exotic livestock diseases have long represented a serious risk to food producing animals in the U.S. This risk has increased in recent times due to increased international travel, international trade in livestock and livestock products, and the threat of intentional introduction of disease. The impact of even a small outbreak of these types of diseases would be enormous, both in terms of direct impact on the livestock and producers, and the ripple effect that would resonate throughout the U.S. economy. Recent events (mad cow disease outbreaks) in the U.K., Canada and U.S., clearly indicate the need for a rapid and effective response to outbreaks of exotic livestock diseases to minimize their consequences in the U.S. This research addresses the goals of ARS National Program Action Plan - Food Animal Production 101: IV. Genomic tools with emphasis on 1. Comprehensive maps, and 5. Bioinformatics and statistical analysis tools. While this program component is often thought of as relating to host genomics, the skills, tools and expertise for database management and analysis of pathogen genomic sequence and annotation are nearly identical to those used for genomic sequencing of livestock species. In fact, the specific goals of this program component, 1. to develop databases of genomic and proteomic information and 2. computer software for efficient manipulation and analysis of genomic and proteomic data, are exactly the goals of the CRIS project. The bioinformatics portion of this project is particularly relevant and will have implications beyond achievement of the current goals. 2. List the milestones (indicators of progress) from your Project Plan. Year 1 (FY 2003) Procure software and hardware needed to manage and analyze pathogen sequence and annotation. Software includes Vector NTI, Lab Share and Bionumerics. Hardware includes a database server and a 24-node Linux cluster for analysis. Year 2 (FY 2004) Develop system to detect genetic modification in viral genomic sequence. Prepare and submit manuscript. Year 3 (FY 2005) Customize the Generic Model Organism Database (Gmod) to make it suitable for managing pathogen sequence and annotation. Prepare and submit manuscript. 3. Milestones: A. List the milestones that were scheduled to be addressed in FY 2004. How many milestones did you fully or substantially meet in FY 2004, and indicate which ones were not fully or substantially met, briefly explain why not, and your plans to do so. One milestone was not fully or substantially met. Develop system to detect genetic modification in viral genomic sequence. Prepare and submit manuscript. We found that the distributions of ordered amino acid triplets were different for different viral genome types. We have prepared and submitted a manuscript. This information is fundamental to the achievement of this milestone but it does not fully or substantially meet it. We have extended a specific cooperative agreement with University of South Carolina (see CRIS 5438-32000-025-01S) to complete this milestone in the next 6 months. They will use GC content, codon bias, distributions of ordered amino acid triplets, and restriction sites to detect evidence of genetic modification. B. List the milestones that you expect to address over the next 3 years (FY 2005, 2006, and 2007). What do you expect to accomplish, year by year, over the next 3 years under each milestone? This project is scheduled to terminate in 2005. Year 3 (FY 2005) Customize the Generic Model Organism Database (Gmod) to make it suitable for managing pathogen sequence and annotation. Prepare and submit manuscript. Currently, Gmod does not easily exchange data with GenBank (it was designed to exchange data with Ensembl). Cold Spring Harbor Laboratory (CSHL; CRIS 5438-32000-025-02S) is adding functionality to Gmod to accommodate seamless data exchange with GenBank. 4. What were the most significant accomplishments this past year? A. Single Most Significant Accomplishment During FY 2004. We discovered that the distributions of ordered amino acid triplets were different for different viral genome types. We have prepared and submitted a manuscript. This information is fundamental to detecting genome modification. This research was conducted through a specific cooperative agreement with University of South Carolina (see CRIS 5438- 32000-025-01S). B. Other Significant Accomplishment(s). None. C. Significant Activities That Support Special Target Populations. None. D. Progress Report. None. 5. Describe the major accomplishments over the life of the project, including their predicted or actual impact. We discovered that the distributions of ordered amino acid triplets were different for different viral genome types. We have prepared and submitted a manuscript. This information is fundamental to detecting genome modification. Distribution of amino acid triplets may prove to be a powerful technique for classifying novel anonymous sequences. The impact may extend well beyond pathogen sequencing. This research was conducted through a specific cooperative agreement with University of South Carolina (see CRIS 5438-32000-025-01S). Milestone for 2004: Develop system to detect genetic modification in viral genomic sequence. Prepare and submit manuscript. Action Plan components: National Program 101 - IV Genomic tools with emphasis on 1. Comprehensive maps, and 5. Bioinformatics and statistical analysis tools. 6. What science and/or technologies have been transferred and to whom? When is the science and/or technology likely to become available to the end- user (industry, farmer, other scientists)? What are the constraints, if known, to the adoption and durability of the technology products? The technology will be accessible to other scientists and action agencies once the paper is published. Constraints to adoption of the methodology depend on sample sequencing capacity in the event of a disease outbreak. Once the genomic infrastructure (sequencing capacity and associated expertise) is available, the technology can be widely applied.

Impacts
(N/A)

Publications