DEVELOPMENT OF BIOINFORMATICS TOOLS FOR LIVESTOCK

Knowledge Area (KA)	Subject of Investigation (SOI)	Field of Science (FOS)	Percent
304	3220	1080	20%
304	3320	1080	20%
304	3410	1080	50%
304	3610	1080	10%

Progress 08/10/03 to 09/11/07

Outputs
Progress Report Objectives (from AD-416) Develop bioinformatic tools for acquisition and analysis of DNA sequence information system established for genetic marker data. Develop tools to automate data processing. Develop tools for data entry and display for use by researchers. Approach (from AD-416) Purchase hardware as needed and develop and purchase software as needed to implement bioinformatics platform needed to acquire, store, and analyze genetic marker, DNA sequenc, and gene expression data. Automate data acquisition, sequence analysis, linkage analysis, and QTL mapping using database triggers, scripts, and other appropriate tools. Develop tools for researchers to submit data to the Oracle database. Develop web pages for data queries, reports, and summaries. Tools will be developed cooperatively with other ARS bioinformatics groups, especially Clay Center, Nebraska. Significant Activities that Support Special Target Populations Developed a database for storing and dissemination of bovine genotype data among the members of Bovine HapMap Consortium. Several Quality control steps were implemented to reduce genotype errors in the database. Multiple levels of access were implemented in the database for various roles such as administrator, breed champion, researcher and animal owner. In addition to the database development several queries were generated to help writing the manuscript describing the bovine genotype and haplotype data. Bovine microRNA differential expression among tissues involved in immune response and embryonic stages was studied. Conserved microRNA homologs in cattle were identified by sequence comparison to the bovine genome. Novel microRNAs in cattle were reported after confirmation by multiple methods such as hairpin secondary structure, unique mapping to the bovine genome and expression of multiple copies in related tissues. Prediction of transcription factor binding sites (TFBS) using across- species conservation as evidence was completed. Using well studied genes as examples, this approach predicted and verified most previously known TFBS and several new candidate sites. A comprehensive catalog of conserved TFBS in mammal promoters was produced and could be overlaid with other information on major genome browsers. A paper describing this application is accepted and currently in press. As a complement to the Bovine HapMap Consortium project, we initiated a systematic study of the CNV (copy number variation) within the same cattle population using array comparative genomic hybridization. Our data, for the first time, demonstrated that significant amounts of CNV exist in cattle; many CNV are common both across diverse cattle breeds and among individuals within a breed. The future objective is to integrate the CNV data with the single-nucleotide polymorphisms (SNP) data to accelerate genetic improvement through comprehensive prediction of genetic merit. EST analysis was performed on 11,520 sequences from Lymantria dispar and resulting high quality sequences were submitted to public databases. Developed a suite of perl scripts to map pig and cow EST as part of collaboration with the Animal Biosciences and Biotechnology Laboratory. The scripts parse blast output and map the EST into human RefSeq orthologs. Developed a software application for the breeders and AI companies to donate semen for use in the genotyping project. The database interface will check the inventory for existing semen and flag the number of units required for any given bull. The dairy breeds targeted for this effort are Holstein, Jersey and Brown Swiss. Developed website for soybean linkage map using SNP based markers. The web queries provide the primers used for SNP discovery as well as genotypes for several cultivars. Accomplishments Generated and validated over 70,000 novel bovine SNP markers. This work identifies critical gene pathways in the mammary gland involved in increased milk yield and addresses National Program 101 Component 1: Understanding, Improving, and Effectively Using Animal Genetic and Genomic Resources, Problem Statement 1B: Identify Functional Genes and Their Interactions. Developed a genome-wide SNP genotyping assay that can be used for studies of whole genome selection in dairy and beef cattle populations. This assay contains tests to interrogate 58,663 sites of the genome for genetic variation. The assay will be available for our experiment in genome selection by late August. This product will be sold commercially in December of 2007. Identified 22 novel microRNAs in cattle through a combination of experimental and bioinformatics analysis. Identified and mapped all the homologous microRNAs from human, mouse and rat to the latest version of the bovine genome. Generated a comprehensive catalog of conserved TFBS in mammal promoters. It is open to the public through free internet access and will provide the insights into the understanding of genome-wide regulation and variation in the bovine gene expression. Technology Transfer Number of Patent Applications filed: 1 Number of Web Sites managed: 7 Number of Non-Peer Reviewed Presentations and Proceedings: 17 Number of Newspaper Articles,Presentations for NonScience Audiences: 3

Impacts
(N/A)

Publications

Cooper, B., Neelan, A., Campbell, K., Lee, J., Liu, G., Garrett, W.M., Scheffler, B.E., Tucker, M.L. 2007. Protein Accumulation Changes Associated with Germination of the Uromyces appendiculatus Uredospore. Molecular Plant-Microbe Interactions. 20(7):857-866.

Progress 10/01/05 to 09/30/06

Outputs
Progress Report 1. What major problem or issue is being resolved and how are you resolving it (summarize project aims and objectives)? How serious is the problem? Why does it matter? Cutting edge technologies for mapping and identifying genetic information have led to an explosion in the amount of sequence and structure information available to researchers. Data generation from analysis of DNA sequence, gene expression, genetic marker, and protein structure is ongoing and accelerating. The primary focus of this project is to develop bioinformatics tools to acquire, analyze, and store molecular genetic data. Bioinformatics plays a key role in managing, understanding, and utilizing this molecular genetic information. By providing algorithms, databases, user interfaces, and statistical tools, bioinformatics enables genomic researchers to better identify and understand important genes and can lead to their use to improve animal welfare, production, and efficiency. We will automate data acquisition, sequence analysis, linkage analysis, and quantitative trait loci mapping using scripts, database triggers, and other appropriate tools. Tools will also be developed for genomic researchers to submit data to the public databases and query data via the Internet. We are purchasing the necessary hardware and software tools to implement the bioinformatics programs required. With the public effort to sequence the bovine genome, these tools created to leverage that investment are exceptionally important. The bovine genome is expected to generate widespread benefits for agriculture as well as basic and applied biology. Anticipated impact of these efforts includes ability to: o identify genes that control growth efficiency, muscle development and milk composition; o enhance disease resistance in cattle and sheep through selection of naturally occurring genetic variation; and o improve the nutritional value and safety of beef and dairy products. With the bovine genome assembly nearing release and the haplotype map project well underway, both inter- and intra-species sequence comparisons combined with functional genomics are leveraged to accelerate the gene identification process and national selection program in cattle. Within species comparisons will allow identification of SNP that are useful for enhancing genetic selection, for use in parentage verification and traceability, and as positional candidate genes explaining phenotypic differences. Across species comparisons may allow identification of conserved coding and non-coding functional elements based on sequence similarities across divergent species. This project falls entirely under National Program 101 Food Animal Production and includes elements of four components of the NP101 Action Plan. The project focuses primarily on the components entitled "Genomic Tools" and "Genetic Improvement" and through collaboration the components "Conserve, Characterize and Use of Genetic Resources" and "Reproductive Efficiency." This project also impacts NP103 (Animal Health) component "Genetic Resistance to Disease" and NP 30 (Plant, Microbial, and Insect Genetic Resources, Genomics and Genetic Improvement) and 302 (Plant Biological and Molecular Processes.) NP 101 Collaborator Projects: o" "Identification, Validation and Fine-Mapping of Quantitative Trait Loci in Dairy Cattle" (Project #1265-31000-081-00D) o "Functional Genomics of Dairy Production" (Project #1265-31000-086-00D) o "Swine Germplasm Preservation, Propagation, and Embryo Developmental Competence" (Project #1265-31000-082-00D) NP 103 Collaborator Projects: o "Immunological and genetic basis of resistance to parasites of cattle" (Project #1265-32000-072-00D) NP 301 and 302 Collaborator Projects: o "A Single Nucleotide Polymorphism-Based Map of Soybean and Applications to Gene Discovery in Germplasm" (Project #1275-21000-164-00D) 2. List by year the currently approved milestones (indicators of research progress) Year 1, 2004 1. Design and build marker database. 2. Investigate existing resources, design database for EST pipeline. 3. Obtain chromat data for soybean and cattle data for SNP project. 4. Continue SAGE development, initiate microarray developments. Make SAGE databases available online to external researchers for functional genomics and proteomics. Year 2, 2005 1. Initiate tool and interface development for marker data analysis. 2. Develop tools to populate database, process data at least to dbEST submission for EST pipeline. 3. Develop machine learning algorithm and predict SNP. 4. Continue microarray developments, initiate proteomics developments. Year 3, 2006 1. Complete tool development, interface GenoProb for genetic marker data analysis. 2. Develop assembly and annotation tools for EST pipeline. 3. Validate SNP with collaborators. 4. Continue microarray and proteomics developments. Initial database of livestock MALDI-TOF spectra available online to external researchers. 5. In silico prediction and experimental verification of functional elements by comparative analysis of evolutionarily constrained sequences in bovine genome and high throughput assays. Year 4, 2007 1. Develop de novo SNP development tools. 2. Continue microarray and proteomics developments. 4a List the single most significant research accomplishment during FY 2006. Genomic variation analysis was completed comparing cattle, human and dog. Neutral mutation rates within each lineage were estimated, cattle-human- dog orthologous fragments were compared to reveal the change of the genome sizes and measured the potential contribution from transposable elements. Differential turn-over patterns of transposable elements was identified as a source of genome size differences among the three species. Genomic divergence among human, cow, and dog was characterized. This analysis provides a large-scale and unbiased assessment of genomic divergences and regional variation of mutation events among cattle, dog and human. It is expected that these data will serve as a baseline for future mammalian molecular evolution studies. 4b List other significant research accomplishment(s), if any. Mapping of single nucleotide polymorphism (SNP) markers that were not placed in the official genome assembly was made possible by using a combination a radiation hybrid map and human-cow comparative map information. Prediction of transcription factor binding sites (TFBS) using across- species conservation as evidence was completed. A computational approach combining position-weight matrixes and phylogenetic footprinting was customized and optimized to systematically identify conserved TFBS in bovine promoters (the 5kb regions upstream of all RefGene) by cross- species genomic comparison of several mammals. Our expanded TFBS matrixes included conserved position-weight matrixes extracted from experimental data and over-represented conserved upstream motifs derived from large- scale phylogenetic analyses of several mammals. Using well studied genes as examples, this approach verified most previously known TFBS and several new candidate sites. Experimental evidences, including gel shifting and cell transfection, further validated these new TFBS sites. A comprehensive catalog of conserved TFBS in bovine promoters was produced and could be overlaid with other information on major genome browsers to provide the insights into the understanding of genome-wide regulation and variation in the bovine gene expression. Developed a suite of perl scripts to prepare protein target sequences as part of a collaboration with the Soybean Genomics and Improvement Laboratory. The scripts parse tblastx output and translate soybean rust EST sequence into amino acid sequence in multiple frames to facilitate the identification of peptide obtained from mass spectrometers using by MudPIT (Multidimensional Protein Identification Technology) . A database and integrated data analysis system to support SNP data as part of the International Bovine HapMap Consortium was developed. This software has a web interface for data analysis and visualization. This effort is in collaboration with scientists at Texas A&M University. Development of high density soybean genetic map using SNP markers. Primers were developed using soybean traces for expressed sequence tag sequences to identify SNP in genes. Genomic DNA fragments (BAC ends and BAC) were repeat masked and filtered for known genes to identify SNP distributed through out the genome. Development of semen donation website to facilitate producer and stakeholder donation of essential animals seem required for genomics research. Improved hardware and software for data backups was purchased to support the expanding data storage needs across a variety of servers. High reliability and availability backups are now possible. Additional extension to desktop systems has been deployed. 5. Describe the major accomplishments to date and their predicted or actual impact. Briefly, the development of EST-PAGE has been the major accomplishment to date. This package allows researchers to easily process EST data including: base calling, vector trimming, screening for contaminants and low complexity regions, and dbEST submission. This software is being made available as an open source project. As a result, over 25 groups around the world have requested and received source code for EST-PAGE. 6. What science and/or technologies have been transferred and to whom? When is the science and/or technology likely to become available to the end- user (industry, farmer, other scientists)? What are the constraints, if known, to the adoption and durability of the technology products? The EST-PAGE software described in Question 5 has been made available to researchers at BARC and worldwide. The existing Unix server continues to provide GCG service to all Beltsville genomic personnel. A number of seminars were presented at scientific and corporate meetings to share information about the research goals and accomplishments of this project as well as results based on tools created under this project. Presentations at scientific meetings are outlined in question 7, below. Invited presentations included. 7. List your most important publications in the popular press and presentations to organizations and articles written about your work. (NOTE: List your peer reviewed publications below). Popular press articles: None. Important Presentations: Liu, G., Van Tassell, C.P., Sonstegard, T.S., Matukumalli, L.K., and Shade, L.L. 2005. Comparative analysis of genomic variation among mammalian genomes: Recent genome expansions and substitution patterns. Genome Informatics. Log Number 184836. Liu, G., Matukumalli, L.K., Sonstegard, T.S.. and Van Tassell, C.P. 2006. Differential turn-over patterns of transposable elements account for the genome size differences among cattle, dog, and human. The Biology of Genomes. Log Number 192947. Matukumalli, L.K., Couthino, L.L., Sonstegard, T.S.. Van Tassell, C.P., Gasbarre, L.C., Capuco, A.V., Smith, T.P. 2006 Identification of bovine microRNAs. 71st Symposium of regulatory RNAs. Log Number 193698. Liu, G. , Yang, J., Matukumalli, L.K., Sonstegard, T.S., Van Tassell, C. P., and Hanson, R.W. 2006. In silico prediction of regulatory elements by phylogenetic footprinting. BARC Poster Day 2006. Log Number 194608. Liu, G. , Yang, J., Matukumalli, L.K., Sonstegard, T.S., Van Tassell, C. P., and Hanson, R.W. 2006. Systematic identification of regulatory elements in cattle. International Conference on Animal Genetics. Log Number 194612. Articles written about our research: None.

Impacts
(N/A)

Publications

Liu, G., Van Tassell, C.P., Sonstegard, T.S., Matukumalli, L.K., Shade, L. L. 2006. Genomic divergences among cattle, dog, and human estimated from large-scale alignments of genomic sequences. Biomed Central (BMC) Genomics. 7(1):140.
Matukumalli, L.K., Grefenstette, J.J., Van Tassell, C.P., Choi, I., Cregan, P.B. 2006. Application of machine learning in SNP discovery. BMC Bioinformatics. 6(7):4.

Progress 10/01/04 to 09/30/05

Outputs
1. What major problem or issue is being resolved and how are you resolving it (summarize project aims and objectives)? How serious is the problem? What does it matter? Cutting edge technologies for mapping and identifying genetic information have led to an explosion in the amount of sequence and structure information available to researchers. Data generation from analysis of DNA sequence, gene expression, genetic marker, and protein structure is ongoing and accelerating. The primary focus of this project is to develop bioinformatics tools to acquire, analyze, and store molecular genetic data. Bioinformatics plays a key role in managing, understanding, and utilizing this molecular genetic information. By providing algorithms, databases, user interfaces, and statistical tools, bioinformatics enables genomic researchers to better identify and understand important genes and can lead to their use to improve animal welfare, production, and efficiency. We will automate data acquisition, sequence analysis, linkage analysis, and quantitative trait loci mapping using scripts, database triggers, and other appropriate tools. Tools will also be developed for genomic researchers to submit data to the public databases and query data via the Internet. We are purchasing the necessary hardware and software tools to implement the bioinformatics programs required. With the public effort to sequence the bovine genome, these tools created to leverage that investment are exceptionally important. The bovine genome is expected to generate widespread benefits for agriculture as well as basic and applied biology. Anticipated impact of these efforts includes ability to: o identify genes that control growth efficiency, muscle development and milk composition; o enhance disease resistance in cattle and sheep through selection of naturally occurring genetic variation; and o improve the nutritional value and safety of beef and dairy products. With the bovine genome assembly nearing release and the haplotype map project well underway, both inter- and intra-species sequence comparisons combined with functional genomics are leveraged to accelerate the gene identification process and national selection program in cattle. Within species comparisons will allow identification of SNP that are useful for enhancing genetic selection, for use in parentage verification and traceability, and as positional candidate genes explaining phenotypic differences. Across species comparisons may allow identification of conserved coding and non-coding functional elements based on sequence similarities across divergent species. This project falls entirely under National Program 101 Food Animal Production and includes elements of four components of the NP101 Action Plan. The project focuses primarily on the components entitled "Genomic Tools" and "Genetic Improvement" and through collaboration the components "Conserve, Characterize and Use of Genetic Resources" and "Reproductive Efficiency." This project also impacts NP103 (Animal Health) component "Genetic Resistance to Disease" and NP 301 (Plant, Microbial, and Insect Genetic Resources, Genomics and Genetic Improvement) and 302 (Plant Biological and Molecular Processes.) NP 101 Collaborator Projects: . "Identification, Validation and Fine-Mapping of Quantitative Trait Loci in Dairy Cattle" (Project 1265-31000-081-00D) . "Functional Genomics of Dairy Production" (Project 1265-31000-086-00D) . "Swine Germplasm Preservation, Propagation, and Embryo Developmental Competence" (Project 1265-31000-082-00D) NP 103 Collaborator Projects: . "Immunological and genetic basis of resistance to parasites of cattle" (Project 1265-32000-072-00D) NP 301 and 302 Collaborator Projects: . "A Single Nucleotide Polymorphism-Based Map of Soybean and Applications to Gene Discovery in Germplasm" (Project 1275-21000-164-00D) 2. List the milestones (indicators of progress) from your Project Plan. Year 1, 2004 1. Design and build marker database. 2. Investigate existing resources, design database for EST pipeline. 3. Obtain chromat data for soybean and cattle data for SNP project. 4. Continue SAGE development, initiate microarray developments. Make SAGE databases available online to external researchers for functional genomics and proteomics. Year 2, 2005 1. Initiate tool and interface development for marker data analysis. 2. Develop tools to populate database, process data at least to dbEST submission for EST pipeline. 3. Develop machine learning algorithm and predict SNP. 4. Continue microarray developments, initiate proteomics developments. Year 3, 2006 1. Complete tool development, interface GenoProb for genetic marker data analysis. 2. Develop assembly and annotation tools for EST pipeline. 3. Validate SNP with collaborators. 4. Continue microarray and proteomics developments. Initial database of livestock MALDI-TOF spectra available online to external researchers. 5. In silico prediction and experimental verification of functional elements by comparative analysis of evolutionarily constrained sequences in bovine genome and high throughput assays. Year 4, 2007 1. Develop de novo SNP development tools. 2. Continue microarray and proteomics developments. 3a List the milestones that were scheduled to be addressed in FY 2005. For each milestone, indicate the status: fully met, substantially met, or not met. If not met, why. 1. Continue marker database development. Initiate tool and interface development for marker data analysis. Initiate tests of MTDFREML-Q and integration with microsatellite database for automated data analysis. Milestone Substantially Met 2. Continue development of EST-PAGE, our EST pipeline. Improve tools in that package, such as assembly, Blast, and gene ontology. Milestone Fully Met 3. Develop machine learning algorithm to predict SNP using outbred populations. Milestone Substantially Met 4. Continue microarray developments, initiate proteomics developments. Milestone Substantially Met 3b List the milestones that you expect to address over the next 3 years (FY 2006, 2007, and 2008). What do you expect to accomplish, year by year, over the next 3 years under each milestone? Year 3, 2006 1. Complete tool development, interface GenoProb for genetic marker data analysis. 2. Continue development of annotation tools for EST-PAGE. 3. Validate SNP with collaborators. 4. Continue microarray and proteomics developments. Initial database of livestock MALDI-TOF spectra available online to external researchers. Year 4, 2007 1. Develop de novo SNP development tools depending on success SNP discovery from genome data. 2. Continue microarray and proteomics developments. Year 1, New Project, 2008, if approved 1. Develop tools to facilitate genomic selection. 2. Develop tools for QTL fine mapping using dense SNP data. 4a What was the single most significant accomplishment this past year? A software package called SNP-PHAGE was developed and is being released. This package allows researchers to easily process DNA sequence data for polymorphism discovery and submission to the public database (GenBank's dbSNP). This software has a web interface for data analysis and visualization. This software is being used for analyzing soybean and bovine sequences to discover more than 10,000 SNP. A manuscript describing this application is under preparation. This software is being made available as an open source project. We believe the development of this turnkey package will enable small genomics groups without bioinformatics support to process data rapidly, and most importantly, to submit data to the public databases for widespread use. 4b List other significant accomplishments, if any. A new full-time permanent scientist was hired. The investigator brings strong comparative bioinformatics expertise to the project. Complete genomic variation analysis was completed comparing cattle and other mammals. Neutral mutation rates within each lineage were estimated, cattle-human-dog orthologous fragments were compared to reveal the change of the genome sizes and measured the potential contribution from transposable elements. Machine learning was implemented in a new software application package SNP-PHAGE-ML to improve the accuracy of the polymorphisms prediction accuracy. We achieved a five fold increase in success rate of genetic marker discovery from this software. A manuscript describing this application has been submitted for publication and is currently under review. Great improvements in computer hardware have been achieved with the purchase of a new, powerful 4-processor server and high performance disk storage. This system will replace the aging multi-processor system that will become the development and testing platform. A vastly improved hardware and software solution for data backups was purchased to support the expanding data storage needs across a variety of servers. High reliability and availability backups are now possible. Additional extension to desktop systems has been purchased and is being deployed. A database schema, interface and web server are being designed and will be deployed to support the Bovine Genome Project Haplotype Mapping project. This effort is in collaboration with scientists at Texas A&M University. 5. Describe the major accomplishments over the life of the project, including their predicted or actual impact. Briefly, the development of EST-PAGE has been the major accomplishment to date. This package allows researchers to easily process EST data including: base calling, vector trimming, screening for contaminants and low complexity regions, and dbEST submission. This software is being made available as an open source project. As a result, over 25 groups around the world have requested and received source code for EST-PAGE. 6. What science and/or technologies have been transferred and to whom? When is the science and/or technology likely to become available to the end- user (industry, farmer, other scientists)? What are the constraints, if known, to the adoption and durability of the technology products? The EST-PAGE software described in Question 5 has been made available to researchers at BARC and worldwide. The existing Unix server continues to provide GCG service to all Beltsville genomic personnel. A number of seminars were presented at scientific and corporate meetings to share information about the research goals and accomplishments of this project as well as results based on tools created under this project. Presentations at scientific meetings are outlined in question 7, below. Invited presentations included "An introduction to bovine genomics and bioinformatics: From QTL mapping to the bovine genome project," presented at George Mason University as part of the seminar series in the Bioinformatics and Computational Biology Program. An invited presentation entitled "A Brief Introduction to BFGL, QTL Mapping, QTN Validation, and Bioinformatics" was presented to the Iowa State University animal breeding group. The bioinformatics research program was described in a presentation as part of an ARS- Norway dairy cattle genomics research meeting.

Impacts
(N/A)

Publications

Min, W., Lillehoj, H.S., Ashwell, C.M., Van Tassell, C.P., Dalloul, R.A., Matukumalli, L.K., Han, J.Y., Lillehoj, E.P. 2005. Est analysis of Eimeria- stimulated intestinal intraepithelial lymphocytes in chickens. Molecular Biotechnology. 30:143-150.
Blomberg, L., Long, E.L., Sonstegard, T.S., Van Tassell, C.P., Dobrinsky, J.R., Zuelke, K.A. 2004. Serial analysis of gene expression (sage) during elongation of peri-implantation porcine trophectoderm (conceptus). Physiological Genomics. 20:188-194.
Baumann, R.G., Baldwin, R.L., Sonstegard, T.S., Van Tassell, C.P., Matukumalli, L. 2004. Characterization of a normalized cDNA library from bovine intestinal muscle and epithelial tissues [abstract]. 7th International Meeting of the Microarray Gene Expression Data Society. p. 24.
Matukumalli, L.K., Grefenstette, J.J., Sonstegard, T.S., Van Tassell, C.P. 2005. Optimizing bovine SNP discovery efforts [abstract]. Recomb2005. p. 227.
Matukumalli, L.K., Grefenstette, J.J., Van Tassell, C.P., Choii, I., Cregan, P.B. 2004. Development of algorithms for prediction and validation of polymorphisms in polyploids (soybean) using EST data [abstract]. TIGR's Sixteenth International Genome Sequencing and Analysis Conference. p. 33.
Matukumalli, L.K., Grefenstette, J.J., Van Tassell, C.P., Choii, I., Cregan, P.B. 2004. Application of machine learning programs towards accelerating polymorphisms discovery [abstract]. 7th Annual Conference on Computational Genomics. p. 30.

Progress 10/01/03 to 09/30/04

Outputs
1. What major problem or issue is being resolved and how are you resolving it (summarize project aims and objectives)? How serious is the problem? What does it matter? Data generation from analysis of DNA sequence, gene expression, genetic marker, and proteomic structure is ongoing and accelerating. The primary focus of this project is to develop bioinformatic tools to acquire, analyze, and store molecular genetic data. We are purchasing the necessary hardware and software tools to implement the bioinformatics platform required. We will automate data acquisition, sequence analysis, linkage analysis, and quantitative trait loci mapping using triggers, scripts, and other appropriate tools. Tools will also be developed for genomic researchers to submit data to the databases and query data via the Internet. Cutting edge technologies for mapping and identifying genetic information have led to an explosion in the amount of sequence and structural information available to researchers. Bioinformatics plays a key role in managing, understanding, and utilizing this molecular genetic information. By providing algorithms, databases, user interfaces, and statistical tools, bioinformatics enables genomic researchers to better identify and understand important genes and can lead to their use to improve animal welfare, production, and efficiency. With the public effort to sequence the bovine genome, the tools create to leverage that investment are exceptionally important. The bovine genome is expected to generate widespread benefits for agriculture as well as basic and applied biology. Anticipate impact of these efforts include ability to: . identify genes that control growth efficiency, muscle development and milk composition; . enhance disease resistance in cattle and sheep through selection of naturally occurring genetic variation; and . improve the nutritional value and safety of beef and dairy products. This project falls entirely under National Program 101 Food Animal Production and includes elements of four components of the NP101 Action Plan. The project focuses primarily on the components entitled "Genomic Tools" and "Genetic Improvement" and through collaboration the components "Conserve, Characterize and Use of Genetic Resources" and "Reproductive Efficiency." This project also impacts NP103 (Animal Health) component "Genetic Resistance to Disease" and NP 301 (Plant, Microbial, and Insect Genetic Resources, Genomics and Genetic Improvement) and 302 (Plant Biological and Molecular Processes.) NP 101 Collaborator Projects: . "Identification, Validation and Fine-Mapping of Quantitative Trait Loci in Dairy Cattle"(CRIS # 1265-31000-081-00) . "Functional Genomics of Dairy Production" (CRIS # 1265-31000-086-00) . "Swine Germplasm Preservation, Propagation, and Embryo Developmental Competence" (CRIS # 1265-31000-082-00) NP 103 Collaborator Projects: . "Immunological and genetic basis of resistance to parasites of cattle" (CRIS#1265-32000-072) NP 301 and 302 Collaborator Projects: . "A Single Nucleotide Polymorphism-Based Map of Soybean and Applications to Gene Discovery in Germplasm" (CRIS # 1275-21000-164-00) 2. List the milestones (indicators of progress) from your Project Plan. Year 1, 2004 1. Design and build marker database. 2. Investigate existing resources, design database for EST pipeline. 3. Obtain chromat data for soybean and cattle data for SNP project. 4. Continue SAGE development, initiate microarray developments. Make SAGE databases available online to external researchers for functional genomics and proteomics. Year 2, 2005 1. Initiate tool and interface development for marker data analysis. 2. Develop tools to populate database, process data at least to dbEST submission for EST pipeline. 3. Develop machine learning algorithm and predict SNP. 4. Continue microarray developments, initiate proteomics developments. Year 3, 2006 1. Complete tool development, interface GenoProb for genetic marker data analysis. 2. Develop assembly and annotation tools for EST pipeline. 3. Validate SNP with collaborators. 4. Continue microarray and proteomics developments. Initial database of livestock MALDI-TOF spectra available online to external researchers. Year 4, 2007 1. Develop de novo SNP development tools. 2. Continue microarray and proteomics developments. 3. Milestones: The status of following milestones scheduled to be met in FY 2004 is indicated below: 1. Design and build marker database. Status: In progress. Due to the complexities of migrating old data into a new database, this process has been slower than expected. 2. Investigate existing resources, design database for EST pipeline. Status: Completed. 3. Obtain chromat data for soybean and cattle data for SNP project. Status: Completed. Bovine project continues as the cattle genome project is underway. 4. Continue SAGE development, initiate microarray developments. Make SAGE databases available online to external researchers for functional genomics and proteomics. Status: Completed. B. List the milestones that you expect to address over the next 3 years (FY 2005, 2006, & 2007). What do you expect to accomplish, year by year, over the next 3 years under each milestone? Year 2, 2005 1. Continue marker database development. Initiate tool and interface development for marker data analysis. Initiate tests of MTDFREML-Q and integration with microsatellite database for automated data analysis. 2. Continue development of EST-PAGE, our EST pipeline. Improve tools in that package, such as assembly, Blast, and gene ontology. 3. Develop machine learning algorithm to predict SNP using outbred populations. 4. Continue microarray developments, initiate proteomics developments. Year 3, 2006 5. Complete tool development, interface GenoProb for genetic marker data analysis. 6. Continue development of annotation tools for EST-PAGE. 7. Validate SNP with collaborators. 8. Continue microarray and proteomics developments. Initial database of livestock MALDI-TOF spectra available online to external researchers. Year 4, 2007 3. Develop de novo SNP development tools depending on success SNP discovery from genome data. 4. Continue microarray and proteomics developments. 4. What were the most significant accomplishments this past year? A. Single Most Significant Accomplishment during FY2004 (one per Research Project): A software package called EST-PAGE was developed and released. This package allows researchers to easily process EST data including: base calling, vector trimming, screening for contaminants and low complexity regions, and dbEST submission. This software is being made available as an open source project. As a result, over 25 groups around the world have requested and received source code for EST-PAGE. We believe the development of this turnkey package will enable small genomics groups without bioinformatics support to process data rapidly, and most importantly, to submit data to the public databases for widespread use. B. Other Significant Accomplishment(s), if any. Great improvements in computer hardware and software have been achieved. A Linux cluster with 32 processors was purchased with assistance from the Area Office. An addition computing system with hardware optimized for additional sequence similarity search tools has been acquired with assistance from four other laboratories that have collaborating research projects. A vastly improved hardware and software solution for data backups is being purchased to support the expanding data storage needs across a variety of servers. For SAGE and microarray development, microarray data analysis tools created for candidate gene cDNA microarray using heirchechal mixed model in SAS. SAGE databases made public via ANRI web site (including clickwrap license agreement for access) as well as submitted to GEO database at NIH's National Center for Biotechnology Information. C. Significant activities that support special target populations None. D. Progress Report opportunity to submit additional programmatic information to your Area Office and NPS (optional for all in-house ("D") projects and the projects listed in Appendix A; mandatory from all other subordinate projects). None. 5. Describe the major accomplishments over the life of the project, including their predicted or actual impact. This is the first complete year of the project and as such accomplishments are indicated in question 4. Briefly, the development of EST-PAGE described in Question 4 has been the primary accomplishment to date. Additional tool development for single nucleotide polymorphism production is currently under development. This application is expected to have widespread impact and usage with the completion of the bovine genome. Based on primary results from other organisms, this software should facilitate automated high throughput marker development. 6. What science and/or technologies have been transferred and to whom? When is the science and/or technology likely to become available to the end- user (industry, farmer, other scientists)? What are the constraints, if known, to the adoption and durability of the technology products? The EST-PAGE software described in Question 4 has been made available to researchers at BARC and worldwide. The existing Unix server continues to provide GCG service to all Beltsville genomic personnel. A number of seminars were presented at scientific and corporate meetings to share information about the research goals and accomplishments of this project as well as results based on tools created under this project. 7. List your most important publications in the popular press and presentations to organizations and articles written about your work. Popular press articles: . none Important Presentations: Choi, I.-Y., Hyten, D.L., Matukumalli, L.K., Yi, S.-I., Cregan, P.B. SNP Discovery is Legume Species Using Primers Derived from Soybean Unigenes. BARC Poster Day 2004. Matukumalli, L.K., Grefenstette, J.J., Van Tassell, C.P., Choi, I.-Y., Cregan, P.B. In silico prediction and validation of polymorphisms in soybean genome using EST data. BARC Poster Day 2004. Articles written about our research: . none

Impacts
(N/A)

Publications

Matukumalli, L.K., Grefenstette, J.J., Sonstegard, T.S., Van Tassell, C.P. 2004. Est-page - a simple web interface for managing and analyzing est data. Bioinformatics, 20:286-288.
Cregan, P.B., Fickus, E.W., Hyatt, S.M., Zhu, Y.L., Song, Q.J., Young, N.P. , Grimm, D.R., Hyten, D.L., Van Tassell, C.P., Matukumalli, L.K. Single nucleotide polymorphisms (snps) in sobyean. Genetics.
Van Tassell, C.P., Cregan, P.B., Matukumalli, L.K., Grenfenstette, J.J., Choi, I. 2003. Identification of paralogs and snp in the allopolyploid soybean genomics from est data [abstract]. Proceedings of International Meeting on Single Nucleotide Polymorphism and Complex Genome Analysis. Abstract 64.
Matukumalli, L.K., Grefenstette, J.J., Sonstegard, T.S., Van Tassell, C.P. 2003. Est-page: a tool for managing and analyzing EST data [abstract]. TIGR XV GSAC Conference, September 2003.