Identification of Copy Number Polymorphisms in the Bovine Genome

Recipient Organization
UNIVERSITY OF MISSOURI
(N/A)
COLUMBIA,MO 65211

Performing Department
VETERINARY PATHOLOGY

Non Technical Summary
The project represents a step in the development of DNA markers to improve disease resistance and production traits in cattle by selective breeding. The purpose of the project is to identify copy number polymorphisms that influence disease resistance and/or production traits.

Animal Health Component

Research Effort Categories

Basic

90%

Applied

Developmental

Classification

Knowledge Area (KA)	Subject of Investigation (SOI)	Field of Science (FOS)	Percent
304	3399	1040	100%

Knowledge Area
304 - Animal Genome;

Subject Of Investigation
3399 - Beef cattle, general/other;

Field Of Science
1040 - Molecular biology;

Keywords

disease susceptibility

Goals / Objectives
Two recent publications (Sebat et al., Science 2004; Iafrate et al., Nature Genetics 2004) reported previously unappreciated copy number polymorphisms (CNPs) in the human genome. Many of these CNPs exceeded 100 Kb in length and it has been suggested that the polymorphisms may make a major contribution to individual differences in disease susceptibility and other traits among people. It is likely that similar CNPs occur in cattle and that they are responsible for a variety of phenotypic differences among individual animals including differences in disease susceptibility and production traits. Thus, out goal is to take advantage of the soon-to-be-completed bovine genome project and use an alternative strategy to discover and map CNPs in the bovine genome.

Project Methods
We propose to do automatic Blast searches of trace sequences from the Bovine Genome Sequencing Project with consecutive segments of the assembled ovine genome sequence and calculate the depth of coverage. The computer analysis of the bovine genome will provide us with a plot of depth of coverage versus chromosomal position for each of the bovine chromosomes (except the Y chromosomes) showing the chromosomal positions of repeated genome segments. The next step will be to identify repeat regions that are CNPs because the copy number varies from individual to individual. Due to limited resources, we will only be able to evaluate 5 to 10 of the thousands of repeated genome segments expected from the in silico survey, therefore, we will prioritize the repeated genome segments and examine those most likely to exhibit copy number polymorphisms that influence disease resistance or production traits.

Progress 10/01/04 to 09/30/06

Outputs
We were looking for areas with recent duplications not reflected in the NCBI genome assemblies by determining the mean depths of coverage with the whole-genome shotgun reads in consecutive overlapping windows along each of the chromosomes. Although our ultimate target was the bovine genome, assembly of the canine genome was completed earlier and so we decided to test the strategy with the canine genome assembly. The dog genome assembly, which had been lower-case repeat masked, was downloaded from the UCSC genome ftp site. Fasta and XML files for genome sequence reads were downloaded from the NCBI TraceDB ftp site. XML files were parsed to identify reads that were sequenced by whole-genome shotgun. To reduce memory requirements for sequence comparisons, the 35,078,056 reads were divided into sets of 50,000 sequences each. Repeats in the reads were lower-case masked using RepeatMasker and RepBase. The masked reads were searched against the dog genome assembly using BLAT. Perl and R scripts were written to summarize the BLAT results. The number of matching sequencing reads within windows along chromosomes were computed and plotted using user-specified thresholds for percent identity, alignment length and window size. This has been a computationally intensive project. Preliminary results suggest that some novel highly repetitive repeats were not filtered from the reads or genome assembly and that these have significantly increased the required computation time over our initial expectation and have also increased output file size. Nearly 98.8% of the reads have been searched using BLAT. Six to eight 2.2 GHz processors have been dedicated to the project since June 2005, and the data currently occupies 150 GB of hard disk. A set-back occurred in November 2005 when a RAID drive failed, and data had to be partitioned into 20-GB parcels among the hard drives associated with the compute nodes. It was decided that the technical problems associated with the computer analysis were too difficult and the project was terminated.

Impacts
Because the computer analysis was inadequate for the task, the impact will be low.

Publications

No publications reported this period