Progress 10/01/04 to 09/30/06
Outputs We were looking for areas with recent duplications not reflected in the NCBI genome assemblies by determining the mean depths of coverage with the whole-genome shotgun reads in consecutive overlapping windows along each of the chromosomes. Although our ultimate target was the bovine genome, assembly of the canine genome was completed earlier and so we decided to test the strategy with the canine genome assembly. The dog genome assembly, which had been lower-case repeat masked, was downloaded from the UCSC genome ftp site. Fasta and XML files for genome sequence reads were downloaded from the NCBI TraceDB ftp site. XML files were parsed to identify reads that were sequenced by whole-genome shotgun. To reduce memory requirements for sequence comparisons, the 35,078,056 reads were divided into sets of 50,000 sequences each. Repeats in the reads were lower-case masked using RepeatMasker and RepBase. The masked reads were searched against the dog genome assembly using BLAT.
Perl and R scripts were written to summarize the BLAT results. The number of matching sequencing reads within windows along chromosomes were computed and plotted using user-specified thresholds for percent identity, alignment length and window size. This has been a computationally intensive project. Preliminary results suggest that some novel highly repetitive repeats were not filtered from the reads or genome assembly and that these have significantly increased the required computation time over our initial expectation and have also increased output file size. Nearly 98.8% of the reads have been searched using BLAT. Six to eight 2.2 GHz processors have been dedicated to the project since June 2005, and the data currently occupies 150 GB of hard disk. A set-back occurred in November 2005 when a RAID drive failed, and data had to be partitioned into 20-GB parcels among the hard drives associated with the compute nodes. It was decided that the technical problems associated with the
computer analysis were too difficult and the project was terminated.
Impacts Because the computer analysis was inadequate for the task, the impact will be low.
Publications
- No publications reported this period
|
|