Source: UNIV OF ARIZONA submitted to
WEB RESOURCES FOR THE COMPUTATION AND DISPLAY OF PHYSICAL MAPPING DATA
Sponsoring Institution
National Institute of Food and Agriculture
Project Status
TERMINATED
Funding Source
Reporting Frequency
Annual
Accession No.
0194404
Grant No.
2001-52100-01994
Project No.
ARZW-2002-04500
Proposal No.
2002-04500
Multistate No.
(N/A)
Program Code
10.4
Project Start Date
Aug 1, 2002
Project End Date
Jul 31, 2005
Grant Year
2003
Project Director
Soderlund, C. A.
Recipient Organization
UNIV OF ARIZONA
(N/A)
TUCSON,AZ 85721
Performing Department
INST FOR BIOMED SCI & BIOTECH
Non Technical Summary
(N/A)
Animal Health Component
(N/A)
Research Effort Categories
Basic
(N/A)
Applied
(N/A)
Developmental
100%
Classification

Knowledge Area (KA)Subject of Investigation (SOI)Field of Science (FOS)Percent
20115992080100%
Goals / Objectives
The long term goal of this proposal is to provide flexible web based resources that will make it easier for the user to explore relationships between pieces of physical data. Tools will be created to enhance FPC (Fingerprinted Contigs), which is a program designed to assemble restriction fragment fingerprints into contigs. It is written in C and provides the intensive computing. WebFPC displays the results of the FPC computation on the Web. Sequences (ESTs, STCs, and genomic) are generally associated with FPC databases. Flexible tools will be created to run searches on the sequences and search associated annotation files; the results will show the distribution along the chromosome and within a contig. The specific objectives are 1. Add to WebFPC: (1) A fingerprint comparison tool. (2) The ability to digest a sequence and run it in the fingerprint comparison tool. (3) Display a chromosome with the framework markers and localized contigs. 2. Sequence comparisons: Associated with a given WebFPC will be zero or more directories of sequences such as EST, STC, or clone sequences. The Web user will be able to compare their sequence against any of these sets of files, or compare any of the sequence files against each other, and the results will be shown in various formats including as the distribution along the chromosome or contigs. 3. Search annotations: Associated with each directory of sequences for a given WebFPC will be subdirectories of annotations. The user can search for a word or substring and have the results shown as the distribution along the chromosome or contig. 4. Host FPC web pages: We will host the web displays of FPC databases created at other laboratories. A web coordinator will work with each individual laboratory to customize their pages, provide links to data on other web sites, and provide all options from the above objectives. We will make WebFPC freely available to researchers using it for non-commercial purposes with the stipulation that we can reference their WebFPC so as to maintain a web site of all FPC databases. 5. Batch processing: A Web interface will provide a means to download an FPC file onto our 512-node Beowulf cluster to be assembled and returned. The FPC algorithm will be extended to try to resolve false-positives in problem contigs. 6. Courses: A computational genomic course for graduate computer scientists will be designed and taught. An on-line tutorial for FPC will be developed. This proposal supports our ongoing collaborations: We are using FPC to build physical maps of various grasses including rice, maize and sorghum in collaboration with University of Georgia in Athens, University of Missouri in Columbia, North Carolina State University, Cold Spring Harbor Laboratory and Washington University Genome Sequencing Center. A recently funded proposal is to fingerprint tomato in collaboration with Cornell. For all of these projects, ESTs and anchored markers are being hybridized to the clones.
Project Methods
The approach used for the first five objectives involves techniques in Web interface and software design. It must be flexible such that it will be easy to maintain WebFPCs for external sites and to extend to new data types. Our design will be for the use of Blast (Altschul, Madden, Schaffer, Zhang, Zhang, Miller, and Lipman, 1997) against EST, STC or sequence files. Towards this end, data will be organized in a directory structure using suffixes to determine sequence and annotation directories. The target directory of a search must use names for the data items that are super-strings of the clone name from which they are derived, or attached to clones within FPC. Hence, results may be added as remarks or markers attached to clones. Annotation directories will have a list of keywords associated with each, but a general search will also be available. The annotations must correspond to a sequence in the parent directory. New interfaces will be added to FPC and WebFPC to show the distribution of various types of entities long the chromosome and contig. These entities will be selectable to zoom in to see additional information. In order to provide an acceptable response time on the Web, and reduce duplication of work between FPC and WebFPC, the above features will be implemented in FPC, which is written in the C language, and WebFPC will execute FPC in batch mode. To further reduce the execution time for large files (e.g. a EST file may have over 20,000 data items of over 500 characters each), the search will be run on a 512-node Beowulf by distributing the file or pieces of each file over the nodes. To host FPC databases for external laboratories, a web form will allow the user to sign in, and based on a registered name, write into a given directory. They will be allowed to download new sequence, annotation or a FPC database, and the software will automatically detect new data and list it on the query Web form. Hence, they will only need to contact the WebFPC coordinator to change an option such as adding or removing the ability to automatically download new sequences from Genbank. The fifth objective will allow external laboratories to assemble their FPC on our Beowulf machine. Users will be able to download their databases, have them assembled and returned. The assembly problem is NP-complete due to false positive and false negative overlaps, hence, an assembly results in many contigs, and some of them are incorrect. The problem contigs are generally due to false positive overlaps as two regions of the genome are mapped to the same space and there is no linear order. We will try to automatically resolve these contigs by increasing the stringency and re-assembling. This will break the contig into multiple subcontigs that may have overlapping ends, so the subcontigs will be ordered and disconnected contigs given new contig numbers. We have been using this method interactively, and find it fixes the large majority of the problems.

Progress 08/01/02 to 07/31/05

Outputs
The objectives of this grant are (1) create web-based approaches for analyzing and viewing a FPC database, (2) parallelize the FPC assembly code, and (3) provide an FPC tutorial. For objective 1, we developed the following suite of tools referred to as the WebAGCoL package: (a) WebFPC is an interactive Java display of the contigs, where there is complete support for zooming, scrolling, and coloring markers based on various attributes. (b) WebChrom is an interactive display of the contigs shown along the chromosome. There is also a search tool that allows the user to define a set of marker attributes and the resulting set of distribution of markers are shown along the chromosome. (c) WebFP allows the user to compare the fingerprint of one or more clones with the rest of the database. (d) WebBSS allows the user to compare a sequence with the BAC end sequences and sequenced clones associated with the FPC project, hence, the input is located on the map if it matches a sequence. The results from the last three tools all link to WebFPC. Along with this software package is an installation script and demo files that make it easy for a novice user to install the package. The software is downloadable from www.agcol.arizona.edu/software/webagcol and has been published in Pampanwar et al. 2004. Objective 2 has been completed, where the three most compute intensive routines have all been parallelized for a multiprocessor machine. The code uses standard Unix and C parallel constructs so that no special software needs to be implemented in order to use it, and it is executed simply be providing a command line argument. It runs on any Unix based machine that has multiple processors and provides a 3.5x speedup on a four processor machine. Part of the parallel implementation resulted in a master's thesis in the Department of Electrical and Computer Engineering at the University of Arizona (Gupta 2004). The parallel code is part of the FPC distributable, which is downloadable from www.agcol.arizona.edu/software/fpc and has been published in a paper on the High Information Content Fingerprinting (Nelson et al. 2005), a large exploratory project that would have been difficult if we did not have the large speedup provided by this implementation. Objective 3 has been completed and the tutorial along with the demo files can be downloaded from the FPC web site and has been published (Engler and Soderlund 2002). We are frequently complimented on the tutorial and used it for a 2004 HICF/FPC workshop organized by Jan Dvorak at Davis University. Also under this grant, we have developed an FPC module for BioPerl, which is downloadable from www.agcol.arizona.edu/software/bioperl and is part of the BioPerl package (www.bioperl.org). Funded in part by this grant, we have developed a Java implementation of the GMOD genome browser (www.gmod.org) and provide the configuration and perl script necessary to create a genome browser for FPC; this is downloadable from www.agcol.arizona.edu/software/java_gbrowse).

Impacts
Physical maps are built with FPC and the maps are generally used as a community resource for clone based sequencing and locating regions of interest. Though the FPC software package is freely available and FPC databases are generally freely available, it takes too much time to get FPC working on a local machine just to look at a region of interest or ask a few questions of the map. Hence, the WebAGCoL web interface allows the user community to easily query and view the FPC map. Additionally, the availability of the WebAGCoL package means that each lab does not have to spend valuable resources on making their own web display of the data. It has been downloaded by 141 laboratories, where each site is potentially viewed by hundreds of people in the community. The parallel implementation for FPC allows it to run on a multiprocessor Unix based machines. As dual processors can now be purchased for under $6000 and quad processors can be purchased for under $1300, this allows considerable speedup on machines that are affordable by a general biology laboratory. For laboratories that will use FPC extensively, the FPC tutorial is extremely helpful in learning how to use the software, which saves the users much time and results in better overall maps for the community. In summary, the impact of this grant is that it saves scientists time when using FPC.

Publications

  • Engler, F. and C. Soderlund (2002). Software for Physical Maps. In Ian Dunham (ed) Genomic Mapping and Sequencing, Horizon Press, Genome Technology series. Norfolk, UK, pp. 201-236.
  • Gupta, G. (2004) Shared Memory Implementation for Building Physical Maps of Genomes. Master's Thesis. University of Arizona.
  • Pampanwar, V., F. Engler, J. Hatfield, S. Blundy, G. Gupta, and C. Soderlund (2005) FPC web tools for rice, maize and distribution. Plant Physiology 138: 116-126.
  • Nelson, W., A. Bharti, E. Butler, F. Wei, G. Fuks, H. Kim, R. Wing, J. Messing, and C. Soderlund (2005). Whole-Genome Validation of High-Information-Content Fingerprinting. Plant Physiology 139:27-38.