Source: UNIVERSITY OF ARIZONA submitted to
SYMAP - A SOFTWARE PACKAGE TO COMPUTE, DISPLAY AND ANALYZE SYNTENY BETWEEN MULTIPLE GENOMES WITH APPLICATION TO ROSACEAE AND POACEAE
Sponsoring Institution
National Institute of Food and Agriculture
Project Status
TERMINATED
Funding Source
Reporting Frequency
Annual
Accession No.
0214643
Grant No.
2008-35300-04439
Project No.
ARZR-2008-02280
Proposal No.
2008-02280
Multistate No.
(N/A)
Program Code
52.1
Project Start Date
Aug 15, 2008
Project End Date
Aug 14, 2012
Grant Year
2008
Project Director
Soderlund, C. A.
Recipient Organization
UNIVERSITY OF ARIZONA
888 N EUCLID AVE
TUCSON,AZ 85719-4824
Performing Department
(N/A)
Non Technical Summary
Comparative genomics has proved to be an excellent way to learn more about the structure, function and organization of genomes. The increasing ease with which genomes can be sequenced and/or mapped is creating a wealth of resources to be mined. However, the sequenced genomes are generally in hundreds of unordered pieces. Robust software is necessary to support "high-throughput cross-species comparisons" of this data. Our goal is to develop software to compare multiple unordered sequenced genomes to aid in ordering the pieces of the genomes. It will also compute groups of highly conserved regions across the genomes to aid in detecting and annotating genes. The results of the computation will be used by biologists, hence the software will provide an interactive query system and 3D Java graphics to allow maximal exploratory research. To aid the communities studying Rosaceae, Poaceae and Fabaceae, we will extend the publicly available website (www.symapdb.org) for the comparisons of sequenced genomes from these families. Not only does this add considerable value to the sequence data, but it is useful to researchers around the world.
Animal Health Component
(N/A)
Research Effort Categories
Basic
33%
Applied
33%
Developmental
34%
Classification

Knowledge Area (KA)Subject of Investigation (SOI)Field of Science (FOS)Percent
20124992080100%
Goals / Objectives
The objectives of this proposal are: (1) Compute composite synteny blocks between multiple genomes and display with Java 3D graphics, where the genomes are represented by FPC maps or sequence contigs (ordered or unordered). (2) Using the synteny, confirm predicted genes, discover new genes, and detect species-specific genes. Provide a mechanism to easily add and query associated data. (3) Create publicly available SyMAP projects (www.symapdb.org) for Rosaceae, Poaceae and Fabaceae from publicly available sequenced and FPC-mapped genomes. (4) Create an easy-to-install distributable of the software that executes standalone and as a web application. Provide user support and maintenance for SyMAP and FPC.
Project Methods
We will extend the SyMAP (synteny mapping and alignment program, Soderlund et al. 2006) software program, which currently aligns an FPC map with an ordered sequenced genome or aligns two ordered sequenced genomes. An existing software package, such as Mummer (Kurtz et al. 2004) or Vmatch/Ramaco (www.vmatch.de) will be used to compute the anchors. The SyMAP algorithm will be used to compute the synteny, with some modifications for efficiency. There will be algorithmic developments for providing a partial order to the unordered sequenced contigs, finding composite synteny blocks over all the genome being compared, and evaluating existing and new gene predictions. The graphics will be extended to use the Java 3D package to display three or more genomes. For SyMAP to be useful for any community, a curator interface will be developed in order to add attributes to be associated with the anchors, and files of attribute/values can then be uploaded into the system. A Biomart style (Durinck et al. 2005) web-based query interface will be developed, where the results can be shown in a downloadable table or in graphic form. The database will be changed to use Chado (Mungol et al. 2007) so that it can easily interface with other Chado-based software applications. By creating SyMAP projects from the sequenced and mapped genomes of Rosaceae, Poaceae and Fabaceae in collaboration with the biologists of these communities, we will receive feedback on whether they agree with the computed synteny and whether the interface allows them to easily explore the results.

Progress 08/15/11 to 08/14/12

Outputs
OUTPUTS: ACTIVITIES: A major addition to the software is the Query page, which allows the user to query across two or more genomes within a synteny set (i.e. all pairwise synteny have been computed). Based on the selection of genomes from the interface, the user can request to view the collinear genes, orphan genes, synteny hits with no annotation, and complete linkage of species (i.e. the set of genes that all species share). The synteny hits are clustered to create putative multi-species gene families, which may be filtered based on annotation or presence/absence within given lineages. The user can also query on annotation string and/or locations. The results are in a table format where the user can dynamically select columns to show or hide, and select a row to show the synteny graphics. The table results can be aligned with MUSCLE (Edgar, R.C 2004) and the results are shown graphically. The table or the table sequences can be written to file. Substantial improvements were also made to make it easier for the user to run SyMAP, e.g. all parameters can be set from the graphical manager, and help is provided on each window. Improvements have been made to the draft sequence alignment algorithm and graphical displays. Code has been added to work around a limitation of MUMmer (Kurtz et al. 2004), which previously prevented complete genomic self-alignments. EVENTS: co-PI Nelson presented a computer demonstration (C07) at the Plant and Animal Genome conference, San Diego, Jan. 14 2012. SERVICES: We provide immediate responses to all queries about SyMAP and FPC. PRODUCTS: SyMAP v3.5 was released in January 2012, and SyMAP v4.0 was released in July 2012. The software includes complete documentation and demo files. From August 1, 2011 until October 26, 2012, there were 575 unique users that downloaded SyMAP and 222 unique users that downloaded FPC. The SyMAP plant genome synteny site (www.symapdb.org) has an average 580 visitors per month since the beginning of the year. DISSEMINATION: As the web is an important method of dissemination, we continue to keep the SyMAP software on-line documentation current, along with a Tour of the software. We also continue to keep the www.symapdb.org site updated with the latest releases of Poaceae, Fabaceae, and Rosaceae genomes. PARTICIPANTS: PI Carol Soderlund provided overall focus and direction for the project, testing and aid with the documentation and website. co-PI William Nelson performed the majority of the end-user interaction and development work. Mark Willer, a computer scientist in the Soderlund group, developed the software for displaying the results in a table, all table features, and aided in the software development of the SyMAP manager. COLLABORATIONS: We collaborated with Rajeev Varshney (CGIAR) and others on the synteny analysis for the chickpea genome, where SyMAP was also used to aid the ordering of draft sequence. A manuscript has been submitted to Nature Biotech (PIs Soderlund and Nelson are co-authors, and this grant is listed as a source of funding). TARGET AUDIENCES: The target audience are biologists who are studying comparative genomics, with emphasize on plant genomes. PROJECT MODIFICATIONS: Not relevant to this project.

Impacts
Change in knowledge: SyMAP had been designed to distinguish duplicated synteny blocks but we had not considered gene families; with some minor changes, we were able to extend it to compute gene families, which are queryable on the new Query page. Change in action: In our collaboration with Varshney, Chi Song and G. Zhang, they provided a Perl script that we modified to produce a linear image of the SyMAP synteny between three genomes, which is an excellent view for publication; this script has been added to the SyMAP release.

Publications

  • W. Nelson (2012) SyMAP: Dynamic Synteny Exploration. Electronic Jan 14 computer demos (abstract), Plant and Animal Genome Conference, San Diego, CA (C07).


Progress 08/15/08 to 08/14/12

Outputs
OUTPUTS: ACTIVITIES: SyMAP was originally written to compute and display FPC to sequenced genome synteny. It has been greatly extended under this funding to compare sequenced genomes. It computes and displays genome sequence to genome sequence synteny using MUMMER hits, where one of the genomes may be draft. When given gene annotations, it clusters hits to fit the gene annotation and learns the parameters to cluster them when no gene annotation exists. The original software was ported from Perl to Java, and a manager interface was developed to guide the user through the synteny building process. Additional visualization was developed resulting in a static 3-genome view along with three new interactive views, which are the 3D display, multiple chromosome dotplot, and circle view. The 2D view has many enhancements, including the visualization of the gene models. A query page was developed to allow the user to query all genomes in a set (i.e. have complete pairwise synteny) on various attributes including collinear genes, orphan genes, genes with no annotation, and set of genes shared by all selected genomes. The results are shown in a versatile table format with links to the 2D view. Under this funding, the FPC program was supported; it was modified to extend its use to next-gen fingerprinting, which was developed by van Oeveren et al. (Genome Research, 2011) EVENTS: SyMAP was presented at the Plant and Animal Genome conference in San Diego as a poster in 2009, 2010 and 2011, and was presented as a computer demonstration in 2012. SERVICES: We have provided immediate feedback for all emails sent about SyMAP and FPC. PRODUCTS: The SyMAP and FPC freely available packages are released at www.agcol.arizona.edu. The synteny has been computed for all Rosaceae and Poaceae sequenced genomes along with some salient Fabaceae and Curcurbiteae genomes, which are displayed at www.symapdb.org. From August 1, 2011 until October 26, 2012, there were 575 unique users that downloaded SyMAP and 222 unique users that downloaded FPC. The SyMAP plant genome website has an average 580 visitors per month since the beginning of the year. DISSEMINATION: As the website is an important dissemination medium, we provide a SyMAP Tour of the software so that the user can easily understand its features. For both SyMAP and FPC, we provide complete documentation along with demo files so the user can easily try SyMAP or FPC with no setup. PARTICIPANTS: PI Soderlund provided the direction and focus of the project, was lead author on the SyMAP paper, and aided with testing, documentation and release. Co-PI Willman Nelson provided user feedback, FPC enhancements, developed about 45% of the software and SyMAP synteny site, aided with the manuscript, and was lead on creating the release packages. Matthew Bomhoff provided about 50% of the software development and SyMAP synteny site, and aided with the releases. Mark Willer provided about 5% of the software development. COLLABORATIONS: We have collaborated with Jan Dvoraks group at U.C. Davis to apply SyMAP to the Brachypodium FPC map and the rice sequenced genome (Gu et al. 2009, BMC Genomics). We have aided Rod Wings group at UA in using SyMAP in aligning a 22MB maize region to rice and sorghum (Wei et al. 2009, PLoS Genetics). We worked with Jan van Oeveren at Keygene on the FPC extension for the sequence-based fingerprints. We collaborated with Rajeev Varshney (CGIAR) and others on the synteny analysis for the chickpea genome, where SyMAP was also used to aid the ordering of draft sequence. A manuscript has been submitted to Nature Biotech (PIs Soderlund and Nelson are co-authors, and this grant is listed as a source of funding). TRAINING: Matthew Bomhoff is a computer scientist who joined the Soderlund group with no genomics background; he had on the job training, and has since worked with two other genomic laboratories on the UA campus. TARGET AUDIENCES: Biologists studying comparative sequenced genomes, emphasizing plant genomes. PROJECT MODIFICATIONS: Not relevant to this project.

Impacts
From our collaborations and feedback, we added the interactive circle view, the 3-genome display and the ability to have external links to the SyMAP gene annotations. In our own work in creating the SyMAP synteny website, we found it difficult to manage large projects and hence, created the SyMAP manager. The manager makes it possible to create SyMAP synteny without having to read any documentation, except for the rare situations. SyMAP had been designed to distinguish duplicated synteny blocks but we had not considered gene families; with some minor changes, we were able to extend it to compute gene families, which are queryable on the new query page.

Publications

  • Soderlund, C., Bomhoff, M., and Nelson, W.M. (2011). SyMAP v3.4: a turnkey syteny system with application to plant genomes. Nucleic Acids Res 39(10):e38.
  • W. Nelson (2012) SyMAP: Dynamic Synteny Exploration. Electronic Jan 14 computer demos (abstract), Plant and Animal Genome Conference, San Diego, CA (C07).


Progress 08/15/10 to 08/14/11

Outputs
OUTPUTS: ACTIVITIES: Implementation of the remaining major software objectives was delayed by the departure of the software developer M. Bomhoff, but will be completed during the extension period by W. Nelson. However, this year was productive as we wrote a substantial manuscript (the submitted manuscript we reported last year was just a Bioinformatics Application Note and was not accepted). The writing of the manuscript along with user feedback lead to numerous enhancements resulting in the 3.3 and 3.4 releases, where the most significant one was a reorganization of the MUMmer processing, which greatly reduced the compute time taken in this phase. EVENTS: W. Nelson attended the AFRI Project Director Meeting held prior to the Plant and Animal Genome conference, in San Diego, CA., and presented the SYMAP poster at the PAG and AFRI poster sessions. SERVICES: Ongoing support for end-users, particularly at Mississippi State, UNC, Clemson, and University of Arizona. PRODUCTS: Two releases (versions 3.3 and 3.4) have been made in the past year, at www.agcol.arizona.edu. The SyMAP plant synteny website (www.symapdb.org) has been kept current with the latest software. DISSEMINATION: An objective of this grant is to provide support for FPC. In March we released version 9.4, containing upgrades to support next-gen fingerprinting; these were made in conjunction with researchers at NCGR, Floragenix, and Keygene, who have been generating this type of fingerprint. An objective of the grant is to align the Rosaceae resources, which continue to grow in number. We have recently added the Strawberry genome sequence to our Rosaceae alignment set at www.symapdb.org. We have also had discussions at PAG with Dr. Amit Dhingra regarding collaborating to apply SyMAP to several other Rosaceae genomes on which his group is working. PARTICIPANTS: PARTICIPANTS: Carol Soderlund provided overall focus and direction for the project, and provided testing and aid with the documentation and website. William Nelson performed the majority of the end-user interaction and development work. TARGET AUDIENCES: Not relevant to this project. PROJECT MODIFICATIONS: Not relevant to this project.

Impacts
IMPACT: The SyMAP package has been downloaded by 455 unique users from 211 identified institutions, where the most current version v3.4 has been downloaded by 242 unique users. Our SyMAP plant synteny site has had an average 233 unique visitors per month since the beginning of the year.

Publications

  • Soderlund, C., Bomhoff, M., and Nelson, W.M. 2011. SyMAP v3.4: a turnkey syteny system with application to plant genomes. Nucleic Acids Res 39(10):e38.


Progress 08/15/09 to 08/14/10

Outputs
OUTPUTS: ACTIVITIES: The four main accomplishments for this year were: (1) Fine tuning all aspects of the software, release package and documentation. This includes dealing with portability issues and speed optimizations. (2) A multi-genome dot plot was added, which allows multiple genomes to be viewed against a reference genome. A multi-chromosome dot-plot was also developed, which greatly aids viewing ancient duplications. This view was made part of the 3D interface, so that users easily select different chromosome sets to view in 3D, dot-plot or the 2D detail view. (3) We developed an algorithm for ordering un-ordered sequences that have relatively large divergence, as seen when comparing plant genomes. As part of this objective, we altered SyMAP to allow the input of unordered sequenced contigs. The resulting output shows the syntenic regions of the ordered contigs, but then also shows the syntenic regions of the unordered. We will be experimenting more with this feature as the Rosaceae genomes become available. (4) A paper was written and submitted. EVENTS: C.Soderlund, W. Nelson and M. Bomhoff all attended the AFRI Project Director Meeting to be held in San Diego, CA. W. Nelson and M. Bomhoff presented the SYMAP poster at the PAG and AFRI poster sessions. PRODUCTS: We have had three releases of the SyMAP software in the last year at www.agcol.arizona.edu. The www.symapdb.org website has been kept current with the latest software. DISSEMINATION An objective of this grant is to provide support for FPC. NCGR and Floragenix are collaborating to use FPC for sequenced based fingerprint, and asked for our assistance. We acquired and analyzed their fingerprint, and concluded that their pooling strategy resulted in too much error; they are currently trying to reduce the error. An objective of the grant is to align the Rosaceae resources, which are just now becoming available (e.g. the peach genome); to further aid the Rosaceae community , we contacted Amy Lezzoni of RosBREED and offered to provide links from genes in SyMAP to relevant sites (e.g. at RosBREED) and support requests from the community that would aid them; she was supportive of the idea and was going to discuss it with the executive committee. To help potential users understand what SyMAP can do, a "SyMAP Tour" was developed that walks the user through all the views with a brief description of each. A User Guide aids the user in further details of using SyMAP. A System Guide aids laboratories who seek to install SyMAP, which includes demo files so that SyMAP can easily be tried. PARTICIPANTS: Carol Soderlund provided overall focus and direction for the project, and provided testing and aid with the documentation and website. William Nelson interacted with users to answer questions, fix problems, and took over the development on SyMAP when Matthew Bomhoff left in February. Matthew Bomhoff performed the majority of the development work. TARGET AUDIENCES: Not relevant to this project. PROJECT MODIFICATIONS: Not relevant to this project.

Impacts
When attempting to order sequence contigs from a plant genome based on similarity to another plant genome, we found that the repetitive-ness and divergence prevents there from being enough information to order with high confidence the majority of the sequenced contigs, however, it is still of use as it does show what ordering is available.

Publications

  • No publications reported this period


Progress 08/15/08 to 08/14/09

Outputs
OUTPUTS: The following refers to the specific objectives of the SyMAP grant. ACTIVITIES: Algorithm: In support of objectives 1 and 2, SyMAP was extended to take as input MUMMER (Kurtz et al. 2004) anchors (short alignments) for sequence to sequence synteny computation. It also takes as input gene annotations, which it uses to cluster the MUMMER anchors to correspond to the gene coordinates, and learns parameters from the genes to cluster anchors in regions of no annotation. It uses general plant-based parameters when there are no gene models. The algorithm has been ported from Perl to Java, resulting in a substantial speedup, and allowing these functions to be incorporated into the project management console. Management Console: In support of objective 4, a Java project manager has been developed to generate and display syntenies between related groups of species. Installation has been greatly simplified; in particular it is no longer necessary to install a MySQL database for projects that run standalone. Query and Display: In support of objective 1, a 3D display has been developed and integrated with the previous 2D display, which has been enhanced for the sequence to sequence synteny. Additionally, the gene models and annotation may be displayed. In support of objective 2, a query page has been developed to search on gene names and annotation, where the results are in a table form and link to the SyMAP display. FPC extension: In support of objective 4, we modified the FPC code so that band sizes could be greater than 65,536 for use with sequenced based fingerprints. EVENTS: PIs C. Soderlund and W. Nelson attended the PAG Plant Genome Project Director meeting in January 2009 and the 2009 RECOMB conference held in Tucson, AZ. PRODUCTS: A SyMAP package has been created that includes a User's Guide and demo, which will be released as soon as our submitted paper has been accepted. The SyMAP website (www.symapdb.org) has a showcase of syntenic groups for: (i) 5 yeasts, (ii) 4 enterobacteria, (iii) maize, rice, brachypodium, and sorghum, and (iv) arabidopsis, medicago, soybean, and poplar. DISSEMINATION: We have collaborated with Jan Dvorak's group at Davis to apply SyMAP to the Brachypodium FPC map and the rice sequenced genome (Luo et al., accepted by BMC Genomics). We have aided Rod Wing's group at UA in using the SyMAP in aligning a 22MB maize region to rice and sorghum (Wu et al., accepted by PLoS Genetics). We are working with Greg May at NCGR to incorporate SyMAP into the Legume Information System (LIS) database. We have been working with Jan van Oeveren at Keygene on the FPC extension, as they have designed a way to use it with sequenced based fingerprints. PARTICIPANTS: PI Carol Soderlund provided the direction of the software development, collaborations, and writing of the User's Manual and manuscript. co-PI William Nelson provided details of the software development, performs some of the algorithm programming, and was lead on the manuscript. Matthew Bomhoff provided the rest of the programming, including all the new graphics, created the distributable package, and help with the manual and manuscript. TARGET AUDIENCES: Nothing significant to report during this reporting period. PROJECT MODIFICATIONS: Not relevant to this project.

Impacts
EVALUATION: The definition of synteny is nebulous and hard to test as there is no strict metric for evaluation. However, our graphical interface provides the best testing for what is the right synteny, as they can be visually confirmed. The addition of gene models was also helpful, as that lead us to develop the clustering algorithm guided by the gene models. CHANGE IN ACTION: We had collaborators mention that they liked the circle view (Krzywinski et al. 2009); as it was easy to implement, we added it to the web-based set of view. Based on feedback from Greg May et al., we added a feature to provide external links to the gene annotation.

Publications

  • No publications reported this period