Progress 02/01/09 to 01/31/13
Outputs Target Audience: From the general scientific audience to popular audience. Changes/Problems:
Nothing Reported
What opportunities for training and professional development has the project provided? We trained several graduate students and postdoctoral associates of whom all graduated and they all has successful career placement. How have the results been disseminated to communities of interest? The results have been disseminated to scientific community. What do you plan to do during the next reporting period to accomplish the goals?
Nothing Reported
Impacts What was accomplished under these goals?
General achievements. Our group has an excellent track record of producing high quality research and publications and achieving the specific aims of the past NIH proposals. We are known in the community and the techniques and software that we developed and published so far, such as Assembly Reconciliator (Zimin et al., 2009), Quorum error corrector (Marcais et al., 2013), MaSuRCA assembler and Jellyfish k-mer counter (Marcais et al., 2011) are widely used by the community. In the past funding period we developed and published a new genome assembler based on a new approach for assembly of high coverage Illumina data sets. The new method transforms very large numbers of paired-end reads into a much smaller number of longer super-reads. The use of super-reads allows us to assemble combinations of Illumina reads of several read lengths together with long reads from 454 and Sanger sequencing technologies, making it one of the few assemblers capable of handling such mixtures. We call our system the Maryland Super-Read Celera Assembler (abbreviated MaSuRCA and pronounced “mazurka”). We compared MaSuRCA against two of the most widely used assemblers for Illumina data: Allpaths-LG and SOAPdenovo because these seemed to offer the strongest competition as judged in (Salzberg et al., 2012). In Zimin et al., 2013 we evaluated the performance of these three assemblers on the two data sets from organisms for which high-quality assemblies are available: the bacterium Rhodobacter sphaeroides and chromosome 16 of the mouse (Mus musculus) genome. Below we show that when assemblies are evaluated by comparison with finished sequence, we find MaSuRCA performs better than Allpaths-LG and SOAPdenovo on the sample Illumina data sets. We then show that MaSuRCA can also significantly outperform both methods if the original data are augmented with long reads. MaSuRCA is available as open-source code at http://www.genome.umd.edu. The MaSuRCA integrates the development on the specific aims 1 and 2. It includes our gap-filling module that now runs by default, post-processing the assembler results. We used MaSuRCA to assemble a number of genomes in collaboration with researchers in the United States and abroad: 1. Cardiocondyla ant (paper in preparation), 454/Illumina mixed data 2. Indian cow, Bos indicus, joint with USDA-ARS, 454/Illumina mixed data 3. Rhesus macaque, Macaca mulatta, joint with Robert Norgren, University of Nebraska, Sanger/Illumina mixed data (please see the attached support/collaboration letter) 4. Water Buffalo, Bubalis bubalis, joint with USDA-ARS and CASPUR, Italy, 454/Illumina data 5. Domestic cat, Felis felis, joint with Wes Warren, Washington University, St. Louis, Sanger/454/Illumina mixed data 6. Salmon, Salmo salar, joint with Jason Miller, JCVI, Ben Koop, University of BC, Canada, and Alexandro Maass, MMI, Chile, Sanger/Illumina mixed data 7. American bison, 454/Illumina mixed data 8. Tarsier syrichta, joint with Wes Warren, Washington University, St. Louis, Sanger/Illumina data 9. Loblolly pine, Pinus taeda, 22Gb genome, preliminary assembly, joint with David Neale and Chuck Langley, UC-Davis, and Pinerefseq consortium 10. Alpaca, Lama pacos, 454/Sanger/Illumina mixed data, joint with Belinda Appleton, Deakin University, Australia 11. Heliconius melpomene butterfly, part of Heliconius Genome consortium. 12. Domestic turkey, Meleagris gallopavo, joint with Rami Dalloul of Virginia Tech institute 13. Chicken, Gallus gallus, joint with Wes Warren, Washington University, St. Louis, Sanger/454/Illumina mixed data (please see the attached support/collaboration letter) 14. Fire ant, joint with Sasha Mikheleev, OIST, Japan, 454/Illumina mixed data 15. Stalk-eyed fly, joint with Jerry Wilkinson, University of Maryland, 454/Illumina data Most of these genomes have been deposited and accepted to NCBI and are now the most complete and current assemblies available. We also note that we know of many researchers who are now using MaSuRCA assembler to assemble the genomes of their interest on their own, with minimal help/advice from us. The following is a short list of researchers who we communicated with regarding MaSuRCA within the last several months: · Geoff Waldbieser (please see the attached support/collaboration letter), USDA-ARS, pacific cod genome · Nicolas Tourasse, Institut de Biologie Physico-Chimique (IBPC), France, unicellular algae genome. · Lex Nederbragt, University of Oslo, Norway, Atlantic cod genome. · Robert Schnabel, University of Missouri-Columbia · Irena Lanc, University of Notre Dame, Indiana, USA · Joanna Kelley, Washington State University, WA, USA · Asumi Tago, Niigata University, Japan · Wes Warren, Washington University, MO, USA · Ted Kalbfleisch, University of Louisville, KY, USA · Erin Hine, University of Maryland School of Medicine, MD, USA · Arun Seetharam, Iowa State University, IA, USA · Peter Clark, The Children's Hospital of Philadelphia, PA, USA · Joanna L. Kelley, Washington State University, WA, USA · Karthikeyan Murugesan, Georgia tech university, GA, USA · Jeffrey Lary, University of Connecticut, CT, USA · Alex Tunnicliffe, Canbridge University Cancer researck, UK
Publications
- Type:
Journal Articles
Status:
Published
Year Published:
2013
Citation:
Zimin, AV., &, Yorke JA. The MaSuRCA genome Assembler. Bioinformatics (2013). doi:10.1093/bioinformatics/btt476
- Type:
Journal Articles
Status:
Published
Year Published:
2012
Citation:
5. Dasmahapatra KK, Walters JR, Briscoe AD, Davey JW, Whibley A, Nadeau NJ, Zimin AV, Hughes DS, Ferguson LC, Martin SH, Salazar C, Lewis JJ, &, Yorke JA, .., Linares M, Blaxter ML, ffrench-Constant RH, Joron M, Kronforst MR, Mullen SP, Reed RD, Scherer SE, Richards S, Mallet J, McMillan W, Jiggins CD. Butterfly genome reveals promiscuous exchange of mimicry adaptations among species. Nature. 2012 Jul 5;487(7405):94-8.
- Type:
Journal Articles
Status:
Published
Year Published:
2012
Citation:
Smith C, Zimin AV, &, Yorke JA, Tsitsui, N. [this paper has 52 authors] Draft genome of the globally widespread and invasive Argentine ant (Linepithema humile) PNAS advance pub, doi:10.1073/pnas.1008617108
- Type:
Journal Articles
Status:
Published
Year Published:
2011
Citation:
Smith C, Robertson H, Helmkampf M, Zimin A, et al. [this paper has 43 authors] A draft genome of the red harvester ant Pogonomyrmex barbatus. PNAS advance pub, doi:10.1073/pnas.1007901108
|
Progress 02/01/09 to 01/31/10
Outputs OUTPUTS: Our group generated a new assembly of the domestic cow, Bos taurus, that dramatically improved our previous assembly. The earlier assembly, UMD2, was released in early 2009 and described in our 2009 publication in Genome Biology. The newer assembly, UMD2, was released on our website and to the broader community in August 2009. Many details about this assembly are on our site at http://www.cbcb.umd.edu/research/bos_taurus_assembly.shtml. UMD3 marks a significant improvement over UMD2 with many fewer gaps, smaller gaps, and approximately 3% more sequence placed on chromosomes. Both assembly are accessible online for FTP download or BLAST search, and outside groups have made UMD3 available through browsers. We collaborated with NCBI to produce a new annotation of this assembly, which is now available in GenBank and at our site as well. A paper describing the improvements in annotation is under review. More recently, we collaborated with several outside groups include USDA colleagues to assemble and annotate the turkey genome. For that project, we assembled a mixture of 454 and Illumina reads to produce a high-quality genome, comparable to the original chicken genome assembly but at far lower cost. We then used our genome annotation pipeline to predict gene models and assign gene names, which formed the basis for much of the analysis that appeared in our recently published (late 2010) genome paper in PLoS Biology. We collaborated with other groups to assemble the strawberry genome, which appeared in Nature Genetics online at the end of 2010 and will appear in print in 2011. This genome used a mixture of 454, SOLiD, and Illumina sequencing technologies, although the low coverage made it a particular challenge. We are currently working on assemblies of a new cow genome, Bos taurus indicus, sequenced by USDA Beltsville, and on a re-assembly of the chicken genome bringing in new 454 data generated by our colleagues at Washington University genom center. We are continually modifying our suite of assembly tools and other software to allow us to use the latest next-generation sequence data. We have reported in several papers that errors in the sequence data can lead to erroneous assemblies, and it is a constant challenge to detect and correct the data errors so as to avoid mis-assembled large genomes. PARTICIPANTS: Steven Salzberg, Ph.D. (PD), oversaw the entire project including the technical work on genome assembly and genome annotation as well as the manuscripts produced by the project. James Yorke, Ph.D. (Co-PD), oversaw the assembly efforts for all assemblies and participated in depth in design of new algorithms and new assembly strategies for cow, turkey, and other species. Aleksey Zimin, Ph.D. (Co-PD), ran multiple assemblies using several major assembly packages, including the Celera Assembler, and developed novel or modified routines to use with next-generation sequence data. Geo Pertea (Bioinformatics Engineer), developed and ran the genome annotation pipeline, which was used to annotate the turkey genome, and to revise and help re-annotate the cow genome. Art Delcher, Ph.D. (Senior Research Scientist), provided detailed technical advice on modifications to the Celera Assembler (he is one of the original software developers) and helped to run and debug the assembler, as well as assisting with manuscript preparation. Not funded by the project but contributed significantly. David Kelley (graduate student), developed new algorithm to identify mis-assembled regions of a genome, particularly duplications, and used it to identify and fix many mis-assemblies in the cow genome. Michael Schatz (graduate student, completed Ph.D. in 2010) modified Celera Assembler as part of the multiple re-assemblies of the cow genome. Not funded by this project but contributed substantially. TARGET AUDIENCES: Nothing significant to report during this reporting period. PROJECT MODIFICATIONS: Nothing significant to report during this reporting period.
Impacts The improved assemblies of the cow, Bos taurus, generated by our group (UMD2 and later UMD3), have had a major impact on cow genetics and genomics. These assemblies have been widely adopted and are recognized by the community as superior to the assemblies produced by the Baylor Human Genome Sequencing Center. For example, the Baylor assembly, BosTau4.2 (latest version), contains thousands of erroneously duplicated segments, which appear to any user of the data to be recently duplicated (because they are nearly identical). These are not genuine, however, but are artifacts of the Atlas assembler, which has difficulty distinguishing between two overlapping BACs from different haplotypes versus a genuine segmental duplication. In addition to correcting many erroneous duplications, the UMD3 assembly has much more accurate SNP data. Colleagues from USDA reported that a much higher percentage of SNPs were validated in their tests using the UMD assembly than using the Baylor assembly. The new assembly of turkey, released in 2010, provides a major new resource for poultry genetics. The relatively high quality of the assembly should serve the community well for many years. We are committed to improving this further if additional sequence data generated for that purpose should become available. Along similar lines, we are working with Washington University to improve the chicken genome assembly using new data generated by them.
Publications
- R.A. Dalloul, J.A. Long, A.V. Zimin, L. Aslam, K. Beal, L. Blomberg , D.W. Burt, O. Crasta, R.P.M.A. Crooijmans , K. Cooper, R.A. Coulombe, S. De, M.E. Delany, J.B. Dodgson, J.J. Dong , C. Evans, P. Flicek, L. Florea, O. Folkerts, M.A.M. Groenen, T.T. Harkins, J. Herrero, S. Hoffmann, H.-J. Megens, A. Jiang, P. de Jong, P. Kaiser, H. Kim, K.-W. Kim, S. Kim, D. Langenberger, M.-K. Lee, T. Lee, S. Mane, G. Marcais, M. Marz, A.P. McElroy, T. Modise, M. Nefedov, C. Notredame, I.R. Paton, W.S. Payne, G. Pertea, D. Prickett, D. Puiu, D. Qioa, E. Raineri, S.L. Salzberg, M.C. Schatz, C. Scheuring, C.J. Schmidt, S. Schroeder, E.J. Smith, J. Smith, T.S. Sonstegard, P.F. Stadler, H. Tafer, Z. Tu, C.P. Van Tassell, A.J. Vilella, K. Williams, J.A. Yorke, L. Zhang, H.-B. Zhang, X. Zhang, Y. Zhang, and K.M. Reed. Multi-platform next generation sequencing of the domestic turkey (Meleagris gallopavo): genome assembly and analysis. PLoS Biology (2010), 8(9): e1000475. doi:10.1371/journal.pbio.1000475.
- D.R. Kelley and S.L. Salzberg. Detection and correction of false segmental duplications caused by genome mis-assembly. Genome Biology (2010), 11:R28. doi:10.1186/gb-2010-11-3-r28.
- A.V. Zimin, A.L. Delcher, L. Florea, D.A. Kelley, M.C. Schatz, D. Puiu, F. Hanrahan, G. Pertea, C.P. Van Tassell, T.S. Sonstegard, G. Marcais, M. Roberts, P. Subramanian, J.A. Yorke, and S.L. Salzberg. A whole-genome assembly of the domestic cow, Bos taurus. Genome Biology (2009), 10:R42.
- V Shulaev, DJ Sargent, RN Crowhurst, TC Mockler, O Folkerts, AL Delcher, P Jaiswal, K Mockaitis, A Liston, SP Mane, P Burns, TM Davis, JP Slovin, N Bassil, RP Hellens, C Evans, T Harkins, C Kodira, B Desany, OR Crasta, RV Jensen, AC Allan, TP Michael, JC Setubal, J-M Celton, DJG Rees, KP Williams, SH Holt, JJR Rojas, M Chatterjee, B Liu, H Silva, L Meisel, A Adato, SA Filichkin, M Troggio, R Viola, T-L Ashman, H Wang, P Dharmawardhana, J Elser, R Raja, HD Priest, DW Bryant Jr, SE Fox, SA Givan, LJ Wilhelm, S Naithani, A Christoffels, DY Salama, J Carter, EL Girona, A Zdepski, W Wang, RA Kerstetter, W Schwab, SS Korban, J Davik, A Monfort, B Denoyes-Rothan, P Arus, R Mittler, B Flinn, A Aharoni, JL Bennetzen, SL Salzberg, AW Dickerman, R Velasco, M Borodovsky, RE Veilleux, and KM Folta. The genome of woodland strawberry (Fragaria vesca). Nature Genetics (2010). Published online 26 December 2010.
- D.R. Kelley, M.C. Schatz, and S.L. Salzberg. Quake: quality-aware detection and correction of sequencing errors. Genome Biology (2010), 11:R116. http://genomebiology.com/2010/11/11/R116/.
|