Progress 12/01/09 to 11/30/14
Outputs Target Audience: We continue to make software and Web-based materials available from our projects. We cannot, however, track the individuals who are accessing the material or using our software (beyond citations and general analytics). Changes/Problems:
Nothing Reported
What opportunities for training and professional development has the project provided? We continue to participate in a broad range of training and workshops; however, while these workshops were initiated under this grant, we are now funded by two separate NIH and NSF grants to run them. Therefore we limit the reporting here to mention that the tools being developed under this grant are taught as part of the NGS Summer Workshop at MSU, bioinformatics.msu.edu/ngs-summer-course-2014, using documentation created with the funding from this grant. How have the results been disseminated to communities of interest? In addition to the publications, we maintain three Web sites to support training on and reuse of our software: http://ged.msu.edu/angus/ http://khmer.readthedocs.org/ http://khmer-protocols.readthedocs.org/ What do you plan to do during the next reporting period to accomplish the goals?
Nothing Reported
Impacts What was accomplished under these goals?
1) We have continued to maintain, support, and extend the khmer software to provide tools for filtering sequence prior to downstream analysis (http://khmer.readthedocs.org). The khmer software includes tools to help with sequence cleaning and building new gene models. khmer is largely unique in its capabilities, although other packages (the Trinity and Mira assemblers, in particular) have started to adopt our algorithms. We estimate that there are several hundred to several thousand users of khmer. One of khmer's particularly important goals is to enable all of this analysis in the cloud, using rental machines from e.g. Amazon or lower-cost computers on XSEDE or iPlant Collaborative. We can now do essentially all NGS analysis in the cloud using khmer, which is a significant advance for this project. Continued funding for khmer has been assured through an NIH R01, because khmer has proven to be widely useful for bioinformatics across fields. We continue to collaborate widely with USDA researchers, and collaborate locally through a training grant and a cattle grant khmer has now under gone over a dozen releases, its algorithms have been even more widely adopted (it is included in the TruSeq Long Read pipeline used by Illumina), and it is under a much more stable development cycle. 2) The ANGUS tutorials (http://ged.msu.edu/angus/) on NGS analysis continue to be developed towards the goals of this project, although future support has shifted largely to an NIH R25 course grant that is funded through 2017. 3) We have built several computational "protocols" for NGS sequence analysis, and posted them openly under a Creative Commons/no restrictions license for full reuse and remixing. Called 'khmer protocols', these protocols grew from the tutorials above but are now maintained and supported on a continuing basis. We are currently extending these protocols to be of even wider use for ag and vet genomes.
Publications
- Type:
Journal Articles
Status:
Published
Year Published:
2014
Citation:
Howe AC, Jansson JK, Malfatti SA, Tringe SG, Tiedje JM, Brown CT.
Proc Natl Acad Sci U S A. 2014 Apr 1;111(13):4904-9. doi: 10.1073/pnas.1402564111.
- Type:
Journal Articles
Status:
Published
Year Published:
2014
Citation:
Zhang Q, Pell J, Canino-Koning R, Howe AC, Brown CT.
PLoS One. 2014 Jul 25;9(7):e101271. doi: 10.1371/journal.pone.0101271.
|
Progress 12/01/12 to 11/30/13
Outputs Target Audience: Our target audience has been advanced graduate students, postdocs, and faculty interested in doing sequence analysis. We have reached them through a series of talks and workshops, as evidenced by the increasing number of citations (over 60) for products from this grant. Changes/Problems:
Nothing Reported
What opportunities for training and professional development has the project provided? We continue to participate in a broad range of training and workshops; however, while these workshops were initiated under this grant, we are now funded by two separate NIH and NSF grants to run them. Therefore we limit the reporting here to mention that the tools being developed under this grant are taught as part of the NGS Summer Workshop at MSU,bioinformatics.msu.edu/ngs-summer-course-2014, using documentation created with the funding from this grant. How have the results been disseminated to communities of interest? In addition to the publications, we maintain three Web sites to support training on and reuse of our software: http://ged.msu.edu/angus/ http://khmer.readthedocs.org/ http://khmer-protocols.readthedocs.org/ What do you plan to do during the next reporting period to accomplish the goals? We are working towards a 1.0 release of the khmer software, refining the protocols for sequence analysis, and working to publish the results. Currently we are down to 1 FTE working on this, but much of the work nucleated by this grant has gone on to be funded elsewhere (including by the USDA) so we are simply working to complete a software release and publication, which is attainable within the remainder of the grant.
Impacts What was accomplished under these goals?
1) We have continued to maintain, support, and extend the khmer software to provide tools for filtering sequence prior to downstream analysis (http://khmer.readthedocs.org). The khmer software includes tools to help with sequence cleaning and building new gene models. khmer is largely unique in its capabilities, although other packages (the Trinity and Mira assemblers, in particular) have started to adopt our algorithms. We estimate that there are several hundred to several thousand users of khmer. One of khmer's particularly important goals is to enable all of this analysis in the cloud, using rental machines from e.g. Amazon or lower-cost computers on XSEDE or iPlant Collaborative. We can now do essentially all NGS analysis in the cloud using khmer, which is a significant advance for this project. Continued funding for khmer has been assured through an NIH R01, because khmer has proven to be widely useful for bioinformatics across fields. We continue to collaborate widely with USDA researchers, and collaborate locally through a training grant and a cattle grant. 2) The ANGUS tutorials (http://ged.msu.edu/angus/) on NGS analysis continue to be developed towards the goals of this project, although future support has shifted largely to an NIH R25 course grant that is funded through 2017. 3) We have built several computational "protocols" for NGS sequence analysis, and posted them openly under a Creative Commons/no restrictions license for full reuse and remixing. Called 'khmer protocols', these protocols grew from the tutorials above but are now maintained and supported on a continuing basis. 4) We have integrated these tutorials into the Galaxy and iPlant Collaborative workflow engines, to the best of our ability; we are working with the Galaxy and iPlant engineers to extend the workflow engines to support our needs. These workflow engines are becoming field-standard and allow any biologist to work with NGS data flexibly and efficiently; when combined with our enabling of the cloud for certain tasks through khmer, this effort represents a significant expansion of functionality for biologists. 5) We have continued to develop the 'gimme' software for building gene models for agricultural and "semi-model" organisms. We expect to submit a paper for publication on this shortly.
Publications
- Type:
Journal Articles
Status:
Accepted
Year Published:
2013
Citation:
1: Schwarz EM, Korhonen PK, Campbell BE, Young ND, Jex AR, Jabbar A, Hall RS,
Mondal A, Howe AC, Pell J, Hofmann A, Boag PR, Zhu XQ, Gregory TR, Loukas A,
Williams BA, Antoshechkin I, Brown CT, Sternberg PW, Gasser RB. The genome and
developmental transcriptome of the strongylid nematode Haemonchus contortus.
Genome Biol. 2013 Aug 28;14(8):R89. [Epub ahead of print] PubMed PMID: 23985341.
- Type:
Journal Articles
Status:
Accepted
Year Published:
2013
Citation:
1: Subramaniam S, Johnston J, Preeyanon L, Brown CT, Kung HJ, Cheng HH.
Integrated analyses of genome-wide DNA occupancy and expression profiling
identify key genes and pathways involved in cellular transformation by a Marek's
disease virus oncoprotein, Meq. J Virol. 2013 Aug;87(16):9016-29. doi:
10.1128/JVI.01163-13. Epub 2013 Jun 5. PubMed PMID: 23740999; PubMed Central
PMCID: PMC3754031.
- Type:
Book Chapters
Status:
Awaiting Publication
Year Published:
2013
Citation:
khmer: Working with Big Data in Bioinformatics. Eric McDonald, C Titus Brown. In: The Performance of Open Source Applications, Tavish Armstrong. In press.
|
Progress 12/01/11 to 11/30/12
Outputs OUTPUTS: The most important output from this grant has been the continued development and maintenance of the khmer software for sequence analysis (see: http://github.com/ged-lab/khmer). This software is increasingly popular for mRNAseq, genome, and metagenome assembly and is being used by a number of agricultural labs in particular. It is also being used in our summer course on next-generation sequence analysis, and has served as the basis for a grant to improve the chicken genome (a submitted proposal) as well as for a funded proposal on bovine tuberculosis resistance (Coussens, PI). We have also provided a number of online tutorials that show how to use khmer to do next-generation sequencing data analysis at http://ged.msu.edu/angus/ and also at the khmer documentation, http://khmer.readthedocs.org. khmer has been presented in over a dozen invited talks, including at the Plant and Animal Genome in 2012 and 2013, three talks at the University of Arizona in August 2012, a talk at the University of Miami in Florida, a talk at NESCent in North Carolina, a talk at Notre Dame, a talk at the Extremely Large Databases conference at Stanford, a talk at a Wellcome Trust meeting in Cambridge, and a talk at the Norwegian Sequencing Center in Oslo, Norway. PARTICIPANTS: Kanchan Pavangadkar (PD) has continued to work on mRNAseq analysis and user interfaces for analyzing differential expression data. Eric McDonald (staff programmer) was hired at the beginning of this reporting period and has worked extensively on optimizing and improving the khmer software. Jason Pell (graduate student) published one paper and is nearing graduation. TARGET AUDIENCES: 1. We have yearly taught a two week summer course on analyzing next-generation sequencing data that has included industry and academic participants from agriculturally relevant labs and companies. This course has benefitted significantly from the expertise developed as a result of this funding. 2. I have integrated work from this project into an interdisciplinary graduate course here at Michigan State University. 3. Last summer, four students from underrpresented minorities did summer undergraduate research that extended this project in various ways. PROJECT MODIFICATIONS: Our major change in approach has been to focus on integrating data analysis approaches relevant to agriculture into biomedically targeted analysis programs. In particular, we are working to integrate our existing khmer software into the Galaxy software package for sequence analysis, which will permit Galaxy to be used for resequencing and mRNAseq analysis of agricultural data in ways that it cannot currently work.
Impacts A number of external projects are in motion using our software and approaches, but none have yet been published.
Publications
- 1: Pell J, Hintze A, Canino-Koning R, Howe A, Tiedje JM, Brown CT. Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. Proc Natl Acad Sci U S A. 2012 Aug 14;109(33):13272-7. doi: 10.1073/pnas.1121464109. Epub 2012 Jul 30. PubMed PMID: 22847406; PubMed Central PMCID: PMC3421212.
|
Progress 12/01/10 to 11/30/11
Outputs OUTPUTS: Expertise and software funded by this grant were used to run several workshops. In particular, we run a 2-week workshop at the MSU Kellogg Biological Station on Analyzing Next-Generation Sequence Data that demonstrated the utility of various analysis techniques developed on this grant. We also ran a 1-day workshop in November for local area researchers to learn some of the sequence assembly techniques that we have developed. Much of the course materials available at http://ged.msu.edu/angus/ were developed with expertise supported by this grant. PARTICIPANTS: Dr. C. Titus Brown (PI) supervised research on tools and helped direct the overall research. Dr. Hans Cheng (collaborator) made extensive use of technologies and approaches developed under this grant for analysis of Marek's Disease Virus infections in chicken. His lab collaborates closely with the PIs lab. Likit Preeyanon (graduate student) is primarily funded by the Thai government, but has received equipment and travel support from this grant. He is the primary student working between Dr. Cheng and Dr. Brown. He has been working on transcriptome generation for chick, allele-specific expression analysis of that data, and genome-wide association studies for MDV-resistance in chick. He is an author on several papers emerging from this work. Jason Pell is working to develop a suite of computational tools to help scale and improve analysis of next generation sequencing data. He is an author on several papers emerging from this work. Dr. Arend Hintze co-supervised the development of computational analysis techniques by Jason Pell, and investigated graph theoretic aspects of sequence analysis. He is an author on several papers emerging from this work. Dr. Kanchan Pavangadkar has worked to generate transcriptome data from chick and analyze it using our approaches. She is an author on several papers emerging from this work. TARGET AUDIENCES: Not relevant to this project. PROJECT MODIFICATIONS: Nothing significant to report during this reporting period.
Impacts During this reporting period, we developed an effective solution to a wide variety of sequence analysis problems of importance to the agricultural community. This approach, termed digital normalization, will make certain kinds of sequence analysis simply and easily accessible to researchers without substantial compute resources. In particular, we have made RNA sequence analysis work on rental "cloud" computers, which are often more limited in capacity than other computers; this eliminates the substantial up-front resource investment currently needed for RNA sequence analysis.
Publications
- No publications reported this period
|
Progress 12/01/09 to 11/30/10
Outputs OUTPUTS: There are several outputs from our first year. First, driven partly by results from our first year of funding, we ran a summer workshop on next-generation sequencing data analysis. A number of researchers, including two USDA-funded graduate students and three researchers at commercial ag companies, participated in the course. Mr Jason Pell, a graduate student funded by the grant, and Dr. Kanchan Pavangadkar (a pd on the grant), helped develop course materials based on their work on this project. Second, personnel involved in the grant have contributed to the course- associated ANGUS Web site, which contains educational materials on using advanced sequencing technologies. The site is online at ged.msu.edu/angus/. Third, we have developed several technologies for helping to scale data analysis to the volume of data now available. In particular, we have been working on advanced techniques for assembling mRNA sequencing data, detecting splice isoforms, and storing and retrieving annotations. Fourth, we have continued developing a genome display toolkit to index and display hundreds of thousands of genomic features. Fifth, we have begun developing a cloud deployment system to enable researchers to take advantage of cloud computing (rental compute infrastructure) for their analyses. All of our software is being made freely available under a BSD license at github.com/ctb/, and being disseminated to the community through talks, blog posts, and twitter posts. PARTICIPANTS: Mr. Likit Preeyanon is a graduate student supported on a Thai government grant. He is working with Dr. Hans Cheng at the USDA ADOL on technology for allelotyping and splice isoform analysis in mRNAseq data. Mr. Jason Pell is a graduate student working on synopsis algorithms for scalably dealing with next-generation sequencing data. His work is being used by Mr. Preeyanon as a computational toolset. Dr. Kanchan Pavangadkar is a postdoc documenting and evaluating mapping and assembly of next-generation sequencing data for mRNA and genome analysis. Dr. Arend Hintze is a postdoc working on building Web applications to analyze next-generation sequencing data. Mr. Preeyanon and Mr. Pell helped teach and develop materials for our summer workshop on next-generation sequencing. Dr. Pavangadkar participated in the workshop. TARGET AUDIENCES: Our online materials are aimed at researchers who want to make use of sequencing technology to do genome sequencing, resequencing, transcriptome quantification, and allelotyping. This includes academic and industrial researchers. PROJECT MODIFICATIONS: Nothing significant to report during this reporting period.
Impacts Our primary outcome is in the development of novel synopsis algorithms for dealing with large (50-200 Gb) data sets of mRNA sequencing data. Current assembly methods cannot assemble this volume of data on existing computers. A secondary outcome is our discovery that existing mRNA assembly tools do a very poor job of detecting splice isoforms. This is being driven by a collaboration with Dr. Hans Cheng, who has large quantities of allelotyping and genomic data from both Marek's Disease Virus infected birds and from Cobb-Vantress chicken lines. We are working to make our results robust and our methods reusable now. In both cases, this grant funded the people working on the techniques as well as associated computer equipment.
Publications
- No publications reported this period
|
|