Easily accessible Web-based tools for analyzing next-generation sequencing data from agricultural animals

EASILY ACCESSIBLE WEB-BASED TOOLS FOR ANALYZING NEXT-GENERATION SEQUENCING DATA FROM AGRICULTURAL ANIMALS

Sponsoring Institution

National Institute of Food and Agriculture

Project Status

COMPLETE

Funding Source

AFRI COMPETITIVE GRANT

Reporting Frequency

Annual

Accession No.

0220610

Grant No.

2010-65205-20361

Cumulative Award Amt.

$689,921.00

Proposal No.

2009-03296

Multistate No.

(N/A)

Project Start Date

Dec 1, 2009

Project End Date

Nov 30, 2014

Grant Year

2010

Program Code

[92120]- Animal Genome, Genetics and Breeding

Recipient Organization
MICHIGAN STATE UNIV
(N/A)
EAST LANSING,MI 48824

Performing Department
Computer Science and Engineering

Non Technical Summary
Revolutionary technological advancements have and will continue to play an important role in determining the power and utility of genomics and, in turn, its impact on all of biology just as was the case for molecular biology and recombinant DNA. PCR, DNA microarrays, and automated sequencers are only a few of many examples that dramatically changed how scientists conduct their experiments and what biological problems can be addressed. Most recently, next generation sequencers (e.g., Illumina/Solexa Genome Analyzer or GA, Roche 454 GS FLX, ABI SOLiD) that produce enormous amount of sequence data are beginning to impact the field Originally designed to provide higher throughput and lower cost for whole genome de novo sequencing or resequencing, it became readily apparent that these machines could be applied to other experimental questions and concepts. Animal scientists will soon have great power to characterize genotypic variation and RNA levels within and across individuals and connect this variation with phenotypic trait variation. However, the most critical hindrance to using next-generation sequencing effectively is the lack of easy-to-use computational tools to support data analysis. Specifically, (1) large datasets are not easily manipulated, (2) there is no standard pipeline to handle the data, (3) a long analysis time is required even with high-performance compute clusters, (4) there is a lack of user interfaces. More importantly, most of the existing tools are designed for human, mouse, and other biomedical models in mind. What may not be readily apparent is that due to the short read lengths and the algorithms used, a high quality reference genome is required. While some agricultural animals have genome sequence assemblies, it is likely that they will never reach the quality achieved already for human and mouse. In short, due to these problems, in order to utilize next-generation sequencing for agricultural animals, the community needs bioinformatics tools that regular bench scientists can use and that incorporate the unique needs and restrictions of agricultural animals. We propose to fill this needed gap by building an open source analysis pipeline and associated Web interface to take in next-gen sequencing data, map it onto existing genomes and gene annotations, visualize the mappings, summarize the data, and extract digested data in formats that are easily manipulated and immediately useful to animal scientists.

Animal Health Component

40%

Research Effort Categories

Basic

30%

Applied

40%

Developmental

30%

Classification

Knowledge Area (KA)	Subject of Investigation (SOI)	Field of Science (FOS)	Percent
304	3299	1080	25%
304	3399	1080	25%
304	3699	1080	25%
304	3599	1080	25%

Knowledge Area
304 - Animal Genome;

Subject Of Investigation
3299 - Poultry, general/other; 3699 - Sheep and wool, general/other; 3599 - Swine, general/other; 3399 - Beef cattle, general/other;

Field Of Science
1080 - Genetics;

Keywords

bioinformatics

next generation sequencing

Goals / Objectives
We will build a reusable, open source pipeline for the analysis of next-generation sequencing data, with a Web interface for submitting data and analyzing results. We will specifically address the analysis needs of data sets from genome resequencing and variation analysis and RNAseq-based expression analysis and genome annotation. Our site design and implementation will center on usability and integration of existing genome-scale data sets such as complete and draft genome sequences, gene annotation sets, and private EST data sets with data from next-generation sequencing projects. We will also provide documentation, screencasts, and online tutorials, and give demos and tutorials at conferences. Our initial effort will focus on providing a basic Web interface capable of mapping uploaded next-generation sequencing data to the chicken, cow, sheep, and pig genomes using Bowtie and maq. Post-mapping analysis options will included SNP calling, CNV detection, RNAseq analysis, and peak detection for ChIP-seq. In addition to this functionality, we will also enable the import of unannotated genome assemblies, Velvet assembly of unmapped reads, mapping from or two "private" sequences from in-lab or commercial EST and genome sequencing projects, import of GFF3 feature files, and handling of bar-coded analyses from multiplex sequencing.

Project Methods
To ensure that the latest version of the software is always functioning, and that new features are immediately testable, we will use an agile-based software development methodology, in which multiple short iterations of software development are followed by functional releases. Because users evaluate the software on a regular basis, the next iteration can take into account user comments and feedback immediately. Agile methodologies also use automated testing to keep the software as a whole functioning; a thorough automated test suite also supports continued development, deployment, maintenance, and platform migration, by making sure that the software works across different installations and development environments. A formal development practice is necessary for any big software project, and agile methodologies have a proven track record of efficiently creating useful and functional software. With respect to documentation, in our prior experience with Cartwheel and FamilyRelations, we have found that users just don't read documentation; instead they prefer brief tutorials on solving their specific problem. We will focus our documentation effort on technical aspects of the software pipeline, e.g. statistical methodologies and mapping parameters, and instead provide written tutorials and video ``screencasts'' demonstrating the function of the software. Input from the community and software evaluation will be conducted on a regular basis to ensure that our tools are widely disseminated and relevant to the user community. A key part of our proposal is to adapt to the needs of the community. Accordingly, we will have a public mailing list to which all users can subscribe, so that we can conduct a dialog with them. This model has worked well in the case of the Velvet assembler, for example, with many user questions and helpful responses being posted every week. The biology post-doctoral fellow will also maintain a wiki containing suggestions and comments for upcoming features. We will be proactive in engaging our user community for feedback and evaluation. To allow for the maximum community participation, we will use a common open source development stack for development: a public developer's mailing list for technical communication, a Trac-based Wiki and issue-tracking system for process management, a github-based source code repository for our version control system, and buildbot continuous integration system to publicize our software's status. We will also use a number of common automated testing tools, including nose, Selenium, and twill, in order to write test code that can be executed by anyone. All software will be made available under the Apache Software Foundation Open Source license, which permits both commercial and non-commercial use, re-use, and distribution of unmodified and derived software. This will encourage people and companies to make use of our software in whatever way best suits them, without any encumbrances.

Progress 12/01/09 to 11/30/14

Outputs
Target Audience: We continue to make software and Web-based materials available from our projects. We cannot, however, track the individuals who are accessing the material or using our software (beyond citations and general analytics). Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided? We continue to participate in a broad range of training and workshops; however, while these workshops were initiated under this grant, we are now funded by two separate NIH and NSF grants to run them. Therefore we limit the reporting here to mention that the tools being developed under this grant are taught as part of the NGS Summer Workshop at MSU, bioinformatics.msu.edu/ngs-summer-course-2014, using documentation created with the funding from this grant. How have the results been disseminated to communities of interest? In addition to the publications, we maintain three Web sites to support training on and reuse of our software: http://ged.msu.edu/angus/ http://khmer.readthedocs.org/ http://khmer-protocols.readthedocs.org/ What do you plan to do during the next reporting period to accomplish the goals? Nothing Reported

Impacts
What was accomplished under these goals? 1) We have continued to maintain, support, and extend the khmer software to provide tools for filtering sequence prior to downstream analysis (http://khmer.readthedocs.org). The khmer software includes tools to help with sequence cleaning and building new gene models. khmer is largely unique in its capabilities, although other packages (the Trinity and Mira assemblers, in particular) have started to adopt our algorithms. We estimate that there are several hundred to several thousand users of khmer. One of khmer's particularly important goals is to enable all of this analysis in the cloud, using rental machines from e.g. Amazon or lower-cost computers on XSEDE or iPlant Collaborative. We can now do essentially all NGS analysis in the cloud using khmer, which is a significant advance for this project. Continued funding for khmer has been assured through an NIH R01, because khmer has proven to be widely useful for bioinformatics across fields. We continue to collaborate widely with USDA researchers, and collaborate locally through a training grant and a cattle grant khmer has now under gone over a dozen releases, its algorithms have been even more widely adopted (it is included in the TruSeq Long Read pipeline used by Illumina), and it is under a much more stable development cycle. 2) The ANGUS tutorials (http://ged.msu.edu/angus/) on NGS analysis continue to be developed towards the goals of this project, although future support has shifted largely to an NIH R25 course grant that is funded through 2017. 3) We have built several computational "protocols" for NGS sequence analysis, and posted them openly under a Creative Commons/no restrictions license for full reuse and remixing. Called 'khmer protocols', these protocols grew from the tutorials above but are now maintained and supported on a continuing basis. We are currently extending these protocols to be of even wider use for ag and vet genomes.

Publications

Type: Journal Articles Status: Published Year Published: 2014 Citation: Howe AC, Jansson JK, Malfatti SA, Tringe SG, Tiedje JM, Brown CT. Proc Natl Acad Sci U S A. 2014 Apr 1;111(13):4904-9. doi: 10.1073/pnas.1402564111.
Type: Journal Articles Status: Published Year Published: 2014 Citation: Zhang Q, Pell J, Canino-Koning R, Howe AC, Brown CT. PLoS One. 2014 Jul 25;9(7):e101271. doi: 10.1371/journal.pone.0101271.

Progress 12/01/12 to 11/30/13

Outputs
Target Audience: Our target audience has been advanced graduate students, postdocs, and faculty interested in doing sequence analysis. We have reached them through a series of talks and workshops, as evidenced by the increasing number of citations (over 60) for products from this grant. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided? We continue to participate in a broad range of training and workshops; however, while these workshops were initiated under this grant, we are now funded by two separate NIH and NSF grants to run them. Therefore we limit the reporting here to mention that the tools being developed under this grant are taught as part of the NGS Summer Workshop at MSU,bioinformatics.msu.edu/ngs-summer-course-2014, using documentation created with the funding from this grant. How have the results been disseminated to communities of interest? In addition to the publications, we maintain three Web sites to support training on and reuse of our software: http://ged.msu.edu/angus/ http://khmer.readthedocs.org/ http://khmer-protocols.readthedocs.org/ What do you plan to do during the next reporting period to accomplish the goals? We are working towards a 1.0 release of the khmer software, refining the protocols for sequence analysis, and working to publish the results. Currently we are down to 1 FTE working on this, but much of the work nucleated by this grant has gone on to be funded elsewhere (including by the USDA) so we are simply working to complete a software release and publication, which is attainable within the remainder of the grant.

Impacts
What was accomplished under these goals? 1) We have continued to maintain, support, and extend the khmer software to provide tools for filtering sequence prior to downstream analysis (http://khmer.readthedocs.org). The khmer software includes tools to help with sequence cleaning and building new gene models. khmer is largely unique in its capabilities, although other packages (the Trinity and Mira assemblers, in particular) have started to adopt our algorithms. We estimate that there are several hundred to several thousand users of khmer. One of khmer's particularly important goals is to enable all of this analysis in the cloud, using rental machines from e.g. Amazon or lower-cost computers on XSEDE or iPlant Collaborative. We can now do essentially all NGS analysis in the cloud using khmer, which is a significant advance for this project. Continued funding for khmer has been assured through an NIH R01, because khmer has proven to be widely useful for bioinformatics across fields. We continue to collaborate widely with USDA researchers, and collaborate locally through a training grant and a cattle grant. 2) The ANGUS tutorials (http://ged.msu.edu/angus/) on NGS analysis continue to be developed towards the goals of this project, although future support has shifted largely to an NIH R25 course grant that is funded through 2017. 3) We have built several computational "protocols" for NGS sequence analysis, and posted them openly under a Creative Commons/no restrictions license for full reuse and remixing. Called 'khmer protocols', these protocols grew from the tutorials above but are now maintained and supported on a continuing basis. 4) We have integrated these tutorials into the Galaxy and iPlant Collaborative workflow engines, to the best of our ability; we are working with the Galaxy and iPlant engineers to extend the workflow engines to support our needs. These workflow engines are becoming field-standard and allow any biologist to work with NGS data flexibly and efficiently; when combined with our enabling of the cloud for certain tasks through khmer, this effort represents a significant expansion of functionality for biologists. 5) We have continued to develop the 'gimme' software for building gene models for agricultural and "semi-model" organisms. We expect to submit a paper for publication on this shortly.

Publications

Type: Journal Articles Status: Accepted Year Published: 2013 Citation: 1: Schwarz EM, Korhonen PK, Campbell BE, Young ND, Jex AR, Jabbar A, Hall RS, Mondal A, Howe AC, Pell J, Hofmann A, Boag PR, Zhu XQ, Gregory TR, Loukas A, Williams BA, Antoshechkin I, Brown CT, Sternberg PW, Gasser RB. The genome and developmental transcriptome of the strongylid nematode Haemonchus contortus. Genome Biol. 2013 Aug 28;14(8):R89. [Epub ahead of print] PubMed PMID: 23985341.
Type: Journal Articles Status: Accepted Year Published: 2013 Citation: 1: Subramaniam S, Johnston J, Preeyanon L, Brown CT, Kung HJ, Cheng HH. Integrated analyses of genome-wide DNA occupancy and expression profiling identify key genes and pathways involved in cellular transformation by a Marek's disease virus oncoprotein, Meq. J Virol. 2013 Aug;87(16):9016-29. doi: 10.1128/JVI.01163-13. Epub 2013 Jun 5. PubMed PMID: 23740999; PubMed Central PMCID: PMC3754031.
Type: Book Chapters Status: Awaiting Publication Year Published: 2013 Citation: khmer: Working with Big Data in Bioinformatics. Eric McDonald, C Titus Brown. In: The Performance of Open Source Applications, Tavish Armstrong. In press.

Progress 12/01/11 to 11/30/12

Outputs
OUTPUTS: The most important output from this grant has been the continued development and maintenance of the khmer software for sequence analysis (see: http://github.com/ged-lab/khmer). This software is increasingly popular for mRNAseq, genome, and metagenome assembly and is being used by a number of agricultural labs in particular. It is also being used in our summer course on next-generation sequence analysis, and has served as the basis for a grant to improve the chicken genome (a submitted proposal) as well as for a funded proposal on bovine tuberculosis resistance (Coussens, PI). We have also provided a number of online tutorials that show how to use khmer to do next-generation sequencing data analysis at http://ged.msu.edu/angus/ and also at the khmer documentation, http://khmer.readthedocs.org. khmer has been presented in over a dozen invited talks, including at the Plant and Animal Genome in 2012 and 2013, three talks at the University of Arizona in August 2012, a talk at the University of Miami in Florida, a talk at NESCent in North Carolina, a talk at Notre Dame, a talk at the Extremely Large Databases conference at Stanford, a talk at a Wellcome Trust meeting in Cambridge, and a talk at the Norwegian Sequencing Center in Oslo, Norway. PARTICIPANTS: Kanchan Pavangadkar (PD) has continued to work on mRNAseq analysis and user interfaces for analyzing differential expression data. Eric McDonald (staff programmer) was hired at the beginning of this reporting period and has worked extensively on optimizing and improving the khmer software. Jason Pell (graduate student) published one paper and is nearing graduation. TARGET AUDIENCES: 1. We have yearly taught a two week summer course on analyzing next-generation sequencing data that has included industry and academic participants from agriculturally relevant labs and companies. This course has benefitted significantly from the expertise developed as a result of this funding. 2. I have integrated work from this project into an interdisciplinary graduate course here at Michigan State University. 3. Last summer, four students from underrpresented minorities did summer undergraduate research that extended this project in various ways. PROJECT MODIFICATIONS: Our major change in approach has been to focus on integrating data analysis approaches relevant to agriculture into biomedically targeted analysis programs. In particular, we are working to integrate our existing khmer software into the Galaxy software package for sequence analysis, which will permit Galaxy to be used for resequencing and mRNAseq analysis of agricultural data in ways that it cannot currently work.

Impacts
A number of external projects are in motion using our software and approaches, but none have yet been published.

Publications

1: Pell J, Hintze A, Canino-Koning R, Howe A, Tiedje JM, Brown CT. Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. Proc Natl Acad Sci U S A. 2012 Aug 14;109(33):13272-7. doi: 10.1073/pnas.1121464109. Epub 2012 Jul 30. PubMed PMID: 22847406; PubMed Central PMCID: PMC3421212.

Progress 12/01/10 to 11/30/11

Outputs
OUTPUTS: Expertise and software funded by this grant were used to run several workshops. In particular, we run a 2-week workshop at the MSU Kellogg Biological Station on Analyzing Next-Generation Sequence Data that demonstrated the utility of various analysis techniques developed on this grant. We also ran a 1-day workshop in November for local area researchers to learn some of the sequence assembly techniques that we have developed. Much of the course materials available at http://ged.msu.edu/angus/ were developed with expertise supported by this grant. PARTICIPANTS: Dr. C. Titus Brown (PI) supervised research on tools and helped direct the overall research. Dr. Hans Cheng (collaborator) made extensive use of technologies and approaches developed under this grant for analysis of Marek's Disease Virus infections in chicken. His lab collaborates closely with the PIs lab. Likit Preeyanon (graduate student) is primarily funded by the Thai government, but has received equipment and travel support from this grant. He is the primary student working between Dr. Cheng and Dr. Brown. He has been working on transcriptome generation for chick, allele-specific expression analysis of that data, and genome-wide association studies for MDV-resistance in chick. He is an author on several papers emerging from this work. Jason Pell is working to develop a suite of computational tools to help scale and improve analysis of next generation sequencing data. He is an author on several papers emerging from this work. Dr. Arend Hintze co-supervised the development of computational analysis techniques by Jason Pell, and investigated graph theoretic aspects of sequence analysis. He is an author on several papers emerging from this work. Dr. Kanchan Pavangadkar has worked to generate transcriptome data from chick and analyze it using our approaches. She is an author on several papers emerging from this work. TARGET AUDIENCES: Not relevant to this project. PROJECT MODIFICATIONS: Nothing significant to report during this reporting period.

Impacts
During this reporting period, we developed an effective solution to a wide variety of sequence analysis problems of importance to the agricultural community. This approach, termed digital normalization, will make certain kinds of sequence analysis simply and easily accessible to researchers without substantial compute resources. In particular, we have made RNA sequence analysis work on rental "cloud" computers, which are often more limited in capacity than other computers; this eliminates the substantial up-front resource investment currently needed for RNA sequence analysis.

Publications

No publications reported this period

Progress 12/01/09 to 11/30/10

Outputs
OUTPUTS: There are several outputs from our first year. First, driven partly by results from our first year of funding, we ran a summer workshop on next-generation sequencing data analysis. A number of researchers, including two USDA-funded graduate students and three researchers at commercial ag companies, participated in the course. Mr Jason Pell, a graduate student funded by the grant, and Dr. Kanchan Pavangadkar (a pd on the grant), helped develop course materials based on their work on this project. Second, personnel involved in the grant have contributed to the course- associated ANGUS Web site, which contains educational materials on using advanced sequencing technologies. The site is online at ged.msu.edu/angus/. Third, we have developed several technologies for helping to scale data analysis to the volume of data now available. In particular, we have been working on advanced techniques for assembling mRNA sequencing data, detecting splice isoforms, and storing and retrieving annotations. Fourth, we have continued developing a genome display toolkit to index and display hundreds of thousands of genomic features. Fifth, we have begun developing a cloud deployment system to enable researchers to take advantage of cloud computing (rental compute infrastructure) for their analyses. All of our software is being made freely available under a BSD license at github.com/ctb/, and being disseminated to the community through talks, blog posts, and twitter posts. PARTICIPANTS: Mr. Likit Preeyanon is a graduate student supported on a Thai government grant. He is working with Dr. Hans Cheng at the USDA ADOL on technology for allelotyping and splice isoform analysis in mRNAseq data. Mr. Jason Pell is a graduate student working on synopsis algorithms for scalably dealing with next-generation sequencing data. His work is being used by Mr. Preeyanon as a computational toolset. Dr. Kanchan Pavangadkar is a postdoc documenting and evaluating mapping and assembly of next-generation sequencing data for mRNA and genome analysis. Dr. Arend Hintze is a postdoc working on building Web applications to analyze next-generation sequencing data. Mr. Preeyanon and Mr. Pell helped teach and develop materials for our summer workshop on next-generation sequencing. Dr. Pavangadkar participated in the workshop. TARGET AUDIENCES: Our online materials are aimed at researchers who want to make use of sequencing technology to do genome sequencing, resequencing, transcriptome quantification, and allelotyping. This includes academic and industrial researchers. PROJECT MODIFICATIONS: Nothing significant to report during this reporting period.

Impacts
Our primary outcome is in the development of novel synopsis algorithms for dealing with large (50-200 Gb) data sets of mRNA sequencing data. Current assembly methods cannot assemble this volume of data on existing computers. A secondary outcome is our discovery that existing mRNA assembly tools do a very poor job of detecting splice isoforms. This is being driven by a collaboration with Dr. Hans Cheng, who has large quantities of allelotyping and genomic data from both Marek's Disease Virus infected birds and from Cobb-Vantress chicken lines. We are working to make our results robust and our methods reusable now. In both cases, this grant funded the people working on the techniques as well as associated computer equipment.

Publications

No publications reported this period