Big Data Analysis Tools For Agricultural Genomics

BIG DATA ANALYSIS TOOLS FOR AGRICULTURAL GENOMICS

Sponsoring Institution

National Institute of Food and Agriculture

Project Status

COMPLETE

Funding Source

HATCH

Reporting Frequency

Annual

Accession No.

1004061

Grant No.

(N/A)

Cumulative Award Amt.

(N/A)

Proposal No.

(N/A)

Multistate No.

(N/A)

Project Start Date

Oct 1, 2014

Project End Date

Jul 16, 2016

Grant Year

(N/A)

Program Code

[(N/A)]- (N/A)

Recipient Organization
CLEMSON UNIVERSITY
(N/A)
CLEMSON,SC 29634

Performing Department
Genetics and Biochemistry

Non Technical Summary
Anyone who has accessed an Internet search engine knows there are volumes of free digital information available to mine anything from the best shoe price to where Fiji is located. The same is true for agricultural genetic information stored on the Internet that can be mined and used to develop new crops. New crops are always needed, and an acceleration of the crop development cycle is essential to mitigate the effects of population pressure and climate change on food and plant co-product (biofuel, cotton, etc.) yields. The applied basic research described in this proposal could have a powerful impact on human health and commerce. A ramp up of crop development speed is essential given the competitiveness from world markets for existing crop commodities, the enormous market potential of bioenergy and other co-products, and the threat of climate change on shifting living (e.g. emerging pests and weeds) and non-living (e.g. floods and droughts) factors affecting crop yield. In short, the application of new technologies such as the analysis of huge DNA sequence datasets described in this proposed research may very well be essential to maintaining and improving the profitability of the US/SC agricultural industry now and in an uncertain future where the Farmer's Almanac may not be predictive of future climate.

Animal Health Component

20%

Research Effort Categories

Basic

80%

Applied

20%

Developmental

(N/A)

Classification

Knowledge Area (KA)	Subject of Investigation (SOI)	Field of Science (FOS)	Percent
201	2410	1081	100%

Knowledge Area
201 - Plant Genome, Genetics, and Genetic Mechanisms;

Subject Of Investigation
2410 - Cross-commodity research--multiple crops;

Field Of Science
1081 - Breeding;

Keywords

Goals / Objectives
Objective A: Build a Crop Gene Interaction Network Database. We will insert gene interaction networks for crops relevant to SC into our GeneNet Engine data-mining resource. These networks will be collected from public sources and in some cases constructed de novo using NGS data. A key functionality of the GeneNet Engine is the delivery of known DNA polymorphisms (e.g. SNPs with flanking DNA) near genes found by the user to relevant to the biology in which they are interested.Objective B: Construct Translational Genomics Software. We will construct tools to analyze complex gene interaction patterns (networks). The GeneNet Engine itself is a tool that allows for the exploration of networks by finding interacting partners with specific gene names, genetic signal of a trait, and enriched biological function. We will explore way tool improve this tool for comparative genomics. In addition, we will work the construction of fast network alignment software to identify conserved interaction patterns between crops.Objective C: Outreach to SC Crop Development Community. If travel funds are provided, deliver on-site training in the use of these tools and relevance to crop development.

Project Methods
Objective A: Build a Crop Gene Interaction Network Database. In this Objective, we will add networks of agricultural relevance to South Carolina. These could include bioenergy grasses, soybean, cotton, vegetables, and others as they are generated by the Feltus lab or other groups. To obtain the networks, we will A) search the literature for gene interaction networks (RNAseq and hybridization based) for target crops relevant to South Carolina; and B) construct co-expression networks de novo from public RNAseq data. For example, there are at least two soybean networks available from the literature (Yim, Yu et al. 2013; Yu, Zhang et al. 2014), and there are 138 soybean RNAseq datasets in the NCBI SRA database (SRA 2014). Given the PI's previous research, we will be interested in adding bioenergy grass networks including one we are constructing for sorghum (unpublished data). For de novo RNAseq network construction, FASTQ files will be pre-processed by removing adaptors and soft-trimmed with trimmomatic (Trimmomatic 2013) and mapped to the relevant genome assembly with gene model coordinates in GTF files using bowtie2/TopHat (Trapnell, Pachter et al. 2009; bowtie2 2013). These SAM/BAM alignments for all conditions will be used to construct FPKM matrices and input into a gene co-expression network (GCN) construction pipeline for gene interaction module discovery. PI Feltus is currently exploring alternate RNAseq normalization techniques (Dillies, Rau et al. 2013) for GCN construction. We will remove outlier distributions and construct a single global GCN for each species, but we will explore statistical partitioning of the expression sets prior to network construction as we have done for rice (Ficklin and Feltus 2013) and Arabidopsis (Feltus, Ficklin et al. 2013). The construction of multiple GCNs allows for the maximal capture of gene co-expression space. Once constructed, each GCN will be portioned into gene modules (sub-networks) using network community discovery techniques such as MCL (Hwang, Cho et al. 2006) and link community approaches (Ahn, Bagrow et al. 2010; Kalinka and Tomancak 2011), both of which have been extremely informative in our ongoing analysis of Arabidopsis and rice networks, followed by enrichment analysis of GO, Interpro, and KEGG terms (Huang, Sherman et al. 2009; Ficklin, Luo et al. 2010). In this way, we will generate a set of possible co-expressed gene interaction modules with functional annotation. The annotations within a module can be tested for significance using enrichment analysis and simulation techniques for which the Feltus lab has developed relevant code. All modules form public or de novo constructed GCNs will be inserted into the GeneNet Engine database. In addition, any public genetic data in an easily accessible format can be loaded into the database for the association of GWAS and QTL signals with gene interaction modules. Furthermore, any public DNA polymorphism data (e.g. SNP-DB) can be loaded for the association of useful DNA markers to genes of interest. As mentioned above, a key usage example would be for a researcher to identify a candidate gene of interest in the database, find interacting genes, and then select DNA markers near these genes for a molecular breeding experiment.Objective B: Construct Translational Genomics Software. In this objective, we will create tools to facilitate the translational genomics potential for agricultural crops. For example, we will collaborate with computer engineers to port network alignment algorithms such as IsoRankN (Liao, Lu et al. 2009) to optimized code. We are currently developing global network alignment software based on homology and topology to Nvidia's CUDA code so that it will run quickly on GPUs and scale with network size (unpublished software). If funds are available for back-end compute resources, we will create a web-based tool to allow for network alignment using this GPU-optimized code. The GeneNet Engine itself is a tool that allows for the exploration of networks by finding interacting partners with specific gene names, genetic signal of a trait, and enriched biological function. We will explore way tool improve this tool for comparative genomics such as the creation of mapping tables between syntenic regions of related crop genomes.Objective C: Outreach to SC Crop Development Community. If travel funds are provided, PI Feltus will provide an on-site, one day work shop at a target site (e.g. PeeDee REC) on how to use these tools as well as provide examples of their importance. If funds are not provided for travel, a remote workshop will be organized. These workshops will occur at least once per year probably in the Fall after the harvest has been completed for Summer crops.Outputs. The focal output of this project will be the networks deposited into the GeneNet Engine database and any public software tools developed. From a deliverable perspective, each network and tool is a potential publishable unit. Therefore, the number of peer-reviewed publications will be a key metric of progress as the project progresses. In addition, the utility of these tools will be assessed in a qualitative and quantitative manner. The IP logs of those who accessed the database and any tools we develop will be captured and the number of unique South Carolina users counted. In addition, records will be kept of any contacts by South Carolina researchers via workshop, email, and telephone will be recorded to qualitatively assess the impact of this work be South Carolina crop researchers. Furthermore, user email addresses will be recorded and online surveys (e.g. Survey Monkey) will be performed to monitor the success of the project as well as provide feedback for tool improvement.

Progress 10/01/14 to 07/16/16

Outputs
Target Audience:The target audience includes plant genomics researchers supporting plant breeding operations. However, the software tools we are developing are useful in any area engaged in genomics. While focused on this level of researcher, our work ultimately is aimed at identifying breeding targets for crop improvement. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?Seven undergraduates (4 women, 1 African-American) were trained by the PI. Two PhD students were trained by the PI. One student is in computer engineering and one joined the PI's lab in the department of genetics & biochemistry. How have the results been disseminated to communities of interest?We are in progress of disseminating our findings in the scientific literature and popular press. The first year of this project was heavily focused on software design that will be useful to a broad swath of agricultural researchers. As outlined above, we have created a public web server located at http://network.genome.clemson.edu which is supercomputer enabled gene network alignment tool that has been open for usage for over a year. What do you plan to do during the next reporting period to accomplish the goals? Nothing Reported

Impacts
What was accomplished under these goals? Accomplishments in Objective A during this reporting period: We have ported our workflow (OSG-GEM) to the Open Science Grid and have created RNAseq based expression matrices for target organisms. This workflow is in the final stages of publication and is already open sourced and available to anyone in the world at https://github.com/feltus/OSG-GEM. These matrices are then run on OSG to make networks that are under analysis and will be core data in future systems biology publications. Accomplishments in Objective B during this reporting period: We have continued to develop software for this objective called the GPU-enabled Global Gene Network Aligner (G3NA) Software. Gene interaction networks help genomics researchers and plant breeders identify complex gene expression relationships relevant agronomic traits. Network construction is computationally expensive, so we have co-developed with the Melissa Smith lab at Clemson develop a GPU based optimized framework to perform multiple global network alignment. The open source software is mature and on a private GitHub account at https://github.com/karansapra/G3NA. The repository is now public and we are awaiting reviews for publication. We call this software the GPU-enabled Global Gene Network Alignment (G3NA) tool which is superfast. For example, we aligned rice and maize networks on the Palmetto Supercomputer using NVIDIA K40 GPUs. Maize-rice alignment took 34.2 seconds compared to 21,800 seconds with IsoRankN, a speed up of 672.8x. We have extended this software to include a data mining and visualization component that runs on nVidia GPUs called Biodep-viz. This software was discussed and demoed at the 2015 Supercomputing conference at the Clemson booth and as an invited nVidia theater talk providing high visibility of plant genomics and the Experiment Station to the supercomputing industry. We have submitted BioDep-viz for publication and will release the source code upon acceptance. Accomplishments in Objective C during this reporting period: We continue to maintain the GPU-enabled Global Gene Network Aligner (G3NA) Web Server. We have completed the construction of a public gene network alignment web server located at http://network.genome.clemson.edu. The computer resources funded by the experiment station are now available for any researcher in the world.

Publications

Type: Journal Articles Status: Under Review Year Published: 2016 Citation: Accelerating Genome Information Discovery with the Biological Dependency Visualizer (BioDep-Viz) Software by Karan Sapra, Melissa Smith, Frank A. Feltus, and Joshua Levine (submitted to BMC Bioinformatics) TC#6440
Type: Journal Articles Status: Accepted Year Published: 2016 Citation: SOFTWARE: Gene Expression Matrix Construction Using the Open Science Grid by William L. Poehlman, Mats Rynge, Chris Branton, D. Balamurugan, and Frank A. Feltus
Type: Journal Articles Status: Under Review Year Published: 2016 Citation: Fast Gene Interaction Network Alignment with GPUs by Karan Sapra, Melissa Smith, Frank A. Feltus. (submitted to PLOS ONE TC#8381).
Type: Conference Papers and Presentations Status: Published Year Published: 2015 Citation: Amari Lewis, Karan Sapra, Kathleen Kyle, Alex Feltus, Melissa Smith, Jill Gemmill Visualization and Interaction of Multiple Layers of High Dimensional Biological Data. XSEDE2015
Type: Conference Papers and Presentations Status: Published Year Published: 2015 Citation: " Amari Lewis, Karan Sapra, Kathleen Kyle, Alex Feltus, Melissa Smith,Jill Gemmill. Visualization and Interaction of Multiple Layers of High Dimensional Biological Data.MARC U* Star (Maximizing Access to Research Careers for Undergraduate Student Training in Academic Research), Winston-Salem University, November 11-14, 2015.
Type: Conference Papers and Presentations Status: Published Year Published: 2015 Citation: nVidia Theater Talk, Supercomputing 2015, November 16, 2015.
Type: Other Status: Published Year Published: 2015 Citation: Clemson Big Data article: http://phys.org/news/2015-10-clemson-scientists-team-tackle-big.html

Progress 10/01/14 to 09/30/15

Outputs
Target Audience:The target audience includes plant genomics researchers supporting plant breeding operations. However, the software tools we are developing are useful in any area engaged in genomics. While focused on this level of research, our work ultimately is aimed at identifying breeding targets for crop improvement. Changes/Problems:We attempted to organize a workshop for experiment station breeders but there was minimal interest from off-site experiment station breeders. The PI has decided to record instructional videos during the 2015-2016 reporting period and share with breeders at their leisure. What opportunities for training and professional development has the project provided?Three undergraduates (2 were African-American women) were trained by the PI who was a participant in the NSF "Big Data Visualization REU" Source: National Science Foundation [1359223](V Byrd, PI). All three students presented their work at scientific meetings. Two PhD students were trained by the PI. One student is in computer engineering and one joined the PI's lab in the department of genetics & biochemistry. How have the results been disseminated to communities of interest?We are in progress of disseminating our findings in the scientific literature. The first year of this project was heavily focused on software design that will be useful to a broad swath of agricultural researchers. As outlined above, we have created a public web server located at http://network.genome.clemson.edu which is supercomputer enabled gene network alignment tool that has been open for usage for months. What do you plan to do during the next reporting period to accomplish the goals? Publish two peer reviewed manuscripts or computer science conference proceedings on G3NA and the prototype visualization tool. Continue to construct and analyze crop gene co-expression networks and upload to the GeneNetEngine. Construct and disseminate two systems genetics video tutorials on how to interpret and align gene co-expression networks.

Impacts
What was accomplished under these goals? Accomplishments in Objective A during this reporting period: We have worked very hard developing a high throughput next generation sequence analysis workflow to generate gene expression matrices necessary for gene network construction. We have ported our workflow to the Open Science Grid and have created RNAseq based expression matrices for target organisms. These networks are under analysis and will core data in future systems biology publications. Accomplishments in Objective B during this reporting period: We have constructed a core piece of software for this objective called the GPU-enabled Global Gene Network Aligner (G3NA) Software. Gene interaction networks help genomics researchers and plant breeders identify complex gene expression relationships relevant to agronomic traits. Network construction is computationally expensive, so we have co-developed with the Melissa Smith lab at Clemson to develop a GPU based optimized framework to perform multiple global network alignment. The open source software is mature and on a private GitHub account at https://github.com/karansapra/G3NA. The repository will be public once our work is accepted for publication. We call this software the GPU-enabled Global Gene Network Alignment (G3NA) tool which is superfast. For example, we aligned rice and maize networks on the Palmetto Supercomputer using NVIDIA K40 GPUs. Maize-rice alignment took 34.2 seconds compared to 21,800 seconds with IsoRankN, a speed up of 672.8x. This work was submitted for peer review as "Large Scale Gene Network Alignment using GPGPUs with G3NA" by Karan Sapra, Asher Sampong, Melissa Smith, Frank A. Feltus. (submitted to Bioinformatics). We have extended this software to include a data mining and visualization component that runs on nVidia GPUs. This softare was discussed and demoed at the 2015 Supercomputing conference at the Clemson booth and as an invited nVidia theater talk providing high visibility of plant genomics and the Experiment Station to the supercomputing industry, Accomplishments in Objective C during this reporting period: We created a GPU-enabled Global Gene Network Aligner (G3NA) Web Server. We have completed the construction of a public gene network alignment web server located at http://network.genome.clemson.edu. The computer resources funded by the experiment station are now available for any researcher in the world.

Publications

Type: Journal Articles Status: Submitted Year Published: 2015 Citation: "Large Scale Gene Network Alignment using GPGPUs with G3NA" by Karan Sapra, Asher Sampong, Melissa Smith, Frank A. Feltus. (submitted to Bioinformatics)
Type: Websites Status: Published Year Published: 2015 Citation: GPU-enabled Global Gene Network Aligner (G3NA) Web Server: http://network.genome.clemson.edu
Type: Conference Papers and Presentations Status: Published Year Published: 2015 Citation: Kathleen E. Kyle, Karan Sapra, Amari Lewis, Melissa C. Smith, Jill Gemmill, F. Alex Feltus Detangling genetic hairballs: implementing an abstracted, gene cluster view for the gPICTviz visualization tool. XSEDE2015