Source: UNIVERSITY OF MISSOURI submitted to
THE NEXT GENERATION BOVINE GENOME DATABASE
Sponsoring Institution
National Institute of Food and Agriculture
Project Status
TERMINATED
Funding Source
Reporting Frequency
Annual
Accession No.
0233081
Grant No.
2010-65205-20647
Project No.
MO-CGAS0002
Proposal No.
2013-00627
Multistate No.
(N/A)
Program Code
92120
Project Start Date
Jul 1, 2012
Project End Date
Dec 31, 2014
Grant Year
2013
Project Director
Elsik, C. G.
Recipient Organization
UNIVERSITY OF MISSOURI
(N/A)
COLUMBIA,MO 65211
Performing Department
Animal Sciences
Non Technical Summary
The bovine genome sequence will have a tremendous impact on our understanding of genetic mechanisms underlying important production traits. The genome sequence contains information for all the genes involved in agriculturally and economically important traits, but the genome information must be made accessible to livestock researchers. Most livestock researchers do not have advanced computing skills, and therefore need a simple yet comprehensive web-based resource to access the information in a way that will allow them to use it in experiments in the lab and on the farm. The purpose of this project is to extend the Bovine Genome Database by adding new data and web-based data mining tools that will allow livestock researchers to access and integrate the bovine genome and other genomics data, so they can identify candidate genes and mutations for use in biological experiments. New genomic data that has become available as a result of the bovine genome and haplotype map projects will be included. This will allow the conversion of data to information and ultimately to knowledge that will lead to new technology for enhanced production efficiency and animal well-being.
Animal Health Component
(N/A)
Research Effort Categories
Basic
100%
Applied
(N/A)
Developmental
(N/A)
Classification

Knowledge Area (KA)Subject of Investigation (SOI)Field of Science (FOS)Percent
3043399108050%
3043499108050%
Goals / Objectives
The overall objective of this project is to extend the Bovine Genome Database to allow researchers to most efficiently combine what has been learned about functional features of the genome with data to be generated with new genomic resources emerging from the sequencing project. The rationale that underlies the proposed project is that we cannot fully exploit the cattle genome sequence and new genomic resources without integrating the information and making it accessible to livestock researchers. We plan to accomplish the overall objective of this project by pursuing the following specific goals: 1) Create a query and data mining interface designed to enable researchers to identify candidate genes and to select appropriate markers, including tag SNPs, for QTL validation and fine mapping studies, 2) Provide a mechanism to contribute information that will be used to improve the bovine genome assembly, 3) Annotate the latest bovine genome assembly, using RNASeq and other data collected from the community.
Project Methods
To accomplish the first goal, we will incorporate data warehousing capabilities into the Bovine Genome Database (BGD) using the BioMart data management system, which allows advanced querying of many biological data sources through a single query interface. Our implementation of BioMart will allow querying across BGD, Ensembl, UniProt and other external data sources. In addition to providing new simple and intuitive user interfaces for queries, we will port, integrate and update bovine genomic data such as SNPs, haplotype blocks, QTL and genome browser tracks to BGD. The second goal is to provide information that can be used to improve the bovine genome assembly. Our approach includes 1) identifying discrepancies between existing alternate assemblies using data provided by NCBI in order to flag difficult regions, and 2) using assembled transcript sequences to identify misassembled regions. The third objective, to reannotate the newest bovine genome, includes collecting RNASeq data from the community, processing, assembling and aligning the RNASeq data for the newest reference assembly, mapping the current bovine OGS to the newest assembly, performing computational gene prediction, setting up a new Chado database, genome browsers and manual annotation resources for the newest assembly, collecting manual annotation data, creating a new OGS, performing computational functional annotation, populating gene pages with the new data, mapping SNPs, and connecting the BovineQTL database to the newest assembly.

Progress 07/01/12 to 12/31/14

Outputs
Target Audience: The target audience is bovine genomics researchers. Changes/Problems: We made several changes in objectives and approaches throughout the course of the project, and these were described in previous reports. A problem throughout the project has been the existence of competing bovine genome assemblies. Although there is a belief that the community prefers the UMD3.1 assembly, we do have users of the Btau assembly, so we have continued to support it. The existence of two bovine genome assemblies doubled the work in genome browser development, alignment computation, and data parsing. We performed additonal work to allow users to navigate across the Btau_4.6.1 and UMD3.1 genome assemblies and to check for incongruent gene annotations. What opportunities for training and professional development has the project provided? Nothing Reported How have the results been disseminated to communities of interest? The results for Objective 1 (BovineMine) and Objective 3 (RNAseq-based annotation) were presented at the Plant and Animal Genome Conference in January 2015 (after the end of the project), and are publicly available at the Bovine Genome Database (http://bovinegenome.org). The JBrowse graphical genome browsers are open to all, but the annotation editing functions require a password to prevent malicious submission of bogus annotations; registration is open to all researchers, with credentials reviewed by our staff. What do you plan to do during the next reporting period to accomplish the goals? Nothing Reported

Impacts
What was accomplished under these goals? The overall goal was to extend the Bovine Genome Database by adding new data and data mining tools to aid livestock researchers in identifying candidate genes and mutations for use in biological experiments. Prior to this project, gene annotation data at the Bovine Genome Database was not well-integrated with data from external sources, and was accessible mostly as tracks on graphical genome browsers that were not well-suited for current high-throughput sequencing datasets, sequence comparison with BLAST, and large downloadable files. In this project we integrated data from many sources and created new search and download functions to allow researchers to create and download customized datasets that they can combine with their own research data for more powerful analyses. We used new RNA sequencing data to identify novel genes in the bovine gene. We also created new genome browsers that allow users to view large amounts of high throughput sequencing data mapped to bovine chromosomes. The outcome of this work will allow researchers to more efficiently combine what has been learned about functional features of the bovine genome with data they are generating in their research to make new discoveries that will lead to improved animal health and production. Objective 1. Create a query and data mining interface designed to enable researchers to identify candidate genes and to select appropriate markers, including tag SNPs, for QTL validation and fine mapping studies. Major Activities We created new search tools and new graphical interfaces for the Bovine Genome Database. These can be divided into three main categories: 1) genome browsers, 2) search tools, and 3) a data mining warehouse. The general approach involved performing computations on genome and gene sequences, developing scripts to format data, adapting and configuring open-source code, and developing new code for search tools. The work performed on genome browsers included 1) setting up JBrowse for bovine assembly Btau_4.6.1, 2) creating links to allow users to navigate from the Btau_4.6.1 assembly to the UMD3.1 assembly, and vice-versa, and 3) enhancing the UMD3.1 JBrowse to provide gene annotation-editing functions with Web Apollo annotation software. These tasks included developing code to reformat data from different sources and computing corresponding features and chromosomal regions across assemblies. We created new search tools using Perl-CGI and these are available in the "Search and Annotation Tools" menu at BovineGenome.org. These include 1) a tool for predicted microRNA:mRNA interactions, 2) a tool to show gene expression levels in different tissues, 3) a tool to identify differences in annotations across assemblies, 4) a tool to identify disagreements in Ensembl and RefSeq annotations, 5) a tool to navigate to locations of candidate novel genes in the genome browser, and 6) a chromosome and scaffold identifier conversion tool. In addition to developing code for the web applications, we performed computational analysis to predict microRNA targets, performed computations to identify incongruent gene models, and developed a MySQL database to host the data. We created a new data warehouse and data mining interface called BovineMine that is publicly available at http://bovinegenome.org/bovinemine. We used InterMine, an open-source package, which is part of the Generic Model Organism Database (GMOD) construction set. We collected and formatted data for genes, proteins, orthologs, homologs, protein interactions, gene ontology, pathways, publications, QTL, SNP from RefSeq, Ensembl, UniProt, Treefam, Homologene, OrthoDB, PubMed, BioGrid, Reactome, dbSNP, the Bovine HapMap, Animal QTLdb and annotation data from the Bovine Genome Database. Data Collected We used data from various sources described above. Results We developed a new Jbrowse instance (Btau_4.6.1), updated and improved an existing Jbrowse instance (UMD3.1), created six new search interfaces specific, created a new data mining warehouse, and improved linkages across different components of the Bovine Genome Database. The outcome of this work is an integrated bovine genomics database with a flexible search interface that allows researchers to create custom datasets that they can combine with their own data to use in their research analyses. Key Outcomes The key outcome is the potential for a change in knowledge when livestock researchers are able to create integrated bovine genomic data sets that they can use in their research. Objective 2. Provide a mechanism to contribute information that will be used to improve the bovine genome assembly. The objective was removed, as discussed in the December 2013 annual report. Objective 3. Annotate the latest bovine genome assembly, using RNASeq and other data collected from the community. Major Activities We acquired transcriptome data (single-end 100 bp Illumina RNAseq) from tissues of Dominette and relatives (91 libraries) from Lee Alexander (USDA-ARS). We evaluated and optimized procedures and parameters for applying RNAseq data to gene annotation. We created JBrowse tracks for each Illumina library, and compared them to the Ensembl, RefSeq and Gnomon gene prediction sets to determine the level of experimental support for existing gene predictions, and to identify loci for previously uncharacterized genes or exons. We computed transcript assemblies on UMD3.1 scaffolds using the RNAseq datasets using 6 different combinations of alignment and assemblies methods (Tophat/Cufflinks, STAR/Cufflinks, Tophat/StringTie, STAR/StringTie, De novo Trinity/PASA) and a sixth method (Genome Guided Trinity/PASA), because each method had strengths and weaknesses. We developed a pipeline to merge the results into protein and non-protein-coding gene loci. We evaluated and combined selected assembly sets to create a final transcript set, and used it to predict coding and non-coding genes. Since assembling single-end Illumina RNAseq data resulted in gene fragments, splits and merges, we collaborated with Tim Smith (USDA-ARS) to analyze PacBio (long read) RNAseq data. We developed and optimized a PacBio RNAseq annotation pipeline and applied it to three Dominette tissue datasets (adipose, muscle, lung). Data Collected From data provided by Lee Alexander, we generated 546 transcript assembly sets (91 Illumina datasets multiplied by 6 pipelines) a combined Illumina-based transcript set. We also created a PacBio-based transcript set based on data provided by Tim Smith. Results We created a non-redundant set of transcripts from the 546 transcript assembly sets (91 Illumina datasets multiplied by 6 pipelines). Grouping transcripts into unique loci resulted in 150,078 gene loci, higher than the expected number of genes for a mammal; many transcripts that appeared to be fragments. We developed tests and metrics to eliminate some of the data. We selected sets to combine in order to create larger transcripts and reduce the number of fragments. The optimized combined single-end Illumina based set had 31,938 unique gene loci with 190,878 unique transcripts. We predicted that 167,913 transcripts within 18,556 gene loci were protein coding. The transcripts without protein-coding potential consisted of 17,692 multi-exon and 3,333 single-exon putative long non-coding RNAs (lncRNA). Multi-exon lncRNA were further grouped into 9,855 multi-exon lncRNA genes. Intersection of the protein-coding genes with the Ensembl and RefSeq gene sets indicated 1,025 novel protein coding genes in our gene set. PacBio analysis resulted in 15,073 unique transcripts from adipose, 11,359 from lung, and 8087 from muscle. These were combined across tissues and grouped into 12,743 unique gene loci. Comparison with RefSeq showed that the PacBio RNAseq provided 2735 novel gene loci. Key Outcome The key outcome is a change in knowledge through the identification of novel bovine genes.

Publications


    Progress 07/01/14 to 12/31/14

    Outputs
    Target Audience: The target audience is bovine genomics researchers. Changes/Problems: Changes/Problems related to Objectve 1 We originally planned to use BioMart open-source software to create data mining tools using federated technology to integrate data, and had started working with BioMart previously. However, we could not overcome problems with compatibilty between BioMart and our existing database infrastructure, so we modified our approach and used InterMine instead of BioMart. Rather than using the data federation approach, we acquired data from external sources to load into our InterMine database. InterMine does still allow us to link query results to all external sources. Changes/Problems related to Objective 3 Single-end Illumina RNAseq data was valuable for annotation, but resulted in many gene fragments and split and merged genes when compared to known genes. To overcome these challenges, we analyzed long-read PacBio RNAseq data provided by Tim Smith at USDA-ARS. What opportunities for training and professional development has the project provided? Lab members Colin Diesh and Darren Hagen, who contributed to this project, presented results at the Plant and Animal Genome Conference in January 2015. How have the results been disseminated to communities of interest? The results for Objective 1 (BovineMine) and Objective 3 (RNAseq-based annotation) were presented at the Plant and Animal Genome Conference in January 2015 (after the end of the project), and are available at the Bovine Genome Database (http://bovinegenome.org). The results will also be included in a Bovine Genome Database update article to be submitted to Nucleic Acids Research for the annual database issue, which is published in January. According to NAR guidelines, these manuscripts are must be submitted between June and September. What do you plan to do during the next reporting period to accomplish the goals? Nothing Reported

    Impacts
    What was accomplished under these goals? The goal is to extend the Bovine Genome Database by adding new data and data mining tools to aid livestock researchers in identifying candidate genes and mutations for use in biological experiments. During this project period we have continued identifying genes in the bovine genome assembly. We have also created a new data warehouse and data mining tool called BovineMine, which allows researchers to easily integrate bovine genomics data from many sources. Objective 1. Create a query and data mining interface designed to enable researchers to identify candidate genes and to select appropriate markers, including tag SNPs, for QTL validation and fine mapping studies. Major Activities We have created a new data warehouse and data mining interface called BovineMine that is publicly available at http://bovinegenome.org/bovinemine. We used InterMine, an open-source package, which is part of the Generic Model Organism Database (GMOD) construction set. We have collected and formatted data for genes, proteins, orthologs, homologs, protein interactions, gene ontology, pathways, publications, QTL, SNP from RefSeq, Ensembl, UniProt, Treefam, Homologene, OrthoDB, PubMed, BioGrid, Reactome, dbSNP, the Bovine HapMap, Animal QTLdb and annotation data from the Bovine Genome Database. Data Collected We collected data from the sources listed above to integrate into BovineMine. Results The outcome of this work is an integrated bovine genomics database with a flexible search interface that allows researchers to create custom datasets that they can combine with their own data to use in their research analyses. Links are provided to external sources and to BGD, allowing users to navigate from query results to the BGD UMD3.1 JBrowse genome browser. To search BovineMine, researchers can use a keyword search box on the home page to enter either single or lists of identifiers or keywords from external sources or BGD. The results page is tabulated and shows a summary of the query. Clicking on a single result in the list will lead to more detailed page about that result, such as a gene page, publication abstract or ontology term definition.Users can create lists or add query results to a list, by clicking on 'Create/Add to List' to perform further analyses. They can also download the search results as tab-delimited, comma-separated values, XML or JSON. If the results are genomic features, they can be downloaded in GFF3 and BED format. In addition to the keyword search box, users can use predefined templates for popular simple queries by clicking on the 'Templates' tab. The Templates page provides users with a list of templates. For example, a template query called "Gene->Pathway" quickly associates a gene ID/symbol with Reactome pathway information. In addition, a query builder tool allows the user to create complex queries. For more complex searches, researchers can create custom queries with custom output formats using the QueryBuilder. The Data Source section of BovineMine provides descriptions of the datasets that are integrated into BovineMine along with their date of download, version or release, citations wherever applicable and any additional comments. The MyMine section of BovineMine serves as a portal for users to manage their lists, queries, templates and account details. Users can save lists and query templates and view the history of recent queries. Researchers who would like to programmatically access BovineMine can use the InterMine Appilcation Programming Interface (API), which is available in Perl, Python, Ruby and Java. Key Outcomes The key outcome is the potential for a change in knowledge when livestock researchers are able to create integrated bovine genomic data sets that they can use in their research. Objective 2. Provide a mechanism to contribute information that will be used to improve the bovine genome assembly. The objective was removed, as discussed in the December 2013 annual report. Objective 3. Annotate the latest bovine genome assembly, using RNASeq and other data collected from the community. Major Activities We continued the analysis of single-end Illumina RNAseq datasets from Dominette, described in the previous progress report. We finished the last of six transcript assembly pipelines, so by now we have computed transcript assemblies for 91 RNAseq datasets, representing 95 tissues, using six different methods, which include a combination of four reference-based alignment/assembly methods (Tophat/Cufflinks, Tophat/StringTie, STAR/Cufflinks, STAR/StringTie), one de novo assembly method (Trinity) followed by alignment to the genome, and a hybrid RNAseq assembly method (Genome Guided Trinity). We evaluated and combined selected assembly sets to create a final transcript set. Given challenges with gene fragments, splits and merges that resulted from assembling single-end Illumina RNAseq data, we thought long-read RNAseq data generated using PacBio sequencing would be useful. We collaborated with Tim Smith at USDA-ARS to analyze PacBio RNAseq data generated from some of the Dominette tissues. We developed and optimized a PacBio RNAseq annotation pipeline and applied it to three tissue datasets (adipose, muscle, lung). We also applied our bovine RNAseq analysis pipeline in collaboration with a group investigating genomic imprinting in fetal overgrowth syndrome induced by assisted reproduction in cattle, an example of use of our work in bovine genomics research. Data Collected We generated 91 new transcript assembly sets using the sixth pipeline that was not yet completed in the previous reporting year, created a combined Illumina-based annotation set, and created PacBio based gene annotations using data supplied by a collaborator (Tim Smith). Results We created a non-redundant set of transcripts from the 546 transcript assembly sets (91 Illumina datasets multiplied by 6 pipelines). The sum of the number of transcripts in the 546 assembly sets was 14,935,292, and this number was reduced to 10,381,463 unique transcripts. Grouping these into unique loci resulted in 150,078 gene loci, higher than the expected number of genes for a mammal; many transcripts that appeared to be fragments. We developed tests and metrics to eliminate some of the data. We evaluated each pipeline set for potential chimeras and perfect-length alignments between Transdecoder peptides and Swissprot proteins to identify those that performed well in gene structure prediction. We selected sets to combine using PASA in order to create larger transcripts and reduce the number of fragments. We ran PASA twice to create two separate gene sets. One PASA run (Run 1) combined assembly sets created with the Tophat/StringTie, De novo Trinity, and Genome Guided Trinity pipelines. The other PASA run (Run 2) combined the Tophat/StringTie assembly set with transcripts from the other sets only if the transcript was found in both a reference-based and a Trinity set. We evaluate the results of the two PASA runs, and selected Run 2 for the final single-end Illumina based set. It had 31,938 unique gene loci with 190,878 unique transcripts. With Transdecoder we predicted that 167,913 transcripts within 18,556 gene loci were protein coding. The transcripts without protein-coding potential consisted of 17,692 multi-exon and 3,333 single-exon putative long non-coding RNAs (lncRNA). Multi-exon lncRNA were further grouped into 9,855 multi-exon lncRNA genes. Intersection of the protein-coding genes with the Ensembl and RefSeq gene sets indicated 1,025 novel protein coding genes in our gene set. PacBio analysis resulted in 15,073 unique transcripts from adipose, 11,359 from lung, and 8087 from muscle. These were combined across tissues and grouped into 12,743 unique gene loci. Comparison with RefSeq showed that the PacBio RNAseq provided 2735 novel gene loci. Key Outcome The key outcome is a change in knowledge through the identification of novel bovine genes.

    Publications

    • Type: Journal Articles Status: Accepted Year Published: 2015 Citation: Chen Z, Hagen DE, Elsik CG, Ji T, Morris CJ, Moon LE, Rivera RM. Characterization of global loss-of-imprinting in fetal overgrowth syndrome induced by assisted reproduction. Proceedings of the National Academy of Sciences USA, In Press
    • Type: Conference Papers and Presentations Status: Other Year Published: 2015 Citation: Diesh CM, Unni DR, Tayal A, Hagen DE, Elsik CG. New Data Mining Interfaces at the Bovine Genome Database. Poster Abstract. Plant and Animal Genome Conference. San Diego, CA, January 10-14, 2015.
    • Type: Conference Papers and Presentations Status: Other Year Published: 2015 Citation: Hagen DE, Unni DR, Diesh CM, Elsik CG. Discovering Novel Protein-Coding Genes and Long Non-Coding RNAs in Bos Taurus. Poster Abstract. Plant and Animal Genome Conference, San Diego, CA, January 10-14, 2015.


    Progress 07/01/13 to 06/30/14

    Outputs
    Target Audience: The target audience is bovine genomics researchers. Changes/Problems: An ongoing problem has been the need to work on two competing bovine genome assemblies. We did not anticipate the need to develop browsers and databases for four assemblies (three Btau and UMD3.1) during the course of this project, and to maintain current browsers for both Btau and UMD. A significant amount of work has been dedicated to this. Although we stated last year that we would focus on the UMD3.1 assembly, we met users at conferences during this project period that want us to also support Btau_4.6.1. What opportunities for training and professional development has the project provided? Dr. Elsik and one staff member (Colin Diesh) presented posters at the Plant and Animal Genome Conference. Dr. Elsik and one staff member (Darren Hagen) presented posters at the Cold Spring Harbor Biology of Genomes Meeting. How have the results been disseminated to communities of interest? We have made the results available on the Bovine Genome Database website (http://BovineGenome.org) and have presented them at the Plant and Animal Genome Conference and the Cold Spring Harbor Biology of Genomes Meeting. What do you plan to do during the next reporting period to accomplish the goals? Objective 1. Since we have recognized gene annotation as a major need before data can be effectively integrated, we will continue to develop interfaces that will aid in annotation and help users navigate browsers. We are also continuing work on the data mining interface, but will likely use InterMine instead of BioMart, because we found that the InterMine API works better with our existing infrastructure. Objective 2. This objective has been removed, as stated in our previous annual report. However, we are continuing to develop interfaces that are needed to navigate across competing assemblies (now as part of Objective 1). Objective 3. We plan to focus on genome annotation with RNAseq, because our results have shown that the use of RNAseq to identify novel genes and correct existing gene predictions is more complex than we originally thought. Using a single methods will miss many genes, but using too many methods will likely introduce false positives. In addition, some of the predicted gene loci are very complex in terms of individual transcript models, and we need to develop a method to create reliable gene models at loci that may have hundreds of variant transcript models. After completing this work, we will not likely have resources to leverage cross-species comparisons to annotate conserved elements, as mentioned in the previous annual report.

    Impacts
    What was accomplished under these goals? The goal is to extend the Bovine Genome Database by adding new data and data mining tools to aid livestock researchers in identifying candidate genes and mutations for use in biological experiments. Before data can be effectively integrated, there is a need for more complete and correct list of bovine genes and transcripts. During this project period we have made progress in a new bovine gene set, using high-throughput RNA sequence datasets from over 90 bovine tissues. We have identified over 1600 candidate novel protein coding genes. Objective 1. Create a query and data mining interface designed to enable researchers to identify candidate genes and to select appropriate markers, including tag SNPs, for QTL validation and fine mapping studies. Major Activities We set up Jbrowse for Btau_4.6.1. Previously, we had only Gbrowse for this assembly, and it did not allow viewing high throughput datasets. Now we have Jbrowse for both current bovine assemblies (Btau_4.6.1 and UMD3.1). We incorporated links across the Jbrowse browsers using tracks that we developed using UCSC LiftOver. Previously we had links only across GBrowse viewers. We also improved these “Alternate Assembly” tracks by developing a script to filter short alignments that appear in the LiftOver file, but likely represent short repeated regions instead of syntenic chromosome segments across assemblies. We created new search interfaces that focus on annotation and assembly differences. The Annotation Assembly Comparison Tool allows users to lookup locations of genes (Ensembl, RefSeq, OGSv2) on two bovine assemblies (UMD3.1 and Btau_4.6.1). 2) The Ensembl-NCBI Comparison Tool allows users to look up corresponding ids across datasets and investigate disagreements in gene models. In some cases, multiple NCBI genes may overlap one Ensembl gene or vice versa, indicating disagreement as to whether a locus should be split into multiple genes or merged into a single gene. Gene loci showing disagreement between Ensembl and NCBI gene predictions are priorities for manual annotation. 3) The Predicted Transcript RNAseq Read Count tool provides raw read counts from spliced alignments of Illumina RNAseq reads to locations of predicted transcripts (Ensembl, RefSeq and Bovine OGSv2) in the UMD3.1 assembly. The read count number helps users determine which of 91 RNAseq data sets to view in Jbrowse/Web Apollo. The spliced read alignments (BAM tracks) are particularly useful in determining whether splice junctions in predicted genes are correct. 4) The Chromosome and Scaffold ID Lookup tool allows users to convert identifiers for the different bovine assemblies to allow users to navigate across genome data sources. They can look up alternative identifiers by entering a chromosome number id, GenBank id, or RefSeq id. 5) The Candidate Novel Protein Coding Gene Search tool allows users to easily navigate to locations of over 1600 candidate novel genes in Jbrowse. Searches may be limited by RNAseq dataset or chromosome. Tissue information allows users to select relevant RNAseq read alignment tracks. Data Collected We used data from NCBI, UCSC and data we acquired during the previous reporting period. Results We developed a new Jbrowse/WebApollo instance (Btau_4.6.1), updated and improved an existing Jbrowse/WebApollo instance (UMD3.1), created five new search interfaces. Key Outcomes The key outcome is the potential for a change in knowledge when livestock researchers use these tools. Objective 2. Provide a mechanism to contribute information that will be used to improve the bovine genome assembly. The objective was removed, as discussed in the previous annual report. Objective 3. Annotate the latest bovine genome assembly, using RNASeq and other data collected from the community. Major Activities We have finished evaluating and optimizing procedures and parameters. This includes and investigation of error correction, and a decision not to use it for RNAseq assembly. We redefined cutoffs and stringencies for quality metrics. We also reran repeat masking on Btau_4.6.1 and UMD3.1 with a new (January 2014) RepeatMasker library. Using our redefined parameters, we aligned 91 RNAseq datasets using both Tophat2 and STAR to UMD3.1 scaffolds. Previously, alignments had been done with chromosomes so they could be viewed on our genome browser, but need alignments on scaffolds to use them in gene prediction. We use RNAseq in gene prediction in several ways. 1) Splice junctions in read alignments are used as either weights to support gene models or in training gene predictors. 2) Assembled transcripts are used to correct gene models or add UTR after gene prediction has been run. 3) Assembled transcripts are used to create new gene models. In preparation to use RNAseq-based introns in gene prediction, we investigated the distribution of the number of spliced reads supporting introns in different categories (known introns in protein-coding genes, new introns within protein coding genes, known introns in non-coding genes, new introns in non-coding genes, new intron not within gene boundaries). We performed the analysis with both Tophat and BLAT, since these have both been used in mammalian gene prediction. We computed transcript assemblies on UMD3.1 scaffolds with 91 Dominette (and relatives) RNAseq datasets using 5 different methods (Tophat/Cufflinks, STAR/Cufflinks, Tophat/StringTie, STAR/StringTie, De novo Trinity/PASA) and a sixth method (Genome Guided Trinity/PASA) is in progress. We developed a pipeline to merge the results into protein and non-protein-coding gene loci, and will rerun the pipeline once the sixth assembly method is complete. We also redid our estimation of novel genes per tissue (preliminary data was reported last year) using our improved methods. The pipeline for identifying novel gene loci is as follows: For 455 assembly sets with unique identifiers indicating tissue and method for each transcript (91 RNAseq datasets x 5 alignment/assembly methods), we ran Transdecoder to identify open reading frames (ORF). Transcripts were divided into those with ORF and those without ORF. For both sets, transcripts with single exons were removed and IntersectBed was used to identify and remove transcripts overlapping Ensembl and RefSeq genes. For transcripts without ORF (candidate long non-coding RNA) we ran additional methods to eliminate those with coding potential (CPAT, Blast search vs. NCBI NR). For both protein-coding transcripts and candidate long-noncoding RNA, we developed a clustering script to merge transcripts with overlapping coordinates that are likely from the same gene locus. We have improved our A-to-I RNA editing pipeline to use current best practices for calling variants with RNASeq and to improve alignment accuracy by incorporating GATK and STAR. Data Collected We have generated 455 transcript assembly sets (described above). Results Intron Analysis: While the number of previously annotated introns was higher in Tophat2 compared to BLAT alignments, BLAT predicted many more novel introns, suggesting that BLAT produced a larger number of false positives. Some of the novel introns identified in both alignments were supported by >10,000 of reads. Based on this analysis, we decided to use TopHat2 for intron-based gene prediction methods. Novel Genes from Transcript Assembly: Our pipeline resulted in 1605 candidate novel protein coding gene loci and 17,003 candidate long-non-coding RNA loci. We are further filtering candidate long non-coding RNA with CPC. An analysis of novel protein-coding gene loci resulting from different alignment/assembly methods showed that results differed across methods, and that a single method may not be the best approach. The number of novel protein-coding genes per method ranged from 544 to 879, with only 127 common to all five methods. Key Outcome The key outcome is a change in knowledge through the identification of novel bovine genes.

    Publications

    • Type: Conference Papers and Presentations Status: Other Year Published: 2014 Citation: Hagen DE, Schroeder SG, Alexander LJ, Decker JE, Schnabel RD, Sonstegard TS, Taylor JF, Elsik CG. Reannotation of the Bos taurus genome using RNAseq data from Dominette, the source of the reference genome. Biology of Genomes Meeting, Cold Spring Harbor, NY, May 6-10, 2014.
    • Type: Conference Papers and Presentations Status: Other Year Published: 2014 Citation: Hagen DE, Childers CP, Diesh C. Elsik CG. Bovine Genome Database: Resources for Re-Annotating the Bos taurus Genome. Poster Abstract. Plant and Animal Genome Conference, San Diego, CA, January 11-15, 2014.
    • Type: Conference Papers and Presentations Status: Other Year Published: 2014 Citation: Elsik CG, Munoz-Torres MC, Lee E, Diesh CM, Buels RM, Unni D, Helt GA, Childers CP, Reese JT, Kolicheski AL, Samollow PB, Johnson GS, Holmes IH, Lewis SE. Apollo for ongoing community input to the annotation of reference genomes. Reannotation of the Bos taurus genome using RNAseq data from Dominette, the source of the reference genome. Biology of Genomes Meeting, Cold Spring Harbor, NY, May 6-10, 2014.


    Progress 07/01/12 to 06/30/13

    Outputs
    Target Audience: This project targets bovine genomics researchers. Changes/Problems: Due to delay in work by the research community to generate a new bovine genome assembly, we have decided to remove aim 2 and to shift more effort to aim 3. Aim 3 is also being modified, because we will focus on annotating the UMD3.1 assembly, not a future bovine assembly. We will augment Aim 3 by incorporating analysis of non-coding RNAs. We will predict bovine non-coding RNAs using new data that we collect from the community, and will add bovine non-coding RNAs that have been predicted by others. We will also leverage cross-species comparisons to annotate conserved elements that may not be represented in RNASeq data, and will incorporate any "AgEncode" data that may become available during the project period. What opportunities for training and professional development has the project provided? Dr. Elsik attended the Livestock Genomics meeting and the ABWG Animal Cyberinfrastructure Meeting, both in Hinxton UK. Darren Hagen and Justin Reese, who worked on this project, attended the Plant and Animal Genome Conference. How have the results been disseminated to communities of interest? The results are available to the public through the BovineGenome.org website, and have been presented at the Plant and Animal Genome Conference. What do you plan to do during the next reporting period to accomplish the goals? Aim 1: We will complete deployment of BioMart. After receiving a transcriptome data from a large number of tissues, we realized that we must invest effort into applying controlled vocabularies to tissue data types. Whenever possible we will identify appropriate terms from existing ontologies. These will be integrated into the data mining tools and the faceted track browser for JBrowse. Aim 2: This aim will be modified. (See the Changes section). Aim 3: We will continue to optimize our strategy for computational gene prediction using all available transcriptome data and will continue to collect RNASeq data from the community. We are currently performing digital normalization of transcriptome libraries using Khmer, which was developed by Titus Brown and Jellyfish, which is part of the Trinity pipeline. We will investigate error correction methods to improve datasets prior to de novo assembly, and will create de novo transcriptome assemblies using Velvet/Oases and Trinity. We will also investigate the generation of an augmented assembly using both de novo and genome guided assembly. Lee Alexander has provided us with his work using Trinity and PASA, and will incorporate that into our gene prediction strategy. Once we have identified an optimal strategy, we will create an improved predicted protein coding gene set. We will also continue to improve the annotation tool interface, particularly by organizing tracks, incorporating controlled vocabularies for RNASeq libraries and creating additional tracks for priority genes. We will publicize the annotation tools and seek input from bovine researchers. We will continue to leverage the RNASeq data to identify RNA editing, and will create browser tracks showing edited sites.

    Impacts
    What was accomplished under these goals? The work this year can be summarized as 1) migrating to a new server infrastructure, 2) updating the Btau genome browser, 3) creating assembly alignment tracks that allow navigation across the Btau_4.6 and UMD3.1 assemblies, 4) creating new a new manual annotation system with Web Apollo, 5) identifying priority genes for annotation, 6) organizing and mapping transcriptome data, 7) developing a pipeline to predict RNA editing using transcriptome data, and 8) creating tools to predict microRNA target sites. The lab arrived at the University of Missouri on July 1 2012 (the beginning of this project period), so some of our effort was in migrating BovineGenome.org to new server infrastructure. Other maintenance included updating our Ruby-on-Rails codebase for Gene Pages. We are currently providing genome browsers for two alternate bovine genome assemblies (Btau_4.6 and UMD3.1). Work this year included updating from Btau_4.2 to Btau_4.6. For each GBrowse genome browser (Btau_4.6 and UMD3.1), we have created tracks that show alignments to the alternate assembly and allow users to navigate across assemblies. Towards Aim 1 (Data Mining): We have created a database and query interface for predicted microRNA:mRNA interactions. We downloaded bovine microRNA sequences from miRBase and used miRanda target prediction software to predict miRNA targets in two datasets, the 3’ UTRs of all RefSeq and Ensembl bovine genes. We have created a web-based interface using Perl-CGI to allow users to access a list of microRNA:mRNA interactions, local alignment images of the predicted target:miRNA hybrid, and related output data, after entering an identifier for microRNA or mRNA. Other options include UTR dataset (RefSeq, Ensembl, or both), energy cutoff, and score cutoff. Towards Aim 3 (Genome Annotation): We have created new community annotation tools for UMD3.1 based on Web Apollo software. These tools include a JBrowse-based genome browser that allows users to view annotations and aligned data. The browser includes gene prediction sets (RefSeq, ENSEMBL and Bovine OGSv2.0 mapped onto UMD3.1), non-coding RNA, repetitive elements, protein homolog alignments, scaffold regions aligned between UMD3.1 and Btau_4.6, and mapped transcriptome data (shown in 4 ways – read alignment, assembly alignment, XY-plots and heatmaps). To help users organize and select from the 372 data tracks, we have created a faceted track browser. The JBrowse browser becomes an annotation editor when registered users log in using a new user registration system that we developed this year. By registering, users have access to graphical tools that allow editing gene annotations. When a user submits an annotation it is immediately visible on the browser. This is an improvement over the system that the research community used to manually annotate Btau_3.1, which required significant effort by bioinformatics database administrators to process annotation data before it was available to the public. We have created tools to help researchers identify priority genes for annotation. These were genes present in the priority gene list created by the team leaders in the original community annotation project based on Btau_3.1. We identified manually annotated genes that did not map perfectly across assemblies. We have created a webpage that lists the priority genes and provides links to their locations in the JBrowse browser, which includes a separate track for these genes. We have acquired transcriptome data (RNASeq) for tissues of Dominette and relatives (88 libraries) from Lee Alexander. We used FastQC to characterize overall quality of libraries andmapped the reads to UMD3.1 using Bowtie2/TopHat. For each library, we have created JBrowse tracks using BAM formatted data for displaying individual reads, BigWig formatted data for providing coverage information, and gff3 data for Cufflinks transcript assemblies. These data were compared to the Ensembl, RefSeq and Gnomon gene prediction sets to determine the level of experimental support for existing gene predictions, and to identify loci for previously uncharacterized genes or exons. We have developed a pipeline to detect RNA editing, and we are analyzing RNAseq libraries from brain-derived tissues. The pipeline includes mapping RNASeq reads to the genome assembly, processing alignments to reduce alignment errors, identifying variants and filtering out known SNPs.

    Publications

    • Type: Conference Papers and Presentations Status: Other Year Published: 2013 Citation: Hagen, D.E., Childers, C.P., Reese, J.T., Elsik C.G. Bovine Genome Database: Working Towards an Improved Bovine Gene Set Using Community-provided RNA-Seq Data. Poster Abstract. Plant and Animal Genome Conference, San Diego, CA, Jan. 11-16, 2013.