Source: UNIVERSITY OF KENTUCKY submitted to NRP
FUNCTIONALLY ANNOTATED EQUINE PANGENOME WITH INFRASTRUCTURE FOR AN ACCESSIBLE, INTEGRATIVE, COMMUNITY GENOMICS RESOURCE
Sponsoring Institution
National Institute of Food and Agriculture
Project Status
ACTIVE
Funding Source
Reporting Frequency
Annual
Accession No.
1032160
Grant No.
2024-67015-42330
Cumulative Award Amt.
$765,000.00
Proposal No.
2023-07847
Multistate No.
(N/A)
Project Start Date
Jul 1, 2024
Project End Date
Jun 30, 2027
Grant Year
2024
Program Code
[A1201]- Animal Health and Production and Animal Products: Animal Breeding, Genetics, and Genomics
Recipient Organization
UNIVERSITY OF KENTUCKY
500 S LIMESTONE 109 KINKEAD HALL
LEXINGTON,KY 40526-0001
Performing Department
(N/A)
Non Technical Summary
Inexpensive and accurate high throughput sequencing has made it possible to begin to understand thegenetic basis of health and performance in animals. In agricultural applications, these data haveexposed the limitations of a single reference genome for a species. It is now feasible to create manyreference genomes of individual breeds, providing for the impact of breed-specific genetic,epigenetic, and structural differences to be elucidated. A pan-genome allows for the integration ofmultiple reference genomes with their associated annotation and variant data to facilitate discovery.Our long-term goal is to provide tools and readily accessible data to expedite efforts to connectgenome to phenome. In this proposal we will build upon prior community efforts to address our goalby: 1) creating a pangenome from accurate, haplotype-resolved genome assemblies of 13 targetbreeds and 3 related Equids, 2) annotate the pangenome using tissue-specific transcriptomic andepigenetic data, and 3) developing and making available an integrative genomics portal. As anoutcome of this work, in line with FAIR principles, we will have tools necessary for anyone toquickly and easily access community-generated genomics data without bioinformatics training. Themethodology developed for analyses and data access can be applied across species. Our team iswell-positioned to do this work, which fits into the blueprint area of "Developing Advanced GenomicTools, Technologies, and Resources for Agricultural Animals," given our prior experience inhaplotype-resolved genome assembly and bioinformatic methodology, a significant amount of data onhand, and the support of a highly collaborative community.
Animal Health Component
50%
Research Effort Categories
Basic
50%
Applied
50%
Developmental
(N/A)
Classification

Knowledge Area (KA)Subject of Investigation (SOI)Field of Science (FOS)Percent
30438101080100%
Knowledge Area
304 - Animal Genome;

Subject Of Investigation
3810 - Horses, ponies, and mules;

Field Of Science
1080 - Genetics;
Goals / Objectives
Our long-term goal is to facilitate global genome-mapping efforts and provide tools and readily accessible data to expedite efforts to accurately identify the genes and variants underlying phenotypes. We have already curated large datasets of short- and long-read genomes and transcriptomes, have developed the tools to analyze them, and tools to aid in the efficient use of those data by others. In this proposal we build on this work to develop further resources to accelerate understanding of the link between genotype and phenotype by: 1) creating accurate, haplotype-resolved genome assemblies derived from 14 individual horses within 13 target breeds; 2) cataloging genetic diversity in major breed groupings of the domestic horse, with particular focus on structural variants; 3) improving the pan-genome annotation by incorporating data from coding and non-coding tissue-specific transcriptomes, and ChIP-Seq and ATAC-Seq data from the FAANG initiative; and 4) building on our currently available pipelines to provide containerized tools for this large scale DNA and RNA analysis. Further, we will improve access to the large amounts of sequencing data, without the need for an advanced computational skill set by creating a genomics data portal that will provide free access to the pan-genome and pan-transcriptome in addition to the allele frequencies of all variants identified in this project.In objective 1, we generate haplotype resolved telomere-to-telomere (T2T) reference genome assemblies of 8 diverse equine breeds, which when combined with our preliminary data will provide genome assemblies for 13 breeds representing major breed groupings of genetic diversity in the equids11 the Przewalski Horse, the Donkey, and the Zebra. Combined with short read whole genome sequence (WGS) of ~1,983 horses of >70 breeds (See Facilities and Other Resources table 1) we will assemble the first graph-based pangenome for the horse. In objective 2, RNA-Seq data from >100 tissues and up to 16 individuals will be combined with Iso-Seq data of 33 tissues to create de novo transcriptome assemblies and an equine coding and non-coding expression atlas; provide tissue-specific transcript expression data; and develop an equine pan-transcriptome. Combined this will be used to develop an equine pan-transcriptome. Annotation of the draft pan-genome (objective 1) will be improved through integration of the pan-transcriptome along with a complementary gene structure annotation effort at EBI (see letter of support from Dr. Fergal Martin) to produce data readily amenable for use in comparative genomics efforts across agricultural species, and beyond. In objective 3, we create an integrated equine genomics data portal for the sharing of these data as well as the containerized pipelines used to create them.
Project Methods
Objective 1. Development of a pan-genome for the domestic horse.Trio creation and fetal collection. In addition to the six trios already collected, a Standardbred and Lipizzaner mare that will be used in this objective have been identified at UC Davis (Dr. Dini letter of support). The other 2 mares (Morgan and Tennessee Walking Horse) will be obtained either through donation (Dr. Loux last year's letter of support used with permission) or purchase. Semen from stallions (Belgian, Haflinger, Icelandic, and Shetland) will be obtained through purchase or donation for insemination of the 4 mares to produce breed-breed genomes. \ Approximately 90 days post fertilization, the mares will be treated with exogenous prostaglandin to induce abortion86 and the fetus will be obtained. Tissue samples from the fetus (e.g., heart, liver, lung, brain, skin, skeletal muscle, spleen, gonad) will be collected, immediately flash frozen, and retained in a tissue bank at the University of Kentucky for "adopt a tissue" use by the community.Trio sequencing. For each fetus, we will extract DNA from the liver for sequencing to generate the PacBio HiFi, ONT ultralong read, and HiC data and will be assembled as described above. The HiC data will be generated for each fetus by PDs Davis and Kalbfleisch with independent funding as part of their work on genome structure. Similarly, these data will be assembled using the Verkko85 pipeline consistent with best practices. The resulting 14 haplotype-resolved horse assemblies will be submitted to NCBI and ENSEMBL, and ENSEMBL will provide gene structure annotation.Objective 1c. Graphical representation of multiple genome assemblies to create a draft pan-genome.A pan-genome model is a data structure used to represent multiple genomic sequences from a population or species. Genome graphs represent whole-genome relationships where sequence information is represented by nodes and edges describe how these sequences are ordered in each haplotype. Here we use the genomes assembled in objectives 1a and 1b to create a genome graph using the minigraph-Cactus pipeline88,89 described by the Human pan-genome Reference Consortium.Objective 1e. Capture and catalog genetic variation in the horse using the pan-genome resource.Population distribution of novel sequences. Using the short read sequencing catalog we will evaluate the frequency of novel sequences identified through the de novo assembly in objective 1d for each breed. Phylogenetic and genetic structural analyses will then be investigated to evaluate how the different breeds cluster.Variant detection. WGS sequences from the 1,983 horses will be mapped to the pan-genome created in objective 1c using Giraffe97. We will use a combination of vg98 and Giraffe97 to identify and genotype variants in this population.Novel structural variant detection. To evaluate where complex structural variants map to and identify those that are present in the current reference and those that are novel vg98 will be used A major advantage of this software is that it allows discovery of multiple variants within the same region, including SNPs within insertions80. Reads can then be mapped to the graphical representation to look for variants. vg limits the false discovery rate of graph based programs by using text indexing strategies for graphs to allocate nodes based on the origin of the sequence 98,99. During alignment, this prevents the false positive rate by preventing the 'linking' of variants that can occur with just graphical representation alone. vg has been shown to exceed BWA-MEM19 for reads containing non-reference variants, and perform at least as well on reads that are highly similar to the reference genome98. We will combine vg with GraphTyper100 to give variant genotypes for each individual..Objective 1f. Containerized pipeline developmentCombined with workflow management systems (e.g. Snakemake17), user-friendly, scalable containerized processing and analysis pipelines may be deployed on anything from personal workstations to High-Performance Compute (HPC) clusters, yielding reproducible results and are thus critical tools in the analysis of large scale genomics data. Importantly, containerized workflows make it possible to codify standard workflows, and will encourage their use resulting in datasets that may be created independently, but are readily integrated because they were run through exactly the same workflow.Objective 2. Creation of a tissue specific pan-transcriptome and improved annotation of the pan-genome.In this objective, we capitalize on data generated from 4 Thoroughbred horses from the FAANG initiative and 12 Quarter Horses to construct tissue- and breed-specific mRNA, lncRNA and miRNA transcriptomes and combine this information with the pan-genome to create a pan-transcriptome. We also use this data to annotate the pan-genome using the Comparative Annotation Toolkit101. Finally we incorporate ChIP-Seq from four histone marks (H3K4me1, H3K4me3, H3K27ac, and H3K27me3), ATAC-seq and CTCF annotations into the pan-genome graph.Objective 2a. Integration of RNA-Seq, Iso-Seq, and CAGEseq data to develop comprehensive tissue transcriptomes.We will generate a containerized pipeline to process Iso-Seq data as described40 enabling tissue-specific transcriptome comparisons between Thoroughbred and Quarter Horse annotations. The containerized RNA-Seq pipeline (above) may then be utilized to predict coding-potential (i.e. distinguish mRNA from lncRNA) and quantitate transcript expression against the improved Iso-Seq based annotation. Additionally, the improved annotation will bolster miRNA target prediction from the small RNA sequencing data generated previously.Objective 3. Creation of an integrative genomics portal for the horse.Leveraging URLs to Aggregate and Make Data Available. The integrative genomics portal will be a place to aggregate both resources such as containerized pipelines and raw and derived datasets for use by scientists interested in obtaining usable data without having to map and analyze them themselves. We will use a web server located at EquineGenomics.uky.edu. From there, resources such as an IGV .genome json file for the equine genome can be loaded into IGV, and URLs representing each indexed and mapped dataset whether they are a bam, vcf, or bed file can be copied, or dragged and dropped into IGV, the UCSC genome browser, or a terminal window for command line use.The current system (see preliminary data) is composed of static html pages that list the URLs for the data that are stored on a virtual server housed at the University of Kentucky. For this project we have acquired a large, fast server capable of storing 50Tb of data that can be expanded with new storage. With support from this proposal, we will create a PostgreSQL relational database that will manage the links and their metadata. The web interface will remain simple. It will query the relational database to retrieve URLs specified by data type (WGS, RNA-Seq, etc.) with additional query specificity added in the future to include, breed, submitting scientist, and others as they are requested or become otherwise important. For the foreseeable future, there will be no reason to implement security that controls, or otherwise prohibits access to these data as these data will have been made publicly available by the submitting scientist. We will use the Spring framework (https://spring.io) to create our java based web application. It will have REST APIs that serve as the middle layer that may be queried programmatically and will be exposed for use by the bioinformatics community.