Source: UNIV OF MINNESOTA submitted to NRP
TOOLS TO LINK PHENOTYPE TO GENOTYPE IN THE HORSE
Sponsoring Institution
National Institute of Food and Agriculture
Project Status
COMPLETE
Funding Source
Reporting Frequency
Annual
Accession No.
1011941
Grant No.
2017-67015-26296
Cumulative Award Amt.
$500,000.00
Proposal No.
2016-10133
Multistate No.
(N/A)
Project Start Date
Mar 15, 2017
Project End Date
Mar 14, 2022
Grant Year
2017
Program Code
[A1201]- Animal Health and Production and Animal Products: Animal Breeding, Genetics, and Genomics
Recipient Organization
UNIV OF MINNESOTA
(N/A)
ST PAUL,MN 55108
Performing Department
Veterinary Population Medicine
Non Technical Summary
Thousands of genomic regions having strong statistical associations with various economically important traits in agricultural species have been identified. However, in most cases the causal genes and alleles remain unknown due to inadequate power in genome-wide association studies (GWAS) and difficulties in prioritizing candidate genes. Failure to link an animal's genotype to its phenotype is highly relevant in the horse, where the primary goals of genomic studies are to identify the molecular basis for desirable traits, obtain mechanistic insight into physiology/pathophysiology, and pinpoint new/alternative approaches to disease prevention and therapeutic strategies.Our long-term goals are to facilitate genome-mapping efforts in the horse and provide tools to expedite the identification of the genes and alleles underlying multiple phenotypes of interest. These tools will help facilitate GWAS, and assist the prioritization of candidate genes, and accelerate the identification of functional alleles by developing a comprehensive catalog of genetic variants. Although some aspects of this proposal are equine specific, these computational tools can easily be extended to other species. Collectively our work will increase the power of all gene mapping studies, facilitate gene, haplotype, pathway and network-based GWAS, improve genome annotation, provide functional evidence to aid in prioritization of candidate genes, and expedite the identification of the underlying functional alleles.
Animal Health Component
50%
Research Effort Categories
Basic
50%
Applied
50%
Developmental
(N/A)
Classification

Knowledge Area (KA)Subject of Investigation (SOI)Field of Science (FOS)Percent
30438101080100%
Knowledge Area
304 - Animal Genome;

Subject Of Investigation
3810 - Horses, ponies, and mules;

Field Of Science
1080 - Genetics;
Goals / Objectives
Our long-term goals are to facilitate genome-mapping efforts and provide tools to expedite the accurate identification of the genes and alleles underlying phenotypes in the domestic horse. Towards this end, we have recently developed 2,000,000 (2M) and 670,000 (670K) SNP equine genotyping arrays. In this proposal we build upon this effort to:Develop tools that further facilitate GWAS in the horse by:1) enabling complementary GWAS approaches including gene, haplotype, and pathway analyses through SNP-to-gene mapping and construction of haplotype maps;2) developing an imputation resource; and3) constructing tissue co-expression networks for integrated network-based association analysis.We will assist prioritization of candidate genes for genomic studies in the horse by:4) refining the physical and functional annotation of protein coding mRNAs, long non-coding RNAs (lncRNA), and microRNAs (miRNA) using RNA-seq using issue-specific gene expression and gene co-expression networks.Finally, we accelerate the identification of functional alleles by: 5) developing a comprehensive catalog of genetic variants.
Project Methods
In objective 1, we use whole genome sequence (WGS) data from 414 horses to create a database of genetic variants, quantify the genetic load across 11 target breeds, and demonstrate the ability of this resource to identify disease alleles for 7 performance-/life-limiting genetic diseases.In objective 2, we use WGS from 414 horses and 2M SNP array data from 347 horses to develop a fine scale haplotype map, define breed-specific haplotype boundaries to perform robust SNP-to-gene mapping. We then optimize genotype imputation from lower-density (54K/65K/670K) arrays to high-density (~2M SNPs and WGS variants) and develop a user-friendly software-library to implement imputation.In objective 3, RNA-seq data from 12 tissues in 12 horses will be used to create an equine mRNA, lncRNA, and miRNA expression atlas; improve genome annotation; provide tissue-specific gene expression data; and de novo transcriptome assemblies. Functional annotation of these genes will be achieved via gene co-expression networks.In objective 4, the pipelines and databases developed in objectives 1-3 will be joined to build a multi-staged data integration model to prioritize candidate genes applicable to any phenotype of interest, and resources and tools will be deployed on Cyverse servers.

Progress 03/15/20 to 03/14/21

Outputs
Target Audience:The target audience is primarily animal genomics researchers. Collectively our work will increase the power of gene mapping studies, facilitate gene, haplotype, pathway and network-based GWAS, improve genome annotation, provide functional evidence to aid in prioritization of candidate genes, and expedite the identification of functional alleles in the horse. The Camoco framework developed in Objective 4 will be of use to genomics researchers across species. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?• Development of computational skills. Objective 1 was developed into the PhD thesis of a boarded large animal internist Dr. Sian Durward-Akhurst. During this project, through classes, mentoring, and working with other students Dr. Durward-Akhurst has developed the necessary computational skills to produce and analyze the 30TB of data that make up this project. • Presentation skills. Dr. Durward-Akhurst has presented these data at several conferences and during our College of Veterinary Medicine graduate student seminar series. • Collaboration. To gain additional cases for these 11 diseases and additional diseases to investigate in the future, Dr. Durward-Akhurst has worked with horse owners and veterinarians. • Grant writing skills. Dr. Durward-Akhurst assisted with and was mentored in the writing additional grants to support the extra cases included in this analysis, as well as 3 grants as principal investigator and an additional 5 grants as co-investigator to follow-up the variants identified in horses with AFIB in 600 Standardbred and 600 Thoroughbred racehorses. • Mentoring. Dr. Durward-Akhurst co-mentored a DVM summer scholar for 2 summers on part of the disease-causing identification work for equine myotonia. Dr. Durward-Akhurst also co-mentoring 2 DVM summer scholar students on: 1) the work exploring the AFIB variants; and 2) the estimation of the false positive rate of the genetic burden variants in 2020. Both students have elected to continue in our lab. One will continue working on the AFIB project and the other is developing the preliminary data required to start developing a genetic variation catalog to improve conservation efforts for endangered raptors. • Additional training. Dr. Durward-Akhurst has received a Morris Animal Foundation fellowship to support her continuing investigation of the AFIB variants. The success of this work has led to 2 clinician-scientist position offers at well respected veterinary schools and a job interview at a 3rd well-respected veterinary school. • PhD student Mr. Jonah Cullen attended the Rocky Mountain Genomics HackCon at the University of Colorado Boulder (July 2019) • Mr. Cullen, worked on a webtool for improving the functional annotation via co-expression networks within a small team • Attendance of the RMGHC provided Mr. Cullen the opportunity to meet and network with both peers and senior investigators from a diverse range of biological and computational backgrounds • Mr Cullen has led the expansion of the biweekly hacky hour to include researchers throughout the veterinary medicine college. • Mr Cullen will attend of the University of Washington's 7th Summer Institute in Statistics for Big Data (SISBID) will provide the opportunity to gain additional skills in big data visualization and machine learning, as well as meet and network (virtually) with peers. How have the results been disseminated to communities of interest?Presentations: Sian Durward Akhurst: Invited presentations • Durward-Akhurst SA. The genetics of cardiac arrhythmias. University of Minnesota Equine Center Research symposium, MN, USA. January 2021. • Durward-Akhurst SA. Cardiology for the horse: a day-to-day approach. 1st Congresso Internacional de Residencia em Medicina Veterinaria, Universidade Federal de Minas Gerais, Brazil. September 2020 Scientific presentations • Durward-Akhurst SA, JR Mickelson, C Stauthammer, ME McCue. Genetic bases of cardiac arrhythmias in the horse. Equine Cardiology Retreat, PA, USA. June 2020. • Durward-Akhurst SA, Schaefer RJ, Mickelson JR, McCue ME. The frequency of loss of function variants in the equine population. Equine and NRSP8 workshops, Plant and Animal Genome Conference, CA, USA. January 2020. Abstract presentations • Durward-Akhurst SA, Schaefer RJ, Grantham B, Carey K, Mickelson JR, McCue ME. The frequency of phenotype associated variants in the equine population. Dorothy Havemeyer Equine Genetics Research Retreat, NY, USA. February 2021. Poster presentations • Durward-Akhurst SA, Schaefer RJ, Mickelson JR, McCue ME. Loss of function variants in the equine population. Plant and Animal Genome Conference, CA, USA. January 2020. Jonah Cullen: • Cullen JN, Schaefer RJ, Durward-Akhurst SA, Mickelson JR, McCue ME. Assessing the impact of sequencing platform on transcriptome assembly, differential expression, and variant discovery in the horse (poster), Plant and Animal Genome Conference, San Diego CA (2020). Rob Schaefer: • Schaefer RJ, Beeson SK, Durward-Akhurst SA, Grantham B, Carey K, Mickelson JR, McCue ME. Whole Genome Imputation in the horse. Plant and Animal Genome Conference, CA, USA. January 2020. Summer Scholar students: • Springer K, Durward-Akhurst SA, Mickelson JR, McCue ME. Investigating loss of function variants in the general equine population. Student Chapter of the American Veterinary Medicine Association Congress. March 2021. • Adam E, Durward-Akhurst SA, Mickelson JR, McCue ME. Investigating and validation of the genetic bases of cardiac arrhythmias including atrial fibrillation in Standardbreds. Boehringer Ingelheim-NIH National Veterinary Scholars Symposium. August 2020. • Springer K, Durward-Akhurst SA, Mickelson JR, McCue ME. Investigating loss of function variants in the general equine population. Boehringer Ingelheim-NIH National Veterinary Scholars Symposium. August 2020. • Mahoney KC, Tate NM, Wanner NM, Durward-Akhurst SA, McCue ME, Mickelson JR, Friedenberg SG, Furrow E. The use of computational (in silico) tools to predict pathogenicity of missense variants in the horse. Boehringer Ingelheim-NIH National Veterinary Scholars Symposium. August 2020. High school students: • Nadkarni I, Durward-Akhurst SA, McCue ME. Computational validation of putative arrhythmia causing variants. Minnesota High School Research Symposium. February 2021. What do you plan to do during the next reporting period to accomplish the goals?Objective 1. Presentations. We will continue to present our data on the variant catalog and highlight the utility of precision medicine approaches for veterinary species. Genetic testing guidelines - we will write up guidelines and recommendations for producing reliable genetic tests that horse owners and veterinarians can use and interpret without concerns about the variants being false positives. Structural variant analysis - the structural variant results will be analyzed to determine the best way to determine the overlap between the identification tools and written up as the first large scale study of structural variants from whole genome sequencing in the horse. Disease-causing variant analysis - The final prioritization and ranking of the variants will be performed and then grants submitted to follow-up on these variants and determine the true disease-causing variants. The atrial fibrillation follow-up is already funded and we expect to produce a publication in the next 18 months. We have successfully recruited an excellent PhD student who will start working on the idiopathic renal hematuria variants. This will likely make up her PhD thesis. Objective 2. Sustain development of the computational tools and publish pipelines and imputation results. Objective 3. Pipeline construction and comparisons between callers Finalize and containerize complete indexing and mapping pipelines Prepare RNA for sequencing from 200+ tissues RNA sequencing on remaining tissues Generate tissue-specific networks as part of the equine expression atlas Develop methods to build cross-tissue networks Present preliminary results at general and equine-specific conferences

Impacts
What was accomplished under these goals? Objective 1. WGS were collected from 534 horses of 44 breeds, including ≥15 horses from each of 10 target breeds (Arabian, Belgian, Clydesdale, Icelandic horse, Morgan, Quarter Horse [QH], Shetland pony, Standardbred, Thoroughbred, Welsh Pony) that represent the major groupings of genetic diversity in domestic horses. The WGS were mapped to the equine reference genome. SNPs and small structural variants Single nucleotide polymorphisms (SNPs) and small structural variants identified using a modified version of the GATK best practices pipeline. Bcftools and GATK-haplotype caller were used to identify the variants and the intersect used for downstream analysis. ANNOVAR and SnpEff were used to predict the functional effect of the variants and then all high impact variants (high by both variant effect predictors, or high by one and moderate by the other) were summed to give the genetic burden in the population and within the target breeds. The number of variants identified was associated with the depth of coverage, and therefore estimated marginal means (EMMEANs) accounting for depth of coverage were used for the analysis. The total genetic burden corrected for the estimated false positive rate is 5,807 variants with each horse carrying on average 846 genetic burden variants (range: 213 - 1,193). Our analysis of the 10 target breeds showed significant differences in the genetic burden between breeds (p <0.001), with the highest average genetic burden in Icelandic horses (755 variants/horse) and lowest in Thoroughbreds (585 variants/horse). horses. Reported locations of causative and associated variants for equine phenotypes were extracted from the Online Mendelian Inheritance in Animals catalogue (https://omia.org/home/). There were a reported 34 disease-causing, 68 disease associated, 50 non-disease causing, and 4 non-disease-associated. We identified between 34% and 91% of these variants in our cohort. Large structural variants Structural variants were identified using BreakDancer (insertions, deletions, duplications, inversions, and intra-chromosomal translocations [SVs] and inter-chromosomal translocations [CTXs]), cn.MOPS (copy number variants [CNVs]), DELLY (SVs and CTXs), and GenomeSTRiP (SVs and CNVs) with each tool's base settings. Python was used to extract the exact intersect (based on matching break points at the 5' and 3' end of the variant), the union, and the 50% overlap (CNVs and SVs) of the variant callers for each structural variant type. Between 288,802 and 14,842,349 SVs, 1,120,501 and 10,019,966 CTXs, and 135,988 and 370,220 CNVs were identified. Analysis of the overlap between structural variant callers is ongoing and a publication is being put together with expected submission by the end of this year. We have identified and performed WGS (12-20 x coverage) on 28 horses representing 11 highly detrimental genetic diseases in the horse that are analogous to human Mendelian diseases. Variants were then further prioritized based on: 1) their frequency in the catalog of equine genetic variation that we developed (cut offs of minor allele frequency ≤5% and ≤based on Mendelian disease variant frequencies based on Hardy-Weinberg equilibrium [q]); 2) their presence in genes reported to be intolerant to damaging variants; 3) computationally predicted to have a high impact on phenotype; and 4a) presence in all disease cases (AFIB, IRH, myotonia); or 4b) following a recessive, de novo, or dominant inheritance pattern in the offspring compared to the parent (alopecia areata, microphthalmia). Objective 2. See 2020 report for more details. Objective 3. As part of objective 3, we have developed two publicly available containerized workflows, 1) IndexForTheFuture (https://github.com/jonahcullen/IndexForTheFuture), and 2) RNAMapping (https://github.com/UMN-EGGL/RNAMapping). IndexForTheFuture was designed to ensure the stable and reproducible generation of reference genome indices directly from the NCBI and Ensembl servers to be used for mapping RNA-seq data. We are currently improving this pipeline to run entirely within our publicly available Docker container (https://hub.docker.com/repository/docker/jonahcullen/ec3index). This pipeline generates both STAR and Salmon genome indices for any specified NCBI or Ensembl annotation release. The RNAMapping pipeline is capable of processing RNA-seq data regardless of platform or sequencing strategy from raw FASTQs through the generation of a multi-sample, non-redundant transcriptome (i.e. improving the physical annotation) and transcript/gene-level quantification. Moreover, the output from this workflow, may be used to improve functional annotation via generation of tissue-specific co-expression networks. Seven additional horses were sacrificed and tissues collected in fall of 2020 after COVID delays. Based on the previously RNA-sequenced tissue samples and available frozen tissue sets, we designed a sound strategy to maximize the number and diversity of tissues to sequence. RNA is currently being isolated and mRNA and small RNA libraries will be generated and sequenced in summer 2021.

Publications

  • Type: Journal Articles Status: Published Year Published: 2020 Citation: Hisey EA, Hermans H, Lounsberry Z, Avila F, Knickelbein K, Durward-Akhurst SA, McCue ME, Kalbfleisch T, Lassaline ME, Back W, Bellone RR. (2020). Whole genome sequencing identified a 16 kilobase deletion on ECA13 associated with distichiasis in Friesian horses. BMC Genomics. 21, 848
  • Type: Journal Articles Status: Submitted Year Published: 2021 Citation: Durward-Akhurst SA, Schaefer RJ, Grantham B, Carey WK, Mickelson JR, McCue ME. Genetic variation and the distribution of variant types in the horse. Submitted to Genome Research April 2021
  • Type: Journal Articles Status: Submitted Year Published: 2021 Citation: Durward-Akhurst SA, Schaefer RJ, Springer K, Grantham B, Carey WK, Mickelson JR, McCue ME. The genetic burden and frequency of disease-associated variants in the equine population. Submitted to Nature Genetics April 2021.


Progress 03/15/19 to 03/14/20

Outputs
Target Audience:The target audience is primarily animal genomics researchers. Collectively our work will increase the power of gene mapping studies, facilitate gene, haplotype, pathway and network-based GWAS, improve genome annotation, provide functional evidence to aid in prioritization of candidate genes, and expedite the identification of functional alleles in the horse. The Camoco framework developed in Objective 4 will be of use to genomics researchers across species. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?• Development of computational skills. Objective 1 was developed into the PhD thesis of a boarded large animal internist Dr. Sian Durward-Akhurst. During this project, through classes, mentoring, and working with other students Dr. DurwardAkhurst has developed the necessary computational skills to produce and analyze the 30TB of data that make up this project. • Presentation skills. Dr. Durward-Akhurst student has presented these data at several conferences and during our College of Veterinary Medicine graduate student seminar series. • Collaboration. To gain additional cases for these 10 diseases and additional diseases to investigate in the future, Dr. Durward-Akhurst has worked with horse owners and veterinarians. • Grant writing skills. Dr. Durward-Akhurst assisted with and was mentored in the writing additional grants to support the extra cases included in this analysis, as well as 3 grants as principal investigator to follow-up the variants identified in horses with AFIB in 600 Standardbred racehorses. • Mentoring. Dr. Durward-Akhurst co-mentored a DVM summer scholar for 2 summers on part of the disease-causing identification work for equine myotonia. Dr. Durward-Akhurst is also co-mentoring 3 new students on submitting summer scholar proposals to work on the follow-up to the AFIB disease-causing variants. • Postdoctoral student Dr. Robert Schaefer led a hack-a-thon team at the Rocky Mountain Genome Hack-a-thon where we implemented additional features in COB. Results from these tools were presented at several meetings and conferences (below). • PhD student Mr. Jonah Cullen attended the Rocky Mountain Genomics HackCon at the University of Colorado Boulder (July 2019) • Mr. Cullen, worked on a webtool for improving the functional annotation via co-expression networks within a small team • Attendance of the RMGHC provided Mr. Cullen the opportunity to meet and network with both peers and senior investigators from a diverse range of biological and computational backgrounds How have the results been disseminated to communities of interest?Publications: All publications in preparation are listed in the text above. Presentations: Rob Schaefer: • (Invited Talk) UCD School of Agriculture and Food Science, Dublin Ireland. Integrating GWAS SNPs with CoExpression Networks to Prioritize Candidate Genes in Complex Traits • (hack-a-thon) Rocky Mountain Hack-a-thon. Boulder, CO. Team Lead: COB. • (poster) Maize Genetics Conference, St. Louis, MO. The transcriptional landscape of diverse maize genotypes • (poster) Plant and Animal Genomics Conference, San Diego, CA. Cloud Scalable Computational Tools for the Horse Genome • (talk/demo) Havemeyer Horse Genetics Workshop, Pavia, Italy. Processing tens of millions of genotypes with HapDab and analyzing tissue specific gene co-expression networks with Camoco. • (hack-a-thon) Equine Gene Annotation Hack-Con. Host. St. Paul, MN. • (hack-a-thon) Rocky Mountain Hack-a-thon. Boulder, CO. • (hack-a-thon) Mozilla Global Code Sprint. Host. St. Paul, MN. • (Software demo) Plant and Animal Genome Conference, San Diego, CA. Identifying High Priority Candidate Genes from GWAS using Co-Expression Networks. • (Invited Talk) Bio5 Institute and Cyverse. Tucson AZ. Using Co-expression networks to unravel gene function in agricultrual species. Sian Durward Akhurst: • Durward-Akhurst SA. Genetic variation and the frequency of deleterious mutations (genetic burden) in healthy horses. University of Minnesota Supercomputing Institute seminar, Minneapolis, MN. October 2018. Scientific presentations • Durward-Akhurst SA. Using the genome to improve disease diagnosis. ACVIM Forum, Phoenix, Arizona, USA. June 2019. • Durward-Akhurst SA. Identification of disease-causing variants for pituitary dwarfism in Quarter Horses. Genofling, University of Minnesota Genomics Center, Minneapolis, USA. April 2019. Finalist Genopitch. • Durward-Akhurst SA, Schaefer RJ, Mickelson JR, McCue ME. Loss of function variants in the equine population. Equine and NRSP8 workshops, Plant and Animal Genome Conference, CA, USA. January 2020. • Durward-Akhurst SA, Schaefer RJ, Mickelson JR, McCue ME. Tools for Precision Medicine in the Horse. AVMA/AVMF Young Investigator Award competition, Worcester State University, MA, USA. July 2019. Finalist. • Durward-Akhurst SA, Schaefer RJ, Mickelson JR, McCue ME. Genetic variation and the frequency of deleterious mutations (genetic burden) in healthy horses. Lightning talk: Plant and Animal Genome Conference, San Diego, USA. January 2019. • Durward-Akhurst SA, Schaefer RJ, Mickelson JR, McCue ME. Loss of function variants in the equine population. Plant and Animal Genome Conference, CA, USA. January 2020. • Durward-Akhurst SA, Schaefer RJ, Mickelson JR, McCue ME. Loss of function variants in the equine population. Calgary International Equine Symposium: Innovation and Discovery, Calgary, AB, Canada. September 2019. • Durward-Akhurst SA, Schaefer RJ, Mickelson JR, McCue ME. Genetic variation and the frequency of deleterious mutations (genetic burden) in healthy horses. ACVIM Forum, Phoenix, Arizona, USA. June 2019. • Durward-Akhurst SA, Schaefer RJ, Mickelson JR, McCue ME. Genetic variation and the frequency of deleterious mutations (genetic burden) in healthy horses. Plant and Animal Genome Conference, San Diego, USA. January 2019. • Durward-Akhurst SA, Schaefer RJ, Mickelson JR, McCue ME. Genetic variation and the frequency of deleterious mutations (genetic burden) in healthy horses. Calgary International Equine Symposium: Innovation and Discovery, Calgary, AB, Canada. September 2018. 1 st place Graduate Student Poster award. • Durward-Akhurst SA, Schaefer RJ, Mickelson JR, McCue ME. Single nucleotide polymorphism in the equine population. Minnesota Supercomputing Institute research day, University of Minnesota, Minneapolis, MN, USA. April 2018. 2nd place Biological and Medical Sciences category Jonah Cullen: • Cullen JN, Schaefer RJ, Beeson S, Mickelson JR, McCue ME. RNA-seq driven gene annotation in the horse (poster), Plant and Animal Genome Conference, San Diego CA (2018). • Cullen JN, Schaefer RJ, Mickelson JR, McCue ME. RNA-seq workflow for improved physical and functional annotation of the equine genome (poster), Calgary International Equine Symposium, Calgary Alberta (2019). • Cullen JN, Schaefer RJ, Durward-Akhurst SA, Mickelson JR, McCue ME. Assessing the impact of sequencing platform on transcriptome assembly, differential expression, and variant discovery in the horse (poster), Plant and Animal Genome Conference, San Diego CA (2020). Summer Scholar students: • Ellingson L, Durward-Akhurst SA, Schaefer RJ, Mickelson JR, McCue ME. Identification of genetic variants associated with myotonia in the horse. Morris Animal Foundation Scientific Advisory Board Meeting, Denver, Colorado, USA. June 2018. 2 nd place Exemplary Summer Scholar award ? • Ellingson L, Durward-Akhurst SA, Schaefer RJ, Mickelson JR, McCue ME. Identification of genetic variants associated with myotonia in the horse. Boehringer Ingelheim-NIH National Veterinary Scholars Symposium, National Institutes of Health MSU, USA. August 2017. What do you plan to do during the next reporting period to accomplish the goals?Objective 1. • Structural variant analysis - the structural variant results will be analyzed and written up as the first large scale study of structural variants from whole genome sequencing in the horse. • Disease-causing variant analysis - The final prioritization and ranking of the variants will be performed and then grants submitted to follow-up on these variants and determine the true disease- causing variants. • Publications on the: 1) genetic variation in the equine population; 2) frequency of previously reported disease-causing variants in the equine population; and 3) improvement in number of variants identified with high allele frequencies between EquCab2 and EquCab3 will be written and submitted. Objective 2. • Sustain development of the computational tools and publish pipelines and imputation results. Objective 3. • In Q1 2020, we plan to sample the 20 tissue types from 6 horses and send to the sequencing core for library preparation and sequencing. RNA-seq data will be processed, assembled, and analyzed in Q2 and Q3, followed by the initial manuscript submission in Q4. ?• Website development to begin in Q4 and finish in 2021.

Impacts
What was accomplished under these goals? Objective 1. WGS were collected from 534 horses of 49 breeds, including ≥15 horses from each of 10 target breeds (Arabian, Belgian, Clydesdale, Icelandic horse, Morgan, Quarter Horse [QH], Shetland pony, Standardbred, Thoroughbred, Welsh Pony) that represent the major groupings of genetic diversity in domestic horses. The WGS were mapped to the equine reference genome, and single nucleotide polymorphisms (SNPs) and small structural variants identified using a modified version of the GATK best practices pipeline. Bcftools and GATK-haplotype caller were used to identify the variants and the intersect used for downstream analysis. ANNOVAR and SnpEff were used to predict the functional effect of the variants and then all high impact variants (high by both variant effect predictors, or high by one and moderate by the other) were summed to give the genetic burden in the population and within the target breeds. 29,882,273 variants were identified; these included 28,273,058 SNPs and 1,609,215 small structural variants. The total genetic burden is 8,683 variants with each horse carrying on average 2,409 genetic burden variants (range 538 - 3,674). Our analysis of the 10 target breeds showed significant differences in the genetic burden between breeds (p <0.001), with the highest average genetic burden in Icelandic horses (2,906 variants/horse) and lowest in Morgans (2,037 variants/horse). This provides the first large-scale catalog of genetic variation and estimation of the genetic burden in healthy horses. Larger structural variants have been called with Breakdancer, cn.MOPS, Delly, and GenomeSTRiP and analysis is ongoing. Publications on the: 1) genetic variation in the equine population; 2) frequency of previously reported disease-causing variants in the equine population; and 3) improvement in number of variants identified with high allele frequencies between EquCab2 and EquCab3 are in progress. ?We have identified and performed WGS (12-20 x coverage) on 24 horses representing 10 highly detrimental genetic diseases in the horse that are analogous to human Mendelian diseases. These diseases include: 5 skeletal muscle diseases (dystrophic and non-dystrophic myotonia, eosinophilic myositis, malignant hyperthermia (MH) without the RYR1 C7360G allele, and hyperkalemic periodic paralysis (HYPP) without the SCN4A F1416L allele); alopecia areata, that causes severe hair loss and must be managed with corticosteroids that can lead to laminitis; hemochromatosis, an iron storage disease that leads to liver failure; complete bilateral absence of the vas deferens causing stallion infertility; paroxysmal atrial fibrillation (AFIB), the most common pathologic arrhythmia in horses and may contribute to sudden cardiac death; idiopathic renal hematuria (IRH), a life threatening disease of Arabs; and microphthalmia, a performance limiting disease of Sport Horses. WGS has been mapped to the reference genome and alleles identified using the pipeline developed in objective 1. Variants have been prioritized based on the reported disease prevalence (AFIB) or a presumed disease prevalence of 1% given that the diseases are considered rare in the general horse population and assuming that they follow recessive inheritance patterns. Variants with an allele frequency in the catalog of genetic variation greater than expected based on disease prevalence were excluded. Variants were then further prioritized based on: 1) their presence in genes reported to be intolerant to damaging variants; 2) computationally predicted to have a high impact on phentotype; and 3) presence in all disease cases (AFIB, IRH, myotonia); or 4) following a recessive, de novo, or dominant inheritance pattern in the offspring compared to the parent (alopecia areata, microphthalmia). Final ranking using the American College of Medical Genetics guidelines for disease-causing variant prioritization is ongoing. The long-term goal is to validate the likely variants by sequencing them in a larger population of individuals to ensure segregation of the allele with the disease phenotype, and to develop genetic tests for easy disease diagnosis and identification of carriers to assist breeders with decreasing the frequency of clinical disease in their herd. Objective 2. A comprehensive and reproducible computational pipeline was designed to facilitate the creation of a genotype imputation and haplotype reference panel in the horse. Segments of the pipeline were defined using the workflow management system 'Snakemake', and the computational environment wascontainerized using both Docker as well as Singularity. Containers are publicly available on both DockerHub and the Singularity Container Library. Source code to build the pipeline from source is available at our public laboratory git repository (https://github.com/UMN-EGGL). Whole genome sequence from 549 horses were mapped to EquCab3 using both GATK Haplotype Caller as well as bcftools 'mpileup' to produce joint genotyping calls across every base pair in the genome. SNPs from both callers were assessed for baseline quality control and variant quality scores were recalibrated using 6 shared features among GATK and bcftools variant callers. Gaussian mixture models were fit for both callers using training data derived from SNPs sets defined from the recent MNEc2M SNP genotyping array. Using a true positive probability threshold of 99.5%, recalibrated scores were used to combine SNPs from GATK and bcftools into a combined set of 20.8M SNPs representing high confidence, whole-genome SNP reference panel for the horse (WGSNP). Variants were phased in 50Kb windows using Beagle 5.1. Estimations for imputation accuracies from MNEc2M:WGSNP were calculated using leave-1 out cross validation resulting in between 85% and 99.8% accuracy under naïve imputation scenarios in over 11 horse breeds. Objective 3. As part of objective 3, we have developed two publicly available containerized workflows, 1) IndexForTheFuture (https://github.com/jonahcullen/IndexForTheFuture), and 2) RNAMapping (https://github.com/UMN-EGGL/RNAMapping). IndexForTheFuture was designed to ensure the stable and reproducible generation of reference genome indices directly from the NCBI and ENSEMBL servers to be used for mapping RNA-seq data. The RNAMapping workflow is capable of processing RNA-seq data regardless of platform or sequencing strategy from raw FASTQs through the generation of a multisample, non-redundant transcriptome (i.e. improving the physical annotation) and transcript/gene-level quantification. Moreover, the output from this workflow, may be used to improve functional annotation via generation of tissue-specific co-expression networks. Through containerization, usage of these workflow removes the variability associated with both lab-specific computational infrastructure (e.g. differences in computing platforms from equine research groups at various universities) and genome release. Additionally, we have begun analyzing the differences in the physical annotation (via transcriptomes), differential expression, and functional annotation (via co-expression networks) observed between samples sequenced on different platforms. Expected results will serve to better inform how to consider platformspecific variance and merge data from multiple sources in the continued development of the equine tissue expression atlas. In addition to data resources, several major contributions were made to general purpose, bioinformatics computational tools. Major features were implemented for storing biological data using Minus80, an open source bioinformatics Python library. Minus80 has had 14 minor releases and 1 major release throughout the duration of this grant.

Publications


    Progress 03/15/18 to 03/14/19

    Outputs
    Target Audience:The target audience is primarily animal genomics researchers. Collectively our work will increase the power of gene mapping studies, facilitate gene, haplotype, pathway and network-based GWAS, improve genome annotation, provide functional evidence to aid in prioritization of candidate genes, and expedite the identification of functional alleles in the horse. The Camoco framework developed in Objective 4 will be of use to genomics researchers across species. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided? Development of computational skills. Objective 1 was developed into the PhD thesis of a boarded large animal internist Dr. Sian Durward-Akhurst. During this project, through classes, mentoring, and working with other students Dr. Durward-Akhurst has developed the necessary computational skills to produce and analyze the 30TB of data that make up this project. Presentation skills. Dr. Durward-Akhurst student has presented these data at several conferences and during our College of Veterinary Medicine graduate student seminar series. Collaboration. To gain additional cases for these 10 diseases and additional diseases to investigate in the future, Dr. Durward-Akhurst has worked with horse owners and veterinarians. Grant writing skills. Dr. Durward-Akhurst assisted with and was mentored in the writing additional grants to support the extra cases included in this analysis, as well as 3 grants as principal investigator to follow-up the variants identified in horses with AFIB in 600 Standardbred racehorses. Mentoring. Dr. Durward-Akhurst co-mentored a DVM summer scholar for 2 summers on part of the disease-causing identification work for equine myotonia. Dr. Durward-Akhurst is also co-mentoring 3 new students on submitting summer scholar proposals to work on the follow-up to the AFIB disease-causing variants. Postdoctoral student Dr. Robert Schaefer led a hack-a-thon team at the Rocky Mountain Genome Hack-a-thon where we implemented additional features in COB. Results from these tools were presented at several meetings and conferences (below). PhD student Mr. Jonah Cullen attended the Rocky Mountain Genomics HackCon at the University of Colorado Boulder (July 2019) Mr. Cullen, worked on a webtool for improving the functional annotation via co-expression networks within a small team Attendance of the RMGHC provided Mr. Cullen the opportunity to meet and network with both peers and senior investigators from a diverse range of biological and computational backgrounds How have the results been disseminated to communities of interest?Publications: All publications in preparation are listed in the text above. Presentations: Rob Schaefer: (Invited Talk) UCD School of Agriculture and Food Science, Dublin Ireland. Integrating GWAS SNPs with CoExpression Networks to Prioritize Candidate Genes in Complex Traits (hack-a-thon) Rocky Mountain Hack-a-thon. Boulder, CO. Team Lead: COB. (poster) Maize Genetics Conference, St. Louis, MO. The transcriptional landscape of diverse maize genotypes (poster) Plant and Animal Genomics Conference, San Diego, CA. Cloud Scalable Computational Tools for the Horse Genome (talk/demo) Havemeyer Horse Genetics Workshop, Pavia, Italy. Processing tens of millions of genotypes with HapDab and analyzing tissue specific gene co-expression networks with Camoco. (hack-a-thon) Equine Gene Annotation Hack-Con. Host. St. Paul, MN. (hack-a-thon) Rocky Mountain Hack-a-thon. Boulder, CO. (hack-a-thon) Mozilla Global Code Sprint. Host. St. Paul, MN. (Software demo) Plant and Animal Genome Conference, San Diego, CA. Identifying High Priority Candidate Genes from GWAS using Co-Expression Networks. (Invited Talk) Bio5 Institute and Cyverse. Tucson AZ. Using Co-expression networks to unravel gene function in agricultrual species. Sian Durward Akhurst: Durward-Akhurst SA. Genetic variation and the frequency of deleterious mutations (genetic burden) in healthy horses. University of Minnesota Supercomputing Institute seminar, Minneapolis, MN. October 2018. Scientific presentations Durward-Akhurst SA. Using the genome to improve disease diagnosis. ACVIM Forum, Phoenix, Arizona, USA. June 2019. Durward-Akhurst SA. Identification of disease-causing variants for pituitary dwarfism in Quarter Horses. Genofling, University of Minnesota Genomics Center, Minneapolis, USA. April 2019. Finalist Genopitch. Durward-Akhurst SA, Schaefer RJ, Mickelson JR, McCue ME. Loss of function variants in the equine population. Equine and NRSP8 workshops, Plant and Animal Genome Conference, CA, USA. January 2020. Durward-Akhurst SA, Schaefer RJ, Mickelson JR, McCue ME. Tools for Precision Medicine in the Horse. AVMA/AVMF Young Investigator Award competition, Worcester State University, MA, USA. July 2019. Finalist. Durward-Akhurst SA, Schaefer RJ, Mickelson JR, McCue ME. Genetic variation and the frequency of deleterious mutations (genetic burden) in healthy horses. Lightning talk: Plant and Animal Genome Conference, San Diego, USA. January 2019. Durward-Akhurst SA, Schaefer RJ, Mickelson JR, McCue ME. Loss of function variants in the equine population. Plant and Animal Genome Conference, CA, USA. January 2020. Durward-Akhurst SA, Schaefer RJ, Mickelson JR, McCue ME. Loss of function variants in the equine population. Calgary International Equine Symposium: Innovation and Discovery, Calgary, AB, Canada. September 2019. Durward-Akhurst SA, Schaefer RJ, Mickelson JR, McCue ME. Genetic variation and the frequency of deleterious mutations (genetic burden) in healthy horses. ACVIM Forum, Phoenix, Arizona, USA. June 2019. Durward-Akhurst SA, Schaefer RJ, Mickelson JR, McCue ME. Genetic variation and the frequency of deleterious mutations (genetic burden) in healthy horses. Plant and Animal Genome Conference, San Diego, USA. January 2019. Durward-Akhurst SA, Schaefer RJ, Mickelson JR, McCue ME. Genetic variation and the frequency of deleterious mutations (genetic burden) in healthy horses. Calgary International Equine Symposium: Innovation and Discovery, Calgary, AB, Canada. September 2018. 1 st place Graduate Student Poster award. Durward-Akhurst SA, Schaefer RJ, Mickelson JR, McCue ME. Single nucleotide polymorphism in the equine population. Minnesota Supercomputing Institute research day, University of Minnesota, Minneapolis, MN, USA. April 2018. 2nd place Biological and Medical Sciences category Jonah Cullen: Cullen JN, Schaefer RJ, Beeson S, Mickelson JR, McCue ME. RNA-seq driven gene annotation in the horse (poster), Plant and Animal Genome Conference, San Diego CA (2018). Cullen JN, Schaefer RJ, Mickelson JR, McCue ME. RNA-seq workflow for improved physical and functional annotation of the equine genome (poster), Calgary International Equine Symposium, Calgary Alberta (2019). Cullen JN, Schaefer RJ, Durward-Akhurst SA, Mickelson JR, McCue ME. Assessing the impact of sequencing platform on transcriptome assembly, differential expression, and variant discovery in the horse (poster), Plant and Animal Genome Conference, San Diego CA (2020). Summer Scholar students: Ellingson L, Durward-Akhurst SA, Schaefer RJ, Mickelson JR, McCue ME. Identification of genetic variants associated with myotonia in the horse. Morris Animal Foundation Scientific Advisory Board Meeting, Denver, Colorado, USA. June 2018. 2 nd place Exemplary Summer Scholar award Ellingson L, Durward-Akhurst SA, Schaefer RJ, Mickelson JR, McCue ME. Identification of genetic variants associated with myotonia in the horse. Boehringer Ingelheim-NIH National Veterinary Scholars Symposium, National Institutes of Health MSU, USA. August 2017. What do you plan to do during the next reporting period to accomplish the goals?Objective 1. Structural variant analysis - the structural variant results will be analyzed and written up as the first large scale study of structural variants from whole genome sequencing in the horse. Disease-causing variant analysis - The final prioritization and ranking of the variants will be performed and then grants submitted to follow-up on these variants and determine the true disease- causing variants. Publications on the: 1) genetic variation in the equine population; 2) frequency of previously reported disease-causing variants in the equine population; and 3) improvement in number of variants identified with high allele frequencies between EquCab2 and EquCab3 will be written and submitted. Objective 2. Sustain development of the computational tools and publish pipelines and imputation results. Objective 3. In Q1 2020, we plan to sample the 20 tissue types from 6 horses and send to the sequencing core for library preparation and sequencing. RNA-seq data will be processed, assembled, and analyzed in Q2 and Q3, followed by the initial manuscript submission in Q4. Website development to begin in Q4 and finish in 2021.

    Impacts
    What was accomplished under these goals? Objective 1. WGS were collected from 534 horses of 49 breeds, including ≥15 horses from each of 10 target breeds (Arabian, Belgian, Clydesdale, Icelandic horse, Morgan, Quarter Horse [QH], Shetland pony, Standardbred, Thoroughbred, Welsh Pony) that represent the major groupings of genetic diversity in domestic horses. The WGS were mapped to the equine reference genome, and single nucleotide polymorphisms (SNPs) and small structural variants identified using a modified version of the GATK best practices pipeline. Bcftools and GATK-haplotype caller were used to identify the variants and the intersect used for downstream analysis. ANNOVAR and SnpEff were used to predict the functional effect of the variants and then all high impact variants (high by both variant effect predictors, or high by one and moderate by the other) were summed to give the genetic burden in the population and within the target breeds. 29,882,273 variants were identified; these included 28,273,058 SNPs and 1,609,215 small structural variants. The total genetic burden is 8,683 variants with each horse carrying on average 2,409 genetic burden variants (range 538 - 3,674). Our analysis of the 10 target breeds showed significant differences in the genetic burden between breeds (p <0.001), with the highest average genetic burden in Icelandic horses (2,906 variants/horse) and lowest in Morgans (2,037 variants/horse). This provides the first large-scale catalog of genetic variation and estimation of the genetic burden in healthy horses. Larger structural variants have been called with Breakdancer, cn.MOPS, Delly, and GenomeSTRiP and analysis is ongoing. Publications on the: 1) genetic variation in the equine population; 2) frequency of previously reported disease-causing variants in the equine population; and 3) improvement in number of variants identified with high allele frequencies between EquCab2 and EquCab3 are in progress. ?We have identified and performed WGS (12-20 x coverage) on 24 horses representing 10 highly detrimental genetic diseases in the horse that are analogous to human Mendelian diseases. These diseases include: 5 skeletal muscle diseases (dystrophic and non-dystrophic myotonia, eosinophilic myositis, malignant hyperthermia (MH) without the RYR1 C7360G allele, and hyperkalemic periodic paralysis (HYPP) without the SCN4A F1416L allele); alopecia areata, that causes severe hair loss and must be managed with corticosteroids that can lead to laminitis; hemochromatosis, an iron storage disease that leads to liver failure; complete bilateral absence of the vas deferens causing stallion infertility; paroxysmal atrial fibrillation (AFIB), the most common pathologic arrhythmia in horses and may contribute to sudden cardiac death; idiopathic renal hematuria (IRH), a life threatening disease of Arabs; and microphthalmia, a performance limiting disease of Sport Horses. WGS has been mapped to the reference genome and alleles identified using the pipeline developed in objective 1. Variants have been prioritized based on the reported disease prevalence (AFIB) or a presumed disease prevalence of 1% given that the diseases are considered rare in the general horse population and assuming that they follow recessive inheritance patterns. Variants with an allele frequency in the catalog of genetic variation greater than expected based on disease prevalence were excluded. Variants were then further prioritized based on: 1) their presence in genes reported to be intolerant to damaging variants; 2) computationally predicted to have a high impact on phentotype; and 3) presence in all disease cases (AFIB, IRH, myotonia); or 4) following a recessive, de novo, or dominant inheritance pattern in the offspring compared to the parent (alopecia areata, microphthalmia). Final ranking using the American College of Medical Genetics guidelines for disease-causing variant prioritization is ongoing. The long-term goal is to validate the likely variants by sequencing them in a larger population of individuals to ensure segregation of the allele with the disease phenotype, and to develop genetic tests for easy disease diagnosis and identification of carriers to assist breeders with decreasing the frequency of clinical disease in their herd. Objective 2. A comprehensive and reproducible computational pipeline was designed to facilitate the creation of a genotype imputation and haplotype reference panel in the horse. Segments of the pipeline were defined using the workflow management system 'Snakemake', and the computational environment was containerized using both Docker as well as Singularity. Containers are publicly available on both DockerHub and the Singularity Container Library. Source code to build the pipeline from source is available at our public laboratory git repository (https://github.com/UMN-EGGL). Whole genome sequence from 549 horses were mapped to EquCab3 using both GATK Haplotype Caller as well as bcftools 'mpileup' to produce joint genotyping calls across every base pair in the genome. SNPs from both callers were assessed for baseline quality control and variant quality scores were recalibrated using 6 shared features among GATK and bcftools variant callers. Gaussian mixture models were fit for both callers using training data derived from SNPs sets defined from the recent MNEc2M SNP genotyping array. Using a true positive probability threshold of 99.5%, recalibrated scores were used to combine SNPs from GATK and bcftools into a combined set of 20.8M SNPs representing high confidence, whole-genome SNP reference panel for the horse (WGSNP). Variants were phased in 50Kb windows using Beagle 5.1. Estimations for imputation accuracies from MNEc2M:WGSNP were calculated using leave-1 out cross validation resulting in between 85% and 99.8% accuracy under naïve imputation scenarios in over 11 horse breeds. Objective 3. As part of objective 3, we have developed two publicly available containerized workflows, 1) IndexForTheFuture (https://github.com/jonahcullen/IndexForTheFuture), and 2) RNAMapping (https://github.com/UMN-EGGL/RNAMapping). IndexForTheFuture was designed to ensure the stable and reproducible generation of reference genome indices directly from the NCBI and ENSEMBL servers to be used for mapping RNA-seq data. The RNAMapping workflow is capable of processing RNA-seq data regardless of platform or sequencing strategy from raw FASTQs through the generation of a multi- sample, non-redundant transcriptome (i.e. improving the physical annotation) and transcript/gene-level quantification. Moreover, the output from this workflow, may be used to improve functional annotation via generation of tissue-specific co-expression networks. Through containerization, usage of these workflow removes the variability associated with both lab-specific computational infrastructure (e.g. differences in computing platforms from equine research groups at various universities) and genome release. Additionally, we have begun analyzing the differences in the physical annotation (via transcriptomes), differential expression, and functional annotation (via co-expression networks) observed between samples sequenced on different platforms. Expected results will serve to better inform how to consider platform- specific variance and merge data from multiple sources in the continued development of the equine tissue expression atlas. In addition to data resources, several major contributions were made to general purpose, bioinformatics computational tools. Major features were implemented for storing biological data using Minus80, an open source bioinformatics Python library. Minus80 has had 14 minor releases and 1 major release throughout the duration of this grant.

    Publications


      Progress 03/15/17 to 03/14/18

      Outputs
      Target Audience: Nothing Reported Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided? Nothing Reported How have the results been disseminated to communities of interest? Nothing Reported What do you plan to do during the next reporting period to accomplish the goals? Nothing Reported

      Impacts
      What was accomplished under these goals? See 2019 report for updated information

      Publications