Source: MICHIGAN STATE UNIV submitted to NRP
FACT: SWIM - A CYBER-ENABLED SWINE GENOME IMPUTATION FRAMEWORK AND PUBLICLY ACCESSIBLE SERVER FOR NUCLEOTIDE RESOLUTION GENETIC MAPPING
Sponsoring Institution
National Institute of Food and Agriculture
Project Status
COMPLETE
Funding Source
Reporting Frequency
Annual
Accession No.
1025629
Grant No.
2021-67021-34149
Cumulative Award Amt.
$499,900.00
Proposal No.
2020-08986
Multistate No.
(N/A)
Project Start Date
Mar 1, 2021
Project End Date
Feb 28, 2025
Grant Year
2021
Program Code
[A1541]- Food and Agriculture Cyberinformatics and Tools
Recipient Organization
MICHIGAN STATE UNIV
(N/A)
EAST LANSING,MI 48824
Performing Department
ANIMAL SCIENCE
Non Technical Summary
Most economically important characters in food animals and plants are complex quantitative traits. Large-scale cyber-informatics and data analytics have greatly facilitated genetic mapping and genomic prediction of economic traits. Genotype imputation, the process by which a small subset of genomic locations are genotyped to infer whole genome genotypes based on exisiting whole genome sequences,is a proven, accurate, and cost-effective alternative to direct sequencing of genetic selection candidates. However, scalable and accessible genotype imputation tools are lacking in pigs and other livestock animals. Our project is driven by critical need, high potential impact, and clearly defined goals. Our multi-PD team with complimentary expertise will develop a novel cyber-informatics tool to directly address the FACT program priority. The long-term goals of this project are to develop, deploy, and maintain a SWine IMputation (SWIM) server to serve the U.S. and international swine genetics research community. This server will allowusers to upload SNP chip genotypes and receive imputed whole genome sequences. Our informatics and data science approach will enable the swine genetics community to effectively utilize existing DNA sequence data, significantly reduce the need to allocate resources to sequence new animals, and accelerate the rate of genetic improvement.
Animal Health Component
30%
Research Effort Categories
Basic
30%
Applied
30%
Developmental
40%
Classification

Knowledge Area (KA)Subject of Investigation (SOI)Field of Science (FOS)Percent
30435991080100%
Knowledge Area
304 - Animal Genome;

Subject Of Investigation
3599 - Swine, general/other;

Field Of Science
1080 - Genetics;
Goals / Objectives
There are three major goals of this project:1) Develop and optimize the variant calling and genotype imputation pipelines to best suit the population structure and genome architecture of the domestic pig Sus scrofa species and achieve high accuracy for existing genotyping platforms. 2) Develop and implement a web server with accessible user interface to allow users to efficiently utilize the resource. 3) Evaluate and identify best performing imputation strategies to guide future studies.
Project Methods
Objective 1: Develop and optimize variant calling and genotype imputation pipelines.Per-animal processing: We will follow the standard GATK pipeline, beginning with quality-controlled sequence FASTQ files and producing one GVCF file per animal at the end. The GVCF (genome variant call format) is a relatively new file format derived from the legacy VCF format. It contains coverage information for both variant and non-variant sites in the genome.Population processing: We will utilize the new GenomicsDB framework implemented in GATK. This framework allows for more straightforward and faster increment of additional animals and intuitive parallelization of genotype calls across genomic intervals. The GVCFs produced in the previous step will be combined and genotyped to produce a complete variant set across all animals in the population.Optimization of genotype imputation: We will focus on finding the best combination of phasing and imputation tools that maximize imputation accuracy. We will use cross-validation to evaluate imputation accuracy based on genotype concordance and correlation between observed and imputed genotypes and/or dosages. Computation time of genotype imputation at the server end will also be considered. Our goal is to achieve best imputation accuracy while keeping computation time on the server within a reasonable frame.Objective 2: Develop and implement a web server with accessible user interface for genotype imputation in pigsWe will focus on developing our prototype Apache web server into a fully functional and intuitive imputation server.User end: We will provide instructions consisting of scripts, examples, and QC steps to covert one of these formats to the VCF format we accept, including lifting over between genome builds.Web server user interface: We will ask for an email address for communicating results and/or QC failure and a click button to upload genotype files. Importantly, we will also ask for the reference haplotype panel to be used, whether it's a specific breed or a multi-breed panel depending on our optimization of the imputation pipeline.Server end: As soon as a user uploads their data, two levels (per-variant and per-sample) of quality control are performed by triggering a PHP action to execute a series of commands to complete the imputation. Subsequently, download link will be generated and emailed to users.Testing: We will extensively test the web server using real SNP chip genotypes provided to us by our collaborators. Our collaborators will also serve as testers with no prior experience with the server to identify problems and provide feedbacks.Objective 3: Evaluate and identify best performing imputation strategiesEvaluation of imputation using SNP chips of different densities: To test if and how much improvement in imputation accuracy can be achieved by SNP chips of different densities, we will perform cross-validation in whole genome sequenced animals by subsetting variants at different densities in the target sample to realistically mimic genotyping using SNP chips. We will focus on a range of densities between 60K and 800K, including 60K, 100K, and at 100K increment up to 800K. We will also include SNPs covered by the commercial Illumina Porcine 60K and Affymetrix Axiom Porcine Array (660K) for comparison.Evaluation of imputation using low-coverage sequencing at different coverages: To evaluate the performance of low-coverage sequencing based imputation, we will perform in silico low-coverage sequencing by down-sampling high-coverage data to low-coverage (0.5X, 1X, 2X, 3X, 4X, 5X) in the target set. The reference set (randomly partitioned for cross-validation) remains at their original sequence coverage to construct reference haplotype panel. We do not need to call variants all over again but will perform phasing in the reference set in each iteration of the cross-validation partition. Imputation and assessment of accuracy will be performed as described in Objective 1.

Progress 03/01/21 to 02/28/25

Outputs
Target Audience:Throughou the project, there has been publications and presentations at various venues targeting academic and industry personnel in the field of animal genetics. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?Several students and postdocs were trained in this project, who were authors on papers and presented work at conferences. How have the results been disseminated to communities of interest?- Peer reviewed publications - Conference presentations - Unviersity seminar talks What do you plan to do during the next reporting period to accomplish the goals? Nothing Reported

Impacts
What was accomplished under these goals? Major goals were achieved 1) We developed and optimizeed the variant calling and genotype imputation pipelines to best suit the population structure and genome architecture of the domestic pigSus scrofaspecies and achieved high accuracy for existing genotyping platforms. 2)We developed and implemented a web server (swimgeno.org)with accessible user interface to allow users to efficiently utilize the resource. As of March 30, 2025.273 jobs, and 346,424 genomes have been imputed. 3) We evaluated and identifiedbest performing imputation strategies to guide future studies.

Publications

  • Type: Peer Reviewed Journal Articles Status: Published Year Published: 2024 Citation: Quan J=, Yang M=, Cai G=, Ding R, Zhuang Z, Zhou S, Tan S, Ruan D, Wu J, Zheng E, Zhang Z, Liu L, Meng F, Wu J, Xu C, Qiu Y, Wang S, Lin M, Li S, Ye Y, Zhou F, Lin D, Li X, Deng S, Zhang Y, Yao Z, Gao X, Yang Y, Liu Y, Zhan Y, Liu Z, Zhang J, Ma F, Yang J, Chen Q, Yang J, Ye J, Dong L, Gu T, Huang S, Xu Z, Li Z, Yang J, Huang W, Wu Z (2024) Multi-omic characterization of allele-specific regulatory variation in hybrid pigs. Nat Commun 15(1):5587


Progress 03/01/24 to 02/28/25

Outputs
Target Audience:The work is presented at a seminar at Iowa State University. Attendees include academicresearchers in the animal sciencefield. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?The project supports the training of a postdoctoral fellow (Dr. Mohammed Bedhane) and partially a graduate student (Leland Ackerson), who have attended conferences under this support. How have the results been disseminated to communities of interest? Nothing Reported What do you plan to do during the next reporting period to accomplish the goals? Nothing Reported

Impacts
What was accomplished under these goals? We have made significant progress under the three major goals of the project. 1) The web server (swimgeno.org) is up and running since April 2022. As of March 30, 2025, the server has completed 273 imputation jobs for 346,424genomes. 2) New SNP chip platform support are added. 3) We have compared imputation using SNP-chip based and low-coverage sequencing based strategies. SNP-chip based imputation was found to achieve comparable accuracy with low coverage sequencing up to 3x in pigs using either a pure breed or multi-breed reference panel. The work is being written up for publication.

Publications


    Progress 03/01/23 to 02/29/24

    Outputs
    Target Audience: The work waspresented at National Swine Improvement Federation. Attendees include industry and academic researchers in the swine genetics field. In addtion, a publication has appeared in the journal Communications Biology. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided? The project supports the training of a postdoctoral fellow (Dr. Mohammed Bedhane) and partially a graduate student (Leland Ackerson), who have attended conferences. How have the results been disseminated to communities of interest?The main paper from this project has been published: Ding et al., The SWine IMputation (SWIM) haplotype reference panel enables nucleotide resolution genetic mapping in pigs. Commun Biol 6:577, 2023 The work has also been presented at the National Swine Improvement Fedration annual meeting in St. Louis, MO in 2023. What do you plan to do during the next reporting period to accomplish the goals? We will focus on finishing the third aim of the project and develop a maintenance plan so the server can continue to be active after the project is complete.

    Impacts
    What was accomplished under these goals? We have made significant progress under the three major goals of the project. 1) The variant calling pipeline has been optimized and fully implemented. 2) The web server (swimgeno.org) is up and running since April 2022. As of February 29, 2024, the server has completed 150 jobs for 227,633 genomes, up from 77 imputation jobs for 126,046 genomes a year ago. 3) We have compared imputation using SNP-chip based and low-coverage sequencing based strategies. A manuscript is being written to summarize the comparison results.

    Publications

    • Type: Journal Articles Status: Published Year Published: 2023 Citation: Ding R, Savegnago R, Liu J, Long N, Tan C, Cai G, Zhuang Z, Wu J, Yang M, Qiu Y, Ruan D, Quan J, Zheng E, Hong L, Li Z, Tan S, Bedhane M, Schnabel R, Steibel J, Gondro C, Yang J, Huang W, Wu Z (2023) The SWine IMputation (SWIM) haplotype reference panel enables nucleotide resolution genetic mapping in pigs. Commun Biol 6(1):577


    Progress 03/01/22 to 02/28/23

    Outputs
    Target Audience:The work is presented at the Cattle/Swine Workshop at the Plant and Animal Genome Conference. Attendees include academic and industry researchers in the animal genetics field. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?The project supports the training of a postdoctoral fellow (initially Dr. Rodrigo Savegnago, and later Dr. Mohammed Bedhane) and partially a graduate student (Leland Ackerson), who have attended conferences. How have the results been disseminated to communities of interest?The results have been written up for publication (bioRxiv:https://www.biorxiv.org/content/10.1101/2022.05.18.492518v1) and is currently under review. The work has also been presented at the Plant and Animal Genome Conference 30 (January 2023) Cattle/Swine workshop as an invited talk.? What do you plan to do during the next reporting period to accomplish the goals?We will continue to work on the three goals. We plan to publish the papers and submit a new paper describing comparison of SNP-chip and low coverage sequencing based imputation. In addition, we will develop a maintenance plan for the web server even after the project concludes.

    Impacts
    What was accomplished under these goals? We have made significant progress under the three major goals of the project. 1) The variant calling and imputation pipelines have been evaluated and optimized. We found the software combination Shapeit4/Impute5 to outperform others. This has been implemented on the web server. A manuscript describing the development of the resource and server is under review for publication. 2) The web server (swimgeno.org) is up and running since April 2022. As of February 27, 2023, the server has completed 77 imputation jobs for 126,046 genomes. 3) We have compared imputation using SNP-chip based and low-coverage sequencing based strategies. SNP-chip based imputation was found to achieve comparable accuracy with low coverage sequencing up to 3x in pigs using either a pure breed or multi-breed reference panel.

    Publications


      Progress 03/01/21 to 02/28/22

      Outputs
      Target Audience:A preliminary report of the work is presented at the Swine Workshop for the USDA NRSP8 National Animal Genome Research Program. Attendees include academic and industry researchers in the swine genetics field. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?The project supports the training of a postdoctoral fellow (Dr. Rodrigo Savegnago), who have attended meetings. How have the results been disseminated to communities of interest?The results have been presented at the Swine Workshop of the NRSP8 National Animal Genome Research Program. The work shop was initially scheduled at the Plant and Animal Genome Confeference but was rescheduled and moved online. What do you plan to do during the next reporting period to accomplish the goals?We will continue to work on the three goals, which remain on track to be completed on schedule. A new postdoctoral fellow has been recruited (Dr. Mohammed Bedhane) to work on the project. No changes have been made to the goals or plans.

      Impacts
      What was accomplished under these goals? We have made substantial progress under these goals. 1) We have developed and optimized the variant calling pipeline. The pipeline has been used to genotype DNA variants for more than 2,000 animals across multiple pig breeds. The genotype imputation pipeline has been developed and is being evaluated. 2) The web server has been developed and is waiting for the final pipeline to be served on the server. 3) We have evaluated mutliple strategies for imputation and found the combination of Shapeit4/Impute5 to be the best performing software combination, achieving accuracy > 98% as measured by concordance rate. We have also found that imputation using low density arrays achieve comparable accuracy with low coverage sequencing.

      Publications