Source: UNIV OF MINNESOTA submitted to NRP
DSFAS PARTNERSHIP: INTEGRATIVE DATA SCIENCE FOR PRRSV PHENOTYPIC PREDICTION BASED ON STRUCTURAL DIVERSITY, COMPUTATIONAL IMMUNOLOGY AND GENOMIC EPIDEMIOLOGY
Sponsoring Institution
National Institute of Food and Agriculture
Project Status
ACTIVE
Funding Source
Reporting Frequency
Annual
Accession No.
1030677
Grant No.
2023-67021-40018
Cumulative Award Amt.
$799,993.00
Proposal No.
2022-11608
Multistate No.
(N/A)
Project Start Date
Aug 1, 2023
Project End Date
Jul 31, 2027
Grant Year
2023
Program Code
[A1541]- Food and Agriculture Cyberinformatics and Tools
Recipient Organization
UNIV OF MINNESOTA
(N/A)
ST PAUL,MN 55108
Performing Department
Veterinary Population Medicine
Non Technical Summary
Porcine reproductive and respiratory syndrome virus-type 2 (PRRSV-2) is a rapidly evolving RNA virus impacting ~30-50% of breeding farms. Swine producers sequence thousands of viruses annually as part of disease management, but are frustrated by limitations in translating genetic data to phenotypic insights relevant for decision-making. Our objective is to create an "integrative data science" platform to predict the immunogenic and epidemiologic phenotype of PRRSV-2 variants through breaking barriers and building connectivity among AI/ML across structural biology, computational immunology, and genomic epidemiology.Specifically, we will apply unsupervised learning to identify clusters (variants) of closely related genetic sequences within large-scale databases. We will measure structural divergence between variants (hypothesized to influence immunogenicity and antibody binding) using AlphaFold 2.0, an artificial intelligence tool recently developed for 3D structural predictions of protein folding. We will then apply machine learning to experimental data on antibody cross-reactivity amongst variants to build a model that predicts "antigenic distance" based on genetic dissimilarities and structural divergence metrics. Lastly, we will develop a machine learning algorithm that predicts a variant's epidemiologic fitness (i.e., expansion of a given variant through time) based on antigenic distance, structural divergence, and phylogenetic features.We will also create an html-based visualization platform that allows stakeholders to classify and contextualize their own sequences within the broader phenotypic diversity of the virus. This integration of back-end analytical methods and front-end visualization tools will improve human-data interactions surrounding interpretation of sequence data and support decision-making about appropriate disease management and immunization practices within the swine industry.
Animal Health Component
25%
Research Effort Categories
Basic
75%
Applied
25%
Developmental
(N/A)
Classification

Knowledge Area (KA)Subject of Investigation (SOI)Field of Science (FOS)Percent
3113510117025%
3113510109025%
3113510110125%
3113510107025%
Goals / Objectives
Develop an integrative data science platform to predict the structural, immunogenic, and epidemiologic phenotype of PRRSV-2 genetic variants through breaking barriers and building connectivity among AI and ML across structural biology, computational immunology, and genomic epidemiology. Specific work packages (WP) will include:Develop a PRRSV-2 variant classification system using supervised and unsupervised machine learning on large-scale sequence databases (WP1)Utilize AI algorithms to quantify structural dissimilarities of PRRSV-2 variants based on genetic sequence (WP2a).Train and test machine learning algorithms to predict antigenic distance (immunological cross-reactivity) based on sequence data and structural divergence metrics (WP2b)Quantify the epidemiologic fitness (i.e., population expansion) of new variants, and apply machine learning to identify early indicators of fitness, including structural features and antigenic distance (WP3)Create a front-end visualization tool for end-users to support decision-making and contextualize their own sequences within the broader phenotypic diversity of PRRSV-2 (WP4)
Project Methods
WP1: Sequence data / Variant namingExpected results:Evaluation of alternative systems for classifying and naming PRRSV-2 variantsDevelopment of a pipeline for classification and assimilation of new sequence dataGeneration of reference sequences for phenotype predictionDetailed methodologyVariant clustering:Sequences will be aligned using the MUSCLE algorithm, and meta-data assigned including date and county of collection. Unsupervised learning will be applied to identify clusters in the genetic data using a discriminant analysis of principal components (DAPC) available via the packageadegenet2.0.0 in R[38]. Alternatively, we will apply a tree-based clustering approach using theTreeClusterpackage available in Python[39]. Here,we first will construct phylogenetic trees using FastTree[40], which is optimized for large sequence datasets. We will then identify clusters within the tree corresponding to clades that meet user-specified thresholds, such as the maximum, mean, or median pairwise patristic distance between sequences within the same cluster. The exact definition that we move forward with will be chosen based on an iterative process with the working group and target end-users, with consideration to the classification system's a) ability to discriminate between field strains but not over discretize, b) rate at which new clusters appear within the data, and c) stability of cluster assignments with different subsets of data. Representative sequences within each variant will be selected as references and used for phenotypic machine learning (WP2-3).Variant classifier:Once a variant clustering method is selected, we will apply supervised machine learning algorithms to the dataset to train an algorithm that can assign sequences to variant. Algorithms that will be compared include random forest, currently used for variant assignment for SARS-CoV-2, discriminant analysis of principle components, and nucleotide distance nearest to references (k-nearest neighbors). Once developed (with a mechanism for periodical updating), the trained algorithm is data-free and can be used as part of an html-based platform where end-users can upload and classify their own sequences (WP-4).WP2: Computational immunology & Structural divergenceExpected results:Integration of AI-informed structural metrics with predictive ML algorithms to estimate cross-protectionin silicoGeneration of phenotypic predictors that may be associated with epidemiologic fitnessDetailed methodologyExperimental data from serum neutralization (SN) assays:We have access to pre-existing data on serum neutralization titers from >30 viruses, which creates a 30x30 panel of virus-serum pairs (n = 870 observations; anti-sera was generated from pigs that were infected and allowed to develop an antibody response to the same set of viruses). We will calculate antigenic distance by subtracting the log2of heterologous titer from the log2of homologous titers, i.e.,Dij= log2(Hjj) - log2(Hij), with a one unit difference representing a 2-fold loss in neutralization ability between the homologous and heterologous pair[25, 51]. Structural divergence metrics:Structures for the ORF5 and M protein (which interacts with ORF5) sequences, and potentially other proteins, will be estimated using AlphaFold 2.0[20]( https://github.com/deepmind/alphafold), and the spatial differences between pairs of protein structures will be calculated for the whole protein and in selected regions (e.g. ectodomain) via custom-built R code.As well as using AlphaFold for representatives of different variants, we will also use Missense3D[56]and/or FoldX[57]or similar (e.g. reviewed in[58]) to evaluate effects ofindividual amino acid substitutions on structures.Additionally, differences in glycosylation (a process where glycans bind to specific amino acid motifs and thus potentially mask epitopes[59, 61]) might influence the antigenic and fitness properties of PRRSV variants. Therefore we will investigate changes in potential glycosylation sites during evolution and possible impacts on epitope shielding using glycoSHIELD[62]or similar, alongside the sequence motif based analyses.Antigenic distance. model development and evaluation:ML algorithms with 10-fold cross-validation to prevent over-fitting will be applied to the training set to build a model that classifies virus-serum pairs as antigenically similar (low antigenic distance) or dissimilar (high distance).Features identified in variable importance plots as most associated with the outcome warrant follow-upin vitroorin vivoto ascertain their functional significance, which is an example of how ML tools can be used to support and accelerate discovery in research. An ORF5-only model will be developed in addition to a WGS model, as many end-users will only have access to ORF5 sequences. Preliminary data on PRRSV-1 shows that an ORF5 model performs almost as well as one based on additional proteins.WP3: Genomic epidemiologyExpected results:Identification of correlates of epidemiologic fitness that can be quantified from sequence-based phenotypic inference (WP2) and the structure of phylogenetic treesValidated methodology that can be applied prospectively to large-scale phylogenetic datasets to rank variants according to their epidemiologic fitnessDetailed methodologyORF5 sequences with temporal and spatial metadata collected in the U.S. from 2012-2022 (WP1) will be used to a)measure each variant's structural and immunogenic (WP2), and evolutionary attributes based on the structure of the phylogenetic tree at timet, b) evaluate the fitness of each variant during a follow-up period ofnmonths, and 4) correlate attributes measured at timetto a variant's future epidemiologic fitness during the follow-up period to identify attributes that can serve as early indicators of the epidemiologic fitness of a variant.Increasing frequency and/or geographic extent during the follow-up period are indicative of highly successful emerging variants with expanding viral population sizes[27].Therefore, a variant's future success during a follow-up period (6, 12, or 24 months) will be measured based on clade growth (i.e., change in number of sequences) and spatial distribution (i.e., increase in number of counties affected).Machine learning algorithms (following the general validation framework described in WP2) will be applied to relate variant attributes at timetwith measures of fitness in the follow-up period to identify attributes that are predictive of future success.WP4: Visualization and human-data interfaceExpected results:An annotated database and visualization tool for PRRSV-2 that can be used to identify and visualize contemporary circulating strains and inform on evolutionary trends, relatedness amongst variants, and phenotypic similarities.Detailed methodologyHTML-based classifier:We will develop an analytical webtool that automatically classifies user-submitted sequence data, with the user receiving the variant ID associated with their sequence. Our team maintains the United States Swine Pathogen Database (US-SPD:https://swinepathogendb.org/)[77]. This is a web-based, curated, relational database hosted on infrastructure developed by the United States Department of Agriculture, Agricultural Research Service (USDA-ARS SCINet).NextStrain visualization:We will develop a PRRSV-specific NextStrain platform for the analysis and visualization of large-scale PRRSV-2 phylogenies using an easily accessible NextStrain group (https://nextstrain.org/groups/). The end-users of this platform will be both practitioners and diagnosticians, the latter of which often assist producers/clients in interpretation of sequence data.

Progress 08/01/23 to 07/31/24

Outputs
Target Audience: Scientists, epidemiologists, and swine veterinarians Government and policymaking agencies involved in animal disease control and prevention Swine industry stakeholders (e.g., swine production companies, producers, pharmaceutical companies, biosecurity companies) Changes/Problems:Due to the interest of our post-doctoral assocaite at the University of Minnesota in structural modeling, the primary work for WP2 is now being done at Minnesota rather than at Roslin Institute. What opportunities for training and professional development has the project provided?The project has provided two postdoctoral researchers with opportunities to learn and develop skills in machine learning, phylogenetics and structural bioinformatics. This student also supports one graduate student work on PRRSV bioinformatics. How have the results been disseminated to communities of interest?The results have been disseminated to communities of interest through various channels. The veterinary diagnostic laboratory at the University of Minnesota, which processes a significant volume of swine disease samples in the U.S. Midwest, has integrated the new machine learning algorithm for PRRSV-2 variant classification into its services, replacing the traditional RFLP typing with our classification system in the ORF5 sequencing report provided to clients. Additionally, we presented the development and application of this classification system to different stakeholders in the swine industry. An infographic detailing the variant classification system was distributed as a leaflet at the 2024 AASV meeting, where swine practitioners convened, and we delivered oral presentations on this topic at the 2024 Allen D. Leman Swine Conference, attended by scientists, practitioners, and producers. Ultimately, the findings from the WP1 study have been published in a peer-reviewed journal, along with a publicly accessible variant classification platform available on a dedicated website. Academic conference attended by industry: • Kimberly VanderWaal, Nakarin Pamornchainavakul: PRRSV-2 genetic classification (Infographic). AASV Annual Meeting. Nashville, Tennessee, February 24-27, 2024. • Kimberly VanderWaal, Paul Yeske: PRRSV-2 genetic variant classification: What is it and why we need it? Allen D. Leman Swine Conference, St. Paul, MN September 21-24, 2024. Presentations to industry groups: - We have done 3different orientaitons thus far, where we have met with veterinary practices/swine production companies to oreint and educate them to the new clasificaiton system. This includes Fairmont Veterinary Clinic, Swine Vet Center, Vaxxinova. Media: • 2024 Article in National Hog Farmer: "Surprising but true: PRRSV one of the most sequenced viruses in the world" By: Kimberly VanderWaal. https://www.nationalhogfarmer.com/livestock-management/surprising-but-true-prrsv-one-of-the-most-sequenced-viruses-in-the-world • 2024 Morison Swine Health Monitoring Program: "Surprising but true: PRRSV one of the most sequenced viruses in the world" By: Kimberly VanderWaal. https://umnswinenews.com/2024/08/23/surprising-but-true-prrsv-is-one-of-the-most-sequenced-viruses-in-the-world/#:~:text=For%20those%20outside%20the%20pig,is%20unparalleled%20by%20human%20medicine. What do you plan to do during the next reporting period to accomplish the goals?As we conclude the first year of the project, we have successfully completed Activity 1 (Variant Clustering) of Aim WP1 ahead of schedule, and we anticipate finalizing Activity 2 (Automated Variant Classifier) by the end of this year. In line with our project plan, we have also made at least 25% progress on Aim WP2a. Furthermore, we have initiated Aim WP3 earlier than planned (initially scheduled for year three) to address the active use of the application developed in Aim WP1. Moving forward, we aim to adhere to the proposed timeline or even accelerate some aims in anticipation of Aim WP1 results, which are foundational to the subsequent aims and will soon be ready for implementation. Our detailed plans for the next reporting period include the following: WP1 - The manuscript on Activity 2 (Automated Variant Classifier) has been submitted to a journal and is currently under peer review, with expected publication by the end of 2024. During this period, we will continue to refine our machine learning algorithm based on feedback from ongoing and new users, including veterinary diagnostic laboratories and their clients. Any adjustments, along with updates to the variant classification model trained on new sequencing data from the Morrison Swine Health Monitoring Project (MSHMP), will be uploaded to our GitHub page and RShiny app website on a quarterly basis. WP2a and WP2b - For the AlphaFold analysis, we plan to complete the structural prediction of 158 GP5/M heterodimers, along with separate GP5 and M proteins using various models and cleavage site predictions. We will compare the structural information, including protein conformations and confidence parameters, across all models to identify the best prediction. This optimized structural configuration will then be applied to GP5 and M proteins from newly generated PRRSV-2 whole genome sequences in cross-neutralization experiments. We aim to measure the association between structural divergence, amino acid variation, and antigenic distances and develop a machine learning model for antigenic and phenotypic prediction based on these features. WP3 - The primary challenge we currently face with this aim is improving the accuracy of the virus fitness predictive model. We will explore strategies such as downsampling data from overrepresenting classes to address dataset imbalance or modifying the predictor variables to enhance the model's performance. Our goal is to achieve at least 70% accuracy by the end of the first quarter of 2025. WP4 - We will continue to add trained modes from WP2 and WP3 into the PRRSLoom webtool as they become available. Additionally, we plan to present our research findings at upcoming conferences, including NAPRRS 2024 and the AASV meeting in 2025.

Impacts
What was accomplished under these goals? WP1- Completion status = 95% We utilized a database of over 28,730 sequences from 2010 to 2021 to develop a fine-scale classification system for PRRSV-2 variants. We systematically compared 140 approaches, assessing various tree-building methods and criteria for defining variants. The three most effective approaches yielded reproducible classifications, with the average genetic distance among sequences within the same variant ranging from 2.1% to 2.5%, while the divergence between variants was 2.5% to 2.7%. We trained machine learning algorithms that accurately assigned new sequences to existing variants with over 95% accuracy. This work is supported by a published paper and another under review. The classification tool is publicly available at https://stemma.shinyapps.io/PRRSLoom-variants/ with code accessible on https://github.com/kvanderwaal/prrsv2_classification. WP2a- Completion status = 25% We incorporated 153 nucleotide sequences of ORF5 and ORF6 genes from NCBI GenBank, representing various PRRSV-2 lineages and sub-lineages in the U.S. from 1992 to 2021, along with five common commercial live attenuated vaccine strains. These sequences were translated into GP5 and M proteins for structural prediction analysis. Using AlphaFold2 on the University of Minnesota's supercomputer, we predicted the structures of PRRSV-2 GP5 and M proteins in several configurations, achieving predictions for over 75% of the data. Preliminary findings suggest a disconnect between structural distances and amino acid distances for GP5/M proteins, indicating the need for further exploration, particularly focusing on epitopic sites in subsequent analyses. WP2b- Completion status = 50% We have collaborating with a company, Phibro Animal Health, who has shared with our team data on cross-neutraliztiaon between a panel of PRRSV-2 variants. We used these data to train a randomForest model in R to predict the potential of anti-sera generated against one PRRSV-2 variant to provide protection against a different variant. The model is reasonably accurate, with accuracy >70%. We also have obtained data from this group on lung lesions in immunized and challenged aniamls, to allow us to assess wheether the ML algoirthm predictions (trained on in vitro data) provided an approxiation of in vivo protection. Results look promising thus far, with general agreement between the in vivo and in vitro studies. A manuscript is being drafted. WP3- Completion status = 20% We utilized the same dataset from WP1 to apply machine learning for predicting the epidemiologic fitness of new PRRSV-2 variants. We identified 20 candidate features from ORF5 sequence alignments, variant classification data, and the phylogenetic tree to forecast variants with potential population expansion exceeding 200% in 12, 24, or 36 months. Fourteen machine learning classifiers were trained using 70% of the data with 10-fold cross-validation, achieving accuracy rates between 50% and 65% depending on the model and prediction timeframe. Notably, the random forest and extra tree classifiers performed the best, with local branching index and within-variant genetic distance emerging as key features. Further refinement of features and methodologies is underway to enhance prediction accuracy and reduce processing time. WP4- Completion status = 50% The classification webtool from WP1 allows end-users to classify their own sequences into variants. This tool also includes additional contextual information that a user can look-up once they have their variant ID. Notably, they can view the current prevalence of their variant in the entire dataset, and also get a report on the "occurrence" trend, i.e., if the occurrence of the variant is accerlated (doubled over 12 months), decellerating (decreased by 1/2 over the past 12 months), or is stable.

Publications

  • Type: Journal Articles Status: Published Year Published: 2024 Citation: VanderWaal K, Pamornchainavakul N, Kikuti M, Linhares DCL, Trevisan G, Zhang J, Anderson TK, Zeller M, Rossow S, Holtkamp DJ, Makau DN, Corzo CA and Paploski IAD (2024) Phylogenetic-based methods for fine-scale classification of PRRSV-2 ORF5 sequences: a comparison of their robustness and reproducibility. Front. Virol. 4:1433931. doi: 10.3389/fviro.2024.1433931
  • Type: Journal Articles Status: Under Review Year Published: 2024 Citation: Kimberly VanderWaal, Nakarin Pamornchainavakul, Mariana Kikuti, Jianqiang Zhang, Michael Zeller, Giovani Trevisan, Stephanie Rossow, Mark Schwartz, Daniel C.L. Linhares, Derald J. Holtkamp, Jo�o Paulo Herrera da Silva, Cesar A. Corzo, Julia P. Baker, Tavis K. Anderson, Dennis N. Makau, Igor A.D. Paploski (2024) PRRSV-2 variant classification: a dynamic nomenclature for enhanced monitoring and surveillance. bioRxiv 2024.08.20.608841; doi: https://doi.org/10.1101/2024.08.20.608841
  • Type: Conference Papers and Presentations Status: Published Year Published: 2024 Citation: K. VanderWaal. Rapid evolution of PRRSV: Is it possible to predict the emergence of new PRRSV variants? 27th International Pig Veterinary Society Congress. Leipzig, Germany. June 4-7, 2024.
  • Type: Conference Papers and Presentations Status: Published Year Published: 2024 Citation: K. VanderWaal & N. Pamornchainavakul. Predicting PRRSV-2 Variant Emergence: Insights from a Decade of Genomic Analysis. 27th International Pig Veterinary Society Congress. Leipzig, Germany. June 4-7, 2024.
  • Type: Conference Papers and Presentations Status: Published Year Published: 2024 Citation: Kimberly VanderWaal and Paul Yeske. "PRRSV-2 genetic variant classification: What is it and why we need it?" Oral presentation at Allen D. Leman Swine Conference. September 23, 2024, St. Paul, MN.
  • Type: Conference Papers and Presentations Status: Published Year Published: 2024 Citation: Kimberly VanderWaal. "Chasing a moving target: The emergence of new PRRSV genetic variants in U.S. swine." The University of Minnesota College of Veterinary Medicine Research, Innovation, Discovery, and Education (RIDE) Seminar Series. September 11, 2024, St. Paul, MN.