FACT-CIN: Developing genome and metagenome sequencing and computational tools for disease detection in plant clinics

FACT-CIN: DEVELOPING GENOME AND METAGENOME SEQUENCING AND COMPUTATIONAL TOOLS FOR DISEASE DETECTION IN PLANT CLINICS

Sponsoring Institution

National Institute of Food and Agriculture

Project Status

COMPLETE

Funding Source

AFRI COMPETITIVE GRANT

Reporting Frequency

Annual

Accession No.

1025709

Grant No.

2021-67021-34343

Cumulative Award Amt.

$900,001.00

Proposal No.

2020-08949

Multistate No.

(N/A)

Project Start Date

Mar 1, 2021

Project End Date

Feb 28, 2025

Grant Year

2021

Program Code

[A1541]- Food and Agriculture Cyberinformatics and Tools

Recipient Organization
OREGON STATE UNIVERSITY
(N/A)
CORVALLIS,OR 97331

Performing Department
Botany & Plant Pathology

Non Technical Summary
Plant diseases, along with pests and weeds, present significant impediments to agricultural economic prosperity and food biosecurity. The US has a network of university plant disease clinics that diagnose, survey, and educate stakeholders (growers) to help them manage losses from plant diseases. The rapidity with which diseases can spread, both naturally and artificially through trade, means plant clinics have a critical need to increase their capacity in helping stakeholders. Whole genome sequencing has demonstrated benefits in accelerating diagnosis, shortening response times, and mitigating risks of human diseases and is emerging as a promising method for plant clinics to help their stakeholders. However, use of whole genome sequencing has not been effectively integrated in plant disease diagnostics. Tools, infrastructure, datasets, and expertise necessary for leveraging whole genome sequencing data are not currently available at most clinics.This research team consists of a consortium of researchers and plant clinic diagnosticians. The team will develop plant pathogen-specific tools to overcome the bottleneck in computational data analysis and empower experts in disease diagnostics to use genomic methods. The objectives of the research project are to: 1) develop standard protocols to ensure high quality, reliable results, essential for actional insights, 2) develop computational tools that automate and simplify processing and analyzing whole genome sequencing information, 3) develop new tools that allow end-users to visualize whole genome sequencing information and gain insights to better manage disease and disease spread, 4) develop new methods to sequence DNA of communities of microorganisms and predict possible causative agents of disease, and 5) validate the tools and communicate findings to diagnosticians in plant clinics. The ultimate goal of this research project is to implement a surveillance network of plant clinics that share and use whole genome sequencing information. This is expected to accelerate diagnosis, shorten response time to outbreaks, provide more accurate and sustainable management strategies, and enable earlier disruption of transmission chains to help stakeholders better mitigate risks of disease to plants.

Animal Health Component

25%

Research Effort Categories

Basic

25%

Applied

25%

Developmental

50%

Classification

Knowledge Area (KA)	Subject of Investigation (SOI)	Field of Science (FOS)	Percent
212	4010	1160	100%

Knowledge Area
212 - Pathogens and Nematodes Affecting Plants;

Subject Of Investigation
4010 - Bacteria;

Field Of Science
1160 - Pathology;

Keywords

Goals / Objectives
The overarching goal of this project is to integrate use of whole genome sequencing information with traditional diagnostics to increase the capacity of plant disease diagnostic clinics to help stakeholders in US agriculture mitigate risks of disease. The project aims to address immediate needs in developing tools useable by diagnosticians and scalable for a nationwide network of adopters. This consortium leverages collaborations between diagnosticians in plant clinics and geneticists/genomicists/population geneticists in three different institutions. This consortium will also interact with the established national network of plant clinics to inform, educate, and initiate adoption of whole genome-based methods. The four goals and their objectives are as follows:Goal I. Implement and integrate tools to apply WGS for disease diagnostics.Obj. 1. Develop standards and optimize preparatory workflow for use in plant clinics.Obj. 2. Integrate currently used WGS analysis tools into a workflow for use in plant clinics.Obj. 3. Develop visuals to effectively interpret and communicate WGS results.Goal II. Implement and integrate tools to apply Meta-WGS for disease diagnostics.Obj. 1. Develop standards and optimize preparatory workflow for use in plant clinics.Obj. 2. Integrate available Meta-WGS analysis tools into a workflow for use in plant clinics. Obj. 3. Develop and optimize novel machine learning algorithms.Goal III. Plant disease clinics validate the developed protocols and tools. Obj. 1. Validation using inoculated samples.Obj. 2. Validation using field collected samples.Goal IV. Engage and inform plant clinics nationally.

Project Methods
We will employ methods that are standard to computational biology, molecular biology, and plant disease diagnostics. There are no significant deviations from usual methods. In general, we will evaluate methods against datasets in which the "truth" is already known. In addition, we focus methods on four taxa of culturable bacteria.Goal I.1: Sequencing reads from one deeply sequenced strain from each taxon will be subset to varying depths and analyzed to determine depths necessary for drawing conclusions on species and genotype-level relationships. Results from subsets of reads will be compared to our current understanding of relationships among strains, defined on the basis of deep sequencing and large genomic datasets.Goal I.2: LINbase is currently operational and will be modified to address the goals of this project. A new pathogen database will also be built to provide genotype-level identification, inferences on molecular phenotypes, and interact with LINbase. Together, the two will automate many steps of data analyses and provide a user-friendly interface. Diagnosticians in plant clinics will pilot the pipeline and databases to evaluate impact and design interfaces that are the most user-friendly. Goal I.3: We will develop user-friendly methods for visualizing genomic data. Visuals, such as phylogenetic trees and minimum spanning networks, standard in evolutionary and population biology, will serve as the foundation. We will add additional features that allow users to easily project new information to help identify patterns. Diagnosticians will pilot these tools to evaluate their impact and design intuitive features. GoalII.1: We will sequence samples taken directly from plants inoculated with known culturable bacteria to determine relationships between pathogen load and sequencing depth and to compare sensitivity of different sequencing technologies (metagenomics). We will also simulate different loads of pathogens to determine the depth of coverage necessary to inform on possible causative agents of disease.Goal II.2: We will test different workflows for handling metagenomic data. See Goal III.1; this is used to generate datasets in which the truth is known to help with evaluation.Goal II.3: We will evaluate and test three types of machine learning methods that are based on a family of supervised machine learning methods called classification. See Goal III.1; this is used to generate datasets in which the truth is known to help with evaluation.Goal III.1: One strain of Agrobacterium and Xanthomonas will be inoculated onto host plants. Standard methods will be used to enrich for the pathogen and prepared for metagenome sequencing. Libraries will be sequenced on two different sequencing platforms and the data will be used for testing and evaluating methods described in Goal II.Goal III.2: Once methods have been optimized, we will evaluate them by analyzing field samples. These samples will be received from clients of the plant clinics. They will be analyzed in parallel, using standard methods in disease diagnostics and by experts in the clinics. Diagnoses, determined by computational methods and traditional methods, will be compared and used to evaluate the efficacy of the metagenome-based methods.Goal IV: Methods in active learning will be used in workshops that teach others how to handle and analyze whole genome data.

Progress 03/01/21 to 02/28/25

Outputs
Target Audience:Our target audience was broad. Research efforts were directly for diagnosticians, who serve stakeholders. However, findings also impacted scientific researchers. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?Overall, the project trained undergraduate and graduate students, postdoctoral researchers, and FRAs. It provided cross-training between computational biology, plant pathology, and diagnostics. It also gave diagnostiticians training in bioinformatics and helped basic biologists apply their research to a diagnostic setting. How have the results been disseminated to communities of interest?Results have been disseminated via peer-reviewed publication, as presentations at conferences, as a workshop at APS (2024), and as modules and a pipeline on nf-core. We are scheduled to present PathogenSurveillance at APS (2025), both as an oral presentation (submitted abstract) and as a workshop (accepted). What do you plan to do during the next reporting period to accomplish the goals? Nothing Reported

Impacts
What was accomplished under these goals? Aim I. Implement and integrate tools to apply WGS for disease diagnostics. Obj. 1. Develop standards and optimize preparatory workflow for use in plant clinics. This aim was completed and published in Iruegas-Borcado et al (2023). Briefly, we reported on the effects of sequencing depth on genome assembly and accuracy of calling single nucleotide polymorphisms (SNPs). In addition, we reported on the importance of comparing not only core genomes, but also accessory genomes when analyzing whole genome sequences of drawing conclusions on epidemiological links. We demonstrated that SNP calling programs and reference genome sequences (relationship to samples and quality of assemblies) can have significant effects on conclusions. Last, in Iruegas-Borcado et al (2023), we made recommendations on best practices. Obj. 2. Integrate currently used WGS analysis tools into a workflow for use in plant clinics. We have completed the development of the PathogenSurveillance pipeline. Our pipeline is based on Nextflow, a workflow engine that uses software containers such as Docker and Singularity, or the Conda management system to run pipelines independent of the execution environment (Di Tommaso et al. 2017). This pipeline will take raw Illumina or Nanopore reads and analyze whole genome analyses of bacteria and eukaryotes. Note the new features include analyses of Nanopore reads and of eukaryotes. PathogenSurveillance automates several analyses steps, including: 1) processing reads, 2) quickly inferring a taxonomic identity, 3) assembling and annotating genome sequences, 4) clustering orthologs, 5) identifying from public databases the closest related strains, 5) building core genome and SNP phylogenies, and 6) building minimum spanning networks. There are two extremely powerful attributes that allow non-experts without sufficient computing infrastructure to use our pipeline. These are: 1) automated analysis, including even automated identification of publicly available close reference genome sequences and 2) coupling the Seqera Platform to Google cloud infrastructure. The Seqera Platform manages workflows while AWS provides cost-effective computing infrastructure. A functional pipeline is already available and being beta tested by collaborators. Approximately 15 modules, many of which have general purposes, have been developed. The "Continuous integration testing", necessary for pipelines to be accepted by nf-core, has taken longer than expected. We are currently experiencing unusual errors that are not due to our pipeline but occur under rarely used run variables by nf-core. This is the last hurdle prior to public release of the pipeline and submission of a manuscript describing it. Obj. 3. Develop visuals to effectively interpret and communicate WGS data. These were completed in prior years. Our pipeline generates an interactive HTML report as well as a static PDF report. The interactive HTML report includes phylogenetic trees as well as a minimum spanning network. Last, the report provides key information to the diagnostician such as taxonomic identities of samples and quality of genome sequences. We have also completed the building of R markdowns that provide higher level summaries for researchers interested in mining the data for more basic questions on the biology of samples. Aim II. Implement and integrate tools to apply Meta-WGS for disease diagnostics. Obj. 1. Develop standards and optimize preparatory workflow for use in plant clinics. This work was completed and its efficacy has been demonstrated for detecting Xylella fastidiosa We have also completed testing the applicability of these methods for detecting plant-associated pathogens that reside within leaf tissues and among more complex microbial communities. Obj. 2. Integrate available Meta-WGS analysis tools into a workflow for use in plant clinics. This was completed. We used Nanopore sequencing to characterize samples infected by members of the Xylella genera to survey population diversity and inform on infections by subspecies complexes. More details can be found in products (Abdelrazek et al., 2024). Obj. 3. Develop and optimize novel machine learning algorithms. This was completed (Johnson et al., 2023). Briefly, we constructed a k-mer frequency table and found that a random forest model had the best combination of run-time and accuracy, based on the analysis of tomato metagenomes . Aim III. Plant disease clinics validate the developed protocols and tools. Obj. 1. Validation using inoculated samples. This was completed previously and is described in Aim II, objective 2. Obj. 2. Validation using field collected samples. The was completed, as was described in Aim I, objective 1 and again in Irugeas-Bocardo et al (2024).

Publications

Type: Peer Reviewed Journal Articles Status: Published Year Published: 2024 Citation: Sudermann, M.A., Foster, Z.S.L., Chang, J.H., Grunwald, N.J. (2024) Metabarcoding for plant pathologists. Canadian Journal of Plant Pathology. 46(2), 142160.
Type: Peer Reviewed Journal Articles Status: Published Year Published: 2025 Citation: Iruegas-Bocardo, F., Sutton, W., Buchanan, R.A., Grunwald, N.J., Chang, J.H., and Putnam, M.L. (2025) Canker and dieback of Alnus rubra is caused by Lonsdalea quercina. Phytopathology. 115:112-116
Type: Peer Reviewed Journal Articles Status: Awaiting Publication Year Published: 2025 Citation: Abdelrazek, S., Rodriguez Salamanca, L., and Vinatzer, B A. (2025) Metagenomic sequencing of tomato plants with wilt symptoms allows for strain-level pathogen identification and genome-based characterization. Phytopathology (in press)

Progress 03/01/23 to 02/29/24

Outputs
Target Audience:Diagnosticians, growers and farmers, researchers. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?At Oregon State University, we trained one postdoctoral researcher and two undergraduate students. At Virginia Tech, we trained one research associate and one graduate student. At Ohio State University, we trained one postdoctoral researcher. Training is trans-disciplinary and includes biology, computer science, mathematics, and statistics. How have the results been disseminated to communities of interest?Results have been disseminated via a peer-reviewed publication and in presentations at conferences What do you plan to do during the next reporting period to accomplish the goals?1. Include use of long-read sequences in the pipeline. 2. Add POCP to the pipeline. 3. Deploy pipeline at OSU Plant Clinic 4. Complete manuscript on pipeline. 5. Run workshop at APS (Memphis, TN) on pipeline 6. Complete manuscript that describes a machine learning pipeline that assigns importance to SNPs or genes.

Impacts
What was accomplished under these goals? Obj. 1. Develop standards and optimize preparatory workflow for use in plant clinics. This aim was completed in a prior reporting period. Obj. 2. Integrate currently used WGS analysis tools into a workflow for use in plant clinics. We are extremely excited about our accomplishments under this objective and believe that this pipeline has potential to be transformational for diagnostic settings and highly useful for research scientists. The pipeline is functional and nearly complete. It was built in the nf-core framework (Ewels et al. 2020). Our pipeline is based on Nextflow, a workflow engine that uses software containers such as Docker and Singularity, or the Conda management system to run pipelines independent of the execution environment (Di Tommaso et al. 2017). This pipeline will take raw Illumina reads and analyze whole genome analyses of bacteria. It automates several analyses steps, including: 1) processing reads, 2) quickly inferring a taxonomic identity, 3) assembling and annotating genome sequences, 4) clustering orthologs, 5) identifying from public databases the closest related strains, 5) building core genome and SNP phylogenies, and 6) building minimum spanning networks. There are two extremely powerful attributes that allow non-experts without sufficient computing infrastructure to use our pipeline. These are: 1) automated analysis, including even automated identification of publicly available close reference genome sequences and 2) coupling the Seqera Platform to AWS cloud infrastructure. The Seqera Platform manages workflows while AWS provides cost-effective computing infrastructure. A functional pipeline is already available and being beta tested by collaborators. Approximately 15 modules, many of which have general purposes, have been developed. Obj. 3. Develop visuals to effectively interpret and communicate WGS data. Our pipeline generates an interactive HTML report as well as a static PDF report. The interactive HTML report includes phylogenetic trees as well as a minimum spanning network. Last, the reports provides key information to the diagnostician such as taxonomic identities of samples and quality of genome sequences. We are also building R markdowns that provide higher level summaries for researchers that may be interested in mining the data for more basic questions on the biology of samples. Aim II. Implement and integrate tools to apply Meta-WGS for disease diagnostics. Obj. 1. Develop standards and optimize preparatory workflow for use in plant clinics. This aim was completed in a prior reporting period. Obj. 2. Integrate available Meta-WGS analysis tools into a workflow for use in plant clinics. This has been completed We used Nanopore sequencing to characterize samples infected by members of the Xylella genera to survey population diversity and inform on infections by subspecies complexes. More details can be found in products (Abdelrazek et al., 2024) Obj. 3. Develop and optimize novel machine learning algorithms. This near complete in a previous annual report and has been finished durig this one. More details can be found in products (Johnson et al., 2023). Aim III. Plant disease clinics validate the developed protocols and tools. Obj. 1. Validation using inoculated samples. This was completed previously. Obj. 2. Validation using field collected samples. The Virginia Tech and Oregon State University Plant Clinics have archived over 50 samples and their corresponding pathogen isolates as well as used traditional methods for diagnosis.

Publications

Type: Conference Papers and Presentations Status: Other Year Published: 2023 Citation: Genomic Source Attribution of Salmonella Using Machine Learning on Metagenomics Sequencing Data. School of Plant and Environmental Sciences Annual Symposium. Chinnareddy S, Liao J, Li S. October 5th 2023.
Type: Conference Papers and Presentations Status: Other Year Published: 2023 Citation: 2. PathogenDx: Automated Analysis of Whole Genome Sequencing Data for the Identification and Analysis of Pathogen Populations. ICPP. Bocardo, Foster, Phan, Witherell, Weisberg, Putnam, Chang, and Grunwald. 2023
Type: Conference Papers and Presentations Status: Other Year Published: 2023 Citation: Xanthomonas hortorum pv. pelargonii in geranium. Plant Health 2023. APS. Denver, CO. Roman-Reyna, V. 2023.
Type: Journal Articles Status: Published Year Published: 2023 Citation: 1. Crosby KC, Rojas M, Sharma P, Johnson MA, Mazloom R, Kvitko BH, Smits TH, Venter SN, Coutinho TA, Heath LS, Palmer M, Vinatzer BA (2023) Genomic delineation and description of species and within-species lineages in the genus Pantoea. Frontiers in Microbiology. doi.org/10.3389/fmicb.2023.1254999
Type: Journal Articles Status: Published Year Published: 2024 Citation: 2. Abdelrazek S, Bush E, Oliver CL, Liu H, Sharma P, Flores MA, Donegan MA, Almeida R, Nita M, Vinatzer BA (2024) A survey of Xylella fastidiosa in the US state of Virginia reveals wide distribution of both subspecies fastidiosa and multiplex in grapevine. Phytopathology. doi.org/10.1094/PHYTO-06-23-0212-R
Type: Journal Articles Status: Published Year Published: 2023 Citation: 3. Johnson MA, Vinatzer BA, Li S (2023) Reference-Free Plant Disease Detection Using Machine Learning and Long-Read Metagenomic Sequencing. Applied and Environmental Microbiology, DOI. doi.org/10.1128/aem.00260-23
Type: Journal Articles Status: Published Year Published: 2024 Citation: 4. Roman-Reyna, V, Sharma, A, Toth, H, Konkel, Z, Lmiotek, N, Murthy, S, Faith, S, Slot, J, Hand, F, Goss, E, Jacobs, J (2024) Live tracking of a plant pathogen outbreak reveals rapid and successive, multidecade plasmid reduction. mSystems. 9(2): https://doi.org/10.1128/msystems.00795-23.

Progress 03/01/22 to 02/28/23

Outputs
Target Audience:Diagnosticians, growers and farmers, researchers. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?Two postdoctoral researchers and one post-bacc student (Oregon State University), one postdoctoral researcher, graduate research assistant and undergraduate student (The Ohio State University) one graduate student was trained on developing machine learning tools and one research faculty was trained on using nanopore sequencing at Virginia Tech have been trained at the intersection of plant disease diagnostics, evolutionary biology, genomics as well as metagenomics, and machine learning. How have the results been disseminated to communities of interest?Results have been disseminated via a peer-reviewed publication. Results have been disseminated to the greater scientific community and to the National Plant Diagnostic Network (by way of their newsletter) via a peer-reviewed publication. Our project was presented at the 2022 National meeting of the National Plant Diagnostic Network in Davis, CA. This group is comprised of university and state department of agriculture diagnostic professionals. Marcela A. Johnson, Haijie Liu, Elizabeth Bush, Parul Sharma, Shu Yang, Reza Mazloom, Lenwood S. Heath, Mizuho Nita, Song Li, Boris A. Vinatzer. "Long-read metagenomics to investigate plant disease outbreaks beyond plant pathogen detection and identification". AMS Fall Central Sectional Meeting. Sept 17-18, 2022. El Paso, TX. Oral presentation. Marcela A. Johnson. "From CS to Bioinformatics and Beyond". UTEP Bioinformatics colloquium. Sept 16, 2022. El Paso, TX. Oral presentation. What do you plan to do during the next reporting period to accomplish the goals?Aim I. Implement and integrate tools to apply WGS for disease diagnostics. Obj. 2. Integrate currently used WGS analysis tools into a workflow for use in plant clinics. We have gained a strong familiarity with the Nextflow programming language and expect to complete module and pipeline development during the next period. We will also develop methods that automate the selection of reference genome sequences. Obj. 3. Develop visuals to effectively interpret and communicate WGS data. We will determine the type of information required in reports. We expect to draft a report that includes easily accessible visualizations. We will test the utility of Nextflow Tower, which allows users to interact via web browsers to run Nextflow pipelines and execute them on local computing or HPC cloud-based environments. Nextflow Tower is a potential method for sharing reports. Aim II. Implement and integrate tools to apply Meta-WGS for disease diagnostics. Obj. 2. Integrate available Meta-WGS analysis tools into a workflow for use in plant clinics. We will continue using Meta-WGS for disease diagnostics. Obj. 3. Develop and optimize novel machine learning algorithms. We will develop simulations to develop alignment-free methods for diagnostics, e.g., predict virulence gene sequences. Aim III. Plant disease clinics validate the developed protocols and tools. Obj. 2. Validation using field collected samples. Plant Disease Clinics are currently isolating DNA from samples and isolates. We expect to use our methods to analyze DNA and compare findings from those derived from traditional diagnostic methods. We will present our project at APS and ICPP in the summer of 2023.

Impacts
What was accomplished under these goals? Aim I. Implement and integrate tools to apply WGS for disease diagnostics. Obj. 1. Develop standards and optimize preparatory workflow for use in plant clinics. This aim has been completed and published in Iruegas-Borcado et al (2023). Briefly, we reported on the effects of sequencing depth on genome assembly and accuracy of calling single nucleotide polymorphisms (SNPs). In addition, we reported on the importance of comparing not only core genomes, but also accessory genomes when analyzing whole genome sequences for drawing conclusions on epidemiological links. We demonstrated that SNP calling programs and reference genome sequences (relationship to samples and quality of assemblies) can have significant effects on conclusions. Last, in Iruegas-Borcado et al (2023), we made recommendations on best practices. We also like to highlight that this published work was done in collaboration with multiple plant clinics and showed the importance of sharing genomic data among a network of clinics. Obj. 2. Integrate currently used WGS analysis tools into a workflow for use in plant clinics. WGS analysis requires expertise in biocomputing and computing tools often requires specific environments. To overcome these hurdles, we are implementing our pipeline in the nf-core framework (Ewels et al. 2020). Our pipeline is based on Nextflow, a workflow engine that uses software containers such as Docker and Singularity, or the Conda management system to run pipelines independent of the execution environment (Di Tommaso et al. 2017). A powerful feature of nf-core is that it has a large community of developers who contribute open-source modules. This flexibility is highly advantageous because modules can be incorporated into a diversity of pipelines. Moreover, nf-core has oversight to ensure all contributed modules meet with standards. To date, we have contributed four modules to nf-core and connected them to several others available in nf-core to produce a partial pipeline that aligns reads to a reference to identify SNPs. Using the same dataset reported in Iruegas-Borcado et al (2023), we demonstrated a proof-of-concept implementation of a nextflow pipeline and showed it performed as expected. Obj. 3. Develop visuals to effectively interpret and communicate WGS data. As part of the nf-core pipeline, we included tools to visualize relationships in a minimum spanning network. We are also in the process of building modules for genome assembly and constructing core genome phylogenies as an additional visualization tool. We are also working on report formats suitable for sharing by email or as a web presentation. Aim II. Implement and integrate tools to apply Meta-WGS for disease diagnostics. Obj. 1. Develop standards and optimize preparatory workflow for use in plant clinics. This work was completed in the previous year. Obj. 2. Integrate available Meta-WGS analysis tools into a workflow for use in plant clinics. We have begun using meta-WGS in the diagnostic setting. We used Nanopore sequencing to characterize samples infected by members of the Ralstonia and Xylella genera to survey population diversity and inform on infections by subspecies complexes. We used meta-WGS to support diagnoses of an emergent fungal pathogen that causes vascular disease on dogwood, redbuds and maple. Additionally, we used Illumina sequencing to diagnose fungal and bacterial pathogens of diverse plant hosts including tomato, pepper, potato, cabbage, geranium and cabbage. We are comparing different Illumina read depths to define the appropriate limits for pathogen detection with meta-WGS. Obj. 3. Develop and optimize novel machine learning algorithms. We previously developed and tested a K-mer based machine learning method using both convolutional neural networks and random forest. The machine learning methods were tested with nanopore sequencing data generated for Pseudomonas syringae-infected tomato leaves and for Xylella-infected grapevine. Findings showed that for P. syringae data, the majority of reads from infected samples were from the pathogen and other microorganisms and the machine learning methods were able to distinguish pathogen-derived reads from plant-derived reads based largely on differential abundance. For Xylella data, a substantial amount of reads in the samples were from the host plant. Nonetheless, the machine learning method was able to differentiate pathogen reads based on GC content, as reads derived from host plants tended to be more AT rich. A revised manuscript is currently under review. Aim III. Plant disease clinics validate the developed protocols and tools. Obj. 1. Validation using inoculated samples. This was completed. See objective 2 of Aim II. Obj. 2. Validation using field collected samples. The three Plant Clinics have archived over 250 tissue samples and their corresponding pathogen isolates and have used multiple means to arrive at a diagnosis. This material will be used as a check on the Meta-WGS analyses of the samples.

Publications

Type: Journal Articles Status: Accepted Year Published: 2023 Citation: Iruegas-Bocardo, F., Weisberg, A. J., Riutta, E. R., Kilday, K., Bonkowski, J. C., Creswell, C., Daughtrey, M. L., Rane, K., Gr�nwald, N. J., Chang, J. H.6, and Putnam, M. L. (2023). Whole genome sequencing-based tracing of a 2022 introduction and outbreak of Xanthomonas hortorum pv. pelargonii. Phytopath. https://doi.org/10.1094/PHYTO-09-22-0321-R.
Type: Journal Articles Status: Published Year Published: 2022 Citation: Bernal E, Rotondo F, Roman-Reyna V, Klass T, Timilsina S, Minsavage GV, Iruegas-Bocardo F, Goss EM, Jones JB, Jacobs JM, Miller SA, Francis DM. Migration Drives the Replacement of Xanthomonas perforans Races in the Absence of Widely Deployed Resistance. Front Microbiol. 2022 Mar 18;13:826386. doi: 10.3389/fmicb.2022.826386. https://www.frontiersin.org/articles/10.3389/fmicb.2022.826386/full
Type: Journal Articles Status: Accepted Year Published: 2023 Citation: Roman-Reyna V, Curland RD, Velez-Negron Y, Ledman KE, Gutierrez Castillo DE, Beutler J, Butchacas J, Brar G, Roberts R, Dill-Macky R, Jacobs JM. Development of genome-driven, lifestyle-informed markers for identification of the cereal-infecting pathogens Xanthomonas translucens pathovars undulosa and translucens. Phytopathology. 12 Oct 2022 (epub ahead of print) https://doi.org/10.1094/PHYTO-07-22-0262-SA
Type: Theses/Dissertations Status: Accepted Year Published: 2023 Citation: https://vtechworks.lib.vt.edu/handle/10919/113825

Progress 03/01/21 to 02/28/22

Outputs
Target Audience:Plant disease diagnosticians and researchers. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?One postdoctoral researcher (Oregon State University), one postdoctoral researcher (The Ohio State University), one graduate student (Virginia Tech), and one undergraduate (Oregon State University) have been trained at the intersection of plant disease diagnostics, evolutionary biology, genomics as well as metagenomics, and machine learning. How have the results been disseminated to communities of interest?Results have been disseminated via peer-reviewed publication and in review. We have also communicated results to our advisory board, who represent federal and state regulatory agencies, lead diagnostic clinics, and/or have leadership roles in the national plant diagnostics network. We (investigators from each institution) have presented at departmental and national meetings. We have also communicated our project to individual stakeholders and have invited their participation (submit samples and provide feedback on findings). Last, we developed and presented workshops on using these methods in diagnostics. Workshops were delivered at four US universities and two international instiuttions. What do you plan to do during the next reporting period to accomplish the goals?Aim I. Implement and integrate tools to apply WGS for disease diagnostics. Obj. 2. Integrate currently used WGS analysis tools into a workflow for use in plant clinics. We will continue to develop PathID for automating data processing and analyses. We will start developing the three databases of marker genes. We will circumscribe additional species, subspecies, and phylogroups within species in LINbase with a focus on Agrobacterium and related genera. Obj. 3. Develop visuals to effectively interpret and communicate WGS data. We will continue adapting tools such as Nextstrain for data visualization. Aim II. Implement and integrate tools to apply Meta-WGS for disease diagnostics. Obj. 2. Integrate available Meta-WGS analysis tools into a workflow for use in plant clinics. We will write detailed protocols on how to use the devised Illumina and Nanopore analysis work-flow so that it can be tested by disease clinic personnel. Obj. 3. Develop and optimize novel machine learning algorithms. We will further study the machine learning methods using interpretable machine learning approaches. We will test Deep-Lift and RF-SHAP, two advanced feature selection methods to select informative k-mers for the classification. We will determine the k-mer distributions in the genome and identify the minimum informative k-mers for achieving high classification accuracy. For samples with low accuracy, we plan to test a two stage ML model where the first layer model will be used to determine whether the reads are from host genome or the pathogen genome, and the second layer model will determine whether the sample can be classified as infected or healthy. Aim III. Plant disease clinics validate the developed protocols and tools. Obj. 1. Validation using inoculated samples. Oregon State University will continue infecting plants to generate samples for future use. VT will also start this objective and compare results with those obtained from samples consisting of DNA of known combinations of mock microbial communities, DNA of healthy plant DNA, and DNA of known pathogen DNA at different concentrations. VT will share DNA of these samples and of inoculated samples with the group at Ohio State University so we can compare results obtained at VT with nanopore sequencing with those obtained at Ohio State University using Illumina sequencing. Obj. 2. Validation using field collected samples. Plant Disease Clinics will continue archiving samples for use once pipelines are ready to be tested. For goal IV, we have arranged for a speaking engagement at the National Plant Diagnostic Network meeting in April of 2022. At this event, we will present our accomplishment and future goals to scientsts from the 50 states and US territories.

Impacts
What was accomplished under these goals? Goal I. Obj. 1. Reads from deeply sequenced and previously assembled genomes varying in size from approximately 5 Mb to 50 Mb have been used to assess the impact of sequencing depth on genome coverage, genome assembly, and robustness of SNP calls. For bacterial genomes, a depth of coverage of 20X consistently yielded results comparable to assemblies derived from all reads. For eukaryotic pathogen genomes, a 40X coverage is necessary. Findings will be used to guide diagnosticians on multiplexing strategies prior to sequencing and the reliability of results after sequencing. A similar approach was employed to determine the minimal quality for genome assemblies to be used in LINbase. First, assemblies using only Illumina reads were compared with hybrid assemblies using both, short Illumina reads and long nanopore reads. It was found that the two types of assemblies were assigned the same LINs up to position U (99.99% ANI). Therefore, we concluded that closing a genome assembly with long reads does not need to be included in our workflow. Second, assemblies of different quality (in regard to number of contigs, n50, and length of shortest contig) were made using different numbers of Illumina reads. We found that as long as the number of contigs was below 500, the n50 was above 50,000, and the shortest contig was longer than 500, the assigned LINs stayed the same up to position P (99.925% ANI). This ANI threshold is higher than the breadth of the clonal, cool-virulent brown rot pandemic lineage (approximately corresponding to the select agent R. solanacearum Race 3 biovar 2), which has an ANI breadth of 99.9% (manuscript in preparation based on result obtained as part of USDA APHIS Farmbill project AP19PPQS&T00C083). Therefore, the minimal assembly quality (contig number <500; n50>50,000; shortest contig >500) will be used for our disease diagnostics workflow for WGS-based identification using LINbase. Last for this aim, using an in-house Illumina miniSeq purchased using funds received from a state agency, we have made significant progress in developing a standard preparatory workflow for preparing DNA and making libraries. Obj. 2. We are in the process of scripting an automated workflow that will initiate data processing and analyses once sequencing reads are transferred onto our servers (PathID). We are adapting our pipeline for NextFlow, which overcomes key issues with portability, reproducibility, and continuous checkpoint. We are currently examining a recently developed NextFlow-based pipeline called Bactopia to assess the ease to which we can adapt it for our needs. We have circumscribed Pseudomonas, Xanthomonas, Xylella, and Ralstonia species, subspecies, and phylogroups within species in LINbase. The Ralstonia circumscriptions were done as part of the USDA APHIS Farmbill project AP19PPQS&T00C083. Any user of LINbase can now identify a genome sequence as a member of any of the circumscribed groups. We are in the process of adding circumscriptions of genome-based taxa (including validly published named species as well as genome-similarity-based genomospecies) for Agrobacterium and related genera as well. We have also used currently available genome sequences to calibrate the LINbase classification scheme to that of hierBAPS. This is a crucial step for determining sub-species level relationships of plant pathogens for identifying suitable references for calling single nucleotide polymorphisms. To make LINbase and the planned PathID Web server compatible with each other, we are developing an application programming interface (API) for LINbase so that the future PathID Web server can communicate with LINbase. Obj. 3. We are testing NextStrain as a potential method for rapidly visualizing data. Goal II. Obj. 1. This work has been completed and its efficacy has been demonstrated for detecting Xylella fastidiosa (see products). We are currently testing the applicability of these methods for detecting plant-associated pathogens that reside within leaf tissues and among more complex microbial communities. Obj. 2. For Meta-WGS using nanopore sequencing, we have devised the workflow shown in Figure 11. Once this workflow has been validated, we will train the Virginia Tech Plant Disease Clinic personnel in using this workflow. Obj. 3. We have developed and tested a K-mer based machine learning method using both convolutional neural networks and random forest. The machine learning methods were tested with nanopore sequencing data generated for Pseudomonas syringae-infected tomato leaves and for Xylella-infected grapevine. For P. syringae data, we have achieved high accuracy of > 95% using both models. For Xylella data, we can only achieve 65% accuracy. The success rates seem to be related to the complexity of the sequencing library. In P. syringae samples, the majority of reads from infected samples were from the pathogen and other microorganisms. In contrast, for the Xylella data, a substantial amount of reads in the samples were from the host plant. Goal III. Obj. 1. For nanopore sequencing, we have mostly used field collected samples so far. Analyzing these samples with Meta-WGS and comparing the obtained results with qPCR, we have realized the need to start validation using known concentrations of pure DNA of mock microbial communities mixed with known concentrations of DNA of pure healthy plant DNA and known concentrations of pure pathogen DNA. We will compare Meta-WGS results obtained with these samples with qPCR to compute detection thresholds. Once this step is completed, we will transition to inoculated samples. We have inoculated tomato plants with a strain of agrobacteria. Total DNA will be collected from galls of various age. These samples will be used in the future to test the efficacy of Meta-WGS methods. Obj. 2. Not yet started but the associated plant disease clinics have begun archiving diagnosed samples that will be used in the future for assess the efficacy of WGS- and Meta-WGS-based detection methods.

Publications

Type: Journal Articles Status: Accepted Year Published: 2021 Citation: Roman-Reyna et al., (2021; https://doi.org/10.1128/mSystems.00591-21)