Source: UNIV OF MARYLAND submitted to NRP
SEEING DOUBLE: EVIDENCE FOR COPY NUMBER VARIANTS IN TRANSGENIC CROP RESISTANCE AND METHODS FOR THEIR EARLY DETECTION
Sponsoring Institution
National Institute of Food and Agriculture
Project Status
ACTIVE
Funding Source
Reporting Frequency
Annual
Accession No.
1032647
Grant No.
2024-33522-43706
Cumulative Award Amt.
$649,847.00
Proposal No.
2024-03788
Multistate No.
(N/A)
Project Start Date
Sep 1, 2024
Project End Date
Aug 31, 2027
Grant Year
2024
Program Code
[HX]- Biotechnology Risk Assessment
Recipient Organization
UNIV OF MARYLAND
(N/A)
COLLEGE PARK,MD 20742
Performing Department
(N/A)
Non Technical Summary
This research seeks to understand the frequency and mechanisms by which pests overcome plant resistance traits conferred by engineered genes (BRAG priority 5i), and develop a monitoring framework to improve transgenic crop durability. We have shown that genomic monitoring can track changes in resistance genotypes of wild pests over time, and this information can be leveraged to detect emerging resistance and trigger remediation. Our prior work focused on single nucleotide polymorphisms, but genomic monitoring can detect multiple types of genetic variants conferring pest resistance. We recently found that gene copy number variants (CNVs) strongly contribute to the field-evolved transgenic crop resistance observed in H. zea, our pest model. Understanding the role of CNVs in resistance evolution will be key to insect resistance management because CNVs often act to broadly increase detoxification and metabolism, conferring cross resistance to many compounds (including insecticides expressed by transgenic crops). Our proposed work will generate data and develop algorithms to improve detection of emerging resistance caused by CNVs. Using long read sequencing and targeted historical sequencing, we will characterize genome wide CNVs in H. zea, which will provide important insight into their role in transgenic crop resistance. To improve CNV detection from genomic monitoring data, we will develop a novel algorithm and benchmark it against available tools, providing a novel resource for agricultural researchers and regulators. This work will advance our understanding of which genomic patterns correspond to resistance associated CNV evolution and develop approaches for detection of those evolutionary signals from genomic monitoring data.
Animal Health Component
50%
Research Effort Categories
Basic
25%
Applied
50%
Developmental
25%
Classification

Knowledge Area (KA)Subject of Investigation (SOI)Field of Science (FOS)Percent
21131101080100%
Knowledge Area
211 - Insects, Mites, and Other Arthropods Affecting Plants;

Subject Of Investigation
3110 - Insects;

Field Of Science
1080 - Genetics;
Goals / Objectives
Objective 1: Measure variation among CNV resistance haplotypes using long-read sequencingTo accomplish Objective 1,we will identify CNVs by sequencing H. zea samples collectedbefore, during, and after resistance evolution with long read technology. Using whole genome nanopore sequencing we will reconstruct genome-wide CNV haplotypes in three Cry1Ab resistant samples (Objective 1a). We will also describe population level haplotypic variation in CNVs for 75 samples using targeted sequencing of one resistance associated chromosome (Chr9; Objective 1b). These data will allow us to recover full resistance associated haplotypes, improve our understanding of the evolutionary processes giving rise to resistance associated CNVs, and provide a set of validated CNVs for benchmarking CNV detection algorithms. Objective 2.Characterize the ancestral state of the Cry1Ab resistance associated genomic region in H. zea.To accomplish Objective 2,we will sequence the region of Chr9 containingthe resistance-related CNV forfield-collected museum quality H. zea from 1996. These data will allow us to examine whether the CNV existed as standing genetic variation in H. zea prior to commercial release of Bt crops. Sequence data from these samples will then be compared to existing DNA sequence data from samples collected in later years, allowing us to determine both the extent of DNA variation linked to the CNV, as well as how selection shaped Chr9 immediately following commercial release of Bt transgenic crops.Objective 3. Develop novel computational methods for discovering resistance-related CNVs from data produced by genomic monitoring.To accomplish Objective 3, we will first generate benchmark datasets and use them to evaluate the utility of existing methods for discovering resistance-related CNVs from genomic monitoring data (Supporting Objective 3a). Second, we will develop new algorithms, specifically targeting wild insect population data collected across multiple time points (Supporting Objective 3b). Finally, we will apply our methods to real H. zea time-series data to evaluate its potential for improved resistance monitoring (Supporting Objective 3c).
Project Methods
Our plan of work combines two evolutionary genomics experiments to determine the number and types of mutational events that resulted in Bt resistance evolution (Objective 1), as well as test whether the trypsin CNV on Chr9existed as standing genetic variation prior to commercial release of Bt crops (Objective 2). Results from these studies will help determine how selection shaped H. zea's Chr9 as Bt adoption grew. Understanding these patterns will allow us to improve upon existing genomic approaches for resistance monitoring. Specifically, we will use data from our first two objectives, gather existing data from previous BRAG-funded work, and develop simulated datasets to benchmark existing software for identification of CNVs in insects (Objective 3). Understanding the strengths and limitations of existing algorithms, typically developed for use in human cancer genomics, will enable us to improve upon or develop novel algorithms for use in other organisms (Objective 3).For Objective 1, we will sequence genomes of field-collected individuals from 2002, 2012, and 2019 with ultra-long reads from an Oxford Nanopore PromethIon. This will allow us to assemble through repetitive regions and recover full sequences of each CNV. With the addition of Illumina short read data for polishing of the error prone long reads, we will resolve full CNV haplotypes with high confidence. We will initially use long read whole genome sequencing for one individual per time point (Objective 1a). This will allow us to characterize the CNV landscapeacross the genome, identify genomic breakpoints, and recover full copies of each duplicated gene separately. We will also identify which CNVs are likely to play a role in resistance by comparison to already identified genomic windows under selection and resistance QTL. We will thenuse targeted Nanopore adaptive sequencing to identify resistance associated haplotypic diversity on Chr9 (containing our Bt resistance CNV) in 25 individuals from each of our three time points (Objective 1b), as Bt resistance spread. Data sets generated for Objective 1a and 1b will allow us to use both alignment-based and assembly-based CNV detection, providing two lines of evidence for each CNV. After assembly, we will predict and manually curate gene models for the CNV, and reconstruct gene amplification history. Based upon our gene models, we will also identify potential effects of variants on gene copy function, detect signals of selection separately for each gene copy, and compare gene expression across copies using existing RNASeq data.For Objective 2, we will empirically generate ancestral genetic information (e.g. prior to deployment of a widespread management tool) for wild H. zea exposed to increasing Bt selective pressure. This will reveal early diversity at the trypsin CNV and its flanking regions. To accomplish this objective, we will use target sequence capture to prepare Illumina libraries for 36 ancestral H. zea samples. These target capture libraries will allow us to sequence much of the 5-6 Mb region of Chr9. Target capture reads will be mapped to the H. zea genome, followed by SNP calling. This will generate SNPs and coverage depth data which can be used to analyze evolutionary genomic patterns at (and outside of) our target CNV. SNPs and coverage depth data from this ancestral dataset will be compared to samples collected from the same geographic region in later years (2002, 2008, 2010, 2012, and 2017; n = 25-30 per year). Within this comparative framework, we will test our hypothesis that 1996 samples will have the highest diversity in our CNV region, but strong selection on the CNV in later years should result in an appreciable decline in diversity, both within the trypsin cluster, as well as in the flanking regions. We will also compare depth of coverage across years (scaled to mean overall depth per sample) to test for evidence of the CNV in the ancestral population. Finally, we will ask whether the CNV or the haplotypes flanking the modern CNV existed at detectable frequencies in these ancestral samples, providing insights into their presence as standing genetic variation and qualities that might influence their detection in genomic monitoring data.For Objective 3, we will combine existing and novel data streams to test existing algorithms/software, as well as improve upon or develop new algorithms/software for CNV detection from genomic monitoring data in the following stages.Stage 1 (Supporting Objective 3a): Generate benchmark datasets and use them to evaluate the utility of existing methods for discovering resistance-related CNVs from genomic monitoring data.We will generate three benchmark datasets. Benchmark (A) will be created by Illumina sequencing of up to four H. zea trios. Benchmark (B) will be synthetic genomic monitoring data for H. zea. Specifically, we perform SLiM4 population simulations to create synthetic genomes with variants under neutral evolution (genetic drift) and positive selection (e.g. due to conferring a resistance phenotype). The resulting simulated genomes will be used to guide the introduction of copy number and single nucleotide variants into an H. zea reference genome, producing time-series population data. Lastly, reads will be generated from these synthetic genomes using the ART simulator. Benchmark (C) will be created in a similar fashion to benchmark B but using a broader set of genomes drawn from the USDA-ARS Ag100Pest Initiative. Benchmark A will be used to evaluate the utility of existing methods for discovering and genotyping CNVs from insect genomes (e.g. CNVnator, CNVcaller, DELLY, and MANTIS). Benchmarks B and C will be used to evaluate the utility of existing methods for discovering resistance-related CNVs from time-series data. First, we will apply methods of CNV calling and genotyping (e.g. CNVnator, CNVcaller, DELLY, and MANTIS). Second, we will transform the resulting CNV genotypes into two-state variants, indicating whether there are 2 copies or not. Third, we will apply methods for detecting two-state variants under positive selection from time-series population data (e.g. FIT and Timesweeper). At each stage, we will evaluate methods for accuracy and computational efficiency.Stage 2 (Supporting Objective 3b): Develop new algorithms, specifically targeting wild insect population data collected across multiple time points.We will develop new methods that address the limitations of existing methods identified during benchmarking. We will also develop a novel statistical framework for identifying copy number variants (i.e. multi-state variants) under positive selection from time-series population data, as no such method exists for this task. Lastly, we will develop tools for validating and inspecting results.Stage 3 (Supporting Objective 3c): Apply our methods to real H. zea time-series data to evaluate the potential for improved resistance monitoring.We will apply the bioinformatics pipelines developed in our project to genomic monitoring data collected for H. zea from 2002, 2008, 2010, 2012, and 2017 (n = 25-30 per year). We will run pipelines taking each of these years as the present time point to determine the earliest year that resistance-related CNVs (e.g. the trypsin 77 CNV) can be detected. This will help us to determine the utility of our methods for early detection.

Progress 09/01/24 to 08/31/25

Outputs
Target Audience:Our target audience for this reporting period includedbasic and applied entomologists, applied geneticists,and regulatory agencies involved in resistance management for pesticidal biotechnologies. An additional target audience includes developers of algorithms for variant calling from short read data. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?Funding for this projecthas provided Co-PD Taylor with the opportunity to recruit an undergraduate student researcher at Hofstra University. This student will complete a year-long senior capstone project focused on the downstream analysis of the genome assemblies produced this year. The undergraduate student will learn about the research process, fundamental genetics and genomics, and will develop bioinformatics skills including gene annotation, variant effect prediction, and gene tree reconstruction. PD Fritz has had theopportunity to train one postdoctoral researcher (Ben Schultz) at the University of Maryland. Ben began his postdoctoral position in June of 2025 after a lengthy search. He is learningtheH. zeastudy system, further developing his bioinformatics/data analytics skills, and is beginning to mentor other trainees involved in this project. Co-PDMolloy has been provided the opportunity to train one University of Maryland graduate student (Junyan Dai). Junyan Dai will enter his second year of the computer science PhD program in fall 2025. Since his hire in spring 2025, Junyan has learned fundamentals of genomics and population genetics, familiarized himself with the H. zea study system, and developed bioinformatics skills. This training is critical for Junyan to apply knowledge acquired through his PhD coursework in computer science (e.g., algorithm design and analysis, parallel computing, and machine learning) to this project. How have the results been disseminated to communities of interest?PD Fritz and co-PD Taylor each shared results from this project in presentations given at the 2025 Plant and Animal Genome Conference. These presentationsreached academic researchers in the fieldsof agriculture, entomology, and applied evolutionary biology. PD Fritz also gave one invited seminar on results from this project at the University of Virginia during the spring of 2025, reaching academic researchers in the fields ofbasic and applied evolutionary biology. What do you plan to do during the next reporting period to accomplish the goals?Objective 1: Measure variation among CNV resistance haplotypes using long-read sequencing In the next reporting period we will complete the downstream analysis of the two newly assembled haplotypes. Downstream analysis will include manual gene model curation, analysis of potential functional impacts of gene variants, and reconstruction of gene amplification evolutionary history. Additionally, using the sequencing and analysis approach validated in this project period, we will target 2 more individuals for whole genome PacBio HiFi sequencing that are known to have different serine protease homologue gene copy numbers. Whole genome HiFi sequencing is more expensive than our original targeted adaptive nanopore approach. Using only whole genome HiFi sequencing we could accomplish our planned objectives, but with a more limited sample size. Due to our interest in describing haplotypic variation in and among multiple populations, we plan to try one more targeted adaptive nanopore sequencing run, with a modified protocol. In consultation with the sequencing center, we decided that hard masking the reference sequence and modifying DNA isolation protocols to minimize fragmentation could yield significantly better results. If we are unable to produce satisfactory results with this second attempt at adaptive nanopore sequencing of chromosome 9, we will fully pivot to using only whole genome PacBio HiFi sequencing which we have shown is capable of describing variation in these important resistance haplotypes. Objective 2.Characterize the ancestral state of the Cry1Ab resistance associated genomic region inH. zea. Inthe upcoming reporting period, we will use the newly assembled haploid genomes from resistant and susceptible individuals (see Obj 1) to develop baits for target-capture sequencing. We will work with Arbor BioSciences or a similar companyto optimize baitsets and synthesize them. Museum quality H. zea samples collected from before or shortly after the commercial release of Cry1-expressing crops will be identified and submitted for sequencing at Arbor BioSciences. Upon receipt of the sequencing data, we will begin bioinformatic analyses to characterize haplotypic diversity at the time of commercial release of Bt crops Objective 3.Develop novel computational methods for discovering resistance-related CNVs from data produced by genomic monitoring. In Fall 2025, we will complete our CNV benchmarking study and prepare this work for publication in a peer-reviewed scientific journal. We have planned several analyses to finalize our preliminary results. First, we will evaluate whether filtering CNV regions impacts relative method performance; for example, we will filter CNV regions associated with repetitive sequences (e.g., transposable elements) as well as those identified in all individuals, which likely reflect common differences between the reference genome. Second, we will compare methods in terms of copy number. Specifically, many methods, including LUMPY, identify CNV regions associated with gains or losses but do not indicate a specific copy number, whereas other methods like CNVkit indicate both the regions and the copy number. As our preliminary results indicate that LUMPY more accurately identifies CNV regions than CNVkit, we will evaluate the impact of calling copy number in the CNV regions identified by LUMPY with different methods. For this evaluation, we will consider read mappability scores of the CNV regions. Finally, we will extend our benchmarking study to include a greater number of CNV callers, including those developed in recent years (e.g., SurVIndel2, Nat Comm, 2024). In Spring 2026, we will begin working on supporting objective 3b: developing an algorithm for identifying CNVs in a population associated with resistance evolution.

Impacts
What was accomplished under these goals? At the time of writing this proposal, we had identified that one chromosome with the strongesteffect on Cry1Ab resistance in H. zea contained a copy number variant of multiple trypsin-like genes. This unexpected finding of a non-target site mechanism of resistance led toour proposed objectives, which we use to organize our accomplishments below. Objective 1: Measure variation among CNV resistance haplotypes using long-read sequencing (20% complete) The major goal of objective 1 wasto characterize variation in Cry resistance copy number haplotypes using long-read sequencing. In year 1 we planned to complete the sequencing and in years 2 and 3 we planned to complete analysis and publication. Our planned approach focused on targeted Nanopore long read sequencing of chromosome 9 for three populations with whole genome sequencing for a small number of individuals. As targeted Nanopore sequencing of chromosomes is an emerging technology, we completed a trial sequencing run on a subset of samples. Analysis of that trial sequencing run suggested that repetitive content on the targeted chromosome negatively impacted sequencing yield, coverage, and the resulting assemblies. Based on those preliminary results from the targeted sequencing we pivoted to focus on using whole genome long read sequencing with PacBio HiFi reads as an alternative approach to characterize resistance haplotypes. Using whole genome PacBio HiFi sequencing and a trio-binning assembly approach, we successfully produced high quality resistant and susceptible whole genome assemblies. These assemblies contained fully-resolved resistant and susceptible haplotypes of the target region. The resistant haplotype included three copies of the ~200 kb regiondescribed in our previously published papers. Automated annotation also identified three copies of most of the genes in that ~200 kb region, including the trypsin-like genes (serine protease homologs), which are the gene candidates we have linked to resistance evolution. The resistant assembly includes full sequences of each gene copy and break points for the amplification events. Pivoting to whole genome PacBio HiFi sequencing allowed us to not only recover the sequences of the amplified genes we were targeting but will make possible genome wide analyses investigating the role of gene amplification in resistance evolution broadly. With these first assemblies, we have developed an analysis pipeline that can be applied to other samples to begin describing variation in resistance haplotypes. Objective 2.Characterize the ancestral state of the Cry1Ab resistance associated genomic region inH. zea (2% complete) The major goal of objective 2 was to characterize ancestral variation atthe region of Chr9 containingthe resistance-related CNV for H. zea by sequencing historical, museum-quality samples from our collection. We had planned to begin bait development and sequencingsamples starting in year 1, and to complete analyses and publication in year 2.During this performance period, we advertised for a postdoc. However, we were unable to find a candidate after our first call for applications, delaying the start of this objective. We were able to hire a postdoctoral researcherto assist with objective 2 inJune of 2025. They havebegun identifying and working with gDNA samples that we wish to use for objective 2. Objective 3.Develop novel computational methods for discovering resistance-related CNVs from data produced by genomic monitoring (25% complete) The major goal of objective 3 was to develop a computational method for discovering resistance-related copy number variants (CNVs) from genomic monitoring data (Years 1 - 3). We proposed to accomplish this goal through three supporting objectives. First, we planned to evaluate the utility of existing methods for calling CNVs from whole genome sequencing of H. zea individuals (Supporting Objective 3a). Second, we planned to develop an algorithm that takes CNVs called from genomic monitoring data as input and identifies those likely to be associated with resistance evolution (Supporting Objective 3b). Third, we planned to apply CNV callers and our method to real H. zea time-series data to evaluate its potential for resistance monitoring (Supporting Objective 3c). We proposed to complete supporting objectives 3a, 3b, and 3c in years 1-2, years 2-3, and year 3, respectively. Trio data sets are considered a gold standard tool for benchmarking variant callers, as Mendelian consistency can be evaluated between the parents and offspring. To address our supporting objective 3a, we have generated a benchmarking data set by performing whole genome sequencing of 4 H. zea trios (12 individuals total). To our knowledge, this is the first trio data set generated for H. zea or related systems. Moreover, each trio is an F1 cross between Cry resistant and susceptible individuals, with the respective presence or absence of the CNV on chromosome 9 confirmed via digital droplet PCR. Simultaneously, we recruited a computer science PhD student (Junyan Dai), who is leading the CNV calling benchmarking efforts. Junyan has just completed benchmarking three of the most popular CNV callers: CNVkit, GATK-gCNV, and LUMPY. These respective methods made 16.5, 10358.5, and 45982 CNV calls on average, of which 6 (36%), 2976.5 (29%), and 1745.5 (4%) were inconsistent with Mendelian inheritance. Not only did LUMPY achieve the lowest inconsistency rate, but it was also the only method to correctly identify CNVs in the regions targeted by digital droplet PCR regions across all individuals. To summarize, our preliminary results demonstrate that (1) popular CNV callers are highly variable in terms of their performance on H. zea, highlighting the need for benchmarking, and that (2) the method LUMPY may produce sufficiently accurate CNV calls for genomic monitoring of H. zea.

Publications