Progress 04/15/24 to 04/14/25
Outputs Target Audience:We reached bioinformaticians and software developers by releasing two major software packages that help to analyze whoel genome sequencing data on pangenome graphs and help analyze the results. The KhufuEnv includes a set of 132 tools that effieciently analzye large ngs datasets. We reached plant breeders, genomicists, geneticists, and peanut researchers by presenting at conferences including Advances in Arachis Genomics and Biotechnology (2025 Goa, India), Annual Peanut Research and Education Society annual meeting (2025, Richmond, VA), National Assoicaiton of Plant Breeders annual conference (2025, Hawaii), and Plant and Animal Genomics (2025, San Diego, CA). We reached professionals in the field of breeding and genomics by presenting at the first PAG webinar series (July 10, 2025). The webinar had more than 500 registerations and individuals representing 30 countries attended. Changes/Problems:There have been no major changes in approach. We did add one intriguing activity. We have selected on the most promising single plants, increased them in winter nurseries, and started evaluating plots for drought tolerance and white mold resistance. The first year plots were extremely tolerant to drought. We look forward to the second year evaluations. What opportunities for training and professional development has the project provided?This project has provided training of 1 PhD student and two postdoctoral researchers. How have the results been disseminated to communities of interest?Results have been disseminated through national and international conferences (PAG, AAGB, APRES, NAPB) and through webinars (PAG/Genomeweb webinar series). We have released two pieces of software through github for bioinformaticians. What do you plan to do during the next reporting period to accomplish the goals?The next reporting period will include a third season of field phenotyping, full analysis of the first two years, and final analysis of the recombination events within the population. We will also prepare and finalize final publications and go through the review process of the currently under review publications.
Impacts What was accomplished under these goals?
The project has 3 main objectives: Objective 1 - Evaluate the impact of a multiparent crossing scheme for creating novel combinations of alleles in peanut and identifying functional variation Objective 2 - Genetically map molecular targets for strong drought tolerance and disease resistance segregating in the MAGIC population Objective 3 - Utilize pan-genomic graphs as a method to increase the efficiency of genetic mapping in complex populations Objective 1 The first step to accurately identifying crossing over within the generations of the MAGIC population is to derive a novel imputation strategy that utilizes the population structure. Using genotyping information from each level of the 18-way MAGIC population (specific to each cross), we can iteratively impute genotypes across each level for skim-seq data and assess the recombination architecture at the F1s. Having sequencing data through the final crossing scheme, we can also assess the amount of initial diversity lost and if there is any bias towards certain founders in the population. Given that MAGIC populations are often used for mapping, we will also compare the accuracies of using initial the initial founder HiFi data to impute the members of the 16-way cross versus an iterative imputation pipeline that accounts for genotypes across each generation. This may provide insight on the importance of subsampling within MAGIC generations for genotyping since imputed genotypes serve as the foundation for identifying QTL in MAGIC populations. Pangenome graph construction: Captures 4,795,144 variants across 18 haplotypes Sequencing of MAGIC population: PanMap consists of 669,891 filtered variants across 770 individuals Average coverage: 0.77x Sequencing of "ground-truth" MAGIC population subset: PanMap consists of filtered variants across of 34 "ground truth" samples which are being subset/down sampled to 1, 0.5, and 0.25x for assess impact of coverage on variant calling and imputation Average coverage: 6.3x Iterative imputation pipeline: Starting at the F1 generation, imputation is initially fueled by HiFi (>20x) long-read sequencing data of parents as the reference, and iteratively, imputing missing genotypes for the following generations for each cross, using imputed genotypes for each layer to serve as reference for the next Current working script is available on GitHub: https://github.com/hw85241/MAGIC; will serve as a resource for others who take on this approach Objective 2 In 2023, two field experiments were conducted. First, 740 individual plants (50 checks) - representing F1 families from an 8-way cross - were germinated in the greenhouse. The plants were sampled for DNA, tagged, and transplanted in the field within rows of white mold tolerant cultivar Georgia 12Y. One month after transplanting, plants were inoculated with Sclerotium rolfsi mycelium agar disks punches and the plots were well watered for two days to promote fungal growth. Disease ratings were taken every week starting one week after inoculation. Disease ratings were based on a 1 - 5 scale where 1 represents no disease and 5 represents plant death. Ratings of 2,3, and 4 represent disease progression from small lesions on the mainstem to larger lesions on the mainstem and lateral stems. A total of 700 plants were sequenced using Twist 96-plex whole genome sequencing library prep. The average sequencing depth across the population was 0.86x coverage. The resistant and susceptible tails of the disease distribution were identified for genetic mapping. Analysis is still in progress. Second, 900 plants (100 checks) - representing F1 families from a separate 8-way cross - were germinated in the greenhouse. The plants were sampled for DNA, tagged, and transplanted in the field spaced one foot apart. After 100 days, drought shelters were placed over the plots to induce stress. Drought stress visual ratings were taken starting 10 days after stress induction at 3 different times. Final rating was taken at 40 days post stress induction. Visual rating was a 1 to 5 scale where 1 is no visual stress (wilting) and 5 is completely brown and dead. To control for edge effects (where edges of the shelter received less stress than middle rows), the plants were separated into edges (300) and middle rows (900). All 900 plants were sequenced to achieve ~1X coverage. For the edge rows (600 plants), the bulks averaged a rating of 1 for the tolerant bulk and 4 for the susceptible bulk. For the middle rows (300 plants), the tolerant bulk averaged a rating of 1.4 and the susceptible bulk averaged a rating of 4.8. Both sets of bulks will be analyzed separately and together. In 2024, the experiment was replicated with 700 plants analyzed for white mold and 900 plants analyzed for drought. All 1,600 plants were genotyped with ~1x coverage wgs and variation has been cataloged on the pangenome graph. Analysis is currently underway. Objective 3 To efficiently profile variation on the pangenome graph, we developed a comprehensive pipeline that prepares graphs from minigraph-cactus, calls variants from vg tools and filters for accuracy, and filters final variant calls for allele frequency, depth, and missing data. We named this pipeline, KhufuPAN, a pangenome aware version of our internal software Khufu (https://www.hudsonalpha.org/khufudata/). KhufuPAN can be accessed publicly at https://github.com/w-korani/KhufuPAN. To efficiently filter and query graph-based variants, we also created the KhufuEnv (Wright et al., 2025). KhufuEnv is a suite of tools that can be used to convert, filter, query hapmap and panmap files. We developed a new format, the panmap, that contains the information necessary to understand the variation from a graph and how it relates to the graph structure. Critical information such as the graph genome where the allele is derived and the size and sequence of the variant is included. References Hickey G, Monlong J, Ebler J, Novak AM, Eizenga JM, Gao Y, Abel HJ, Antonacci-Fulton LL, Asri M, Baid G, Baker CA, Belyaeva A, Billis K, Bourque G, Buonaiuto S, Carroll A, Chaisson MJP, Chang P-C, Chang XH, Cheng H, Chu J, Cody S, Colonna V, Cook DE, Cook-Deegan RM, Cornejo OE, Diekhans M, Doerr D, Ebert P, Ebler J, Eichler EE, Fairley S, Fedrigo O, Felsenfeld AL, Feng X, Fischer C, Flicek P, Formenti G, Frankish A, Fulton RS, Garg S, Garrison E, Garrison NA, Giron CG, Green RE, Groza C, Guarracino A, Haggerty L, Hall IM, Harvey WT, Haukness M, Haussler D, Heumos S, Hoekzema K, Hourlier T, Howe K, Jain M, Jarvis ED, Ji HP, Kenny EE, Koenig BA, Kolesnikov A, Korbel JO, Kordosky J, Koren S, Lee H, Lewis AP, Liao W-W, Lu S, Lu T- Y, Lucas JK, Magalhães H, Marco-Sola S, Marijon P, Markello C, Marschall T, Martin FJ, McCartney A, McDaniel J, Miga KH, Mitchell MW, Mountcastle J, Munson KM, Mwaniki MN, Nattestad M, Nurk S, Olsen HE, Olson ND, Pesout T, Phillippy AM, Popejoy AB, Porubsky D, Prins P, Puiu D, Rautiainen M, Regier AA, Rhie A, Sacco S, Sanders AD, Schneider VA, Schultz BI, Shafin K, Sibbesen JA, Sirén J, Smith MW, Sofia HJ, Tayoun ANA, Thibaud-Nissen F, Tomlinson C, Tricomi FF, Villani F, Vollger MR, Wagner J, Walenz B, Wang T, Wood JMD, Zimin AV, Zook JM, Marschall T, Li H, Paten B, Human Pangenome Reference C (2024) Pangenome graph construction from genome alignments with Minigraph-Cactus. Nature Biotechnology 42:663-673 Lee, K, Korani, W, Bentz, PC, Pokhrel, S, Ozias-Akins, P, Harkess, A, Vaughn, J, Clevenger, J (2025) Long-Read Low-Pass Sequencing for High-Resolution Trait Mapping bioRxiv 2025.01.09.632261; doi: https://doi.org/10.1101/2025.01.09.632261 Wright, HC, Davis, CEM, Clevenger, J, Korani, W (2025) KhufuEnv, an auxiliary toolkit for building computational pipelines for plant and animal breeding. bioRxiv 2025.03.28.645917; doi: https://doi.org/10.1101/2025.03.28.645917
Publications
|