Microbial Pathogenomics of the Major Cacao Diseases

Goals / Objectives
The aim of this project is to sequence the major fungal and oomycete pathogens of cacao and to build a database which will provide a platform to study the geographical diversity, host-specificity and biology of the main fungal pathogens of cacao. This project will be divided into three main goals: 1) Genomic DNA sequencing and assembly: Sequence, assemble, and annotate the genomes of the major fungal and oomycete pathogens of cacao. All data will be promptly deposited to existing robust online platforms for genomic data sharing and mining to allow researchers and plant breeders to benefit from this effort in the short term. 2) Multiple isolate re-sequencing and variant identification. Assessing the diversity of pathogen populations in a specific region or in different hosts. 3) Protein coding gene prediction: Functional annotation of the protein-coding genes in the genomes focused on cataloguing potential virulence factors, including effectors.

Project Methods
A combination of single molecule sequencing, using Pacific Biosciences RS (PacBio) technologies, and Illumina sequencing will be used to generate high quality genome references for each of the species sequenced (Moniliophthora perniciosa, Moniliophthora roreri, and Phytophthora spp.). For genome sequencing of reference isolates, PacBio technology will be the preferred choice for all the pathogens provided that sufficient amounts of high molecular weight DNA are available. 20â¿¿30 kb libraries will be prepared in the Cantu Lab and sequenced by Novagene. We will target at least 100x coverage for all pathogens. In case of exceptionally challenging DNA extractions, a combination of mate-pair and paired-end libraries will be sequenced to achieve 100X coverage on an Illumina MiSeq (600 bp reads) and HiSeq2500 (150 bp reads). We expect the longer read length generated by the MiSeq to provide significant improvement to the current genome assemblies for the pathogens whose draft genomes are already available. For genome re-sequencing of additional isolates (genetic diversity, see WS 2), paired-end reads (150 bp) will be generated using the Illumina HiSeq2500 technology to achieve at least 60X coverage per isolate. Both Illumina sequencing of large insert mate-pair libraries and PacBio sequencing significantly improve complex genome scaffolding. PacBio produces very long read lengths that can span repeats and structural variation breakpoints.While average read lengths of PacBio reads are 15 kb, >50 kb reads without GC-bias or systematic errors are not uncommon. The long read length and low bias make PacBio reads particularly appropriate for resolving complex repeats and filling the gaps in de novo assembly. Mate-pair library sequencing is now commonly used to resolve genomic repeats, detect structural variants, and to scaffold together distant regions of the assembly. PacBio reads will be assembled using the HGAP 3.0 (for haploid genomes) or FALCON-Unzip (for diploid genomes) pipeline followed by Quiver for error corrections. Mapping of the Illumina reads, followed by the GATK variant-discovery pipeline, detected 103 sites in which Illumina and PacBio sequences were in discordance. Assuming that the Illumina reads were correct, we concluded that the genome sequence obtained using SMRT technology is 99.99976% accurate. In this project, the same strains used for PacBio sequencing will be sequenced using Illumina to estimate residual error rates in the PacBio assemblies and eventually correct the sites whose base calls disagree between PacBio assemblies and Illumina reads. Illumina reads will be assembled using multiple approaches (MaSuRCA, CLC, SOAPdenovo, SPAdes). The assembler providing the most complete scaffolds, gene space, and best assembly metrics will be chosen and its parameters optimized to improve assemblies. K-mer analysis will be conducted to estimate genome sizes and determine the proportion of the genome represented in the assembled scaffolds. Gene space completeness will be estimated with CEGMA. Raw read processing, read asembly, geneautomation, experession analysis and genetic diversity study will be conducted in the Cantu lab.