MaSuRCA-SY software for synteny-assisted high quality de novo individual genome assembly from Illumina data

MASURCA-SY SOFTWARE FOR SYNTENY-ASSISTED HIGH QUALITY DE NOVO INDIVIDUAL GENOME ASSEMBLY FROM ILLUMINA DATA

Sponsoring Institution

National Institute of Food and Agriculture

Project Status

COMPLETE

Funding Source

AFRI COMPETITIVE GRANT

Reporting Frequency

Annual

Accession No.

1015789

Grant No.

2018-67015-28199

Cumulative Award Amt.

$500,000.00

Proposal No.

2017-05202

Multistate No.

(N/A)

Project Start Date

May 15, 2018

Project End Date

May 14, 2022

Grant Year

2018

Program Code

[A1201]- Animal Health and Production and Animal Products: Animal Breeding, Genetics, and Genomics

Recipient Organization
JOHNS HOPKINS UNIVERSITY
3400 N CHARLES ST W400 WYMAN PARK BLDG
BALTIMORE,MD 21218-2680

Performing Department
Biomedical Engineering

Non Technical Summary
High quality individual animal genome assemblies will help in reaching one of the central goals of genetic studies: determining the relationship between phenotype and genotype. Discovering genetic features governing growth, meat quality, feed efficiency and disease resistance are of the utmost importance for livestock breeding programs. Ideally one would like to be able to determine the complete genomic sequence of many different individuals and then perform studies based the SNP data and on discovering structural variation by aligning their genome sequences to each other. But presently this is cost-prohibitive, confounded both by the sequencing costs to produce long read data for high quality de novo assembly of a mammalian genome and by high computational costs. The long-term goal of this project is to address these issues by developing computational techniques that use inexpensive Illumina data and allow for fast computation on inexpensive hardware, and that yield high quality assemblies of individual genomes aided by synteny to one or more reference genomes for the same or closely related species.

Animal Health Component

25%

Research Effort Categories

Basic

75%

Applied

25%

Developmental

(N/A)

Classification

Knowledge Area (KA)	Subject of Investigation (SOI)	Field of Science (FOS)	Percent
304	3999	1080	100%

Knowledge Area
304 - Animal Genome;

Subject Of Investigation
3999 - Animal research, general;

Field Of Science
1080 - Genetics;

Keywords

Goals / Objectives
Determining the relationship between phenotype and genotype is a central goal of genetic studies. Discovering genetic features governing growth, meat quality, feed efficiency and disease resistance are of the utmost importance for livestock breeding programs. Ideally one would like to be able to determine the complete genomic sequence of many different individuals and then perform statistical studies based on aligning their genome sequences to each other. But presently this is cost-prohibitive, confounded both by the sequencing costs to produce data for de novo assembly of a mammalian genome and by high computational costs. The goal of this project is to address these issues with the following two Specific Objectives:Specific Objective 1: We propose to develop a technique that will allow the production of high quality de novo genome assemblies of individual animals from low-cost Illumina paired-end data with the aid of synteny information from one or more reference genomes of closely related species. Specific Objective 2: We propose to study how the effectiveness of our technique varies depending on: (a) how distant the related reference is to the target genome; (b) how much Illumina coverage is available; and (c) the number of reference genomes available. We propose to use yeast, American bison and domestic cow data sets for evaluation of the performance of the technique.

Project Methods
Specific objective 1 -- methods.We will start with Illumina paired end read data set with 2x150bp reads covering the genome at 50x-100x. Based on the results of the past projects, about 60x may be optimal, but this value changes constantly as sequencing technology changes. The proposed assembly technique is shown of Figure 3 and can be outlined in the following steps:Super-read construction. We first transform Illumina paired-end reads into what we call "super-reads". To do that, we build a compressed database of all K-mers (note the uppercase K) in all reads. Then, starting with each read as a seed, we use K-mers to extend the read sequence forwards and backwards, stopping when it is ambiguous how to continue. The K-mers used for building the super-reads are typically 51 to 127 bp with current Illumina technologies. The extended sequence is called a "super-read", and is described in more detail in (Zimin et al., 2013). Generally many reads extend to the same sequence, stopping at the same ambiguity, a branching of the k-mer graph. Virtually every read is in some super-read, and there are typically 100 to 400 times fewer super-reads than Illumina paired-end reads. By design, super-reads provide contiguous overlapping coverage of the genome. For different genomes on which we have applied this technique, average super-reads sizes have ranged from 300 to 2,200 bp, depending on the Illumina read sizes, heterozygosity and repeat content of the genome sequenced.Approximate Alignment. We can then create approximate alignments of the super-reads to the reference genomes using short (15 to 25) k-mer seeds (we use lower case k to talk about short k-mers used in mapping super-reads to reference). We can efficiently create a query database of all short k-mers in the super-reads using the partial suffix array (PSA) where we record the positions of all suffixes up to a given length in the super-reads. Using the PSA, we can examine all k-mers in a given reference contig and lookfor a Longest Common Subsequence (LCS) of k-mers between the contig and each super read. The k-mers can be overlapping. Once we identified a super-read having a minimum required number of k-mers in the LCS, we create an approximate positioning of the super-read on the reference contig based on the positions of LCS k-mers in reference and the super-read. We record the number of k-mers in the LCS for each super-read alignment.Graph traversal. When super-reads are constructed, we record the exact K-overlaps between the super-reads (where K is the K-mer size that was used in generating the super-reads and the overlaps are at least of length K). Using super-read positions on the reference contigs we then look for confirmed K-overlaps between the super-reads, where the overlap length is consistent with the alignment positions. We then form connected components where super-reads are the nodes and confirmed K-overlaps are the edges. Each such connected component represents Directed Acyclical Graph (DAG) of K-overlapping super-reads spanning a region of the reference. We impose a direction on each DAG from 5' end of the reference contig to the 3' end. For each node in each DAG we know the number of k-mers in the LCS for that node (super-read). The approximate positioning of the super reads on the reference imposes a topological order on the DAG. We then look for the path through the DAG corresponding to each connected component that maximizes the sum of matching k-mers between the path and the reference. We call each such path a synteny read.Tiling. To make sure we keep only the best and longest synteny-reads, we tile the reference with the synteny-reads in a greedy way, not allowing overlaps longer than K. We do not allow synteny-reads that are contained in larger one, and we choose synteny-reads so as to maximize the number of k-mers in the tiling.Assembly. Finally, we assemble the synteny-reads together with the original super-reads and Illumina linking mates (the paired end reads where both mates did not map to the same super-read) into contigs and scaffolds using CABOG assembler version 8.2. We can use other linking information, such as Illumina mate pairs, if available, for scaffolding.At this time, as part of our preliminary studies we developed an experimental code that implements the preliminary version of the synteny-reads technique. We list the preliminary results from the feasibility study in the introduction. Our experimental code is currently implemented in shell scripts, Perl and C++.Specific objective 2 - methodsWe propose to use initially the yeast and then bison data set (the bison data provided by Lauren Dobson of TAMU and Tim Smith of USDA-ARS, see attached letter of collaboration) to study the performance of the technique and make adjustments to the algorithm. For the reference we propose to use latest available Bos taurus reference genome version 5.0.1 based on the UMD 3.1.1 assembly and upgraded with PBJelly by Baylor College of Medicine NCBI Genbank accession GCA_000003205.6. This reference genome has N50 contig size of 276Kbp. We also propose to use the available water buffalo reference (NCBI accession GCA_000471725.1) as well as goat and sheep reference genomes. We will investigate the impact of using more than one reference and identify possibility of assembly errors. We propose to analyze any misassemblies that we identify by mapping the synteny-assisted assembly against the independently assembled reference genome from Genbank accession GCA_000754665.1 and improve the assembly algorithm to address the mis-assemblies.Our proposed method is based on creating synteny-assisted assemblies and then mapping them to the alternative assemblies or to the reference using MUMmer software (Kurtz et al., 2004). We then filter the alignments to discover the best one-to-one mapping and analyze the discrepancies. We examine all assembled contigs that do not align in a single chunk and map the original Illumina reads back to these contigs and examine the alignments for "weak spots", where the weak spot is defined as a region not spanned by Illumina reads or fragments. If a weak spot is found at the location of discrepancy between the assembled contigs and the reference, such location is an apparent mis-assembly. We then can go back to the placements of the synteny-reads in the assembled contigs, output by the assembler, and see if the apparent mis-assembly was due to a mis-assembled synteny-read. If it is the case, then we re-examine the algorithm used for building the synteny-reads to identify possible weaknesses and refine the method.Finally, we propose to use the re-sequencing data for Hereford bulls (see Taylor letter of support) to develop the fastest solution to build individual genomes of these animals assisted by synteny with the latest available Hereford reference (Bos taurus 5.0.1 or later). We will investigate how the Illumina coverage of the individual impacts the assembly quality and develop a re-sequencing study protocol including requirements on the data and hardware to quickly obtain high quality individual animal assemblies in automated fashion.

Progress 05/15/18 to 05/14/22

Outputs
Target Audience:Researchers ranging from undergraduate students to senior faculty in educational and research institution, andin private industry and government in the fields of genomics, DNA sequence data analysis, breeding and population studies. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?This project funded 40% of the salary for postdoctoral fellow Christopher Pockrandt at JHU. How have the results been disseminated to communities of interest? Nothing Reported What do you plan to do during the next reporting period to accomplish the goals? Nothing Reported

Impacts
What was accomplished under these goals? Both Specific Objectives have been accomplished in full, and tested on A.thaliana and human data sets. In addition, we have developed additional software described below. Reference Guided Assembly Figure 1. MaSuRCA-Syntigs strategy. We turn Illumina reads into super-reads and then use 25-mers in the super-reads to compute approximate alignments of the super-reads to reference contigs. Then we build syntigs as paths of exactly overlapping super-reads where the overlaps are confirmed by the reference alignment positions. The paths are called syntigs. Finally we assemble super-reads and syntigs de novo. The number of reference or high quality genome assemblies for different species is growing rapidly thanks to proliferation of 3-rd generation long read sequencing technologies. The reference genomes can be used as templates to assist in individual de novo genome assemblies of closely related (or the same) species from low cost short read Illumina data. We show that using one or more reference genomes yields a short-read de novo assembly that is superior in contiguity and completeness. The MaSuRCA-Syntig software that is a new addition to MaSuRCA genome assembly package that enables synteny-assisted de novo genome assembly from Illumina paired end read data guided by one of more reference sequences of closely related species. The principal difference of the new technique is that multiple references can be used at the same time and we show that assembly contiguity improves as more reference sequences are added. We achieved N50 contig size of 986Kb for de novo synteny assisted assembly of A. thaliana, 2.8 times bigger than N50 for assembly of the same data that did not use reference information. Use of human reference genome version GRCh38 resulted in N50 contig size of 482Kbp for de novo Illumina-only assembly of NA12878 data set, 5 times bigger than the corresponding N50 for the assembly with-out use of the reference. The MaSuRCA-Syntig strategy is shown on Figure 1. We split the reference assembly (or assemblies) into contigs at gaps. We then compute the super-reads from the Illumina reads in a standard way done in MaSuRCA (Zimin et al, 2013). Next we create approximate alignments of the super-reads to each contig from read using 25-mers that the reference contigs have in common with super-reads. Smaller seeds may be needed for more divergent species. The 25-mer seeds work for closely related species, where DNA sequences are >98% similar. For the alignments, we first build a database of all 25-mers in the super-reads. We use this database to compute, for each super-read, its approximate start and end positions on each reference contig using the LCS algorithm described in (Zimin et al., 2017). For each reference contig R, we walk down the contig looking at each 25-mer. We use the 25-mer database to determine (in constant time for each 25-mer) which 25-mers are found in super-reads. Once we have the super-reads that match R, for each such super-read S we look for ordered subsequences of the 25-mers that both R and S have in common. We then assign a score to each super-read S, where the score is number of 25-mers in the longest common subsequence (LCS) of 25-mers in the two sequences. We label an alignment as plausible if the score of S exceeds some specified minimum. For each plausible alignment, we compute an approximate position of S along R based on the positions of the LCS 25-mers in R and S. Using all super-read positions on a reference contig R, we create possible paths of (plausible) super-reads along P. Each path consists of a sequence of super-reads where two adjacent super-reads must have an exact overlap of at least 40 bases, and also must have positions on R that make it possible for them to overlap. We call each such path a synteny read, or syntig. We then assemble the super reads and syntigs using Flye assembler (Kolmogorov et al., 2019) in "subassemblies" mode. Table 1. Reference genome sequences used for A. thaliana experiments Reference genomes for Arabisopsis thaliana experiments ID Genbank accession Total Sequence (Mbp) N50 Contig (bp) N50 Scaffold (bp) TAIR1.0 (Col) GCA_000001735.1 118.96 10,898,021 23,459,830 Ler1 (Ler) GCA_001651475.1 117.11 862,972 22,588,203 Ler2 (Ler) GCA_000835945.1 127.42 11,163,166 11,163,166 We show the performance of our preliminary algorithm on assemblies of Arabidopsis thaliana Ler (Landsberg erecta) ecotype data set, consisting of 100x coverage by 2x300 Illumina MiSeq paired end reads. The references that we use are shown in Table 1. We use the official reference genome for A. thaliana Col (Columbia ecotype) TAIR1.0, and two references of more closely related species A. thaliana Ler. The Genbank accessions are listed in Table 1. The Ler2 reference is the most contiguous one, because it was produces using 3rd generation PacBio sequenceing data (Berlin et al., 2015). We set up four reference assisted assembly experiments, shown in Table 2: Experiment 1. Use TAIR 1.0 reference - different ecotype Experiment 2. Use Ler1 reference - less contiguous, same ecotype Experiment 3. Use two references, TAIR1.0 and Ler1 Experiment 4. Use the most contiguous and the closest reference Ler2. Table 2 shows that using more contiguous reference improves the assembly, because the assembly produces using Ler2 reference has N50 contig size of 986Kbp whereas assembly produced using less contiguous reference Ler1 has 723Kbp N50 contig size. Using more closely related reference works better, since Ler1 reference albeit less contiguous, yielded better result than the more contiguous TAIR1.0 reference. Using two references yields longer contigs than using a single reference, as shown in Experiment 3, even though we are combining much less contiguous Ler1 reference with more contiguous TAIR1.0 reference. The most contiguous and the closest reference Ler2 yields the best reference -assisted assembly result with 2.8 times longer contigs than the ones produces from Illumina data alone. Table 2. Reference assisted assemblies of A.thaliana Synteny assisted assemblies of Arabidopsis thaliana Ler Reference used Total Sequence (Mbp) N50 Contig (bp) N50 Scaffold (bp) none 127,353,458 351,096 433,094 TAIR1.0 121,573,373 501,958 503,045 Ler1 126,141,674 723,190 726,839 TAIR+Ler1 123,405,969 800,823 801,815 Ler2 131,677,486 986,399 993,221 The reference assisted assembly code is included in version in MaSuRCA assembler version 3.3.3 and up, and it is available on github at https://github.com/alekseyzimin/masurca.

Publications

Type: Journal Articles Status: Published Year Published: 2022 Citation: Zimin AV, Salzberg SL. The SAMBA tool uses long reads to improve the contiguity of genome assemblies. PLoS computational biology. 2022 Feb 4;18(2):e1009860.
Type: Journal Articles Status: Awaiting Publication Year Published: 2022 Citation: Guo A, Salzberg S, Zimin AV. JASPER: a fast genome polishing tool that improves accuracy and creates population-specific reference genomes. bioRxiv. 2022 Jan 1.
Type: Journal Articles Status: Published Year Published: 2022 Citation: Zimin AV, Shumate A, Shinder I, Heinz J, Puiu D, Pertea M, Salzberg SL. A reference-quality, fully annotated genome from a Puerto Rican individual. Genetics. 2022 Feb;220(2):iyab227.

Progress 05/15/20 to 05/14/21

Outputs
Target Audience:Researchers ranging from undergraduate students to senior faculty in educational and research institution, andin private industry and government in the fields of genomics, DNA sequence data analysis, breeding and population studies. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?This project funded 40% of the salary for postdoctoral fellow Christopher Pockrandt at JHU. How have the results been disseminated to communities of interest? Nothing Reported What do you plan to do during the next reporting period to accomplish the goals? Nothing Reported

Impacts
What was accomplished under these goals? Synteny-guided assembly approach. The number of reference or high quality genome assemblies for different species is growing rapidly thanks to proliferation of 3-rd generation long read sequencing technologies. The reference genomes can be used as templates to assist in individual de novo genome assemblies of closely related (or the same) species from low cost short read Illumina data. We show that using one or more reference genomes yields a short-read de novo assembly that is superior in contiguity and completeness. The MaSuRCA-Syntig software that is a new addition to MaSuRCA genome assembly package that enables synteny-assisted de novo genome assembly from Illumina paired end read data guided by one of more reference sequences of closely related species. The principal difference of the new technique is that multiple references can be used at the same time and we show that assembly contiguity improves as more reference sequences are added. We achieved N50 contig size of 986Kb for de novo synteny assisted assembly of A. thaliana, 2.8 times bigger than N50 for assembly of the same data that did not use reference information. Use of human reference genome version GRCh38 resulted in N50 contig size of 482Kbp for de novo Illumina-only assembly of NA12878 data set, 5 times bigger than the corresponding N50 for the assembly with-out use of the reference. The MaSuRCA-Syntig strategy is shown on Figure 1. We split the reference assembly (or assemblies) into contigs at gaps. We then compute the super-reads from the Illumina reads in a standard way done in MaSuRCA (Zimin et al, 2013). Next we create approximate alignments of the super-reads to each contig from read using 25-mers that the reference contigs have in common with super-reads. Smaller seeds may be needed for more divergent species. The 25-mer seeds work for closely related species, where DNA sequences are >98% similar. For the alignments, we first build a database of all 25-mers in the super-reads. We use this database to compute, for each super-read, its approximate start and end positions on each reference contig using the LCS algorithm described in (Zimin et al., 2017). For each reference contig R, we walk down the contig looking at each 25-mer. We use the 25-mer database to determine (in constant time for each 25-mer) which 25-mers are found in super-reads. Once we have the super-reads that match R, for each such super-read S we look for ordered subsequences of the 25-mers that both R and S have in common. We then assign a score to each super-read S, where the score is number of 25-mers in the longest common subsequence (LCS) of 25-mers in the two sequences. We label an alignment as plausible if the score of S exceeds some specified minimum. For each plausible alignment, we compute an approximate position of S along R based on the positions of the LCS 25-mers in R and S. Using all super-read positions on a reference contig R, we create possible paths of (plausible) super-reads along P. Each path consists of a sequence of super-reads where two adjacent super-reads must have an exact overlap of at least 40 bases, and also must have positions on R that make it possible for them to overlap. We call each such path a synteny read, or syntig. We then assemble the super reads and syntigs using Flye assembler (Kolmogorov et al., 2019) in "subassemblies" mode. Figure 1. MaSuRCA-Syntigs strategy. We turn Illumina reads into super-reads and then use 25-mers in the super-reads to compute approximate alignments of the super-reads to reference contigs. Then we build syntigs as paths of exactly overlapping super-reads where the overlaps are confirmed by the reference alignment positions. The paths are called syntigs. Finally we assemble super-reads and syntigs de novo.

Publications

Type: Journal Articles Status: Published Year Published: 2021 Citation: Masonbrink RE, Alt D, Bayles DO, Boggiatto P, Edwards W, Tatum F, Williams J, Wilson-Welder J, Zimin A, Severin A, Olsen S. A pseudomolecule assembly of the Rocky Mountain elk genome. PloS one. 2021 Apr 28;16(4):e0249899.

Progress 05/15/19 to 05/14/20

Outputs
Target Audience:A community of scientists doing work in genome assembly, analysis and annotation. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?This project supported a postdoctral research associate Christopher Pockrandt. How have the results been disseminated to communities of interest?We presented the results of all activities repated to this project in multiple meetings and seminars, including PAG 2020. What do you plan to do during the next reporting period to accomplish the goals?We will proceed with implementation of the Specific Objectives as planned.

Impacts
What was accomplished under these goals? In this project year we modified the syntigs algorithm for better performance. Instead of super-reads we now use pre-assembled contigs (assembled from the super-reads and paired end linking mate pairs) to create the syntigs, which makes the syntigs longer and improves the resulting performance of the assembler. We also added the option of filling gaps in the resulting reference guided assembly with the reference sequence, in lowercase. We are currently working on the manuscript describing the technique. The reference guided assembly code is available in the current release of MaSuRCA v3.4.1, and the usage of the option is described in the documentation. We downloaded sevebal bovine genomic data sets produced at the University of Missouri Columbie from NCBI SRA and we are now conducting xperiments on low-coverage reference assisted cattle genome assembly using the latest v5 cow reference genome. We implemented and published a fast and accurate POLCA tool for polishing genome assemblies. We use this tool in our reference assisted genome assembly pipeline.POLCA is implemented as a bash script program that takes as input a file of Illumina reads and the target assembly to be polished. The outputs are the polished assembly and a VCF (variant call format) file containing the variants used for polishing. The basic outline of the script is to align the Illumina reads to the genome and then call short variants from the alignments. A variant call is treated as a putative error in the consensus if the count of the alternative allele observations is greater than 1 and at least twice the count of the reference allele. Each error is fixed by replacing the error variant with the highest scoring alternative allele suggested by the Illumina reads. The variants can be substitutions or insertions/deletions of one or more bases. POLCA uses bwa mem (Li and Durbin, 2009) to align reads to the assembly, but another short read aligner can easily be substituted. For variant calling, it uses FreeBayes (Garrison and Marth, 2012) due to its stability and portability; however by default FreeBayes can only use a single thread (processor). In POLCA we use shell level multiprocessing FreeBayes to run multiple instances of FreeBayes in parallel, thus significantly speeding up the variant calling. We also tuned its alignment and variant calling parameters to improve sensitivity, specificity, and speed for detecting consensus errors. The FreeBayes binary is included with the POLCA distribution as part of the MaSuRCA package. (Note that POLCA installs with MaSuRCA but can be run independently to polish assemblies produced with third-party assemblers.) POLCA first builds an index of the target assembly, and then aligns the Illumina reads to the target with bwa. It then uses samtools to sort the alignment (bam) file. For variant calling we run FreeBayes in 5Mb batches, merging the variant call vcf files after all batches finish. We then process the assembly using the computed variant calls in parallel, where the number of batches is equal to the user-specified number of CPUs. We extract all target sequence names, sort them in lexicographic order and split the sorted list into batches. This helps balance the amount of target sequence in each batch, thus balancing the load on the CPUs. Parallel execution is achieved using the "xargs -P" command, which ensures compatibility between different Unix-based systems. We compared the performance of POLCA to other published polishing techniques on a real data set, using a previously published assembly of the NA12878 human genome, Genbank accession GCA_001013985.1. That assembly was produced from PacBio SMRT data (Pendleton et al, 2015), and as such it was likely to contain more consensus-level sequence errors than an assembly based on Illumina data. Alignment of this assembly to the GRCh38.p12 human reference genome with nucmer, followed by dnadiff to compute differences, yields an average alignment identity rate of 99.66%. For polishing this assembly, we used Illumina data for the same subject, NA12878, from the Genome In A Bottle project (Zook et al., 2014), dataset 140115_D00360_0009_AH8962ADXX, which contains 553,657,530 149-bp reads. Because the "true" sequence of the NA12878 genome is not known, we evaluated, for each of the three polishing programs, whether the polished genome yielded a better alignment to the GRCh38.p12 sequence. The NA12878 assembly polished with POLCA had the closest alignment by a small margin, with 99.752% identity to GRCh38, while the assemblies polished with NextPolish, Pilon and Racon had 99.750%, 99.746% and 99.749% identity respectively. Thus all four polishing programs gave very similar results in terms of accuracy, however, POLCA and NextPolish ran considerably faster, completing the task in 4 hours and less than 1 hour respectively, while Racon took 15h 39m and Pilon took far longer, 150h 16m. We note that Pilon is designed to do more than correct single base substitutions and short indel errors, which explains its longer run times. It attempts to identify and correct mis-assembled or collapsed repeats as well, a much more computationally demanding problem. POLCA provides an effective way to correct single-base substitution and short insertion/deletions errors in draft genome assemblies. On simulated data, it proves to be more accurate than Pilon and Racon and equivalent to the newer NextPolish method. POLCA is faster than Racon and Pilon, but slower than NextPolish. On simulated data, the most accurate polishing was achieved by using a combination of both POLCA and NextPolish. On real human and bacterial genome data, POLCA and NextPolish performed similarly, and better than Pilon and Racon, although POLCA appeared to be marginally better for human genome polishing. The manuscript describing the performance of the POLCA tool has been published in PLoS Computational Biology (Zimin and Salzberg, 2020).

Publications

Type: Journal Articles Status: Published Year Published: 2020 Citation: Alonge M, Shumate A, Puiu D, Zimin A, Salzberg SL. Chromosome-Scale Assembly of the Bread Wheat Genome Reveals Thousands of Additional Gene Copies. Genetics. 2020 Aug 12.
Type: Journal Articles Status: Published Year Published: 2020 Citation: Shumate A, Zimin AV, Sherman RM, Puiu D, Wagner JM, Olson ND, Pertea M, Salit ML, Zook JM, Salzberg SL. Assembly and annotation of an Ashkenazi human reference genome. Genome biology. 2020 Dec;21(1):1-8.
Type: Journal Articles Status: Awaiting Publication Year Published: 2020 Citation: Scott AD, Zimin AV, Puiu D, Workman R, Britton M, Zaman S, Caballero M, Read AC, Bogdanove AJ, Burns E, Wegrzyn J. A Reference Genome Sequence for Giant Sequoia. G3: Genes, Genomes, Genetics. 2020 Sep 18.
Type: Journal Articles Status: Published Year Published: 2020 Citation: Giordano R, Donthu RK, Zimin A, Chavez IC, Gabaldon T, van Munster M, Hon L, Hall R, Badger J, Nguyen M, Flores A. Soybean aphid biotype 1 genome: Insights into the invasive biology and adaptive evolution of a major agricultural pest. Insect Biochemistry and Molecular Biology. 2020 Feb 25:103334.
Type: Journal Articles Status: Published Year Published: 2020 Citation: Zimin AV, Salzberg SL. The genome polishing tool POLCA makes fast and accurate corrections in genome assemblies. PLoS computational biology. 2020 Jun 26;16(6):e1007981.
Type: Journal Articles Status: Published Year Published: 2020 Citation: Rosen BD, Bickhart DM, Schnabel RD, Koren S, Elsik CG, Tseng E, Rowan TN, Low WY, Zimin A, Couldrey C, Hall R. De novo assembly of the cattle reference genome with single-molecule sequencing. GigaScience. 2020 Mar;9(3):giaa021.

Progress 05/15/18 to 05/14/19

Outputs
Target Audience: Nothing Reported Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided? Nothing Reported How have the results been disseminated to communities of interest?We presented the results at the Biology of Genomes meeting in 2019. What do you plan to do during the next reporting period to accomplish the goals?We will proceed with implementation of the Specific Objectives 1 and 2 as planned.

Impacts
What was accomplished under these goals? In the first year of the project we spent the most effort on Specific Objective 1: to develop a technique that will allow the production of high quality de novo genome assemblies of individual animals from low-cost Illumina paired-end data with the aid of synteny information from one or more reference genomes of closely related species. The number of reference or high quality genome assemblies for different species is growing rapidly thanks to proliferation of 3-rd generation long read sequencing technologies. The reference genomes can be used as templates to assist in individual de novo genome assemblies of closely related (or the same) species from low cost short read Illumina data. We show that using one or more reference genomes yields a short-read de novo assembly that is superior in contiguity and completeness. The MaSuRCA-Syntig software that is a new addition to MaSuRCA genome assembly package that enables synteny-assisted de novo genome assembly from Illumina paired end read data guided by one of more reference sequences of closely related species. The principal difference of the new technique is that multiple references can be used at the same time and we show that assembly contiguity improves as more reference sequences are added. We achieved N50 contig size of 986Kb for de novo synteny assisted assembly of A. thaliana, 2.8 times bigger than N50 for assembly of the same data that did not use reference information. Use of human reference genome version GRCh38 resulted in N50 contig size of 482Kbp for de novo Illumina-only assembly of NA12878 data set, 5 times bigger than the corresponding N50 for the assembly with-out use of the reference. The MaSuRCA-Syntig strategy is shown on Figure 1. We split the reference assembly (or assemblies) into contigs at gaps. We then compute the super-reads from the Illumina reads in a standard way done in MaSuRCA (Zimin et al, 2013). Next we create approximate alignments of the super-reads to each contig from read using 25-mers that the reference contigs have in common with super-reads. Smaller seeds may be needed for more divergent species. The 25-mer seeds work for closely related species, where DNA sequences are >98% similar. For the alignments, we first build a database of all 25-mers in the super-reads. We use this database to compute, for each super-read, its approximate start and end positions on each reference contig using the LCS algorithm described in (Zimin et al., 2017). For each reference contig R, we walk down the contig looking at each 25-mer. We use the 25-mer database to determine (in constant time for each 25-mer) which 25-mers are found in super-reads. Once we have the super-reads that match R, for each such super-read S we look for ordered subsequences of the 25-mers that both R and S have in common. We then assign a score to each super-read S, where the score is number of 25-mers in the longest common subsequence (LCS) of 25-mers in the two sequences. We label an alignment as plausible if the score of S exceeds some specified minimum. For each plausible alignment, we compute an approximate position of S along R based on the positions of the LCS 25-mers in R and S. Using all super-read positions on a reference contig R, we create possible paths of (plausible) super-reads along P. Each path consists of a sequence of super-reads where two adjacent super-reads must have an exact overlap of at least 40 bases, and also must have positions on R that make it possible for them to overlap. We call each such path a synteny read, or syntig. We then assemble the super reads and syntigs using Flye assembler (Kolmogorov et al., 2019) in "subassemblies" mode. Table 1. Reference genome sequences used for A. thaliana experiments Reference genomes for A.thaliana experiments ID Genbank accession Total Sequence (Mbp) N50 Contig (bp) N50 Scaffold (bp) TAIR1.0 (Col) GCA_000001735.1 118.96 10,898,021 23,459,830 Ler1 (Ler) GCA_001651475.1 117.11 862,972 22,588,203 Ler2 (Ler) GCA_000835945.1 127.42 11,163,166 11,163,166 We show the performance of our preliminary algorithm on assemblies of Arabidopsis thaliana Ler (Landsberg erecta) ecotype data set, consisting of 100x coverage by 2x300 Illumina MiSeq paired end reads. The references that we use are shown in Table 1. We use the official reference genome for A. thaliana Col (Columbia ecotype) TAIR1.0, and two references of more closely related species A. thaliana Ler. The Genbank accessions are listed in Table 1. The Ler2 reference is the most contiguous one, because it was produces using 3rd generation PacBio sequenceing data (Berlin et al., 2015). We set up four reference assisted assembly experiments, shown in Table 2: Experiment 1. Use TAIR 1.0 reference - different ecotype Experiment 2. Use Ler1 reference - less contiguous, same ecotype Experiment 3. Use two references, TAIR1.0 and Ler1 Experiment 4. Use the most contiguous and the closest reference Ler2. Table 2 shows that using more contiguous reference improves the assembly, because the assembly produces using Ler2 reference has N50 contig size of 986Kbp whereas assembly produced using less contiguous reference Ler1 has 723Kbp N50 contig size. Using more closely related reference works better, since Ler1 reference albeit less contiguous, yielded better result than the more contiguous TAIR1.0 reference. Using two references yields longer contigs than using a single reference, as shown in Experiment 3, even though we are combining much less contiguous Ler1 reference with more contiguous TAIR1.0 reference. The most contiguous and the closest reference Ler2 yields the best reference -assisted assembly result with 2.8 times longer contigs than the ones produces from Illumina data alone. Table 2. Reference assisted assemblies of A.thaliana Synteny assisted assemblies of A.thaliana Ler Reference used Total Sequence (Mbp) N50 Contig (bp) N50 Scaffold (bp) none 127,353,458 351,096 433,094 TAIR1.0 121,573,373 501,958 503,045 Ler1 126,141,674 723,190 726,839 TAIR+Ler1 123,405,969 800,823 801,815 Ler2 131,677,486 986,399 993,221 The reference assisted assembly code is included in alpha version in MaSuRCA assembler version 3.3.3 and up, and it is available on github at https://github.com/alekseyzimin/masurca.

Publications

Type: Journal Articles Status: Published Year Published: 2019 Citation: New de novo assembly of the Atlantic bottlenose dolphin (Tursiops truncatus) improves genome completeness and provides haplotype phasing KA Martinez-Viaud, CT Lawley, MM Vergara, G Ben-Zvi, T Biniashvili, ... GigaScience 8 (3), giy168
Type: Journal Articles Status: Published Year Published: 2019 Citation: Breitwieser FP, Pertea M, Zimin AV, Salzberg SL. Human contamination in bacterial genomes has created thousands of spurious proteins. Genome research. 2019 Jun 1;29(6):954-60.
Type: Journal Articles Status: Under Review Year Published: 2019 Citation: Zimin AV, Salzberg SL. The genome polishing tool POLCA makes fast and accurate corrections in genome assemblies. bioRxiv. 2019 Jan 1.
Type: Journal Articles Status: Awaiting Publication Year Published: 2019 Citation: Read AC, Moscou MJ, Zimin AV, Pertea G, Meyer RS, Purugganan MD, Leach JE, Triplett LR, Salzberg SL, Bogdanove AJ. Genome assembly and characterization of a complex zfBED-NLR gene-containing disease resistance locus in Carolina Gold Select rice with Nanopore sequencing. PLOS Genetics. 2020 Jan 27;16(1):e1008571.