Pc Genomic Databases Homology Search Sequencing Progress Genomic Assembly Process

Pneumocystis carinii Genomic Assembly Process

The Pc genome has been assembled using 2 assembly tools, the Phrap Assembler and the more recently developed Arachne Assembler. These assembly tools give two different versions of the proposed Pc genome. At present, we are evaluating the benefits of each genomic assembly by comparing and contrasting the two genomic views the Phrap and Arachne Assemblers are generating.

Arachne Genomic Assembly

General Information

Originally we used the Phrap Assembler for the Pc project. Though Phrap was not designed as a genome- scale assembler, we selected it because of a history of long standing usage in assembly projects; and the availability of a comprehensive Phrad/Phrap/Consed genomic assembly and analysis package.

Recently, we implemented the Arachne Whole Genome Shotgun Assembler (used by the Whitehead Institute to assemble fungal genomes) to assemble the Pc genome, although PHRED and portions of the PhredPhrap script for nucleotide base calling and quality score determinations for sequence reads are used extensively. With the Arachne Assembler, better advantage can be taken of the multi-level insert length libraries. Arachne allows for precise control of insert lengths. For example, Arachne allows input of read lengths and standard deviation for each insert library. This allows a hierarchy of reads to be built at more precise distances. In addition, the Arachne assembler builds Super Contigs by examining paired read names and joining them. This is a similar process to the Phrap based Contig Linker program developed in house (below), however Arachne builds this feature into their genomic assembler.

Overview of Arachne Assembly Results

Genome sequencing progress is being measured by the number of SuperContigs, the sum of the SuperContig lengths, the number of Contigs, and the sum of the Combined Contig lengths. The sum of the SuperContig lengths provides an estimate of the breadth of Pc genomic coverage, while the sum of Contig lengths represents contiguously covered sequence at a more detailed level. The following link gives an overview of the Arachne Assembly Results as measured by the number of SuperContigs, the number of Contigs, and the SuperContig and Contig base length sums. It should be noted that the size of the Pc genome is ~7.0 Megabases, excluding the chromosomal telomeric ends and centromeres.

The Arachne Acefile display is available at this link. The Arachne Acefile lists all Super Contigs, their lengths, the sum of their gaps, and the contigs that make up each of the SuperContigs.

The Arachne Assembler generates a detailed assembly report in Postscript format. This detailed postscript file report may be downloaded to your computer and opened using the Adobe Acrobat Distiller program. The Adobe Acrobat Reader is freely downloadable. The Arachne postscript file is called The Report for Arachne Assembly of P. carinii. This report includes Individual Read Statistics, Contig and SuperContig Statistics, Contig Coverage Statistics, and Library Statistics generated by the Arachne Assembler.

Arachne Assembly and Homology Details

Homology searches have been conducted on the Arachne assembled Contigs for the latest genomic Pc assembly. The Arachne assembled Contigs (grouped by Super Contigs) and BLASTX Homology Details are available at this link. The BLASTX table lists homology hits to the SWISS-PROT and NR Databases for all the Arachne Contigs. To ensure maximum BLASTX coverage and to minimize missed homologies, the Arachne assembled Contigs were split into 2000 base pair sections, then the Arachne Contigs were incremented by 500 base pairs and the next 2000 base pair section was obtained. Therefore, the first search on each contig was conducted from Contig 1 base 1 to Contig 1 base 2000. The second search of that same Contig was conducted from Contig 1 base 501 to Contig 1 base 2501. This process was continued for the length of each Arachne assembled Contig, yielding homology search results all along each of the Arachne assembled Contigs.

Pc Genomic Databases Homology Search Sequencing Progress Genomic Assembly Process