Gene Finding

Pneumocystis carinii Putative (Hypothetical) Genes

As an ongoing project, we are aligning the Pneumocystis carinii (Pc) Expressed Sequence Tags (ESTs) with the Pc genomic database in the form of Contigs. Both the 3' ends of initial ESTs will be aligned as well as the full sequences of the unigene set. This alignment will permit identification of putative genes and will supply structural information such as intron usage and size; presence of splice intermediates; and gene density. These data are provided on an ongoing basis and give a first glimpse into the Pc genome from a gene structure point of view. A local copy of GeneSeqer (Usuka, J., Zhu, W. and Brendel, V., "Optimal spliced alignment of homologous cDNA to a genomic DNA template". Bioinformatics 16, 203-211.) was used to align cDNA to genomic sequences to provide an inventory of putative Pneumocystis carinii genes and gene structures.

Methods and Limitations

We have found putative Pc genes on a contig level basis by aligning Pc cDNAs and genomic data. At present the Pc genomic sequencing project is not complete, however a number of contiguous stretches of sequences are available. The actual genomic sequences are available on a contig by contig basis by clicking on the "Contig Name" link. cDNA to genomic spliced alignments (i.e. splice threading) were conducted by using GeneSeqer. Complete GeneSeqer reports are also available on a contig by contig basis in gene finding results. Lastly, where an alignment between Pc cDNA and genomic data was found, Gene Predictions were made. These predictions are limited by the: 1) ability to GeneSeqer to identify the alignment; 2) number of Pc cDNA (5030); and, 3) amount of Pc genomic coverage. In contigs where "No Genes are Predicted", no cDNA to Genomic alignments were found by GeneSeqer. Where cDNA to Genomic spliced alignments were found, the "Genes and Proteins" links indicate intron/exon boundaries, surrounding boundary sequence, alternative protein predictions, and protein homolog predictions.

Putative gene predictions for the Pc genome on a contig level basis can be found in the resulting tables. The sequencing portion of the Pc genome project is NOT complete. However, we are conducting gene finding operations prior to sequence completion to provide the Pneumocystis and other scientific research communities with the most updated data available. As sequencing efforts continue, we will re-assemble all available shotguns, cosmid contigs, and cosmid end sequences. When this is done, the contig names and the content of the contig sequences will change. Some contigs will be folded into other existing contigs, while others may reform as more appropriate alignments are found. In general, as the Pc sequencing continues, the number of contigs will decrease and contig lengths will increase. Lastly, after additional shotgun sequences are added to the growing Pc genomic assembly, we will re-run the putative gene prediction systems, regularly updating the predictions within each contig.

Future Plans

Despite the growing number of fungal genomes that are being sequenced, only a few gene prediction programs for fungi have been developed. Due to biases in A/T content, intron/exon boundaries, promoter sequences and gene densities as well as large differences in organization and structure between different fungal genomes, a training set of well characterized genes and splice signals needs to be developed for each distinct genome. Moreover, there is a high percentage of genes that do not share similarity with sequences of known genes. Approximately half of the putative Pc genes have not identified any homologs (a situation similar to other fungi, such as yeast and Neurospora). Therefore, gene finding in Pc as well as in other fungal genomes represents a significant challenge. Assigning putative protein structure and function to identified Pc genes will also be difficult for genes with no homologs, requiring fold recognition methods that go beyond sequence homology.

We have an NIH-funded grant to develop software and analysis tools to identify putative genes in the Pc genome and predict their function. We plan to employ an integrated biological-computational strategy for gene finding and annotation in Pc. Our specific aims are to: (1) build a representative Pc gene database that will identify intron/exon boundaries and other related signals; (2) develop and train Pc-specific gene recognition algorithms based on standard pattern recognition approaches; and (3) assign putative structure and function to identified Pc genes using fold recognition methods, leading to a full annotation of the Pc genome. The members of our research team are:

One Final Note ...


Perform Gene Finding