A microbial genome sequence provides a wealth of data and specific information that cannot be obtained by other experimental approaches. A genome sequence can be regarded as the starting point for detailed bioinformatics analysis, metabolic reconstruction and systematic functional examination of all identified genes. In particular, genome sequencing has revealed important information on genes and deduced proteins of human pathogens, including putative virulence factors and potential new drug targets. The most widely used strategy for sequencing of a microbial genome is the direct shotgun approach with the application of Sanger technology, along with a small-insert clone library of the organism of interest . An emerging alternative technology in genome sequencing by the direct shotgun approach is provided by the Genome Sequencer System. The special design of this revolutionary technology relies on small DNA fragments and adaptor sequences, rather than small-insert clone libraries, and highly parallel sequencing in picoliter-scale volumes. This results in an ultrafast sequencing process . This sequencing strategy is therefore ideally suited for the rapid determination of genome sequences of hitherto uncharacterized human pathogens.
In this application note, we describe the de novo sequencing of the human pathogen Corynebacterium urealyticum by using the Genome Sequencer 20 System. C. urealyticum is frequently isolated from urine samples of catheterized intensive-care patients and causes urinary tract infections that are significantly associated with stone formation . Moreover, C. urealyticum is multiply resistant to clinically relevant antibiotics and is often susceptible only to glycopeptides. The new sequencing approach provided detailed information on the genome of C. urealyticum for the first time. In combination with elaborated bioinformatics tools, the annotated genome sequence revealed valuable insights into the gene inventory of this emerging nosocomial pathogen.
Materials and Methods
Preparation of genomic DNA
C. urealyticum DSM7109, originally isolated from a bladder stone, was grown in brain heart broth supplemented with 1% (v/v) Tween 80. Genomic DNA was isolated from 8 × 109 cells and purified by an iterative phenol-chloroform extraction method . This procedure yielded high-purity genomic DNA with a concentration of 1.175 ng/µl and an A280 nm/A260 nm ratio of 1.89.
Genome sequencing and assembly
Samples were prepared using the Genome Sequencer DNA Library Preparation Kit (5 µg sample DNA was processed in the standard DNA Library Preparation procedure) and the Genome Sequencer emPCR Kit I (shotgun). Sequencing reactions were performed with the Genome Sequencer 20 Sequencing Kit and the Genome Sequencer PicoTiterPlate Kit. Sequencing runs were carried out using the Genome Sequencer 20 Instrument.
Post-run analysis: Flow-data were assembled using the Assembly Software (Newbler Assembler) of the Genome Sequencer 20 Software Version 1.0.52 as described in the Genome Sequencer 20 Data Processing Software manual.
Genome annotation with GenDB and SAMS
After de novo assembly of the C. urealyticum genome, the obtained sequence contigs were filtered according to their size. Subsequently, 69 contigs with a minimal length of 501 bp were chained together into a pseudochromosome in which the contigs were separated by 12-mer linkers (CTAGCTAGCATG) containing stop codons in all six reading frames. This pseudochromosome was automatically annotated by applying the GenDB platform  and a standard analysis pipeline: After gene prediction with a combined approach using GLIMMER and CRITICA, an automatic function prediction (METANOR) with a combination of standard bioinformatics tools, such as BLAST, HMMer and InterPro, was performed for each identified protein-coding sequence. This approach led to consistent gene annotations, assigning gene names, gene products, EC numbers, functional protein categories (COGs), and other attributes.
Small sequence contigs (≤500 bp) were analyzed using the Sequence Analysis and Management System (SAMS), which is based on GenDB. SAMS was originally designed and implemented for quality control of sequence data obtained during the high-throughput phase of genome sequencing projects. Similar to the analysis of potential coding regions predicted on a bacterial genome sequence, individual (small) sequences can be processed and annotated using a bioinformatics pipeline analogous to that described above. Accordingly, the application of SAMS results in consistently annotated small sequence contigs.
Results and Discussion
Sequencing of the C. urealyticum genome with the Genome Sequencer System yielded 657,410 sequence reads that were finally used for de novo genome assembly. By applying a contig length cut-off of 500 bp, a total number of 2,294,755 bases were assembled into 69 contigs with an average contig size of 33,257 bp. Of these assembled bases, 2,291,059 (99.8%) have been determined with PHRED 40 quality, meaning that the accuracy of the base call is at least 99.99% at the respective position. The largest assembled contig has a size of 175,964 bp. In addition, 154 small contigs (≤500 bp) with an average length of 144 bp were assembled, producing a total of 22,211 bp. The contiguous sequences of the C. urealyticum genome were uploaded into SAMS and GenDB for rapid annotation of the sequence data . Small contigs were analyzed with a set of bioinformatics tools included in SAMS, and the resulting observations were stored in the SAMS database to enable all the assembled sequence contigs to be evaluated. Although many of the small contigs revealed no significant hits in relation to public database entries, this automated analysis identified 57 partial protein-coding regions of which 25 (44%) apparently code for transposases of mobile genetic elements. The 69 large contigs were used for a precise gene prediction by the combined action of the software tools GLIMMER and CRITICA which are integrated in the GenDB platform, resulting in the prediction of 2,027 protein-coding regions. Relevant data derived from this genome annotation are summarized in Table 1.
To assess the quality of the C. urealyticum genome sequence by an additional bioinformatics approach, we used the automatic frameshift prediction performed by the CRITICA tool during the annotation process (Figure 1). Among the 121 candidate regions predicted by CRITICA only seven turned out to contain a frameshift when considering the highly similar gene arrangement in the genome of the taxonomically closely related pathogen Corynebacterium jeikeium . Interestingly, six potential frameshifts are apparently located in sequence regions that contain homopolymer stretches of length six and seven. These frameshifts might be due to the special detection of nucleotide incorporation during the sequencing process that is based on the release of pyrophosphate and the generation of photons. The number of incorporated nucleotides is thus indicated by the signal intensity which is, in principle, linear to at least homopolymers of length eight . The low number of frameshifts that are apparently associated with homopolymers in the present study indicates the very high quality of the established contig sequences of the C. urealyticum genome, especially when considering the presence of 451 homopolymer stretches of length 6 to 13 within the 69 contigs. Subsequently, functional annotation of the predicted coding regions was performed with the METANOR tool in such a way that orthology information from the C. jeikeium genome  was used to generate functional protein assignments. This bioinformatics approach allowed us to detect those proteins that are homologous in both corynebacteria and those that are encoded only by the C. urealyticum genome. The functional classification of the deduced proteins into clusters of orthologous groups of proteins (COGs)  is shown in Figure 2a.
Approximately three quarters of the predicted proteins (1,589) revealed significant similarity to proteins that are encoded by the C. jeikeium genome and are apparently shared between both corynebacteria (Figure 2b), whereas 438 proteins were non-homologous between both species (Figure 2c). Among the set of proteins lacking homology with proteins from C. jeikeium, we identified the components of a microbial urease and accessory proteins which could, together, constitute a functional urease machinery involved in the utilization of urea as nitrogen source and in the concomitant splitting of urea. This enzymatic reaction could be responsible for alkalinization of the human urine, which in turn could cause damage to epithelial cells of the urinary tract along with struvite stone formation. The respective proteins can therefore be regarded as prominent virulence factors of C. urealyticum.
This study clearly demonstrates that the Genome Sequencer System provides a powerful technology to determine high-quality de novo sequences of bacterial genomes without prior DNA cloning. In conjunction with elaborated bioinformatics platforms for genome annotation, this new sequencing approach enables researchers to rapidly decipher and analyze the gene inventory of previously uncharacterized human pathogens. It may soon be possible to gain important insights into the lifestyle and pathophysiology of C. urealyticum and the molecular mechanisms of multidrug resistance from in-depth analyses of the annotated genome sequence.
1. Frangeul L et al. (1999) Microbiology 145: 2625–2634
2. Margulies M et al. (2005) Nature 437: 376–380
3. Funke G et al. (1997) Clin Microbiol Rev 10: 125–159
4. Tauch A et al. (1995) Plasmid 33: 168–179
5. Meyer F et al. (2003) Nucleic Acids Res 31: 2187–2195
6. Tauch A et al. (2005) J Bacteriol 187: 4671–4682
7. Tatusov RL et al. (1997) Science 278: 631–637
This article was originally published in Biochemica 4/2006, pages 4-6. ©Springer Medizin Verlag 2006