My watch list  

Sequence assembly

In bioinformatics, sequence assembly refers to aligning and merging many fragments of a much longer DNA sequence in order to reconstruct the original sequence. Typically the short fragments result from shotgun sequencing genomic DNA, or gene transcript (ESTs).

First-Generation sequence assemblers began to appear in the late 1980s and early 1990s, to piece together vast quantities of fragments generated by automated sequencing instruments. Often, these First-Generation assemblers employed the Shortest Common Supersequence algorithm, outlined below. As a result, they're not well suited for reconstructing original sequences containing repeats or noise. The following are First-Generation assemblers, widely used within the industry:

  • Phrap, by Phil Green, The University of Washington
  • TIGR Assembler, The Institute for Genomic Research
  • CAP3, by Xiaoqiu Huang, Michigan Technological University
  • Sequencher, Gene Codes Corporation

Modern assemblers like DNA Baser or Celera Assembler brings many improvements to the first generation of assemblers.


Assemblers for large genome

Faced with the challenge of assembling the much larger genomes of the fruit fly Drosophila melanogaster in 2000 and the human genome just a year later, scientists developed assemblers able like Celera Assembler (first developed by a private company) and Arachne able to handle genomes of 100-300 million base pairs. Subsequent to these efforts, several other groups, mostly at the major genome sequencing centers, built large-scale assemblers, and an open source effort known as AMOS was launched to bring together all the innovations in genome assembly technology under the open source framework.

EST assembly differs from genome assembly in several ways. For instance, genomes often have large amounts of repetitive sequences, mainly in the intra-genic parts. Since ESTs represent gene transcripts, they will not contain these repeats. On the other hand, genes sometimes overlap in the genome (sense-antisense transcription), and should ideally still be assembled separately. EST assembly is also complicated by features like (cis-) alternative splicing, trans-splicing, single-nucleotide polymorphism, recoding, and post-transcriptional modification. These differences make the new generation assemblers less applicable to EST assembly.

Greedy algorithm

Given a set of sequence fragments the object is to find the Shortest common supersequence.

  1. calculate pairwise alignments of all fragments
  2. choose two fragments with the largest overlap
  3. merge chosen fragments
  4. repeat step 2. and 3. until only one fragment is left

The result is a suboptimal solution to the problem.

See also


  • Phrap
  • TIGR Assembler
  • Staden Package - GAP4
  • CAP3
  • CLC bio Advanced contig assembly
  • DNA Baser
  • AMOS
  • MIRA and miraEST
  • CodonCode Aligner
  • Sequencher
  • Mosaik
This article is licensed under the GNU Free Documentation License. It uses material from the Wikipedia article "Sequence_assembly". A list of authors is available in Wikipedia.
Your browser is not current. Microsoft Internet Explorer 6.0 does not support some functions on Chemie.DE