To use all functions of this page, please activate cookies in your browser.
With an accout for my.bionity.com you can always see everything at a glance – and you can configure your own website and individual newsletter.
- My watch list
- My saved searches
- My saved topics
- My newsletter
Additional recommended knowledge
In the process of evolution, from one generation to the next the amino acid sequences of an organism's proteins are gradually altered through the action of DNA mutations. For example, the sequence
could mutate into the sequence
in one generation and possibly
over a longer period of evolutionary time. Each amino acid is more or less likely to mutate into various other amino acids. For instance, a hydrophobic residue such as valine is more likely to stay hydrophobic than not, since replacing it with a hydrophilic residue could affect the folding and/or activity of the protein.
If we have two amino acid sequences in front of us, we should be able to say something about how likely they are to be derived from a common ancestor, or homologous. If we can line up the two sequences using a sequence alignment algorithm such that the mutations required to transform a hypothetical ancestor sequence into both of the current sequences would be evolutionarily plausible, then we'd like to assign a high score to the comparison of the sequences.
To this end, we will construct a 20x20 matrix where the (i,j)th entry is equal to the probability of the ith amino acid being transformed into the jth amino acid in a certain amount of evolutionary time. There are many different ways to construct such a matrix, called a substitution matrix. Here are the most commonly used ones:
The simplest possible substitution matrix would be one in which each amino acid is considered maximally similar to itself, but not able to transform into any other amino acid. This matrix would look like:
This identity matrix will succeed in the alignment of very similar amino acid sequences but will be miserable at aligning two distantly related sequences. We need to figure out all the probabilities in a more rigorous fashion. It turns out that an empirical examination of previously aligned sequences works best.
We express the probabilities of transformation in what are called log-odds scores. The scores matrix S is defined as
where Mi,j is the probability that amino acid i transforms into amino acid j and pi is the frequency of amino acid i. The base of the logarithm is not important, and you will often see the same substitution matrix expressed in different bases.
One of the first amino acid substitution matrices, the PAM (Point Accepted Mutation) matrix was developed by Margaret Dayhoff in the 1970s. This matrix is calculated by observing the differences in closely related proteins. The PAM1 matrix estimates what rate of substitution would be expected if 1% of the amino acids had changed. The PAM1 matrix is used as the basis for calculating other matrices by assuming that repeated mutations would follow the same pattern as those in the PAM1 matrix, and multiple substitutions can occur at the same site. Using this logic, Dayhoff derived matrices as high as PAM250. Usually the PAM 30 and the PAM70 are used.
A matrix for divergent sequences can be calculated from a matrix for closely related sequences by taking the second matrix to a power. For instance, we can roughly approximate the WIKI2 matrix from the WIKI1 matrix by saying where W1 is WIKI1 and W2 is WIKI2. This is how the PAM250 matrix is calculated.
Dayhoff's methodology of comparing closely related species turned out not to work very well for aligning evolutionarily divergent sequences. Sequence changes over long evolutionary time scales are not well approximated by compounding small changes that occur over short time scales. The BLOSUM (BLOck SUbstitution Matrix) series of matrices rectifies this problem. Henikoff and Henikoff constructed these matrices using multiple alignments of evolutionarily divergent proteins. The probabilities used in the matrix calculation are computed by looking at "blocks" of conserved sequences found in multiple protein alignments. These conserved sequences are assumed to be of functional importance within related proteins. The BLOSUM62 matrix is calculated from observed substitutions between proteins that share 62% sequence identity or more - clustering together proteins showing 62% sequence identity or more and giving weight 1 to each such cluster (Henikoff and Henikoff), (and the BLOSUM100 matrix is calculated from alignments between proteins showing 100% identity the proteins in the BLOCKS database (since all have 100% sequence identity or less)?). One would use a higher numbered BLOSUM matrix for aligning two closely related sequences and a lower number for more divergent sequences.
It turns out that the BLOSUM62 matrix does an excellent job detecting similarities in distant sequences, and this is the matrix used by default in most recent alignment applications such as BLAST.
Differences between PAM and BLOSUM
Current innovative approaches include incorporating secondary structure information into the sequences and substitution matrices. See this paper for an example of this direction of research.
|This article is licensed under the GNU Free Documentation License. It uses material from the Wikipedia article "Substitution_matrix". A list of authors is available in Wikipedia.|