fprotdist |
Wiki
The master copies of EMBOSS documentation are available at http://emboss.open-bio.org/wiki/Appdocs on the EMBOSS Wiki.Please help by correcting and extending the Wiki pages.
Function
Protein distance algorithmDescription
Computes a distance measure for protein sequences, using maximum likelihood estimates based on the Dayhoff PAM matrix, the JTT matrix model, the PBM model, Kimura's 1983 approximation to these, or a model based on the genetic code plus a constraint on changing to a different category of amino acid. The distances can also be corrected for gamma-distributed and gamma-plus-invariant-sites-distributed rates of change in different sites. Rates of evolution can vary among sites in a prespecified way, and also according to a Hidden Markov model. The program can also make a table of percentage similarity among sequences. The distances can be used in the distance matrix programs.Algorithm
This program uses protein sequences to compute a distance matrix, under four different models of amino acid replacement. It can also compute a table of similarity between the amino acid sequences. The distance for each pair of species estimates the total branch length between the two species, and can be used in the distance matrix programs FITCH, KITSCH or NEIGHBOR. This is an alternative to use of the sequence data itself in the parsimony program PROTPARS.The program reads in protein sequences and writes an output file containing the distance matrix or similarity table. The five models of amino acid substitution are one which is based on the Jones, Taylor and Thornton (1992) model of amino acid change, the PMB model (Veerassamy, Smith and Tillier, 2004) which is derived from the Blocks database of conserved protein motifs, one based on the PAM matrixes of Margaret Dayhoff, one due to Kimura (1983) which approximates it based simply on the fraction of similar amino acids, and one based on a model in which the amino acids are divided up into groups, with change occurring based on the genetic code but with greater difficulty of changing between groups. The program correctly takes into account a variety of sequence ambiguities.
The five methods are:
(1) The Dayhoff PAM matrix. This uses Dayhoff's PAM 001 matrix from Dayhoff (1979), page 348. The PAM model is an empirical one that scales probabilities of change from one amino acid to another in terms of a unit which is an expected 1% change between two amino acid sequences. The PAM 001 matrix is used to make a transition probability matrix which allows prediction of the probability of changing from any one amino acid to any other, and also predicts equilibrium amino acid composition. The program assumes that these probabilities are correct and bases its computations of distance on them. The distance that is computed is scaled in units of expected fraction of amino acids changed. This is a unit such that 1.0 is 100 PAM's.
(2) The Jones-Taylor-Thornton model. This is similar to the Dayhoff PAM model, except that it is based on a recounting of the number of observed changes in amino acids by Jones, Taylor, and Thornton (1992). They used a much larger sample of protein sequences than did Dayhoff. The distance is scaled in units of the expected fraction of amino acids changed (100 PAM's). Because its sample is so much larger this model is to be preferred over the original Dayhoff PAM model. It is the default model in this program.
(3) The PMB (Probability Matrix from Blocks) model. This is derived using the Blocks database of conserved protein motifs. It will be described in a paper by Veerassamy, Smith and Tillier (2004). Elisabeth Tillier kindly made the matrices available for this model.
(4) Kimura's distance. This is a rough-and-ready distance formula for approximating PAM distance by simply measuring the fraction of amino acids, p, that differs between two sequences and computing the distance as (Kimura, 1983) D = - loge ( 1 - p - 0.2 p2 ). This is very quick to do but has some obvious limitations. It does not take into account which amino acids differ or to what amino acids they change, so some information is lost. The units of the distance measure are fraction of amino acids differing, as also in the case of the PAM distance. If the fraction of amino acids differing gets larger than 0.8541 the distance becomes infinite.
(5) The Categories distance. This is my own concoction. I imagined a nucleotide sequence changing according to Kimura's 2-parameter model, with the exception that some changes of amino acids are less likely than others. The amino acids are grouped into a series of categories. Any base change that does not change which category the amino acid is in is allowed, but if an amino acid changes category this is allowed only a certain fraction of the time. The fraction is called the "ease" and there is a parameter for it, which is 1.0 when all changes are allowed and near 0.0 when changes between categories are nearly impossible.
In this option I have allowed the user to select the Transition/Transversion ratio, which of several genetic codes to use, and which categorization of amino acids to use. There are three of them, a somewhat random sample:
- The George-Hunt-Barker (1988) classification of amino acids,
- A classification provided by my colleague Ben Hall when I asked him for one,
- One I found in an old "baby biochemistry" book (Conn and Stumpf, 1963), which contains most of the biochemistry I was ever taught, and all that I ever learned.
Interestingly enough, all of them are consisten with the same linear ordering of amino acids, which they divide up in different ways. For the Categories model I have set as default the George/Hunt/Barker classification with the "ease" parameter set to 0.457 which is approximately the value implied by the empirical rates in the Dayhoff PAM matrix.
The method uses, as I have noted, Kimura's (1980) 2-parameter model of DNA change. The Kimura "2-parameter" model allows for a difference between transition and transversion rates. Its transition probability matrix for a short interval of time is:
To: A G C T --------------------------------- A | 1-a-2b a b b From: G | a 1-a-2b b b C | b b 1-a-2b a T | b b a 1-a-2b
where a is u dt, the product of the rate of transitions per unit time and dt is the length dt of the time interval, and b is v dt, the product of half the rate of transversions (i.e., the rate of a specific transversion) and the length dt of the time interval.
Each distance that is calculated is an estimate, from that particular pair of species, of the divergence time between those two species. The Kimura distance is straightforward to compute. The other two are considerably slower, and they look at all positions, and find that distance which makes the likelihood highest. This likelihood is in effect the length of the internal branch in a two-species tree that connects these two species. Its likelihood is just the product, under the model, of the probabilities of each position having the (one or) two amino acids that are actually found. This is fairly slow to compute.
The computation proceeds from an eigenanalysis (spectral decomposition) of the transition probability matrix. In the case of the PAM 001 matrix the eigenvalues and eigenvectors are precomputed and are hard-coded into the program in over 400 statements. In the case of the Categories model the program computes the eigenvalues and eigenvectors itself, which will add a delay. But the delay is independent of the number of species as the calculation is done only once, at the outset.
The actual algorithm for estimating the distance is in both cases a bisection algorithm which tries to find the point at which the derivative os the likelihood is zero. Some of the kinds of ambiguous amino acids like "glx" are correctly taken into account. However, gaps are treated as if they are unkown nucleotides, which means those positions get dropped from that particular comparison. However, they are not dropped from the whole analysis. You need not eliminate regions containing gaps, as long as you are reasonably sure of the alignment there.
Note that there is an assumption that we are looking at all positions, including those that have not changed at all. It is important not to restrict attention to some positions based on whether or not they have changed; doing that would bias the distances by making them too large, and that in turn would cause the distances to misinterpret the meaning of those positions that had changed.
The program can now correct distances for unequal rates of change at different amino acid positions. This correction, which was introduced for DNA sequences by Jin and Nei (1990), assumes that the distribution of rates of change among amino acid positions follows a Gamma distribution. The user is asked for the value of a parameter that determines the amount of variation of rates among amino acid positions. Instead of the more widely-known coefficient alpha, PROTDIST uses the coefficient of variation (ratio of the standard deviation to the mean) of rates among amino acid positions. . So if there is 20% variation in rates, the CV is is 0.20. The square of the C.V. is also the reciprocal of the better-known "shape parameter", alpha, of the Gamma distribution, so in this case the shape parameter alpha = 1/(0.20*0.20) = 25. If you want to achieve a particular value of alpha, such as 10, you will want to use a CV of 1/sqrt(100) = 1/10 = 0.1.
In addition to the five distance calculations, the program can also compute a table of similarities between amino acid sequences. These values are the fractions of amino acid positions identical between the sequences. The diagonal values are 1.0000. No attempt is made to count similarity of nonidentical amino acids, so that no credit is given for having (for example) different hydrophobic amino acids at the corresponding positions in the two sequences. This option has been requested by many users, who need it for descriptive purposes. It is not intended that the table be used for inferring the tree.
Usage
Here is a sample session with fprotdist
% fprotdist Protein distance algorithm Input (aligned) protein sequence set(s): protdist.dat Phylip distance matrix output file [protdist.fprotdist]: Computing distances: Alpha Beta . Gamma .. Delta ... Epsilon .... Output written to file "protdist.fprotdist" Done. |
Go to the input files for this example
Go to the output files for this example
Command line arguments
Protein distance algorithm Version: EMBOSS:6.3.0 Standard (Mandatory) qualifiers: [-sequence] seqsetall File containing one or more sequence alignments [-outfile] outfile [*.fprotdist] Phylip distance matrix output file Additional (Optional) qualifiers (* if not always prompted): -ncategories integer [1] Number of substitution rate categories (Integer from 1 to 9) * -rate array Rate for each category * -categories properties File of substitution rate categories -weights properties Weights file -method menu [j] Choose the method to use (Values: j (Jones-Taylor-Thornton matrix); h (Henikoff/Tiller PMB matrix); d (Dayhoff PAM matrix); k (Kimura formula); s (Similarity table); c (Categories model)) * -gamma menu [c] Rate variation among sites (Values: g (Gamma distributed rates); i (Gamma+invariant sites); c (Constant rate)) * -gammacoefficient float [1] Coefficient of variation of substitution rate among sites (Number 0.001 or more) * -invarcoefficient float [1] Coefficient of variation of substitution rate among sites (Number 0.001 or more) * -aacateg menu [G] Choose the category to use (Values: G (George/Hunt/Barker (Cys), (Met Val Leu Ileu), (Gly Ala Ser Thr Pro)); C (Chemical (Cys Met), (Val Leu Ileu Gly Ala Ser Thr), (Pro)); H (Hall (Cys), (Met Val Leu Ileu), (Gly Ala Ser Thr),(Pro))) * -whichcode menu [u] Which genetic code (Values: u (Universal); c (Ciliate); m (Universal mitochondrial); v (Vertebrate mitochondrial); f (Fly mitochondrial); y (Yeast mitochondrial)) * -ease float [0.457] Prob change category (1.0=easy) (Number from 0.000 to 1.000) * -ttratio float [2.0] Transition/transversion ratio (Number 0.000 or more) * -basefreq array [0.25 0.25 0.25 0.25] Base frequencies for A C G T/U (use blanks to separate) -printdata boolean [N] Print data at start of run -[no]progress boolean [Y] Print indications of progress of run Advanced (Unprompted) qualifiers: (none) Associated qualifiers: "-sequence" associated qualifiers -sbegin1 integer Start of each sequence to be used -send1 integer End of each sequence to be used -sreverse1 boolean Reverse (if DNA) -sask1 boolean Ask for begin/end/reverse -snucleotide1 boolean Sequence is nucleotide -sprotein1 boolean Sequence is protein -slower1 boolean Make lower case -supper1 boolean Make upper case -sformat1 string Input sequence format -sdbname1 string Database name -sid1 string Entryname -ufo1 string UFO features -fformat1 string Features format -fopenfile1 string Features file name "-outfile" associated qualifiers -odirectory2 string Output directory General qualifiers: -auto boolean Turn off prompts -stdout boolean Write first file to standard output -filter boolean Read first file from standard input, write first file to standard output -options boolean Prompt for standard and additional values -debug boolean Write debug output to program.dbg -verbose boolean Report some/full command line options -help boolean Report command line options and exit. More information on associated and general qualifiers can be found with -help -verbose -warning boolean Report warnings -error boolean Report errors -fatal boolean Report fatal errors -die boolean Report dying program messages -version boolean Report version number and exit |
Qualifier | Type | Description | Allowed values | Default | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Standard (Mandatory) qualifiers | ||||||||||||||||
[-sequence] (Parameter 1) |
seqsetall | File containing one or more sequence alignments | Readable sets of sequences | Required | ||||||||||||
[-outfile] (Parameter 2) |
outfile | Phylip distance matrix output file | Output file | <*>.fprotdist | ||||||||||||
Additional (Optional) qualifiers | ||||||||||||||||
-ncategories | integer | Number of substitution rate categories | Integer from 1 to 9 | 1 | ||||||||||||
-rate | array | Rate for each category | List of floating point numbers | |||||||||||||
-categories | properties | File of substitution rate categories | Property value(s) | |||||||||||||
-weights | properties | Weights file | Property value(s) | |||||||||||||
-method | list | Choose the method to use |
|
j | ||||||||||||
-gamma | list | Rate variation among sites |
|
c | ||||||||||||
-gammacoefficient | float | Coefficient of variation of substitution rate among sites | Number 0.001 or more | 1 | ||||||||||||
-invarcoefficient | float | Coefficient of variation of substitution rate among sites | Number 0.001 or more | 1 | ||||||||||||
-aacateg | list | Choose the category to use |
|
G | ||||||||||||
-whichcode | list | Which genetic code |
|
u | ||||||||||||
-ease | float | Prob change category (1.0=easy) | Number from 0.000 to 1.000 | 0.457 | ||||||||||||
-ttratio | float | Transition/transversion ratio | Number 0.000 or more | 2.0 | ||||||||||||
-basefreq | array | Base frequencies for A C G T/U (use blanks to separate) | List of floating point numbers | 0.25 0.25 0.25 0.25 | ||||||||||||
-printdata | boolean | Print data at start of run | Boolean value Yes/No | No | ||||||||||||
-[no]progress | boolean | Print indications of progress of run | Boolean value Yes/No | Yes | ||||||||||||
Advanced (Unprompted) qualifiers | ||||||||||||||||
(none) | ||||||||||||||||
Associated qualifiers | ||||||||||||||||
"-sequence" associated seqsetall qualifiers | ||||||||||||||||
-sbegin1 -sbegin_sequence |
integer | Start of each sequence to be used | Any integer value | 0 | ||||||||||||
-send1 -send_sequence |
integer | End of each sequence to be used | Any integer value | 0 | ||||||||||||
-sreverse1 -sreverse_sequence |
boolean | Reverse (if DNA) | Boolean value Yes/No | N | ||||||||||||
-sask1 -sask_sequence |
boolean | Ask for begin/end/reverse | Boolean value Yes/No | N | ||||||||||||
-snucleotide1 -snucleotide_sequence |
boolean | Sequence is nucleotide | Boolean value Yes/No | N | ||||||||||||
-sprotein1 -sprotein_sequence |
boolean | Sequence is protein | Boolean value Yes/No | N | ||||||||||||
-slower1 -slower_sequence |
boolean | Make lower case | Boolean value Yes/No | N | ||||||||||||
-supper1 -supper_sequence |
boolean | Make upper case | Boolean value Yes/No | N | ||||||||||||
-sformat1 -sformat_sequence |
string | Input sequence format | Any string | |||||||||||||
-sdbname1 -sdbname_sequence |
string | Database name | Any string | |||||||||||||
-sid1 -sid_sequence |
string | Entryname | Any string | |||||||||||||
-ufo1 -ufo_sequence |
string | UFO features | Any string | |||||||||||||
-fformat1 -fformat_sequence |
string | Features format | Any string | |||||||||||||
-fopenfile1 -fopenfile_sequence |
string | Features file name | Any string | |||||||||||||
"-outfile" associated outfile qualifiers | ||||||||||||||||
-odirectory2 -odirectory_outfile |
string | Output directory | Any string | |||||||||||||
General qualifiers | ||||||||||||||||
-auto | boolean | Turn off prompts | Boolean value Yes/No | N | ||||||||||||
-stdout | boolean | Write first file to standard output | Boolean value Yes/No | N | ||||||||||||
-filter | boolean | Read first file from standard input, write first file to standard output | Boolean value Yes/No | N | ||||||||||||
-options | boolean | Prompt for standard and additional values | Boolean value Yes/No | N | ||||||||||||
-debug | boolean | Write debug output to program.dbg | Boolean value Yes/No | N | ||||||||||||
-verbose | boolean | Report some/full command line options | Boolean value Yes/No | Y | ||||||||||||
-help | boolean | Report command line options and exit. More information on associated and general qualifiers can be found with -help -verbose | Boolean value Yes/No | N | ||||||||||||
-warning | boolean | Report warnings | Boolean value Yes/No | Y | ||||||||||||
-error | boolean | Report errors | Boolean value Yes/No | Y | ||||||||||||
-fatal | boolean | Report fatal errors | Boolean value Yes/No | Y | ||||||||||||
-die | boolean | Report dying program messages | Boolean value Yes/No | Y | ||||||||||||
-version | boolean | Report version number and exit | Boolean value Yes/No | N |
Input file format
fprotdist reads any normal sequence USAs.Input files for usage example
File: protdist.dat
5 13 Alpha AACGTGGCCACAT Beta AAGGTCGCCACAC Gamma CAGTTCGCCACAA Delta GAGATTTCCGCCT Epsilon GAGATCTCCGCCC |
Output file format
fprotdist output contains on its first line the number of species. The distance matrix is then printed in standard form, with each species starting on a new line with the species name, followed by the distances to the species in order. These continue onto a new line after every nine distances. The distance matrix is square with zero distances on the diagonal. In general the format of the distance matrix is such that it can serve as input to any of the distance matrix programs.If the similarity table is selected, the table that is produced is not in a format that can be used as input to the distance matrix programs. it has a heading, and the species names are also put at the tops of the columns of the table (or rather, the first 8 characters of each species name is there, the other two characters omitted to save space). There is not an option to put the table into a format that can be read by the distance matrix programs, nor is there one to make it into a table of fractions of difference by subtracting the similarity values from 1. This is done deliberately to make it more difficult for the use to use these values to construct trees. The similarity values are not corrected for multiple changes, and their use to construct trees (even after converting them to fractions of difference) would be wrong, as it would lead to severe conflict between the distant pairs of sequences and the close pairs of sequences.
If the option to print out the data is selected, the output file will precede the data by more complete information on the input and the menu selections. The output file begins by giving the number of species and the number of characters, and the identity of the distance measure that is being used.
In the Categories model of substitution, the distances printed out are scaled in terms of expected numbers of substitutions, counting both transitions and transversions but not replacements of a base by itself, and scaled so that the average rate of change is set to 1.0. For the Dayhoff PAM and Kimura models the distance are scaled in terms of the expected numbers of amino acid substitutions per site. Of course, when a branch is twice as long this does not mean that there will be twice as much net change expected along it, since some of the changes may occur in the same site and overlie or even reverse each other. The branch lengths estimates here are in terms of the expected underlying numbers of changes. That means that a branch of length 0.26 is 26 times as long as one which would show a 1% difference between the protein (or nucleotide) sequences at the beginning and end of the branch. But we would not expect the sequences at the beginning and end of the branch to be 26% different, as there would be some overlaying of changes.
One problem that can arise is that two or more of the species can be so dissimilar that the distance between them would have to be infinite, as the likelihood rises indefinitely as the estimated divergence time increases. For example, with the Kimura model, if the two sequences differ in 85.41% or more of their positions then the estimate of divergence time would be infinite. Since there is no way to represent an infinite distance in the output file, the program regards this as an error, issues a warning message indicating which pair of species are causing the problem, and computes a distance of -1.0.
Output files for usage example
File: protdist.fprotdist
5 Alpha 0.000000 0.331834 0.628142 1.036660 1.365098 Beta 0.331834 0.000000 0.377406 1.102689 0.682218 Gamma 0.628142 0.377406 0.000000 0.979550 0.866781 Delta 1.036660 1.102689 0.979550 0.000000 0.227515 Epsilon 1.365098 0.682218 0.866781 0.227515 0.000000 |
Data files
NoneNotes
None.References
None.Warnings
None.Diagnostic Error Messages
None.Exit status
It always exits with status 0.Known bugs
None.See also
Program name | Description |
---|---|
distmat | Create a distance matrix from a multiple sequence alignment |
ednacomp | DNA compatibility algorithm |
ednadist | Nucleic acid sequence Distance Matrix program |
ednainvar | Nucleic acid sequence Invariants method |
ednaml | Phylogenies from nucleic acid Maximum Likelihood |
ednamlk | Phylogenies from nucleic acid Maximum Likelihood with clock |
ednapars | DNA parsimony algorithm |
ednapenny | Penny algorithm for DNA |
eprotdist | Protein distance algorithm |
eprotpars | Protein parsimony algorithm |
erestml | Restriction site Maximum Likelihood method |
eseqboot | Bootstrapped sequences algorithm |
fdiscboot | Bootstrapped discrete sites algorithm |
fdnacomp | DNA compatibility algorithm |
fdnadist | Nucleic acid sequence Distance Matrix program |
fdnainvar | Nucleic acid sequence Invariants method |
fdnaml | Estimates nucleotide phylogeny by maximum likelihood |
fdnamlk | Estimates nucleotide phylogeny by maximum likelihood |
fdnamove | Interactive DNA parsimony |
fdnapars | DNA parsimony algorithm |
fdnapenny | Penny algorithm for DNA |
fdolmove | Interactive Dollo or Polymorphism Parsimony |
ffreqboot | Bootstrapped genetic frequencies algorithm |
fproml | Protein phylogeny by maximum likelihood |
fpromlk | Protein phylogeny by maximum likelihood |
fprotpars | Protein parsimony algorithm |
frestboot | Bootstrapped restriction sites algorithm |
frestdist | Distance matrix from restriction sites or fragments |
frestml | Restriction site maximum Likelihood method |
fseqboot | Bootstrapped sequences algorithm |
fseqbootall | Bootstrapped sequences algorithm |
Author(s)
This program is an EMBOSS conversion of a program written by Joe Felsenstein as part of his PHYLIP package.Please report all bugs to the EMBOSS bug team (emboss-bug © emboss.open-bio.org) not to the original author.
History
Written (2004) - Joe Felsenstein, University of Washington.Converted (August 2004) to an EMBASSY program by the EMBOSS team.