fcontml |
Wiki
The master copies of EMBOSS documentation are available at http://emboss.open-bio.org/wiki/Appdocs on the EMBOSS Wiki.Please help by correcting and extending the Wiki pages.
Function
Gene frequency and continuous character Maximum LikelihoodDescription
Estimates phylogenies from gene frequency data by maximum likelihood under a model in which all divergence is due to genetic drift in the absence of new mutations. Does not assume a molecular clock. An alternative method of analyzing this data is to compute Nei's genetic distance and use one of the distance matrix programs. This program can also do maximum likelihood analysis of continuous characters that evolve by a Brownian Motion model, but it assumes that the characters evolve at equal rates and in an uncorrelated fashion, so that it does not take into account the usual correlations of characters.Algorithm
This program estimates phylogenies by the restricted maximum likelihood method based on the Brownian motion model. It is based on the model of Edwards and Cavalli-Sforza (1964; Cavalli-Sforza and Edwards, 1967). Gomberg (1966), Felsenstein (1973b, 1981c) and Thompson (1975) have done extensive further work leading to efficient algorithms. CONTML uses restricted maximum likelihood estimation (REML), which is the criterion used by Felsenstein (1973b). The actual algorithm is an iterative EM Algorithm (Dempster, Laird, and Rubin, 1977) which is guaranteed to always give increasing likelihoods. The algorithm is described in detail in a paper of mine (Felsenstein, 1981c), which you should definitely consult if you are going to use this program. Some simulation tests of it are given by Rohlf and Wooten (1988) and Kim and Burgman (1988).The default (gene frequency) mode treats the input as gene frequencies at a series of loci, and square-root-transforms the allele frequencies (constructing the frequency of the missing allele at each locus first). This enables us to use the Brownian motion model on the resulting coordinates, in an approximation equivalent to using Cavalli-Sforza and Edwards's (1967) chord measure of genetic distance and taking that to give distance between particles undergoing pure Brownian motion. It assumes that each locus evolves independently by pure genetic drift.
The alternative continuous characters mode (menu option C) treats the input as a series of coordinates of each species in N dimensions. It assumes that we have transformed the characters to remove correlations and to standardize their variances.
A word about microsatellite data
Many current users of CONTML use it to analyze microsatellite data. There are three ways to do this:
- Coding each copy number as an allele, and feeding in the frequencies of these alleles. As CONTML's gene frequency mode assumes that all change is by genetic drift, this means that no copy number arises by mutation during the divergence of the populations. Since microsatellite loci have very high mutation rates, this is questionable.
-
Use some other program, one not in the PHYLIP package, to compute
distances among the populations. Some of the programs that can do this
are RSTCalc, poptrfdos, Microsat, and Populations. Links to them can
be found at my Phylogeny Programs web site at
http://evolution.gs.washington.edu/phylip/software.html.
Those distance measures allow for mutation during the divergence of the populations. But even they are not perfect -- they do not allow us to use all the information contained in the gene frequency differences of within a copy number allele. There is a need for a more complete statistical treatment of inference of phylogenies from microsatellite models, ones that take both mutation and genetic drift fully into account.
-
Alternatively, there is the Brownian motion approximation to mean
population copy number. This is described in my book (Felsenstein,
2004, Chapter 15, pp. 242-245), and it is implicit also in the
microsatellite distances. Each locus is coded as a single continuous
character, the mean of the copy number in at that microsatellite locus
in that species. Thus if the species (or population) has frequencies
0.10, 0.24, 0.60, and 0.06 of alleles that have 18, 19, 20, and 21
copies, it is coded as having
0.10 X 18 + 0.24 X 19 + 0.60 X 20 + 0.06 X 21 = 19.62
copies. These values can, I believe, be calculated by a spreadsheet program. Each microsatellite is represented by one character, and the continuous character mode of CONTML is used (not the gene frequencies mode). This coding allows for mutation that changes copy number. It does not make complete use of all data, but neither does the treatment of microsatellite gene frequencies as changing only genetic drift. frequency
Usage
Here is a sample session with fcontml
% fcontml -printdata Gene frequency and continuous character Maximum Likelihood Input file: contml.dat Phylip tree file (optional): Phylip contml program output file [contml.fcontml]: Adding species: 1. European 2. African 3. Chinese 4. American 5. Australian Output written to file "contml.fcontml" Tree also written onto file "contml.treefile" Done. |
Go to the input files for this example
Go to the output files for this example
Command line arguments
Gene frequency and continuous character Maximum Likelihood Version: EMBOSS:6.3.0 Standard (Mandatory) qualifiers: [-infile] frequencies File containing one or more sets of data [-intreefile] tree Phylip tree file (optional) [-outfile] outfile [*.fcontml] Phylip contml program output file Additional (Optional) qualifiers (* if not always prompted): -datatype menu [g] Input type in infile (Values: g (Gene frequencies); i (Continuous characters)) * -lengths boolean [N] Use branch lengths from user trees * -njumble integer [0] Number of times to randomise (Integer 0 or more) * -seed integer [1] Random number seed between 1 and 32767 (must be odd) (Integer from 1 to 32767) * -global boolean [N] Global rearrangements -outgrno integer [0] Species number to use as outgroup (Integer 0 or more) -[no]trout toggle [Y] Write out trees to tree file * -outtreefile outfile [*.fcontml] Phylip tree output file (optional) -printdata boolean [N] Print data at start of run -[no]progress boolean [Y] Print indications of progress of run -[no]treeprint boolean [Y] Print out tree Advanced (Unprompted) qualifiers: (none) Associated qualifiers: "-outfile" associated qualifiers -odirectory3 string Output directory "-outtreefile" associated qualifiers -odirectory string Output directory General qualifiers: -auto boolean Turn off prompts -stdout boolean Write first file to standard output -filter boolean Read first file from standard input, write first file to standard output -options boolean Prompt for standard and additional values -debug boolean Write debug output to program.dbg -verbose boolean Report some/full command line options -help boolean Report command line options and exit. More information on associated and general qualifiers can be found with -help -verbose -warning boolean Report warnings -error boolean Report errors -fatal boolean Report fatal errors -die boolean Report dying program messages -version boolean Report version number and exit |
Qualifier | Type | Description | Allowed values | Default | ||||
---|---|---|---|---|---|---|---|---|
Standard (Mandatory) qualifiers | ||||||||
[-infile] (Parameter 1) |
frequencies | File containing one or more sets of data | Frequency value(s) | |||||
[-intreefile] (Parameter 2) |
tree | Phylip tree file (optional) | Phylogenetic tree | |||||
[-outfile] (Parameter 3) |
outfile | Phylip contml program output file | Output file | <*>.fcontml | ||||
Additional (Optional) qualifiers | ||||||||
-datatype | list | Input type in infile |
|
g | ||||
-lengths | boolean | Use branch lengths from user trees | Boolean value Yes/No | No | ||||
-njumble | integer | Number of times to randomise | Integer 0 or more | 0 | ||||
-seed | integer | Random number seed between 1 and 32767 (must be odd) | Integer from 1 to 32767 | 1 | ||||
-global | boolean | Global rearrangements | Boolean value Yes/No | No | ||||
-outgrno | integer | Species number to use as outgroup | Integer 0 or more | 0 | ||||
-[no]trout | toggle | Write out trees to tree file | Toggle value Yes/No | Yes | ||||
-outtreefile | outfile | Phylip tree output file (optional) | Output file | <*>.fcontml | ||||
-printdata | boolean | Print data at start of run | Boolean value Yes/No | No | ||||
-[no]progress | boolean | Print indications of progress of run | Boolean value Yes/No | Yes | ||||
-[no]treeprint | boolean | Print out tree | Boolean value Yes/No | Yes | ||||
Advanced (Unprompted) qualifiers | ||||||||
(none) | ||||||||
Associated qualifiers | ||||||||
"-outfile" associated outfile qualifiers | ||||||||
-odirectory3 -odirectory_outfile |
string | Output directory | Any string | |||||
"-outtreefile" associated outfile qualifiers | ||||||||
-odirectory | string | Output directory | Any string | |||||
General qualifiers | ||||||||
-auto | boolean | Turn off prompts | Boolean value Yes/No | N | ||||
-stdout | boolean | Write first file to standard output | Boolean value Yes/No | N | ||||
-filter | boolean | Read first file from standard input, write first file to standard output | Boolean value Yes/No | N | ||||
-options | boolean | Prompt for standard and additional values | Boolean value Yes/No | N | ||||
-debug | boolean | Write debug output to program.dbg | Boolean value Yes/No | N | ||||
-verbose | boolean | Report some/full command line options | Boolean value Yes/No | Y | ||||
-help | boolean | Report command line options and exit. More information on associated and general qualifiers can be found with -help -verbose | Boolean value Yes/No | N | ||||
-warning | boolean | Report warnings | Boolean value Yes/No | Y | ||||
-error | boolean | Report errors | Boolean value Yes/No | Y | ||||
-fatal | boolean | Report fatal errors | Boolean value Yes/No | Y | ||||
-die | boolean | Report dying program messages | Boolean value Yes/No | Y | ||||
-version | boolean | Report version number and exit | Boolean value Yes/No | N |
Input file format
fcontml reads continuous character data.Continuous character data
The programs in this group use gene frequencies and quantitative character values. One (CONTML) constructs maximum likelihood estimates of the phylogeny, another (GENDIST) computes genetic distances for use in the distance matrix programs, and the third (CONTRAST) examines correlation of traits as they evolve along a given phylogeny.When the gene frequencies data are used in CONTML or GENDIST, this involves the following assumptions:
- Different lineages evolve independently.
- After two lineages split, their characters change independently.
- Each gene frequency changes by genetic drift, with or without mutation (this varies from method to method).
- Different loci or characters drift independently.
How these assumptions affect the methods will be seen in my papers on inference of phylogenies from gene frequency and continuous character data (Felsenstein, 1973b, 1981c, 1985c).
The input formats are fairly similar to the discrete-character programs, but with one difference. When CONTML is used in the gene-frequency mode (its usual, default mode), or when GENDIST is used, the first line contains the number of species (or populations) and the number of loci and the options information. There then follows a line which gives the numbers of alleles at each locus, in order. This must be the full number of alleles, not the number of alleles which will be input: i. e. for a two-allele locus the number should be 2, not 1. There then follow the species (population) data, each species beginning on a new line. The first 10 characters are taken as the name, and thereafter the values of the individual characters are read free-format, preceded and separated by blanks. They can go to a new line if desired, though of course not in the middle of a number. Missing data is not allowed - an important limitation. In the default configuration, for each locus, the numbers should be the frequencies of all but one allele. The menu option A (All) signals that the frequencies of all alleles are provided in the input data -- the program will then automatically ignore the last of them. So without the A option, for a three-allele locus there should be two numbers, the frequencies of two of the alleles (and of course it must always be the same two!). Here is a typical data set without the A option:
5 3 2 3 2 Alpha 0.90 0.80 0.10 0.56 Beta 0.72 0.54 0.30 0.20 Gamma 0.38 0.10 0.05 0.98 Delta 0.42 0.40 0.43 0.97 Epsilon 0.10 0.30 0.70 0.62
whereas here is what it would have to look like if the A option were invoked:
5 3 2 3 2 Alpha 0.90 0.10 0.80 0.10 0.10 0.56 0.44 Beta 0.72 0.28 0.54 0.30 0.16 0.20 0.80 Gamma 0.38 0.62 0.10 0.05 0.85 0.98 0.02 Delta 0.42 0.58 0.40 0.43 0.17 0.97 0.03 Epsilon 0.10 0.90 0.30 0.70 0.00 0.62 0.38
The first line has the number of species (or populations) and the number of loci. The second line has the number of alleles for each of the 3 loci. The species lines have names (filled out to 10 characters with blanks) followed by the gene frequencies of the 2 alleles for the first locus, the 3 alleles for the second locus, and the 2 alleles for the third locus. You can start a new line after any of these allele frequencies, and continue to give the frequencies on that line (without repeating the species name).
If all alleles of a locus are given, it is important to have them add up to 1. Roundoff of the frequencies may cause the program to conclude that the numbers do not sum to 1, and stop with an error message.
While many compilers may be more tolerant, it is probably wise to make sure that each number, including the first, is preceded by a blank, and that there are digits both preceding and following any decimal points.
CONTML and CONTRAST also treat quantitative characters (the continuous-characters mode in CONTML, which is option C). It is assumed that each character is evolving according to a Brownian motion model, at the same rate, and independently. In reality it is almost always impossible to guarantee this. The issue is discussed at length in my review article in Annual Review of Ecology and Systematics (Felsenstein, 1988a), where I point out the difficulty of transforming the characters so that they are not only genetically independent but have independent selection acting on them. If you are going to use CONTML to model evolution of continuous characters, then you should at least make some attempt to remove genetic correlations between the characters (usually all one can do is remove phenotypic correlations by transforming the characters so that there is no within-population covariance and so that the within-population variances of the characters are equal -- this is equivalent to using Canonical Variates). However, this will only guarantee that one has removed phenotypic covariances between characters. Genetic covariances could only be removed by knowing the coheritabilities of the characters, which would require genetic experiments, and selective covariances (covariances due to covariation of selection pressures) would require knowledge of the sources and extent of selection pressure in all variables.
CONTRAST is a program designed to infer, for a given phylogeny that is provided to the program, the covariation between characters in a data set. Thus we have a program in this set that allow us to take information about the covariation and rates of evolution of characters and make an estimate of the phylogeny (CONTML), and a program that takes an estimate of the phylogeny and infers the variances and covariances of the character changes. But we have no program that infers both the phylogenies and the character covariation from the same data set.
In the quantitative characters mode, a typical small data set would be:
5 6 Alpha 0.345 0.467 1.213 2.2 -1.2 1.0 Beta 0.457 0.444 1.1 1.987 -0.2 2.678 Gamma 0.6 0.12 0.97 2.3 -0.11 1.54 Delta 0.68 0.203 0.888 2.0 1.67 Epsilon 0.297 0.22 0.90 1.9 1.74
Note that in the latter case, there is no line giving the numbers of alleles at each locus. In this latter case no square-root transformation of the coordinates is done: each is assumed to give directly the position on the Brownian motion scale.
For further discussion of options and modifiable constants in CONTML, GENDIST, and CONTRAST see the documentation files for those programs.
Input files for usage example
File: contml.dat
5 10 2 2 2 2 2 2 2 2 2 2 European 0.2868 0.5684 0.4422 0.4286 0.3828 0.7285 0.6386 0.0205 0.8055 0.5043 African 0.1356 0.4840 0.0602 0.0397 0.5977 0.9675 0.9511 0.0600 0.7582 0.6207 Chinese 0.1628 0.5958 0.7298 1.0000 0.3811 0.7986 0.7782 0.0726 0.7482 0.7334 American 0.0144 0.6990 0.3280 0.7421 0.6606 0.8603 0.7924 0.0000 0.8086 0.8636 Australian 0.1211 0.2274 0.5821 1.0000 0.2018 0.9000 0.9837 0.0396 0.9097 0.2976 |
Output file format
fcontml output has a standard appearance. The topology of the tree is given by an unrooted tree diagram. The lengths (in time or in expected amounts of variance) are given in a table below the topology, and a rough confidence interval given for each length. Negative lower bounds on length indicate that rearrangements may be acceptable.The units of length are amounts of expected accumulated variance (not time). The log likelihood (natural log) of each tree is also given, and it is indicated how many topologies have been tried. The tree does not necessarily have all tips contemporary, and the log likelihood may be either positive or negative (this simply corresponds to whether the density function does or does not exceed 1) and a negative log likelihood does not indicate any error. The log likelihood allows various formal likelihood ratio hypothesis tests. The description of the tree includes approximate standard errors on the lengths of segments of the tree. These are calculated by considering only the curvature of the likelihood surface as the length of the segment is varied, holding all other lengths constant. As such they are most probably underestimates of the variance, and hence may give too much confidence in the given tree.
One should use caution in interpreting the likelihoods that are printed out. If the model is wrong, it will not be possible to use the likelihoods to make formal statistical statements. Thus, if gene frequencies are being analyzed, but the gene frequencies change not only by genetic drift, but also by mutation, the model is not correct. It would be as well-justified in this case to use GENDIST to compute the Nei (1972) genetic distance and then use FITCH, KITSCH or NEIGHBOR to make a tree. If continuous characters are being analyzed, but if the characters have not been transformed to new coordinates that evolve independently and at equal rates, then the model is also violated and no statistical analysis is possible. Doing such a transformation is not easy, and usually not even possible.
If the U (User Tree) option is used and more than one tree is supplied, the program also performs a statistical test of each of these trees against the one with highest likelihood. If there are two user trees, the test done is one which is due to Kishino and Hasegawa (1989), a version of a test originally introduced by Templeton (1983). In this implementation it uses the mean and variance of log-likelihood differences between trees, taken across loci. If the two trees means are more than 1.96 standard deviations different then the trees are declared significantly different. This use of the empirical variance of log-likelihood differences is more robust and nonparametric than the classical likelihood ratio test, and may to some extent compensate for the any lack of realism in the model underlying this program.
If there are more than two trees, the test done is an extension of the KHT test, due to Shimodaira and Hasegawa (1999). They pointed out that a correction for the number of trees was necessary, and they introduced a resampling method to make this correction. The version used here is a multivariate normal approximation to their test; it is due to Shimodaira (1998). The variances and covariances of the sum of log likelihoods across loci are computed for all pairs of trees. To test whether the difference between each tree and the best one is larger than could have been expected if they all had the same expected log-likelihood, log-likelihoods for all trees are sampled with these covariances and equal means (Shimodaira and Hasegawa's "least favorable hypothesis"), and a P value is computed from the fraction of times the difference between the tree's value and the highest log-likelihood exceeds that actually observed. Note that this sampling needs random numbers, and so the program will prompt the user for a random number seed if one has not already been supplied. With the two-tree KHT test no random numbers are used.
In either the KHT or the SH test the program prints out a table of the log-likelihoods of each tree, the differences of each from the highest one, the variance of that quantity as determined by the log-likelihood differences at individual sites, and a conclusion as to whether that tree is or is not significantly worse than the best one.
One problem which sometimes arises is that the program is fed two species (or populations) with identical transformed gene frequencies: this can happen if sample sizes are small and/or many loci are monomorphic. In this case the program "gets its knickers in a twist" and can divide by zero, usually causing a crash. If you suspect that this has happened, check for two species with identical coordinates. If you find them, eliminate one from the problem: the two must always show up as being at the same point on the tree anyway.
Output files for usage example
File: contml.fcontml
Continuous character Maximum Likelihood method version 3.69 5 Populations, 10 Loci Numbers of alleles at the loci: ------- -- ------- -- --- ----- 2 2 2 2 2 2 2 2 2 2 Name Gene Frequencies ---- ---- ----------- locus: 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 European 0.28680 0.71320 0.56840 0.43160 0.44220 0.55780 0.42860 0.57140 0.38280 0.61720 0.72850 0.27150 0.63860 0.36140 0.02050 0.97950 0.80550 0.19450 0.50430 0.49570 African 0.13560 0.86440 0.48400 0.51600 0.06020 0.93980 0.03970 0.96030 0.59770 0.40230 0.96750 0.03250 0.95110 0.04890 0.06000 0.94000 0.75820 0.24180 0.62070 0.37930 Chinese 0.16280 0.83720 0.59580 0.40420 0.72980 0.27020 1.00000 0.00000 0.38110 0.61890 0.79860 0.20140 0.77820 0.22180 0.07260 0.92740 0.74820 0.25180 0.73340 0.26660 American 0.01440 0.98560 0.69900 0.30100 0.32800 0.67200 0.74210 0.25790 0.66060 0.33940 0.86030 0.13970 0.79240 0.20760 0.00000 1.00000 0.80860 0.19140 0.86360 0.13640 Australian 0.12110 0.87890 0.22740 0.77260 0.58210 0.41790 1.00000 0.00000 0.20180 0.79820 0.90000 0.10000 0.98370 0.01630 0.03960 0.96040 0.90970 0.09030 0.29760 0.70240 +-----------------------------------------------------------African ! ! +-------------------------------Australian 1-------------3 ! ! +-----------------------American ! +-----2 ! +Chinese ! +European remember: this is an unrooted tree! Ln Likelihood = 38.71914 Between And Length Approx. Confidence Limits ------- --- ------ ------- ---------- ------ 1 African 0.09693444 ( 0.03123910, 0.19853604) 1 3 0.02252816 ( 0.00089799, 0.05598045) 3 Australian 0.05247405 ( 0.01177094, 0.11542374) 3 2 0.00945315 ( -0.00897717, 0.03795670) 2 American 0.03806240 ( 0.01095938, 0.07997877) 2 Chinese 0.00208822 ( -0.00960622, 0.02017433) 1 European 0.00000000 ( -0.01627246, 0.02516630) |
File: contml.treefile
(African:0.09693444,(Australian:0.05247405,(American:0.03806240,Chinese:0.00208822):0.00945315):0.02252816, European:0.00000000); |
Data files
NoneNotes
None.References
None.Warnings
None.Diagnostic Error Messages
None.Exit status
It always exits with status 0.Known bugs
None.See also
Program name | Description |
---|---|
egendist | Genetic Distance Matrix program |
fgendist | Compute genetic distances from gene frequencies |
Author(s)
This program is an EMBOSS conversion of a program written by Joe Felsenstein as part of his PHYLIP package.Please report all bugs to the EMBOSS bug team (emboss-bug © emboss.open-bio.org) not to the original author.
History
Written (2004) - Joe Felsenstein, University of Washington.Converted (August 2004) to an EMBASSY program by the EMBOSS team.