EMBOSS at CSC

Tehdyt toimenpiteet

EMBOSS: fcontrast

fcontrast

Wiki

The master copies of EMBOSS documentation are available at http://emboss.open-bio.org/wiki/Appdocs on the EMBOSS Wiki.

Please help by correcting and extending the Wiki pages.

Function

Continuous character Contrasts

Description

Reads a tree from a tree file, and a data set with continuous characters data, and produces the independent contrasts for those characters, for use in any multivariate statistics package. Will also produce covariances, regressions and correlations between characters for those contrasts. Can also correct for within-species sampling variation when individual phenotypes are available within a population.

Algorithm

This program implements the contrasts calculation described in my 1985 paper on the comparative method (Felsenstein, 1985d). It reads in a data set of the standard quantitative characters sort, and also a tree from the treefile. It then forms the contrasts between species that, according to that tree, are statistically independent. This is done for each character. The contrasts are all standardized by branch lengths (actually, square roots of branch lengths).

The method is explained in the 1985 paper. It assumes a Brownian motion model. This model was introduced by Edwards and Cavalli-Sforza (1964; Cavalli-Sforza and Edwards, 1967) as an approximation to the evolution of gene frequencies. I have discussed (Felsenstein, 1973b, 1981c, 1985d, 1988b) the difficulties inherent in using it as a model for the evolution of quantitative characters. Chief among these is that the characters do not necessarily evolve independently or at equal rates. This program allows one to evaluate this, if there is independent information on the phylogeny. You can compute the variance of the contrasts for each character, as a measure of the variance accumulating per unit branch length. You can also test covariances of characters.

The statistics that are printed out include the covariances between all pairs of characters, the regressions of each character on each other (column j is regressed on row i), and the correlations between all pairs of characters. In assessing degress of freedom it is important to realize that each contrast was taken to have expectation zero, which is known because each contrast could as easily have been computed xi-xj instead of xj-xi. Thus there is no loss of a degree of freedom for estimation of a mean. The degrees of freedom is thus the same as the number of contrasts, namely one less than the number of species (tips). If you feed these contrasts into a multivariate statistics program make sure that it knows that each variable has expectation exactly zero.

Within-species variation

With the W option selected, CONTRAST analyzes data sets with variation within species, using a model like that proposed by Michael Lynch (1990). The method is described in vague terms in my book (Felsenstein, 2004, pp. 441). If you select the W option for within-species variation, the data set should have this structure (on the left are the data, on the right my comments:

   10    5                           number of species, number of characters
Alpha        2                       name of 1st species, # of individuals
 2.01 5.3 1.5  -3.41 0.3             data for individual #1
 1.98 4.3 2.1  -2.98 0.45            data for individual #2
Gammarus     3                       name of 2nd species, # of individuals
 6.57 3.1 2.0  -1.89 0.6             data for individual #1
 7.62 3.4 1.9  -2.01 0.7             data for individual #2
 6.02 3.0 1.9  -2.03 0.6             data for individual #3
...                                  (and so on)

The covariances, correlations, and regressions for the "additive" (between-species evolutionary variation) and "environmental" (within-species phenotypic variation) are printed out (the maximum likelihood estimates of each). The program also estimates the within-species phenotypic variation in the case where the between-species evolutionary covariances are forced to be zero. The log-likelihoods of these two cases are compared and a likelihood ratio test (LRT) is carried out. The program prints the result of this test as a chi-square variate, and gives the number of degrees of freedom of the LRT. You have to look up the chi-square variable on a table of the chi-square distribution. The A option is available (if the W option is invoked) to allow you to turn off the doing of this test if you want to.

The log-likelihood of the data under the models with and without between-species For the moment the program cannot handle the case where within-species variation is to be taken into account but where only species means are available. (It can handle cases where some species have only one member in their sample).

We hope to fix this soon. We are also on our way to incorporating full-sib, half-sib, or clonal groups within species, so as to do one analysis for within-species genetic and between-species phylogenetic variation.

The data set used as an example below is the example from a paper by Michael Lynch (1990), his characters having been log-transformed. In the case where there is only one specimen per species, Lynch's model is identical to our model of within-species variation (for multiple individuals per species it is not a subcase of his model).

Usage

Here is a sample session with fcontrast


% fcontrast 
Continuous character Contrasts
Input file: contrast.dat
Phylip tree file (optional): contrast.tree
Phylip contrast program output file [contrast.fcontrast]: 


Output written to file "contrast.fcontrast"

Done.

Go to the input files for this example
Go to the output files for this example

Command line arguments

Continuous character Contrasts
Version: EMBOSS:6.3.0

   Standard (Mandatory) qualifiers:
  [-infile]            frequencies File containing one or more sets of data
  [-intreefile]        tree       Phylip tree file (optional)
  [-outfile]           outfile    [*.fcontrast] Phylip contrast program output
                                  file

   Additional (Optional) qualifiers (* if not always prompted):
   -varywithin         boolean    [N] Within-population variation in data
*  -[no]reg            boolean    [Y] Print out correlations and regressions
*  -writecont          boolean    [N] Print out contrasts
*  -[no]nophylo        boolean    [Y] LRT test of no phylogenetic component,
                                  with and without VarA
   -printdata          boolean    [N] Print data at start of run
   -[no]progress       boolean    [Y] Print indications of progress of run

   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-outfile" associated qualifiers
   -odirectory3        string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write first file to standard output
   -filter             boolean    Read first file from standard input, write
                                  first file to standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options and exit. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages
   -version            boolean    Report version number and exit

Qualifier	Type	Description	Allowed values	Default
Standard (Mandatory) qualifiers
[-infile] (Parameter 1)	frequencies	File containing one or more sets of data	Frequency value(s)
[-intreefile] (Parameter 2)	tree	Phylip tree file (optional)	Phylogenetic tree
[-outfile] (Parameter 3)	outfile	Phylip contrast program output file	Output file	<>*.fcontrast
Additional (Optional) qualifiers
-varywithin	boolean	Within-population variation in data	Boolean value Yes/No	No
-[no]reg	boolean	Print out correlations and regressions	Boolean value Yes/No	Yes
-writecont	boolean	Print out contrasts	Boolean value Yes/No	No
-[no]nophylo	boolean	LRT test of no phylogenetic component, with and without VarA	Boolean value Yes/No	Yes
-printdata	boolean	Print data at start of run	Boolean value Yes/No	No
-[no]progress	boolean	Print indications of progress of run	Boolean value Yes/No	Yes
Advanced (Unprompted) qualifiers
(none)
Associated qualifiers
"-outfile" associated outfile qualifiers
-odirectory3 -odirectory_outfile	string	Output directory	Any string
General qualifiers
-auto	boolean	Turn off prompts	Boolean value Yes/No	N
-stdout	boolean	Write first file to standard output	Boolean value Yes/No	N
-filter	boolean	Read first file from standard input, write first file to standard output	Boolean value Yes/No	N
-options	boolean	Prompt for standard and additional values	Boolean value Yes/No	N
-debug	boolean	Write debug output to program.dbg	Boolean value Yes/No	N
-verbose	boolean	Report some/full command line options	Boolean value Yes/No	Y
-help	boolean	Report command line options and exit. More information on associated and general qualifiers can be found with -help -verbose	Boolean value Yes/No	N
-warning	boolean	Report warnings	Boolean value Yes/No	Y
-error	boolean	Report errors	Boolean value Yes/No	Y
-fatal	boolean	Report fatal errors	Boolean value Yes/No	Y
-die	boolean	Report dying program messages	Boolean value Yes/No	Y
-version	boolean	Report version number and exit	Boolean value Yes/No	N

Input file format

fcontrast reads continuous character data.

Continuous character data

The programs in this group use gene frequencies and quantitative character values. One (CONTML) constructs maximum likelihood estimates of the phylogeny, another (GENDIST) computes genetic distances for use in the distance matrix programs, and the third (CONTRAST) examines correlation of traits as they evolve along a given phylogeny.

When the gene frequencies data are used in CONTML or GENDIST, this involves the following assumptions:

Different lineages evolve independently.
After two lineages split, their characters change independently.
Each gene frequency changes by genetic drift, with or without mutation (this varies from method to method).
Different loci or characters drift independently.

How these assumptions affect the methods will be seen in my papers on inference of phylogenies from gene frequency and continuous character data (Felsenstein, 1973b, 1981c, 1985c).

The input formats are fairly similar to the discrete-character programs, but with one difference. When CONTML is used in the gene-frequency mode (its usual, default mode), or when GENDIST is used, the first line contains the number of species (or populations) and the number of loci and the options information. There then follows a line which gives the numbers of alleles at each locus, in order. This must be the full number of alleles, not the number of alleles which will be input: i. e. for a two-allele locus the number should be 2, not 1. There then follow the species (population) data, each species beginning on a new line. The first 10 characters are taken as the name, and thereafter the values of the individual characters are read free-format, preceded and separated by blanks. They can go to a new line if desired, though of course not in the middle of a number. Missing data is not allowed - an important limitation. In the default configuration, for each locus, the numbers should be the frequencies of all but one allele. The menu option A (All) signals that the frequencies of all alleles are provided in the input data -- the program will then automatically ignore the last of them. So without the A option, for a three-allele locus there should be two numbers, the frequencies of two of the alleles (and of course it must always be the same two!). Here is a typical data set without the A option:

     5    3
2 3 2
Alpha      0.90 0.80 0.10 0.56
Beta       0.72 0.54 0.30 0.20
Gamma      0.38 0.10 0.05  0.98
Delta      0.42 0.40 0.43 0.97
Epsilon    0.10 0.30 0.70 0.62

whereas here is what it would have to look like if the A option were invoked:

     5    3
2 3 2
Alpha      0.90 0.10 0.80 0.10 0.10 0.56 0.44
Beta       0.72 0.28 0.54 0.30 0.16 0.20 0.80
Gamma      0.38 0.62 0.10 0.05 0.85  0.98 0.02
Delta      0.42 0.58 0.40 0.43 0.17 0.97 0.03
Epsilon    0.10 0.90 0.30 0.70 0.00 0.62 0.38

The first line has the number of species (or populations) and the number of loci. The second line has the number of alleles for each of the 3 loci. The species lines have names (filled out to 10 characters with blanks) followed by the gene frequencies of the 2 alleles for the first locus, the 3 alleles for the second locus, and the 2 alleles for the third locus. You can start a new line after any of these allele frequencies, and continue to give the frequencies on that line (without repeating the species name).

If all alleles of a locus are given, it is important to have them add up to 1. Roundoff of the frequencies may cause the program to conclude that the numbers do not sum to 1, and stop with an error message.

While many compilers may be more tolerant, it is probably wise to make sure that each number, including the first, is preceded by a blank, and that there are digits both preceding and following any decimal points.

CONTML and CONTRAST also treat quantitative characters (the continuous-characters mode in CONTML, which is option C). It is assumed that each character is evolving according to a Brownian motion model, at the same rate, and independently. In reality it is almost always impossible to guarantee this. The issue is discussed at length in my review article in Annual Review of Ecology and Systematics (Felsenstein, 1988a), where I point out the difficulty of transforming the characters so that they are not only genetically independent but have independent selection acting on them. If you are going to use CONTML to model evolution of continuous characters, then you should at least make some attempt to remove genetic correlations between the characters (usually all one can do is remove phenotypic correlations by transforming the characters so that there is no within-population covariance and so that the within-population variances of the characters are equal -- this is equivalent to using Canonical Variates). However, this will only guarantee that one has removed phenotypic covariances between characters. Genetic covariances could only be removed by knowing the coheritabilities of the characters, which would require genetic experiments, and selective covariances (covariances due to covariation of selection pressures) would require knowledge of the sources and extent of selection pressure in all variables.

CONTRAST is a program designed to infer, for a given phylogeny that is provided to the program, the covariation between characters in a data set. Thus we have a program in this set that allow us to take information about the covariation and rates of evolution of characters and make an estimate of the phylogeny (CONTML), and a program that takes an estimate of the phylogeny and infers the variances and covariances of the character changes. But we have no program that infers both the phylogenies and the character covariation from the same data set.

In the quantitative characters mode, a typical small data set would be:

     5   6
Alpha      0.345 0.467 1.213  2.2  -1.2 1.0
Beta       0.457 0.444 1.1    1.987 -0.2 2.678
Gamma      0.6 0.12 0.97 2.3  -0.11 1.54
Delta      0.68  0.203 0.888 2.0  1.67
Epsilon    0.297  0.22 0.90 1.9 1.74

Note that in the latter case, there is no line giving the numbers of alleles at each locus. In this latter case no square-root transformation of the coordinates is done: each is assumed to give directly the position on the Brownian motion scale.

For further discussion of options and modifiable constants in CONTML, GENDIST, and CONTRAST see the documentation files for those programs.

Input files for usage example

File: contrast.dat

    5   2
Homo        4.09434  4.74493
Pongo       3.61092  3.33220
Macaca      2.37024  3.36730
Ateles      2.02815  2.89037
Galago     -1.46968  2.30259

File: contrast.tree

((((Homo:0.21,Pongo:0.21):0.28,Macaca:0.49):0.13,Ateles:0.62):0.38,Galago:1.00);

Output file format

fcontrast statistics that are printed out include the covariances between all pairs of characters, the regressions of each character on each other (column j is regressed on row i), and the correlations between all pairs of characters. In assessing degress of freedom it is important to realize that each contrast was taken to have expectation zero, which is known because each contrast could as easily have been computed xi-xj instead of xj-xi. Thus there is no loss of a degree of freedom for estimation of a mean. The degrees of freedom is thus the same as the number of contrasts, namely one less than the number of species (tips). If you feed these contrasts into a multivariate statistics program make sure that it knows that each variable has expectation exactly zero. With the W option selected, the covariances, correlations, and regressions for the "additive" (between-species evolutionary variation) and "environmental" (within-species phenotypic variation) are printed out (the maximum likelihood estimates of each). The program also estimates the within-species phenotypic variation in the case where the between-species evolutionary covariances are forced to be zero. The log-likelihoods of these two cases are compared and a likelihood ratio test (LRT) is carried out. The program prints the result of this test as a chi-square variate, and gives the number of degrees of freedom of the LRT. You have to look up the chi-square variable on a table of the chi-square distribution. The A option is available (if the W option is invoked) to allow you to turn off the doing of this test if you want to.

Output files for usage example

File: contrast.fcontrast


Covariance matrix
---------- ------

    3.9423    1.7028
    1.7028    1.7062

Regressions (columns on rows)
----------- -------- -- -----

    1.0000    0.4319
    0.9980    1.0000

Correlations
------------

    1.0000    0.6566
    0.6566    1.0000

Data files

None

Notes

None.

References

None.

Warnings

None.

Diagnostic Error Messages

None.

Exit status

It always exits with status 0.

Known bugs

None.

Author(s)

This program is an EMBOSS conversion of a program written by Joe Felsenstein as part of his PHYLIP package.

History

Written (2004) - Joe Felsenstein, University of Washington.

Converted (August 2004) to an EMBASSY program by the EMBOSS team.

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

CSC

Tehdyt toimenpiteet

Wiki

Function

Description

Algorithm

Within-species variation

Usage

Command line arguments

Input file format

Continuous character data

Input files for usage example

File: contrast.dat

File: contrast.tree

Output file format

Output files for usage example

File: contrast.fcontrast

Data files

Notes

References

Warnings

Diagnostic Error Messages

Exit status

Known bugs

See also

Author(s)

History

Target users