CSC

 
 
Tehdyt toimenpiteet
EMBOSS: fneighbor
fneighbor

 

Wiki

The master copies of EMBOSS documentation are available at http://emboss.open-bio.org/wiki/Appdocs on the EMBOSS Wiki.

Please help by correcting and extending the Wiki pages.

Function

Phylogenies from distance matrix by N-J or UPGMA method

Description

An implementation by Mary Kuhner and John Yamato of Saitou and Nei's "Neighbor Joining Method," and of the UPGMA (Average Linkage clustering) method. Neighbor Joining is a distance matrix method producing an unrooted tree without the assumption of a clock. UPGMA does assume a clock. The branch lengths are not optimized by the least squares criterion but the methods are very fast and thus can handle much larger data sets.

Algorithm

This program implements the Neighbor-Joining method of Saitou and Nei (1987) and the UPGMA method of clustering. The program was written by Mary Kuhner and Jon Yamato, using some code from program FITCH. An important part of the code was translated from FORTRAN code from the neighbor-joining program written by Naruya Saitou and by Li Jin, and is used with the kind permission of Drs. Saitou and Jin.

NEIGHBOR constructs a tree by successive clustering of lineages, setting branch lengths as the lineages join. The tree is not rearranged thereafter. The tree does not assume an evolutionary clock, so that it is in effect an unrooted tree. It should be somewhat similar to the tree obtained by FITCH. The program cannot evaluate a User tree, nor can it prevent branch lengths from becoming negative. However the algorithm is far faster than FITCH or KITSCH. This will make it particularly effective in their place for large studies or for bootstrap or jackknife resampling studies which require runs on multiple data sets.

The UPGMA option constructs a tree by successive (agglomerative) clustering using an average-linkage method of clustering. It has some relationship to KITSCH, in that when the tree topology turns out the same, the branch lengths with UPGMA will turn out to be the same as with the P = 0 option of KITSCH.

The programs FITCH, KITSCH, and NEIGHBOR are for dealing with data which comes in the form of a matrix of pairwise distances between all pairs of taxa, such as distances based on molecular sequence data, gene frequency genetic distances, amounts of DNA hybridization, or immunological distances. In analyzing these data, distance matrix programs implicitly assume that:

  • Each distance is measured independently from the others: no item of data contributes to more than one distance.
  • The distance between each pair of taxa is drawn from a distribution with an expectation which is the sum of values (in effect amounts of evolution) along the tree from one tip to the other. The variance of the distribution is proportional to a power p of the expectation.

These assumptions can be traced in the least squares methods of programs FITCH and KITSCH but it is not quite so easy to see them in operation in the Neighbor-Joining method of NEIGHBOR, where the independence assumptions is less obvious.

THESE TWO ASSUMPTIONS ARE DUBIOUS IN MOST CASES: independence will not be expected to be true in most kinds of data, such as genetic distances from gene frequency data. For genetic distance data in which pure genetic drift without mutation can be assumed to be the mechanism of change CONTML may be more appropriate. However, FITCH, KITSCH, and NEIGHBOR will not give positively misleading results (they will not make a statistically inconsistent estimate) provided that additivity holds, which it will if the distance is computed from the original data by a method which corrects for reversals and parallelisms in evolution. If additivity is not expected to hold, problems are more severe. A short discussion of these matters will be found in a review article of mine (1984a). For detailed, if sometimes irrelevant, controversy see the papers by Farris (1981, 1985, 1986) and myself (1986, 1988b).

For genetic distances from gene frequencies, FITCH, KITSCH, and NEIGHBOR may be appropriate if a neutral mutation model can be assumed and Nei's genetic distance is used, or if pure drift can be assumed and either Cavalli-Sforza's chord measure or Reynolds, Weir, and Cockerham's (1983) genetic distance is used. However, in the latter case (pure drift) CONTML should be better.

Restriction site and restriction fragment data can be treated by distance matrix methods if a distance such as that of Nei and Li (1979) is used. Distances of this sort can be computed in PHYLIp by the program RESTDIST.

For nucleic acid sequences, the distances computed in DNADIST allow correction for multiple hits (in different ways) and should allow one to analyse the data under the presumption of additivity. In all of these cases independence will not be expected to hold. DNA hybridization and immunological distances may be additive and independent if transformed properly and if (and only if) the standards against which each value is measured are independent. (This is rarely exactly true).

FITCH and the Neighbor-Joining option of NEIGHBOR fit a tree which has the branch lengths unconstrained. KITSCH and the UPGMA option of NEIGHBOR, by contrast, assume that an "evolutionary clock" is valid, according to which the true branch lengths from the root of the tree to each tip are the same: the expected amount of evolution in any lineage is proportional to elapsed time.

Usage

Here is a sample session with fneighbor


% fneighbor 
Phylogenies from distance matrix by N-J or UPGMA method
Phylip distance matrix file: neighbor.dat
Phylip neighbor program output file [neighbor.fneighbor]: 



Cycle   4: species 1 (   0.91769) joins species 2 (   0.76891)
Cycle   3: node 1 (   0.42027) joins species 3 (   0.35793)
Cycle   2: species 6 (   0.15168) joins species 7 (   0.11752)
Cycle   1: node 1 (   0.04648) joins species 4 (   0.28469)
last cycle:
 node 1  (   0.02696) joins species 5  (   0.15393) joins node 6  (   0.03982)

Output written on file "neighbor.fneighbor"

Tree written on file "neighbor.treefile"

Done.


Go to the input files for this example
Go to the output files for this example

Command line arguments

Phylogenies from distance matrix by N-J or UPGMA method
Version: EMBOSS:6.3.0

   Standard (Mandatory) qualifiers:
  [-datafile]          distances  Phylip distance matrix file
  [-outfile]           outfile    [*.fneighbor] Phylip neighbor program output
                                  file

   Additional (Optional) qualifiers (* if not always prompted):
   -matrixtype         menu       [s] Type of data matrix (Values: s (Square);
                                  u (Upper triangular); l (Lower triangular))
   -treetype           menu       [n] Neighbor-joining or UPGMA tree (Values:
                                  n (Neighbor-joining); u (UPGMA))
*  -outgrno            integer    [0] Species number to use as outgroup
                                  (Integer 0 or more)
   -jumble             toggle     [N] Randomise input order of species
*  -seed               integer    [1] Random number seed between 1 and 32767
                                  (must be odd) (Integer from 1 to 32767)
   -replicates         boolean    [N] Subreplicates
   -[no]trout          toggle     [Y] Write out trees to tree file
*  -outtreefile        outfile    [*.fneighbor] Phylip tree output file
                                  (optional)
   -printdata          boolean    [N] Print data at start of run
   -[no]progress       boolean    [Y] Print indications of progress of run
   -[no]treeprint      boolean    [Y] Print out tree

   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-outfile" associated qualifiers
   -odirectory2        string     Output directory

   "-outtreefile" associated qualifiers
   -odirectory         string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write first file to standard output
   -filter             boolean    Read first file from standard input, write
                                  first file to standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options and exit. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages
   -version            boolean    Report version number and exit

Qualifier Type Description Allowed values Default
Standard (Mandatory) qualifiers
[-datafile]
(Parameter 1)
distances Phylip distance matrix file Distance matrix  
[-outfile]
(Parameter 2)
outfile Phylip neighbor program output file Output file <*>.fneighbor
Additional (Optional) qualifiers
-matrixtype list Type of data matrix
s (Square)
u (Upper triangular)
l (Lower triangular)
s
-treetype list Neighbor-joining or UPGMA tree
n (Neighbor-joining)
u (UPGMA)
n
-outgrno integer Species number to use as outgroup Integer 0 or more 0
-jumble toggle Randomise input order of species Toggle value Yes/No No
-seed integer Random number seed between 1 and 32767 (must be odd) Integer from 1 to 32767 1
-replicates boolean Subreplicates Boolean value Yes/No No
-[no]trout toggle Write out trees to tree file Toggle value Yes/No Yes
-outtreefile outfile Phylip tree output file (optional) Output file <*>.fneighbor
-printdata boolean Print data at start of run Boolean value Yes/No No
-[no]progress boolean Print indications of progress of run Boolean value Yes/No Yes
-[no]treeprint boolean Print out tree Boolean value Yes/No Yes
Advanced (Unprompted) qualifiers
(none)
Associated qualifiers
"-outfile" associated outfile qualifiers
-odirectory2
-odirectory_outfile
string Output directory Any string  
"-outtreefile" associated outfile qualifiers
-odirectory string Output directory Any string  
General qualifiers
-auto boolean Turn off prompts Boolean value Yes/No N
-stdout boolean Write first file to standard output Boolean value Yes/No N
-filter boolean Read first file from standard input, write first file to standard output Boolean value Yes/No N
-options boolean Prompt for standard and additional values Boolean value Yes/No N
-debug boolean Write debug output to program.dbg Boolean value Yes/No N
-verbose boolean Report some/full command line options Boolean value Yes/No Y
-help boolean Report command line options and exit. More information on associated and general qualifiers can be found with -help -verbose Boolean value Yes/No N
-warning boolean Report warnings Boolean value Yes/No Y
-error boolean Report errors Boolean value Yes/No Y
-fatal boolean Report fatal errors Boolean value Yes/No Y
-die boolean Report dying program messages Boolean value Yes/No Y
-version boolean Report version number and exit Boolean value Yes/No N

Input file format

fneighbor reads any normal sequence USAs.

Input files for usage example

File: neighbor.dat

    7
Bovine      0.0000  1.6866  1.7198  1.6606  1.5243  1.6043  1.5905
Mouse       1.6866  0.0000  1.5232  1.4841  1.4465  1.4389  1.4629
Gibbon      1.7198  1.5232  0.0000  0.7115  0.5958  0.6179  0.5583
Orang       1.6606  1.4841  0.7115  0.0000  0.4631  0.5061  0.4710
Gorilla     1.5243  1.4465  0.5958  0.4631  0.0000  0.3484  0.3083
Chimp       1.6043  1.4389  0.6179  0.5061  0.3484  0.0000  0.2692
Human       1.5905  1.4629  0.5583  0.4710  0.3083  0.2692  0.0000

Output file format

fneighbor output consists of an tree (rooted if UPGMA, unrooted if Neighbor-Joining) and the lengths of the interior segments. The Average Percent Standard Deviation is not computed or printed out. If the tree found by Neighbor is fed into FITCH as a User Tree, it will compute this quantity if one also selects the N option of FITCH to ensure that none of the branch lengths is re-estimated.

As NEIGHBOR runs it prints out an account of the successive clustering levels, if you allow it to. This is mostly for reassurance and can be suppressed using menu option 2. In this printout of cluster levels the word "OTU" refers to a tip species, and the word "NODE" to an interior node of the resulting tree.

Output files for usage example

File: neighbor.fneighbor


Neighbor-Joining/UPGMA method version 3.69


   7 Populations

 Neighbor-joining method

 Negative branch lengths allowed


  +---------------------------------------------Mouse     
  ! 
  !                        +---------------------Gibbon    
  1------------------------2 
  !                        !  +----------------Orang     
  !                        +--4 
  !                           ! +--------Gorilla   
  !                           +-5 
  !                             ! +--------Chimp     
  !                             +-3 
  !                               +------Human     
  ! 
  +------------------------------------------------------Bovine    


remember: this is an unrooted tree!

Between        And            Length
-------        ---            ------
   1          Mouse           0.76891
   1             2            0.42027
   2          Gibbon          0.35793
   2             4            0.04648
   4          Orang           0.28469
   4             5            0.02696
   5          Gorilla         0.15393
   5             3            0.03982
   3          Chimp           0.15168
   3          Human           0.11752
   1          Bovine          0.91769


File: neighbor.treefile

(Mouse:0.76891,(Gibbon:0.35793,(Orang:0.28469,(Gorilla:0.15393,
(Chimp:0.15168,Human:0.11752):0.03982):0.02696):0.04648):0.42027,Bovine:0.91769);

Data files

None

Notes

None.

References

None.

Warnings

None.

Diagnostic Error Messages

None.

Exit status

It always exits with status 0.

Known bugs

None.

See also

Program name Description
efitch Fitch-Margoliash and Least-Squares Distance Methods
ekitsch Fitch-Margoliash method with contemporary tips
eneighbor Phylogenies from distance matrix by N-J or UPGMA method
ffitch Fitch-Margoliash and Least-Squares Distance Methods
fkitsch Fitch-Margoliash method with contemporary tips

Author(s)

This program is an EMBOSS conversion of a program written by Joe Felsenstein as part of his PHYLIP package.

Please report all bugs to the EMBOSS bug team (emboss-bug © emboss.open-bio.org) not to the original author.

History

Written (2004) - Joe Felsenstein, University of Washington.

Converted (August 2004) to an EMBASSY program by the EMBOSS team.

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.