version 3.6
RESTDIST -- Program to compute distance matrix from restriction sites or fragments
© Copyright 2000-2002 by the University of
Washington. Written by Joseph Felsenstein. Permission is granted to copy
this document provided that no fee is charged for it and that this copyright
notice is not removed.
Restdist reads the same restriction sites format as RESTML and
computes a restriction sites distance. It can also compute a restriction
fragments distance. The original restriction fragments and restriction
sites distance methods were introduced by Nei and Li (1979). Their original
method for restriction fragments is
also available in this program, although its default methods
are my modifications of the original Nei and Li methods.
These two distances assume that the restriction sites are accidental byproducts
of random change of nucleotide sequences. For my restriction sites distance
the DNA sequences are assumed to be changing according to the Kimura
2-parameter model of DNA change (Kimura, 1980). The user can set the
transition/transversion rate for the model. For my restriction fragments
distance there is
there is an implicit assumption of a Jukes-Cantor (1969) model of change,
The user can also set the
parameter of a correction for unequal rates of evolution between sites in
the DNA sequences, using a Gamma distribution of rates among sites.
The Jukes-Cantor model is also implicit in the restriction fragments
distance of Nei and Li(1979). It
does not allow us to correct for a Gamma distribution of rates among
sites.
Restriction Sites Distance
The restriction sites distances use data coded for the presence of absence of
individual restriction sites (usually as + and - or 0 and 1). My
distance is based on the proportion, out of all sites observed in one species
or the other, which are present in both species. This is done to correct for
the ascertainment of sites, for the fact that we are not aware of many sites
because they do not appear in any species.
My distance starts by computing from the particular pair of species the fraction
n++
f = ---------------------
n++ + 1/2 (n+- + n-+)
where n++ is the number of sites contained in both species,
n+- is the number of sites contained in the first of the
two species but not in the second, and n-+ is the
number of sites contained in the second of the two species but not in the
first. This is the fraction of sites that are present in one species which are
present in both. Since the number of sites present in the two species will
often differ, the denominator is the average of the number of sites found in
the two species.
If each restriction site is s nucleotides long, the probability
that a restriction site is present in the other species, given that it is
present in a species, is
Qs,
where Q is the probability that a nucleotide has no net change as one
goes from the one species to the other. It may have changed in between; we
are interested in the probability that that nucleotide site is in the same base
in both species, irrespective of what has happened in between.
The distance is then computed by finding the branch length of a two-species
tree (connecting these two species with a single branch) such
that Q equals the s-th root of f. For this the
program computes Q for various values of branch length, iterating them
by a Newton-Raphson algorithm until the two quantities are equal.
The resulting distance should be numerically close to the original
restriction sites distance of Nei and Li (1979). It is inspired by theirs,
but theirs differs by implicitly assuming a symmetric Jukes-Cantor (1969)
model of nucleotide change, and theirs does not include a correction for
Gamma distribution of rate of change among nucleotide sites.
Restriction Fragments Distance
For restriction fragments data we use a different distance. If we
average over all restriction fragment lengths, each at its own
expected frequency, the probability that the fragment will still be
in existence after a certain amount of branch length, we must take into
account the probability that the two restriction sites at the ends of
the fragment do not mutate, and the probability that no new
restriction site occurs within the fragment in that amount of branch
length. The result for a restriction site length of s is:
Q2s
f = --------
2 - Qs
(The details of the derivation will be given in my forthcoming book
Inferring Phylogenies (to be published by Sinauer Associates
in 2001).
Given the observed fraction of restriction sites retained, f,
we can solve a quadratic equation from the above expression for
Qs. That makes it easy to obtain a value of Q,
and the branch length can then be estimated by adjusting it so the
probability of a base not changing is equal to that value.
Alternatively, if we use the Nei and Li (1979) restriction fragments
distance, this involves solving for g in the nonlinear
equation
g = [ f (3 - 2g) ]1/4
and then the distance is given by
d = - (2/r) loge(g)
where r is the length of the restriction site.
Comparing these two restriction fragments distances in a case
where their underlying DNA model is the same (which is when the
transition/transversion ratio of the modified model is set to
0.5), you will find
that they are very close to each other, differing very little at
small distances, with the modified distance become smaller than
the Nei/Li distance at larger distances. It will therefore matter
very little which one you use.
A Comment About RAPDs and AFLPs
Although these distances are designed for restriction sites
and restriction fragments data, they can be applied to RAPD and
AFLP data as well. RAPD (Randomly Amplified Polymorphic DNA)
and AFLP (Amplified Fragment Length Polymorphism) data consist
of presence or absence of individual bands on a gel. The bands
are segments of DNA with PCR primers at each end. These
primers are defined sequences of known length (often about
10 nucleotides each). For AFLPs the reolevant length is the primer
length, plus three nucleotides. Mutation in these sequences makes them no
longer be primers, just as in the case of restriction sites.
Thus a pair of 10-nucleotide primers will behave much the same
as a 20-nucleotide restriction site. You can use the restriction
sites distance as the distance between RAPD or AFLP patterns if you
set the proper value for the total length of the site to the
total length of the primers (plus 6 in the case of AFLPs).
Of course there are many possible sources of noise in these data,
including confusing fragments of similar length for each other
and having primers near each other in the genome, and these are
not taken into account in the statistical model used here.
INPUT FORMAT AND OPTIONS
The input is fairly standard, with one addition. As usual the first line of
the file gives the number of species and the number of sites, but there is also
a third number, which is the number of different restriction enzymes that were
used to detect the restriction sites. Thus a data set with 10 species and
35 different sites, representing digestion with 4 different enzymes, would
have the first line of the data file look like this:
10 35 4
The site data are in standard form. Each species starts with a species name
whose maximum length is given by the constant "nmlngth"
(whose value in the
program as distributed is 10 characters). The name should, as usual, be padded
out to that length with blanks if necessary. The sites data then follows, one
character per site (any blanks will
be skipped and ignored). Like the DNA and protein sequence data, the
restriction sites data may be either in the "interleaved" form or the
"sequential" form. Note that if you are analyzing restriction sites
data with the programs DOLLOP or MIX or other discrete character
programs, at the moment those programs do not use the "aligned" or
"interleaved" data format. Therefore you may want to avoid that format
when you have restriction sites data that you will want to feed into
those programs.
The presence of a site is indicated by a "+" and the absence by a "-". I have
also allowed the use of "1" and "0" as synonyms for "+" and "-", for
compatibility with MIX and DOLLOP which do not allow "+" and "-". If the
presence of
the site is unknown (for example, if the DNA containing it has been deleted so
that one
does not know whether it would have contained the site) then the state "?" can
be used to indicate that the state of this site is unknown.
The options are selected using an interactive menu. The menu looks like this:
Restriction site or fragment distances, version 3.6a3
Settings for this run:
R Restriction sites or fragments? Sites
G Gamma distribution of rates among sites? No
T Transition/transversion ratio? 2.000000
S Site length? 6.0
L Form of distance matrix? Square
M Analyze multiple data sets? No
I Input sequences interleaved? Yes
0 Terminal type (IBM PC, ANSI, none)? (none)
1 Print out the data at start of run? No
2 Print indications of progress of run? Yes
Y to accept these or type the letter for one to change
|
The user either types "Y" (followed, of course, by a carriage-return)
if the settings shown are to be accepted, or the letter or digit corresponding
to an option that is to be changed.
The R option toggles between a restriction sites distance, which
is the default setting, and a restriction fragments distance. In
the latter case, another option appears, the N (Nei/Li) option.
This allows the user to choose the original Nei and Li (1979)
restriction fragments distance rather than my modified Nei/Li
distance, which is the default.
If the G (Gamma distribution) option is selected, the user will be
asked to supply the coefficient of variation of the rate of substitution
among sites. This is different from the parameters used by Nei and Jin, who
introduced Gamma distribution of rates in DNA distances, but
related to their parameters: their parameter a is also known as
"alpha", the shape parameter of the Gamma distribution. It is
related to the coefficient of variation by
CV = 1 / a1/2
or
a = 1 / (CV)2
(their parameter b is absorbed here by the requirement that time is scaled so
that the mean rate of evolution is 1 per unit time, which means that a = b).
As we consider cases in which the rates are less variable we should set a
larger and larger, as CV gets smaller and smaller.
The Gamma distribution option is not available when using the
original Nei/Li restriction fragments distance.
The T option is the Transition/transversion option. The user is prompted for
a real number greater than 0.0, as the expected ratio of transitions to
transversions. Note
that this is the resulting expected ratio of transitions to transversions.
The default value of the T parameter if you do not use the T
option is 2.0. The T option is not available when you choose the original
Nei/Li restriction fragment distance, which assumes a Jukes-Cantor (1969)
model of DNA change, for which the transition/transversion ratio is
in effect fixed at 0.5.
The S option selects the site length. This is set to a default
value of 6. It can be set to any positive integer. While in
the RESTML program there is an upper limit on the restriction
site length (set by memory limitations), in RESTDIST there is
no effective limit on the size of the restriction sites. A value
of 20, which might be appropriate in many cases for RAPD or AFLP
data, is typically not practical in RESTML, but it is useable in
RESTDIST.
Option L specifies that the output file will have a square matrix
of distances. It can be used to change to lower-triangular
data matrices. This will usually not be
necessary, but if the distance matrices are going to be very
large, this alternative can reduce their size by half. The
programs which are to use them should then of course be informed
that they can expect lower-triangular distance matrices.
The M, I, and 0 options are the usual Multiple data set,
Interleaved input, and screen terminal type options. These are
described in the main documentation file.
Option 1 specifies that the input data will be written out
on the output file before the distances. This is off by
default. If it is done, it will make the output file unusable
as input to our distance matrix programs.
Option 2 turns off or on the indications of the progress of the run.
The program prints out a row of dots (".") indicating the
calculation of individual distances. Since the distance matrix
is symmetrical, the program only computes the distances for the
upper triangle of the distance matrix, and then duplicates
the distance to the other corner of the matrix. Thus the rows of
dots start out of full length, and then egt shorter and shorter.
OUTPUT FORMAT
The output file contains on its first line the number of species. The
distance matrix is then printed in standard
form, with each species starting on a new line with the species name, followed
by the distances to the species in order. These continue onto a new line
after every nine distances. If the L option is used, the matrix or distances
is in lower triangular form, so that only the distances to the other species
that precede each species are printed. Otherwise the distance matrix is square
with zero distances on the diagonal. In general the format of the distance
matrix is such that it can serve as input to any of the distance matrix
programs.
If the option to print out the data is selected, the output file will
precede the data by more complete information on the input and the menu
selections. The output file begins by giving the number of species and the
number of characters.
The distances printed out are scaled in terms of expected numbers of
substitutions per DNA site, counting both transitions and transversions but not
replacements of a base by itself, and scaled so that the average rate of
change, averaged over all sites analyzed, is set to 1.0. Thus when the
G option is used, the rate of change at one site may be higher than
at another, but their mean is expected to be 1.
PROGRAM CONSTANTS
The constants available to be changed are "initialv" and
"iterationsr". The constant "initialv" is the starting
value of the distance in the iterations. This will typically not need to
be changed. The constant "iterationsr" is the number of
times that the Newton-Raphson method which is used to solve the
equations for the distances is iterated. The program can be
speeded up by reducing the number of iterations from the default
value of 20, but at the possible risk of computing the distance
less accurately.
FUTURE OF THE PROGRAM
The present program does not compute the original distance of Nei and Li (1979)
for restriction sites (though it does have an option to compute their original
distance for restriction fragments). I hope to add their restriction
sites distance in the near future.
TEST DATA SET
5 13 2
Alpha ++-+-++--+++-
Beta ++++--+--+++-
Gamma -+--+-++-+-++
Delta ++-+----++---
Epsilon ++++----++---
|
CONTENTS OF OUTPUT FILE (with all numerical options on)
(Note that when the options for displaying the input data are turned off,
the output is in a form suitable for use as an input file in the distance
matrix programs).
5 Species, 13 Sites
Name Sites
---- -----
Alpha ++-+-++--+ ++-
Beta ++++--+--+ ++-
Gamma -+--+-++-+ -++
Delta ++-+----++ ---
Epsilon ++++----++ ---
Alpha 0.0000 0.0224 0.1077 0.0688 0.0826
Beta 0.0224 0.0000 0.1077 0.0688 0.0442
Gamma 0.1077 0.1077 0.0000 0.1765 0.1925
Delta 0.0688 0.0688 0.1765 0.0000 0.0197
Epsilon 0.0826 0.0442 0.1925 0.0197 0.0000
|
|