A number of different Markov models of DNA sequence evolution have been proposed. These substitution models differ in terms of the parameters used to describe the rates at which one nucleotide replaces another during evolution. These models are frequently used in molecular phylogenetic analyses . In particular, they are used during the calculation of likelihood of a tree (in Bayesian and maximum likelihood approaches to tree estimation) and they are used to estimate the evolutionary distance between sequences from the observed differences between the sequences.
47-399: These models are phenomenological descriptions of the evolution of DNA as a string of four discrete states. These Markov models do not explicitly depict the mechanism of mutation nor the action of natural selection. Rather they describe the relative rates of different changes. For example, mutational biases and purifying selection favoring conservative changes are probably both responsible for
94-438: A point mutation that changes a purine nucleotide to another purine ( A ↔ G ), or a pyrimidine nucleotide to another pyrimidine ( C ↔ T ). Approximately two out of three single nucleotide polymorphisms (SNPs) are transitions. Transitions can be caused by oxidative deamination and tautomerization . Although there are twice as many possible transversions , transitions appear more often in genomes , possibly due to
141-644: A decrease in the level of variation surrounding the locus under selection. The incidental purging of non-deleterious alleles due to such spatial proximity to deleterious alleles is called background selection . This effect increases with lower mutation rate but decreases with higher recombination rate. Purifying selection can be split into purging by non-random mating ( assortative mating ) and purging by genetic drift . Purging by genetic drift can remove primarily deeply recessive alleles, whereas natural selection can remove any type of deleterious alleles. The idea that those genes of an organism that are expressed in
188-585: A set of sequences. They are often used for analyzing the evolution of an entire locus by making the simplifying assumption that different sites evolve independently and are identically distributed . This assumption may be justifiable if the sites can be assumed to be evolving neutrally . If the primary effect of natural selection on the evolution of the sequences is to constrain some sites, then models of among-site rate-heterogeneity can be used. This approach allows one to estimate only one matrix of relative rates of substitution, and another set of parameters describing
235-1103: A similar (but not equivalent) model in 1984 using a different parameterization; that latter model is referred to as the F84 model. ] Rate matrix Q = ( ∗ κ π G π C π T κ π A ∗ π C π T π A π G ∗ κ π T π A π G κ π C ∗ ) {\displaystyle Q={\begin{pmatrix}{*}&{\kappa \pi _{G}}&{\pi _{C}}&{\pi _{T}}\\{\kappa \pi _{A}}&{*}&{\pi _{C}}&{\pi _{T}}\\{\pi _{A}}&{\pi _{G}}&{*}&{\kappa \pi _{T}}\\{\pi _{A}}&{\pi _{G}}&{\kappa \pi _{C}}&{*}\end{pmatrix}}} If we express
282-644: Is called a transversion . Consider a DNA sequence of fixed length m evolving in time by base replacement. Assume that the processes followed by the m sites are Markovian independent, identically distributed and that the process is constant over time. For a particular site, let be the set of possible states for the site, and their respective probabilities at time t {\displaystyle t} . For two distinct x , y ∈ E {\displaystyle x,y\in {\mathcal {E}}} , let μ x y {\displaystyle \mu _{xy}\ } be
329-449: Is difficult, and usually not necessary. Instead, branch lengths (and path lengths) in phylogenetic analyses are usually expressed in the expected number of changes per site. The path length is the product of the duration of the path in time and the mean rate of substitutions. While their product can be estimated, the rate and time are not identifiable from sequence divergence. The descriptions of rate matrices on this page accurately reflect
376-421: Is equal to the amount of change from y {\displaystyle y\ } to x {\displaystyle x\ } , (although the two states may occur with different frequencies). This means that: Not all stationary processes are reversible, however, most commonly used DNA evolution models assume time reversibility, which is considered to be a reasonable assumption. Under
423-400: Is equal to zero. It follows that For a stationary process , where Q {\displaystyle Q} does not depend on time t , this differential equation can be solved. First, where exp ( t Q ) {\displaystyle \exp(tQ)} denotes the exponential of the matrix t Q {\displaystyle tQ} . As a result, If
470-521: Is frequently referred to as the p {\displaystyle p} -distance. It is a sufficient statistic for calculating the Jukes–Cantor distance correction, but is not sufficient for the calculation of the evolutionary distance under the more complex models that follow (also note that p {\displaystyle p} used in subsequent formulae is not identical to the " p {\displaystyle p} -distance"). K80,
517-540: Is given by: where p is the proportion of sites that show transitional differences and q is the proportion of sites that show transversional differences. K81, the Kimura 1981 model, often called Kimura's three parameter model (K3P model) or the Kimura three substitution type (K3ST) model, has distinct rates for transitions and two distinct types of transversions . The two transversion types are those that conserve
SECTION 10
#1732781073005564-546: Is here a possible confusion between two meanings of the word transition . (i) In the context of Markov chains , transition is the general term for the change between two states. (ii) In the context of nucleotide changes in DNA sequences , transition is a specific term for the exchange between either the two purines (A ↔ G) or the two pyrimidines (C ↔ T) (for additional details, see the article about transitions in genetics ). By contrast, an exchange between one purine and one pyrimidine
611-431: Is the fraction of the frequency of state x {\displaystyle x\ } that is the result of transitions from state y {\displaystyle y\ } to state x {\displaystyle x\ } . Corollary The 12 off-diagonal entries of the rate matrix, Q {\displaystyle Q\ } (note the off-diagonal entries determine
658-916: Is therefore μ {\displaystyle \mu } , the overall substitution rate. As previously mentioned, this variable becomes a constant when we normalize the mean-rate to 1. When branch length, ν {\displaystyle \nu } , is measured in the expected number of changes per site then: It is worth noticing that ν = 3 4 t μ = ( μ 4 + μ 4 + μ 4 ) t {\displaystyle \nu ={3 \over 4}t\mu =({\mu \over 4}+{\mu \over 4}+{\mu \over 4})t} what stands for sum of any column (or row) of matrix Q {\displaystyle Q} multiplied by time and thus means expected number of substitutions in time t {\displaystyle t} (branch duration) for each particular site (per site) when
705-434: Is used in the transition probability formulae below in place of μ t . Note that ν is a parameter to be estimated from data, and is referred to as the branch length, while β is simply a number that can be calculated from the rate matrix (it is not a separate free parameter). The value of β can be found by forcing the expected rate of flux of states to 1. The diagonal entries of the rate-matrix (the Q matrix) represent -1 times
752-512: Is used much less often than the K80 (K2P) model for distance estimation and it is seldom the best-fitting model in maximum likelihood phylogenetics. Despite these facts, the K81 model has continued to be studied in the context of mathematical phylogenetics. One important property is the ability to perform a Hadamard transform assuming the site patterns were generated on a tree with nucleotides evolving under
799-518: The Felsenstein's 1981 model, is an extension of the JC69 model in which base frequencies are allowed to vary from 0.25 ( π A ≠ π G ≠ π C ≠ π T {\displaystyle \pi _{A}\neq \pi _{G}\neq \pi _{C}\neq \pi _{T}} ) Rate matrix: When branch length, ν, is measured in
846-513: The Kimura 1980 model, often referred to as Kimura's two parameter model (or the K2P model ), distinguishes between transitions ( A ↔ G {\displaystyle A\leftrightarrow G} , i.e. from purine to purine, or C ↔ T {\displaystyle C\leftrightarrow T} , i.e. from pyrimidine to pyrimidine) and transversions (from purine to pyrimidine or vice versa). In Kimura's original description of
893-421: The haploid stage are under more efficient natural selection than those genes expressed exclusively in the diploid stage is referred to as the “masking theory”. This theory implies that purifying selection is more efficient in the haploid stage of the life cycle where fitness effects are more fully expressed than in the diploid stage of the life cycle. Evidence supporting the masking theory has been reported in
940-417: The population genetics level, with as little as a single point mutation being the unit of selection. In such a case, carriers of the harmful point mutation have fewer offspring each generation, reducing the frequency of the mutation in the gene pool. In the case of strong negative selection on a locus, the purging of deleterious variants will result in the occasional removal of linked variation, producing
987-523: The Hadamard transform can even provide evidence that the data do not fit a tree. The Hadamard transform can also be combined with a wide variety of methods to accommodate among-sites rate heterogeneity, using continuous distributions rather than the discrete approximations typically used in maximum likelihood phylogenetics (although one must sacrifice the invertibility of the Hadamard transform to use certain among-sites rate heterogeneity distributions). F81,
SECTION 20
#17327810730051034-629: The Jukes–Cantor, the scaling factor would be 4/(3μ) because the rate of leaving each state is 3μ/4 . JC69, the Jukes and Cantor 1969 model, is the simplest substitution model . There are several assumptions. It assumes equal base frequencies ( π A = π G = π C = π T = 1 4 ) {\displaystyle \left(\pi _{A}=\pi _{G}=\pi _{C}=\pi _{T}={1 \over 4}\right)} and equal mutation rates . The only parameter of this model
1081-506: The K81 model. When used in the context of phylogenetics the Hadamard transform provides an elegant and fully invertible means to calculate expected site pattern frequencies given a set of branch lengths (or vice versa). Unlike many maximum likelihood calculations, the relative values for α {\displaystyle \alpha } , β {\displaystyle \beta } , and γ {\displaystyle \gamma } can vary across branches and
1128-639: The Markov chain is irreducible , i.e. if it is always possible to go from a state x {\displaystyle x} to a state y {\displaystyle y} (possibly in several steps), then it is also ergodic . As a result, it has a unique stationary distribution π = { π x , x ∈ E } {\displaystyle {\boldsymbol {\pi }}=\{\pi _{x},\,x\in {\mathcal {E}}\}} , where π x {\displaystyle \pi _{x}} corresponds to
1175-659: The Markov chain is in state E i {\displaystyle E_{i}} , then the probability that at time t 0 + t {\displaystyle t_{0}+t} , it will be in state E j {\displaystyle E_{j}} depends only upon i {\displaystyle i} , j {\displaystyle j} and t {\displaystyle t} . This then allows us to write that probability as p i j ( t ) {\displaystyle p_{ij}(t)} . Theorem: Continuous-time transition matrices satisfy: Note: There
1222-469: The amount of sequence divergence. This raw measurement of divergence provides information about the number of changes that have occurred along the path separating the sequences. The simple count of differences (the Hamming distance ) between sequences will often underestimate the number of substitution because of multiple hits (see homoplasy ). Trying to estimate the exact number of changes that have occurred
1269-444: The branch length, ν in terms of the expected number of changes per site then: Purifying selection In natural selection , negative selection or purifying selection is the selective removal of alleles that are deleterious . This can result in stabilising selection through the purging of deleterious genetic polymorphisms that arise through random mutations. Purging of deleterious alleles can be achieved on
1316-400: The diagonal entries, since the rows of Q {\displaystyle Q\ } sum to zero) can be completely determined by 9 numbers; these are: 6 exchangeability terms and 3 stationary frequencies π x {\displaystyle \pi _{x}\ } , (since the stationary frequencies sum to 1). By comparing extant sequences, one can determine
1363-703: The expected number of changes per site then: HKY85, the Hasegawa, Kishino and Yano 1985 model, can be thought of as combining the extensions made in the Kimura80 and Felsenstein81 models. Namely, it distinguishes between the rate of transitions and transversions (using the κ parameter), and it allows unequal base frequencies ( π A ≠ π G ≠ π C ≠ π T {\displaystyle \pi _{A}\neq \pi _{G}\neq \pi _{C}\neq \pi _{T}} ). [ Felsenstein described
1410-500: The frequencies of p A ( t ) , p G ( t ) , p C ( t ) , p T ( t ) {\displaystyle p_{A}(t),\,p_{G}(t),\,p_{C}(t),\,p_{T}(t)} do not change. Definition : A stationary Markov process is time reversible if (in the steady state) the amount of change from state x {\displaystyle x\ } to y {\displaystyle y\ }
1457-419: The frequency of A {\displaystyle A} 's at time t + Δ t {\displaystyle t+\Delta t} is equal to the frequency at time t {\displaystyle t} minus the frequency of the lost A {\displaystyle A} 's plus the frequency of the newly created A {\displaystyle A} 's. Similarly for
Models of DNA evolution - Misplaced Pages Continue
1504-468: The instantaneous rates of change between different states (the Q matrices below). If we are given a starting (ancestral) state at one position, the model's Q matrix and a branch length expressing the expected number of changes to have occurred since the ancestor, then we can derive the probability of the descendant sequence having each of the four states. The mathematical details of this transformation from rate-matrix to probability matrix are described in
1551-406: The mathematics of substitution models section of the substitution model page. By expressing models in terms of the instantaneous rates of change we can avoid estimating a large numbers of parameters for each branch on a phylogenetic tree (or each comparison if the analysis involves many pairwise sequence comparisons). The models described on this page describe the evolution of a single site within
1598-1341: The model the α and β were used to denote the rates of these types of substitutions, but it is now more common to set the rate of transversions to 1 and use κ to denote the transition/transversion rate ratio (as is done below). The K80 model assumes that all of the bases are equally frequent ( π A = π G = π C = π T = 1 4 {\displaystyle \pi _{A}=\pi _{G}=\pi _{C}=\pi _{T}={1 \over 4}} ). Rate matrix Q = ( ∗ κ 1 1 κ ∗ 1 1 1 1 ∗ κ 1 1 κ ∗ ) {\displaystyle Q={\begin{pmatrix}{*}&{\kappa }&{1}&{1}\\{\kappa }&{*}&{1}&{1}\\{1}&{1}&{*}&{\kappa }\\{1}&{1}&{\kappa }&{*}\end{pmatrix}}} with columns corresponding to A {\displaystyle A} , G {\displaystyle G} , C {\displaystyle C} , and T {\displaystyle T} , respectively. The Kimura two-parameter distance
1645-454: The probabilities p G ( t ) {\displaystyle p_{G}(t)} , p C ( t ) {\displaystyle p_{C}(t)} and p T ( t ) {\displaystyle p_{T}(t)} . These equations can be written compactly as where is known as the rate matrix . Note that, by definition, the sum of the entries in each row of Q {\displaystyle Q}
1692-522: The proportion of time spent in state x {\displaystyle x} after the Markov chain has run for an infinite amount of time. In DNA evolution, under the assumption of a common process for each site, the stationary frequencies π A , π G , π C , π T {\displaystyle \pi _{A},\,\pi _{G},\,\pi _{C},\,\pi _{T}} correspond to equilibrium base compositions. Indeed, note that since
1739-458: The rate of leaving each state. For time-reversible models, we know the equilibrium state frequencies (these are simply the π i parameter value for state i ). Thus we can find the expected rate of change by calculating the sum of flux out of each state weighted by the proportion of sites that are expected to be in that class. Setting β to be the reciprocal of this sum will guarantee that scaled process has an expected flux of 1: For example, in
1786-408: The rate of substitution equals μ {\displaystyle \mu } . Given the proportion p {\displaystyle p} of sites that differ between the two sequences the Jukes–Cantor estimate of the evolutionary distance (in terms of the expected number of changes) between two sequences is given by The p {\displaystyle p} in this formula
1833-434: The relative magnitude of different substitutions, but these rate matrices are not scaled such that a branch length of 1 yields one expected change. This scaling can be accomplished by multiplying every element of the matrix by the same factor, or simply by scaling the branch lengths. If we use the β to denote the scaling factor, and ν to denote the branch length measured in the expected number of substitutions per site then βν
1880-416: The relatively high rate of transitions compared to transversions in evolving sequences. However, the Kimura (K80) model described below only attempts to capture the effect of both forces in a parameter that reflects the relative rate of transitions to transversions. Evolutionary analyses of sequences are conducted on a wide variety of time scales. Thus, it is convenient to express these models in terms of
1927-400: The single-celled yeast Saccharomyces cerevisiae . Further evidence of strong purifying selection in haploid tissue-specific genes, in support of the masking theory, has been reported for the plant, Scots Pine . This genetics article is a stub . You can help Misplaced Pages by expanding it . Transition (genetics) Transition , in genetics and molecular biology , refers to
Models of DNA evolution - Misplaced Pages Continue
1974-465: The stationary distribution π {\displaystyle {\boldsymbol {\pi }}} satisfies π Q = 0 {\displaystyle {\boldsymbol {\pi }}Q=0} , we see that when the current distribution p ( t ) {\displaystyle \mathbf {p} (t)} is the stationary distribution π {\displaystyle {\boldsymbol {\pi }}} we have In other words,
2021-612: The time reversibility assumption, let s x y = μ x y / π y {\displaystyle s_{xy}=\mu _{xy}/\pi _{y}\ } , then it is easy to see that: Definition The symmetric term s x y {\displaystyle s_{xy}\ } is called the exchangeability between states x {\displaystyle x\ } and y {\displaystyle y\ } . In other words, s x y {\displaystyle s_{xy}\ }
2068-537: The transition matrix Example: We would like to model the substitution process in DNA sequences ( i.e. Jukes–Cantor , Kimura, etc. ) in a continuous-time fashion. The corresponding transition matrices will look like: where the top-left and bottom-right 2 × 2 blocks correspond to transition probabilities and the top-right and bottom-left 2 × 2 blocks corresponds to transversion probabilities . Assumption: If at some time t 0 {\displaystyle t_{0}} ,
2115-561: The transition rate from state x {\displaystyle x} to state y {\displaystyle y} . Similarly, for any x {\displaystyle x} , let the total rate of change from x {\displaystyle x} be The changes in the probability distribution p A ( t ) {\displaystyle p_{A}(t)} for small increments of time Δ t {\displaystyle \Delta t} are given by In other words, (in frequentist language),
2162-415: The variance in the total rate of substitution across sites. Continuous-time Markov chains have the usual transition matrices which are, in addition, parameterized by time, t {\displaystyle t} . Specifically, if E 1 , E 2 , E 3 , E 4 {\displaystyle E_{1},E_{2},E_{3},E_{4}} are the states, then
2209-1818: The weak/strong properties of the nucleotides (i.e., A ↔ T {\displaystyle A\leftrightarrow T} and C ↔ G {\displaystyle C\leftrightarrow G} , denoted by symbol γ {\displaystyle \gamma } ) and those that conserve the amino/keto properties of the nucleotides (i.e., A ↔ C {\displaystyle A\leftrightarrow C} and G ↔ T {\displaystyle G\leftrightarrow T} , denoted by symbol β {\displaystyle \beta } ). The K81 model assumes that all equilibrium base frequencies are equal (i.e., π A = π G = π C = π T = 0.25 {\displaystyle \pi _{A}=\pi _{G}=\pi _{C}=\pi _{T}=0.25} ). Rate matrix Q = ( ∗ α β γ α ∗ γ β β γ ∗ α γ β α ∗ ) {\displaystyle Q={\begin{pmatrix}{*}&{\alpha }&{\beta }&{\gamma }\\{\alpha }&{*}&{\gamma }&{\beta }\\{\beta }&{\gamma }&{*}&{\alpha }\\{\gamma }&{\beta }&{\alpha }&{*}\end{pmatrix}}} with columns corresponding to A {\displaystyle A} , G {\displaystyle G} , C {\displaystyle C} , and T {\displaystyle T} , respectively. The K81 model
#4995