In probability theory and statistics , the Jensen–Shannon divergence , named after Johan Jensen and Claude Shannon , is a method of measuring the similarity between two probability distributions . It is also known as information radius ( IRad ) or total divergence to the average . It is based on the Kullback–Leibler divergence , with some notable (and useful) differences, including that it is symmetric and it always has a finite value. The square root of the Jensen–Shannon divergence is a metric often referred to as Jensen–Shannon distance. The similarity between the distributions is greater when the Jensen-Shannon distance is closer to zero.
70-471: JSD may refer to: Jensen–Shannon divergence Jackson system development Doctor of Juridical Science , a research doctorate in law Japanese School of Detroit Jatiya Samajtantrik Dal , Bangladeshi political party Topics referred to by the same term [REDACTED] This disambiguation page lists articles associated with the title JSD . If an internal link led you here, you may wish to change
140-1169: A Bernoulli process . The entropy of the unknown result of the next toss of the coin is maximized if the coin is fair (that is, if heads and tails both have equal probability 1/2). This is the situation of maximum uncertainty as it is most difficult to predict the outcome of the next toss; the result of each toss of the coin delivers one full bit of information. This is because H ( X ) = − ∑ i = 1 n p ( x i ) log b p ( x i ) = − ∑ i = 1 2 1 2 log 2 1 2 = − ∑ i = 1 2 1 2 ⋅ ( − 1 ) = 1. {\displaystyle {\begin{aligned}\mathrm {H} (X)&=-\sum _{i=1}^{n}{p(x_{i})\log _{b}p(x_{i})}\\&=-\sum _{i=1}^{2}{{\frac {1}{2}}\log _{2}{\frac {1}{2}}}\\&=-\sum _{i=1}^{2}{{\frac {1}{2}}\cdot (-1)}=1.\end{aligned}}} However, if we know
210-438: A mixture distribution between P {\displaystyle P} and Q {\displaystyle Q} and the binary indicator variable Z {\displaystyle Z} that is used to switch between P {\displaystyle P} and Q {\displaystyle Q} to produce the mixture. Let X {\displaystyle X} be some abstract function on
280-420: A random variable quantifies the average level of uncertainty or information associated with the variable's potential states or possible outcomes. This measures the expected amount of information needed to describe the state of the variable, considering the distribution of probabilities across all potential states. Given a discrete random variable X {\displaystyle X} , which takes values in
350-420: A sigma-algebra on X {\displaystyle X} . The entropy of M {\displaystyle M} is H μ ( M ) = sup P ⊆ M H μ ( P ) . {\displaystyle \mathrm {H} _{\mu }(M)=\sup _{P\subseteq M}\mathrm {H} _{\mu }(P).} Finally, the entropy of the probability space
420-490: A coin with probability p of landing on heads and probability 1 − p of landing on tails. The maximum surprise is when p = 1/2 , for which one outcome is not expected over the other. In this case a coin flip has an entropy of one bit . (Similarly, one trit with equiprobable values contains log 2 3 {\displaystyle \log _{2}3} (about 1.58496) bits of information because it can have one of three values.) The minimum surprise
490-498: A competing measure in structures dual to that of subsets of a universal set. Information is quantified as "dits" (distinctions), a measure on partitions. "Dits" can be converted into Shannon's bits , to get the formulas for conditional entropy, and so on. Another succinct axiomatic characterization of Shannon entropy was given by Aczél , Forte and Ng, via the following properties: It was shown that any function H {\displaystyle \mathrm {H} } satisfying
560-627: A finite set of probability distributions can be defined as the minimizer of the average sum of the Jensen-Shannon divergences between a probability distribution and the prescribed set of distributions: C ∗ = arg min Q ∑ i = 1 n J S D ( P i ∥ Q ) {\displaystyle C^{*}=\arg \min _{Q}\sum _{i=1}^{n}{\rm {JSD}}(P_{i}\parallel Q)} An efficient algorithm (CCCP) based on difference of convex functions
630-428: A logarithm mediates between these two operations. The conditional entropy and related quantities inherit simple relation, in turn. The measure theoretic definition in the previous section defined the entropy as a sum over expected surprisals μ ( A ) ⋅ ln μ ( A ) {\displaystyle \mu (A)\cdot \ln \mu (A)} for an extremal partition. Here
700-634: A message, as in data compression . For example, consider the transmission of sequences comprising the 4 characters 'A', 'B', 'C', and 'D' over a binary channel. If all 4 letters are equally likely (25%), one cannot do better than using two bits to encode each letter. 'A' might code as '00', 'B' as '01', 'C' as '10', and 'D' as '11'. However, if the probabilities of each letter are unequal, say 'A' occurs with 70% probability, 'B' with 26%, and 'C' and 'D' with 2% each, one could assign variable length codes. In this case, 'A' would be coded as '0', 'B' as '10', 'C' as '110', and 'D' as '111'. With this representation, 70% of
770-478: A particular number will win a lottery has high informational value because it communicates the occurrence of a very low probability event. The information content , also called the surprisal or self-information, of an event E {\displaystyle E} is a function which increases as the probability p ( E ) {\displaystyle p(E)} of an event decreases. When p ( E ) {\displaystyle p(E)}
SECTION 10
#1732776026896840-546: A perfectly noiseless channel. Shannon strengthened this result considerably for noisy channels in his noisy-channel coding theorem . Entropy in information theory is directly analogous to the entropy in statistical thermodynamics . The analogy results when the values of the random variable designate energies of microstates, so Gibbs's formula for the entropy is formally identical to Shannon's formula. Entropy has relevance to other areas of mathematics such as combinatorics and machine learning . The definition can be derived from
910-450: A probability distribution π = ( π 1 , … , π n ) {\displaystyle \pi =(\pi _{1},\ldots ,\pi _{n})} as where S ( ρ ) {\displaystyle S(\rho )} is the von Neumann entropy of ρ {\displaystyle \rho } . This quantity was introduced in quantum information theory, where it
980-414: A set of axioms establishing that entropy should be a measure of how informative the average outcome of a variable is. For a continuous random variable, differential entropy is analogous to entropy. The definition E [ − log p ( X ) ] {\displaystyle \mathbb {E} [-\log p(X)]} generalizes the above. The core idea of information theory
1050-409: A variable. The concept of information entropy was introduced by Claude Shannon in his 1948 paper " A Mathematical Theory of Communication ", and is also referred to as Shannon entropy . Shannon's theory defines a data communication system composed of three elements: a source of data, a communication channel , and a receiver. The "fundamental problem of communication" – as expressed by Shannon –
1120-444: Is H μ ( Σ ) {\displaystyle \mathrm {H} _{\mu }(\Sigma )} , that is, the entropy with respect to μ {\displaystyle \mu } of the sigma-algebra of all measurable subsets of X {\displaystyle X} . Consider tossing a coin with known, not necessarily fair, probabilities of coming up heads or tails; this can be modelled as
1190-520: Is log b ( 2 ) {\displaystyle \log _{b}(2)} : A more general bound, the Jensen–Shannon divergence is bounded by log b ( n ) {\displaystyle \log _{b}(n)} for more than two probability distributions: The Jensen–Shannon divergence is the mutual information between a random variable X {\displaystyle X} associated to
1260-553: Is σ μ ( A ) = − ln μ ( A ) . {\displaystyle \sigma _{\mu }(A)=-\ln \mu (A).} The expected surprisal of A {\displaystyle A} is h μ ( A ) = μ ( A ) σ μ ( A ) . {\displaystyle h_{\mu }(A)=\mu (A)\sigma _{\mu }(A).} A μ {\displaystyle \mu } -almost partition
1330-541: Is a set family P ⊆ P ( X ) {\displaystyle P\subseteq {\mathcal {P}}(X)} such that μ ( ∪ P ) = 1 {\displaystyle \mu (\mathop {\cup } P)=1} and μ ( A ∩ B ) = 0 {\displaystyle \mu (A\cap B)=0} for all distinct A , B ∈ P {\displaystyle A,B\in P} . (This
1400-460: Is a relaxation of the usual conditions for a partition.) The entropy of P {\displaystyle P} is H μ ( P ) = ∑ A ∈ P h μ ( A ) . {\displaystyle \mathrm {H} _{\mu }(P)=\sum _{A\in P}h_{\mu }(A).} Let M {\displaystyle M} be
1470-469: Is a set provided with some σ-algebra of measurable subsets. In particular we can take A {\displaystyle A} to be a finite or countable set with all subsets being measurable. The Jensen–Shannon divergence (JSD) is a symmetrized and smoothed version of the Kullback–Leibler divergence D ( P ∥ Q ) {\displaystyle D(P\parallel Q)} . It
SECTION 20
#17327760268961540-451: Is a symmetric function, everywhere defined, bounded and equal to zero only if two density matrices are the same. It is a square of a metric for pure states , and it was recently shown that this metric property holds for mixed states as well. The Bures metric is closely related to the quantum JS divergence; it is the quantum analog of the Fisher information metric . The centroid C* of
1610-438: Is approximately 0.693 n nats or 0.301 n decimal digits. The meaning of the events observed (the meaning of messages ) does not matter in the definition of entropy. Entropy only takes into account the probability of observing a specific event, so the information it encapsulates is information about the underlying probability distribution , not the meaning of the events themselves. Another characterization of entropy uses
1680-402: Is bounded by 1 for two probability distributions, given that one uses the base 2 logarithm: With this normalization, it is a lower bound on the total variation distance between P and Q: With base-e logarithm, which is commonly used in statistical thermodynamics, the upper bound is ln ( 2 ) {\displaystyle \ln(2)} . In general, the bound in base b
1750-672: Is called the Holevo information: it gives the upper bound for amount of classical information encoded by the quantum states ( ρ 1 , … , ρ n ) {\displaystyle (\rho _{1},\ldots ,\rho _{n})} under the prior distribution π {\displaystyle \pi } (see Holevo's theorem ). Quantum Jensen–Shannon divergence for π = ( 1 2 , 1 2 ) {\displaystyle \pi =\left({\frac {1}{2}},{\frac {1}{2}}\right)} and two density matrices
1820-401: Is central to the definition of information entropy. The connection between thermodynamics and what is now known as information theory was first made by Ludwig Boltzmann and expressed by his equation : where S {\displaystyle S} is the thermodynamic entropy of a particular macrostate (defined by thermodynamic parameters such as temperature, volume, energy, etc.), W
1890-419: Is close to 1, the surprisal of the event is low, but if p ( E ) {\displaystyle p(E)} is close to 0, the surprisal of the event is high. This relationship is described by the function log ( 1 p ( E ) ) , {\displaystyle \log \left({\frac {1}{p(E)}}\right),} where log {\displaystyle \log }
1960-428: Is defined by where M = 1 2 ( P + Q ) {\displaystyle M={\frac {1}{2}}(P+Q)} is a mixture distribution of P {\displaystyle P} and Q {\displaystyle Q} . The geometric Jensen–Shannon divergence (or G-Jensen–Shannon divergence) yields a closed-form formula for divergence between two Gaussian distributions by taking
2030-405: Is equiprobable. That is, we are choosing X {\displaystyle X} according to the probability measure M = ( P + Q ) / 2 {\displaystyle M=(P+Q)/2} , and its distribution is the mixture distribution. We compute It follows from the above result that the Jensen–Shannon divergence is bounded by 0 and 1 because mutual information
2100-415: Is fairly predictable. We can be fairly certain that, for example, 'e' will be far more common than 'z', that the combination 'qu' will be much more common than any other combination with a 'q' in it, and that the combination 'th' will be more common than 'z', 'q', or 'qu'. After the first few letters one can often guess the rest of the word. English text has between 0.6 and 1.3 bits of entropy per character of
2170-401: Is for the receiver to be able to identify what data was generated by the source, based on the signal it receives through the channel. Shannon considered various ways to encode, compress, and transmit messages from a data source, and proved in his source coding theorem that the entropy represents an absolute mathematical limit on how well data from the source can be losslessly compressed onto
JSD - Misplaced Pages Continue
2240-474: Is interpreted as being proportional to the amount of further Shannon information needed to define the detailed microscopic state of the system, that remains uncommunicated by a description solely in terms of the macroscopic variables of classical thermodynamics, with the constant of proportionality being just the Boltzmann constant . Adding heat to a system increases its thermodynamic entropy because it increases
2310-403: Is non-negative and bounded by H ( Z ) = 1 {\displaystyle H(Z)=1} in base 2 logarithm. One can apply the same principle to a joint distribution and the product of its two marginal distribution (in analogy to Kullback–Leibler divergence and mutual information) and to measure how reliably one can decide if a given response comes from the joint distribution or
2380-414: Is reported to calculate the Jensen-Shannon centroid of a set of discrete distributions (histograms). The Jensen–Shannon divergence has been applied in bioinformatics and genome comparison , in protein surface comparison, in the social sciences, in the quantitative study of history, in fire experiments, and in machine learning. Shannon entropy In information theory , the entropy of
2450-530: Is that the "informational value" of a communicated message depends on the degree to which the content of the message is surprising. If a highly likely event occurs, the message carries very little information. On the other hand, if a highly unlikely event occurs, the message is much more informative. For instance, the knowledge that some particular number will not be the winning number of a lottery provides very little information, because any particular chosen number will almost certainly not win. However, knowledge that
2520-414: Is the base of the logarithm used. Common values of b are 2, Euler's number e , and 10, and the corresponding units of entropy are the bits for b = 2 , nats for b = e , and bans for b = 10 . In the case of p ( x ) = 0 {\displaystyle p(x)=0} for some x ∈ X {\displaystyle x\in {\mathcal {X}}} ,
2590-533: Is the expected value operator , and I is the information content of X . I ( X ) {\displaystyle \operatorname {I} (X)} is itself a random variable. The entropy can explicitly be written as: H ( X ) = − ∑ x ∈ X p ( x ) log b p ( x ) , {\displaystyle \mathrm {H} (X)=-\sum _{x\in {\mathcal {X}}}p(x)\log _{b}p(x),} where b
2660-727: Is the logarithm , which gives 0 surprise when the probability of the event is 1. In fact, log is the only function that satisfies а specific set of conditions defined in section § Characterization . Hence, we can define the information, or surprisal, of an event E {\displaystyle E} by I ( E ) = − log 2 ( p ( E ) ) , {\displaystyle I(E)=-\log _{2}(p(E)),} or equivalently, I ( E ) = log 2 ( 1 p ( E ) ) . {\displaystyle I(E)=\log _{2}\left({\frac {1}{p(E)}}\right).} Entropy measures
2730-405: Is the trace . At an everyday practical level, the links between information entropy and thermodynamic entropy are not evident. Physicists and chemists are apt to be more interested in changes in entropy as a system spontaneously evolves away from its initial conditions, in accordance with the second law of thermodynamics , rather than an unchanging probability distribution. As the minuteness of
2800-471: Is the number of microstates (various combinations of particles in various energy states) that can yield the given macrostate, and k B is the Boltzmann constant . It is assumed that each microstate is equally likely, so that the probability of a given microstate is p i = 1/ W . When these probabilities are substituted into the above expression for the Gibbs entropy (or equivalently k B times
2870-419: Is when p = 0 or p = 1 , when the event outcome is known ahead of time, and the entropy is zero bits. When the entropy is zero bits, this is sometimes referred to as unity, where there is no uncertainty at all – no freedom of choice – no information . Other values of p give entropies between zero and one bits. Information theory is useful to calculate the smallest amount of information required to convey
JSD - Misplaced Pages Continue
2940-542: Is worth noting that if we drop the "small for small probabilities" property, then H {\displaystyle \mathrm {H} } must be a non-negative linear combination of the Shannon entropy and the Hartley entropy . The Shannon entropy satisfies the following properties, for some of which it is useful to interpret entropy as the expected amount of information learned (or uncertainty eliminated) by revealing
3010-417: The Boltzmann constant k B indicates, the changes in S / k B for even tiny amounts of substances in chemical and physical processes represent amounts of entropy that are extremely large compared to anything in data compression or signal processing . In classical thermodynamics, entropy is defined in terms of macroscopic measurements and makes no reference to any probability distribution, which
3080-424: The Boltzmann constant , and p i is the probability of a microstate . The Gibbs entropy was defined by J. Willard Gibbs in 1878 after earlier work by Boltzmann (1872). The Gibbs entropy translates over almost unchanged into the world of quantum physics to give the von Neumann entropy introduced by John von Neumann in 1927: where ρ is the density matrix of the quantum mechanical system and Tr
3150-847: The base for the logarithm . Thus, entropy is characterized by the above four properties. This differential equation leads to the solution I ( u ) = k log u + c {\displaystyle \operatorname {I} (u)=k\log u+c} for some k , c ∈ R {\displaystyle k,c\in \mathbb {R} } . Property 2 gives c = 0 {\displaystyle c=0} . Property 1 and 2 give that I ( p ) ≥ 0 {\displaystyle \operatorname {I} (p)\geq 0} for all p ∈ [ 0 , 1 ] {\displaystyle p\in [0,1]} , so that k < 0 {\displaystyle k<0} . The different units of information ( bits for
3220-415: The binary logarithm log 2 , nats for the natural logarithm ln , bans for the decimal logarithm log 10 and so on) are constant multiples of each other. For instance, in case of a fair coin toss, heads provides log 2 (2) = 1 bit of information, which is approximately 0.693 nats or 0.301 decimal digits. Because of additivity, n tosses provide n bits of information, which
3290-407: The Shannon entropy), Boltzmann's equation results. In information theoretic terms, the information entropy of a system is the amount of "missing" information needed to determine a microstate, given the macrostate. In the view of Jaynes (1957), thermodynamic entropy, as explained by statistical mechanics , should be seen as an application of Shannon's information theory: the thermodynamic entropy
3360-493: The above properties must be a constant multiple of Shannon entropy, with a non-negative constant. Compared to the previously mentioned characterizations of entropy, this characterization focuses on the properties of entropy as a function of random variables (subadditivity and additivity), rather than the properties of entropy as a function of the probability vector p 1 , … , p n {\displaystyle p_{1},\ldots ,p_{n}} . It
3430-1280: The coin is not fair, but comes up heads or tails with probabilities p and q , where p ≠ q , then there is less uncertainty. Every time it is tossed, one side is more likely to come up than the other. The reduced uncertainty is quantified in a lower entropy: on average each toss of the coin delivers less than one full bit of information. For example, if p = 0.7, then H ( X ) = − p log 2 ( p ) − q log 2 ( q ) = − 0.7 log 2 ( 0.7 ) − 0.3 log 2 ( 0.3 ) ≈ − 0.7 ⋅ ( − 0.515 ) − 0.3 ⋅ ( − 1.737 ) = 0.8816 < 1. {\displaystyle {\begin{aligned}\mathrm {H} (X)&=-p\log _{2}(p)-q\log _{2}(q)\\&=-0.7\log _{2}(0.7)-0.3\log _{2}(0.3)\\&\approx -0.7\cdot (-0.515)-0.3\cdot (-1.737)\\&=0.8816<1.\end{aligned}}} Uniform probability yields maximum uncertainty and therefore maximum entropy. Entropy, then, can only decrease from
3500-543: The efficiency of a source set with n symbols can be defined simply as being equal to its n -ary entropy. See also Redundancy (information theory) . The characterization here imposes an additive property with respect to a partition of a set . Meanwhile, the conditional probability is defined in terms of a multiplicative property, P ( A ∣ B ) ⋅ P ( B ) = P ( A ∩ B ) {\displaystyle P(A\mid B)\cdot P(B)=P(A\cap B)} . Observe that
3570-435: The expected (i.e., average) amount of information conveyed by identifying the outcome of a random trial. This implies that rolling a die has higher entropy than tossing a coin because each outcome of a die toss has smaller probability ( p = 1 / 6 {\displaystyle p=1/6} ) than each outcome of a coin toss ( p = 1 / 2 {\displaystyle p=1/2} ). Consider
SECTION 50
#17327760268963640-413: The following properties. We denote p i = Pr( X = x i ) and Η n ( p 1 , ..., p n ) = Η( X ) . The rule of additivity has the following consequences: for positive integers b i where b 1 + ... + b k = n , Choosing k = n , b 1 = ... = b n = 1 this implies that the entropy of a certain outcome is zero: Η 1 (1) = 0 . This implies that
3710-545: The geometric mean. A more general definition, allowing for the comparison of more than two probability distributions, is: where M := ∑ i = 1 n π i P i {\displaystyle {\begin{aligned}M&:=\sum _{i=1}^{n}\pi _{i}P_{i}\end{aligned}}} and π 1 , … , π n {\displaystyle \pi _{1},\ldots ,\pi _{n}} are weights that are selected for
3780-562: The link to point directly to the intended article. Retrieved from " https://en.wikipedia.org/w/index.php?title=JSD&oldid=1125986595 " Category : Disambiguation pages Hidden categories: Short description is different from Wikidata All article disambiguation pages All disambiguation pages Jensen%E2%80%93Shannon divergence Consider the set M + 1 ( A ) {\displaystyle M_{+}^{1}(A)} of probability distributions where A {\displaystyle A}
3850-407: The logarithm is ad hoc and the entropy is not a measure in itself. At least in the information theory of a binary string, log 2 {\displaystyle \log _{2}} lends itself to practical interpretations. Motivated by such relations, a plethora of related and competing quantities have been defined. For example, David Ellerman 's analysis of a "logic of partitions" defines
3920-926: The message. Named after Boltzmann's Η-theorem , Shannon defined the entropy Η (Greek capital letter eta ) of a discrete random variable X {\textstyle X} , which takes values in the set X {\displaystyle {\mathcal {X}}} and is distributed according to p : X → [ 0 , 1 ] {\displaystyle p:{\mathcal {X}}\to [0,1]} such that p ( x ) := P [ X = x ] {\displaystyle p(x):=\mathbb {P} [X=x]} : H ( X ) = E [ I ( X ) ] = E [ − log p ( X ) ] . {\displaystyle \mathrm {H} (X)=\mathbb {E} [\operatorname {I} (X)]=\mathbb {E} [-\log p(X)].} Here E {\displaystyle \mathbb {E} }
3990-447: The number of possible microscopic states of the system that are consistent with the measurable values of its macroscopic variables, making any complete state description longer. (See article: maximum entropy thermodynamics ). Maxwell's demon can (hypothetically) reduce the thermodynamic entropy of a system by using information about the states of individual molecules; but, as Landauer (from 1961) and co-workers have shown, to function
4060-431: The observation of event i follows from Shannon's solution of the fundamental properties of information : Given two independent events, if the first event can yield one of n equiprobable outcomes and another has one of m equiprobable outcomes then there are mn equiprobable outcomes of the joint event. This means that if log 2 ( n ) bits are needed to encode the first value and log 2 ( m ) to encode
4130-586: The only possible values of I {\displaystyle \operatorname {I} } are I ( u ) = k log u {\displaystyle \operatorname {I} (u)=k\log u} for k < 0 {\displaystyle k<0} . Additionally, choosing a value for k is equivalent to choosing a value x > 1 {\displaystyle x>1} for k = − 1 / log x {\displaystyle k=-1/\log x} , so that x corresponds to
4200-1047: The probability distributions P 1 , P 2 , … , P n {\displaystyle P_{1},P_{2},\ldots ,P_{n}} , and H ( P ) {\displaystyle H(P)} is the Shannon entropy for distribution P {\displaystyle P} . For the two-distribution case described above, P 1 = P , P 2 = Q , π 1 = π 2 = 1 2 . {\displaystyle P_{1}=P,P_{2}=Q,\pi _{1}=\pi _{2}={\frac {1}{2}}.\ } Hence, for those distributions P , Q {\displaystyle P,Q} J S D = H ( M ) − 1 2 ( H ( P ) + H ( Q ) ) {\displaystyle JSD=H(M)-{\frac {1}{2}}{\bigg (}H(P)+H(Q){\bigg )}} The Jensen–Shannon divergence
4270-431: The product distribution—subject to the assumption that these are the only two possibilities. The generalization of probability distributions on density matrices allows to define quantum Jensen–Shannon divergence (QJSD). It is defined for a set of density matrices ( ρ 1 , … , ρ n ) {\displaystyle (\rho _{1},\ldots ,\rho _{n})} and
SECTION 60
#17327760268964340-545: The remaining randomness in the random variable X {\displaystyle X} given the random variable Y {\displaystyle Y} . Entropy can be formally defined in the language of measure theory as follows: Let ( X , Σ , μ ) {\displaystyle (X,\Sigma ,\mu )} be a probability space . Let A ∈ Σ {\displaystyle A\in \Sigma } be an event . The surprisal of A {\displaystyle A}
4410-477: The second, one needs log 2 ( mn ) = log 2 ( m ) + log 2 ( n ) to encode both. Shannon discovered that a suitable choice of I {\displaystyle \operatorname {I} } is given by: I ( p ) = log ( 1 p ) = − log ( p ) . {\displaystyle \operatorname {I} (p)=\log \left({\tfrac {1}{p}}\right)=-\log(p).} In fact,
4480-594: The set X {\displaystyle {\mathcal {X}}} and is distributed according to p : X → [ 0 , 1 ] {\displaystyle p\colon {\mathcal {X}}\to [0,1]} , the entropy is H ( X ) := − ∑ x ∈ X p ( x ) log p ( x ) , {\displaystyle \mathrm {H} (X):=-\sum _{x\in {\mathcal {X}}}p(x)\log p(x),} where Σ {\displaystyle \Sigma } denotes
4550-419: The sum over the variable's possible values. The choice of base for log {\displaystyle \log } , the logarithm , varies for different applications. Base 2 gives the unit of bits (or " shannons "), while base e gives "natural units" nat , and base 10 gives units of "dits", "bans", or " hartleys ". An equivalent definition of entropy is the expected value of the self-information of
4620-439: The time only one bit needs to be sent, 26% of the time two bits, and only 4% of the time 3 bits. On average, fewer than 2 bits are required since the entropy is lower (owing to the high prevalence of 'A' followed by 'B' – together 96% of characters). The calculation of the sum of probability-weighted log probabilities measures and captures this effect. English text, treated as a string of characters, has fairly low entropy; i.e. it
4690-428: The underlying set of events that discriminates well between events, and choose the value of X {\displaystyle X} according to P {\displaystyle P} if Z = 0 {\displaystyle Z=0} and according to Q {\displaystyle Q} if Z = 1 {\displaystyle Z=1} , where Z {\displaystyle Z}
4760-536: The value associated with uniform probability. The extreme case is that of a double-headed coin that never comes up tails, or a double-tailed coin that never results in a head. Then there is no uncertainty. The entropy is zero: each toss of the coin delivers no new information as the outcome of each coin toss is always certain. To understand the meaning of −Σ p i log( p i ) , first define an information function I in terms of an event i with probability p i . The amount of information acquired due to
4830-450: The value of a random variable X : The inspiration for adopting the word entropy in information theory came from the close resemblance between Shannon's formula and very similar known formulae from statistical mechanics . In statistical thermodynamics the most general formula for the thermodynamic entropy S of a thermodynamic system is the Gibbs entropy where k B is
4900-1486: The value of the corresponding summand 0 log b (0) is taken to be 0 , which is consistent with the limit : lim p → 0 + p log ( p ) = 0. {\displaystyle \lim _{p\to 0^{+}}p\log(p)=0.} One may also define the conditional entropy of two variables X {\displaystyle X} and Y {\displaystyle Y} taking values from sets X {\displaystyle {\mathcal {X}}} and Y {\displaystyle {\mathcal {Y}}} respectively, as: H ( X | Y ) = − ∑ x , y ∈ X × Y p X , Y ( x , y ) log p X , Y ( x , y ) p Y ( y ) , {\displaystyle \mathrm {H} (X|Y)=-\sum _{x,y\in {\mathcal {X}}\times {\mathcal {Y}}}p_{X,Y}(x,y)\log {\frac {p_{X,Y}(x,y)}{p_{Y}(y)}},} where p X , Y ( x , y ) := P [ X = x , Y = y ] {\displaystyle p_{X,Y}(x,y):=\mathbb {P} [X=x,Y=y]} and p Y ( y ) = P [ Y = y ] {\displaystyle p_{Y}(y)=\mathbb {P} [Y=y]} . This quantity should be understood as
#895104