Misplaced Pages

Information theory

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.

Information theory is the mathematical study of the quantification , storage , and communication of information . The field was established and put on a firm footing by Claude Shannon in the 1940s, though early contributions were made in the 1920s through the works of Harry Nyquist and Ralph Hartley . It is at the intersection of electronic engineering , mathematics , statistics , computer science , neurobiology , physics , and electrical engineering .

#664335

116-969: A key measure in information theory is entropy . Entropy quantifies the amount of uncertainty involved in the value of a random variable or the outcome of a random process . For example, identifying the outcome of a fair coin flip (which has two equally likely outcomes) provides less information (lower entropy, less uncertainty) than identifying the outcome from a roll of a die (which has six equally likely outcomes). Some other important measures in information theory are mutual information , channel capacity , error exponents , and relative entropy . Important sub-fields of information theory include source coding , algorithmic complexity theory , algorithmic information theory and information-theoretic security . Applications of fundamental topics of information theory include source coding/ data compression (e.g. for ZIP files ), and channel coding/ error detection and correction (e.g. for DSL ). Its impact has been crucial to

232-433: A source of information. A memoryless source is one in which each message is an independent identically distributed random variable , whereas the properties of ergodicity and stationarity impose less restrictive constraints. All such sources are stochastic . These terms are well studied in their own right outside information theory. Information rate is the average entropy per symbol. For memoryless sources, this

348-1169: A Bernoulli process . The entropy of the unknown result of the next toss of the coin is maximized if the coin is fair (that is, if heads and tails both have equal probability 1/2). This is the situation of maximum uncertainty as it is most difficult to predict the outcome of the next toss; the result of each toss of the coin delivers one full bit of information. This is because H ( X ) = − ∑ i = 1 n p ( x i ) log b ⁡ p ( x i ) = − ∑ i = 1 2 1 2 log 2 ⁡ 1 2 = − ∑ i = 1 2 1 2 ⋅ ( − 1 ) = 1. {\displaystyle {\begin{aligned}\mathrm {H} (X)&=-\sum _{i=1}^{n}{p(x_{i})\log _{b}p(x_{i})}\\&=-\sum _{i=1}^{2}{{\frac {1}{2}}\log _{2}{\frac {1}{2}}}\\&=-\sum _{i=1}^{2}{{\frac {1}{2}}\cdot (-1)}=1.\end{aligned}}} However, if we know

464-548: A code is a method used to encrypt a message that operates at the level of meaning; that is, words or phrases are converted into something else. A code might transform "change" into "CVGDK" or "cocktail lounge". The U.S. National Security Agency defined a code as "A substitution cryptosystem in which the plaintext elements are primarily words, phrases, or sentences, and the code equivalents (called "code groups") typically consist of letters or digits (or both) in otherwise meaningless combinations of identical length." A codebook

580-520: A monoalphabetic substitution cipher is easy, solving even a simple code is difficult. Decrypting a coded message is a little like trying to translate a document written in a foreign language, with the task basically amounting to building up a "dictionary" of the codegroups and the plaintext words they represent. One fingerhold on a simple code is the fact that some words are more common than others, such as "the" or "a" in English. In telegraphic messages,

696-420: A sigma-algebra on X {\displaystyle X} . The entropy of M {\displaystyle M} is H μ ( M ) = sup P ⊆ M H μ ( P ) . {\displaystyle \mathrm {H} _{\mu }(M)=\sup _{P\subseteq M}\mathrm {H} _{\mu }(P).} Finally, the entropy of the probability space

812-490: A coin with probability p of landing on heads and probability 1 − p of landing on tails. The maximum surprise is when p = 1/2 , for which one outcome is not expected over the other. In this case a coin flip has an entropy of one bit . (Similarly, one trit with equiprobable values contains log 2 ⁡ 3 {\displaystyle \log _{2}3} (about 1.58496) bits of information because it can have one of three values.) The minimum surprise

928-498: A competing measure in structures dual to that of subsets of a universal set. Information is quantified as "dits" (distinctions), a measure on partitions. "Dits" can be converted into Shannon's bits , to get the formulas for conditional entropy, and so on. Another succinct axiomatic characterization of Shannon entropy was given by Aczél , Forte and Ng, via the following properties: It was shown that any function H {\displaystyle \mathrm {H} } satisfying

1044-407: A continuous random variable, differential entropy is analogous to entropy. The definition E [ − log ⁡ p ( X ) ] {\displaystyle \mathbb {E} [-\log p(X)]} generalizes the above. The core idea of information theory is that the "informational value" of a communicated message depends on the degree to which the content of the message

1160-448: A data source, and proved in his source coding theorem that the entropy represents an absolute mathematical limit on how well data from the source can be losslessly compressed onto a perfectly noiseless channel. Shannon strengthened this result considerably for noisy channels in his noisy-channel coding theorem . Entropy in information theory is directly analogous to the entropy in statistical thermodynamics . The analogy results when

1276-428: A logarithm mediates between these two operations. The conditional entropy and related quantities inherit simple relation, in turn. The measure theoretic definition in the previous section defined the entropy as a sum over expected surprisals μ ( A ) ⋅ ln ⁡ μ ( A ) {\displaystyle \mu (A)\cdot \ln \mu (A)} for an extremal partition. Here

SECTION 10

#1732772285665

1392-414: A lot of work for both cryptographers and the code users. In practice, when codes were in widespread use, they were usually changed on a periodic basis to frustrate codebreakers, and to limit the useful life of stolen or copied codebooks. Once codes have been created, codebook distribution is logistically clumsy, and increases chances the code will be compromised. There is a saying that "Three people can keep

1508-448: A measurement in bytes per symbol, and a logarithm of base 10 will produce a measurement in decimal digits (or hartleys ) per symbol. Intuitively, the entropy H X of a discrete random variable X is a measure of the amount of uncertainty associated with the value of X when only its distribution is known. The entropy of a source that emits a sequence of N symbols that are independent and identically distributed (iid)

1624-634: A message, as in data compression . For example, consider the transmission of sequences comprising the 4 characters 'A', 'B', 'C', and 'D' over a binary channel. If all 4 letters are equally likely (25%), one cannot do better than using two bits to encode each letter. 'A' might code as '00', 'B' as '01', 'C' as '10', and 'D' as '11'. However, if the probabilities of each letter are unequal, say 'A' occurs with 70% probability, 'B' with 26%, and 'C' and 'D' with 2% each, one could assign variable length codes. In this case, 'A' would be coded as '0', 'B' as '10', 'C' as '110', and 'D' as '111'. With this representation, 70% of

1740-524: A particular army and nowhere else might very well indicate the commander of that army. A codegroup that appears in messages preceding an attack on a particular location may very well stand for that location. Cribs can be an immediate giveaway to the definitions of codegroups. As codegroups are determined, they can gradually build up a critical mass, with more and more codegroups revealed from context and educated guesswork. One-part codes are more vulnerable to such educated guesswork than two-part codes, since if

1856-478: A particular number will win a lottery has high informational value because it communicates the occurrence of a very low probability event. The information content , also called the surprisal or self-information, of an event E {\displaystyle E} is a function which increases as the probability p ( E ) {\displaystyle p(E)} of an event decreases. When p ( E ) {\displaystyle p(E)}

1972-442: A random variable with two outcomes is the binary entropy function, usually taken to the logarithmic base 2, thus having the shannon (Sh) as unit: The joint entropy of two discrete random variables X and Y is merely the entropy of their pairing: ( X , Y ) . This implies that if X and Y are independent , then their joint entropy is the sum of their individual entropies. For example, if ( X , Y ) represents

2088-488: A secret if two of them are dead," ( Benjamin Franklin - Wikiquote ) and though it may be something of an exaggeration, a secret becomes harder to keep if it is shared among several people. Codes can be thought reasonably secure if they are only used by a few careful people, but if whole armies use the same codebook, security becomes much more difficult. In contrast, the security of ciphers is generally dependent on protecting

2204-403: A signal; noise, periods of silence, and other forms of signal corruption often degrade quality. Information entropy In information theory , the entropy of a random variable quantifies the average level of uncertainty or information associated with the variable's potential states or possible outcomes. This measures the expected amount of information needed to describe the state of

2320-399: A single random variable. Another useful concept is mutual information defined on two random variables, which describes the measure of information in common between those variables, which can be used to describe their correlation. The former quantity is a property of the probability distribution of a random variable and gives a limit on the rate at which data generated by independent samples with

2436-447: A statistical description for data, information theory quantifies the number of bits needed to describe the data, which is the information entropy of the source. This division of coding theory into compression and transmission is justified by the information transmission theorems, or source–channel separation theorems that justify the use of bits as the universal currency for information in many contexts. However, these theorems only hold in

SECTION 20

#1732772285665

2552-486: A theoretical section quantifying "intelligence" and the "line speed" at which it can be transmitted by a communication system, giving the relation W = K log m (recalling the Boltzmann constant ), where W is the speed of transmission of intelligence, m is the number of different voltage levels to choose from at each time step, and K is a constant. Ralph Hartley 's 1928 paper, Transmission of Information , uses

2668-424: A trained analyst monitoring the communications of someone who has already aroused suspicion might be able to recognize a comment like "Aunt Bertha has gone into labor" as having an ominous meaning. Famous example of one time codes include: Sometimes messages are not prearranged and rely on shared knowledge hopefully known only to the recipients. An example is the telegram sent to U.S. President Harry Truman , then at

2784-510: A unit or scale or measure of information. Alan Turing in 1940 used similar ideas as part of the statistical analysis of the breaking of the German second world war Enigma ciphers. Much of the mathematics behind information theory with events of different probabilities were developed for the field of thermodynamics by Ludwig Boltzmann and J. Willard Gibbs . Connections between information-theoretic entropy and thermodynamic entropy, including

2900-444: Is H μ ( Σ ) {\displaystyle \mathrm {H} _{\mu }(\Sigma )} , that is, the entropy with respect to μ {\displaystyle \mu } of the sigma-algebra of all measurable subsets of X {\displaystyle X} . Consider tossing a coin with known, not necessarily fair, probabilities of coming up heads or tails; this can be modelled as

3016-553: Is σ μ ( A ) = − ln ⁡ μ ( A ) . {\displaystyle \sigma _{\mu }(A)=-\ln \mu (A).} The expected surprisal of A {\displaystyle A} is h μ ( A ) = μ ( A ) σ μ ( A ) . {\displaystyle h_{\mu }(A)=\mu (A)\sigma _{\mu }(A).} A μ {\displaystyle \mu } -almost partition

3132-433: Is N ⋅ H bits (per message of N symbols). If the source data symbols are identically distributed but not independent, the entropy of a message of length N will be less than N ⋅ H . If one transmits 1000 bits (0s and 1s), and the value of each of these bits is known to the receiver (has a specific value with certainty) ahead of transmission, it is clear that no information is transmitted. If, however, each bit

3248-408: Is symmetric : Mutual information can be expressed as the average Kullback–Leibler divergence (information gain) between the posterior probability distribution of X given the value of Y and the prior distribution on X : In other words, this is a measure of how much, on the average, the probability distribution on X will change if we are given the value of Y . This is often recalculated as

3364-541: Is a set family P ⊆ P ( X ) {\displaystyle P\subseteq {\mathcal {P}}(X)} such that μ ( ∪ ⁡ P ) = 1 {\displaystyle \mu (\mathop {\cup } P)=1} and μ ( A ∩ B ) = 0 {\displaystyle \mu (A\cap B)=0} for all distinct A , B ∈ P {\displaystyle A,B\in P} . (This

3480-460: Is a relaxation of the usual conditions for a partition.) The entropy of P {\displaystyle P} is H μ ( P ) = ∑ A ∈ P h μ ( A ) . {\displaystyle \mathrm {H} _{\mu }(P)=\sum _{A\in P}h_{\mu }(A).} Let M {\displaystyle M} be

3596-404: Is a way of comparing two distributions: a "true" probability distribution ⁠ p ( X ) {\displaystyle p(X)} ⁠ , and an arbitrary probability distribution ⁠ q ( X ) {\displaystyle q(X)} ⁠ . If we compress data in a manner that assumes ⁠ q ( X ) {\displaystyle q(X)} ⁠

Information theory - Misplaced Pages Continue

3712-473: Is also referred to as Shannon entropy . Shannon's theory defines a data communication system composed of three elements: a source of data, a communication channel , and a receiver. The "fundamental problem of communication" – as expressed by Shannon – is for the receiver to be able to identify what data was generated by the source, based on the signal it receives through the channel. Shannon considered various ways to encode, compress, and transmit messages from

3828-438: Is approximately 0.693 n nats or 0.301 n decimal digits. The meaning of the events observed (the meaning of messages ) does not matter in the definition of entropy. Entropy only takes into account the probability of observing a specific event, so the information it encapsulates is information about the underlying probability distribution , not the meaning of the events themselves. Another characterization of entropy uses

3944-401: Is central to the definition of information entropy. The connection between thermodynamics and what is now known as information theory was first made by Ludwig Boltzmann and expressed by his equation : where S {\displaystyle S} is the thermodynamic entropy of a particular macrostate (defined by thermodynamic parameters such as temperature, volume, energy, etc.), W

4060-419: Is close to 1, the surprisal of the event is low, but if p ( E ) {\displaystyle p(E)} is close to 0, the surprisal of the event is high. This relationship is described by the function log ⁡ ( 1 p ( E ) ) , {\displaystyle \log \left({\frac {1}{p(E)}}\right),} where log {\displaystyle \log }

4176-466: Is defined as: It is common in information theory to speak of the "rate" or "entropy" of a language. This is appropriate, for example, when the source of information is English prose. The rate of a source of information is related to its redundancy and how well it can be compressed, the subject of source coding . Communications over a channel is the primary motivation of information theory. However, channels often fail to produce exact reconstruction of

4292-415: Is fairly predictable. We can be fairly certain that, for example, 'e' will be far more common than 'z', that the combination 'qu' will be much more common than any other combination with a 'q' in it, and that the combination 'th' will be more common than 'z', 'q', or 'qu'. After the first few letters one can often guess the rest of the word. English text has between 0.6 and 1.3 bits of entropy per character of

4408-467: Is important in communication where it can be used to maximize the amount of information shared between sent and received signals. The mutual information of X relative to Y is given by: where SI ( S pecific mutual Information) is the pointwise mutual information . A basic property of the mutual information is that That is, knowing Y , we can save an average of I ( X ; Y ) bits in encoding X compared to not knowing Y . Mutual information

4524-466: Is independently equally likely to be 0 or 1, 1000 shannons of information (more often called bits) have been transmitted. Between these two extremes, information can be quantified as follows. If X {\displaystyle \mathbb {X} } is the set of all messages { x 1 , ..., x n } that X could be, and p ( x ) is the probability of some x ∈ X {\displaystyle x\in \mathbb {X} } , then

4640-418: Is interpreted as being proportional to the amount of further Shannon information needed to define the detailed microscopic state of the system, that remains uncommunicated by a description solely in terms of the macroscopic variables of classical thermodynamics, with the constant of proportionality being just the Boltzmann constant . Adding heat to a system increases its thermodynamic entropy because it increases

4756-434: Is merely the entropy of each symbol, while, in the case of a stationary stochastic process, it is: that is, the conditional entropy of a symbol given all the previous symbols generated. For the more general case of a process that is not necessarily stationary, the average rate is: that is, the limit of the joint entropy per symbol. For stationary sources, these two expressions give the same result. The information rate

Information theory - Misplaced Pages Continue

4872-417: Is needed to encrypt, and decrypt the phrases or words. By contrast, ciphers encrypt messages at the level of individual letters, or small groups of letters, or even, in modern ciphers, individual bits . Messages can be transformed first by a code, and then by a cipher. Such multiple encryption , or "superencryption" aims to make cryptanalysis more difficult. Another comparison between codes and ciphers

4988-454: Is not symmetric and does not satisfy the triangle inequality (making it a semi-quasimetric). Another interpretation of the KL divergence is the "unnecessary surprise" introduced by a prior from the truth: suppose a number X is about to be drawn randomly from a discrete set with probability distribution ⁠ p ( x ) {\displaystyle p(x)} ⁠ . If Alice knows

5104-757: Is not symmetric. The I ( X n → Y n ) {\displaystyle I(X^{n}\to Y^{n})} measures the information bits that are transmitted causally from X n {\displaystyle X^{n}} to Y n {\displaystyle Y^{n}} . The Directed information has many applications in problems where causality plays an important role such as capacity of channel with feedback, capacity of discrete memoryless networks with feedback, gambling with causal side information, compression with causal side information, real-time control communication settings, and in statistical physics. Other important information theoretic quantities include

5220-412: Is surprising. If a highly likely event occurs, the message carries very little information. On the other hand, if a highly unlikely event occurs, the message is much more informative. For instance, the knowledge that some particular number will not be the winning number of a lottery provides very little information, because any particular chosen number will almost certainly not win. However, knowledge that

5336-421: Is that a code typically represents a letter or groups of letters directly without the use of mathematics. As such the numbers are configured to represent these three values: 1001 = A, 1002 = B, 1003 = C, ... . The resulting message, then would be 1001 1002 1003 to communicate ABC. Ciphers, however, utilize a mathematical formula to represent letters or groups of letters. For example, A = 1, B = 2, C = 3, ... . Thus

5452-414: Is the base of the logarithm used. Common values of b are 2, Euler's number e , and 10, and the corresponding units of entropy are the bits for b = 2 , nats for b = e , and bans for b = 10 . In the case of p ( x ) = 0 {\displaystyle p(x)=0} for some x ∈ X {\displaystyle x\in {\mathcal {X}}} ,

5568-429: Is the conditional mutual information I ( X 1 , X 2 , . . . , X i ; Y i | Y 1 , Y 2 , . . . , Y i − 1 ) {\displaystyle I(X_{1},X_{2},...,X_{i};Y_{i}|Y_{1},Y_{2},...,Y_{i-1})} . In contrast to mutual information, directed information

5684-533: Is the expected value operator , and I is the information content of X . I ⁡ ( X ) {\displaystyle \operatorname {I} (X)} is itself a random variable. The entropy can explicitly be written as: H ( X ) = − ∑ x ∈ X p ( x ) log b ⁡ p ( x ) , {\displaystyle \mathrm {H} (X)=-\sum _{x\in {\mathcal {X}}}p(x)\log _{b}p(x),} where b

5800-727: Is the logarithm , which gives 0 surprise when the probability of the event is 1. In fact, log is the only function that satisfies а specific set of conditions defined in section § Characterization . Hence, we can define the information, or surprisal, of an event E {\displaystyle E} by I ( E ) = − log 2 ⁡ ( p ( E ) ) , {\displaystyle I(E)=-\log _{2}(p(E)),} or equivalently, I ( E ) = log 2 ⁡ ( 1 p ( E ) ) . {\displaystyle I(E)=\log _{2}\left({\frac {1}{p(E)}}\right).} Entropy measures

5916-405: Is the trace . At an everyday practical level, the links between information entropy and thermodynamic entropy are not evident. Physicists and chemists are apt to be more interested in changes in entropy as a system spontaneously evolves away from its initial conditions, in accordance with the second law of thermodynamics , rather than an unchanging probability distribution. As the minuteness of

SECTION 50

#1732772285665

6032-473: Is the average conditional entropy over Y : Because entropy can be conditioned on a random variable or on that random variable being a certain value, care should be taken not to confuse these two definitions of conditional entropy, the former of which is in more common use. A basic property of this form of conditional entropy is that: Mutual information measures the amount of information that can be obtained about one random variable by observing another. It

6148-458: Is the distribution underlying some data, when, in reality, ⁠ p ( X ) {\displaystyle p(X)} ⁠ is the correct distribution, the Kullback–Leibler divergence is the number of average additional bits per datum necessary for compression. It is thus defined Although it is sometimes used as a 'distance metric', KL divergence is not a true metric since it

6264-471: Is the number of microstates (various combinations of particles in various energy states) that can yield the given macrostate, and k B is the Boltzmann constant . It is assumed that each microstate is equally likely, so that the probability of a given microstate is p i = 1/ W . When these probabilities are substituted into the above expression for the Gibbs entropy (or equivalently k B times

6380-581: Is used. A common unit of information is the bit or shannon , based on the binary logarithm . Other units include the nat , which is based on the natural logarithm , and the decimal digit , which is based on the common logarithm . In what follows, an expression of the form p log p is considered by convention to be equal to zero whenever p = 0 . This is justified because lim p → 0 + p log ⁡ p = 0 {\displaystyle \lim _{p\rightarrow 0+}p\log p=0} for any logarithmic base. Based on

6496-419: Is when p = 0 or p = 1 , when the event outcome is known ahead of time, and the entropy is zero bits. When the entropy is zero bits, this is sometimes referred to as unity, where there is no uncertainty at all – no freedom of choice – no information . Other values of p give entropies between zero and one bits. Information theory is useful to calculate the smallest amount of information required to convey

6612-542: Is worth noting that if we drop the "small for small probabilities" property, then H {\displaystyle \mathrm {H} } must be a non-negative linear combination of the Shannon entropy and the Hartley entropy . The Shannon entropy satisfies the following properties, for some of which it is useful to interpret entropy as the expected amount of information learned (or uncertainty eliminated) by revealing

6728-417: The Boltzmann constant k B indicates, the changes in S / k B for even tiny amounts of substances in chemical and physical processes represent amounts of entropy that are extremely large compared to anything in data compression or signal processing . In classical thermodynamics, entropy is defined in terms of macroscopic measurements and makes no reference to any probability distribution, which

6844-424: The Boltzmann constant , and p i is the probability of a microstate . The Gibbs entropy was defined by J. Willard Gibbs in 1878 after earlier work by Boltzmann (1872). The Gibbs entropy translates over almost unchanged into the world of quantum physics to give the von Neumann entropy introduced by John von Neumann in 1927: where ρ is the density matrix of the quantum mechanical system and Tr

6960-518: The Potsdam Conference to meet with Soviet premier Joseph Stalin , informing Truman of the first successful test of an atomic bomb . See also one-time pad , an unrelated cypher algorithm An idiot code is a code that is created by the parties using it. This type of communication is akin to the hand signals used by armies in the field. Example: Any sentence where 'day' and 'night' are used means 'attack'. The location mentioned in

7076-630: The Rényi entropy and the Tsallis entropy (generalizations of the concept of entropy), differential entropy (a generalization of quantities of information to continuous distributions), and the conditional mutual information . Also, pragmatic information has been proposed as a measure of how much information has been used in making a decision. Coding theory is one of the most important and direct applications of information theory. It can be subdivided into source coding theory and channel coding theory. Using

SECTION 60

#1732772285665

7192-847: The base for the logarithm . Thus, entropy is characterized by the above four properties. This differential equation leads to the solution I ⁡ ( u ) = k log ⁡ u + c {\displaystyle \operatorname {I} (u)=k\log u+c} for some k , c ∈ R {\displaystyle k,c\in \mathbb {R} } . Property 2 gives c = 0 {\displaystyle c=0} . Property 1 and 2 give that I ⁡ ( p ) ≥ 0 {\displaystyle \operatorname {I} (p)\geq 0} for all p ∈ [ 0 , 1 ] {\displaystyle p\in [0,1]} , so that k < 0 {\displaystyle k<0} . The different units of information ( bits for

7308-415: The binary logarithm log 2 , nats for the natural logarithm ln , bans for the decimal logarithm log 10 and so on) are constant multiples of each other. For instance, in case of a fair coin toss, heads provides log 2 (2) = 1 bit of information, which is approximately 0.693 nats or 0.301 decimal digits. Because of additivity, n tosses provide n bits of information, which

7424-605: The log is in base 2. In this way, the extent to which Bob's prior is "wrong" can be quantified in terms of how "unnecessarily surprised" it is expected to make him. Directed information , I ( X n → Y n ) {\displaystyle I(X^{n}\to Y^{n})} , is an information theory measure that quantifies the information flow from the random process X n = { X 1 , X 2 , … , X n } {\displaystyle X^{n}=\{X_{1},X_{2},\dots ,X_{n}\}} to

7540-450: The logarithm , varies for different applications. Base 2 gives the unit of bits (or " shannons "), while base e gives "natural units" nat , and base 10 gives units of "dits", "bans", or " hartleys ". An equivalent definition of entropy is the expected value of the self-information of a variable. The concept of information entropy was introduced by Claude Shannon in his 1948 paper " A Mathematical Theory of Communication ", and

7656-418: The probability mass function of each source symbol to be communicated, the Shannon entropy H , in units of bits (per symbol), is given by where p i is the probability of occurrence of the i -th possible value of the source symbol. This equation gives the entropy in the units of "bits" (per symbol) because it uses a logarithm of base 2, and this base-2 measure of entropy has sometimes been called

7772-405: The shannon in his honor. Entropy is also commonly computed using the natural logarithm (base e , where e is Euler's number), which produces a measurement of entropy in nats per symbol and sometimes simplifies the analysis by avoiding the need to include extra constants in the formulas. Other bases are also possible, but less commonly used. For example, a logarithm of base 2 = 256 will produce

7888-541: The unit ban . The landmark event establishing the discipline of information theory and bringing it to immediate worldwide attention was the publication of Claude E. Shannon's classic paper "A Mathematical Theory of Communication" in the Bell System Technical Journal in July and October 1948. Historian James Gleick rated the paper as the most important development of 1948, noting that the paper

8004-491: The German diplomatic "0075" two-part code system which contained upwards of 10,000 phrases and individual words. A one-time code is a prearranged word, phrase or symbol that is intended to be used only once to convey a simple message, often the signal to execute or abort some plan or confirm that it has succeeded or failed. One-time codes are often designed to be included in what would appear to be an innocent conversation. Done properly they are almost impossible to detect, though

8120-407: The Shannon entropy), Boltzmann's equation results. In information theoretic terms, the information entropy of a system is the amount of "missing" information needed to determine a microstate, given the macrostate. In the view of Jaynes (1957), thermodynamic entropy, as explained by statistical mechanics , should be seen as an application of Shannon's information theory: the thermodynamic entropy

8236-493: The above properties must be a constant multiple of Shannon entropy, with a non-negative constant. Compared to the previously mentioned characterizations of entropy, this characterization focuses on the properties of entropy as a function of random variables (subadditivity and additivity), rather than the properties of entropy as a function of the probability vector p 1 , … , p n {\displaystyle p_{1},\ldots ,p_{n}} . It

8352-409: The analysis of music , art creation , imaging system design, study of outer space , the dimensionality of space , and epistemology . Information theory studies the transmission, processing, extraction, and utilization of information . Abstractly, information can be thought of as the resolution of uncertainty. In the case of communication of information over a noisy channel, this abstract concept

8468-480: The assertion: With it came the ideas of: Information theory is based on probability theory and statistics, where quantified information is usually described in terms of bits. Information theory often concerns itself with measures of information of the distributions associated with random variables. One of the most important measures is called entropy , which forms the building block of many other measures. Entropy allows quantification of measure of information in

8584-483: The channel capacity. These codes can be roughly subdivided into data compression (source coding) and error-correction (channel coding) techniques. In the latter case, it took many years to find the methods Shannon's work proved were possible. A third class of information theory codes are cryptographic algorithms (both codes and ciphers ). Concepts, methods and results from coding theory and information theory are widely used in cryptography and cryptanalysis , such as

8700-485: The cipher keys. Cipher keys can be stolen and people can betray them, but they are much easier to change and distribute. It was common to encipher a message after first encoding it, to increase the difficulty of cryptanalysis. With a numerical code, this was commonly done with an "additive" - simply a long key number which was digit-by-digit added to the code groups, modulo 10. Unlike the codebooks, additives would be changed frequently. The famous Japanese Navy code, JN-25 ,

8816-486: The code designed, or the encoder. For example, in a code using numeric code groups, a plaintext word starting with "a" would have a low-value group, while one starting with "z" would have a high-value group. The same codebook could be used to "encode" a plaintext message into a coded message or "codetext", and "decode" a codetext back into plaintext message. In order to make life more difficult for codebreakers, codemakers designed codes with no predictable relationship between

8932-446: The codegroup for "STOP" (i.e., end of sentence or paragraph) is usually very common. This helps define the structure of the message in terms of sentences, if not their meaning, and this is cryptanalytically useful. Further progress can be made against a code by collecting many codetexts encrypted with the same code and then using information from other sources For example, a particular codegroup found almost exclusively in messages from

9048-485: The codegroups and the ordering of the matching plaintext. In practice, this meant that two codebooks were now required, one to find codegroups for encoding, the other to look up codegroups to find plaintext for decoding. Such "two-part" codes required more effort to develop, and twice as much effort to distribute (and discard safely when replaced), but they were harder to break. The Zimmermann Telegram in January 1917 used

9164-433: The codenumber "26839" of a one-part code is determined to stand for "bulldozer", then the lower codenumber "17598" will likely stand for a plaintext word that starts with "a" or "b". At least, for simple one part codes. Various tricks can be used to " plant " or "sow" information into a coded message, for example by executing a raid at a particular time and location against an enemy, and then examining code messages sent after

9280-1280: The coin is not fair, but comes up heads or tails with probabilities p and q , where p ≠ q , then there is less uncertainty. Every time it is tossed, one side is more likely to come up than the other. The reduced uncertainty is quantified in a lower entropy: on average each toss of the coin delivers less than one full bit of information. For example, if p = 0.7, then H ( X ) = − p log 2 ⁡ ( p ) − q log 2 ⁡ ( q ) = − 0.7 log 2 ⁡ ( 0.7 ) − 0.3 log 2 ⁡ ( 0.3 ) ≈ − 0.7 ⋅ ( − 0.515 ) − 0.3 ⋅ ( − 1.737 ) = 0.8816 < 1. {\displaystyle {\begin{aligned}\mathrm {H} (X)&=-p\log _{2}(p)-q\log _{2}(q)\\&=-0.7\log _{2}(0.7)-0.3\log _{2}(0.3)\\&\approx -0.7\cdot (-0.515)-0.3\cdot (-1.737)\\&=0.8816<1.\end{aligned}}} Uniform probability yields maximum uncertainty and therefore maximum entropy. Entropy, then, can only decrease from

9396-472: The demon himself must increase thermodynamic entropy in the process, by at least the amount of Shannon information he proposes to first acquire and store; and so the total thermodynamic entropy does not decrease (which resolves the paradox). Landauer's principle imposes a lower bound on the amount of heat a computer must generate to process a given amount of information, though modern computers are far less efficient. Code (cryptography) In cryptology ,

9512-535: The divergence from the product of the marginal distributions to the actual joint distribution: Mutual information is closely related to the log-likelihood ratio test in the context of contingency tables and the multinomial distribution and to Pearson's χ test : mutual information can be considered a statistic for assessing independence between a pair of variables, and has a well-specified asymptotic distribution. The Kullback–Leibler divergence (or information divergence , information gain , or relative entropy )

9628-543: The efficiency of a source set with n symbols can be defined simply as being equal to its n -ary entropy. See also Redundancy (information theory) . The characterization here imposes an additive property with respect to a partition of a set . Meanwhile, the conditional probability is defined in terms of a multiplicative property, P ( A ∣ B ) ⋅ P ( B ) = P ( A ∩ B ) {\displaystyle P(A\mid B)\cdot P(B)=P(A\cap B)} . Observe that

9744-468: The entropy is H ( X ) := − ∑ x ∈ X p ( x ) log ⁡ p ( x ) , {\displaystyle \mathrm {H} (X):=-\sum _{x\in {\mathcal {X}}}p(x)\log p(x),} where Σ {\displaystyle \Sigma } denotes the sum over the variable's possible values. The choice of base for log {\displaystyle \log } ,

9860-489: The entropy, H , of X is defined: (Here, I ( x ) is the self-information , which is the entropy contribution of an individual message, and E X {\displaystyle \mathbb {E} _{X}} is the expected value .) A property of entropy is that it is maximized when all the messages in the message space are equiprobable p ( x ) = 1/ n ; i.e., most unpredictable, in which case H ( X ) = log n . The special case of information entropy for

9976-435: The expected (i.e., average) amount of information conveyed by identifying the outcome of a random trial. This implies that rolling a die has higher entropy than tossing a coin because each outcome of a die toss has smaller probability ( p = 1 / 6 {\displaystyle p=1/6} ) than each outcome of a coin toss ( p = 1 / 2 {\displaystyle p=1/2} ). Consider

10092-413: The following properties. We denote p i = Pr( X = x i ) and Η n ( p 1 , ..., p n ) = Η( X ) . The rule of additivity has the following consequences: for positive integers b i where b 1 + ... + b k = n , Choosing k = n , b 1 = ... = b n = 1 this implies that the entropy of a certain outcome is zero: Η 1 (1) = 0 . This implies that

10208-525: The following sentence specifies the location to be attacked. An early use of the term appears to be by George Perrault, a character in the science fiction book Friday by Robert A. Heinlein : Terrorism expert Magnus Ranstorp said that the men who carried out the September 11 attacks on the United States used basic e-mail and what he calls "idiot code" to discuss their plans. While solving

10324-416: The given distribution can be reliably compressed. The latter is a property of the joint distribution of two random variables, and is the maximum rate of reliable communication across a noisy channel in the limit of long block lengths, when the channel statistics are determined by the joint distribution. The choice of logarithmic base in the following formulae determines the unit of information entropy that

10440-486: The important contributions by Rolf Landauer in the 1960s, are explored in Entropy in thermodynamics and information theory . In Shannon's revolutionary and groundbreaking paper, the work for which had been substantially completed at Bell Labs by the end of 1944, Shannon for the first time introduced the qualitative and quantitative model of communication as a statistical process underlying information theory, opening with

10556-470: The individual codebook elements. In the example, the message 13 26 39 can be cracked by dividing each number by 13 and then ranking them alphabetically. However, the focus of codebook cryptanalysis is the comparative frequency of the individual code elements matching the same frequency of letters within the plaintext messages using frequency analysis . In the above example, the code group, 1001, 1002, 1003, might occur more than once and that frequency might match

10672-402: The limit of many channel uses, the rate of information that is asymptotically achievable is equal to the channel capacity, a quantity dependent merely on the statistics of the channel over which the messages are sent. Coding theory is concerned with finding explicit methods, called codes , for increasing the efficiency and reducing the error rate of data communication over noisy channels to near

10788-407: The logarithm is ad hoc and the entropy is not a measure in itself. At least in the information theory of a binary string, log 2 {\displaystyle \log _{2}} lends itself to practical interpretations. Motivated by such relations, a plethora of related and competing quantities have been defined. For example, David Ellerman 's analysis of a "logic of partitions" defines

10904-416: The message ABC results by multiplying each letter's value by 13. The message ABC, then would be 13 26 39. Codes have a variety of drawbacks, including susceptibility to cryptanalysis and the difficulty of managing the cumbersome codebooks , so ciphers are now the dominant technique in modern cryptography. In contrast, because codes are representational, they are not susceptible to mathematical analysis of

11020-926: The message. Named after Boltzmann's Η-theorem , Shannon defined the entropy Η (Greek capital letter eta ) of a discrete random variable X {\textstyle X} , which takes values in the set X {\displaystyle {\mathcal {X}}} and is distributed according to p : X → [ 0 , 1 ] {\displaystyle p:{\mathcal {X}}\to [0,1]} such that p ( x ) := P [ X = x ] {\displaystyle p(x):=\mathbb {P} [X=x]} : H ( X ) = E [ I ⁡ ( X ) ] = E [ − log ⁡ p ( X ) ] . {\displaystyle \mathrm {H} (X)=\mathbb {E} [\operatorname {I} (X)]=\mathbb {E} [-\log p(X)].} Here E {\displaystyle \mathbb {E} }

11136-447: The number of possible microscopic states of the system that are consistent with the measurable values of its macroscopic variables, making any complete state description longer. (See article: maximum entropy thermodynamics ). Maxwell's demon can (hypothetically) reduce the thermodynamic entropy of a system by using information about the states of individual molecules; but, as Landauer (from 1961) and co-workers have shown, to function

11252-403: The number of times that ABC occurs in plain text messages. (In the past, or in non-technical contexts, code and cipher are often used to refer to any form of encryption ). Codes are defined by "codebooks" (physical or notional), which are dictionaries of codegroups listed with their corresponding plaintext. Codes originally had the codegroups assigned in 'plaintext order' for convenience of

11368-431: The observation of event i follows from Shannon's solution of the fundamental properties of information : Given two independent events, if the first event can yield one of n equiprobable outcomes and another has one of m equiprobable outcomes then there are mn equiprobable outcomes of the joint event. This means that if log 2 ( n ) bits are needed to encode the first value and log 2 ( m ) to encode

11484-586: The only possible values of I {\displaystyle \operatorname {I} } are I ⁡ ( u ) = k log ⁡ u {\displaystyle \operatorname {I} (u)=k\log u} for k < 0 {\displaystyle k<0} . Additionally, choosing a value for k is equivalent to choosing a value x > 1 {\displaystyle x>1} for k = − 1 / log ⁡ x {\displaystyle k=-1/\log x} , so that x corresponds to

11600-433: The position of a chess piece— X the row and Y the column, then the joint entropy of the row of the piece and the column of the piece will be the entropy of the position of the piece. Despite similar notation, joint entropy should not be confused with cross-entropy . The conditional entropy or conditional uncertainty of X given random variable Y (also called the equivocation of X about Y )

11716-420: The raid. Coding errors are a particularly useful fingerhold into a code; people reliably make errors, sometimes disastrous ones. Planting data and exploiting errors works against ciphers as well. Constructing a new code is like building a new language and writing a dictionary for it; it was an especially big job before computers. If a code is compromised, the entire task must be done all over again, and that means

11832-450: The random process Y n = { Y 1 , Y 2 , … , Y n } {\displaystyle Y^{n}=\{Y_{1},Y_{2},\dots ,Y_{n}\}} . The term directed information was coined by James Massey and is defined as where I ( X i ; Y i | Y i − 1 ) {\displaystyle I(X^{i};Y_{i}|Y^{i-1})}

11948-545: The remaining randomness in the random variable X {\displaystyle X} given the random variable Y {\displaystyle Y} . Entropy can be formally defined in the language of measure theory as follows: Let ( X , Σ , μ ) {\displaystyle (X,\Sigma ,\mu )} be a probability space . Let A ∈ Σ {\displaystyle A\in \Sigma } be an event . The surprisal of A {\displaystyle A}

12064-477: The second, one needs log 2 ( mn ) = log 2 ( m ) + log 2 ( n ) to encode both. Shannon discovered that a suitable choice of I {\displaystyle \operatorname {I} } is given by: I ⁡ ( p ) = log ⁡ ( 1 p ) = − log ⁡ ( p ) . {\displaystyle \operatorname {I} (p)=\log \left({\tfrac {1}{p}}\right)=-\log(p).} In fact,

12180-417: The situation where one transmitting user wishes to communicate to one receiving user. In scenarios with more than one transmitter (the multiple-access channel), more than one receiver (the broadcast channel ) or intermediary "helpers" (the relay channel ), or more general networks , compression followed by transmission may no longer be optimal. Any process that generates successive messages can be considered

12296-734: The success of the Voyager missions to deep space, the invention of the compact disc , the feasibility of mobile phones and the development of the Internet and artificial intelligence . The theory has also found applications in other areas, including statistical inference , cryptography , neurobiology , perception , signal processing , linguistics , the evolution and function of molecular codes ( bioinformatics ), thermal physics , molecular dynamics , black holes , quantum computing , information retrieval , intelligence gathering , plagiarism detection , pattern recognition , anomaly detection ,

12412-439: The time only one bit needs to be sent, 26% of the time two bits, and only 4% of the time 3 bits. On average, fewer than 2 bits are required since the entropy is lower (owing to the high prevalence of 'A' followed by 'B' – together 96% of characters). The calculation of the sum of probability-weighted log probabilities measures and captures this effect. English text, treated as a string of characters, has fairly low entropy; i.e. it

12528-460: The true distribution ⁠ p ( x ) {\displaystyle p(x)} ⁠ , while Bob believes (has a prior ) that the distribution is ⁠ q ( x ) {\displaystyle q(x)} ⁠ , then Bob will be more surprised than Alice, on average, upon seeing the value of X . The KL divergence is the (objective) expected value of Bob's (subjective) surprisal minus Alice's surprisal, measured in bits if

12644-536: The value associated with uniform probability. The extreme case is that of a double-headed coin that never comes up tails, or a double-tailed coin that never results in a head. Then there is no uncertainty. The entropy is zero: each toss of the coin delivers no new information as the outcome of each coin toss is always certain. To understand the meaning of −Σ p i log( p i ) , first define an information function I in terms of an event i with probability p i . The amount of information acquired due to

12760-450: The value of a random variable X : The inspiration for adopting the word entropy in information theory came from the close resemblance between Shannon's formula and very similar known formulae from statistical mechanics . In statistical thermodynamics the most general formula for the thermodynamic entropy S of a thermodynamic system is the Gibbs entropy where k B is

12876-1486: The value of the corresponding summand 0 log b (0) is taken to be 0 , which is consistent with the limit : lim p → 0 + p log ⁡ ( p ) = 0. {\displaystyle \lim _{p\to 0^{+}}p\log(p)=0.} One may also define the conditional entropy of two variables X {\displaystyle X} and Y {\displaystyle Y} taking values from sets X {\displaystyle {\mathcal {X}}} and Y {\displaystyle {\mathcal {Y}}} respectively, as: H ( X | Y ) = − ∑ x , y ∈ X × Y p X , Y ( x , y ) log ⁡ p X , Y ( x , y ) p Y ( y ) , {\displaystyle \mathrm {H} (X|Y)=-\sum _{x,y\in {\mathcal {X}}\times {\mathcal {Y}}}p_{X,Y}(x,y)\log {\frac {p_{X,Y}(x,y)}{p_{Y}(y)}},} where p X , Y ( x , y ) := P [ X = x , Y = y ] {\displaystyle p_{X,Y}(x,y):=\mathbb {P} [X=x,Y=y]} and p Y ( y ) = P [ Y = y ] {\displaystyle p_{Y}(y)=\mathbb {P} [Y=y]} . This quantity should be understood as

12992-407: The values of the random variable designate energies of microstates, so Gibbs's formula for the entropy is formally identical to Shannon's formula. Entropy has relevance to other areas of mathematics such as combinatorics and machine learning . The definition can be derived from a set of axioms establishing that entropy should be a measure of how informative the average outcome of a variable is. For

13108-438: The variable, considering the distribution of probabilities across all potential states. Given a discrete random variable X {\displaystyle X} , which takes values in the set X {\displaystyle {\mathcal {X}}} and is distributed according to p : X → [ 0 , 1 ] {\displaystyle p\colon {\mathcal {X}}\to [0,1]} ,

13224-424: The word information as a measurable quantity, reflecting the receiver's ability to distinguish one sequence of symbols from any other, thus quantifying information as H = log S = n log S , where S was the number of possible symbols, and n the number of symbols in a transmission. The unit of information was therefore the decimal digit , which since has sometimes been called the hartley in his honor as

13340-561: Was "even more profound and more fundamental" than the transistor . He came to be known as the "father of information theory". Shannon outlined some of his initial ideas of information theory as early as 1939 in a letter to Vannevar Bush . Prior to this paper, limited information-theoretic ideas had been developed at Bell Labs , all implicitly assuming events of equal probability. Harry Nyquist 's 1924 paper, Certain Factors Affecting Telegraph Speed , contains

13456-419: Was formalized in 1948 by Claude Shannon in a paper entitled A Mathematical Theory of Communication , in which information is thought of as a set of possible messages, and the goal is to send these messages over a noisy channel, and to have the receiver reconstruct the message with low probability of error, in spite of the channel noise. Shannon's main result, the noisy-channel coding theorem , showed that, in

#664335