The Akaike information criterion ( AIC ) is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Thus, AIC provides a means for model selection .
129-409: AIC is founded on information theory . When a statistical model is used to represent the process that generated the data, the representation will almost never be exact; so some information will be lost by using the model to represent the process. AIC estimates the relative amount of information lost by a given model: the less information a model loses, the higher the quality of that model. In estimating
258-433: A source of information. A memoryless source is one in which each message is an independent identically distributed random variable , whereas the properties of ergodicity and stationarity impose less restrictive constraints. All such sources are stochastic . These terms are well studied in their own right outside information theory. Information rate is the average entropy per symbol. For memoryless sources, this
387-453: A statistical model of the process that generates the data and (second) deducing propositions from the model. Konishi and Kitagawa state "The majority of the problems in statistical inference can be considered to be problems related to statistical modeling". Relatedly, Sir David Cox has said, "How [the] translation from subject-matter problem to statistical model is done is often the most critical part of an analysis". The conclusion of
516-472: A "data generating mechanism" does exist in reality, then according to Shannon 's source coding theorem it provides the MDL description of the data, on average and asymptotically. In minimizing description length (or descriptive complexity), MDL estimation is similar to maximum likelihood estimation and maximum a posteriori estimation (using maximum-entropy Bayesian priors ). However, MDL avoids assuming that
645-420: A feature of Bayesian procedures which use proper priors (i.e. those integrable to one) is that they are guaranteed to be coherent . Some advocates of Bayesian inference assert that inference must take place in this decision-theoretic framework, and that Bayesian inference should not conclude with the evaluation and summarization of posterior beliefs. Likelihood-based inference is a paradigm used to estimate
774-745: A frequentist or repeated sampling interpretation. In contrast, Bayesian inference works in terms of conditional probabilities (i.e. probabilities conditional on the observed data), compared to the marginal (but conditioned on unknown parameters) probabilities used in the frequentist approach. The frequentist procedures of significance testing and confidence intervals can be constructed without regard to utility functions . However, some elements of frequentist statistics, such as statistical decision theory , do incorporate utility functions . In particular, frequentist developments of optimal inference (such as minimum-variance unbiased estimators , or uniformly most powerful testing ) make use of loss functions , which play
903-412: A hypothesis test, consider the t -test to compare the means of two normally-distributed populations. The input to the t -test comprises a random sample from each of the two populations. To formulate the test as a comparison of models, we construct two different models. The first model models the two populations as having potentially different means and standard deviations. The likelihood function for
1032-448: A measurement in bytes per symbol, and a logarithm of base 10 will produce a measurement in decimal digits (or hartleys ) per symbol. Intuitively, the entropy H X of a discrete random variable X is a measure of the amount of uncertainty associated with the value of X when only its distribution is known. The entropy of a source that emits a sequence of N symbols that are independent and identically distributed (iid)
1161-420: A model of transformed data . Following is an illustration of how to deal with data transforms (adapted from Burnham & Anderson (2002 , §2.11.3): "Investigators should be sure that all hypotheses are modeled using the same response variable"). Suppose that we want to compare two models: one with a normal distribution of y and one with a normal distribution of log( y ) . We should not directly compare
1290-413: A much better model than BIC even when the "true model" is in the candidate set. The reason is that, for finite n , BIC can have a substantial risk of selecting a very bad model from the candidate set. This reason can arise even when n is much larger than k . With AIC, the risk of selecting a very bad model is minimized. Information theory Information theory is the mathematical study of
1419-546: A preliminary step before more formal inferences are drawn. Statisticians distinguish between three levels of modeling assumptions: Whatever level of assumption is made, correctly calibrated inference, in general, requires these assumptions to be correct; i.e. that the data-generating mechanisms really have been correctly specified. Incorrect assumptions of 'simple' random sampling can invalidate statistical inference. More complex semi- and fully parametric assumptions are also cause for concern. For example, incorrectly assuming
SECTION 10
#17327880693011548-442: A random variable with two outcomes is the binary entropy function, usually taken to the logarithmic base 2, thus having the shannon (Sh) as unit: The joint entropy of two discrete random variables X and Y is merely the entropy of their pairing: ( X , Y ) . This implies that if X and Y are independent , then their joint entropy is the sum of their individual entropies. For example, if ( X , Y ) represents
1677-436: A randomly-chosen member of the first population is in category #1. Hence, the probability that a randomly-chosen member of the first population is in category #2 is 1 − p . Note that the distribution of the first population has one parameter. Let q be the probability that a randomly-chosen member of the second population is in category #1. Note that the distribution of the second population also has one parameter. To compare
1806-410: A signal; noise, periods of silence, and other forms of signal corruption often degrade quality. Statistical inference Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution . Inferential statistical analysis infers properties of a population , for example by testing hypotheses and deriving estimates. It is assumed that
1935-400: A single random variable. Another useful concept is mutual information defined on two random variables, which describes the measure of information in common between those variables, which can be used to describe their correlation. The former quantity is a property of the probability distribution of a random variable and gives a limit on the rate at which data generated by independent samples with
2064-431: A statistic (under the null-hypothesis) is defined by evaluating the test statistic for all of the plans that could have been generated by the randomization design. In frequentist inference, the randomization allows inferences to be based on the randomization distribution rather than a subjective model, and this is important especially in survey sampling and design of experiments. Statistical inference from randomized studies
2193-447: A statistical description for data, information theory quantifies the number of bits needed to describe the data, which is the information entropy of the source. This division of coding theory into compression and transmission is justified by the information transmission theorems, or source–channel separation theorems that justify the use of bits as the universal currency for information in many contexts. However, these theorems only hold in
2322-473: A statistical inference is a statistical proposition . Some common forms of statistical proposition are the following: Any statistical inference requires some assumptions. A statistical model is a set of assumptions concerning the generation of the observed data and similar data. Descriptions of statistical models usually emphasize the role of population quantities of interest, about which we wish to draw inference. Descriptive statistics are typically used as
2451-486: A theoretical section quantifying "intelligence" and the "line speed" at which it can be transmitted by a communication system, giving the relation W = K log m (recalling the Boltzmann constant ), where W is the speed of transmission of intelligence, m is the number of different voltage levels to choose from at each time step, and K is a constant. Ralph Hartley 's 1928 paper, Transmission of Information , uses
2580-511: A unit or scale or measure of information. Alan Turing in 1940 used similar ideas as part of the statistical analysis of the breaking of the German second world war Enigma ciphers. Much of the mathematics behind information theory with events of different probabilities were developed for the field of thermodynamics by Ludwig Boltzmann and J. Willard Gibbs . Connections between information-theoretic entropy and thermodynamic entropy, including
2709-427: A user's utility function need not be stated for this sort of inference, these summaries do all depend (to some extent) on stated prior beliefs, and are generally viewed as subjective conclusions. (Methods of prior construction which do not require external input have been proposed but not yet fully developed.) Formally, Bayesian inference is calibrated with reference to an explicitly stated utility, or loss function;
SECTION 20
#17327880693012838-433: Is N ⋅ H bits (per message of N symbols). If the source data symbols are identically distributed but not independent, the entropy of a message of length N will be less than N ⋅ H . If one transmits 1000 bits (0s and 1s), and the value of each of these bits is known to the receiver (has a specific value with certainty) ahead of transmission, it is clear that no information is transmitted. If, however, each bit
2967-933: Is entropy . Entropy quantifies the amount of uncertainty involved in the value of a random variable or the outcome of a random process . For example, identifying the outcome of a fair coin flip (which has two equally likely outcomes) provides less information (lower entropy, less uncertainty) than identifying the outcome from a roll of a die (which has six equally likely outcomes). Some other important measures in information theory are mutual information , channel capacity , error exponents , and relative entropy . Important sub-fields of information theory include source coding , algorithmic complexity theory , algorithmic information theory and information-theoretic security . Applications of fundamental topics of information theory include source coding/ data compression (e.g. for ZIP files ), and channel coding/ error detection and correction (e.g. for DSL ). Its impact has been crucial to
3096-408: Is symmetric : Mutual information can be expressed as the average Kullback–Leibler divergence (information gain) between the posterior probability distribution of X given the value of Y and the prior distribution on X : In other words, this is a measure of how much, on the average, the probability distribution on X will change if we are given the value of Y . This is often recalculated as
3225-404: Is a way of comparing two distributions: a "true" probability distribution p ( X ) {\displaystyle p(X)} , and an arbitrary probability distribution q ( X ) {\displaystyle q(X)} . If we compress data in a manner that assumes q ( X ) {\displaystyle q(X)}
3354-496: Is also more straightforward than many other situations. In Bayesian inference , randomization is also of importance: in survey sampling , use of sampling without replacement ensures the exchangeability of the sample with the population; in randomized experiments, randomization warrants a missing at random assumption for covariate information. Objective randomization allows properly inductive procedures. Many statisticians prefer randomization-based analysis of data that
3483-459: Is also widely used for statistical inference . Suppose that we have a statistical model of some data. Let k be the number of estimated parameters in the model. Let L ^ {\displaystyle {\hat {L}}} be the maximized value of the likelihood function for the model. Then the AIC value of the model is the following. Given a set of candidate models for
3612-425: Is argued to be appropriate for selecting the "true model" (i.e. the process that generated the data) from the set of candidate models, whereas AIC is not appropriate. To be specific, if the "true model" is in the set of candidates, then BIC will select the "true model" with probability 1, as n → ∞ ; in contrast, when selection is done via AIC, the probability can be less than 1. Proponents of AIC argue that this issue
3741-466: Is defined as: It is common in information theory to speak of the "rate" or "entropy" of a language. This is appropriate, for example, when the source of information is English prose. The rate of a source of information is related to its redundancy and how well it can be compressed, the subject of source coding . Communications over a channel is the primary motivation of information theory. However, channels often fail to produce exact reconstruction of
3870-454: Is essentially AIC with an extra penalty term for the number of parameters. Note that as n → ∞ , the extra penalty term converges to 0, and thus AICc converges to AIC. If the assumption that the model is univariate and linear with normal residuals does not hold, then the formula for AICc will generally be different from the formula above. For some models, the formula can be difficult to determine. For every model that has AICc available, though,
3999-516: Is given by Ding et al. (2018) The formula for the Bayesian information criterion (BIC) is similar to the formula for AIC, but with a different penalty for the number of parameters. With AIC the penalty is 2 k , whereas with BIC the penalty is ln( n ) k . A comparison of AIC/AICc and BIC is given by Burnham & Anderson (2002 , §6.3-6.4), with follow-up remarks by Burnham & Anderson (2004) . The authors show that AIC/AICc can be derived in
Akaike information criterion - Misplaced Pages Continue
4128-467: Is important in communication where it can be used to maximize the amount of information shared between sent and received signals. The mutual information of X relative to Y is given by: where SI ( S pecific mutual Information) is the pointwise mutual information . A basic property of the mutual information is that That is, knowing Y , we can save an average of I ( X ; Y ) bits in encoding X compared to not knowing Y . Mutual information
4257-466: Is independently equally likely to be 0 or 1, 1000 shannons of information (more often called bits) have been transmitted. Between these two extremes, information can be quantified as follows. If X {\displaystyle \mathbb {X} } is the set of all messages { x 1 , ..., x n } that X could be, and p ( x ) is the probability of some x ∈ X {\displaystyle x\in \mathbb {X} } , then
4386-504: Is many times larger than k , then the extra penalty term will be negligible; hence, the disadvantage in using AIC, instead of AICc, will be negligible. The Akaike information criterion was formulated by the statistician Hirotugu Akaike . It was originally named "an information criterion". It was first announced in English by Akaike at a 1971 symposium; the proceedings of the symposium were published in 1973. The 1973 publication, though,
4515-435: Is merely the entropy of each symbol, while, in the case of a stationary stochastic process, it is: that is, the conditional entropy of a symbol given all the previous symbols generated. For the more general case of a process that is not necessarily stationary, the average rate is: that is, the limit of the joint entropy per symbol. For stationary sources, these two expressions give the same result. The information rate
4644-500: Is negligible, because the "true model" is virtually never in the candidate set. Indeed, it is a common aphorism in statistics that " all models are wrong "; hence the "true model" (i.e. reality) cannot be in the candidate set. Another comparison of AIC and BIC is given by Vrieze (2012) . Vrieze presents a simulation study—which allows the "true model" to be in the candidate set (unlike with virtually all real data). The simulation study demonstrates, in particular, that AIC sometimes selects
4773-411: Is not estimated from the data, but instead given in advance, then there are only p + 1 parameters.) The AIC values of the candidate models must all be computed with the same data set. Sometimes, though, we might want to compare a model of the response variable , y , with a model of the logarithm of the response variable, log( y ) . More generally, we might want to compare a model of the data with
4902-454: Is not symmetric and does not satisfy the triangle inequality (making it a semi-quasimetric). Another interpretation of the KL divergence is the "unnecessary surprise" introduced by a prior from the truth: suppose a number X is about to be drawn randomly from a discrete set with probability distribution p ( x ) {\displaystyle p(x)} . If Alice knows
5031-766: Is not symmetric. The I ( X n → Y n ) {\displaystyle I(X^{n}\to Y^{n})} measures the information bits that are transmitted causally from X n {\displaystyle X^{n}} to Y n {\displaystyle Y^{n}} . The Directed information has many applications in problems where causality plays an important role such as capacity of channel with feedback, capacity of discrete memoryless networks with feedback, gambling with causal side information, compression with causal side information, real-time control communication settings, and in statistical physics. Other important information theoretic quantities include
5160-410: Is standard practice to refer to a statistical model, e.g., a linear or logistic models, when analyzing data from randomized experiments. However, the randomization scheme guides the choice of a statistical model. It is not possible to choose an appropriate model without knowing the randomization scheme. Seriously misleading results can be obtained analyzing data from randomized experiments while ignoring
5289-426: Is that it is applicable only in terms of frequency probability ; that is, in terms of repeated sampling from a population. However, the approach of Neyman develops these procedures in terms of pre-experiment probabilities. That is, before undertaking an experiment, one decides on a rule for coming to a conclusion such that the probability of being correct is controlled in a suitable way: such a probability need not have
Akaike information criterion - Misplaced Pages Continue
5418-429: Is the conditional mutual information I ( X 1 , X 2 , . . . , X i ; Y i | Y 1 , Y 2 , . . . , Y i − 1 ) {\displaystyle I(X_{1},X_{2},...,X_{i};Y_{i}|Y_{1},Y_{2},...,Y_{i-1})} . In contrast to mutual information, directed information
5547-489: Is the asymptotic property under well-specified and misspecified model classes. Their fundamental differences have been well-studied in regression variable selection and autoregression order selection problems. In general, if the goal is prediction, AIC and leave-one-out cross-validations are preferred. If the goal is selection, inference, or interpretation, BIC or leave-many-out cross-validations are preferred. A comprehensive overview of AIC and other popular model selection methods
5676-474: Is the average conditional entropy over Y : Because entropy can be conditioned on a random variable or on that random variable being a certain value, care should be taken not to confuse these two definitions of conditional entropy, the former of which is in more common use. A basic property of this form of conditional entropy is that: Mutual information measures the amount of information that can be obtained about one random variable by observing another. It
5805-458: Is the distribution underlying some data, when, in reality, p ( X ) {\displaystyle p(X)} is the correct distribution, the Kullback–Leibler divergence is the number of average additional bits per datum necessary for compression. It is thus defined Although it is sometimes used as a 'distance metric', KL divergence is not a true metric since it
5934-399: Is the probability density function for the log-normal distribution . We then compare the AIC value of the normal model against the AIC value of the log-normal model. For misspecified model, Takeuchi's Information Criterion (TIC) might be more appropriate. However, TIC often suffers from instability caused by estimation errors. The critical difference between AIC and BIC (and their variants)
6063-581: Is used. A common unit of information is the bit or shannon , based on the binary logarithm . Other units include the nat , which is based on the natural logarithm , and the decimal digit , which is based on the common logarithm . In what follows, an expression of the form p log p is considered by convention to be equal to zero whenever p = 0 . This is justified because lim p → 0 + p log p = 0 {\displaystyle \lim _{p\rightarrow 0+}p\log p=0} for any logarithmic base. Based on
6192-655: The Bayesian paradigm, the likelihoodist paradigm, and the Akaikean-Information Criterion -based paradigm. This paradigm calibrates the plausibility of propositions by considering (notional) repeated sampling of a population distribution to produce datasets similar to the one at hand. By considering the dataset's characteristics under repeated sampling, the frequentist properties of a statistical proposition can be quantified—although in practice this quantification may be challenging. One interpretation of frequentist inference (or classical inference)
6321-630: The Rényi entropy and the Tsallis entropy (generalizations of the concept of entropy), differential entropy (a generalization of quantities of information to continuous distributions), and the conditional mutual information . Also, pragmatic information has been proposed as a measure of how much information has been used in making a decision. Coding theory is one of the most important and direct applications of information theory. It can be subdivided into source coding theory and channel coding theory. Using
6450-605: The log is in base 2. In this way, the extent to which Bob's prior is "wrong" can be quantified in terms of how "unnecessarily surprised" it is expected to make him. Directed information , I ( X n → Y n ) {\displaystyle I(X^{n}\to Y^{n})} , is an information theory measure that quantifies the information flow from the random process X n = { X 1 , X 2 , … , X n } {\displaystyle X^{n}=\{X_{1},X_{2},\dots ,X_{n}\}} to
6579-473: The probability mass function of each source symbol to be communicated, the Shannon entropy H , in units of bits (per symbol), is given by where p i is the probability of occurrence of the i -th possible value of the source symbol. This equation gives the entropy in the units of "bits" (per symbol) because it uses a logarithm of base 2, and this base-2 measure of entropy has sometimes been called
SECTION 50
#17327880693016708-466: The quantification , storage , and communication of information . The field was established and put on a firm footing by Claude Shannon in the 1940s, though early contributions were made in the 1920s through the works of Harry Nyquist and Ralph Hartley . It is at the intersection of electronic engineering , mathematics , statistics , computer science , neurobiology , physics , and electrical engineering . A key measure in information theory
6837-406: The shannon in his honor. Entropy is also commonly computed using the natural logarithm (base e , where e is Euler's number), which produces a measurement of entropy in nats per symbol and sometimes simplifies the analysis by avoiding the need to include extra constants in the formulas. Other bases are also possible, but less commonly used. For example, a logarithm of base 2 = 256 will produce
6966-532: The unit ban . The landmark event establishing the discipline of information theory and bringing it to immediate worldwide attention was the publication of Claude E. Shannon's classic paper "A Mathematical Theory of Communication" in the Bell System Technical Journal in July and October 1948. Historian James Gleick rated the paper as the most important development of 1948, noting that
7095-464: The ε i are the residuals from the straight line fit. If the ε i are assumed to be i.i.d. Gaussian (with zero mean), then the model has three parameters: b 0 , b 1 , and the variance of the Gaussian distributions. Thus, when calculating the AIC value of this model, we should use k =3. More generally, for any least squares model with i.i.d. Gaussian residuals, the variance of
7224-463: The 'Bayes rule' is the one which maximizes expected utility, averaged over the posterior uncertainty. Formal Bayesian inference therefore automatically provides optimal decisions in a decision theoretic sense. Given assumptions, data and utility, Bayesian inference can be made for essentially any problem, although not every statistical inference need have a Bayesian interpretation. Analyses which are not formally Bayesian can be (logically) incoherent ;
7353-401: The 'error' of the approximation) can be assessed using simulation. The heuristic application of limiting results to finite samples is common practice in many applications, especially with low-dimensional models with log-concave likelihoods (such as with one-parameter exponential families ). For a given dataset that was produced by a randomization design, the randomization distribution of
7482-442: The AIC paradigm: it is provided by maximum likelihood estimation . Interval estimation can also be done within the AIC paradigm: it is provided by likelihood intervals . Hence, statistical inference generally can be done within the AIC paradigm. The most commonly used paradigms for statistical inference are frequentist inference and Bayesian inference . AIC, though, can be used to do statistical inference without relying on either
7611-417: The AIC values of the two models. Instead, we should transform the normal cumulative distribution function to first take the logarithm of y . To do that, we need to perform the relevant integration by substitution : thus, we need to multiply by the derivative of the (natural) logarithm function, which is 1/ y . Hence, the transformed distribution has the following probability density function : —which
7740-489: The Cox model can in some cases lead to faulty conclusions. Incorrect assumptions of Normality in the population also invalidates some forms of regression-based inference. The use of any parametric model is viewed skeptically by most experts in sampling human populations: "most sampling statisticians, when they deal with confidence intervals at all, limit themselves to statements about [estimators] based on very large samples, where
7869-410: The absolute quality of a model, only the quality relative to other models. Thus, if all the candidate models fit poorly, AIC will not give any warning of that. Hence, after selecting a model via AIC, it is usually good practice to validate the absolute quality of the model. Such validation commonly includes checks of the model's residuals (to determine whether the residuals seem like random) and tests of
SECTION 60
#17327880693017998-486: The amount of information lost by a model, AIC deals with the trade-off between the goodness of fit of the model and the simplicity of the model. In other words, AIC deals with both the risk of overfitting and the risk of underfitting. The Akaike information criterion is named after the Japanese statistician Hirotugu Akaike , who formulated it. It now forms the basis of a paradigm for the foundations of statistics and
8127-416: The analysis of music , art creation , imaging system design, study of outer space , the dimensionality of space , and epistemology . Information theory studies the transmission, processing, extraction, and utilization of information . Abstractly, information can be thought of as the resolution of uncertainty. In the case of communication of information over a noisy channel, this abstract concept
8256-510: The application of confidence intervals , it does not necessarily invalidate conclusions drawn from fiducial arguments. An attempt was made to reinterpret the early work of Fisher's fiducial argument as a special case of an inference theory using upper and lower probabilities . Developing ideas of Fisher and of Pitman from 1938 to 1939, George A. Barnard developed "structural inference" or "pivotal inference", an approach using invariant probabilities on group families . Barnard reformulated
8385-734: The approach is founded on the concept of entropy in information theory . Indeed, minimizing AIC in a statistical model is effectively equivalent to maximizing entropy in a thermodynamic system; in other words, the information-theoretic approach in statistics is essentially applying the Second Law of Thermodynamics . As such, AIC has roots in the work of Ludwig Boltzmann on entropy . For more on these issues, see Akaike (1985) and Burnham & Anderson (2002 , ch. 2). A statistical model must account for random errors . A straight line model might be formally described as y i = b 0 + b 1 x i + ε i . Here,
8514-455: The area of statistical inference . Predictive inference is an approach to statistical inference that emphasizes the prediction of future observations based on past observations. Initially, predictive inference was based on observable parameters and it was the main purpose of studying probability , but it fell out of favor in the 20th century due to a new parametric approach pioneered by Bruno de Finetti . The approach modeled phenomena as
8643-472: The arguments behind fiducial inference on a restricted class of models on which "fiducial" procedures would be well-defined and useful. Donald A. S. Fraser developed a general theory for structural inference based on group theory and applied this to linear models. The theory formulated by Fraser has close links to decision theory and Bayesian statistics and can provide optimal frequentist decision rules if they exist. The topics below are usually included in
8772-480: The assertion: With it came the ideas of: Information theory is based on probability theory and statistics, where quantified information is usually described in terms of bits. Information theory often concerns itself with measures of information of the distributions associated with random variables. One of the most important measures is called entropy , which forms the building block of many other measures. Entropy allows quantification of measure of information in
8901-400: The asymptotic theory of limiting distributions is often invoked for work with finite samples. For example, limiting results are often invoked to justify the generalized method of moments and the use of generalized estimating equations , which are popular in econometrics and biostatistics . The magnitude of the difference between the limiting distribution and the true distribution (formally,
9030-463: The available posterior beliefs as the basis for making statistical propositions. There are several different justifications for using the Bayesian approach. Many informal Bayesian inferences are based on "intuitively reasonable" summaries of the posterior. For example, the posterior mean, median and mode, highest posterior density intervals, and Bayes Factors can all be motivated in this way. While
9159-447: The candidate model that minimized the information loss. We cannot choose with certainty, because we do not know f . Akaike (1974) showed, however, that we can estimate, via AIC, how much more (or less) information is lost by g 1 than by g 2 . The estimate, though, is only valid asymptotically ; if the number of data points is small, then some correction is often necessary (see AICc , below). Note that AIC tells nothing about
9288-429: The central limit theorem ensures that these [estimators] will have distributions that are nearly normal." In particular, a normal distribution "would be a totally unrealistic and catastrophically unwise assumption to make if we were dealing with any kind of economic population." Here, the central limit theorem states that the distribution of the sample mean "for very large samples" is approximately normally distributed, if
9417-484: The channel capacity. These codes can be roughly subdivided into data compression (source coding) and error-correction (channel coding) techniques. In the latter case, it took many years to find the methods Shannon's work proved were possible. A third class of information theory codes are cryptographic algorithms (both codes and ciphers ). Concepts, methods and results from coding theory and information theory are widely used in cryptography and cryptanalysis , such as
9546-484: The contextual affinities of a process and learning the intrinsic characteristics of the observations. For example, model-free simple linear regression is based either on: In either case, the model-free randomization inference for features of the common conditional distribution D x ( . ) {\displaystyle D_{x}(.)} relies on some regularity conditions, e.g. functional smoothness. For instance, model-free randomization inference for
9675-401: The data, AIC estimates the quality of each model, relative to each of the other models. Thus, AIC provides a means for model selection . AIC is founded on information theory : it offers an estimate of the relative information lost when a given model is used to represent the process that generated the data. (In doing so, it deals with the trade-off between the goodness of fit of the model and
9804-399: The data, the preferred model is the one with the minimum AIC value. Thus, AIC rewards goodness of fit (as assessed by the likelihood function), but it also includes a penalty that is an increasing function of the number of estimated parameters. The penalty discourages overfitting , which is desired because increasing the number of parameters in the model almost always improves the goodness of
9933-425: The distribution is not heavy-tailed. Given the difficulty in specifying exact distributions of sample statistics, many methods have been developed for approximating these. With finite samples, approximation results measure how close a limiting distribution approaches the statistic's sample distribution : For example, with 10,000 independent samples the normal distribution approximates (to two digits of accuracy)
10062-521: The distribution of the sample mean for many population distributions, by the Berry–Esseen theorem . Yet for many practical purposes, the normal approximation provides a good approximation to the sample-mean's distribution when there are 10 (or more) independent samples, according to simulation studies and statisticians' experience. Following Kolmogorov's work in the 1950s, advanced statistics uses approximation theory and functional analysis to quantify
10191-407: The distributions of the two populations, we construct two different models. The first model models the two populations as having potentially different distributions. The likelihood function for the first model is thus the product of the likelihoods for two distinct binomial distributions; so it has two parameters: p , q . To be explicit, the likelihood function is as follows. The second model models
10320-536: The divergence from the product of the marginal distributions to the actual joint distribution: Mutual information is closely related to the log-likelihood ratio test in the context of contingency tables and the multinomial distribution and to Pearson's χ test : mutual information can be considered a statistic for assessing independence between a pair of variables, and has a well-specified asymptotic distribution. The Kullback–Leibler divergence (or information divergence , information gain , or relative entropy )
10449-490: The entropy, H , of X is defined: (Here, I ( x ) is the self-information , which is the entropy contribution of an individual message, and E X {\displaystyle \mathbb {E} _{X}} is the expected value .) A property of entropy is that it is maximized when all the messages in the message space are equiprobable p ( x ) = 1/ n ; i.e., most unpredictable, in which case H ( X ) = log n . The special case of information entropy for
10578-649: The error of approximation. In this approach, the metric geometry of probability distributions is studied; this approach quantifies approximation error with, for example, the Kullback–Leibler divergence , Bregman divergence , and the Hellinger distance . With indefinitely large samples, limiting results like the central limit theorem describe the sample statistic's limiting distribution if one exists. Limiting results are not statements about finite samples, and indeed are irrelevant to finite samples. However,
10707-408: The example above, has an advantage by not making such assumptions. For another example of a hypothesis test, suppose that we have two populations, and each member of each population is in one of two categories —category #1 or category #2. Each population is binomially distributed . We want to know whether the distributions of the two populations are the same. We are given a random sample from each of
10836-464: The experimental protocol; common mistakes include forgetting the blocking used in an experiment and confusing repeated measurements on the same experimental unit with independent replicates of the treatment applied to different experimental units. Model-free techniques provide a complement to model-based methods, which employ reductionist strategies of reality-simplification. The former combine, evolve, ensemble and train algorithms dynamically adapting to
10965-422: The first model is thus the product of the likelihoods for two distinct normal distributions; so it has four parameters: μ 1 , σ 1 , μ 2 , σ 2 . To be explicit, the likelihood function is as follows (denoting the sample sizes by n 1 and n 2 ). The second model models the two populations as having the same means but potentially different standard deviations. The likelihood function for
11094-521: The fit. AIC is founded in information theory . Suppose that the data is generated by some unknown process f . We consider two candidate models to represent f : g 1 and g 2 . If we knew f , then we could find the information lost from using g 1 to represent f by calculating the Kullback–Leibler divergence , D KL ( f ‖ g 1 ) ; similarly, the information lost from using g 2 to represent f could be found by calculating D KL ( f ‖ g 2 ) . We would then, generally, choose
11223-514: The formula for AICc is given by AIC plus terms that includes both k and k . In comparison, the formula for AIC includes k but not k . In other words, AIC is a first-order estimate (of the information loss), whereas AICc is a second-order estimate . Further discussion of the formula, with examples of other assumptions, is given by Burnham & Anderson (2002 , ch. 7) and by Konishi & Kitagawa (2008 , ch. 7–8). In particular, with other assumptions, bootstrap estimation of
11352-474: The formula is often feasible. To summarize, AICc has the advantage of tending to be more accurate than AIC (especially for small samples), but AICc also has the disadvantage of sometimes being much more difficult to compute than AIC. Note that if all the candidate models have the same k and the same formula for AICc, then AICc and AIC will give identical (relative) valuations; hence, there will be no disadvantage in using AIC, instead of AICc. Furthermore, if n
11481-470: The frequentist paradigm or the Bayesian paradigm: because AIC can be interpreted without the aid of significance levels or Bayesian priors . In other words, AIC can be used to form a foundation of statistics that is distinct from both frequentism and Bayesianism. When the sample size is small, there is a substantial probability that AIC will select models that have too many parameters, i.e. that AIC will overfit. To address such potential overfitting, AICc
11610-416: The given distribution can be reliably compressed. The latter is a property of the joint distribution of two random variables, and is the maximum rate of reliable communication across a noisy channel in the limit of long block lengths, when the channel statistics are determined by the joint distribution. The choice of logarithmic base in the following formulae determines the unit of information entropy that
11739-410: The goal is to find the set of parameter values that maximizes the likelihood function, or equivalently, maximizes the probability of observing the given data. The process of likelihood-based inference usually involves the following steps: The Akaike information criterion (AIC) is an estimator of the relative quality of statistical models for a given set of data. Given a collection of models for
11868-487: The important contributions by Rolf Landauer in the 1960s, are explored in Entropy in thermodynamics and information theory . In Shannon's revolutionary and groundbreaking paper, the work for which had been substantially completed at Bell Labs by the end of 1944, Shannon for the first time introduced the qualitative and quantitative model of communication as a statistical process underlying information theory, opening with
11997-495: The information loss. In this example, we would omit the third model from further consideration. We then have three options: (1) gather more data, in the hope that this will allow clearly distinguishing between the first two models; (2) simply conclude that the data is insufficient to support selecting one model from among the first two; (3) take a weighted average of the first two models, with weights proportional to 1 and 0.368, respectively, and then do statistical inference based on
12126-522: The likelihood-ratio test is valid only for nested models , whereas AIC (and AICc) has no such restriction. Every statistical hypothesis test can be formulated as a comparison of statistical models. Hence, every statistical hypothesis test can be replicated via AIC. Two examples are briefly described in the subsections below. Details for those examples, and many more examples, are given by Sakamoto, Ishiguro & Kitagawa (1986 , Part II) and Konishi & Kitagawa (2008 , ch. 4). As an example of
12255-403: The limit of many channel uses, the rate of information that is asymptotically achievable is equal to the channel capacity, a quantity dependent merely on the statistics of the channel over which the messages are sent. Coding theory is concerned with finding explicit methods, called codes , for increasing the efficiency and reducing the error rate of data communication over noisy channels to near
12384-462: The model is referred to as training or learning (rather than inference ), and using a model for prediction is referred to as inference (instead of prediction ); see also predictive inference . Statistical inference makes propositions about a population, using data drawn from the population with some form of sampling . Given a hypothesis about a population, for which we wish to draw inferences, statistical inference consists of (first) selecting
12513-409: The model that minimizes the information loss. We cannot choose with certainty, but we can minimize the estimated information loss. Suppose that there are R candidate models. Denote the AIC values of those models by AIC 1 , AIC 2 , AIC 3 , ..., AIC R . Let AIC min be the minimum of those values. Then the quantity exp((AIC min − AIC i )/2) can be interpreted as being proportional to
12642-408: The model's predictions. For more on this topic, see statistical model validation . To apply AIC in practice, we start with a set of candidate models, and then find the models' corresponding AIC values. There will almost always be information lost due to using a candidate model to represent the "true model," i.e. the process that generated the data. We wish to select, from among the candidate models,
12771-481: The observed data set is sampled from a larger population. Inferential statistics can be contrasted with descriptive statistics . Descriptive statistics is solely concerned with properties of the observed data, and it does not rest on the assumption that the data come from a larger population. In machine learning , the term inference is sometimes used instead to mean "make a prediction, by evaluating an already trained model"; in this context inferring properties of
12900-576: The paper was "even more profound and more fundamental" than the transistor . He came to be known as the "father of information theory". Shannon outlined some of his initial ideas of information theory as early as 1939 in a letter to Vannevar Bush . Prior to this paper, limited information-theoretic ideas had been developed at Bell Labs , all implicitly assuming events of equal probability. Harry Nyquist 's 1924 paper, Certain Factors Affecting Telegraph Speed , contains
13029-479: The parameters of a statistical model based on observed data. Likelihoodism approaches statistics by using the likelihood function , denoted as L ( x | θ ) {\displaystyle L(x|\theta )} , quantifies the probability of observing the given data x {\displaystyle x} , assuming a specific set of parameter values θ {\displaystyle \theta } . In likelihood-based inference,
13158-404: The parameters of interest, and the estimators / test statistic to be used, the absence of obviously explicit utilities and prior distributions has helped frequentist procedures to become widely viewed as 'objective'. The Bayesian calculus describes degrees of belief using the 'language' of probability; beliefs are positive, integrate into one, and obey probability axioms. Bayesian inference uses
13287-462: The population feature conditional mean , μ ( x ) = E ( Y | X = x ) {\displaystyle \mu (x)=E(Y|X=x)} , can be consistently estimated via local averaging or local polynomial fitting, under the assumption that μ ( x ) {\displaystyle \mu (x)} is smooth. Also, relying on asymptotic normality or resampling, we can construct confidence intervals for
13416-468: The population feature, in this case, the conditional mean , μ ( x ) {\displaystyle \mu (x)} . Different schools of statistical inference have become established. These schools—or "paradigms"—are not mutually exclusive, and methods that work well under one paradigm often have attractive interpretations under other paradigms. Bandyopadhyay and Forster describe four paradigms: The classical (or frequentist ) paradigm,
13545-433: The position of a chess piece— X the row and Y the column, then the joint entropy of the row of the piece and the column of the piece will be the entropy of the position of the piece. Despite similar notation, joint entropy should not be confused with cross-entropy . The conditional entropy or conditional uncertainty of X given random variable Y (also called the equivocation of X about Y )
13674-417: The probability that the i th model minimizes the (estimated) information loss. As an example, suppose that there are three candidate models, whose AIC values are 100, 102, and 110. Then the second model is exp((100 − 102)/2) = 0.368 times as probable as the first model to minimize the information loss. Similarly, the third model is exp((100 − 110)/2) = 0.007 times as probable as the first model to minimize
13803-450: The random process Y n = { Y 1 , Y 2 , … , Y n } {\displaystyle Y^{n}=\{Y_{1},Y_{2},\dots ,Y_{n}\}} . The term directed information was coined by James Massey and is defined as where I ( X i ; Y i | Y i − 1 ) {\displaystyle I(X^{i};Y_{i}|Y^{i-1})}
13932-458: The residuals' distributions should be counted as one of the parameters. As another example, consider a first-order autoregressive model , defined by x i = c + φx i −1 + ε i , with the ε i being i.i.d. Gaussian (with zero mean). For this model, there are three parameters: c , φ , and the variance of the ε i . More generally, a p th-order autoregressive model has p + 2 parameters. (If, however, c
14061-576: The role of (negative) utility functions. Loss functions need not be explicitly stated for statistical theorists to prove that a statistical procedure has an optimality property. However, loss-functions are often useful for stating optimality properties: for example, median-unbiased estimators are optimal under absolute value loss functions, in that they minimize expected loss, and least squares estimators are optimal under squared error loss functions, in that they minimize expected loss. While statisticians using frequentist inference must choose for themselves
14190-474: The same Bayesian framework as BIC, just by using different prior probabilities . In the Bayesian derivation of BIC, though, each candidate model has a prior probability of 1/ R (where R is the number of candidate models). Additionally, the authors present a few simulation studies that suggest AICc tends to have practical/performance advantages over BIC. A point made by several researchers is that AIC and BIC are appropriate for different tasks. In particular, BIC
14319-511: The same phenomena. However, a good observational study may be better than a bad randomized experiment. The statistical analysis of a randomized experiment may be based on the randomization scheme stated in the experimental protocol and does not need a subjective model. However, at any time, some hypotheses cannot be tested using objective statistical models, which accurately describe randomized experiments or random samples. In some cases, such randomized studies are uneconomical or unethical. It
14448-399: The second model from further consideration: so we would conclude that the two populations have different means. The t -test assumes that the two populations have identical standard deviations; the test tends to be unreliable if the assumption is false and the sizes of the two samples are very different ( Welch's t -test would be better). Comparing the means of the populations via AIC, as in
14577-431: The second model thus sets μ 1 = μ 2 in the above equation; so it has three parameters. We then maximize the likelihood functions for the two models (in practice, we maximize the log-likelihood functions); after that, it is easy to calculate the AIC values of the models. We next calculate the relative likelihood. For instance, if the second model was only 0.01 times as likely as the first model, then we would omit
14706-490: The second model was only 0.01 times as likely as the first model, then we would omit the second model from further consideration: so we would conclude that the two populations have different distributions. Statistical inference is generally regarded as comprising hypothesis testing and estimation . Hypothesis testing can be done via AIC, as discussed above. Regarding estimation, there are two types: point estimation and interval estimation . Point estimation can be done within
14835-465: The simplicity of the model.) The minimum description length (MDL) principle has been developed from ideas in information theory and the theory of Kolmogorov complexity . The (MDL) principle selects statistical models that maximally compress the data; inference proceeds without assuming counterfactual or non-falsifiable "data-generating mechanisms" or probability models for the data, as might be done in frequentist or Bayesian approaches. However, if
14964-417: The situation where one transmitting user wishes to communicate to one receiving user. In scenarios with more than one transmitter (the multiple-access channel), more than one receiver (the broadcast channel ) or intermediary "helpers" (the relay channel ), or more general networks , compression followed by transmission may no longer be optimal. Any process that generates successive messages can be considered
15093-747: The success of the Voyager missions to deep space, the invention of the compact disc , the feasibility of mobile phones and the development of the Internet and artificial intelligence . The theory has also found applications in other areas, including statistical inference , cryptography , neurobiology , perception , signal processing , linguistics , the evolution and function of molecular codes ( bioinformatics ), thermal physics , molecular dynamics , black holes , quantum computing , information retrieval , intelligence gathering , plagiarism detection , pattern recognition , anomaly detection ,
15222-460: The true distribution p ( x ) {\displaystyle p(x)} , while Bob believes (has a prior ) that the distribution is q ( x ) {\displaystyle q(x)} , then Bob will be more surprised than Alice, on average, upon seeing the value of X . The KL divergence is the (objective) expected value of Bob's (subjective) surprisal minus Alice's surprisal, measured in bits if
15351-427: The two populations as having the same distribution. The likelihood function for the second model thus sets p = q in the above equation; so the second model has one parameter. We then maximize the likelihood functions for the two models (in practice, we maximize the log-likelihood functions); after that, it is easy to calculate the AIC values of the models. We next calculate the relative likelihood. For instance, if
15480-414: The two populations. Let m be the size of the sample from the first population. Let m 1 be the number of observations (in the sample) in category #1; so the number of observations in category #2 is m − m 1 . Similarly, let n be the size of the sample from the second population. Let n 1 be the number of observations (in the sample) in category #1. Let p be the probability that
15609-447: The underlying probability model is known; the MDL principle can also be applied without assumptions that e.g. the data arose from independent sampling. The MDL principle has been applied in communication- coding theory in information theory , in linear regression , and in data mining . The evaluation of MDL-based inferential procedures often uses techniques or criteria from computational complexity theory . Fiducial inference
15738-447: The weighted multimodel . The quantity exp((AIC min − AIC i )/2) is known as the relative likelihood of model i . It is closely related to the likelihood ratio used in the likelihood-ratio test . Indeed, if all the models in the candidate set have the same number of parameters, then using AIC might at first appear to be very similar to using the likelihood-ratio test. There are, however, important distinctions. In particular,
15867-425: The word information as a measurable quantity, reflecting the receiver's ability to distinguish one sequence of symbols from any other, thus quantifying information as H = log S = n log S , where S was the number of possible symbols, and n the number of symbols in a transmission. The unit of information was therefore the decimal digit , which since has sometimes been called the hartley in his honor as
15996-510: The work of Hurvich & Tsai (1989) , and several further papers by the same authors, which extended the situations in which AICc could be applied. The first general exposition of the information-theoretic approach was the volume by Burnham & Anderson (2002) . It includes an English presentation of the work of Takeuchi. The volume led to far greater use of AIC, and it now has more than 64,000 citations on Google Scholar . Akaike called his approach an "entropy maximization principle", because
16125-412: Was an approach to statistical inference based on fiducial probability , also known as a "fiducial distribution". In subsequent work, this approach has been called ill-defined, extremely limited in applicability, and even fallacious. However this argument is the same as that which shows that a so-called confidence distribution is not a valid probability distribution and, since this has not invalidated
16254-404: Was developed: AICc is AIC with a correction for small sample sizes. The formula for AICc depends upon the statistical model. Assuming that the model is univariate , is linear in its parameters, and has normally-distributed residuals (conditional upon regressors), then the formula for AICc is as follows. —where n denotes the sample size and k denotes the number of parameters. Thus, AICc
16383-419: Was formalized in 1948 by Claude Shannon in a paper entitled A Mathematical Theory of Communication , in which information is thought of as a set of possible messages, and the goal is to send these messages over a noisy channel, and to have the receiver reconstruct the message with low probability of error, in spite of the channel noise. Shannon's main result, the noisy-channel coding theorem , showed that, in
16512-460: Was generated by well-defined randomization procedures. (However, it is true that in fields of science with developed theoretical knowledge and experimental control, randomized experiments may increase the costs of experimentation without improving the quality of inferences. ) Similarly, results from randomized experiments are recommended by leading statistical authorities as allowing inferences with greater reliability than do observational studies of
16641-462: Was only an informal presentation of the concepts. The first formal publication was a 1974 paper by Akaike. The initial derivation of AIC relied upon some strong assumptions. Takeuchi (1976) showed that the assumptions could be made much weaker. Takeuchi's work, however, was in Japanese and was not widely known outside Japan for many years. (Translated in ) AIC was originally proposed for linear regression (only) by Sugiura (1978) . That instigated
#300699