In statistics , maximum likelihood estimation ( MLE ) is a method of estimating the parameters of an assumed probability distribution , given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model , the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference .
122-408: If the likelihood function is differentiable , the derivative test for finding maxima can be applied. In some cases, the first-order conditions of the likelihood function can be solved analytically; for instance, the ordinary least squares estimator for a linear regression model maximizes the likelihood when the random errors are assumed to have normal distributions with the same variance. From
244-392: A {\textstyle x=a} when Although this definition looks similar to the differentiability of single-variable real functions, it is however a more restrictive condition. A function f : C → C {\textstyle f:\mathbb {C} \to \mathbb {C} } , that is complex-differentiable at a point x = a {\textstyle x=a}
366-492: A r g m a x θ 1 h L ( θ ∣ x ∈ [ x j , x j + h ] ) , {\displaystyle \mathop {\operatorname {arg\,max} } _{\theta }{\mathcal {L}}(\theta \mid x\in [x_{j},x_{j}+h])=\mathop {\operatorname {arg\,max} } _{\theta }{\frac {1}{h}}{\mathcal {L}}(\theta \mid x\in [x_{j},x_{j}+h]),} since h {\textstyle h}
488-661: A r g m a x θ 1 h ∫ x j x j + h f ( x ∣ θ ) d x , {\displaystyle \mathop {\operatorname {arg\,max} } _{\theta }{\frac {1}{h}}{\mathcal {L}}(\theta \mid x\in [x_{j},x_{j}+h])=\mathop {\operatorname {arg\,max} } _{\theta }{\frac {1}{h}}\Pr(x_{j}\leq x\leq x_{j}+h\mid \theta )=\mathop {\operatorname {arg\,max} } _{\theta }{\frac {1}{h}}\int _{x_{j}}^{x_{j}+h}f(x\mid \theta )\,dx,} where f ( x ∣ θ ) {\textstyle f(x\mid \theta )}
610-420: A r g m a x θ L ( θ ∣ x j ) = a r g m a x θ [ lim h → 0 + L ( θ ∣ x ∈ [ x j , x j + h ] ) ] =
732-471: A r g m a x θ L ( θ ∣ x j ) = a r g m a x θ f ( x j ∣ θ ) , {\displaystyle \mathop {\operatorname {arg\,max} } _{\theta }{\mathcal {L}}(\theta \mid x_{j})=\mathop {\operatorname {arg\,max} } _{\theta }f(x_{j}\mid \theta ),} and so maximizing
854-933: A r g m a x θ [ lim h → 0 + 1 h ∫ x j x j + h f ( x ∣ θ ) d x ] = a r g m a x θ f ( x j ∣ θ ) . {\displaystyle {\begin{aligned}&\mathop {\operatorname {arg\,max} } _{\theta }{\mathcal {L}}(\theta \mid x_{j})=\mathop {\operatorname {arg\,max} } _{\theta }\left[\lim _{h\to 0^{+}}{\mathcal {L}}(\theta \mid x\in [x_{j},x_{j}+h])\right]\\[4pt]={}&\mathop {\operatorname {arg\,max} } _{\theta }\left[\lim _{h\to 0^{+}}{\frac {1}{h}}\int _{x_{j}}^{x_{j}+h}f(x\mid \theta )\,dx\right]=\mathop {\operatorname {arg\,max} } _{\theta }f(x_{j}\mid \theta ).\end{aligned}}} Therefore,
976-478: A jump discontinuity , it is possible for the derivative to have an essential discontinuity . For example, the function f ( x ) = { x 2 sin ( 1 / x ) if x ≠ 0 0 if x = 0 {\displaystyle f(x)\;=\;{\begin{cases}x^{2}\sin(1/x)&{\text{ if }}x\neq 0\\0&{\text{ if }}x=0\end{cases}}}
1098-575: A log-normal distribution . The density of Y follows with f X {\displaystyle f_{X}} standard Normal and g − 1 ( y ) = log ( y ) {\displaystyle g^{-1}(y)=\log(y)} , | ( g − 1 ( y ) ) ′ | = 1 y {\displaystyle |(g^{-1}(y))^{\prime }|={\frac {1}{y}}} for y > 0 {\displaystyle y>0} . As assumed above, if
1220-401: A parametric family { f ( ⋅ ; θ ) ∣ θ ∈ Θ } , {\displaystyle \;\{f(\cdot \,;\theta )\mid \theta \in \Theta \}\;,} where Θ {\displaystyle \,\Theta \,} is called the parameter space , a finite-dimensional subset of Euclidean space . Evaluating
1342-430: A vector-valued function mapping R k {\displaystyle \,\mathbb {R} ^{k}\,} into R r . {\displaystyle \;\mathbb {R} ^{r}~.} Estimating the true parameter θ {\displaystyle \theta } belonging to Θ {\displaystyle \Theta } then, as a practical matter, means to find
SECTION 10
#17327879723631464-481: A classifier that minimizes total expected risk, especially, when the costs (the loss function) associated with different decisions are equal, the classifier is minimizing the error over the whole distribution. Thus, the Bayes Decision Rule is stated as where w 1 , w 2 {\displaystyle \;w_{1}\,,w_{2}\;} are predictions of different classes. From
1586-447: A conclusion which could only be reached via Bayes' theorem given knowledge about the marginal probabilities P ( p H = 0.5 ) {\textstyle P(p_{\text{H}}=0.5)} and P ( HH ) {\textstyle P({\text{HH}})} . Now suppose that the coin is not a fair coin, but instead that p H = 0.3 {\textstyle p_{\text{H}}=0.3} . Then
1708-407: A density f ( x ∣ θ ) {\textstyle f(x\mid \theta )} , where the sum of all the p {\textstyle p} 's added to the integral of f {\textstyle f} is always one. Assuming that it is possible to distinguish an observation corresponding to one of the discrete probability masses from one which corresponds to
1830-529: A factor that does not depend on the model parameters. For example, the MLE parameters of the log-normal distribution are the same as those of the normal distribution fitted to the logarithm of the data. In fact, in the log-normal case if X ∼ N ( 0 , 1 ) {\displaystyle X\sim {\mathcal {N}}(0,1)} , then Y = g ( X ) = e X {\displaystyle Y=g(X)=e^{X}} follows
1952-415: A function is necessarily infinitely differentiable, and in fact analytic . If M is a differentiable manifold , a real or complex-valued function f on M is said to be differentiable at a point p if it is differentiable with respect to some (or any) coordinate chart defined around p . If M and N are differentiable manifolds, a function f : M → N is said to be differentiable at
2074-478: A function that is continuous everywhere but differentiable nowhere is the Weierstrass function . A function f {\textstyle f} is said to be continuously differentiable if the derivative f ′ ( x ) {\textstyle f^{\prime }(x)} exists and is itself a continuous function. Although the derivative of a differentiable function never has
2196-543: A given significance level . Numerous other tests can be viewed as likelihood-ratio tests or approximations thereof. The asymptotic distribution of the log-likelihood ratio, considered as a test statistic, is given by Wilks' theorem . The likelihood ratio is also of central importance in Bayesian inference , where it is known as the Bayes factor , and is used in Bayes' rule . Stated in terms of odds , Bayes' rule states that
2318-403: A multi-variable function, while not being complex-differentiable. For example, f ( z ) = z + z ¯ 2 {\displaystyle f(z)={\frac {z+{\overline {z}}}{2}}} is differentiable at every point, viewed as the 2-variable real function f ( x , y ) = x {\displaystyle f(x,y)=x} , but it
2440-464: A parameter θ {\textstyle \theta } . Then the function L ( θ ∣ x ) = f θ ( x ) , {\displaystyle {\mathcal {L}}(\theta \mid x)=f_{\theta }(x),} considered as a function of θ {\textstyle \theta } , is the likelihood function (of θ {\textstyle \theta } , given
2562-440: A parameter θ {\textstyle \theta } . Then the function L ( θ ∣ x ) = p θ ( x ) = P θ ( X = x ) , {\displaystyle {\mathcal {L}}(\theta \mid x)=p_{\theta }(x)=P_{\theta }(X=x),} considered as a function of θ {\textstyle \theta } ,
SECTION 20
#17327879723632684-402: A perspective of minimizing error, it can also be stated as where Differentiable function In mathematics , a differentiable function of one real variable is a function whose derivative exists at each point in its domain . In other words, the graph of a differentiable function has a non- vertical tangent line at each interior point in its domain. A differentiable function
2806-438: A point p if it is differentiable with respect to some (or any) coordinate charts defined around p and f ( p ). Likelihood function A likelihood function (often simply called the likelihood ) measures how well a statistical model explains observed data by calculating the probability of seeing that data under different parameter values of the model. It is constructed from the joint probability distribution of
2928-494: A set h 1 , h 2 , … , h r , h r + 1 , … , h k {\displaystyle \;h_{1},h_{2},\ldots ,h_{r},h_{r+1},\ldots ,h_{k}\;} in such a way that h ∗ = [ h 1 , h 2 , … , h k ] {\displaystyle \;h^{\ast }=\left[h_{1},h_{2},\ldots ,h_{k}\right]\;}
3050-535: A statistical test of the "validity" of the constraint, known as the Lagrange multiplier test . Nonparametric maximum likelihood estimation can be performed using the empirical likelihood . A maximum likelihood estimator is an extremum estimator obtained by maximizing, as a function of θ , the objective function ℓ ^ ( θ ; x ) {\displaystyle {\widehat {\ell \,}}(\theta \,;x)} . If
3172-438: A stronger condition of uniform convergence almost surely has to be imposed: Additionally, if (as assumed above) the data were generated by f ( ⋅ ; θ 0 ) {\displaystyle f(\cdot \,;\theta _{0})} , then under certain conditions, it can also be shown that the maximum likelihood estimator converges in distribution to a normal distribution. Specifically, where I
3294-418: A unique global maximum. Compactness implies that the likelihood cannot approach the maximum value arbitrarily close at some other point (as demonstrated for example in the picture on the right). Compactness is only a sufficient condition and not a necessary condition. Compactness can be replaced by some other conditions, such as: The dominance condition can be employed in the case of i.i.d. observations. In
3416-500: Is negative definite for every θ ∈ Θ {\textstyle \,\theta \in \Theta \,} at which the gradient ∇ L ≡ [ ∂ L ∂ θ i ] i = 1 n i {\textstyle \;\nabla L\equiv \left[\,{\frac {\partial L}{\,\partial \theta _{i}\,}}\,\right]_{i=1}^{n_{\mathrm {i} }}\;} vanishes, and if
3538-498: Is negative semi-definite at θ ^ {\displaystyle {\widehat {\theta \,}}} , as this indicates local concavity . Conveniently, most common probability distributions – in particular the exponential family – are logarithmically concave . While the domain of the likelihood function—the parameter space —is generally a finite-dimensional subset of Euclidean space , additional restrictions sometimes need to be incorporated into
3660-636: Is positive definite and | I ( θ ) | {\textstyle \,\left|\mathbf {I} (\theta )\right|\,} is finite. This ensures that the score has a finite variance. The above conditions are sufficient, but not necessary. That is, a model that does not meet these regularity conditions may or may not have a maximum likelihood estimator of the properties mentioned above. Further, in case of non-independently or non-identically distributed observations additional properties may need to be assumed. In Bayesian statistics, almost identical regularity conditions are imposed on
3782-415: Is smooth (the function is locally well approximated as a linear function at each interior point) and does not contain any break, angle, or cusp . If x 0 is an interior point in the domain of a function f , then f is said to be differentiable at x 0 if the derivative f ′ ( x 0 ) {\displaystyle f'(x_{0})} exists. In other words,
Maximum likelihood estimation - Misplaced Pages Continue
3904-471: Is a one-to-one function from R k {\displaystyle \mathbb {R} ^{k}} to itself, and reparameterize the likelihood function by setting ϕ i = h i ( θ 1 , θ 2 , … , θ k ) . {\displaystyle \;\phi _{i}=h_{i}(\theta _{1},\theta _{2},\ldots ,\theta _{k})~.} Because of
4026-497: Is a column-vector of Lagrange multipliers and ∂ h ( θ ) T ∂ θ {\displaystyle \;{\frac {\partial h(\theta )^{\mathsf {T}}}{\partial \theta }}\;} is the k × r Jacobian matrix of partial derivatives. Naturally, if the constraints are not binding at the maximum, the Lagrange multipliers should be zero. This in turn allows for
4148-470: Is a model, often in idealized form, of the process generated by the data. It is a common aphorism in statistics that all models are wrong . Thus, true consistency does not occur in practical applications. Nevertheless, consistency is often considered to be a desirable property for an estimator to have. To establish consistency, the following conditions are sufficient. In other words, different parameter values θ correspond to different distributions within
4270-624: Is a real upper triangular matrix and Γ T {\displaystyle \Gamma ^{\mathsf {T}}} is its transpose . In practice, restrictions are usually imposed using the method of Lagrange which, given the constraints as defined above, leads to the restricted likelihood equations where λ = [ λ 1 , λ 2 , … , λ r ] T {\displaystyle ~\lambda =\left[\lambda _{1},\lambda _{2},\ldots ,\lambda _{r}\right]^{\mathsf {T}}~}
4392-570: Is assumed that the information matrix , I ( θ ) = ∫ − ∞ ∞ ∂ log f ∂ θ r ∂ log f ∂ θ s f d z {\displaystyle \mathbf {I} (\theta )=\int _{-\infty }^{\infty }{\frac {\partial \log f}{\partial \theta _{r}}}\ {\frac {\partial \log f}{\partial \theta _{s}}}\ f\ \mathrm {d} z}
4514-432: Is automatically differentiable at that point, when viewed as a function f : R 2 → R 2 {\displaystyle f:\mathbb {R} ^{2}\to \mathbb {R} ^{2}} . This is because the complex-differentiability implies that However, a function f : C → C {\textstyle f:\mathbb {C} \to \mathbb {C} } can be differentiable as
4636-444: Is called the maximum likelihood estimator . It is generally a function defined over the sample space , i.e. taking a given sample as its argument. A sufficient but not necessary condition for its existence is for the likelihood function to be continuous over a parameter space Θ {\displaystyle \,\Theta \,} that is compact . For an open Θ {\displaystyle \,\Theta \,}
4758-494: Is central to likelihoodist statistics : the law of likelihood states that degree to which data (considered as evidence) supports one parameter value versus another is measured by the likelihood ratio. In frequentist inference , the likelihood ratio is the basis for a test statistic , the so-called likelihood-ratio test . By the Neyman–Pearson lemma , this is the most powerful test for comparing two simple hypotheses at
4880-416: Is defined to be R ( θ ) = L ( θ ∣ x ) L ( θ ^ ∣ x ) . {\displaystyle R(\theta )={\frac {{\mathcal {L}}(\theta \mid x)}{{\mathcal {L}}({\hat {\theta }}\mid x)}}.} Thus, the relative likelihood is the likelihood ratio (discussed above) with
5002-896: Is differentiable at 0, since f ′ ( 0 ) = lim ε → 0 ( ε 2 sin ( 1 / ε ) − 0 ε ) = 0 {\displaystyle f'(0)=\lim _{\varepsilon \to 0}\left({\frac {\varepsilon ^{2}\sin(1/\varepsilon )-0}{\varepsilon }}\right)=0} exists. However, for x ≠ 0 , {\displaystyle x\neq 0,} differentiation rules imply f ′ ( x ) = 2 x sin ( 1 / x ) − cos ( 1 / x ) , {\displaystyle f'(x)=2x\sin(1/x)-\cos(1/x)\;,} which has no limit as x → 0. {\displaystyle x\to 0.} Thus, this example shows
Maximum likelihood estimation - Misplaced Pages Continue
5124-589: Is expressed in terms of a set of parameters . The goal of maximum likelihood estimation is to determine the parameters for which the observed data have the highest joint probability. We write the parameters governing the joint distribution as a vector θ = [ θ 1 , θ 2 , … , θ k ] T {\displaystyle \;\theta =\left[\theta _{1},\,\theta _{2},\,\ldots ,\,\theta _{k}\right]^{\mathsf {T}}\;} so that this distribution falls within
5246-452: Is given by L ( θ ∣ x ∈ [ x j , x j + h ] ) {\textstyle {\mathcal {L}}(\theta \mid x\in [x_{j},x_{j}+h])} . Observe that a r g m a x θ L ( θ ∣ x ∈ [ x j , x j + h ] ) =
5368-450: Is given in the section Differentiability classes ). If f is differentiable at a point x 0 , then f must also be continuous at x 0 . In particular, any differentiable function must be continuous at every point in its domain. The converse does not hold : a continuous function need not be differentiable. For example, a function with a bend, cusp , or vertical tangent may be continuous, but fails to be differentiable at
5490-421: Is not complex-differentiable at any point because the limit lim h → 0 h + h ¯ 2 h {\textstyle \lim _{h\to 0}{\frac {h+{\bar {h}}}{2h}}} does not exist (the limit depends on the angle of approach). Any function that is complex-differentiable in a neighborhood of a point is called holomorphic at that point. Such
5612-460: Is not directly used in AIC-based statistics. Instead, what is used is the relative likelihood of models (see below). In evidence-based medicine , likelihood ratios are used in diagnostic testing to assess the value of performing a diagnostic test . Since the actual value of the likelihood function depends on the sample, it is often convenient to work with a standardized measure. Suppose that
5734-408: Is not necessarily differentiable, but a differentiable function is necessarily continuous (at every point where it is differentiable) as being shown below (in the section Differentiability and continuity ). A function is said to be continuously differentiable if its derivative is also a continuous function; there exist functions that are differentiable but not continuously differentiable (an example
5856-801: Is of class C 2 {\displaystyle C^{2}} if the first and second derivative of the function both exist and are continuous. More generally, a function is said to be of class C k {\displaystyle C^{k}} if the first k {\displaystyle k} derivatives f ′ ( x ) , f ′ ′ ( x ) , … , f ( k ) ( x ) {\textstyle f^{\prime }(x),f^{\prime \prime }(x),\ldots ,f^{(k)}(x)} all exist and are continuous. If derivatives f ( n ) {\displaystyle f^{(n)}} exist for all positive integers n , {\textstyle n,}
5978-433: Is often avoided and instead f ( x ; θ ) {\textstyle f(x;\theta )} or f ( x , θ ) {\textstyle f(x,\theta )} are used to indicate that θ {\textstyle \theta } is regarded as a fixed unknown quantity rather than as a random variable being conditioned on. The likelihood function does not specify
6100-481: Is positive and constant. Because a r g m a x θ 1 h L ( θ ∣ x ∈ [ x j , x j + h ] ) = a r g m a x θ 1 h Pr ( x j ≤ x ≤ x j + h ∣ θ ) =
6222-508: Is possible to estimate the second-order bias of the maximum likelihood estimator, and correct for that bias by subtracting it: This estimator is unbiased up to the terms of order 1 / n , and is called the bias-corrected maximum likelihood estimator . This bias-corrected estimator is second-order efficient (at least within the curved exponential family), meaning that it has minimal mean squared error among all second-order bias-corrected estimators, up to
SECTION 50
#17327879723636344-498: Is said to be differentiable at a ∈ U {\displaystyle a\in U} if the derivative exists. This implies that the function is continuous at a . This function f is said to be differentiable on U if it is differentiable at every point of U . In this case, the derivative of f is thus a function from U into R . {\displaystyle \mathbb {R} .} A continuous function
6466-406: Is such that ∫ − ∞ ∞ H r s t ( z ) d z ≤ M < ∞ . {\textstyle \,\int _{-\infty }^{\infty }H_{rst}(z)\mathrm {d} z\leq M<\infty \;.} This boundedness of the derivatives is needed to allow for differentiation under the integral sign . And lastly, it
6588-497: Is taken with respect to the true density. Maximum-likelihood estimators have no optimum properties for finite samples, in the sense that (when evaluated on finite samples) other estimators may have greater concentration around the true parameter-value. However, like other estimation methods, maximum likelihood estimation possesses a number of attractive limiting properties : As the sample size increases to infinity, sequences of maximum likelihood estimators have these properties: Under
6710-402: Is that in finite samples, there may exist multiple roots for the likelihood equations. Whether the identified root θ ^ {\displaystyle \,{\widehat {\theta \,}}\,} of the likelihood equations is indeed a (local) maximum depends on whether the matrix of second-order partial and cross-partial derivatives, the so-called Hessian matrix
6832-569: Is the Fisher information matrix . The maximum likelihood estimator selects the parameter value which gives the observed data the largest possible probability (or probability density, in the continuous case). If the parameter consists of a number of components, then we define their separate maximum likelihood estimators, as the corresponding component of the MLE of the complete parameter. Consistent with this, if θ ^ {\displaystyle {\widehat {\theta \,}}}
6954-456: Is the likelihood function , given the outcome x {\textstyle x} of the random variable X {\textstyle X} . Sometimes the probability of "the value x {\textstyle x} of X {\textstyle X} for the parameter value θ {\textstyle \theta } " is written as P ( X = x | θ ) or P ( X = x ; θ ) . The likelihood
7076-399: Is the MLE for θ {\displaystyle \theta } , and if g ( θ ) {\displaystyle g(\theta )} is any transformation of θ {\displaystyle \theta } , then the MLE for α = g ( θ ) {\displaystyle \alpha =g(\theta )} is by definition It maximizes
7198-422: Is the index of the discrete probability mass corresponding to observation x {\textstyle x} , because maximizing the probability mass (or probability) at x {\textstyle x} amounts to maximizing the likelihood of the specific observation. The fact that the likelihood function can be defined in a way that includes contributions that are not commensurate (the density and
7320-505: Is the posterior probability of θ {\textstyle \theta } given the data x {\textstyle x} . Consider a simple statistical model of a coin flip: a single parameter p H {\textstyle p_{\text{H}}} that expresses the "fairness" of the coin. The parameter is the probability that a coin lands heads up ("H") when tossed. p H {\textstyle p_{\text{H}}} can take on any value within
7442-1196: Is the probability density function, it follows that a r g m a x θ L ( θ ∣ x ∈ [ x j , x j + h ] ) = a r g m a x θ 1 h ∫ x j x j + h f ( x ∣ θ ) d x . {\displaystyle \mathop {\operatorname {arg\,max} } _{\theta }{\mathcal {L}}(\theta \mid x\in [x_{j},x_{j}+h])=\mathop {\operatorname {arg\,max} } _{\theta }{\frac {1}{h}}\int _{x_{j}}^{x_{j}+h}f(x\mid \theta )\,dx.} The first fundamental theorem of calculus provides that lim h → 0 + 1 h ∫ x j x j + h f ( x ∣ θ ) d x = f ( x j ∣ θ ) . {\displaystyle \lim _{h\to 0^{+}}{\frac {1}{h}}\int _{x_{j}}^{x_{j}+h}f(x\mid \theta )\,dx=f(x_{j}\mid \theta ).} Then
SECTION 60
#17327879723637564-547: Is the probability of the data averaged over all parameters. Since the denominator is independent of θ , the Bayesian estimator is obtained by maximizing f ( x 1 , x 2 , … , x n ∣ θ ) P ( θ ) {\displaystyle f(x_{1},x_{2},\ldots ,x_{n}\mid \theta )\operatorname {\mathbb {P} } (\theta )} with respect to θ . If we further assume that
7686-661: Is the probability that a particular outcome x {\textstyle x} is observed when the true value of the parameter is θ {\textstyle \theta } , equivalent to the probability mass on x {\textstyle x} ; it is not a probability density over the parameter θ {\textstyle \theta } . The likelihood, L ( θ ∣ x ) {\textstyle {\mathcal {L}}(\theta \mid x)} , should not be confused with P ( θ ∣ x ) {\textstyle P(\theta \mid x)} , which
7808-440: Is this density interpreted as a function of the parameter, rather than the random variable. Thus, we can construct a likelihood function for any distribution, whether discrete, continuous, a mixture, or otherwise. (Likelihoods are comparable, e.g. for parameter estimation, only if they are Radon–Nikodym derivatives with respect to the same dominating measure.) The above discussion of the likelihood for discrete random variables uses
7930-453: Is usually defined differently for discrete and continuous probability distributions (a more general definition is discussed below). Given a probability density or mass function x ↦ f ( x ∣ θ ) , {\displaystyle x\mapsto f(x\mid \theta ),} where x {\textstyle x} is a realization of the random variable X {\textstyle X} ,
8052-485: Is viewed as a function of x {\textstyle x} with θ {\textstyle \theta } fixed, it is a probability density function, and when viewed as a function of θ {\textstyle \theta } with x {\textstyle x} fixed, it is a likelihood function. In the frequentist paradigm , the notation f ( x ∣ θ ) {\textstyle f(x\mid \theta )}
8174-452: The Cramér–Rao bound . Specifically, where I {\displaystyle ~{\mathcal {I}}~} is the Fisher information matrix : In particular, it means that the bias of the maximum likelihood estimator is equal to zero up to the order 1 / √ n . However, when we consider the higher-order terms in the expansion of
8296-431: The counting measure , under which the probability density at any outcome equals the probability of that outcome. The above can be extended in a simple way to allow consideration of distributions which contain both discrete and continuous components. Suppose that the distribution consists of a number of discrete probability masses p k ( θ ) {\textstyle p_{k}(\theta )} and
8418-523: The matrix of second partials H ( θ ) ≡ [ ∂ 2 L ∂ θ i ∂ θ j ] i , j = 1 , 1 n i , n j {\displaystyle \mathbf {H} (\theta )\equiv \left[\,{\frac {\partial ^{2}L}{\,\partial \theta _{i}\,\partial \theta _{j}\,}}\,\right]_{i,j=1,1}^{n_{\mathrm {i} },n_{\mathrm {j} }}\;}
8540-534: The maximum a posteriori estimate is the parameter θ that maximizes the probability of θ given the data, given by Bayes' theorem: where P ( θ ) {\displaystyle \operatorname {\mathbb {P} } (\theta )} is the prior distribution for the parameter θ and where P ( x 1 , x 2 , … , x n ) {\displaystyle \operatorname {\mathbb {P} } (x_{1},x_{2},\ldots ,x_{n})}
8662-407: The maximum likelihood estimate for the parameter θ is θ ^ {\textstyle {\hat {\theta }}} . Relative plausibilities of other θ values may be found by comparing the likelihoods of those other values with the likelihood of θ ^ {\textstyle {\hat {\theta }}} . The relative likelihood of θ
8784-426: The outcome X = x {\textstyle X=x} ). Again, L {\textstyle {\mathcal {L}}} is not a probability density or mass function over θ {\textstyle \theta } , despite being a function of θ {\textstyle \theta } given the observation X = x {\textstyle X=x} . The use of
8906-713: The posterior odds of two alternatives, A 1 {\displaystyle A_{1}} and A 2 {\displaystyle A_{2}} , given an event B {\displaystyle B} , is the prior odds, times the likelihood ratio. As an equation: O ( A 1 : A 2 ∣ B ) = O ( A 1 : A 2 ) ⋅ Λ ( A 1 : A 2 ∣ B ) . {\displaystyle O(A_{1}:A_{2}\mid B)=O(A_{1}:A_{2})\cdot \Lambda (A_{1}:A_{2}\mid B).} The likelihood ratio
9028-404: The probability density in specifying the likelihood function above is justified as follows. Given an observation x j {\textstyle x_{j}} , the likelihood for the interval [ x j , x j + h ] {\textstyle [x_{j},x_{j}+h]} , where h > 0 {\textstyle h>0} is a constant,
9150-405: The random variable that (presumably) generated the observations. When evaluated on the actual data points, it becomes a function solely of the model parameters. In maximum likelihood estimation , the argument that maximizes the likelihood function serves as a point estimate for the unknown parameter, while the Fisher information (often approximated by the likelihood's Hessian matrix at
9272-414: The Bayesian estimator coincides with the maximum likelihood estimator for a uniform prior distribution P ( θ ) {\displaystyle \operatorname {\mathbb {P} } (\theta )} . In many practical applications in machine learning , maximum-likelihood estimation is used as the model for parameter estimation. The Bayesian Decision theory is about designing
9394-460: The conditions outlined below, the maximum likelihood estimator is consistent . The consistency means that if the data were generated by f ( ⋅ ; θ 0 ) {\displaystyle f(\cdot \,;\theta _{0})} and we have a sufficiently large number of observations n , then it is possible to find the value of θ 0 with arbitrary precision. In mathematical terms this means that as n goes to infinity
9516-605: The corresponding likelihood. The result of such calculations is displayed in Figure ;1. The integral of L {\textstyle {\mathcal {L}}} over [0, 1] is 1/3; likelihoods need not integrate or sum to one over the parameter space. Let X {\textstyle X} be a random variable following an absolutely continuous probability distribution with density function f {\textstyle f} (a function of x {\textstyle x} ) which depends on
9638-430: The data are independent and identically distributed , then we have this being the sample analogue of the expected log-likelihood ℓ ( θ ) = E [ ln f ( x i ∣ θ ) ] {\displaystyle \ell (\theta )=\operatorname {\mathbb {E} } [\,\ln f(x_{i}\mid \theta )\,]} , where this expectation
9760-414: The data were generated by f ( ⋅ ; θ 0 ) , {\displaystyle ~f(\cdot \,;\theta _{0})~,} then under certain conditions, it can also be shown that the maximum likelihood estimator converges in distribution to a normal distribution. It is √ n -consistent and asymptotically efficient, meaning that it reaches
9882-499: The density component, the likelihood function for an observation from the continuous component can be dealt with in the manner shown above. For an observation from the discrete component, the likelihood function for an observation from the discrete component is simply L ( θ ∣ x ) = p k ( θ ) , {\displaystyle {\mathcal {L}}(\theta \mid x)=p_{k}(\theta ),} where k {\textstyle k}
10004-463: The distribution of this estimator, it turns out that θ mle has bias of order 1 ⁄ n . This bias is equal to (componentwise) where I j k {\displaystyle {\mathcal {I}}^{jk}} (with superscripts) denotes the ( j,k )-th component of the inverse Fisher information matrix I − 1 {\displaystyle {\mathcal {I}}^{-1}} , and Using these formulae it
10126-445: The domain of the function f {\textstyle f} . For a multivariable function, as shown here , the differentiability of it is something more complex than the existence of the partial derivatives of it. A function f : U → R {\displaystyle f:U\to \mathbb {R} } , defined on an open set U ⊂ R {\textstyle U\subset \mathbb {R} } ,
10248-548: The equivariance of the maximum likelihood estimator, the properties of the MLE apply to the restricted estimates also. For instance, in a multivariate normal distribution the covariance matrix Σ {\displaystyle \,\Sigma \,} must be positive-definite ; this restriction can be imposed by replacing Σ = Γ T Γ , {\displaystyle \;\Sigma =\Gamma ^{\mathsf {T}}\Gamma \;,} where Γ {\displaystyle \Gamma }
10370-400: The estimation process. The parameter space can be expressed as where h ( θ ) = [ h 1 ( θ ) , h 2 ( θ ) , … , h r ( θ ) ] {\displaystyle \;h(\theta )=\left[h_{1}(\theta ),h_{2}(\theta ),\ldots ,h_{r}(\theta )\right]\;} is
10492-565: The estimator θ ^ {\displaystyle {\widehat {\theta \,}}} converges in probability to its true value: Under slightly stronger conditions, the estimator converges almost surely (or strongly ): In practical applications, data is never generated by f ( ⋅ ; θ 0 ) {\displaystyle f(\cdot \,;\theta _{0})} . Rather, f ( ⋅ ; θ 0 ) {\displaystyle f(\cdot \,;\theta _{0})}
10614-1178: The existence of a Taylor expansion . Second, for almost all x {\textstyle x} and for every θ ∈ Θ {\textstyle \,\theta \in \Theta \,} it must be that | ∂ f ∂ θ r | < F r ( x ) , | ∂ 2 f ∂ θ r ∂ θ s | < F r s ( x ) , | ∂ 3 f ∂ θ r ∂ θ s ∂ θ t | < H r s t ( x ) {\displaystyle \left|{\frac {\partial f}{\partial \theta _{r}}}\right|<F_{r}(x)\,,\quad \left|{\frac {\partial ^{2}f}{\partial \theta _{r}\,\partial \theta _{s}}}\right|<F_{rs}(x)\,,\quad \left|{\frac {\partial ^{3}f}{\partial \theta _{r}\,\partial \theta _{s}\,\partial \theta _{t}}}\right|<H_{rst}(x)} where H {\textstyle H}
10736-569: The existence of a function that is differentiable but not continuously differentiable (i.e., the derivative is not a continuous function). Nevertheless, Darboux's theorem implies that the derivative of any function satisfies the conclusion of the intermediate value theorem . Similarly to how continuous functions are said to be of class C 0 , {\displaystyle C^{0},} continuously differentiable functions are sometimes said to be of class C 1 {\displaystyle C^{1}} . A function
10858-462: The existence of a global maximum of the likelihood function is of the utmost importance. By the extreme value theorem , it suffices that the likelihood function is continuous on a compact parameter space for the maximum likelihood estimator to exist. While the continuity assumption is usually met, the compactness assumption about the parameter space is often not, as the bounds of the true parameter values might be unknown. In that case, concavity of
10980-421: The existence of the partial derivatives (or even of all the directional derivatives ) does not guarantee that a function is differentiable at a point. For example, the function f : R → R defined by is not differentiable at (0, 0) , but all of the partial derivatives and directional derivatives exist at this point. For a continuous example, the function is not differentiable at (0, 0) , but again all of
11102-413: The fixed denominator L ( θ ^ ) {\textstyle {\mathcal {L}}({\hat {\theta }})} . This corresponds to standardizing the likelihood to have a maximum of 1. A likelihood region is the set of all values of θ whose relative likelihood is greater than or equal to a given threshold. In terms of percentages, a p % likelihood region for θ
11224-422: The function is smooth or equivalently, of class C ∞ . {\displaystyle C^{\infty }.} A function of several real variables f : R → R is said to be differentiable at a point x 0 if there exists a linear map J : R → R such that If a function is differentiable at x 0 , then all of the partial derivatives exist at x 0 , and
11346-821: The graph of f has a non-vertical tangent line at the point ( x 0 , f ( x 0 )) . f is said to be differentiable on U if it is differentiable at every point of U . f is said to be continuously differentiable if its derivative is also a continuous function over the domain of the function f {\textstyle f} . Generally speaking, f is said to be of class C k {\displaystyle C^{k}} if its first k {\displaystyle k} derivatives f ′ ( x ) , f ′ ′ ( x ) , … , f ( k ) ( x ) {\textstyle f^{\prime }(x),f^{\prime \prime }(x),\ldots ,f^{(k)}(x)} exist and are continuous over
11468-508: The joint density at the observed data sample y = ( y 1 , y 2 , … , y n ) {\displaystyle \;\mathbf {y} =(y_{1},y_{2},\ldots ,y_{n})\;} gives a real-valued function, which is called the likelihood function . For independent and identically distributed random variables , f n ( y ; θ ) {\displaystyle f_{n}(\mathbf {y} ;\theta )} will be
11590-419: The likelihood function L n {\displaystyle \,{\mathcal {L}}_{n}\,} is called the maximum likelihood estimate. Further, if the function θ ^ n : R n → Θ {\displaystyle \;{\hat {\theta }}_{n}:\mathbb {R} ^{n}\to \Theta \;} so defined is measurable , then it
11712-405: The likelihood function approaches a constant on the boundary of the parameter space, ∂ Θ , {\textstyle \;\partial \Theta \;,} i.e., lim θ → ∂ Θ L ( θ ) = 0 , {\displaystyle \lim _{\theta \to \partial \Theta }L(\theta )=0\;,} which may include
11834-715: The likelihood function in order to proof asymptotic normality of the posterior probability , and therefore to justify a Laplace approximation of the posterior in large samples. A likelihood ratio is the ratio of any two specified likelihoods, frequently written as: Λ ( θ 1 : θ 2 ∣ x ) = L ( θ 1 ∣ x ) L ( θ 2 ∣ x ) . {\displaystyle \Lambda (\theta _{1}:\theta _{2}\mid x)={\frac {{\mathcal {L}}(\theta _{1}\mid x)}{{\mathcal {L}}(\theta _{2}\mid x)}}.} The likelihood ratio
11956-430: The likelihood function is θ ↦ f ( x ∣ θ ) , {\displaystyle \theta \mapsto f(x\mid \theta ),} often written L ( θ ∣ x ) . {\displaystyle {\mathcal {L}}(\theta \mid x).} In other words, when f ( x ∣ θ ) {\textstyle f(x\mid \theta )}
12078-411: The likelihood function may increase without ever reaching a supremum value. In practice, it is often convenient to work with the natural logarithm of the likelihood function, called the log-likelihood : Since the logarithm is a monotonic function , the maximum of ℓ ( θ ; y ) {\displaystyle \;\ell (\theta \,;\mathbf {y} )\;} occurs at
12200-516: The likelihood function plays a key role. More specifically, if the likelihood function is twice continuously differentiable on the k -dimensional parameter space Θ {\textstyle \Theta } assumed to be an open connected subset of R k , {\textstyle \mathbb {R} ^{k}\,,} there exists a unique maximum θ ^ ∈ Θ {\textstyle {\hat {\theta }}\in \Theta } if
12322-501: The likelihood of observing "HH" assuming p H = 0.5 {\textstyle p_{\text{H}}=0.5} is L ( p H = 0.5 ∣ HH ) = 0.25. {\displaystyle {\mathcal {L}}(p_{\text{H}}=0.5\mid {\text{HH}})=0.25.} This is not the same as saying that P ( p H = 0.5 ∣ H H ) = 0.25 {\textstyle P(p_{\text{H}}=0.5\mid HH)=0.25} ,
12444-499: The linear map J is given by the Jacobian matrix , an n × m matrix in this case. A similar formulation of the higher-dimensional derivative is provided by the fundamental increment lemma found in single-variable calculus. If all the partial derivatives of a function exist in a neighborhood of a point x 0 and are continuous at the point x 0 , then the function is differentiable at that point x 0 . However,
12566-421: The location of the anomaly. Most functions that occur in practice have derivatives at all points or at almost every point. However, a result of Stefan Banach states that the set of functions that have a derivative at some point is a meagre set in the space of all continuous functions. Informally, this means that differentiable functions are very atypical among continuous functions. The first known example of
12688-498: The maximum of the likelihood function subject to the constraint h ( θ ) = 0 . {\displaystyle ~h(\theta )=0~.} Theoretically, the most natural approach to this constrained optimization problem is the method of substitution, that is "filling out" the restrictions h 1 , h 2 , … , h r {\displaystyle \;h_{1},h_{2},\ldots ,h_{r}\;} to
12810-424: The maximum) gives an indication of the estimate's precision . In contrast, in Bayesian statistics , the estimate of interest is the converse of the likelihood, the so-called posterior probability of the parameter given the observed data, which is calculated via Bayes' rule . The likelihood function, parameterized by a (possibly multivariate) parameter θ {\textstyle \theta } ,
12932-412: The model. If this condition did not hold, there would be some value θ 1 such that θ 0 and θ 1 generate an identical distribution of the observable data. Then we would not be able to distinguish between these two parameters even with an infinite amount of data—these parameters would have been observationally equivalent . The identification condition establishes that the log-likelihood has
13054-547: The non-i.i.d. case, the uniform convergence in probability can be checked by showing that the sequence ℓ ^ ( θ ∣ x ) {\displaystyle {\widehat {\ell \,}}(\theta \mid x)} is stochastically equicontinuous . If one wants to demonstrate that the ML estimator θ ^ {\displaystyle {\widehat {\theta \,}}} converges to θ 0 almost surely , then
13176-430: The occurrence of a maximum (or a minimum) are known as the likelihood equations. For some models, these equations can be explicitly solved for θ ^ , {\displaystyle \,{\widehat {\theta \,}}\,,} but in general no closed-form solution to the maximization problem is known or available, and an MLE can only be found via numerical optimization . Another problem
13298-420: The partial derivatives and directional derivatives exist. In complex analysis , complex-differentiability is defined using the same definition as single-variable real functions. This is allowed by the possibility of dividing complex numbers . So, a function f : C → C {\textstyle f:\mathbb {C} \to \mathbb {C} } is said to be differentiable at x =
13420-421: The perspective of Bayesian inference , MLE is generally equivalent to maximum a posteriori (MAP) estimation with a prior distribution that is uniform in the region of interest. In frequentist inference , MLE is a special case of an extremum estimator , with the objective function being the likelihood. We model a set of observations as a random sample from an unknown joint probability distribution which
13542-428: The points at infinity if Θ {\textstyle \,\Theta \,} is unbounded. Mäkeläinen and co-authors prove this result using Morse theory while informally appealing to a mountain pass property. Mascarenhas restates their proof using the mountain pass theorem . In the proofs of consistency and asymptotic normality of the maximum likelihood estimator, additional assumptions are made about
13664-445: The prior P ( θ ) {\displaystyle \operatorname {\mathbb {P} } (\theta )} is a uniform distribution, the Bayesian estimator is obtained by maximizing the likelihood function f ( x 1 , x 2 , … , x n ∣ θ ) {\displaystyle f(x_{1},x_{2},\ldots ,x_{n}\mid \theta )} . Thus
13786-1218: The probability densities that form the basis of a particular likelihood function. These conditions were first established by Chanda. In particular, for almost all x {\textstyle x} , and for all θ ∈ Θ , {\textstyle \,\theta \in \Theta \,,} ∂ log f ∂ θ r , ∂ 2 log f ∂ θ r ∂ θ s , ∂ 3 log f ∂ θ r ∂ θ s ∂ θ t {\displaystyle {\frac {\partial \log f}{\partial \theta _{r}}}\,,\quad {\frac {\partial ^{2}\log f}{\partial \theta _{r}\partial \theta _{s}}}\,,\quad {\frac {\partial ^{3}\log f}{\partial \theta _{r}\,\partial \theta _{s}\,\partial \theta _{t}}}\,} exist for all r , s , t = 1 , 2 , … , k {\textstyle \,r,s,t=1,2,\ldots ,k\,} in order to ensure
13908-478: The probability density at x j {\textstyle x_{j}} amounts to maximizing the likelihood of the specific observation x j {\textstyle x_{j}} . In measure-theoretic probability theory , the density function is defined as the Radon–Nikodym derivative of the probability distribution relative to a common dominating measure. The likelihood function
14030-624: The probability mass) arises from the way in which the likelihood function is defined up to a constant of proportionality, where this "constant" can change with the observation x {\textstyle x} , but not with the parameter θ {\textstyle \theta } . In the context of parameter estimation, the likelihood function is usually assumed to obey certain conditions, known as regularity conditions. These conditions are assumed in various proofs involving likelihood functions, and need to be verified in each particular application. For maximum likelihood estimation,
14152-557: The probability of two heads on two flips is P ( HH ∣ p H = 0.3 ) = 0.3 2 = 0.09. {\displaystyle P({\text{HH}}\mid p_{\text{H}}=0.3)=0.3^{2}=0.09.} Hence L ( p H = 0.3 ∣ HH ) = 0.09. {\displaystyle {\mathcal {L}}(p_{\text{H}}=0.3\mid {\text{HH}})=0.09.} More generally, for each value of p H {\textstyle p_{\text{H}}} , we can calculate
14274-456: The probability that θ {\textstyle \theta } is the truth, given the observed sample X = x {\textstyle X=x} . Such an interpretation is a common error, with potentially disastrous consequences (see prosecutor's fallacy ). Let X {\textstyle X} be a discrete random variable with probability mass function p {\textstyle p} depending on
14396-597: The product of univariate density functions : The goal of maximum likelihood estimation is to find the values of the model parameters that maximize the likelihood function over the parameter space, that is Intuitively, this selects the parameter values that make the observed data most probable. The specific value θ ^ = θ ^ n ( y ) ∈ Θ {\displaystyle ~{\hat {\theta }}={\hat {\theta }}_{n}(\mathbf {y} )\in \Theta ~} that maximizes
14518-525: The range 0.0 to 1.0. For a perfectly fair coin , p H = 0.5 {\textstyle p_{\text{H}}=0.5} . Imagine flipping a fair coin twice, and observing two heads in two tosses ("HH"). Assuming that each successive coin flip is i.i.d. , then the probability of observing HH is P ( HH ∣ p H = 0.5 ) = 0.5 2 = 0.25. {\displaystyle P({\text{HH}}\mid p_{\text{H}}=0.5)=0.5^{2}=0.25.} Equivalently,
14640-454: The same value of θ {\displaystyle \theta } as does the maximum of L n . {\displaystyle \,{\mathcal {L}}_{n}~.} If ℓ ( θ ; y ) {\displaystyle \ell (\theta \,;\mathbf {y} )} is differentiable in Θ , {\displaystyle \,\Theta \,,} sufficient conditions for
14762-506: The so-called profile likelihood : The MLE is also equivariant with respect to certain transformations of the data. If y = g ( x ) {\displaystyle y=g(x)} where g {\displaystyle g} is one to one and does not depend on the parameters to be estimated, then the density functions satisfy and hence the likelihood functions for X {\displaystyle X} and Y {\displaystyle Y} differ only by
14884-413: The terms of the order 1 / n . It is possible to continue this process, that is to derive the third-order bias-correction term, and so on. However, the maximum likelihood estimator is not third-order efficient. A maximum likelihood estimator coincides with the most probable Bayesian estimator given a uniform prior distribution on the parameters . Indeed,
#362637