In the theory of probability , the Glivenko–Cantelli theorem (sometimes referred to as the Fundamental Theorem of Statistics ), named after Valery Ivanovich Glivenko and Francesco Paolo Cantelli , describes the asymptotic behaviour of the empirical distribution function as the number of independent and identically distributed observations grows. Specifically, the empirical distribution function converges uniformly to the true distribution function almost surely .
52-977: The uniform convergence of more general empirical measures becomes an important property of the Glivenko–Cantelli classes of functions or sets. The Glivenko–Cantelli classes arise in Vapnik–Chervonenkis theory , with applications to machine learning . Applications can be found in econometrics making use of M-estimators . Assume that X 1 , X 2 , … {\displaystyle X_{1},X_{2},\dots } are independent and identically distributed random variables in R {\displaystyle \mathbb {R} } with common cumulative distribution function F ( x ) {\displaystyle F(x)} . The empirical distribution function for X 1 , … , X n {\displaystyle X_{1},\dots ,X_{n}}
104-454: A fair coin toss is a Bernoulli trial. When a fair coin is flipped once, the theoretical probability that the outcome will be heads is equal to 1 ⁄ 2 . Therefore, according to the law of large numbers, the proportion of heads in a "large" number of coin flips "should be" roughly 1 ⁄ 2 . In particular, the proportion of heads after n flips will almost surely converge to 1 ⁄ 2 as n approaches infinity. Although
156-2264: A case of continuous random variable X {\displaystyle X} . Fix − ∞ = x 0 < x 1 < ⋯ < x m − 1 < x m = ∞ {\displaystyle -\infty =x_{0}<x_{1}<\cdots <x_{m-1}<x_{m}=\infty } such that F ( x j ) − F ( x j − 1 ) = 1 m {\displaystyle F(x_{j})-F(x_{j-1})={\frac {1}{m}}} for j = 1 , … , m {\displaystyle j=1,\dots ,m} . Now for all x ∈ R {\displaystyle x\in \mathbb {R} } there exists j ∈ { 1 , … , m } {\displaystyle j\in \{1,\dots ,m\}} such that x ∈ [ x j − 1 , x j ] {\displaystyle x\in [x_{j-1},x_{j}]} . Therefore, Since max j ∈ { 1 , … , m } | F n ( x j ) − F ( x j ) | → 0 a.s. {\textstyle \max _{j\in \{1,\dots ,m\}}|F_{n}(x_{j})-F(x_{j})|\to 0{\text{ a.s.}}} by strong law of large numbers, we can guarantee that for any positive ε {\textstyle \varepsilon } and any integer m {\textstyle m} such that 1 / m < ε {\textstyle 1/m<\varepsilon } , we can find N {\textstyle N} such that for all n ≥ N {\displaystyle n\geq N} , we have max j ∈ { 1 , … , m } | F n ( x j ) − F ( x j ) | ≤ ε − 1 / m a.s. {\textstyle \max _{j\in \{1,\dots ,m\}}|F_{n}(x_{j})-F(x_{j})|\leq \varepsilon -1/m{\text{ a.s.}}} . Combined with
208-428: A class of sets C {\displaystyle {\mathcal {C}}} to obtain an empirical measure indexed by sets C ∈ C . {\displaystyle C\in {\mathcal {C}}.} Where I C ( x ) {\displaystyle I_{C}(x)} is the indicator function of each set C {\displaystyle C} . Further generalization
260-567: A collection of independent and identically distributed (iid) samples from a random variable with finite mean, the sample mean converges in probability to the expected value That is, for any positive number ε , lim n → ∞ Pr ( | X ¯ n − μ | < ε ) = 1. {\displaystyle \lim _{n\to \infty }\Pr \!\left(\,|{\overline {X}}_{n}-\mu |<\varepsilon \,\right)=1.} Interpreting this result,
312-597: A law of large numbers. A special form of the LLN (for a binary random variable) was first proved by Jacob Bernoulli . It took him over 20 years to develop a sufficiently rigorous mathematical proof which was published in his Ars Conjectandi ( The Art of Conjecturing ) in 1713. He named this his "Golden Theorem" but it became generally known as " Bernoulli's theorem ". This should not be confused with Bernoulli's principle , named after Jacob Bernoulli's nephew Daniel Bernoulli . In 1837, S. D. Poisson further described it under
364-402: A random variable that does not have a finite variance under some other weaker assumption, and Khinchin showed in 1929 that if the series consists of independent identically distributed random variables, it suffices that the expected value exists for the weak law of large numbers to be true. These further studies have given rise to two prominent forms of the LLN. One is called the "weak" law and
416-429: A related distribution function F {\displaystyle F} by means of the empirical measure or empirical distribution function, respectively. These are uniformly good estimates under certain conditions. Theorems in the area of empirical processes provide rates of this convergence. Let X 1 , X 2 , … {\displaystyle X_{1},X_{2},\dots } be
468-491: A sequence of independent identically distributed random variables with values in the state space S with probability distribution P . Definition Properties Definition To generalize this notion further, observe that the empirical measure P n {\displaystyle P_{n}} maps measurable functions f : S → R {\displaystyle f:S\to \mathbb {R} } to their empirical mean , In particular,
520-537: A sequence of random variables which converge to F ( x ) {\displaystyle F(x)} almost surely by the strong law of large numbers . Glivenko and Cantelli strengthened this result by proving uniform convergence of F n {\displaystyle \ F_{n}\ } to F . {\displaystyle \ F~.} Theorem This theorem originates with Valery Glivenko and Francesco Cantelli , in 1933. For simplicity, consider
572-455: A set S {\displaystyle \ {\mathcal {S}}\ } with a sigma algebra of Borel subsets A and a probability measure P . {\displaystyle \ \mathbb {P} ~.} For a class of subsets, and a class of functions define random variables where P n ( C ) {\displaystyle \ \mathbb {P} _{n}(C)\ }
SECTION 10
#1732802075872624-670: Is a random measure arising from a particular realization of a (usually finite) sequence of random variables . The precise definition is found below. Empirical measures are relevant to mathematical statistics . The motivation for studying empirical measures is that it is often impossible to know the true underlying probability measure P {\displaystyle P} . We collect observations X 1 , X 2 , … , X n {\displaystyle X_{1},X_{2},\dots ,X_{n}} and compute relative frequencies . We can estimate P {\displaystyle P} , or
676-461: Is central to statistical learning of binary classification tasks. Theorem ( Vapnik and Chervonenkis , 1968) There exist a variety of consistency conditions for the equivalence of uniform Glivenko-Cantelli and Vapnik-Chervonenkis classes. In particular, either of the following conditions for a class C {\displaystyle {\mathcal {C}}} suffice: Empirical measure In probability theory , an empirical measure
728-400: Is defined by where I C {\displaystyle I_{C}} is the indicator function of the set C . {\displaystyle \ C~.} For every (fixed) x , {\displaystyle \ x\ ,} F n ( x ) {\displaystyle \ F_{n}(x)\ } is
780-523: Is difficult or impossible to use other approaches. The average of the results obtained from a large number of trials may fail to converge in some cases. For instance, the average of n results taken from the Cauchy distribution or some Pareto distributions (α<1) will not converge as n becomes larger; the reason is heavy tails . The Cauchy distribution and the Pareto distribution represent two cases:
832-494: Is given by In this case, empirical measures are indexed by a class C = { ( − ∞ , x ] : x ∈ R } . {\displaystyle {\mathcal {C}}=\{(-\infty ,x]:x\in \mathbb {R} \}.} It has been shown that C {\displaystyle {\mathcal {C}}} is a uniform Glivenko–Cantelli class , in particular, with probability 1. Law of large numbers In probability theory ,
884-929: Is known as Kolmogorov's strong law , see e.g. Sen & Singer (1993 , Theorem 2.3.10). The weak law states that for a specified large n , the average X ¯ n {\displaystyle {\overline {X}}_{n}} is likely to be near μ . Thus, it leaves open the possibility that | X ¯ n − μ | > ε {\displaystyle |{\overline {X}}_{n}-\mu |>\varepsilon } happens an infinite number of times, although at infrequent intervals. (Not necessarily | X ¯ n − μ | ≠ 0 {\displaystyle |{\overline {X}}_{n}-\mu |\neq 0} for all n ). The strong law shows that this almost surely will not occur. It does not imply that with probability 1, we have that for any ε > 0
936-404: Is no principle that a small number of observations will coincide with the expected value or that a streak of one value will immediately be "balanced" by the others (see the gambler's fallacy ). The LLN only applies to the average of the results obtained from repeated trials and claims that this average converges to the expected value; it does not claim that the sum of n results gets close to
988-418: Is that the probability that, as the number of trials n goes to infinity, the average of the observations converges to the expected value, is equal to one. The modern proof of the strong law is more complex than that of the weak law, and relies on passing to an appropriate subsequence. The strong law of large numbers can itself be seen as a special case of the pointwise ergodic theorem . This view justifies
1040-491: Is the empirical measure, P n f {\displaystyle \ \mathbb {P} _{n}f\ } is the corresponding map, and Definitions Glivenko–Cantelli classes of functions (as well as their uniform and universal forms) are defined similarly, replacing all instances of C {\displaystyle {\mathcal {C}}} with F {\displaystyle {\mathcal {F}}} . The weak and strong versions of
1092-410: Is the map induced by P n {\displaystyle P_{n}} on measurable real-valued functions f , which is given by Then it becomes an important property of these classes whether the strong law of large numbers holds uniformly on F {\displaystyle {\mathcal {F}}} or C {\displaystyle {\mathcal {C}}} . Consider
SECTION 20
#17328020758721144-431: The law of large numbers ( LLN ) is a mathematical law that states that the average of the results obtained from a large number of independent random samples converges to the true value, if it exists. More formally, the LLN states that given a sample of independent and identically distributed values, the sample mean converges to the true mean . The LLN is important because it guarantees stable long-term results for
1196-667: The Cauchy distribution does not have an expectation, whereas the expectation of the Pareto distribution ( α <1) is infinite. One way to generate the Cauchy-distributed example is where the random numbers equal the tangent of an angle uniformly distributed between −90° and +90°. The median is zero, but the expected value does not exist, and indeed the average of n such variables have the same distribution as one such variable. It does not converge in probability toward zero (or any other value) as n goes to infinity. And if
1248-519: The above result, this further implies that ‖ F n − F ‖ ∞ ≤ ε a.s. {\textstyle \|F_{n}-F\|_{\infty }\leq \varepsilon {\text{ a.s.}}} , which is the definition of almost sure convergence. One can generalize the empirical distribution function by replacing the set ( − ∞ , x ] {\displaystyle (-\infty ,x]} by an arbitrary set C from
1300-437: The average of a set of normally distributed variables). The variance of the sum is equal to the sum of the variances, which is asymptotic to n 2 / log n {\displaystyle n^{2}/\log n} . The variance of the average is therefore asymptotic to 1 / log n {\displaystyle 1/\log n} and goes to zero. There are also examples of
1352-434: The average of the first n values goes to zero as n goes to infinity. As an example, assume that each random variable in the series follows a Gaussian distribution (normal distribution) with mean zero, but with variance equal to 2 n / log ( n + 1 ) {\displaystyle 2n/\log(n+1)} , which is not bounded. At each stage, the average will be normally distributed (as
1404-664: The average to converge almost surely on something (this can be considered another statement of the strong law), it is necessary that they have an expected value (and then of course the average will converge almost surely on that). If the summands are independent but not identically distributed, then provided that each X k has a finite second moment and ∑ k = 1 ∞ 1 k 2 Var [ X k ] < ∞ . {\displaystyle \sum _{k=1}^{\infty }{\frac {1}{k^{2}}}\operatorname {Var} [X_{k}]<\infty .} This statement
1456-424: The averages of some random events . For example, while a casino may lose money in a single spin of the roulette wheel, its earnings will tend towards a predictable percentage over a large number of spins. Any winning streak by a player will eventually be overcome by the parameters of the game. Importantly, the law applies (as the name indicates) only when a large number of observations are considered. There
1508-512: The case where X 1 , X 2 , ... is an infinite sequence of independent and identically distributed (i.i.d.) Lebesgue integrable random variables with expected value E( X 1 ) = E( X 2 ) = ... = μ , both versions of the law state that the sample average X ¯ n = 1 n ( X 1 + ⋯ + X n ) {\displaystyle {\overline {X}}_{n}={\frac {1}{n}}(X_{1}+\cdots +X_{n})} converges to
1560-414: The convergence is only weak (in probability). See differences between the weak law and the strong law . The strong law applies to independent identically distributed random variables having an expected value (like the weak law). This was proved by Kolmogorov in 1930. It can also apply in other cases. Kolmogorov also showed, in 1933, that if the variables are independent and identically distributed, then for
1612-561: The empirical measure of A is simply the empirical mean of the indicator function, P n ( A ) = P n I A . For a fixed measurable function f {\displaystyle f} , P n f {\displaystyle P_{n}f} is a random variable with mean E f {\displaystyle \mathbb {E} f} and variance 1 n E ( f − E f ) 2 {\displaystyle {\frac {1}{n}}\mathbb {E} (f-\mathbb {E} f)^{2}} . By
Glivenko–Cantelli theorem - Misplaced Pages Continue
1664-483: The expected difference grows, but at a slower rate than the number of flips. Another good example of the LLN is the Monte Carlo method . These methods are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. The larger the number of repetitions, the better the approximation tends to be. The reason that this method is important is mainly that, sometimes, it
1716-407: The expected value times n as n increases. Throughout its history, many mathematicians have refined this law. Today, the LLN is used in many fields including statistics, probability theory, economics, and insurance. For example, a single roll of a fair, six-sided die produces one of the numbers 1, 2, 3, 4, 5, or 6, each with equal probability . Therefore, the expected value of the average of
1768-639: The expected value: (Lebesgue integrability of X j means that the expected value E( X j ) exists according to Lebesgue integration and is finite. It does not mean that the associated probability measure is absolutely continuous with respect to Lebesgue measure .) Introductory probability texts often additionally assume identical finite variance Var ( X i ) = σ 2 {\displaystyle \operatorname {Var} (X_{i})=\sigma ^{2}} (for all i {\displaystyle i} ) and no correlation between random variables. In that case,
1820-415: The inequality | X ¯ n − μ | < ε {\displaystyle |{\overline {X}}_{n}-\mu |<\varepsilon } holds for all large enough n , since the convergence is not necessarily uniform on the set where it holds. The strong law does not hold in the following cases, but the weak law does. There are extensions of
1872-411: The intuitive interpretation of the expected value (for Lebesgue integration only) of a random variable when sampled repeatedly as the "long-term average". Law 3 is called the strong law because random variables which converge strongly (almost surely) are guaranteed to converge weakly (in probability). However the weak law is known to hold in certain conditions where the strong law does not hold and then
1924-406: The law of large numbers that the empirical probability of success in a series of Bernoulli trials will converge to the theoretical probability. For a Bernoulli random variable , the expected value is the theoretical probability of success, and the average of n such variables (assuming they are independent and identically distributed (i.i.d.) ) is precisely the relative frequency. For example,
1976-411: The law of large numbers to collections of estimators, where the convergence is uniform over the collection; thus the name uniform law of large numbers . Suppose f ( x , θ ) is some function defined for θ ∈ Θ, and continuous in θ . Then for any fixed θ , the sequence { f ( X 1 , θ ), f ( X 2 , θ ), ...} will be a sequence of independent and identically distributed random variables, such that
2028-417: The name "la loi des grands nombres" ("the law of large numbers"). Thereafter, it was known under both names, but the "law of large numbers" is most frequently used. After Bernoulli and Poisson published their efforts, other mathematicians also contributed to refinement of the law, including Chebyshev , Markov , Borel , Cantelli , Kolmogorov and Khinchin . Markov showed that the law can apply to
2080-399: The other the "strong" law, in reference to two different modes of convergence of the cumulative sample means to the expected value; in particular, as explained below, the strong form implies the weak. There are two different versions of the law of large numbers that are described below. They are called the strong law of large numbers and the weak law of large numbers . Stated for
2132-573: The proofs. This assumption of finite variance is not necessary . Large or infinite variance will make the convergence slower, but the LLN holds anyway. Mutual independence of the random variables can be replaced by pairwise independence or exchangeability in both versions of the law. The difference between the strong and the weak version is concerned with the mode of convergence being asserted. For interpretation of these modes, see Convergence of random variables . The weak law of large numbers (also called Khinchin 's law) states that given
Glivenko–Cantelli theorem - Misplaced Pages Continue
2184-432: The proportion of heads (and tails) approaches 1 ⁄ 2 , almost surely the absolute difference in the number of heads and tails will become large as the number of flips becomes large. That is, the probability that the absolute difference is a small number approaches zero as the number of flips becomes large. Also, almost surely the ratio of the absolute difference to the number of flips will approach zero. Intuitively,
2236-413: The rolls is: 1 + 2 + 3 + 4 + 5 + 6 6 = 3.5 {\displaystyle {\frac {1+2+3+4+5+6}{6}}=3.5} According to the law of large numbers, if a large number of six-sided dice are rolled, the average of their values (sometimes called the sample mean ) will approach 3.5, with the precision increasing as more dice are rolled. It follows from
2288-857: The sample mean of this sequence converges in probability to E[ f ( X , θ )]. This is the pointwise (in θ ) convergence. A particular example of a uniform law of large numbers states the conditions under which the convergence happens uniformly in θ . If Then E[ f ( X , θ )] is continuous in θ , and sup θ ∈ Θ ‖ 1 n ∑ i = 1 n f ( X i , θ ) − E [ f ( X , θ ) ] ‖ → P 0. {\displaystyle \sup _{\theta \in \Theta }\left\|{\frac {1}{n}}\sum _{i=1}^{n}f(X_{i},\theta )-\operatorname {E} [f(X,\theta )]\right\|{\overset {\mathrm {P} }{\rightarrow }}\ 0.} This result
2340-412: The series, keeping the expected value constant. If the variances are bounded, then the law applies, as shown by Chebyshev as early as 1867. (If the expected values change during the series, then we can simply apply the law to the average deviation from the respective expected values. The law then states that this converges in probability to zero.) In fact, Chebyshev's proof works so long as the variance of
2392-408: The strong law of large numbers , P n ( A ) converges to P ( A ) almost surely for fixed A . Similarly P n f {\displaystyle P_{n}f} converges to E f {\displaystyle \mathbb {E} f} almost surely for a fixed measurable function f {\displaystyle f} . The problem of uniform convergence of P n to P
2444-409: The trials embed a selection bias , typical in human economic/rational behaviour, the law of large numbers does not help in solving the bias. Even if the number of trials is increased the selection bias remains. The Italian mathematician Gerolamo Cardano (1501–1576) stated without proof that the accuracies of empirical statistics tend to improve with the number of trials. This was then formalized as
2496-789: The variance of the average of n random variables is Var ( X ¯ n ) = Var ( 1 n ( X 1 + ⋯ + X n ) ) = 1 n 2 Var ( X 1 + ⋯ + X n ) = n σ 2 n 2 = σ 2 n . {\displaystyle \operatorname {Var} ({\overline {X}}_{n})=\operatorname {Var} ({\tfrac {1}{n}}(X_{1}+\cdots +X_{n}))={\frac {1}{n^{2}}}\operatorname {Var} (X_{1}+\cdots +X_{n})={\frac {n\sigma ^{2}}{n^{2}}}={\frac {\sigma ^{2}}{n}}.} which can be used to shorten and simplify
2548-494: The various Glivenko-Cantelli properties often coincide under certain regularity conditions. The following definition commonly appears in such regularity conditions: Theorems The following two theorems give sufficient conditions for the weak and strong versions of the Glivenko-Cantelli property to be equivalent. Theorem ( Talagrand , 1987) Theorem ( Dudley , Giné, and Zinn, 1991) The following theorem
2600-494: The weak law applying even though the expected value does not exist. The strong law of large numbers (also called Kolmogorov 's law) states that the sample average converges almost surely to the expected value That is, Pr ( lim n → ∞ X ¯ n = μ ) = 1. {\displaystyle \Pr \!\left(\lim _{n\to \infty }{\overline {X}}_{n}=\mu \right)=1.} What this means
2652-455: The weak law states that for any nonzero margin specified ( ε ), no matter how small, with a sufficiently large sample there will be a very high probability that the average of the observations will be close to the expected value; that is, within the margin. As mentioned earlier, the weak law applies in the case of i.i.d. random variables, but it also applies in some other cases. For example, the variance may be different for each random variable in
SECTION 50
#17328020758722704-778: Was open until Vapnik and Chervonenkis solved it in 1968. If the class C {\displaystyle {\mathcal {C}}} (or F {\displaystyle {\mathcal {F}}} ) is Glivenko–Cantelli with respect to P then P n converges to P uniformly over c ∈ C {\displaystyle c\in {\mathcal {C}}} (or f ∈ F {\displaystyle f\in {\mathcal {F}}} ). In other words, with probability 1 we have The empirical distribution function provides an example of empirical measures. For real-valued iid random variables X 1 , … , X n {\displaystyle X_{1},\dots ,X_{n}} it
#871128