In statistical theory , a U-statistic is a class of statistics defined as the average over the application of a given function applied to all tuples of a fixed size. The letter "U" stands for unbiased. In elementary statistics, U-statistics arise naturally in producing minimum-variance unbiased estimators .
25-398: The theory of U-statistics allows a minimum-variance unbiased estimator to be derived from each unbiased estimator of an estimable parameter (alternatively, statistical functional ) for large classes of probability distributions . An estimable parameter is a measurable function of the population's cumulative probability distribution : For example, for every probability distribution,
50-437: A K {\displaystyle K} -valued function of r {\displaystyle r} d {\displaystyle d} -dimensional variables. For each n ≥ r {\displaystyle n\geq r} the associated U-statistic f n : ( K d ) n → K {\displaystyle f_{n}\colon (K^{d})^{n}\to K}
75-535: A minimum-variance unbiased estimator (MVUE) or uniformly minimum-variance unbiased estimator (UMVUE) is an unbiased estimator that has lower variance than any other unbiased estimator for all possible values of the parameter. For practical statistics problems, it is important to determine the MVUE if one exists, since less-than-optimal procedures would naturally be avoided, other things being equal. This has led to substantial development of statistical theory related to
100-526: A family of densities p θ , θ ∈ Ω {\displaystyle p_{\theta },\theta \in \Omega } , where Ω {\displaystyle \Omega } is the parameter space. An unbiased estimator δ ( X 1 , X 2 , … , X n ) {\displaystyle \delta (X_{1},X_{2},\ldots ,X_{n})} of g ( θ ) {\displaystyle g(\theta )}
125-416: A few observations: this defines the basic estimator based on a given number of observations. For example, a single observation is itself an unbiased estimate of the mean and a pair of observations can be used to derive an unbiased estimate of the variance. The U-statistic based on this estimator is defined as the average (across all combinatorial selections of the given size from the full set of observations) of
150-478: A finite population, where the defining property is termed ‘inheritance on the average’. Fisher's k -statistics and Tukey's polykays are examples of homogeneous polynomial U-statistics (Fisher, 1929; Tukey, 1950). For a simple random sample φ of size n taken from a population of size N , the U-statistic has the property that the average over sample values ƒ n ( xφ )
175-487: Is the MVUE minimizes MSE among unbiased estimators . In some cases biased estimators have lower MSE because they have a smaller variance than does any unbiased estimator; see estimator bias . Consider the data to be a single observation from an absolutely continuous distribution on R {\displaystyle \mathbb {R} } with density and we wish to find the UMVU estimator of First we recognize that
200-410: Is UMVUE if ∀ θ ∈ Ω {\displaystyle \forall \theta \in \Omega } , for any other unbiased estimator δ ~ . {\displaystyle {\tilde {\delta }}.} If an unbiased estimator of g ( θ ) {\displaystyle g(\theta )} exists, then one can prove there
225-423: Is a complete sufficient statistic for the family of densities. Then is the MVUE for g ( θ ) . {\displaystyle g(\theta ).} A Bayesian analog is a Bayes estimator , particularly with minimum mean square error (MMSE). An efficient estimator need not exist, but if it does and if it is unbiased, it is the MVUE. Since the mean squared error (MSE) of an estimator δ
250-470: Is a function of a complete, sufficient statistic is the UMVUE estimator. Put formally, suppose δ ( X 1 , X 2 , … , X n ) {\displaystyle \delta (X_{1},X_{2},\ldots ,X_{n})} is unbiased for g ( θ ) {\displaystyle g(\theta )} , and that T {\displaystyle T}
275-563: Is an essentially unique MVUE. Using the Rao–Blackwell theorem one can also prove that determining the MVUE is simply a matter of finding a complete sufficient statistic for the family p θ , θ ∈ Ω {\displaystyle p_{\theta },\theta \in \Omega } and conditioning any unbiased estimator on it. Further, by the Lehmann–Scheffé theorem , an unbiased estimator that
SECTION 10
#1732790978054300-573: Is defined to be the average of the values f ( x i 1 , … , x i r ) {\displaystyle f(x_{i_{1}},\dotsc ,x_{i_{r}})} over the set I r , n {\displaystyle I_{r,n}} of r {\displaystyle r} -tuples of indices from { 1 , 2 , … , n } {\displaystyle \{1,2,\dotsc ,n\}} with distinct entries. Formally, In particular, if f {\displaystyle f}
325-427: Is exactly equal to the population value ƒ N ( x ). Some examples: If f ( x ) = x {\displaystyle f(x)=x} the U-statistic f n ( x ) = x ¯ n = ( x 1 + ⋯ + x n ) / n {\displaystyle f_{n}(x)={\bar {x}}_{n}=(x_{1}+\cdots +x_{n})/n}
350-446: Is not the median of n {\displaystyle n} values. However, it is a minimum variance unbiased estimate of the expected value of the median of three values, not the median of the population. Similar estimates play a central role where the parameters of a family of probability distributions are being estimated by probability weighted moments or L-moments . Minimum-variance unbiased estimator In statistics
375-666: Is symmetric the above is simplified to where now J r , n {\displaystyle J_{r,n}} denotes the subset of I r , n {\displaystyle I_{r,n}} of increasing tuples. Each U-statistic f n {\displaystyle f_{n}} is necessarily a symmetric function . U-statistics are very natural in statistical work, particularly in Hoeffding's context of independent and identically distributed random variables , or more generally for exchangeable sequences , such as in simple random sampling from
400-942: Is the sample mean. If f ( x 1 , x 2 ) = | x 1 − x 2 | {\displaystyle f(x_{1},x_{2})=|x_{1}-x_{2}|} , the U-statistic is the mean pairwise deviation f n ( x 1 , … , x n ) = 2 / ( n ( n − 1 ) ) ∑ i > j | x i − x j | {\displaystyle f_{n}(x_{1},\ldots ,x_{n})=2/(n(n-1))\sum _{i>j}|x_{i}-x_{j}|} , defined for n ≥ 2 {\displaystyle n\geq 2} . If f ( x 1 , x 2 ) = ( x 1 − x 2 ) 2 / 2 {\displaystyle f(x_{1},x_{2})=(x_{1}-x_{2})^{2}/2} ,
425-426: The asymptotic normality and to the variance (in finite samples) of such quantities. The theory has been used to study more general statistics as well as stochastic processes , such as random graphs . Suppose that a problem involves independent and identically-distributed random variables and that estimation of a certain parameter is required. Suppose that a simple unbiased estimate can be constructed based on only
450-472: The MVUE Clearly δ ( X ) = T 2 2 {\displaystyle \delta (X)={\frac {T^{2}}{2}}} is unbiased and T = log ( 1 + e − x ) {\displaystyle T=\log(1+e^{-x})} is complete sufficient, thus the UMVU estimator is This example illustrates that an unbiased function of
475-892: The U-statistic is the sample variance f n ( x ) = ∑ ( x i − x ¯ n ) 2 / ( n − 1 ) {\displaystyle f_{n}(x)=\sum (x_{i}-{\bar {x}}_{n})^{2}/(n-1)} with divisor n − 1 {\displaystyle n-1} , defined for n ≥ 2 {\displaystyle n\geq 2} . The third k {\displaystyle k} -statistic k 3 , n ( x ) = ∑ ( x i − x ¯ n ) 3 n / ( ( n − 1 ) ( n − 2 ) ) {\displaystyle k_{3,n}(x)=\sum (x_{i}-{\bar {x}}_{n})^{3}n/((n-1)(n-2))} ,
500-402: The basic estimator applied to the sub-samples. Pranab K. Sen (1992) provides a review of the paper by Wassily Hoeffding (1948), which introduced U-statistics and set out the theory relating to them, and in doing so Sen outlines the importance U-statistics have in statistical theory. Sen says, “The impact of Hoeffding (1948) is overwhelming at the present time and is very likely to continue in
525-473: The density can be written as Which is an exponential family with sufficient statistic T = log ( 1 + e − x ) {\displaystyle T=\log(1+e^{-x})} . In fact this is a full rank exponential family, and therefore T {\displaystyle T} is complete sufficient. See exponential family for a derivation which shows Therefore, Here we use Lehmann–Scheffé theorem to get
SECTION 20
#1732790978054550-433: The population median is an estimable parameter. The theory of U-statistics applies to general classes of probability distributions. Many statistics originally derived for particular parametric families have been recognized as U-statistics for general distributions. In non-parametric statistics , the theory of U-statistics is used to establish for statistical procedures (such as estimators and tests) and estimators relating to
575-656: The problem of optimal estimation. While combining the constraint of unbiasedness with the desirability metric of least variance leads to good results in most practical settings—making MVUE a natural starting point for a broad range of analyses—a targeted specification may perform better for a given problem; thus, MVUE is not always the best stopping point. Consider estimation of g ( θ ) {\displaystyle g(\theta )} based on data X 1 , X 2 , … , X n {\displaystyle X_{1},X_{2},\ldots ,X_{n}} i.i.d. from some member of
600-494: The sample skewness defined for n ≥ 3 {\displaystyle n\geq 3} , is a U-statistic. The following case highlights an important point. If f ( x 1 , x 2 , x 3 ) {\displaystyle f(x_{1},x_{2},x_{3})} is the median of three values, f n ( x 1 , … , x n ) {\displaystyle f_{n}(x_{1},\ldots ,x_{n})}
625-491: The years to come.” Note that the theory of U-statistics is not limited to the case of independent and identically-distributed random variables or to scalar random-variables. The term U-statistic, due to Hoeffding (1948), is defined as follows. Let K {\displaystyle K} be either the real or complex numbers, and let f : ( K d ) r → K {\displaystyle f\colon (K^{d})^{r}\to K} be
#53946