Misplaced Pages

Ordinary least squares

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.

In statistics , ordinary least squares ( OLS ) is a type of linear least squares method for choosing the unknown parameters in a linear regression model (with fixed level-one effects of a linear function of a set of explanatory variables ) by the principle of least squares : minimizing the sum of the squares of the differences between the observed dependent variable (values of the variable being observed) in the input dataset and the output of the (linear) function of the independent variable . Some sources consider OLS to be linear regression.

#15984

129-401: Geometrically, this is seen as the sum of the squared distances, parallel to the axis of the dependent variable, between each data point in the set and the corresponding point on the regression surface—the smaller the differences, the better the model fits the data. The resulting estimator can be expressed by a simple formula, especially in the case of a simple linear regression , in which there

258-856: A r ( A ^ 2 ) = v a r ( 1 N ∑ n = 0 N − 1 x [ n ] ) = independence 1 N 2 [ ∑ n = 0 N − 1 v a r ( x [ n ] ) ] = 1 N 2 [ N σ 2 ] = σ 2 N {\displaystyle \mathrm {var} \left({\hat {A}}_{2}\right)=\mathrm {var} \left({\frac {1}{N}}\sum _{n=0}^{N-1}x[n]\right){\overset {\text{independence}}{=}}{\frac {1}{N^{2}}}\left[\sum _{n=0}^{N-1}\mathrm {var} (x[n])\right]={\frac {1}{N^{2}}}\left[N\sigma ^{2}\right]={\frac {\sigma ^{2}}{N}}} It would seem that

387-473: A discrete uniform distribution 1 , 2 , … , N {\displaystyle 1,2,\dots ,N} with unknown maximum, the UMVU estimator for the maximum is given by k + 1 k m − 1 = m + m k − 1 {\displaystyle {\frac {k+1}{k}}m-1=m+{\frac {m}{k}}-1} where m is the sample maximum and k

516-483: A linear regression model , the response variable, y i {\displaystyle y_{i}} , is a linear function of the regressors: or in vector form, where x i {\displaystyle \mathbf {x} _{i}} , as introduced previously, is a column vector of the i {\displaystyle i} -th observation of all the explanatory variables; β {\displaystyle {\boldsymbol {\beta }}}

645-1091: A mean of A {\displaystyle A} , which can be shown through taking the expected value of each estimator E [ A ^ 1 ] = E [ x [ 0 ] ] = A {\displaystyle \mathrm {E} \left[{\hat {A}}_{1}\right]=\mathrm {E} \left[x[0]\right]=A} and E [ A ^ 2 ] = E [ 1 N ∑ n = 0 N − 1 x [ n ] ] = 1 N [ ∑ n = 0 N − 1 E [ x [ n ] ] ] = 1 N [ N A ] = A {\displaystyle \mathrm {E} \left[{\hat {A}}_{2}\right]=\mathrm {E} \left[{\frac {1}{N}}\sum _{n=0}^{N-1}x[n]\right]={\frac {1}{N}}\left[\sum _{n=0}^{N-1}\mathrm {E} \left[x[n]\right]\right]={\frac {1}{N}}\left[NA\right]=A} At this point, these two estimators would appear to perform

774-454: A "candidate" value for the parameter vector β . The quantity y i − x i b , called the residual for the i -th observation, measures the vertical distance between the data point ( x i , y i ) and the hyperplane y = x b , and thus assesses the degree of fit between the actual data and the model. The sum of squared residuals ( SSR ) (also called the error sum of squares ( ESS ) or residual sum of squares ( RSS ))

903-410: A common value for the given predictor variable. This is the only interpretation of "held fixed" that can be used in an observational study . The notion of a "unique effect" is appealing when studying a complex system where multiple interrelated components influence the response variable. In some cases, it can literally be interpreted as the causal effect of an intervention that is linked to the value of

1032-499: A constant and a scalar regressor x i , then this is called the "simple regression model". This case is often considered in the beginner statistics classes, as it provides much simpler formulas even suitable for manual calculation. The parameters are commonly denoted as ( α , β ) : The least squares estimates in this case are given by simple formulas In the previous section the least squares estimator β ^ {\displaystyle {\hat {\beta }}}

1161-409: A constant; it simply subtracts the mean from a variable.) In order for R to be meaningful, the matrix X of data on regressors must contain a column vector of ones to represent the constant whose coefficient is the regression intercept. In that case, R will always be a number between 0 and 1, with values close to 1 indicating a good degree of fit. If the data matrix X contains only two variables,

1290-1324: A fixed, unknown parameter corrupted by AWGN. To find the Cramér–Rao lower bound (CRLB) of the sample mean estimator, it is first necessary to find the Fisher information number I ( A ) = E ( [ ∂ ∂ A ln ⁡ p ( x ; A ) ] 2 ) = − E [ ∂ 2 ∂ A 2 ln ⁡ p ( x ; A ) ] {\displaystyle {\mathcal {I}}(A)=\mathrm {E} \left(\left[{\frac {\partial }{\partial A}}\ln p(\mathbf {x} ;A)\right]^{2}\right)=-\mathrm {E} \left[{\frac {\partial ^{2}}{\partial A^{2}}}\ln p(\mathbf {x} ;A)\right]} and copying from above ∂ ∂ A ln ⁡ p ( x ; A ) = 1 σ 2 [ ∑ n = 0 N − 1 x [ n ] − N A ] {\displaystyle {\frac {\partial }{\partial A}}\ln p(\mathbf {x} ;A)={\frac {1}{\sigma ^{2}}}\left[\sum _{n=0}^{N-1}x[n]-NA\right]} Taking

1419-576: A group of predictor variables, say, { x 1 , x 2 , … , x q } {\displaystyle \{x_{1},x_{2},\dots ,x_{q}\}} , a group effect ξ ( w ) {\displaystyle \xi (\mathbf {w} )} is defined as a linear combination of their parameters where w = ( w 1 , w 2 , … , w q ) ⊺ {\displaystyle \mathbf {w} =(w_{1},w_{2},\dots ,w_{q})^{\intercal }}

SECTION 10

#1732773106016

1548-422: A linear model as above, not all elements in X {\displaystyle \mathbf {X} } contains information on the data points. The first column is populated with ones, X i 1 = 1 {\displaystyle X_{i1}=1} . Only the other columns contain actual data. So here p {\displaystyle p} is equal to the number of regressors plus one). Such

1677-472: A particular candidate, based on some demographic features, such as age. Or, for example, in radar the aim is to find the range of objects (airplanes, boats, etc.) by analyzing the two-way transit timing of received echoes of transmitted pulses. Since the reflected pulses are unavoidably embedded in electrical noise, their measured values are randomly distributed, so that the transit time must be estimated. As another example, in electrical communication theory,

1806-407: A predictor variable. However, it has been argued that in many cases multiple regression analysis fails to clarify the relationships between the predictor variables and the response variable when the predictors are correlated with each other and are not assigned following a study design. Numerous extensions of linear regression have been developed, which allow some or all of the assumptions underlying

1935-506: A probability distribution (e.g., Bayesian statistics ). It is then necessary to define the Bayesian probability π ( θ ) . {\displaystyle \pi ({\boldsymbol {\theta }}).\,} After the model is formed, the goal is to estimate the parameters, with the estimates commonly denoted θ ^ {\displaystyle {\hat {\boldsymbol {\theta }}}} , where

2064-535: A scalar response y i {\displaystyle y_{i}} and a column vector x i {\displaystyle \mathbf {x} _{i}} of p {\displaystyle p} parameters (regressors), i.e., x i = [ x i 1 , x i 2 , … , x i p ] T {\displaystyle \mathbf {x} _{i}=\left[x_{i1},x_{i2},\dots ,x_{ip}\right]^{\operatorname {T} }} . In

2193-400: A study design, the comparisons of interest may literally correspond to comparisons among units whose predictor variables have been "held fixed" by the experimenter. Alternatively, the expression "held fixed" can refer to a selection that takes place in the context of data analysis. In this case, we "hold a variable fixed" by restricting our attention to the subsets of the data that happen to have

2322-516: A system usually has no exact solution, so the goal is instead to find the coefficients β {\displaystyle {\boldsymbol {\beta }}} which fit the equations "best", in the sense of solving the quadratic minimization problem where the objective function S {\displaystyle S} is given by A justification for choosing this criterion is given in Properties below. This minimization problem has

2451-509: A unique global minimum at b = β ^ {\displaystyle b={\hat {\beta }}} , which can be given by the explicit formula: The product N = X X is a Gram matrix and its inverse, Q = N , is the cofactor matrix of β , closely related to its covariance matrix , C β . The matrix ( X X ) X = Q X is called the Moore–Penrose pseudoinverse matrix of X. This formulation highlights

2580-427: A unique solution, provided that the p {\displaystyle p} columns of the matrix X {\displaystyle \mathbf {X} } are linearly independent , given by solving the so-called normal equations : The matrix X T X {\displaystyle \mathbf {X} ^{\operatorname {T} }\mathbf {X} } is known as the normal matrix or Gram matrix and

2709-496: A variance of 1 k ( N − k ) ( N + 1 ) ( k + 2 ) ≈ N 2 k 2  for small samples  k ≪ N {\displaystyle {\frac {1}{k}}{\frac {(N-k)(N+1)}{(k+2)}}\approx {\frac {N^{2}}{k^{2}}}{\text{ for small samples }}k\ll N} so a standard deviation of approximately N / k {\displaystyle N/k} ,

SECTION 20

#1732773106016

2838-417: Is a p × 1 {\displaystyle p\times 1} vector of unknown parameters; and the scalar ε i {\displaystyle \varepsilon _{i}} represents unobserved random variables ( errors ) of the i {\displaystyle i} -th observation. ε i {\displaystyle \varepsilon _{i}} accounts for

2967-420: Is a simple linear regression ; a model with two or more explanatory variables is a multiple linear regression . This term is distinct from multivariate linear regression , which predicts multiple correlated dependent variables rather than a single dependent variable. In linear regression, the relationships are modeled using linear predictor functions whose unknown model parameters are estimated from

3096-417: Is a framework for modeling response variables that are bounded or discrete. This is used, for example: Generalized linear models allow for an arbitrary link function , g , that relates the mean of the response variable(s) to the predictors: E ( Y ) = g − 1 ( X B ) {\displaystyle E(Y)=g^{-1}(XB)} . The link function is often related to

3225-476: Is a generalization of simple linear regression to the case of more than one independent variable, and a special case of general linear models, restricted to one dependent variable. The basic model for multiple linear regression is for each observation i = 1 , … , n {\textstyle i=1,\ldots ,n} . In the formula above we consider n observations of one dependent variable and p independent variables. Thus, Y i

3354-577: Is a meaningful effect. It can be accurately estimated by its minimum-variance unbiased linear estimator ξ ^ A = 1 q ( β ^ 1 ′ + β ^ 2 ′ + ⋯ + β ^ q ′ ) {\textstyle {\hat {\xi }}_{A}={\frac {1}{q}}({\hat {\beta }}_{1}'+{\hat {\beta }}_{2}'+\dots +{\hat {\beta }}_{q}')} , even when individually none of

3483-485: Is a measure of the overall model fit: where T denotes the matrix transpose , and the rows of X , denoting the values of all the independent variables associated with a particular value of the dependent variable, are X i = x i . The value of b which minimizes this sum is called the OLS estimator for β . The function S ( b ) is quadratic in b with positive-definite Hessian , and therefore this function possesses

3612-466: Is a single regressor on the right side of the regression equation. The OLS estimator is consistent for the level-one fixed effects when the regressors are exogenous and forms perfect colinearity (rank condition), consistent for the variance estimate of the residuals when regressors have finite fourth moments and—by the Gauss–Markov theorem — optimal in the class of linear unbiased estimators when

3741-435: Is a special group effect with weights w 1 = 1 {\displaystyle w_{1}=1} and w j = 0 {\displaystyle w_{j}=0} for j ≠ 1 {\displaystyle j\neq 1} , but it cannot be accurately estimated by β ^ 1 ′ {\displaystyle {\hat {\beta }}'_{1}} . It

3870-551: Is a weight vector satisfying ∑ j = 1 q | w j | = 1 {\textstyle \sum _{j=1}^{q}|w_{j}|=1} . Because of the constraint on w j {\displaystyle {w_{j}}} , ξ ( w ) {\displaystyle \xi (\mathbf {w} )} is also referred to as a normalized group effect. A group effect ξ ( w ) {\displaystyle \xi (\mathbf {w} )} has an interpretation as

3999-695: Is also not a meaningful effect. In general, for a group of q {\displaystyle q} strongly correlated predictor variables in an APC arrangement in the standardized model, group effects whose weight vectors w {\displaystyle \mathbf {w} } are at or near the centre of the simplex ∑ j = 1 q w j = 1 {\textstyle \sum _{j=1}^{q}w_{j}=1} ( w j ≥ 0 {\displaystyle w_{j}\geq 0} ) are meaningful and can be accurately estimated by their minimum-variance unbiased linear estimators. Effects with weight vectors far away from

Ordinary least squares - Misplaced Pages Continue

4128-443: Is also sometimes called the hat matrix because it "puts a hat" onto the variable y . Another matrix, closely related to P is the annihilator matrix M = I n − P ; this is a projection matrix onto the space orthogonal to V . Both matrices P and M are symmetric and idempotent (meaning that P = P and M = M ), and relate to the data matrix X via identities PX = X and MX = 0 . Matrix M creates

4257-443: Is called the intercept . Without the intercept, the fitted line is forced to cross the origin when x i = 0 → {\displaystyle x_{i}={\vec {0}}} . Regressors do not have to be independent for estimation to be consistent e.g. they may be non-linearly dependent. Short of perfect multicollinearity, parameter estimates may still be consistent; however, as multicollinearity rises

4386-417: Is captured by x j . In this case, including the other variables in the model reduces the part of the variability of y that is unrelated to x j , thereby strengthening the apparent relationship with x j . The meaning of the expression "held fixed" may depend on how the values of the predictor variables arise. If the experimenter directly sets the values of the predictor variables according to

4515-413: Is equal to zero for any conformal vector, v . This means that y − X β ^ {\displaystyle \mathbf {y} -\mathbf {X} {\boldsymbol {\hat {\beta }}}} is the shortest of all possible vectors y − X β {\displaystyle \mathbf {y} -\mathbf {X} {\boldsymbol {\beta }}} , that is,

4644-468: Is identical to the maximum likelihood estimator (MLE) under the normality assumption for the error terms. This normality assumption has historical importance, as it provided the basis for the early work in linear regression analysis by Yule and Pearson . From the properties of MLE, we can infer that the OLS estimator is asymptotically efficient (in the sense of attaining the Cramér–Rao bound for variance) if

4773-443: Is just a certain linear combination of the vectors of regressors. Thus, the residual vector y − Xβ will have the smallest length when y is projected orthogonally onto the linear subspace spanned by the columns of X . The OLS estimator β ^ {\displaystyle {\hat {\beta }}} in this case can be interpreted as the coefficients of vector decomposition of y = Py along

4902-754: Is meaningful when the latter is. Thus meaningful group effects of the original variables can be found through meaningful group effects of the standardized variables. In Dempster–Shafer theory , or a linear belief function in particular, a linear regression model may be represented as a partially swept matrix, which can be combined with similar matrices representing observations and other assumed normal distributions and state equations. The combination of swept or unswept matrices provides an alternative method for estimating linear regression models. A large number of procedures have been developed for parameter estimation and inference in linear regression. These methods differ in computational simplicity of algorithms, presence of

5031-400: Is minimized. For example, it is common to use the sum of squared errors ‖ ε ‖ 2 2 {\displaystyle \|{\boldsymbol {\varepsilon }}\|_{2}^{2}} as a measure of ε {\displaystyle {\boldsymbol {\varepsilon }}} for minimization. Consider a situation where a small ball is being tossed up in

5160-403: Is often unimportant, since estimation and inference is carried out while conditioning on X . All results stated in this article are within the random design framework. The classical model focuses on the "finite sample" estimation and inference, meaning that the number of observations n is fixed. This contrasts with the other approaches, which study the asymptotic behavior of OLS, and in which

5289-401: Is probable. Group effects provide a means to study the collective impact of strongly correlated predictor variables in linear regression models. Individual effects of such variables are not well-defined as their parameters do not have good interpretations. Furthermore, when the sample size is not large, none of their parameters can be accurately estimated by the least squares regression due to

Ordinary least squares - Misplaced Pages Continue

5418-433: Is regressed on C . It is often used where the variables of interest have a natural hierarchical structure such as in educational statistics, where students are nested in classrooms, classrooms are nested in schools, and schools are nested in some administrative grouping, such as a school district. The response variable might be a measure of student achievement such as a test score, and different covariates would be collected at

5547-461: Is still assumed, with a matrix B replacing the vector β of the classical linear regression model. Multivariate analogues of ordinary least squares (OLS) and generalized least squares (GLS) have been developed. "General linear models" are also called "multivariate linear models". These are not the same as multivariable linear models (also called "multiple linear models"). Various models have been created that allow for heteroscedasticity , i.e.

5676-496: Is strongly correlated with other predictor variables, it is improbable that x j {\displaystyle x_{j}} can increase by one unit with other variables held constant. In this case, the interpretation of β j {\displaystyle \beta _{j}} becomes problematic as it is based on an improbable condition, and the effect of x j {\displaystyle x_{j}} cannot be evaluated in isolation. For

5805-403: Is the total sum of squares for the dependent variable, L = I n − 1 n J n {\textstyle L=I_{n}-{\frac {1}{n}}J_{n}} , and J n {\textstyle J_{n}} is an n × n matrix of ones. ( L {\displaystyle L} is a centering matrix which is equivalent to regression on

5934-423: Is the i observation of the dependent variable, X ij is i observation of the j independent variable, j = 1, 2, ..., p . The values β j represent parameters to be estimated, and ε i is the i independent identically distributed normal error. In the more general multivariate linear regression, there is one equation of the above form for each of m > 1 dependent variables that share

6063-414: Is the i -th row of matrix X . Using these residuals we can estimate the sample variance s using the reduced chi-squared statistic: The denominator, n − p , is the statistical degrees of freedom . The first quantity, s , is the OLS estimate for σ , whereas the second, σ ^ 2 {\displaystyle \scriptstyle {\hat {\sigma }}^{2}} ,

6192-453: Is the sample size , sampling without replacement. This problem is commonly known as the German tank problem , due to application of maximum estimation to estimates of German tank production during World War II . The formula may be understood intuitively as; the gap being added to compensate for the negative bias of the sample maximum as an estimator for the population maximum. This has

6321-469: Is the MLE estimate for σ . The two estimators are quite similar in large samples; the first estimator is always unbiased , while the second estimator is biased but has a smaller mean squared error . In practice s is used more often, since it is more convenient for the hypothesis testing. The square root of s is called the regression standard error , standard error of the regression , or standard error of

6450-449: Is the least squares estimator of β j ′ {\displaystyle \beta _{j}'} . In particular, the average group effect of the q {\displaystyle q} standardized variables is which has an interpretation as the expected change in y ′ {\displaystyle y'} when all x j ′ {\displaystyle x_{j}'} in

6579-437: Is the so-called classical GMM case, when the estimator does not depend on the choice of the weighting matrix. Note that the original strict exogeneity assumption E[ ε i  |  x i ] = 0 implies a far richer set of moment conditions than stated above. In particular, this assumption implies that for any vector-function ƒ , the moment condition E[ ƒ ( x i )· ε i ] = 0 will hold. However it can be shown using

SECTION 50

#1732773106016

6708-451: Is the unknown. Assuming the system cannot be solved exactly (the number of equations n is much larger than the number of unknowns p ), we are looking for a solution that could provide the smallest discrepancy between the right- and left- hand sides. In other words, we are looking for the solution that satisfies where ‖ · ‖ is the standard L  norm in the n -dimensional Euclidean space R . The predicted quantity Xβ

6837-864: Is then squared and the expected value of this squared value is minimized for the MMSE estimator. Commonly used estimators (estimation methods) and topics related to them include: Consider a received discrete signal , x [ n ] {\displaystyle x[n]} , of N {\displaystyle N} independent samples that consists of an unknown constant A {\displaystyle A} with additive white Gaussian noise (AWGN) w [ n ] {\displaystyle w[n]} with zero mean and known variance σ 2 {\displaystyle \sigma ^{2}} ( i.e. , N ( 0 , σ 2 ) {\displaystyle {\mathcal {N}}(0,\sigma ^{2})} ). Since

6966-434: Is through statistical probability that optimal solutions are sought to extract as much information from the data as possible. Linear regression model In statistics , linear regression is a model that estimates the linear relationship between a scalar response ( dependent variable ) and one or more explanatory variables ( regressor or independent variable ). A model with exactly one explanatory variable

7095-412: The β j ′ {\displaystyle \beta _{j}'} can be accurately estimated by β ^ j ′ {\displaystyle {\hat {\beta }}_{j}'} . Not all group effects are meaningful or can be accurately estimated. For example, β 1 ′ {\displaystyle \beta _{1}'}

7224-551: The i {\displaystyle i} -th observations on all the explanatory variables. Typically, a constant term is included in the set of regressors X {\displaystyle \mathbf {X} } , say, by taking x i 1 = 1 {\displaystyle x_{i1}=1} for all i = 1 , … , n {\displaystyle i=1,\dots ,n} . The coefficient β 1 {\displaystyle \beta _{1}} corresponding to this regressor

7353-413: The q {\displaystyle q} variables via testing H 0 : ξ A = 0 {\displaystyle H_{0}:\xi _{A}=0} versus H 1 : ξ A ≠ 0 {\displaystyle H_{1}:\xi _{A}\neq 0} , and (3) characterizing the region of the predictor variable space over which predictions by

7482-399: The Gauss–Markov theorem that the optimal choice of function ƒ is to take ƒ ( x ) = x , which results in the moment equation posted above. There are several different frameworks in which the linear regression model can be cast in order to make the OLS technique applicable. Each of these settings produces the same formulas and same results. The only difference is the interpretation and

7611-496: The Mean Squared Error (MSE) as the cost on a dataset that has many large outliers, can result in a model that fits the outliers more than the true data due to the higher importance assigned by MSE to large errors. So, cost functions that are robust to outliers should be used if the dataset has many large outliers . Conversely, the least squares approach can be used to fit models that are not linear models. Thus, although

7740-405: The data . Most commonly, the conditional mean of the response given the values of the explanatory variables (or predictors) is assumed to be an affine function of those values; less commonly, the conditional median or some other quantile is used. Like all forms of regression analysis , linear regression focuses on the conditional probability distribution of the response given the values of

7869-729: The errors are homoscedastic and serially uncorrelated . Under these conditions, the method of OLS provides minimum-variance mean-unbiased estimation when the errors have finite variances . Under the additional assumption that the errors are normally distributed with zero mean, OLS is the maximum likelihood estimator that outperforms any non-linear unbiased estimator. Suppose the data consists of n {\displaystyle n} observations { x i , y i } i = 1 n {\displaystyle \left\{\mathbf {x} _{i},y_{i}\right\}_{i=1}^{n}} . Each observation i {\displaystyle i} includes

SECTION 60

#1732773106016

7998-464: The maximum likelihood estimator. One of the simplest non-trivial examples of estimation is the estimation of the maximum of a uniform distribution. It is used as a hands-on classroom exercise and to illustrate basic principles of estimation theory. Further, in the case of estimation based on a single sample, it demonstrates philosophical issues and possible misunderstandings in the use of maximum likelihood estimators and likelihood functions . Given

8127-490: The multicollinearity problem. Nevertheless, there are meaningful group effects that have good interpretations and can be accurately estimated by the least squares regression. A simple way to identify these meaningful group effects is to use an all positive correlations (APC) arrangement of the strongly correlated variables under which pairwise correlations among these variables are all positive, and standardize all p {\displaystyle p} predictor variables in

8256-532: The natural logarithm of the pdf ln ⁡ p ( x ; A ) = − N ln ⁡ ( σ 2 π ) − 1 2 σ 2 ∑ n = 0 N − 1 ( x [ n ] − A ) 2 {\displaystyle \ln p(\mathbf {x} ;A)=-N\ln \left(\sigma {\sqrt {2\pi }}\right)-{\frac {1}{2\sigma ^{2}}}\sum _{n=0}^{N-1}(x[n]-A)^{2}} and

8385-463: The residuals from the regression: The variances of the predicted values s y ^ i 2 {\displaystyle s_{{\hat {y}}_{i}}^{2}} are found in the main diagonal of the variance-covariance matrix of predicted values: were P is the projection matrix and s is the sample variance. The full matrix is very large; its diagonal elements can be calculated individually as: where X i

8514-580: The transpose , so that x i β is the inner product between vectors x i and β . Often these n equations are stacked together and written in matrix notation as where Fitting a linear model to a given data set usually requires estimating the regression coefficients β {\displaystyle {\boldsymbol {\beta }}} such that the error term ε = y − X β {\displaystyle {\boldsymbol {\varepsilon }}=\mathbf {y} -\mathbf {X} {\boldsymbol {\beta }}}

8643-441: The "hat" indicates the estimate. One common estimator is the minimum mean squared error (MMSE) estimator, which utilizes the error between the estimated parameters and the actual value of the parameters e = θ ^ − θ {\displaystyle \mathbf {e} ={\hat {\boldsymbol {\theta }}}-{\boldsymbol {\theta }}} as the basis for optimality. This error term

8772-500: The (population) average size of a gap between samples; compare m k {\displaystyle {\frac {m}{k}}} above. This can be seen as a very simple case of maximum spacing estimation . The sample maximum is the maximum likelihood estimator for the population maximum, but, as discussed above, it is biased. Numerous fields require the use of estimation theory. Some of these fields include: Measured data are likely to be subject to noise or uncertainty and it

8901-535: The Fisher information into v a r ( A ^ ) ≥ 1 I {\displaystyle \mathrm {var} \left({\hat {A}}\right)\geq {\frac {1}{\mathcal {I}}}} results in v a r ( A ^ ) ≥ σ 2 N {\displaystyle \mathrm {var} \left({\hat {A}}\right)\geq {\frac {\sigma ^{2}}{N}}} Comparing this to

9030-416: The air and then we measure its heights of ascent h i at various moments in time t i . Physics tells us that, ignoring the drag , the relationship can be modeled as where β 1 determines the initial velocity of the ball, β 2 is proportional to the standard gravity , and ε i is due to measurement errors. Linear regression can be used to estimate the values of β 1 and β 2 from

9159-401: The assumptions which have to be imposed in order for the method to give meaningful results. The choice of the applicable framework depends mostly on the nature of data in hand, and on the inference task which has to be performed. One of the lines of difference in interpretation is whether to treat the regressors as random variables, or as predefined constants. In the first case ( random design )

9288-458: The basic model to be relaxed. The simplest case of a single scalar predictor variable x and a single scalar response variable y is known as simple linear regression . The extension to multiple and/or vector -valued predictor variables (denoted with a capital X ) is known as multiple linear regression , also known as multivariable linear regression (not to be confused with multivariate linear regression ). Multiple linear regression

9417-619: The basis of X . In other words, the gradient equations at the minimum can be written as: A geometrical interpretation of these equations is that the vector of residuals, y − X β ^ {\displaystyle \mathbf {y} -X{\hat {\boldsymbol {\beta }}}} is orthogonal to the column space of X , since the dot product ( y − X β ^ ) ⋅ X v {\displaystyle (\mathbf {y} -\mathbf {X} {\hat {\boldsymbol {\beta }}})\cdot \mathbf {X} \mathbf {v} }

9546-433: The behavior at a large number of samples is studied. In some applications, especially with cross-sectional data , an additional assumption is imposed — that all observations are independent and identically distributed . This means that all observations are taken from a random sample which makes all the assumptions listed earlier simpler and easier to interpret. Also this framework allows one to state asymptotic results (as

9675-401: The central role of the linear predictor β ′ x as in the classical linear regression model. Under certain conditions, simply applying OLS to data from a single-index model will consistently estimate β up to a proportionality constant. Hierarchical linear models (or multilevel regression ) organizes the data into a hierarchy of regressions, for example where A is regressed on B , and B

9804-450: The centre are not meaningful as such weight vectors represent simultaneous changes of the variables that violate the strong positive correlations of the standardized variables in an APC arrangement. As such, they are not probable. These effects also cannot be accurately estimated. Applications of the group effects include (1) estimation and inference for meaningful group effects on the response variable, (2) testing for "group significance" of

9933-586: The centred y {\displaystyle y} and x j ′ {\displaystyle x_{j}'} be the standardized x j {\displaystyle x_{j}} . Then, the standardized linear regression model is Parameters β j {\displaystyle \beta _{j}} in the original model, including β 0 {\displaystyle \beta _{0}} , are simple functions of β j ′ {\displaystyle \beta _{j}'} in

10062-607: The classroom, school, and school district levels. Errors-in-variables models (or "measurement error models") extend the traditional linear regression model to allow the predictor variables X to be observed with error. This error causes standard estimators of β to become biased. Generally, the form of bias is an attenuation, meaning that the effects are biased toward zero. In a multiple linear regression model parameter β j {\displaystyle \beta _{j}} of predictor variable x j {\displaystyle x_{j}} represents

10191-430: The continuous probability density function (pdf) or its discrete counterpart, the probability mass function (pmf), of the underlying distribution that generated the data must be stated conditional on the values of the parameters: p ( x | θ ) . {\displaystyle p(\mathbf {x} |{\boldsymbol {\theta }}).\,} It is also possible for the parameters themselves to have

10320-419: The data strongly influence the performance of different estimation methods: A fitted linear regression model can be used to identify the relationship between a single predictor variable x j and the response variable y when all the other predictor variables in the model are "held fixed". Specifically, the interpretation of β j is the expected change in y for a one-unit change in x j when

10449-878: The dependent variable y and the vector of regressors x is linear . This relationship is modeled through a disturbance term or error variable ε —an unobserved random variable that adds "noise" to the linear relationship between the dependent variable and regressors. Thus the model takes the form y i = β 0 + β 1 x i 1 + ⋯ + β p x i p + ε i = x i T β + ε i , i = 1 , … , n , {\displaystyle y_{i}=\beta _{0}+\beta _{1}x_{i1}+\cdots +\beta _{p}x_{ip}+\varepsilon _{i}=\mathbf {x} _{i}^{\mathsf {T}}{\boldsymbol {\beta }}+\varepsilon _{i},\qquad i=1,\ldots ,n,} where denotes

10578-440: The distribution of the response, and in particular it typically has the effect of transforming between the ( − ∞ , ∞ ) {\displaystyle (-\infty ,\infty )} range of the linear predictor and the range of the response variable. Some common examples of GLMs are: Single index models allow some degree of nonlinearity in the relationship between x and y , while preserving

10707-421: The equation . It is common to assess the goodness-of-fit of the OLS regression by comparing how much the initial variation in the sample can be reduced by regressing onto X . The coefficient of determination R is defined as a ratio of "explained" variance to the "total" variance of the dependent variable y , in the cases where the regression sum of squares equals the sum of squares of residuals: where TSS

10836-514: The errors for different response variables may have different variances . For example, weighted least squares is a method for estimating linear regression models when the response variables may have different error variances, possibly with correlated errors. (See also Weighted linear least squares , and Generalized least squares .) Heteroscedasticity-consistent standard errors is an improved method for use with uncorrelated but potentially heteroscedastic errors. The Generalized linear model (GLM)

10965-427: The expected change in y {\displaystyle y} when variables in the group x 1 , x 2 , … , x q {\displaystyle x_{1},x_{2},\dots ,x_{q}} change by the amount w 1 , w 2 , … , w q {\displaystyle w_{1},w_{2},\dots ,w_{q}} , respectively, at

11094-430: The following two broad categories: Linear regression models are often fitted using the least squares approach, but they may also be fitted in other ways, such as by minimizing the " lack of fit " in some other norm (as with least absolute deviations regression), or by minimizing a penalized version of the least squares cost function as in ridge regression ( L -norm penalty) and lasso ( L -norm penalty). Use of

11223-470: The group effect also reduces to an individual effect. A group effect ξ ( w ) {\displaystyle \xi (\mathbf {w} )} is said to be meaningful if the underlying simultaneous changes of the q {\displaystyle q} variables ( x 1 , x 2 , … , x q ) ⊺ {\displaystyle (x_{1},x_{2},\dots ,x_{q})^{\intercal }}

11352-403: The individual effect of x j {\displaystyle x_{j}} . It has an interpretation as the expected change in the response variable y {\displaystyle y} when x j {\displaystyle x_{j}} increases by one unit with other predictor variables held constant. When x j {\displaystyle x_{j}}

11481-523: The influences upon the responses y i {\displaystyle y_{i}} from sources other than the explanatory variables x i {\displaystyle \mathbf {x} _{i}} . This model can also be written in matrix notation as where y {\displaystyle \mathbf {y} } and ε {\displaystyle {\boldsymbol {\varepsilon }}} are n × 1 {\displaystyle n\times 1} vectors of

11610-400: The information in x j , so that once that variable is in the model, there is no contribution of x j to the variation in y . Conversely, the unique effect of x j can be large while its marginal effect is nearly zero. This would happen if the other covariates explained a great deal of the variation of y , but they mainly explain variation in a way that is complementary to what

11739-543: The least squares estimated model are accurate. A group effect of the original variables { x 1 , x 2 , … , x q } {\displaystyle \{x_{1},x_{2},\dots ,x_{q}\}} can be expressed as a constant times a group effect of the standardized variables { x 1 ′ , x 2 ′ , … , x q ′ } {\displaystyle \{x_{1}',x_{2}',\dots ,x_{q}'\}} . The former

11868-399: The matrix X T y {\displaystyle \mathbf {X} ^{\operatorname {T} }\mathbf {y} } is known as the moment matrix of regressand by regressors. Finally, β ^ {\displaystyle {\hat {\boldsymbol {\beta }}}} is the coefficient vector of the least-squares hyperplane , expressed as or Suppose b is

11997-436: The maximum likelihood estimator A ^ = 1 N ∑ n = 0 N − 1 x [ n ] {\displaystyle {\hat {A}}={\frac {1}{N}}\sum _{n=0}^{N-1}x[n]} which is simply the sample mean. From this example, it was found that the sample mean is the maximum likelihood estimator for N {\displaystyle N} samples of

12126-1399: The maximum likelihood estimator is A ^ = arg ⁡ max ln ⁡ p ( x ; A ) {\displaystyle {\hat {A}}=\arg \max \ln p(\mathbf {x} ;A)} Taking the first derivative of the log-likelihood function ∂ ∂ A ln ⁡ p ( x ; A ) = 1 σ 2 [ ∑ n = 0 N − 1 ( x [ n ] − A ) ] = 1 σ 2 [ ∑ n = 0 N − 1 x [ n ] − N A ] {\displaystyle {\frac {\partial }{\partial A}}\ln p(\mathbf {x} ;A)={\frac {1}{\sigma ^{2}}}\left[\sum _{n=0}^{N-1}(x[n]-A)\right]={\frac {1}{\sigma ^{2}}}\left[\sum _{n=0}^{N-1}x[n]-NA\right]} and setting it to zero 0 = 1 σ 2 [ ∑ n = 0 N − 1 x [ n ] − N A ] = ∑ n = 0 N − 1 x [ n ] − N A {\displaystyle 0={\frac {1}{\sigma ^{2}}}\left[\sum _{n=0}^{N-1}x[n]-NA\right]=\sum _{n=0}^{N-1}x[n]-NA} This results in

12255-404: The measured data. This model is non-linear in the time variable, but it is linear in the parameters β 1 and β 2 ; if we take regressors x i  = ( x i 1 , x i 2 )  = ( t i , t i ), the model takes on the standard form Standard linear regression models with standard estimation techniques make a number of assumptions about the predictor variables,

12384-1017: The measurements which contain information regarding the parameters of interest are often associated with a noisy signal . For a given model, several statistical "ingredients" are needed so the estimator can be implemented. The first is a statistical sample – a set of data points taken from a random vector (RV) of size N . Put into a vector , x = [ x [ 0 ] x [ 1 ] ⋮ x [ N − 1 ] ] . {\displaystyle \mathbf {x} ={\begin{bmatrix}x[0]\\x[1]\\\vdots \\x[N-1]\end{bmatrix}}.} Secondly, there are M parameters θ = [ θ 1 θ 2 ⋮ θ M ] , {\displaystyle {\boldsymbol {\theta }}={\begin{bmatrix}\theta _{1}\\\theta _{2}\\\vdots \\\theta _{M}\end{bmatrix}},} whose values are to be estimated. Third,

12513-472: The model so that they all have mean zero and length one. To illustrate this, suppose that { x 1 , x 2 , … , x q } {\displaystyle \{x_{1},x_{2},\dots ,x_{q}\}} is a group of strongly correlated variables in an APC arrangement and that they are not strongly correlated with predictor variables outside the group. Let y ′ {\displaystyle y'} be

12642-454: The negative expected value is trivial since it is now a deterministic constant − E [ ∂ 2 ∂ A 2 ln ⁡ p ( x ; A ) ] = N σ 2 {\displaystyle -\mathrm {E} \left[{\frac {\partial ^{2}}{\partial A^{2}}}\ln p(\mathbf {x} ;A)\right]={\frac {N}{\sigma ^{2}}}} Finally, putting

12771-406: The normality assumption is satisfied. In iid case the OLS estimator can also be viewed as a GMM estimator arising from the moment conditions These moment conditions state that the regressors should be uncorrelated with the errors. Since x i is a p -vector, the number of moment conditions is equal to the dimension of the parameter vector β , and thus the system is exactly identified. This

12900-511: The other covariates are held fixed—that is, the expected value of the partial derivative of y with respect to x j . This is sometimes called the unique effect of x j on y . In contrast, the marginal effect of x j on y can be assessed using a correlation coefficient or simple linear regression model relating only x j to y ; this effect is the total derivative of y with respect to x j . Care must be taken when interpreting regression results, as some of

13029-420: The point that estimation can be carried out if, and only if, there is no perfect multicollinearity between the explanatory variables (which would cause the gram matrix to have no inverse). After we have estimated β , the fitted values (or predicted values ) from the regression will be where P = X ( X X ) X is the projection matrix onto the space V spanned by the columns of X . This matrix P

13158-430: The predictors, rather than on the joint probability distribution of all of these variables, which is the domain of multivariate analysis . Linear regression is also a type of machine learning algorithm , more specifically a supervised algorithm, that learns from the labelled datasets and maps the data points to the most optimized linear functions that can be used for prediction on new datasets. Linear regression

13287-762: The probability of x {\displaystyle \mathbf {x} } becomes p ( x ; A ) = ∏ n = 0 N − 1 p ( x [ n ] ; A ) = 1 ( σ 2 π ) N exp ⁡ ( − 1 2 σ 2 ∑ n = 0 N − 1 ( x [ n ] − A ) 2 ) {\displaystyle p(\mathbf {x} ;A)=\prod _{n=0}^{N-1}p(x[n];A)={\frac {1}{\left(\sigma {\sqrt {2\pi }}\right)^{N}}}\exp \left(-{\frac {1}{2\sigma ^{2}}}\sum _{n=0}^{N-1}(x[n]-A)^{2}\right)} Taking

13416-729: The probability of x [ n ] {\displaystyle x[n]} becomes ( x [ n ] {\displaystyle x[n]} can be thought of a N ( A , σ 2 ) {\displaystyle {\mathcal {N}}(A,\sigma ^{2})} ) p ( x [ n ] ; A ) = 1 σ 2 π exp ⁡ ( − 1 2 σ 2 ( x [ n ] − A ) 2 ) {\displaystyle p(x[n];A)={\frac {1}{\sigma {\sqrt {2\pi }}}}\exp \left(-{\frac {1}{2\sigma ^{2}}}(x[n]-A)^{2}\right)} By independence ,

13545-464: The regressors x i are random and sampled together with the y i ' s from some population , as in an observational study . This approach allows for more natural study of the asymptotic properties of the estimators. In the other interpretation ( fixed design ), the regressors X are treated as known constants set by a design , and y is sampled conditionally on the values of X as in an experiment . For practical purposes, this distinction

13674-428: The regressors may not allow for marginal changes (such as dummy variables , or the intercept term), while others cannot be held fixed (recall the example from the introduction: it would be impossible to "hold t i fixed" and at the same time change the value of t i ). It is possible that the unique effect be nearly zero even when the marginal effect is large. This may imply that some other covariate captures all

13803-438: The residual vector should satisfy the following equation: The equation and solution of linear least squares are thus described as follows: Another way of looking at it is to consider the regression line to be a weighted average of the lines passing through the combination of any two points in the dataset. Although this way of calculation is more computationally expensive, it provides a better intuition on OLS. The OLS estimator

13932-918: The response depends linearly both on a value and its square; in which case we would include one regressor whose value is just the square of another regressor. In that case, the model would be quadratic in the second regressor, but none-the-less is still considered a linear model because the model is still linear in the parameters ( β {\displaystyle {\boldsymbol {\beta }}} ). Consider an overdetermined system of n {\displaystyle n} linear equations in p {\displaystyle p} unknown coefficients , β 1 , β 2 , … , β p {\displaystyle \beta _{1},\beta _{2},\dots ,\beta _{p}} , with n > p {\displaystyle n>p} . This can be written in matrix form as where (Note: for

14061-552: The response variable y is still a scalar. Another term, multivariate linear regression , refers to cases where y is a vector, i.e., the same as general linear regression . The general linear model considers the situation when the response variable is not a scalar (for each observation) but a vector, y i . Conditional linearity of E ( y ∣ x i ) = x i T B {\displaystyle E(\mathbf {y} \mid \mathbf {x} _{i})=\mathbf {x} _{i}^{\mathsf {T}}B}

14190-755: The response variable and their relationship. Numerous extensions have been developed that allow each of these assumptions to be relaxed (i.e. reduced to a weaker form), and in some cases eliminated entirely. Generally these extensions make the estimation procedure more complex and time-consuming, and may also require more data in order to produce an equally precise model. The following are the major assumptions made by standard linear regression models with standard estimation techniques (e.g. ordinary least squares ): Violations of these assumptions can result in biased estimations of β , biased standard errors, untrustworthy confidence intervals and significance tests. Beyond these assumptions, several other statistical properties of

14319-492: The response variables and the errors of the n {\displaystyle n} observations, and X {\displaystyle \mathbf {X} } is an n × p {\displaystyle n\times p} matrix of regressors, also sometimes called the design matrix , whose row i {\displaystyle i} is x i T {\displaystyle \mathbf {x} _{i}^{\operatorname {T} }} and contains

14448-420: The same set of explanatory variables and hence are estimated simultaneously with each other: for all observations indexed as i = 1, ... , n and for all dependent variables indexed as j = 1, ... , m . Nearly all real-world regression models involve multiple predictors, and basic descriptions of linear regression are often phrased in terms of the multiple regression model. Note, however, that in these cases

14577-611: The same time with other variables (not in the group) held constant. It generalizes the individual effect of a variable to a group of variables in that ( i {\displaystyle i} ) if q = 1 {\displaystyle q=1} , then the group effect reduces to an individual effect, and ( i i {\displaystyle ii} ) if w i = 1 {\displaystyle w_{i}=1} and w j = 0 {\displaystyle w_{j}=0} for j ≠ i {\displaystyle j\neq i} , then

14706-404: The same. However, the difference between them becomes apparent when comparing the variances. v a r ( A ^ 1 ) = v a r ( x [ 0 ] ) = σ 2 {\displaystyle \mathrm {var} \left({\hat {A}}_{1}\right)=\mathrm {var} \left(x[0]\right)=\sigma ^{2}} and v

14835-671: The sample mean is a better estimator since its variance is lower for every  N  > 1. Continuing the example using the maximum likelihood estimator, the probability density function (pdf) of the noise for one sample w [ n ] {\displaystyle w[n]} is p ( w [ n ] ) = 1 σ 2 π exp ⁡ ( − 1 2 σ 2 w [ n ] 2 ) {\displaystyle p(w[n])={\frac {1}{\sigma {\sqrt {2\pi }}}}\exp \left(-{\frac {1}{2\sigma ^{2}}}w[n]^{2}\right)} and

14964-466: The sample size n  → ∞ ), which are understood as a theoretical possibility of fetching new independent observations from the data generating process . The list of assumptions in this case is: First of all, under the strict exogeneity assumption the OLS estimators β ^ {\displaystyle \scriptstyle {\hat {\beta }}} and s are unbiased , meaning that their expected values coincide with

15093-442: The second derivative ∂ 2 ∂ A 2 ln ⁡ p ( x ; A ) = 1 σ 2 ( − N ) = − N σ 2 {\displaystyle {\frac {\partial ^{2}}{\partial A^{2}}}\ln p(\mathbf {x} ;A)={\frac {1}{\sigma ^{2}}}(-N)={\frac {-N}{\sigma ^{2}}}} and finding

15222-443: The standard error around such estimates increases and reduces the precision of such estimates. When there is perfect multicollinearity, it is no longer possible to obtain unique estimates for the coefficients to the related regressors; estimation for these parameters cannot converge (thus, it cannot be consistent). As a concrete example where regressors are non-linearly dependent yet estimation may still be consistent, we might suspect

15351-422: The standardized model. A group effect of { x 1 ′ , x 2 ′ , … , x q ′ } {\displaystyle \{x_{1}',x_{2}',\dots ,x_{q}'\}} is and its minimum-variance unbiased linear estimator is where β ^ j ′ {\displaystyle {\hat {\beta }}_{j}'}

15480-431: The standardized model. The standardization of variables does not change their correlations, so { x 1 ′ , x 2 ′ , … , x q ′ } {\displaystyle \{x_{1}',x_{2}',\dots ,x_{q}'\}} is a group of strongly correlated variables in an APC arrangement and they are not strongly correlated with other predictor variables in

15609-469: The strongly correlated group increase by ( 1 / q ) {\displaystyle (1/q)} th of a unit at the same time with variables outside the group held constant. With strong positive correlations and in standardized units, variables in the group are approximately equal, so they are likely to increase at the same time and in similar amount. Thus, the average group effect ξ A {\displaystyle \xi _{A}}

15738-421: The terms "least squares" and "linear model" are closely linked, they are not synonymous. Given a data set { y i , x i 1 , … , x i p } i = 1 n {\displaystyle \{y_{i},\,x_{i1},\ldots ,x_{ip}\}_{i=1}^{n}} of n statistical units , a linear regression model assumes that the relationship between

15867-408: The true values of the parameters: Statistical estimation Estimation theory is a branch of statistics that deals with estimating the values of parameters based on measured empirical data that has a random component. The parameters describe an underlying physical setting in such a way that their value affects the distribution of the measured data. An estimator attempts to approximate

15996-413: The unknown parameters using the measurements. In estimation theory, two approaches are generally considered: For example, it is desired to estimate the proportion of a population of voters who will vote for a particular candidate. That proportion is the parameter sought; the estimate is based on a small random sample of voters. Alternatively, it is desired to estimate the probability of a voter voting for

16125-475: The variance is known then the only unknown parameter is A {\displaystyle A} . The model for the signal is then x [ n ] = A + w [ n ] n = 0 , 1 , … , N − 1 {\displaystyle x[n]=A+w[n]\quad n=0,1,\dots ,N-1} Two possible (of many) estimators for the parameter A {\displaystyle A} are: Both of these estimators have

16254-431: The variance of the residuals is the minimum possible. This is illustrated at the right. Introducing γ ^ {\displaystyle {\hat {\boldsymbol {\gamma }}}} and a matrix K with the assumption that a matrix [ X   K ] {\displaystyle [\mathbf {X} \ \mathbf {K} ]} is non-singular and K X = 0 (cf. Orthogonal projections ),

16383-459: The variance of the sample mean (determined previously) shows that the sample mean is equal to the Cramér–Rao lower bound for all values of N {\displaystyle N} and A {\displaystyle A} . In other words, the sample mean is the (necessarily unique) efficient estimator , and thus also the minimum variance unbiased estimator (MVUE), in addition to being

16512-426: Was obtained as a value that minimizes the sum of squared residuals of the model. However it is also possible to derive the same estimator from other approaches. In all cases the formula for OLS estimator remains the same: β = ( X X ) X y ; the only difference is in how we interpret this result. For mathematicians, OLS is an approximate solution to an overdetermined system of linear equations Xβ ≈ y , where β

16641-448: Was the first type of regression analysis to be studied rigorously, and to be used extensively in practical applications. This is because models which depend linearly on their unknown parameters are easier to fit than models which are non-linearly related to their parameters and because the statistical properties of the resulting estimators are easier to determine. Linear regression has many practical uses. Most applications fall into one of

#15984