Regression Assumptions, one by one
Many textbooks on linear regression start with a fully-fletched regression model, including assumptions such as independence and homoscedasticity right from the start. I always found this treatment not very intuitive, as many results, such as the unbiasedness of the parameter estimates, hold without making these stronger assumptions. The goal of this post is to introduce the assumptions of the OLS model one by one, and only when they become necessary. The major assumptions are, in the order they are introduced: linearity, no perfect collinearity, zero conditional mean, independence, homoscedasticity, and the normal distribution of errors.
The material is inspired by a number of textbooks, most importantly Gelman and Hill (2006) and Hayashi (2000). The latter is especially helpful, because it is clearly stated which assumptions are needed for which result.
The basic assumption: Linearity
In a regression model, we model the outcome as a linear function of a number of predictors. The model is of the following form:
\[ y=\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2}+\ldots+\epsilon \]
We know the outcome, \(y\), and the predictors \(x_{1},x_{2},\ldots\) The \(\beta_{j}\) are unknown, and we want to estimate them from the data. Because we rarely assume that the relationship between \(y\) and \(x_{1},x_{2},\ldots\) is completely deterministic, we also add an error term, \(\epsilon.\) This term captures anything about \(y\) that is not captured by the predictors. The most important assumption in regression is in this equation: We assume that the relationship between the outcome and the predictors is additive and linear. This assumption can be relaxed somewhat. For instance, we could interact predictors to produce the model
\[ y=\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2}+\beta_{3}(x_{1}x_{2})+\epsilon. \]
In this model, \(y\) is no longer a linear function of \(x_{1}\) and \(x_{2}\). However, the regression model is still linear in the parameters, although we have to pay attention when interpreting the coefficients. The same is true for a model of the form \(y=\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{1}^{2}+\epsilon,\) which includes a squared term. If we assume a multiplicative model \(y=x_{1}x_{2},\) we could transform this is into a linear model using \(\log y=\beta_{1}\log x_{1}+\beta_{2}\log x_{2}.\)
Regression as an optimization problem
We now collect some data, in order to estimate the unknown quantities of this model, the \(\beta_{j}.\) We collect data on the outcome \(y\) and the predictors \(x_{1},x_{2},\ldots\) To find values for the \(\beta_{j}\), first, let’s just guess. For instance, we could roll a die repeatedly and record the results. We call the result \(\hat{\beta}_{j},\) to show that we have now estimated \(\beta_{j}\). To be sure, this estimate will be pretty bad (we didn’t use any of our data to find it!). To formalize the idea that our made-up \(\hat{\beta}_{j}\)’s will likely be a bad estimate, we compute for each observation the predicted values, \(\hat{y}_{i}\):
\[ \hat{y}_{i} = \hat{\beta}_{0}+\hat{\beta}_{1}x_{i1}+\hat{\beta}_{2}x_{i2}+\ldots \]
We use only the deterministic part of the model to compute \(\hat{y},\) because, by definition, we do not know \(\epsilon_{i}.\) Intuitively, if we simply guess the parameter values, the predicted values \(\hat{y}_{i}\) will have little to do with the actual values \(y_{i}\). One way to quantify how good or bad our guesses are is to compute a loss function, which is some function that quantifies how much the predicted values deviate from the true, known values. One such loss function is
\[ \mathcal{\mathscr{L}}_{2}(\hat{\beta}_{0},\hat{\beta}_{1},\ldots)=\sum_{i=1}^{n}(y_{i}-\hat{y}_{i})^{2}, \]
This loss function, called the squared error loss, is always positive, and ideally, we would like its value to be small. For a random guess, the differences between \(y_{i}\) and \(\hat{y}_{i}\) will likely be large. A better guess would reduce the value of the loss function. This choice of loss function may seem somewhat arbitrary. For instance, one might ask why we didn’t choose the following loss function:
\[ \mathcal{\mathscr{L}}_{1}(\hat{\beta}_{0},\hat{\beta}_{1},\ldots)=\sum_{i=1}^{n}|y_{i}-\hat{y}_{i}|. \]
This is also a very reasonable loss function: it is always positive, and we would like its value to be small. There is an infinite number of other possible loss functions, and in many situations it will be advantageous to choose one of these alternatives. However, the squared error loss has some advantages. Maybe most important of all, its easy to find a closed-form expression for the \(\hat{\beta}_j\)’s that minimizes the loss. It is from this choice of loss function that ordinary least squares gets its name; but notably, this refers to the estimation method. Nothing in our regression model so far has told us that we require \(\mathcal{\mathscr{L}}_{2}\) as a loss function, it’s just one way to estimate the parameters \(\beta_{j}\).
The goal is now to find a set of parameters \(\hat{\beta}_{j}\) that minimize the value of \(\mathcal{\mathscr{L}}_{2}(\hat{\beta}_{0},\hat{\beta}_{1},\ldots)\). At this point, it becomes convenient to switch to matrix notation. We arrange the outcome values and the predictors into a \(n\times1\) column vector \(\mathbf{y}=(y_{1},y_{2},\ldots,y_{n})\), and a \(n\times p\) matrix \(\mathbf{X}\), respectively. The first column of \(\mathbf{X}\) is just a vector of 1’s—this way we can include \(\beta_{0}\) in the parameter vector. The parameters are collected in a \(p\times1\) column vector \(\boldsymbol{\beta}=(\beta_{0},\beta_{1},\ldots,\beta_{p-1})\).
Through the use of some matrix algebra, we can then rewrite the loss function as
\[ \mathcal{\mathscr{L}}_{2}(\hat{\boldsymbol{\beta}})=||\mathbf{y}-\mathbf{\hat{y}}||_{2}^{2}=\mathbf{\mathbf{y}}^{T}\mathbf{\mathbf{y}}-2\hat{\boldsymbol{\beta}}^{T}\mathbf{X}^{T}\mathbf{y}+\hat{\boldsymbol{\beta}}^{T}\mathbf{X}^{T}\mathbf{X}\hat{\boldsymbol{\beta}}. \]
To find the minimum, we differentiate \(\mathcal{\mathscr{L}}_{2}(\hat{\boldsymbol{\beta}})\) with respect to \(\hat{\boldsymbol{\beta}}\) (this requires a bit of matrix calculus):
\[ \frac{\partial}{\partial\hat{\boldsymbol{\beta}}}\mathcal{\mathscr{L}}_{2}(\hat{\boldsymbol{\beta}})=-2\mathbf{X}^{T}\mathbf{y}+2\mathbf{X}^{T}\mathbf{X}\hat{\boldsymbol{\beta}}. \]
Setting this equal to zero, we get
\[ \mathbf{X}^{T}\mathbf{X}\hat{\boldsymbol{\beta}}=\mathbf{X}^{T}\mathbf{y}, \]
the so-called “normal equations.” The remaining step is to premultiply both sides by the inverse of \(\mathbf{X}^{T}\mathbf{X}\), from which we obtain
\[ \hat{\boldsymbol{\beta}}=(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}. \]
To show that this is really the minimum of \(\mathcal{\mathscr{L}}_{2}(\hat{\boldsymbol{\beta}})\), we would need to show that \(\mathcal{\mathscr{L}}_{2}(\hat{\boldsymbol{\beta}})\) is convex. We skip that step here.
The expression that we derived for \(\hat{\boldsymbol{\beta}}\) is attractive in its simplicity. However, the expression relies on that fact that \(\mathbf{X}^{T}\mathbf{X}\) is invertible. This will only be the case if \(\mathbf{X}\) has full column rank, or, to put it differently, that no two columns in \(\mathbf{X}\) are perfectly correlated. It is also required that \(n\ge p,\) i.e. that we have at least as many observations as predictors. If \(\mathbf{X}\) does not have full column rank, we could replace \((\mathbf{X}^{T}\mathbf{X})^{-1}\) with a pseudo-inverse. However, that inverse won’t be unique, and so the estimates \(\hat{\boldsymbol{\beta}}\) will no longer be unique either. Going forward, we will assume that no two columns in \(\mathbf{X}\) are perfectly correlated.
If we just want to learn about \(\hat{\boldsymbol{\beta}},\) we are done at this point. We made two big assumptions: We assumed that \(y\) depends linearly on the inputs \(x_{1},x_{2},\ldots\), and we assumed that no two predictors are perfectly correlated.
The zero conditional mean assumption
Arguably, until now, the math involved only some optimization, but no statistics. However, in statistics, we also care about uncertainty. For instance, we are often interested in whether some input \(x_{j}\) is associated with the outcome \(y\). If \(\hat{\beta}_{j}\), the coefficient for \(x_{j}\), is zero, we would conclude that there is no association. But \(\hat{\beta}_{j}\) might be non-zero just by chance, and so we would like some measure of the uncertainty of \(\hat{\beta}_{j}\).
To do so, we make an assumption about the error term. Currently our model can be written in matrix notation as
\[ \mathbf{y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}. \]
We now additionally assume that \(\boldsymbol{\epsilon}\) is a random vector with mean \(\mathbf{0}\) and covariance matrix \(\boldsymbol{\Sigma}\). The assumption that \(\text{E}[\boldsymbol{\epsilon}]\) is specifically zero is not very strict. For instance, we could assume \(\text{E}[\boldsymbol{\epsilon}]=c\), where \(c\) is some constant, and then absorb this constant in the coefficient \(\beta_{0}\). However, assuming that the mean of the errors is constant is a big assumption: Essentially it means that, after we have accounted for the predictors \(x_{1},x_{2},\ldots\), there is no other systematic variation in \(y\).
It is possibly easier to recognize the severity of this assumption by finding the expressions for the mean and variance of \(\mathbf{y}\). These are now also random variables, as they are functions of \(\boldsymbol{\epsilon}\). We still assume that the predictors are known and fixed. It is then straightforward to derive the expectation and variance of \(\mathbf{y}\):
\[ \begin{aligned} \text{E}[\mathbf{y}] & =\text{E}[\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}]=\mathbf{X}\boldsymbol{\beta}+\text{E}[\boldsymbol{\epsilon}]=\mathbf{X}\boldsymbol{\beta}\\ \text{Var}[\mathbf{y}] & =\text{Var}[\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}]=\text{Var}[\boldsymbol{\epsilon}]=\boldsymbol{\Sigma}. \end{aligned} \]
Note that the assumption about \(\boldsymbol{\epsilon}\) is equivalent to assuming that \(\mathbf{y}\) is a random vector with expectation \(\mathbf{X}\boldsymbol{\beta}\) and covariance matrix \(\boldsymbol{\Sigma}\). We therefore assume that the information we have in \(\mathbf{X}\) is sufficient to model \(\text{E}[\mathbf{y}]\). The zero conditional mean assumption is frequently violated—the most common occurrence is omitted variable bias.
If \(\mathbf{y}\) is a random vector, this also has consequences for our OLS estimator, \(\hat{\boldsymbol{\beta}}\), which is a function of \(\mathbf{y}\). Therefore, \(\hat{\boldsymbol{\beta}}\) is a random vector as well. We can derive the expectation of \(\hat{\boldsymbol{\beta}}\) as follows:
\[ \begin{aligned} \text{E}[\hat{\boldsymbol{\beta}}] & =\text{E}[(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}]\\ & =(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T} \text{E}[\mathbf{y}]\\ & =(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T} \mathbf{X} \boldsymbol{\beta}\\ & =\boldsymbol{\beta}. \end{aligned} \]
Hence, under the present assumptions, \(\hat{\boldsymbol{\beta}}\) is unbiased. To arrive at this result, we had to assume linearity, no perfect collinearity, and \(\text{E}[\boldsymbol{\epsilon}]=\mathbf{0}\) (zero conditional mean).
Standard errors
Next, making use of the fact that for a non-random matrix \(\mathbf{A}\) and random vector \(\mathbf{y}\), \(\text{Var}[\mathbf{A}\mathbf{y}]=\mathbf{A}\ \text{Var}[\mathbf{y}]\mathbf{A}^{T}\), we find that
\[ \begin{aligned} \text{Var}[\hat{\boldsymbol{\beta}}] & =\text{Var}[(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}]\\ & =(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\text{Var}[\mathbf{y}]\ \mathbf{X}(\mathbf{X}^{T}\mathbf{X})^{-1}\\ & =(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\boldsymbol{\Sigma}\mathbf{X}(\mathbf{X}^{T}\mathbf{X})^{-1}. \end{aligned} \]
This is the expression that we interested in. The diagonal elements of \(\text{Var}[\hat{\boldsymbol{\beta}}]\) (a \(p\times p\) matrix) tell us about the uncertainty in estimating the elements of \(\hat{\boldsymbol{\beta}}\). To estimate the variance, however, we need to know or estimate \(\boldsymbol{\Sigma}\), and this is where we run into trouble. This matrix will be of the form
\[ \boldsymbol{\Sigma}=\begin{bmatrix}\text{Var}[y_{1}] & \text{Cov}[y_{1},y_{2}] & \cdots & \text{Cov}[y_{1},y_{n}]\\ \text{Cov}[y_{2},y_{1}] & \text{Var}[y_{2}] & \cdots & \text{Cov}[y_{2},y_{n}]\\ \vdots & \vdots & \ddots & \vdots\\ \text{Cov}[y_{n},y_{1}] & \text{Cov}[y_{n},y_{2}] & \cdots & \text{Var}[y_{n}] \end{bmatrix} \]
This is a \(n\times n\) symmetric matrix (\(\text{Cov}[y_{i},y_{j}]=\text{Cov}[y_{j},y_{i}]\)). We therefore need to estimate \(n\) variances and \(\frac{1}{2}(n^{2}-n)\) covariances, for a total of \(\frac{1}{2}(n^{2}+n)\) elements. However, we only have \(n\) observations in total! Clearly, this won’t work, and we therefore have to make another assumption: First, we will assume that the \(y_{i}\) are independent. We do not assume the \(y_{i}\) are i.i.d., independent and identically distributed. The mean of \(y_{i}\) depends on \(x_{i}\), and therefore two different \(y_{i}\)’s may have a different mean. We can also state this assumption in terms of the errors, as \(\boldsymbol{\Sigma}\) is also the covariance matrix of \(\boldsymbol{\epsilon}\). (Strictly speaking, we only require here that the outcomes/errors are uncorrelated, but the independence assumption is more transparent.)
Substantively, the assumption of independence means that we assume \(E[y_i|y_j] = E[y_i]\), i.e. that the mean of any outcome does not depend on the values of the other outcomes. Depending on the problem, this assumption can be unrealistic, and it can relaxed with techniques such as clustered standard errors or multilevel models. The assumption of independence is sometimes replaced with the assumption that our data is a random sample from a population of interest, but I find the statement that the \(y_i\) are independent more precise.
With the independence assumption, the form of \(\boldsymbol{\Sigma}\) is radically simplified, as all off-diagonal values are now zero:
\[ \boldsymbol{\Sigma}'=\begin{bmatrix}\text{Var}[y_{1}] & & & \mathbf{0}\\ & \text{Var}[y_{2}]\\ & & \ddots\\ \mathbf{0} & & & \text{Var}[y_{n}] \end{bmatrix} \]
We are left with the \(n\) variances on the diagonal. Estimating \(n\) variances with \(n\) data points might still sounds impossible, but it is routinely done: When using robust or heteroscedasticity-consistent standard errors, the variances of \(\text{Var}[y_{i}]\) are estimated as \(\widehat{\text{Var}}[y_{i}]=(y_{i}-\hat{y}_{i})^{2}\) (White 1980; but see Freedman 2006 and King and Roberts 2015). However, in the standard regression model, we instead make another assumption: homoscedasticity. This means that we assume that all the \(y_{i}\) have the same variance, which will denote by \(\sigma^{2}\). Thus,
\[ \boldsymbol{\Sigma}''=\begin{bmatrix}\sigma^{2} & & & \mathbf{0}\\ & \sigma^{2}\\ & & \ddots\\ \mathbf{0} & & & \sigma^{2} \end{bmatrix}=\sigma^{2}\mathbf{I}, \]
where \(\mathbf{I}\) is the \(n\times n\) identity matrix. Now, we only have to estimate one parameter with \(n\) observations. Before we show how to estimate \(\sigma^{2}\), we first note that the expression for the variance becomes much simpler when we assume that the covariance matrix of \(\mathbf{y}\) is given by \(\sigma^{2}\mathbf{I}\):
\[ \begin{aligned} \text{Var}[\hat{\boldsymbol{\beta}}] & =(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}(\sigma^{2}\mathbf{I})\mathbf{X}(\mathbf{X}^{T}\mathbf{X})^{-1}\\ & =\sigma^{2}(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{X}(\mathbf{X}^{T}\mathbf{X})^{-1}\\ & =\sigma^{2}(\mathbf{X}^{T}\mathbf{X})^{-1} \end{aligned} \]
It remains to estimate \(\sigma^{2}\). Under the assumptions made, it can be shown that
\[ \hat{\sigma}^{2}=\frac{1}{n-p}\sum_{i=1}^{n}(y_{i}-\hat{y}_{i})^{2} \]
is an unbiased estimator for \(\sigma^{2}\). The expression is just the average error that our model makes in predicting \(\hat{y}\), adjusted for the \(p\) degrees of freedom that we required to estimate \(\hat{\boldsymbol{\beta}}\). Hence, for an estimate of the variance of \(\hat{\boldsymbol{\beta}}\), we can use
\[ \widehat{\text{Var}}[\hat{\boldsymbol{\beta}}]= \hat{\sigma}^2 (\mathbf{X}^{T}\mathbf{X})^{-1}, \]
and the square root of this expression is, of course, the standard error. If we are just interested in the standard error, we can stop here.
The Normality assumption
We are now able to estimate \(\text{E}[\hat{\boldsymbol{\beta}}]\) and \(\text{Var}[\hat{\boldsymbol{\beta}}]\) . But if we want to perform statistical inference, we need to know more than just these first two moments. This is where the normality assumption comes in. This assumption applies to the errors, i.e. we assume that \(\boldsymbol{\epsilon}\sim\text{MVN}(0,\sigma^{2}\mathbf{I})\). From this follows that \(\mathbf{y}\sim\text{MVN}(\mathbf{X}\boldsymbol{\beta},\sigma^{2}\mathbf{I}),\) and
\[ \hat{\boldsymbol{\beta}}\sim\text{MVN}\left(\boldsymbol{\beta},\sigma^{2}(\mathbf{X}^{T}\mathbf{X})^{-1}\right). \]
After this assumption, we know the exact sampling distribution of \(\hat{\boldsymbol{\beta}}\), which allows us to derive confidence intervals for \(\hat{\boldsymbol{\beta}}\). These expressions show that the normality assumption is computationally convenient, as making the normal assumption about the errors translates into the result that \(\hat{\boldsymbol{\beta}}\) is also normal. The normality assumption is also justified through the central limit theorem. In other words, even without the explicit assumption, we could have justified the use of a normal approximation.
Conclusion
It turns out that this list is similar the one presented by Gelman and Hill (2006, p. 45f.), who list validity, additivity and linearity, independence of errors, equal variance of errors, and normality of errors as assumptions, in order of their importance. Their list however, adds another major assumption to the front of the list: Validity. In their words,
Most importantly, the data you are analyzing should map to the research question you are trying to answer. This sounds obvious but is often overlooked or ignored because it can be inconvenient. Optimally, this means that the outcome measure should accurately reflect the phenomenon of interest, the model should include all relevant predictors, and the model should generalize to the cases to which it will be applied. (p. 45)
Furthermore, none of the mathematical assumptions guarantee that the regression coefficients can be interpreted as causal.
References
- Freedman, David A. (2006). “On The So-Called ‘Huber Sandwich Estimator’ and ‘Robust Standard Errors’”. The American Statistician. 60 (4): 299–302. doi:10.1198/000313006X152207.
- Gelman, Andrew and Jennifer Hill. (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press.
- Hayashi, Fumio. (2000). Econometrics. Princeton University Press.
- King, Gary; Roberts, Margaret E. (2015). “How Robust Standard Errors Expose Methodological Problems They Do Not Fix, and What to Do About It”. Political Analysis. 23 (2): 159–179. doi:10.1093/pan/mpu015.
- White, Halbert (1980). “A Heteroscedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroscedasticity”. Econometrica. 48 (4): 817–838. doi:10.2307/1912934.