Understanding regression coefficients and multicollinearity through the standardized regression model
The socalled standardized regression model is often presented in textbooks^{1} as a solution to numerical issues that can arise in regression analysis, or as a method to bring the regression coefficients to a common, more interpretable scale. However, this transformation can also be useful to gain a deeper understanding into the construction of regression coefficients, the problem of multicollinearity, and the inflation of standard errors. It can thus also be a useful educational tool.
Correlation transformation
The standardized model refers to the model that is estimated after applying the correlation transformation to the outcome and the predictor variables. Let \( \mathbf{a}=(a_{1},a_{2},\ldots,a_{n})^T \) be a column vector of length n, then the correlation transformation is defined by
\[ a_{i}^{*}=\frac{a_{i}\bar{a}}{\sqrt{\sum_{i=1}^{n}(a_{i}\bar{a})^{2}}}, \]where \(\bar{a}\) denotes the mean of the components of \(\mathbf{a}\). The correlation transformation is similar to a zstandardization, but instead of dividing by the standard deviation, we divide by the square root of the sum of squares. If we now consider another vector \(\mathbf{b}^{*}\), for which the same transformation has been applied, we find that
\[ \left(\mathbf{a}^{*}\right)^{T}\mathbf{b}^{*}=r_{a,b}, \]where \(r_{a,b}\) denotes the Pearson correlation coefficient between vectors \(\mathbf{a}\) and \(\mathbf{b}\). From this it also follows that the dot product of the transformed vector with itself will be 1, i.e. \(\left(\mathbf{a}^{*}\right)^{T}\mathbf{a}^{*}=1.\) The correlation transformation is the key “trick” that will be used to estimate the standardized model.
The standardized model
In a standard regression problem, we have an \(n\times1\) outcome vector \(\mathbf{y}\) and a \(n\times p\) matrix \(\mathbf{X}\) containing the p predictors. To estimate the standardized model, we apply the correlation transformation to the outcome vector \(\mathbf{y}\) and to each of the predictors. We then estimate the model
\[ \begin{aligned} \mathbf{y} & =\mathbf{X}\bm{\beta}+\bm{\epsilon},\\ \bm{\epsilon} & \sim N(0,\sigma^{2}\mathbf{I}). \end{aligned} \]The design matrix \(\mathbf{X}=\begin{bmatrix}\mathbf{x}_{1} & \mathbf{x}_{2} & \cdots & \mathbf{x}_p\end{bmatrix}\) contains the p transformed predictors, but no intercept. This is because any intercept term would always be estimated to be zero after the correlation transformation has been applied.
The correlation transformation makes it much easier to understand the role of the key components that are required when finding the estimates for the vector \(\hat{\bm{\beta}}\):
\[ \hat{\bm{\beta}}=(\mathbf{X}^{T}\mathbf{X})^{1}\mathbf{X}^{T}\mathbf{y} \]The first component is the matrix \(\mathbf{X}^{T}\mathbf{X}\), which now has the simple form
\[ \mathbf{X}^{T}\mathbf{X}=\begin{bmatrix}1 & r_{1,2} & \cdots & r_{1,p}\\ r_{2,1} & 1 & \cdots & r_{2,p}\\ \vdots & \vdots & \ddots & \vdots\\ r_{p,1} & r_{p,2} & \cdots & 1 \end{bmatrix}=\mathbf{r}_{XX}, \]where \(r_{1,2}\) stands for the correlation between predictors \(\mathbf{x}_{1}\) and \(\mathbf{x}_{2}\). Since this matrix is simply the correlation matrix \(\mathbf{r}_{XX}\) between the predictor variables, all of its diagonal elements are 1, and all offdiagonal elements are between 1 and 1.
The second component is the vector \(\mathbf{X}^{T}\mathbf{y}\), which has the simple form
\[ \mathbf{X}^{T}\mathbf{y}=\begin{bmatrix}r_{1,y}\\ r_{2,y}\\ \vdots\\ r_{p,y} \end{bmatrix}=\mathbf{r}_{XY}, \]where \(r_{p,y}\) stands for the correlation between the \(p\)th predictor and the outcome vector. Thus, the expression for \(\hat{\bm{\beta}}\) simply involves the two correlation matrices:
\[ \hat{\bm{\beta}}=(\mathbf{r}_{XX})^{1}\mathbf{r}_{XY}. \]Not only the estimates for \(\hat{\bm{\beta}}\) are of interest, but also their standard errors. The expression for the variance of \(\hat{\bm{\beta}}\) is
\[ \text{Var}(\hat{\bm{\beta}})=\hat{\sigma}^{2}(\mathbf{X}^{T}\mathbf{X})^{1}=\hat{\sigma}^{2}(\mathbf{r}_{XX})^{1}, \]where \(\hat{\sigma}^{2}\) is estimated through the mean squared error.
Finding the estimates for \(\hat{\bm{\beta}}\) and the standard errors requires inverting the correlation matrix \(\mathbf{r}_{XX}\), which is complicated for large p. We will thus look at two limiting cases, which will make inverting the matrix possible: uncorrelated predictors, and a small number of predictors.
(1) Uncorrelated predictors
We first consider perfectly uncorrelated predictors. When all the predictors are uncorrelated with each other, the correlation matrix \(\mathbf{r}_{XX}\) has an extremely simple expression:
\[ \mathbf{r}_{XX}=\mathbf{X}^{T}\mathbf{X}=\mathbf{I}, \]where \(\mathbf{I}\) is the identity matrix. This fact should be obvious from inspection of the matrix above. The full expression for \(\hat{\bm{\beta}}\) simply becomes:
\[ \hat{\bm{\beta}} =(\mathbf{r}_{XX})^{1}\mathbf{r}_{XY} =\mathbf{r}_{XY}=\begin{bmatrix}r_{1,y}\\ r_{2,y}\\ \vdots\\ r_{p,y} \end{bmatrix} \]Thus, when the predictors are all uncorrelated with each other, the coefficients are simply given by the correlation coefficients between the predictor and the outcome \(\mathbf{y}.\)
The standard errors for the regression are constant, i.e. each coefficient will have the same standard error regardless of the size of the correlation between the predictor and \(\mathbf{y}.\) It can be shown^{2} that the standard errors are given by
\[ \text{s.e.}(\hat{\bm{\beta}})=\frac{1}{\sqrt{np}}\sqrt{1\sum_{i=1}^{p}r_{i,y}^{2}}. \]Thus, the standard errors depend only on the sample size, the number of predictors, and the sum of the squared coefficients. Generally, the standard errors will decrease with increasing sample size, increase with an increasing number of predictors, and increase with lower correlations between the predictors and the outcome. All of these results should make intuitive sense.
(2) Two correlated predictors
In actual applications, perfectly uncorrelated predictors are rare. In fact, the goal of regression is often to control for correlated predictors. We now look at the case of two correlated predictors.
In this case, it is also straightforward to find an expression for \(\hat{\bm{\beta}}.\) First, we need to find the inverse of
\[ \mathbf{r}_{XX}=\begin{bmatrix}1 & r_{1,2}\\ r_{2,1} & 1 \end{bmatrix}=\begin{bmatrix}1 & r_{1,2}\\ r_{1,2} & 1 \end{bmatrix}. \]The determinant of this matrix is \(\det\mathbf{r}_{XX}=1r_{1,2}^{2}\), and the inverse is then
\[ \begin{aligned} (\mathbf{r}_{XX})^{1} & =\frac{1}{1r_{1,2}^{2}}\begin{bmatrix}1 & r_{1,2}\\ r_{1,2} & 1 \end{bmatrix}. \end{aligned} \]As an aside, this form of the matrix \(\mathbf{r}_{XX}=\mathbf{X}^{T}\mathbf{X}\) also makes it easy to see why perfectly correlated predictors are problematic: When \(r_{1,2}=\pm1\), the determinant of the matrix is zero and the matrix does not have an inverse.
The full expression for \(\hat{\bm{\beta}}\) is:
\[ \begin{aligned} \hat{\bm{\beta}} & =(\mathbf{r}_{XX})^{1}\mathbf{r}_{XY}\\ &=\frac{1}{1r_{1,2}^{2}}\begin{bmatrix}1 & r_{1,2}\\ r_{1,2} & 1 \end{bmatrix}\begin{bmatrix}r_{1,y}\\ r_{2,y} \end{bmatrix}\\ & =\frac{1}{1r_{1,2}^{2}}\begin{bmatrix}r_{1,y}r_{1,2}r_{2,y}\\ r_{2,y}r_{1,2}r_{1,y} \end{bmatrix} \end{aligned} \]Thus,
\[ \begin{aligned} \hat{\beta}_{1} & =\frac{r_{1,y}r_{1,2}r_{2,y}}{1r_{1,2}^{2}},\\ \hat{\beta}_{2} & =\frac{r_{2,y}r_{1,2}r_{1,y}}{1r_{1,2}^{2}}. \end{aligned} \]It is immediately evident that, when the two predictors are uncorrelated \((r_{1,2}=0),\) the estimated regression coefficients are simply given by their correlation with \(\mathbf{y}\) (as seen above). When \(r_{1,2}\neq 0,\) both coefficients will change, and the effect will be larger for larger values of \(r_{1,2}.\) If we assume that all three correlations are positive, the formula provides an intuitive way of thinking about what it means to “control” for another variable: the raw correlation between \(\mathbf{x}_1\) and \(\mathbf{y}\) will be reduced by an amount that depends both on the size of the correlation between \(\mathbf{x}_1\) and \(\mathbf{x}_2\) and on the correlation between \(\mathbf{x}_2\) and \(\mathbf{y}.\)
For instance, assume that we are interested in the coefficient \(\hat{\beta}_{1}\). We let \(r_{1,y}=0.5\) and \(r_{2,y}=0.7.\) In a simple regression, where we just include \(\mathbf{x}_1,\) we would find the coefficient to be 0.5. Now we want to control for another predictor, \(\mathbf{x}_2,\) which is also correlated with the outcome at 0.7. For any “controlling” to happen, \(\mathbf{x}_1\) and \(\mathbf{x}_2\) need to be correlated as well. One interesting question is: How large does this correlation need to be to make \(\hat{\beta}_{1}\) zero? This is straighforward – simply plug in the values, set to zero, and solve for \(r_{1,2}:\)
\[ \begin{aligned} \hat{\beta}_{1} & =\frac{0.5r_{1,2}0.7}{1r_{1,2}^{2}} = 0\\ r_{1,2} & =\frac{0.5}{0.7} \approx 0.71 \\ \end{aligned} \]Hence, the effect for \(\mathbf{x}_1\) would only vanish completely if \(r_{1,2}\) is fairly large, as should be expected.
In other situations, the coefficient cannot become zero by introducing a control variable. Assume for instance, \(r_{1,y}=0.5\) and \(r_{2,y}=0.4.\) The solution here is \(r_{1,2}=1.25,\) which is impossible. It turns out that the local minimum is attained at \(\hat{\beta}_{1}=0.4\), where \(r_{1,2}=0.5\). In other words, controlling for \(\mathbf{x}_2\) will at most reduce \(\hat{\beta}_{1}\) from 0.5 to 0.4, and this will happen when \(r_{1,2}=0.5\).
The combined effect of different correlations can be explored in the Shiny app shown below. The plot shows the coefficient \(\hat{\beta_1}\) (yaxis) as a function of the correlation between the two predictors (xaxis). Because we are dealing with the correlations among three variables, the range of possible values for \(r_{1,2}\) may be restricted depending on the values of \(r_{1,y}\) and \(r_{2,y}.\)^{3} The Shiny app will show only the range of possible values.
Using the sliders, one can adjust the correlations between the predictors and the outcome variable. In the default setting, the correlations are set as \(r_{1,y}=0.5\) and \(r_{2,y}=0.7.\) For this example, when \(r_{1,2}<0,\) the estimated coefficient will be inflated compared to the raw correlation \(r_{1,y}\) (indicated by the orange line). When \(r_{1,2}>0,\) the estimated coefficient will be attenuated instead. The attenuation will be especially severe as \(r_{1,2}\) approaches 1. This is the problem of multicollinearity and can also be seen from the formula for \(\hat{\beta}_1\): As \(r_{1,2}\) approaches 1, \(\hat{\beta}_1\) approaches \( \pm \infty.\)
Another interesting fact to note is that the coefficient of a predictor can be nonzero even if the predictor is completely uncorrelated with the outcome. For instance, if we let \(r_{1,y}=0\) and \(r_{2,y}=0.5,\) the plot shows a sigmoid shape: \(\hat{\beta}_1\) will be positive when \(\mathbf{x}_1\) and \(\mathbf{x}_2\) are negatively correlated, and vice versa. This happens, of course, because multiple regression provides conditional inference: While \(\mathbf{x}_1\) and \(\mathbf{y}\) may be uncorrelated, they may well be correlated once we condition on \(\mathbf{x}_2\).
As a last step, we consider the standard errors for the two regression coefficients. As before,
\[ \begin{aligned} \text{Var}(\hat{\bm{\beta}}) & =\hat{\sigma}^{2}(\mathbf{r}_{XX})^{1}\\ & =\frac{\hat{\sigma}^{2}}{1r_{12}^{2}}\begin{bmatrix}1 & r_{12}\\ r_{12} & 1 \end{bmatrix} \end{aligned} \]Thus, the standard errors are again constant:
\[ \text{s.e.}(\hat{\beta}_{1})=\text{s.e.}(\hat{\beta}_{2})=\frac{\hat{\sigma}}{\sqrt{1r_{12}^{2}}} \]This clearly shows that any correlation between \(\mathbf{x}_{1}\) and \(\mathbf{x}_{2}\) increases the variance and standard errors of the estimated coefficients. In fact, as \(r_{1,2}\) approaches 1, the standard errors approach \(\infty\). This is an important result, because, even though either \(\mathbf{x}_{1}\) or \(\mathbf{x}_{2}\) might be highly correlated with \(\mathbf{y}\), under multicollinearity the standard errors might be very large. Thus, statistical tests might not reject the null hypothesis, despite strong correlation.
Conclusion
The standardized regression model, as defined by the correlation transformation, can be used to explore the construction of regression coefficients and standard errors in simple cases. In the model with two predictors, all quantities depend only on the three correlations between \(\mathbf{x}_{1},\) \(\mathbf{x}_{2},\) and \(\mathbf{y}\). This makes it easy to see the impact of different correlations on the estimated regression coefficients.

See for instance, Kutner et al. (2005) Applied Linear Statistical Models (esp. p. 271 ff.), on which a lot of this material is based. ↩︎

This result can be shown through the use of the “hat” matrix, which is the matrix \(\mathbf{H}\) that satisfies \(\hat{\mathbf{y}}=\mathbf{X}\hat{\bm{\beta}}=\mathbf{H}\mathbf{y}\). Because this matrix is a projection matrix, it is idempotent.
We use the mean squared error to estimate \(\sigma^{2}\). The vector of residuals is denoted by \(\mathbf{e}=\mathbf{y}\hat{\mathbf{y}}=(\mathbf{I}\mathbf{H})\mathbf{y}\). The variance of \(\hat{\bm{\beta}}\) can then be found through some matrix algebra:
\[ \begin{aligned} \text{Var}(\hat{\bm{\beta}}) & =\text{MSE}(\mathbf{X}^{T}\mathbf{X})^{1}\\ & =\frac{1}{np}\left(\mathbf{e}^{T}\mathbf{e}\right)\mathbf{I}\\ & =\frac{1}{np}\mathbf{y}^{T}(\mathbf{I}\mathbf{H})^{T}(\mathbf{I}\mathbf{H})\mathbf{y}\\ & =\frac{1}{np}\mathbf{y}^{T}(\mathbf{I}\mathbf{H})\mathbf{y}\\ & =\frac{1}{np}\left(\mathbf{y}^{T}\mathbf{y}\mathbf{y}^{T}\mathbf{H}\mathbf{y}\right)\\ & =\frac{1}{np}\left(1\hat{\bm{\beta}}^{T}\hat{\bm{\beta}}\right)\\ & =\frac{1}{np}\left(1\sum_{i=1}^{p}r_{i,y}^{2}\right) \end{aligned} \] ↩︎

See, for instance, this blogpost for an explanation. ↩︎