Understanding regression coefficients and multicollinearity through the standardized regression model
The so-called standardized regression model is often presented in textbooks1 as a solution to numerical issues that can arise in regression analysis, or as a method to bring the regression coefficients to a common, more interpretable scale. However, this transformation can also be useful to gain a deeper understanding into the construction of regression coefficients, the problem of multicollinearity, and the inflation of standard errors. It can thus also be a useful educational tool.
Correlation transformation
The standardized model refers to the model that is estimated after applying the correlation transformation to the outcome and the predictor variables. Let \(\mathbf{a}=(a_{1},a_{2},\ldots,a_{n})^T\) be a column vector of length n, then the correlation transformation is defined by
\[ a_{i}^{*}=\frac{a_{i}-\bar{a}}{\sqrt{\sum_{i=1}^{n}(a_{i}-\bar{a})^{2}}}, \]
where \(\bar{a}\) denotes the mean of the components of \(\mathbf{a}\). The correlation transformation is similar to a z-standardization, but instead of dividing by the standard deviation, we divide by the square root of the sum of squares. If we now consider another vector \(\mathbf{b}^{*}\), for which the same transformation has been applied, we find that
\[ \left(\mathbf{a}^{*}\right)^{T}\mathbf{b}^{*}=r_{a,b}, \]
where \(r_{a,b}\) denotes the Pearson correlation coefficient between vectors \(\mathbf{a}\) and \(\mathbf{b}\). From this it also follows that the dot product of the transformed vector with itself will be 1, i.e. \(\left(\mathbf{a}^{*}\right)^{T}\mathbf{a}^{*}=1.\) The correlation transformation is the key “trick” that will be used to estimate the standardized model.
The standardized model
In a standard regression problem, we have an \(n\times1\) outcome vector \(\mathbf{y}\) and a \(n\times p\) matrix \(\mathbf{X}\) containing the p predictors. To estimate the standardized model, we apply the correlation transformation to the outcome vector \(\mathbf{y}\) and to each of the predictors. We then estimate the model
\[ \begin{aligned} \mathbf{y} & =\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\epsilon},\\ \boldsymbol{\epsilon} & \sim N(0,\sigma^{2}\mathbf{I}). \end{aligned} \]
The design matrix \(\mathbf{X}=\begin{bmatrix}\mathbf{x}_{1} & \mathbf{x}_{2} & \cdots & \mathbf{x}_p\end{bmatrix}\) contains the p transformed predictors, but no intercept. This is because any intercept term would always be estimated to be zero after the correlation transformation has been applied.
The correlation transformation makes it much easier to understand the role of the key components that are required when finding the estimates for the vector \(\hat{\boldsymbol{\beta}}\):
\[ \hat{\boldsymbol{\beta}}=(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y} \]
The first component is the matrix \(\mathbf{X}^{T}\mathbf{X}\), which now has the simple form
\[ \mathbf{X}^{T}\mathbf{X}=\begin{bmatrix}1 & r_{1,2} & \cdots & r_{1,p}\\ r_{2,1} & 1 & \cdots & r_{2,p}\\ \vdots & \vdots & \ddots & \vdots\\ r_{p,1} & r_{p,2} & \cdots & 1 \end{bmatrix}=\mathbf{r}_{XX}, \]
where \(r_{1,2}\) stands for the correlation between predictors \(\mathbf{x}_{1}\) and \(\mathbf{x}_{2}\). Since this matrix is simply the correlation matrix \(\mathbf{r}_{XX}\) between the predictor variables, all of its diagonal elements are 1, and all off-diagonal elements are between -1 and 1.
The second component is the vector \(\mathbf{X}^{T}\mathbf{y}\), which has the simple form
\[ \mathbf{X}^{T}\mathbf{y}=\begin{bmatrix}r_{1,y}\\ r_{2,y}\\ \vdots\\ r_{p,y} \end{bmatrix}=\mathbf{r}_{XY}, \]
where \(r_{p,y}\) stands for the correlation between the \(p\)th predictor and the outcome vector. Thus, the expression for \(\hat{\boldsymbol{\beta}}\) simply involves the two correlation matrices:
\[ \hat{\boldsymbol{\beta}}=(\mathbf{r}_{XX})^{-1}\mathbf{r}_{XY}. \]
Not only the estimates for \(\hat{\boldsymbol{\beta}}\) are of interest, but also their standard errors. The expression for the variance of \(\hat{\boldsymbol{\beta}}\) is
\[ \text{Var}(\hat{\boldsymbol{\beta}})=\hat{\sigma}^{2}(\mathbf{X}^{T}\mathbf{X})^{-1}=\hat{\sigma}^{2}(\mathbf{r}_{XX})^{-1}, \]
where \(\hat{\sigma}^{2}\) is estimated through the mean squared error.
Finding the estimates for \(\hat{\boldsymbol{\beta}}\) and the standard errors requires inverting the correlation matrix \(\mathbf{r}_{XX}\), which is complicated for large p. We will thus look at two limiting cases, which will make inverting the matrix possible: uncorrelated predictors, and a small number of predictors.
Conclusion
The standardized regression model, as defined by the correlation transformation, can be used to explore the construction of regression coefficients and standard errors in simple cases. In the model with two predictors, all quantities depend only on the three correlations between \(\mathbf{x}_{1},\) \(\mathbf{x}_{2},\) and \(\mathbf{y}\). This makes it easy to see the impact of different correlations on the estimated regression coefficients.
Footnotes
See for instance, Kutner et al. (2005) Applied Linear Statistical Models (esp. p. 271 ff.), on which a lot of this material is based.↩︎
This result can be shown through the use of the “hat” matrix, which is the matrix \(\mathbf{H}\) that satisfies \(\hat{\mathbf{y}}=\mathbf{X}\hat{\boldsymbol{\beta}}=\mathbf{H}\mathbf{y}\). Because this matrix is a projection matrix, it is idempotent.
We use the mean squared error to estimate \(\sigma^{2}\). The vector of residuals is denoted by \(\mathbf{e}=\mathbf{y}-\hat{\mathbf{y}}=(\mathbf{I}-\mathbf{H})\mathbf{y}\). The variance of \(\hat{\boldsymbol{\beta}}\) can then be found through some matrix algebra:
\[ \begin{aligned} \text{Var}(\hat{\boldsymbol{\beta}}) & =\text{MSE}(\mathbf{X}^{T}\mathbf{X})^{-1}\\ & =\frac{1}{n-p}\left(\mathbf{e}^{T}\mathbf{e}\right)\mathbf{I}\\ & =\frac{1}{n-p}\mathbf{y}^{T}(\mathbf{I}-\mathbf{H})^{T}(\mathbf{I}-\mathbf{H})\mathbf{y}\\ & =\frac{1}{n-p}\mathbf{y}^{T}(\mathbf{I}-\mathbf{H})\mathbf{y}\\ & =\frac{1}{n-p}\left(\mathbf{y}^{T}\mathbf{y}-\mathbf{y}^{T}\mathbf{H}\mathbf{y}\right)\\ & =\frac{1}{n-p}\left(1-\hat{\boldsymbol{\beta}}^{T}\hat{\boldsymbol{\beta}}\right)\\ & =\frac{1}{n-p}\left(1-\sum_{i=1}^{p}r_{i,y}^{2}\right) \end{aligned} \]↩︎
See, for instance, this blogpost for an explanation.↩︎