Logistic regression with categorical predictors

regression

logistic-regression

Author

Ben Elbers

Published

19 January 2025

I’ve written a little bit on linear regression on this blog before, for instance on the correlation model. Mathematically, linear regression and the OLS estimator are nice to work with exactly because of the linearity. Once we get to more complicated models, such as logistic regression, there is no closed-form solution anymore to the estimator, which makes it a bit harder to see what’s going on under the hood.

In the case of logistic regression, a maximum likelihood estimator is usually used, which has no closed-form solution. One exception to this rule is in the case of categorical predictors. In this case, there is a simple (and indeed somewhat trivial) closed-form solution to the estimation of the model.

Setup

In the simplest possible setup, we have a binary predictor \(X\) and a binary outcome \(Y\). For instance, similarly to the example on Wikipedia, \(X\) could be whether a student has studied for an exam, and \(Y\) could be whether the student has passed. Let’s say 80% of the students that have studied have passed the exam, and 40% of the students that didn’t study have passed the exam. This means, that effectively, we try to fit a logistic curve to two points \((0, 0.4)\) and \((1, 0.8)\), where the x coordinate specifies whether a student has studied, and the y coordinate specifies the probability of passing the exam. Graphically, this looks like this:

Curve-fitting

With this setup, this is an exercise in fitting a logistic function to the data – and because we have only two points, the fit is perfect. The curve we’re fitting is this one:

\[ p(x) = \text{logit}^{-1}(\beta_0+\beta_1 x) = \frac{1}{1+\text{exp}(-(\beta_0+\beta_1 x))} \]

where \(p(x)\) is the probability of passing, and \(x\) is 1 if a student has studied, and 0 otherwise. The function \(\text{logit}^{-1}(x) = \frac{1}{e^{-x}}\) is the inverse logit, also known as the logistic function. The logit function is the inverse of this function, i.e. \(\text{logit}(x) = \text{log}\frac{x}{(1-x)}\) which means taking the logarithm of the odds.

Let’s use \(a\) for the probability of passing if a student didn’t study, and \(b\) for the probability of passing if a student did study. Then we need to solve this system of equations for \(\beta_0\) and \(\beta_1\):

\[ \begin{align} a &=\text{logit}^{-1}(\beta_0) \\ b &=\text{logit}^{-1}(\beta_0+\beta_1) \end{align} \]

Trivially, by just applying the logit transformation, we obtain:

\[ \begin{align} \beta_0 &=\text{logit}(a) \\ \beta_1 &=\text{logit}(b) - \beta_0 = \text{logit}(b) - \text{logit}(a) \end{align} \]

This is hopefully intuitive. The results are equivalent to those that would be obtained by OLS in a linear model (where we would get \(\beta_0=a\) and \(\beta_1=b-a\)), but on the logit scale instead of on the probability scale.

Example

For the example above, we therefore obtain

\[ \begin{align} \beta_0 &= \text{log}\frac{0.4}{(1-0.4)} =\text{log}\frac{2}{3} \approx -0.405 \\ \beta_1 &= \text{log}\frac{0.8}{(1-0.8)} - \beta_0 = \text{log}(6) \approx 1.792 \end{align}. \]

Let’s check this against what we get from R using the glm function:

# create data for 10 students
exam <- data.frame(
    x = c(0, 0, 0, 0, 0, 1, 1, 1, 1, 1),
    y = c(1, 1, 0, 0, 0, 1, 1, 1, 1, 0)
)

model <- glm(y ~ x, data = exam, family = binomial)
coef(model)
#> (Intercept)           x 
#>  -0.4054651   1.7917595

The MLE gives the same answer.

Another way to obtain these coefficients is to use an OLS estimator with aggregate data, where we use the logit transform:

# define a logit function to use in linear model
logit <- function(x) log(x / (1-x))
(logit(0.4))
#> [1] -0.4054651
(logit(0.8) - logit(0.4))
#> [1] 1.791759

exam_aggregate <- data.frame(x = c(0,1), y = c(0.4, 0.8))
model <- lm(logit(y) ~ x, data = exam_aggregate, weights = c(5,5))
coef(model)
#> (Intercept)           x 
#>  -0.4054651   1.7917595

Extending to a categorical predictor

Until now, \(X\) was assumed to be binary. What happens if \(X\) is categorical? It turns out, this case is not very different, as we’re effectively fitting several separate logistic curves if we have only one categorical predictor. As an example, assume we have three academic departments with different admission rates. Department 0 admits 40% of students who apply, department 1 admits 80% of students, and department 2 admits 60% of students.

To use a categorical predictor in a regression model, we “dummy code” the department information with department 0 as the reference category. Define two binary variables \(X_1\) and \(X_2\):

Department 0 is coded as \(X_1=0\) and \(X_2=0\)
Department 1 is coded as \(X_1=1\) and \(X_2=0\)
Department 2 is coded as \(X_1=0\) and \(X_2=1\)

We then define the logistic regression model as

\[ p(x) = \text{logit}^{-1}(\beta_0+\beta_1 x_1+\beta_2 x_2) \]

Because the predictors are binary, we are now effectively fitting two separate curves, as in this figure:

Both logistic curves run through the point \((0, 0.4)\) because we picked department 0 as the reference category. The difference in slopes between these two curves is the reason that interactions terms in logistic regression models are much more tricky to interpret than in linear models.

Again, use \(a\), \(b\), and \(c\) for the probability of admittance for departments 0, 1, and 2, respectively. The system of equations to solve now becomes

\[ \begin{align} a &=\text{logit}^{-1}(\beta_0) \\ b &=\text{logit}^{-1}(\beta_0+\beta_1) \\ c &=\text{logit}^{-1}(\beta_0+\beta_2) \end{align} \]

with solutions

\[ \begin{align} \beta_0 &=\text{logit}(a) \\ \beta_1 &=\text{logit}(b) - \text{logit}(a) \\ \beta_2 &=\text{logit}(c) - \text{logit}(a). \end{align} \]

For the example, the resulting coefficients are

(logit(0.4))
#> [1] -0.4054651
(logit(0.8) - logit(0.4))
#> [1] 1.791759
(logit(0.6) - logit(0.4))
#> [1] 0.8109302

And here’s the same using the glm function:

# create data for 15 applicants (5 for each department)
depts <- data.frame(
    x = c(0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2),
    y = c(1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0)
)

model <- glm(y ~ as.factor(x), data = depts, family = binomial)
coef(model)
#>   (Intercept) as.factor(x)1 as.factor(x)2 
#>    -0.4054651     1.7917595     0.8109302