Comparing the scales of the Dissimilarity index and Theil’s index of segregation

segregation
Author

Ben Elbers

Published

27 July 2025

When studying segregation, often the question comes up of how to interpret the extent of segregation. Is an index value of 0.3 a meaningful amount of segregation? At what threshold do we speak of “high segregation”? For the Dissimilarity index, D, Massey and Denton (1993) proposed

a simple rule of thumb … values under 0.3 are low, those between 0.3 and 0.6 are moderate and anything above 0.6 is high. (p. 20)

This simple rule has been frequently used when interpreting the D, but it’s not directly transferable to other segregation indices that operate on a different scale.

In this post, I’ll explore some properties of the Dissimilarity index and Theil’s H index, both of which are frequently used in studies of segregation, and compare how the scales of the D and the H relate to each other.

Understanding D and H

The Dissimilarity index operates on a linear scale. To illustrate this point, let’s assume we have a city with two racial groups A and B, and two schools. To compare the outcomes for different indices, we define a single parameter \(n\), so that we can generate different segregation scenarios using this parameter:

School A B
School 1 \(n\) \(2000 - n\)
School 2 \(2000 - n\) \(n\)

For instance, if we set \(n=1000\), there is no segregation, or perfect integration: Both schools have an equal amount of A and B students. If we set \(n=0\), there is absolute segregation: Every school contains only a single racial group. For values between 0 and 1000, we will have intermediate levels of segregation. It’s important to keep in mind that this is a very restricted scenario: Regardless of the value of \(n\), we only have two schools, both of which are of equal size, and the two racial groups are also always of equal size.

To see how the Dissimilarity index changes when we move away from perfect integration, let’s simplify the formula for our case. Here, \(A\) and \(B\) are the totals for each racial group, and \(a_i\) and \(b_i\) refer to the number of students of racial groups A and B in school \(i\). We then have

\[ \begin{align} D &= \frac{1}{2} \sum_i \left| \frac{a_i}{A} - \frac{b_i}{B} \right| \\ &= \frac{1}{2} \left| \frac{n}{2000} - \frac{2000 - n}{2000} \right| + \frac{1}{2} \left| \frac{2000 - n}{2000} - \frac{n}{2000} \right| \\ &= \frac{1}{4000} \left( \left| 2n - 2000 \right| + \left| 2000 - 2n \right| \right) \\ &= \frac{1}{2000} \left| 2000 - 2n \right| \\ &= 1 - \frac{1}{1000} n \\ \end{align} \]

and \(\frac{\partial}{\partial n} D = -\frac{1}{1000}\). Hence, for every pair of students that switches places to increase segregation (decreasing \(n\) by 1), the D increases by a constant amount, \(1/1000\). This is an important property of the Dissimarility index: it operates on a linear scale.

Let’s do the same exercise for Theil’s index of segregation, which we’ll call the H index. Here, \(E\) refers to the entropy of the racial group distribution, \(E_i\) is the entropy of the racial group distribution within school \(i\), and \(p_i\) is the proportion of students in school \(i\). Because we have two groups of equal size, \(E=\log 2\), and we also have \(E_1=E_2\), as the distributions are just flipped. We then have:

\[ \begin{align} H &= \frac{1}{ E } \sum_i p_i \left(E - E_i \right) \\ &= \sum_i \frac{1}{2} \left(1 - \frac{E_i}{\log 2} \right) \\ &= 1 - \frac{E_1}{\log 2} \\ \end{align} \]

where \(E_1=E_2=-\frac{n}{2000} \log\frac{n}{2000} - \frac{2000-n}{2000} \log\frac{2000-n}{2000}\). We therefore have \[\frac{\partial}{\partial n} H = \frac{1}{2000 \log 2} \left( \log \frac{n}{2000} - \log \left(1 - \frac{n}{2000}\right) \right).\]

which shows that for the H index, the change in segregation depends on \(n\) and is not constant. The fact that the H index operates on the log scale means that a marginal change in segregation when there is little segregation will have a smaller absolute effect compared to when there’s already a lot of segregation:

\[ \begin{align} \frac{\partial}{\partial n} H \mid_{n=900} &= -0.0001 \\ \frac{\partial}{\partial n} H \mid_{n=100} &= -0.0021 \end{align} \]

To make these results a bit more intuitive, let’s directly compare the D and H values across the range of possible values for \(n\):

Show the code
library(ggplot2)

N <- 2000
e <- function(n) -n / N * log(n / N) - (N - n) / N * log((N - n) / N)
h <- function(n) 1 - e(n) / log(2)
d <- function(n) 1 - 2/N * n

seg <- data.frame(
  n = rep(0:1000, 2),
  measure = c(rep("D", 1001), rep("H", 1001)),
  value = c(d(0:1000), h(1:1000), 0)
)

(
ggplot(seg, aes(x = n, y = value, color = measure))
    + geom_line()
    + scale_x_reverse()
    + labs(y = "Segregation", x = "< Less segregated | More segregated >")
    + theme_minimal()
    + theme(legend.title = element_blank())
)

Clearly, the D increases linearly as \(n\) decreases, while the \(H\) shows logarithmic behavior: Small increases when segregation is low, large increase when segregation is high. The \(H\) index is always smaller than the D index, except at the two extreme cases of complete integregation and complete segregation, where the index values are identical.

Let’s link this back up to Massey and Denton’s interpretation of the D:

Show the code
library(kableExtra)

compare <- data.frame(
  D = d(rev(seq(0, 1000, by = 100))),
  H = c(h(rev(seq(100, 1000, by = 100))), 1),
  level = c("low", "low", "low", "moderate", "moderate", "moderate", "moderate", "high", "high", "high", "high")
)

kable(compare, digits = 2, col.names = c("D index", "H index", "Massey/Denton"))
D index H index Massey/Denton
0.0 0.00 low
0.1 0.01 low
0.2 0.03 low
0.3 0.07 moderate
0.4 0.12 moderate
0.5 0.19 moderate
0.6 0.28 moderate
0.7 0.39 high
0.8 0.53 high
0.9 0.71 high
1.0 1.00 high

Hence, in this example, we should consider an H value above ~0.07 already as “moderate segregation”, and a value above ~0.28 already as “high” segregation!

A more general situation

There is a big problem with the table above: It should not be used to translate D into H values for all types of situations. The example is an edge case – only two schools, each school has the same size, and the two racial groups are of equal size as well. Ultimately, the D and the H index work differently and evaluate the same situations differently. To show this point, we generalize our example slightly by studying all possible 2x2 tables with a fixed total population count.

Generating all possible tables can be computationally expensive, so I’m using all tables with a total population of 100 here. That yields 176,451 unique tables, after removing tables that have empty schools or empty racial groups. I then calculated the H and D for each of these tables, and the result is two-dimensional distribution that looks like this:

Show the code
library(data.table)

generate_tables <- function(n, k) {
    helper <- function(n, k, prefix = c()) {
        if (k == 1) return(list(c(prefix, n)))

        result <- list()
        for (i in 0:n) {
            result <- c(result, helper(n - i, k - 1, c(prefix, i)))
        }
        return(result)
    }

    helper(n, k)
}

dt <- rbindlist(lapply(generate_tables(100, 4), function(x) as.list(x)))
names(dt) <- c("w1", "w2", "b1", "b2")
dt[, s1 := w1 + b1]
dt[, s2 := w2 + b2]
dt[, w := w1 + w2]
dt[, b := b1 + b2]
dt[, n := w + b]
dt <- dt[s1 > 0 & s2 > 0 & w > 0 & b > 0]

logf <- function(x) ifelse(x == 0, 0, log(x))

dt[, D := 0.5 * (abs(w1 / w - b1 / b) + abs(w2 / w - b2 / b))]
dt[, E := -w / n * logf(w / n) - b / n * logf(b / n)]
dt[, E1 := -w1 / s1 * logf(w1 / s1) - b1 / s1 * logf(b1 / s1)]
dt[, E2 := -w2 / s2 * logf(w2 / s2) - b2 / s2 * logf(b2 / s2)]
dt[, H := 1 / E * (s1 / n * (E - E1) + s2 / n * (E - E2))]

(
    ggplot(dt, aes(x=D, y=H))
    + stat_bin_hex(bins=c(60, 30))
    + scale_fill_viridis_c()
    + geom_abline(color = "gray")
    + labs(title = paste0("Correlation r = ", round(dt[, cor(D, H)], 2)))
    + theme_minimal()
    + theme(legend.position = "none")
)

Ligher areas have higher density, and we can see that there is a thin band of tables where some sort of general relationship between the H and D holds. There are however many scenarios where, for any given value of D, there is a wide range of H values. What the plot also shows is that out of all the possible contingency tables, many are concentrated in the area where segregation is low. Lastly, we can again see that H index is always lower than the D index.

For a new version of the table above, I’m now showing the possible range of H values by showing the 5th, 50th, and 95th percentile of the distribution for any given value of D:

Show the code
tab <- rbindlist(lapply(seq(0, 1, by = 0.1), function(r) {
    dt[abs(D - r) < 0.00001, .(D = mean(D), q5 = quantile(H, 0.05), q50 = median(H), q95 = quantile(H, 0.95))]
}))

kable(tab, digits = 2, col.names = c("D", "5th percentile", "Median", "95th percentile"))
D 5th percentile Median 95th percentile
0.0 0.00 0.00 0.00
0.1 0.01 0.01 0.05
0.2 0.02 0.03 0.13
0.3 0.06 0.07 0.18
0.4 0.10 0.13 0.28
0.5 0.15 0.22 0.40
0.6 0.22 0.29 0.46
0.7 0.34 0.40 0.57
0.8 0.44 0.55 0.70
0.9 0.60 0.73 0.83
1.0 1.00 1.00 1.00

The median scenario closely matches our simplified example above, but there’s quite a range of values. For instance, a D value of 0.8 can correspond to H values in the range of 0.44 to 0.70.

Let’s have a closer look at this scenario. The next figure shows two segplots – a visual display of the contingency table that is used to produce the segregation index. Here we have two examples, both of which have a 90%-10% distribution for the racial group. This overall distribution is shown to the right of each segplot. The D is identical in these two examples, but the H index is quite different: 0.44 on the left, 0.70 on the right – a difference of 60%! In the scenario on the left, there is one school with 35% of the students coming from the minority group, and a second school that is completely segregated. In the scenario on the right, there is one very small school that consists only of students of the minority group, and a large school that contains a small amount of minority group students (~2%).

Show the code
library(segregation)
library(patchwork)

example1 <- matrix_to_long(matrix(c(0, 10, 72, 18), nrow=2))
example2 <- matrix_to_long(matrix(c(8, 2, 0, 90), nrow=2))
ent <- entropy(example1, "group", weight = "n")

mutual_local(example1, "group", "unit", weight = "n", wide = TRUE)[, .(unit, ls = ls / ent, p, contrib = ls / ent * p)]
#> Key: <unit>
#>      unit        ls     p   contrib
#>    <char>     <num> <num>     <num>
#> 1:      1 0.3241035  0.72 0.2333545
#> 2:      2 0.7331267  0.28 0.2052755
mutual_local(example2, "group", "unit", weight = "n", wide = TRUE)[, .(unit, ls = ls / ent, p, contrib = ls / ent * p)]
#> Key: <unit>
#>      unit        ls     p   contrib
#>    <char>     <num> <num>     <num>
#> 1:      1 7.0830689  0.08 0.5666455
#> 2:      2 0.1488661  0.92 0.1369568

(
  segplot(example1, "group", "unit", weight = "n", bar_space = 0.01)
  + labs(title = "D = 0.8, H = 0.44")
  + segplot(example2, "group", "unit", weight = "n", bar_space = 0.01)
  + labs(title = "D = 0.8, H = 0.70")
  & theme(legend.position = "none", axis.title.x = element_text(size = 10))
)

The key to understanding why the H index sees the second example as more segregated is to think about how surprised one is to find any of these schools. With a 90%-10% split, how suprising is it to find a school that is distributed 64%-36%? How surprising is it to find a school that is 100-0%? This is the scenario on the left, and the H index quantifies this amount of surprise for the first school as 0.73, and for the second school as 0.33. (These are adjusted local segregation scores – local segregation scores divided by the racial group entropy.) After multiplying these by the size of the school, we arrive at an H value of 0.44.

For the second scenario we ask: With a 90%-10% split, how suprising is it to find a school that is distributed 0%-100%? How surprising is it to find a school that is 98-2%? Here, the H index quantifies the amount of surprise as 7.1 (!) and 0.15, and we arrive at a total H index of 0.70. This intuitively reflects the fact that in a city where the minority group makes up only 10% of the overall student population, it is extremely surprising to find a school that is minority-only. To the D index, these two scenarios are identical, but there is definitely an argument to be made here that the second scenario is, in fact, more segregated.

Conclusion

This last example has shown that there is no clear unique mapping between H and D index values – and, in fact, if that were the case, there would be no need to have another index in the first place! The example also showed that it is quite intuitive how the H index arrives at a slighly different conclusion compared to the D index. The H index is therefore its own unique index, with unique properties and its own scale.

Nonetheless, situations that many would regard as already highly segregated yield relatively low absolute values for the H index. When interpreting H index values, it is therefore important to consider small deviations from 0 already as moderately segregated. Some might consider this a downside of the H, but in exchange we gain a lot of desirable properties, such as decomposability, local segregation scores, multigroup indices, and the avoidance of many problems that the D index has (see Winship 1977 for some of these). Instead of purely relying on the index value, it is also a good idea to visualize the data, for instance by using a segplot.

Lastly, if we think about the segregation process from a statistical standpoint, any small deviations that might just be due to noise lead to an increase in the segregation score. The H index is much less susceptible to this than the D index, which is also a desirable property. More details on this aspect are found in an earlier post of mine on the bias of segregation indices.

References

Douglas S. Massey and Nancy A. Denton. 1993. American Apartheid. Harvard University Press.

Winship, Christopher. 1977. A Revaluation of Indexes of Residential Segregation. Social Forces 55(4): 1058-1066.